Doom9's Forum - View Single Post - CUDA H.264 vs x264 Speed and Image Quality Benchmarks, discussion

sethk · 1st January 2010, 01:27

Quote:

Originally Posted by LoRd_MuldeR

And that's a low-performance "on board" chip, which isn't anywhere near NVidia's "high end" boards. I wouldn't expect great performance from that one

Sure, it will beat everything that Intel can deliver currently, but unfortunately that doesn't mean much...

No, it doesn't solve the fundamental problem at all. All data still needs to go through the "slow" PCIe bus, twice! That means a serious bottleneck, bandwidth-wise and delay-wise.

All that "zero-copy access" does is: Data doesn't need to be uploaded to "global" memory before it can be copied to the "shared" memory of the individual block. It can go to the "shared" memory of the destination block immediately now. That avoids one indirection in some cases, but certainly not in all cases. In cases where more than one single thread block needs to access the data, we still need to store it in "global" memory. Also the data still needs to be uploaded via PCIe, which IMO is the most important problem! For a project I was working on we had to move calculations to the GPU, although they were slower(!) on the GPU than on the CPU. But downloading all the intermediate data to the CPU (host memory) was so slow, that we had to do everything on the GPU (global device memory) and only downloaded the final result. So believe me: The bottleneck of having to upload/download all the data through the PCIe bus (which currently is the only way on all competitive GPU's) is not just some "theoretical" thing. It's something you will encounter in reality! And it's something that can easily kill all the "nice" speedup you did expect by GPGPU processing. People are even trying to transfer data from one GPU board to another GPU board via DVI to avoid PCIe ^^

Definitely true about the MCP being a slow solution - didn't mean to imply otherwise, but I mention it for two reasons - one, GT200 is not the only current solution supporting zero-copy and two, it did not have the same latency issues with zero-copy access.

As you say, this is all academic because of the low shader pipeline count on this solution.

For the GT200, even with the high latency induced by accessing main memory through the PCIe bus, it has enough bandwidth (8GB / sec) to main memory that I imagine it could be used in creative ways in conjunction with local memory.

Now the GT200 has 1GB of local memory, so you could certain use the local graphics memory (GDDR) to buffer large groups of uncompressed frames, and have the main CPU handle decompression and streaming of the uncompressed frames to GPU memory (which could store a large enough buffer of frames to handle backwards and forwards frame access), and write the compressed data back to main memory without hitting a PCIe memory bottleneck. Since most of the work would need to be done on uncompressed frames and that could be happening in GPU memory instead of main memory, the PCIe latency may not be as big a deal as if you tried to do everything in main memory.

I haven't attempted any of this, and I may be off-base in my thinking, but it was what I pictured as a reasonable approach.