CUDA H.264 vs x264 Speed and Image Quality Benchmarks, discussion - Page 10

Firebird · 14th December 2009, 16:52

Quote:

Is CUDA getting closer to x264

No. It will never be as good as x264 is.

LoRd_MuldeR · 14th December 2009, 17:01

Quote:

Is CUDA getting closer to x264

No. It will never be as good as x264 is.

That statement doesn't make sense. CUDA is a platform technology while x264 is one specific software. So you are comparing apples and oranges

So the real question is: Will GPGPU-based (CUDA, Stream, OpenCL, etc.) H.264 encoders eventually beat CPU-only encoders performance-wise and quality-wise?

Well, currently it doesn't look like this will happen soon. But this may change with upcoming GPU generation...

Cyber-Mav · 14th December 2009, 17:04

cuda is getting closer now in quality.

nakTT · 14th December 2009, 17:12

Quote:

Originally Posted by LoRd_MuldeR

That statement doesn't make sense. CUDA is a platform technology while x264 is one specific software. So you are comparing apples and oranges

So the real question is: Will GPGPU-based (CUDA, Stream, OpenCL, etc.) H.264 encoders eventually beat CPU-only encoders performance-wise and quality-wise?

Well, currently it doesn't look like this will happen soon. But this may change with upcoming GPU generation...

Thanks Loard, that's what I meant.

Quality wise, what the upcoming GPU generation got to do with it? It is just the hardware, I can understand if it speed wise. Please shed some light on this.

roozhou · 14th December 2009, 17:29

Quote:

Originally Posted by nakTT

Thanks Loard, that's what I meant.

Quality wise, what the upcoming GPU generation got to do with it? It is just the hardware, I can understand if it speed wise. Please shed some light on this.

There is no video encoding chip on any GPU so the encoding quality has nothing to do with GPU generation or CPU models. You get same quality between a P3 and a i7 with x264 if you use the same settings.

Quality is only determined by the algorithm that the encoder uses. This applies to both CUDA and x264.

nakTT · 14th December 2009, 17:42

Quote:

Originally Posted by roozhou

There is no video encoding chip on any GPU so the encoding quality has nothing to do with GPU generation or CPU models. You get same quality between a P3 and a i7 with x264 if you use the same settings.

Quality is only determined by the algorithm that the encoder uses. This applies to both CUDA and x264.

That is my understanding that I'm trying to share on my previous post. Perhaps we are wrong/right?

nakTT · 14th December 2009, 18:16

Thanks for your thought Stephen. It doesn't occur to me until you point it out. Thanks again.

LoRd_MuldeR · 14th December 2009, 19:24

Quote:

Originally Posted by roozhou

There is no video encoding chip on any GPU...

Not yet. But there is dedicated decoder hardware on any modern graphics card already. Also there are encoding solutions available that ship with a dedicated encoder hardware/stick.

Therefore it's not completely absurd to think about adding dedicated encoder chips to future GPU generations...

Quote:

Originally Posted by roozhou

...so the encoding quality has nothing to do with GPU generation or CPU models.

Well, the capabilities of the first GPGPU-enabled GPU generation were pretty limited. Since then the GPU manufactures have added new GPGPU-specific capabilities with each generation.

Certain encoding algorithms, that can not be implemented (efficiently) on the current GPU generation, may be implementable on future GPU generations.

So we may see improved GPGPU encoders on future GPU generations indeed! Especially since the development of GPU's currently is rapid, while the development of CPU's is slowing down.

Of course there is no automatism. We'll have to wait and see whether future GPU generations will be more suitable for video encoder than the current GPU's.

nakTT · 15th December 2009, 03:45

Quote:

Originally Posted by LoRd_MuldeR

Well, the capabilities of the first GPGPU-enabled GPU generation were pretty limited. Since then the GPU manufactures have added new GPGPU-specific capabilities with each generation.

Certain encoding algorithms, that can not be implemented (efficiently) on the current GPU generation, may be implementable on future GPU generations.

So we may see improved GPGPU encoders on future GPU generations indeed! Especially since the development of GPU's currently is rapid, while the development of CPU's is slowing down.

Of course there is no automatism. We'll have to wait and see whether future GPU generations will be more suitable for video encoder than the current GPU's.

So is my understanding correct to say that CPU encoding give programmer a huge of room to maneuver as appose to GPGPU?

Thanks for a very newbie friendly explanation. I have always like the way you treat newbies. Keep it up.

kidjan · 16th December 2009, 04:37

Quote:

Originally Posted by St Devious

ok, doing that.

In the meantime here's some more images

CUDA 6 Mbps

x264 6Mbps

Source

IMO, quality comparisons like this would be a lot more useful with SSIM (and possibly PSNR) measurements. My $.02, possibly wrong. It's a lot easier to encode the same video to equal bitrates and then see how it fares with an objective measurement than posting screenshots.

Puncakes · 16th December 2009, 09:19

Quote:

Originally Posted by kidjan

IMO, quality comparisons like this would be a lot more useful with SSIM (and possibly PSNR) measurements. My $.02, possibly wrong. It's a lot easier to encode the same video to equal bitrates and then see how it fares with an objective measurement than posting screenshots.

I don't know about you, but considering the fact that those measurements are useless for comparing actual visual quality, I think I'd rather have screenshots.

LoRd_MuldeR · 16th December 2009, 10:52

Quote:

Originally Posted by nakTT

So is my understanding correct to say that CPU encoding give programmer a huge of room to maneuver as appose to GPGPU?

You must think of the GPU as a massively parallel processor. So GPGPU (CUDA, Stream, OpenCL, etc) gives the programmer access to a massively parallel co-processor.

And we are not talking about four or eights threads here. We are talking about hundreds or even better thousands of threads that need to run on the GPU!

So if you want to leverage the theoretical processing power of a GPU, your problem must be highly parallelizeable and new algorithms are needed that scale to hundreds/thousands of threads.

Therefore not any problem is suitable for the GPU. There are inherently sequential problems that don't fit on the GPU at all!

The GPU cores are many, but they are very limited. Especially memory access to the (global) GPU memory is extremely slow, because it's not cached at all (except for texture memory).

Thus we must try to "hide" slow memory access with calculations, which means that we need much more GPU threads than we have GPU cores.

Well, each group/block of GPU cores has its own local "shared" memory that is fast, but the size of that per-block shared memory is small. Way too small for many things!

Also we can't sync the shared memories of different blocks, so whenever threads from different blocks need to "communicate", this needs to be done through the slow "global" memory.

Even organizing/synchronizing the threads within a block is a though task, because "bad" memory access patterns can slow down your GPU program significantly!

Last but not least the GPU cannot access the main/host memory at all. Hence the CPU program needs to upload all input data to the graphic's device first and later download all the results.

That "host <-> device" data transfer is a serious bottleneck and means that you cannot run "small" functions on the GPU, even if they are a lot faster there.

What worth is it to complete a calculation in 1 ms instead of 10 ms, but it takes 20 ms to upload/download the data to/from the graphic's device? Yes, it's completely useless!

So if we move parts of our program to the GPU, this must be significant parts with enough "work" to justify the communication delay. It's not trivial to find such parts in your software.

Remember: Those parts must also be highly parallelizable and efficient parallel algorithms for the individual problem must exists (or must be developed).

See also:
http://developer.download.nvidia.com..._Guide_2.0.pdf

Dark Shikari · 16th December 2009, 10:58

Quote:

Originally Posted by LoRd_MuldeR

And we are not talking about four or eights threads here. We are talking about hundreds or even better thousands of threads that run on the GPU.

More like 20,000.

nakTT · 16th December 2009, 11:30

Thanks again LoRd_MuldeR, for your informative posting. I really enjoy reading the info.

Limit · 16th December 2009, 14:53

I wonder if the next generation CPU/GPU combo chips like Llano/Fusion would make any difference. The latencies should be significant lower although they are still conected over PCIe. What do you think, would such a APU be useful for x264?

nakTT · 16th December 2009, 15:21

Quote:

Originally Posted by Limit

I wonder if the next generation CPU/GPU combo chips like Llano/Fusion would make any difference. The latencies should be significant lower although they are still conected over PCIe. What do you think, would such a APU be useful for x264?

IMHO integrated GPU (be it just on the same packaging or on the same silicone) will be nowhere near the power of a high-end discreet GPU. Please note that those kind of CPUwithGPU are targeted towards notebooks and other lightly demanding graphic usage like office PC and others.

LoRd_MuldeR · 16th December 2009, 15:30

Quote:

Originally Posted by Limit

I wonder if the next generation CPU/GPU combo chips like Llano/Fusion would make any difference. The latencies should be significant lower although they are still conected over PCIe. What do you think, would such a APU be useful for x264?

Well, it may make the bottleneck less critical, but certainly doesn't remove it, as the basic architecture still is the same. Unless they use more PCIe lanes for the internal interconnect than they used for the "external" PCIe bus, there won't be much difference. And even if there is a difference, the way we call GPU kernels/programs is still the same: Upload input data from the host to the device, invoke the GPU kernel, wait for completion (while maybe doing other things on the CPU) and finally download the results from the device back to the host. Also I doubt that the combined CPU/GPU chip packages will contain very powerful GPU's. It will be more like what he have as "on board" graphics chips now. Not anywhere near high-end GPU's.

However with NVidia's new "Fermi" GPU generation there will be significant improvements for the per-block "shared" GPU memory: It's now much larger and it can (optionally) be used to cache accesses to the global GPU memory. This may (or may not) significantly help for specific problems. Also this is one example for what I said before: Future GPU generations may be more suitable for implementing video compression algorithms than the current generation. In the case of Fermi I cannot tell you whether it helps video encoding or not. The Codec gurus need to decide ^^

Limit · 16th December 2009, 16:14

Quote:

Originally Posted by LoRd_MuldeR

Well, it may make the bottleneck less critical, but certainly doesn't remove it, as the basic architecture still is the same. Unless they use more PCIe lanes for the internal interconnect than they used for the "external" PCIe bus, there won't be much difference.

PCIe is a point-to-point link so it should be possible to run the GPU's link with a much higher clock rate. Standard PCIe runs at 100MHz. With CPU, GPU and PCIe Controller on the same die a much higher frequency for the GPU PCIe link should be a small problem. For example if you get it running with 1GHz, you increase the bandwidth and decrease the latency by a factor of 10.

Quote:

Originally Posted by LoRd_MuldeR

And even if there is a difference, the way we call GPU kernels/programs is still the same: Upload input data from the host to the device, call the kernel, wait for completion (while maybe doing other things on the CPU) and finally download the results from the device back to the host.

Afaik the integrated GPUs has no own memory besides the small caches. So there is no need to copy data from host memory to device memory because it is the same memory.

Quote:

Originally Posted by LoRd_MuldeR

Also I doubt that the combined CPU/GPU chip packages will contain very powerful GPU's. It will be more like what he have as "on board" graphics chips now. Not anywhere near high-end GPU's.

That is clear. The last rumours I heard speak of 240 shader units for AMD/ATIs first generation Fusion APU. That is far from high-end but its computing power is still higher then any avaible CPU's.

LoRd_MuldeR · 16th December 2009, 16:28

Quote:

Originally Posted by Limit

Afaik the integrated GPUs has no own memory besides the small caches. So there is no need to copy data from host memory to device memory because it is the same memory.

Well, it then "shares" the RAM modules with the CPU - not to be confused with the on-chip shared memory. But this doesn't mean that the GPU can directly access the same memory locations that the CPU uses. We don't know it yet, but I would assume they simply "lock" a certain range of the physical main memory address space for the GPU. So we'd still have to copy the input data from the "regular" memory area (used by the CPU) over to some place in the memory area reserved for GPU - and the results need to copied back the same way.

Also we are talking about Intel CPU's here, the upcoming "Arrandale" to be precise. So far Intel doesn't offer any GPGPU API for their GPU's. Until Intel does so (probably by making their GPU's accessible through OpenCL), we cannot use those combined CPU/GPU chips for anything but graphics output or video decoding at all! And if you look at the OpenCL API, it is defined similar to the CUDA API. In particular there is "host" memory that OpenCL kernels explicitly cannot access! And there's the "global" (device) memory, which all OpenCL kernels can access.

Quote:

Originally Posted by Limit

PCIe is a point-to-point link so it should be possible to run the GPU's link with a much higher clock rate. Standard PCIe runs at 100MHz. With CPU, GPU and PCIe Controller on the same die a much higher frequency for the GPU PCIe link should be a small problem. For example if you get it running with 1GHz, you increase the bandwidth and decrease the latency by a factor of 10.

That sounds like pure speculation. Unless there are some facts, I will assume that the "internal" PCIe-based interconnect will be roughly at the same level as "external" PCIe 2.0 is nowadays...

ajp_anton · 16th December 2009, 16:39

Quote:

Originally Posted by LoRd_MuldeR

Also we are talking about Intel CPU's here, the upcoming "Arrandale" to be precise. So far Intel doesn't offer any GPGPU API for their GPU's. Until Intel does so (probably by making their GPU's accessible through OpenCL), we cannot use those combined CPU/GPU chips for anything but graphics output or video decoding at all!

No to mention that when we say "the integrated GPUs aren't powerful", we mean the low end of AMD and Nvidia, which is far from what Intel has to offer =)

14th December 2009, 17:04	#183 \| Link
Cyber-Mav Registered User Join Date: Dec 2005 Posts: 244	cuda is getting closer now in quality.

14th December 2009, 18:16	#187 \| Link
nakTT Registered User Join Date: Dec 2008 Posts: 415	Thanks for your thought Stephen. It doesn't occur to me until you point it out. Thanks again.

16th December 2009, 11:30	#194 \| Link
nakTT Registered User Join Date: Dec 2008 Posts: 415	Thanks again LoRd_MuldeR, for your informative posting. I really enjoy reading the info.

16th December 2009, 14:53	#195 \| Link
Limit Registered User Join Date: Oct 2005 Location: .DE Posts: 15	I wonder if the next generation CPU/GPU combo chips like Llano/Fusion would make any difference. The latencies should be significant lower although they are still conected over PCIe. What do you think, would such a APU be useful for x264?