CUDA H.264 vs x264 Speed and Image Quality Benchmarks, discussion - Page 11

edison · 16th December 2009, 19:12

Fermi have 128KB L2 cache (per memory controller,total 768KB) that can use to speed up thread block sync.

and, MCP7X/GT200 and the later gpus can use system memory without copying to dedicated (video) memory for significant perf improvement, NVIDIA call this "zero copy accesss".

LoRd_MuldeR · 16th December 2009, 20:48

Yup, that's one of the new features introduced with CUDA 2.2, don't know if OpenCL will have a similar functionality. But this doesn't solve the fundamental problem! All that "zero copy accesss" does is: The GPU can now fetch memory directly from the host process' memory space. This memory fetch is done across PCIe and it bypasses the "global" device memory. But still the speed for accessing the main ("host") memory from the GPU isn't anywhere near the CPU, as all data still must go through the PCIe bus - this means limited bandwidth as well as additional latency! You may be able to "hide" the latency by doing "stream" processing: Upload the next data element to the device, while the current data element is being processed. So when the current data element is finished processing, the next element has already arrived at the device and thus the device doesn't need to wait - it can continue with processing immediately. But it highly depends on the individual application/problem whether stream processing is feasible or not.

Last but not least it should be mentioned that the "zero copy accesss" is only supported by the Geforce-200 series and later, which currently excludes most CUDA-enabled devices!

slavickas · 16th December 2009, 21:59

>Last but not least it should be mentioned that the "zero copy accesss" is only supported by the Geforce-200 series and later
minus GTS 250 thanks to renaming

kidjan · 17th December 2009, 03:13

Quote:

Originally Posted by Puncakes

I don't know about you, but considering the fact that those measurements are useless for comparing actual visual quality, I think I'd rather have screenshots.

You're not comparing "actual visual quality"--you're comparing a reference image (i.e. input to the encoder) with an outputted image (i.e. decoded output). If you were comparing actual visual quality, we'd all be scrutinizing the quality of the input material.

Furthermore, I find a "screenshot" completely inadequate, given that any singular image from encoded video may not be indicative of the overall quality. SSIM or some other objective measurement should be used over the duration of video, and in the perfect world you'd return a box-plot of cumulative SSIM/PSNR/whatever scores. That box plot would communicate A) outliers (which correspond with particularly bad frames), B) median SSIM, and C) how consistent image quality is based on the quartiles/whiskers.

I digress, and I'm open to the possibility that I'm being a pedantic jerk. But I find this method of posting screen shots of singular frames to be almost completely devoid of merit.

nm · 17th December 2009, 16:30

Quote:

Originally Posted by kidjan

You're not comparing "actual visual quality"--you're comparing a reference image (i.e. input to the encoder) with an outputted image (i.e. decoded output). If you were comparing actual visual quality, we'd all be scrutinizing the quality of the input material.

He means the actual (subjective) visual quality compared to the reference, of course.

Quote:

Furthermore, I find a "screenshot" completely inadequate, given that any singular image from encoded video may not be indicative of the overall quality.

I agree, although not completely. I think screenshots are more useful than SSIM/PSNR scores or graphs when comparing different encoders.

Quote:

SSIM or some other objective measurement should be used over the duration of video, and in the perfect world you'd return a box-plot of cumulative SSIM/PSNR/whatever scores. That box plot would communicate A) outliers (which correspond with particularly bad frames), B) median SSIM, and C) how consistent image quality is based on the quartiles/whiskers.

I digress, and I'm open to the possibility that I'm being a pedantic jerk. But I find this method of posting screen shots of singular frames to be almost completely devoid of merit.

"Objective" measures such as PSNR or SSIM do not represent visual quality very well, so I'd argue that posting them as box-plots or graphs is also almost completely devoid of merit. I prefer full videos and randomly selected screenshots (with matched frametypes).

kidjan · 17th December 2009, 22:43

Quote:

Originally Posted by nm

He means the actual (subjective) visual quality compared to the reference, of course.

No, he does not mean the "visual quality" compared to the reference. He means the "visual similarity" compared to the reference. Quality is the wrong word.

Quote:

Originally Posted by nm

"Objective" measures such as PSNR or SSIM do not represent visual quality very well, so I'd argue that posting them as box-plots or graphs is also almost completely devoid of merit. I prefer full videos and randomly selected screenshots (with matched frametypes).

Of course they don't represent "visual quality"; that isn't what SSIM or PSNR are used for, nor is it what the comparisons here are really about. Again: the comparisons are between a reference image and a decoded output. That does not need to be a "qualitative" comparison, nor should it be. For this sort of task, an objective meaure is clearly preferable to improperly conducting ad-hoc comparisons with singular frames because it allows us to quantify how well of a job the encoder did matching the input.

Lastly, if someone is going to do the Mean-Opinion-Score approach being done in this thread, stop labeling the encoder output. They're biasing results since observers know which encoder produced what output. Assuming the MOS approach is preferable (again, I disagree), then this thread is a wonderful example of what not to do in experimental design.

And now I'm definitely a pedantic jerk. Sorry. Will stop posting now.

nm · 18th December 2009, 00:28

Quote:

Originally Posted by kidjan

No, he does not mean the "visual quality" compared to the reference. He means the "visual similarity" compared to the reference. Quality is the wrong word.

You can replace the word "quality" with "similarity" if it pleases you. I doubt you'll get other people to use it though, since the term "quality" is pretty established in this context. See http://en.wikipedia.org/wiki/Video_quality and some of the references listed there, for example. "Similarity" is used in content-based retrieval and pattern matching research.

Quote:

Of course they don't represent "visual quality"; that isn't what SSIM or PSNR are used for, nor is it what the comparisons here are really about. Again: the comparisons are between a reference image and a decoded output.

Yes, we all know what these comparisons are about. You don't need to get that worked up about terminology. It's besides the point and completely irrelevant.

Quote:

That does not need to be a "qualitative" comparison, nor should it be. For this sort of task, an objective meaure is clearly preferable to improperly conducting ad-hoc comparisons with singular frames because it allows us to quantify how well of a job the encoder did matching the input.

The issue is that not all encoders try to optimize for PSNR or SSIM. Some have other "psy" optimizations that produce significantly worse SSIM scores but higher visual similarity.

I'd only use PSNR and SSIM for measuring the effect of certain (non-psy) parameters of an encoder, against itself.

kidjan · 18th December 2009, 08:42

Quote:

Originally Posted by nm

Some have other "psy" optimizations that produce significantly worse SSIM scores but higher visual similarity.

I'm skeptical of this claim. On what information do you base this assertion?

Regardless, it doesn't change the fact that the methods used in this thread are completely without any scientific merit. If they're going to do MOS, at least do it with some semblence of rigor.

LoRd_MuldeR · 18th December 2009, 11:10

Quote:

Originally Posted by kidjan

I'm skeptical of this claim. On what information do you base this assertion?

x264 development! When Psy-Optimizations were added, the subjective quality was improved significantly, as agreed by most users. At the same time SSIM metric did drop significantly.

You can try it out yourself: Encode the same source clip with "--ssim --tune film" and with "--ssim --tune ssim" (the latter includes "--no-psy"). Then you'll see...

(Side note: I'm currently developing a SSIM-based metric for a specific application and I'm having a hard time to measure a certain "effect" - one that is clearly visible to human viewers)

kidjan · 23rd December 2009, 02:41

Quote:

Originally Posted by LoRd_MuldeR

x264 development! When Psy-Optimizations were added, the subjective quality was improved significantly, as agreed by most users. At the same time SSIM metric did drop significantly.

You can try it out yourself: Encode the same source clip with "--ssim --tune film" and with "--ssim --tune ssim" (the latter includes "--no-psy"). Then you'll see...

(Side note: I'm currently developing a SSIM-based metric for a specific application and I'm having a hard time to measure a certain "effect" - one that is clearly visible to human viewers)

Thanks Mulder.

I'd be interested to see how you arrived at these results (i.e. what is "significant," how did you determine MOS dropped, etc.), but maybe in a different thread, I've already disrupted this one enough. Thanks for the feedback.

sethk · 30th December 2009, 02:24

Quote:

Originally Posted by LoRd_MuldeR

Yup, that's one of the new features introduced with CUDA 2.2, don't know if OpenCL will have a similar functionality. But this doesn't solve the fundamental problem! All that "zero copy accesss" does is: The GPU can now fetch memory directly from the host process' memory space. This memory fetch is done across PCIe and it bypasses the "global" device memory. But still the speed for accessing the main ("host") memory from the GPU isn't anywhere near the CPU, as all data still must go through the PCIe bus - this means limited bandwidth as well as additional latency! You may be able to "hide" the latency by doing "stream" processing: Upload the next data element to the device, while the current data element is being processed. So when the current data element is finished processing, the next element has already arrived at the device and thus the device doesn't need to wait - it can continue with processing immediately. But it highly depends on the individual application/problem whether stream processing is feasible or not.

Last but not least it should be mentioned that the "zero copy accesss" is only supported by the Geforce-200 series and later, which currently excludes most CUDA-enabled devices!

Zero-copy access is also available on the MCP79x series integrated GPUs, where the integrated GPU is also the memory controller (northbridge) for the CPU and my understanding is it doesn't have much of a penalty in use on this family of chips. Data accessed using this command does not need to be copied to another area before being accessed, although that is an option. In fact the whole point of zero-copy access is to .. have no copying going on. (See post here).

The GT200 on the other hand is a more complicated scenario, but it sounds like it might still be helpful in some encoding situations.

LoRd_MuldeR · 30th December 2009, 02:39

Quote:

Originally Posted by sethk

Zero-copy access is also available on the MCP79x series integrated GPUs, where the integrated GPU is also the memory controller (northbridge) for the CPU and my understanding is it doesn't have much of a penalty in use on this family of chips. Data accessed using this command does not need to be copied to another area before being accessed, although that is an option. In fact the whole point of zero-copy access is to .. have no copying going on. (See post here).

And that's a low-performance "on board" chip, which isn't anywhere near NVidia's "high end" boards. I wouldn't expect great performance from that one

Sure, it will beat everything that Intel can deliver currently, but unfortunately that doesn't mean much...

Quote:

Originally Posted by sethk

The GT200 on the other hand is a more complicated scenario, but it sounds like it might still be helpful in some encoding situations.

No, it doesn't solve the fundamental problem at all. All data still needs to go through the "slow" PCIe bus, twice! That means a serious bottleneck, bandwidth-wise and delay-wise.

All that "zero-copy access" does is: Data doesn't need to be uploaded to "global" memory before it can be copied to the "shared" memory of the individual block. It can go to the "shared" memory of the destination block immediately now. That avoids one indirection in some cases, but certainly not in all cases. In cases where more than one single thread block needs to access the data, we still need to store it in "global" memory. Also the data still needs to be uploaded via PCIe, which IMO is the most important problem! For a project I was working on we had to move calculations to the GPU, although they were slower(!) on the GPU than on the CPU. But downloading all the intermediate data to the CPU (host memory) was so slow, that we had to do everything on the GPU (global device memory) and only downloaded the final result. So believe me: The bottleneck of having to upload/download all the data through the PCIe bus (which currently is the only way on all competitive GPU's) is not just some "theoretical" thing. It's something you will encounter in reality! And it's something that can easily kill all the "nice" speedup you did expect by GPGPU processing. People are even trying to transfer data from one GPU board to another GPU board via DVI to avoid PCIe ^^

sethk · 1st January 2010, 01:27

Quote:

Originally Posted by LoRd_MuldeR

And that's a low-performance "on board" chip, which isn't anywhere near NVidia's "high end" boards. I wouldn't expect great performance from that one

Sure, it will beat everything that Intel can deliver currently, but unfortunately that doesn't mean much...

No, it doesn't solve the fundamental problem at all. All data still needs to go through the "slow" PCIe bus, twice! That means a serious bottleneck, bandwidth-wise and delay-wise.

All that "zero-copy access" does is: Data doesn't need to be uploaded to "global" memory before it can be copied to the "shared" memory of the individual block. It can go to the "shared" memory of the destination block immediately now. That avoids one indirection in some cases, but certainly not in all cases. In cases where more than one single thread block needs to access the data, we still need to store it in "global" memory. Also the data still needs to be uploaded via PCIe, which IMO is the most important problem! For a project I was working on we had to move calculations to the GPU, although they were slower(!) on the GPU than on the CPU. But downloading all the intermediate data to the CPU (host memory) was so slow, that we had to do everything on the GPU (global device memory) and only downloaded the final result. So believe me: The bottleneck of having to upload/download all the data through the PCIe bus (which currently is the only way on all competitive GPU's) is not just some "theoretical" thing. It's something you will encounter in reality! And it's something that can easily kill all the "nice" speedup you did expect by GPGPU processing. People are even trying to transfer data from one GPU board to another GPU board via DVI to avoid PCIe ^^

Definitely true about the MCP being a slow solution - didn't mean to imply otherwise, but I mention it for two reasons - one, GT200 is not the only current solution supporting zero-copy and two, it did not have the same latency issues with zero-copy access.

As you say, this is all academic because of the low shader pipeline count on this solution.

For the GT200, even with the high latency induced by accessing main memory through the PCIe bus, it has enough bandwidth (8GB / sec) to main memory that I imagine it could be used in creative ways in conjunction with local memory.

Now the GT200 has 1GB of local memory, so you could certain use the local graphics memory (GDDR) to buffer large groups of uncompressed frames, and have the main CPU handle decompression and streaming of the uncompressed frames to GPU memory (which could store a large enough buffer of frames to handle backwards and forwards frame access), and write the compressed data back to main memory without hitting a PCIe memory bottleneck. Since most of the work would need to be done on uncompressed frames and that could be happening in GPU memory instead of main memory, the PCIe latency may not be as big a deal as if you tried to do everything in main memory.

I haven't attempted any of this, and I may be off-base in my thinking, but it was what I pictured as a reasonable approach.

yuvi · 1st January 2010, 19:52

As stated in CUDA2.2PinnedMemoryAPIs.pdf, mapped pinned memory is always beneficial for integrated GPUs (and afaik supported by all CUDA-capable integrated GPUs since it's only needed that the driver map the memory the GPU uses into the application's VM space.) Discrete GPUs are completely different: mapped pinned memory is pretty much only beneficial over copying the data to device memory if the data is only accessed once and all accesses are coalesced. Thus, for most kernels zero-copy is slower for discrete GPUs.

DiKey · 16th September 2010, 22:46

Is anything new with Fermi capabilities? Some new programs or algorithms?

Pakmenu · 2nd January 2012, 11:09

If anybody noticed the bitrate in x264 peaks at 2229kbps! to make an average of 900kbps...
The cuda clip peaks at 1776kbps, which makes for less variability, and thus MORE bitrate at low motion, low contracst scenes.
Just think: to have scenes at 2230kbps and yet have an average of 900kbps, you have to have a lot of scenes having only let's say only 300kbps. to have a peak at 1780 you can have low mation scenes at maybe 500kbps.

Anyway: comparing a FRAME and not knowing the FRAMESIZE the encoder used.... is pointless! a fair comarison would look at one whole GOP structure (I frame and dependent b/p frames) and have both encoders set to encode the same size for all the frames in the GOP, then look at the quality.

The MORE variability a bitrate has, the more will be allocated at high contrast, high movement scenes and LESS bitrate to low contrast, low motion scenes. It looks like the picture grabbed is extremely dark, low contrast and probably not a lot of mation.
This means x264 would have allocated MORE bitrate to better lighted and higher motion scenes.
since it's not known how the rest of the clip looks, the rest of the clip might look better for x264!

Anyway when looking only at FRAME comparrison, the SIZE of THE FRAME should be taken EQUAL to be fair in comparrison!
(the framesize can be looked up when enabeling OSD in FFDshow.)

Indeed a pointless and unfair comparison!

Didée · 2nd January 2012, 14:49

Ok, then let's look at full video streams. No change, CUDA encode looks just crap.

CruNcher · 2nd January 2012, 14:53

i can speak about experience with the Nvidia Encoder and it was the most interesting one (i analyzed most every change Nvidia Engineers did in the past on it (saw lot of bugs go by, was impressed it being the first one to support FreXt), by driver revision) until Quicksync came up which is rapidly enhancing

and got another Quality boost

Though i would really like to see how ORBX http://us.download.nvidia.com/downlo...8_GTC2010..wmv currently compares to H.264 on the GPU being desinged entirely for the GPU from the groundup

Jules brought some powerfull stuff together the last years with bringing Paul Debevec's Lightstage HDR Rendering Research on board of his cloud vision and his Engine work and doing his GPU Video Codec work

Though it could be that AMD is coming back also with their far from good Performance in the last years (Research wise) it seems they have some interesting stuff in the cradle out of labs for this year (the Realtime Deshaking introduction surprised a little and now the improvement of it v 2.0 in their coming GPUs which also could give their General Video Encoding a big boost)

hajj_3 · 7th January 2012, 16:57

@dark shikari, i don't suppose you might want to write an updated article about h.265/hevc as the complete draft of the spec is due in february.

16th December 2009, 19:12	#201 \| Link
edison Registered User Join Date: Dec 2005 Posts: 106	Fermi have 128KB L2 cache (per memory controller,total 768KB) that can use to speed up thread block sync. and, MCP7X/GT200 and the later gpus can use system memory without copying to dedicated (video) memory for significant perf improvement, NVIDIA call this "zero copy accesss". Last edited by edison; 16th December 2009 at 19:15.

16th December 2009, 20:48	#202 \| Link
LoRd_MuldeR Software Developer Join Date: Jun 2005 Location: Last House on Slunk Street Posts: 13,248	Yup, that's one of the new features introduced with CUDA 2.2, don't know if OpenCL will have a similar functionality. But this doesn't solve the fundamental problem! All that "zero copy accesss" does is: The GPU can now fetch memory directly from the host process' memory space. This memory fetch is done across PCIe and it bypasses the "global" device memory. But still the speed for accessing the main ("host") memory from the GPU isn't anywhere near the CPU, as all data still must go through the PCIe bus - this means limited bandwidth as well as additional latency! You may be able to "hide" the latency by doing "stream" processing: Upload the next data element to the device, while the current data element is being processed. So when the current data element is finished processing, the next element has already arrived at the device and thus the device doesn't need to wait - it can continue with processing immediately. But it highly depends on the individual application/problem whether stream processing is feasible or not. Last but not least it should be mentioned that the "zero copy accesss" is only supported by the Geforce-200 series and later, which currently excludes most CUDA-enabled devices! __________________ Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ Last edited by LoRd_MuldeR; 16th December 2009 at 20:55.

1st January 2010, 19:52	#214 \| Link
yuvi Registered User Join Date: Jan 2006 Posts: 30	As stated in CUDA2.2PinnedMemoryAPIs.pdf, mapped pinned memory is always beneficial for integrated GPUs (and afaik supported by all CUDA-capable integrated GPUs since it's only needed that the driver map the memory the GPU uses into the application's VM space.) Discrete GPUs are completely different: mapped pinned memory is pretty much only beneficial over copying the data to device memory if the data is only accessed once and all accesses are coalesced. Thus, for most kernels zero-copy is slower for discrete GPUs. Last edited by yuvi; 1st January 2010 at 19:54.

2nd January 2012, 14:49	#217 \| Link
Didée Registered User Join Date: Apr 2002 Location: Germany Posts: 5,391	Ok, then let's look at full video streams. No change, CUDA encode looks just crap. __________________ - We´re at the beginning of the end of mankind´s childhood - My little flickr gallery. (Yes indeed, I do have hobbies other than digital video!)

2nd January 2012, 14:53	#218 \| Link
CruNcher Registered User Join Date: Apr 2002 Location: Germany Posts: 4,926	i can speak about experience with the Nvidia Encoder and it was the most interesting one (i analyzed most every change Nvidia Engineers did in the past on it (saw lot of bugs go by, was impressed it being the first one to support FreXt), by driver revision) until Quicksync came up which is rapidly enhancing and got another Quality boost Though i would really like to see how ORBX http://us.download.nvidia.com/downlo...8_GTC2010..wmv currently compares to H.264 on the GPU being desinged entirely for the GPU from the groundup Jules brought some powerfull stuff together the last years with bringing Paul Debevec's Lightstage HDR Rendering Research on board of his cloud vision and his Engine work and doing his GPU Video Codec work Though it could be that AMD is coming back also with their far from good Performance in the last years (Research wise) it seems they have some interesting stuff in the cradle out of labs for this year (the Realtime Deshaking introduction surprised a little and now the improvement of it v 2.0 in their coming GPUs which also could give their General Video Encoding a big boost) __________________ all my compares are riddles so please try to decipher them yourselves :) It is about Time Join the Revolution NOW before it is to Late ! http://forum.doom9.org/showthread.php?t=168004 Last edited by CruNcher; 2nd January 2012 at 16:49.

16th December 2009, 21:59	#203 \| Link
slavickas I'm Shpongled Join Date: Nov 2001 Location: Lithuania Posts: 303	>Last but not least it should be mentioned that the "zero copy accesss" is only supported by the Geforce-200 series and later minus GTS 250 thanks to renaming

16th September 2010, 22:46	#215 \| Link
DiKey Videomaker Join Date: May 2008 Location: Russia, Engels town Posts: 44	Is anything new with Fermi capabilities? Some new programs or algorithms?

2nd January 2012, 11:09	#216 \| Link
Pakmenu Constant Quality.. Join Date: Oct 2009 Location: China Posts: 4	If anybody noticed the bitrate in x264 peaks at 2229kbps! to make an average of 900kbps... The cuda clip peaks at 1776kbps, which makes for less variability, and thus MORE bitrate at low motion, low contracst scenes. Just think: to have scenes at 2230kbps and yet have an average of 900kbps, you have to have a lot of scenes having only let's say only 300kbps. to have a peak at 1780 you can have low mation scenes at maybe 500kbps. Anyway: comparing a FRAME and not knowing the FRAMESIZE the encoder used.... is pointless! a fair comarison would look at one whole GOP structure (I frame and dependent b/p frames) and have both encoders set to encode the same size for all the frames in the GOP, then look at the quality. The MORE variability a bitrate has, the more will be allocated at high contrast, high movement scenes and LESS bitrate to low contrast, low motion scenes. It looks like the picture grabbed is extremely dark, low contrast and probably not a lot of mation. This means x264 would have allocated MORE bitrate to better lighted and higher motion scenes. since it's not known how the rest of the clip looks, the rest of the clip might look better for x264! Anyway when looking only at FRAME comparrison, the SIZE of THE FRAME should be taken EQUAL to be fair in comparrison! (the framesize can be looked up when enabeling OSD in FFDshow.) Indeed a pointless and unfair comparison!

7th January 2012, 16:57	#219 \| Link
hajj_3 Registered User Join Date: Mar 2004 Posts: 1,126	@dark shikari, i don't suppose you might want to write an updated article about h.265/hevc as the complete draft of the spec is due in february.