Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > MPEG-4 AVC / H.264

Reply
 
Thread Tools Search this Thread Display Modes
Old 16th December 2009, 19:12   #201  |  Link
edison
Registered User
 
Join Date: Dec 2005
Posts: 106
Fermi have 128KB L2 cache (per memory controller,total 768KB) that can use to speed up thread block sync.

and, MCP7X/GT200 and the later gpus can use system memory without copying to dedicated (video) memory for significant perf improvement, NVIDIA call this "zero copy accesss".

Last edited by edison; 16th December 2009 at 19:15.
edison is offline   Reply With Quote
Old 16th December 2009, 20:48   #202  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,248
Yup, that's one of the new features introduced with CUDA 2.2, don't know if OpenCL will have a similar functionality. But this doesn't solve the fundamental problem! All that "zero copy accesss" does is: The GPU can now fetch memory directly from the host process' memory space. This memory fetch is done across PCIe and it bypasses the "global" device memory. But still the speed for accessing the main ("host") memory from the GPU isn't anywhere near the CPU, as all data still must go through the PCIe bus - this means limited bandwidth as well as additional latency! You may be able to "hide" the latency by doing "stream" processing: Upload the next data element to the device, while the current data element is being processed. So when the current data element is finished processing, the next element has already arrived at the device and thus the device doesn't need to wait - it can continue with processing immediately. But it highly depends on the individual application/problem whether stream processing is feasible or not.

Last but not least it should be mentioned that the "zero copy accesss" is only supported by the Geforce-200 series and later, which currently excludes most CUDA-enabled devices!
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊

Last edited by LoRd_MuldeR; 16th December 2009 at 20:55.
LoRd_MuldeR is offline   Reply With Quote
Old 16th December 2009, 21:59   #203  |  Link
slavickas
I'm Shpongled
 
slavickas's Avatar
 
Join Date: Nov 2001
Location: Lithuania
Posts: 303
>Last but not least it should be mentioned that the "zero copy accesss" is only supported by the Geforce-200 series and later
minus GTS 250 thanks to renaming
slavickas is offline   Reply With Quote
Old 17th December 2009, 03:13   #204  |  Link
kidjan
Registered User
 
kidjan's Avatar
 
Join Date: Oct 2008
Posts: 39
Quote:
Originally Posted by Puncakes View Post
I don't know about you, but considering the fact that those measurements are useless for comparing actual visual quality, I think I'd rather have screenshots.
You're not comparing "actual visual quality"--you're comparing a reference image (i.e. input to the encoder) with an outputted image (i.e. decoded output). If you were comparing actual visual quality, we'd all be scrutinizing the quality of the input material.

Furthermore, I find a "screenshot" completely inadequate, given that any singular image from encoded video may not be indicative of the overall quality. SSIM or some other objective measurement should be used over the duration of video, and in the perfect world you'd return a box-plot of cumulative SSIM/PSNR/whatever scores. That box plot would communicate A) outliers (which correspond with particularly bad frames), B) median SSIM, and C) how consistent image quality is based on the quartiles/whiskers.

I digress, and I'm open to the possibility that I'm being a pedantic jerk. But I find this method of posting screen shots of singular frames to be almost completely devoid of merit.

Last edited by kidjan; 17th December 2009 at 03:23.
kidjan is offline   Reply With Quote
Old 17th December 2009, 16:30   #205  |  Link
nm
Registered User
 
Join Date: Mar 2005
Location: Finland
Posts: 2,641
Quote:
Originally Posted by kidjan View Post
You're not comparing "actual visual quality"--you're comparing a reference image (i.e. input to the encoder) with an outputted image (i.e. decoded output). If you were comparing actual visual quality, we'd all be scrutinizing the quality of the input material.
He means the actual (subjective) visual quality compared to the reference, of course.

Quote:
Furthermore, I find a "screenshot" completely inadequate, given that any singular image from encoded video may not be indicative of the overall quality.
I agree, although not completely. I think screenshots are more useful than SSIM/PSNR scores or graphs when comparing different encoders.

Quote:
SSIM or some other objective measurement should be used over the duration of video, and in the perfect world you'd return a box-plot of cumulative SSIM/PSNR/whatever scores. That box plot would communicate A) outliers (which correspond with particularly bad frames), B) median SSIM, and C) how consistent image quality is based on the quartiles/whiskers.

I digress, and I'm open to the possibility that I'm being a pedantic jerk. But I find this method of posting screen shots of singular frames to be almost completely devoid of merit.
"Objective" measures such as PSNR or SSIM do not represent visual quality very well, so I'd argue that posting them as box-plots or graphs is also almost completely devoid of merit. I prefer full videos and randomly selected screenshots (with matched frametypes).
nm is offline   Reply With Quote
Old 17th December 2009, 22:43   #206  |  Link
kidjan
Registered User
 
kidjan's Avatar
 
Join Date: Oct 2008
Posts: 39
Quote:
Originally Posted by nm
He means the actual (subjective) visual quality compared to the reference, of course.
No, he does not mean the "visual quality" compared to the reference. He means the "visual similarity" compared to the reference. Quality is the wrong word.

Quote:
Originally Posted by nm
"Objective" measures such as PSNR or SSIM do not represent visual quality very well, so I'd argue that posting them as box-plots or graphs is also almost completely devoid of merit. I prefer full videos and randomly selected screenshots (with matched frametypes).
Of course they don't represent "visual quality"; that isn't what SSIM or PSNR are used for, nor is it what the comparisons here are really about. Again: the comparisons are between a reference image and a decoded output. That does not need to be a "qualitative" comparison, nor should it be. For this sort of task, an objective meaure is clearly preferable to improperly conducting ad-hoc comparisons with singular frames because it allows us to quantify how well of a job the encoder did matching the input.

Lastly, if someone is going to do the Mean-Opinion-Score approach being done in this thread, stop labeling the encoder output. They're biasing results since observers know which encoder produced what output. Assuming the MOS approach is preferable (again, I disagree), then this thread is a wonderful example of what not to do in experimental design.

And now I'm definitely a pedantic jerk. Sorry. Will stop posting now.
kidjan is offline   Reply With Quote
Old 18th December 2009, 00:28   #207  |  Link
nm
Registered User
 
Join Date: Mar 2005
Location: Finland
Posts: 2,641
Quote:
Originally Posted by kidjan View Post
No, he does not mean the "visual quality" compared to the reference. He means the "visual similarity" compared to the reference. Quality is the wrong word.
You can replace the word "quality" with "similarity" if it pleases you. I doubt you'll get other people to use it though, since the term "quality" is pretty established in this context. See http://en.wikipedia.org/wiki/Video_quality and some of the references listed there, for example. "Similarity" is used in content-based retrieval and pattern matching research.

Quote:
Of course they don't represent "visual quality"; that isn't what SSIM or PSNR are used for, nor is it what the comparisons here are really about. Again: the comparisons are between a reference image and a decoded output.
Yes, we all know what these comparisons are about. You don't need to get that worked up about terminology. It's besides the point and completely irrelevant.

Quote:
That does not need to be a "qualitative" comparison, nor should it be. For this sort of task, an objective meaure is clearly preferable to improperly conducting ad-hoc comparisons with singular frames because it allows us to quantify how well of a job the encoder did matching the input.
The issue is that not all encoders try to optimize for PSNR or SSIM. Some have other "psy" optimizations that produce significantly worse SSIM scores but higher visual similarity.

I'd only use PSNR and SSIM for measuring the effect of certain (non-psy) parameters of an encoder, against itself.

Last edited by nm; 18th December 2009 at 00:32.
nm is offline   Reply With Quote
Old 18th December 2009, 08:42   #208  |  Link
kidjan
Registered User
 
kidjan's Avatar
 
Join Date: Oct 2008
Posts: 39
Quote:
Originally Posted by nm View Post
Some have other "psy" optimizations that produce significantly worse SSIM scores but higher visual similarity.
I'm skeptical of this claim. On what information do you base this assertion?

Regardless, it doesn't change the fact that the methods used in this thread are completely without any scientific merit. If they're going to do MOS, at least do it with some semblence of rigor.

Last edited by kidjan; 18th December 2009 at 08:49.
kidjan is offline   Reply With Quote
Old 18th December 2009, 11:10   #209  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,248
Quote:
Originally Posted by kidjan View Post
I'm skeptical of this claim. On what information do you base this assertion?
x264 development! When Psy-Optimizations were added, the subjective quality was improved significantly, as agreed by most users. At the same time SSIM metric did drop significantly.

You can try it out yourself: Encode the same source clip with "--ssim --tune film" and with "--ssim --tune ssim" (the latter includes "--no-psy"). Then you'll see...

(Side note: I'm currently developing a SSIM-based metric for a specific application and I'm having a hard time to measure a certain "effect" - one that is clearly visible to human viewers)
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊

Last edited by LoRd_MuldeR; 18th December 2009 at 11:26.
LoRd_MuldeR is offline   Reply With Quote
Old 23rd December 2009, 02:41   #210  |  Link
kidjan
Registered User
 
kidjan's Avatar
 
Join Date: Oct 2008
Posts: 39
Quote:
Originally Posted by LoRd_MuldeR View Post
x264 development! When Psy-Optimizations were added, the subjective quality was improved significantly, as agreed by most users. At the same time SSIM metric did drop significantly.

You can try it out yourself: Encode the same source clip with "--ssim --tune film" and with "--ssim --tune ssim" (the latter includes "--no-psy"). Then you'll see...

(Side note: I'm currently developing a SSIM-based metric for a specific application and I'm having a hard time to measure a certain "effect" - one that is clearly visible to human viewers)
Thanks Mulder.

I'd be interested to see how you arrived at these results (i.e. what is "significant," how did you determine MOS dropped, etc.), but maybe in a different thread, I've already disrupted this one enough. Thanks for the feedback.
kidjan is offline   Reply With Quote
Old 30th December 2009, 02:24   #211  |  Link
sethk
Registered User
 
Join Date: Jan 2003
Posts: 20
Quote:
Originally Posted by LoRd_MuldeR View Post
Yup, that's one of the new features introduced with CUDA 2.2, don't know if OpenCL will have a similar functionality. But this doesn't solve the fundamental problem! All that "zero copy accesss" does is: The GPU can now fetch memory directly from the host process' memory space. This memory fetch is done across PCIe and it bypasses the "global" device memory. But still the speed for accessing the main ("host") memory from the GPU isn't anywhere near the CPU, as all data still must go through the PCIe bus - this means limited bandwidth as well as additional latency! You may be able to "hide" the latency by doing "stream" processing: Upload the next data element to the device, while the current data element is being processed. So when the current data element is finished processing, the next element has already arrived at the device and thus the device doesn't need to wait - it can continue with processing immediately. But it highly depends on the individual application/problem whether stream processing is feasible or not.

Last but not least it should be mentioned that the "zero copy accesss" is only supported by the Geforce-200 series and later, which currently excludes most CUDA-enabled devices!
Zero-copy access is also available on the MCP79x series integrated GPUs, where the integrated GPU is also the memory controller (northbridge) for the CPU and my understanding is it doesn't have much of a penalty in use on this family of chips. Data accessed using this command does not need to be copied to another area before being accessed, although that is an option. In fact the whole point of zero-copy access is to .. have no copying going on. (See post here).

The GT200 on the other hand is a more complicated scenario, but it sounds like it might still be helpful in some encoding situations.

Last edited by sethk; 30th December 2009 at 02:28.
sethk is offline   Reply With Quote
Old 30th December 2009, 02:39   #212  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,248
Quote:
Originally Posted by sethk View Post
Zero-copy access is also available on the MCP79x series integrated GPUs, where the integrated GPU is also the memory controller (northbridge) for the CPU and my understanding is it doesn't have much of a penalty in use on this family of chips. Data accessed using this command does not need to be copied to another area before being accessed, although that is an option. In fact the whole point of zero-copy access is to .. have no copying going on. (See post here).
And that's a low-performance "on board" chip, which isn't anywhere near NVidia's "high end" boards. I wouldn't expect great performance from that one

Sure, it will beat everything that Intel can deliver currently, but unfortunately that doesn't mean much...

Quote:
Originally Posted by sethk View Post
The GT200 on the other hand is a more complicated scenario, but it sounds like it might still be helpful in some encoding situations.
No, it doesn't solve the fundamental problem at all. All data still needs to go through the "slow" PCIe bus, twice! That means a serious bottleneck, bandwidth-wise and delay-wise.

All that "zero-copy access" does is: Data doesn't need to be uploaded to "global" memory before it can be copied to the "shared" memory of the individual block. It can go to the "shared" memory of the destination block immediately now. That avoids one indirection in some cases, but certainly not in all cases. In cases where more than one single thread block needs to access the data, we still need to store it in "global" memory. Also the data still needs to be uploaded via PCIe, which IMO is the most important problem! For a project I was working on we had to move calculations to the GPU, although they were slower(!) on the GPU than on the CPU. But downloading all the intermediate data to the CPU (host memory) was so slow, that we had to do everything on the GPU (global device memory) and only downloaded the final result. So believe me: The bottleneck of having to upload/download all the data through the PCIe bus (which currently is the only way on all competitive GPU's) is not just some "theoretical" thing. It's something you will encounter in reality! And it's something that can easily kill all the "nice" speedup you did expect by GPGPU processing. People are even trying to transfer data from one GPU board to another GPU board via DVI to avoid PCIe ^^
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊

Last edited by LoRd_MuldeR; 30th December 2009 at 03:05.
LoRd_MuldeR is offline   Reply With Quote
Old 1st January 2010, 01:27   #213  |  Link
sethk
Registered User
 
Join Date: Jan 2003
Posts: 20
Quote:
Originally Posted by LoRd_MuldeR View Post
And that's a low-performance "on board" chip, which isn't anywhere near NVidia's "high end" boards. I wouldn't expect great performance from that one

Sure, it will beat everything that Intel can deliver currently, but unfortunately that doesn't mean much...



No, it doesn't solve the fundamental problem at all. All data still needs to go through the "slow" PCIe bus, twice! That means a serious bottleneck, bandwidth-wise and delay-wise.

All that "zero-copy access" does is: Data doesn't need to be uploaded to "global" memory before it can be copied to the "shared" memory of the individual block. It can go to the "shared" memory of the destination block immediately now. That avoids one indirection in some cases, but certainly not in all cases. In cases where more than one single thread block needs to access the data, we still need to store it in "global" memory. Also the data still needs to be uploaded via PCIe, which IMO is the most important problem! For a project I was working on we had to move calculations to the GPU, although they were slower(!) on the GPU than on the CPU. But downloading all the intermediate data to the CPU (host memory) was so slow, that we had to do everything on the GPU (global device memory) and only downloaded the final result. So believe me: The bottleneck of having to upload/download all the data through the PCIe bus (which currently is the only way on all competitive GPU's) is not just some "theoretical" thing. It's something you will encounter in reality! And it's something that can easily kill all the "nice" speedup you did expect by GPGPU processing. People are even trying to transfer data from one GPU board to another GPU board via DVI to avoid PCIe ^^
Definitely true about the MCP being a slow solution - didn't mean to imply otherwise, but I mention it for two reasons - one, GT200 is not the only current solution supporting zero-copy and two, it did not have the same latency issues with zero-copy access.

As you say, this is all academic because of the low shader pipeline count on this solution.

For the GT200, even with the high latency induced by accessing main memory through the PCIe bus, it has enough bandwidth (8GB / sec) to main memory that I imagine it could be used in creative ways in conjunction with local memory.

Now the GT200 has 1GB of local memory, so you could certain use the local graphics memory (GDDR) to buffer large groups of uncompressed frames, and have the main CPU handle decompression and streaming of the uncompressed frames to GPU memory (which could store a large enough buffer of frames to handle backwards and forwards frame access), and write the compressed data back to main memory without hitting a PCIe memory bottleneck. Since most of the work would need to be done on uncompressed frames and that could be happening in GPU memory instead of main memory, the PCIe latency may not be as big a deal as if you tried to do everything in main memory.

I haven't attempted any of this, and I may be off-base in my thinking, but it was what I pictured as a reasonable approach.

Last edited by sethk; 1st January 2010 at 01:30.
sethk is offline   Reply With Quote
Old 1st January 2010, 19:52   #214  |  Link
yuvi
Registered User
 
Join Date: Jan 2006
Posts: 30
As stated in CUDA2.2PinnedMemoryAPIs.pdf, mapped pinned memory is always beneficial for integrated GPUs (and afaik supported by all CUDA-capable integrated GPUs since it's only needed that the driver map the memory the GPU uses into the application's VM space.) Discrete GPUs are completely different: mapped pinned memory is pretty much only beneficial over copying the data to device memory if the data is only accessed once and all accesses are coalesced. Thus, for most kernels zero-copy is slower for discrete GPUs.

Last edited by yuvi; 1st January 2010 at 19:54.
yuvi is offline   Reply With Quote
Old 16th September 2010, 22:46   #215  |  Link
DiKey
Videomaker
 
Join Date: May 2008
Location: Russia, Engels town
Posts: 44
Is anything new with Fermi capabilities? Some new programs or algorithms?
DiKey is offline   Reply With Quote
Old 2nd January 2012, 11:09   #216  |  Link
Pakmenu
Constant Quality..
 
Join Date: Oct 2009
Location: China
Posts: 4
If anybody noticed the bitrate in x264 peaks at 2229kbps! to make an average of 900kbps...
The cuda clip peaks at 1776kbps, which makes for less variability, and thus MORE bitrate at low motion, low contracst scenes.
Just think: to have scenes at 2230kbps and yet have an average of 900kbps, you have to have a lot of scenes having only let's say only 300kbps. to have a peak at 1780 you can have low mation scenes at maybe 500kbps.

Anyway: comparing a FRAME and not knowing the FRAMESIZE the encoder used.... is pointless! a fair comarison would look at one whole GOP structure (I frame and dependent b/p frames) and have both encoders set to encode the same size for all the frames in the GOP, then look at the quality.

The MORE variability a bitrate has, the more will be allocated at high contrast, high movement scenes and LESS bitrate to low contrast, low motion scenes. It looks like the picture grabbed is extremely dark, low contrast and probably not a lot of mation.
This means x264 would have allocated MORE bitrate to better lighted and higher motion scenes.
since it's not known how the rest of the clip looks, the rest of the clip might look better for x264!

Anyway when looking only at FRAME comparrison, the SIZE of THE FRAME should be taken EQUAL to be fair in comparrison!
(the framesize can be looked up when enabeling OSD in FFDshow.)

Indeed a pointless and unfair comparison!
Pakmenu is offline   Reply With Quote
Old 2nd January 2012, 14:49   #217  |  Link
Didée
Registered User
 
Join Date: Apr 2002
Location: Germany
Posts: 5,391
Ok, then let's look at full video streams. No change, CUDA encode looks just crap.
__________________
- We´re at the beginning of the end of mankind´s childhood -

My little flickr gallery. (Yes indeed, I do have hobbies other than digital video!)
Didée is offline   Reply With Quote
Old 2nd January 2012, 14:53   #218  |  Link
CruNcher
Registered User
 
CruNcher's Avatar
 
Join Date: Apr 2002
Location: Germany
Posts: 4,926
i can speak about experience with the Nvidia Encoder and it was the most interesting one (i analyzed most every change Nvidia Engineers did in the past on it (saw lot of bugs go by, was impressed it being the first one to support FreXt), by driver revision) until Quicksync came up which is rapidly enhancing and got another Quality boost

Though i would really like to see how ORBX http://us.download.nvidia.com/downlo...8_GTC2010..wmv currently compares to H.264 on the GPU being desinged entirely for the GPU from the groundup
Jules brought some powerfull stuff together the last years with bringing Paul Debevec's Lightstage HDR Rendering Research on board of his cloud vision and his Engine work and doing his GPU Video Codec work

Though it could be that AMD is coming back also with their far from good Performance in the last years (Research wise) it seems they have some interesting stuff in the cradle out of labs for this year (the Realtime Deshaking introduction surprised a little and now the improvement of it v 2.0 in their coming GPUs which also could give their General Video Encoding a big boost)
__________________
all my compares are riddles so please try to decipher them yourselves :)

It is about Time

Join the Revolution NOW before it is to Late !

http://forum.doom9.org/showthread.php?t=168004

Last edited by CruNcher; 2nd January 2012 at 16:49.
CruNcher is offline   Reply With Quote
Old 7th January 2012, 16:57   #219  |  Link
hajj_3
Registered User
 
Join Date: Mar 2004
Posts: 1,125
@dark shikari, i don't suppose you might want to write an updated article about h.265/hevc as the complete draft of the spec is due in february.
hajj_3 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 15:21.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.