Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
|
|
Thread Tools | Search this Thread | Display Modes |
16th December 2009, 19:12 | #201 | Link |
Registered User
Join Date: Dec 2005
Posts: 106
|
Fermi have 128KB L2 cache (per memory controller,total 768KB) that can use to speed up thread block sync.
and, MCP7X/GT200 and the later gpus can use system memory without copying to dedicated (video) memory for significant perf improvement, NVIDIA call this "zero copy accesss". Last edited by edison; 16th December 2009 at 19:15. |
16th December 2009, 20:48 | #202 | Link |
Software Developer
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,248
|
Yup, that's one of the new features introduced with CUDA 2.2, don't know if OpenCL will have a similar functionality. But this doesn't solve the fundamental problem! All that "zero copy accesss" does is: The GPU can now fetch memory directly from the host process' memory space. This memory fetch is done across PCIe and it bypasses the "global" device memory. But still the speed for accessing the main ("host") memory from the GPU isn't anywhere near the CPU, as all data still must go through the PCIe bus - this means limited bandwidth as well as additional latency! You may be able to "hide" the latency by doing "stream" processing: Upload the next data element to the device, while the current data element is being processed. So when the current data element is finished processing, the next element has already arrived at the device and thus the device doesn't need to wait - it can continue with processing immediately. But it highly depends on the individual application/problem whether stream processing is feasible or not.
Last but not least it should be mentioned that the "zero copy accesss" is only supported by the Geforce-200 series and later, which currently excludes most CUDA-enabled devices!
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ Last edited by LoRd_MuldeR; 16th December 2009 at 20:55. |
17th December 2009, 03:13 | #204 | Link | |
Registered User
Join Date: Oct 2008
Posts: 39
|
Quote:
Furthermore, I find a "screenshot" completely inadequate, given that any singular image from encoded video may not be indicative of the overall quality. SSIM or some other objective measurement should be used over the duration of video, and in the perfect world you'd return a box-plot of cumulative SSIM/PSNR/whatever scores. That box plot would communicate A) outliers (which correspond with particularly bad frames), B) median SSIM, and C) how consistent image quality is based on the quartiles/whiskers. I digress, and I'm open to the possibility that I'm being a pedantic jerk. But I find this method of posting screen shots of singular frames to be almost completely devoid of merit. Last edited by kidjan; 17th December 2009 at 03:23. |
|
17th December 2009, 16:30 | #205 | Link | |||
Registered User
Join Date: Mar 2005
Location: Finland
Posts: 2,641
|
Quote:
Quote:
Quote:
|
|||
17th December 2009, 22:43 | #206 | Link | ||
Registered User
Join Date: Oct 2008
Posts: 39
|
Quote:
Quote:
Lastly, if someone is going to do the Mean-Opinion-Score approach being done in this thread, stop labeling the encoder output. They're biasing results since observers know which encoder produced what output. Assuming the MOS approach is preferable (again, I disagree), then this thread is a wonderful example of what not to do in experimental design. And now I'm definitely a pedantic jerk. Sorry. Will stop posting now. |
||
18th December 2009, 00:28 | #207 | Link | |||
Registered User
Join Date: Mar 2005
Location: Finland
Posts: 2,641
|
Quote:
Quote:
Quote:
I'd only use PSNR and SSIM for measuring the effect of certain (non-psy) parameters of an encoder, against itself. Last edited by nm; 18th December 2009 at 00:32. |
|||
18th December 2009, 08:42 | #208 | Link | |
Registered User
Join Date: Oct 2008
Posts: 39
|
Quote:
Regardless, it doesn't change the fact that the methods used in this thread are completely without any scientific merit. If they're going to do MOS, at least do it with some semblence of rigor. Last edited by kidjan; 18th December 2009 at 08:49. |
|
18th December 2009, 11:10 | #209 | Link | |
Software Developer
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,248
|
Quote:
You can try it out yourself: Encode the same source clip with "--ssim --tune film" and with "--ssim --tune ssim" (the latter includes "--no-psy"). Then you'll see... (Side note: I'm currently developing a SSIM-based metric for a specific application and I'm having a hard time to measure a certain "effect" - one that is clearly visible to human viewers)
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ Last edited by LoRd_MuldeR; 18th December 2009 at 11:26. |
|
23rd December 2009, 02:41 | #210 | Link | |
Registered User
Join Date: Oct 2008
Posts: 39
|
Quote:
I'd be interested to see how you arrived at these results (i.e. what is "significant," how did you determine MOS dropped, etc.), but maybe in a different thread, I've already disrupted this one enough. Thanks for the feedback. |
|
30th December 2009, 02:24 | #211 | Link | |
Registered User
Join Date: Jan 2003
Posts: 20
|
Quote:
The GT200 on the other hand is a more complicated scenario, but it sounds like it might still be helpful in some encoding situations. Last edited by sethk; 30th December 2009 at 02:28. |
|
30th December 2009, 02:39 | #212 | Link | ||
Software Developer
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,248
|
Quote:
Sure, it will beat everything that Intel can deliver currently, but unfortunately that doesn't mean much... Quote:
All that "zero-copy access" does is: Data doesn't need to be uploaded to "global" memory before it can be copied to the "shared" memory of the individual block. It can go to the "shared" memory of the destination block immediately now. That avoids one indirection in some cases, but certainly not in all cases. In cases where more than one single thread block needs to access the data, we still need to store it in "global" memory. Also the data still needs to be uploaded via PCIe, which IMO is the most important problem! For a project I was working on we had to move calculations to the GPU, although they were slower(!) on the GPU than on the CPU. But downloading all the intermediate data to the CPU (host memory) was so slow, that we had to do everything on the GPU (global device memory) and only downloaded the final result. So believe me: The bottleneck of having to upload/download all the data through the PCIe bus (which currently is the only way on all competitive GPU's) is not just some "theoretical" thing. It's something you will encounter in reality! And it's something that can easily kill all the "nice" speedup you did expect by GPGPU processing. People are even trying to transfer data from one GPU board to another GPU board via DVI to avoid PCIe ^^
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ Last edited by LoRd_MuldeR; 30th December 2009 at 03:05. |
||
1st January 2010, 01:27 | #213 | Link | |
Registered User
Join Date: Jan 2003
Posts: 20
|
Quote:
As you say, this is all academic because of the low shader pipeline count on this solution. For the GT200, even with the high latency induced by accessing main memory through the PCIe bus, it has enough bandwidth (8GB / sec) to main memory that I imagine it could be used in creative ways in conjunction with local memory. Now the GT200 has 1GB of local memory, so you could certain use the local graphics memory (GDDR) to buffer large groups of uncompressed frames, and have the main CPU handle decompression and streaming of the uncompressed frames to GPU memory (which could store a large enough buffer of frames to handle backwards and forwards frame access), and write the compressed data back to main memory without hitting a PCIe memory bottleneck. Since most of the work would need to be done on uncompressed frames and that could be happening in GPU memory instead of main memory, the PCIe latency may not be as big a deal as if you tried to do everything in main memory. I haven't attempted any of this, and I may be off-base in my thinking, but it was what I pictured as a reasonable approach. Last edited by sethk; 1st January 2010 at 01:30. |
|
1st January 2010, 19:52 | #214 | Link |
Registered User
Join Date: Jan 2006
Posts: 30
|
As stated in CUDA2.2PinnedMemoryAPIs.pdf, mapped pinned memory is always beneficial for integrated GPUs (and afaik supported by all CUDA-capable integrated GPUs since it's only needed that the driver map the memory the GPU uses into the application's VM space.) Discrete GPUs are completely different: mapped pinned memory is pretty much only beneficial over copying the data to device memory if the data is only accessed once and all accesses are coalesced. Thus, for most kernels zero-copy is slower for discrete GPUs.
Last edited by yuvi; 1st January 2010 at 19:54. |
2nd January 2012, 11:09 | #216 | Link |
Constant Quality..
Join Date: Oct 2009
Location: China
Posts: 4
|
If anybody noticed the bitrate in x264 peaks at 2229kbps! to make an average of 900kbps...
The cuda clip peaks at 1776kbps, which makes for less variability, and thus MORE bitrate at low motion, low contracst scenes. Just think: to have scenes at 2230kbps and yet have an average of 900kbps, you have to have a lot of scenes having only let's say only 300kbps. to have a peak at 1780 you can have low mation scenes at maybe 500kbps. Anyway: comparing a FRAME and not knowing the FRAMESIZE the encoder used.... is pointless! a fair comarison would look at one whole GOP structure (I frame and dependent b/p frames) and have both encoders set to encode the same size for all the frames in the GOP, then look at the quality. The MORE variability a bitrate has, the more will be allocated at high contrast, high movement scenes and LESS bitrate to low contrast, low motion scenes. It looks like the picture grabbed is extremely dark, low contrast and probably not a lot of mation. This means x264 would have allocated MORE bitrate to better lighted and higher motion scenes. since it's not known how the rest of the clip looks, the rest of the clip might look better for x264! Anyway when looking only at FRAME comparrison, the SIZE of THE FRAME should be taken EQUAL to be fair in comparrison! (the framesize can be looked up when enabeling OSD in FFDshow.) Indeed a pointless and unfair comparison! |
2nd January 2012, 14:49 | #217 | Link |
Registered User
Join Date: Apr 2002
Location: Germany
Posts: 5,391
|
Ok, then let's look at full video streams. No change, CUDA encode looks just crap.
__________________
- We´re at the beginning of the end of mankind´s childhood - My little flickr gallery. (Yes indeed, I do have hobbies other than digital video!) |
2nd January 2012, 14:53 | #218 | Link |
Registered User
Join Date: Apr 2002
Location: Germany
Posts: 4,926
|
i can speak about experience with the Nvidia Encoder and it was the most interesting one (i analyzed most every change Nvidia Engineers did in the past on it (saw lot of bugs go by, was impressed it being the first one to support FreXt), by driver revision) until Quicksync came up which is rapidly enhancing and got another Quality boost
Though i would really like to see how ORBX http://us.download.nvidia.com/downlo...8_GTC2010..wmv currently compares to H.264 on the GPU being desinged entirely for the GPU from the groundup Jules brought some powerfull stuff together the last years with bringing Paul Debevec's Lightstage HDR Rendering Research on board of his cloud vision and his Engine work and doing his GPU Video Codec work Though it could be that AMD is coming back also with their far from good Performance in the last years (Research wise) it seems they have some interesting stuff in the cradle out of labs for this year (the Realtime Deshaking introduction surprised a little and now the improvement of it v 2.0 in their coming GPUs which also could give their General Video Encoding a big boost)
__________________
all my compares are riddles so please try to decipher them yourselves :) It is about Time Join the Revolution NOW before it is to Late ! http://forum.doom9.org/showthread.php?t=168004 Last edited by CruNcher; 2nd January 2012 at 16:49. |
|
|