View Full Version : PureVideo/UVD: max decoding speed ?
divide
8th February 2009, 10:11
I was wondering what is the maximum decoding speed of theses hardware based decoders ? We always hear about CPU usage when using GPU (DXVA benchs), but never about the max decoded frames per second the GPU can achieve with, let's say, 1080p 25mbit/s h.264.
Any info on this ?
Plus, how do they compare with CoreAVC/DivX7 on quad-core CPU ?
DJ Bobo
8th February 2009, 12:37
I don't think it makes sense to measure such things, because it doesn't make sense to produce non-standard framerates.
The highest standard framerate is 29,97fps and the UVD-units are designed to decode such framerates even at 1920x1080 and 40Mbit/s, which is the highest video bitrate possible on blu-ray discs.
That's all you have to know if you ask me.
divide
8th February 2009, 14:08
It makes sense to me as I'm gonna use DXVA to process videos, not for realtime playback. So the higher the framerate, the quicker the process...
Kado
8th February 2009, 15:45
Use DXVA checker and benchmark a renderer using DXVA.
divide
8th February 2009, 17:31
Thanks !
divide
8th February 2009, 17:47
BTW, the calculated framerate by DXVA Checker is kinda buggy ("average 40fps" but displaying a slow framerate) fortunately it also shows the time needed to bench the entire video, which is far better indication :)
My results (9600GT, Core2Duo 2.5ghz, original video is 10.4sec long -17mbps h.264-):
MPC DXVA Decoder:
6.2sec
CoreAVC
12.6sec
DivX 7
12.5sec
Reimar
8th February 2009, 19:46
It makes sense to me as I'm gonna use DXVA to process videos, not for realtime playback. So the higher the framerate, the quicker the process...
Then there is not much sense in measuring playback speed, copying the data from the GPU to the host _can_ take a lot of time, possibly more than the decoding.
divide
9th February 2009, 00:00
So basicaly you're saying that such decoders are useless ? :
http://neuron2.net/dgavcdecnv/AVCQuickStart.html
I'm not sure it's so long to copy a 1080p texture back to the cpu. PCI-Express isn't that slow...
~bT~
9th February 2009, 00:29
^ isn't the whole point of using those sort of decoders to reduce cpu usage so that it can be used elsewhere?
Leak
9th February 2009, 00:35
^ isn't the whole point of using those sort of decoders to reduce cpu usage so that it can be used elsewhere?
Sure - but usually for immediate display, which means the decoded images are already in video memory and only need to be displayed.
Reading them back into RAM puts extra load on the CPU and costs bandwidth on the PEG lanes.
np: AGF - Words Are Useless (Words Are Missing)
wozio
9th February 2009, 07:45
I think using cuda copying from video memory to ram is much faster than using direct3d. Maybe this is solution?
divide
9th February 2009, 08:48
According to PCI-Express x4 specifications, transfer speed from GPU to RAM is 1000MB/s, which mean you can transfer up to 160 (1080p RGB) frames per second. So there's no bottleneck here.
Plus, I don't think it cost anything to CPU to order a memory transfer.
Reimar
9th February 2009, 11:03
According to PCI-Express x4 specifications, transfer speed from GPU to RAM is 1000MB/s, which mean you can transfer up to 160 (1080p RGB) frames per second.
I'd expect you'd want YV12, not RGB if you want to do more filtering/encoding (though that even halves the bandwidth requirements).
Going by CUDA though, there is at least a 20 ms overhead. In addition if you have to use ordinary (non-page-locked) memory, you have to do a memcpy in addition, so you better have fast RAM (CPU cache does not really help here, with software decoding you might be lucky and everything stays in the CPU's cache).
So there's no bottleneck here.
According to the numbers by "divide", the decoding speed would be 40 fps, even if the transfer can do 160 fps, if it is badly implemented (or old hardware that can not do copying and calculation at the same time, which is the whole G8? generation I think, the CUDA manual can tell details) that would reduce the speed to 32 fps.
Plus, I don't think it cost anything to CPU to order a memory transfer.
Again, it depends on the implementation. The CUDA memory transfer will completely block one CPU. Actually it will completely block the CPU while first waiting for the calculation to complete and then for the memory transfer to complete. This can be avoided but only with additional overhead. Hopefully they implemented it in a more sane way for DXVA.
My point was that when you want to know the speed of one thing it does not make that much sense to measure a different thing.
Yes, copying the data back should not be a big issue, but this kind of usage is quite new, and some of the features necessary to make it work well are not available in older hardware, and some _might_ be missing in the software.
divide
9th February 2009, 11:19
I take good notes of all the points you mention here, fortunately I'm free with my implementation choices as I'm not doing this through CUDA, but DirectX/DXVA.
Kado
9th February 2009, 11:55
These are the benchmark results using DXVA Checker on my system:
Cyberlink = 60 frames (with or without DXVA)
Divx 7 = 60 frames
CoreAVC = 60 frames
FFDshow-mt (rev 2644)= 50 frames
FFDshow (rev 2666) = 32 frames
MPC-HC = 32 frames (no DXVA) / 60 frames (DXVA)
As you can see there's a limitation at 60 frames maybe because of my monitor refresh rate (60Hz), something like vsync.
CPU = Pentium D @ 4GHz
GPU = GeForce 9800GTX
Video = 10.1mbps, 1080p, 29.970 frames, CABAC, 1 reference frame, h.264.
CruNcher
9th February 2009, 23:16
Kado hehe you where Render Refresh limited in your test, try the other VMR renderer it's not limited
Kado
10th February 2009, 14:26
Using VMR7 theres no 60 frames limit, I'll do some tests and report later on.
divide
11th February 2009, 09:24
Using VMR7 MPC-HC/DXVA won't work...
vBulletin® v3.8.5, Copyright ©2000-2012, Jelsoft Enterprises Ltd.