View Full Version : H.264 CPU/DXVA codec comparison - Core2Duo vs UVD 2.2
NikosD
12th February 2011, 11:22
UPDATE 22/4/2011: UVD+ (Radeon 3650 results added)
UPDATE 31/3/2011: CoreAVC 2.5.1 (CPU & DXVA results added)
This is my second post regarding codec performance/ benchmarking.
The first one is here:
http://forum.doom9.org/showthread.php?t=156660
This time DXVA and CPU codecs are included too.
All tests have been done on
Win 7 SP1 x64 - Core 2 Duo @ 2.83GHz - Radeon 5750 (UVD 2.2) - Catalyst 11.1a
Second system is (for DXVA only):
Win XP SP3 32bit - Core 2 Duo @ 2.83GHz - Radeon 3650 AGP (UVD+) - Catalyst 11.2
The benchmark tool is DXVAChecker v2.4.0 (32bit)
Home page: http://bluesky23.yu-nagi.com/en/
You can find all reference video files here:
ftp://helpedia.com/pub/multimedia/x264/testvideos/
1.Twinpeaks1080p30fps-27Mbps
2.Samsung.Demo.Oceanic.Life-1080p30fpsRef16-40Mbps
3.Basketball - 1088p60fpsRef8-10Mbps
4.Girls.YoonYoon-1080p60fpsRef5-21Mbps
5.Birds_1080p60fpsReF2-30Mbps
6.Cat-1080p60fpsRef4-25Mbps
Benchmark instructions are here:
http://forum.doom9.org/showthread.php?t=156660
Codecs included in comparison:
CoreAVC v2.5.1 - CPU & DXVA
CoreAVC v2.0 - CPU only
DiAVC v1.2.2 - CPU only
FFMpeg-mt v52.110.0 (rev3757) - CPU only
DivX H.264 v1.2.1 Build 9.0.1.21 - CPU & DXVA
FFDshow DXVA rev3757 - DXVA only
MPC-HC v1.5.1.2910 (32bit - standalone filter) - CPU & DXVA
Cyberlink PowerDVD 10 v1.0.2229 (latest as of 12th Feb) - CPU & DXVA
Microsoft DirectShow H.264 (built-in Win 7) - CPU & DXVA
Microsoft MediaFoundation H.264 (built-in Win 7) - CPU & DXVA
Four comments:
1) CoreAVC v2.5.1 CPU is the fastest codec. Second best is CoreAVC again -previous version v2.0
2) CoreAVC v2.5.1 DXVA has almost identical results with MPC-HC DXVA & FFDShow DXVA
3) Core2Duo@2.83 GHz is faster (with optimized codecs) than UVD 2.2 in H.264 decoding
4) UVD 2.2 is very close in MIN and AVG frame rate to UVD+ (in Radeon 3650) in supported video clips by UVD+ (BluRay spec only)
Results:
A. Twinpeaks-30fps
Codec Codec type Min/Avg/Max fps
1) CoreAVC v2.5.1 CPU 80/97/107
2) CoreAVC v2.0 CPU 75/93/101
3) FFMpeg-mt CPU 68/85/94
4) DiAVC CPU 64/74/79
5) Microsoft DS CPU 55/69/95
6) PowerDVD CPU 56/68/76
7) Microsoft MFT CPU 54/66/86
8) MPC-HC CPU 44/62/69
9) Microsoft DS DXVA 50/60/100
10) DivX DXVA 49/60/86
11) FFDShow DXVA 49/60/86
12) CoreAVC v2.5.1 DXVA 50/59/83
13) MPC-HC DXVA 50/59/83
14 ) PowerDVD DXVA 50/59/78
15) Microsoft MFT DXVA 50/59/76
DivX CPU ---
B. Samsung-30fps
1) CoreAVC v2.5.1 CPU 32/49/90
2) FFMpeg-mt CPU 32/49/89
3) DiAVC CPU 34/49/81
4) CoreAVC v2.0 CPU 32/48/96
5) DivX DXVA 37/46/79
6) Microsoft DS DXVA 32/46/80
7) MPC-HC DXVA 37/46/75
8) CoreAVC v2.5.1 DXVA 37/46/74
9) DivX CPU 31/46/86
10) PowerDVD DXVA 32/45/65
11) PowerDVD CPU 22/43/79
12) Microsoft DS CPU 23/40/78
13) MPC-HC CPU 19/28/60
Microsoft MFT CPU ---
Microsoft MFT DXVA ---
FFDShow DXVA ---
C. Basket-60fps
1) CoreAVC v2.5.1 CPU 71/89/110
2) CoreAVC v2.0 CPU 72/88/111
3) DiAVC CPU 75/84/103
4) FFMpeg-mt CPU 73/83/104
5) DivX CPU CPU 70/83/98
6) PowerDVD CPU 67/78/97
7) Microsoft DS CPU 43/60/85
8) PowerDVD DXVA 55/58/70
9) Microsoft DS DXVA 50/57/107
10) DivX DXVA 55/57/81
11) MPC-HC DXVA 55/57/79
12) CoreAVC v2.5.1 DXVA 54/57/77
13) FFDShow DXVA 52/57/76
14) MPC-HC CPU 40/48/68
Microsoft MFT CPU ---
Microsoft MFT DXVA ---
D. Girls-60fps
1) CoreAVC v2.0 CPU 58/73/94
2) CoreAVC v2.5.1 CPU 59/72/90
3) DivX CPU 62/69/79
4) DiAVC CPU 58/68/83
5) FFMpeg-mt CPU 58/65/82
6) PowerDVD CPU 52/62/82
7) MPC-HC DXVA 56/57/80
8) DivX DXVA 55/57/82
9) CoreAVC v2.5.1 DXVA 55/57/81
10) FFDShow DXVA 55/57/80
11) PowerDVD DXVA 55/57/79
12) Microsoft DS DXVA 43/57/82
13) Microsoft DS CPU 43/52/72
14) MPC-HC CPU 37/40/55
Microsoft MFT CPU ---
Microsoft MFT DXVA ---
E. Birds-60fps
1) Microsoft MFT DXVA 51/163/404
2) PowerDVD CPU 52/68/71
3) DiAVC CPU 54/63/69
4) Microsoft MFT CPU 39/61/93
5) DivX CPU 53/61/71
6) CoreAVC v2.5.1 CPU 53/59/68
7) CoreAVC v2.0 CPU 41/57/71
8) PowerDVD DXVA 52/56/68
9) DivX DXVA 52/55/77
10) CoreAVC v2.5.1 DXVA 52/55/75
11) FFDShow DXVA 52/55/74
12) Microsoft DS DXVA 51/55/89
13) MPC-HC DXVA 50/55/80
14) FFMpeg-mt CPU 48/54/67
15) Microsoft DS CPU 37/48/68
16) MPC-HC CPU 30/35/41
F. Cat-60fps
1) CoreAVC v2.5.1 CPU 66/70/76
2) CoreAVC v2.0 CPU 66/70/75
3) DiAVC CPU 64/70/74
4) FFMpeg-mt CPU 63/67/72
5) PowerDVD DXVA 52/57/69
6) FFDShow DXVA 54/56/76
7) CoreAVC v2.5.1 DXVA 48/56/85
DivX CPU ---
PowerDVD CPU ---
DivX DXVA ---
Microsoft DS DXVA ---
Microsoft MFT DXVA ---
MPC-HC DXVA ---
Microsoft MFT CPU ---
Microsoft DS CPU ---
MPC-HC CPU ---
Second system:
FFDShow DXVA rev3828
A. FFDShow DXVA 43/53/61
B. Corrupted image due to L5.1 (not supported by UVD/UVD+)
C. Corrupted image due to L5.1 (not supported by UVD/UVD+)
D. FFDShow DXVA 50/55/60
E. FFDShow DXVA 49/53/59
F. FFDShow DXVA 50/55/61
Feel free to add your comments/ results.
altruist
31st March 2011, 03:51
This is an excellent study. I can't believe no one else has thanked you for this.
I recently noticed CoreAVC now supports ATI hardware decoding through DXVA. Thought you might want to know if you don't already :)
NikosD
31st March 2011, 08:43
Thanks for your comments.
I know that CoreAVC 2.5.1 has DXVA support, but there is no demo version AFAIK.
When I get the new version, I will definitely try it.
UPDATE: Results for CoreAVC v2.5.1 added (CPU & DXVA)
bobdynlan
31st March 2011, 15:21
UPDATE 31/3/2011:
E. Birds-60fps
1) Microsoft MFT DXVA 51/163/404
Did you retest this? Does not seem like a valid result, maybe you should remove it from the top of the list.
CruNcher
31st March 2011, 16:18
According to that test the Samsung clip seems the most complex one even more complex then the 60 fps ones due to the high bitrate + cabac most likely and ref frames
are any of them sliced ?
Cyberlink seems also to have to fight with it quiet Hard and FFMPEG MT can even survive against CoreAVC and DiAVC on that one interesting.
You should add Arcsoft and Mainconcept to that list, many falsely belive DivX and Mainconcepts implementation are identical that is false though their are differences, especialy as Mainconcept is @ SDK 8.8 and DivX still somewhere @ the 8.5-8.7 codebase :)
Also Elecard released their H.264 DXVA implementation that seems also fast :)
Also please add on which Power Profile you tested on Win 7 :)
the latest PowerDVD decoder is also 1.0.0.2610
BetaBoy
31st March 2011, 18:40
Thanx for the comparison. I'd hold of on 'true' DXVA comp stats with CoreAVC 2.5.x as we are about to release more DXVA features in upcoming releases. We have not even begun optimizations for DXVA... and we already know of bottlenecks that should make it even better/faster.
On the DivX / Main concept diffs.... they still use the same 'cores' from what others have posted here on D9.
pirlouy
31st March 2011, 19:57
Not sure to understand.
From these stats, can we say that CPU is better than GPU (DXVA) for 24fps movie ??
neoufo51
31st March 2011, 21:42
Edit: Never mind
mark0077
31st March 2011, 21:55
Hi,
Is it worth adding scores for the new lav cuid decoder http://forum.doom9.org/showthread.php?t=160290
Excellent thread btw, very useful.
neoufo51
31st March 2011, 22:07
Hi,
Is it worth adding scores for the new lav cuid decoder http://forum.doom9.org/showthread.php?t=160290
Excellent thread btw, very useful.
He can't because that's only for Nvidia cards and he is doing this on ATI cards.
NikosD
1st April 2011, 07:43
Did you retest this? Does not seem like a valid result, maybe you should remove it from the top of the list.
All tests were done 3 times and the numbers are all true.
You can see an analysis and possible explanation of those strange figures here:
http://forum.doom9.org/showthread.php?t=156660&page=7
Check out the posts of a member named hwti and my comments
NikosD
1st April 2011, 07:58
According to that test the Samsung clip seems the most complex one even more complex then the 60 fps ones due to the high bitrate + cabac most likely and ref frames
The Samsung clip is the famous difficult Samsung clip with 16 ReFrames, huge bitrate etc, etc but both CPU and DXVA codecs manage to stay above the 30fps by 50% - 49fps on average
On the other hand, because of the double frame rate (60fps) needed by the other clips, none of the DXVA & CPU codecs manage to stay above the 60fps by 50% - which means 90fps on average
are any of them sliced ?
Sorry I don't get it.
Cyberlink seems also to have to fight with it quiet Hard and FFMPEG MT can even survive against CoreAVC and DiAVC on that one interesting.
True
You should add Arcsoft and Mainconcept to that list, many falsely belive DivX and Mainconcepts implementation are identical that is false though their are differences, especialy as Mainconcept is @ SDK 8.8 and DivX still somewhere @ the 8.5-8.7 codebase :)
Also Elecard released their H.264 DXVA implementation that seems also fast :)
I have some results of ArcSoft here:
http://forum.doom9.org/showthread.php?t=156660
Mainconcept and Elecard are not so popular. If you give me some links I'll try it :)
Also please add on which Power Profile you tested on Win 7 :)
It's High Performance, but what's the difference ? They are all tested under the same conditions.
the latest PowerDVD decoder is also 1.0.0.2610
Mine was the latest as of 12th February - date of my first post
kypec
1st April 2011, 08:02
@NikosD: thanks for your efforts put into this extensive test rounds. Could you please post the results in more flexible format, like Google spreadsheet perhaps? I think that would provide better edit options for you and better sorting options for viewers as well.
NikosD
1st April 2011, 08:25
Not sure to understand.
From these stats, can we say that CPU is better than GPU (DXVA) for 24fps movie ??
Not better, faster.
A Core2Duo@2.83GHz supported by optimized and multithreaded codecs (like CoreAVC, FFmpeg-mt, DiAVC etc) is faster than UVD 2.2.
So, a modern Core i7 with 4 or 6 cores would be a lot, lot faster than UVD or VPx (Nvidia)
But as long as the UVD or VPx (Nvidia) or Intel's hardware solution, manage to play the clips with a minimum frame rate above the frame rate of x1 - say 24fps or 30fps or 60fps - then it's working.
The main reason of using DXVA and dedicated hardware for decoding video formats is power (laptops) - because Core2Duo consumes 10 times more power than UVD.
Of course, if you have a slow CPU then speed does matter, too.
And of course, during playback on the dedicated fixed function hardware inside the GPU (UVD, VPx etc), the CPU is free of doing other things, because the CPU utilization is <5%.
So, with a dedicated hardware in GPU and DXVA, you have an extra dedicated extremely low power consuming processor in your system capable of decoding several video formats like H.264, VC-1, MPEG2, WMV, MPEG4 ASP(DivX, Xvid) besides your CPU.
pirlouy
1st April 2011, 12:10
Ok. For me DXVA is dangerous, because too much dependant of Nvidia or ATI or Intel development. And I rather trust ffmpeg coding than those 3.
I've tried DXVA (through MPC-HC, ffdshow and another decoder), but was not really convinced, because it was sometimes jerky.
Like I don't use laptop, power is not something to consider (even for Earth, since I don't watch HD videos all day). I prefer using GPU to upscale only (madVR renderer).
Like I don't do any post-processing stuff, and from your results, it confirms that it's better (in my case) to use CPU to decode.
Thanks for your tests.
NikosD
1st April 2011, 12:23
I forgot to say that on ATI recent hardware with UVD 2.x and above, you can do postprocessing on driver's level within the Catalyst suite during playback by DXVA decoding, using GPU shaders (not used by DXVA) with 0% CPU utilization for all video formats supported by UVD.
Even if you don't use post-proc or a laptop, you pay the bill for the power you consume :)
pirlouy
1st April 2011, 13:26
I've done a new test with DXVA. If I have a bit rate too high (mt2s with 20 Mbps for example), I have a lot of dropped frames (jerky videos). Yet I have quite the same GPU than you. Strange... I have catalyst from February (preview 2). But I won't search more. Everything is ok with CPU.
ps: Even if I'd watch 100 HD videos in a year, it would be ridiculous in terme of difference of power. 2€ for me and that won't do a difference for Global warming. :-)
Nikos,all the samples are offline.Can you re-up them please ?
NikosD
1st April 2011, 16:22
None of them is offline.
I think you do something wrong.
NikosD
1st April 2011, 16:28
ps: Even if I'd watch 100 HD videos in a year, it would be ridiculous in terme of difference of power. 2€ for me and that won't do a difference for Global warming. :-)
I'm not trying to convince you on anything, but you have a dedicated co-processor in your system designed to do one thing better than anyone else and you prefer to give that thing to your main processor who has a lot more to do and it is definitely not designed to do that job.
Your choice.
P.S And of course video acceleration is all over Internet because of Flash videos and the new version 10.2 which uses hardware acceleration for H.264 HD videos with minimum CPU utilization.
The same goes for HTML5, too.
CruNcher
1st April 2011, 19:15
That 4 Girls clip is also nice some peaks their really kill VP2 @ least with 95% DSP utilization :D its nicely @ the Edge (slightly over) of what VP2 is capable with VMR7 you don't see the frame drops as it trys to hold it stable and jitters with 10ms and plays @ 50fps @ VMR9 you see every drop nice to test the max edge of VP2, here for that clip on VP2 i would definitely switch automatically to Software Playback as with CUDA La CUVID/CoreAVC it jitters even more 24ms (no surprise) :)
Also a nice clip to test Intels HD2000/3000 Decoder with and compare vs VPx and UVD :)
Though its interesting comparing that to the Sony Playstation Net Wipeout 60 fps clip that plays super fluid and seems to perfectly fit into the VP2 specs, so i wonder what the Encoder of that won leaving specs in terms of Visual Quality i guess not really much, but that's sadly with x264 encoders become standard thinking (encode as complex as possible don't care about specs and force Hardware Manufactures to become better, even if they dont want to and are happy with being able to play Blu-Rays ;) )
NikosD
1st April 2011, 20:01
I'm looking forward for VP4 and Intel benchmark results, too.
Where are you guys ? :)
CruNcher
1st April 2011, 20:20
There is no real Benchmark for this Full Hardware Playback/ Live PostPro Framework yet :)
Cyberlink DXVA (VP2) + VMR9 Renderless + Sharpen Complex 2 (simple) + High Resolution Timing
http://img863.imageshack.us/img863/5233/gpulivesharpencyberlink.png
None of them is offline.
I think you do something wrong.
Correct ! :D
nevcairiel
1st April 2011, 20:48
I'm looking forward for VP4 and Intel benchmark results, too.
Where are you guys ? :)
Due to stupid HotFiles limitations, i only have some clips so far.
This is VP4 on the 270.51 driver (GTX 570 - but all VP4 chips should run the same speed, its not dependent on the clock domain of the main GPU)
CoreAVC 2.5.1 crashed the driver in DXVA mode, so only CUDA tested.
1. Twinpeaks
MS MFT DXVA - 79/88/104
MS DS DXVA - 81/84/86
MPC-HC DXVA - 81/84/86
CoreAVC 2.5.1 CUDA - 67/70/72
LAV CUVID - 78/81/84
3. Basketball
MS MFT DXVA - ---
MS DS DXVA - 74/82/103
MPC-HC DXVA - 69/82/99
CoreAVC 2.5.1 CUDA - 75/84/106
LAV CUVID - 71/84/106
5. Birds
MS MFT DXVA - 62/75/87
MS DS DXVA - 70/77/84
MPC-HC DXVA - 70/77/85
CoreAVC 2.5.1 CUDA - 55/61/67
LAV CUVID - 68/77/85
Let me take the opportunity to question the results of your MS MFT DXVA test on the Birds sample, its just not in line with any other results.
I think its also interesting that my totally unoptimized decoder (in a unoptimized debug build, too!) is faster then CoreAVC CUDA in some samples, although they should be using the same decode engine.
Oh, and just for fun, some CPU decoding values using CoreAVC 2.5.1 and ffdshow (using ffmpeg-mt)
This is on a Core i7 2600K in stock (turbo) speeds.
Of course not comparable to your CPU measurements.
Twin Peaks:
CoreAVC - 520 fps
ffdshow - 476 fps
I think the twin peaks sample gives false values because its completly decoded so fast that the fps calculation is probably wrong.
Basketball:
CoreAVC - 459 fps
ffdshow - 473 fps
Birds
CoreAVC - 268 fps
ffdshow - 265 fps
pirlouy
2nd April 2011, 01:04
I'm not trying to convince you on anything, but you have a dedicated co-processor in your system designed to do one thing better than anyone else and you prefer to give that thing to your main processor who has a lot more to do and it is definitely not designed to do that job.
I disagree. :-)
CPU can do everything and all statistics in this thread shows the CPU suffers less than GPU.
It is confirmed in my case. My E8400 decodes 20Mbs without problem (30% CPU) whereas my HD5770 renders jerky videos. Maybe it's a driver problem, but it's a problem I have not using CPU. And I can't use "hardware" decoders, i don't think there are for ATI...
For HTML5 and Flash, it's different, CPU is also used by javascript, css and all other stuff, so it can be helpful to have GPU.
I hope I won't piss you off, but I was not sure I read results correctly, but I think I was correct. Especially with new CPU, if you don't do a lot of post-processing, that's better to use CPU than DXVA.
NikosD
2nd April 2011, 08:50
Let me take the opportunity to question the results of your MS MFT DXVA test on the Birds sample, its just not in line with any other results.
You can see an analysis and possible explanation of those strange figures here:
http://forum.doom9.org/showthread.php?t=156660&page=7
Check out the posts of a member named hwti and my comments
NikosD
2nd April 2011, 10:17
Oh, and just for fun, some CPU decoding values using CoreAVC 2.5.1 and ffdshow (using ffmpeg-mt)
This is on a Core i7 2600K in stock (turbo) speeds.
It would be extremely interesting to post some benchmark results of Intel Clear Video HD hardware.
I've seen really big numbers using Quick Sync in decoding - encoding (transcoding) applications.
nevcairiel
2nd April 2011, 10:34
I have a P67 board, i cannot use the integrated GPU.
In any case, Quick Sync is only for encoding, the decode engine is probably not significantly faster then previous generations of ClearVideo HD.
As the encoding is done in the GPU, decoding with the CPU for re-encoding tasks is probably better. (260 fps decoding, hooray)
NikosD
2nd April 2011, 11:30
Well according to Intel and several hardware review sites Quick Sync is not only for encoding but for decoding, too.
Because, as I wrote to my previous post, transcoding is a two phase task. Decode and encode.
So if you want the fastest transcoding engine possible (like Quick Sync), you have to optimize both parts.
That's why Intel boosted decode performance a lot in Sandy Bridge processors and built for the first time a fixed-function hardware encoder.
The previous version of Clear Video decoding hardware is a lot weaker than Sandy Bridge, I don't know the difference.
More details here:
http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i7-2600k-i5-2500k-core-i3-2100-tested/8
As for your second thought that maybe CPU decoding is faster than hardware decoding, it seems that it's definitely not true for Quick Sync according to here:
http://www.tomshardware.com/reviews/video-transcoding-amd-app-nvidia-cuda-intel-quicksync,2839-7.html
We have to wait for actual results posted by someone who wants to help the debate :)
Anyone ?
nevcairiel
2nd April 2011, 11:35
Maybe Quick Sync can directly access the decoded image in the GPUs memory, saving the memory transfer there would make it alot faster. I still think that the raw decode performance isn't that much greater. In any case, i will upgrade to a Z68 board once its out, so i can use the integrated GPU. If until then no-one else does some tests, i can do it then.
NikosD
2nd April 2011, 11:58
OK!
Looking forward to UVD 3 results,too. Because I'm not going to upgrade to a Radeon 6xxx card soon :)
renq
2nd April 2011, 12:55
OK!
Looking forward to UVD 3 results,too. Because I'm not going to upgrade to a Radeon 6xxx card soon :)
I'll download the clips and get to it then:)
Since I'm not a premium user at hotfile. it might take a while, but since I already have downloaded the first clip:
MS MFT DXVA: 51/57/61
Configuration:
Windows 7 x64 SP1
Phenom II X2 560 @ X4 B60 3885MHz
4GB DDR3
2GB HD6950 with unlocked shaders (stock clocks)
32bit dxva checker 2.4.0.0.0.0
Catalyst 11.4 beta
CruNcher
2nd April 2011, 13:19
I disagree. :-)
CPU can do everything and all statistics in this thread shows the CPU suffers less than GPU.
It is confirmed in my case. My E8400 decodes 20Mbs without problem (30% CPU) whereas my HD5770 renders jerky videos. Maybe it's a driver problem, but it's a problem I have not using CPU. And I can't use "hardware" decoders, i don't think there are for ATI...
For HTML5 and Flash, it's different, CPU is also used by javascript, css and all other stuff, so it can be helpful to have GPU.
I hope I won't piss you off, but I was not sure I read results correctly, but I think I was correct. Especially with new CPU, if you don't do a lot of post-processing, that's better to use CPU than DXVA.
It heavily depends it's not said per see that it's always more efficient Power Consumption wise to use a Full GPU Framework especially as you still bound to CPU Overhead even on Win 7 ;)
You rather have to carefully balance out pro/cons and combine them smartly on the possibilities of the OS to get best results (Power Consumption saving).
But per se saying its overall less efficient is wrong also we are just @ the beginning of how to manage the GPU efficiently inside Windows and WDDM 1.1 is not really much different in those regards then Windows XP in the future we hopefully gonna see better GPU/CPU Management and less overhead ;)
I have a P67 board, i cannot use the integrated GPU.
In any case, Quick Sync is only for encoding, the decode engine is probably not significantly faster then previous generations of ClearVideo HD.
As the encoding is done in the GPU, decoding with the CPU for re-encoding tasks is probably better. (260 fps decoding, hooray)
Yeah some reviews show 25W with full 1080p Blu-Ray playback (i think it was Xbitlabs review) :)
not that bad though nothing which Nvidia or ATI couldn't achieve (if such a low power discrete DSP only card would exist from them it consumes roughly 3W (not the DSP only measurement of that is quiete impossible todo it will be somewhere @ 500mw or roughly 1W) for 1080p Blu-Ray where Intels old Decoder Core consumed somewhere 8W (though also not the Decoder IP alone but all the sourunding CPU logic that needed to be active @ the same time in that case Atom, which came from Imagination http://www.imgtec.com/powervr/powervr-technology.asp ;) ), also i don't believe most reviews in Nvidia Encoder results because they don't set it up for maximum Performance (they all rely on data they gather from ISV implementations) and Nvidia is continuously improving their Encoder they just in the 270. driver fixed a very bad Motion Estimation bug (one of the Main Developers of Nvidias GPU Encoder Core is a ex MSU Researcher http://www.linkedin.com/pub/anton-obukhov/6/527/78b he also was one of the main Brains behind the FRExt implementation in the Nvidia Encoder that even still Mainconcept has to fight with) :)
And as you said the interesting thing is how does Intels DSP and EUs work together and are they more efficient then what Nvidia/AMD/ATI have currently discrete are the shorter paths really help allot (or is Nvidias/ATIs Optimization so good that it can cope even with that, Winning @ the Driver level) ?
No one really benched that yet except for Encoding but their it was even from THGs test (which is currently no doubt the most reliable test existing) only tested out of a consumer level pov :P
NikosD
2nd April 2011, 15:38
the first clip:
MS MFT DXVA: 51/57/61
Configuration:
Windows 7 x64 SP1
Phenom II X2 560 @ X4 B60 3885MHz
4GB DDR3
2GB HD6950 with unlocked shaders (stock clocks)
32bit dxva checker 2.4.0.0.0.0
Catalyst 11.4 beta
This is a complete disaster!
Your score is lower than mine!
You could check out two things:
1) Do not have any default options enabled in Video settings of Catalyst Control Center. You have to disable every default post-processing filter that Catalyst apply after driver's installation like Color Vibrance, Flesh tone correction, Dynamic range, Edge-enhancement, De-noise, Mosquito noise reduction etc etc.
You have to disable them all.
2) Select the null renderer (black screen) during benchmarking mode in DXVAchecker.
Could you retest it?
NikosD
2nd April 2011, 16:09
Maybe Quick Sync can directly access the decoded image in the GPUs memory, saving the memory transfer there would make it alot faster. I still think that the raw decode performance isn't that much greater
It heavily depends it's not said per see that it's always more efficient Power Consumption wise to use a Full GPU Framework especially as you still bound to CPU Overhead even on Win 7 ;)
You rather have to carefully balance out pro/cons and combine them smartly on the possibilities of the OS to get best results (Power Consumption saving).
But per se saying its overall less efficient is wrong also we are just @ the beginning of how to manage the GPU efficiently inside Windows and WDDM 1.1 is not really much different in those regards then Windows XP in the future we hopefully gonna see better GPU/CPU Management and less overhead ;)
I reply to both as an opportunity to clarify one thing.
GPU video acceleration is a misleading and confusing term, describing something that does not exist the last 4 years.
When ATI and Nvidia tried to accelerate video decoding in their cards, they did it using the same piece of hardware they used for 2D and 3D acceleration.
The results were limited and very hard to implement in video player applications.
In 2007 ATI used for the first time a dedicated piece of logic - UVD - in ATI HD 2000 series which completely offloaded both CPU & GPU shaders and it was completely independent from the rest of the card. It uses of course the video card memory, but it doesn't use GPU shaders and CPU at all (<5% CPU utilization & 0% GPU utilization)
GPU in general has nothing to do with video decoding nowadays.
When we say "hardware acceleration" or DXVA or GPU acceleration referring to video files, we always mean the independent, dedicated, added fixed function logic circuit which exists on the same die of what we call "GPU"
It's a little piece of IC, extremely fast and extremely low power consuming processor that we call UVD in ATI GPU, VP2, VP3, VP4 in Nvidia GPU. Intel has not given a particular name AFAIK.
Quick Sync is something a little different because it's a dedicated fixed function decoder and for the first time encoder, too. So Quick Sync it's the first dedicated fixed function transcoder, a bold move of Intel.
AMD and Nvidia rely on GPU shaders to encode video (or CPU) and fixed function processor (UVD and VPx) to decode video.
So, after all these I can say that there is no CPU overhead, GPU framework etc.
My Power Scheme is High Performance but I have enabled in BIOS the SpeedStep feature of Core2Duo and during DXVA playback my 2.83GHz processor goes in deep sleep mode running at the minimum clock speed of 1.7GHz.
Because it is not used at all!
CruNcher
2nd April 2011, 16:49
If you would have read the 2nd part you would know that im pretty aware of what you just explained you misunderstood me i talked about Framework not Decoding alone and surely there is a CPU overhead and depending on your view of it it is important to you or not, surely for most consumer it isn't ;)
Post Process on every GPU Nvidia/Intel/AMD/ATI/Mobile ones is dependent on the either so called EUs or Shaders if you can keep paths short as possible theoretically you should have a performance advantage that's what nev asked himself and if you believe that encoding with Quicksync shows you as low CPU utilization as if you where Decoding that is wrong there is overhead ;)
I could also explain you
http://www.www.xbitlabs.com/images/video/intel-hd-graphics-2000-3000/transcode-1.png
http://www.www.xbitlabs.com/images/video/intel-hd-graphics-2000-3000/transcode-2.png
where this massive difference comes from between Cyberlinks and Arcsofts Encoding Framework especially the Nvidia part ;)
Playback power consumption on SB without Post Processing
http://www.www.xbitlabs.com/images/video/intel-hd-graphics-2000-3000/power-5.png
The Picture btw i showed above with Complex Sharpen 2 consumes about 45W (only the Card no CPU) on a 9800 GT non fermi cores GPU Load is @ avg 30% still frames get lost also because the Nvidia VP2 as i said before cant cope with this stream and that's why jitter is also so high with 34ms its a perfect benchmark for Vista/7 Aero (DWM) and also Sandy Bridge :)
NikosD
2nd April 2011, 17:47
These graphs you posted are puzzles to be solved by them (Cyberlink, Arcsoft etc)
The reviewer as you have read in the article had no answer and Cyberlink and Arcsoft had no answer, too !
Who am I to have an answer:)
I think nobody has a solid answer.
But I was referring to DXVA only - which means decoding only - because this is the subject of the whole thread and the question of Pirlouy I was answering.
We compare CPU vs DXVA codecs, so by definition we are referring to decoding only.
I think it's clear now.
renq
2nd April 2011, 18:11
This is a complete disaster!
Your score is lower than mine!
You could check out two things:
1) Do not have any default options enabled in Video settings of Catalyst Control Center. You have to disable every default post-processing filter that Catalyst apply after driver's installation like Color Vibrance, Flesh tone correction, Dynamic range, Edge-enhancement, De-noise, Mosquito noise reduction etc etc.
You have to disable them all.
2) Select the null renderer (black screen) during benchmarking mode in DXVAchecker.
Could you retest it?
DIVX ver is 9.01.21
Arcsoft ver 2.27.319.108
FFDShow DXVA ver is 3800
All post-processing turned OFF.
1. Clip
MS MFT - 48/57/85
DIVX DXVA - 48/58/85
Arcsoft - 45/57/69
Ffdshow - 44/57/70
2. Clip
DIVX - 38/48/75
MS DTV/DVD - 31/48/86
Arcsoft - Wouldn't play
FFDSHow - Same
3. Clip
DIVX - 51/57/84
MS DTV/DVD - 48/57/79
Arcsoft - 53/57/64
ffdshow - 54/57/71
CruNcher
2nd April 2011, 18:14
And sure playback wise SB could be more efficient then VP2 or Vp4 especially power consumption wise as you dont have the discrete GFX card consumption to take into account (default idle consumption) anymore with a high Performance Decoder such as CoreAVC and with certain streams you wont have any chance and can't use only the DSP as they either wont be compatible or the DSP to weak for Realtime Playback but i tried to explain to pirolouy that this isn't always a good thing if you work in Frameworks where Post Processing is added which tasks the CPU another time and if you mix the most efficient stuff of a system in a balanced way you can get better results especially if you want to keep also 3D Performance @ the same time which Sandy Bridge cant deliver ;)
Because he thinks CPU is always more efficient and that is per see wrong even on a very Powerful low power CPU like Sandy Bridge, and now you can just imagine how Powerful it can be to mix Intels Media SDK 2.0 Ecosystem and Nvidias Nvcuvid,Nvcuvenc together into 1 Framework and utilizing both fixed functions and Shader units for different tasks @ the same time ;)
And yes i have that answer that xbitlabs and lol Arcsoft doesn't have (you can also find it on doom9) ;)
NikosD
2nd April 2011, 18:28
It seems that UVD 3 has the same performance as UVD 2.2, or worse in minimum framerate.
The results - in terms of raw performane of UVD 3 - I have to say are a little disappointing.
Thanks renq.
CruNcher
2nd April 2011, 18:54
ATI/AMD never claimed that Performance is better just that the featureset was enhanced, though the same on Nvidias side
seems the 4 Girls stream is practicaly the same Performance as on Nvidia VP2 with UVD 2.2 11) PowerDVD DXVA 55/57/79 it doesn't reach the full 60 fps so no big difference though i reach only 50 fps without benching with DXVAChecker but in Real but also on XP with VMR7 windowed lets see what DXVAChecker is saying.
pirlouy
2nd April 2011, 19:18
Thanks NikosD and CruNcher for enlightenments. I admit I don't know everythink about DXVA, that's why my first post was a question. :-)
I know it's a bit a pity not to use this module (UVD in my case), but I don't know if it's my module which has a problem, or if it's a driver problem, but in my case, it's safer to use CPU (no dropped frames).
Are there any OpenCL video decoders in the market ? Maybe it's a young technology/language, but it could be better than CUDA (for general users, not inevitably Nvidia users), since it's not reserved for a company...
nevcairiel
2nd April 2011, 19:27
Are there any OpenCL video decoders in the market ? Maybe it's a young technology/language, but it could be better than CUDA (for general users, not inevitably Nvidia users), since it's not reserved for a company...
Thats a common misconception. A CUDA video decoder does not use CUDA to decode. CUDA offers a special API to access the video decoder - you just "use" CUDA to access that API, the decoder is not written in CUDA. Thats why i named my decoder CUVID (the name of the API), and try to avoid using the terms CUDA in general when talking about it.
Because of this, OpenCL does not qualify for this. The "common" API on Windows is DXVA, sadly it does come with limitations, but because its mostly meant for playback, i don't see the vendors investing in a common API anytime soon.
NikosD
2nd April 2011, 19:29
The situation seems like MLAA.
ATI introduced MLAA as Radeon 6xxx series feature only - but later added MLAA to 5xxx series, too.
When ATI introduced UVD 2.2 they said it had all the features that UVD 3 has. Full MPEG2 acceleration, MPEG4 ASP support (DivX, Xvid)
I'm pretty sure that UVD 3 & UVD 2.2 are the same hardware.
They decided to expose the "extra" features of both versions to UVD 3 only.
Please ATI fanboys, don't answer to that guess. It's just an evil thought of me :)
But I don't think that they will act the same way as MLAA
NikosD
2nd April 2011, 19:39
Thats a common misconception. A CUDA video decoder does not use CUDA to decode. CUDA offers a special API to access the video decoder - you just "use" CUDA to access that API, the decoder is not written in CUDA. Thats why i named my decoder CUVID (the name of the API), and try to avoid using the terms CUDA in general when talking about it.
Because of this, OpenCL does not qualify for this. The "common" API on Windows is DXVA, sadly it does come with limitations, but because its mostly meant for playback, i don't see the vendors investing in a common API anytime soon.
This is not exactly right.
The latest OpenCL 1.1 specification has added direct calls to DXVA, like CUDA.
I'm not a developer to know exactly the differences between CUDA and OpenCL access to DXVA, but I'm pretty sure that OpenCL will finally catch up CUDA in DXVA access, if it hasn't done that already.
But I'm not aware of any "OpenCL" decoder.
nevcairiel
2nd April 2011, 20:11
This is not exactly right.
Oh, but it is.
The latest OpenCL 1.1 specification has added direct calls to DXVA, like CUDA.
It does not.
OpenCL 1.1 spec: http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
Go find me any reference to video decoding.
ATI is trying to be smart, and invented an API called "OpenVideo Decode", but its only ATI, and not "Open", and its still in its very early days. (and not related to OpenCL, although it has interop with OpenCL like CUVID has with CUDA)
CruNcher
2nd April 2011, 20:20
VP2 WIN XP SP3 Forceware 270.51 9800 GT
4 Girls
DXVA:
Renderer: Video Mixing Renderer 7
Decoder: CyberLink Video Decoder
Decoder Device: ModeH264_VLD_NoFGT
Processor Device: -
Time: 03:47.269
Average FPS: 53,012
Min/Max FPS: Min: 50 Max: 58
Renderer: Video Mixing Renderer 9
Decoder: CyberLink Video Decoder
Decoder Device: ModeH264_VLD_NoFGT
Processor Device: -
Time: 03:47.192
Average FPS: 53,030
Min/Max FPS: Min: 50 Max: 58
Renderer: Video Mixing Renderer 7
Decoder: CoreAVC Video Decoder
Decoder Device: ModeH264_VLD_NoFGT
Processor Device: -
Time: 03:53.328
Average FPS: 51,635
Min/Max FPS: Min: 47 Max: 56
Renderer: Video Mixing Renderer 9
Decoder: CoreAVC Video Decoder
Decoder Device: ModeH264_VLD_NoFGT
Processor Device: -
Time: 03:52.397
Average FPS: 51,842
Min/Max FPS: Min: 48 Max: 56
Nvcuvid via Dshow Surprise Surprise ;) (not really)
Renderer: Video Mixing Renderer 7
Decoder: CoreAVC Video Decoder
Decoder Device: -
Processor Device: -
Time: 04:02.820
Average FPS: 49,617
Min/Max FPS: Min: 44 Max: 55
Renderer: Video Mixing Renderer 9
Decoder: CoreAVC Video Decoder
Decoder Device: -
Processor Device: -
Time: 04:02.846
Average FPS: 49,612
Min/Max FPS: Min: 44 Max: 54
Renderer: Video Mixing Renderer 7
Decoder: LAV CUVID Decoder
Decoder Device: -
Processor Device: -
Time: 04:24.156
Average FPS: 45,609
Min/Max FPS: Min: 43 Max: 52
Renderer: Video Mixing Renderer 9
Decoder: LAV CUVID Decoder
Decoder Device: -
Processor Device: -
Time: 04:23.916
Average FPS: 45,651
Min/Max FPS: Min: 43 Max: 53
Renderer: Video Mixing Renderer 7
Decoder: CUDA Video Decoder
Decoder Device: -
Processor Device: -
Time: 04:41.867
Average FPS: 42,733
Min/Max FPS: Min: 40 Max: 46
Renderer: Video Mixing Renderer 9
Decoder: CUDA Video Decoder
Decoder Device: -
Processor Device: -
Time: 04:41.806
Average FPS: 42,742
Min/Max FPS: Min: 40 Max: 46
CPU:
Renderer: Video Mixing Renderer 7
Decoder: CoreAVC Video Decoder
Decoder Device: -
Processor Device: -
Time: 00:56.136
Average FPS: 214,622
Min/Max FPS: Min: 200 Max: 234
Renderer: Video Mixing Renderer 9
Decoder: CoreAVC Video Decoder
Decoder Device: -
Processor Device: -
Time: 00:55.837
Average FPS: 215,771
Min/Max FPS: Min: 201 Max: 248
NikosD
2nd April 2011, 20:23
I meant this:
http://developer.amd.com/gpu/amdappsdk/pages/default.aspx
"Support for UVD video hardware component through OpenCL"
I thought it was OpenCL 1.1 feature, not ATI-only.
I'm pretty sure that if ATI did it, Khronos will do it also in a future spec.
What do you think?
nevcairiel
2nd April 2011, 20:52
I don't think it'll be standardized in OpenCL.
vBulletin® v3.8.11, Copyright ©2000-2026, vBulletin Solutions Inc.