PDA

View Full Version : H.264 decoding performance - benchmarked!


Blue_MiSfit
17th March 2009, 18:56
Numbers have been revised, all are correct now :)

Hey all,

There's been a bit of bickering around here recently about H.264 decoding performance. I thought I'd run some tests (pure software decoding only) with recent software and show the world.

First, a brief disclaimer: this test is meant as an evaluation of decoding performance on high-end systems. These numbers might be irrelevant if you're using an older CPU, are mostly concerned with GPU decoding, or otherwise not interested in high-speed H.264 decoding! :) In other words, please take all these numbers with a grain of salt, and make your own tests when necessary!

System: 2x Quad Core "Harpertown" 2.66 GHz Xeon (Dell 1950), 4GB RAM, Windows Server 2003 (32 bit)
Source: ~8.6mbps High Profile 1080p 3 references, 3 b-frames, CABAC, Deblocking, MKV container.

All measurements made with timecodec.exe


CoreAVC Professional 1.9.5
dfps: 160.2
Notes: Using ~ 4 cores (~56% utilization). Limited to 4 cores, it delivers 151.3dfps.

libavcodec, from ffdshow-tryouts build 2737 (March 2 2009)
dfps: 39.2
Notes: Uses only one core

ffmpeg-mt, from ffdshow-tryouts build 2737 (March 2 2009)
dfps: 211.4
Notes: Uses all 8 cores! Limited to 4 cores, it delivers 120.9dfps

DivX H.264 Decoder 1.0
dfps: 255.3
Notes: Uses all 8 cores! Limited to 4 cores it delivers 166.0 dfps

http://img144.imageshack.us/img144/8392/decoderperformance.gif (http://img144.imageshack.us/my.php?image=decoderperformance.gif)

DivX H.264 really surprised me. It scales perfectly from 4 to 8 cores, is blazing fast, and free!

Also, just for fun I ran another test with CoreAVC's CUDA support on a 9800gt. I got 48.8fps decoding, and CPU usage hovered around 2-5%. That's handy-dandy for transcoding if you're on a system with CUDA support! I imagine the numbers would be quite similar with DGAVCDecNV in an AviSynth script.

~MiSfit

Blue_MiSfit
17th March 2009, 19:30
And wouldn't you know it?

CoreAVC releases a new version (1.9.5) this morning!!! :)

I'll update the OP with new numbers

Manao
17th March 2009, 19:50
Notes: Uses all 8 cores! (but doesn't scale as well as ffmpeg-mt)That isn't conclusive. You can't be sure that if ffmpeg-mt were using all the cores, it would scales as well as it currently is.

Blue_MiSfit
17th March 2009, 19:59
True that. I might re-run the test on a quad-core system.

~MiSfit

Blue_MiSfit
17th March 2009, 20:05
I re-ran the tests with CoreAVC 1.9.5, and the dfps was about the same. It was well within a 5% margin of error, so I'm not updating the OP :)

~MiSfit

turbojet
17th March 2009, 20:11
Couldn't you just set timecodec's affinity to 4 cores in your current setup?

Blue_MiSfit
17th March 2009, 20:25
Indeed. How could I have overlooked the obvious solution :) I'll post results

Blue_MiSfit
17th March 2009, 20:28
Limited to 4 cores, DivX H.264 decoder gives:
481.6fps dfps:166.0

So, it looks like ffmpeg-mt is much faster with 4 cores as well.

~MiSfit

turbojet
17th March 2009, 20:31
Are coreavc and ffmpeg-mt results the same with affinity set to 4 cpu's?

Blue_MiSfit
17th March 2009, 20:35
Uh oh...

Running things again, it looks like I made a huge mistake!

ffmpeg-mt used all 8 cores to get 211.4fps. 'DOH!

I'm going to put a disclaimer on the OP (but leave it up for posterity's sake), and make another post with updated numbers.

Sorry guys :(

~MiSfit

turbojet
17th March 2009, 20:48
Oh ok, I was going to say ffmpeg-mt has made a huge leap since tests from a few months ago. I think an updated 4 core ffmpeg-mt and if coreavc changes with 4 core affinity that too in OP is sufficient.

Blue_MiSfit
17th March 2009, 21:06
I'm re-running everything, and throwing in CoreAVC 1.9.5's CUDA support on a 9800gt as well.

~MiSfit

Blue_MiSfit
17th March 2009, 21:19
Alright, all numbers are updated, and I made a pretty little chart for everyone :D

DivX H.264 for the win! Wow!

~MiSfit

Snowknight26
17th March 2009, 23:06
What's the max number of cores (threads?) that each decoder can use? Might do some bechmarks on my 16 core beast.

Sagittaire
17th March 2009, 23:37
Well ... anyway simple GPU card at 30$ can now perfectly decode H264 BD stream at 40 Mbps. BD9 use only 10% of my CPU with my little c2d at 2.66 Ghz. High efficiency software decoder like Coreavc and DivX decoder are simply useless for me now.

turbojet
17th March 2009, 23:44
For transcoding I've found out that coreavc cuda is about 10% slower on first pass and 5% faster on second pass or crf then ffmpeg-mt. So if you are doing 2 pass it's still overall faster to use software solutions. I've tested this on an X2, Q6600 and i7 with pretty comparable results.

Blue_MiSfit
17th March 2009, 23:59
@Sagittaire:

Sure, if you only encode at ~40fps :)

If you need high speed transcoding, CUDA won't help you. Unless of course you have a real monster like a GTX285!

On that note - @turbojet - what GPU did you perform the tests on?

The 9800gt I used is very upper-low end these days. It's basically an overclocked 8800gt, with 1gb of VRAM to help me out in fft3dgpu. It doesn't even begin to compare to a GTX260 or a GTX285. I wonder how those would do :)

~MiSfit

turbojet
18th March 2009, 00:43
i7 is 9800GT
6600 and X2 is 8600GT

ajp_anton
18th March 2009, 01:58
If you need high speed transcoding, CUDA won't help you. Unless of course you have a real monster like a GTX285!Will a faster card really decode faster? Isn't the work still done in VP2 with CoreAVC, and isn't that the same in all cards?

Blue_MiSfit
18th March 2009, 02:12
I don't think so, since this isn't DXVA. It's CUDA.

I may be totally wrong, though!

~MiSfit

Dark Shikari
18th March 2009, 02:24
I don't think so, since this isn't DXVA. It's CUDA.

I may be totally wrong, though!

~MiSfitCUDA is just being used as an interface to access PureVideo. It isn't as if CoreCodec wrote an entirely new decoder in CUDA.

Cyber-Mav
18th March 2009, 02:24
would be great to see single and 2 core tests being done too.

either way thanks for your input on this. i didnt know divx was so fast as a decoder. would love to see single and 2 core tests on it especially vs coreavc.

Sagekilla
18th March 2009, 03:06
Dark Shikari, isn't that more or less what every hardware acceleration option is today? I can't think of any decoders that actually use the shaders to perform the decoding. Most telltale sign would be high power consumption compared to other decoders, since running the shaders eats up quite a lot of power.

Blue_MiSfit
18th March 2009, 03:09
I stand corrected :)

I thought they did!

But that makes a lot more sense.

I may run some more tests for dual core systems, but I work with quad-cores as a worst case scenario :)

~MiSfit

BetaBoy
18th March 2009, 14:55
Blue_MiSfit.... while I'm partial as you might think, your results are a mixed bag considering the different CPU/GPU/RAM configurations ppl might have. Its why we scrapped our 'compare' webpage efforts as it would only cause more confusion, and that's something we surely don't want. Instead like your finding out, we want 'each' user to test to see what works best for their configuration/system. So I would point out that your results will may/will differ for someone else (even with the same configuration).

Additionally (and as Dark Shikari can confirm) we have very little in the way CPU instruction set optimizations in CoreAVC 1.x at the moment (unlike DivX which noted that they rely heavily on optimizations for speed)... that's what (in part) is coming in CoreAVC 2.0.

Cyber-Mav
18th March 2009, 16:27
i thought coreavc was already fully optimised using sse1, 2, 3 and 4. thats what gives coreavc its speed i thought?

Shakey_Jake33
18th March 2009, 17:12
^I think it's CoreAVC's lack of reliance on such things that gives it such a speed advantage on older hardware that might not support SSE2/3/4 (like Athlon XP's).

kemuri-_9
18th March 2009, 18:42
^I think it's CoreAVC's lack of reliance on such things that gives it such a speed advantage on older hardware that might not support SSE2/3/4 (like Athlon XP's).

i imagine that coreavc will only use the additional optimizations only if they are there,
to maintain compatibility with old hardware, similar to how x264 does things.

so it won't remove old hardware from the current supported list.

BetaBoy
18th March 2009, 20:55
kemuri-_9... correct.

Blue_MiSfit
18th March 2009, 21:02
@BetaBoy:
Interesting. Quite impressive that you can manage such high speed decoding without heavy reliance on SIMD!

I can hardly wait to see the performance of CoreAVC 2.0!

I didn't mean this test to be an exhaustive case of determining which decoder was faster in general, but rather which decoder would be fastest given a relatively high-end configuration - i.e. a modern, multi-core CPU. Most folks doing encoding here have relatively beefy systems (i.e. at least a Core 2 Duo, or a recent Athlon 64 X2). I'd submit that my test is entirely relevant for anyone using such a CPU.

But of course, no benchmark or test can be perfectly descriptive of every possible hardware configuration :)

I appreciate your contribution. I'll append a little disclaimer to the OP.

-Derek

Koorogi
18th March 2009, 22:14
@Sagittaire:
Sure, if you only encode at ~40fps :)

If you need high speed transcoding, CUDA won't help you. Unless of course you have a real monster like a GTX285!


My understanding is that while CUDA might have a speed advantage on decoding, there is an extra cost associated with moving the decoded image back over the bus to memory. It's not a problem if you're deoding for display, but if you need to run it through any filters or an encoder running on the CPU, it can be significant. With AGP, it was so costly that it negated any speedup you might have gotten from CUDA. I'm not sure of the speed hit nowadays.

squid_80
18th March 2009, 23:14
With PCI-E the bottleneck is the GPU, the decoding speed remains the same even if the decoded frames aren't copied back into main memory.

benwaggoner
19th March 2009, 00:13
With PCI-E the bottleneck is the GPU, the decoding speed remains the same even if the decoded frames aren't copied back into main memory.
Fair point. Speaking as someone who has spent WAY too much time lately devising performance tests for H.264 and VC-1 playback...

As a parallel test, it'd be interesting to look at CPU load playing back at full screen as a somewhat more applicable metric; we can assume that 400 fps with 24 fps content means a 6% CPU load on 8 cores and no worse than 48% on a single core, but I don't know if that's really a fair assumption :).

Also, we don't care about average playback so much as worst-case playback; a file that has a 80% average CPU load during decoding, but is really varying between 40% and 120% is going to give a really bad experience at 120%!

Lately, I've been trying to get traces the measure both CPU requirements and actual frames played back every second, so we can see where things go wrong. That's useful for defining minimum system requirements for an experience.

We've found that how a H.264 file is encoded can cause a whole lot of variability in decode load as well; it's not like you can cite a particular frame size or data rate and be done with it. CAVLC/CABAC, number of slices with the latter, pyramid B-Frames, and multiple reference frames can have a significant impact on average playback, and an even bigger difference on start time and random access time.

It'd be interesting to add Flash and Silverlight to this testing, although that would be based on CPU load and dropped frames since I don't think either offers away to play back frames as fast as possible.

I can't believe how often this particular piece of fiction gets quoted in other web sites:
http://www.adobe.com/products/flashplayer/systemreqs/

The following minimum hardware configurations are recommended for an optimal playback experience:

1,920x1,080 (1080p), 24 fps:
Intel Core Duo 1.8GHz, AMD Athlon™ 64 X2 4200+ processor (or equivalent)

128MB of RAM

64MB of VRAM
Now, there may be some 1080p24 H.264 Baseline 2 Mbps with 1 reference frame that can play back in Flash on Core Duo 1.8, maybe. But it sure can't do it in 128 MB of RAM, and it certainly isn't going to play back all the frames!

Dark Shikari
19th March 2009, 00:17
Now, there may be some 1080p24 H.264 Baseline 2 Mbps with 1 reference frame that can play back in Flash on Core Duo 1.8, maybe. But it sure can't do it in 128 MB of RAM, and it certainly isn't going to play back all the frames!Correct, it won't do it in 128MB of RAM, it'll do it in about 4MB of RAM.

turbojet
19th March 2009, 00:40
It would be nice to see flash and silverlight results but how can these be compared to timecodec tests without a directshow/wmf filter?

Blue_MiSfit
19th March 2009, 02:17
:shrug:

I guess I could do some tests in Media Player Classic, and record CPU utilization over a span of time (anyone know how to do this?) - say a 10 minute test clip that has most of the "magic" turned on - i.e. 4+ mixed references, 8x8dct, CABAC, no slices, high-ish bitrate, 1080p. I'd have to stay consistent with my renderer of course, but I'd vary the DirectShow decoder and core affinity.

I could then compare that to playing the same MP4 in Firefox or IE 7 with Silverlight 3 and Flash...

Do the Flash and Silverlight decoders support multithreading (I'd assume and sincerely hope so!)? Also, last I checked with Flash, a huge bottleneck was with full screen playback (i.e. scaling and rendering). I see this shouldn't be an issue with Silverlight, as it can easily offload this process to GPU. Very cool! I wonder how it compares to EVR or Haali Renderer? Gosh I'm getting a little excited now!

I also wonder how Flash and Sliverlight handle PC / TV luma levels? This has been a cause for endless rage for me... I'm sure it has something to do with a combination of GPU manufacturer, OS, and driver version... :p

Thoughts? I'll wait to formulate a decent plan instead of flailing about looking like an idiot as I did with this initial test :D

~MiSfit

Snowknight26
20th March 2009, 02:26
Anyone know the max number of threads that each decoder supports? I was able to push 50% CPU usage on my 16 core server using ffmpeg-mt (8 threads) to decode a ~35Mbps H.264 Blu-ray, but I'd love to to do more than that.

For those that were curious:
libavcodec: User: 39s, kernel: 0s, total: 39s, real: 337s, fps: 274.5, dfps: 32.3
ffmpeg-mt: User: 39s, kernel: 0s, total: 39s, real: 149s, fps: 276.9, dfps: 73.1

And a ~5Mbps H.264 480p 60fps clip:
libavcodec: User: 151s, kernel: 15s, total: 167s, real: 1370s, fps: 705.7, dfps: 86.4
ffmpeg-mt: User: 132s, kernel: 2s, total: 134s, real: 251s, fps: 878.6, dfps: 471.3

IgorC
20th March 2009, 10:20
Bigger isn't better.

I don't see any point of 8-cores support unless extra high encode speed. 2 situations:

1. Decoder(playback) . There is no need for more than 2 cores to decode most heavy 1080p content.

2. Normal Encoding. (For each instance) 1 core to decode source, 3 cores to encode. 2 cores to decode source, 2 cores to encode...... No difference in performance.

Imagine you have multithreaded mp3 decoder. For what?

Sagekilla
20th March 2009, 14:12
Encoding is what Snowknight26 does on that 16 core behemoth. In his case, having very fast decoding would benefit his encoding, especially if he uses fast enough settings in x264.

clsid
20th March 2009, 15:29
I don't think that ffmpeg-mt has a limit on the number of threads that it supports. At least there is no hardcoded limit in the code. I do remember reading a comment of its dev somewhere that there is a small decoding bug with a large number of threads (32+).

Snowknight26
20th March 2009, 19:28
But can you set the number of decoding threads to more than 8 in ffdshow's config?

IgorC
20th March 2009, 21:05
Encoding is what Snowknight26 does on that 16 core behemoth. In his case, having very fast decoding would benefit his encoding, especially if he uses fast enough settings in x264.

Umh, that makes really big sense despite:
1. 16 cores represents <0.01% of total PCs.
2. <1% of persons who encode at nonsense speed of 250 fps.

But... it really makes sense for ..let me see.... 0.01%*1% = 0.0001% of total video compressionists who need 8-core scaling. It's really usefull, wow. Despite another 99.9999% don't need it.

Blue_MiSfit
20th March 2009, 21:40
Are you kidding me?

8 core systems are incredibly affordable to a professional.

Who DOESN'T want 8 core scaling of anything?

Back to my last question - does anyone know how to log CPU usage over a period of time? I'm getting interested in running round 2 of my benchmarks

~MiSfit

clsid
20th March 2009, 21:57
But can you set the number of decoding threads to more than 8 in ffdshow's config?It seems not. But that limit could be increased if there is anyone that really needs it.

Sagekilla
20th March 2009, 22:20
Umh, that makes really big sense despite:
1. 16 cores represents <0.01% of total PCs.
2. <1% of persons who encode at nonsense speed of 250 fps.

But... it really makes sense for ..let me see.... 0.01%*1% = 0.0001% of total video compressionists who need 8-core scaling. It's really usefull, wow. Despite another 99.9999% don't need it.

I never said that 16-core encoding represents more than a minority in PCs, nor did I ever say that lots of people encode at absurd speeds.

Yes, there's tons of people who simply do not need that much computing power. On average, most people would be fine with 2 cores at best. Video encoding though? Sure, if I could get a 16 core machine (and I am considering building a dual quad core rig) then I would jump on it. Having that many cores doesn't necessarily mean you're looking for extreme decoding + fast encoding @ 250 fps either though.


Decoding 1080p blu-rays, resizing to 720p, then doing expensive filtering (MDegrain3), and finally passing it off to x264 using mostly maxed out settings isn't fast at all. I get maybe 1 fps if I'm lucky on my Opteron 170 (dual core, 2 GHz). Yes, I know I could speed it up but the tremendous space savings (2 - 3 GB rips @ crf 19) are well worth it. But, if I can get some fast decoding going, and multithread my filtering, then I could get a substantial decrease in encoding time.

But, take note and don't put words in my mouth, I never said it's a practical solution for everyone.

Snowknight26
20th March 2009, 22:28
Back to my last question - does anyone know how to log CPU usage over a period of time? I'm getting interested in running round 2 of my benchmarks

Give ProcessExplorer a shot.

temporance
20th March 2009, 23:07
Bigger isn't better.

I don't see any point of 8-cores support unless extra high encode speed. 2 situations:

1. Decoder(playback) . There is no need for more than 2 cores to decode most heavy 1080p content.

2. Normal Encoding. (For each instance) 1 core to decode source, 3 cores to encode. 2 cores to decode source, 2 cores to encode...... No difference in performance.

Imagine you have multithreaded mp3 decoder. For what?

Faster than real time decoding is needed when seeking / scrubbing video with inter-frame prediction. So efficient parallel decoder potentially reduces delay for playback to restart after seek even if it's not so useful during 1x playback.

BigDid
20th March 2009, 23:49
...
does anyone know how to log CPU usage over a period of time? I'm getting interested in running round 2 of my benchmarks

~MiSfit
Hi,

RMClock has a logging function, not tested though ...

Did

Blue_MiSfit
21st March 2009, 05:46
Cool, thanks guys.

I'll do some testing, and possibly do another round of benchmarks this weekend.

Take care,

~MiSfit

fields_g
21st March 2009, 12:28
If you do update, would you be interested in also adding 1 and 2 core comparisons? I'd like more than 2 points to see exactly how the trends are moving.

benwaggoner
21st March 2009, 17:51
I guess I could do some tests in Media Player Classic, and record CPU utilization over a span of time (anyone know how to do this?) - say a 10 minute test clip that has most of the "magic" turned on - i.e. 4+ mixed references, 8x8dct, CABAC, no slices, high-ish bitrate, 1080p. I'd have to stay consistent with my renderer of course, but I'd vary the DirectShow decoder and core affinity.

I could then compare that to playing the same MP4 in Firefox or IE 7 with Silverlight 3 and Flash...
I'd definitely be interested in that!

We're continuing to tune SL3 playback performance, so I'm always happy to get more data from the field.

Do the Flash and Silverlight decoders support multithreading (I'd assume and sincerely hope so!)? Also, last I checked with Flash, a huge bottleneck was with full screen playback (i.e. scaling and rendering). I see this shouldn't be an issue with Silverlight, as it can easily offload this process to GPU. Very cool! I wonder how it compares to EVR or Haali Renderer? Gosh I'm getting a little excited now!
Yes, the Silverlight H.264 decoder is multithreaded, as is Silveright itself pretty pervasively. We're continuing to tune that, particularly for highly multicore systems.

To measure pure decoder performance, you're probably best off just not scaling anything. In Silverlight, this will activate our "Fast Path" behavior.

http://on10.net/blogs/benwagg/Building-high-performance-Silverlight-Media-Players/

I also wonder how Flash and Sliverlight handle PC / TV luma levels? This has been a cause for endless rage for me... I'm sure it has something to do with a combination of GPU manufacturer, OS, and driver version... :p
In Silverlight 3 final, we'll support the correct matricies for 601 and 709, with our conversion mapping Y'=16 to R'G'B'=0 and Y'=235 to R'G'B'=255. The latest beta release has a few...issues in that area, but they're already fixed in our current builds. However that'll have no impact on decode performance, just my own personal embarassement.

601/709 is determined by frame size, with width >1024 or height >576 as the trigger for 709.

Thoughts? I'll wait to formulate a decent plan instead of flailing about looking like an idiot as I did with this initial test :D
I'd also recommend using content with burned-in timecode and using something like FRAPS to capture frames getting played back. It's better to play 100% of frames at 80% CPU load than 85% of frames at 40% CPU load :).

Sagekilla
21st March 2009, 18:19
Good idea Ben, the only issue there is that fraps itself will eat up a bit more CPU time and you might get frame drops because of that, so it could create a false performance drop :(

benwaggoner
21st March 2009, 18:48
Good idea Ben, the only issue there is that fraps itself will eat up a bit more CPU time and you might get frame drops because of that, so it could create a false performance drop :(
It's not too bad if you shrink the screen size to the smallest that'll fit the player, and then have FRAPS record at quarter size.

I've been able to record 2560x1600p60 with FRAPS on my 8 core with reasonable foreground performance. It's some impressive tech.

Certainly, capturing the video out would be even better yet, but FRAPS has the advantage of being $37 :). Also, since its overhead would be pretty consistant, it shouldn't bias the results overmuch.

allouh
18th April 2009, 09:07
I have an old system and my result were very close to these posted here about being the fastest decoder.
did the test for some HD Samples.

CoreAVC User: 4s, kernel: 3s, total: 7s, real: 65s, fps: 250.3, dfps: 28.8
DivX User: 4s, kernel: 1s, total: 6s, real: 56s, fps: 303.2, dfps: 33.1

CoreAVC User: 3s, kernel: 5s, total: 9s, real: 40s, fps: 191.2, dfps: 45.7
DivX User: 4s, kernel: 4s, total: 9s, real: 35s, fps: 198.9, dfps: 51.4

CoreAVC User: 13s, kernel: 13s, total: 27s, real: 138s, fps: 221.5, dfps: 43.8
DivX User: 14s, kernel: 7s, total: 22s, real: 131s, fps: 271.9, dfps: 45.9

CoreAVC User: 5s, kernel: 7s, total: 12s, real: 53s, fps: 385.5, dfps: 88.9
divX User: 6s, kernel: 4s, total: 10s, real: 46s, fps: 464.7, dfps: 101.3

My system is Intel 3400MHz Single Core with HT, 2GB@667 memory, Nvidia 6200tc VGA.

benwaggoner
19th April 2009, 01:46
Back to my last question - does anyone know how to log CPU usage over a period of time? I'm getting interested in running round 2 of my benchmarks
We've been using Xperf for this.

http://msdn.microsoft.com/en-us/library/cc305221.aspx

I'm not enough of a software engineer to do much more than capture traces myself, but our test and dev folks seem to be able to quickly take those and turn them into "ah-ha!" pretty quickly.