PDA

View Full Version : Increased x264 performance by 20% by removing decoding wait time!


morph166955
2nd September 2007, 17:24
I've been back and forth on if this was something that really mattered for a good while now so I finally got around to running some tests and I was shocked by the results. While I realize that the scalability of this for HD content is not great, its performance on DVD content was mind blowing for me!

What I did was create a simple program that would read and buffer raw yuv frames between mencoder and x264 so that x264 would not have to wait on mencoder to decode the next frame for x264 to use. The way that I have my encoding setup on my linux box is to use mencoder to dump raw yuv (-vf format=i420) video into a fifo pipe and have x264 read from that fifo pipe for its source. My buffer program works by creating two threads in addition to the original, one whos job it is just to read frames from mencoder as fast as it can, and the other two write frames to x264 as fast as x264 wants them. I put in a few failsafes that stop the read thread from overwriting unused frames should the buffer be full (i'll explain this more in detail if someone wants me to, it has to do with the fact that I only allocate the buffer space once and keep looping through it).

The problem that I have seen which this program fixes is that this order has to happen for each frame:

1) x264 makes a read request for x bytes (where x=size of a raw yuv420 frame) from the fifo pipe.
2) mencoder recieves this request and decodes a frame (yes, its more complicated then it just recieving the request, but for simplicity this is ok for now)
3) mencoder sends the raw frame to x264 through the fifo pipe
4) x264 encodes the frame, and then starts the cycle all over again.

after watching the output from my buffering program during my initial test what I was seeing was was that x264 was encoding frames faster then mencoder was spitting them out at times. Having a lot of ram in my encoder (4 gig) I was able to do my first tests using a very large buffer to store the raw frames for x264. I was however able to maintain the same boosted rate using as little as 256MB of ram (128 showed a 5% decrease from 256).

The test source was the dvd of "Die Another Day." The tests were all done using the following settings:


mencoder dvd:// -dvd-device DIEANOTHERDAY_DISC1.ISO -nosound -ovc raw -of rawvideo -vf scale=854:480,crop-848:352:4:62,harddup,format=i420 -mc 0 -noskip -o testin.yuv

(My buffer program read from testin.yuv and dumped to testout.yuv, both are fifo pipes)

x264 --crf 22 --threads 8 --progress -o bondtest.mkv testout.yuv 848x352


Here are the results that I generated:

WITH BUFFER @ 2GB: 119.83%
time x264 --crf 22 --threads 8 --progress -o /s2/bond1.mkv testout.yuv 848x352
x264 [info]: using cpu capabilities: MMX MMXEXT SSE SSE2 SSSE3
x264 [info]: slice I:3241 Avg QP:21.46 size: 20954 PSNR Mean Y:44.62 U:47.16 V:48.35 Avg:45.38 Global:45.10
x264 [info]: slice P:187134 Avg QP:24.19 size: 4576 PSNR Mean Y:42.56 U:46.32 V:47.04 Avg:43.49 Global:42.91
x264 [info]: mb I I16..4: 41.3% 0.0% 58.7%
x264 [info]: mb P I16..4: 8.5% 0.0% 4.8% P16..4: 31.8% 14.3% 3.1% 0.0% 0.0% skip:37.4%
x264 [info]: SSIM Mean Y:0.9736933
x264 [info]: PSNR Mean Y:42.599 U:46.339 V:47.065 Avg:43.519 Global:42.938 kb/s:970.92

encoded 190375 frames, 149.13 fps, 971.03 kb/s
6268.68user 61.60system 21:19.55elapsed 494%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1major+22787minor)pagefaults 0swaps

WITH BUFFER @ 512MB: 120.19%
time x264 --crf 22 --threads 8 --progress -o /s2/bond2.mkv testout.yuv 848x352
x264 [info]: using cpu capabilities: MMX MMXEXT SSE SSE2 SSSE3
x264 [info]: slice I:3241 Avg QP:21.46 size: 20954 PSNR Mean Y:44.62 U:47.16 V:48.35 Avg:45.38 Global:45.10
x264 [info]: slice P:187134 Avg QP:24.19 size: 4576 PSNR Mean Y:42.56 U:46.32 V:47.04 Avg:43.49 Global:42.91
x264 [info]: mb I I16..4: 41.3% 0.0% 58.7%
x264 [info]: mb P I16..4: 8.5% 0.0% 4.8% P16..4: 31.8% 14.3% 3.1% 0.0% 0.0% skip:37.4%
x264 [info]: SSIM Mean Y:0.9736933
x264 [info]: PSNR Mean Y:42.599 U:46.339 V:47.065 Avg:43.519 Global:42.938 kb/s:970.92

encoded 190375 frames, 149.58 fps, 971.03 kb/s
6273.16user 61.78system 21:14.44elapsed 497%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+22782minor)pagefaults 0swaps

WITH BUFFER @ 384MB: 120.38%
time x264 --crf 22 --threads 8 --progress -o /s2/bond4.mkv testout.yuv 848x352
x264 [info]: using cpu capabilities: MMX MMXEXT SSE SSE2 SSSE3
x264 [info]: slice I:3241 Avg QP:21.46 size: 20954 PSNR Mean Y:44.62 U:47.16 V:48.35 Avg:45.38 Global:45.10
x264 [info]: slice P:187134 Avg QP:24.19 size: 4576 PSNR Mean Y:42.56 U:46.32 V:47.04 Avg:43.49 Global:42.91
x264 [info]: mb I I16..4: 41.3% 0.0% 58.7%
x264 [info]: mb P I16..4: 8.5% 0.0% 4.8% P16..4: 31.8% 14.3% 3.1% 0.0% 0.0% skip:37.4%
x264 [info]: SSIM Mean Y:0.9736933
x264 [info]: PSNR Mean Y:42.599 U:46.339 V:47.065 Avg:43.519 Global:42.938 kb/s:970.92

encoded 190375 frames, 149.82 fps, 971.03 kb/s
6272.31user 60.25system 21:11.74elapsed 497%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+22788minor)pagefaults 0swaps

WITH BUFFER @ 256MB: 119.79%
time x264 --crf 22 --threads 8 --progress -o /s2/bond4.mkv testout.yuv 848x352
x264 [info]: using cpu capabilities: MMX MMXEXT SSE SSE2 SSSE3
x264 [info]: slice I:3241 Avg QP:21.46 size: 20954 PSNR Mean Y:44.62 U:47.16 V:48.35 Avg:45.38 Global:45.10
x264 [info]: slice P:187134 Avg QP:24.19 size: 4576 PSNR Mean Y:42.56 U:46.32 V:47.04 Avg:43.49 Global:42.91
x264 [info]: mb I I16..4: 41.3% 0.0% 58.7%
x264 [info]: mb P I16..4: 8.5% 0.0% 4.8% P16..4: 31.8% 14.3% 3.1% 0.0% 0.0% skip:37.4%
x264 [info]: SSIM Mean Y:0.9736933
x264 [info]: PSNR Mean Y:42.599 U:46.339 V:47.065 Avg:43.519 Global:42.938 kb/s:970.92

encoded 190375 frames, 149.09 fps, 971.03 kb/s
6273.07user 62.24system 21:17.68elapsed 495%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1major+22772minor)pagefaults 0swaps

WITH BUFFER @ 128MB: 115.50%
time x264 --crf 22 --threads 8 --progress -o /s2/bond3.mkv testout.yuv 848x352
x264 [info]: using cpu capabilities: MMX MMXEXT SSE SSE2 SSSE3
x264 [info]: slice I:3241 Avg QP:21.46 size: 20954 PSNR Mean Y:44.62 U:47.16 V:48.35 Avg:45.38 Global:45.10
x264 [info]: slice P:187134 Avg QP:24.19 size: 4576 PSNR Mean Y:42.56 U:46.32 V:47.04 Avg:43.49 Global:42.91
x264 [info]: mb I I16..4: 41.3% 0.0% 58.7%
x264 [info]: mb P I16..4: 8.5% 0.0% 4.8% P16..4: 31.8% 14.3% 3.1% 0.0% 0.0% skip:37.4%
x264 [info]: SSIM Mean Y:0.9736933
x264 [info]: PSNR Mean Y:42.599 U:46.339 V:47.065 Avg:43.519 Global:42.938 kb/s:970.92

encoded 190375 frames, 143.75 fps, 971.03 kb/s
6250.02user 59.18system 22:05.33elapsed 476%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+22789minor)pagefaults 0swaps

WITHOUT BUFFER: 100%
time x264 --crf 22 --threads 8 --progress -o /s2/bond2.mkv testout.yuv 848x352
x264 [info]: using cpu capabilities: MMX MMXEXT SSE SSE2 SSSE3
x264 [info]: slice I:3241 Avg QP:21.46 size: 20954 PSNR Mean Y:44.62 U:47.16 V:48.35 Avg:45.38 Global:45.10
x264 [info]: slice P:187134 Avg QP:24.19 size: 4576 PSNR Mean Y:42.56 U:46.32 V:47.04 Avg:43.49 Global:42.91
x264 [info]: mb I I16..4: 41.3% 0.0% 58.7%
x264 [info]: mb P I16..4: 8.5% 0.0% 4.8% P16..4: 31.8% 14.3% 3.1% 0.0% 0.0% skip:37.4%
x264 [info]: SSIM Mean Y:0.9736933
x264 [info]: PSNR Mean Y:42.599 U:46.339 V:47.065 Avg:43.519 Global:42.938 kb/s:970.92


encoded 190375 frames, 124.45 fps, 971.03 kb/s
6194.82user 50.39system 25:29.83elapsed 408%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1major+22784minor)pagefaults 0swaps


The percentages next to the "WITH BUFFER @ ___MB" are the differences between that passes fps value and the base line run (bottom) that had no buffer. The mencoder and x264 lines were unchanged between all passes, only difference was the buffer size which was done using a cli to my program. Just to note, the max CPU% on my system is 800% due to the fact that I have 8-cores so even though x264 wasnt able to use all of that, it was a marked improvement.

I'm going to run some tests on HD content later on today but I'm skeptical that the results will be more then a 5-10% increase (although that is is still something!). Just thought I would share my findings and see what others thought about it.

akupenguin
2nd September 2007, 17:29
1) x264 makes a read request for x bytes (where x=size of a raw yuv420 frame) from the fifo pipe.
2) mencoder recieves this request and decodes a frame (yes, its more complicated then it just recieving the request, but for simplicity this is ok for now)
3) mencoder sends the raw frame to x264 through the fifo pipe
4) x264 encodes the frame, and then starts the cycle all over again.

Not quite. --threads-input (which is implied by --threads) creates a 1 frame buffer. But that's a far cry from your 256MB buffer.

Also, I expect the amount of speedup to vary quite a lot depending on your x264 settings. If x264 uses slow enough settings that mencoder can always decode one frame during the time that x264 encodes 1 frame, then 1 frame buffer should be enough. Otoh, at -m1 the speedup should be even greater (or maybe not, if it means x264 always finishes first).

Sagekilla
2nd September 2007, 17:40
Just curious, but it seems like you're running a quad or octo-core rig, right? I see no other reason to use that many threads if you're not running a rig like that.

I have the weird feeling that the speedup you're encountering is specific to that very fact.

buzzqw
2nd September 2007, 18:56
yes, it's a great 2*4cpu server

@akupenguin

could have a reason force x264 to buffer more frames ? (better scalability, better buffer... don't know...)

BHH

morph166955
2nd September 2007, 21:33
@aku, could you put some way to set the amount of frames that x264 should buffer in (eg like doing "--threads-input 4" to do 4 frames or something like that). Possibly create something that is highly expandable and "limitless" so that I could do like --threads-input 500 or something? And yes, before asked, I did see an instance where having a buffer that large did help out because there was a high speed action scene that did deplete the buffer in my program to that extent.

And yes, I'm sure I'm seeing it more noticably because I'm running 8 cores on my machine, but if you look at the cpu percentages, with out the buffer in place x264 was using 408% cpu where as with the buffer in place and 256+ MB in my buffer it was closer to 497% cpu.

foxyshadis
3rd September 2007, 04:48
It would probably help with x264farm for certain patterns, since heavy threading can be let high and dry while the network pulls more frames. It's essentially a much slower version of the same situation morph's god box is in.

morph166955
3rd September 2007, 17:24
I just ran the same tests as before but this time with --thread-input enabled and actually saw something I didnt expect. I saw a decrease of about 1% from the baseline (no buffer or thread-input) when it was enabled versus when it wasnt. I'm kind of confused as to why this would be happening. aku, any ideas?

akupenguin
3rd September 2007, 18:51
The most obvious explanation is that your experiment is only accurate to +/-1%. Which is not entirely unexpected if it involves reading a high bitrate dvdiso from a harddrive.

... Although your statement did inspire me to find an unrelated bug, whereby --thread-input is only implied by explicit --threads values, not --threads=auto.

morph166955
3rd September 2007, 18:55
While I agree that the experiment does have its margin of errors, i was just seeing that in almost all cases (and i re-ran the original ones at the same time just in case) the ones with it on were slower by just that little bit then the ones with out it.

and glad i could be of remote help with the bug find lol.

morph166955
3rd September 2007, 19:12
as for hd content, no matter what i do i cant decode hd source fast enough to feed x264's needs or keep my buffer loaded above 1 frame. mplayer/mencoder is maxing out at 22fps on this system since it isnt doing multithreaded decoding of h.264 source (in this case a bluray m2ts file)

Manao
3rd September 2007, 19:51
Just take a 352x288 raw source ( foreman for example ), and upsize it with mencoder before piping it to x264. The harddrive's bandwidth won't be saturated, and mencoder should be able to do that fast enough.