View Full Version : x264 feature test: multithreaded encoding for dual-cpus
ok akupenguin/pengvado has a new funky goodie for us:
multithreaded encoding, being able to make fully use of dualcpus resulting into a hopefully nice speed increase
the only problem is, not everyone has a dualcpu so this can be extensively tested :D
so i start this thread to point people having such an equipment to the patch enabling this available here (http://students.washington.edu/lorenm/src/x264/x264_threads.0.diff), maybe someone can make a compile so people can test this!?
if its found to be useful and working pengvado will commit it to the svn
hehe, maybe its even ~ twice as fast as with a single cpu only ;)
enjoy :)
Sharktooth
28th May 2005, 14:49
uhm... downloading :D
I'll make an "experimental" build with this patch so ppl can download and test... :)
Doom9
28th May 2005, 15:26
hmm.... looking at the other diffs, it almost looks as if someobdy is working on high profile features ;) and the last RD diff is a month old :/
Sharktooth
28th May 2005, 17:52
ok, here's a multithreaded build (rev239):
http://www.webalice.it/f.corriga/x264/X264CLI_239thr_mmx.7z
superdump
28th May 2005, 17:57
RDO is going to be attempted again for the 8x8/4x4 transform decision in high profile. It proved fairly fruitless in aku's previous attempts but maybe it will be useful in this area.
The 8x8 patch currently (2005/05/28) forces the 8x8 transform and as aku says, this will produce less good results than the 4x4 else 8x8 would be in main profile. It's the ability to choose that brings the extra compression.
Also note the ffh264_8x8 patch for libavc. This allows decoding of HP streams though deblocking is currently not producing a bit perfect output compared to x264 and JM, despite looking correct. It won't properly decode CQM HP encodes either.
Still, plenty to play with. :D Good job aku!
thx a lot for this build! :)
btw its also useable for people who dont have dualcpus (like me :D ), in this case x264 will simply write frames with multiple slices but will not be any faster when enabling --threads
just in case someone notices problems with decoding the resulting multislice streams with ffdshow, mplayer (or any other libavcodec-based player), there seems to be a bug in libavcodec (not x264!):
1) using --threads 2 + cabac together results in artefacts i also saw when decoding multislice moonlight and vss samples with libavcodec, so i assume libavcodec has a problem combining those two features
2) using --threads 2 + cabac + b-frames + b-pyramid together shows the artefacts described in 1), but also crashes libavcodec after a short time (when disabling b-pyramid its the same as in 1) without a crash)
if you have problems either disable cabac or use another decoder, like nero, for playing the files, till there is a fix in libavcodec :)
AlexB17
28th May 2005, 20:17
People why you don't include such useful options as zones & multithreading in VFW version? Many peoples don't wanna mess with CLI builds prefer VDubMod.
BBugsBunny
28th May 2005, 20:31
I've got a dual Xeon 3.6 GHz.
I did already some testing of SSE3 builds and could compare the speed of the encoding very well as I would use the same test avi.
Could someone make a VFW compile - I never used CLI prefer VFW + VirtualDub. Dual CPU codecs work very well for VFW as well - eg. Mainconcept DV codec.
By the way has someone maybe a vote left for me for the nero beta programme? I think my hardware setup could be an interresting addition to the beta programme.
you guys have a dual cpu, but cant use cli? is it really that hard to let vfw go :rolleyes:
see the situation is as follows: if you want a big encoding speed increase you have to drop vfw and use the cli, its easy as that ;)
BBugsBunny
28th May 2005, 20:49
Well I think I will test the CLI version anyway - I've been using computers since the C64 so it will not be a big problem.
My personal opinion about CLI is that it's like going back to MS-Dos.
Whereas still there are some good things about a command shell though.
Using VirtualDub and VFW codecs is more comfortable and I would have already some SSE3 benchmark results to compare with.
PS: If someone is interrested in building a dual xeon CPU system here is my upgrade story:
http://forums.2cpu.com/showthread.php?threadid=58351
superdump
28th May 2005, 21:05
Sorry, deblocking in ffh264 isn't disabled, it's just not producing results identical to x264 and JM. Original post edited.
BBugsBunny
28th May 2005, 21:19
OK did a quick test:
J:\>x264 -p1 -ot.mp4 deintno.avs
avis [info]: 480x640 @ 25.00 fps (405 frames)
x264 [info]: using cpu capabilities MMX MMXEXT SSE SSE2
mp4 [info]: initial delay 0 (scale 25)
x264 [info]: slice I:2 Avg QP:23.00 Avg size: 43203 PSNR Mean Y:41.63 U:50.20
V:52.29 Avg:43.15 Global:43.15
x264 [info]: slice P:403 Avg QP:26.00 Avg size: 10429 PSNR Mean Y:37.89 U:48.11
V:50.33 Avg:39.48 Global:39.48
x264 [info]: slice I Avg I4x4:85.8% I16x16:14.3%
x264 [info]: slice P Avg I4x4:0.1% I16x16:0.5% P:66.5% P8x8:9.6% PSKIP:23.
4%
x264 [info]: PSNR Mean Y:37.91 U:48.12 V:50.34 Avg:39.50 Global:39.49 kb/s:2118.
2
encoded 405 frames, 9.86 fps, 2118.27 kb/s
with --threads 2
J:\>x264 --threads 2 -p1 -ot.mp4 deintno.avs
avis [info]: 480x640 @ 25.00 fps (405 frames)
x264 [info]: using cpu capabilities MMX MMXEXT SSE SSE2
mp4 [info]: initial delay 0 (scale 25)
x264 [info]: slice I:2 Avg QP:23.00 Avg size: 43323 PSNR Mean Y:41.64 U:50.18
V:52.30 Avg:43.16 Global:43.16
x264 [info]: slice P:403 Avg QP:26.00 Avg size: 10458 PSNR Mean Y:37.89 U:48.11
V:50.32 Avg:39.48 Global:39.48
x264 [info]: slice I Avg I4x4:85.8% I16x16:14.2%
x264 [info]: slice P Avg I4x4:0.1% I16x16:0.5% P:66.8% P8x8:9.6% PSKIP:23.
1%
x264 [info]: PSNR Mean Y:37.91 U:48.12 V:50.33 Avg:39.50 Global:39.49 kb/s:2124.
1
encoded 405 frames, 13.12 fps, 2125.15 kb/s
AVISynth script deintno.avs:
AVISource("Capture.avi")
ConvertToYV12()
Doom9
28th May 2005, 21:56
405 frames is a bit tiny.. I use 10'000 frames in my codec comparison and I don't even consider that to be enough.. it's merely a choice from a practicality standpoint as I simply cannot not use my PCs while encoding.
APF_Gandalf
28th May 2005, 23:06
made a little test with this build on a 1001 frames file using avisynth on my dual 3.0 Ghz xeon (hyper threading enabled)
mpc with ffdshow 2005-05-27
osmo4 player 0.2.5-DEV
nero show time 2.0.0.26
nero media players refuses to play anything (because of vobsub?) btw, I never use it.
using a simple avs:
avisource("E:\Tsubasa\Tsubasa chronicle.avi")
addborders(0,66,0,66)
trim(0,1000)
common settings:
F:\x264>x264.exe --bframe 2 --ref 5 --pass 1 --stats "x264_stat.log" --qcomp 0.7
5 --ipratio 1.10 --pbratio 1.30 --analyse "all" --weightb --progress -o tsubasa.
mp4 02.avs
avis [info]: 704x528 @ 23.98 fps (1001 frames)
x264 [info]: using cpu capabilities MMX MMXEXT SSE SSE2
default threads (not enabled)
first pass
cpu usage ~27%
encoded 1001 frames, 11.73 fps, 767.80 kb/s
file plays fine in osmo4, MPC and nero
second pass
cpu usage ~27%
3,77 MB (3 962 176 bytes)
encoded 1001 frames, 10.90 fps, 397.23 kb/s
file plays fine in osmo4, MPC and nero
--threads 2
first pass
cpu usage ~41%
encoded 1001 frames, 12.13 fps, 785.49 kb/s
file crashes osmo4 and MPC; nero show time freezes on a frame with artefacts
second pass
cpu usage ~43%
encoded 1001 frames, 16.00 fps, 398.24 kb/s
file crashes osmo4 and MPC; nero show time freezes on a frame with artefacts
--threads 8
first pass
cpu usage ~60%
encoded 1001 frames, 12.13 fps, 785.49 kb/s
file crashes osmo4 and MPC; nero show time freezes on a frame with artefacts
second pass
cpu usage ~66%
encoded 1001 frames, 15.37 fps, 399.29 kb/s
file crashes osmo4 and MPC; nero show time freezes on a frame with artefacts
--threads 16
first pass
cpu usage ~60%
encoded 1001 frames, 11.99 fps, 785.76 kb/s
second pass
cpu usage ~66%
encoded 1001 frames, 15.58 fps, 399.33 kb/s
file crashes osmo4 and MPC; nero show time freezes on a frame with artefacts
first thing to say, the speed increase is really interesting, but I need a decoder that can decode it
I'll try later with a complete anime episode (36 000 frames)
Sharktooth
28th May 2005, 23:20
8 and 16 are useless since you have 4 execution units.
-threads 4 is the ideal choice.
APF_Gandalf
29th May 2005, 00:15
8 and 16 were done because:
-the cpu wasn't satureted yet
-I wanted to see the influence on precision aiming the bitrate/filesize
If I understand it well, it's more or less like slicing the movie in "n" equal parts and encoding them separately (at the same time) and then joining them back together.
If I'm right, the rate control may be less efficient and the more threads you'll launch, the least efficient it will be.
akupenguin
29th May 2005, 00:37
No. The movie is sliced into N equal parts within each frame, so ratecontrol is not affected. The only detrimental effects are: cabac contexts are reset, and macroblocks on the top edge of a slice don't get to predict MVs from the row above (both slightly increase bitrate).
With the current patch, threads are capped at 4 even if you have more execution units. If your CPU isn't saturated, it's because not all of the encoding is easily parallelizable. Frame type decisions, deblocking, and half-pel interpolation are done single-threaded.
Note that lavc/ffdshow is known to be buggy with cabac+multislice. I'm investigating it.
Joe Fenton
29th May 2005, 08:27
I applied the patch to my cvs checkout of x264. I tried it on a raw dump of the opening of Angelic Layer (720x480, 24 FPS, 1402 frames raw yuv 4:2:0). I used the same settings as APF_Gandalf simply because I'm not familiar with x264. Any suggestions on settings is appreciated.
I'm running Fedora Core 3 for AMD64 (64bit linux), 2.6.11 kernel, on an MSI Master2-FAR with two Opteron 240 CPUs, and 1 GByte of DDR333 memory.
one thread:
pass 1 - encoded 1402 frames, 6.91 fps, 1549.57 kb/s
pass 2 - encoded 1402 frames, 7.82 fps, 1549.57 kb/s
CPU usage is 50 - 55%
two threads:
pass 1 - encoded 1402 frames, 10.62 fps, 1571.93 kb/s
pass 2 - encoded 1402 frames, 12.38 fps, 1571.39 kb/s
CPU usage is 77 - 83%
I would say this is a significant speedup for slower dual CPU systems. I'm doing roughly 50% faster encoding. If I had this for xvid, I'd be a REALLY happy camper. :D Maybe a similar kind of optimization can be made in xvid. As it is, it really helps x264.
Sharktooth
29th May 2005, 15:53
above 50% here too, but depends also on the avisynth filters.
i'll do a more accurate test with yuv samples asap.
Joe Fenton
29th May 2005, 19:17
Since most filters work on individual frames, it seems to me that you could make a couple threads that processed every other frame to speed filters in AVISynth.
708145
30th May 2005, 01:51
Originally posted by Joe Fenton
Since most filters work on individual frames, it seems to me that you could make a couple threads that processed every other frame to speed filters in AVISynth.
And on top of that, pipelining the filter chain of avisynth would give additional parallelism :D
Wow! 100th post :cool:
bis besser,
Tobias
Else try my new avisynth filter MT (http://forum.doom9.org/showthread.php?t=94996) that split the frame up into smaller segments and process them in parallel (like Geforce 6 SLI). Good for some filter(fft3dfilter or removegrain) bad for others (subtitle). Sorry for the comercial but it seems that the people posting in this thread does indeed have SMP computers.
BBugsBunny
30th May 2005, 22:10
Some time ago (12.2003) I made a SMP version of the MSharpen filter of Donald Draft (VirtualDub) simply by using the OpenMP extension of the Intel compiler (#pragma omp) to parallelize a for next loop.
The parallelization is very effective - the SMP version gives nearly the double frame rate compared to the non SMP version
If someone wants to take a look:
http://members.chello.at/nagiller/vdub/index.html
sharktooth: why do you post a singlethreaded and a multithreaded cli compile? why not only the multithreaded one?
Sharktooth
31st May 2005, 19:55
coz multithreaded build icludes ptheradGC2.dll and it's bigger than the singlethreaded one.
However i should also compile the singlethreaded CLI to build the VFW, so including the singlethreaded CLI does not hurt.
coz multithreaded build icludes ptheradGC2.dll and it's bigger than the singlethreaded one.
well the difference is only 19kb!?
wasnt it possible to static link the pthread lib? :(
Doom9
31st May 2005, 20:18
and what about the decoding issue when using slices and cabac? not that there's a problem with x264, but if decoders can't handle it, might not be a good idea to use it..
oh, and once again, I'm gonna have to split off the thread.....
and what about the decoding issue when using slices and cabac? not that there's a problem with x264, but if decoders can't handle it, might not be a good idea to use it..
well seems other decoders than libavcodec seem to handle it fine, like nero, moonlight...
Sharktooth
31st May 2005, 20:37
well the difference is only 19kb!?
wasnt it possible to static link the pthread lib? :(
19kb when zipped. Once unzipped the difference is bigger.
For what concerns the static linking i tried to do it with no success...
and what about the decoding issue when using slices and cabac? not that there's a problem with x264, but if decoders can't handle it, might not be a good idea to use it..
oh, and once again, I'm gonna have to split off the thread.....
As bond said, it's a problem with libavcodec. Nero, for example, handles it without problems.
So, actually if you want to decode streams made with --threads > 1 you need a compilant decoder or use CAVLC.
akupenguin
31st May 2005, 20:37
Note that lavc/ffdshow is known to be buggy with cabac+multislice. I'm investigating it.
fixed.
For what concerns the static linking i tried to do it with no success...
ic :(
splitted posts and merged to the multithread thread, as i think it fits there
fixed.
:thanks: Fast - faster - x264 development! :D
Sharktooth
1st June 2005, 16:59
ic :(
i think it's a problem with pthreads :scared:
fixed.
great, has this been commited to the ffmpeg cvs already?
Kostarum Rex Persia
7th June 2005, 01:46
One question: can someone tell me how to use new option in WFV rev 245E codec,Threads.Default value is 1 thread,but what is happening when threads is 4.How that improve H.264 compression.I have a AMD Athlon XP 3200+.I am sorry if the question is obsolete.
bob0r
7th June 2005, 02:15
Seeing you here i think you need to learn a bit more english, this forum can be very technical, so even none english speaking people like myself have to read everything carefully.
The anser to your question:
1: use search and read: http://forum.doom9.org/showthread.php?t=95097
2: very short answer: threads = faster encode (not better quality)
Only when you have more than 1 CPU (or 1 CPU with 2 CORES) or both, threads higher than 1 will have any effect.
So if you have a single CPU, do not use threads. (so you don't have to use threads)
ok i have merged the threads for keeping things together
about multithreading: its all about speeding up encoding making use of dual cpus
- it doesnt help quality (in contrary it will make it worse a bit) and
- it makes no sense to use it on a single cpu
Inventive Software
7th June 2005, 11:42
Maybe a similar kind of optimization can be made in xvid. As it is, it really helps x264.
It is possible to enable multi-threading in XviD. It's in the same page where you set the FOURCC code, and see what extensions are being used. Look near the bottom and find a box "No of threads".
See if that helps.
I am quite interested, because with the codec test I'm doing (it's going great BTW) x264 encodes quite slow, often taking 3-4 hours per 20 mins encoding, whereas other codecs take between 1 and 2 hours to encode 20 mins.
Before the rants start coming in, check the computer specs in my signature and you'll know what I'm talking about.
XadoX
10th January 2006, 10:13
is there also a speed increase with HT cpus, or only with real DualCore cpus.
Inventive Software
10th January 2006, 12:34
Bearing in mind HT involves having 2 threads on one core, this should speed it up, though I wouldn't recommend it. HT can be bad in some apps. Best go for the real deal, i.e dual core. ;)
XadoX
10th January 2006, 12:52
oh, i only have an AMD64 3700+ (SanDiego) :(
tomos
13th January 2006, 00:19
trying it here now on my 4400+ X2 and it worked a treat.
changing threads from 1 to 2 got me from 60% to 99% CPU usage :)
ChrisBensch
13th January 2006, 01:08
I've got a dual core 2.8Ghz Pentium D...but using 2 threads gets me around 80%, increasing the thread count still never gets me above 85%...
ChronoReverse
13th January 2006, 01:35
Bearing in mind HT involves having 2 threads on one core, this should speed it up, though I wouldn't recommend it. HT can be bad in some apps. Best go for the real deal, i.e dual core. ;)
That said, a proper implentation of SMT is actually useful (like in IBM's POWER cpu). Too bad Intel didn't seem to get it right. Dual core with SMT would be a really nice thing.
Doom9
13th January 2006, 09:05
was it necessary to revive this old thread?
ndkamal
25th January 2006, 14:56
I have a Pentium 4 with hypertheading technology and my speed of encoding doesn't change with one or two threads.Windows XP and BIOS recognizes 2 CPUS. Can someone give me an explanation ? Thanks for advance.
tomos
25th January 2006, 15:07
hyperthreading doesnt give a boost in all apps. its still just 1 core in the end so if using x264 hits your cpu @ 100% then adding a 2nd thread cant go above that.
worked a lot for me with my X2 tho. :)
ndkamal
25th January 2006, 19:35
TOMOS, can you give the difference between of speed encoding with one and two CPUS ?
tomos
25th January 2006, 20:20
any programs to bench with? or just change cpu affinity to 1 core? (not really fair since it gets 100%cpu time while background tasks use the other core)
ndkamal
26th January 2006, 00:11
With X264 Sharktooths daily builds the last release. ;)
tomos
26th January 2006, 00:16
may give this a go tomorrow then. have something on the go at the moment that will take another 14-15 hrs to finish (1080p)
would you like dvd res, or hdtv res as the test?
HardwareGeek
26th January 2006, 08:48
I have a Pentium 4 with hypertheading technology and my speed of encoding doesn't change with one or two threads. ...
Also, a program needs to be written to take advantage of hyperthreading, in order to see significant improvement. A multithreaded program may see a little improvement, but I don't think always. Single-threaded programs frequently see less performance with hyperthreading.
ndkamal
26th January 2006, 10:34
For TOMOS,I m talking about a DVD Backup, like Matrix or another one, we can put resolution, the lentgh of the movie, the speed of encoding with one or two Cores, with one of the latest versions of X264. Thanks for advance.
Doom9
26th January 2006, 11:16
I put some performance data in the hardware forum when I first got my X2.
tomos
26th January 2006, 11:17
ah ok. have matrix at home so use that - although i havent done a dvd backup in ages.
is using dgindex, and avs to encode in vdub ok? aiming at 700mb for just the film? 2-passes etc.
-Doom9--
i'm kinda curious now as well anyway so will give this a go for my own curiosity :D
ndkamal
26th January 2006, 12:47
You can use CLI version, MP4 output, a bitrate of 2000 with CE Highprofile without turbo, Automated 2 pass,resolution 672*272, the others parameters don't change, not all the movie but one or two chapters like Chapters 10 and 11. The number of fps and time encoding.
tomos
26th January 2006, 13:27
never used CLI ver before but will give it a go
Doom9
26th January 2006, 14:37
I think you can set the number of threads in VfW as well.. but if you want to use the cli, we have a whole forum with 3 tools that use the x264 commandline encoder.
@ndkamal: with a little searching, you'd have found this: http://forum.doom9.org/showthread.php?t=94226&highlight=doom9+x2
ndkamal
26th January 2006, 15:49
Thanks at moderator I haven't see this topic before.
tomos
26th January 2006, 18:36
ok, doing this now (with megui) oddly - 2 threads 85-90% cpu usage?? never noticed it before
may try with 3 thread to see if it picks up the slack
tomos
26th January 2006, 20:22
ok. just did chapter 10+11 of matrix now with latest release in megui - no resizing (couldnt see how in megui), and with the parameters that ndkamal specced
1st - 2 threads, both cores used
start - 17.27.42 end - 17.43.32 FPS - 11.8495
start - 17.54.04 end - 18.08.10 FPS - 13.3062
2nd - 1 threads, one core used (affinity set process - above normal)
start - 18.10.13 end - 18.37.41 FPS - 6.83
start - 18.37.41 end - 19.03.01 FPS - 7.40
so 1st avg = 12.5778
and 2nd avg = 7.115
i figure 77% performance incrase in using dual core over single. although to be fair thats single core without any OS processes on it. maybe 80-85 % increase in performance?
ndkamal
27th January 2006, 01:00
Thanks TOMOS. I Have made some tests on my Pentium 4, and I have noticed that the multithreaded encoding works in VFW version and not in CLI version, perharps a bug, I have a increase of 20 % of speed.
ndkamal
27th January 2006, 01:03
Some questions for TOMOS about your encoding, what resolution, Do you use CLI or VFW ?
tomos
27th January 2006, 08:26
in that one i just did, it was using megui which afaik is just a front end for the CLI.
normally, i use virtualdubmod though
Doom9
27th January 2006, 09:22
and I have noticed that the multithreaded encoding works in VFW version and not in CLI version, perharps a bug,It's VirtualDub using two threads by default.. seems virtual SMP likes that more than the traditional read->write approach.
Revgen
27th January 2006, 09:27
CLI works fine with my AMD Athlon X2. Perhaps virtual threads don't work as well for CLI as true multi-threading.
ndkamal
27th January 2006, 10:07
I would have the opinion of others P4 HT owners. So I have noticed a increase of performance in the latest version of x264 about 80% and 60 % in build 270.
tomos
27th January 2006, 13:42
just as a sidenote. i tried adding a 3rd thread, but that made it a little slower than 2 threads.
Doom9
27th January 2006, 14:03
thread synchronization takes its time, and the thread scheduler is likely more efficient if it has the same number of performance hungry threads as cores it can distribute the threads on. It's like breaking down a two man job into three bits and let two people work at it.. they can do just fine with jobs one and two, but if each job takes the same amount of time, they'll finish both at the same time and if job 3 cannot be further divided, the two people can not work as efficiently as they could if they both had a separate job. I know that's grossly simplified but I think trying to imagine multithreading in terms of tasks at work and people to do it makes understanding quite easy.
tomos
27th January 2006, 16:10
just testing tbh since my CPU wasnt running @ 100%
if it just used 5% more then over a long encoding time it would have been worth it
live and learn tho :)
sjchmura
28th January 2006, 17:01
So is anyone compiling(I understand non-standard) dual core SMP build like the new 408 one???
Doom9
28th January 2006, 17:23
@sjchmura: you know, a little research cannot hurt. Ever since I don't know how many builds there's an option to select the number of threads you want. The fact that you're asking shows that you have not done the research we expect our members to do before they post a question.
sjchmura
29th January 2006, 21:10
@sjchmura: you know, a little research cannot hurt. Ever since I don't know how many builds there's an option to select the number of threads you want. The fact that you're asking shows that you have not done the research we expect our members to do before they post a question.
I am not sure why you want to insult me but ... as someone that has helped alot of people here this attack was not waranted. If my confusion over numbering - fine. But there ARE different builds and how they are counted. For example, sharktooth build is 408 (recently held due teo bugs). I was assuming the 270 dual CPU build was older but if this is a new numbering system then that is all you needed to say.
Anyone trying to move from Nero AVC to x264 needs to do ALOT of research and it is not easy at time to keep up with build numbers and differences.
I am sorry I offended you or the authors of the test build.
Best
STeve
madoka
30th January 2006, 07:40
No. The movie is sliced into N equal parts within each frame, so ratecontrol is not affected. The only detrimental effects are: cabac contexts are reset, and macroblocks on the top edge of a slice don't get to predict MVs from the row above (both slightly increase bitrate)...
What is the advantage of this approach over a more simplistic one, where one divides the entire movie into N equal parts? Then one just runs N different processes while setting their processor/core affinity accordinly.
Doom9
30th January 2006, 09:31
where one divides the entire movie into N equal parts? Then one just runs N different processes while setting their processor/core affinity accordinly.Consider this (and it's a pretty obvious thing if you ask me): part 1 is a slow motion scene, part 2 a high motion scene. If you divide the target size by N, part 1 will look very good, part 2 will look bad because they both get the same bitrate.
708145
30th January 2006, 09:48
Consider this (and it's a pretty obvious thing if you ask me): part 1 is a slow motion scene, part 2 a high motion scene. If you divide the target size by N, part 1 will look very good, part 2 will look bad because they both get the same bitrate.
well splitting in the temporal domain does work but needs a bit of thinking and development to get it right.
ELDER does keyframe exact splitting (actually frame types are the same for each frame in xvid default vs. ELDER) and uses a bitrate distribution _very_ close to xvid's default.
The main drawback I can see is that I lack some spare time to dedicate for ELDER development... there are too many things with higher priority :)
Anyway I will get out beta4 some time soon but probably not with all the features I planned for this version.
madoka
30th January 2006, 17:39
Consider this (and it's a pretty obvious thing if you ask me): part 1 is a slow motion scene, part 2 a high motion scene. If you divide the target size by N, part 1 will look very good, part 2 will look bad because they both get the same bitrate.
Forgive my naivete, but I'm not quite following. Suppose there are three parts: part A is slow motion, part B has moderate motion, and part C is high motion. So ideally fewer bits should be used for part A, and the savings used on part C. But how would a single-pass rate control algorithm know this? I imagine there's some sort of look-ahead window which the algorithm uses to tweak the bit rate, but in general the window will be small. Hence, while in part A it can't "see" far enough to realize that part C needs more bits.
Now, a 2-pass rate control algorithm wouldn't be bound by the restriction, since the statistics gathered in the 1st pass would indicate that part2 is high motion. However, since the 1st pass is essentially encoding with a constant quantizer, the amount of motion doesn't matter, so dividing the movie into N parts also doesn't matter. Then, in the 2nd pass, the process encoding part 1 will know, based on the 1st pass statistics, to use fewer bits. Similarly, the process encoding part 2 will also know it can use more bits.
The procedure above requires some changes in x264, but it seems less complicated than threading.
Lastly, the examples we came up with are both quite artificial. Theories and models aside, isn't it safe to assume that real videos distribute their high and low motion scenes over time roughly uniformly? If not, then what about dividing the movie into k*N pieces, and each processor/core encodes k random pieces?
akupenguin
30th January 2006, 18:00
Forgive my naivete, but I'm not quite following. Suppose there are three parts: part A is slow motion, part B has moderate motion, and part C is high motion. So ideally fewer bits should be used for part A, and the savings used on part C. But how would a single-pass rate control algorithm know this?
That's where x264 wins. 1pass ABR can do an approximate distribution without any lookahead. Strictly speaking, ABR could still work if you split the movie into pieces... but since each piece could encode at different speeds, the result would be non-deterministic, which sucks for a developer.
Lastly, the examples we came up with are both quite artificial. Theories and models aside, isn't it safe to assume that real videos distribute their high and low motion scenes over time roughly uniformly?
Random distribution, maybe. I'm not even sure of that; there may be a bias for more action towards the end or something. But when local bitrate varies by a factor of >10, a 2 hour movie is not nearly long enough to make a random distribution appear even.
If not, then what about dividing the movie into k*N pieces, and each processor/core encodes k random pieces?
Doesn't that just make the distribution problem even worse?
Doom9
30th January 2006, 18:29
isn't it safe to assume that real videos distribute their high and low motion scenes over time roughly uniformly?I'm sure if you ask a film student he'd have to disagree. Movies storylines generally follow a certain pattern with a bunch of climaxes. For an action oriented movie, those climaxes would generally signify a high amount of action, and the climaxes are not uniformly distributed. I can't get into more details though since I just don't recall them.
Ice =A=
30th January 2006, 18:48
Just one more thing: Splitting the whole movie in two parts would also increase the encoding time on multicore systems, since the two parts would likely not be finished simultaneously.
madoka
30th January 2006, 23:26
That's where x264 wins. 1pass ABR can do an approximate distribution without any lookahead.
Interesting. Can you give me a brief explanation on how the algorithm works?
If not, then what about dividing the movie into k*N pieces, and each processor/core encodes k random pieces?Doesn't that just make the distribution problem even worse?
I'll concede Doom9's point that a typical movie doesn't uniformly distribute its low/high motion scenes over time. So the idea is to sample the movie randomly; if the sample is large enough it should have roughly the same low/high motion distribution as the movie, no? Of course, if the pieces gets too small the discontinuities at the boundaries will become problematic.
However, if I understand how 2-pass rate control works, then none of the above would matter. Besides scene complexity, are there other information that requires coordination across all processes?
On the other hand, synchronization is a problem that needs to be addressed. But the same problem exists for threads, no? There's no guarantee that each thread finish processing their respective frame slice at the same time, either.
I'm curious about the feasibility/flaws/drawbacks to this approach, because it seems more easily extensible to cluster encoding. I was thinking of doing a project on this, but 708145 (http://www.funknmary.de/bergdichter/projekte/index.php?page=ELDER) beat me to it. I guess I'll just have to think of something else to do...:(
akupenguin
31st January 2006, 00:44
Interesting. Can you give me a brief explanation on how the algorithm works?:search: and/or look in x264's "doc" directory.
I'm curious about the feasibility/flaws/drawbacks to this approach, because it seems more easily extensible to cluster encoding.
Slices:
+ Really easy to implement. (x264's multithreading code is a total of 50 lines of C in one function)
- Doesn't perfectly fill even 2 CPUs.
- Each slice boundary costs a few bits.
GOP threading:
+ Scales perfectly to decent-sized clusters.
- Requires application support / restricts API (seekable input, heavily buffered output.) Not usable for realtime streaming.
- Requires CQ or 2pass (no ABR).
JamPS
2nd October 2006, 00:11
Report:
When using --threads 2 I get around 95% CPU utilization when encoding with StaxRip(x264 gets to work alone from commandline).
So I'd say multithreading support is nearly perfect on my computer :)
I don't think it will get/should be any better...
Sharktooth
2nd October 2006, 04:26
JamPS, well... it all depends on the CPU you have...
Zarxrax
5th October 2006, 20:58
I did a search and can't seem to find an answer, so I figure I'll ask.
Does --thread-input create an additional thread for avisynth aside from the other threads that have been created?
If I use --threads 2 --thread-input, would this make a total of 3 threads, or just 2?
akupenguin
5th October 2006, 21:26
--threads 2 creates a total of 3 threads, one of which does avisynth.
--thread-input matters only if you're not otherwise using threaded encoding.
ChronoCross
5th October 2006, 21:28
so it would be mor efficient when using a SMP to do:
x264.exe --threads 2
and for single processor
x264.exe --threads 1 --thread-input
Would the first one be faster on single processor than the second one or is there a reason the second one would be faster?
akupenguin
5th October 2006, 23:28
for 2x SMP, do
x264.exe --threads 2
(which automatically enables thread-input also)
for a single processor with avisynth input, do
x264.exe --threads 1
for a single processor with rawyuv file input (i.e. the input is limited by harddrive speed, not CPU), do
x264.exe --threads 1 --thread-input
Sharktooth
6th October 2006, 00:24
so --thread-input could be useful even in the input is limited by the network, right?
lets say, we have 8 PCs encoding simultaneously and getting data from a single fileserver. the server network card gets easily overloaded and packets speed would be subdivided for the 8 PCs.
--thread-input would help as in the 3rd example you made... is my assumption correct?
vBulletin® v3.8.5, Copyright ©2000-2012, Jelsoft Enterprises Ltd.