PDA

View Full Version : x264's 1st pass always <100% CPU Usage (& --b-adapt 1 crash)


Snowknight26
20th November 2008, 16:51
Recently got a 16 core server that wound up with x264. Ready to see my speeds at skyrocket to levels I've never seen before, I decided to start off with a simple 1st pass:

x264 --pass 1 --bitrate 11848 --stats "thunderball/thunderball.stats" --bframes 3 --b-adapt 2 --b-pyramid --direct auto --deblock -3:-3 --subme 4 --analyse none --me hex --threads auto --thread-input --progress --no-psnr --output NUL "thunderball/thunderball.avs"

Low number of --bframes, relatively low --subme, and --me.
Expecting to see speeds a bit faster than my 3.2GHz Q6600 (my reference), I was dismayed when I saw it was actually slower.. almost 50% slower. So I checked the task manager and noticed only 16% CPU usage (~2-3 cores) -- my Q6600 would do 50% CPU usage (2 cores) too.

Hoping to get higher CPU usage (and hopefully no speed loss due to unused cycles), I upped the settings:

x264 --pass 1 --bitrate 11848 --stats "thunderball/thunderball.stats" --bframes 3 --b-adapt 2 --b-pyramid --direct auto --deblock -3:-3 --subme 9 --analyse none --me umh --threads auto --thread-input --progress --no-psnr --output NUL "thunderball/thunderball.avs"

This time it hovered around 30-40%. (http://www.stfcc.org/misc/server/encoding.png) Again, increased the settings per DS's advice to what I assumed would really max out the beast:

x264 --pass 1 --bitrate 11848 --stats "thunderball/thunderball.stats" --bframes 3 --b-adapt 2 --b-pyramid --direct auto --deblock -3:-3 --subme 9 --analyse all --8x8dct --me tesa --threads 32 --thread-input --progress --no-psnr --output NUL "thunderball/thunderball.avs"

Nope. (http://www.stfcc.org/misc/server/encoding.bottleneck.png) However, this time, the last 4 cores were being maxed out, but the first 12 weren't. Short of using a higher --merange, I couldn't think of any slower settings. Maybe its an x264 bottleneck? I've ruled out any avs/HDD issues, so I'm stumped as to what else it could be.

LoRd_MuldeR
20th November 2008, 16:58
What happens if you replace "--b-adapt 2" with "--b-adapt 1" ???

Snowknight26
20th November 2008, 17:03
50% CPU usage for about 10 seconds followed by about half a minute of 5% (http://www.stfcc.org/misc/server/encoding.fail.png) then x264 silently closes.

LoRd_MuldeR
20th November 2008, 17:06
50% CPU usage for about 10 seconds followed by about half a minute of 5% (http://www.stfcc.org/misc/server/encoding.fail.png) then x264 silently closes.

Something is going seriously wrong :confused:

shon3i
20th November 2008, 17:07
what if use --threads 24 instead --threads auto ??

Snowknight26
20th November 2008, 17:08
It would be the same thing because auto is cores*1.5.

LoRd_MuldeR
20th November 2008, 17:14
Maybe the "thread pool" patch is helpful with that enormous number of frames?

Also you could try to throw even more threads on the machine ;)

Snowknight26
20th November 2008, 17:15
I'd be willing to test it out if a link to x264 compiled with that patch was provided.

LoRd_MuldeR
20th November 2008, 17:21
I'd be willing to test it out if a link to x264 compiled with that patch was provided.

Try here: http://komisar.gin.by/

Snowknight26
20th November 2008, 17:26
Also you could try to throw even more threads on the machine ;)

It's already at 32. ;)

Try here: http://komisar.gin.by/

Same issue. (http://www.stfcc.org/misc/server/encoding.testbuild.png)

kemuri-_9
20th November 2008, 17:31
Dark Shikari mentioned elsewhere (i don't remember where now)
that the number of threads should not cause the following inequality to break:
vertical resolution / threads > 32 or 40
(iirc)

as you claim to have weeded out avs errors, it doesn't seem like it.
as the silent close is a known symptom of crashing by virtual memory limit overflow.

so you should watch the memory/virtual memory usage by x264 as it's using avs, and make sure it doesn't reach the 32 bit 2GB limit.

if it is, a way of working around the limit would be to use avs2yuv and pipe the y4m output to x264's stdin.
so then both avs2yuv(avs) and x264 each have a separate 2GB limit rather than them together having the limit.

Sharktooth
20th November 2008, 17:31
the input thread is too slow causing the encoder to "wait" for data...
use a faster or multithreaded decoder for your source and simplify your avs script. try avisynth MT also.

LoRd_MuldeR
20th November 2008, 17:32
I have to ask again: Are you rally 100% sure that there is no bottleneck in your source?

And did you make sure that x264's affinity isn't limited to certain cores?

Snowknight26
20th November 2008, 17:33
as the silent close is a known symptom of crashing by virtual memory limit overflow.

so you should watch the memory/virtual memory usage by x264 as it's using avs, and make sure it doesn't reach the 32 bit 2GB limit.

Thats a possibility, but x264.exe only goes up to about 1,800,000KB. Overall RAM usage goes from 1.3GB -> 3.1GB then back to 1.3GB as x264 exits.

I have to ask again: Are you rally 100% sure that there is no bottleneck in your source?

Yes, positive. Decoder can decode at 100fps+, avs does at least 50fps.
And did you make sure that x264's affinity isn't limited to certain cores?
Yes. (http://www.stfcc.org/misc/server/encoding2.png)


Anyway, point being, since I don't use --b--adapt 1 which causes this silent crash, the only issue I'm concerned about (can't say about others) is the sub 100% CPU usage.

crypto
20th November 2008, 18:29
the input thread is too slow causing the encoder to "wait" for data...
use a faster or multithreaded decoder for your source and simplify your avs script. try avisynth MT also.

I bet that's the cause. I had similar effects on 4 cores before switching to CUDA decoding. Also resizing is a limiting factor.

@Snowknight26
What fps do you get in pass #1?
Are you downsizing?
What is the input resolution?
What preprocessing filters do you use?

MasterNobody
20th November 2008, 18:34
Try here: http://komisar.gin.by/
Same issue. (http://www.stfcc.org/misc/server/encoding.testbuild.png)
For full-activation of thread-pool patch you must specify --thread-queue more than --threads (for your's configuration I would recomend --threads 24 --thread-queue 48).

LoRd_MuldeR
20th November 2008, 18:34
Snowknight26, just to be sure: Open your source AVS file in VirtualDub and save the filtered video to the HuffYUV format.

Then run the encode from the HuffYUV file with only AVISource() in your script...

(I assume that the HDD throughput shouldn't be a limiting factor on your server machine ^^)

Snowknight26
20th November 2008, 18:48
What fps do you get in pass #1?
Are you downsizing?
What is the input resolution?
What preprocessing filters do you use?

It varies based on the settings, so I can't specifically answer that.
No resizing, input resolution is 1920x1080, only using crop in the avs (->1920x816).

For full-activation of thread-pool patch you must specify --thread-queue more than --threads (for your's configuration I would recomend --threads 24 --thread-queue 48).

I found that --threads 32 increased my encoding speeds by about 6%, but regardless, x264 still silently closes with:

x264test --pass 1 --bitrate 11848 --stats "thunderball/test.stats" --bframes 16 --b-adapt 1 --weightb --b-pyramid --direct auto --deblock -3:-3 --subme 9 --analyse all --8x8dct --me tesa --threads 24 --thread-queue 48 --thread-input --progress --no-psnr --output NUL "thunderball/test.avs"

Edit: Just did some more tests with the above settings and the changes below:
--thread-queue 48 - Crash
--thread-queue 32 - No crash

Similarly, with techouse's r1028 build (without --thread-queue of course):
--bframes 16 --b-adapt 1 - Crash
--bframes 3 --b-adapt 2 - No crash

Snowknight26, just to be sure: Open your source AVS file in VirtualDub and save the filtered video to the HuffYUV format.

Then run the encode from the HuffYUV file with only AVISource() in your script...


I don't think theres any need for that as my 2nd pass can sustain 6fps+ and using 100% of the CPU while the 1st pass can't do either. If there are no other options, I will.

LoRd_MuldeR
20th November 2008, 18:53
You still should get a debug build (http://kemuri9.net/dev/x264/x264_mod2_debug.exe) and track down the "--b-adapt 1" crash. Also did you try "--thread-queue" as suggested by MasterNobody yet?

crypto
20th November 2008, 18:57
It varies based on the settings, so I can't specifically answer that.
No resizing, input resolution is 1920x1080, only using crop in the avs (->1920x816).
...
I don't think theres any need for that as my 2nd pass can sustain 6fps+ and using 100% of the CPU while the 1st pass can't do either. If there are no other options, I will.
At that speed forget about my questions. Decoding speed and resizing will have an significant impact above 30-50 fps.

Snowknight26
20th November 2008, 19:04
You still should get a debug build (http://kemuri9.net/dev/x264/x264_mod2_debug.exe) and track down the "--b--adapt 1" crash. Also did you try "--thread-queue" as suggested by MasterNobody yet?

Yes, edited my previous reply.

Hmm, now I can't seem to reproduce the silent crash, but when I check the task manager, x264 only uses 1GB now instead of close to 2GB.

Edit: There we go, crashing again. Debug build does not crash, techouse's and komisar's do.

I can give VNC access to the box if it would help track the problem down faster.

LoRd_MuldeR
20th November 2008, 19:10
Yes, edited my previous reply.

Hmm, now I can't seem to reproduce the silent crash, but when I check the task manager, x264 only uses 1GB now instead of close to 2GB.

Edit: There we go, crashing again. Will try the debug build now.

Use gdb to run the debug build:
http://sourceforge.net/project/showfiles.php?group_id=2435&package_id=20507
(You need the "gdb-6.8-mingw-3.tar.bz2" file)

Simple prefix "gdb --args" to your commandline. Just like this:
gdb --args x264 --pass 1 --bitrate 11848 --b--adapt 1 [...] --stats "thunderball/thunderball.stats" --output NUL "thunderball/thunderball.avs"

Type "run" to make it start. When it crashed, type "bt" and press Enter to get a Traceback...

Snowknight26
20th November 2008, 19:27
#0 x264_malloc (i_size=6983680) at common/common.c:739
#1 0x0042892d in x264_frame_new (h=<incomplete type>) at common/frame.c:66
#2 0x0042c1a5 in x264_frame_pop_unused (h=<incomplete type>)
at common/frame.c:947
#3 0x0040a2ed in x264_encoder_encode (h=<incomplete type>, pp_nal=0x22faa8,
pi_nal=0x22faac, pic_in=0x22fc90, pic_out=0x22fab0)
at encoder/encoder.c:1419
#4 0x0040266b in Encode_frame (h=0x63790c0, hout=0x77bf1d08, pic=0x22fc90)
at x264.c:768
#5 0x004029c4 in Encode (param=0x22fd30, opt=0x22fd10) at x264.c:852
#6 0x004013d0 in main (argc=31, argv=0x3f4430) at x264.c:113

LoRd_MuldeR
20th November 2008, 19:29
Obviously a crash in x264_malloc(), that is: memory allocation code...

Snowknight26
20th November 2008, 20:38
Now that that problem has been identified, what about that other (http://www.stfcc.org/misc/server/encoding.bottleneck.png) persistent problem (http://www.stfcc.org/misc/server/encoding.bottleneck2.png)? I'm missing out on a potential 200% speed increase. Slowersettnigs won't work because CPU usage will go up fractionally while encoding speed goes down dramatically. My guess is the bframe decision is holding everything else back.

Dark Shikari
20th November 2008, 20:56
#0 x264_malloc (i_size=6983680) at common/common.c:739
#1 0x0042892d in x264_frame_new (h=<incomplete type>) at common/frame.c:66
#2 0x0042c1a5 in x264_frame_pop_unused (h=<incomplete type>)
at common/frame.c:947
#3 0x0040a2ed in x264_encoder_encode (h=<incomplete type>, pp_nal=0x22faa8,
pi_nal=0x22faac, pic_in=0x22fc90, pic_out=0x22fab0)
at encoder/encoder.c:1419
#4 0x0040266b in Encode_frame (h=0x63790c0, hout=0x77bf1d08, pic=0x22fc90)
at x264.c:768
#5 0x004029c4 in Encode (param=0x22fd30, opt=0x22fd10) at x264.c:852
#6 0x004013d0 in main (argc=31, argv=0x3f4430) at x264.c:113That's a malloc failure. Your operating system ran out of memory to give to x264.

Snowknight26
20th November 2008, 21:09
With a 64-bit OS and 8GB of RAM (~6 of which are unused), I have my doubts. Anything else that could be causing it?

kemuri-_9
20th November 2008, 21:15
x264 is 32 bit on windows, it has a 2GB page/memory limit.

(i say this with assurance, since no one would realistically use the x64 version that has no handwritten asm, which makes it a turtle in comparison to the x86)

LoRd_MuldeR
20th November 2008, 21:16
With a 64-bit OS and 8GB of RAM (~6 of which are unused), I have my doubts. Anything else that could be causing it?

32-Bit processes running on a 64-Bit OS are still limited to the 32-Bit virtual address space. So the process can't allocate more than ~3 GB of memory ...

Snowknight26
20th November 2008, 21:17
Right, just as I implied, but there must be something causing x264.exe to hit the 2GB limit.

kemuri-_9
20th November 2008, 21:21
32-Bit processes running on a 64-Bit OS are still limited to the 32-Bit virtual address space. So the process can't allocate more than ~3 GB of memory ...

it can only use 3GB if it's compiled with large address awareness,
under standard compilation it is only 2GB.


the easiest reason for x264 to hit the limit is avisynth taking up more memory than it needs
use SetMemoryMax() in avs to something reasonable.

and iirc, avisynth defaults to using 25% of the system memory as the default max,
so with 8GB of system memory, the max is 2GB, leaving no room for x264 for its own allocation

Snowknight26
20th November 2008, 21:24
Eh, I already had it set to 256 then 64MB.

LoRd_MuldeR
20th November 2008, 21:33
I guess either Avisynth or x264 or both are leaking memory somehow... :o

Snowknight26
20th November 2008, 21:39
I guess either Avisynth or x264 or both are leaking memory somehow... :o

Well since it only happens with --b-adapt 1, I'd say its x264.

I can give VNC access to the box if it would help track the problem down faster.
Offer still stands. :P

Inventive Software
20th November 2008, 21:49
I still can't believe you've got a 16 core system. :p

On a separate note, what's CRF and 1 pass bitrate encoding like? Does that crash too?

Snowknight26
20th November 2008, 21:59
I still can't believe you've got a 16 core system. :p
Yea, I can't either. ;)

On a separate note, what's CRF and 1 pass bitrate encoding like? Does that crash too?

So far using crf has not made it crash.

Dark Shikari
20th November 2008, 22:00
Well since it only happens with --b-adapt 1, I'd say its x264.--b-adapt 2 requires significantly more memory.

video_magic
20th November 2008, 22:06
Just a couple of curiosities with me:

Will Snowknight26 probably lose quality by using so many threads, even though he wants to speed up his encoding - and if so does he realise that?

I am not clear on the difference - if any- of encoding on many cores & many threads. I mean, my HT P4 is 'emulating 2 cores' (essentially 2 physical CPUs right)?

If the quality issue exists like I have read it does, will it manifest its' self on say 16 cores trying to use 16 threads, pretty much the same as someone using 4 cores and trying to use 16 threads?

Thanks to anyone who can answer my newbie question.

Snowknight26
20th November 2008, 22:18
I did some a 5000 frame comparison between 32 threads and 4 threads. Quants were raised anywhere from .05 to .5 and SSIM dropped a tad, but I'm not too worried (unless I should be).

Inventive Software
20th November 2008, 22:26
I think what you'd probably get away with doing is splitting the encode up into 4 pieces and using 4 cores (6 threads in total) with x264. Have you looked at x264farm?

Snowknight26
20th November 2008, 22:30
The whole point of me getting this machine was to encode on it and not on my desktop.
Anyway, I appreciate your alternatives but I want to get these issues resolved.

Inventive Software
21st November 2008, 03:34
I never suggested using your desktop. ;)

Have you tried running x264 using 4 separate cores with 6 or 8 threads (setting the affinity appropriately) and seeing what the CPU usage is like then? If you can max out 4 cores, try running 4 different x264 processes, use a batch file to string the results all together with an appropriate tool.

Snowknight26
21st November 2008, 03:35
That's a workaround solution. I'm looking for one that would force the b-frame decision to use more than 1 thread..if thats what the problem is, that way I can encode one thing at a time using all the cores.

Inventive Software
21st November 2008, 03:39
The new b-frame decision (b-adapt 2) is not multi-threaded. Patch welcome. I think the old one is, so use "--b-adapt 1" and see what it's like.

EDIT: That was a bloody quick response. Are you stalking this thread? :p

LoRd_MuldeR
21st November 2008, 03:42
The new b-frame decision (b-adapt 2) is not multi-threaded. Patch welcome. I think the old one is, so use "--b-adapt 1" and see what it's like.

He already did try "--b-adapt 1" and it always crashed (in the memory allocation code). Which makes no sense, as "--b-adapt 2" actually requires more memory :confused:

There must be some other problem :o

kemuri-_9
21st November 2008, 03:47
i would suggest the memory split method i mentioned before and see how that turns out.
may be able to get a clearer picture of what's going on (and may work as well).

as a reminder, that was to use avs2yuv and pipe to x264.

Dark Shikari
21st November 2008, 04:02
The new b-frame decision (b-adapt 2) is not multi-threaded. Patch welcome. I think the old one isNo it isn't.That's a workaround solution. I'm looking for one that would force the b-frame decision to use more than 1 thread..if thats what the problem is, that way I can encode one thing at a time using all the cores.Sure, its quite possible, but would require a thousand-or-two line patch to move the lookahead ahead of the main encoding thread, add all necessary mutexes and synchronization primitives, and buffer all necessary frames.

Snowknight26
21st November 2008, 04:06
as a reminder, that was to use avs2yuv and pipe to x264.

Just tried it, still crashes.

kemuri-_9
21st November 2008, 04:32
ok, and what was the memory usage for both avs2yuv and x264 up to the crash?

Snowknight26
21st November 2008, 04:34
avs2yuv: ~120MB
x264 before dumprep: ~1.2GB
x264 after dumprep: ~1.7GB

akupenguin
21st November 2008, 04:35
Sure, its quite possible, but would require a thousand-or-two line patch to move the lookahead ahead of the main encoding thread, add all necessary mutexes and synchronization primitives, and buffer all necessary frames.
That's moving B-adapt to a separate thread from the i/o loop, which is independent of whether to multithread it. Simple multithread doesn't need any mutexes, only `make`-style dependencies.

kemuri-_9
21st November 2008, 04:38
avs2yuv: ~120MB
x264 before dumprep: ~1.2GB
x264 after dumprep: ~1.7GB

definitely far from the limits, so i'm out of explanations...
except the classic one... M$ failure...
use linux and see how well it goes. (lol)

Snowknight26
21st November 2008, 04:41
Sorry, that's not a realistic suggestion for me so I can't.

Dark Shikari
21st November 2008, 04:42
That's moving B-adapt to a separate thread from the i/o loop, which is independent of whether to multithread it. Simple multithread doesn't need any mutexes, only `make`-style dependencies.Wait, isn't that what he asked for though?

Multithreading the lookahead means moving it to a separate thread from the I/O loop.

akupenguin
21st November 2008, 06:25
notation: each line is 1 thread, "->" is a dependency, "..." is a thread that remains running while x264_encoder_encode() returns

currently:
frame in -> lookahead loop -> spawn thread for encode ...
... encode another frame ................................
... encode another frame -> sync oldest thread -> out

option 1:
This devotes 1 whole core to lookahead, but if you have 4 cores and run B-adapt=2 on a fast 1st pass, it'll still bottleneck.
frame in -> lookahead loop .......................................
older frame -> sync older lookahead -> spawn thread for encode ...
... encode another frame .........................................
... encode another frame -> sync oldest thread -> out

option 2:
This allows multiple cores for lookahead, but leaves it in the loop, so the distribution of parallelism over time is uneven.
/-> lookahead -\
frame in -> lookahead --> spawn thread for encode ...
\-> lookahead -/
... encode another frame ............................
... encode another frame -> sync oldest thread -> out

option both:
/-> lookahead ............................................
frame in -> lookahead ............................................
\-> lookahead ............................................
older frame -> sync older lookahead -> spawn thread for encode ...
... encode another frame .........................................
... encode another frame -> sync oldest thread -> out

Dark Shikari
21st November 2008, 06:40
Oh, you just mean multithreading within the lookahead.

That wouldn't be too bad as long as you make sure none of the stuff you run concurrently depends on the same MVs/etc.

Sagekilla
21st November 2008, 06:47
There any possibility of getting a threaded b-adapt sooner than later then? :)