Current Patches, Where to get them, How they affect speed/output [Archive] - Page 40

J_Darnley

5th July 2009, 13:22

Sorry for posting here, didn't want a new thread @ newbies, where can i find "x264 core 67 r1153M 7b6ce6a" ? i need it to see if the M stands for autovaq because the aq is set at 1:1.00 and autovaq 0.3 diff was not made @ 1153.
Not necessarily. The 'M' just means modified. Look at the details posted for the file you downloaded.

juGGaKNot

5th July 2009, 13:35

Not necessarily. The 'M' just means modified. Look at the details posted for the file you downloaded.

I know but what was the modification, thats what i ask

Writing library : x264 core 67 r1153M 7b6ce6a aq=1:1.00

Its not a new autovaq 0.3 build to have 2:1.00 but it could be a 0.2 build or not at all autovaq, this is the problem, the M stands for what in this particular build ?

Sharktooth

5th July 2009, 14:14

M = Modified.

microchip8

7th July 2009, 10:23

Could someone update x264_hrd_pulldown.13_interlace.diff ?

Patching fails on latest x264 with the presets system added

JEEB

7th July 2009, 11:09

http://pastebin.ca/1486671

Applied nal-hrd onto an unpatched source tree, applied the failing parts by hand. diff -NdurwBE 'd it. Took the x264.c parts and put them into a patch. Should work, but I've yet to test. The changes are minor so I guess it should >_>

And now to the issue of the fprofile failing...

komisar

7th July 2009, 11:18

Adapted for 1178 x264_hrd_pulldown.13_interlace.1178.diff (http://komisar.gin.by/x.patch/x264_hrd_pulldown.13_interlace.1178.diff) (change TAB to SPACE, remove trailing space)

edit: option "--no-ssim --no-psnr --progress -w -8" in MAKEFILE for profiling fail...

edit: fixing fprofile x265_profile_fix.1178.diff (http://komisar.gin.by/x.patch/x265_profile_fix.1178.diff)

JEEB

7th July 2009, 19:47

Since with rev1179 all of it seems to have calmed down, more or less - here's a build. 32bit only for now, as VMware server doesn't want to connect to localhost it seems.

x264 r1179 32bit
download (http://jeeb.fiveforty.jp/x264/1179/x264.exe) ; release notes (http://jeeb.fiveforty.jp/x264/1179/relnotes.txt)

built on Jul 7 2009, gcc: 4.3.3
fprofiled, -march=i686

x264 r1179 64bit
download (http://jeeb.fiveforty.jp/x264/1179_x64/x264.exe) ; release notes (http://jeeb.fiveforty.jp/x264/1179_x64/relnotes.txt)

built on Jul 10 2009, gcc: 4.3.4 20090220 (prerelease) (x64.generic.Komisar)
fprofiled, -march=core2

patched with:

x264_win_zone_parse_fix_05.diff
x264_hrd_pulldown.13_interlace_modified.diff (http://pastebin.ca/1486671)
x264_AutoVAQ.03.diff

techouse

8th July 2009, 12:29

x264_x64_r1179_unpatched (http://techouse.project357.com/builds/revision1179/x264.exe) | MD5 (http://techouse.project357.com/builds/revision1179/x264.md5)
GCC 4.4.0 20090524 (x64.core2.Komisar), unpatched, generic, fprofiled

________________________________________________________________________________

x264_x86_r1179_techouse (http://techouse.project357.com/builds/x264_x86_r1179_techouse.7z) | INFO (http://techouse.project357.com/nfo/x264_x86_r1179_techouse.txt)
GCC 4.4.0 20090524 (x86.core2.Komisar), fprofiled, -march=core2

x264_x64_r1179_techouse (http://techouse.project357.com/builds/x264_x64_r1179_techouse.7z) | INFO (http://techouse.project357.com/nfo/x264_x64_r1179_techouse.txt)
GCC 4.4.0 20090524 (x64.core2.Komisar), fprofiled, -march=core2

Patches used:

x264_hrd_pulldown.13_interlace_modified.diff
x264_win_zone_parse_fix_05.diff
x264_AutoVAQ.03.diff

JEEB

11th July 2009, 09:15

x264 r1181 32bit
download (http://jeeb.fiveforty.jp/x264/1181/x264.exe) ; release notes (http://jeeb.fiveforty.jp/x264/1181/relnotes.txt)

built on Jul 11 2009, gcc: 4.3.3
fprofiled, -march=i686

x264 r1181 64bit
download (http://jeeb.fiveforty.jp/x264/1181_x64/x264.exe) ; release notes (http://jeeb.fiveforty.jp/x264/1181_x64/relnotes.txt)

built on Jul 11 2009, gcc: 4.3.4 20090220 (prerelease) (x64.generic.Komisar)
fprofiled, -march=core2

patched with:

x264_win_zone_parse_fix_05.diff
x264_hrd_pulldown.13_interlace_modified.diff (http://pastebin.ca/1486671)
x264_AutoVAQ.03.diff

G_M_C

11th July 2009, 10:32

/me is hoping for an update of IMK's ICC / SSE4.x build :)

juGGaKNot

11th July 2009, 11:03

/me is hoping for an update of IMK's ICC / SSE4.x build :)

What does it bring to the table so special ? intel cpus ?

G_M_C

11th July 2009, 11:10

What does it bring to the table so special ? intel cpus ?

I find that ICC builds are slightly faster on my QX9650.

Fr4nz

11th July 2009, 11:42

I find that ICC builds are slightly faster on my QX9650.

Unfortunately we'll have to wait because his video card is broken...read here (I received this message from him via Youtube yesterday):

However, the video card in my computer died and I'm $120 short of a new one. Until I can replace that video card I can't compile any new builds. I'm pretty broke at the moment, so it could take a month or more until I set aside $120 for a card. I'll let you know when I get it replaced.

We'll wait :(

G_M_C

11th July 2009, 14:56

Unfortunately we'll have to wait because his video card is broken...read here (I received this message from him via Youtube yesterday):

We'll wait :(

I'll wait too. Bummer the Video card broke. Never had one break in my system, but i dont OC videocards (only CPU's, which is easy with an unlocked "Extreme edition" ;) ).

rack04

11th July 2009, 19:23

What is the difference between -march=i686 and -march=core2?

LoRd_MuldeR

11th July 2009, 19:30

What is the difference between -march=i686 and -march=core2?

The C compiler is instructed to optimize the build either for an i686-family CPU or for an Intel Core2.
While the former allows the compiler to use the "PentiumPro" instruction set, the latter allows the compiler to use also SSE instructions (everything up to SSSE3).
Furthermore with -"march=core2" the compiler will optimize for different CPU timings...

In short: The "-march=i686" build should run on every CPU, except for some archaic ones. The "-march=core2" build should run a bit faster on a Core2 Duo.

See for details:
http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options

Note that this only effects the plain C code in x264. All the "hand-optimized" assembler code is not effected by compiler optimizations at all!
Also note that x264 uses it's own runtime CPU detection to decide which assembler functions will be used (or not used).

Compiler optimizations can squeeze out a bit more performance (ICC more than GCC), but the important speed-up happens in the assembler part of x264 ;)

kemuri-_9

11th July 2009, 20:32

pengvado (either here or on irc) stated that it's not icc's C compilation that is causing the speed up...
the cause is the icc equivalent to gcc's -mtune

G_M_C

11th July 2009, 20:50

pengvado (either here or on irc) stated that it's not icc's C compilation that is causing the speed up...
the cause is the icc equivalent to gcc's -mtune

[...]
Note that this only effects the plain C code in x264. All the "hand-optimized" assembler code is not effected by compiler optimizations at all!
Also note that x264 uses it's own runtime CPU detection to decide which assembler functions will be used (or not used).

Compiler optimizations can squeeze out a bit more performance (ICC more than GCC), but the important speed-up happens in the assembler part of x264 ;)

That's why I said that they' re only slightly faster. But on clips with > 200.000 frames even "slightly" counts to be a measurable timesaving ;)

I find that ICC builds are slightly faster on my QX9650.

IgorC

11th July 2009, 22:26

That's why I said that they' re only slightly faster. But on clips with > 200.000 frames even "slightly" counts to be a measurable timesaving ;)
1% is still tiny speed up for 100 or 10000 or any other number of frames.
This percentage is hardly noticeble for even >200.000 frames. If encoder gets 10 hours to encode then 1% will present only 6 minutes. It is nothing comparing to 10 hours.
It's still 1%.

akupenguin

12th July 2009, 02:38

pengvado stated (http://forum.doom9.org/showthread.php?p=1294634#post1294634) that it's not icc's C compilation that is causing the speed up...
the cause is the icc equivalent to gcc's -mtune
-mtune affects only C compilation.
I said that any difference between icc-sse2 and icc-ssse3 must be due to the -mtune part rather than the -march part, because icc-ssse3 didn't actually use any ssse3 (but it did include plenty of asm differences in non-sse code). This was not meant to explain any comparison between icc and some other compiler.

Fr4nz

12th July 2009, 07:25

1% is still tiny speed up for 100 or 10000 or any other number of frames.
This percentage is hardly noticeble for even >200.000 frames. If encoder gets 10 hours to encode then 1% will present only 6 minutes. It is nothing comparing to 10 hours.
It's still 1%.

Well, IIRC sometimes there's a difference of 4-5% in favor of ICC build, which would save 20-30 minutes...not much, but also better than nothing :)

Changing the subject, I have on question: my father has an AMD Phenom X4 9550 CPU and if I use the "DXVA-HQ preset" in MeGUI first pass is slightly slower than second pass (this does not happen on my Intel E6750 @3,3 ghz , on which the first pass is ~2.3x faster than the second pass)...how's it possible?

12th July 2009, 11:12

Changing the subject, I have on question: my father has an AMD Phenom X4 9550 CPU and if I use the "DXVA-HQ preset" in MeGUI first pass is slightly slower than second pass (this does not happen on my Intel E6750 @3,3 ghz , on which the first pass is ~2.3x faster than the second pass)...how's it possible?
B-adapt 2 slows it down, probably. Frametype decision (which is done only in the first pass) is single-threaded, so it acts as a bottleneck on multi-core CPUs.

Fr4nz

12th July 2009, 11:14

B-adapt 2 slows it down, probably. Frametype decision (which is done only in the first pass) is single-threaded, so it acts as a bottleneck on multi-core CPUs.

This makes sense, but how's possibile that the Phenom is slowed down so much??

In order to give you an idea, if in the first pass i have 10-11fps/sec with my E6750@3,3ghz, with Phenom I have merely 5fps/sec...

Dark Shikari

12th July 2009, 11:25

This makes sense, but how's possibile that the Phenom is slowed down so much??Phenom has 4 cores, Core 2 Duo has only two?

Fr4nz

12th July 2009, 11:31

Phenom has 4 cores, Core 2 Duo has only two?

Intel E6750 has only 2 cores.

Furthermore, in order to give you a better idea of the "situation", second pass is faster on Phenom 9550 than on E6750.

Dark Shikari

12th July 2009, 11:37

Intel E6750 has only 2 cores.That's what I just said. What don't you understand? More cores means the penalty for using settings that cripple multithreading will hurt speed more.

Fr4nz

12th July 2009, 11:46

That's what I just said. What don't you understand? More cores means the penalty for using settings that cripple multithreading will hurt speed more.

Ok, what I don't understand is: is frametype decision (which is single threaded and indicated as the "culprit" by nm) so "heavy" in respect to other algorithms used by x264 so that Phenom 9550 is brutally outperformed (~2x) by an Intel E6750 in the first pass?

12th July 2009, 12:26

One core of your overclocked E6750 outperforms one core of the Phenom by almost 2x. If b-adapt 2 dominates the encoding time because of fast first-pass parameters, this means that overall encoding is also twice as fast on the E6750. Just check x264's CPU usage during the first pass. I guess it's less than 50 % on the Phenom.

kemuri-_9

12th July 2009, 16:16

$ ./x264 -o NUL test.y4m --frames 1000 -b16
yuv4mpeg: 640x480@30/1fps, 0:0
x264 [info]: using cpu capabilities: MMX2 SSE2Fast FastShuffle SSEMisalign LZCNT
x264 [info]: profile High, level 3.0
x264 [info]: slice I:6 Avg QP:13.84 size: 9284
x264 [info]: slice P:570 Avg QP:21.80 size: 7107
x264 [info]: slice B:424 Avg QP:24.58 size: 1312
x264 [info]: consecutive B-frames: 37.5% 22.5% 4.5% 7.6% 15.6% 11.5% 0.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

encoded 1000 frames, 92.89 fps, 1119.34 kb/s
~92% CPU across 4 cores

$ ./x264 -o NUL test.y4m --frames 1000 -b5 --b-adapt 2
x264 [info]: slice I:6 Avg QP:13.97 size: 8909
x264 [info]: slice P:507 Avg QP:21.69 size: 6518
x264 [info]: slice B:487 Avg QP:24.01 size: 1869
x264 [info]: consecutive B-frames: 22.2% 23.3% 47.4% 1.2% 1.0% 4.8%

encoded 1000 frames, 54.47 fps, 1024.58 kb/s
~70% CPU across 4 cores

$ ./x264 -o NUL test.y4m --frames 1000 -b16 --b-adapt 2
x264 [info]: slice I:6 Avg QP:13.97 size: 8906
x264 [info]: slice P:500 Avg QP:21.71 size: 6543
x264 [info]: slice B:494 Avg QP:23.96 size: 1853
x264 [info]: consecutive B-frames: 21.6% 23.3% 47.1% 1.2% 1.0% 1.8% 2.1% 0.8% 0.0% 1.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

encoded 1000 frames, 13.72 fps, 1017.91 kb/s
~37% CPU across 4 cores

b-adapt 2 reduces threading improvement as you increase the number of bframes:
5 being noticeable on default settings...
with 16 nearly destroying all threading speedup...

Fr4nz

12th July 2009, 19:46

b-adapt 2 reduces threading improvement as you increase the number of bframes:
5 being noticeable on default settings...
with 16 nearly destroying all threading speedup...

I see, thank you for the clarification! I didn't know that b-adapt is computationally so "heavy"...

imk

14th July 2009, 06:09

Dead video card is still dead. Using an old 1MB ATI All-in-Wonder PCI card for the time being. 1024x768 VGA mode is great. :(

Anyway...
x264-r1181M-imk-win.7z (http://imk.cx/pc/x264/x264-r1181M-imk-win.7z)
win_build_info.txt (http://imk.cx/pc/x264/win_build_info.txt)

G_M_C

14th July 2009, 08:55

Fr4nz

14th July 2009, 09:03

Zachs

14th July 2009, 09:19

Is the threaded slicetype patch stable enough to be included yet? specifically v15: http://mailman.videolan.org/pipermail/x264-devel/2009-July/005988.html

Thanks.

Underground78

14th July 2009, 09:21

There is a v16 : http://mailman.videolan.org/pipermail/x264-devel/2009-July/006022.html ...

Zachs

14th July 2009, 09:28

@imk

Would greatly appreciate it if you could include x264-r1179-threaded-slicetype-v16-fix.diff

found here - http://kemuri9.net/dev/x264/patches/

G_M_C

14th July 2009, 09:56

@imk

Would greatly appreciate it if you could include x264-r1179-threaded-slicetype-v16-fix.diff

found here - http://kemuri9.net/dev/x264/patches/

Hmmm, i'd prefer only using patches that are known to be stable. Hopefully DS will be able to review this patch :)

kemuri-_9

14th July 2009, 18:53

Hmmm, i'd prefer only using patches that are known to be stable. Hopefully DS will be able to review this patch :)

DS has reviewed the v16 patch, and the -fix i made is the result to fix the crash he found...
other things that need fixing are non-critical...
1. removing changes to osdep.h
2. possibly removing/altering the section that sets the lookahead thread's priority (as it breaks BeOS compilation and from discussion from pengvado, negative values for priority should only work on linux and as root)

not to mention it's not official yet as i'm not the patch maintainer...
He's been busy with his job and hasn't yet come back to irc discuss the slight change i made from v16

but I've been using the patch as it's progressed along and have had no issues using it.

Zachs

15th July 2009, 01:45

Hmm. I gave kemuri-9's build a spin but didn't notice any difference at all using --rc-lookahead auto vs without. I would've thought that patch is to enable x264 to utilize multicores in first pass more effectively when --b-adapt 2 is used? With or without --rc-lookahead, my dual core 2.13GHz CPU stays at around 75%.

Dark Shikari

15th July 2009, 01:50

Yes, I'm not sure what's up with that; it seems to give far less benefit than I expected.

Either:

1. There is some major inefficiency in the patch (quite possible).
2. B-adapt 2 needs internal multithreading (I'll do this once threaded slicetype is committed).

kemuri-_9

15th July 2009, 05:24

$ ./x264-o NUL test.y4m --frames 1000 -b5 --b-adapt 2 --rc-lookahead auto
encoded 1000 frames, 57.30 fps, 1024.59 kb/s

$ ./x264 -o NUL test.y4m --frames 1000 -b16 --b-adapt 2 --rc-lookahead auto
encoded 1000 frames, 14.18 fps, 1017.92 kb/s

the difference from no lookahead to automatic on this quadcore is:
5 bframes: 54.47 fps -> 57.30 fps
16 bframes: 13.72 fps -> 14.18 fps

the speedup is there, it's just not very much from the level of how much b-adapt 2 can destroy threading performance currently.

my builds are also set to "-march=AMD", since that's what all 4 of my pcs+laptop are.
so if you're on an intel chip, you probably will get slightly lower speeds with my build compared to everyone else's "-march=intel" builds

2. B-adapt 2 needs internal multithreading (I'll do this once threaded slicetype is committed).
yes, this is very much needed from what I've seen.

komisar

15th July 2009, 07:48

threaded-slicetype speed result for Intel core i7 with settings "--preset placebo --tune touhou"
lookahead=0: 8.42 fps
lookahead=30: 8.95 fps
lookahead=60: 9.20 fps
lookahead=90: 9.26 fps
lookahead=120: 9.15 fps
lookahead=150: 9.19 fps
lookahead=180: 9.28 fps
lookahead=210: 9.10 fps
lookahead=240: 9.22 fps

Edit -> kemuri-_9, sure :) this is my inattention

kemuri-_9

15th July 2009, 16:43

lookahead=30: 8.95 fps (default for "auto")

that seems incorrect, auto sets the value to (bframes + threads) * 2
placebo has 16 bframes, so even without the thread value the automatic lookahead size is already > 30.
iirc, the i7 has 4 real and 4 ht cores, so auto threads should set it to 12,
so (12 + 16) * 2 = 56 is what lookahead auto should be setting the lookahead size to.

Zachs

16th July 2009, 09:12

An idea for a patch:

The Intel TBB (Threading Building Blocks) Allocator patch, using cache_aligned_allocator etc instead of x264_malloc. From my own experience using it in other projects, the tbb allocator is about 15% faster than HeapAlloc (called by malloc) and even faster when there's 2 or more threads are allocating / freeing at the same time, especially across different cores.

This should be fairly straight forward considering x264_malloc() / x264_free() already centralizes all allocations / free ops. If I have some time to setup the build environment with all the tools to merge diffs and stuff, I'd probably make one myself using MSVC2005.

Dark Shikari

16th July 2009, 09:19

Zachs

16th July 2009, 09:24

Yeah that's true. Even the frames in frame.c are recycled...

Would TBB help in anyway though? Like parallel loops and stuff?

Dark Shikari

16th July 2009, 09:37

Yeah that's true. Even the frames in frame.c are recycled...

Would TBB help in anyway though? Like parallel loops and stuff?x264 is already heavily threaded. There are no individual loops that run long enough without inter-iteration dependencies to make threading worth it.

ACoolie

16th July 2009, 17:36

If we're free to post ideas here, I tried working on it but don't really have the experience to get it done. An option --encoder-fps would tune your options every so many frames to try to match fps to --encoder-fps value. Bframes would be lowered or raised, refs decreased or increased, etc. It would be very useful if performing faster than a certain speed is crucial. The only issues I'd foresee is that certain options (like bframes) probably can't be changed very easily in the middle of encoding.