Log in

View Full Version : Current Patches, Where to get them, How they affect speed/output


Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 [63] 64 65 66 67 68 69

MasterNobody
8th April 2010, 20:31
**removed until I can figure out why avi output format wasn't working**
For correct compiling with AVI-output you need to configure ffmpeg with such additional options:--enable-muxer=avi --enable-protocol=fileor simply take komisar's libpack from this page (http://komisar.gin.by/mingw/index.html)

LoRd_MuldeR
8th April 2010, 20:45
I tried lots of different settings, and in each case I got borked output in CRF mode with mb-tree on. I just tried the amdfam10 build of x264 r1323 from xvidvideo.ru and that has the exact same problem as outlaw.78's build, borked output with mb-tree on, whereas with the core2 build from xvidvideo.ru it works fine.

Some testes from my Intel Core2 Q6600 machine:
File Size RIPEMD-160
parkrun.1280x720.crf20.gcc345.generic.avi 41.164.020 B54683EEFF99DF9032951E69A20372B9769F9860
parkrun.1280x720.crf20.gcc450.generic.avi 41.164.020 B54683EEFF99DF9032951E69A20372B9769F9860
parkrun.1280x720.crf20.gcc450.core2.avi 41.164.020 B54683EEFF99DF9032951E69A20372B9769F9860
parkrun.1280x720.crf20.gcc450.pentium3.avi 41.164.020 B54683EEFF99DF9032951E69A20372B9769F9860
parkrun.1280x720.crf20.gcc450.amdfam10.avi 53.389.452 E6BDE3785594AD489583F88D88F3307581B7AA9D
parkrun.1280x720.crf20.gcc443.generic.avi 41.164.020 B54683EEFF99DF9032951E69A20372B9769F9860
parkrun.1280x720.crf20.gcc443.amdfam10.avi 53.389.452 E6BDE3785594AD489583F88D88F3307581B7AA9D

x264 version:
version: 0.92.1523M 25ca5b0

Settings that were used:
options: 1280x720 fps=50/1 timebase=1/50 cabac=1 ref=8 deblock=1:-1:-1 analyse=0x3:0x133 me=umh subme=10 psy=1 psy_rd=1.00:0.15 mixed_ref=1 me_range=24 chroma_me=1 trellis=2 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-3 threads=6 sliced_threads=0 nr=0 decimate=1 interlaced=0 constrained_intra=0 bframes=16 b_pyramid=2 b_adapt=2 b_bias=0 direct=3 wpredb=1 wpredp=2 keyint=750 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=60 rc=crf mbtree=1 crf=20.0 qcomp=0.60 qpmin=10 qpmax=51 qpstep=4 ip_ratio=1.40 aq=1:1.00

Dark Shikari
8th April 2010, 20:56
If the amdfam version used an instruction incompatible with Intel chips, it would crash, not give incorrect output.

It's likely, given gcc's track record, that the build is just broken.

burfadel
8th April 2010, 21:09
Interestingly enough, for me it went in the other direction, around 25 percent smaller output file not 25 percent larger! Anyways, looks as though the last few GCC builds (as it happens with 1310 as well) is broken, I think earlier GCC 4.5.0 builds worked, but how far back that was I have no idea!

Is there any reason why it would only occur with mb-tree on? if there is an instruction used there and not elsewhere? at least a bug report can be submitted to the GCC development team if the cause is known :)

LoRd_MuldeR
8th April 2010, 21:13
Is there any reason why it would only occur with mb-tree on? if there is an instruction used there and not elsewhere? at least a bug report can be submitted to the GCC development team if the cause is known :)

Obviously GCC either miscompiled the MB-Tree code itself or it miscompiled something else that becomes apparent with MB-Tree enabled.

Also: Even if the miscompilation can be tracked down to MB-Tree, it doesn't mean you will be fine with MB-Tree disabled. Other more subtle things may be miscompiled too.

(Looking at the output from the amdfam10-optimized build I see some kind of "flickering" that isn't there in the output from the other builds)

outlaw.78
8th April 2010, 21:51
hmmmmm, i will try to compile using gcc 4.4.3 from komisar and post as soon as i am done

LoRd_MuldeR
8th April 2010, 22:09
hmmmmm, i will try to compile using gcc 4.4.3 from komisar and post as soon as i am done

Exactly same result with GCC 4.4.3 (Komisar's build). Please see this (http://forum.doom9.org/showpost.php?p=1389891&postcount=3102) post.

outlaw.78
8th April 2010, 22:21
Exactly same result with GCC 4.4.3 (Komisar's build). Please see this (http://forum.doom9.org/showpost.php?p=1389891&postcount=3102) post.

i see...
i wonder if that has something to do with -ffast-math switch

burfadel
8th April 2010, 23:42
Some testes from my Intel Core2 Q6600 machine:

Tests with that extra 'e' means something completely different :)

It would be good to get to the bottom of the issue, it would be bad for someone trying x264 for the first time using a miscompiled build...

MasterNobody
9th April 2010, 00:47
It seems this is not the miscompilation. This builds probably correctly work only on AMD K10 CPU (which support SSE4A / AMD SSE4). Because gcc generates LZCNT instruction which seems to be incorrectly interpretated as BSR without crash on CPUs without support of SSE4A.

LoRd_MuldeR
9th April 2010, 09:29
It seems this is not the miscompilation. This builds probably correctly work only on AMD K10 CPU (which support SSE4A / AMD SSE4). Because gcc generates LZCNT instruction which seems to be incorrectly interpretated as BSR without crash on CPUs without support of SSE4A.

http://www.cosgan.de/images/smilie/konfus/g070.gif

Dark Shikari
9th April 2010, 09:52
A miscompilation check for this case has been added to x264. The next release will error out loudly if an amdfam10 build is run on any non-Phenom CPU.

LoRd_MuldeR
9th April 2010, 10:01
A miscompilation check for this case has been added to x264. The next release will error out loudly if an amdfam10 build is run on any non-Phenom CPU.

:thanks:

EDIT: Even on an older AMD machine (Athlon64) the "amdfam10" build produced different output than the generic build.

Dark Shikari
9th April 2010, 10:28
:thanks:

EDIT: Even on an older AMD machine (Athlon64) the "amdfam10" build produced different output than the generic build.Obviously, it will do so on any machine that doesn't have SSE4a.

LoRd_MuldeR
9th April 2010, 11:47
Obviously, it will do so on any machine that doesn't have SSE4a.

Help me understand this: AMD re-used opcodes in SSE4a that they already had used for something else on their own processors before? How braindead is this? :eek:

Anyway, if you add the aforementioned miscompilation the check for "amdfam10" and "non-Phenom", will there be a way to disabled/skip this check in the "fprofiling" phase?

Otherwise I wouldn't be able to fprofile "amdfam10" builds anymore :o

J_Darnley
9th April 2010, 11:50
So AMD re-used opcodes in SSE4a that they already had used for something else on their own processors before? How braindead is this? :eek:

Anyway, if you add the aforementioned miscompilation the check for "amdfam10" and "non-Phenom", will there be a way to disabled/skip this check in the "fprofiling" phase?

Otherwise I wouldn't be able to fprofile "amdfam10" builds anymore :o

The result is not wrong when you have this instruction. See: http://pastebin.org/142635
[EDIT] Oh, do you mean you want to run one of these on an problematic cpu?

LoRd_MuldeR
9th April 2010, 11:57
Oh, do you mean you want to run one of these on an problematic cpu?

I don't want to use an "amdfam10" optimized build on my Intel CPU and it's good that x264 normally won't let me do that. But: I still want to be able to fprofile those builds on my CPU. But if it always aborts, I cannot run fprofile for "amdfam10" on Non-Phenom CPU's. And I don't have a Phenom available. Getting "borked" output from fprofiling shouldn't be a big deal, as we discard that output anyway...

kemuri-_9
9th April 2010, 13:14
I don't want to use an "amdfam10" optimized build on my Intel CPU and it's good that x264 normally won't let me do that. But: I still want to be able to fprofile those builds on my CPU. But if it always aborts, I cannot run fprofile for "amdfam10" on Non-Phenom CPU's. And I don't have a Phenom available. Getting "borked" output from fprofiling shouldn't be a big deal, as we discard that output anyway...

let those of us with phenom pcs worry about getting our fprofiled -march=amdfam10 builds, ok?
I think it's better that x264 prevents the situation with an error to prevent those who don't know better than to allow those who do know better to be able to fprofile it.
there are more who don't know better than those who do.

LoRd_MuldeR
9th April 2010, 13:37
Sure. I was more thinking about something like a "#ifdef FPROFILE_GENERATE ... #endif" around the miscompilation check(s) ;)

aegisofrime
9th April 2010, 16:24
So... the conclusion is the amdfam10 builds are fine to use, but only if you have a K10 CPU?

LoRd_MuldeR
9th April 2010, 16:27
So... the conclusion is the amdfam10 builds are fine to use, but only if you have a K10 CPU?

Right. And once this (http://pastebin.org/142635) patch is committed, they will simply error out on unsupported CPU's.

alexins
9th April 2010, 16:43
So... the conclusion is the amdfam10 builds are fine to use, but only if you have a K10 CPU?

Yes, only for: Athlon X2, Athlon II, Phenom, Phenom II, Phenom X3, Phenom X4.

burfadel
9th April 2010, 16:44
That patch is a great way of doing it, as it is instruction specific and not processor specific, meaning new processors that support those instructions will still work and not error out if the current specific supporting processors aren't detected :)

MasterNobody
9th April 2010, 18:11
Help me understand this: AMD re-used opcodes in SSE4a that they already had used for something else on their own processors before? How braindead is this? :eek:
Not fully correct. They re-used combination of REP:BSR (which with 99.99% probabiltiy wouldn't be used by any compiler for generating of BSR instruction) for new instruction LZCNT. But because REP prefix is safe to use with most instructions (and it would be simply ignored in combinations which doesn't make sense) it doesn't crash even on CPUs which doesn't support it.

Dark Shikari
9th April 2010, 19:46
I don't want to use an "amdfam10" optimized build on my Intel CPU and it's good that x264 normally won't let me do that. But: I still want to be able to fprofile those builds on my CPU. But if it always aborts, I cannot run fprofile for "amdfam10" on Non-Phenom CPU's. And I don't have a Phenom available. Getting "borked" output from fprofiling shouldn't be a big deal, as we discard that output anyway...This is stupid. If the output of x264 is "borked", surely the results of an fprofile cannot be reliable anyways.

deank
9th April 2010, 20:36
For those like me - not going after speed and stuff - is it really hard to build a release without all these fprofiled requirements like AMD or Intel compatibility. I guess I'm missing something, but I have a Celeron, and AMD and a CoreDue PCs... Does it mean that I need to use 2 or 3 different builds just to be sure that x264 works with all 3 PCs??

Dark Shikari
9th April 2010, 20:51
For those like me - not going after speed and stuff - is it really hard to build a release without all these fprofiled requirements like AMD or Intel compatibility.It's completely unnecessary and gives practically no speed benefit. It just makes people feel better.

outlaw.78
10th April 2010, 18:51
It's completely unnecessary and gives practically no speed benefit. It just makes people feel better.

agree.
on my phenom x4 running my builds vs generic builds is at most 1-2 % gain and sometimes even 0% ....
its just for fun!

outlaw.78
11th April 2010, 20:18
x264 0.93.1538 x64 & x86 (http://www.mediafire.com/?4zd20wmzjom)


gcc 4.5.0 20100406 (pre-release)
ffmpeg svn 22838
swscale svn 31029
ffms2 svn 309
pthreads 2.9.0.0 static (thanks to rack04 for providing the patches)
x264 0.93.1538 x86,x64,amdfam10,fprofiled

jpsdr
12th April 2010, 08:47
I personnaly notice a speed difference between the rack04 build, and the test build DS give for validating the update of nal-hrd (around 1.15fps vs around 1.3fps).

Dark Shikari
12th April 2010, 08:56
I personnaly notice a speed difference between the rack04 build, and the test build DS give for validating the update of nal-hrd (around 1.15fps vs around 1.3fps).That's hardly surprising. My build was using an ancient gcc (3.4).

jpsdr
12th April 2010, 13:19
At least, on my i7@965, i've noticed that fact. Maybe the profile/options/intel optimisations rack04 use on his build is more noticeable on my CPU...
Well, i'm not complaining... ^_^

burfadel
12th April 2010, 18:18
I think the argument of noticeable speed gain of profiling vs non-profiling is really only valid when comparing it against using the same GCC version, and same settings and tools other than using the -fprofile option. That way the speed is because of the profiling, and not just becuase it was compiled using a different system.

An example is having two cars of the same model and year, lets call them model X. One of them is the standard model and the other a slightly tweaked sports model (fprofiled). Its easy enough to say that the sports model is faster, it seems obvious. However, if you replace the engine in the sports model (currently say GCC 4.4.3/GCC 4.5.0 prerelease) with one from the same type of car, model X sports, but from around 5 years before (GCC 3.4), the comparison is no longer valid. Although the engine may have the same displacement, it may be less efficient to the point where the standard model of the current years car is actually more efficient.

rack04
12th April 2010, 19:59
Here are various builds that I hope will assist in comparing the speed of gcc optimizations and thread pooling to x264 defaults. I will add links as they are available.

Please note that all the builds in this post have the following configuration:

Platform: X86
System: MINGW
asm: yes
avs input: yes
lavf input: no
ffms input: no
mp4 output: yes
pthread: yes
debug: no
gprof: no
PIC: no
shared: no
visualize: no

Toolchain:
GCC 4.4.3
GPAC 0.4.6-DEV Revison 5
Pthreads 2.9.0.0
Yasm 2320
clean x264 fprofiled builds:

x264_x86_core2_fprofiled_r1538 (http://www.multiupload.com/V2A7XNQRJ2)

x264_x86_generic_fprofiled_r1538 (http://www.multiupload.com/91Y3LKTC9W)
clean x264 builds:

x264_x86_core2_r1538 (http://www.multiupload.com/UBYF9JG9Y2)

x264_x86_generic_r1538 (http://www.multiupload.com/Y4UCUWFPL7)
patched x264 fprofiled builds (x264_thread_pool_v2.7_r1538 (http://pastebin.com/q2dz3WbM)):

x264_x86_core2_fprofiled_r1538M (http://www.multiupload.com/YS0K3VU05K)

x264_x86_generic_fprofiled_r1538M (http://www.multiupload.com/7WILZEDH91)
patched x264 builds (x264_thread_pool_v2.7_r1538 (http://pastebin.com/q2dz3WbM)):

x264_x86_core2_r1538M (http://www.multiupload.com/1PAVCFFPWG)

x264_x86_generic_r1538M (http://www.multiupload.com/UJDX1UEYPE)

outlaw.78
12th April 2010, 22:03
what does thread pool patch improves?

rack04
12th April 2010, 22:44
x264_x86_core2_fprofiled_r1538

Trial fps kb/s Mean =5.13
1 5.03 4910.57 Standard Error =0.0315
2 5.17 4910.42 Median =5.16
3 5.08 4911.10 Mode =#N/A
4 5.16 4910.10 Standard Deviation =0.0705
5 5.20 4910.08 Sample Variance =0.0050
Kurtosis =2.1909
Skewness =1.0954
Range =0.17
Minimum =5.03
Maximum =5.20
Sum =25.64
Count =5
Confidence (95%) =0.0213


x264_x86_core2_fprofiled_r1538M

Trial fps kb/s Mean =5.20
1 5.20 4910.51 Standard Error =0.0361
2 5.08 4910.14 Median =5.20
3 5.27 4910.09 Mode =#N/A
4 5.28 4910.53 Standard Deviation =0.0807
5 5.18 4910.47 Sample Variance =0.0065
Kurtosis =2.1909
Skewness =1.0954
Range =0.20
Minimum =5.08
Maximum =5.28
Sum =26.01
Count =5
Confidence (95%) =0.0244


x264_x86_core2_r1538

Trial fps kb/s Mean =5.22
1 5.20 4910.42 Standard Error =0.0360
2 5.19 4910.09 Median =5.20
3 5.36 4910.43 Mode =#N/A
4 5.15 4910.09 Standard Deviation =0.0804
5 5.21 4910.52 Sample Variance =0.0065
Kurtosis =2.1909
Skewness =1.0954
Range =0.21
Minimum =5.15
Maximum =5.36
Sum =26.11
Count =5
Confidence (95%) =0.0243


x264_x86_core2_r1538M

Trial fps kb/s Mean =5.25
1 5.17 4910.48 Standard Error =0.0286
2 5.34 4910.69 Median =5.26
3 5.28 4910.12 Mode =#N/A
4 5.26 4910.88 Standard Deviation =0.0639
5 5.22 4910.71 Sample Variance =0.0041
Kurtosis =2.1909
Skewness =1.0954
Range =0.17
Minimum =5.17
Maximum =5.34
Sum =26.27
Count =5
Confidence (95%) =0.0193


x264_x86_generic_fprofiled_r1538

Trial fps kb/s Mean =5.23
1 5.22 4910.89 Standard Error =0.0521
2 5.28 4910.58 Median =5.22
3 5.17 4910.53 Mode =#N/A
4 5.08 4910.58 Standard Deviation =0.1165
5 5.39 4910.42 Sample Variance =0.0136
Kurtosis =2.1909
Skewness =1.0954
Range =0.31
Minimum =5.08
Maximum =5.39
Sum =26.14
Count =5
Confidence (95%) =0.0351


x264_x86_generic_fprofiled_r1538M

Trial fps kb/s Mean =5.22
1 5.21 4910.13 Standard Error =0.0218
2 5.31 4910.49 Median =5.21
3 5.21 4910.09 Mode =5.21
4 5.19 4910.82 Standard Deviation =0.0488
5 5.20 4910.48 Sample Variance =0.0024
Kurtosis =2.1909
Skewness =1.0954
Range =0.12
Minimum =5.19
Maximum =5.31
Sum =26.12
Count =5
Confidence (95%) =0.0147


x264_x86_generic_r1538

Trial fps kb/s Mean =5.23
1 5.30 4910.48 Standard Error =0.0196
2 5.19 4910.09 Median =5.21
3 5.23 4910.48 Mode =#N/A
4 5.20 4910.44 Standard Deviation =0.0439
5 5.21 4910.49 Sample Variance =0.0019
Kurtosis =2.1909
Skewness =1.0954
Range =0.11
Minimum =5.19
Maximum =5.30
Sum =26.13
Count =5
Confidence (95%) =0.0133

Dark Shikari
12th April 2010, 22:52
Numbers like that are useless without stddev.

rack04
13th April 2010, 17:43
Numbers like that are useless without stddev.

I am in the process of updating the numbers. Hopefully these will be more useful.

rack04
13th April 2010, 22:29
Here is a test build that includes ffmpeg-mt/ffms2-mt input and avi output.

x264_x86_r1538M (http://www.multiupload.com/SN93174MDO)

./configure

Platform: X86
System: MINGW
asm: yes
avs input: yes
lavf input: yes
ffms input: yes
mp4 output: yes
avi output: yes
pthread: yes
debug: no
gprof: no
PIC: no
shared: no
visualize: no

make

roozhou
14th April 2010, 03:48
Here is a test build that includes ffmpeg-mt/ffms2-mt input and avi output.

Any good reason to use ffmpeg-mt against non-mt ffmpeg?

Mr VacBob
14th April 2010, 03:51
Huffyuv is faster until you run out of HD bandwidth. H264 is faster until it's more profitable to start more x264 threads. Which probably happens pretty fast, but I haven't benchmarked it.

rack04
14th April 2010, 04:10
Any good reason to use ffmpeg-mt against non-mt ffmpeg?

For testing to see if there is any speed difference.

Blue_MiSfit
14th April 2010, 06:57
H.264 decoding benefits hugely in ffmpeg-mt, especially when one tries to do high speed transcoding (i.e. 20x realtime for SD). Removing the decoder bottleneck lets x264 itself run as quickly as it can ;)

My experience is in using ffms2 with and without ffmpeg-mt. I saw MASSIVE speedup in cases as I've described.

~MiSfit

roozhou
14th April 2010, 09:42
H.264 decoding benefits hugely in ffmpeg-mt, especially when one tries to do high speed transcoding (i.e. 20x realtime for SD). Removing the decoder bottleneck lets x264 itself run as quickly as it can ;)

My experience is in using ffms2 with and without ffmpeg-mt. I saw MASSIVE speedup in cases as I've described.

~MiSfit
20x real-time transcoding from H264 to H264 for what purpose?

jpsdr
14th April 2010, 10:57
what does thread pool patch improves?

I'm curious too, what this patch is for ?

LoRd_MuldeR
14th April 2010, 12:45
I'm curious too, what this patch is for ?

I think currently x264 creates a new "encoder" thread for each frame and destroys that thread once the frame has been encoded. Creating and destroying threads has some overhead. The idea of having a "pool of threads" is: Created a certain number of threads once and re-use those threads over and over again. If you need a thread to run a task, you simply pick an idle thread from the pool (or wait for one to become idle). And if the thread has done its work (i.e. completed the task), it returns to the pool and waits for a new task to run. This could give some speed-up, because the thread creation/destruction overhead is avoided. But I have no numbers for x264. I assume if the thread pool really gave a significant speed-up, it would have been committed already. This patch is floating around for a long time now...

http://en.wikipedia.org/wiki/Thread_pool

rack04
14th April 2010, 15:43
H.264 decoding benefits hugely in ffmpeg-mt, especially when one tries to do high speed transcoding (i.e. 20x realtime for SD). Removing the decoder bottleneck lets x264 itself run as quickly as it can ;)

My experience is in using ffms2 with and without ffmpeg-mt. I saw MASSIVE speedup in cases as I've described.

~MiSfit

In the test that I just performed I noticed faster speeds without ffmpeg-mt, 6.04 fps vs 5.96 fps.

Atak_Snajpera
14th April 2010, 16:02
6.04 fps vs 5.96 fps.
whole 1% :) ffmpeg-mt is very usefull when you encode BD or any AVC-HD source to PSP (480x272), iphone and so on

rack04
14th April 2010, 16:08
whole 1% :) ffmpeg-mt is very usefull when you encode BD or any AVC-HD source to PSP (480x272), iphone and so on

All I was trying to convey is that in my limited testing ffmpeg-mt and ffms2-mt decoding as a x264 input method doesn't seem to benefit from the multithreading. The source in my test was a Blu-ray Disc. However, your point about the benefits of resizing is a non issue since lavf/ffms as a x264 input method doesn't support resizing, cropping, or padding yet.

roozhou
14th April 2010, 16:17
whole 1% :) ffmpeg-mt is very usefull when you encode BD or any AVC-HD source to PSP (480x272), iphone and so on
How do you accomplish this with x264's internal lavf or ffms?
I would rather do it with ffdshow. The decoder and the resizer will run in separate threads.