Futzing with the x264 code -- possible improvements

Dark Shikari · 11th July 2007, 12:56

I've been futzing with the x264 code today. I figured it would be a fun idea to find parts of the code that could be improved to increase encoding quality, even if just a bit.

I found the following in the code of the --me UMH motion search function, probably the most commonly used one:

Quote:

/* FIXME if the above DIA2/OCT2/CROSS found a new mv, it has not updated omx/omy. We are still centered on the same place as the DIA2. is this desirable? */
CROSS( cross_start, i_me_range, i_me_range/2 );

In other words, even if the above functions found a new motion vector, the CROSS motion search is activated at the previous position... but its not clear whether this is necessarily a bad thing. So I decided to test this. I grabbed a short clip from Ocean's Eleven, about 10 seconds, that had a lot of motion and was relatively denoised. I tested three different code sets:

1: Original, no change.

2:

Quote:

omx = bmx; omy = bmy;
CROSS( cross_start, i_me_range, i_me_range/2 );

i.e. doing exactly what the comment suggested.

3.

Quote:

CROSS( cross_start, i_me_range, i_me_range/2 );
omx = bmx; omy = bmy;
CROSS( cross_start, i_me_range, i_me_range/2 );

Doing the original and what the comment suggested. My reasoning for this is that its quite possible that an earlier search missed some vectors near the origin, and that CROSS exists here to catch some of them that were missed earlier. On the other hand, the comment is making a good suggestion; it might be good to run the CROSS at the new location, too. So why not try both and see what happens?

My command lines were relatively low-bitrate with pretty high settings:

--trellis 2 --no-fast-pskip --subme 7 --bframes 4 --ref 16 --b-pyramid --partitions all --8x8dct --me umh --bime --b-rdo --mixed-refs --direct auto --weightb --progress --crf 35 --deblock 0:0

--trellis 2 --subme 6 --bframes 4 --ref 6 --b-pyramid --partitions all --8x8dct --me umh --bime --b-rdo --mixed-refs --direct auto --weightb --progress --crf 35 --deblock 0:0

--trellis 1 --subme 6 --bframes 4 --ref 3 --b-pyramid --partitions all --8x8dct --me umh --bime --b-rdo --mixed-refs --direct auto --weightb --progress --crf 35 --deblock 0:0

(tests 1, 2, and 3 respectively)

Results:

To measure the improvement with a single % value, I used the following formula:

(1 / (1 - SSIM)) / Bitrate

as my quality-per-bitrate metric. It divides by (1-SSIM) to represent the fact that SSIM increases in quality as it converges towards 1. Since doubling the distance of the SSIM value from 1 halves the quality, 1/(1-SSIM) effectively converts the nonlinear SSIM metric into a linear metric that can be directly compared.

Test 1: 0.1725110171 % quality improvement with Single CROSS modification
Test 2: 0.1697954506 % quality improvement with Single CROSS modification
Test 3: 0.3167555529 % quality improvement with Single CROSS modification

As you can see, the tests get a free 0.15-0.35% quality improvement for zero extra CPU cost; not bad!

Since the Single CROSS modification had no significant effect on the speed (as its running the same code, just from a different starting point), all the FPSs are within error tolerances of the original.

Test 1: 0.4983579238 % quality improvement with Double Cross modification
Test 2: 0.5774497533 % quality improvement with Double Cross modification
Test 3: 0.3762606716 % quality improvement with Double Cross modification
(note that the above are improvements from the original, not from the Single Cross modification)

Double CROSS gave even more of a boost: 0.37% at the least, and around 0.5% on the first two tests.

Test 1: 9.1346153846 % Speed loss with Double Cross modification
Test 2: 5.9615384615 % Speed loss with Double Cross modification
Test 3: 2.5839793282 % Speed loss with Double Cross modification

The disadvantage, of course, is a speed loss ranging from 2.5-9%.

In summary, there are two possible changes here, one which adds a minor quality boost at zero CPU cost, and another that gives a further minor quality boost comparable to minor command lines like --no-fast-pskip at a pretty reasonable CPU cost.

Either would need to be tested further before being incorporated into the code, but this is is my futzing for today

System: Core 2 Duo 2Ghz/2GB RAM
Compiler: Cygwin gcc 3.4.4
Extra compiler options: -march=opteron (seems to work best on my Core 2 Duo)

Inventive Software · 11th July 2007, 13:06

Could be useful as a HQ option extension for UMH.

Dark Shikari · 11th July 2007, 13:10

Quote:

Originally Posted by Inventive Software

Could be useful as a HQ option extension for UMH.

Yeah, I think we could use something between UMH and ESA; perhaps a modification of UMH with some extra searches added in intelligent places to improve overall quality/bitrate.

This is partially because ESA is completely useless in most cases because of its speed; one might even be better off with a longer-range UMH search than a shorter-range ESA.

Another note is that most of the settings that really slow down H.264 don't give much improvement, like --ref 16 instead of --ref 6 on non-cartoon sources, or even --subme 7, which I find is overall not very useful compared to --subme 6. So the "bar" of quality improvement relative to speed loss for an new version of UMH for x264 is not very high, so it would not be as difficult to find improvements that are worth using.

Inventive Software · 11th July 2007, 16:36

So replacing this for ESA would work better dya think? I personally never use ESA, but I don't know what benefits it would bring, because UMH seems to justify the search process... if you're checking each pixel difference like ESA does, you'd need a very fast and efficient algorithm or fast hardware, so this would do well to replace ESA.

Just chucking things round that would appeal to akupenguin.

Dark Shikari · 11th July 2007, 16:36

I've had some other ideas of what to do with this section that I'll work with later; one is to use an if statement to only do the second CROSS if the motion vector changed previously. Another possibility is to perform a search other than CROSS in that if statement and see if that does any better.

Quote:

Originally Posted by Inventive Software

So replacing this for ESA would work better dya think? I personally never use ESA, but I don't know what benefits it would bring, because UMH seems to justify the search process... if you're checking each pixel difference like ESA does, you'd need a very fast and efficient algorithm or fast hardware, so this would do well to replace ESA.

Just chucking things round that would appeal to akupenguin.

I'm not talking about replacing ESA: ESA at X merange will always be better than or equal to UMH at X merange. It exists for a specific purpose; there would never be any reason to replace it IMO.

burfadel · 11th July 2007, 17:02

I've heard from somewhere (can't remember where) that the code for x264 is rather unoptimised, there's a lot of places where MMX/MMXEXT/SSE/SSE2/SSE3/SSSE3 code can be included for extra speed but are currently missing. Is this true?

Manao · 11th July 2007, 17:53

Dark Shikari : everything you did is OK, except the way you compare the results.

As you have noticed, the result quality depends on both the bitrate and the PSNR/SSIM/metric, so since both change at the same time, it's not easy to compare them. You decided to avoid that issue by saying, arbitrarily, that 'quality = 1/(1-SSIM)/bitrate', and then comparing qualities together.

That is definitely not how it should be done. The proper way is to encode at several CRFs, and then to draw the curve metric/bitrate. Once curves are drawn, you can compare the modifications.

Especially, you can say "at the same bitrate, the metrics differ by XXX", or "at the same metrics, the bitrate differs by YYY %".

It's slower, but it works.

Manao · 11th July 2007, 18:00

burfadel : you've heard wrong. x264 can be made faster - everything can be made faster. But it's definitely not "rather unoptimized".

What is missing, last time I checked, is SSSE3 for 32bits OSs ( since akupenguin uses a 64bits OS ), and, perhaps, some SSE2 functions instead of MMXEXT ( it would help on P4/conroe ). Imho, that won't represent more than 5/10% of speed gain.

And, imho, if development time were to be spent on x264, I would rather look toward psychovisual enhancements, there are none at the moment, and it can dramatically improve things.

akupenguin · 11th July 2007, 18:22

While you're at it, remove MMX1, SSE1, and SSE3 from your list of instruction sets. SSE1 and SSE3 are floating-point and thus useless for video coding, and the last cpu that only had MMX1 was way too slow for x264 anyway.

Dark Shikari · 11th July 2007, 19:01

Quote:

Originally Posted by Manao

Dark Shikari : everything you did is OK, except the way you compare the results.

As you have noticed, the result quality depends on both the bitrate and the PSNR/SSIM/metric, so since both change at the same time, it's not easy to compare them. You decided to avoid that issue by saying, arbitrarily, that 'quality = 1/(1-SSIM)/bitrate', and then comparing qualities together.

That is definitely not how it should be done. The proper way is to encode at several CRFs, and then to draw the curve metric/bitrate. Once curves are drawn, you can compare the modifications.

Especially, you can say "at the same bitrate, the metrics differ by XXX", or "at the same metrics, the bitrate differs by YYY %".

It's slower, but it works.

My metric should be relatively valid for similar qualities--i.e. the same CRF, where differences in quality and bitrate are very small. You are right that dividing by bitrate is probably simplistic, as quality and bitrate are not linearly related; that definitely restricts the use of such a metric to small intervals such as mine, and no more. Anyways, it only tells half the story--the quality difference at that CRF.

I agree that the results will differ at different CRFs, of course. I would have to do testing at multiple CRFs to see the true results at more bitrates. I would guess from experience that the higher the bitrate, the less effective the optimizations.

Even if you disagree with my metric, you can always compare quality and bitrate separately, i.e. say "quality improved by 1%" and "bitrate improved by 1%" as separate statements.

I'm testing a number of improvements to the code other than the one I've stated that should improve the effectiveness of the UMH algorithm... I'll post with more later.

Dark Shikari · 11th July 2007, 19:29

CROSS( cross_start, i_me_range, i_me_range/2 );
if(saved_omx != bmx || saved_omy != bmy)
{
omx = bmx; omy = bmy;
CROSS( cross_start, i_me_range, i_me_range/2 );
}

gives the exact same results and seems to be a bit faster, so this would be preferable to the Double Cross solution above.

This requires this:

int saved_omx = omx;
int saved_omy = omy;

to be placed after the previous instance of

omx = bmx; omy = bmy;

akupenguin · 11th July 2007, 19:44

The problem isn't comparing a wide range of bitrate. Nonlinearity kicks in even for asymptotically small ranges: while the curves will be straight on sufficiently small intervals, their slopes are not necessarily the same.
In general, .05 db psnr is equivalent to 1% bitrate. But that doesn't mean 40.06 psnr & 1010 kb/s is better than 40.00 psnr & 1000 kb/s. Because sometimes it's .03db/%br and sometimes it's .07db/%br. (The same applies to ssim, I just picked psnr because I know the right numbers off-hand.)

Sure you can say "encode A is 1% better quality and 1% better bitrate than encode B", knowing that it doesn't mean A is 2% better than B -- it may be more or less than 2% depending on how exactly quality maps to bitrate. The problem comes when encode A is 3% better quality and 1% worse bitrate -- that's not necessarily better at all.
If you encode at multiple values of CRF then you can interpolate between them. You can think of it as experimentally determining the constant of proportionality between quality and bitrate for your specific content and settings, though interpolation is more general in that with sufficiently many samples it can handle a wide range of bitrate and thus non-constant proportionality.

Dark Shikari · 11th July 2007, 19:52

Quote:

Originally Posted by akupenguin

The problem isn't comparing a wide range of bitrate. Nonlinearity kicks in even for asymptotically small ranges: while the curves will be straight on sufficiently small intervals, their slopes are not necessarily the same.
In general, .05 db psnr is equivalent to 1% bitrate. But that doesn't mean 40.06 psnr & 1010 kb/s is better than 40.00 psnr & 1000 kb/s. Because sometimes it's .03db/%br and sometimes it's .07db/%br. (The same applies to ssim, I just picked psnr because I know the right numbers off-hand.)

Sure you can say "encode A is 1% better quality and 1% better bitrate than encode B", knowing that it doesn't mean A is 2% better than B -- it may be more or less than 2% depending on how exactly quality maps to bitrate. The problem comes when encode A is 3% better quality and 1% worse bitrate -- that's not necessarily better at all.
If you encode at multiple values of CRF then you can interpolate between them. You can think of it as experimentally determining the constant of proportionality between quality and bitrate for your specific content and settings, though interpolation is more general in that with sufficiently many samples it can handle a wide range of bitrate and thus non-constant proportionality.

Yeah, that is probably true.

Of course, in my case both the bitrate and quality were generally improved (both of them), so the point is moot unless one gets a true bitrate/quality tradeoff involved.

If I find a change that creates such a tradeoff I'll make sure to try your method first; I agree that the quality/bitrate curve can be nastily nonlinear at times, and you certainly have much more experience with the curve as one of the coders behind x264, so I'll trust you on that.

Dark Shikari · 12th July 2007, 05:14

It appears that changing the hexagon grid in UMH to:

/* hexagon grid */
omx = bmx; omy = bmy;

for( i = 1; i <= i_me_range/4; i++ )
{
static const int hex4[20][2] = {
{-4, 2}, {-4, 1}, {-4, 0}, {-4,-1}, {-4,-2},
{ 4,-2}, { 4,-1}, { 4, 0}, { 4, 1}, { 4, 2},
{ 2, 3}, { 0, 4}, {-2, 3},
{-2,-3}, { 0,-4}, { 2,-3},
{ 3, 2}, { 3,-2}, {-3, 2}, {-3,-2}
};

if( 4*i > X264_MIN4( mv_x_max-omx, omx-mv_x_min,
mv_y_max-omy, omy-mv_y_min ) )
{
for( j = 0; j < 20; j++ )
{
int mx = omx + hex4[j][0]*i;
int my = omy + hex4[j][1]*i;
if( CHECK_MVRANGE(mx, my) )
COST_MV( mx, my );
}
}
else
{
COST_MV_X4( -4*i, 2*i, -4*i, 1*i, -4*i, 0*i, -4*i,-1*i );
COST_MV_X4( -4*i,-2*i, 4*i,-2*i, 4*i,-1*i, 4*i, 0*i );
COST_MV_X4( 4*i, 1*i, 4*i, 2*i, 2*i, 3*i, 0*i, 4*i );
COST_MV_X4( -2*i, 3*i, -2*i,-3*i, 0*i,-4*i, 2*i,-3*i );
COST_MV_X4( -3*i, 2*i, -3*i,-2*i, 3*i, 2*i, 3*i,-2*i );
}
}

gives a decent boost on the clips/settings I've tried it on (adding 4 more spots to the hexagon).

Inventive Software · 12th July 2007, 09:37

Quote:

Originally Posted by Dark Shikari

It appears that changing the hexagon grid in UMH... gives a decent boost on the clips/settings I've tried it on (adding 4 more spots to the hexagon).

Speed or quality boost?

Got metrics to prove it?

Dark Shikari · 12th July 2007, 10:18

Quote:

Originally Posted by Inventive Software

Speed or quality boost?

Got metrics to prove it?

Quality obviously, not speed, how in the world would adding more spots to the hexagon boost speed

I'll run some more metrics in a bit, I didn't save the particulars but it'll be easy to run again and post here.

One thing I noticed is that adding more Y-direction motion searching didn't help much, probably because most input video is longer in width than in height and has more side-to-side motion than up-down motion, and so its safe to be biased in that manner; if anything further Y-direction motion searching actually hurt SSIM.

DeathTheSheep · 24th July 2007, 04:15

Any results for those metrics you were planning to run?

Assuming this "futzing" would indeed yield such improvement in the general case, what effect would changing --merange have with this new algorithm? Would X- and Y-direction motion searching be offset proportionally to the overall extension in search range?

Also, in the neighborhood of suggested improvements, I would without hesitation suggest shunting the Exhaustive search onto a different thread than all the other processing. That is, if it proves too difficult to implement ESA into the current multi-thread framework.

11th July 2007, 13:06	#2 \| Link
Inventive Software Turkey Machine Join Date: Jan 2005 Location: Lowestoft, UK (but visit lots of places with bribes [beer]) Posts: 1,953	Could be useful as a HQ option extension for UMH. __________________ On Discworld it is clearly recognized that million-to-one chances happen 9 times out of 10. If the hero did not overcome huge odds, what would be the point? Terry Pratchett - The Science Of Discworld

11th July 2007, 16:36	#4 \| Link
Inventive Software Turkey Machine Join Date: Jan 2005 Location: Lowestoft, UK (but visit lots of places with bribes [beer]) Posts: 1,953	So replacing this for ESA would work better dya think? I personally never use ESA, but I don't know what benefits it would bring, because UMH seems to justify the search process... if you're checking each pixel difference like ESA does, you'd need a very fast and efficient algorithm or fast hardware, so this would do well to replace ESA. Just chucking things round that would appeal to akupenguin. __________________ On Discworld it is clearly recognized that million-to-one chances happen 9 times out of 10. If the hero did not overcome huge odds, what would be the point? Terry Pratchett - The Science Of Discworld

11th July 2007, 17:53	#7 \| Link
Manao Registered User Join Date: Jan 2002 Location: France Posts: 2,856	Dark Shikari : everything you did is OK, except the way you compare the results. As you have noticed, the result quality depends on both the bitrate and the PSNR/SSIM/metric, so since both change at the same time, it's not easy to compare them. You decided to avoid that issue by saying, arbitrarily, that 'quality = 1/(1-SSIM)/bitrate', and then comparing qualities together. That is definitely not how it should be done. The proper way is to encode at several CRFs, and then to draw the curve metric/bitrate. Once curves are drawn, you can compare the modifications. Especially, you can say "at the same bitrate, the metrics differ by XXX", or "at the same metrics, the bitrate differs by YYY %". It's slower, but it works. __________________ Masktools x86 & x64: Stable (2.0a48) AVCMatrices : Stable (1.3) Anisotool : Beta (1.0a5)

11th July 2007, 18:00	#8 \| Link
Manao Registered User Join Date: Jan 2002 Location: France Posts: 2,856	burfadel : you've heard wrong. x264 can be made faster - everything can be made faster. But it's definitely not "rather unoptimized". What is missing, last time I checked, is SSSE3 for 32bits OSs ( since akupenguin uses a 64bits OS ), and, perhaps, some SSE2 functions instead of MMXEXT ( it would help on P4/conroe ). Imho, that won't represent more than 5/10% of speed gain. And, imho, if development time were to be spent on x264, I would rather look toward psychovisual enhancements, there are none at the moment, and it can dramatically improve things. __________________ Masktools x86 & x64: Stable (2.0a48) AVCMatrices : Stable (1.3) Anisotool : Beta (1.0a5)

11th July 2007, 19:29	#11 \| Link
Dark Shikari x264 developer Join Date: Sep 2005 Posts: 8,666	CROSS( cross_start, i_me_range, i_me_range/2 ); if(saved_omx != bmx \|\| saved_omy != bmy) { omx = bmx; omy = bmy; CROSS( cross_start, i_me_range, i_me_range/2 ); } gives the exact same results and seems to be a bit faster, so this would be preferable to the Double Cross solution above. This requires this: int saved_omx = omx; int saved_omy = omy; to be placed after the previous instance of omx = bmx; omy = bmy; Last edited by Dark Shikari; 11th July 2007 at 19:33.

11th July 2007, 17:02	#6 \| Link
burfadel Registered User Join Date: Aug 2006 Posts: 2,229	I've heard from somewhere (can't remember where) that the code for x264 is rather unoptimised, there's a lot of places where MMX/MMXEXT/SSE/SSE2/SSE3/SSSE3 code can be included for extra speed but are currently missing. Is this true?

11th July 2007, 18:22	#9 \| Link
akupenguin x264 developer Join Date: Sep 2004 Posts: 2,392	While you're at it, remove MMX1, SSE1, and SSE3 from your list of instruction sets. SSE1 and SSE3 are floating-point and thus useless for video coding, and the last cpu that only had MMX1 was way too slow for x264 anyway.

11th July 2007, 19:44	#12 \| Link
akupenguin x264 developer Join Date: Sep 2004 Posts: 2,392	The problem isn't comparing a wide range of bitrate. Nonlinearity kicks in even for asymptotically small ranges: while the curves will be straight on sufficiently small intervals, their slopes are not necessarily the same. In general, .05 db psnr is equivalent to 1% bitrate. But that doesn't mean 40.06 psnr & 1010 kb/s is better than 40.00 psnr & 1000 kb/s. Because sometimes it's .03db/%br and sometimes it's .07db/%br. (The same applies to ssim, I just picked psnr because I know the right numbers off-hand.) Sure you can say "encode A is 1% better quality and 1% better bitrate than encode B", knowing that it doesn't mean A is 2% better than B -- it may be more or less than 2% depending on how exactly quality maps to bitrate. The problem comes when encode A is 3% better quality and 1% worse bitrate -- that's not necessarily better at all. If you encode at multiple values of CRF then you can interpolate between them. You can think of it as experimentally determining the constant of proportionality between quality and bitrate for your specific content and settings, though interpolation is more general in that with sufficiently many samples it can handle a wide range of bitrate and thus non-constant proportionality.

12th July 2007, 05:14	#14 \| Link
Dark Shikari x264 developer Join Date: Sep 2005 Posts: 8,666	It appears that changing the hexagon grid in UMH to: /* hexagon grid / omx = bmx; omy = bmy; for( i = 1; i <= i_me_range/4; i++ ) { static const int hex4[20][2] = { {-4, 2}, {-4, 1}, {-4, 0}, {-4,-1}, {-4,-2}, { 4,-2}, { 4,-1}, { 4, 0}, { 4, 1}, { 4, 2}, { 2, 3}, { 0, 4}, {-2, 3}, {-2,-3}, { 0,-4}, { 2,-3}, { 3, 2}, { 3,-2}, {-3, 2}, {-3,-2} }; if( 4i > X264_MIN4( mv_x_max-omx, omx-mv_x_min, mv_y_max-omy, omy-mv_y_min ) ) { for( j = 0; j < 20; j++ ) { int mx = omx + hex4[j][0]i; int my = omy + hex4[j][1]i; if( CHECK_MVRANGE(mx, my) ) COST_MV( mx, my ); } } else { COST_MV_X4( -4i, 2i, -4i, 1i, -4i, 0i, -4i,-1i ); COST_MV_X4( -4i,-2i, 4i,-2i, 4i,-1i, 4i, 0i ); COST_MV_X4( 4i, 1i, 4i, 2i, 2i, 3i, 0i, 4i ); COST_MV_X4( -2i, 3i, -2i,-3i, 0i,-4i, 2i,-3i ); COST_MV_X4( -3i, 2i, -3i,-2i, 3i, 2i, 3i,-2i ); } } gives a decent boost on the clips/settings I've tried it on (adding 4 more spots to the hexagon).

24th July 2007, 04:15	#17 \| Link
DeathTheSheep <The VFW Sheep of Death> Join Date: Dec 2004 Location: Deathly pasture of VFW Posts: 1,149	Any results for those metrics you were planning to run? Assuming this "futzing" would indeed yield such improvement in the general case, what effect would changing --merange have with this new algorithm? Would X- and Y-direction motion searching be offset proportionally to the overall extension in search range? Also, in the neighborhood of suggested improvements, I would without hesitation suggest shunting the Exhaustive search onto a different thread than all the other processing. That is, if it proves too difficult to implement ESA into the current multi-thread framework. __________________ Recommended all-in-one stop for x264/GCC needs on Windows: Komisar x264 builds! Last edited by DeathTheSheep; 24th July 2007 at 04:34.