PDA

View Full Version : Avisynth filters/general code optimization


SansGrip
10th October 2002, 19:48
Forgive me if this is beating a dead (or even just slightly light-headed) horse. I did search on this topic but found very little.

I like to think that I write fairly CPU-friendly C/C++ code, but I've never before written anything as performance-critical as an Avisynth filter, and could be missing something obvious.

Currently I'm ensuring that all one-time calculations get done in the constructor, and that all per-frame calculations get done outside of the main loop. But apart from this obvious stuff, what should I particularly look out for?

Am I right in saying that the compiler should automatically optimize, for example, power-of-two divisions and multiplications, or do I need to do this manually?

Also are issues such as maximizing CPU cache usage as relevant when disucssing higher-level code like C++?

I guess there's a limit to how much one can optimize C/C++ code, but as of yet I don't know ASM. It's probably fair to say that one cannot achieve very high performance without at least some hand-crafted ASM or, preferably, extensive use of multimedia extensions.

I suppose what I'm getting at is this: to what degree is it possible to optimize in C/C++, and would it be worth learning ASM and the various multimedia extensions?

If anyone can point me to threads I missed, or other good online/paper resources on code optimization and general video processing topics, I'd be very grateful.

Thanks.

-h
10th October 2002, 20:59
The best thing to do is streamline the algorithm itself - find ways to remove calculations which are performed twice (say by caching the initial result), or reduce the body of data which must be processed. A good example of this is motion estimation - a good scheme such as PMVFAST or EPZS will perform ~140 times faster than a full search, and the quality of the resulting search will still be 99% as good.

The next best places for optimization are lookup table candidates and integer divisions.

A good use for lookup tables in a filter would be applying a threshold to every pixel in the image - you could do {if (*pixel < threshold) *pixel = threshold;} for every pixel, but you'd generate a ton of branch mispredictions. A better idea would be to generate a 256-value array, the first (threshold) entries of which are equal to (threshold), and the remainders equal to that entry's index in the array. Then you could do {*pixel = lookup[*pixel];} to perform the same threshold without the possible misprediction. An expansion on this idea would be to create a 65536-member array to threshold 2 pixels at a time, and even two separate loops - the first loop would threshold 32 pixels at a time (to avoid loop overhead), and would fall through to a loop with single-pixel granularity to process the remainder of the image.

If you find yourself performing lots of integer divisions, say to find an average where there may be more than 2 values being considered, it is faster to multiply the result then shift it to the right instead of dividing it immediately.

Say you wanted to divide a value by 13. You could do {val /= 13;}, but integer divisions are famously costly for x86 architecture. Instead, you could do {val = (5042 * val) >> 16;}. This will give you the same result, without the division overhead. The constant (in this case 5042) is precalculated as (((1<<n) / q) + 1), where n is the number of bits you will shift the result by (16 in this case) and q is the number you wish to divide by. Usually they will be precalculated, so you would replace something like {val /= n;} with {val = (offset[n] * val) >> 16;}.

Of course MMX is a dream come true for image processing, and no amount of tweaking in C can get you close to an MMX interpolation or SAD function, or pretty much anything that the p* opcodes can cater to. There's still a fair bit of improvement possible for plain C.

Oh and sorry if you already knew this stuff :) I hope it inspires someone to correct/use/expand the ideas.

Edit: forgot to mention, when implementing multiply-shift division, be sure to watch out for overflows! You don't want your multiplication to result in a value larger than 31 or 32 bits, as you'll just get back nonsense! Thus it is only useful when you know the constraints of your input.

-h

alexnoe
10th October 2002, 21:18
Note that there's absolutely no need to learn 3DNow!. It's a bunch of crap IMHO.
BTW with my new Pentium 4, I just started to make some RGB->YUV with 128-bit MMX :) But the semester starts next week, so i'll have other priorities :(

SansGrip
10th October 2002, 22:00
A good example of this is motion estimation - a good scheme such as PMVFAST or EPZS will perform ~140 times faster than a full search, and the quality of the resulting search will still be 99% as good.

Interesting - I'm playing with motion detection right now, is that the same thing as estimation? I'll ask Dr Google about those two algorithms.

The next best places for optimization are lookup table candidates and integer divisions.

Lookup tables I'm familiar with, but the threshold example was an eye-opener. I never would've thought of that one on my own :)

BTW, what is branch misprediction?

If you find yourself performing lots of integer divisions, say to find an average where there may be more than 2 values being considered, it is faster to multiply the result then shift it to the right instead of dividing it immediately.

Again, fascinating. I guess this stuff is probably all covered in programming 101, but I never had any formal training as such.

Of course MMX is a dream come true for image processing, and no amount of tweaking in C can get you close to an MMX interpolation or SAD function, or pretty much anything that the p* opcodes can cater to.

I'm going to exercise the brain by learning ASM, I think. I dabbled in 680x0 assembler many years ago (just simple stuff, checksumming etc.), so the concepts shouldn't be too hard to pick up again.

I even went to my local Chapters this afternoon to get a book on x86 assembler, but their computer section, despite occupying probably 900 square feet, didn't have a single title on the subject. What's more they'd evidentally had a blind dyslexic chimp recently rearrange the section so it took me seven times as long as it should have to discover this omission. Grrr ;)

I think ASM is the way to go, though. That and a good book on image processing and maybe *shudder* some relevant math text :eek:

SansGrip
10th October 2002, 22:03
BTW with my new Pentium 4, I just started to make some RGB->YUV with 128-bit MMX :)

I think I'll start with MMX and SSE. This has nothing to do with me owning an AMD XP :D

sh0dan
10th October 2002, 22:05
Take a look at the documents I put up on the "Avisynth CVS tests" site in my sig - there's a link at the top, where I put all my optimization documents. One of the best is the "AMD Optimization Guide" - it contains much valuable information (-h mentioned some of the most important). Read it, and post questions here. It's much easier than us just guessing what you want to do. Or another way could be for you to present some C-code you have, which you want optimized - then we can help you reach your goal.

sh0dan
10th October 2002, 22:12
Originally posted by SansGrip
[i]I think I'll start with MMX and SSE. This has nothing to do with me owning an AMD XP :D

... as does probably most of the community. MMX / ISSE will give you much, and can be used by far more people.

Do however beware of too large lookuptabels. 65536 entries will stress your cache quite hard. Dividee mentioned some tests he did, where the algorithm was slowed down by many cache misses. Remember, when you read and write data, some area of the cache gets used for that, which means that some of your 64k (or 256k if you use 4byte aligned ints, as your're supposed to, to avoid alignment penaties) will get erased, which means that you have a big chance of cache misses, which leads to massive slowdown (memory reads = bad thing!). Beware of your cache!

alexnoe
10th October 2002, 22:15
BTW, what is branch misprediction?Starting with the Intel Pentium, the CPUs had a buffer where they stored the destination of jumps.
Lets take a loop:
mov ecx, 100
@@1:
...
...
dec ecx
cmp ecx, 0 (stupid code, but good enough for an example)
jnz @@1

Older cpu's first checked if ecx is 0, and then jumped or didn't jump. This costs time. Every time again.
With "branch prediction", the CPU gains experience about that jump. It will assume that it turns out false, and will jump *before* the comparision is done. From now own, all operations are performed on "shadow registers".
When the comparision is done, the data of the shadow registers is copied into the "normal" ones, otherwise ("branch misprediction") the shadow registers are ignored, cleared, and the jump will be performed correctly. On a PIII, this costs 10 clk, on a P4, it costs 20 clk.

To avoid a load of branch mispredictions on Pentium 2, new instructions were introduced:
cmp eax, 5
jnb @1
mov ebx, ecx
@1:

can now be written
cmp eax, 5
cmovb ebx, ecx
This avoids the jump, and no branch misprediction can occur.

-h
10th October 2002, 22:34
One of the best is the "AMD Optimization Guide" - it contains much valuable information

That's the one! I couldn't remember which guide I had picked up the most from, but that's it. Very handy.

Do however beware of too large lookuptabels. 65536 entries will stress your cache quite hard.

Yeah it will most likely hurt, but could still help if your input was biased to a certain range in the table. But now I can't think of many situations that it would actually help, given today's CPU to bus multipliers. Sometimes I miss the 65816..

-h