View Full Version : Optimized averaging :)
SansGrip
25th November 2002, 03:20
Ok, now I've learned that the cmov* instructions can actually be useful, it's time to turn my attention to averaging. Again, this is in regular old asm, no MMX for the time being.
Are there any "best practices" for coding something like:
if(abs(this_val - that_val) <= threshold)
{
sum += this_val;
++div;
}
...
avg = sum / div;
Could cmov* also be used for adding to sum and div to avoid jumping? Is there a "most efficient" way of doing an abs?
I know that division is bad, so I want to multiply by a then shift right by b. I know that b is a constant, and that a depends upon the divisor. But what I don't understand is how to calculate the appropriate values for a and b.
If you think I'm asking a lot of questions now, just wait until I start on MMX :D.
dividee
25th November 2002, 07:01
Assuming this_val and that_val are unsigned bytes:
movzx dx, this_val
movzx bx, that_val
sub bx, dx
mov cx, bx
sar cx, 15 // cx is either 0 or -1
xor bx, cx
sub bx, cx // bx = abs(this_val-that_val). Remember 2's complement ?
sub bx, treshold_plus_one
sar bx, 15
and dx, bx
add ax, dx // sum
sub si, bx // div
...
mul word ptr [div_table+si*2] // not sure if this adressing mode works
add ax, 32768 // for better rounding
adc dx, 0
// result in dx
div_table is initialized with:
unsigned short div_table[N];
div_table[0]=some_sensible_value;
div_table[1]=65535;
for (int i=2;i<N;i++) div_table[i] = 65536/i;
This code is quite easy to port to MMX. Jumpless code is usually a requirement for MMX.
SansGrip
25th November 2002, 17:01
sar cx, 15 // cx is either 0 or -1
xor bx, cx
sub bx, cx // bx = abs(this_val-that_val). Remember 2's complement ?
Man, this is clever stuff. Did you learn this kind of thing reading source? Is elegant assembly code passed down from wizards to neophytes in this way? :)
This code is quite easy to port to MMX.
I must admit when I started reading through the MMX instruction list I had a hard time conceptualizing how it would be helpful. Then I got to the pack/unpack instructions...
Thanks, dividee. I'll get there yet ;).
SansGrip
25th November 2002, 17:17
Would this work for abs(this_val - that_val)?
mov bx, that_val
mov ax, this_val
mov cx, bx
sub bx, ax // bx = that_val - this_val
sub ax, cx // ax = this_val - that_val
cmovs ax, bx // ax = ax < 0 ? bx : ax
dividee
25th November 2002, 20:41
Yes it should work! And it's probably faster than my version and easier to understand.
The abs code I wrote above is a 'classic' of asm programming. I saw it first as an exercise in an asm course, just after seeing 2's complement representation : "Write a piece of code which implement the abs function without any jump" or something like that :)
You can use the same trick in MMX:
movq mm0, this_val_4words
movq mm1, that_val_4words
psubw mm0, mm1
movq mm1, mm0
psraw mm1, 15
pxor mm0, mm1
psubw mm0, mm1
(and there is no cmov here to make me look stupid ;) )
SansGrip
25th November 2002, 21:15
Yes it should work!
Woohoo!
You can use the same trick in MMX:
The more MMX code I see the more I like it :).
(and there is no cmov here to make me look stupid ;) )
Absolutely no danger of that, my friend. We should also note that your code will work on pre-cmov machines. I'm not sure when that instruction was introduced -- maybe the Pentium?
DSPguru
27th November 2002, 20:06
Originally posted by dividee
(and there is no cmov here to make me look stupid ;) ) your code is lovely :D. but i must tell you that i'm very surpised to find out that an instruction like 'abs' isn't a member of x86 instrcution set.
i'm asm-ing to Motorola's DSP56k, and it has a very nice ABS instruction :)
i never thought it's a uniqe command :p
SansGrip
27th November 2002, 22:40
DSPguru: i'm very surpised to find out that an instruction like 'abs' isn't a member of x86 instrcution set.
The 80x86 instruction set is, shall we say, "quirky" ;).
There's no single instruction for it in MMX/SSE either. There is a specific instruction to do a SAD in ISSE, though.
Oh, here's a nice version of MMX abs from some Intel doc:
movq mm0, this_val
movq mm1, that_val
movq mm2, mm0
psubb mm0, mm1
psubb mm1, mm2
por mm0, mm1
Rather clever I thought :).
vlad59
28th November 2002, 09:16
For video content, you should always use :
movq mm0, this_val
movq mm1, that_val
movq mm2, mm0
psubusb mm0, mm1
psubusb mm1, mm2
por mm0, mm1
Intel doc is clearer about overflow with psubusb.
sh0dan
28th November 2002, 14:07
I added cycle count for AMD Athlon:
Originally posted by vlad59
movq mm0, this_val ; 2 cycles
movq mm1, that_val ; 2 cycles (memory read stall)
movq mm2, mm0 ; 2 cycles
psubusb mm0, mm1 ; 0 cycles (paired)
psubusb mm1, mm2 ; 2 cycles
por mm0, mm1 ; 2 cycles (dependency penalty)
; -----------------------
; 10 cycles
Using ISSE:
movq mm0, thisval ; 2 cycles
movq mm1, thatval ; 2 cycles
movq mm2, mm0 ; 0 cycles (paired)
pminub mm0,mm1 ; 2 cycles - mm0 = min
pmaxub mm1,mm2 ; 0 cycles - mm1 = max
psubusb mm1,mm0 ; 2 cycles
; -----------------------
8 cycles
You have to do a LOT of these for the two cycles to matter - but just for fun :)
In both cases the first movq mm0 should be moved up to avoid the memory read penalty, and an instruction could also be paired with the last instruction in both cases as long as you don't reference the destination register (mm0 and mm1).
And for a non-mmx absdiff, I found this:
; must be eax/edx - pixel values (0-255) must be in these registers.
sub eax,edx ;Difference one way ( = abs(Luma1-Luma2))
cdq ;Exchange
xor eax,edx
sub eax,edx ; eax=absolute difference
sh0dan
28th November 2002, 17:21
I know that division is bad, so I want to multiply by a then shift right by b. I know that b is a constant, and that a depends upon the divisor. But what I don't understand is how to calculate the appropriate values for a and b.
If you want to do:
y = a / b;
and b is a constant, you could do:
int inv_b = 65536 / b;
Then you can do: y = (a*inv_b)>>16;
The value is 65536, because 1<<16 is 65536. (One shifted up 16 times, or 1 multiplied by 2 16 times).
SansGrip
28th November 2002, 17:35
vlad59: Intel doc is clearer about overflow with psubusb.
That's what I meant. I should've referred to the doc before posting :rolleyes:.
SansGrip
28th November 2002, 18:32
sh0dan: Then you can do: y = (a*inv_b)>>16;
My brain is parsing all this correctly, but there must be a crossed wire somewhere because I don't seem to be able to apply it to my own code :confused:.
I think the main reason I'm confused is because pmul can return either the high part of the low part of the multiplication, and I'm not sure which one I want or, if both, how to combine them without overflowing.
I can't believe I'm not getting something this simple. It's very frustrating :mad:.
vlad59
28th November 2002, 18:51
There is a nice iSSE instruction that does exactly what you want :
y = (a*inv_b)>>16;
: pmulhw
There is also a nice 3Dnow instruction about it but I don't remember exactly as I never used it.
But you should also care about rounding as there is some "eagle eye" Doom9 forum reader that can spot this kind of things. With this formulae the result is always truncated. That's a really bad thing especially if you use feedback for your filter.
SansGrip
28th November 2002, 20:03
vlad59: There is a nice iSSE instruction that does exactly what you want
By Jove I think I might have cracked it. Here's what I'm doing:
Static variable:
__int64 div_lookup[12];
In constructor:
for(int i = 1; i <= 11; ++i) // divisor is in range [1, 11]
{
WORD f = 65535 / i;
for(int j = 0; j < 4; ++j)
((WORD*)&div_lookup[ i ])[j] = f; // Fix this up later
}
In GetFrame, where mm0 = corresponding values from div_lookup:
movq mm2, sum
paddw mm2, count
pmulhuw mm0, mm2
Lo and behold, mm0 now contains the average. I'm adding the count to the sum because I figured it would help with rounding... Does it??
Almost like magic. Thanks Vlad :).
Now all I have to do is include spatial neighbours in the average and ensure the averaging only gets applied to "fluctuating" pixels...
vlad59: But you should also care about rounding as there is some "eagle eye" Doom9 forum reader that can spot this kind of things.
Damn those eagle-eyed Doom9 forum readers ;) :D ;).
SansGrip
28th November 2002, 23:09
SansGrip: Almost like magic.
Well, magic with strange artifacts, especially on scene changes ;). Sigh.
vlad59
29th November 2002, 08:15
I was talking about pmulhuw (but you fixed that yourself ;) ).
About rounding Dividee gave me the formula I use in Convolution3D :
result = a / b that's what we want
let complicate a little
inv_b = 32768 / b
result = (a + b) * 2 * inv_b >> 16
With this you have good rounding :
With MMX/iSSE it's a always easy to read
movq mm2, sum
paddw mm2, count
psllw mm2, 1
pmulhuw mm0, mm2
I have no time for now to explain this formula. I'll edit my post tonight to explain.
Hope this helps.
SansGrip
29th November 2002, 18:39
Vlad59: movq mm2, sum
paddw mm2, count
psllw mm2, 1
pmulhuw mm0, mm2
That's almost exactly what I have, except I'm not doubling (sum + count).
I assume that mm0 contains the correct value with which to multiply given this particular count? At the moment I'm using pextrw and pinsrw to access a lookup table to get this value. I'm not sure if there's a more efficient way.
Vlad59: I have no time for now to explain this formula. I'll edit my post tonight to explain.
I think I understand what's going on, except for the doubling. I really appreciate your help with this :).
Edit: Now I know why you're doubling. When I generate my lookup table I'm doing (65535.0 / b + 0.5), whereas you'd be doing (32768.0 / b). I copied "+ 0.5" from the TemporalSoften source. Perhaps I don't need it. Would your way be more accurate?
dividee
30th November 2002, 01:30
About rounding: an error of 1 on a pixel value might seem harmless, but the problem if you're truncating is that the error is biased. With truncation, the mean error is about 0.5 and the error distribution is centered on -0.5. With rounding to nearest, the mean error is about 0.25, but more importantly the error distribution is centered on 0.
Oddly enough, truncation error is more visible on chroma; you can spot a greenish tint (especially on skin) even without comparing the clips side by side.
An auxiliary problem introduced with the multiply and shift method is that it handles badly common cases if you're not rounding. For instance, let's say you have a solid color region with a pixel value of 25:
you'll get (25*11*5957) >> 16 = 24.997 truncated to 24 (5957=65536/11).
Rounding techniques:
round(a) = floor(a+0.5)
this gives
round(a/b) = (a*(65536/b)+32768)>>16 ==> that's in the piece of code I wrote above
round(a/b) = ((a+(b/2))*(65536/b))
= ((2*a+b)*(32768/b)
so you have to shift by 1 before adding the count:
movq mm2, sum
psllw mm2, 1
paddw mm2, count
pmulhuw mm0, mm2
Multiplying by 2 is easier than halving the count, which would require one more register if you still need the count later, and is more precise since 1/2=0.
The Compare filter in avisynth is useful to check if your rounding is good: the Mean Difference should be close to 0 (on regular material at least). If it's close to 0.5 or -0.5 this is bad.
About the (65536.0 / b + 0.5): it should be 65536, not 65535.
The +0.5 has little effect, but it doesn't hurt. But in any case it's not at all sufficient for proper rounding! You should round the final pixel too.
I assume that mm0 contains the correct value with which to multiply given this particular count? At the moment I'm using pextrw and pinsrw to access a lookup table to get this value. I'm not sure if there's a more efficient way.
In TemporalSoften, I build a 16 bit index by packing four 4 bit counts initialy stored in the words of an MMX register. Of course this means a 512K lookup table (64K QWORD entries). It's big, but I only have to do one lookup instead of 4.
DSPguru
30th November 2002, 17:10
Originally posted by SansGrip
We should also note that your code will work on pre-cmov machines. I'm not sure when that instruction was introduced -- maybe the Pentium? any1 knows when cmov was first introduced ?
SansGrip
30th November 2002, 17:37
dividee: The Compare filter in avisynth is useful to check if your rounding is good: the Mean Difference should be close to 0 (on regular material at least). If it's close to 0.5 or -0.5 this is bad.
Ah, that's a useful piece of information. I've never tried that filter, but will definitely add it to my list of checks for new code. Thanks!
About the (65536.0 / b + 0.5): it should be 65536, not 65535.
I used 65535 because I'm storing the result in words and obviously 65536 would overflow. I did have some concerns about that when I wrote it. I need to run that Compare check.
In TemporalSoften, I build a 16 bit index by packing four 4 bit counts initialy stored in the words of an MMX register. Of course this means a 512K lookup table (64K QWORD entries). It's big, but I only have to do one lookup instead of 4.
I read the TemporalSoften source pretty thoroughly while I was figuring out how to write this, even going through it with paper and pencil to understand it properly. I didn't know it was your doing -- very nice code :).
The reason I didn't "borrow" your lookup code is that I was concerned about the size of the table. But you make a valid point when you say it cuts the lookups from four to one. It also eliminates all those pinsrw and pextrws and the related pairing problems.
I think if Compare shows an incorrect mean difference (and I have a feeling it will) I'll comment out my averaging code and try yours. At the moment I'm trying to shave some cycles off the YUY2 version because it's still not as fast as I'd like (30fps on my system, whereas the YV12 code runs ~100). Perhaps that would help a little.
Thanks for the explanation of rounding -- now I see how important it is :). I'll check all this out and post again in a while...
vlad59
30th November 2002, 19:17
@SansGrip
First sorry for wrong informations I gave to you, I have read the wrong code :( .
Next, to really know if your filter contain rounding error make a script where you call your filter 8 or 9 times in a row and use compare in the next line, that will really show you if your filter has rounding error :
avisource("blablabla")
c3d=Convolution3d(0,8,8,8,8,3,0)
c3d=Convolution3d(c3d,0,8,8,8,8,3,0)
c3d=Convolution3d(c3d,0,8,8,8,8,3,0)
c3d=Convolution3d(c3d,0,8,8,8,8,3,0)
c3d=Convolution3d(c3d,0,8,8,8,8,3,0)
c3d=Convolution3d(c3d,0,8,8,8,8,3,0)
c3d=Convolution3d(c3d,0,8,8,8,8,3,0)
compare(c3d)
SansGrip
30th November 2002, 20:27
dividee: In TemporalSoften, I build a 16 bit index by packing four 4 bit counts initialy stored in the words of an MMX register. Of course this means a 512K lookup table (64K QWORD entries). It's big, but I only have to do one lookup instead of 4.
I tried replacing my averaging code with yours and not only is it more accurate (~0 mean difference as opposed to ~0.1) but it's also a couple of fps faster.
Edit: In TemporalSoften you allow a kernel of up to 15 frames, which would then be your maximum divisor. In Flux, my maximum divisor is 11. With this in mind would it be possible to use a smaller lookup table?
Edit 2: This is just a hunch, but I note that (15 + 1) ^ 4 = 65536, and (11 + 1) ^ 4 = 20736. Could that be my lookup table size for a divisor of 11? :)
Do I have your permission to include your averaging code from TemporalSoften in Flux? I would really appreciate it :).
SansGrip
30th November 2002, 20:29
vlad59: First sorry for wrong informations I gave to you, I have read the wrong code :( .
No worries. I'm surprised anyone can keep anything straight when dealing with something as cryptic as MMX ;).
Next, to really know if your filter contain rounding error make a script where you call your filter 8 or 9 times in a row and use compare in the next line, that will really show you if your filter has rounding error
That's a very good idea. I'm definitely going to use that in future testing.
DSPguru
30th November 2002, 20:34
Originally posted by DSPguru
any1 knows when cmov was first introduced ? according to this (http://x86.ddj.com/secrets/opcodes/cmov.htm), cmov is a p6 command.
what's p6 ? pentium 4?
SansGrip
30th November 2002, 20:44
DSPguru: what's p6 ? pentium 4?
And according to this (http://www.geek.com/procspec/procspec.htm) P6 is Pentium Pro and above :).
vlad59
30th November 2002, 20:49
Cmov was not available with my K6-2-400 but I think all athlon has it.
DSPguru
30th November 2002, 20:49
cool, so i guess i will be using lots of cmovs in future versions of BeSweet. (beta8 already includes some ;)).
dividee
1st December 2002, 00:19
@SansGrip:
All my code is GPL, so as long as you release yours with the same license, you're welcome to use it.
About the lookup table size: theorically what you say is true; the problem is for calculating the index the code would become more complicated and thus slower, but the smaller table could improve cache coherency and compensate. Worth a try if you feel up to it ;)
sh0dan
1st December 2002, 17:18
Originally posted by SansGrip
I assume that mm0 contains the correct value with which to multiply given this particular count? At the moment I'm using pextrw and pinsrw to access a lookup table to get this value. I'm not sure if there's a more efficient way.
You said the bad word! ;)
pextrw and pinsrw SHOULD BE AVOIDED AT ALL COST! They perform extemely slow. On Athlon they usually have a 16+ cycle penalty PER INSTRUCTION! So you can see you can do many other things at the same time using other instructions.
In general I would say you should use movd mm-reg, reg32 combined with pshufw. This can be paired better, and you will get much faster code.
If you move a 16 bit value to the mmx-registers - you even have a word full of 0's you can use at the places where you need them, when doing the pshufw's.
Do however be aware that all reg32 -> mm-reg and all mm-reg -> reg32 are quite slow in AMD's (even the movd used above). So avoid them if possible.
Writing to memory can even be faster in some cases. (movd [eax],mm0 + mov ebx,[eax]), if you allow the processor to use "Load->Store Forwarding" - for this to be allowed, you have to satisfy some rather strange rules. Look in the AMD-Optimization guide, and use the AMD CodeAnalyzer.
vBulletin® v3.8.4, Copyright ©2000-2009, Jelsoft Enterprises Ltd.