PDA

View Full Version : Optimizing conditional code


SansGrip
24th November 2002, 21:44
I'm currently trying to wrestle my brain into the correct configuration for producing properly optimized assembly code, and it's really quite difficult ;).

For the last couple of days I've been attempting to come up with an assembler version (vanilla, no extensions) of the following:


int prev_diff = prev_pel - curr_pel,
next_diff = next_pel - curr_pel;
if((prev_diff < 0 && next_diff < 0) ||
(prev_diff > 0 && next_diff > 0))


The best I've come up with so far is pretty much identical to what my compiler produces, involving four subs and four conditional jumps. I'm convinced there must be a better way, but for the life of me can't think of it (except involving a 16mb lookup table :D).

Can anyone help out a poor C programmer who's wandering around the assembler forest without a map or compass? :)

SansGrip
24th November 2002, 22:52
Here's what I've come up with so far. This verion, I believe, causes the fewest instructions to be executed given that a slight majority of pixels are copied rather than processed. esi = start of row, eax = index into row, pp = start of row in previous frame, np = start of row in next frame.


mov ebx, pp
movzx dx, [ebx + eax] // dx = prev pel
movzx bx, [esi + eax] // bx = curr pel
sub dx, bx // dx = prev diff
jz copy_pel // prev diff == 0
js prev_diff_is_negative // prev diff < 0

mov ebx, np
movzx dx, [ebx + eax] // dx = next pel
movzx bx, [esi + eax] // bx = curr pel
sub dx, bx // dx = next diff
jz copy_pel // prev diff > 0 && next diff == 0
js copy_pel // prev diff > 0 && next diff < 0
jmp process_pel // prev diff > 0 && next diff > 0

prev_diff_is_negative:
mov ebx, np
movzx dx, [ebx + eax] // dx = next pel
movzx bx, [esi + eax] // bx = curr pel
sub dx, bx // dx = next diff
jns copy_pel // prev diff < 0 && next diff >= 0

process_pel:


It still looks inefficient to me. I'm wondering if there's a jxx I missed that would help. Or maybe my expectations are too high ;).

Incidentally, since I'm subtracting one unsigned byte from another, do I need to use bx/dx or can I use bl/dl? I'm still too confused by the various flags to work it out myself.

Edit: Just did a quick test and this code is only 1-2fps faster than the corresponding C code. That's not good :(.

SansGrip
24th November 2002, 23:26
And here's another version:


mov ebx, pp
mov cl, [ebx + eax] // cl = prev pel
sub cl, [esi + eax] // cl = prev diff
mov ebx, np
mov dl, [ebx + eax] // dl = next pel
sub dl, [esi + eax] // dl = next diff

test cl, cl
jge prev_diff_not_negative // prev diff >= 0
test dl, dl
jl process_pel // prev diff < 0 && next diff < 0
prev_diff_not_negative:
test cl, cl
jle copy_pel
test dl, dl
jle copy_pel


Fewer instructions, but almost identical in speed to the above version... :confused:

trbarry
24th November 2002, 23:59
Not sure about "(vanilla, no extensions)" but how about something like:


sub ax,ax
sub bx,bx
mov cl, curr
cmp cl, prev
cmova ax, ONE // = 1
cmovb ax, TWO
cmp cl, next
cmova bx, ONE
cmovb bx, TWO
and ax, bx
jz Skip_It


Don't know if it's the fastest but avoids the conditional branches which confuse prefetch processing.

- Tom

SansGrip
25th November 2002, 00:55
Not sure about "(vanilla, no extensions)"

I'm trying to learn asm for the lowest common denominator first. My intention is to get a main loop in assembly code that works for everyone, then write another loop for SSE-capable processors. My thinking is that wrt possible uses for this knowledge in the future I won't always be able to assume an advanced processor :).

but how about something like:

It took me a couple of minutes to grasp what was going on here, but now I understand it and, frankly, think it's wonderful. This is a very clear example of the difference between "thinking in C" and "thinking in asm". I've been writing C code for about a dozen years now and using an AND in this way would never have occurred to me!

Don't know if it's the fastest but avoids the conditional branches which confuse prefetch processing.

Even if it isn't the fastest, it's definitely the most elegant :).

Thanks Tom -- I still have much to learn (or is that unlearn?). I really appreciate your help.