PDA

View Full Version : Optimized division


SansGrip
16th November 2002, 19:36
Given a numerator in the range [0,2295] and a denominator in the range [1,9], would it be more efficient on x86 architecture to implement the division with multiply-then-shift or with a lookup table, i.e. avg = lookup[numerator][denominator]?

SansGrip
18th November 2002, 01:42
It just occurred to me that I could simply benchmark it :D :rolleyes:.

trbarry
18th November 2002, 16:05
SansGrip -

IMHO the multiply then shift methods are faster, at least in assembler. This is especially true when you want to use MMX instructions and operate on 4,8, or even 16 pixels at a time in a single assembler instruction.

Of course in MMX assembler you can also play games with the pavgb instructions to get various ratios very quickly.

(pavgb(x,y) = (x+y+1) / 2)

- Tom

SansGrip
19th November 2002, 17:49
IMHO the multiply then shift methods are faster, at least in assembler.

I benchmarked my lookup table approach against code stolen from TemporalSoften and found little-to-no difference in speed. But this is of course in C++, which probably makes a big difference.

This is especially true when you want to use MMX instructions and operate on 4,8, or even 16 pixels at a time in a single assembler instruction.

I can see how this would give a tremendous speed boost. On a related note, as far as C++ goes, I've seen some code that fetches four bytes at a time from the frame data, unpacks into ints, processes, repacks, then stores four bytes. Would this really be faster than fetching a byte at a time?

trbarry
20th November 2002, 18:32
Sorry, I'm not a C++ wizard, dunno.

I usually just figure that anything that operates on all the video pixels in a high level language is already too slow. But I am an assembly language bigot, so who knows? ;)

- Tom

SansGrip
21st November 2002, 00:56
I usually just figure that anything that operates on all the video pixels in a high level language is already too slow

I'm beginning to think you're right ;). No matter what I do and how many optimizations I make in the code I can see when I step through the disassembly that it's never going to be as good as hand-crafted asm.

Bang go my dreams of filter-writing in C# ;) :D.

But I am an assembly language bigot, so who knows? ;)

Maybe I will be in a few weeks. I just ordered a couple of books from Amazon (one on general 80x86 assembly and the other on multimedia extensions), and I'm currently reading an online book written in pre-Pentium days. I'm really looking forward to writing my first inline assembly :).

trbarry
21st November 2002, 03:50
SansGrip -

I think you can make most things go 3-5 times faster by going to assembler and then another similar multiple for video by learning to write optimized mmx/sse/sse2 code.

It takes a bit longer so it's obviously not worth it for everything but the low level loops in video almost always are.

Some of the best references are free on the AMD and Intel sites. I like to read the AMD manuals first as they seem to be easier to read. But check out: (sloppy titles)

Intel Instruction Reference (and the AMD equivalent)
Intel Optimization Ref & P4 Optimization Ref
AMD Multimedia

and all the sample code that both companies offer. They have some brilliant people writing some of their samples because they want to show off the best of their hardware.

- Tom

SansGrip
21st November 2002, 22:19
I think you can make most things go 3-5 times faster by going to assembler

It'll definitely be worth the headaches I'm getting from trying to remember all those addressing modes if I can get that kind of speed boost.

Some of the best references are free on the AMD and Intel sites. I like to read the AMD manuals first as they seem to be easier to read.

Thanks for the info, I'll definitely check them out once I've got a handle on asm in general. I think what I'll do is start rewriting all my filters' main loops in asm, then I'll start mixing in some extensions as I learn them.

Wrt to the original question about optimized divison, though, you said that multiply-and-shift would be more efficient in asm than a lookup table. After seeing how many cycles a mul requires that surprises me. Is indirection really that costly?

trbarry
22nd November 2002, 01:40
Wrt to the original question about optimized divison, though, you said that multiply-and-shift would be more efficient in asm than a lookup table. After seeing how many cycles a mul requires that surprises me. Is indirection really that costly?

It depends upon how large your lookup table is, whether it stays in the top level cache memory. And I'm not sure that multiply takes much longer than add anymore. I haven't seen a table in insturction timings for awhile.

But I was thinking mostly of video mmx instructions where you may be multiply 4 pairs of words at a time and there is no way to directly use those words as an index or pointer.

- Tom