PDA

View Full Version : Assembler optimization


sh0dan
11th March 2003, 17:47
trbarry wrote:

I'm surprised at your movntq results. For me on both P3's and P4's it has always seemed faster as long as things were aligned to at least 8 bytes and I wasn't subsequently reading it back in soon. But maybe it is machine dependent. What hardware were you testing this on?


The movntq was a big surprise for me too. It is definately faster in direct copying (BitBlt) for instance, but not very useful in routines that actually do some processing.

The data is not read again, and it is 8-byte aligned. But I guess the problem lies in the fast, that movntq cannot be defered. A movq to memory can be stored in the data cache for later storage, whereas movntq must be dispatched directly. When doing some processing the processor can take time to do the write.

I saw some really big penalties in the AMD Pipeline analysis tool (most movntq's took an average of 60 cycles!) It showed penalties for Data Cache miss, and the Load/Store Queue being full. Using movq's, the average cycle/instruction was about 2-4, with the load-store queue doing ok.

It should be noted that the system I tested on were Athlon Tbird 1200, and an Athlon XP 2200+ - both with DDR RAM.

I don't know if you have an avisynth compile setup, but you could try the older convert_yv12.cpp (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/avisynth2/avisynth/convert_yv12.cpp?rev=1.8&only_with_tag=MAIN&sortby=date) and do a #define movntq movq and do some comparisons.

trbarry
11th March 2003, 19:40
It may well be hardware dependent. But I'll go one further and suggest that both the AMD and Intel optimization utilities tend to suggest things that are good for their own processors possibly at the expense of others.

And some new hardware instructions seem to be not very optimized at first or even have special features ignored. This probably includes most things that give temporal hints. Then as the instructions become more commonly used they get optimized in subsequent processor releases.

But it's still something to worry about. I'm afraid I don't have time to go play with benchmarks again now, so I guess I'll continue asis. Heck, I haven't even converted most of my memcpy's to BitBlt's yet.

But if it is a trend we could make a case for not using movntq on Athlon machines for those functions that have multiple implementations.

- Tom

sh0dan
13th March 2003, 13:38
I made a section for Assembler optimization help on AviSynth.org, to post whenever something turns up.

AssemblerOptimizing (http://www.avisynth.org/index.php?page=AssemblerOptimizing)

Kurosu
13th March 2003, 14:33
Thanks, I was overwhelmed by the docs to read as a starter for optimization. I hope this will lead to an easy and simple corpus of general ideas. I particularly like the fact that you've given examples!

sh0dan
13th March 2003, 16:16
I made a "beginner" example with some basics in writing assembler code.

Kurosu
13th March 2003, 16:35
movq (as move quadword) is already explained, and movd should make sense, but saying that movd moves 4 bytes (doubleword) of data from memory or the lower 4 bytes of a mmx register to the lower 4 bytes of another mmx register could clarify it even more.

Another question I've never tested (I've only written pure mmx functions as the one in example, with not trailing C code): if the code further contains C code, should the registers value be saved and restored before jumping to the C code? I guess the function call as is declared already takes care of that most of the time.

Another thing, this time regarding avisynth frame structure. The source pitch is often mod8 for whatever source, if not mod16/mod32. If you read junk data (for instance 2/4/6 bytes aren't pixel data) but that doesn't have any result on the valid bytes (like for average, thresholding...), then the frame doesn't have to be mod8, right?

ARDA
13th March 2003, 16:42
shodan
from levels.cpp
__asm {
mov eax, [height]
mov ebx, p
mov ecx, modulo
movd mm7,[cmax]
movd mm6,[cmin]
pshufw mm7,mm7,0
pshufw mm6,mm6,0
yloop:
mov edx,[row_size]
align 16
xloop:
prefetchnta [ebx+256]
movq mm0,[ebx]
pminub mm0,mm7
pmaxub mm0,mm6
movq [ebx],mm0
add ebx,8
sub edx,8
jnz xloop
add ebx,ecx;
dec height //you forgot to use dec eax in levels cpp
jnz yloop
emms
As always thanks for your great work Arda

sh0dan
13th March 2003, 16:46
@ARDA: Guess, when I just noticed what :) It does (fortunately) only have slight performancewise impact. As I wrote - the example is slightly modified for simplicity (leaving out prefetch, and ditching YUY2 mode).

@Kurosu:
If you feel like clairifying - please do.

I haven't experienced problems with having C-code after an MMX section. Of course you cannot rely on the registers to be maintained, but VC++ completely disables all optimizations if any funtion contains inline assembler. The convert functions I wrote recently have mixed C/assembler, and they work like a charm. It even contains MMX code within a while loop - without any register preservation. I know there has been talk about it - but I never experienced any problems.

Kurosu
13th March 2003, 16:59
@Shodan
1) I'm not really good at explainations, I fear. However that means I can edit your document, then?
2) It was only a question, as I still feel weak in (mmx) asm. (2 months experience, so far). But you have answered it anyway. :)