View Full Version : MxN DCTFilter using MMX asm optimization (developmnent stage)
redfordxx
3rd December 2007, 18:22
Hi,
is somewhere a code for DCT+IDCT 16x16 for MS VC 6.0 to use? I'd like to use it in my avisynth filter...
If it worked for values>255 or float (the former option is better), would be even better.
Second thing, somewhere a code to load quant matrix from a file must exist (or something similar).
Can someone tell me?
Thanx
R*
Fizick
3rd December 2007, 20:52
I tried to find integer 16 dct (MMX)without success.
but fftw library has float dct (any size)
IanB
3rd December 2007, 22:18
Dig very deap into DGDecode (Mpeg2Source) source.
Also XviD source.
Also troll the AMD site.
redfordxx
4th December 2007, 03:06
Dig very deap into DGDecode (Mpeg2Source) source.
Also XviD source.
Wow I thought MPEG2 or XviD work only with 8x8 DCT
IanB
4th December 2007, 04:08
It's not that great a stretch to double the number of points.
You could probably take the MMX code and re-do it in SSE to go from 8 to 16 elements per cycle.
redfordxx
4th December 2007, 09:10
You could probably take the MMX code and re-do it in SSE to go from 8 to 16 elements per cycle.Well, last time I saw the code I remember this part was asm.
Being real beginner in C, I am walking around anything in asm in big distance;-)
IanB
4th December 2007, 13:45
Well if you don't want it fast, then just use the fftw library, it will do 2**n by 2**m, just set n=m=4.
There is also the reference code in DGDecode it should be very easy to increase that to 16x16. If you replace the trig functions it uses with very fast approximations it can be made to walk at a passable speed. Have a look at the library at neuron2.net, I think there is a copy of the paper about it.
redfordxx
5th December 2007, 10:28
Well if you don't want it fast, then just use the fftw library.What's the difference between fast and slow? Rough estimation, I have no clue. 2x, 5x, 10x...?
IanB
5th December 2007, 13:01
At this point a 16x16 DCT is vapourware, a custom SSE2 routine could be fastest, but by how much, there is no way to know. The point is pretty mote, you have admitted you are unwilling to rework the current assembler to do 16x16. So go with what you are confortable with. If the end result is not fast enough, you have fall back positions.
The fftw library is pretty fast for a general purpose fft. Really good implementations are of order N*log(N) so I would expect a top notch 16x16 to be about 5.6 times slower than an 8x8
Fizick
5th December 2007, 19:03
redfordxx,
I was need in forward DCT for blocksize 16x16 (ond other sizes beside 8x8) in MVTools (in special dct mode). And I use fftw currently. You may compare speed yourself :) as well as get example code (under GPL).
IanB,
reworking 8x8 mmx to 16x16 code is not so easy IMHO :)
I considered it but even do not know how to begin.
There are several impementations with various tricks for speed and accuracy (rounding).
BTW, if somebody know 4x4 forward dct (MMX) it would be useful for MVTools
redfordxx
7th December 2007, 11:19
Hi guys,
I started to dig into other sources, I think, there will be success...but some things I just cannot get.
For example, this is my global.h shortened, #included in other files:
#ifndef _GLOBAL_H_
#define _GLOBAL_H_
#include <windows.h>
extern "C" void *_aligned_malloc(size_t size, size_t alignement);
extern "C" void _aligned_free(void *aligned);
#endif
// Ex-DctFilter
// define forward and backward dct functions and pointers, actual code version from Xvid project
#ifndef _FDCT_H_
#define _FDCT_H_
extern "C" {
typedef void (fdctFunc)(short * const block);
typedef fdctFunc* fdctFuncPtr;
extern "C" fdctFuncPtr fdct;
fdctFunc fdct_mmx;
fdctFunc fdct_altivec;
fdctFunc fdct_sse2;
//this line is a problem
fdctFuncPtr pfdct;
#endif
#ifndef _IDCT_H_
#define _IDCT_H_
typedef void (idctFunc)(short * const block);
typedef idctFunc* idctFuncPtr;
extern "C" idctFuncPtr idct;
idctFunc idct_mmx;
idctFunc idct_xmm; // sse verson
idctFunc idct_sse2;
idctFunc idct_altivec;
#define DO_ADJUST(pwa) __asm {
.......blah
}
}
#endif
Some thing I don't understand, for example, how is possible structure like this:#ifndef _FDCT_H_
#define _FDCT_H_
extern "C" {
...
#endif
#ifndef _IDCT_H_
#define _IDCT_H_
....
}
#endifI mean #ifndef #endif mixed with {}. But I can live with that, as long as they work.
But what does not work is
//this line is a problem
fdctFuncPtr pfdct;In DCTFilter, pfdct is defined inside each class. I wanted to make the variable global. But when I build, I got the error, that _pfdct was already defined in other objects. I thought to prevent this, #ifndef is there. All other variables and types are fine. Where is the problem?
gioowe
7th December 2007, 21:53
You cannot declare a global variable in a header file. Header files are just include files that are added into .c files. So if 2 .c files include this header then 2 .c files declare the same variable name and the linker cannot combine both together as it doesn't know which one is right.
You have to declare the global variable in one file and declare an external reference in the header.
extern fdctFuncPtr pfdct;
And in one and only one .c file you add
fdctFuncPtr pfdct;
Why do you need an 16x16 IDCT? By the way: You cannot use the optimized (AAN) 8x8 IDCT of Xvid and upscale it to 16x16. It is optimized and will result in a wrong transform. So simply execute the following equation
http://www.mathworks.com/access/helpdesk/help/toolbox/vipblks/ref/eqn1100535176.gif
from: http://www.mathworks.com/access/helpdesk/help/toolbox/vipblks/index.html?/access/helpdesk/help/toolbox/vipblks/ref/2didct.html
redfordxx
8th December 2007, 00:37
So simply execute the following equation
Hmm, the reason was that when I use FFTW I thought it would be faster than when I code it in C.
Now, I know the FFTW is in float.
As far as I can see, I will need two types of DCT:
BYTE[0,255]==>FDCT==>here_something_to_do==>IDCT==>short==>something==>BYTE[0,255]
or
WORD[0,65535]==>FDCT==>here_something_to_do==>IDCT==>int==>something==>BYTE[0,255]
So, when I do all calculations scaled to integer and precalculate the coefficients, is it worth trying? Or the fftw will be faster even if it is real,float or whatever.
If I am at least roughly correct, using __int32 for calculation will allow me for example scaling coeffs [-1,1]->[-2048,2048] for DCT of 16x16 [0,65535] values. Considering the result will end in BYTE...
About the precalculation: it will be (M*N)^2 coeficients and during the process there will be appropriate number of multiplications and additions both forward and backward and finally shifts for downscaling...
so for 16x16....65536
Just naive question: Is there any faster way to multiply lot of numbers in some array in memory than just going through with for-cycles?
gioowe
8th December 2007, 01:35
Well, you have to do the FDCT in YUV colorspace first. Then these values are normally in range 16..241 (give or take).
If you don't use SIMD instructions then it'll be best to perform integer scaling to 32bit, preferably signed. But consider overflows so don't scale up to high. 2^12 is a good approach. And then you use constants to get rid of all trigonometric functions. On a 32-bit machine 32-bit multiplications are as fast as 16-bit multiplications (indeed they are faster due to skipped 16-bit alignment/scaling issues).
Another important point is loop unrolling. Conditional jumps are nightmares on current processors. If incorrectly predicted they cost up to 42 cycles where you could have done the same amount of add/mul operations. So don't 'if' in loops and perform loop-unrolling. Like
for (int i = 0 ; i < 16 ; ++i)
{
x[0] = y[0] * 2;
x[1] = y[1] * 2;
x[2] = y[2] * 2;
x[3] = y[3] * 2;
} instead of for (int i = 0 ; i < 16 ; ++i)
{
for (int j = 0 ; j < 4 ; ++j) x[j] = y[j] * 2;
}
akupenguin
8th December 2007, 01:37
SSE2 and C optimized int16 and float 4x4, 8x8, and 16x16 DCTS for x86_64: http://akuvian.org/src/dcts-r65.tar.gz
It shouldn't be hard to port it to x86_32, but I haven't done so as I don't have any 32bit computers and only coded it for my own research purposes.
gioowe
8th December 2007, 01:40
Just naive question: Is there any faster way to multiply lot of numbers in some array in memory than just going through with for-cycles?
Yes, SIMD instructions as found in SSE1,2,3,4a. But for that you need to code in assembler. Current compilers are not "intelligent" enough to analyze code and generate efficient SSE instructions. Incorrectly used they can easily decrease performance. With SIMD instructions you can perform multiple operations (like add, mul) at the same time at the cost of having to shuffle/pack data round. Operating values is the easy part, getting your data to fit using that operations is the really hard one.
gioowe
8th December 2007, 01:45
SSE2 and C optimized int16 and float 4x4, 8x8, and 16x16 DCTS for x86_64: http://akuvian.org/src/dcts-r65.tar.gz
It shouldn't be hard to port it to x86_32, but I haven't done so as I don't have any 32bit computers and only coded it for my own research purposes.
It actually might be. First it requires nasm and second it uses 16 xmm registers. 32-bit CPUs only have 8.
akupenguin
8th December 2007, 01:49
Sure, porting requires reallocating registers. That's what I meant by "porting" as opposed to just recompiling.
redfordxx
8th December 2007, 03:32
Well, you have to do the FDCT in YUV colorspace first. Then these values are normally in range 16..241 (give or take).I want to make kinda weighted quantization. I multiply weight clip and original DCT-quantize-IDCT and divide with quantized weightclip only. I am not perfectly sure of the result but I hope;-)
the multiplication of the clips is the reason of [0,65535]
If you don't use SIMD instructions then it'll be best to perform integer scaling to 32bit, preferably signed. But consider overflows so don't scale up to high. 2^12 is a good approach. And then you use constants to get rid of all trigonometric functions.
On a 32-bit machine 32-bit multiplications are as fast as 16-bit multiplications (indeed they are faster due to skipped 16-bit alignment/scaling issues).
Another important point is loop unrolling. Conditional jumps are nightmares on current processors. If incorrectly predicted they cost up to 42 cycles where you could have done the same amount of add/mul operations. So don't 'if' in loops and perform loop-unrolling. Like
for (int i = 0 ; i < 16 ; ++i)
{
x[0] = y[0] * 2;
x[1] = y[1] * 2;
x[2] = y[2] * 2;
x[3] = y[3] * 2;
} instead of for (int i = 0 ; i < 16 ; ++i)
{
for (int j = 0 ; j < 4 ; ++j) x[j] = y[j] * 2;
}
Does it make big difference constants vs. precalculated variables? Because when N,M come as parameters for the filter, only then I can calculate the trigs...
Same for cycles...I can unroll only if I know M,N...if I understood your point.
But maybe I can unroll it like for main part and then I do the rest:
MN=M*N;
cycles=MN>>5;
cyclesMN=cycles<<5;
for (int i = 0 ; i < cyclesMN ; i+=32)
{
x[0+i] = y[0+i] * 2;
x[1+i] = y[1+i] * 2;
x[2+i] = y[2+i] * 2;
...
x[31+i] = y[31+i] * 2;
}
for (int i = cyclesMN ; i < MN ; ++i)
{
x[i] = y[i] * 2;
}
or maybe
for (int i = 0 ; i < cyclesMN)
{
x[i++] = y[i] * 2;
x[i++] = y[i] * 2;
x[i++] = y[i] * 2;
...
x[i++] = y[i] * 2;//32
}
instead the unrolled part. I just don't know how big nightmare is it and how much care of it.
BTW: is the ++ on correct place?
gioowe
8th December 2007, 05:26
Well, you know M and N. They are 16. Otherwise your first version is the better one.
++i is pre-increment
i++ is post-increment
i = 5
a = ++i means a = 6, i = 6
a = i++ means a = 5, i = 6
Don't do x[i++]. This is another addition. (Actually it should be x[i] = y[i++] * 2). Simple table lookup addition is without cost, it is handled by the memory address generator, so do x[i], x[i+1], x[i+2], ..., x[i+31], x += 32. This is only ONE extra addition.
gioowe
8th December 2007, 05:28
/* cycles unused */
cyclesMN = MN & ~31;
;)
Fizick
8th December 2007, 14:56
So, when I do all calculations scaled to integer and precalculate the coefficients, is it worth trying? Or the fftw will be faster even if it is real,float or whatever.
Probably fftw will be faster if you do not use SIMD (SSE,MMX) in your code.
(but use special fast int to float routine instead microsoft, see MVTools again)
redfordxx
8th December 2007, 20:17
Well, you know M and N. They are 16. You know, when I do it, then for all N generally, if the the speed does not increase significantly because of not using constants;-)
redfordxx
9th December 2007, 01:04
Hmm...starting to have something working... slow of course.. I will apply the advices above now, but..
there is surely some tool in VS6 to find out which lines cost me most of the time...can you give direction?
redfordxx
9th December 2007, 01:13
Yeah, one more thing, when I choose optimize for speed in the project release settings, that's all I can do in this case or there are some other switches I should change?
Fizick
9th December 2007, 10:07
VC6 professional (or may be enterprise ?) has profiler tool. (I never use it).
Other profilers are VTune from Intel and CodeAnalyst (AMDCodeAnalyst) from AMD (I used it sometimes).
I can say that it is hard work, you neet big experience for hand optimize, and you do not get many without MMX. :)
http://avisynth.org/mediawiki/Filter_SDK/Assembler_optimizing
redfordxx
9th December 2007, 13:29
http://avisynth.org/mediawiki/Filter_SDK/Assembler_optimizing
Thanks
so I tried for fun co convert inner part of my loops to asm:start1=(int)pTrigs+ij*size2d*4;
start2=(int)pWorkInput;
_asm {
push eax //sum
push ebx //multi
push ecx //adr trig
push edx //adr work
push esi //counter
mov esi, _size2d
mov ecx, start1
mov edx, start2
mov eax, 0
add esi, -4
align 16
LoopToop:
mov ebx, dword ptr [ecx+esi]
imul ebx, dword ptr [edx+esi]
add eax, ebx
add esi, -4
jns LoopToop
mov temp,eax //result
pop esi
pop edx
pop ecx
pop ebx
pop eax
}
I did it just by learning from reading other sources... Hm, it's slower than compiled C at the moment. I assume it's because compliler cannot optimize my asm together with C code. But will improve I hope after incorporating outer loops.
It is just to multiply two int number arrays of size _size2d in memory locations start1 and start2.
Now questions:
- I heard about MMX but afaik there are multiple byte or word operations but I need signed dword operations. Is also available?
- Is there any way to at least roughly spot area in the translated code where is my particular command located...so I can see how it was translated?
- I cannot find how to enable viewing MMX registers in VC6...I see only the normal ones...Advice?
- Where can I read fast which registers have which purpose, so I don't mess something by using wrong reqister.
- I should use push, pop instructions on all registers I use in each _asm block?
- Any recommendation for good online asm language reference? With and index or search tool, so when I want to know what exactly does certain instruction I find it there..
Fizick
9th December 2007, 15:27
1. try use words instead of doublewords
2. listing
3. do not know, probably no way in VC6
4. probably free use any registers eax ebx ecx edx
5 no, compiler do it itself for registers above (almost always, ask IanB for ebx bug ;)) but not esi, edi
I may be wrong (remember who know...) ;)
6 you may got nasm assembler manual. It contains reference for all assembler instructions.
http://nasm.sourceforge.net/
BTW, recenly nasm 2.0 was released with x64 support.
But Appendix B (x86 Assembler instructions reference) was removed in v.2.0.
Download older v0.99.02 doc
Thera are many other articles at inet
redfordxx
9th December 2007, 17:05
1) when I have DCT from [0,65535] I don't think I can scale to words...
3) last year before I reinstalled the comp it was there, and now not (I think so at least)
IanB
9th December 2007, 22:41
@redfordxx,
You are on a hiding to nothing with the ASM above. Modern processors have parallel execution i.e. a core can do 2 things at once if the dependancies in the thread allow it. i.e. add eax, edx can happen at the same time as add ebx, edx but not add eax, ecx because it has to wait for eax to get the answer from the first add. Your code above stalls like this on every instruction.
Also you have lot of entry and exit code that will get recalled many times by the outer loop. And you scan memory backwards, this will destroy any L2 cache prefetch the CPU may attempt.
Modern compilers are pretty smart. Express your algorithm as cleanly as you can, compile it and ask for an assembly listing. Look at the generated code for constructs you can see cause dumb code to be emitted. Express that part of your algorithm differently, see if the compiler does a better job. Loop!
Most dumb code is a result of too many variables active at once, a cleaner re-expression of your algorithm usually can work wonders.
The code above would have been like this int *a=pTrigs+ij*size2d*4;
int *b=pWorkInput;
int sum = 0;
for (int i=0; i<size2d; i++) sum += a[i] * b[i];I doubt you will be able to write better code than the compiler for this example particularly if there is an outer loop to also consider.
Also when using MMX there are compromises. Data is mostly either 8 or 16 bits. There are exception like PMADD which does L32=a16*e16+b16*f16 | R32=c16*g16+d16*h16
Just get your plugin working first, then worry about getting it faster. Once you have working code others can review it and offer suggestions.
redfordxx
9th December 2007, 22:52
OK...I am doing it with words. I "discovered" it is possible when learned more instructions;-)
Now I stuck in following situation:
At the end of the cycle I have in one mm registry 2 signed doublewords. I need to add them to each other, scale down by let's say 10 bits and put the result into short variable/address.
Any advice?
IanB
9th December 2007, 23:10
punpckldq mm1, mm0
paddd mm0, mm1
psrlq mm0, 10+32
movd dword ptr[x], mm0
redfordxx
9th December 2007, 23:12
ask for an assembly listing.
That was my question before already...how?...the only thing it know is to debug assembly window and when it is compiled for release, I don't know how to find the area of the code desired, because it lists whole memory or what.
There are exception like PMADD which does L32=a16*e16+b16*f16 | R32=c16*g16+d16*h16
Hour before I discovered PMADD...thats why I have 2 doublewords...;-)
And you scan memory backwards
Yeah...jns was the surest thing for me;-)
redfordxx
9th December 2007, 23:23
thanks
psrlq mm0, 10+32
and what happens to the sign bit?
Fizick
10th December 2007, 00:38
redfordxx,
As I wrote above, try find "Listing Files" options in VC in listbox on C/C++ tab of project settings ;)
squid_80
10th December 2007, 03:28
and what happens to the sign bit?
If they've been scaled properly the sign won't change.
IanB
10th December 2007, 05:01
Unfortunatly the is no psraq only psra[bwd] variants, so you need to adust my sample code if you need to do signed.
You were warned, MMX has compromises.
Also the sample I gave suffers badly from V pipe stalling, normally I would find other instructions to interleave into the code stream to keep both pipes busy. Well written MMX uses both processing pipes to the max. Poor MMX has every step dependant on the previous and runs at 50% potential speed.
redfordxx
10th December 2007, 10:01
If they've been scaled properly the sign won't change.Can you be more specific?
redfordxx
10th December 2007, 10:30
Also the sample I gave suffers badly from V pipe stalling, normally I would find other instructions to interleave into the code stream to keep both pipes busy. Well written MMX uses both processing pipes to the max. Poor MMX has every step dependant on the previous and runs at 50% potential speed.
Ok so taking care of this (http://avisynth.org/mediawiki/Filter_SDK/Instruction_pairing), right? I will rearange data in memory, so I can do better;-)
This above is the most important, right, but how is it about speed of memory access?
Is
movq mm0,mm1
faster than
movq mm0,[eax]
faster than
movq mm0,[eax+8]
faster than
movq mm0,[eax+ebx]
faster than
movq mm0,[eax+ebx+8]
faster than
add eax,ebx
movq mm0,[eax]?
How many cycles or delay or whatever?
Similarily, is
mov ecx,edx
faster than
push ecx
or
pop ecx
faster than
mov ecx,[eax]
faster than
mov ecx,[eax+4]
faster than
mov ecx,[eax+ebx]
faster than
add eax,ebx
mov ecx,[eax]?
How many cycles or delay or whatever?
IIRC, I saw somewhere, pmadd takes 3 cycles and paddd takes 1?
redfordxx
10th December 2007, 10:43
And you scan memory backwards, this will destroy any L2 cache prefetch the CPU may attempt.I listed the assembly (thanx, Fyzick). For scanning the memory, there are two counters...one increasing(add 4) to scan the memory and one decreasing(dec) to check the end of cycle...so this is the best way for me,I assume...(except I do add 8 or 16, depending on your above post answer...maybe some combination with lea will improve something...?)
Leak
10th December 2007, 10:55
I'd also recommend having a look at the PDFs found on Agner Fog's page (http://www.agner.org/optimize/) about optimizing assembly.
sh0dan
10th December 2007, 12:20
This above is the most important, right, but how is it about speed of memory access?
Actual cycle count varies from processor to processor, just as the pairing capabilities of different processors vary.
To answer your question:
* Moves between registers (of same type) are faster than reads from memory (even cached data).
* There is no penalty for complex address references. [edi*2+4] is same speed as [edi].
AMD's Optimization Guide lists cycle counts, and should give you a general idea of how instructions compare. There are small differences to Intel systems - mainly faster SSE2, but code that is faster on one system is generally faster on both.
90+% of all systems today have SSE2, so I usually do an SSE2 version, and a fall back to C code for the rest.
But feel free to ask.
redfordxx
10th December 2007, 13:30
In the most inner loop I will use only movq, paddd, pmadd. can I pair pmadd in V-pipe with one or more instructions (movq,paddd) in U-pipe?
squid_80
10th December 2007, 15:07
Can you be more specific?
It's a bit hard to explain in text on a forum. But in this case you've got a 16-bit value, scaled up by 10 bits = 26 bits total. So as long as your previous calculations haven't produced a result that has exceeded 25 bits you're not going to lose the most significant bit (hence the sign).
redfordxx
10th December 2007, 15:34
Thanks --- the Agner Fog's and AMD docs I will read in the evening...
Here is my current idea of the most inner loop in my code. What do you think about the pairing?
Assume that the algo and instructions are correct...but there can be change in the order of instructions, addressing method and loop control/decision. (*) For example is there gain in using the AddressCounter somehow also for loop control?
//version pairing ... in one loop 6memory reads and one AddressCounter increase...means minimum 7pairs...+loop control decision at the end
mov eax, AddressSrc
mov ebx, AddressTrigs
mov ecx, NumberOfLoops
mov edx, 0 //AddressCounter
//----- intended pair delimiter
movq mm0, [eax] //U pipe (load 2+2 values0 src)
pxor mm6, mm6 //v pipe (sum0=0)
//-----
movq mm1, [ebx] //U pipe (load 2+2 values00 trig)
pxor mm7, mm7 //v pipe (sum1=0)
LoopLabel:
//-----
movq mm2, [ebx+edx*2+8] //U pipe (load 2+2 values01 trig)
pmaddwd mm1, mm0 //V pipe (multiply src0*trig00 for 1+1 sum0)
//-----
movq mm3, [ebx+edx*2+16] //U pipe (load 2+2 values10 trig)
pmaddwd mm2, mm0 //V pipe (multiply src0*trig01 for 1+1 sum1)
//-----
movq mm0, [eax+edx+8] //U pipe (load 2+2 values1 src)
pmaddwd mm3, mm0 //V pipe (multiply src1*trig10 for 1+1 sum0)
//-----
movq mm4, [ebx+edx*2+24] //U pipe (load 2+2 values11 trig)
paddd mm7, mm2 //V pipe (add to 1+1 sum1)
//-------
add edx, 16 //Increase AddressCounter
paddd mm6, mm1 //U pipe (add to 1+1 sum0)
pmaddwd mm4, mm0 //V pipe (multiply src1*trig11 for 1+1 sum1)
//-------
movq mm0, [eax+edx] //U pipe (load 2+2 values0 src)
paddd mm6, mm3 //V pipe (add to 1+1 sum0)
//-----
movq mm1, [ebx+edx*2] //U pipe (load 2+2 values00 trig)
paddd mm7, mm4 //V pipe (add to 1+1 sum1)
//-----
dec ecx
jnz LoopLabel
//here some downscaling mm6 and mm7 and packing to mm7 to be decided later
mov edx, AddressDst
//-----
movq [edx], mm7 //U pipe (saving all 4 values)
This does 4 cycles from previous page calculating values from 2 loops on last page per one loop...on rearranged data...
It will have one useless pair at the end of last loop (prereading the data to mm0 and mm1 for next loop which won't happen.
[EDIT] (*) I already see...negative memory offset and one increasing counter for addressing and loop control;-)
One worry: If between add,sub,dec,inc and jnz are few mmx instructions, will jnz still work? Like in example:
//here is edx negative
add edx, 16 //Increase AddressCounter
paddd mm6, mm1 //U pipe (add to 1+1 sum0)
pmaddwd mm4, mm0 //V pipe (multiply src1*trig11 for 1+1 sum1)
//-------
movq mm0, [eax+edx] //U pipe (load 2+2 values0 src)
paddd mm6, mm3 //V pipe (add to 1+1 sum0)
//-----
movq mm1, [ebx+edx*2] //U pipe (load 2+2 values00 trig)
paddd mm7, mm4 //V pipe (add to 1+1 sum1)
//-----
jnz LoopLabel
IanB
10th December 2007, 21:20
If between add,sub,dec,inc and jnz are few mmx instructions, will jnz still work? Yes, MMS/SSE intructions generally do not effect the flags. Sprinkling the loop control code in between awkward stall point is a good technique.
redfordxx
10th December 2007, 21:45
It's a bit hard to explain in text on a forum. But in this case you've got a 16-bit value, scaled up by 10 bits = 26 bits total.No...I have 4 normally signed doubleword in two registers. So I see two options:
- add something then shift then subtract something else
- arit shift and then "and" with mask
redfordxx
11th December 2007, 03:27
Wooow....12fps (pure C++) --> 100 fps (inner loop MMX)
(approx)
...and some more to come
redfordxx
11th December 2007, 03:44
How did I solve the downscaling:
Somewhere in the beginning:
unsigned __int64 i128SignMask = 0x8000000080000000;
_asm {
movq mm0, i128SignMask
movq mm1, mm0
psrad mm0, scale
pxor mm1,mm0
movq i128SignMask, mm1
}
And the shifting/downscaling during the processing://here some downscaling mm6 and mm7 and packing to mm7
//-----
movq mm5, i128SignMask
psrad mm6, scale
//-----
mov edx, CurrentDst
pxor mm6, mm5
psrad mm7, scale
//-----
pxor mm6, mm5
//-----
packssdw mm7, mm6
//-----
movq [edx], mm7 //U pipe (saving all 4 values)
redfordxx
11th December 2007, 20:34
* Moves between registers (of same type) are faster than reads from memory (even cached data).So, what is faster?
movq mm0, [eax]
movq mm1, [eax]
ormovq mm1, [eax]
punpcklbw mm0, mm1
pxor mm0, mm0
punpckubw mm1, mm0
don’t worry about pairing, I have other instructions to interleave.
AMD's Optimization Guide lists cycle counts, and should give you a general idea of how instructions compare. Well, there is written latency... I didn't understand the x-y formula...but can I take it as "some unit of time?"
90+% of all systems today have SSE2, so I usually do an SSE2 version, and a fall back to C code for the rest.AthlonXP here;-)
IanB
11th December 2007, 22:20
You probably meantmovq mm0, [eax]
...
movq mm1, mm0On a P4pxor mm0, mm0
movq mm1, [eax]
...
por mm0, mm1is faster, on most other machines the difference is indeterminate (other factors prevail)
You really need to test and measure a stream of code for speed, out of order execution make absolutes difficult to predict. Also don't worry to much about lock step pairing your instruction, instructions all have different latency and execution times, as long as there are at least 2 loosly independant flows through most of your code the chip will keep itself busy.
A convention to map the flows and stall points in your source is to indent one flow by a spacemovd mm0, [eax]
movd mm1, [eax+4]
punpuklbw mm0, mm7
punpuklbw mm1, mm7
pmaddwd mm0, mm6
pmaddwd mm1, mm6
paddd mm0, mm1 ; stall
add eax, 8
packssdw mm0, mm7
movq [eax-8], mm0 ; stall
jmp loop
redfordxx
12th December 2007, 01:39
I better say the idea behind:
I process the bytes from video multiple times so the question is
1- copy to work area unpacked to words and then process (processing 8 bytes require 2 mm reads --- first example)
2- copy to work area as it is and unpack during processing
(processing 8 bytes require 1 mm read but needs unpacking --- second example)
More and more I think number one is better...
About pairing:
Will be 2 3-cycles-long-instructions, paire with 3 2-cycles-long-instructions and all together (if lucky) processed in 6 cycles (if all rules fulfilled)?
And, looking at the Appendix F tables, little confused about some things...but what is important to me about pipes:
If I want to combine two instructions they have to have together two pipes (FSTORE, FMUL or FADD). So at least one instruction has to have more than one pipe possible, eg. FMUL/FADD, or every instruction has to have different pipe, eg. FSTORE for one and FMUL for the other.
If that is correct, two PADDD instructions can be processed simultaneously...which is not what I thought....so....am I correct?
redfordxx
12th December 2007, 11:21
Is there any worse performance when reading 4 bytes aligned memory instead of 8 bytes aligned?
It would be cycle containing some 25-30 pairs of mmx ops with about 20 reads of memory, where 2 of them would be dword aligned. If it will decrease performance, I can rearange data before the cycle…but I don’t know if it is worth it.
Other thing: which more registers than eax-edx can I use? (To control cycles mainly)
Huh, few days ago I said I don’t want anything with asm, and now…;-)
IanB
12th December 2007, 11:32
Whether to unpack into a temp buffer or not depends on the algorithm. The punpuk instruction can tend to stall waiting for the shifter. You probably need to code up both approaches and time them.
As a rule of thumb if you need the data unpacked from the temp buffer 3 times or more then unpacking to the temp buffer will mostly be faster. Unfortuantely the converse is unpredictable, even for a single 1 off access it might be faster or it might not. You really need to measure the speed.
About pairing :- If you are talking about a P3 then you can approximately make that statement. With a P3 the amount of out of order possible is limited enough for us to follow. With any of the later chips the level of out of order processing is astounding, so worrying about what can truely actually be paired become somewhat mote, mearly making the effort to provide 2 streams of code interlaced as best we can is usually enough.
Remember the pipes can be the combined length of both pipes apart in actual execution.
And yes a PADD can happen is either pipe simultaneously.
:Snap:Edit:
If the data is aligned a movd and movq are about the same speed, so the movq moves twice as much data for the same clocks. But if you end up adding extra instructions to unpack the data you may loose that advantage. Count the cycles for both code alternatives. And don't forget to include any overhead in getting data into the right shape for each approach.
And you can use all the registers in your code, just put them back at the end.
redfordxx
12th December 2007, 14:31
Whether to unpack into a temp buffertemp buffer…you mean memory?
If the data is aligned a movd and movq are about the same speed, so the movq moves twice as much data for the same clocks. But if you end up adding extra instructions to unpack the data you may loose that advantage. Count the cycles for both code alternatives. And don't forget to include any overhead in getting data into the right shape for each approach.
To explain better the movq issue (there won't be any movd):
I have data in memory:
aaaa|bbbb||cccc|dddd||eeee|ffff||
First I wanted (process each dword hi and low:
movq mm0, [eax]
…process
pswapd mm0
...process
add eax, 8
;loop cycle here
but pswapd is
a) AMD specific…and maybe other users will use my filter
b) it seems to me that MS VC6 does not accept 3DNow!
So other options are:
1. replace pswapd with mmx code
movq mm0, [eax]
…process
movq mm1, mm0
punpckldq mm0, mm0
punpckhdq mm1, mm0
...process
add eax, 8
;loop cycle here
no memory access but might be slow…it takes some 6 cycles
2. maybe prepare data like this:
ffff|aaaa||bbbb|cccc||dddd|eeee||ffff|0000||
and read it like this
movq mm0, [eax] ; here is the 4-aligned read every second loop
...process
add eax, 4
;loop cycle here
3. maybe prepare data like this:
aaaa|bbbb||bbbb|aaaa||cccc|dddd||dddd|cccc||eeee|ffff||ffff|eeee||
and process like
movq mm0, [eax]
…process
movq mm0, [eax+8]
...process
add eax, 16
;loop cycle here
It seems to me that 2 and 3 could be faster than 1.
2. requires less preparation time than 3, less memory but there are 4-only-aligned movq reads mixed with 8-aligned movq reads on same data.
3. requires double memory of 2, more prep time but there are only 8-aligned movq reads
Sooner in this thread there was commment about conditional jumps and nightmares. When I use the jnz, is it that slow that it is worth unrolling the cycle 2 or 4 times (and therefore risk of uselessly processing zeroes if there is odd number of steps? (which is not probable when speaking about 2D video data…)
Currently, I have the approach to prepare block of data into working area process and put back.
Maybe faster would be prepare full plane, process and put back.
That would be eg. on full HD video 4MB to 5.3MB depending on alignment requirements.
If I will have 2, maybe 3 input clips in later version of this filter it will make 12MB of data. Output will be packed to bytes probably so it makes another 2-3MB.
Is it within "good manners" to require 15MB memory for one filter?
sh0dan
12th December 2007, 15:41
* Pairing.
U/V pipes are for scalar code, and is something from the Pentium age. Now there are usually 3 scalar pipes. There is a
* Timing: Moves from memory have at LEAST a 4 cycle latency (if reading from same cacheline as previous reads), usually much more, something like 12 from cache level 1. MMX multiply has 3, most others have 2, but allocate either the FADD and/or FMULL pipepine, or shifter on Intel.
Since timing is immensely complex, and individual operation timing varies a lot, just try not to make your code too dependent. If so, the processor has a much freedom, and will look ahead for unblocked instructions to execute while a read stall is blocking one unit.
* Temp buffers are only a good idea if you can do it on a limited set of data and keep data in the cache. The resizer and planar converter in 2.6 unpacks one line, that will fit nicely into cache, and converts this line.
If your filter requires more lines (2 or 3), you just unpack the needed numer of lines and unpack one line for each additional line you need as you process downwards. This also makes in-place processing a lot easier, since you can write to the original image and still have the original picture data.
* pswapd can be done by using pshufw mm0,mm0,01001110b
* No it is not good manner to allocate 15MB, unless strictly needed.
redfordxx
12th December 2007, 16:44
Temp buffers are only a good idea if you can do it on a limited set of data and keep data in the cache.Not sure whether I am missing something, but I prepare data not to have it fitting into cache but to have it packed and aligned.
* pswapd can be done by using pshufw mm0,mm0,01001110bOoops, something new…I have to study up. So when I want to produce code (with Visual C++ 6.0) which is good for most machines today and my AthlonXP too, I should study up MMX, MMXExt, SSE…nothing more…?
* No it is not good manner to allocate 15MB, unless strictly needed.OK
sh0dan
12th December 2007, 19:11
Uncached memory reads are very expensive, keep your temp buffer well below 64k (L1 size of Athlon XP), otherwise unaligned reads are very much preferable.
For not, 99% of all processors will have MMX+iSSE (MMXExt), so go for that and have a C fallback for the rest.
A misaligned read isn't that bad, but try to avoid it, if possible. Aligned reads are nice, but not essential. The horizontal resizer has lots of unaligned reads, but it is still very fast.
redfordxx
12th December 2007, 19:55
pshufw mm1,mm1,01001110bcauses
Compiler Error C2400
inline assembler syntax error in 'context'; found 'token'
The given token caused a syntax error within the given context.
Specifying a Pentium instruction can cause this error. Choosing the Pentium option (/G5) causes the compiler to generate instruction sequences optimized for the Pentium, but does not allow instructions specific to the Pentium.
Is it possible that VC++ 6.0 does not know SSE or it is about the options...? I dont understand much of the error message.
sh0dan
13th December 2007, 10:58
VC6 SP5 shouldn't have problems with SSE1 or SSE2. Have you installed SP5?
Maybe you should consider upgrading to VC2005 Express. It's free, and fixes a lot of VC6 problems, plus VC6 is EOL and more than 10 years old.
redfordxx
13th December 2007, 17:50
VC6 SP5 shouldn't have problems with SSE1 or SSE2. Have you installed SP5?
Maybe you should consider upgrading to VC2005 Express. It's free, and fixes a lot of VC6 problems, plus VC6 is EOL and more than 10 years old.
Thanks for reminding.
Is there anything I should fear in the newer version?...AthlonXP 2600+ MS WinXP and C-beginner here ;-)
I checked MS page...there is already 2008 to download... is there any particular reason why you mentioned 2005?
My old code will be accepted smoothly, right?
Fizick
13th December 2007, 19:45
For VCToolkit2003 you need in .Net1.1. You can integrate it in old VC6 shell with some tricks (dirs in path).
For VC2005 you need in .NET2.0 to run,
For VC2008 you need in .NET3.5 (?) to run it. Not so many people tried it (me not). Generally, it is the same as 2005. IMO it is better to wait some time for hotfixes (SP1 ?) and use 2005 SP1 now. :)
But you did not answer about SP5 and processor pack. :)
redfordxx
13th December 2007, 21:13
For VCToolkit2003 you need in .Net1.1. You can integrate it in old VC6 shell with some tricks (dirs in path).
For VC2005 you need in .NET2.0 to run,
For VC2008 you need in .NET3.5 (?) to run it. Not so many people tried it (me not). Generally, it is the same as 2005. IMO it is better to wait some time for hotfixes (SP1 ?) and use 2005 SP1 now. :)
But you did not answer about SP5 and processor pack. :)Where can I find? It is usually in the about box, but there is nothing...or it means that there are no packs installled;-)
OK...I go for 2005, hope I find offline install somewhere...
redfordxx
14th December 2007, 15:50
OK, I am on VC2005... according to AVSTimer it seems to perform slower...could you advice the max optimization settings for the release configuration? (if there is something else than Maximize Speed (/O2)...
foxyshadis
14th December 2007, 19:41
Target Prescott or Core2 to generate appropriate SIMD, enable Link Time Code Generation (or even the profiling). If there's a lot of floating point, maybe change that to fast. Other optimizations give only very minor benefits, like FPO.
redfordxx
14th December 2007, 20:00
Target Prescott or Core2
1) How?
2) I have AthlonXP...doesn't matter?
redfordxx
14th December 2007, 20:07
Now there are usually 3 scalar pipes. There is aDoes it mean, that it is best to prepare not 2 but 3 flows of instructions?
redfordxx
14th December 2007, 20:17
When the code reads last pixel of the plane (srcp+pitch*height-1) using mov, movd, or movq the reading of last bytes will happen outside the allocated memory...may it crash?
Same for writing, but here I see that it probably will...
redfordxx
14th December 2007, 23:51
Thank you all for providing me the helping info...so I thought, maybe I should tell you what I am working on so that you know where is your effort going.
Some might know that I want to succeed to make my SmoothDeblock filter...but that's a distant target.
Now I am warming up with upgrading TBarry's DCTFilter...because it will be the key for the Smoothdeblock anyway.
Changes so far (working and hope bug free):
+any dimension of DCT (well, too big takes really time) --- if you like 1x1;-), 20x20 or 100x1, no problemo
+any dimension of clip, independent on DCT size --- no error messages, the borders just will not be processed
+possibility to shift the DCT blocks
+minor other features
+backward compatible with original DCTFilter (parameters format)
-I didn't bother with YUY2
-as it is general-size and not very optimized, it is about 4x slower than original DCTFilter at 8x8 at the moment --- however, at 8x8 it swithces to the original routine unless specified else
ToDo features:
+load quant matrix from file
+add weighted DCT mode (based on input mask clip)
+add variable quant DCT mode (based on input mask clip)
ToDo optimizations:
+improve core algo, then optimize
+optimize the transfer of data to work area and back
It's working now, and sooner or later I plan to publish it...
Should I start thread in Avs usage or Avs dev forum?
redfordxx
15th December 2007, 02:03
Back to bug you with questions...;-) (sorry)
* I'd like to add the quantization based on matrix. Probably somewhere is already existing GPL code to load the matrix from a file. Can you recommend me where can I get it?
* Is it possible in asm do something like switch?
What I mean: I have a number 0-10 in register eax and the code would be something like
mov ebx, -1
jump to address [EIP+2+3*EAX] ;if 2 would be the lenght of the jump instruction and 3 lenght of the add instruction for example
add ebx,1
add ebx,1
add ebx,1
add ebx,1
add ebx,1
add ebx,1
add ebx,1
add ebx,1
add ebx,1
add ebx,1
add ebx,1
; here the value of EBX would be same as the EAX
This code makes no sense but I hope you know what I mean.
redfordxx
15th December 2007, 04:54
And you can use all the registers in your code, just put them back at the end.Mhm, after some tough moments I realized that using ESP and especially EBP is road to hell.
Any other restrictions?
Dark Shikari
15th December 2007, 04:54
Mhm, after some tough moments I realized that using ESP and especially EBP is road to hell.
Any other restrictions?You should have no problem using ebp if you use --fomit-frame-pointer when compiling.
Fizick
15th December 2007, 09:57
--fomit-frame-pointer is flag for GCC, but for VC it is /Oy
It is defuult with full optimization /Ox
Too many questions. Are most of them obsolete now? :) Anyway, they are useful for future beginner asm developers (like me). May be add to the thread title ("16x16 code to use with assembler optimisation")
'any dimension of DCT' is interesting, but big slowdown is not.
For really fast DCT you must use FAST DCT algo with butterfly and other more complex tricks for power of 2 sizes (4,8,16, 32,) and probably 3,9,27.
redfordxx
15th December 2007, 19:07
--fomit-frame-pointer is flag for GCC, but for VC it is /Oy
It is defuult with full optimization /Ox
Does not help...:-(
Too many questions. Are most of them obsolete now? :) No...
'any dimension of DCT' is interesting, but big slowdown is not.
For really fast DCT you must use FAST DCT algo with butterfly and other more complex tricks for power of 2 sizes (4,8,16, 32,) and probably 3,9,27.well, the thing is, that I was not focusing much on the algo and more on the learning coding.
Instead of the three calculations: DCT-Quantization-IDCT I made it one step together. I precalculated coeficients C_u*C_u*C_v*C_v*Q_uv*COS_iu*COS_xu*COS_jv*COS_yv. However I think, incorporating Q_uv (quant matrix coef) removed the posibility so separe calculation...I didn't do research for other algo options yet.
The rough performance on 720x480 (there can be some influences on my PCSize My algo, FFTW, Int8x8
8x8 65 150 250
16x16 14 14
FFTW is not implemented...I calculate it as 2xtime for the DCT in your DCTFFTW class
sh0dan
15th December 2007, 19:38
Does it mean, that it is best to prepare not 2 but 3 flows of instructions?
For non mmx (scalar) instructions, yes. Scalar optimizations are however IMHO mostly a waste of time, since it mostly doesn't bring much over nicely written C-code.
Fizick
15th December 2007, 19:57
Does not help...:-(
?
__asm
{
mov esi, param1; got params
push ebp ; store ebp
...; any change ebp
pop ebp ; restore ebp and stack
}
foxyshadis
15th December 2007, 22:09
1) How?
2) I have AthlonXP...doesn't matter?
I forgot, that's only with ICL (Intel compiler), sorry. VS only gives you SSE and SSE2. And yeah, you can't target SSE2 and up if you have an XP. :(
A generic switch is like this (http://www.eventhelix.com/RealtimeMantra/Basics/CToAssemblyTranslation3.htm), but a dirt simple switch should be calculable like:
mov eax, -1
mov ebx, 6 ; this being your jump-in value
add ebx, jumper ; inc is single-opcode, no multiply?
jmp ebx
jumper:
inc eax
inc eax
inc eax
inc eax
inc eax
inc eax
inc eax
inc eax
by my assembly is so rusty that might not be legal at all. Of course counting hex bytes can get old quick, it might be much saner to use a lookup table.
Leak
16th December 2007, 01:18
by my assembly is so rusty that might not be legal at all. Of course counting hex bytes can get old quick, it might be much saner to use a lookup table.
That, or use a label at the start of the first and second unrolled loop part and subtract their addresses - if it's a bit more complicated than you'll have to mul by that value anyway...
redfordxx
16th December 2007, 22:25
I did
__asm
{
pushad ; store everything
...; change ebp
;3 instruction after it crushed...still in asm area
popad; restore all
}
redfordxx
16th December 2007, 22:32
use a lookup table.but this would be slow, wouldn't it?
IIRC, in the AMD appendix there is written latency 1 for most of short jumps...but I assume it is only the evalulation of the condition and then the jump takes some time...
...same as memory reads... they announce latency 2 but they take much longer as sh0dan mentioned
IanB
16th December 2007, 23:16
int foo(int abc) {
int def;
__asm {
mov eax, abc ;; abc is probably [ebp+8]
mov def, eax ;; def is probably [ebp-20]
Have you worked out how to get ASM listings from the compiler yet?
Always look at the ASM listing from the compiler, your __asm code must fit in with the code from the compiler. i.e. the whole code must be consistant.
And the magic word you need to reference a code label is offsetadd ebx, offset jumper
To find these magic words look in the ASM listing, write some C that will use the concept you need and see how the compiler does it.
gioowe
16th December 2007, 23:32
but this would be slow, wouldn't it?
IIRC, in the AMD appendix there is written latency 1 for most of short jumps...but I assume it is only the evalulation of the condition and then the jump takes some time...
...same as memory reads... they announce latency 2 but they take much longer as sh0dan mentioned
The condition is calculated before. A jump only checks flags.
A correctly predicted jump or an unconditional jump has a latency of 2, assuming that the next instruction is in the L1 code cache. If it was mispredicted then the CPU has to flush all decoded instructions and start again. This takes about 42 cycles. The AMD processor (I don't care about Intel) has a branch prediction as follows: A conditional branch is assumed as non-taken the first time. The second time (address) it is assumed as the same as last time. All further times (address) it is assumed as the second time.
foxyshadis
17th December 2007, 19:38
And the magic word you need to reference a code label is offsetadd ebx, offset jumper
To find these magic words look in the ASM listing, write some C that will use the concept you need and see how the compiler does it.
Thanks, I was scratching my head over that when I was trying to test and get the code actually working yesterday. :p
redfordxx
17th December 2007, 20:32
Have you worked out how to get ASM listings from the compiler yet?Yeah, almost reading like a book;-)
Of course I am learning from that.
I read also one funny thing:the compiler translated my jnz to some other kind of conditional jump;-)
Always look at the ASM listing from the compiler, your __asm code must fit in with the code from the compiler. i.e. the whole code must be consistant....having consistent code with the code from the compiler...this topic I probably leave to the horse for now...he has bigger head than me;-)
Leak
17th December 2007, 20:41
I read also one funny thing:the compiler translated my jnz to some other kind of conditional jump;-)
Well, jne (not equal) and jnz (not zero) are the same instruction, since for both you just check whether the zero bit in the flags register is set - maybe that's what happened?
np: Yello - Daily Disco (1980-1985: The New Mix In One Go)
redfordxx
17th December 2007, 21:02
@Fizick
I changed the title...d'you like it better?
First I borrowed your Bytes2Float routine and changed it into Bytes2Words...Then I completely replaced it with scalar asm and it was significant speedup...I will add MMX and then I post it...maybe it would be useful for you if it is possible change it back to float...
Fizick
17th December 2007, 23:06
i have also SSE optimized bytes to float routine in Vaguedenoiser ;)
(code written by Kurosu though) :)
But it take very small percent of time.
When (if) you make fast MMX dct 16x16 (16x8, 8x16), I will try add it to MVTools. :)
redfordxx
18th December 2007, 19:19
Hello,
here is new area to explore for me:
Packed integer division on mmx registers... afaik it is not in MMX set, is there any common instruction set with divisions? Or it is only AMD specific when I have AthlonXP? Where should I focus my learning efforts?
sh0dan
18th December 2007, 19:57
redfordxx: What is the C-equivalent of what you want to do?
MMX cannot do division, but you can use inverse multiply to achieve a division if your division is constant.
For example:
y = x / 5;
==
y = x * (256 / 5) / 256
==
y = (x * 51) >> 8
Increase 256 to any power of two to get better precision.
redfordxx
18th December 2007, 21:13
OK, I try it short and fast, coz this hotel internet in Brussels is constantly disconnecting me.
The mul-division is nice, I heard of it but didn't know what is it exactly...will use later...tnx
redfordxx
18th December 2007, 21:16
My case:
a=IDCT(Quantize(DCT(x))
b=IDCT(Quantize(DCT(y))
c=a/b
I have
x=[0,65535]
y=[0,255]
a,b scaled to signed DW
c=[0,255] unsigned saturated
x,y are values from video...so definitely not constant...
So I see two options:
1)do it scalar
2)use other instruction set? There is no common packed integer division?
IanB
19th December 2007, 04:00
unsigned short Reciprocal[65536]; // 65536/i
c=(a*Reciprocal[b])>>16;
Look for the extract word/insert word instructions pinsw/pextw (sp?) and pmulhw
redfordxx
19th December 2007, 08:48
So, iiuc I should prepare some lookuptable for all numbers [0,65535]...
IanB
19th December 2007, 13:16
In the general case, Yes, a 65K table. But if you know your data you can pull some tricks to increase accuracy or reduce the table size.
redfordxx
19th December 2007, 22:18
It seems, that maybe all this insert/extract gymnastics takes so much time, that it is better to do in in normal scalar asm...
Or maybe I will do this part is scalar version for compatibility and then I'll do 3DNowPro version with normal division... however, I am not sure yet, whether 3DNowPro can on xmm do something so nice as PMADDWD:(
Sulik
19th December 2007, 23:42
The most efficient way to achieve this is probably to temporarily convert the data to floating point and use single-precision floating-point SSE to perform the final division on 4 values at a time.
IanB
20th December 2007, 02:41
Look at how many cycles all the division instructions take on all of the processors. In this many cycles you could almost rebuild the pyramids.
Do not do division unless there is no other way around the problem.
Post your existing code for this portion and I will see what can be done.
Sulik
20th December 2007, 03:09
Not true. You should be able to issue a DIVPS instruction operating on 4 values with only ~20 cycle latency on a Core2 duo.
This should end up faster than 4 scalar lookup and avoids trashing L1.
IanB
20th December 2007, 03:37
The code I am thinking about would be 4 streams of 2 fast instructions plus a pmulhuw all up maybe 12 to 16 cycles on a Core2 and will do almost as well on most other CPU's
You also have to include the to and from float conversion as well to use the DIVPS.
redfordxx
23rd December 2007, 04:44
OK, back to switch code and jumps:
jmp ecx is already working...
jg ecx does not...
is there any trick?
squid_80
23rd December 2007, 05:46
jmp [reg] means the target address is in [reg], jg [reg] means the target address is the current address plus the displacement in [reg]. In other words jmp uses absolute addressing and jg uses relative.
IanB
23rd December 2007, 07:13
For the conditional jumps there is only the relative8 and relative32 forms of the instruction.
Jmp has relative8, relative32, reg/mem32, far and far indirect forms of the instruction.
Use jng label01 ;; opposite test
jmp [ecx]
label01:
redfordxx
23rd December 2007, 14:23
I have now something similar
jg label01
startloop
;code
js startloop
label01:
jmp ecx
labelswitch:
;case1
;case2
...
I only thought it might be faster with only one jump... from jg directly to case#
IanB
23rd December 2007, 21:38
Yes, lots of things would be faster if there were single instructions to implement them :D :D :D :D :D
Fizick
23rd December 2007, 22:26
and the best way to make a jump is do not jump :D
redfordxx
26th December 2007, 19:55
This thing slows my code down:add ebx, dst_pit //increase write endFunny, that in similar situation somewhere else in the procedure works fine. Probably is not cached. Can I tell the processor in advance to cache this variable for a moment?
This is some part of code, in the bottom is the problem line...it is copying the bytes from workarea;-)
mov edi, SX //inputpitch (sizex)
mov eax, OutputAddress8 //read
mov ebx, DstPoint //DstPoint...write
mov ecx, OutputLastPieceJump//precalculated jump address
mov esi, negY //reset counter y (-sizey)
LoopOutputY:
add eax, edi //increase read end
mov edx, 8
sub edx, edi //reset counter x
jg OutputLess8
LoopOutputXAlignedTop:
movq mm0, [eax+edx]
movq [ebx+edx], mm0
add edx, 8 //increase x counter
jle LoopOutputXAlignedTop
OutputLess8:
jmp ecx
OutputLastPiece:
mov dl, BYTE PTR [eax+1]
mov BYTE PTR [ebx+1], dl
mov dh, BYTE PTR [eax+2]
mov BYTE PTR [ebx+2], dh
mov dl, BYTE PTR [eax+3]
mov BYTE PTR [ebx+3], dl
mov dh, BYTE PTR [eax+4]
mov BYTE PTR [ebx+4], dh
mov dl, BYTE PTR [eax+5]
mov BYTE PTR [ebx+5], dl
mov dh, BYTE PTR [eax+6]
mov BYTE PTR [ebx+6], dh
mov dl, BYTE PTR [eax+7]
mov BYTE PTR [ebx+7], dl
add ebx, dst_pit //increase write end********!!!!!!!!!!!!!!!!!!!!!!!!!!!
add esi, 1 //increase y counter
jnz LoopOutputY
[EDIT]I tried to put dst_pit to esi and I used variable for the counter which was esi before...didn't help.
Commenting this line more then doubles the speed of the code... so, something is wrong:-(
[EDIT2]Further info:
When I comment the main processing part of the code, the cycle thru the blocks of each frame with CopyFromSrcToWorkArea and CopyFromWorkAreaToDst routines remain. The latter is shown above and contains the add ebx, dst_pit line, the former is similar and contains add eax, src_pit line. A test clip is processed in 53sec. If I comment one of those lines, the time drops to 24sec, if I comment both of them, time is 17sec.
If I replace either src_pit or dst_pit with an imm value the processing time of the clip is between 53 and 24 depending on the value (the bigger, the slower)
:(what will help it?
IanB
26th December 2007, 22:09
Apart from the obvious logic error in your code (The tail end repeat the begining of the line) it is called accessing real memory. Without in +=dst_pit you access the same memory every time, it will be in L1 cache and very fast. Otherwise you are writing to real memory which is slower. I take it that is the point of the exercise, so you are stuck.
Anyway the code looks like a Blit operation. Use the provided env->BitBlt() function, someone already worked very hard to make it as fast as possible.
redfordxx
26th December 2007, 22:46
Apart from the obvious logic error in your code (The tail end repeat the begining of the line)I don't see what you mean.
it is called accessing real memory. Without in +=dst_pit you access the same memory every time, it will be in L1 cache and very fast. Otherwise you are writing to real memory which is slower. I take it that is the point of the exercise, so you are stuck.Well I looked in the TBarrys DCTFilter and it has similar principle, except having it unrolled for 8x8 in mmx. It doesnot experience this kind of problem. I wonder why...
Anyway the code looks like a Blit operation. Use the provided env->BitBlt() function, someone already worked very hard to make it as fast as possible.But I have already the main loop completely in asm. Can I call this funciton from asm?
IanB
26th December 2007, 23:14
I don't see what you mean.add eax, edi //increase read end
mov edx, 8 <- Starts +8 from base
sub edx, edi
...
movq mm0, [eax+edx]
movq [ebx+edx], mm0
...
mov dl, BYTE PTR [eax+1] <- OutputAddress8+SX+1
mov BYTE PTR [ebx+1], dl:confused:
It is hard to analyze fully when you only give snippets here and there. When dealing with the Cache, scale is very important. i.e. are you moving 1000's, 100000's or 10000000's of bytes?
But I have already the main loop completely in asm. Can I call this function from asm?Yes. As always write some C, look at the asm listing to see how to do it.
redfordxx
27th December 2007, 00:27
:confused:
It is hard to analyze fully when you only give snippets here and there. Sure...The eight is okay there it is because <- OutputAddress8+SX+1
<- (OutputAddress-8)+SX+1
<- OutputAddress+SX-7I precalculated lot of things because I was not always sure how bad it would be to have it in the cycle.
When dealing with the Cache, scale is very important. i.e. are you moving 1000's, 100000's or 10000000's of bytes?this procedure moves blocks for DCT... probably will not be often bigger than (SX*Y)=(16*16)
I looked at BitBlt and saw this comment
//move backwards for easier looping and to disable hardware prefetch
what's the point...because I remember the recommendation here sooner in this thread
movntq [ebx+edx], mm0 helped a little...53sec-->48sec
...well the code is not perfectly ready to be shown...need som cleanup, but why not:
void DCTBytes2Bytes(const unsigned char *srcp, int src_pit, unsigned char *dstp, int dst_pit, int rowsize, int FldHeight)
{
int SX=sizex;
int negY=-sizey;
int TrigsAddress=(int)pTrigs;
int AreaToLoopInput=-iWorkInputSize;
int AreaToLoopOutput=-iWorkOutputSize;
int InputAddress16=(int)pWorkInput-16;
int OutputAddress8=(int)pWorkOutput-8;
int InputEndAddress=(int)pWorkInput+iWorkInputSize;
int OutputEndAddress=(int)pWorkOutput+iWorkOutputSize;
int TrigsChunkSize_8=iTrigsChunkSize*8;
int SrcPointY=((int)srcp)+sizex-8;
int DstPointY=((int)dstp)-8;
int SrcPoint=SrcPointY;
int DstPoint=DstPointY;
int SrcPointYBigStep=sizey*src_pit;
int DstPointYBigStep=sizey*dst_pit;
int SrcPointYEnd=SrcPointY+FldHeight*src_pit;
int SrcPointYLineEnd=SrcPointY+rowsize;
int InputLastPieceJump=(((((SX)>>3)+1)<<3)-SX-1)*6;
int OutputLastPieceJump=(((((SX)>>3)+1)<<3)-SX-1)*6;
unsigned __int64 ilScale=iScale;
unsigned __int64 i64SignMask = 0x8000000080000000;
unsigned __int64 i64Full = 0xFFFFFFFFFFFFFFFF;
_asm {
align 16
emms
pushad
//32-bit general registers: EAX is 0, ECX is 1, EDX is 2, EBX is 3, ESP is 4, EBP is 5, ESI is 6, and EDI is 7.
//16-bit general registers: AX is 0, CX is 1, DX is 2, BX is 3, SP is 4, BP is 5, SI is 6, and DI is 7.
//8-bit general registers: AL is 0, CL is 1, DL is 2, BL is 3, AH is 4, CH is 5, DH is 6 and BH is 7
movq mm1, i64SignMask
movq mm0, mm1
psrad mm1, ilScale
pandn mm0,i64Full
pxor mm0,mm1
movq i64SignMask, mm0
mov ecx, OutputLastPieceJump
add ecx, offset OutputLastPiece
mov OutputLastPieceJump,ecx
mov ecx, InputLastPieceJump
add ecx, offset InputLastPiece
mov InputLastPieceJump,ecx
mov edi, SX //inputpitch (sizex)
mov eax, SrcPoint //SrcPoint...read
LabelGlobalYLoop:
LabelGlobalXLoop:
/**/ //prepare data...Source is BYTE array, copy to workarea SHORT
mov ebx, InputAddress16 //write
mov esi, negY //reset counter y (-sizey)
mov ecx, InputLastPieceJump
LoopInputY:
lea ebx, [ebx+2*edi]//increase write end
mov edx, 8
sub edx, edi //reset counter x
jg InputLess8
LoopInputXAlignedTop:
movq mm0, [eax+edx]
movq mm1, mm0
pxor mm2, mm2
punpcklbw mm0, mm2
punpckhbw mm1, mm2
movq [ebx+2*edx], mm0
movq [ebx+2*edx+8], mm1
add edx, 8 //increase x counter
jle LoopInputXAlignedTop
InputLess8:
jmp ecx
InputLastPiece:
mov dl, BYTE PTR [eax+1]
mov BYTE PTR [ebx+2], dl
mov dh, BYTE PTR [eax+2]
mov BYTE PTR [ebx+4], dh
mov dl, BYTE PTR [eax+3]
mov BYTE PTR [ebx+6], dl
mov dh, BYTE PTR [eax+4]
mov BYTE PTR [ebx+8], dh
mov dl, BYTE PTR [eax+5]
mov BYTE PTR [ebx+10], dl
mov dh, BYTE PTR [eax+6]
mov BYTE PTR [ebx+12], dh
mov dl, BYTE PTR [eax+7]
mov BYTE PTR [ebx+14], dl
add eax, src_pit //increase read end
// add eax, 64 //increase read end
add esi, 1 //increase y counter
jnz LoopInputY
/**/
//main process on prepared data @pWorkInput...in one loop 4words (1 read) mul by 8x4 words (8 memory reads) and store in bytes
//to do pairing, mm1+mm2 can be coupled with mm3, or mm3 used for shuffled second part and remove 2x movq where is **1 (dunno what's better gain)
/**/
mov eax, InputEndAddress
mov ebx, TrigsAddress
mov edx, OutputEndAddress
mov edi, AreaToLoopOutput //outer counter
LoopTrigOutput:
add ebx, TrigsChunkSize_8
mov ecx, AreaToLoopInput //inner counter
pxor mm4, mm4 //(sum=0)
pxor mm5, mm5 //(sum=0)
pxor mm6, mm6 //(sum=0)
pxor mm7, mm7 //(sum=0)
movq mm0, [eax+ecx+0] //U pipe
LoopTrigInput:
movq mm1, [ebx+8*ecx+0] //(load 2+2 trig values)
pmaddwd mm1, mm0 //(multiply src*trig for 1+1 sum)
paddd mm4, mm1 //(add to 1+1 sum)
movq mm2, [ebx+8*ecx+8] //(load 2+2 trig values)
pmaddwd mm2, mm0 //(multiply src*trig for 1+1 sum)
paddd mm5, mm2 //(add to 1+1 sum)
movq mm1, [ebx+8*ecx+16] //(load 2+2 trig values)
pmaddwd mm1, mm0 //(multiply src*trig for 1+1 sum)
paddd mm6, mm1 //(add to 1+1 sum)
movq mm2, [ebx+8*ecx+24] //(load 2+2 trig values) **1
pmaddwd mm2, mm0 //(multiply src*trig for 1+1 sum)
paddd mm7, mm2 //(add to 1+1 sum) old block
pshufw mm3,mm0,01001110b
movq mm1, [ebx+8*ecx+32] //(load 2+2 trig values)
pmaddwd mm1, mm3 //(multiply src*trig for 1+1 sum)
paddd mm4, mm1 //(add to 1+1 sum)
movq mm2, [ebx+8*ecx+40] //(load 2+2 trig values)
pmaddwd mm2, mm3 //(multiply src*trig for 1+1 sum)
paddd mm5, mm2 //(add to 1+1 sum)
movq mm1, [ebx+8*ecx+48] //(load 2+2 trig values)
pmaddwd mm1, mm3 //(multiply src*trig for 1+1 sum)
paddd mm6, mm1 //(add to 1+1 sum)
movq mm2, [ebx+8*ecx+56] //(load 2+2 trig values) **1
pmaddwd mm2, mm3 //(multiply src*trig for 1+1 sum) old block
paddd mm7, mm2 //(add to 1+1 sum) old block
movq mm0, [eax+ecx+8] //U pipe
add ecx, 8 //Increase Counter
jnz LoopTrigInput
//here some downscaling mm4 - mm7 and packing to mm4
movq mm1, i64SignMask
movq mm2, ilScale
psrad mm4, mm2
pand mm4, mm1
psrad mm5, mm2
pand mm5, mm1
psrad mm6, mm2
pand mm6, mm1
psrad mm7, mm2
pand mm7, mm1
packssdw mm4, mm5
packssdw mm6, mm7
packuswb mm4, mm6
movq [edx+edi], mm4 //(saving all 4 values)
add edi, 8
jnz LoopTrigOutput
/**/
/**/ //copy data back...From workarea BYTE array, copy to dst BYTE
//mov ebp, dst_pit //dstpitch---crushes
mov edi, SX //inputpitch (sizex)
mov ebx, DstPoint //DstPoint...write
mov eax, OutputAddress8 //read
add ebx, edi //DstPoint+=SX
mov DstPoint, ebx //DstPoint...write
mov ecx, OutputLastPieceJump//precalculated jump address
mov esi, negY //reset counter y (-sizey)
/**/
LoopOutputY:
add eax, edi //increase read end
mov edx, 8
sub edx, edi //reset counter x
jg OutputLess8
LoopOutputXAlignedTop:
movq mm0, [eax+edx]
movntq [ebx+edx], mm0
add edx, 8 //increase x counter
jle LoopOutputXAlignedTop
OutputLess8:
jmp ecx
OutputLastPiece:
mov dl, BYTE PTR [eax+1]
mov BYTE PTR [ebx+1], dl
mov dh, BYTE PTR [eax+2]
mov BYTE PTR [ebx+2], dh
mov dl, BYTE PTR [eax+3]
mov BYTE PTR [ebx+3], dl
mov dh, BYTE PTR [eax+4]
mov BYTE PTR [ebx+4], dh
mov dl, BYTE PTR [eax+5]
mov BYTE PTR [ebx+5], dl
mov dh, BYTE PTR [eax+6]
mov BYTE PTR [ebx+6], dh
mov dl, BYTE PTR [eax+7]
mov BYTE PTR [ebx+7], dl
add ebx, dst_pit //increase write end
// add ebx,400 //increase write end
add esi, 1 //increase y counter
jnz LoopOutputY
/**/
// end loop LabelGlobalXLoop
mov eax, SrcPoint
add eax, edi ;SrcPoint+=SX
cmp eax, SrcPointYLineEnd ;SrcPoint<SrcPointYLineEnd
mov SrcPoint, eax
jl LabelGlobalXLoop
// end loop LabelGlobalYLoop
mov ecx, DstPointY
add ecx, DstPointYBigStep ;DstPointY+=DstPointYBigStep
mov DstPointY, ecx
mov DstPoint, ecx ;DstPoint=DstPointY
mov eax, SrcPointY
mov edx, SrcPointYBigStep
add eax, edx ;SrcPointY+=SrcPointYBigStep
add SrcPointYLineEnd, edx ;SrcPointYLineEnd+=SrcPointYBigStep
cmp eax, SrcPointYEnd ;SrcPointY<SrcPointYEnd
mov SrcPointY, eax
mov SrcPoint, eax ;SrcPoint=SrcPointY
jl LabelGlobalYLoop
popad
}//asm
}
IanB
27th December 2007, 02:07
Your original slow down problem is probably L1 cache saturation, you have memory references all over the shop. Each cache line is 64 bytes long and 64 byte aligned.
movq mm0, [eax+edx]
movq mm1, mm0
pxor mm2, mm2
punpcklbw mm0, mm2
Getting the pairing right can help a lot
movq mm0, [eax+edx]
pxor mm2, mm2 <- But it is constant and should be outside loop anyhow
movq mm1, mm0
punpcklbw mm0, mm2
If you have enough registers internally duplicate the loop. i.e. :-
pxor mm4, mm4
label:
movq mm0, [eax+edx]
movq mm2, [eax+edx+8]
movq mm1, mm0
movq mm3, mm2
punpcklbw mm0, mm4
punpckhbw mm1, mm4
punpcklbw mm2, mm4
punpckhbw mm3, mm4
...
Another example :-
movq mm1, [ebx+8*ecx+0] //(load 2+2 trig values)
pmaddwd mm1, mm0 //(multiply src*trig for 1+1 sum)
paddd mm4, mm1 //(add to 1+1 sum)
movq mm2, [ebx+8*ecx+8] //(load 2+2 trig values)
pmaddwd mm2, mm0 //(multiply src*trig for 1+1 sum)
paddd mm5, mm2 //(add to 1+1 sum)
Don't rely on the out of order execution, do it yourself, like this :-
movq mm1, [ebx+8*ecx+0] //(load 2+2 trig values)
movq mm2, [ebx+8*ecx+8] //(load 2+2 trig values)
pmaddwd mm1, mm0 //(multiply src*trig for 1+1 sum)
pmaddwd mm2, mm0 //(multiply src*trig for 1+1 sum)
paddd mm4, mm1 //(add to 1+1 sum)
paddd mm5, mm2 //(add to 1+1 sum)
How do you know [ebx+1] is zero
InputLastPiece:
mov dl, BYTE PTR [eax+1]
mov BYTE PTR [ebx+2], dl
Try this style instead
movzx edx, BYTE PTR [eax+1]
mov WORD PTR [ebx+1], dx
If the inner loop is looped many times dedicating 7 explicit end cases can be beneficial.
OutputLess8:
jmp ecx ;; Loaded from table of 7 routine addresses
align 16
OutputLastPiece7:
...
align 16
OutputLastPiece4:
mov edx, DWORD PTR [eax+4]
mov DWORD PTR [ebx+4], edx
add ebx, dst_pit //increase write end
add esi, 1 //increase y counter
jnz LoopOutputY
jmp NextOutputY
align 16
OutputLastPiece3:
...
align 16
OutputLastPiece1:
mov dl, BYTE PTR [eax+7]
mov BYTE PTR [ebx+7], dl
add ebx, dst_pit //increase write end
add esi, 1 //increase y counter
jnz LoopOutputY
align 16
NextOutputY:
...More latter ;) gotta run now
redfordxx
27th December 2007, 03:37
Thanx lotYour original slow down problem is probably L1 cache saturation, you have memory references all over the shop.You mean the variables I prepared in the beginning or that in one cycle I do src->(copy)->input->(process)->output->(copy)->dst
Each cache line is 64 bytes long and 64 byte aligned.Hm, and how can I control it, or how should I behave?
But it is constant and should be outside loop anyhow
OK
If you have enough registers internally duplicate the loop. i.e. :But I don't know whether there will be 16+,32+... values and not 8+,24+,40+... values. Or, more jumps?
Don't rely on the out of order execution, do it yourself, like thisYes, I wanted to ask about this one, because there are two pmaddwd next to each other which require FMUL, so there will be bottleneck anyway?
How do you know [ebx+1] is zeroI know;) --- it is not obvious here, but I prepared it in constructor.(saving here 1 cycle maybe)
InputLastPiece:
mov dl, BYTE PTR [eax+1]
mov BYTE PTR [ebx+2], dl
mov dh, BYTE PTR [eax+2]
mov BYTE PTR [ebx+4], dh
can be helpful to alter dh,dl?
If the inner loop is looped many times dedicating 7 explicit end cases can be beneficial.Yes, I will try...
But what to do with my cache problem?
redfordxx
27th December 2007, 04:29
LoopTrigInput:
movq mm1, [ebx+8*ecx+0] //(load 2+2 trig values)
movq mm2, [ebx+8*ecx+8] //(load 2+2 trig values)
pmaddwd mm1, mm0 //(multiply src*trig for 1+1 sum)
pshufw mm3,mm0,01001110b
pmaddwd mm2, mm0 //(multiply src*trig for 1+1 sum)
paddd mm4, mm1 //(add to 1+1 sum)
paddd mm5, mm2 //(add to 1+1 sum)
movq mm1, [ebx+8*ecx+16] //(load 2+2 trig values)
movq mm2, [ebx+8*ecx+24] //(load 2+2 trig values) **1
pmaddwd mm1, mm0 //(multiply src*trig for 1+1 sum)
pmaddwd mm2, mm0 //(multiply src*trig for 1+1 sum)
paddd mm6, mm1 //(add to 1+1 sum)
paddd mm7, mm2 //(add to 1+1 sum) old block
movq mm1, [ebx+8*ecx+32] //(load 2+2 trig values)
movq mm2, [ebx+8*ecx+40] //(load 2+2 trig values)
pmaddwd mm1, mm3 //(multiply src*trig for 1+1 sum)
pmaddwd mm2, mm3 //(multiply src*trig for 1+1 sum)
paddd mm4, mm1 //(add to 1+1 sum)
paddd mm5, mm2 //(add to 1+1 sum)
movq mm1, [ebx+8*ecx+48] //(load 2+2 trig values)
movq mm2, [ebx+8*ecx+56] //(load 2+2 trig values) **1
pmaddwd mm1, mm3 //(multiply src*trig for 1+1 sum)
pmaddwd mm2, mm3 //(multiply src*trig for 1+1 sum) old block
add ecx, 8 //Increase Counter
paddd mm6, mm1 //(add to 1+1 sum)
paddd mm7, mm2 //(add to 1+1 sum) old block
movq mm0, [eax+ecx] //U pipe
jnz LoopTrigInput
Trying reorganize pairs, tried many combinations and cannot measure any improvement (in CPU time in Task Manager)
IanB
27th December 2007, 08:08
Your original slow down problem is probably L1 cache saturation, you have memory references all over the shop.You mean the variables I prepared in the beginning or that in one cycle I do src->(copy)->input->(process)->output->(copy)->dst
It's all together. Going from just enough to 1 line short can be catastrophic. And yes you have over 160 bytes of local variables, this will be 3 cache lines. 16K of L1 cache is 256 lines and they are some-way associative, i.e. not all lines do all addresses.
Each cache line is 64 bytes long and 64 byte aligned.
Hm, and how can I control it, or how should I behave?
Just be aware there is a limit and know how it is allocated. Always be conservative. For stack based variables there is nothing really you can do, the assumption is your variables start and finish partway thru a cache line.
For special buffers when squeezing every last ounce of performance, you can align the buffer mod 64, this avoids the 2 wasted part cache lines at each end.
If you have enough registers internally duplicate the loop. i.e. :
But I don't know whether there will be 16+,32+... values and not 8+,24+,40+... values. Or, more jumps?
Well it has to be a case by case judgement call. Is the stuffing around in the end code worth doing 16 bytes per cycle instead of 8.
Don't rely on the out of order execution, do it yourself, like this
Yes, I wanted to ask about this one, because there are two pmaddwd next to each other which require FMUL, so there will be bottleneck anyway?
I try to avoid single unit stalls if I can, but I don't bust a hump if I can't arrange the code to avoid it. Out of order execution may handle the problem for you, the idea is to just do the best you readily can.
For some unknown reason PMADDWD's are like public transport, they hunt in packs. :D
InputLastPiece:
mov dl, BYTE PTR [eax+1]
mov BYTE PTR [ebx+2], dl
mov dh, BYTE PTR [eax+2]
mov BYTE PTR [ebx+4], dh
can be helpful to alter dh,dl?
For some processors using ah, bh, ch & dh to memory is slower, the movsx & movzx instructions are preferred for byte from memory operations.
If the inner loop is looped many times dedicating 7 explicit end cases can be beneficial.
Yes, I will try...
Also know your data. If the data is 95% probably going to be in 8 byte chunks then the end case code will only be used 5% of the times, so it may not be worth spending lots of time perfecting it. Most video clips are mod 16 width.
But what to do with my cache problem?
First prove it is a cache exhaustion problem, reduce the size of your data slightly. Is there a knee effect in speed? i.e. cropping off a few lines doubles the speed. This may be some entirely different phenominon here, I just made a guess here!
----------------------------
Also I am not sure you are getting good value pre-unpacking the data from 8bits to 16bit units. Your code is obtuse, but it seems to me that each unpacked element is only used once. If this is the case then unpacking on the fly would be no slower and more cache efficient i.e. you could fit twice as much data into the available cache. E.g. in the Horizontal resizer we pre-unpack the 8bit data to 16bit units. For the Point resizer each unit is only used once so there is only penalty. For the Lanczos4 resizer each unit is used 8 times so there is very much gain. Note the resizer core is general purpose.
Also adding offsets in address calculations is practically free, you could get rid of some of your precalculations by just doing [ebx+2*edx-16] instead of using InputAddress16=(int)pWorkInput-16;
redfordxx
27th December 2007, 12:24
First prove it is a cache exhaustion problem, reduce the size of your data slightly. Is there a knee effect in speed? i.e. cropping off a few lines doubles the speed. This may be some entirely different phenominon here, I just made a guess here!
Measurements:Block size: Time:
8x1 21
8x2 94
8x3 94
8x4 94
16x1 13
16x2 58
16x3 58
16x4 58
32x1 11
32x2 23
32x3 23
64x1 11
64x2 24
64x3 23
80x1 12
80x2 22
96x1 12
96x2 12
96x3 13
96x4 13
96x6 15
96x8 14
120x1 13
120x2 16
120x3 16
120x4 18
128x1 22
128x2 24
128x3 23
256x1 24
I commented everything except the last routine which copies data from workarea (X*Y) to dst (pitch*height)
Seems to me that there is problem saving more than one line in dst. This problem vanishes as the line lenght reaches 96. However, probably the caching of the source becomes problem for 128B and more?
redfordxx
27th December 2007, 15:04
Just be aware there is a limit and know how it is allocated. Always be conservative. For stack based variables there is nothing really you can do, the assumption is your variables start and finish partway thru a cache line.
For special buffers when squeezing every last ounce of performance, you can align the buffer mod 64, this avoids the 2 wasted part cache lines at each end.How can I control whether my variables are all in one line?
Or maybe it will be possible to prepare (among others) two sets of variables which will be saved with pushad in the beginning (I decide the 64 aligned place in memory with the stack registry) and read it every cycle with popad (once before "main process on prepared data @pWorkInput", once before "copy data back"). Might be faster than 4 mov's from memory to registers?
Can prefetch instruction help somwhere somehow? If it is for both AMD and Intel...
IanB
27th December 2007, 21:52
Measurements:This make absolutely no sense, please explain the numbers and what you have tested :confused:
How can I control whether my variables are all in one line?Allocate a struct and manually align it. i.e. Y = (X = malloc(sizeof(Y)+63)) & ~64; Y->a =10;... free(X);
But you are wasting your time, you should not need these local variables in the first place, registers are for storing your working variables. Memory is for your buffers.
Can prefetch instruction help somwhere somehow?Prefetching and movntq writes are for dealing with data set bigger than the L2 cache. Generally they are not compatible with data sets that easily fit in the L2 cache. i.e. small data sets perform worse than without them.
In avisynth dealing with 720x576 YV12 images you are looking at about 600K per frame so 2 input and 1 output buffer fit easily in a 2Mb L2 cache. With HiDef you are looking at about 3Mb per frame. Ideally the flow from 1 filter to the next, keeps all the current data in L2 cache.
If it is for both AMD and Intel...The theory is similar for both breeds, just the actual numbers change.
redfordxx
28th December 2007, 03:19
This make absolutely no sense, please explain the numbers and what you have tested
Basically I am testing speed of asm equivalent of this
for(y=0;y<height;y+=blockheight)
for(x=0;x<rowsize;x+=blockwidth)
//here is some processing, which is now omitted, so it does not make big sense
for(j=0;j<blockheight;j++)
for(i=0;i<blockwidth;i++)
dstptr[(y+j)*dstpitch+x+i]=workoutput[j*blockwidth+i];
these numbers 8x1 etc in my previous post are block sizes, i.e. blockwidth*blockheight. So, the results show that if the block is more than one line high, it slows down.
The block will mostly be 8x8 or 16x16...so one line is not really enough:(
redfordxx
28th December 2007, 03:44
Now I am fighting with stupid thing like this:
This works:
const unsigned char* sourcep [32];
for (int i=0;i<length;++i)
sourcep[i] = source[i]->GetReadPtr(plane);
but i need aligned variable lenght something like this:
const unsigned char* sourcep = (const unsigned char*) _aligned_malloc(sizeof(unsigned char*)*length, 64);
for (int i=0;i<length;++i)
sourcep[i] = source[i]->GetReadPtr(plane);
but is does not work...i got message on that third line:
error C2440: '=' : cannot convert from 'const BYTE *' to 'const unsigned char'
More or less I know why (sourcep[i] is already the the target of the pointer), but still not sure how to write it correctly.
squid_80
28th December 2007, 03:48
Looks like sourcep should be a const unsigned char **.
IanB
28th December 2007, 06:44
Okay the equivalent C code helps a lot. Perhaps it should be in your source as comments.
Have you got thismovntq [ebx+edx], mm0in the block you are testing? And you have an AMD beastie. Test with a normal movq.
Do not mix cached and non-temporal memory accesses, specially writes and make absolutly sure that they are 64 bit aligned.
Also as a baseline perhaps you should test the pure C code version.
------------------------
Also a possible hint for your end around code. As long as you have at least 8 bytes total to process just do a single unaligned movq at the end. i.e. offset the movq to match the end of the buffer and process the few overlapping bytes twice.
redfordxx
28th December 2007, 07:11
Have you got thismovntq [ebx+edx], mm0in the block you are testing? And you have an AMD beastie. Test with a normal movq.
I had it before... it was little slower.
Also as a baseline perhaps you should test the pure C code version.Had it before and was slower.
:confused:
Also a possible hint for your end around code. As long as you have at least 8 bytes total to process just do a single unaligned movq at the end. i.e. offset the movq to match the end of the buffer and process the few overlapping bytes twice.Will try...but have to be cautios about the end of row;)
redfordxx
28th December 2007, 07:15
Looks like sourcep should be a const unsigned char **.Thanks...ehm but now I can't access the members of the array easily like this anymoremov ecx, [sourcep+4*eax]where eax walks thru the array.
squid_80
28th December 2007, 08:10
Thanks...ehm but now I can't access the members of the array easily like this anymoremov ecx, [sourcep+4*eax]where eax walks thru the array.
Why not?
IanB
28th December 2007, 13:37
I meant that you should repeat the knee test with both movq and C code to get a comparitive baseline for the 50% slowdown. i.e. you are looking for the speed ratio to be constant.
I assume you are still trying to find why you have a performance knee.
--------------------
Oh and also about the end code stuff. For planar frames Avisynth guarantees that pitch is at least mod 8 (mod 16 on >= SSE2 machines). Use the PLANAR_Y_ALIGNED form of the RowSize(). Just remember that bytes beyond rowsize up to pitch contain uninitialized junk, you are free to overwrite it with good stuff or trash as you see fit.
redfordxx
28th December 2007, 13:52
Oh and also about the end code stuff. For planar frames Avisynth guarantees that pitch is at least mod 8 (mod 16 on >= SSE2 machines). Use the PLANAR_Y_ALIGNED form of the RowSize(). Just remember that bytes beyond rowsize up to pitch contain uninitialized junk, you are free to overwrite it with good stuff or trash as you see fit.But if block width is let's say 4 and rowsize=pitch, then I will write four bytes over...
redfordxx
28th December 2007, 14:00
_declspec(align(64)) const unsigned char* sourcep[64]
__asm{
mov ecx, [sourcep] ;=sourcep[0]
mov ecx, [sourcep+4] ;=sourcep[1]
}
but
const unsigned char** sourcep = (const unsigned char**) _aligned_malloc(sizeof(unsigned char*)*length, 64);
__asm{
mov ecx, [sourcep] ;=sourcep[0]
mov ecx, [sourcep+4] ;=&sourcep+4
}
squid_80
28th December 2007, 15:06
I don't see anything wrong with that. In C code *(sourcep+1) is the same as sourcep[1]. In assembly they both translate to [sourcep+4] (assuming pointers are 4 bytes).
redfordxx
28th December 2007, 15:39
In the second case I cant access directly sourcep[1].
Probably I have to get the pointer first and then the value...two memory reads instead of one. Is there workaround or not?
redfordxx
28th December 2007, 15:42
For special buffers when squeezing every last ounce of performance, you can align the buffer mod 64, this avoids the 2 wasted part cache lines at each end.Maybe I am missing, what exactly you mean by buffer...
squid_80
28th December 2007, 16:16
In the second case I cant access directly sourcep[1].
Sure you can... The same as before, sourcep[1] is [sourcep+4] in assembly.
IanB
28th December 2007, 23:03
@Squid, The 1st case is reallymov ecx, [ebp+sourcep+0]
mov edx, [ebp+sourcep+4]
the 2nd case ismov esi, [ebp+sourcep]
mov ecx, [esi+0]
mov edx, [esi+4]
--------------------------
But if block width is let's say 4 and rowsize=pitch, then I will write four bytes over...
Yes you would, but you can adjust the way you process it so that only the last loop has to deal with the problem.AWIDTH=rowsize;
if (AWIDTH+8-blockwidth > pitch)
AWIDTH-=8-blockwidth;
for(y=0;y<height;y+=blockheight) {
for(x=0;x<AWIDTH;x+=blockwidth)
for(j=0;j<blockheight;j++)
for(i=0;i<blockwidth;i++) {
FAST CODE (overlaps to mod 8)
}
for(x=AWIDTH;x<rowsize;x+=blockwidth)
for(j=0;j<blockheight;j++)
for(i=0;i<blockwidth;i++) {
END CODE (No overlaps!)
}
}
Maybe I am missing, what exactly you mean by buffer...buffer :- chunk of memory, work space, something you declare or malloc, where you read and/or write data to.
The point was in the very special case where you are trying to squeeze every last ounce of cache there is the option to waste some memory and align your buffer to cache lines. This is not normal practise for general purpose filters because you do not have enough control of the world, i.e. you do not know what size image the user will give you, you do not know what block size the user will choose.
For normal use being 4 aligned for cpu registers, 8 aligned for MMX registers and 16 aligned for SSE2 registers is adequate.
IanB
28th December 2007, 23:10
Or even more devious...
for(y=0;y<height-blockheight;y+=blockheight) {
for(x=0;x<rowsize;x+=blockwidth)
for(j=0;j<blockheight;j++)
for(i=0;i<blockwidth;i++) {
FAST CODE (overlaps to mod 8)
}
}
y=height-blockheight;
for(x=0;x<AWIDTH;x+=blockwidth)
for(j=0;j<blockheight;j++)
for(i=0;i<blockwidth;i++) {
FAST CODE (overlaps to mod 8)
}
for(x=AWIDTH;x<rowsize;x+=blockwidth)
for(j=0;j<blockheight;j++)
for(i=0;i<blockwidth;i++) {
END CODE (No overlaps!)
}
redfordxx
30th December 2007, 10:15
Would be maybe helpful, to copy first row of blocks from source to multiple workareas (for faster copying) then process all of them and then copy all of them to dst frame?
If that would be more cache-friendly.
If so which way (copying src to workarea):
1) copy block after block (copy all lines of first block, then second...)
2) go thru frame reading one whole line after each other and copy the pixel to different workareas where they belong...
1) seems to me more cachable for the writing part (in the workarea the pixel would be written aftereach other) but reads in src are random
2) seems to me more cachable for the reading part (going thru the src in line pixel after pixel, but writing is random)
By the way, how is it with the back direction reading? You recomended me to avoid it but there was notice in the BitBlt function that the author used it to avoid HW prefetch...:confused:
IanB
30th December 2007, 14:10
You are at the point where there is much unknown information to guess. It could work or it could exhaust the cache. You will have to test and time it.
Reading backwards applies to a special set of cases, search the AMD technotes for how and when to use it.
Also with using movntq you say your filter is faster, does your measure include a next filter in the chain. You may make your filter faster but the next filter might be slower because it must re-read that frame data just written back into cache, hence the whole script might run slower. Test carefully!
redfordxx
7th November 2011, 22:52
Hi...after years..
I came around and I thougnt I would share what I've done. Just as it is.
So, quick comment. It is based on the TBarrys one. Usage:DctFilter c[fac0]f[fac1]f[fac2]f[fac3]f[fac4]f[fac5]f[fac6]f[fac7]f[fac8]f[fac9]f[fac10]f[fac11]f[fac12]f[fac13]f[fac14]f[fac15]f[mode]s[blockyx]i[blockyy]i[shiftyx]i[shiftyy]i
Mode can be: showF, default, classic, freeDCT
-showF....shows frequency values in spatial domain (maps it on 0-255 range)
-other modes are just the frequency decimation, various kinds (classic is original 8x8 method, freeDCT is [blockyx]x[blockyy] transformation, default chooses automatically...I think
There are 16 factors to be adjusted.
I added dll and sources.
v8 works
v13 seems not to work
Issues and stupid things:
- I didn't rename the filter at that time, so ithas same name as theold one (probably that's why recompiled the old filter to name DCTFiler, so I can have both in AviSynth)
- DCTAddConstand doesnot work and I don't know why
- Newer version doesnot work and I dont know why...
So, maybe someone's interested. ShowF looks interesting...;-)
vBulletin® v3.8.5, Copyright ©2000-2012, Jelsoft Enterprises Ltd.