kassandro
5th September 2004, 23:47
I am just writing my first filter, which uses floating point SSE, i.e. 4 single precision floating point numbers are processed simultaneously. That means I can process 4 pixels simultaneously. But pixels are essentially byte values or three byte values. Now, how do I load 4 adjacent bytes as single precision floating point numbers into an SSE register? The engineers at Intel seemed to have not thought at all about this problem. Unbelievable, but I need a whopping 8 instructions to achieve this. I use the following macro:
#define load4bytes(dest_xmm, src_mem, mmx_null, mmx_temp1, mmx_temp2, xmm_temp)\
__asm movd mmx_temp1, src_mem \
__asm punpcklbw mmx_temp1, mmx_null\
__asm movq mmx_temp2, mmx_temp1\
__asm punpcklbw mmx_temp1, mmx_null\
__asm punpckhbw mmx_temp2, mmx_null\
__asm cvtpi2ps dest_xmm, mmx_temp1\
__asm cvtpi2ps xmm_temp, mmx_temp2\
__asm shufps dest_xmm, xmm_temp,0 + (1 << 2) + 0 + (1<<6)
Here dest_xmm is the destination SSE register, src_mem is the memory location, from where the 4 bytes are loaded, mmx_temp1 and mmx_temp2 are MMX registers for temporary use, xmm_temp is a SSE register for temporary use and mmx_null is a MMX register filled with zeroes, which is left unchanged by the macro. If somebody knows how to do it faster, I would like to know it.
With SSE2 "only" 4 instructions are necessary for loading. A single instruction which loads 4 bytes as single precision floating point numbers into an SSE register would be much easier to implement than the cvtpi2ps, which I have ultimately to use twice. It involves a transfer from an MMX register to an SSE register, which doesn't sound fast either. This implementation desaster should also act as a brake for SSE based dct/idct. Storing 4 single precision floating point numbers from an SSE register to a memory location isn't much faster either: I need 6 instructions for it and again 4 instructions if I own a P4. One can truely say, that the chip designers at Intel have built a load/store brake into their otherwise quite well designed SSE instruction unit.
#define load4bytes(dest_xmm, src_mem, mmx_null, mmx_temp1, mmx_temp2, xmm_temp)\
__asm movd mmx_temp1, src_mem \
__asm punpcklbw mmx_temp1, mmx_null\
__asm movq mmx_temp2, mmx_temp1\
__asm punpcklbw mmx_temp1, mmx_null\
__asm punpckhbw mmx_temp2, mmx_null\
__asm cvtpi2ps dest_xmm, mmx_temp1\
__asm cvtpi2ps xmm_temp, mmx_temp2\
__asm shufps dest_xmm, xmm_temp,0 + (1 << 2) + 0 + (1<<6)
Here dest_xmm is the destination SSE register, src_mem is the memory location, from where the 4 bytes are loaded, mmx_temp1 and mmx_temp2 are MMX registers for temporary use, xmm_temp is a SSE register for temporary use and mmx_null is a MMX register filled with zeroes, which is left unchanged by the macro. If somebody knows how to do it faster, I would like to know it.
With SSE2 "only" 4 instructions are necessary for loading. A single instruction which loads 4 bytes as single precision floating point numbers into an SSE register would be much easier to implement than the cvtpi2ps, which I have ultimately to use twice. It involves a transfer from an MMX register to an SSE register, which doesn't sound fast either. This implementation desaster should also act as a brake for SSE based dct/idct. Storing 4 single precision floating point numbers from an SSE register to a memory location isn't much faster either: I need 6 instructions for it and again 4 instructions if I own a P4. One can truely say, that the chip designers at Intel have built a load/store brake into their otherwise quite well designed SSE instruction unit.