PDA

View Full Version : Intel's multimedia blunder


kassandro
5th September 2004, 23:47
I am just writing my first filter, which uses floating point SSE, i.e. 4 single precision floating point numbers are processed simultaneously. That means I can process 4 pixels simultaneously. But pixels are essentially byte values or three byte values. Now, how do I load 4 adjacent bytes as single precision floating point numbers into an SSE register? The engineers at Intel seemed to have not thought at all about this problem. Unbelievable, but I need a whopping 8 instructions to achieve this. I use the following macro:

#define load4bytes(dest_xmm, src_mem, mmx_null, mmx_temp1, mmx_temp2, xmm_temp)\
__asm movd mmx_temp1, src_mem \
__asm punpcklbw mmx_temp1, mmx_null\
__asm movq mmx_temp2, mmx_temp1\
__asm punpcklbw mmx_temp1, mmx_null\
__asm punpckhbw mmx_temp2, mmx_null\
__asm cvtpi2ps dest_xmm, mmx_temp1\
__asm cvtpi2ps xmm_temp, mmx_temp2\
__asm shufps dest_xmm, xmm_temp,0 + (1 << 2) + 0 + (1<<6)

Here dest_xmm is the destination SSE register, src_mem is the memory location, from where the 4 bytes are loaded, mmx_temp1 and mmx_temp2 are MMX registers for temporary use, xmm_temp is a SSE register for temporary use and mmx_null is a MMX register filled with zeroes, which is left unchanged by the macro. If somebody knows how to do it faster, I would like to know it.
With SSE2 "only" 4 instructions are necessary for loading. A single instruction which loads 4 bytes as single precision floating point numbers into an SSE register would be much easier to implement than the cvtpi2ps, which I have ultimately to use twice. It involves a transfer from an MMX register to an SSE register, which doesn't sound fast either. This implementation desaster should also act as a brake for SSE based dct/idct. Storing 4 single precision floating point numbers from an SSE register to a memory location isn't much faster either: I need 6 instructions for it and again 4 instructions if I own a P4. One can truely say, that the chip designers at Intel have built a load/store brake into their otherwise quite well designed SSE instruction unit.

MfA
6th September 2004, 00:52
If you are going to spend so much cycles doing the conversion you might as well use the time for a conversion to linear colourspace. To me it seems a shame to do processing with floating point with gamma correction applied.

Of course this is hardle the only load/store break, dont forget the ever so slight penalty for unaligned accesses ...

Fizick
6th September 2004, 22:24
Vaguedenoiser has some asm 3dnow and SSE code for byte to float conversion (by Kurosu), but i am not sure that code is more short.

kassandro
7th September 2004, 01:14
Originally posted by Fizick
Vaguedenoiser has some asm 3dnow and SSE code for byte to float conversion (by Kurosu), but i am not sure that code is more short.
I made a look at float_sse.asm and float_3dne.asm. Yes, Vaguedenoiser shares the same silly problems with me. The macro, CONVONE, takes 4 resp. 2 adjacent bytes, convertes it to float and writes it back to memory in order to access it later. All that nonsense would not be necessary, if Intel or AMD would have implemented the very simple direct conversion from "unsigned char" to float. The SSE code in float_sse.asm is essentially the same as my macro, but since several 4 bytes chunks are converted simultaneously, one can do more instruction pairing. With 3dnow you can process single precision floating point numbers with the MMX registers, but only 2 instead of 4 with SSE. The 3dnow loading macro is very similar to the SSE2 macro, but only with 2 instead of 4 bytes. I wonder, that programmers don't complain loudly about this incredible blunder.

@MfA: actually Intel has done something recently about unaligned access. They have added within SSE3 a new 16 byte load unaligned instruction, which should be more efficient, if also the subsequent 16 bytes are loaded. Thus, the integer SSE designers unlike the floating point SSE designers seem to be aware, that the byte is the basic multimedia quantity as far as pictures is concerned.