PDA

View Full Version : sse2 fun


-h
11th April 2002, 14:11
Just started working with sse2, and I'd like a critique of some SAD code. It doesn't really matter if you don't know what a SAD routine is supposed to do, I'm just wondering how pipelining etc. should work. Assume esi is 16-btye aligned:

sad16_sse2
push esi
push edi
push ebx

mov esi, [esp + 12 + 4] ; cur
mov edi, [esp + 12 + 8] ; ref
mov eax, [esp + 12 + 12] ; stride

mov ebx, eax
mov ecx, eax
shl ebx, 2
add ecx, ebx ; ecx = stride*5
mov edx, ebx
sub ebx, eax ; ebx = stride*3
add edx, ebx ; edx = stride*7

pxor xmm7, xmm7 ; xmm7 = sum = 0

movdqu xmm0, [edi] ; ref 0
movdqu xmm1, [edi+eax] ; ref 1
movdqu xmm2, [edi+eax*2] ; ref 2
movdqu xmm3, [edi+ebx] ; ref 3
psadbw xmm0, [esi] ; diff to cur
psadbw xmm1, [esi+eax]
psadbw xmm2, [esi+eax*2]
psadbw xmm3, [esi+ebx]
paddusw xmm7, xmm0 ; add diffs to sum
paddusw xmm7, xmm1
paddusw xmm7, xmm2
paddusw xmm7, xmm3

movdqu xmm0, [edi+eax*4] ; ref 4
movdqu xmm1, [edi+ecx] ; ref 5
movdqu xmm2, [edi+ebx*2] ; ref 6
movdqu xmm3, [edi+edx] ; ref 7
psadbw xmm0, [esi+eax*4]
psadbw xmm1, [esi+ecx]
psadbw xmm2, [esi+ebx*2]
psadbw xmm3, [esi+edx]
paddusw xmm7, xmm0
paddusw xmm7, xmm1
paddusw xmm7, xmm2
paddusw xmm7, xmm3

add esi, eax ; advance by 8 lines
add edi, eax
add esi, edx
add edi, edx

movdqu xmm0, [edi] ; ref 8
movdqu xmm1, [edi+eax] ; ref 9
movdqu xmm2, [edi+eax*2] ; ref 10
movdqu xmm3, [edi+ebx] ; ref 11
psadbw xmm0, [esi]
psadbw xmm1, [esi+eax]
psadbw xmm2, [esi+eax*2]
psadbw xmm3, [esi+ebx]
paddusw xmm7, xmm0
paddusw xmm7, xmm1
paddusw xmm7, xmm2
paddusw xmm7, xmm3

movdqu xmm0, [edi+eax*4] ; ref 12
movdqu xmm1, [edi+ecx] ; ref 13
movdqu xmm2, [edi+ebx*2] ; ref 14
movdqu xmm3, [edi+edx] ; ref 15
psadbw xmm0, [esi+eax*4]
psadbw xmm1, [esi+ecx]
psadbw xmm2, [esi+ebx*2]
psadbw xmm3, [esi+edx]
paddusw xmm7, xmm0
paddusw xmm7, xmm1
paddusw xmm7, xmm2
paddusw xmm7, xmm3

movdqa xmm6, xmm7 ; copy sum
psrldq xmm6, 8 ; shift right by 8 bytes
paddq xmm7, xmm6 ; get final sum in low dword
movd eax, xmm7 ; return sum

pop ebx
pop edi
pop esi

ret

Now I'm not all that interested in whether it would compile or not (probably not, I don't have a P4 to test on atm), but is that the way the reads/psadbw's/paddusw's should be structured? Or would this be better/faster:

movdqu xmm0, [edi] ; ref 0
psadbw xmm0, [esi] ; diff to cur
movdqu xmm1, [edi+eax] ; ref 1
paddusw xmm7, xmm0 ; add diff to sum
psadbw xmm1, [esi+eax]
movdqu xmm2, [edi+eax*2] ; ref 2
paddusw xmm7, xmm1
psadbw xmm2, [esi+eax*2]
movdqu xmm3, [edi+ebx] ; ref 3
paddusw xmm7, xmm2
psadbw xmm3, [esi+ebx]
paddusw xmm7, xmm3

Where I've just interleaved the paddusw's to immediately follow the unaligned read, with the uninformed hope that it'll perform the paddusw while waiting for the read to complete, seeing as the psadbw will be waiting for the unaligned read to finish.

But yeah, this is first time sse2, I'm just after some general ideas.

Edit: corrected some mm6/mm7 -> xmm6/xmm7 mistakes.

-h

trbarry
11th April 2002, 14:56
-h

I also only started writing SSE2 code recently after I built a P4 system. But I've found I can't always get optimzed code just by following the darn manual. Sometimes prefetch helps, sometimes not, I fiddle with other things and run it to see what goes faster. And I think maybe SSE2 instructions are much more performance sensitive to data alignment, even the ones that don't crash when on bad boundaries.

But generally I try not to access a value immediately after loading it into a xmm register, so I rearrange code a little, making it more or less unreadable. ;)

And do the usual mmx things like not confusing the branch prediction with conditional jumps and stuff.

The following is a nasm SSE2 version of Xvid sad16 that I've been playing with. It really does work on my P4, and I use it, but it doesn't really give any measurable performance improvement. I haven't released it because it seems some of the nasm compilers don't handle all sse2 instructions correctly. Gruel suggested I put in some nasm conditional compile parm logic to address this and I don't know how to do that. And I've been hoping to make it faster somehow.

Remodeling, Pardon our dust. ;)

- Tom




;===========================================================================
;
; uint32_t sad16_sse2(const uint8_t * const cur,
; const uint8_t * const ref,
; const uint32_t stride,
; const uint32_t best_sad);
;
; experimental! Since most data are unaligned it may not be faster to use sse2
; but we'll find out (probably is). trbarry 03/17/2002
;
;===========================================================================

; Macro to accum sum of abs diff's and bump esi
; This one does not assume alignment

%macro sumdiffsse2U 0-1 1
%if %1
prefetchnta [esi+2*ecx]
%endif
movdqu xmm2, [esi+edi] ; ref2
movdqu xmm0, [esi] ; ref
psadbw xmm0, xmm2 ; mm0 = |ref - cur|
paddusw xmm6, xmm0 ; sum += mm0
lea esi, [esi+2*ecx] ; bump ptrs
%endmacro

; macro to accum sum of abs diff's and bump esi
; This one does assume 16 byte alignment

%macro sumdiffsse2A 0-1 1
%if %1
prefetchnta [esi+ecx]
%endif
movdqu xmm0, [esi+edi] ; ref
psadbw xmm0, [esi] ; ref2 ; this one must be 16 byte aligned
paddusw xmm6, xmm0 ; sum += mm0
lea esi, [esi+ecx] ; bump ptrs
%endmacro

%macro Accum3moreA 0
add edi, eax ; now offset from 3 more rows
add esi, eax

psadbw xmm0, xmm1 ; dif 0, ...
paddusw xmm6, xmm0 ; sum

movdqu xmm0, [edi] ; 3, 6, 9, 12
movdqa xmm1, [esi]

psadbw xmm2, xmm3 ; dif 1, ...
paddusw xmm6, xmm2 ; sum

movdqu xmm2, [edi+ecx] ; 4, 7, 10, 13
movdqa xmm3, [esi+ecx] ;

psadbw xmm4, xmm5 ; dif 2, ...
paddusw xmm6, xmm4 ; sum += mm0

movdqu xmm4, [edi+2*ecx] ; 5, 8, 11, 14
movdqa xmm5, [esi+2*ecx] ;

%endmacro

align 16
cglobal sad16_sse2
sad16_sse2

push esi
push edi
push ebx

mov esi, [esp + 12 + 4] ; ref
mov edi, [esp + 12 + 8] ; cur
mov ecx, [esp + 12 + 12] ; stride
pxor xmm6, xmm6 ; xmm6 = sum = 0

test esi, 0x0000000f ; esi 16 byte aligned already?
jz Aligned_OK ; y
test edi, 0x0000000f ; n, how about edi?
jz Aligned_Almost ; good, use this one instead

Not_Aligned: ; not sure this ever happens anymore

sub edi, esi ; carry as offset
sumdiffsse2U
sumdiffsse2U
sumdiffsse2U
sumdiffsse2U

sumdiffsse2U
sumdiffsse2U
sumdiffsse2U
sumdiffsse2U

sumdiffsse2U
sumdiffsse2U
sumdiffsse2U
sumdiffsse2U

sumdiffsse2U
sumdiffsse2U
sumdiffsse2U 0
sumdiffsse2U 0
jmp Continue

Aligned_Almost:
xchg esi, edi ; parm usage symetrical so just swap them

Aligned_OK:

; 0-2
movdqu xmm0, [edi] ; 0 ref2
movdqa xmm1, [esi] ; ref

movdqu xmm2, [edi+ecx] ; 1
movdqa xmm3, [esi+ecx]

movdqu xmm4, [edi+2*ecx] ; 2
movdqa xmm5, [esi+2*ecx] ;

; 3-5
mov eax, ecx
lea eax, [eax+2*ecx] ; 3*ecx

Accum3moreA

; 6-8
Accum3moreA

; 9-11
Accum3moreA

; 12-14
Accum3moreA

; 15
psadbw xmm0, xmm1 ; dif 12
paddusw xmm6, xmm0 ; sum

movdqu xmm0, [edi+4*ecx] ; 15
movdqa xmm1, [esi+4*ecx]

psadbw xmm2, xmm3 ; dif 13
paddusw xmm6, xmm2 ; sum

psadbw xmm4, xmm5 ; dif 14
paddusw xmm6, xmm4 ; sum += mm0

psadbw xmm0, xmm1 ; dif 15
paddusw xmm6, xmm0 ; sum

Continue:
movdqa xmm0, xmm6
punpckhqdq xmm0, xmm0
paddusw xmm6, xmm0
; movd eax, xmm6 ; nasm does not get generated properly
movdq2q mm6, xmm6 ; see if this works

movd eax, mm6
pop ebx
pop edi
pop esi
emms

ret

-h
11th April 2002, 15:09
Ah I see, prefetch a line, 1 line before it's actually used. I like it.

Well my system at work is a P4 (and Fridays are slow :)), so I'll have a tinker and see what can be done.

I know things like quantization will work quite well with sse2 (dequantization will benefit from even normal sse - pminsw/pmaxsw for saturation stuff), hopefully some fun code will get committed by the weekend.

-h

trbarry
11th April 2002, 16:10
If you think it is time to start releasing SSE2 code I've also got a SSE2 Xvid IDCT to release. This was originally contributed by DmitryR for DVD2AVI, but added by me and tested there. But I use it in Xvid.

- Tom

edit: Huh, Friday? I had forgotten where you were. ;)