Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
|
|
#1 | Link |
|
Kilted Yaksman
Join Date: Oct 2001
Location: South Carolina
Posts: 1,303
|
sse2 fun
Just started working with sse2, and I'd like a critique of some SAD code. It doesn't really matter if you don't know what a SAD routine is supposed to do, I'm just wondering how pipelining etc. should work. Assume esi is 16-btye aligned:
Code:
sad16_sse2 push esi push edi push ebx mov esi, [esp + 12 + 4] ; cur mov edi, [esp + 12 + 8] ; ref mov eax, [esp + 12 + 12] ; stride mov ebx, eax mov ecx, eax shl ebx, 2 add ecx, ebx ; ecx = stride*5 mov edx, ebx sub ebx, eax ; ebx = stride*3 add edx, ebx ; edx = stride*7 pxor xmm7, xmm7 ; xmm7 = sum = 0 movdqu xmm0, [edi] ; ref 0 movdqu xmm1, [edi+eax] ; ref 1 movdqu xmm2, [edi+eax*2] ; ref 2 movdqu xmm3, [edi+ebx] ; ref 3 psadbw xmm0, [esi] ; diff to cur psadbw xmm1, [esi+eax] psadbw xmm2, [esi+eax*2] psadbw xmm3, [esi+ebx] paddusw xmm7, xmm0 ; add diffs to sum paddusw xmm7, xmm1 paddusw xmm7, xmm2 paddusw xmm7, xmm3 movdqu xmm0, [edi+eax*4] ; ref 4 movdqu xmm1, [edi+ecx] ; ref 5 movdqu xmm2, [edi+ebx*2] ; ref 6 movdqu xmm3, [edi+edx] ; ref 7 psadbw xmm0, [esi+eax*4] psadbw xmm1, [esi+ecx] psadbw xmm2, [esi+ebx*2] psadbw xmm3, [esi+edx] paddusw xmm7, xmm0 paddusw xmm7, xmm1 paddusw xmm7, xmm2 paddusw xmm7, xmm3 add esi, eax ; advance by 8 lines add edi, eax add esi, edx add edi, edx movdqu xmm0, [edi] ; ref 8 movdqu xmm1, [edi+eax] ; ref 9 movdqu xmm2, [edi+eax*2] ; ref 10 movdqu xmm3, [edi+ebx] ; ref 11 psadbw xmm0, [esi] psadbw xmm1, [esi+eax] psadbw xmm2, [esi+eax*2] psadbw xmm3, [esi+ebx] paddusw xmm7, xmm0 paddusw xmm7, xmm1 paddusw xmm7, xmm2 paddusw xmm7, xmm3 movdqu xmm0, [edi+eax*4] ; ref 12 movdqu xmm1, [edi+ecx] ; ref 13 movdqu xmm2, [edi+ebx*2] ; ref 14 movdqu xmm3, [edi+edx] ; ref 15 psadbw xmm0, [esi+eax*4] psadbw xmm1, [esi+ecx] psadbw xmm2, [esi+ebx*2] psadbw xmm3, [esi+edx] paddusw xmm7, xmm0 paddusw xmm7, xmm1 paddusw xmm7, xmm2 paddusw xmm7, xmm3 movdqa xmm6, xmm7 ; copy sum psrldq xmm6, 8 ; shift right by 8 bytes paddq xmm7, xmm6 ; get final sum in low dword movd eax, xmm7 ; return sum pop ebx pop edi pop esi ret Code:
movdqu xmm0, [edi] ; ref 0 psadbw xmm0, [esi] ; diff to cur movdqu xmm1, [edi+eax] ; ref 1 paddusw xmm7, xmm0 ; add diff to sum psadbw xmm1, [esi+eax] movdqu xmm2, [edi+eax*2] ; ref 2 paddusw xmm7, xmm1 psadbw xmm2, [esi+eax*2] movdqu xmm3, [edi+ebx] ; ref 3 paddusw xmm7, xmm2 psadbw xmm3, [esi+ebx] paddusw xmm7, xmm3 But yeah, this is first time sse2, I'm just after some general ideas. Edit: corrected some mm6/mm7 -> xmm6/xmm7 mistakes. -h Last edited by -h; 11th April 2002 at 15:15. |
|
|
|
|
|
#2 | Link |
|
Registered User
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,091
|
-h
I also only started writing SSE2 code recently after I built a P4 system. But I've found I can't always get optimzed code just by following the darn manual. Sometimes prefetch helps, sometimes not, I fiddle with other things and run it to see what goes faster. And I think maybe SSE2 instructions are much more performance sensitive to data alignment, even the ones that don't crash when on bad boundaries. But generally I try not to access a value immediately after loading it into a xmm register, so I rearrange code a little, making it more or less unreadable. ![]() And do the usual mmx things like not confusing the branch prediction with conditional jumps and stuff. The following is a nasm SSE2 version of Xvid sad16 that I've been playing with. It really does work on my P4, and I use it, but it doesn't really give any measurable performance improvement. I haven't released it because it seems some of the nasm compilers don't handle all sse2 instructions correctly. Gruel suggested I put in some nasm conditional compile parm logic to address this and I don't know how to do that. And I've been hoping to make it faster somehow. Remodeling, Pardon our dust. ![]() - Tom Code:
;=========================================================================== ; ; uint32_t sad16_sse2(const uint8_t * const cur, ; const uint8_t * const ref, ; const uint32_t stride, ; const uint32_t best_sad); ; ; experimental! Since most data are unaligned it may not be faster to use sse2 ; but we'll find out (probably is). trbarry 03/17/2002 ; ;=========================================================================== ; Macro to accum sum of abs diff's and bump esi ; This one does not assume alignment %macro sumdiffsse2U 0-1 1 %if %1 prefetchnta [esi+2*ecx] %endif movdqu xmm2, [esi+edi] ; ref2 movdqu xmm0, [esi] ; ref psadbw xmm0, xmm2 ; mm0 = |ref - cur| paddusw xmm6, xmm0 ; sum += mm0 lea esi, [esi+2*ecx] ; bump ptrs %endmacro ; macro to accum sum of abs diff's and bump esi ; This one does assume 16 byte alignment %macro sumdiffsse2A 0-1 1 %if %1 prefetchnta [esi+ecx] %endif movdqu xmm0, [esi+edi] ; ref psadbw xmm0, [esi] ; ref2 ; this one must be 16 byte aligned paddusw xmm6, xmm0 ; sum += mm0 lea esi, [esi+ecx] ; bump ptrs %endmacro %macro Accum3moreA 0 add edi, eax ; now offset from 3 more rows add esi, eax psadbw xmm0, xmm1 ; dif 0, ... paddusw xmm6, xmm0 ; sum movdqu xmm0, [edi] ; 3, 6, 9, 12 movdqa xmm1, [esi] psadbw xmm2, xmm3 ; dif 1, ... paddusw xmm6, xmm2 ; sum movdqu xmm2, [edi+ecx] ; 4, 7, 10, 13 movdqa xmm3, [esi+ecx] ; psadbw xmm4, xmm5 ; dif 2, ... paddusw xmm6, xmm4 ; sum += mm0 movdqu xmm4, [edi+2*ecx] ; 5, 8, 11, 14 movdqa xmm5, [esi+2*ecx] ; %endmacro align 16 cglobal sad16_sse2 sad16_sse2 push esi push edi push ebx mov esi, [esp + 12 + 4] ; ref mov edi, [esp + 12 + 8] ; cur mov ecx, [esp + 12 + 12] ; stride pxor xmm6, xmm6 ; xmm6 = sum = 0 test esi, 0x0000000f ; esi 16 byte aligned already? jz Aligned_OK ; y test edi, 0x0000000f ; n, how about edi? jz Aligned_Almost ; good, use this one instead Not_Aligned: ; not sure this ever happens anymore sub edi, esi ; carry as offset sumdiffsse2U sumdiffsse2U sumdiffsse2U sumdiffsse2U sumdiffsse2U sumdiffsse2U sumdiffsse2U sumdiffsse2U sumdiffsse2U sumdiffsse2U sumdiffsse2U sumdiffsse2U sumdiffsse2U sumdiffsse2U sumdiffsse2U 0 sumdiffsse2U 0 jmp Continue Aligned_Almost: xchg esi, edi ; parm usage symetrical so just swap them Aligned_OK: ; 0-2 movdqu xmm0, [edi] ; 0 ref2 movdqa xmm1, [esi] ; ref movdqu xmm2, [edi+ecx] ; 1 movdqa xmm3, [esi+ecx] movdqu xmm4, [edi+2*ecx] ; 2 movdqa xmm5, [esi+2*ecx] ; ; 3-5 mov eax, ecx lea eax, [eax+2*ecx] ; 3*ecx Accum3moreA ; 6-8 Accum3moreA ; 9-11 Accum3moreA ; 12-14 Accum3moreA ; 15 psadbw xmm0, xmm1 ; dif 12 paddusw xmm6, xmm0 ; sum movdqu xmm0, [edi+4*ecx] ; 15 movdqa xmm1, [esi+4*ecx] psadbw xmm2, xmm3 ; dif 13 paddusw xmm6, xmm2 ; sum psadbw xmm4, xmm5 ; dif 14 paddusw xmm6, xmm4 ; sum += mm0 psadbw xmm0, xmm1 ; dif 15 paddusw xmm6, xmm0 ; sum Continue: movdqa xmm0, xmm6 punpckhqdq xmm0, xmm0 paddusw xmm6, xmm0 ; movd eax, xmm6 ; nasm does not get generated properly movdq2q mm6, xmm6 ; see if this works movd eax, mm6 pop ebx pop edi pop esi emms ret Last edited by trbarry; 11th April 2002 at 16:01. |
|
|
|
|
|
#3 | Link |
|
Kilted Yaksman
Join Date: Oct 2001
Location: South Carolina
Posts: 1,303
|
Ah I see, prefetch a line, 1 line before it's actually used. I like it.
Well my system at work is a P4 (and Fridays are slow ), so I'll have a tinker and see what can be done.I know things like quantization will work quite well with sse2 (dequantization will benefit from even normal sse - pminsw/pmaxsw for saturation stuff), hopefully some fun code will get committed by the weekend. -h |
|
|
|
|
|
#4 | Link |
|
Registered User
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,091
|
If you think it is time to start releasing SSE2 code I've also got a SSE2 Xvid IDCT to release. This was originally contributed by DmitryR for DVD2AVI, but added by me and tested there. But I use it in Xvid.
- Tom edit: Huh, Friday? I had forgotten where you were.
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|