Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Programming and Hacking > Development

Reply
 
Thread Tools Display Modes
Old 11th April 2002, 15:11   #1  |  Link
-h
Kilted Yaksman
 
-h's Avatar
 
Join Date: Oct 2001
Location: South Carolina
Posts: 1,303
sse2 fun

Just started working with sse2, and I'd like a critique of some SAD code. It doesn't really matter if you don't know what a SAD routine is supposed to do, I'm just wondering how pipelining etc. should work. Assume esi is 16-btye aligned:

Code:
sad16_sse2
		push esi
		push edi
		push ebx

		mov esi, [esp + 12 + 4]		; cur
		mov edi, [esp + 12 + 8]		; ref
		mov eax, [esp + 12 + 12]	; stride

		mov ebx, eax
		mov ecx, eax
		shl ebx, 2
		add ecx, ebx			; ecx = stride*5
		mov edx, ebx
		sub ebx, eax			; ebx = stride*3
		add edx, ebx			; edx = stride*7
		
		pxor xmm7, xmm7			; xmm7 = sum = 0

		movdqu xmm0, [edi]		; ref 0
		movdqu xmm1, [edi+eax]		; ref 1
		movdqu xmm2, [edi+eax*2]	; ref 2
		movdqu xmm3, [edi+ebx]		; ref 3
		psadbw xmm0, [esi]		; diff to cur
		psadbw xmm1, [esi+eax]
		psadbw xmm2, [esi+eax*2]
		psadbw xmm3, [esi+ebx]
		paddusw xmm7, xmm0		; add diffs to sum
		paddusw xmm7, xmm1
		paddusw xmm7, xmm2
		paddusw xmm7, xmm3

		movdqu xmm0, [edi+eax*4]	; ref 4
		movdqu xmm1, [edi+ecx]		; ref 5
		movdqu xmm2, [edi+ebx*2]	; ref 6
		movdqu xmm3, [edi+edx]		; ref 7
		psadbw xmm0, [esi+eax*4]
		psadbw xmm1, [esi+ecx]
		psadbw xmm2, [esi+ebx*2]
		psadbw xmm3, [esi+edx]
		paddusw xmm7, xmm0
		paddusw xmm7, xmm1
		paddusw xmm7, xmm2
		paddusw xmm7, xmm3

		add esi, eax			; advance by 8 lines
		add edi, eax
		add esi, edx
		add edi, edx

		movdqu xmm0, [edi]		; ref 8
		movdqu xmm1, [edi+eax]		; ref 9
		movdqu xmm2, [edi+eax*2]	; ref 10
		movdqu xmm3, [edi+ebx]		; ref 11
		psadbw xmm0, [esi]
		psadbw xmm1, [esi+eax]
		psadbw xmm2, [esi+eax*2]
		psadbw xmm3, [esi+ebx]
		paddusw xmm7, xmm0
		paddusw xmm7, xmm1
		paddusw xmm7, xmm2
		paddusw xmm7, xmm3

		movdqu xmm0, [edi+eax*4]	; ref 12
		movdqu xmm1, [edi+ecx]		; ref 13
		movdqu xmm2, [edi+ebx*2]	; ref 14
		movdqu xmm3, [edi+edx]		; ref 15
		psadbw xmm0, [esi+eax*4]
		psadbw xmm1, [esi+ecx]
		psadbw xmm2, [esi+ebx*2]
		psadbw xmm3, [esi+edx]
		paddusw xmm7, xmm0
		paddusw xmm7, xmm1
		paddusw xmm7, xmm2
		paddusw xmm7, xmm3

		movdqa xmm6, xmm7		; copy sum
		psrldq xmm6, 8			; shift right by 8 bytes
		paddq xmm7, xmm6		; get final sum in low dword
		movd eax, xmm7			; return sum

		pop ebx
		pop edi
		pop esi

		ret
Now I'm not all that interested in whether it would compile or not (probably not, I don't have a P4 to test on atm), but is that the way the reads/psadbw's/paddusw's should be structured? Or would this be better/faster:

Code:
		movdqu xmm0, [edi]		; ref 0
		psadbw xmm0, [esi]		; diff to cur
		movdqu xmm1, [edi+eax]		; ref 1
		paddusw xmm7, xmm0		; add diff to sum
		psadbw xmm1, [esi+eax]
		movdqu xmm2, [edi+eax*2]	; ref 2
		paddusw xmm7, xmm1
		psadbw xmm2, [esi+eax*2]
		movdqu xmm3, [edi+ebx]		; ref 3
		paddusw xmm7, xmm2
		psadbw xmm3, [esi+ebx]
		paddusw xmm7, xmm3
Where I've just interleaved the paddusw's to immediately follow the unaligned read, with the uninformed hope that it'll perform the paddusw while waiting for the read to complete, seeing as the psadbw will be waiting for the unaligned read to finish.

But yeah, this is first time sse2, I'm just after some general ideas.

Edit: corrected some mm6/mm7 -> xmm6/xmm7 mistakes.

-h

Last edited by -h; 11th April 2002 at 15:15.
-h is offline   Reply With Quote
Old 11th April 2002, 15:56   #2  |  Link
trbarry
Registered User
 
trbarry's Avatar
 
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,091
-h

I also only started writing SSE2 code recently after I built a P4 system. But I've found I can't always get optimzed code just by following the darn manual. Sometimes prefetch helps, sometimes not, I fiddle with other things and run it to see what goes faster. And I think maybe SSE2 instructions are much more performance sensitive to data alignment, even the ones that don't crash when on bad boundaries.

But generally I try not to access a value immediately after loading it into a xmm register, so I rearrange code a little, making it more or less unreadable.

And do the usual mmx things like not confusing the branch prediction with conditional jumps and stuff.

The following is a nasm SSE2 version of Xvid sad16 that I've been playing with. It really does work on my P4, and I use it, but it doesn't really give any measurable performance improvement. I haven't released it because it seems some of the nasm compilers don't handle all sse2 instructions correctly. Gruel suggested I put in some nasm conditional compile parm logic to address this and I don't know how to do that. And I've been hoping to make it faster somehow.

Remodeling, Pardon our dust.

- Tom



Code:
;===========================================================================
;
; uint32_t sad16_sse2(const uint8_t * const cur,
;					const uint8_t * const ref,
;					const uint32_t stride,
;					const uint32_t best_sad);
;
; experimental!  Since most data are unaligned it may not be faster to use sse2
; but we'll find out (probably is). trbarry 03/17/2002
;
;===========================================================================

; Macro to accum sum of abs diff's and bump esi
; This one does not assume alignment

%macro sumdiffsse2U 0-1 1
%if %1 
		prefetchnta [esi+2*ecx]  
%endif
		movdqu	xmm2, [esi+edi]	; ref2
		movdqu	xmm0, [esi]		; ref
		psadbw	xmm0, xmm2		; mm0 = |ref - cur|
		paddusw	xmm6, xmm0		; sum += mm0
		lea		esi, [esi+2*ecx]	; bump ptrs
%endmacro

; macro to accum sum of abs diff's and bump esi
; This one does assume 16 byte alignment

%macro sumdiffsse2A 0-1 1 
%if %1	
		prefetchnta [esi+ecx]  
%endif
		movdqu	xmm0, [esi+edi]	; ref
		psadbw	xmm0, [esi]		; ref2			; this one must be 16 byte aligned
		paddusw	xmm6, xmm0		; sum += mm0
		lea		esi, [esi+ecx]	; bump ptrs
%endmacro

%macro Accum3moreA 0 
		add		edi, eax			; now offset from 3 more rows
		add		esi, eax
		
		psadbw	xmm0, xmm1			; dif 0, ...
		paddusw	xmm6, xmm0			; sum
		
		movdqu	xmm0, [edi]			; 3, 6, 9, 12 
		movdqa	xmm1, [esi]	   

		psadbw  xmm2, xmm3			; dif 1, ...
		paddusw	xmm6, xmm2			; sum

		movdqu	xmm2, [edi+ecx]		; 4, 7, 10, 13
		movdqa	xmm3, [esi+ecx]		;   
		
		psadbw	xmm4, xmm5			; dif 2, ...
		paddusw	xmm6, xmm4			; sum += mm0

		movdqu	xmm4, [edi+2*ecx]	; 5, 8, 11, 14 
		movdqa	xmm5, [esi+2*ecx]	;   

%endmacro

align 16
cglobal sad16_sse2
sad16_sse2

		push esi
		push edi
		push ebx

		mov esi, [esp + 12 + 4]		; ref
		mov edi, [esp + 12 + 8]		; cur
		mov ecx, [esp + 12 + 12]	; stride
		pxor xmm6, xmm6				; xmm6 = sum = 0
		
		test	esi, 0x0000000f     ; esi 16 byte aligned already?
		jz		Aligned_OK			; y
		test	edi, 0x0000000f		; n, how about edi?
		jz		Aligned_Almost		; good, use this one instead

Not_Aligned:						; not sure this ever happens anymore

		sub		edi, esi			; carry as offset
		sumdiffsse2U  
		sumdiffsse2U 
		sumdiffsse2U 
		sumdiffsse2U 

		sumdiffsse2U 
		sumdiffsse2U 
		sumdiffsse2U 
		sumdiffsse2U 
		
		sumdiffsse2U 
		sumdiffsse2U 
		sumdiffsse2U 
		sumdiffsse2U 
		
		sumdiffsse2U 
		sumdiffsse2U 
		sumdiffsse2U 0
		sumdiffsse2U 0 
		jmp		Continue

Aligned_Almost:
		xchg	esi, edi			; parm usage symetrical so just swap them

Aligned_OK:

; 0-2
		movdqu	xmm0, [edi]			; 0 ref2
		movdqa	xmm1, [esi]			;   ref

		movdqu	xmm2, [edi+ecx]		; 1 
		movdqa	xmm3, [esi+ecx]	

		movdqu	xmm4, [edi+2*ecx]	; 2 
		movdqa	xmm5, [esi+2*ecx]	;   

; 3-5
		mov		eax, ecx
		lea		eax, [eax+2*ecx]	; 3*ecx

		Accum3moreA

; 6-8
		Accum3moreA

; 9-11
		Accum3moreA

; 12-14
		Accum3moreA

; 15
		psadbw	xmm0, xmm1			; dif 12
		paddusw	xmm6, xmm0			; sum
		
		movdqu	xmm0, [edi+4*ecx]	; 15 
		movdqa	xmm1, [esi+4*ecx]	   

		psadbw  xmm2, xmm3			; dif 13
		paddusw	xmm6, xmm2			; sum
		
		psadbw	xmm4, xmm5			; dif 14
		paddusw	xmm6, xmm4			; sum += mm0

		psadbw	xmm0, xmm1			; dif 15
		paddusw	xmm6, xmm0			; sum

Continue:
		movdqa	xmm0, xmm6
		punpckhqdq  xmm0, xmm0
		paddusw	xmm6, xmm0
;		movd	eax, xmm6			; nasm does not get generated properly
		movdq2q	mm6, xmm6			; see if this works
		
		movd    eax, mm6
		pop		ebx
		pop 	edi
		pop 	esi
		emms

		ret

Last edited by trbarry; 11th April 2002 at 16:01.
trbarry is offline   Reply With Quote
Old 11th April 2002, 16:09   #3  |  Link
-h
Kilted Yaksman
 
-h's Avatar
 
Join Date: Oct 2001
Location: South Carolina
Posts: 1,303
Ah I see, prefetch a line, 1 line before it's actually used. I like it.

Well my system at work is a P4 (and Fridays are slow ), so I'll have a tinker and see what can be done.

I know things like quantization will work quite well with sse2 (dequantization will benefit from even normal sse - pminsw/pmaxsw for saturation stuff), hopefully some fun code will get committed by the weekend.

-h
-h is offline   Reply With Quote
Old 11th April 2002, 17:10   #4  |  Link
trbarry
Registered User
 
trbarry's Avatar
 
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,091
If you think it is time to start releasing SSE2 code I've also got a SSE2 Xvid IDCT to release. This was originally contributed by DmitryR for DVD2AVI, but added by me and tested there. But I use it in Xvid.

- Tom

edit: Huh, Friday? I had forgotten where you were.
trbarry is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 05:35.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.