PDA

View Full Version : SSE(2) Float to unsigned 16 bit shorts.


sh0dan
20th August 2009, 10:45
I need a quick way to convert float values to 16 bit unsigned integers. I currently use this:

minps [65535.f], xmm0 // Saturate (Latency: 3 on C2, Throughput: 1)
cvtps2dq xmm0, xmm0 // Convert to dwords (Latency: 3 on C2, Throughput: 1)
movdqa xmm1, xmm0 // Copy (L:1 T:0.5)
pcmpgtd xmm1, [zeroes] // if (xmm1 > 0) xmm1 = ones (L:1 T:1)
pand xmm1,xmm0 // Result in xmm1 in dwords (L:1 T:1)
(shuffle to get lower words in xmm1 to get result in lower 64 bits of xmm1)


An SSE 4.1 implementation is much simpler, as it has "packusdw", avoiding everything but the actual conversion.

Does anyone have a more efficient way for SSE2? I don't care that much for latency, but throughput count is important.

Dark Shikari
20th August 2009, 11:18
Why not something like this?

cvtps2dq xmm0, xmm0
psubsd xmm0, [32768] //Convert to signed
packssdw xmm0, xmm0
paddw xmm0, [32768] //Convert back to unsigned

Total inverse throughput cost on Core 2 Conroe: 1+1+2+1 = 5.

(untested)

akupenguin
20th August 2009, 11:25
cvtps2dq xmm0, xmm0 // 3/1/p1 (latency / inverse throughput / execution unit(s), on Conroe)
cvtps2dq xmm1, xmm1 // 3/1/p1
psubsd xmm0, [pd_32768] // 1/.5/p05
psubsd xmm1, [pd_32768] // 1/.5/p05
packssdw xmm0, xmm1 // 4/2/p0
pxor xmm0, [pw_32768] // 1/.33/p015
// total latency: 10 cycles
// total inverse throughput: 3 cycles per 8 elements (if you can schedule it right)

untested

sh0dan
20th August 2009, 12:24
Thanks a bunch - I'll try it out tonight!

@aku: I assume "pw_32768" is 0x8000

akupenguin
20th August 2009, 12:29
I assume "pw_32768" is 0x8000
yes. (and pw=words, pd=dwords)

sh0dan
21st August 2009, 10:16
Thanks - it worked like a charm! I did use the xor trick - it's just sexy :)

Sometimes you can just stare yourself blind at a problem! :confused: