Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
4th April 2006, 08:07 | #1 | Link |
結城有紀
Join Date: Dec 2003
Location: NJ; OR; Shanghai
Posts: 894
|
[help]SSE optimized function runs slower
I'd like to calc the distance of 2 color in RGB colorspace. I write 2 functions, one use sqrt, another use SSE2 code. In my 2000 frames test, the former function uses 39sec while the optimized code uses 48sec. Theoretically the SSE code should run much faster than the ordinary code so i wonder there's some bottleneck in my code. Please help me to find out the problem, TIA.
my code: Code:
if(sse) diff += _sse2_dist((float)(tr - r), (float)(tg - g), (float)(tb - b)); else diff += sqrt((float)(tr - r) * (tr - r) + (tg - g) * (tg - g) + (tb - b) * (tb - b)); float _sse2_dist(float a, float b, float c) { __m128 x, s, r; _MM_ALIGN16 float flo[4] = {0.0}; flo[0] = a; flo[1] = b; flo[2] = c; x = _mm_load_ps(flo); s = _mm_mul_ps(x, x); r = _mm_add_ss(s, _mm_movehl_ps(s, s)); r = _mm_add_ss(r, _mm_shuffle_ps(r, r, 1)); r = _mm_sqrt_ps(r); _mm_store_ss(flo, r); return flo[0]; } MeteorRain |
4th April 2006, 09:02 | #2 | Link |
Registered User
Join Date: Jan 2002
Location: San Jose, CA
Posts: 216
|
In this case, most of your time will be spent in the square root and call overhead (not including store-to-load forwarding issues from storing 3 32-bit values immediately before loading it as a packed value). The overhead of SSE setup is killing any benefits over regular floating point.
The whole point of SIMD is to process data in parallel. You'd be better off re-arranging your data so you have 3 separate R,G, and B planes, so you can then process 4 pixels simultaneously, without having to get around the lack of horizontal operations. |
4th April 2006, 15:21 | #4 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
Step 1 move for "if (sse)" test outside your outer most loop.
Step 2 as Sulik says try to process entire rows of data. i.e. pass in 2 pointers and a count, have your sse code pick up 4 pixels at a time and pipeline the algorithm. Think very hard if you can express your algorithm without doing the SQRT's. If you are just doing comparisons then the sum of squared values is just as good as the sum of values. i.e. if A>B then A*A > B*B Also when using instrinsics always ask for an ASM listing from the compiler. You will probably find the compiler is loading and storing the XMM register between intrinsic call, which of course defeats the purpose of using SSE instructions. The latest compiler is much better but can still do some very stupid things. Shuffling your code slightly may help unconfuse the compiler. To get the ultimate speed you may need to use assembler. |
5th April 2006, 01:11 | #5 | Link |
結城有紀
Join Date: Dec 2003
Location: NJ; OR; Shanghai
Posts: 894
|
i re-write almost the whole code, delete the point matrix in my old code, direct get point from the source data, and re-arrange the algorithm. the result data is the same as the old one, but speed goes up obviously!
Thanks to above ppl! =============== yea, cost me several hours to debug it. :| Last edited by MeteorRain; 5th April 2006 at 01:14. |
Thread Tools | Search this Thread |
Display Modes | |
|
|