MMX, SSE2, Intel c++ etc :) [Archive]

View Full Version : MMX, SSE2, Intel c++ etc :)

dragongodz

28th December 2004, 11:55

decided to start a new thread here instead of continuing the discusion about it in the Nics xvid thread since this is really more of a general development discussion.

basically the discusion was about Intel C++ compiler not using SSE2 on an Athlon64 even though it has SSE2 when using the QxN or QaxN options. these both cause a check for real Intel cpu aswell as SSE2 capability.

now i have found out a little more about this so here is a post from the Intel C++ forum about it.
http://softwareforums.intel.com/ids/board/message?board.id=16&message.id=2190
so QxW(or i am assuming QaxW) should use SSE2 on an Athlon64 then.

another interesting little aside for SSE2 is this post
http://softwareforums.intel.com/ids/board/message?board.id=16&message.id=2209
so SSE2 is not always faster. :)

Joe Fenton

29th December 2004, 02:56

The -QxW switch, for 8.1 compilers, generates much the same code as -QxN, but leaves the responsibility for correct support of SSE2 to the user.

Gee, nice of them to finally allow people to generate code for the AMD64. It only took until 8.1 to do it, and notice that it merely short circuits all tests. They just dumped responsibility for checking the CPU and features onto the programmer. :rolleyes:

Oh well, it's probably better to rely on the programmer than to rely on Intel. :p ;)

As to the thread on the speed of MMX and SSE2, you'll notice a few things: all tested systems were Intel powered; the new P4 showed no difference, the older P4 showed a slight edge to MMX, and the Centrino gave a big edge to MMX. Given the routine being performed, that is to be expected.

Basically, the SSE2 routine fetched 128 bits from two sources and added them to an accumulated value. The MMX routine fetched the data 64 bits at a time. So the older the system, the more biased the results were towards fetching the data 64bits at a time. This is a bus limitation rather than a limitation of the execution units doing the addition. It's possible that doing 64 bit fetches with the SSE2 code would have made the SSE2 the exact same speed. Knowing the system's bus architecture can allow you to tailor your code for the best speed possible.

The cases where such a situation occurs are very limited. You'll notice the requirements of the test - two blocks of 16 byte aligned data with a specific stride that fit the cache just right. So you may find some tests that seem to indicate that MMX is faster than SSE2, but real-life applications usually can't benefit from this.

dragongodz

29th December 2004, 11:25

It's possible that doing 64 bit fetches with the SSE2 code would have made the SSE2 the exact same speed.
maybe and maybe not. lets assume it does though, its still not faster and isnt that one of the things Intel touted about SSE2, that its faster ? so as i said SSE2 is not always faster.

So you may find some tests that seem to indicate that MMX is faster than SSE2, but real-life applications usually can't benefit from this.
well considering that the example is a SAD comparison it actually does go to show a function that is used in real-life applications. SAD can be used in avcodec applications(QuEnc, FFDshow etc) and is also used in Xvid. i would call those real-life enough for me. :D

diehardii

29th December 2004, 17:51

An interesting topic. I've just started using the Intel compiler and the IPP and I have to say I'm very impressed. The AMD thing is a bit irritating, but as I'm developing for our lab it won't impact us (Dell is our "approved" vendor). This is something to keep in mind however. Thanks for the heads up.

~Steve

Joe Fenton

30th December 2004, 03:06

Originally posted by dragongodz
maybe and maybe not. lets assume it does though, its still not faster and isnt that one of the things Intel touted about SSE2, that its faster ? so as i said SSE2 is not always faster.

well considering that the example is a SAD comparison it actually does go to show a function that is used in real-life applications. SAD can be used in avcodec applications(QuEnc, FFDshow etc) and is also used in Xvid. i would call those real-life enough for me. :D

Look at the calculation being done - SAD. Yes, it's used in quite a few places, but it's hardly a function you NEED to even use MMX or SSE2 on. I bet that a plain assembly language version wouldn't be too much slower. :D

Where SSE2 shines are those functions that MMX doesn't give you, but accelerate your video/audio/whatever processing. That was my whole point - that thread was rather useless in trying to determine if SSE2 is worth using compared to MMX. It tested one function that really makes no difference in speed. It was like making two assembly langauge functions that used LSL Rn,1 vs ADD Rn,Rn to multiply an integer by two and arguing over which was faster. :cool:

dragongodz

30th December 2004, 04:21

it's hardly a function you NEED to even use MMX or SSE2 on. I bet that a plain assembly language version wouldn't be too much slower
but it would still be slower just as the SSE2 version was slower. since its also called a hell of a lot in encoding i kind of think any speed gained from it is a good thing. :)

Where SSE2 shines are those functions that MMX doesn't give you,
of course.

that thread was rather useless in trying to determine if SSE2 is worth using compared to MMX. It tested one function that really makes no difference in speed.
no i posted it to show that SSE2 is not ALWAYS the faster to use. at no time did i say SSE2 should never be used. since there WAS a difference in speed for a real and used function i thought this would be of interest to people.

Joe Fenton

31st December 2004, 00:17

Originally posted by dragongodz
... i posted it to show that SSE2 is not ALWAYS the faster to use. at no time did i say SSE2 should never be used. since there WAS a difference in speed for a real and used function i thought this would be of interest to people.

I'm saying the difference was the load instructions, not the SSE2 or MMX processing instructions. Like I said earlier, I'd bet that doing 64bit loads with the SSE2 routine would make it the same speed as the MMX routine on all the systems tested.

Peter Cheat

9th January 2005, 12:18

The added instructions for MMX, MMX2, SSE and SSE2 have very specific applications. Programmers are expected to know which optimisations to use for a specific segment of code. So there will definitely be instances where SSE2, or SSE will reduce performance rather than increase it.

trbarry

9th January 2005, 14:57

Look at the calculation being done - SAD. Yes, it's used in quite a few places, but it's hardly a function you NEED to even use MMX or SSE2 on. I bet that a plain assembly language version wouldn't be too much slower.

A few years ago I played with Xvid source a bit, adding SSE2 IDCT & SAD. IDCT was being added by someone else anyway so I tried to optimize SAD. I never could get any significant speedup and never released it.

Also, when I wrote my TomsMoComp deinterlace filter I made no less than 3 different SSE2 versions. I never released any of them, for the same reason. The speedup was just not enough to justifiy the complexity of having yet another version to maintain. Howevr this was on an old 1.7 Ghz rdram P4 and it was likely a bit more memory bus bound than it might be these days.

Generally I tend to use SSE2 only when the memory access pattern is linear, 16 byte aligned, and well behaved. If things are dancing all over the place then I'll likely stick with SSE.

- Tom

hank315

9th January 2005, 17:39

A few years ago I played with Xvid source a bit, adding SSE2 IDCT & SAD. IDCT was being added by someone else anyway so I tried to optimize SAD. I never could get any significant speedup and never released it.The same experience here, using SSE2 for SAD in a motion vector search brings not that much over plain MMX code.
I'm using a P4 3.2 GHz, 800 FSB and dual channel memory.
The reason why SSE2 isn't much faster IHMO is just the memory unalignment in the motion vector search process, the source MB may be 16 byte aligned but the candidates will certainly not be aligned.
But i have also used SSE2 to get the SAD of two luminance planes which are both 16 byte aligned, in this case SSE2 runs almost twice as fast as MMX.

Joe Fenton

9th January 2005, 21:11

That's why one feature of SSE3 is to speedup unaligned access. It's to make SSE better for motion compensation where alignment could be anything.

trbarry

11th January 2005, 05:26

It is true that I had to check for 16 byte alignment and handle the cases where one or both were not aligned properly. SSE3 might have made a big difference there but I've no experiance with that.

- Tom