Question about The Intel SAD Way (tm) [Archive]

View Full Version : Question about The Intel SAD Way (tm)

lexor

21st September 2007, 01:41

Hey guys, some of the more advanced coders on these boards have probably seen this already, but in preparation for next Core 2 refresh Intel has posted some code to show off the SSE4. While that is of little concern to anyone right now, they are working from the SSE2 code to show off improvement.

As I am stuck with an SSE2 CPU for foreseeable future, I was wondering if their code is any better for an SSE2 CPU than what x264 currently has (and if it is better, can it legally be used, they did just throw it out there for the purpose of helping others improve).

Here is the Motion Estimation (http://softwarecommunity.intel.com/articles/eng/1246.htm) article (scroll to bottom to see full function code for 4x4, 8x8 and 16x16 cases).

Sergey A. Sablin

21st September 2007, 01:58

lexor

21st September 2007, 02:01

how SSE4 code can be better for SSE2 CPU if it's simply not supported on that platform?

the was already several questions about SSE4 and new sad instructions - use search on SSE4.

What the hell are you talking about? Did you just hit quote and not read my post or something?

Sergey A. Sablin

21st September 2007, 02:19

What the hell are you talking about? Did you just hit quote and not read my post or something?

another hit and reply, just as a thanks for your patience - it is just plain old full search: read block, compare to another block.
and no, it is worse even than ESA as it is written using intrinsics, which will translate to stack usage (for 16x16 case), as even 64-bit x86-64 architecture have only 16 SSE2 regs.

use search once again - the were several questions about performance of ESA/SEA etc.

Dark Shikari

21st September 2007, 02:47

SSE4 SAD ESA is completely useless... x264's SEA algorithm is faster IIRC.

Sharktooth

21st September 2007, 02:52

SSE4 is a work in progress... however as it has been already said, they're quite useless for video encoding.
maybe the SSE5 will help a bit but intel wont support them (http://www.tcmagazine.com/comments.php?shownews=16129&catid=6)...

JohnnyMalaria

21st September 2007, 03:19

I'd be very happy if you could perform an 8x8 16-bit transpose with a single instruction. Or 8x8 16-bit multiplication/addition/subtractions. Surely these have application in many areas. Or 4 x 32-bit *signed* multiplication (haven't looked hard enough, may be it exists in 4 or 5).

Oh well. Life stopped at SSE2 for me. Yet to see any benefit of 3, 4 or 5.

As for Intel's stance on SSE5, can you say "Not invented here....."

Manao

21st September 2007, 05:56

SSE4 is a work in progress... however as it has been already said, they're quite useless for video encoding.Erm, no. Quoting akupenguin :SSE4.1 contains some useful instructions. However, I'm not sure the instruction touted as being designed for video compression (mpsadbw) is among the useful ones.In particular, the PMOVxxx will speed up computations made on 16bits ( interpolation, explicit weights ) and PBLENDxx the deblocking. Of course, nothing earth shattering there, but there aren't been anything earth shattering since PSADBW...

Sergey A. Sablin

21st September 2007, 06:11

but there aren't been anything earth shattering since PSADBW...

let's say SIMD in general for signal processing, and yeah, of course psadbw for hybrid video compression.

lexor

21st September 2007, 14:11

Thank you for your input guys. I guess my reasoning was that even though it does a dumb method, it does a lot of it at the same time in each iteration, so possibly a speed up.

another hit and reply, just as a thanks for your patience

use search once again - the were several questions about performance of ESA/SEA etc.

Don't you take that wounded righteousness tone with me. You didn't read my post. You (twice now) tell me to use search, even though no one has discussed this algorithm here before, I know since I keep up with these boards. Tell me how a comparison of any other algorithms to x264 is supposed to help me with evaluating performance of this one algorithm compared to x264's? You see when Bond says "use search", it means the question has been answered before, many times over. If the question hasn't been asked before, searching doesn't really help, does it?

You deserved every word of my retort, and much more.

JohnnyMalaria

21st September 2007, 14:46

how SSE4 code can be better for SSE2 CPU if it's simply not supported on that platform?

the was already several questions about SSE4 and new sad instructions - use search on SSE4.

lexor isn't suggesting using SSE4 code on an SSE2 processor. lexor's intent is clear - the Intel article provides SSE2 and SSE4 optimized code to demonstrate the alleged benefits of SSE4. lexor wants to know if the SSE2 code in the article is better than the SSE2 code currently being used. I think bothering to read the article in the context of lexor's question would have made that clear.

Dark Shikari

21st September 2007, 15:11

Don't you take that wounded righteousness tone with me. You didn't read my post. You (twice now) tell me to use search, even though no one has discussed this algorithm here before, I know since I keep up with these boards. Tell me how a comparison of any other algorithms to x264 is supposed to help me with evaluating performance of this one algorithm compared to x264's? You see when Bond says "use search", it means the question has been answered before, many times over. If the question hasn't been asked before, searching doesn't really help, does it?
Yes, it has been asked before. Yes, its been asked before more than once. I know, because I've seen it answered it before. More than once.

Like here (http://forum.doom9.org/showpost.php?p=1044394&postcount=8) and here (http://forum.doom9.org/showthread.php?t=124885&highlight=SSE4) and here (http://forum.doom9.org/showpost.php?p=1011901&postcount=180) and here (http://forum.doom9.org/showpost.php?p=993989&postcount=436) and here (http://forum.doom9.org/showpost.php?p=1043524&postcount=12) and ... :rolleyes:

JohnnyMalaria

21st September 2007, 15:47

Dark Shikari

21st September 2007, 15:51

The original question has nothing to do with how much better is SSE4 compared to SSE2.

RTFQ - the question is: Is the Intel article's SSE2-optimized code better than the SSE2 code in x264 for motion estimation? Nothing, nothing, nothing to do with SSE4 comparisons.

It's that simple - so why continue to preach to lexor about searching and searching and, oh let's see, searching??!??!?!?!?Perhaps because pretty much everyone misunderstood the question, and the thread sort of became about SSE4... :sly:

To answer the original post, I highly doubt there is any SSE2 code out there that would be more than marginally better. x264 is pretty heavily optimized.

Manao

21st September 2007, 16:09

Intel's code isn't even optimized. It just relies on the compiler to optimize the SAD computation. So it's useless for x264.

Sergey A. Sablin

21st September 2007, 18:59

Don't you take that wounded righteousness tone with me. You didn't read my post.

it's kinda annoying when you continue to applauding, even when I already answered you question.

pretty every question started from SSE4 was finally transformed to the question how much x264 full search is optimized - the were several times said, SEA 2-3 times faster than conventional algorithm (ESA) implemented on SSE2 and faster than SSE4 optimized ESA, there were several times said about process of both algorithms.

http://forum.doom9.org/showthread.php?p=990797#post990797
http://forum.doom9.org/showthread.php?p=993986#post993986

You (twice now) tell me to use search, even though no one has discussed this algorithm here before, I know since I keep up with these boards. Tell me how a comparison of any other algorithms to x264 is supposed to help me with evaluating performance of this one algorithm compared to x264's?

don't you expect somebody wants to discuss separately GCD by Euclid, do you? the "algorithm" (if somebody may call it algorithm) is just a brute force - it just can't be faster than _any_ other. There is no need to know assembly to understand this from the code - read block, compare it to every possible block in a search window - what can be slower?

JohnnyMalaria

21st September 2007, 20:41

Does anyone here know how to read?

The OP's question has NOTHING TO DO WITH SSE4 so all these cross-references to SSE4 threads are irrelevant.

It is about comparing two SSE2 implementations of a motion detection algorithm. That's all.

If you are going to bollock people for asking stupid/pointless/already done questions, make sure you have read and properly digested the question.

Evidently, people skim-read the posts, see "SSE2" and "SSE4" in the same paragraph, decide it's a "why is SSE4 better?" post and the flame the poster.

Good grief, Charlie Brown.

Sergey A. Sablin

21st September 2007, 20:55

Does anyone here know how to read?

it seems only few of us.

the question was already answered at least three times in this very thread. is there something not clear yet?

Audionut

21st September 2007, 21:41

It seems there are more mods on this board than people interested in dvd conversion.

If you don't agree with me, use search.

lexor

21st September 2007, 22:39

it seems only few of us.

the question was already answered at least three times in this very thread. is there something not clear yet?

Clearly you are not one of the "few", since you keep talking about and linking to SSE4 discussions.

Furthermore, algorithm and implementation are not the same thing. May I direct you to rev676 which made ESA multi-threaded, it's still ESA but it performs better, but it's still the dumb exhaustive search and according to your last "brilliant" GCD post (that took some deciphering) it could not possibly be better. Alright then... Similarly Intel's SAD while implementing exhaustive computation may still be better than some other implementation of it. This was the only point of question I had. This Intel implementation for this algorithm hasn't been discussed before. How SEA and any other algorithm compares to ESA has nothing to do with this discussion and sheds no light on the discussion.

Sagekilla

21st September 2007, 22:44

Sorry but, your question was answered already quite a few posts back numerous times. No, I'm not talking about SSE4, and yes I am of SSE2 optmized ESA. IIRC, some people (I believe it was Akupenguin and Dark Shikari) developed a variant of the ESA algorithm. It was nearly identical to ESA in terms of quality, but several times faster.

Edit: Also, Sergey was referring to SSE2 in the latest post, NOT SSE4.

lexor

21st September 2007, 22:48

Sorry but, your question was answered already quite a few posts back numerous times.
I know, I already thanked the people who answered my querry. What's you point?

No, I'm not talking about SSE4, and yes I am of SSE2 optmized ESA. IIRC, some people (I believe it was Akupenguin and Dark Shikari) developed a variant of the ESA algorithm. It was nearly identical to ESA in terms of quality, but several times faster.
Actually the last graph of Aku's showed the new multi-threaded esa spank almost everything. Was kind of amusing to watch the quick algorithms get hammered :)

Edit: Also, Sergey was referring to SSE2 in the latest post, NOT SSE4.
Not according to his second sentence. Although in his defense he seems to have KPR's eloquence with words, but with significant knowledge about what he's talking about when it comes to coding. Scary.

JohnnyMalaria

22nd September 2007, 00:18

[A]lgorithm and implementation are not the same thing.....This Intel implementation for this algorithm hasn't been discussed before.

The Intel white paper sets out to achieve two things from a technical perspective:

1. Demonstrate the benefits of SSE4 over SSE2
2. Use a level playing field - i.e., the Intel C++ compiler

It also sets out (obviously) to say "buy our latest processors and our compilers to make the most of them".

Leaving the sales pitch aside, a key statement in the white paper is:

The Intel® Compiler 10.0.018 beta was used to build the code. The 'O2' and 'QxS' compiler flags were used. 'QxS' is a new flag for the compiler to generate optimized code specifically for Penryn. The speedups from the SSE2-optimized function to the Intel SSE4-optimized function ranged from 1.6x to 3.8x. In addition to the speedups seen from Intel SSE4, we also observed the speedups from multi-threading (Figure 3-1).

i.e., everything was tested with code automatically generated by a compiler. Now, the Intel C++ compiler is certainly much better suited for automatic generation of SIMD code than, say, the Microsoft C++ compiler. The Intel compiler can be integrated into Microsoft's development suite. Some people will poo-poo using Intel's C++ compiler but if you aren't into hours of manual optimization then it is a great option to have.

My rambling about the Intel C++ compiler does have a point: the code isn't optimized as best as it might be. Intel sell (for Win/Mac) or give away (for non-commercial use on Linux) an entire library of functions for multimedia processing - the Intel Integrated Performance Primitives (IPP).

http://www.intel.com/cd/software/products/asmo-na/eng/perflib/ipp/302910.htm

This library includes a very large of primitive functions including SAD motion estimation, H264 decoding and encoding, MPEG2, DV etc etc etc. These functions are hand-crafted by Intel programmers with access to the full range of optimization tools - things like scheduling to prevent processor stalls, multithreading, profiling etc. Furthermore, versions of each function exist for every Intel architecture.

So, given that Intel have highly optimized SAD functions, the real question is how do the existing x264 algorithms compare to the IPP ones. Someone on Linux could easily do a direct comparison. Of course, you couldn't use it on Windows unless you buy a license for the library or get an application that uses the library.

This raises one more issue - processor brand. AMD and Intel processors may have the same instruction set but the microarchitectures are different. A routine optimized for an AMD processor with SSE2 may perform less on the Intel equivalent - and vice versa.

So, any Linux gurus up to comparing the Intel SAD functions against x264's?

Sagekilla

22nd September 2007, 00:19

My point is if your question was already answered there is no need to continue bashing on other people, nor for anyone else to continue this in this thread. Remember, an argument is an exchange of anger and arrogance, where as discussions are an exchange of intelligence :) No need for any inflammatory words to be thrown around by anyone here!

Sergey A. Sablin

22nd September 2007, 01:01

So, given that Intel have highly optimized SAD functions, the real question is how do the existing x264 algorithms compare to the IPP ones. Someone on Linux could easily do a direct comparison. Of course, you couldn't use it on Windows unless you buy a license for the library or get an application that uses the library.
there is an evaluation version of IPP, so anyone even on Win may test it.
And IPP is a kinda low level SDK - there are many optimized functions, but there is no optimized algorithms like motion estimation. There are many heavily optimized decoder functions, so I'd say it's more suited for decoders implementation, while encoder side requires a lot of work to reach any good implementation (mpeg-1/2/4, h261/3/4, same for audio part).

No need to compare x264 to IPP here, one may look on last MSU comparison of H.264 encoders. Both speed and quality wise.

JohnnyMalaria

22nd September 2007, 01:50

Good point.

Intel do also provide a good set of sample applications:

http://www.intel.com/cd/software/products/asmo-na/eng/219967.htm#va (direct - may not work if you haven't accept the license agreement)

or from here:

http://www.intel.com/cd/software/products/asmo-na/eng/302910.htm

They include:

Video Encoder
A video encoder application using Intel IPP audio and video coding functions, and image and signal processing functions. It supports H.264, MPEG-4, and MPEG-2.

I did look at these some time ago (I use IPP in my software) but these formats aren't my focus.

akupenguin

22nd September 2007, 08:54

Picking a function at random out of IPP (or not so random, given this discussion):
IPP sad16x16 is essentially identical to x264's. How can you go wrong with psadbw? (plus a check for misalignment, which x264 can omit because x264's data is always aligned.)
IPP sad8x8 uses xmm registers for values that would fit in mmx. This makes it much slower than x264's on k8 and p4, and a little bit slower on core2.
IPP sad4x4 isn't even simd.

I see many functions in IPP that should be simd but aren't. Which makes me wonder if they even did any hand optimization, or whether the whole library is just a naive C implementation put through icc's autovectorizer.

Sharktooth

22nd September 2007, 13:23

probably the latter...