View Single Post
Old 8th November 2011, 22:30   #2  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
Quote:
Originally Posted by redfordxx View Post
Hi, I am about to come to Average plugin again and add some new features. First I have few questions

1) which level of SSE is suitable (high enough to perform and low enough to be compatible and not overflow. Basically I need following operations (appropriately packed)
B*F
W*F
B*B
W*W
B*B*F
W*W*F
Where B=packed bytes, W=packed words, F=one float or float scaled to byte or word (this one will multiply all the packed numbers)
Is SSE2 ok there can be benefit of something else
I generally try to restrict myself to plain old MMX first time out. This gives a base line for improvement and will work everywhere. Moving up to ISSE gives you the rest of the MMX instructions like PAVGB, PMULUSW, PSHUFW, etc. Just about everything does ISSE now. Straight SSE give floating point but no useful integer stuff and there are lots of annoying instruction holes. SSE2 fills out the lack of integer stuff and gives double support. All new cpus do SSE2 now. SSE3, SSSE3 and up fill in special instruction holes, but your code must check that the cpu actually has the required instructions.

Many cpu's only do SSE2 as pairs of 64bit ops internally which can extend the instruction latency badly leading to pipe stalls that you cannot get rid of no matter how you shuffle your code. My old 3GHz P4 HT generally runs SSE2 code the same or slower than the MMX/ISSE code, sometime a lot slower. My less old Core2 generally runs SSE2 code the same as the best MMX/ISSE code, sometime a little faster, very occasionally a lot faster. I3 and up start to rock but also have reduced latency and improved throughput on MMX which make you have to work hard getting the SSE2+ code significantly faster. Wander through the x264 code and discussions on how various things bite and don't work as expected.
Quote:
2) what happens if there is overflow in SIMD?
Generally your choice, either wrapping or saturation depending on the instructions selected, eg PADD verse PADDUSB.
Quote:
3) can I uses mmx and xmm register at the same time?
Yes. The MMX<->XMM transfer instructions have some braindeadness on many cpu's, usually extreme latency sometime poor throughput as well..
Quote:
4) I somewhere read that reading 16 bytes of data from memory to xmm is more than 2x slower than 2x 8bytes data from memory. So it means it is better to read twice and then combine...is that true?
It depends on the cpu model and memory controller. Movdqa is always at least equal fastest. Movdqu can massively suck, 2 movq's may be faster and have lower latency, Lddqu has it's problems as well. You have to test your case to know, then the test only applies to your cpu with your memory.
Quote:
5) what is the guaranteed alignment of data in plane...i.e. length of row in the input clip is multiplication of what?
Rowsize is always the width.

Pitch is always at least (width+15)/16*16 for all planes. You can read and write the data in the area between pitch and rowsize. It must be considered uninitialised (it will be random old frame data).

Non-aligned crops can make the start address of a row non-aligned, but the original VideoFrameBuffer would have been aligned so you can backstep the row start to regain alignment. All data outside the active picture area must be considered uninitialised.
IanB is offline   Reply With Quote