Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 8th November 2011, 15:02   #1  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
RedAverage plugin (version 1.4.3 biiiiig bugfix)

EDIT:I am removing the original post which is in IanB's answer anyway and replacing it with some info about what's done.
This plugin started with the idea of improving original mg262's Average. Here I thank him for his work.

This new plugin has diferent names to avoid name conflicts and there are currently following filters:
RAverageM - masked average
RAverageW - weighted average
RMerge - yet another merge filter


common parameters:
y,u,v=3 ...process plane, other numbers disable processing the plane atm for RAverageW and RAverageM. For RMerge it works just like in masktools
bias=0 ...number which is added to finally computed value (just fyi: bias=-0.5 kills rounding the result)
sse=-1 ...limits the processor capabilities. -1=use all (bitwise...16,8,4=SSE4.1,SSSE3,SSE2)...0=no asm, which is veeeeryyyy sloooow
lsb_in=false ...if true, 16bit stacked input is expected (for more detail explanation about 16bit clips find Dither tools)
lsb_out=false ...if true, 16bit stacked output is produced
All parameters are optional, except the input clips. There must be at least one input clip pair.

RAverageW(clip1, weight1, clip2, weight2, ... clipn,weightn,bias,y,u,v,sse, bool lsb_in, bool lsb_out, int mode, int errlevel, bool silent)
computes value of weighted average of clips:
result=c1*w1+c2*w2+...cn*wn+bias
silent=false ... if true, certain error messages (mostly about rounding error) are suppresed
errlevel=0 ...sets the maximum rounding error you are willing to accept as a tradeoff to faster method
mode=-1 ...bitwise 4, 8 for enabling method 4, 8. 0 results in unoptimized C++.
- mode 8: requires SSE2, has higher precision than 4 but depending on hardware, probably will be slower
- mode 4: requires SSSE3, is faster, however has certain limitations in precision (if you don't have silent true, you will be warned) (8bit in only atm, 8 or 16 out)
basically method 4 is fully precise, when weights are nice power-of-two-numbers, specifically: w_i=a_i*2^k (a_i are integers and k is integer, all signed...moreover sum[max(a_i,0)]<128 and sum[min(a_i,0)]>-128
...in other words, weights must be scalable to signed byte;-)
- mode 0: plain C++, slow, fully precise (I think) and supports fully 8 & 16bit clips in and out.
Concerning 16bit support and conversion when the in and out have different bitdepth, the values are relative to full scale. That introduces multiplication or division by 256 in case of change of bitdepth. So: 8bitvalue*weight*256=16bitvalue.

RAverageM(clip1, mask1, clip2, mask2, ... clipn,maskn,bias,y,u,v,sse, bool lsb_in, bool lsb_out)
computes value of masked average of clips:
result=c1*m1+c2*m2+...cn*mn+bias

RMerge(clip1, clip2, mask,y,u,v,sse, bool lsb_in, bool lsb_out, int mode)
Merges two clips based on mask, as expected. However, there are some new things. First, again, 16bit support (although not fully optimized).
Then there are two modes in case 8bit input which decide, whether the mask is applied slightly differently to correct the problem of incomplete range of the mask:
mode=255 ...this is standard merge formula: r=c1*(256-m)+c2*m
mode=256 ...this is adjusted merge formula: r=c1*(256-n)+c2*n where n= ( m<=128 ? m : m+(m-128)/128 )
lsb_in=true ...this is 16bit merge formula: r=c1*(65536-m)+c2*m
Then, r is scaled and rounded to results r8 or r16, depending on output bitdepth.
The goal of mode 256 is to allow merge values in full range and if m=255, then r8=c2 ... (...and not r8=(c1+c2*255+128)/256 like usually and in mode 255)
Only lsb_in=lsb_out=false and mode=255 is SSSE3 optimized at this moment. Anything else goes plain C++. But the optimized version performed in my configuration better than mt_merge.

Examples:
Code:
RAverageW(c1, 0.5, c2, 0.5)   #same as mt_average
RAverageW(c1, 1, c2, -1, bias=128)   #same as mt_makediff
RAverageW(c1, 1, c2, 1, bias=-128)   #same as mt_adddiff
a2=RAverageW(a1, -1, bias=255)   #same as mt_invert
RAverageW(c1, a1, c2, a2)         #similar to mt_merge
RMerge(c1, c2, m)                     #same as mt_merge
RMerge(c1, c2, m, mode=255)           #same as mt_merge
RMerge(c1, c2, m, mode=256)           #not same as mt_merge
Code:
#restoring blended telecined video
f0=o.SelectEvery(5,0)
f1=o.SelectEvery(5,1)
f2=o.SelectEvery(5,2)
f3=o.SelectEvery(5,3)
f4=o.SelectEvery(5,4)
n3=RAverageW(f1,-0.5,f2,1,f3,1,f4,-0.5)
return Interleave(f0,f1,n3,f4)
Changelog:

1.4.3 serious bugfix, thanx to Bloax for pointing out
1.4.2 bugfix of RMerge SSSE3, switched clips in RMerge to match the mt_merge formula, minor improvements
1.4.1 SSSE3 optimization of RMerge in 8bit
1.4.0 introduction of of new filter RMerge
1.3.10 full 16bit support for RAverageW mode 8 (SSE2) and minor speed improvements
1.3.9 added 16bit support for RAverageW, however not always SIMD optimized. Multiple algorithms available, different in speed and precision. Thorough estimation of rounding error at the beginning added, so that the most suitable algo is chosen. Ugly source included.
1.3.5 something I am not afraid to publish, but I am afraid to publish the messy src

Todo's (maybe):
- optimization
- mixed bitdepth for merge masks
- RTotal

I did some speed benchmarking, so why not to share it. It is rough measurement, with C++ code as a reference speed:
Code:
Filter		method		2 clips	4 clips	8 clips
Average		C++		89%		
RAverageW	C++		100%	100%	100%
Average		SIMD		388%	354%	351%
RAverageW	8 - SSE2	1213%	906%	738%
RAverageW	4 - SSSE3	1213%	957%	1176%
I think that method 4 and 8 have same results in case of lower number of clips, because on my computer, there is memory-processor data transfer speed the limiting factor. I don't know how on other machines, but I hope 4 is faster in all cases.

Disclaimer:
This filters are in development stage and cannot guarantee no bugs, so feel free to use it, test it and comment it.
If you report a bug please specify it in a reproducable clip. For instance:
Code:
c=ColorBars(pixel_type="YV12")
m=mt_lutspa(c,expr="x y * 2 256 * *", mode="relative", y=3,u=3,v=3).trim(1,1).Loop(c1.framecount,0,0)
RAverageM(m,c,m,c,bias=-10000,lsb_in=false,lsb_out=true)
Moreover, if it's a crash, try to play with the sse parameter first.
Use this thread only for development issues, for questions how do I... use Average plug-in thread

I am adding new version's links to the changelog.
Attached Files
File Type: rar RedAverage-1.3.5.rar (39.2 KB, 438 views)

Last edited by redfordxx; 17th January 2012 at 16:41.
redfordxx is offline   Reply With Quote
Old 8th November 2011, 22:30   #2  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
Quote:
Originally Posted by redfordxx View Post
Hi, I am about to come to Average plugin again and add some new features. First I have few questions

1) which level of SSE is suitable (high enough to perform and low enough to be compatible and not overflow. Basically I need following operations (appropriately packed)
B*F
W*F
B*B
W*W
B*B*F
W*W*F
Where B=packed bytes, W=packed words, F=one float or float scaled to byte or word (this one will multiply all the packed numbers)
Is SSE2 ok there can be benefit of something else
I generally try to restrict myself to plain old MMX first time out. This gives a base line for improvement and will work everywhere. Moving up to ISSE gives you the rest of the MMX instructions like PAVGB, PMULUSW, PSHUFW, etc. Just about everything does ISSE now. Straight SSE give floating point but no useful integer stuff and there are lots of annoying instruction holes. SSE2 fills out the lack of integer stuff and gives double support. All new cpus do SSE2 now. SSE3, SSSE3 and up fill in special instruction holes, but your code must check that the cpu actually has the required instructions.

Many cpu's only do SSE2 as pairs of 64bit ops internally which can extend the instruction latency badly leading to pipe stalls that you cannot get rid of no matter how you shuffle your code. My old 3GHz P4 HT generally runs SSE2 code the same or slower than the MMX/ISSE code, sometime a lot slower. My less old Core2 generally runs SSE2 code the same as the best MMX/ISSE code, sometime a little faster, very occasionally a lot faster. I3 and up start to rock but also have reduced latency and improved throughput on MMX which make you have to work hard getting the SSE2+ code significantly faster. Wander through the x264 code and discussions on how various things bite and don't work as expected.
Quote:
2) what happens if there is overflow in SIMD?
Generally your choice, either wrapping or saturation depending on the instructions selected, eg PADD verse PADDUSB.
Quote:
3) can I uses mmx and xmm register at the same time?
Yes. The MMX<->XMM transfer instructions have some braindeadness on many cpu's, usually extreme latency sometime poor throughput as well..
Quote:
4) I somewhere read that reading 16 bytes of data from memory to xmm is more than 2x slower than 2x 8bytes data from memory. So it means it is better to read twice and then combine...is that true?
It depends on the cpu model and memory controller. Movdqa is always at least equal fastest. Movdqu can massively suck, 2 movq's may be faster and have lower latency, Lddqu has it's problems as well. You have to test your case to know, then the test only applies to your cpu with your memory.
Quote:
5) what is the guaranteed alignment of data in plane...i.e. length of row in the input clip is multiplication of what?
Rowsize is always the width.

Pitch is always at least (width+15)/16*16 for all planes. You can read and write the data in the area between pitch and rowsize. It must be considered uninitialised (it will be random old frame data).

Non-aligned crops can make the start address of a row non-aligned, but the original VideoFrameBuffer would have been aligned so you can backstep the row start to regain alignment. All data outside the active picture area must be considered uninitialised.
IanB is offline   Reply With Quote
Old 9th November 2011, 01:09   #3  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
I am looking though my old code and what I see
Code:
							  
env->GetCPUFlags() & CPUF_INTEGER_SSE) != 0


punpcklbw mm2, mm7//low
Testing SSE, running SSE2 and using MMX reg


This is I think not good, I guess?
redfordxx is offline   Reply With Quote
Old 9th November 2011, 01:45   #4  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
1) From what you wrote, basically I should not bother too much with the speed issues, since every machine has it differently implemented, correct? Anyway would you recommend some document where is clock cycles or whatever speed specification of the instruction? Maybe to learn whether pxor is faster than movq on mmx...
2) MMX<->XMM still should be faster than from memory on every machine, or not?
3) So since the pitch is mod16, basically it is safe (and good idea???) to run one cycle from 0 to pitch*height, instead of two cycles inside each other for height and width?
4) Saturation means 250+10=255?
5) Should I be interested in MOVNTDQ? I don't know what is the benefit of nontemporal
redfordxx is offline   Reply With Quote
Old 9th November 2011, 03:00   #5  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
Quote:
Originally Posted by redfordxx View Post
I am looking though my old code and what I see
Code:
							  
env->GetCPUFlags() & CPUF_INTEGER_SSE) != 0

punpcklbw mm2, mm7//low
Testing SSE, running SSE2 and using MMX reg
CPUF_INTEGER_SSE are really the extra MMX instructions, not to be confused with CPUF_SSE (although the same CPU model's mostly provide both, except Athlon plain (not the XP/MP)).

Your capability testing should always match your instruction usage.

Quote:
Originally Posted by redfordxx View Post
1) From what you wrote, basically I should not bother too much with the speed issues, since every machine has it differently implemented, correct?
No, I mean you should not assume SSE2 is always faster than MMX, you need to test it. On some hardware SSE2 code might be twice as fast as an MMX version, but it also might be slower.
Quote:
Anyway would you recommend some document where is clock cycles or whatever speed specification of the instruction? Maybe to learn whether pxor is faster than movq on mmx...
Agner Foggg.
Quote:
2) MMX<->XMM still should be faster than from memory on every machine, or not?
The killer is latency not throughput, bad CPU's can stuff around for more than 9 cycles. A from L1 cache movdqa can be all done in 3 cycles. You need to test it.
Quote:
3) So since the pitch is mod16, basically it is safe (and good idea???) to run one cycle from 0 to pitch*height, instead of two cycles inside each other for height and width?
No, Cropped frames may result in a tiny active hole in the middle of a hugh VideoFrameBuffer. E.g rowsize might be 16 and pitch 1920.
Quote:
4) Saturation means 250+10=255?
Yes, and 10-250=0 and 32760+100=32767 and -32760-100=-32768, etc
Quote:
5) Should I be interested in MOVNTDQ? I don't know what is the benefit of nontemporal
In general I have found MOVNT* only good for sequential read once data. On output it may make "your" filter appear faster, but it invariably makes the next filter slower. It is very difficult to predict the effect across a range of machines. You need to test it.

In the HResizer we read 1 row of input, unpack it to a temp array, then FIR the living hell out of that temp array. By reading the source rows with MOVNTQ we avoid flushing the start of the temp array from L1 cache before the first FIR operation. On most machines the temp array and FIR coefficients live in the L1 cache and hece the code screams along. This is an exceptional case and much effort by many people went into getting this optimal.
IanB is offline   Reply With Quote
Old 9th November 2011, 21:32   #6  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
So basically if I understand it correctly (what latency and rec thruput means) I made a little breakdown of my code.
What it does is load data from memory to mm0, mm2, interleaves with zeros, multiplies and adds to mm4,mm5
The expressions in comment means:
clock for operation: reciprocial thru / latency --> clock when data ready
Code:
			spatial:
mov	eax, [length4]		    //n    // 23: r1.0 / l2    --> 25  
  movq  mm4, mm6                           // 24: r0.3 / l1    --> 25
  movq  mm5, mm6                           // 24: r0.3 / l1    --> 25
	temporal:
	  mov     ecx, [sourcep+eax-8]     //  1: r1.0 / l2    -->  3
	  mov     edx, [sourcep+eax-4]     //  2: r1.0 / l2    -->  4
	  movq    mm0, qword ptr [ecx+ebx-8] //3: r1.0 / l2    -->  5
	  movq    mm2, qword ptr [edx+ebx-8] //4: r1.0 / l2    -->  6
	  movq    mm1, mm0  //frame            5: r0.3 / l1    -->  6
	  movq    mm3, mm2  //mask             6: r0.3 / l1    -->  7  (wait for mm2)
//	  pxor    mm1, mm1
//	  pxor    mm3, mm3
            // mm7=000000000000
	  punpcklbw mm0, mm7//low              7: r1.0 / l1    -->  8
	  punpcklbw mm2, mm7//low              8: r1.0 / l1    -->  9
	  punpckhbw mm1, mm7//high             9: r1.0 / l1    --> 10
	  punpckhbw mm3, mm7//high            10: r1.0 / l1    --> 11
//	  punpckhbw mm1, mm0//high
//	  punpckhbw mm3, mm2//high
		  pmullw mm2, mm0//low        11: r1.0 / l3    --> 14
	  pmullw mm3, mm1//high               12: r1.0 / l3    --> 15
//	  pmulhw mm3, mm1//high    
	sub     eax, 8                      //13:    
	  paddusw   mm4, mm2//low             14: r0.5 / l1    --> 15
	  paddusw   mm5, mm3//high            15: r0.5 / l1    --> 16  (wait for mm3)
	jnz      temporal                   //16: r1.0 / l0    --> 17 
  psrlw   mm4, 8//low                         17: r1.0 / l1    --> 18  ???
  psrlw   mm5, 8//high                        18: r1.0 / l1    --> 19  ???
  packuswb mm4, mm5                        // 19: r1.0 / l1    --> 20
  movq    qword ptr  [edi+ebx-8], mm4      // 20: r1.0 / l3    --> 23
sub     ebx, 8                             // 21:
jnz     spatial                            // 22:
1)I believe I could save one clock with the change in comments, (then I have zeros in lower bytes in mm1,mm3, so I used pmulhw) however I ended up with real but I ended up with really ugly output.
2)Maybe I can save even more with punpck directly from memory, but I nowhere found documented latency for punpck from memory...and again then I will have to do this pmulhw god knows what will happen
3) Q: Latency is relevant only for the destination argument or both? for example with paddusw mm4, mm2 I have to wait for next use of mm4 but can read/write mm2 imediately? Or both mm are blocked?
4) Q: operations on reg32 and mm go in one line or parallel? I mean eg sub and pmullw go in same time or after each other?

EDIT: all this can be thrown away, if I misunderstood latency and thruput

Last edited by redfordxx; 10th November 2011 at 00:06.
redfordxx is offline   Reply With Quote
Old 9th November 2011, 21:34   #7  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
What is the best way to make conditional jump if eax is multiply of 2?
Thank you for your time...
R.
redfordxx is offline   Reply With Quote
Old 9th November 2011, 21:46   #8  |  Link
jmartinr
Registered User
 
jmartinr's Avatar
 
Join Date: Dec 2007
Location: Enschede, NL
Posts: 276
Quote:
Originally Posted by redfordxx View Post
What is the best way to make conditional jump if eax is multiply of 2?
Thank you for your time...
R.
test eax, 1 ?
__________________
Roelofs Coaching
jmartinr is offline   Reply With Quote
Old 9th November 2011, 22:34   #9  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
I refuse to try to read stuff that is wider than the screen. Edit your post to collapse the tabs and I might then take an interest.

Thanks, it make some sense when you can see the whole picture.

Last edited by IanB; 10th November 2011 at 22:49. Reason: Wish was granted
IanB is offline   Reply With Quote
Old 9th November 2011, 23:43   #10  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
And then jnz i suppose, Thanx


I have one little different question, about calling the filter. Atm, this filter requires variable amount of arguments. (Clip, float, clip, float, clip float,...) I want to add more optional named arguments :
y,u,v integers (obviously for plane processing)
bias float
How can I detect what in fact are the argument? FYI, the code bellow how it works now, but do understand that I did't create it, I used and modified from somewhere else.
Code:
struct WeightedClip
{
	PClip clip;
	double weight;
		WeightedClip(PClip _clip, double _weight)
		: clip(_clip), weight(_weight) {}
};

AVSValue __cdecl Create_RAverageW(AVSValue arguments, void *user_data, IScriptEnvironment *env) 
{
	AVSValue array = arguments[0];
	int noarguments = array.ArraySize();
	if(noarguments % 2 != 0)
		env->ThrowError("Average requires an even number of arguments.");

	vector<WeightedClip> clips;
	for (int i=0; i < noarguments; i+=2)
		clips.push_back(WeightedClip(array[i].AsClip(), array[i+1].AsFloat()));

	return new RAverageW(clips [0]. clip, clips, env);  
}


extern "C" __declspec(dllexport) const char *__stdcall AvisynthPluginInit2(IScriptEnvironment *env) 
{
	env->AddFunction("RAverageW", ".*", Create_RAverageW, 0);
	return "Average plugin";
}
redfordxx is offline   Reply With Quote
Old 9th November 2011, 23:56   #11  |  Link
Gavino
Avisynth language lover
 
Join Date: Dec 2007
Location: Spain
Posts: 3,375
Quote:
Originally Posted by redfordxx View Post
Atm, this filter requires variable amount of arguments. (Clip, float, clip, float, clip float,...) I want to add more optional named arguments :
y,u,v integers (obviously for plane processing)
bias float
How can I detect what in fact are the argument? FYI, the code bellow how it works now, but do understand that I did't create it, I used and modified from somewhere else.
".*" in the call to AddFunction specifies that it accepts a variable number of arguments (of any type). These are packaged into a single array argument, arguments[0].
You can add further named arguments, eg ".*[y]i[u]i[v]i" would add three ints, y, u v, available as arguments[1], arguments[2] and arguments[3].
I notice your code does not check the type of the variable arguments - it really should, instead of assuming they are correct.
__________________
GScript and GRunT - complex Avisynth scripting made easier
Gavino is offline   Reply With Quote
Old 10th November 2011, 00:10   #12  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
Quote:
Originally Posted by IanB View Post
I refuse to try to read stuff that is wider than the screen. Edit your post to collapse the tabs and I might then take an interest.
I fully understand, I am sorry...I collapsed it. Starting from the bottom are the most important things.
redfordxx is offline   Reply With Quote
Old 10th November 2011, 00:33   #13  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
Quote:
Originally Posted by Gavino View Post
I notice your code does not check the type of the variable arguments - it really should, instead of assuming they are correct.
Do I need to check only the .* ones? How do I check (sorry for my ignorance)?
redfordxx is offline   Reply With Quote
Old 10th November 2011, 00:55   #14  |  Link
Gavino
Avisynth language lover
 
Join Date: Dec 2007
Location: Spain
Posts: 3,375
Quote:
Originally Posted by redfordxx View Post
Do I need to check only the .* ones? How do I check (sorry for my ignorance)?
Yes, as "." indicates 'any type'; all others have a specific type and this will be checked by the parser before calling your plugin.

You can check by using the IsXXX() functions from avisynth.h.
For example,
Code:
if (array[i].IsClip() {
  ... do something with array[i].AsClip() ...
}
else
  env->ThrowError("...<a suitable error message>...");
__________________
GScript and GRunT - complex Avisynth scripting made easier
Gavino is offline   Reply With Quote
Old 10th November 2011, 18:42   #15  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
Well, after one day figuring out that it does not work coz pmulhw is signed, I think I succeeded in some speedup, but hard to tell because I measure it in TaskManager on CPU Time and it varies+-10%...I don't know, maybe caching situation.
I found somewhere cycles.h and I guess it is something I would might need to measure... but I get error message
1>C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\include\Cycles.h(201): warning C4405: 'ret' : identifier is reserved word
1>C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\include\Cycles.h(463): error C3861: 'elapsed': identifier not found
Anyone has experience with this and may have advice?
redfordxx is offline   Reply With Quote
Old 11th November 2011, 05:41   #16  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
Quote:
Originally Posted by redfordxx View Post
So basically if I understand it correctly (what latency and reciprocal throughput means) I made a little breakdown of my code.
What it does is load data from memory to mm0, mm2, interleaves with zeros, multiplies and adds to mm4,mm5
The expressions in comment means:
clock for operation: reciprocal throughput / latency --> clock when data ready

.....

1)I believe I could save one clock with the change in comments, (then I have zeros in lower bytes in mm1,mm3, so I used pmulhw) however I ended up with real but I ended up with really ugly output.
To do what I think you want you would need the ISSE instruction pmulhuw (unsigned) instead of the MMX instruction pmulhw (signed). The results may not be bit identical as the lower bits can combine to increment the total result, but some cleverness can avoid this.
Quote:
2)Maybe I can save even more with punpck directly from memory, but I nowhere found documented latency for punpck from memory...and again then I will have to do this pmulhw god knows what will happen
Given you need the same data in 2 places (high & low), this probably won't work out for you here. Also the bytes end up in the wrong half of the word so you need to spend extra instructions repairing this, but again some cleverness can avoid this.
Quote:
3) Q: Latency is relevant only for the destination argument or both? for example with paddusw mm4, mm2 I have to wait for next use of mm4 but can read/write mm2 imediately? Or both mm are blocked?
No latency is when the result is available, doing padd mm0, mm7 followed by padd mm1, mm7 incur no stall. However if the next instruction needs mm1, i.e. padd mm0, mm1 then you must wait for mm1 to be ready.
Quote:
4) Q: operations on reg32 and mm go in one line or parallel? I mean eg sub and pmullw go in same time or after each other?
Yes IA32 instruction can be interleaved with MMX/SSE instructions. It is worth noting that MMX/SSE instructions do no effect the flags so you can move an IA32 test instructions before an MMX/SSE instruction to fill a stall.
Quote:
EDIT: all this can be thrown away, if I misunderstood latency and reciprocal throughput
You seem to have the basic idea, but it is so much more complicated that you are predicting. Not all the execution units have all the same hardware, so some instruction will run in series even if the registers involved are completely independent. Also modern CPU have very advanced Out Of Order execution, this make prediction of what happens when very difficult. Atom has no OOO so for that platform it matters a lot.


What does seem apparent is you are processing memory backwards. This usually kill the hardware prefetch and stuffs performance on modern CPU's. Previously it was used to avoid pushing needed data from the cache with very large data sets.


I had a bash at re-ordering your code, it's possibly not right algorithmically, but it may give you some ideas.
Code:
                          // mm7=0     
spatial:
    mov	eax, [length4]                     // 23: r1.0 / l2    --> 25  
    movq  mm4, mm6                         // 24: r0.3 / l1    --> 25
    movq  mm5, mm6                         // 24: r0.3 / l1    --> 25

temporal:                              
	  mov       ecx, [sourcep+eax-8]       //  1: r1.0 / l2    -->  3
	  mov       edx, [sourcep+eax-4]       //  2: r1.0 / l2    -->  4
	  movq      mm0, qword ptr[ecx+ebx-8]  //  3: r1.0 / l2    -->  5+x
	  movq      mm2, qword ptr[edx+ebx-8]  //  4: r1.0 / l2    -->  6+x
#ifdef ISSE
	  punpckhbw mm1, mm0  // high               : r1.0 / l1    -->     (wait for mm0)
	  punpcklbw mm0, mm7  // low                : r1.0 / l1    -->   
	  punpckhbw mm3, mm2  // high               : r1.0 / l1    -->     (wait for mm2)
	  punpcklbw mm2, mm7  // low                : r1.0 / l1    -->   
	  pmulhuw   mm3, mm1  // high               : r1.0 / l3    -->     ISSE Required!
	  sub       eax, 8    //                    :    
	  pmullw    mm2, mm0  // low                : r1.0 / l3    -->   
	  paddusw   mm5, mm3  // high               : r0.5 / l1    -->   
	  paddusw   mm4, mm2  // low                : r0.5 / l1    -->   
#else
	  movq      mm1, mm0  // frame              : r0.3 / l1    -->     (wait for mm0)
	  punpcklbw mm0, mm7  // low                : r1.0 / l1    -->   
	  movq      mm3, mm2  // mask               : r0.3 / l1    -->     (wait for mm2)
	  punpcklbw mm2, mm7  // low                : r1.0 / l1    -->   
	  punpckhbw mm1, mm7  // high               : r1.0 / l1    -->   
	  pmullw    mm2, mm0  // low                : r1.0 / l3    -->   
	  punpckhbw mm3, mm7  // high               : r1.0 / l1    -->   
	  paddusw   mm4, mm2  // low                : r0.5 / l1    -->   
	  pmullw    mm3, mm1  // high               : r1.0 / l3    -->   
	  sub       eax, 8    //                    :    
	  paddusw   mm5, mm3  // high               : r0.5 / l1    -->   
#endif
	  jnz       temporal  //                  16: r1.0 / l0    --> 17 
                                       
    psrlw     mm4, 8  //low                   17: r1.0 / l1    --> 18
    psrlw     mm5, 8  //high                  18: r1.0 / l1    --> 19
    sub       ebx, 8                       // 19:              --> 20 
    packuswb  mm4, mm5                     // 20: r1.0 / l1    --> 21
    movq      qword ptr[edi+ebx], mm4      // 21: r1.0 / l3    --> 23  (wait for mm4)
    jnz       spatial                      // 22:
IanB is offline   Reply With Quote
Old 11th November 2011, 05:46   #17  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
As others point out using ".*" in your argument list overrides type checking of the arguments and the number of arguments.

So your creator code must validate what it is expecting.
IanB is offline   Reply With Quote
Old 13th November 2011, 10:41   #18  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
Hi,
I would like to do something like this:

Code:
bool	SSEMMXenabled = (env->GetCPUFlags() & CPUF_INTEGER_SSE) !=0;
bool	SSE2enabled = (env->GetCPUFlags() & CPUF_SSE2) !=0;
if (optimisable && SSE2enabled && (sse & 2 ))
{
	#define REGSIZE 16
	#define MOVALL MOVDQA
	#define m0 xmm0
	#define m1 xmm1
	....
	if (y==3) {
		#define PLAN PLANAR_Y
		#include "code.asm"
	}
	if (u==3) {
		#define PLAN PLANAR_U
		#include "code.asm"
	}
	if (v==3) {
		#define PLAN PLANAR_V
		#include "code.asm"
	}
}
else if  (optimisable && SSEMMXenabled && (sse & 1 ))
{
	#define REGSIZE 8
	#define MOVALL MOVQ
	#define m0 mm0
	#define m1 mm1
	....
	if (y==3) {
		#define PLAN PLANAR_Y
		#include "code.asm"
	}
	if (u==3) {
		#define PLAN PLANAR_U
		#include "code.asm"
	}
	if (v==3) {
		#define PLAN PLANAR_V
		#include "code.asm"
	}
}
else
{
	if (y==3) averageplaneW(PLANAR_Y, env);
	if (u==3) averageplaneW(PLANAR_U, env);
	if (v==3) averageplaneW(PLANAR_V, env);
}
when code.asm looks like
Code:
__asm
 {	
	pushad
....
	  emms
	popad
 }
I think from this example is my intention clear. Is that proper way to do this?
redfordxx is offline   Reply With Quote
Old 13th November 2011, 10:58   #19  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
My measurements:
I am doing it on 8192x4096x500 blankclips and I look in taskmanager at CPU time. I don't know nothing better.
These are my results:
Code:
clips		1	4	8	16		
C++		59	137				
Original	17	38	77			
MMX		9	29	62	109		
XMM		10	16	37	59
In the tible is CPU time and in top is number of the input clips to average. Original is the average plugin published in the other thread. MMX and XMM are my versions.
Clearly there is an overhead of the cycles thru the plane about 9 sec. Which is significant, especially when small number of clips. But I tried.
Moreover it seems to me obvious that with the number of clips there is increase of time due the fact, that the processor doesnot read 16 clips in paralel efficiently. Caching problem. Is there any way to help? Like tell the machine that the caching should be that way and not that way?
Now I will try to process it with using all XMM and MM reg, maybe it will boost a little.

Last edited by redfordxx; 13th November 2011 at 11:05.
redfordxx is offline   Reply With Quote
Old 15th November 2011, 19:10   #20  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
Small question. For pointers and counters, do I have anything else available to use than eax,ebx,ecx,edx,edi,esi?
Coz if I am going to do stacked 16bits I would appreciate more, if there is possibility.
Also, the compiled always warns me: frame pointer register 'ebx' modified by inline assembly code.
Everything seems to work OK though, so should that bother me?
redfordxx is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 11:38.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2017, vBulletin Solutions Inc.