Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
![]() |
#1 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
RedAverage plugin (version 1.4.3 biiiiig bugfix)
EDIT:I am removing the original post which is in IanB's answer anyway and replacing it with some info about what's done.
This plugin started with the idea of improving original mg262's Average. Here I thank him for his work. This new plugin has diferent names to avoid name conflicts and there are currently following filters: RAverageM - masked average RAverageW - weighted average RMerge - yet another merge filter common parameters: y,u,v=3 ...process plane, other numbers disable processing the plane atm for RAverageW and RAverageM. For RMerge it works just like in masktools bias=0 ...number which is added to finally computed value (just fyi: bias=-0.5 kills rounding the result) sse=-1 ...limits the processor capabilities. -1=use all (bitwise...16,8,4=SSE4.1,SSSE3,SSE2)...0=no asm, which is veeeeryyyy sloooow lsb_in=false ...if true, 16bit stacked input is expected (for more detail explanation about 16bit clips find Dither tools) lsb_out=false ...if true, 16bit stacked output is produced All parameters are optional, except the input clips. There must be at least one input clip pair. RAverageW(clip1, weight1, clip2, weight2, ... clipn,weightn,bias,y,u,v,sse, bool lsb_in, bool lsb_out, int mode, int errlevel, bool silent) computes value of weighted average of clips: result=c1*w1+c2*w2+...cn*wn+bias silent=false ... if true, certain error messages (mostly about rounding error) are suppresed errlevel=0 ...sets the maximum rounding error you are willing to accept as a tradeoff to faster method mode=-1 ...bitwise 4, 8 for enabling method 4, 8. 0 results in unoptimized C++. - mode 8: requires SSE2, has higher precision than 4 but depending on hardware, probably will be slower - mode 4: requires SSSE3, is faster, however has certain limitations in precision (if you don't have silent true, you will be warned) (8bit in only atm, 8 or 16 out) basically method 4 is fully precise, when weights are nice power-of-two-numbers, specifically: w_i=a_i*2^k (a_i are integers and k is integer, all signed...moreover sum[max(a_i,0)]<128 and sum[min(a_i,0)]>-128 ...in other words, weights must be scalable to signed byte;-) - mode 0: plain C++, slow, fully precise (I think) and supports fully 8 & 16bit clips in and out. Concerning 16bit support and conversion when the in and out have different bitdepth, the values are relative to full scale. That introduces multiplication or division by 256 in case of change of bitdepth. So: 8bitvalue*weight*256=16bitvalue. RAverageM(clip1, mask1, clip2, mask2, ... clipn,maskn,bias,y,u,v,sse, bool lsb_in, bool lsb_out) computes value of masked average of clips: result=c1*m1+c2*m2+...cn*mn+bias RMerge(clip1, clip2, mask,y,u,v,sse, bool lsb_in, bool lsb_out, int mode) Merges two clips based on mask, as expected. However, there are some new things. First, again, 16bit support (although not fully optimized). Then there are two modes in case 8bit input which decide, whether the mask is applied slightly differently to correct the problem of incomplete range of the mask: mode=255 ...this is standard merge formula: r=c1*(256-m)+c2*m mode=256 ...this is adjusted merge formula: r=c1*(256-n)+c2*n where n= ( m<=128 ? m : m+(m-128)/128 ) lsb_in=true ...this is 16bit merge formula: r=c1*(65536-m)+c2*m Then, r is scaled and rounded to results r8 or r16, depending on output bitdepth. The goal of mode 256 is to allow merge values in full range and if m=255, then r8=c2 ... (...and not r8=(c1+c2*255+128)/256 like usually and in mode 255) Only lsb_in=lsb_out=false and mode=255 is SSSE3 optimized at this moment. Anything else goes plain C++. But the optimized version performed in my configuration better than mt_merge. Examples: Code:
RAverageW(c1, 0.5, c2, 0.5) #same as mt_average RAverageW(c1, 1, c2, -1, bias=128) #same as mt_makediff RAverageW(c1, 1, c2, 1, bias=-128) #same as mt_adddiff a2=RAverageW(a1, -1, bias=255) #same as mt_invert RAverageW(c1, a1, c2, a2) #similar to mt_merge RMerge(c1, c2, m) #same as mt_merge RMerge(c1, c2, m, mode=255) #same as mt_merge RMerge(c1, c2, m, mode=256) #not same as mt_merge Code:
#restoring blended telecined video f0=o.SelectEvery(5,0) f1=o.SelectEvery(5,1) f2=o.SelectEvery(5,2) f3=o.SelectEvery(5,3) f4=o.SelectEvery(5,4) n3=RAverageW(f1,-0.5,f2,1,f3,1,f4,-0.5) return Interleave(f0,f1,n3,f4) 1.4.3 serious bugfix, thanx to Bloax for pointing out 1.4.2 bugfix of RMerge SSSE3, switched clips in RMerge to match the mt_merge formula, minor improvements 1.4.1 SSSE3 optimization of RMerge in 8bit 1.4.0 introduction of of new filter RMerge 1.3.10 full 16bit support for RAverageW mode 8 (SSE2) and minor speed improvements 1.3.9 added 16bit support for RAverageW, however not always SIMD optimized. Multiple algorithms available, different in speed and precision. Thorough estimation of rounding error at the beginning added, so that the most suitable algo is chosen. Ugly source included. 1.3.5 something I am not afraid to publish, but I am afraid to publish the messy src Todo's (maybe): - optimization - mixed bitdepth for merge masks - RTotal I did some speed benchmarking, so why not to share it. It is rough measurement, with C++ code as a reference speed: Code:
Filter method 2 clips 4 clips 8 clips Average C++ 89% RAverageW C++ 100% 100% 100% Average SIMD 388% 354% 351% RAverageW 8 - SSE2 1213% 906% 738% RAverageW 4 - SSSE3 1213% 957% 1176% Disclaimer: This filters are in development stage and cannot guarantee no bugs, so feel free to use it, test it and comment it. If you report a bug please specify it in a reproducable clip. For instance: Code:
c=ColorBars(pixel_type="YV12") m=mt_lutspa(c,expr="x y * 2 256 * *", mode="relative", y=3,u=3,v=3).trim(1,1).Loop(c1.framecount,0,0) RAverageM(m,c,m,c,bias=-10000,lsb_in=false,lsb_out=true) Use this thread only for development issues, for questions how do I... use Average plug-in thread I am adding new version's links to the changelog. Last edited by redfordxx; 17th January 2012 at 16:41. |
![]() |
![]() |
![]() |
#2 | Link | |||||
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
Quote:
Many cpu's only do SSE2 as pairs of 64bit ops internally which can extend the instruction latency badly leading to pipe stalls that you cannot get rid of no matter how you shuffle your code. My old 3GHz P4 HT generally runs SSE2 code the same or slower than the MMX/ISSE code, sometime a lot slower. My less old Core2 generally runs SSE2 code the same as the best MMX/ISSE code, sometime a little faster, very occasionally a lot faster. I3 and up start to rock but also have reduced latency and improved throughput on MMX which make you have to work hard getting the SSE2+ code significantly faster. Wander through the x264 code and discussions on how various things bite and don't work as expected. Quote:
Quote:
Quote:
Quote:
Pitch is always at least (width+15)/16*16 for all planes. You can read and write the data in the area between pitch and rowsize. It must be considered uninitialised (it will be random old frame data). Non-aligned crops can make the start address of a row non-aligned, but the original VideoFrameBuffer would have been aligned so you can backstep the row start to regain alignment. All data outside the active picture area must be considered uninitialised. |
|||||
![]() |
![]() |
![]() |
#3 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
I am looking though my old code and what I see
Code:
env->GetCPUFlags() & CPUF_INTEGER_SSE) != 0 punpcklbw mm2, mm7//low This is I think not good, I guess? |
![]() |
![]() |
![]() |
#4 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
1) From what you wrote, basically I should not bother too much with the speed issues, since every machine has it differently implemented, correct? Anyway would you recommend some document where is clock cycles or whatever speed specification of the instruction? Maybe to learn whether pxor is faster than movq on mmx...
2) MMX<->XMM still should be faster than from memory on every machine, or not? 3) So since the pitch is mod16, basically it is safe (and good idea???) to run one cycle from 0 to pitch*height, instead of two cycles inside each other for height and width? 4) Saturation means 250+10=255? 5) Should I be interested in MOVNTDQ? I don't know what is the benefit of nontemporal |
![]() |
![]() |
![]() |
#5 | Link | |||||||
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
Quote:
Your capability testing should always match your instruction usage. Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
In the HResizer we read 1 row of input, unpack it to a temp array, then FIR the living hell out of that temp array. By reading the source rows with MOVNTQ we avoid flushing the start of the temp array from L1 cache before the first FIR operation. On most machines the temp array and FIR coefficients live in the L1 cache and hece the code screams along. This is an exceptional case and much effort by many people went into getting this optimal. |
|||||||
![]() |
![]() |
![]() |
#6 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
So basically if I understand it correctly (what latency and rec thruput means) I made a little breakdown of my code.
What it does is load data from memory to mm0, mm2, interleaves with zeros, multiplies and adds to mm4,mm5 The expressions in comment means: clock for operation: reciprocial thru / latency --> clock when data ready Code:
spatial: mov eax, [length4] //n // 23: r1.0 / l2 --> 25 movq mm4, mm6 // 24: r0.3 / l1 --> 25 movq mm5, mm6 // 24: r0.3 / l1 --> 25 temporal: mov ecx, [sourcep+eax-8] // 1: r1.0 / l2 --> 3 mov edx, [sourcep+eax-4] // 2: r1.0 / l2 --> 4 movq mm0, qword ptr [ecx+ebx-8] //3: r1.0 / l2 --> 5 movq mm2, qword ptr [edx+ebx-8] //4: r1.0 / l2 --> 6 movq mm1, mm0 //frame 5: r0.3 / l1 --> 6 movq mm3, mm2 //mask 6: r0.3 / l1 --> 7 (wait for mm2) // pxor mm1, mm1 // pxor mm3, mm3 // mm7=000000000000 punpcklbw mm0, mm7//low 7: r1.0 / l1 --> 8 punpcklbw mm2, mm7//low 8: r1.0 / l1 --> 9 punpckhbw mm1, mm7//high 9: r1.0 / l1 --> 10 punpckhbw mm3, mm7//high 10: r1.0 / l1 --> 11 // punpckhbw mm1, mm0//high // punpckhbw mm3, mm2//high pmullw mm2, mm0//low 11: r1.0 / l3 --> 14 pmullw mm3, mm1//high 12: r1.0 / l3 --> 15 // pmulhw mm3, mm1//high sub eax, 8 //13: paddusw mm4, mm2//low 14: r0.5 / l1 --> 15 paddusw mm5, mm3//high 15: r0.5 / l1 --> 16 (wait for mm3) jnz temporal //16: r1.0 / l0 --> 17 psrlw mm4, 8//low 17: r1.0 / l1 --> 18 ??? psrlw mm5, 8//high 18: r1.0 / l1 --> 19 ??? packuswb mm4, mm5 // 19: r1.0 / l1 --> 20 movq qword ptr [edi+ebx-8], mm4 // 20: r1.0 / l3 --> 23 sub ebx, 8 // 21: jnz spatial // 22: 2)Maybe I can save even more with punpck directly from memory, but I nowhere found documented latency for punpck from memory...and again then I will have to do this pmulhw god knows what will happen 3) Q: Latency is relevant only for the destination argument or both? for example with paddusw mm4, mm2 I have to wait for next use of mm4 but can read/write mm2 imediately? Or both mm are blocked? 4) Q: operations on reg32 and mm go in one line or parallel? I mean eg sub and pmullw go in same time or after each other? EDIT: all this can be thrown away, if I misunderstood latency and thruput Last edited by redfordxx; 10th November 2011 at 00:06. |
![]() |
![]() |
![]() |
#8 | Link |
Registered User
Join Date: Dec 2007
Location: Enschede, NL
Posts: 302
|
test eax, 1 ?
__________________
Roelofs Coaching |
![]() |
![]() |
![]() |
#9 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
I refuse to try to read stuff that is wider than the screen. Edit your post to collapse the tabs and I might then take an interest.
![]() Thanks, it make some sense when you can see the whole picture. ![]() Last edited by IanB; 10th November 2011 at 22:49. Reason: Wish was granted |
![]() |
![]() |
![]() |
#10 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
And then jnz i suppose, Thanx
I have one little different question, about calling the filter. Atm, this filter requires variable amount of arguments. (Clip, float, clip, float, clip float,...) I want to add more optional named arguments : y,u,v integers (obviously for plane processing) bias float How can I detect what in fact are the argument? FYI, the code bellow how it works now, but do understand that I did't create it, I used and modified from somewhere else. Code:
struct WeightedClip { PClip clip; double weight; WeightedClip(PClip _clip, double _weight) : clip(_clip), weight(_weight) {} }; AVSValue __cdecl Create_RAverageW(AVSValue arguments, void *user_data, IScriptEnvironment *env) { AVSValue array = arguments[0]; int noarguments = array.ArraySize(); if(noarguments % 2 != 0) env->ThrowError("Average requires an even number of arguments."); vector<WeightedClip> clips; for (int i=0; i < noarguments; i+=2) clips.push_back(WeightedClip(array[i].AsClip(), array[i+1].AsFloat())); return new RAverageW(clips [0]. clip, clips, env); } extern "C" __declspec(dllexport) const char *__stdcall AvisynthPluginInit2(IScriptEnvironment *env) { env->AddFunction("RAverageW", ".*", Create_RAverageW, 0); return "Average plugin"; } |
![]() |
![]() |
![]() |
#11 | Link | |
Avisynth language lover
Join Date: Dec 2007
Location: Spain
Posts: 3,439
|
Quote:
You can add further named arguments, eg ".*[y]i[u]i[v]i" would add three ints, y, u v, available as arguments[1], arguments[2] and arguments[3]. I notice your code does not check the type of the variable arguments - it really should, instead of assuming they are correct. |
|
![]() |
![]() |
![]() |
#14 | Link | |
Avisynth language lover
Join Date: Dec 2007
Location: Spain
Posts: 3,439
|
Quote:
You can check by using the IsXXX() functions from avisynth.h. For example, Code:
if (array[i].IsClip() { ... do something with array[i].AsClip() ... } else env->ThrowError("...<a suitable error message>..."); |
|
![]() |
![]() |
![]() |
#15 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
Well, after one day figuring out that it does not work coz pmulhw is signed, I think I succeeded in some speedup, but hard to tell because I measure it in TaskManager on CPU Time and it varies+-10%...I don't know, maybe caching situation.
I found somewhere cycles.h and I guess it is something I would might need to measure... but I get error message 1>C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\include\Cycles.h(201): warning C4405: 'ret' : identifier is reserved word 1>C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\include\Cycles.h(463): error C3861: 'elapsed': identifier not found Anyone has experience with this and may have advice? |
![]() |
![]() |
![]() |
#16 | Link | |||||
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
Quote:
Quote:
Quote:
Quote:
Quote:
What does seem apparent is you are processing memory backwards. This usually kill the hardware prefetch and stuffs performance on modern CPU's. Previously it was used to avoid pushing needed data from the cache with very large data sets. I had a bash at re-ordering your code, it's possibly not right algorithmically, but it may give you some ideas. Code:
// mm7=0 spatial: mov eax, [length4] // 23: r1.0 / l2 --> 25 movq mm4, mm6 // 24: r0.3 / l1 --> 25 movq mm5, mm6 // 24: r0.3 / l1 --> 25 temporal: mov ecx, [sourcep+eax-8] // 1: r1.0 / l2 --> 3 mov edx, [sourcep+eax-4] // 2: r1.0 / l2 --> 4 movq mm0, qword ptr[ecx+ebx-8] // 3: r1.0 / l2 --> 5+x movq mm2, qword ptr[edx+ebx-8] // 4: r1.0 / l2 --> 6+x #ifdef ISSE punpckhbw mm1, mm0 // high : r1.0 / l1 --> (wait for mm0) punpcklbw mm0, mm7 // low : r1.0 / l1 --> punpckhbw mm3, mm2 // high : r1.0 / l1 --> (wait for mm2) punpcklbw mm2, mm7 // low : r1.0 / l1 --> pmulhuw mm3, mm1 // high : r1.0 / l3 --> ISSE Required! sub eax, 8 // : pmullw mm2, mm0 // low : r1.0 / l3 --> paddusw mm5, mm3 // high : r0.5 / l1 --> paddusw mm4, mm2 // low : r0.5 / l1 --> #else movq mm1, mm0 // frame : r0.3 / l1 --> (wait for mm0) punpcklbw mm0, mm7 // low : r1.0 / l1 --> movq mm3, mm2 // mask : r0.3 / l1 --> (wait for mm2) punpcklbw mm2, mm7 // low : r1.0 / l1 --> punpckhbw mm1, mm7 // high : r1.0 / l1 --> pmullw mm2, mm0 // low : r1.0 / l3 --> punpckhbw mm3, mm7 // high : r1.0 / l1 --> paddusw mm4, mm2 // low : r0.5 / l1 --> pmullw mm3, mm1 // high : r1.0 / l3 --> sub eax, 8 // : paddusw mm5, mm3 // high : r0.5 / l1 --> #endif jnz temporal // 16: r1.0 / l0 --> 17 psrlw mm4, 8 //low 17: r1.0 / l1 --> 18 psrlw mm5, 8 //high 18: r1.0 / l1 --> 19 sub ebx, 8 // 19: --> 20 packuswb mm4, mm5 // 20: r1.0 / l1 --> 21 movq qword ptr[edi+ebx], mm4 // 21: r1.0 / l3 --> 23 (wait for mm4) jnz spatial // 22: |
|||||
![]() |
![]() |
![]() |
#18 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
Hi,
I would like to do something like this: Code:
bool SSEMMXenabled = (env->GetCPUFlags() & CPUF_INTEGER_SSE) !=0; bool SSE2enabled = (env->GetCPUFlags() & CPUF_SSE2) !=0; if (optimisable && SSE2enabled && (sse & 2 )) { #define REGSIZE 16 #define MOVALL MOVDQA #define m0 xmm0 #define m1 xmm1 .... if (y==3) { #define PLAN PLANAR_Y #include "code.asm" } if (u==3) { #define PLAN PLANAR_U #include "code.asm" } if (v==3) { #define PLAN PLANAR_V #include "code.asm" } } else if (optimisable && SSEMMXenabled && (sse & 1 )) { #define REGSIZE 8 #define MOVALL MOVQ #define m0 mm0 #define m1 mm1 .... if (y==3) { #define PLAN PLANAR_Y #include "code.asm" } if (u==3) { #define PLAN PLANAR_U #include "code.asm" } if (v==3) { #define PLAN PLANAR_V #include "code.asm" } } else { if (y==3) averageplaneW(PLANAR_Y, env); if (u==3) averageplaneW(PLANAR_U, env); if (v==3) averageplaneW(PLANAR_V, env); } Code:
__asm { pushad .... emms popad } |
![]() |
![]() |
![]() |
#19 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
My measurements:
I am doing it on 8192x4096x500 blankclips and I look in taskmanager at CPU time. I don't know nothing better. These are my results: Code:
clips 1 4 8 16 C++ 59 137 Original 17 38 77 MMX 9 29 62 109 XMM 10 16 37 59 Clearly there is an overhead of the cycles thru the plane about 9 sec. Which is significant, especially when small number of clips. But I tried. Moreover it seems to me obvious that with the number of clips there is increase of time due the fact, that the processor doesnot read 16 clips in paralel efficiently. Caching problem. Is there any way to help? Like tell the machine that the caching should be that way and not that way? Now I will try to process it with using all XMM and MM reg, maybe it will boost a little. Last edited by redfordxx; 13th November 2011 at 11:05. |
![]() |
![]() |
![]() |
#20 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
Small question. For pointers and counters, do I have anything else available to use than eax,ebx,ecx,edx,edi,esi?
Coz if I am going to do stacked 16bits I would appreciate more, if there is possibility. Also, the compiled always warns me: frame pointer register 'ebx' modified by inline assembly code. Everything seems to work OK though, so should that bother me? |
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
Display Modes | |
|
|