View Single Post
Old 11th November 2011, 05:41   #16  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
Quote:
Originally Posted by redfordxx View Post
So basically if I understand it correctly (what latency and reciprocal throughput means) I made a little breakdown of my code.
What it does is load data from memory to mm0, mm2, interleaves with zeros, multiplies and adds to mm4,mm5
The expressions in comment means:
clock for operation: reciprocal throughput / latency --> clock when data ready

.....

1)I believe I could save one clock with the change in comments, (then I have zeros in lower bytes in mm1,mm3, so I used pmulhw) however I ended up with real but I ended up with really ugly output.
To do what I think you want you would need the ISSE instruction pmulhuw (unsigned) instead of the MMX instruction pmulhw (signed). The results may not be bit identical as the lower bits can combine to increment the total result, but some cleverness can avoid this.
Quote:
2)Maybe I can save even more with punpck directly from memory, but I nowhere found documented latency for punpck from memory...and again then I will have to do this pmulhw god knows what will happen
Given you need the same data in 2 places (high & low), this probably won't work out for you here. Also the bytes end up in the wrong half of the word so you need to spend extra instructions repairing this, but again some cleverness can avoid this.
Quote:
3) Q: Latency is relevant only for the destination argument or both? for example with paddusw mm4, mm2 I have to wait for next use of mm4 but can read/write mm2 imediately? Or both mm are blocked?
No latency is when the result is available, doing padd mm0, mm7 followed by padd mm1, mm7 incur no stall. However if the next instruction needs mm1, i.e. padd mm0, mm1 then you must wait for mm1 to be ready.
Quote:
4) Q: operations on reg32 and mm go in one line or parallel? I mean eg sub and pmullw go in same time or after each other?
Yes IA32 instruction can be interleaved with MMX/SSE instructions. It is worth noting that MMX/SSE instructions do no effect the flags so you can move an IA32 test instructions before an MMX/SSE instruction to fill a stall.
Quote:
EDIT: all this can be thrown away, if I misunderstood latency and reciprocal throughput
You seem to have the basic idea, but it is so much more complicated that you are predicting. Not all the execution units have all the same hardware, so some instruction will run in series even if the registers involved are completely independent. Also modern CPU have very advanced Out Of Order execution, this make prediction of what happens when very difficult. Atom has no OOO so for that platform it matters a lot.


What does seem apparent is you are processing memory backwards. This usually kill the hardware prefetch and stuffs performance on modern CPU's. Previously it was used to avoid pushing needed data from the cache with very large data sets.


I had a bash at re-ordering your code, it's possibly not right algorithmically, but it may give you some ideas.
Code:
                          // mm7=0     
spatial:
    mov	eax, [length4]                     // 23: r1.0 / l2    --> 25  
    movq  mm4, mm6                         // 24: r0.3 / l1    --> 25
    movq  mm5, mm6                         // 24: r0.3 / l1    --> 25

temporal:                              
	  mov       ecx, [sourcep+eax-8]       //  1: r1.0 / l2    -->  3
	  mov       edx, [sourcep+eax-4]       //  2: r1.0 / l2    -->  4
	  movq      mm0, qword ptr[ecx+ebx-8]  //  3: r1.0 / l2    -->  5+x
	  movq      mm2, qword ptr[edx+ebx-8]  //  4: r1.0 / l2    -->  6+x
#ifdef ISSE
	  punpckhbw mm1, mm0  // high               : r1.0 / l1    -->     (wait for mm0)
	  punpcklbw mm0, mm7  // low                : r1.0 / l1    -->   
	  punpckhbw mm3, mm2  // high               : r1.0 / l1    -->     (wait for mm2)
	  punpcklbw mm2, mm7  // low                : r1.0 / l1    -->   
	  pmulhuw   mm3, mm1  // high               : r1.0 / l3    -->     ISSE Required!
	  sub       eax, 8    //                    :    
	  pmullw    mm2, mm0  // low                : r1.0 / l3    -->   
	  paddusw   mm5, mm3  // high               : r0.5 / l1    -->   
	  paddusw   mm4, mm2  // low                : r0.5 / l1    -->   
#else
	  movq      mm1, mm0  // frame              : r0.3 / l1    -->     (wait for mm0)
	  punpcklbw mm0, mm7  // low                : r1.0 / l1    -->   
	  movq      mm3, mm2  // mask               : r0.3 / l1    -->     (wait for mm2)
	  punpcklbw mm2, mm7  // low                : r1.0 / l1    -->   
	  punpckhbw mm1, mm7  // high               : r1.0 / l1    -->   
	  pmullw    mm2, mm0  // low                : r1.0 / l3    -->   
	  punpckhbw mm3, mm7  // high               : r1.0 / l1    -->   
	  paddusw   mm4, mm2  // low                : r0.5 / l1    -->   
	  pmullw    mm3, mm1  // high               : r1.0 / l3    -->   
	  sub       eax, 8    //                    :    
	  paddusw   mm5, mm3  // high               : r0.5 / l1    -->   
#endif
	  jnz       temporal  //                  16: r1.0 / l0    --> 17 
                                       
    psrlw     mm4, 8  //low                   17: r1.0 / l1    --> 18
    psrlw     mm5, 8  //high                  18: r1.0 / l1    --> 19
    sub       ebx, 8                       // 19:              --> 20 
    packuswb  mm4, mm5                     // 20: r1.0 / l1    --> 21
    movq      qword ptr[edi+ebx], mm4      // 21: r1.0 / l3    --> 23  (wait for mm4)
    jnz       spatial                      // 22:
IanB is offline   Reply With Quote