View Single Post
Old 6th June 2009, 23:40   #16  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
Quote:
Originally Posted by SEt
I'm already on Nehalem and when i change
Code:
movq	xmm7, qword ptr [edi+pitch*0-1]
movq	xmm4, qword ptr [edi+pitch*0+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
to
Code:
movq	xmm7, qword ptr [edi+pitch*1-1]
movq	xmm4, qword ptr [edi+pitch*1+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
i see 1.4x slowdown in profiler for the whole function.
Consider the memory address each is referencing and which cache line each uses. I have colour coded 3 different memory areas. In the fast case only 2 areas are used. Also accessing data not aligned to 64 bits has a penalty. And a very big penalty when you cross a cache line (64 byte) boundary. For the [edi+pitch*1-1] you maybe slipping into the previous cache line (what address is in EDI ?)
IanB is offline   Reply With Quote