Quote:
Originally Posted by SEt
I'm already on Nehalem and when i change
Code:
movq xmm7, qword ptr [edi+pitch*0-1]
movq xmm4, qword ptr [edi+pitch*0+1]
movq xmm1, qword ptr [edi+pitch*0]
movq xmm2, qword ptr [edi+pitch*2]
to
Code:
movq xmm7, qword ptr [edi+pitch*1-1]
movq xmm4, qword ptr [edi+pitch*1+1]
movq xmm1, qword ptr [edi+pitch*0]
movq xmm2, qword ptr [edi+pitch*2]
i see 1.4x slowdown in profiler for the whole function.
|
Consider the memory address each is referencing and which cache line each uses. I have colour coded 3 different memory areas. In the fast case only 2 areas are used. Also accessing data not aligned to 64 bits has a penalty. And a very big penalty when you cross a cache line (64 byte) boundary. For the [edi+
pitch*1-1] you maybe slipping into the previous cache line (what address is in EDI ?)