I'm already on Nehalem and when i change
Code:
movq xmm7, qword ptr [edi+pitch*0-1]
movq xmm4, qword ptr [edi+pitch*0+1]
movq xmm1, qword ptr [edi+pitch*0]
movq xmm2, qword ptr [edi+pitch*2]
to
Code:
movq xmm7, qword ptr [edi+pitch*1-1]
movq xmm4, qword ptr [edi+pitch*1+1]
movq xmm1, qword ptr [edi+pitch*0]
movq xmm2, qword ptr [edi+pitch*2]
i see 1.4x slowdown in profiler for the whole function.