Quote:
Originally Posted by SEt
I'm already on Nehalem and when i change
Code:
movq xmm7, qword ptr [edi+pitch*0-1]
movq xmm4, qword ptr [edi+pitch*0+1]
movq xmm1, qword ptr [edi+pitch*0]
movq xmm2, qword ptr [edi+pitch*2]
to
Code:
movq xmm7, qword ptr [edi+pitch*1-1]
movq xmm4, qword ptr [edi+pitch*1+1]
movq xmm1, qword ptr [edi+pitch*0]
movq xmm2, qword ptr [edi+pitch*2]
i see 1.4x slowdown in profiler for the whole function.
|
I was referring to the pinsrw with regard to the cacheline split.
If you're getting such a large slowdown merely by changing that, you should try to figure out
why. Performance counters might be useful for analyzing that.