Doom9's Forum - View Single Post - aWarpSharp2

IanB · 6th June 2009, 23:40

Quote:

Originally Posted by SEt

I'm already on Nehalem and when i change

Code:

movq	xmm7, qword ptr [edi+pitch*0-1]
movq	xmm4, qword ptr [edi+pitch*0+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]

to

Code:

movq	xmm7, qword ptr [edi+pitch*1-1]
movq	xmm4, qword ptr [edi+pitch*1+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]

i see 1.4x slowdown in profiler for the whole function.

Consider the memory address each is referencing and which cache line each uses. I have colour coded 3 different memory areas. In the fast case only 2 areas are used. Also accessing data not aligned to 64 bits has a penalty. And a very big penalty when you cross a cache line (64 byte) boundary. For the [edi+pitch*1-1] you maybe slipping into the previous cache line (what address is in EDI ?)