View Single Post
Old 7th June 2009, 00:12   #17  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
The code reads 3 lines of one frame while writing one line to another frame in simple loop. edi is global counter increased by 8 that is used as offset for all sources and destination. It's not cache line split problem as [edi+pitch*1+1] and [edi+pitch*0-1] are ok but not [edi+pitch*1-1], also the order of impact is too big and on Nehalem such penalties are small. It seems like some kind of cache (address?) conflict as the descriptions of performance counters that produce spikes:
REPL - Counts the number of lines brought into the L1 data cache.
M_REPL - Counts the number of modified lines brought into the L1 data cache.
M_EVICT - Counts the number of modified lines evicted from the L1 data cache due to replacement.
M_SNOOP_EVICT - Counts the number of modified lines evicted from the L1 data cache due to snoop HITM intervention.

But the code linearly reads from one location and linearly writes to another in simple loop.

EDIT: It's indeed seems like cache address conflict as lower 16 bits of [edi+pitch*1] and [output] are the same, but it doesn't give me any idea how to fix it besides caching [edi+pitch*1-1] from previous iteration (as both memory locations are what i get from avisynth).
EDIT2: And it seems to be L2-3 problem with scenario something like:
Cache lines in L1 are allocated independently, but when output L1 line is written to L2+ it mistakes next reference to [edi+pitch*1-1] as accessing the same location for that single -1 byte which results in the L1 cache lines ping-pong hell as seen by counters.

Last edited by SEt; 7th June 2009 at 00:48.
SEt is offline   Reply With Quote