PDA

View Full Version : memcpy optimization question


huang_ch
24th January 2007, 09:49
I'm trying to do some learnings in optimizations, and I want to start with memcpy. One problem I met is that, if two pionters are not warmed before doing memcpy, the performance will suffer hugely. Below is the experiments, please help to identify the cause. Thanks.

Hardware: E6600+1G*2 DDR2 667 (oc @ 2.8G+780Mhz Mem)
Software: Windows 2003 x64, VisualStudio 2005

Exp 1:
Test with a simple memcpy(dst, src, len), and calculate the bandwidth with len/time.
len=64KB: bandwidth: ~1000MB/s
len=512MB: bandwidth: ~1000MB/s

Exp 2:
Test with below sequence, and calculate the bandwidth with only the time of memcpy.
memset( src, 0, len ); //this is what I called warm the pointer
memset( dst, 0, len );
memcpy( dst, src, len );

len=64KB: bandwidth: ~11000MB/s
len=512MB: bandwidth: ~1700MB/s

So my question is, why doing memset on both pointers will make the memcpy so fast? A 512MB buffer obviously can be put into any level cache.
And another question is, why I could only get such low bandwidth utilization? (And I've tried some SSE/SSE2/MMX optimized memcpy, but strange to see no major performance gain, but if I remember correctly, I've done a MMX optimization myself several years ago, on Celeron 400 and K7-500, both have significant performance gain compare to the memcpy in Visual Studio 6.0)

codeguru
2nd February 2007, 11:47
Hardware: E6600+1G*2 DDR2 667 (oc @ 2.8G+780Mhz Mem)

Software: Windows 2003 x64, VisualStudio 2005

Exp 1:
Test with a simple memcpy(dst, src, len), and calculate the bandwidth with len/time.
len=64KB: bandwidth: ~1000MB/s
len=512MB: bandwidth: ~1000MB/s

Exp 2:
Test with below sequence, and calculate the bandwidth with only the time of memcpy.
memset( src, 0, len ); //this is what I called warm the pointer
memset( dst, 0, len );
memcpy( dst, src, len );

len=64KB: bandwidth: ~11000MB/s
len=512MB: bandwidth: ~1700MB/s

So my question is, why doing memset on both pointers will make the memcpy so fast? A 512MB buffer obviously can be put into any level cache.
And another question is, why I could only get such low bandwidth utilization? (And I've tried some SSE/SSE2/MMX optimized memcpy, but strange to see no major performance gain, but if I remember correctly, I've done a MMX optimization myself several years ago, on Celeron 400 and K7-500, both have significant performance gain compare to the memcpy in Visual Studio 6.0)

So, you are wondering why memcpy is that slow - the answer is simple: it's a copy loop, and that cannot be fast. Maybe you


So the clue is to replace the memcpy with a more advanced version that can make usage of the cache structure or at least the chipset.

So a for() {*dest++=*source++} loop is braindead, specially on x64 platforms. But it's portable and does not need to be rewritten if the processor architecture changes from x32 to x64.
for example by for(){(int *)dest++=(int *)source++}
int is 8 bytes long on x64 architecture, and on 32 bit you could use the double type.

Modern CPUs can win lot of performance and memory throughput with prefetching.


If you use the memset only once you might loose time with memsetting and gain it back in the memcpy, but the way to archieve a performance gain is right.

Each processor has a maximum of memory segment size that can be processed into the cache, and you have noticed that this 64k segment was copied 7 times faster than the 512 MB segment. You might determine the maximum of this blocksize, this is processor dependent. IMO this segment size is dependent on the cache line size in the CPU. On older Pentium CPUs it was 512 byte, later increased to 2 KB, and if your segment size is a direct multiple of this cache line size you may reach the optimum.

mymemcpy(int *dest, int *source, int count)
{
int segsize=65536;

long *ptemp;
ptemp=malloc(segsize*sizeof(int));
memset(ptemp,0,segsize);//your warmup
for(count=0;count<length/segsize; count++)
{
memcpy(ptemp, source+count*segsize, segsize);
memcpy(dest+count*segsize,ptemp, segsize);
}

}

You might omit the ptemp buffer and do a direct memcpy(dest,src,size) or use assembler code at this place. A good compiler should optimize copy loops at least in a way the processor can predict the adresses to be copied, because the CPU executes several commands nearly paraleel in a pipeline.

In real world programming a memset(dest,0,size); memset(src,0,size); memcpy(dest,src,size);
would never occur, this is just working like using coretest on cached harddisks.

So you might get a pointer on some memory, that should be copied to a buffer to be processeed, and after that it might be written somewhere else.

Here is also an article about memcpy optimization.

http://www.embedded.com/showArticle.jhtml?articleID=19205567

squid_80
2nd February 2007, 14:17
So a for() {*dest++=*source++} loop is braindead, specially on x64 platforms. But it's portable and does not need to be rewritten if the processor architecture changes from x32 to x64.
for example by for(){(int *)dest++=(int *)source++}
int is 8 bytes long on x64 architecture, and on 32 bit you could use the double type.

Int is 32-bits on both linux and windows on x64 systems. Also your code is casting to an int pointer but not dereferencing it.

(And I've tried some SSE/SSE2/MMX optimized memcpy, but strange to see no major performance gain, but if I remember correctly, I've done a MMX optimization myself several years ago, on Celeron 400 and K7-500, both have significant performance gain compare to the memcpy in Visual Studio 6.0)
Probably memcpy in VS6.0 didn't use SIMD instructions, whereas there are intrinsic implementations of memcpy in VS2005 that do.

housekat
2nd February 2007, 20:30
no benefits from mmx/sse? did you try something like this:
extern void * __cdecl optimized_memcpy(void* dest, const void* src, size_t s){
_asm {
mov ecx, [s]
mov esi, [src]
mov edi, [dest]
shr ecx, 6 //mit mmx: 64bytes per iteration
jz lower_64//if lower than 64 bytes
loop_64: //MMX transfers multiples of 64bytes
movq mm0, 0[ESI] //read sources
movq mm1, 8[ESI]
movq mm2, 16[ESI]
movq mm3, 24[ESI]
movq mm4, 32[ESI]
movq mm5, 40[ESI]
movq mm6, 48[ESI]
movq mm7, 56[ESI]

movq 0[EDI], mm0 //write destination
movq 8[EDI], mm1
movq 16[EDI], mm2
movq 24[EDI], mm3
movq 32[EDI], mm4
movq 40[EDI], mm5
movq 48[EDI], mm6
movq 56[EDI], mm7

add esi, 64
add edi, 64
dec ecx
jnz loop_64
emms //close mmx operation
lower_64://transfer rest of buffer
mov ebx,esi
sub ebx,src
mov ecx,[s]
sub ecx,ebx
shr ecx, 3 //multiples of 8 bytes
jz lower_8
loop_8:
movq mm0, [esi] //read source
movq [edi], mm0 //write destination
add esi, 8
add edi, 8
dec ecx
jnz loop_8
emms //close mmx operation
lower_8:
mov ebx,esi
sub ebx,src
mov ecx,[s]
sub ecx,ebx
rep movsb
ende:
mov eax, [dest] //return dest
}
}
this code will not cope with overlapped memory...

squid_80
3rd February 2007, 05:14
No benefit over the existing routines in VS2005, which already use code similar to what you posted.

foxyshadis
3rd February 2007, 06:23
housekat, no prefetch? That's got to cut performance, especially on the slower intel buses. The VS2005 version is probably similar to or even identical to memcpy_amd, which is pretty much as fast as such things can be done.

housekat
3rd February 2007, 15:06
just a first idea - just astonished bout not having advantage of using mmx. didn't know that vs2005 has optimization. for win i'm on old vs6.0 - has none optimizations. but good news hearing vs2005 uses more optimized code.

squid_80
3rd February 2007, 15:17
The code you posted is fine if targeting MMX only (which is pretty much a minimum requirement these days), since prefetchxxx is a SSE instruction. Move emms to the end section though, no point doing it twice.

huang_ch
9th February 2007, 06:07
Sorry for a late reply.

So the clue is to replace the memcpy with a more advanced version that can make usage of the cache structure or at least the chipset.

Just as squid_80 pointed out, VS2005 has already use some SIMD optimizations, and also I've tried with Intel Compiler, the result is pretty much at the same level. And also I've tried them on WinXP x86 instead of Win2003 x64, the result is pretty same.


Each processor has a maximum of memory segment size that can be processed into the cache, and you have noticed that this 64k segment was copied 7 times faster than the 512 MB segment. You might determine the maximum of this blocksize, this is processor dependent. IMO this segment size is dependent on the cache line size in the CPU. On older Pentium CPUs it was 512 byte, later increased to 2 KB, and if your segment size is a direct multiple of this cache line size you may reach the optimum.

That explains well why Exp2_64KB is 10x faster than Exp1_64KB, but doesn't explain why Exp2_512MB is still 1.7x faster than Exp1_512MB, because 512MB shouldn't be fit into any cache in a sequence read/write, so I still wonder what kind of data is cached due to a pre-memset?