Loop unrolling would be dramatically more beneficial than bashing one byte at a time here, so it's no wonder that isn't particularly well optimized for anymore. Hell, accessing the pointers as ulongs (or ulong longs in 64-bit) would be a great improvement, without having to get into intrinsic/ASM/OpenMP territory.
|