Okay the equivalent C code helps a lot. Perhaps it should be in your source as comments.
Have you got this
Code:
movntq [ebx+edx], mm0
in the block you are testing? And you have an AMD beastie. Test with a normal
movq.
Do not mix cached and non-temporal memory accesses, specially writes and make absolutly sure that they are 64 bit aligned.
Also as a baseline perhaps you should test the pure C code version.
------------------------
Also a possible hint for your end around code. As long as you have at least 8 bytes total to process just do a single unaligned
movq at the end. i.e. offset the movq to match the end of the buffer and process the few overlapping bytes twice.