Quote:
Originally Posted by SEt
Dark Shikari, i know not everything is optimally written, but i thought better release working version now than super-optimized never. I know that horizontal blur is made for palignr and will look into this when i have time, but i have no idea how to save kittens or why loading correct line gives 1.4x speed drop for the whole function, including those awful pinsrw that should be much more time consuming than just unaligned load from additional memory location.
|
"Should be much more time consuming?"
Does that imply you tested it, and found it to be faster?
If it's faster, I'm going to be inclined to blame cacheline-split. Test on an AMD chip or Nehalem and watch the penalties melt away.