Problem solved. Thanks to Dark Shikari for kicking me to actually go see performance counters and IanB for idea what can help.
Placing earlier
Code:
mov al, byte ptr [edi+pitch*1-8]
changed nothing, but gave me idea that worked - i moved problem loads before write of previous iteration. A few more optimizations and new code works as fast as old one and sometimes even a bit faster. Will post it later when other things are done.
IanB, i started to use pitch*? instead of registers because included source had no register for pitch*1.