So this time I improved the internal 2xY SAD function (avoided one push & pop and switched half of the segment register reads with regular ones - 8% faster).
After a good idea I added a optimized version which avoided most reg->mmx moves and used a new fact(1.9.3.2) - the aligned source block buffer is continuous (pitch = blockwidth), so only one read is needed for that. This resulted in an other 8% gain of MVDegrain3 with block=8 on YUY2, the gain with YV12 is less.
Since the second version is clearly faster, it is now the only one used. The Source contains both though.
You can get it
here.
So I suppose that should solve the 4xY block issue.