1) From what you wrote, basically I should not bother too much with the speed issues, since every machine has it differently implemented, correct? Anyway would you recommend some document where is clock cycles or whatever speed specification of the instruction? Maybe to learn whether pxor is faster than movq on mmx...
2) MMX<->XMM still should be faster than from memory on every machine, or not?
3) So since the pitch is mod16, basically it is safe (and good idea???) to run one cycle from 0 to pitch*height, instead of two cycles inside each other for height and width?
4) Saturation means 250+10=255?
5) Should I be interested in MOVNTDQ? I don't know what is the benefit of nontemporal
|