sh0dan
11th March 2003, 17:47
trbarry wrote:
I'm surprised at your movntq results. For me on both P3's and P4's it has always seemed faster as long as things were aligned to at least 8 bytes and I wasn't subsequently reading it back in soon. But maybe it is machine dependent. What hardware were you testing this on?
The movntq was a big surprise for me too. It is definately faster in direct copying (BitBlt) for instance, but not very useful in routines that actually do some processing.
The data is not read again, and it is 8-byte aligned. But I guess the problem lies in the fast, that movntq cannot be defered. A movq to memory can be stored in the data cache for later storage, whereas movntq must be dispatched directly. When doing some processing the processor can take time to do the write.
I saw some really big penalties in the AMD Pipeline analysis tool (most movntq's took an average of 60 cycles!) It showed penalties for Data Cache miss, and the Load/Store Queue being full. Using movq's, the average cycle/instruction was about 2-4, with the load-store queue doing ok.
It should be noted that the system I tested on were Athlon Tbird 1200, and an Athlon XP 2200+ - both with DDR RAM.
I don't know if you have an avisynth compile setup, but you could try the older convert_yv12.cpp (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/avisynth2/avisynth/convert_yv12.cpp?rev=1.8&only_with_tag=MAIN&sortby=date) and do a #define movntq movq and do some comparisons.
I'm surprised at your movntq results. For me on both P3's and P4's it has always seemed faster as long as things were aligned to at least 8 bytes and I wasn't subsequently reading it back in soon. But maybe it is machine dependent. What hardware were you testing this on?
The movntq was a big surprise for me too. It is definately faster in direct copying (BitBlt) for instance, but not very useful in routines that actually do some processing.
The data is not read again, and it is 8-byte aligned. But I guess the problem lies in the fast, that movntq cannot be defered. A movq to memory can be stored in the data cache for later storage, whereas movntq must be dispatched directly. When doing some processing the processor can take time to do the write.
I saw some really big penalties in the AMD Pipeline analysis tool (most movntq's took an average of 60 cycles!) It showed penalties for Data Cache miss, and the Load/Store Queue being full. Using movq's, the average cycle/instruction was about 2-4, with the load-store queue doing ok.
It should be noted that the system I tested on were Athlon Tbird 1200, and an Athlon XP 2200+ - both with DDR RAM.
I don't know if you have an avisynth compile setup, but you could try the older convert_yv12.cpp (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/avisynth2/avisynth/convert_yv12.cpp?rev=1.8&only_with_tag=MAIN&sortby=date) and do a #define movntq movq and do some comparisons.