PDA

View Full Version : A note on LUTs


mg262
3rd December 2005, 17:46
I've recently been trying to speed up Didée's LimitedSharpen, and in the course of it I tried measuring the speed of the four mt_lutxy calls used in a call with the function with all parameters at default. (I used the script version modified by Socio to use MaskTools 2.0. Edit: in case you're not familiar with the script, it uses supersampling so LUTs are on large frames; that's not relevant to the point I'm making below, but it does mean you shouldn't extrapolate from the raw numbers.)

Results from kassandro's AVSTimer

When measured in vacuo, i.e. in a script where all the time is spent in a LUT call or in MPEG2Source:
305 fps, 299 fps, 304 fps, 292 fps

When measured in the actual script:
141 fps, 143 fps, 146 fps, 144 fps

(Results were taken over long periods and I checked that per-period averages were roughly constant. Source was 720x576, and I'm using a P4/2400. I didn't save the actual material passed into each LUT and use this for the in vacuo tests; feel free to repeat the test like this if you think this would make a substantial difference....)

The reason for the difference is clear -- in a long script, the LUT is displaced from the cache between mt_lutxy calls.

Conclusion
I'm not trying to measure the actual slowdown factor or argue that one shouldn't use YV12LUTxy/mt_lutxy or generalise to all 2D lookup-table applications. What I want to say is:

When measuring the performance of large lookup tables, measure in a real script rather than a toy one.

Perhaps this is obvious to everyone but me... but until today I had an inaccurate impression of the cost of a LUT based on numbers I have seen quoted in several places, so I thought it was worth saying.

sh0dan
4th December 2005, 01:11
This naturally depends on the size of the LUT.

Also, it could be interesting to see if loading the entire table into cache before it is used, will make it faster. That could make it faster, since it would be a linear read. In practice you only have to read every 64 bytes for the cacheline to be read.