mg262
3rd December 2005, 17:46
I've recently been trying to speed up Didée's LimitedSharpen, and in the course of it I tried measuring the speed of the four mt_lutxy calls used in a call with the function with all parameters at default. (I used the script version modified by Socio to use MaskTools 2.0. Edit: in case you're not familiar with the script, it uses supersampling so LUTs are on large frames; that's not relevant to the point I'm making below, but it does mean you shouldn't extrapolate from the raw numbers.)
Results from kassandro's AVSTimer
When measured in vacuo, i.e. in a script where all the time is spent in a LUT call or in MPEG2Source:
305 fps, 299 fps, 304 fps, 292 fps
When measured in the actual script:
141 fps, 143 fps, 146 fps, 144 fps
(Results were taken over long periods and I checked that per-period averages were roughly constant. Source was 720x576, and I'm using a P4/2400. I didn't save the actual material passed into each LUT and use this for the in vacuo tests; feel free to repeat the test like this if you think this would make a substantial difference....)
The reason for the difference is clear -- in a long script, the LUT is displaced from the cache between mt_lutxy calls.
Conclusion
I'm not trying to measure the actual slowdown factor or argue that one shouldn't use YV12LUTxy/mt_lutxy or generalise to all 2D lookup-table applications. What I want to say is:
When measuring the performance of large lookup tables, measure in a real script rather than a toy one.
Perhaps this is obvious to everyone but me... but until today I had an inaccurate impression of the cost of a LUT based on numbers I have seen quoted in several places, so I thought it was worth saying.
Results from kassandro's AVSTimer
When measured in vacuo, i.e. in a script where all the time is spent in a LUT call or in MPEG2Source:
305 fps, 299 fps, 304 fps, 292 fps
When measured in the actual script:
141 fps, 143 fps, 146 fps, 144 fps
(Results were taken over long periods and I checked that per-period averages were roughly constant. Source was 720x576, and I'm using a P4/2400. I didn't save the actual material passed into each LUT and use this for the in vacuo tests; feel free to repeat the test like this if you think this would make a substantial difference....)
The reason for the difference is clear -- in a long script, the LUT is displaced from the cache between mt_lutxy calls.
Conclusion
I'm not trying to measure the actual slowdown factor or argue that one shouldn't use YV12LUTxy/mt_lutxy or generalise to all 2D lookup-table applications. What I want to say is:
When measuring the performance of large lookup tables, measure in a real script rather than a toy one.
Perhaps this is obvious to everyone but me... but until today I had an inaccurate impression of the cost of a LUT based on numbers I have seen quoted in several places, so I thought it was worth saying.