Quote:
Originally Posted by pinterf
Note: mt_merge has a cplace parameter default "mpeg2" which - with luma = true - is slower than the dumb "mpeg1" choice. Could you try your benchmarks wih cplace="mpeg1" ? Regarding the other benchmarks, I'll do them as well, for example why mt_binarize is slower.
EDIT: Overlay multiply (largest speed difference): no wonder, there is no SIMD optimization there at all.
EDIT2: mt_invert and Avisynth Invert is SSE2 only. But there is only a single instruction or two between load and store which usually implies no or little gain.
Actually some years ago I've implemented for example 8 bit binarize functions in AVX2 but I got zero speed gain so I decided that it won't go live yet. Time to test those again on my i7-7700.
|
I was checking the issue with mt_binarize benchmarks, because the processing itself is more processor-heavy when using Expr and I did not understand, why it is still slower.
The common in mt_binarize and Expr-based ex_binarize that they read and store pixels.
What they are doing inside:
mt_binarize (16 bit data) has 2 operations:
- integer addition
- comparison.
Expr:
- Converts 16 bit pixels to 32 bit float (size doubled, using two register instead of one)
- Compares with the limit (float comparison)
- Mask-blends either 0.0f or 65535.0f depending on the result.
- Converts back float data to 16 bits integer with rounding.
Well, this difference can be seen in the single-threaded benchmark results.
Doing almost nothing, quite interestingly mt_binarize alone is so fast that we better not do any synthetic benchmark on it - and in general with such filters (like mt_logic). I recommend to test them only embedded in a real script. (Like Dogway has did as well when provided benchmarks for whole scripts)
mt_binarize is a minimal-operation filter, having a memory load + two register operations + memory store.
Clearly it was reaching the memory bottleneck.
mt_binarize with no MT(!) is even a bit quicker than with any Prefetch values. This must be due to ruined caching and task swithing/register saving overhead.
mt_binarize combined with RemoveGrain was in the same ballpark with Prefetch(4) than without RemoveGrain!
Tested on i7-7700, avs+ 3.7.work
Code:
#SetMaxCPU("SSE4.1")
Import("ExTools.avsi")
Colorbars(pixel_type = "YUV420P16")
mt_binarize()
#ex_binarize()
#RemoveGrain(1, -1)
#Prefetch(4) # 8
Data in fps, on my system the values are ~average, actual values fluctuate, but we can see the trends.
Code:
Prefetch mt_binarize ex_binarize x64_mt_bin x64_ex_bin
- 19000 7000 19100 6700
4 16000 16500 15900 16600
8 13000 13900 12600 13900
Paired with a RemoveGrain after mt_/ex_binarize:
Code:
#SetMaxCPU("SSE4.1")
Import("ExTools.avsi")
Colorbars(pixel_type = "YUV420P16")
mt_binarize()
#ex_binarize()
RemoveGrain(1, -1)
#Prefetch(4) # 8
Code:
Prefetch mt_binarize ex_binarize x64_mt_bin x64_ex_bin + RemoveGrain(1, -1)
- 8800 5000 8500 4500
4 16114 11400 16700 11200