Doom9's Forum - View Single Post

pinterf · 28th May 2021, 09:10

Quote:

Originally Posted by pinterf

Note: mt_merge has a cplace parameter default "mpeg2" which - with luma = true - is slower than the dumb "mpeg1" choice. Could you try your benchmarks wih cplace="mpeg1" ? Regarding the other benchmarks, I'll do them as well, for example why mt_binarize is slower.
EDIT: Overlay multiply (largest speed difference): no wonder, there is no SIMD optimization there at all.
EDIT2: mt_invert and Avisynth Invert is SSE2 only. But there is only a single instruction or two between load and store which usually implies no or little gain.
Actually some years ago I've implemented for example 8 bit binarize functions in AVX2 but I got zero speed gain so I decided that it won't go live yet. Time to test those again on my i7-7700.

I was checking the issue with mt_binarize benchmarks, because the processing itself is more processor-heavy when using Expr and I did not understand, why it is still slower.

The common in mt_binarize and Expr-based ex_binarize that they read and store pixels.

What they are doing inside:

mt_binarize (16 bit data) has 2 operations:
- integer addition
- comparison.

Expr:
- Converts 16 bit pixels to 32 bit float (size doubled, using two register instead of one)
- Compares with the limit (float comparison)
- Mask-blends either 0.0f or 65535.0f depending on the result.
- Converts back float data to 16 bits integer with rounding.

Well, this difference can be seen in the single-threaded benchmark results.

Doing almost nothing, quite interestingly mt_binarize alone is so fast that we better not do any synthetic benchmark on it - and in general with such filters (like mt_logic). I recommend to test them only embedded in a real script. (Like Dogway has did as well when provided benchmarks for whole scripts)

mt_binarize is a minimal-operation filter, having a memory load + two register operations + memory store.
Clearly it was reaching the memory bottleneck.
mt_binarize with no MT(!) is even a bit quicker than with any Prefetch values. This must be due to ruined caching and task swithing/register saving overhead.

mt_binarize combined with RemoveGrain was in the same ballpark with Prefetch(4) than without RemoveGrain!

Tested on i7-7700, avs+ 3.7.work

Code:

#SetMaxCPU("SSE4.1")
Import("ExTools.avsi")
Colorbars(pixel_type = "YUV420P16")
mt_binarize()
#ex_binarize()
#RemoveGrain(1, -1)
#Prefetch(4) # 8

Data in fps, on my system the values are ~average, actual values fluctuate, but we can see the trends.

Code:

Prefetch mt_binarize ex_binarize x64_mt_bin x64_ex_bin
-           19000         7000       19100       6700
4           16000        16500       15900      16600
8           13000        13900       12600      13900

Paired with a RemoveGrain after mt_/ex_binarize:

Code:

#SetMaxCPU("SSE4.1")
Import("ExTools.avsi")
Colorbars(pixel_type = "YUV420P16")
mt_binarize()
#ex_binarize()
RemoveGrain(1, -1)
#Prefetch(4) # 8

Code:

Prefetch mt_binarize ex_binarize x64_mt_bin x64_ex_bin + RemoveGrain(1, -1)
-           8800          5000       8500       4500
4           16114        11400       16700      11200