Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Usage

Reply
 
Thread Tools Search this Thread Display Modes
Old 19th May 2021, 18:24   #1  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,367
Dogway's Filters Packs

Dogway's Filters Packs


GitHub repo.

If you are new to AviSynth, to get things up and running refer to this post.

TIP: For AviSynth+ front end GUIs such as MeGui/RipBot, follow the next suggestions to update to latest AVS+ version; for MeGui and RipBot.

----

As others before I thought on creating one single thread to list and explain my updated filter packs and avs+ modernization efforts.

The main goal of the updates is to reduce redundant, outdated or slow functions to modern alternatives often with more features, like HBD support (32-bit float inc.), frame properties and improved performance among others.
Basic building block functions like those in masktools2, RgTools and smoothadjust have been replaced with internal Expr() wrappers, this allows liquid and easily editable code for others to inspect, debug or branch.
Additionally higher level filters have also been created or ported like those in SharpenersPack or GradePack (read below).
For performance reasons many expressions have also seen major refactors so due to this and modern AVS+ syntax updates probably any version earlier than v3.7.3 won't work properly, but this is a necessary evil to move things forward and make HBD filtering of HD|UHD sources something that is counted in hours and not days. Special thanks to pinterf for his continuous work on AVS+.


ORIGINAL

ExTools: Wrapper library for Expr() expressions that covers (and expands) most masktools2 and removegrain functions, including lutspa and convolutions. Also adds Array helper functions that expands those included internally and function approximations like 'atan', 'expr'... faster than internal. Syntax and arguments are kept so it's easy to update old scripts to the new counterparts. This pack will be required for all the following scripts.
Post about STTWM() (and initial release), Adaptive Threshold, ex_bilateral() (vs Dither_bilateral16() ), ex_shape() GIF. Morphological mask filtering.

Transforms Pack: Divided in 3: Main, Models, Transfers. Modern color and tone response technical transforms functions for color managing AviSynth+. Goal is usability, functionality and accuracy, works over any bitdepth, supports any luma range, extra color spaces and color models among them real RGB based HSV, reversible YUV and YCoCg, IPT, OkLab, ICtCp, IPTPQc2 and more. Includes also a SoftLimiter() and building block matrix functions.
Example converting an ACEScg exr to Rec709-1886 with gamut compression, tonemapping and filmic contrast.
Example converting a Dolby Vision IPTPQc2 (DVp5 or DVp8) clip to Rec709-1886 with gamut compression and tonemapping.

Grade Pack: Look transforms. Includes ex_levels() with native HBD support (same usage than native Levels() ), ex_autolevels(), ex_contrast(), ex_blend(), ex_glow(), ex_posterize(), greyscale_rgb(), FindTemp(), WhitePoint(), Vignette(), Skin_Qualifier(), GamutWarning(), PseudoColor(), GreyWorld(), HSVxHSV() and ex_vibrance(), a saturation and vibrance function.

SMDegrain: Simple MDegrain Mod. Easy to use, fool proof degraining wrapper of MDegrain and company. Initially a small few lines wrapper by Caroliano that I took over and implemented YUY2 support, interlaced support, 16-bit dither support, contrasharpening, prefiltering, debug view, documentation, globals, and good practice code. Later on real.finder took over and adapted it to modern code, added bugfixes, HBD support and so on, so forth. Now I ported it to ExTools, sanitized the code, removed old avisynth support, Dither support, and YUY2 support and included some new features like alternative degrainers, recursion, low frequency restoration (here too), DCT flicker, optimized UHD performance, ex_DGDenoise() and ex_BM3D() prefilters, and multi-scale retinex.
Here some explanations on iterative temporal filtering. In this post a draft for a SAD sampler. Recommendations for heavy grain prefiltering. 16mm film restoration (+here). Original 2011 SMDegrain thread.

Resizers Pack: Pack of functions involving resizing operations, like deep_resize() (and here) a refactored port of nnedi3_resize16() six times the speed and at higher quality, nnedi3resize(), a nnedi3 based arbitrary size scaler, RatioResize() which can resize by a single factor; percentage, adjust to width/height, to PAR, DAR and so on. PadResize() to crop or pad given input dimensions, PadBorders() like an advanced AddBorders()+Crop() with option to mirror, dilate or fill borders, MatteCrop() to automatically fix (crop+resize) movies with random bordered shots, and some utilities like mmod() to crop/pad/auto to mod, and nmod() to mod values with extra features like min value or bankers' rounding.

Masks Pack: Mask and limiter filters. BoxMask(), FlatMask(), LumaMask(), CornerMask() (cheap alternative here) and MotionMask() for masks and ex_limitchange(), ex_limitdif() and Soothe() for limiters.

Scenes Pack: SceneStats() opens the doors to scene based workflows. It writes current (frame) scene range bounds into '_SceneRange' frame properties and current scene change into '_SceneChangePrev', also scene motion into '_SceneMotion', scene details to '_SceneDetails' (a complexity index for average of edges), scene exposure index to '_SceneExposure' and pixel stats into '_SceneStats', on the fly or by offloading it to a file. ReadStats() can load an optionally exported SceneStats() stats file for faster processing at encoding stages. ClipStats() will otherwise load them and convert them to clip global stats, to help you decide better clip-wide constant settings in your filters.
Example for SMDegrain.
Example for FilmGrainPlus.

FilmGrain+: Made from the ground up, an accurate and performant synthetic film grain filter with presets for the most common negative films.

Logo: Easily add static logos or watermarks, with blur, fade in/out, opacity, and blending controls. Eventually also for video based logos.

Stabilization Tools Pack (legacy): Initially a simple mod of Stab() which grew bigger and currently includes various strategies for edge filling. Also includes FilmGateFix() mainly aimed at anime sources.


EX/MIX MODS

Normally there are 2 flavors of each mod:
EX mods are future proof with ExTools wrappers and minimal dependencies, this can also come handy when running on Linux/macOS which very few plugins support.
MIX mods use carefully chosen masktools2 and removegrain functions to maximize speed but come with these and probably other dependencies as well.

QTGMC+: Reference deinterlacer. Ported to ExTools from v3.382 (~40% faster in HBD). Includes ex_vinverse() (now legacy for int bitdepths), ex_bob() and ex_reduceflicker() functions.

LSFplus: Based on LSFmod, also one of the best sharpeners out there. Optimized (+74% with no SS), ported to ExTools and added more features.

GradFun3plus: Port of cretindesalpes' excellent GradFun3 debanding filter to internal AVS+ calls (+66% gain in smode=0)

Sharpeners Pack: Collection of high quality sharpeners optimized and ported to ExTools for HBD support and performance. In total 29 sharpeners, among them; Adaptive Sharpen, ex_unsharp, CASP, NVSharpen, ex_ContraSharpening, SeeSaw, FineSharpPlus, NonlinUSM, ReCon, blah and Plum.

Deblock Pack: Pack containing different deblocking functions from famous Deblock_QED() (29% speed gain), to CCD(), SmoothD2c(), SmoothDeblock() (WIP) or feisty2's Oyster (Oyster includes also deringing and else)

Similarity Metrics: Pack containing all the similarity/distance metrics ported by Asd-g to AVS+ from WolframRhodium VapourSynth repo. I collected, sanitized and updated the code for x4 speed gain on GMSD(), x2 on MDSI(), x3 on vsSSIM() and added+refactored BSSIM() from zorr. Also created SVM(), a metric for image sharpness. For more metrics check the cost functions in ex_makediff() in ExTools.

yugefunc: Collection of VapourSynth filters ported to AVS+ and optimized on the way with ExTools and other expression tricks: ex_guidedblur(), ex_ANguidedblur(), XDoG() (WIP), etc

Other: Some other scripts have received the ExTools treatment; FillMissing(), FastLineDarkenPlus(), SPresso(), DeStripe(), etc


ExTools main functions:
Code:
# EXPRESSIONS    
ex_lut()         - Single variable (1 clip)  expressions
ex_lutxy()       - Double variable (2 clips) expressions
ex_lutxyz()      - Triple variable (3 clips) expressions
ex_lutxyza()     - Quadruple variable (4 clips) expressions
ex_makediff()    - Clip based differentiation. Also calculates similarity/residual metrics via cost functions
ex_adddiff()     - Sum clips, specially useful to add back the result of differentiation
ex_makeadddiff() - ex_makediff() and ex_adddiff() in one step
ex_logic()       - Logical operations between 2 clips with logic ops (MIN, MAX, OR, AND, etc)
ex_merge()       - Merging. Performs a linear interpolation between 2 clips based on mask (3rd clip)
ex_clamp()       - Clamps first clip between the maximum of the second clip and the minimum of the third
ex_binarize()    - Performs binary type segmentation or thresholding
ex_athres()      - Adaptive Threshold. Special binary thresholding for uneven brightness images (ie. extracting letters from a shaded area)
ex_invert()      - Invert the clip pixel values
ex_lutspa()      - Relative or absolute pixel-location based expressions
ex_motion()      - Computes a very primitive motion mask akin to MaskTools2's mt_motion()
ex_hysteresis()  - Proof of concept Expr() port of mt_hysteresis(). Uses 'for' loops so very slow
                 
# MORPHOLOGICAL  
ex_expand()      - Morphological dilation/expansion of pixel-value based on structuring element given by the kernel window
ex_inpand()      - Morphological erosion/contraction of pixel-value based on structuring element given by the kernel window
ex_inflate()     - Expansion via outward blurring given structuring element of pixel values of the kernel window
ex_deflate()     - Contraction via inward blurring given structuring element of pixel values of the kernel window
ex_hitormiss()   - Structuring elements based morphological transforms for binary images
ex_edge()        - Gradient magnitude. Edge detection via (partial) local derivatives
ex_luts()        - Moving window relative pixel-location based expressions. A convolution do-it-all filter
ex_shape()       - Helper filter for ex_luts() (and other expression based filters) to fetch kernel-window pixels into a string
                 
# BLURS          
ex_boxblur()     - Discreet local neighborhood blur convolutions
ex_blur()        - Gaussian (or Butterworth) weighted blur convolutions
ex_gaussianblur()- Optimized Gaussian filter for large sigma
ex_kawase()      - Kawase optimized blur filter (still slower than ex_gaussianblur() ). Accepts different strides so good for exponential blur
ex_blur3D()      - Spatio-temporal blur filter
ex_bilateral()   - Bilateral blur filter (respects edges)
ex_smartblur()   - Like Bilateral filter but more performant (mimics Photoshop's Surface Blur)
ex_smooth()      - Savitzky-Golay smoothing filter. Halfway between blur and antialiasing
ex_FluxSmoothT() - Minimum change between a temporal weighted blur and temporal median. Informal port of FluxSmoothT filter via Didée's description
ex_FluxSmoothST()- Spatio-Temporal minimum change between weighted blur and median. Uses ex_FluxSmoothT() and its spatial equivalent ex_MinBlur()
ex_median()      - Median (rank order) based blur filtering. Also includes some alternative mean average algorithms
ex_repair()      - Median (rank order) based repair filter
STTWM()          - Spatio-Temporal Thresholded Weighted Median (STPresso() inspired / not a port)

Last edited by Dogway; 1st November 2023 at 01:24.
Dogway is offline   Reply With Quote
Old 19th May 2021, 20:33   #2  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,367
Benchmarks for 1080p 16-bit

ExTools

Code:
DGSource("1080psrc.dgi")
ConvertBits(16)

# mt_makediff()                   is  5% slower   than ex_makediff()
# mt_adddiff()                    is  5% slower   than ex_adddiff()
# mt_logic(mode="and")            is  6% slower   than ex_logic(mode="and")
# mt_merge(luma=true,U/V=3)       is  8% faster   than ex_merge(luma=true, UV=3) (5% slower when Y clips/masks)
# mt_clamp()                      is  6% slower   than ex_clamp()
# mt_binarize()                   is 12% slower   than ex_binarize()
# mt_invert()                     is 10% slower   than ex_invert()
#      invert(channels="Y")       is 17% slower   than ex_invert()
# mt_lutspa()                     is  0% slower   than ex_lutspa()
# mt_luts()*                      is 96% slower   than ex_luts() *tested with mt_luts( c, mode="max", pixels=mt_square( 1 ), expr="x y - abs") 
# Overlay(mode="multiply")        is 44% slower   than ex_blend(mode="multiply")
# Overlay_MTools(mode="multiply") is  8% slower   than ex_blend(mode="multiply")
# OverlayPlus(a, mode="multiply") is  1% slower   than ex_blend(mode="multiply")

# mt_expand()                     is  6% faster   than ex_expand()
# mt_inpand()                     is  5% faster   than ex_inpand()
# mt_deflate()                    is 14% faster   than ex_deflate()
# mt_inflate()                    is 14% faster   than ex_inflate()
# mt_edge()*                      is 14% faster   than ex_edge() *but much slower in "free" kernel mode

Prefetch(4)
In any case there are a few exceptions, for example mt_expand(mode= mt_circle(zero=true, radius=2)) is much much slower than ex_expand(2,mode="circle"), and in the same vein ex_expand(2) could be same speed or faster than mt_expand().mt_expand().

Code:
# Bilateral Blur
#
# 100% Dither_bilateral16(radius=2, thr=10, flat=1.0,   u=1, v=1) (216fps) # Output is dirtier though
#  77% ex_bilateral(1,dejaggie=false)
#  59% vsTBilateral(diameterY=3, sdevY=4, idevY=4.0,    u=1, v=1)
#  43% TBilateral(3,3,chroma=false)                                        # only supports 8-bits
#  23% bilateral(sigmaSY=1, sigmaRY=0.02, algorithmY=2, u=1, v=1)

Prefetch(6)

# Variable Box Blur
#
# 100% removegrain(20,-1) (485fps)
#  91% ex_boxblur(1,mode="mean",UV=1)
#  90% MiniDeen(radiusY=1, thrY=255, u=1,v=1) # crumbles from rad=3 onwards
#  89% neo_MiniDeen(radiusY=1, thrY=255, u=1,v=1)
#  80% mt_inflate().mt_deflate()              # mean blur approximation
#  70% ex_blur(1.5,n=300,mode="butterworth")
#  69% blur(1.58)
#  67% generalconvolution(matrix="1 1 1 1 1 1 1 1 1",chroma=false)
#  65% Dither_box_filter16(2,U=1,V=1)   # with ConverttoStacked() and ConvertfromStacked()
#  44% mt_convolution("1 1 1","1 1 1",U=1,V=1)
#  13% SpatialSoften(1,30,0)                   # 8-bit YUY2 only, thresholded. Prefetch(8)
#   5% mt_luts(last, "avg", mt_square(1), "y",chroma="-1")

# Variable Gaussian Blur (binomial fitted)
#
# 100% removegrain(12,-1) (486fps)           # technically a binomial weighted mean of [1 2 1]
#  97% GBlur2(sqrt(1)/2. * sqrt(2),chroma=2) #         only in 8-bit. weighted mean of [1 2 1]
#  90% ex_boxblur(1,mode="weighted",UV=1)    # binomial weighted mean
#  88% ablur(1, 1, chroma=1)                 # against ex_boxblur(2,mode="weighted",UV=1)
#  85% BinomialBlur(sqrt(1)*0.707,U=1,V=1)   # only in 8-bit
#  81% vsTCanny(sqrt(1)*0.707,mode=-1,u=1,v=1) # true gaussian blur (fastest for mid size sigma)
#  70% ex_blur(1,mode="binomial,UV=1)        # true gaussian blur
#  68% blur(1.00)                            # weighted mean of [1 2 1]
#  65% generalconvolution(matrix="1 2 1 2 4 2 1 2 1",chroma=false)
#  44% mt_convolution("1 2 1","1 2 1",U=1,V=1)
#  23% GBlur(rad=1,sd=0.9,u=false,v=false)
#  11% FastBlur(sqrt(1)*0.707,gamma=false)
#  11% GaussianBlur(0.53,U=1,V=1)            # only in 8-bit

Prefetch(4)


Script mods

CPU: i7-4790K (Stock Clock)
GPU: GTX 1070

Prefetch(6)
LSFmod.v2.193: 45.4fps
LSFmod.v6.0ex: 60.0fps
LSFplus.v6.0mix: 61.8fps
Code:
LSFplus(preset="slow",strength=200,edgemode=0,soothe=true,ss_x=1.0,ss_y=1.0)

Prefetch(4)
GrainFactoryLite: 57fps (96fps @8-bit)
GrainFactory3mod EX: 68fps (84fps @8-bit) Prefetch(8)
FilmGrain: 52fps (75fps @8-bit) Prefetch(8)
FilmGrain+: 38fps (54fps @8-bit) ('gamma' mode)
FilmGrain+: 37fps (50fps @8-bit) ('log' mode)
Code:
str=1.25
size=1.2
GrainFactory3mod(size=1,g1str=6.0*str,g2str=8.0*str,g3str=5.5*str,g1size=1.20*size,g2size=1.50*size,g3size=1.40*size,g1cstr=0.9,g2cstr=0.9,g3cstr=0.9,temp_avg=1)
or
FilmGrain(size=1.1,str=9,cstr=0.5,coarse=4.0,conv=false)
or
FilmGrainPlus(size=1.5,str=0.8,lo=1,mid=1,hi=1,sharpness=0.8,mode="gamma")
Prefetch(6)
SMDegrain v3.1.2.111s: 3.300fps (slowdown due to Contrasharpening() )
Prefetch(8)
SMDegrain v4.3.0d: 15.1fps
Code:
SMDegrain(tr=2,thSAD=400,contrasharp=true,refinemotion=true)

Note: QTGMC+ v4.0 is the last version comparable to older ones. Later versions use different default core deinterlacers for higher detail preservation.

Prefetch(8) (720x576 clip)
QTGMC 3.382s: 23 fps (8-bit) 10.0 fps (16-bit)
QTGMC+ 4.00p: 22 fps (8-bit) 18.0 fps (16-bit)
Code:
QTGMCp(tr2=3,preset="very slow",Lossless=2,sourcematch=3,sharpness=0.2,MatchEnhance=0.0,MatchPreset="Slow", MatchPreset2="Slow",border=true,threads=4)
QTGMC 3.382s: 54.7 fps (8-bit) 31.5 fps (16-bit)
QTGMC+ 4.00p: 55.4 fps (8-bit) 40.0 fps (16-bit)
Code:
QTGMCp(thsad1=300,blocksize=8,TR0=1,TR1=1,TR2=0,EZKeepGrain=1.0,NoiseDeint="Generate",StabilizeNoise=true,border=true,chromamotion=false,threads=4)
QTGMC 3.382s: 89 fps (8-bit) 42.1 fps (16-bit)
QTGMC+ 4.00: 87 fps (8-bit) 60.0 fps (16-bit)
Code:
QTGMCp(tr2=2,preset="slow",border=false,threads=4)
QTGMC 3.382s: 15.4 fps (8-bit) 12.3 fps (16-bit)
QTGMC+ 4.00p: 13.2 fps (8-bit) 12.3 fps (16-bit)
Code:
QTGMCp(tr2=2,preset="very slow",SVThin=0.5,EZKeepGrain=2.0,NoisePreset="slower",Sharpness=0.7,tuning="DV-SD",border=true,threads=4)
QTGMC 3.382s: 72.6 fps (8-bit) 38 fps (16-bit)
QTGMC+ 4.00p: 73.3 fps (8-bit) 53.5 fps (16-bit)
Code:
QTGMCp(TR2=3,TR0=1,TR1=1, Preset="Slower", InputType=1, sharpness=0)


ex_median(), ex_bilateral() with Prefetch(6)
Code:
100.0% vertical  Prefetch(4)   (440fps)
 97.5% undot     Prefetch(4)
 97.0% undot6    Prefetch(4)
 96.8% cartoon   Prefetch(4)
 94.3% edgeS     Prefetch(4)
 91.1% verticalS Prefetch(4)
 90.5% medianT   Prefetch(4)
 84.5% median
 84.3% SixNN     Prefetch(4)
 82.2% PML
 81.8% edgeC
 80.9% undot3    Prefetch(4)
 80.7% undot2    Prefetch(4)
 80.7% midsum
 79.8% EMF
 77.0% medianT5  Prefetch(4)
 76.7% GaussT5
 76.6% IQM
 75.2% ML3D
 75.0% edgeW
 75.0% winsor
 73.4% trimean
 72.0% edgeCL
 71.4% smart
 70.5% SNN
 70.2% CAM
 69.1% CWM
 67.6% CWM2
 66.5% AWM
 52.6% MMF
 52.2% PWM
 49.3% WMF
 48.2% IQMST
 44.1% ML3Dex
 38.0% bilateral
 36.1% Hybrid
 33.2% STWM
 33.2% kuwahara
 31.1% BDM
 29.0% unblob3D
 28.9% DGM5
 28.1% TL3D
 26.4% DGM3
 26.0% DGM2
 25.9% unblob3
 25.9% DGM1
 25.7% DGM4
 25.5% median5
 24.8% DGM0
 21.5% medianST
 20.2% trimean5
 19.9% median7o
 19.1% smart2
 18.5% AMF
 18.2% IQM5
 17.3% winsor5
 16.8% medianSTS
 16.1% IQMV
 16.1% GaussST5
  8.1% median7   Prefetch(8)
  6.3% smart3    Prefetch(8)
ex_median() comparison chart.................................source (for toggle comparison)
..........


ex_blur(), ex_blur3D(), ex_boxblur(), ex_smooth(), ex_kawase()
Code:
100.0% rg19 (448fps)
 99.1% bokeh2
 98.7% kawase lin
 97.8% weighted
 97.8% mean
 86.4% kawase2 lin
 86.2% bokeh
 78.3% SNN
 77.0% rg192
 75.9% mean2
 75.7% smooth
 75.4% weighted2
 75.4% blur
 72.8% smartblur
 71.9% smooth2
 71.4% smooth  sharp
 71.4% blur2
 67.0% smooth2 sharp
 64.3% smartblur2
 60.7% trimmed
 60.7% weighted3D
 52.5% mean3D
 37.1% ex_fluxsmoothST
blurs comparison chart............................................source (for toggle comparison)
..........


ex_edge() with default thresholds and Prefetch(4)
Code:
100.0% mt_sobel (460fps)
 99.3% tritical
 97.0% cartoon
 96.3% hotdog
 95.0% kayyali
 94.6% laplace
 91.1% hprewitt
 89.8% SGDD
 89.1% min/max
 89.1% sobel5
 88.3% roberts
 88.0% max
 87.0% qprewitt
 87.0% LoG
 86.5% TEdge
 85.2% frei-chen
 84.9% kroon
 84.6% prewitt
 84.1% sobel
 84.1% farid
 84.0% pscharr
 83.3% scharr
 81.7% robinson
 79.8% SGDD7
 78.9% DoG       Prefetch(6)
 71.3% Std       Prefetch(6)
 62.2% kirsch    Prefetch(6)
 56.0% DoB       Prefetch(6)
 50.4% farid5    Prefetch(6)
 49.8% SG        Prefetch(6)
 47.4% FDoG      Prefetch(6)
ex_edge() comparison chart



Sharpeners Pack
Code:
360  100 % XSharpenPlus()
355  98.6% CASP(1)
347  96.4% UnsharpMask_HBD(128*n,1,0) Prefetch(4)
339  94.2% DGSharpen2()
317  88.1% ex_unsharp()          Prefetch(4)
310  86.1% DetailSharpen()
272  75.6% NonlinUSM()
250  69.5% FineSharpPlus()
178  49.4% pSharpen()
160  44.4% RSharpen()
150  41.7% LSFplus(preset="LSF")
148  41.1% CASm()
145  40.3% SharpenComplex2()
106  29.4% NVSharpen()           Prefetch(8)
97   26.9% ex_ContraSharpening(a)
79   21.9% SlopeBend()
60   16.7% LSFplus(preset="fast")
50.6 13.1% DelicateSharp()
43.2 12.0% LSFplus(preset="medium")
33    9.2% SSSharpFaster()
27    7.5% LSFplus(preset="slow")
26.5  7.4% SeeSaw(a)
24.9  6.9% ReCon()
21    5.8% MedianSharp()
18.5  5.1% Adaptive_sharpen(1.0) Prefetch(8) (32-bit) 
14.5  4.0% MedSharp()
11.7  3.3% blah()                Prefetch(4)
1.9   0.5% SSSharpEX()           Prefetch(4)
0.22 0.06% RegularSharp()
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread

Last edited by Dogway; 15th April 2023 at 11:40.
Dogway is offline   Reply With Quote
Old 20th May 2021, 06:15   #3  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,322
Quote:
Originally Posted by Dogway View Post
Dogway's Filters Packs
* Expr() is known to perform worse in 8-bits than masktools2 and sometimes even in 16-bit. This is due to lack of lut based calculations in 8-bit, and lack of AVX2 acceleration for convolutions ("pixel addressing"), so work here is anticipating future performance improvements on avs+.
I'm so glad that finally someone is using pixel-addressing. I've done it only for my pure curiousity. The acceleration part took me many weeks to implement; it became so complex that I wanted to remove it. Like in a game theory: when will I lose more if I'm giving it up and drop two week's work or if I'm working on it further

I have to mention that no AVX2 code is used when pixel addressing is used in the Expr expression due to complexity of the implementation code for AVX 32 byte registers.

Expr based basic luts in Avisynth: they are on my roadmap, there are hints in the my source already I'm planning to continue the work on that topic later.

Masktools2 is using internally 64 bit doubles while Expr is using only 32 bit floats.
pinterf is offline   Reply With Quote
Old 20th May 2021, 07:32   #4  |  Link
kedautinh12
Registered User
 
Join Date: Jan 2018
Posts: 2,163
Thank pinterf
kedautinh12 is offline   Reply With Quote
Old 20th May 2021, 08:39   #5  |  Link
real.finder
Registered User
 
Join Date: Jan 2012
Location: Mesopotamia
Posts: 2,589
Quote:
Originally Posted by pinterf View Post
I'm so glad that finally someone is using pixel-addressing. I've done it only for my pure curiousity. The acceleration part took me many weeks to implement; it became so complex that I wanted to remove it. Like in a game theory: when will I lose more if I'm giving it up and drop two week's work or if I'm working on it further

I have to mention that no AVX2 code is used when pixel addressing is used in the Expr expression due to complexity of the implementation code for AVX 32 byte registers.

Expr based basic luts in Avisynth: they are on my roadmap, there are hints in the my source already I'm planning to continue the work on that topic later.

Masktools2 is using internally 64 bit doubles while Expr is using only 32 bit floats.
I did use it before if it this https://github.com/realfinder/AVS-St...sing.avsi#L447 also to imitation https://web.archive.org/web/20150602.../dotcrawl.html in here
__________________
See My Avisynth Stuff
real.finder is offline   Reply With Quote
Old 20th May 2021, 19:35   #6  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,367
pinterf: Thanks to you. Very useful tools, also the array implementation. I do think these new features are under-utilized as I could see so I put them in good use in some of the packs, sanitizing every convoluted script I can find.

I know that for some doing this in avs scripting might look foolish, but IMO it democratizes the code, makes it more liquid and promotes avs+ development. That's why while it might currently underperform in certain situations (lutspa, pixel addressing, etc) it levels out with the improvement in the Expr expressions, as can be seen in the benchmarks. It can only get better, I hope.

Yes, I saw mt_xxpand got AVX2 a year ago, while Expr is on SSSE3 for pixel addressing. I have no programming skills but I can imagine, mainly as Expr is much more powerful hence harder to optimize.
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread

Last edited by Dogway; 20th May 2021 at 23:05.
Dogway is offline   Reply With Quote
Old 22nd May 2021, 09:36   #7  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,367
So I've been working like mad on ExTools and it's almost reaching v1.0 final. Only ex_edge() is left from my planned ports (and ex_clamp which isn't still a 1:1 replica)

Today I added kernel iterated gaussian and box blur functions as separated 2x1D kernels. They run pretty fast, I'm just wondering if I should add a multiplier/divisor to make sigma stepness more granular.

Later I also want to optimize the other kernels in case they are separable and make iterators for them to make radius work properly. Since I already made them for blurs it should be easier to port. Once done I switch back to Transforms Pack for v1.0 final.

On the long run I want to fiddle with DotCrawl convolutions, add more edge detection kernels, and make some Unsharps with them.

Quote:
Originally Posted by pinterf View Post
Expr based basic luts in Avisynth: they are on my roadmap...
Thanks for the work, looking forward it, there are many situations where HBD isn't needed like frame interpolation or deinterlacing.

Quote:
Originally Posted by pinterf View Post
Masktools2 is using internally 64 bit doubles while Expr is using only 32 bit floats.
Does that make a difference in performance? Haven't run tests without use_expr>0
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread

Last edited by Dogway; 22nd May 2021 at 09:40.
Dogway is offline   Reply With Quote
Old 27th May 2021, 14:21   #8  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,367
ExTools v1.0 final is released. Now it also supports 32-bits float bitdepth.
It's generally faster than masktools2 except when Expr() "pixel addressing" feature is used like in convolutions.
For 8-bit it's still slower than masktool2, but pinterf is currently working on it.

I don't know if "pixel addressing" is ever going to have AVX2 acceleration, but if it does it might possible for my ports to exceed masktools2 speed, or at least reach it, leaving masktools2 dependency behind and just work with internal code, as ideally it should be.

Aside from masktools counterparts I also created a few functions like ex_blend() which replaces Overlay(), ex_undot() for removegrain(1), ex_boxblur() for removegrain(19) and ex_blur() for removegrain(12) and blur().

Therefore for my script mods there are two versions divided in folders; EX mods, and MIX mods. EX mods are future proof with ExTools wrappers, and MIX mods use masktools2 convolutions and removegrain to maximize speed.

See updated benchmarks in second post.
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread
Dogway is offline   Reply With Quote
Old 27th May 2021, 14:36   #9  |  Link
kedautinh12
Registered User
 
Join Date: Jan 2018
Posts: 2,163
Thanks for hardwork Dogway, but I think you just focus EX mods and MIX mods we can use real.finder's scripts with same your
kedautinh12 is offline   Reply With Quote
Old 27th May 2021, 14:47   #10  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,367
As you can see on the benchmarks many functions are faster than masktools2 calls in real.finder mods. Specially comes to mind Overlay() which is uber slow, but I also do some optimizations aside from 1:1 ports. I don't plan to port everything, just my most used scripts so I take special care. I'm also cleaning the code, removing old compatibility support, formatting, and so on.

I might replace ex_merge() back to mt_merge() for MIX mods, something is going on in there, but didn't have much time to debug.

From now I will resume TransformsPack to release a v1.0 final soon, focused on SDR color spaces.
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread

Last edited by Dogway; 27th May 2021 at 14:49.
Dogway is offline   Reply With Quote
Old 27th May 2021, 20:10   #11  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,322
Note: mt_merge has a cplace parameter default "mpeg2" which - with luma = true - is slower than the dumb "mpeg1" choice. Could you try your benchmarks wih cplace="mpeg1" ? Regarding the other benchmarks, I'll do them as well, for example why mt_binarize is slower.
EDIT: Overlay multiply (largest speed difference): no wonder, there is no SIMD optimization there at all.
EDIT2: mt_invert and Avisynth Invert is SSE2 only. But there is only a single instruction or two between load and store which usually implies no or little gain.
Actually some years ago I've implemented for example 8 bit binarize functions in AVX2 but I got zero speed gain so I decided that it won't go live yet. Time to test those again on my i7-7700.

Last edited by pinterf; 27th May 2021 at 21:28.
pinterf is offline   Reply With Quote
Old 28th May 2021, 00:05   #12  |  Link
GMJCZP
Registered User
 
GMJCZP's Avatar
 
Join Date: Apr 2010
Location: I have a statue in Hakodate, Japan
Posts: 754
Quote:
Originally Posted by Dogway View Post
As you can see on the benchmarks many functions are faster than masktools2 calls in real.finder mods. Specially comes to mind Overlay() which is uber slow, but I also do some optimizations aside from 1:1 ports. I don't plan to port everything, just my most used scripts so I take special care. I'm also cleaning the code, removing old compatibility support, formatting, and so on.
DogWay:

Do you have a function equivalent to Overlay, but optimized?
__________________
By law and justice!

GMJCZP's Arsenal
GMJCZP is offline   Reply With Quote
Old 28th May 2021, 00:57   #13  |  Link
kedautinh12
Registered User
 
Join Date: Jan 2018
Posts: 2,163
Here:
https://github.com/Dogway/Avisynth-S...ools.avsi#L149
kedautinh12 is offline   Reply With Quote
Old 28th May 2021, 07:03   #14  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,322
Quote:
Originally Posted by GMJCZP View Post
DogWay:

Do you have a function equivalent to Overlay, but optimized?
Overlay is basically many different filters under a common name.
Looking at the code: only 'blend', 'lighten' and 'darken' are optimized.
When there is a popular and frequently used mode _and_ affects scripts significantly with its slowness, probably I can implement a speedup.
pinterf is offline   Reply With Quote
Old 28th May 2021, 07:16   #15  |  Link
Reel.Deel
Registered User
 
Join Date: Mar 2012
Location: Texas
Posts: 1,671
VapourSynth's havfunc's Overlay script (which if I'm not mistaken, mimics AviSynth's Overlay) uses Expr and MaskedMarge to do the work. It can probably be translated into AviSynth easily, it even includes some additional modes not available is AviSynth's Overlay.
Reel.Deel is offline   Reply With Quote
Old 28th May 2021, 09:10   #16  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,322
Quote:
Originally Posted by pinterf View Post
Note: mt_merge has a cplace parameter default "mpeg2" which - with luma = true - is slower than the dumb "mpeg1" choice. Could you try your benchmarks wih cplace="mpeg1" ? Regarding the other benchmarks, I'll do them as well, for example why mt_binarize is slower.
EDIT: Overlay multiply (largest speed difference): no wonder, there is no SIMD optimization there at all.
EDIT2: mt_invert and Avisynth Invert is SSE2 only. But there is only a single instruction or two between load and store which usually implies no or little gain.
Actually some years ago I've implemented for example 8 bit binarize functions in AVX2 but I got zero speed gain so I decided that it won't go live yet. Time to test those again on my i7-7700.
I was checking the issue with mt_binarize benchmarks, because the processing itself is more processor-heavy when using Expr and I did not understand, why it is still slower.

The common in mt_binarize and Expr-based ex_binarize that they read and store pixels.

What they are doing inside:

mt_binarize (16 bit data) has 2 operations:
- integer addition
- comparison.

Expr:
- Converts 16 bit pixels to 32 bit float (size doubled, using two register instead of one)
- Compares with the limit (float comparison)
- Mask-blends either 0.0f or 65535.0f depending on the result.
- Converts back float data to 16 bits integer with rounding.

Well, this difference can be seen in the single-threaded benchmark results.

Doing almost nothing, quite interestingly mt_binarize alone is so fast that we better not do any synthetic benchmark on it - and in general with such filters (like mt_logic). I recommend to test them only embedded in a real script. (Like Dogway has did as well when provided benchmarks for whole scripts)

mt_binarize is a minimal-operation filter, having a memory load + two register operations + memory store.
Clearly it was reaching the memory bottleneck.
mt_binarize with no MT(!) is even a bit quicker than with any Prefetch values. This must be due to ruined caching and task swithing/register saving overhead.

mt_binarize combined with RemoveGrain was in the same ballpark with Prefetch(4) than without RemoveGrain!

Tested on i7-7700, avs+ 3.7.work

Code:
#SetMaxCPU("SSE4.1")
Import("ExTools.avsi")
Colorbars(pixel_type = "YUV420P16")
mt_binarize()
#ex_binarize()
#RemoveGrain(1, -1)
#Prefetch(4) # 8
Data in fps, on my system the values are ~average, actual values fluctuate, but we can see the trends.

Code:
Prefetch mt_binarize ex_binarize x64_mt_bin x64_ex_bin
-           19000         7000       19100       6700
4           16000        16500       15900      16600
8           13000        13900       12600      13900
Paired with a RemoveGrain after mt_/ex_binarize:

Code:
#SetMaxCPU("SSE4.1")
Import("ExTools.avsi")
Colorbars(pixel_type = "YUV420P16")
mt_binarize()
#ex_binarize()
RemoveGrain(1, -1)
#Prefetch(4) # 8
Code:
Prefetch mt_binarize ex_binarize x64_mt_bin x64_ex_bin + RemoveGrain(1, -1)
-           8800          5000       8500       4500
4           16114        11400       16700      11200
pinterf is offline   Reply With Quote
Old 28th May 2021, 10:40   #17  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,367
Yes, that was a test I had planned in my head because on real scripts it seems my functions perform slower while in synthetic normally is faster. I don't really understand why. One thing I thought, and that's why I asked I thought masktools2 was converting to double float, but here is 16-bit integer like mt_binarize in your example above. I would think double float has a performance penalty as you explained.

By the way are those fps in the thousands? I get 530fps but I use a bit more real case scenario (not totally synthetic), 1080p source, load with DGSource and process in 16-bit.

I crafted a small script, more like what happens within filters:

250fps
Code:
ConvertBits(16)
a=ex_binarize(68)
b=a.ex_invert()
ex_logic(a,b,mode="andn")
# removegrain(1,-1)
Prefetch(4)
215fps
Code:
ConvertBits(16)
a=mt_binarize(68)
b=a.mt_invert()
mt_logic(a,b,mode="andn")
# removegrain(1,-1)
Prefetch(4)
EDIT: By the way I had two more algos for binarize, but they were same or slower than the ternary:
Code:
str = Format("x x {th} scaleb - x {th} scaleb + clip")
str = Format("x x {th} scaleb - * ")
Enabling removegrain (or disabling Prefetch) didn't make much of a difference. I also have to test ex_merge() but the mask handling got me a bit on my nerves and it was the last thing I did when I was already burned with ExTools.

On ex_blend, I have plans to add more blending modes, same for ex_expand shapes, ex_edge modes, unsharp and so on. But wanted to get the basics first and on a later time improve the project with fresher eyes.
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread

Last edited by Dogway; 28th May 2021 at 10:49.
Dogway is offline   Reply With Quote
Old 28th May 2021, 11:11   #18  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,322
I have Colorbars source filter, that why it is quicker.

Double (mt_lut) vs Float (Expr): the difference affects only where expressions must be evaluated Expr, and mt_lut family.
When there is enough memory and mt_lut is really using LUT then the slow calculations affect only the creation of lut tables.
But for a 16bit lutxy there is no memory for lut (we'd end with a 8GB memory table), so masktools is using 'realtime' expression evaluation. Calculates the expression for each frame and for each pixel. In pure C code.
And that is very slow.
Expr is calculating realtime as well. But since it compiles the expression into SSE2/AVX2 machine code (acts like a small compiler) it is quicker than realtime mt_lut by magnitudes.

Usually non-lut masktools filters are optimized heavily and use integer where the source is 8-16 bits.
pinterf is offline   Reply With Quote
Old 28th May 2021, 12:02   #19  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,367
Quote:
Originally Posted by pinterf View Post
Usually non-lut masktools filters are optimized heavily and use integer where the source is 8-16 bits.
That makes sense now. I thought it only was the convolution types. I also noticed chaining removegrain (or Dither_boxfilter) is very fast, whereas Expr() suffers a lot, probably due what you explained. So processing in 32-bit float would tell another story.
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread
Dogway is offline   Reply With Quote
Old 28th May 2021, 13:20   #20  |  Link
GMJCZP
Registered User
 
GMJCZP's Avatar
 
Join Date: Apr 2010
Location: I have a statue in Hakodate, Japan
Posts: 754
Quote:
Originally Posted by pinterf View Post
Overlay is basically many different filters under a common name.
Looking at the code: only 'blend', 'lighten' and 'darken' are optimized.
When there is a popular and frequently used mode _and_ affects scripts significantly with its slowness, probably I can implement a speedup.
Fortunately I am only interested in the blend mode, I should check out how to equate ex_blend with Overlay.

Thanks Dogway for your contribution.
__________________
By law and justice!

GMJCZP's Arsenal

Last edited by GMJCZP; 28th May 2021 at 13:23.
GMJCZP is offline   Reply With Quote
Reply

Tags
avisynth, dogway, filters, hbd, packs

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 11:29.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.