Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Usage

Reply
 
Thread Tools Search this Thread Display Modes
Old 19th May 2021, 18:24   #1  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 1,656
Dogway's Filters Packs

Dogway's Filters Packs


GitHub repo.


As others did before I thought on creating one single thread instead of dozens around to list and explain my updated filter packs on my avs+ modernization efforts.
I'm currently updating all my main and mod functions to modern avs+ syntax and expressions to cover HBD support and improve performance hit. My goal is to finally replace masktools2, removegrain, smoothadjust and some other basic building block functions with performant Expr() wrapped internal calls. Currently I run several projects on parallel and they work fine over any bitdepth (from 8-bit integer to 32-bit float).
Because of the syntax update probably any version earlier than AviSynth+ 3.7.1 won't work properly, but this is a necessary evil to move things forward and make HBD filtering of HD or UHD sources something that is count in hours and not days.

* Expr() is known to perform worse in 8-bits than masktools2 and sometimes even in 16-bit. This is due to lack of lut based calculations in 8-bit, and lack of AVX2 acceleration for convolutions ("pixel addressing"), so work here is anticipating future performance improvements on avs+.

UPDATE (12/12/2021): Thanks to pinterf LUT calculations are now possible with Expr(). More benchmarking is needed but performance is now closer to MaskTools2 counterparts when filtering in 8-bit except for convolutions which are still trapped in SSSE3 instruction set. Most filters except for TransformsPack and StabilizationTools are now very mature to use in production. If you have issues or bugs please report here or in the repo bugtracker.


ORIGINAL

ExTools: Wrapper library for Expr() expressions that covers (and expands) most masktools2 and removegrain functions, including lutspa and convolutions. Syntax and arguments are kept so it's easy to update old scripts to the new counterparts. This pack will be required for all the following scripts.

Transforms Pack: Modern color and tone response technical transforms functions for color managing AviSynth+. Goal is usability, functionality and accuracy, works over any bitdepth, supports any luma range, extra color spaces and color models among them real RGB based HSV, reversible YUV and YCoCg, OPP, IPT, OkLab and more.

Grade Pack: Look transforms. Includes ex_levels() with native HBD support (same usage than native Levels() ), ex_contrast(), ex_blend(), ex_glow(), ex_posterize(), greyscale_rgb(), WhitePoint(), Vignette(), Skin_Qualifier(), GamutWarning(), PseudoColor(), HSVxHSV() and ex_vibrance(), a saturation and vibrance function.

SMDegrain: Simple MDegrain. Easy to use, fool proof degraining wrapper of MDegrain and company. Initially a small few lines wrapper by Caroliano that I took over and implemented YUY2 support, interlaced support, 16-bit dither support, contrasharpening, prefiltering, debug view, documentation, globals, and good practice code. Later on real.finder took over and adapted it to modern code, added bugfixes, HBD support and so on, so forth. Now I ported it to ExTools, sanitized the code, removed old avisynth support, Dither support, and YUY2 support and included some new features like alternative degrainers, low frequency restoration, DCT flicker, optimized UHD performance, BM3D prefiltering, and multi-scale retinex.

Resizers Pack: Pack of functions involving resizing operations, like RatioResize() which can resize by a single factor; percentage, adjust to width/height, to PAR, DAR and so on. PadResize() to crop or pad given input dimensions, PadBorders() like an advanced AddBorders() with option to mirror or dilate the borders, MatteCrop() to automatically fix (crop+resize) movies with random bordered shots and some utilities like mmod() to crop/add to mod, and nmod() to mod values with extra features like minimum value or bankers' rounding.

Masks Pack: High level mask filters. BoxMask(), FlatMask() and LumaMask().

Logo: Easily add static logos or watermarks, with blur, fade in/out, opacity, and blending controls. Eventually will create the same for video based logos.

Stabilization Tools Pack: Initially a simple mod of Stab() which grew bigger and currently includes various strategies for edge filling. Also includes FilmGateFix() mainly aimed at anime sources, needs a rehaul though.


EX/MIX MODS

QTGMC: Reference deinterlacer. Ported to ExTools from v3.382 (~45% faster in HBD)

GrainFactory3mod: One of the best if not the best regraining filter for AviSynth. Ported to ExTools.

LSFmod: Also one of the best sharpeners out there. Optimized (near +20%), ported to ExTools and added more features.

Sharpeners Pack: Collection of high quality sharpeners optimized and ported to ExTools for HBD support and performance: Adaptive Sharpen, FineSharp, SeeSaw, SeeSawMulti, SlopeBend, Contrasharpening, NonlinUSM, DetailSharpen, MedianSharp, SSSharpFaster, SSSharp, ReCon, blah, MedSharp, SharpenComplex2, ex_unsharp, NVSharpen, UnsharpMask_HBD, ex_XSharpen, psharpen, ex_CAS, CASm, and MultiSharpen.

FrameRateConverter: Great frame interpolation filter. Sanitized and ported to ExTools.

Deblock Pack: Pack containing different deblocking functions from famous Deblock_QED() (25% speed gain), to SmoothD2c, SmoothDeblock (WIP) or feisty2's Oyster (WIP) (Oyster includes also deringing and else)

Similarity Metrics: Pack containing all the similarity metric functions ported by Asd-g to AVS+ from WolframRhodium VapourSynth repo. I collected, sanitized and updated the code for x4 speed gain on GMSD, x2 on MDSI, x3 on vsSSIM and added+refactored BSSIM from zorr. For more metrics check the cost functions in ex_makediff() in ExTools.

yugefunc: Collection of VapourSynth filters ported to AviSynth+ and optimized on the way with ExTools and other expression tricks. Examples: ex_guidedblur(), ex_ANguidedblur(), SmoothGrad() (WIP), BMAFilter() (WIP), etc

Other: Also some other scripts have received the ExTools treatment among them FillMissing, FastLineDarkenMOD, SPresso, DeStripe, etc


ExTools functions:
Code:
# EXPRESSIONS    
ex_lut()         - Write single variable (one  clip)  expressions
ex_lutxy()       - Write double variable (two  clips) expressions
ex_lutxyz()      - Write triple variable (3    clips) expressions
ex_lutxyza()     - Write quadruple variable (4 clips) expressions
ex_makediff()    - Clip based differentiation. Also calculates similarity/residual metrics via cost functions
ex_adddiff()     - Sum clips, specially useful to add back the result of differentiation
ex_makeadddiff() - ex_makediff() and ex_adddiff() in one step
ex_logic()       - Performs logical operations between 2 clips with logic gates (MIN, MAX, OR, AND, etc)
ex_merge()       - Merging. Performs a linear interpolation between 2 clips based on mask (3rd clip)
ex_clamp()       - Clamps first clip between the maximum of the second clip and the minimum of the third
ex_binarize()    - Performs binary type segmentation or thresholding
ex_athres()      - Adaptive Threshold. Special binary thresholding for uneven brightness images (ie. extracting letters from a shaded area)
ex_invert()      - Invert the clip pixel values
ex_lutspa()      - Relative or absolute pixel-location based expressions
ex_motion()      - Computes a very primitive motion mask akin to MaskTools2 mt_motion()
                 
# MORPHOLOGICAL  
ex_expand()      - Morphological dilation or expansion of pixel-value based on structuring element given by the kernel window
ex_inpand()      - Morphological erosion or contraction of pixel-value based on structuring element given by the kernel window
ex_inflate()     - Expansion via outward blurring given structuring element of pixel values of the kernel window
ex_deflate()     - Contraction via inward blurring given structuring element of pixel values of the kernel window
ex_hitormiss()   - Structuring elements based morphological transforms for binary images
ex_edge()        - Gradient magnitude. Edge detection via (partial) local derivatives
ex_corner()      - Harris corner detection filter (WIP)
ex_luts()        - Moving window relative pixel-location based expressions. A convolution do-it-all filter
ex_shape()       - Helper filter for ex_luts() (and other expression based filters) to fetch kernel-window pixels into a string
                 
# BLURS          
ex_boxblur()     - Local neigbourhood discreet distance based blur convolutions
ex_blur()        - Gaussian (or Butterworth) weighted blur convolutions
ex_gaussianblur()- Optimized Gaussian filter for large sigma
ex_kawase()      - Kawase optimized blur filter (still slower than ex_gaussianblur() ). Accepts different strides so good for exponential blur
ex_blur3D()      - Spatio-temporal blur filter
ex_bilateral()   - Bilateral blur filter (respects edges)
ex_smartblur()   - Like Bilateral filter but more performant (mimics Photoshop's Surface Blur)
ex_smooth()      - Savitzky-Golay smoothing filter. Halfway between blur and antialiasing
ex_FluxSmoothT() - Minimum change between a temporal weighted blur and temporal median. Informal port of FluxSmoothT filter via Didée's description
ex_FluxSmoothST()- Spatio-Temporal minimum change between weighted blur and median. Uses ex_FluxSmoothT() and its spatial equivalent ex_MinBlur()
ex_median()      - Median (rank order) based blur filtering. Also includes some alternative mean average algorithms
ex_repair()      - Median (rank order) based repair filter
STTWM()          - Spatio-Temporal Thresholded Weighted Median (STPresso() inspired / not a port)
__________________
[i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub

Last edited by Dogway; 14th January 2022 at 03:06.
Dogway is offline   Reply With Quote
Old 19th May 2021, 20:33   #2  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 1,656
Benchmarks for 1080p 16-bit

ExTools

Code:
DGSource("1080psrc.dgi")
ConvertBits(16)

# mt_makediff()                   is  5% slower   than ex_makediff()
# mt_adddiff()                    is  5% slower   than ex_adddiff()
# mt_logic(mode="and")            is  6% slower   than ex_logic(mode="and")
# mt_merge(luma=true,U/V=3)       is  8% faster   than ex_merge(luma=true, UV=3) (5% slower when Y clips/masks)
# mt_clamp()                      is  6% slower   than ex_clamp()
# mt_binarize()                   is 12% slower   than ex_binarize()
# mt_invert()                     is 10% slower   than ex_invert()
#      invert(channels="Y")       is 17% slower   than ex_invert()
# mt_lutspa()                     is  0% slower   than ex_lutspa()
# mt_luts()*                      is 96% slower   than ex_luts() *tested with mt_luts( c, mode="max", pixels=mt_square( 1 ), expr="x y - abs") 
# Overlay(mode="multiply")        is 44% slower   than ex_blend(mode="multiply")
# Overlay_MTools(mode="multiply") is  8% slower   than ex_blend(mode="multiply")
# OverlayPlus(a, mode="multiply") is  1% slower   than ex_blend(mode="multiply")

# mt_expand()                     is  6% faster   than ex_expand()
# mt_inpand()                     is  5% faster   than ex_inpand()
# mt_deflate()                    is 14% faster   than ex_deflate()
# mt_inflate()                    is 14% faster   than ex_inflate()
# mt_edge()*                      is 14% faster   than ex_edge() *but much slower in "free" kernel mode

Prefetch(4)
In any case there are a few exceptions, for example mt_expand(mode= mt_circle(zero=true, radius=2)) is much much slower than ex_expand(2,mode="circle"), and in the same vein ex_expand(2) could be same speed or faster than mt_expand().mt_expand().

Code:
# Bilateral Blur
#
# 100% Dither_bilateral16(radius=2, thr=10, flat=1.0,   u=1, v=1) (216fps) # Output is dirtier though
#  77% ex_bilateral(1,dejaggie=false)
#  59% vsTBilateral(diameterY=3, sdevY=4, idevY=4.0,    u=1, v=1)
#  43% TBilateral(3,3,chroma=false)                                        # only supports 8-bits
#  23% bilateral(sigmaSY=1, sigmaRY=0.02, algorithmY=2, u=1, v=1)

Prefetch(6)

# Variable Box Blur
#
# 100% removegrain(20,-1) (485fps)
#  91% ex_boxblur(1,mode="mean",UV=1)
#  90% MiniDeen(radiusY=1, thrY=255, u=1,v=1) # crumbles from rad=3 onwards
#  89% neo_MiniDeen(radiusY=1, thrY=255, u=1,v=1)
#  80% mt_inflate().mt_deflate()              # mean blur approximation
#  70% ex_blur(1.5,n=300,mode="butterworth")
#  69% blur(1.58)
#  67% generalconvolution(matrix="1 1 1 1 1 1 1 1 1",chroma=false)
#  65% Dither_box_filter16(2,U=1,V=1)   # with ConverttoStacked() and ConvertfromStacked()
#  44% mt_convolution("1 1 1","1 1 1",U=1,V=1)
#  13% SpatialSoften(1,30,0)                   # 8-bit YUY2 only, thresholded. Prefetch(8)
#   5% mt_luts(last, "avg", mt_square(1), "y",chroma="-1")

# Variable Gaussian Blur (binomial fitted)
#
# 100% removegrain(12,-1) (486fps)           # technically a binomial weighted mean of [1 2 1]
#  97% GBlur2(sqrt(1)/2. * sqrt(2),chroma=2) #         only in 8-bit. weighted mean of [1 2 1]
#  90% ex_boxblur(1,mode="weighted",UV=1)    # binomial weighted mean
#  88% ablur(1, 1, chroma=1)                 # against ex_boxblur(2,mode="weighted",UV=1)
#  85% BinomialBlur(sqrt(1)*0.707,U=1,V=1)   # only in 8-bit
#  81% vsTCanny(sqrt(1)*0.707,mode=-1,u=1,v=1) # true gaussian blur (fastest for mid size sigma)
#  70% ex_blur(1,mode="binomial,UV=1)        # true gaussian blur
#  68% blur(1.00)                            # weighted mean of [1 2 1]
#  65% generalconvolution(matrix="1 2 1 2 4 2 1 2 1",chroma=false)
#  44% mt_convolution("1 2 1","1 2 1",U=1,V=1)
#  23% GBlur(rad=1,sd=0.9,u=false,v=false)
#  11% FastBlur(sqrt(1)*0.707,gamma=false)
#  11% GaussianBlur(0.53,U=1,V=1)            # only in 8-bit

Prefetch(4)


Script mods

CPU: i7-4790K (Stock Clock)
GPU: GTX 1070

Prefetch(6)
LSFmod.v2.193: 45.4fps
LSFmod.v4.1ex: 49.2fps
LSFmod.v4.1mix: 53.1fps
Code:
LSFmod(preset="slow",strength=200,edgemode=0,soothe=true,ss_x=1.0,ss_y=1.0)
Prefetch(4)
GrainFactory3mod: 58fps
GrainFactory3mod mix: 60fps
Code:
str=1.25
size=1.2
GrainFactory3mod(g1str=6.0*str,g2str=8.0*str,g3str=5.5*str,g1size=1.20*size,g2size=1.50*size,g3size=1.40*size,g1cstr=0.9,g2cstr=0.9,g3cstr=0.9,temp_avg=1)
Prefetch(4) - mvtools2 is the bottleneck so the faster the preset the bigger the performance gain
FrameRateConverter 2.0 (only supports 8-bit): 5.6fps
FrameRateConverter 2.2 mix (in 8-bit): 5.1fps (5.7 with luma_rebuild=false)
FrameRateConverter 2.2 mix (in 16-bit): 5.0fps
Code:
RequestLinear(50)
FrameRateConverter(Preset="slow",FrameDouble=true)
Prefetch(6)
SMDegrain v3.1.2.111s: 3.300fps
NotSMDegrain v3.1.2.116s: 3.000fps (slowdown due to Contrasharpening() )
SMDegrain v3.4.0d: 15.1fps
Code:
SMDegrain(tr=2,thSAD=400,contrasharp=true,refinemotion=true)
Prefetch(8) (720x576 clip)
QTGMC 3.382s: 23 fps (8-bit) 10.0 fps (16-bit)
QTGMC 3.71mx: 23 fps (8-bit) 18.5 fps (16-bit)
Code:
QTGMC(tr2=3,preset="very slow",Lossless=2,sourcematch=3,sharpness=0.2,MatchEnhance=0.0,MatchPreset="Slow", MatchPreset2="Slow",border=true,threads=4)
QTGMC 3.382s: 54.7 fps (8-bit) 31.5 fps (16-bit)
QTGMC 3.71mx: 58.0 fps (8-bit) 43.2 fps (16-bit)
Code:
QTGMC(thsad1=300,blocksize=8,TR0=1,TR1=1,TR2=0,EZKeepGrain=1.0,NoiseDeint="Generate",StabilizeNoise=true,border=true,chromamotion=false,threads=4)
QTGMC 3.382s: 89 fps (8-bit) 42.1 fps (16-bit)
QTGMC 3.71mx: 90 fps (8-bit) 61.8 fps (16-bit)
Code:
QTGMC(tr2=2,preset="slow",border=false,threads=4)
QTGMC 3.382s: 15.4 fps (8-bit) 12.3 fps (16-bit)
QTGMC 3.71mx: 13.7 fps (8-bit) 13.0 fps (16-bit)
Code:
QTGMC(tr2=2,preset="very slow",SVThin=0.5,EZKeepGrain=2.0,NoisePreset="slower",Sharpness=0.7,tuning="DV-SD",border=true,threads=4)
QTGMC 3.382s: 72.6 fps (8-bit) 38 fps (16-bit)
QTGMC 3.71mx: 70.4 fps (8-bit) 53.5 fps (16-bit)
Code:
QTGMC(tr2=2,preset="very slow",SVThin=0.5,EZKeepGrain=2.0,NoisePreset="slower",Sharpness=0.7,tuning="DV-SD",border=true,threads=4)


ex_median(), ex_bilateral() with Prefetch(6)
Code:
100.0% vertical  Prefetch(4)   (440fps)
 97.5% undot     Prefetch(4)
 96.8% cartoon   Prefetch(4)
 94.3% edgeS     Prefetch(4)
 91.1% verticalS Prefetch(4)
 90.5% medianT   Prefetch(4)
 84.5% median
 84.3% SixNN     Prefetch(4)
 82.2% PML
 81.8% edgeC
 80.9% undot3    Prefetch(4)
 80.7% undot2    Prefetch(4)
 80.7% midsum
 79.8% EMF
 77.0% medianT5  Prefetch(4)
 76.7% GaussT5
 76.6% IQM
 75.2% ML3D
 75.0% edgeW
 75.0% winsor
 73.4% trimean
 72.0% edgeCL
 71.4% smart
 70.5% SNN
 70.2% CAM
 69.1% CWM
 67.6% CWM2
 66.5% AWM
 52.6% MMF
 52.2% PWM
 49.3% WMF
 44.1% ML3Dex
 38.0% bilateral
 36.1% Hybrid
 33.2% STWM
 33.2% kuwahara
 31.1% BDM
 29.0% unblob3D
 28.9% DGM5
 28.1% TL3D
 26.4% DGM3
 26.0% DGM2
 25.9% unblob3
 25.9% DGM1
 25.7% DGM4
 25.5% median5
 24.8% DGM0
 21.5% medianST
 20.2% trimean5
 19.9% median7o
 19.1% smart2
 18.5% AMF
 18.2% IQM5
 17.3% winsor5
 16.8% medianSTS
 16.1% GaussST5
  8.1% median7   Prefetch(8)
  6.3% smart3    Prefetch(8)
ex_median() comparison chart.................................source (for toggle comparison)
..........


ex_blur(), ex_blur3D(), ex_boxblur(), ex_smooth(), ex_kawase()
Code:
100.0% rg19 (448fps)
 99.1% bokeh2
 98.7% kawase lin
 97.8% weighted
 97.8% mean
 86.4% kawase2 lin
 86.2% bokeh
 78.3% SNN
 77.0% rg192
 75.9% mean2
 75.7% smooth
 75.4% weighted2
 75.4% blur
 72.8% smartblur
 71.9% smooth2
 71.4% smooth  sharp
 71.4% blur2
 67.0% smooth2 sharp
 64.3% smartblur2
 60.7% trimmed
 60.7% weighted3D
 52.5% mean3D
 37.1% ex_fluxsmoothST
blurs comparison chart............................................source (for toggle comparison)
..........


ex_edge() with default thresholds
Code:
100.0% mt_sobel (460fps)
 99.3% tritical
 97.0% cartoon
 96.3% hotdog
 95.0% kayyali
 94.6% laplace
 91.1% hprewitt
 89.8% SGDD
 89.1% min/max
 89.1% sobel5
 88.3% roberts
 88.0% max
 87.0% qprewitt
 87.0% LoG
 86.5% TEdge
 85.2% frei-chen
 84.9% kroon
 84.6% prewitt
 84.1% sobel
 84.1% farid
 84.0% pscharr
 83.3% scharr
 81.7% robinson
 79.8% SGDD7
 78.9% DoG       Prefetch(6)
 71.3% Std       Prefetch(6)
 62.2% kirsch    Prefetch(6)
 56.0% DoB       Prefetch(6)
 50.4% farid5    Prefetch(6)
 49.8% SG        Prefetch(6)
 47.4% FDoG      Prefetch(6)
ex_edge() comparison chart



Sharpeners Pack
Code:
360  100 % ex_XSharpen()
360  100 % ex_CAS(1)
344  95.5% UnsharpMask_HBD(128*n,1,0) Prefetch(4)
315  87.5% DetailSharpen()
312  86.7% ex_unsharp()          Prefetch(4)
273  75.8% NonlinUSM()
188  52.2% pSharpen()
148  41.1% CASm()
143  39.7% SharpenComplex2()
106  29.4% NVSharpen()           Prefetch(8)
101  28.1% FineSharp()
97   26.9% ex_ContraSharpening(a)
93   25.8% LSFmod(preset="fast")
74   20.6% SlopeBend()
55.4 15.4% SeeSaw(a)
37   10.3% LSFmod(preset="medium")
33    9.2% SSSharpFaster()
24.9  6.9% ReCon()
23.6  6.6% LSFmod(preset="slow")
21    5.8% MedianSharp()
18.5  5.1% Adaptive_sharpen(1.0) Prefetch(8) (32-bit) 
14.5  4.0% MedSharp()
11.7  3.3% blah()                Prefetch(4)
1.9   0.5% SSSharpEX()           Prefetch(4)
__________________
[i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub

Last edited by Dogway; 10th January 2022 at 17:46.
Dogway is offline   Reply With Quote
Old 20th May 2021, 06:15   #3  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,119
Quote:
Originally Posted by Dogway View Post
Dogway's Filters Packs
* Expr() is known to perform worse in 8-bits than masktools2 and sometimes even in 16-bit. This is due to lack of lut based calculations in 8-bit, and lack of AVX2 acceleration for convolutions ("pixel addressing"), so work here is anticipating future performance improvements on avs+.
I'm so glad that finally someone is using pixel-addressing. I've done it only for my pure curiousity. The acceleration part took me many weeks to implement; it became so complex that I wanted to remove it. Like in a game theory: when will I lose more if I'm giving it up and drop two week's work or if I'm working on it further

I have to mention that no AVX2 code is used when pixel addressing is used in the Expr expression due to complexity of the implementation code for AVX 32 byte registers.

Expr based basic luts in Avisynth: they are on my roadmap, there are hints in the my source already I'm planning to continue the work on that topic later.

Masktools2 is using internally 64 bit doubles while Expr is using only 32 bit floats.
pinterf is offline   Reply With Quote
Old 20th May 2021, 07:32   #4  |  Link
kedautinh12
Registered User
 
Join Date: Jan 2018
Posts: 930
Thank pinterf
kedautinh12 is offline   Reply With Quote
Old 20th May 2021, 08:39   #5  |  Link
real.finder
Registered User
 
Join Date: Jan 2012
Location: Mesopotamia
Posts: 2,381
Quote:
Originally Posted by pinterf View Post
I'm so glad that finally someone is using pixel-addressing. I've done it only for my pure curiousity. The acceleration part took me many weeks to implement; it became so complex that I wanted to remove it. Like in a game theory: when will I lose more if I'm giving it up and drop two week's work or if I'm working on it further

I have to mention that no AVX2 code is used when pixel addressing is used in the Expr expression due to complexity of the implementation code for AVX 32 byte registers.

Expr based basic luts in Avisynth: they are on my roadmap, there are hints in the my source already I'm planning to continue the work on that topic later.

Masktools2 is using internally 64 bit doubles while Expr is using only 32 bit floats.
I did use it before if it this https://github.com/realfinder/AVS-St...sing.avsi#L447 also to imitation https://web.archive.org/web/20150602.../dotcrawl.html in here
__________________
See My Avisynth Stuff
real.finder is offline   Reply With Quote
Old 20th May 2021, 19:35   #6  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 1,656
pinterf: Thanks to you. Very useful tools, also the array implementation. I do think these new features are under-utilized as I could see so I put them in good use in some of the packs, sanitizing every convoluted script I can find.

I know that for some doing this in avs scripting might look foolish, but IMO it democratizes the code, makes it more liquid and promotes avs+ development. That's why while it might currently underperform in certain situations (lutspa, pixel addressing, etc) it levels out with the improvement in the Expr expressions, as can be seen in the benchmarks. It can only get better, I hope.

Yes, I saw mt_xxpand got AVX2 a year ago, while Expr is on SSSE3 for pixel addressing. I have no programming skills but I can imagine, mainly as Expr is much more powerful hence harder to optimize.
__________________
[i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub

Last edited by Dogway; 20th May 2021 at 23:05.
Dogway is offline   Reply With Quote
Old 22nd May 2021, 09:36   #7  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 1,656
So I've been working like mad on ExTools and it's almost reaching v1.0 final. Only ex_edge() is left from my planned ports (and ex_clamp which isn't still a 1:1 replica)

Today I added kernel iterated gaussian and box blur functions as separated 2x1D kernels. They run pretty fast, I'm just wondering if I should add a multiplier/divisor to make sigma stepness more granular.

Later I also want to optimize the other kernels in case they are separable and make iterators for them to make radius work properly. Since I already made them for blurs it should be easier to port. Once done I switch back to Transforms Pack for v1.0 final.

On the long run I want to fiddle with DotCrawl convolutions, add more edge detection kernels, and make some Unsharps with them.

Quote:
Originally Posted by pinterf View Post
Expr based basic luts in Avisynth: they are on my roadmap...
Thanks for the work, looking forward it, there are many situations where HBD isn't needed like frame interpolation or deinterlacing.

Quote:
Originally Posted by pinterf View Post
Masktools2 is using internally 64 bit doubles while Expr is using only 32 bit floats.
Does that make a difference in performance? Haven't run tests without use_expr>0
__________________
[i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub

Last edited by Dogway; 22nd May 2021 at 09:40.
Dogway is offline   Reply With Quote
Old 27th May 2021, 14:21   #8  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 1,656
ExTools v1.0 final is released. Now it also supports 32-bits float bitdepth.
It's generally faster than masktools2 except when Expr() "pixel addressing" feature is used like in convolutions.
For 8-bit it's still slower than masktool2, but pinterf is currently working on it.

I don't know if "pixel addressing" is ever going to have AVX2 acceleration, but if it does it might possible for my ports to exceed masktools2 speed, or at least reach it, leaving masktools2 dependency behind and just work with internal code, as ideally it should be.

Aside from masktools counterparts I also created a few functions like ex_blend() which replaces Overlay(), ex_undot() for removegrain(1), ex_boxblur() for removegrain(19) and ex_blur() for removegrain(12) and blur().

Therefore for my script mods there are two versions divided in folders; EX mods, and MIX mods. EX mods are future proof with ExTools wrappers, and MIX mods use masktools2 convolutions and removegrain to maximize speed.

See updated benchmarks in second post.
__________________
[i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub
Dogway is offline   Reply With Quote
Old 27th May 2021, 14:36   #9  |  Link
kedautinh12
Registered User
 
Join Date: Jan 2018
Posts: 930
Thanks for hardwork Dogway, but I think you just focus EX mods and MIX mods we can use real.finder's scripts with same your
kedautinh12 is offline   Reply With Quote
Old 27th May 2021, 14:47   #10  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 1,656
As you can see on the benchmarks many functions are faster than masktools2 calls in real.finder mods. Specially comes to mind Overlay() which is uber slow, but I also do some optimizations aside from 1:1 ports. I don't plan to port everything, just my most used scripts so I take special care. I'm also cleaning the code, removing old compatibility support, formatting, and so on.

I might replace ex_merge() back to mt_merge() for MIX mods, something is going on in there, but didn't have much time to debug.

From now I will resume TransformsPack to release a v1.0 final soon, focused on SDR color spaces.
__________________
[i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub

Last edited by Dogway; 27th May 2021 at 14:49.
Dogway is offline   Reply With Quote
Old 27th May 2021, 20:10   #11  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,119
Note: mt_merge has a cplace parameter default "mpeg2" which - with luma = true - is slower than the dumb "mpeg1" choice. Could you try your benchmarks wih cplace="mpeg1" ? Regarding the other benchmarks, I'll do them as well, for example why mt_binarize is slower.
EDIT: Overlay multiply (largest speed difference): no wonder, there is no SIMD optimization there at all.
EDIT2: mt_invert and Avisynth Invert is SSE2 only. But there is only a single instruction or two between load and store which usually implies no or little gain.
Actually some years ago I've implemented for example 8 bit binarize functions in AVX2 but I got zero speed gain so I decided that it won't go live yet. Time to test those again on my i7-7700.

Last edited by pinterf; 27th May 2021 at 21:28.
pinterf is offline   Reply With Quote
Old 28th May 2021, 00:05   #12  |  Link
GMJCZP
Registered User
 
GMJCZP's Avatar
 
Join Date: Apr 2010
Location: I have a statue in Hakodate, Japan
Posts: 737
Quote:
Originally Posted by Dogway View Post
As you can see on the benchmarks many functions are faster than masktools2 calls in real.finder mods. Specially comes to mind Overlay() which is uber slow, but I also do some optimizations aside from 1:1 ports. I don't plan to port everything, just my most used scripts so I take special care. I'm also cleaning the code, removing old compatibility support, formatting, and so on.
DogWay:

Do you have a function equivalent to Overlay, but optimized?
__________________
By law and justice!

GMJCZP's Arsenal
GMJCZP is offline   Reply With Quote
Old 28th May 2021, 00:57   #13  |  Link
kedautinh12
Registered User
 
Join Date: Jan 2018
Posts: 930
Here:
https://github.com/Dogway/Avisynth-S...ools.avsi#L149
kedautinh12 is offline   Reply With Quote
Old 28th May 2021, 07:03   #14  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,119
Quote:
Originally Posted by GMJCZP View Post
DogWay:

Do you have a function equivalent to Overlay, but optimized?
Overlay is basically many different filters under a common name.
Looking at the code: only 'blend', 'lighten' and 'darken' are optimized.
When there is a popular and frequently used mode _and_ affects scripts significantly with its slowness, probably I can implement a speedup.
pinterf is offline   Reply With Quote
Old 28th May 2021, 07:16   #15  |  Link
Reel.Deel
Registered User
 
Join Date: Mar 2012
Location: Texas
Posts: 1,332
VapourSynth's havfunc's Overlay script (which if I'm not mistaken, mimics AviSynth's Overlay) uses Expr and MaskedMarge to do the work. It can probably be translated into AviSynth easily, it even includes some additional modes not available is AviSynth's Overlay.
Reel.Deel is online now   Reply With Quote
Old 28th May 2021, 09:10   #16  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,119
Quote:
Originally Posted by pinterf View Post
Note: mt_merge has a cplace parameter default "mpeg2" which - with luma = true - is slower than the dumb "mpeg1" choice. Could you try your benchmarks wih cplace="mpeg1" ? Regarding the other benchmarks, I'll do them as well, for example why mt_binarize is slower.
EDIT: Overlay multiply (largest speed difference): no wonder, there is no SIMD optimization there at all.
EDIT2: mt_invert and Avisynth Invert is SSE2 only. But there is only a single instruction or two between load and store which usually implies no or little gain.
Actually some years ago I've implemented for example 8 bit binarize functions in AVX2 but I got zero speed gain so I decided that it won't go live yet. Time to test those again on my i7-7700.
I was checking the issue with mt_binarize benchmarks, because the processing itself is more processor-heavy when using Expr and I did not understand, why it is still slower.

The common in mt_binarize and Expr-based ex_binarize that they read and store pixels.

What they are doing inside:

mt_binarize (16 bit data) has 2 operations:
- integer addition
- comparison.

Expr:
- Converts 16 bit pixels to 32 bit float (size doubled, using two register instead of one)
- Compares with the limit (float comparison)
- Mask-blends either 0.0f or 65535.0f depending on the result.
- Converts back float data to 16 bits integer with rounding.

Well, this difference can be seen in the single-threaded benchmark results.

Doing almost nothing, quite interestingly mt_binarize alone is so fast that we better not do any synthetic benchmark on it - and in general with such filters (like mt_logic). I recommend to test them only embedded in a real script. (Like Dogway has did as well when provided benchmarks for whole scripts)

mt_binarize is a minimal-operation filter, having a memory load + two register operations + memory store.
Clearly it was reaching the memory bottleneck.
mt_binarize with no MT(!) is even a bit quicker than with any Prefetch values. This must be due to ruined caching and task swithing/register saving overhead.

mt_binarize combined with RemoveGrain was in the same ballpark with Prefetch(4) than without RemoveGrain!

Tested on i7-7700, avs+ 3.7.work

Code:
#SetMaxCPU("SSE4.1")
Import("ExTools.avsi")
Colorbars(pixel_type = "YUV420P16")
mt_binarize()
#ex_binarize()
#RemoveGrain(1, -1)
#Prefetch(4) # 8
Data in fps, on my system the values are ~average, actual values fluctuate, but we can see the trends.

Code:
Prefetch mt_binarize ex_binarize x64_mt_bin x64_ex_bin
-           19000         7000       19100       6700
4           16000        16500       15900      16600
8           13000        13900       12600      13900
Paired with a RemoveGrain after mt_/ex_binarize:

Code:
#SetMaxCPU("SSE4.1")
Import("ExTools.avsi")
Colorbars(pixel_type = "YUV420P16")
mt_binarize()
#ex_binarize()
RemoveGrain(1, -1)
#Prefetch(4) # 8
Code:
Prefetch mt_binarize ex_binarize x64_mt_bin x64_ex_bin + RemoveGrain(1, -1)
-           8800          5000       8500       4500
4           16114        11400       16700      11200
pinterf is offline   Reply With Quote
Old 28th May 2021, 10:40   #17  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 1,656
Yes, that was a test I had planned in my head because on real scripts it seems my functions perform slower while in synthetic normally is faster. I don't really understand why. One thing I thought, and that's why I asked I thought masktools2 was converting to double float, but here is 16-bit integer like mt_binarize in your example above. I would think double float has a performance penalty as you explained.

By the way are those fps in the thousands? I get 530fps but I use a bit more real case scenario (not totally synthetic), 1080p source, load with DGSource and process in 16-bit.

I crafted a small script, more like what happens within filters:

250fps
Code:
ConvertBits(16)
a=ex_binarize(68)
b=a.ex_invert()
ex_logic(a,b,mode="andn")
# removegrain(1,-1)
Prefetch(4)
215fps
Code:
ConvertBits(16)
a=mt_binarize(68)
b=a.mt_invert()
mt_logic(a,b,mode="andn")
# removegrain(1,-1)
Prefetch(4)
EDIT: By the way I had two more algos for binarize, but they were same or slower than the ternary:
Code:
str = Format("x x {th} scaleb - x {th} scaleb + clip")
str = Format("x x {th} scaleb - * ")
Enabling removegrain (or disabling Prefetch) didn't make much of a difference. I also have to test ex_merge() but the mask handling got me a bit on my nerves and it was the last thing I did when I was already burned with ExTools.

On ex_blend, I have plans to add more blending modes, same for ex_expand shapes, ex_edge modes, unsharp and so on. But wanted to get the basics first and on a later time improve the project with fresher eyes.
__________________
[i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub

Last edited by Dogway; 28th May 2021 at 10:49.
Dogway is offline   Reply With Quote
Old 28th May 2021, 11:11   #18  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,119
I have Colorbars source filter, that why it is quicker.

Double (mt_lut) vs Float (Expr): the difference affects only where expressions must be evaluated Expr, and mt_lut family.
When there is enough memory and mt_lut is really using LUT then the slow calculations affect only the creation of lut tables.
But for a 16bit lutxy there is no memory for lut (we'd end with a 8GB memory table), so masktools is using 'realtime' expression evaluation. Calculates the expression for each frame and for each pixel. In pure C code.
And that is very slow.
Expr is calculating realtime as well. But since it compiles the expression into SSE2/AVX2 machine code (acts like a small compiler) it is quicker than realtime mt_lut by magnitudes.

Usually non-lut masktools filters are optimized heavily and use integer where the source is 8-16 bits.
pinterf is offline   Reply With Quote
Old 28th May 2021, 12:02   #19  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 1,656
Quote:
Originally Posted by pinterf View Post
Usually non-lut masktools filters are optimized heavily and use integer where the source is 8-16 bits.
That makes sense now. I thought it only was the convolution types. I also noticed chaining removegrain (or Dither_boxfilter) is very fast, whereas Expr() suffers a lot, probably due what you explained. So processing in 32-bit float would tell another story.
__________________
[i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub
Dogway is offline   Reply With Quote
Old 28th May 2021, 13:20   #20  |  Link
GMJCZP
Registered User
 
GMJCZP's Avatar
 
Join Date: Apr 2010
Location: I have a statue in Hakodate, Japan
Posts: 737
Quote:
Originally Posted by pinterf View Post
Overlay is basically many different filters under a common name.
Looking at the code: only 'blend', 'lighten' and 'darken' are optimized.
When there is a popular and frequently used mode _and_ affects scripts significantly with its slowness, probably I can implement a speedup.
Fortunately I am only interested in the blend mode, I should check out how to equate ex_blend with Overlay.

Thanks Dogway for your contribution.
__________________
By law and justice!

GMJCZP's Arsenal

Last edited by GMJCZP; 28th May 2021 at 13:23.
GMJCZP is offline   Reply With Quote
Reply

Tags
avs+, dogway, filters, hbd, packs

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 19:17.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, vBulletin Solutions Inc.