Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
19th May 2021, 18:24 | #1 | Link |
Registered User
Join Date: Nov 2009
Posts: 2,367
|
Dogway's Filters Packs
Dogway's Filters Packs
GitHub repo. If you are new to AviSynth, to get things up and running refer to this post. TIP: For AviSynth+ front end GUIs such as MeGui/RipBot, follow the next suggestions to update to latest AVS+ version; for MeGui and RipBot. ---- As others before I thought on creating one single thread to list and explain my updated filter packs and avs+ modernization efforts. The main goal of the updates is to reduce redundant, outdated or slow functions to modern alternatives often with more features, like HBD support (32-bit float inc.), frame properties and improved performance among others. Basic building block functions like those in masktools2, RgTools and smoothadjust have been replaced with internal Expr() wrappers, this allows liquid and easily editable code for others to inspect, debug or branch. Additionally higher level filters have also been created or ported like those in SharpenersPack or GradePack (read below). For performance reasons many expressions have also seen major refactors so due to this and modern AVS+ syntax updates probably any version earlier than v3.7.3 won't work properly, but this is a necessary evil to move things forward and make HBD filtering of HD|UHD sources something that is counted in hours and not days. Special thanks to pinterf for his continuous work on AVS+. ORIGINAL ExTools: Wrapper library for Expr() expressions that covers (and expands) most masktools2 and removegrain functions, including lutspa and convolutions. Also adds Array helper functions that expands those included internally and function approximations like 'atan', 'expr'... faster than internal. Syntax and arguments are kept so it's easy to update old scripts to the new counterparts. This pack will be required for all the following scripts. Post about STTWM() (and initial release), Adaptive Threshold, ex_bilateral() (vs Dither_bilateral16() ), ex_shape() GIF. Morphological mask filtering. Transforms Pack: Divided in 3: Main, Models, Transfers. Modern color and tone response technical transforms functions for color managing AviSynth+. Goal is usability, functionality and accuracy, works over any bitdepth, supports any luma range, extra color spaces and color models among them real RGB based HSV, reversible YUV and YCoCg, IPT, OkLab, ICtCp, IPTPQc2 and more. Includes also a SoftLimiter() and building block matrix functions. Example converting an ACEScg exr to Rec709-1886 with gamut compression, tonemapping and filmic contrast. Example converting a Dolby Vision IPTPQc2 (DVp5 or DVp8) clip to Rec709-1886 with gamut compression and tonemapping. Grade Pack: Look transforms. Includes ex_levels() with native HBD support (same usage than native Levels() ), ex_autolevels(), ex_contrast(), ex_blend(), ex_glow(), ex_posterize(), greyscale_rgb(), FindTemp(), WhitePoint(), Vignette(), Skin_Qualifier(), GamutWarning(), PseudoColor(), GreyWorld(), HSVxHSV() and ex_vibrance(), a saturation and vibrance function. SMDegrain: Simple MDegrain Mod. Easy to use, fool proof degraining wrapper of MDegrain and company. Initially a small few lines wrapper by Caroliano that I took over and implemented YUY2 support, interlaced support, 16-bit dither support, contrasharpening, prefiltering, debug view, documentation, globals, and good practice code. Later on real.finder took over and adapted it to modern code, added bugfixes, HBD support and so on, so forth. Now I ported it to ExTools, sanitized the code, removed old avisynth support, Dither support, and YUY2 support and included some new features like alternative degrainers, recursion, low frequency restoration (here too), DCT flicker, optimized UHD performance, ex_DGDenoise() and ex_BM3D() prefilters, and multi-scale retinex. Here some explanations on iterative temporal filtering. In this post a draft for a SAD sampler. Recommendations for heavy grain prefiltering. 16mm film restoration (+here). Original 2011 SMDegrain thread. Resizers Pack: Pack of functions involving resizing operations, like deep_resize() (and here) a refactored port of nnedi3_resize16() six times the speed and at higher quality, nnedi3resize(), a nnedi3 based arbitrary size scaler, RatioResize() which can resize by a single factor; percentage, adjust to width/height, to PAR, DAR and so on. PadResize() to crop or pad given input dimensions, PadBorders() like an advanced AddBorders()+Crop() with option to mirror, dilate or fill borders, MatteCrop() to automatically fix (crop+resize) movies with random bordered shots, and some utilities like mmod() to crop/pad/auto to mod, and nmod() to mod values with extra features like min value or bankers' rounding. Masks Pack: Mask and limiter filters. BoxMask(), FlatMask(), LumaMask(), CornerMask() (cheap alternative here) and MotionMask() for masks and ex_limitchange(), ex_limitdif() and Soothe() for limiters. Scenes Pack: SceneStats() opens the doors to scene based workflows. It writes current (frame) scene range bounds into '_SceneRange' frame properties and current scene change into '_SceneChangePrev', also scene motion into '_SceneMotion', scene details to '_SceneDetails' (a complexity index for average of edges), scene exposure index to '_SceneExposure' and pixel stats into '_SceneStats', on the fly or by offloading it to a file. ReadStats() can load an optionally exported SceneStats() stats file for faster processing at encoding stages. ClipStats() will otherwise load them and convert them to clip global stats, to help you decide better clip-wide constant settings in your filters. Example for SMDegrain. Example for FilmGrainPlus. FilmGrain+: Made from the ground up, an accurate and performant synthetic film grain filter with presets for the most common negative films. Logo: Easily add static logos or watermarks, with blur, fade in/out, opacity, and blending controls. Eventually also for video based logos. Stabilization Tools Pack (legacy): Initially a simple mod of Stab() which grew bigger and currently includes various strategies for edge filling. Also includes FilmGateFix() mainly aimed at anime sources. EX/MIX MODS Normally there are 2 flavors of each mod: EX mods are future proof with ExTools wrappers and minimal dependencies, this can also come handy when running on Linux/macOS which very few plugins support. MIX mods use carefully chosen masktools2 and removegrain functions to maximize speed but come with these and probably other dependencies as well. QTGMC+: Reference deinterlacer. Ported to ExTools from v3.382 (~40% faster in HBD). Includes ex_vinverse() (now legacy for int bitdepths), ex_bob() and ex_reduceflicker() functions. LSFplus: Based on LSFmod, also one of the best sharpeners out there. Optimized (+74% with no SS), ported to ExTools and added more features. GradFun3plus: Port of cretindesalpes' excellent GradFun3 debanding filter to internal AVS+ calls (+66% gain in smode=0) Sharpeners Pack: Collection of high quality sharpeners optimized and ported to ExTools for HBD support and performance. In total 29 sharpeners, among them; Adaptive Sharpen, ex_unsharp, CASP, NVSharpen, ex_ContraSharpening, SeeSaw, FineSharpPlus, NonlinUSM, ReCon, blah and Plum. Deblock Pack: Pack containing different deblocking functions from famous Deblock_QED() (29% speed gain), to CCD(), SmoothD2c(), SmoothDeblock() (WIP) or feisty2's Oyster (Oyster includes also deringing and else) Similarity Metrics: Pack containing all the similarity/distance metrics ported by Asd-g to AVS+ from WolframRhodium VapourSynth repo. I collected, sanitized and updated the code for x4 speed gain on GMSD(), x2 on MDSI(), x3 on vsSSIM() and added+refactored BSSIM() from zorr. Also created SVM(), a metric for image sharpness. For more metrics check the cost functions in ex_makediff() in ExTools. yugefunc: Collection of VapourSynth filters ported to AVS+ and optimized on the way with ExTools and other expression tricks: ex_guidedblur(), ex_ANguidedblur(), XDoG() (WIP), etc Other: Some other scripts have received the ExTools treatment; FillMissing(), FastLineDarkenPlus(), SPresso(), DeStripe(), etc ExTools main functions: Code:
# EXPRESSIONS ex_lut() - Single variable (1 clip) expressions ex_lutxy() - Double variable (2 clips) expressions ex_lutxyz() - Triple variable (3 clips) expressions ex_lutxyza() - Quadruple variable (4 clips) expressions ex_makediff() - Clip based differentiation. Also calculates similarity/residual metrics via cost functions ex_adddiff() - Sum clips, specially useful to add back the result of differentiation ex_makeadddiff() - ex_makediff() and ex_adddiff() in one step ex_logic() - Logical operations between 2 clips with logic ops (MIN, MAX, OR, AND, etc) ex_merge() - Merging. Performs a linear interpolation between 2 clips based on mask (3rd clip) ex_clamp() - Clamps first clip between the maximum of the second clip and the minimum of the third ex_binarize() - Performs binary type segmentation or thresholding ex_athres() - Adaptive Threshold. Special binary thresholding for uneven brightness images (ie. extracting letters from a shaded area) ex_invert() - Invert the clip pixel values ex_lutspa() - Relative or absolute pixel-location based expressions ex_motion() - Computes a very primitive motion mask akin to MaskTools2's mt_motion() ex_hysteresis() - Proof of concept Expr() port of mt_hysteresis(). Uses 'for' loops so very slow # MORPHOLOGICAL ex_expand() - Morphological dilation/expansion of pixel-value based on structuring element given by the kernel window ex_inpand() - Morphological erosion/contraction of pixel-value based on structuring element given by the kernel window ex_inflate() - Expansion via outward blurring given structuring element of pixel values of the kernel window ex_deflate() - Contraction via inward blurring given structuring element of pixel values of the kernel window ex_hitormiss() - Structuring elements based morphological transforms for binary images ex_edge() - Gradient magnitude. Edge detection via (partial) local derivatives ex_luts() - Moving window relative pixel-location based expressions. A convolution do-it-all filter ex_shape() - Helper filter for ex_luts() (and other expression based filters) to fetch kernel-window pixels into a string # BLURS ex_boxblur() - Discreet local neighborhood blur convolutions ex_blur() - Gaussian (or Butterworth) weighted blur convolutions ex_gaussianblur()- Optimized Gaussian filter for large sigma ex_kawase() - Kawase optimized blur filter (still slower than ex_gaussianblur() ). Accepts different strides so good for exponential blur ex_blur3D() - Spatio-temporal blur filter ex_bilateral() - Bilateral blur filter (respects edges) ex_smartblur() - Like Bilateral filter but more performant (mimics Photoshop's Surface Blur) ex_smooth() - Savitzky-Golay smoothing filter. Halfway between blur and antialiasing ex_FluxSmoothT() - Minimum change between a temporal weighted blur and temporal median. Informal port of FluxSmoothT filter via Didée's description ex_FluxSmoothST()- Spatio-Temporal minimum change between weighted blur and median. Uses ex_FluxSmoothT() and its spatial equivalent ex_MinBlur() ex_median() - Median (rank order) based blur filtering. Also includes some alternative mean average algorithms ex_repair() - Median (rank order) based repair filter STTWM() - Spatio-Temporal Thresholded Weighted Median (STPresso() inspired / not a port) Last edited by Dogway; 1st November 2023 at 01:24. |
19th May 2021, 20:33 | #2 | Link |
Registered User
Join Date: Nov 2009
Posts: 2,367
|
Benchmarks for 1080p 16-bit
ExTools Code:
DGSource("1080psrc.dgi") ConvertBits(16) # mt_makediff() is 5% slower than ex_makediff() # mt_adddiff() is 5% slower than ex_adddiff() # mt_logic(mode="and") is 6% slower than ex_logic(mode="and") # mt_merge(luma=true,U/V=3) is 8% faster than ex_merge(luma=true, UV=3) (5% slower when Y clips/masks) # mt_clamp() is 6% slower than ex_clamp() # mt_binarize() is 12% slower than ex_binarize() # mt_invert() is 10% slower than ex_invert() # invert(channels="Y") is 17% slower than ex_invert() # mt_lutspa() is 0% slower than ex_lutspa() # mt_luts()* is 96% slower than ex_luts() *tested with mt_luts( c, mode="max", pixels=mt_square( 1 ), expr="x y - abs") # Overlay(mode="multiply") is 44% slower than ex_blend(mode="multiply") # Overlay_MTools(mode="multiply") is 8% slower than ex_blend(mode="multiply") # OverlayPlus(a, mode="multiply") is 1% slower than ex_blend(mode="multiply") # mt_expand() is 6% faster than ex_expand() # mt_inpand() is 5% faster than ex_inpand() # mt_deflate() is 14% faster than ex_deflate() # mt_inflate() is 14% faster than ex_inflate() # mt_edge()* is 14% faster than ex_edge() *but much slower in "free" kernel mode Prefetch(4) Code:
# Bilateral Blur # # 100% Dither_bilateral16(radius=2, thr=10, flat=1.0, u=1, v=1) (216fps) # Output is dirtier though # 77% ex_bilateral(1,dejaggie=false) # 59% vsTBilateral(diameterY=3, sdevY=4, idevY=4.0, u=1, v=1) # 43% TBilateral(3,3,chroma=false) # only supports 8-bits # 23% bilateral(sigmaSY=1, sigmaRY=0.02, algorithmY=2, u=1, v=1) Prefetch(6) # Variable Box Blur # # 100% removegrain(20,-1) (485fps) # 91% ex_boxblur(1,mode="mean",UV=1) # 90% MiniDeen(radiusY=1, thrY=255, u=1,v=1) # crumbles from rad=3 onwards # 89% neo_MiniDeen(radiusY=1, thrY=255, u=1,v=1) # 80% mt_inflate().mt_deflate() # mean blur approximation # 70% ex_blur(1.5,n=300,mode="butterworth") # 69% blur(1.58) # 67% generalconvolution(matrix="1 1 1 1 1 1 1 1 1",chroma=false) # 65% Dither_box_filter16(2,U=1,V=1) # with ConverttoStacked() and ConvertfromStacked() # 44% mt_convolution("1 1 1","1 1 1",U=1,V=1) # 13% SpatialSoften(1,30,0) # 8-bit YUY2 only, thresholded. Prefetch(8) # 5% mt_luts(last, "avg", mt_square(1), "y",chroma="-1") # Variable Gaussian Blur (binomial fitted) # # 100% removegrain(12,-1) (486fps) # technically a binomial weighted mean of [1 2 1] # 97% GBlur2(sqrt(1)/2. * sqrt(2),chroma=2) # only in 8-bit. weighted mean of [1 2 1] # 90% ex_boxblur(1,mode="weighted",UV=1) # binomial weighted mean # 88% ablur(1, 1, chroma=1) # against ex_boxblur(2,mode="weighted",UV=1) # 85% BinomialBlur(sqrt(1)*0.707,U=1,V=1) # only in 8-bit # 81% vsTCanny(sqrt(1)*0.707,mode=-1,u=1,v=1) # true gaussian blur (fastest for mid size sigma) # 70% ex_blur(1,mode="binomial,UV=1) # true gaussian blur # 68% blur(1.00) # weighted mean of [1 2 1] # 65% generalconvolution(matrix="1 2 1 2 4 2 1 2 1",chroma=false) # 44% mt_convolution("1 2 1","1 2 1",U=1,V=1) # 23% GBlur(rad=1,sd=0.9,u=false,v=false) # 11% FastBlur(sqrt(1)*0.707,gamma=false) # 11% GaussianBlur(0.53,U=1,V=1) # only in 8-bit Prefetch(4) Script mods CPU: i7-4790K (Stock Clock) GPU: GTX 1070 Prefetch(6) LSFmod.v2.193: 45.4fps LSFmod.v6.0ex: 60.0fps LSFplus.v6.0mix: 61.8fps Code:
LSFplus(preset="slow",strength=200,edgemode=0,soothe=true,ss_x=1.0,ss_y=1.0) Prefetch(4) GrainFactoryLite: 57fps (96fps @8-bit) GrainFactory3mod EX: 68fps (84fps @8-bit) Prefetch(8) FilmGrain: 52fps (75fps @8-bit) Prefetch(8) FilmGrain+: 38fps (54fps @8-bit) ('gamma' mode) FilmGrain+: 37fps (50fps @8-bit) ('log' mode) Code:
str=1.25 size=1.2 GrainFactory3mod(size=1,g1str=6.0*str,g2str=8.0*str,g3str=5.5*str,g1size=1.20*size,g2size=1.50*size,g3size=1.40*size,g1cstr=0.9,g2cstr=0.9,g3cstr=0.9,temp_avg=1) or FilmGrain(size=1.1,str=9,cstr=0.5,coarse=4.0,conv=false) or FilmGrainPlus(size=1.5,str=0.8,lo=1,mid=1,hi=1,sharpness=0.8,mode="gamma") SMDegrain v3.1.2.111s: 3.300fps (slowdown due to Contrasharpening() ) Prefetch(8) SMDegrain v4.3.0d: 15.1fps Code:
SMDegrain(tr=2,thSAD=400,contrasharp=true,refinemotion=true) Note: QTGMC+ v4.0 is the last version comparable to older ones. Later versions use different default core deinterlacers for higher detail preservation. Prefetch(8) (720x576 clip) QTGMC 3.382s: 23 fps (8-bit) 10.0 fps (16-bit) QTGMC+ 4.00p: 22 fps (8-bit) 18.0 fps (16-bit) Code:
QTGMCp(tr2=3,preset="very slow",Lossless=2,sourcematch=3,sharpness=0.2,MatchEnhance=0.0,MatchPreset="Slow", MatchPreset2="Slow",border=true,threads=4) QTGMC+ 4.00p: 55.4 fps (8-bit) 40.0 fps (16-bit) Code:
QTGMCp(thsad1=300,blocksize=8,TR0=1,TR1=1,TR2=0,EZKeepGrain=1.0,NoiseDeint="Generate",StabilizeNoise=true,border=true,chromamotion=false,threads=4) QTGMC+ 4.00: 87 fps (8-bit) 60.0 fps (16-bit) Code:
QTGMCp(tr2=2,preset="slow",border=false,threads=4) QTGMC+ 4.00p: 13.2 fps (8-bit) 12.3 fps (16-bit) Code:
QTGMCp(tr2=2,preset="very slow",SVThin=0.5,EZKeepGrain=2.0,NoisePreset="slower",Sharpness=0.7,tuning="DV-SD",border=true,threads=4) QTGMC+ 4.00p: 73.3 fps (8-bit) 53.5 fps (16-bit) Code:
QTGMCp(TR2=3,TR0=1,TR1=1, Preset="Slower", InputType=1, sharpness=0) ex_median(), ex_bilateral() with Prefetch(6) Code:
100.0% vertical Prefetch(4) (440fps) 97.5% undot Prefetch(4) 97.0% undot6 Prefetch(4) 96.8% cartoon Prefetch(4) 94.3% edgeS Prefetch(4) 91.1% verticalS Prefetch(4) 90.5% medianT Prefetch(4) 84.5% median 84.3% SixNN Prefetch(4) 82.2% PML 81.8% edgeC 80.9% undot3 Prefetch(4) 80.7% undot2 Prefetch(4) 80.7% midsum 79.8% EMF 77.0% medianT5 Prefetch(4) 76.7% GaussT5 76.6% IQM 75.2% ML3D 75.0% edgeW 75.0% winsor 73.4% trimean 72.0% edgeCL 71.4% smart 70.5% SNN 70.2% CAM 69.1% CWM 67.6% CWM2 66.5% AWM 52.6% MMF 52.2% PWM 49.3% WMF 48.2% IQMST 44.1% ML3Dex 38.0% bilateral 36.1% Hybrid 33.2% STWM 33.2% kuwahara 31.1% BDM 29.0% unblob3D 28.9% DGM5 28.1% TL3D 26.4% DGM3 26.0% DGM2 25.9% unblob3 25.9% DGM1 25.7% DGM4 25.5% median5 24.8% DGM0 21.5% medianST 20.2% trimean5 19.9% median7o 19.1% smart2 18.5% AMF 18.2% IQM5 17.3% winsor5 16.8% medianSTS 16.1% IQMV 16.1% GaussST5 8.1% median7 Prefetch(8) 6.3% smart3 Prefetch(8) .......... ex_blur(), ex_blur3D(), ex_boxblur(), ex_smooth(), ex_kawase() Code:
100.0% rg19 (448fps) 99.1% bokeh2 98.7% kawase lin 97.8% weighted 97.8% mean 86.4% kawase2 lin 86.2% bokeh 78.3% SNN 77.0% rg192 75.9% mean2 75.7% smooth 75.4% weighted2 75.4% blur 72.8% smartblur 71.9% smooth2 71.4% smooth sharp 71.4% blur2 67.0% smooth2 sharp 64.3% smartblur2 60.7% trimmed 60.7% weighted3D 52.5% mean3D 37.1% ex_fluxsmoothST .......... ex_edge() with default thresholds and Prefetch(4) Code:
100.0% mt_sobel (460fps) 99.3% tritical 97.0% cartoon 96.3% hotdog 95.0% kayyali 94.6% laplace 91.1% hprewitt 89.8% SGDD 89.1% min/max 89.1% sobel5 88.3% roberts 88.0% max 87.0% qprewitt 87.0% LoG 86.5% TEdge 85.2% frei-chen 84.9% kroon 84.6% prewitt 84.1% sobel 84.1% farid 84.0% pscharr 83.3% scharr 81.7% robinson 79.8% SGDD7 78.9% DoG Prefetch(6) 71.3% Std Prefetch(6) 62.2% kirsch Prefetch(6) 56.0% DoB Prefetch(6) 50.4% farid5 Prefetch(6) 49.8% SG Prefetch(6) 47.4% FDoG Prefetch(6) Sharpeners Pack Code:
360 100 % XSharpenPlus() 355 98.6% CASP(1) 347 96.4% UnsharpMask_HBD(128*n,1,0) Prefetch(4) 339 94.2% DGSharpen2() 317 88.1% ex_unsharp() Prefetch(4) 310 86.1% DetailSharpen() 272 75.6% NonlinUSM() 250 69.5% FineSharpPlus() 178 49.4% pSharpen() 160 44.4% RSharpen() 150 41.7% LSFplus(preset="LSF") 148 41.1% CASm() 145 40.3% SharpenComplex2() 106 29.4% NVSharpen() Prefetch(8) 97 26.9% ex_ContraSharpening(a) 79 21.9% SlopeBend() 60 16.7% LSFplus(preset="fast") 50.6 13.1% DelicateSharp() 43.2 12.0% LSFplus(preset="medium") 33 9.2% SSSharpFaster() 27 7.5% LSFplus(preset="slow") 26.5 7.4% SeeSaw(a) 24.9 6.9% ReCon() 21 5.8% MedianSharp() 18.5 5.1% Adaptive_sharpen(1.0) Prefetch(8) (32-bit) 14.5 4.0% MedSharp() 11.7 3.3% blah() Prefetch(4) 1.9 0.5% SSSharpEX() Prefetch(4) 0.22 0.06% RegularSharp()
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread Last edited by Dogway; 15th April 2023 at 11:40. |
20th May 2021, 06:15 | #3 | Link | |
Registered User
Join Date: Jan 2014
Posts: 2,322
|
Quote:
I have to mention that no AVX2 code is used when pixel addressing is used in the Expr expression due to complexity of the implementation code for AVX 32 byte registers. Expr based basic luts in Avisynth: they are on my roadmap, there are hints in the my source already I'm planning to continue the work on that topic later. Masktools2 is using internally 64 bit doubles while Expr is using only 32 bit floats. |
|
20th May 2021, 08:39 | #5 | Link | |
Registered User
Join Date: Jan 2012
Location: Mesopotamia
Posts: 2,590
|
Quote:
__________________
See My Avisynth Stuff |
|
20th May 2021, 19:35 | #6 | Link |
Registered User
Join Date: Nov 2009
Posts: 2,367
|
pinterf: Thanks to you. Very useful tools, also the array implementation. I do think these new features are under-utilized as I could see so I put them in good use in some of the packs, sanitizing every convoluted script I can find.
I know that for some doing this in avs scripting might look foolish, but IMO it democratizes the code, makes it more liquid and promotes avs+ development. That's why while it might currently underperform in certain situations (lutspa, pixel addressing, etc) it levels out with the improvement in the Expr expressions, as can be seen in the benchmarks. It can only get better, I hope. Yes, I saw mt_xxpand got AVX2 a year ago, while Expr is on SSSE3 for pixel addressing. I have no programming skills but I can imagine, mainly as Expr is much more powerful hence harder to optimize.
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread Last edited by Dogway; 20th May 2021 at 23:05. |
22nd May 2021, 09:36 | #7 | Link |
Registered User
Join Date: Nov 2009
Posts: 2,367
|
So I've been working like mad on ExTools and it's almost reaching v1.0 final. Only ex_edge() is left from my planned ports (and ex_clamp which isn't still a 1:1 replica)
Today I added kernel iterated gaussian and box blur functions as separated 2x1D kernels. They run pretty fast, I'm just wondering if I should add a multiplier/divisor to make sigma stepness more granular. Later I also want to optimize the other kernels in case they are separable and make iterators for them to make radius work properly. Since I already made them for blurs it should be easier to port. Once done I switch back to Transforms Pack for v1.0 final. On the long run I want to fiddle with DotCrawl convolutions, add more edge detection kernels, and make some Unsharps with them. Thanks for the work, looking forward it, there are many situations where HBD isn't needed like frame interpolation or deinterlacing. Does that make a difference in performance? Haven't run tests without use_expr>0
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread Last edited by Dogway; 22nd May 2021 at 09:40. |
27th May 2021, 14:21 | #8 | Link |
Registered User
Join Date: Nov 2009
Posts: 2,367
|
ExTools v1.0 final is released. Now it also supports 32-bits float bitdepth.
It's generally faster than masktools2 except when Expr() "pixel addressing" feature is used like in convolutions. For 8-bit it's still slower than masktool2, but pinterf is currently working on it. I don't know if "pixel addressing" is ever going to have AVX2 acceleration, but if it does it might possible for my ports to exceed masktools2 speed, or at least reach it, leaving masktools2 dependency behind and just work with internal code, as ideally it should be. Aside from masktools counterparts I also created a few functions like ex_blend() which replaces Overlay(), ex_undot() for removegrain(1), ex_boxblur() for removegrain(19) and ex_blur() for removegrain(12) and blur(). Therefore for my script mods there are two versions divided in folders; EX mods, and MIX mods. EX mods are future proof with ExTools wrappers, and MIX mods use masktools2 convolutions and removegrain to maximize speed. See updated benchmarks in second post.
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread |
27th May 2021, 14:47 | #10 | Link |
Registered User
Join Date: Nov 2009
Posts: 2,367
|
As you can see on the benchmarks many functions are faster than masktools2 calls in real.finder mods. Specially comes to mind Overlay() which is uber slow, but I also do some optimizations aside from 1:1 ports. I don't plan to port everything, just my most used scripts so I take special care. I'm also cleaning the code, removing old compatibility support, formatting, and so on.
I might replace ex_merge() back to mt_merge() for MIX mods, something is going on in there, but didn't have much time to debug. From now I will resume TransformsPack to release a v1.0 final soon, focused on SDR color spaces.
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread Last edited by Dogway; 27th May 2021 at 14:49. |
27th May 2021, 20:10 | #11 | Link |
Registered User
Join Date: Jan 2014
Posts: 2,322
|
Note: mt_merge has a cplace parameter default "mpeg2" which - with luma = true - is slower than the dumb "mpeg1" choice. Could you try your benchmarks wih cplace="mpeg1" ? Regarding the other benchmarks, I'll do them as well, for example why mt_binarize is slower.
EDIT: Overlay multiply (largest speed difference): no wonder, there is no SIMD optimization there at all. EDIT2: mt_invert and Avisynth Invert is SSE2 only. But there is only a single instruction or two between load and store which usually implies no or little gain. Actually some years ago I've implemented for example 8 bit binarize functions in AVX2 but I got zero speed gain so I decided that it won't go live yet. Time to test those again on my i7-7700. Last edited by pinterf; 27th May 2021 at 21:28. |
28th May 2021, 00:05 | #12 | Link | |
Registered User
Join Date: Apr 2010
Location: I have a statue in Hakodate, Japan
Posts: 754
|
Quote:
Do you have a function equivalent to Overlay, but optimized? |
|
28th May 2021, 07:03 | #14 | Link | |
Registered User
Join Date: Jan 2014
Posts: 2,322
|
Quote:
Looking at the code: only 'blend', 'lighten' and 'darken' are optimized. When there is a popular and frequently used mode _and_ affects scripts significantly with its slowness, probably I can implement a speedup. |
|
28th May 2021, 07:16 | #15 | Link |
Registered User
Join Date: Mar 2012
Location: Texas
Posts: 1,669
|
VapourSynth's havfunc's Overlay script (which if I'm not mistaken, mimics AviSynth's Overlay) uses Expr and MaskedMarge to do the work. It can probably be translated into AviSynth easily, it even includes some additional modes not available is AviSynth's Overlay.
|
28th May 2021, 09:10 | #16 | Link | |
Registered User
Join Date: Jan 2014
Posts: 2,322
|
Quote:
The common in mt_binarize and Expr-based ex_binarize that they read and store pixels. What they are doing inside: mt_binarize (16 bit data) has 2 operations: - integer addition - comparison. Expr: - Converts 16 bit pixels to 32 bit float (size doubled, using two register instead of one) - Compares with the limit (float comparison) - Mask-blends either 0.0f or 65535.0f depending on the result. - Converts back float data to 16 bits integer with rounding. Well, this difference can be seen in the single-threaded benchmark results. Doing almost nothing, quite interestingly mt_binarize alone is so fast that we better not do any synthetic benchmark on it - and in general with such filters (like mt_logic). I recommend to test them only embedded in a real script. (Like Dogway has did as well when provided benchmarks for whole scripts) mt_binarize is a minimal-operation filter, having a memory load + two register operations + memory store. Clearly it was reaching the memory bottleneck. mt_binarize with no MT(!) is even a bit quicker than with any Prefetch values. This must be due to ruined caching and task swithing/register saving overhead. mt_binarize combined with RemoveGrain was in the same ballpark with Prefetch(4) than without RemoveGrain! Tested on i7-7700, avs+ 3.7.work Code:
#SetMaxCPU("SSE4.1") Import("ExTools.avsi") Colorbars(pixel_type = "YUV420P16") mt_binarize() #ex_binarize() #RemoveGrain(1, -1) #Prefetch(4) # 8 Code:
Prefetch mt_binarize ex_binarize x64_mt_bin x64_ex_bin - 19000 7000 19100 6700 4 16000 16500 15900 16600 8 13000 13900 12600 13900 Code:
#SetMaxCPU("SSE4.1") Import("ExTools.avsi") Colorbars(pixel_type = "YUV420P16") mt_binarize() #ex_binarize() RemoveGrain(1, -1) #Prefetch(4) # 8 Code:
Prefetch mt_binarize ex_binarize x64_mt_bin x64_ex_bin + RemoveGrain(1, -1) - 8800 5000 8500 4500 4 16114 11400 16700 11200 |
|
28th May 2021, 10:40 | #17 | Link |
Registered User
Join Date: Nov 2009
Posts: 2,367
|
Yes, that was a test I had planned in my head because on real scripts it seems my functions perform slower while in synthetic normally is faster. I don't really understand why. One thing I thought, and that's why I asked I thought masktools2 was converting to double float, but here is 16-bit integer like mt_binarize in your example above. I would think double float has a performance penalty as you explained.
By the way are those fps in the thousands? I get 530fps but I use a bit more real case scenario (not totally synthetic), 1080p source, load with DGSource and process in 16-bit. I crafted a small script, more like what happens within filters: 250fps Code:
ConvertBits(16) a=ex_binarize(68) b=a.ex_invert() ex_logic(a,b,mode="andn") # removegrain(1,-1) Prefetch(4) Code:
ConvertBits(16) a=mt_binarize(68) b=a.mt_invert() mt_logic(a,b,mode="andn") # removegrain(1,-1) Prefetch(4) Code:
str = Format("x x {th} scaleb - x {th} scaleb + clip") str = Format("x x {th} scaleb - * ") On ex_blend, I have plans to add more blending modes, same for ex_expand shapes, ex_edge modes, unsharp and so on. But wanted to get the basics first and on a later time improve the project with fresher eyes.
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread Last edited by Dogway; 28th May 2021 at 10:49. |
28th May 2021, 11:11 | #18 | Link |
Registered User
Join Date: Jan 2014
Posts: 2,322
|
I have Colorbars source filter, that why it is quicker.
Double (mt_lut) vs Float (Expr): the difference affects only where expressions must be evaluated Expr, and mt_lut family. When there is enough memory and mt_lut is really using LUT then the slow calculations affect only the creation of lut tables. But for a 16bit lutxy there is no memory for lut (we'd end with a 8GB memory table), so masktools is using 'realtime' expression evaluation. Calculates the expression for each frame and for each pixel. In pure C code. And that is very slow. Expr is calculating realtime as well. But since it compiles the expression into SSE2/AVX2 machine code (acts like a small compiler) it is quicker than realtime mt_lut by magnitudes. Usually non-lut masktools filters are optimized heavily and use integer where the source is 8-16 bits. |
28th May 2021, 12:02 | #19 | Link |
Registered User
Join Date: Nov 2009
Posts: 2,367
|
That makes sense now. I thought it only was the convolution types. I also noticed chaining removegrain (or Dither_boxfilter) is very fast, whereas Expr() suffers a lot, probably due what you explained. So processing in 32-bit float would tell another story.
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread |
28th May 2021, 13:20 | #20 | Link | |
Registered User
Join Date: Apr 2010
Location: I have a statue in Hakodate, Japan
Posts: 754
|
Quote:
Thanks Dogway for your contribution. Last edited by GMJCZP; 28th May 2021 at 13:23. |
|
Tags |
avisynth, dogway, filters, hbd, packs |
Thread Tools | Search this Thread |
Display Modes | |
|
|