Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 8th July 2022, 21:14   #141  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
"I cannot wait until you release it for HBD and blocksize above 8. "

It is only for better speed and lower RAM usage at some use cases. All new quality features already should work with all 2.7.45 bitdepth and blocksize values supported. The DX12-ME mode can not support > 8 bit input because it is not supported by current Microsoft DX12 API. And the only supported input format for hardware ME is NV12 that is internally converted from YV12 AVS format. So to process HBD with hardware ME you need to downconvert source for MAnalyse to YV12. And you can feed 16bit source to MDegrainN using different super clip.

Example was already shown here like
Code:
Super = MSuper(levels=1...)
Super8 = ConvertToYV12.MSuper(levels=1...)
Multi_Vector = Super8.MAnalyse(optSearchOption=5, levels=1...)

MDegrainN(Super, Multi_Vector,...)
"the lines and objects in high grain clip did not get the usual dancing/wobblyness that is very annoying."

It is with MVLPF options enabled for MDegrainN ?

Last edited by DTL; 8th July 2022 at 21:20.
DTL is offline   Reply With Quote
Old 10th July 2022, 19:56   #142  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
New version: https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.12

Added single pass colour overlapped processing in MDegrainN. Fixed regression of not using thSADC/thSADC2 in single pass processing.
Added tweaking param adjSADLPFedmv to MDegrainN to adjust SAD of MVs passed thSAD check after filtering. Float param. Default 1.0 - no correction. Recommended value about 0.8. Typically SAD of the filtered MVs positions is a bit higher in compare with initial after ME processing (so ME engine points to best SAD). So this adjustment allow to add some boost to weighting of blocks after interfiltering of MVs.

Added optSearchOption=6 to MAnalyse. In this mode DX12-ME only used for getting MVs from HW accelerator and SAD calculation performed on host CPU. Compute.cso shader is not used. Also for 8x8 8bit block available UseSubShift=1 for MAnalyse to use sub-shifting (allow to run with pelrefine=false at MSuper and save RAM).
May be faster at some combinations of host/accelerator. Also the SAD calcultation of shader for pel=2 and pel=4 still not completely correct (higher in compare with original mvtools).

So for onCPU SAD calculation (as 'reference' quality mode untill shader not completely fixed):
Code:
super=MSuper(mt=false, chroma=true, pel=4, hpad=8, vpad=8, levels=1, pelrefine=false)
multi_vec=MAnalyse (super, multi=true, blksize=8, delta=tr, overlap=0, chroma=true, optSearchOption=6, mt=false, levels=1, UseSubShift=1)
MDegrainN(last,super, multi_vec, tr, thSAD=250, thSAD2=240, mt=false, wpow=4, thSCD1=400, adjSADzeromv=0.5, adjSADcohmv=0.5, thCohMV=16, MVLPFGauss=0.9, thMVLPFCorr=50, adjSADLPFedmv=0.8, UseSubShift=1)
On i5-9600K CPU with GTX1060 MAnalyse (SO=6 and USS=1) is still a bit slower with MPEG encoding (about 6 vs 6.7 fps) but produces a bit smaller file.

The RAW performance of MAnalyse with different options of search and SAD calcultaion looks may be tested with AVSmeter without MDegrain like
Code:
super=MSuper(mt=false, chroma=true, pel=4, hpad=8, vpad=8, levels=1, pelrefine=false)
multi_vec=MAnalyse (super, blksize=8, chroma=true, optSearchOption=6, mt=false, levels=1, UseSubShift=1)
MStoreVect(multi_vec)
It looks MStoreVect not support multi=true ? So the resulted fps need to be divided to tr_x_2 to estimate real processing speed (without MDegrain) with different tr-values. This script produces performance in pairs frames per second (src+ref).

The 16bit subshifting in MDegrainN still not work completely correctly and still only slow C-reference.

Last edited by DTL; 10th July 2022 at 20:10.
DTL is offline   Reply With Quote
Old 20th July 2022, 13:45   #143  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
Finally the most long awaited feature to MDegrainN with hardware acceleration in 2022 - fully internal MDegrainN interpolated overlap mode : First working release - https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.13

Added interpolated overlap mode to MDegrainN. Only 'max' overlap of blocksize/2 mode currently implemented.

New param of MDegrainN: IntOvlp (int).
Values:
0 - standard mode (default).
1 - internally interpolate input MVs to blocksize/2 overlap mode.

Added block size 16x16 for subshifting with AVX2 implementation. Fixed bug with chroma=false in MDegrainN no copy of chroma planes to output from previous release.

The new param is int and not bool because it is planned to test different interpolated overlap modes in the future. Currently it have more fail-safe design with SAD re-check for interpolated MVs to decrease probability of bad blends. But it is slower. It is possible to run interpolation-only faster mode with SAD interpolation too without re-check but it may decrease quality. Also it is possible to move MVLPF processing before interpolation to test speed/quality.

At the i5-9600K with GTX1060 it runs with x264 encoding about 50% slower but the quality is visibly better. No more blockiness on flames/fogs/fades should be. Also small blockiness on moving objects mostly removed.
It also may run with 'onCPU' MAnalyse with no-overlap MVs search to make some performance gain without HW accelerator.
The subshifting may be used in this mode but may or not be faster in compare with 'precalculated' sub planes in MSuper - looks may depend on host CPU. At i5-9600K with IntOvlp=1 a bit faster run with no-use of subshifting feature.

Now about 4 different quality/speed modes avaialable for overlapping:
1. Old onCPU MAnalyse overlap (full true up to 4x blocks number overlapping search) - possibly best 'reference' quality. Slowest mode.
2. 2 separated clips diagonally shifted at half-block sized processed with hardware-accelerated MAnalyse (single or dual accelerators should be supported if available for each MAnalyse) and overlapped in AVS using different internal or external filters. Uses a 2 sets of really analysed full frame MVs. May be a bit lower in quality in compare with 1. Require additional scripting and/or plugins. The Fizick's BlockOverlap pluging is still C-reference only so may be slow. Speed depend on host performance.
3. Hardware MAnalyse (SO=5 or 6) and interpolated overlap in MDegrainN based on single non-overlapped MVs array. Faster but may be lower in quality in compare with 1 and 2. Possibly the most RAM-saving mode (also support minimal RAM usage with UseSubShift option).
4. No overlap processing with hardware accelerated MAnalyse and standard MDegrain in no overlapped mode. Lowest quality - may produce visible blockiness on flames/fogs/fades. Fastest mode.

I hope the speed penalty from no-overlapped MDegrainN with interpolated overlap may be decreased in the future releases - still not look with profilter what may be optimized more. But in the interpolated overlap mode it also processes 4x blocks number so the host CPU load is high.

Some sad news - the block size 16x16 runs unstable at least at some test modes with hardware acceleration at my remote test host and remote debugger can not catch exception about divide by zero. So it may be a NVIDIA driver issue of Windows or this software.

If you like my software - you may donate me or join my team in OZON promo platform to support my growing family with several kids. Write me a private message for details.

Last edited by DTL; 20th July 2022 at 14:44.
DTL is offline   Reply With Quote
Old 21st July 2022, 18:40   #144  |  Link
mastrboy
Registered User
 
Join Date: Sep 2008
Posts: 365
Quote:
Originally Posted by DTL View Post
Finally the most long awaited feature to MDegrainN with hardware acceleration in 2022 - fully internal MDegrainN interpolated overlap mode : First working release - https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.13
Does it not support YUV420 in 8bit?
I can only get it to work with 10,12,16bit:

Working tests:
Code:
ColorBarsHD().crop(4,0,-4,0)
ConvertToYUV420().ConvertBits(16)
#ConvertToYUV420().ConvertBits(12)
#ConvertToYUV420().ConvertBits(10)
tr = 3
super = MSuper ()
multi_vec = MAnalyse (super, multi=true, delta=tr, blksize=16)
MDegrainN (super, multi_vec, tr, thSAD=400, thSAD2=150, IntOvlp=0)
Not working:
Code:
ColorBarsHD().crop(4,0,-4,0)
ConvertToYUV420().ConvertBits(8)
#ConvertToYUV420()
tr = 3
super = MSuper ()
multi_vec = MAnalyse (super, multi=true, delta=tr, blksize=16)
MDegrainN (super, multi_vec, tr, thSAD=400, thSAD2=150, IntOvlp=0)
Error I get from AVSmeter on 8bit content:
Exception 0xC0000005 [STATUS_ACCESS_VIOLATION]
Module: C:\Program Files (x86)\AviSynth+\plugins64+\mvtools2.dll
Address: 0x00007FFD41996618
__________________
(i have a tendency to drunk post)

Last edited by mastrboy; 21st July 2022 at 18:45. Reason: added error message
mastrboy is offline   Reply With Quote
Old 21st July 2022, 21:52   #145  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
"Does it not support YUV420 in 8bit?"

Practically the only mostly tested format is the YV12 that is separated planes YUV420 in 8bit. That I typically use at my encodings. And block size of 8x8. If you got crash with block size 16 - try to increase padding in MSuper to 16 or more. I think it was fixed in some old versions (may be in the pinterf 2.7.45 source) but if appear again - the current first workaround to try is to increase padding.

So the better MSuper for blocksize=16 is
MSuper(hpad=16, vpad=16)

I even think of making it something like auto-adjust of padding from block size but unfortunately the data flow is from MSuper to downstream filters so MSuper can not get the block size from MAnalyse (in the easy way of current frames sending via AVS environment). And padding of 8 is internal default in MSuper. May be it can be safely enough increased to 16 or even 32 because current PCs typically have more memory. Will try to do it in next builds.

It looks it is old issue of mvtools so in some scripts I see auto-increasing padding to the block size may be added for fail-safety - it it easy in script but may be not possible in a separated filters execution:
Code:
Myblksize = 16
sc=MSuper(hpad=Myblksize, vpad=Myblksize, ...)
MAnalyse(sc, blksize=Myblksize,...)
I know users like block size of 16 because it typically faster onCPU (and with overlap it make not very visible blockiness) but I typically use 8x8 because it give better quality (also as I see 16 is unstable with HW modes at least at my current only avaialble test hardware setup).

Other known issue that SSE2 builds may run unstable with bitdepth >8 on new CPUs. So AVX2 build is recommented where possible.
Also the very few frame sizes were tested - so it is recommended to start from 'standard' of 1920x1080 for FullHD and 3840x2160 for UHD4K. If HW mode will create several buggy blocks lines at the bottom of frame - current workaround is to pad frame from the bottom to several block size lines (I typically use 72 for FullHD and blocksize of 8x8).

The padding is required to keep good quality at the edges of frames because all search, SAD check and blend engines can not operate with parts of blocks (paranoid check of borders will decrease processing speed over all the frame) so for correct and best quality work the padding of at least blocksize size is good. Too large default padding will waste of RAM and may decrease speed. But if 0xC__5 exception occur and increasing padding to some 'large value' like blocksize x10 solves it - it is a mark that some more debug and adjusting of clipping MVs or other bugfix required.

Last edited by DTL; 21st July 2022 at 22:23.
DTL is offline   Reply With Quote
Old 22nd July 2022, 10:28   #146  |  Link
takla
Registered User
 
Join Date: May 2018
Posts: 182
@DTL
Great update!

Code:
LWLibavVideoSource("C:\Users\Admin\Downloads\newSAMPLE.mkv")
Crop(0, 280, -0, -280)
BilinearResize(1920, 1080)
ConvertBits(8, dither=1).ConvertToYV12()
EZdenoise(thSAD=300, TR=8, Chroma=true)
Prefetch(12, 48)
Code:
function EZdenoise(clip Input, int "thSAD", int "thSADC", int "TR", int "BLKSize", int "Overlap", int "Pel", bool "Chroma")
{
thSAD = default(thSAD, 150)
thSADC = default(thSADC, thSAD)
TR = default(TR, 3)
BLKSize = default(BLKSize, 8)
Overlap = default(Overlap, 0)
Pel = default(Pel, 1)
Chroma = default(Chroma, false)

Super = Input.MSuper(Pel=Pel, Chroma=Chroma, Levels=1)
Multi_Vector = Super.MAnalyse(Multi=true, Delta=TR, BLKSize=BLKSize, Overlap=Overlap, Chroma=Chroma, Levels=1, optSearchOption=5)

Input.MDegrainN(Super, Multi_Vector, TR, thSAD=thSAD, thSAD2=int(float(thSAD*0.9)), thSADC=thSADC, thSADC2=int(float(thSADC*0.9)), IntOvlp=1)
}
Code:
CPU
v13 - Levels=1, Overlap=BLKSize/2, 8-bit & YV12
time=74.898s
time=75.007s
857.119 KB

GPU
v13 -optSearchOption=5 & IntOvlp=1
time=71.913s
time=73.643s
time=72.687s
826.089 KB

GPU
v13 -optSearchOption=5 & IntOvlp=0
time=76.360s
785.041 KB

GPU
v13 -optSearchOption=6 & IntOvlp=1
time=72.494s
824.445 KB
All tested using the previously posted 4K sample, with this command:

Code:
ffmpeg -y -benchmark -i 01.avs -c:v prores_ks -qscale:v 4 v13.mkv
Hardware used:
AMD Ryzen 3900X
AMD Radeon RX5700

Quality difference to CPU is now much closer and speed is now faster even at 1080P. I'll most likely post an update for EZdenoise in my thread soon, with some instructions.

Last edited by takla; 22nd July 2022 at 10:35.
takla is offline   Reply With Quote
Old 22nd July 2022, 14:52   #147  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
It is more interesting to test with max quality at pel=4. CPU only vs DX12-ME assisted. Default pel=1 is sort of 'draft' quality only. Same as IntOvlp=0 - fastest but low quality mode.

Also it is good to test if 'large cache' AMD Ryzen will be faster or slower with new interpolated overlap mode for MDegrainN (prefferably with MVLPF enabled also that adds one more SAD re-checking pass and full ref frames reload to dispatch ports of CPU) and UseSubShift true/false. For both 1080p and 4K with pel=4.

So combinations to test:

MSuper(pelrefine=false, pel=4)
MAnalyse(optSearchOption=5) (optSearchOption=6 require UseSubShift=1 in this case)
MDegrainN(MVLPFGauss=0.9, thMVLPFCorr=50, adjSADLPFedmv=0.8, UseSubShift=1, IntOvlp=1)

and

MSuper(pelrefine=true, pel=4)
MAnalyse(optSearchOption=5 or 6)
MDegrainN(MVLPFGauss=0.9, thMVLPFCorr=50, adjSADLPFedmv=0.8, UseSubShift=0, IntOvlp=1)

"v13 -optSearchOption=5 & IntOvlp=1
time=71.913s
v13 -optSearchOption=5 & IntOvlp=0
time=76.360s"

It is even strange - at my 'old' intel CPU of 9-series the (interpolated) overlap mode of MDegrainN is about 2 times slower. May be here something else limits speed so results are close or even much more complex overlap processing in MDegrainN even faster ? The overlap processing in MDegrainN is at minimum 2 passes over the frame - first pass accumulates partial weighted blocks (may be even in float or short 16bit at least) and second pass blends and convert to output bitdepth. Though I typically work with pel=4 only so it require either large RAM planes fetching or many sub-sample shift computing in CPU.

"CPU
v13 - Levels=1,"

Running MAnalyse onCPU is better to use all levels (so levels=0). It is typically only a bit slower but may catch good long MVs if there is fast movement in footage.

Last edited by DTL; 22nd July 2022 at 15:53.
DTL is offline   Reply With Quote
Old 22nd July 2022, 18:19   #148  |  Link
magnetite
Registered User
 
Join Date: May 2010
Posts: 64
So I tried this new update with the OnCPU SAD calculation in this post, and it still asks me for the Compute.cso shader file. Is that normal, or I thought it was supposed to be CPU only?
magnetite is offline   Reply With Quote
Old 26th July 2022, 19:51   #149  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
New version: https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.14

Added mode 2 for IntOvlp for MDegrainN: It do not check real SAD of the interpolated blocks positions. So it is faster but may be lower in quality.

Fixed buffer overrun bug in InterpolateOverlap in MDegrainN.

Added AVX2 (8 bit output), SSE2 and SSE4 ( >8 bit output) second pass processing to output format into MDegrainN.

Disabled loading of shader file Compute.cso in optSearchOption=6 mode of MAnalyse.

Added different builds - for Win10 and later with DX12, for Win7 and others without DX12. Also some IntelC++ builds available for AVX2 CPUs.

It is possible to move (copy) interpolation of overlap to MAnalyse and also put its mode=1 computing of SAD to accelerator. But as I test with IntOvlp=2 at my CPU without re-check of SAD the speed benefit is small enough (about 12%). And that redesign need more time.

As current profiling shows the most of time for overlap processing in MDegrainN with high pel precision is in ref data fetching from memory (USS=0) or sub-shifting computing (USS=1). At the i5-9600 CPU both processes is about balanced. But at faster chips and AVX512 subshifting may be finally USS=1 mode will be visibly faster. Though it depends on cache size and speed and task size. The overlap blend computing and data conversion/storing is very fast already. So putting of post-overlap 16bit to 8bit conversion from C-ref to AVX2 makes almost zero speed addition. At least at my tested config.

IntelC++ SSE2 builds require some syntax redesign and development time so not included in this release. At i5-9600 the speed decreases in a sequence IntelC AVX2 -> MSVC AVX2 -> MSVC SSE2 as 3.55 -> 3.4 -> 3.2 fps with UseSubShift=1 and IntOvlp=1.

Update: Finally add new options descriptions to documentation. See updated file https://github.com/DTL2020/mvtools/b.../mvtools2.html . Still no documented lots of limitations of new options. Like block size, bitdepth and so on (supported yes/no, SIMD accelerated yes/no). It looks need a table form.

Last edited by DTL; 27th July 2022 at 15:50.
DTL is offline   Reply With Quote
Old 3rd August 2022, 09:36   #150  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
New version: https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.15

Added diagonal interpolated overlap mode to MDegrainN of 2x blocks number to process. IntOvlp=3 with SAD re-check and IntOvlp=4 with interpolated SAD.
Added more error messages if non-compatible options provided for MSuper/MAnalyse/MDegrainN.
Updated documentation with new options. Updated file is https://github.com/DTL2020/mvtools/b.../mvtools2.html
Added meander scan in the combined luma+chroma overlapped processing - may be better reuse of cached ref planes data.

Now the IntOvlp=3 is the typical everyday usage mode because it is much better in speed and very close to the quality as IntOvlp=1. Only about 30% slower in compare with no-overlap processing at i5-9600 with SO=5. It is close or equal to old BlockOverlap plugin operation.

Now for the future possible to make many combined modes of speed/quality:
1. Diagonal overlap search onCPU in MAnalyse.
2. Diagonal overlap is compatible with DX12ME and can double load of accelerator with 'real' search - may be used when host CPU speed is low and accelerator is underloaded.
Some internal flags may be added to MVs clip to indicate if it contain diagonal overlapped MVs data.
MDegrainN may also accept 2 MVs clips from 2 MAnalyse for original and diagonally shifted blocks seach data using any combinations of onHWAcc (single or dual) or onCPU search. To balance loading between host CPU and a single or multiply accelerators. As I see after-mining secondhand headless cards with may be equal to GTX1060 chip are now avaialble at Aliexpress at about $35..50.
Though the quality between real searched 2 MVs planes for diagonal overlap mode and second interpolated MVs set need to be examined - may be too low difference. May be only worth is host CPU too slow for MDegrainN and many free accerelators resources available.

Last edited by DTL; 3rd August 2022 at 09:47.
DTL is offline   Reply With Quote
Old 3rd August 2022, 22:31   #151  |  Link
anton_foy
Registered User
 
Join Date: Dec 2005
Location: Sweden
Posts: 702
Quote:
Originally Posted by DTL View Post
New version: https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.15

Added diagonal interpolated overlap mode to MDegrainN of 2x blocks number to process. IntOvlp=3 with SAD re-check and IntOvlp=4 with interpolated SAD.
Added more error messages if non-compatible options provided for MSuper/MAnalyse/MDegrainN.
Updated documentation with new options. Updated file is https://github.com/DTL2020/mvtools/b.../mvtools2.html
Added meander scan in the combined luma+chroma overlapped processing - may be better reuse of cached ref planes data.

Now the IntOvlp=3 is the typical everyday usage mode because it is much better in speed and very close to the quality as IntOvlp=1. Only about 30% slower in compare with no-overlap processing at i5-9600 with SO=5. It is close or equal to old BlockOverlap plugin operation.

Now for the future possible to make many combined modes of speed/quality:
1. Diagonal overlap search onCPU in MAnalyse.
2. Diagonal overlap is compatible with DX12ME and can double load of accelerator with 'real' search - may be used when host CPU speed is low and accelerator is underloaded.
Some internal flags may be added to MVs clip to indicate if it contain diagonal overlapped MVs data.
MDegrainN may also accept 2 MVs clips from 2 MAnalyse for original and diagonally shifted blocks seach data using any combinations of onHWAcc (single or dual) or onCPU search. To balance loading between host CPU and a single or multiply accelerators. As I see after-mining secondhand headless cards with may be equal to GTX1060 chip are now avaialble at Aliexpress at about $35..50.
Though the quality between real searched 2 MVs planes for diagonal overlap mode and second interpolated MVs set need to be examined - may be too low difference. May be only worth is host CPU too slow for MDegrainN and many free accerelators resources available.
Thanks DTL very interesting! I may be a little bit under the sun but I think you mentioned sometime before that prefiltering/auxilary clip will not be required with your build. Is this true? In that case how does it work?
anton_foy is offline   Reply With Quote
Old 3rd August 2022, 23:08   #152  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
" prefiltering/auxilary clip will not be required with your build. "

I still not found visible benefit of prefiltering if using 'interfiltering' of MVs inside MDegrainN with MVLPF processing. So it may be new (partial or complete) replacement of the old prefiltering method. I not made much testing. Though the MVLPF is still in the 2 simple implementations available and may be subject of complex (linear and or non linear) development in the future as both internal processing inside mvtools binary or as intermediate scripting (using MStoreVect/MRestoreVect and sample-accessing methods from script of new AVS+).

So if you use old scripts and not enable MVLPF processing - it is mostly probably prefiltering will make a benefit. If you enable MVLPF it is better to make new tests if prefiltering required or not or how much/etc.

"In that case how does it work?"

To enable internal MVLPF you need to adjust either MVLPFCutoff or MVLPFGauss from default values. Defaults are disabled state for compatibility with old scripts. Only one of 2 may work at the same time. It may be not best options naming. May be better to change MVLPF params to 'MVLPF_Type=None/Sinc(?)/Gauss/..." and 'MVLPF_Param1, MVLPF_Param2,...". So MVLPF_Type=None will clearly mean the processing is disabled (or one of possible filters is selected). So the options may be changed in the future in theory.
The best values of there settings may more or less depend on the footage and current user's preference. I currently use MVLPFGauss=0.8 for my encodings of HDTV 1080i documentaries.

Addition:
Old BlockOverlap plugin have additional 'kernel' control param:
http://avisynth.org.ru/blockoverlap/blockoverlap.html
kernel - blending window form (float, from 0.0 (uniform) to 1.0 (cosine kernel), default =0.5).
Blending mode with kernel=0 is the same as Avisynth command Overlay with opacity=0.5. In this mode the filter can not remove all block artifactes, but it halve them, try use some additional deblock filter.
Mode kernel=1.0 effectively smoothes blocks, but can produce some dot (circle) artefactes instead.

In current IntOvlp modes 3 and 4 full cosine-shaped window is used (equal to kernel=1.0 in BlockOverlap plugin). If it not best for some cases - the additional control param may be added. As I see in 2.7.45- versions of overlap windows - only full cosine window used and not some averaging with rectangular window. Though in old BlockOverlap plugin default is a mix of 0.5 cosine and 0.5 rectangular. May be it can be also subject to test by users and/or script developers.

Last edited by DTL; 4th August 2022 at 12:07.
DTL is offline   Reply With Quote
Old 9th August 2022, 21:11   #153  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
Trying to use new builds with QTGMC for some processing of interlaced HD in intermediate progressive form I found next issues:
1. DX12_ME modes return some minor error about motion clip is too small. So currently only onCPU with optSearchOption=1 is max possible new processing modes with QTGMC.
2. Using default for QTGMC block size of 16 crashes MAnalyse onCPU with out of frame buffer memory access - may be some more padding may help in MSuper or some more vectors limiting check/add in MAnalyse. Only block size of 8 is working. May be limiting were damaged in some redesign from 2.7.45 version - need more debug.

Also as I see the main processing in QTGMC is based in MCompensate function. The some denoising is based on MDegrainX so it is easy enough to rewrite to MDegrainN (though the too low tr is typically not effective with MVLPF).
After looking into MCompensate of 2.7.45 mvtools I see it is very outdated and some redesign is planned:

1. Currently MCompensate is based on hard thresholding by thSAD (same as new MDegrainN with wpow=7). So it may be good to make some smoother rolloff of weighting of the ref block using same weighting functions as in MDegrain. With same wpow new param as with MDegrainN
2. The motion-adaptive adjustment of weighting depending on MVs length and coherency may be added but need testing if it is good for QTGMC activity.
3. MVLPF for 2 frames MCompensate is about not effective at all (or require some redesign of MCompensate for requesting big enough set of frames from MAnalyse in multi-mode and it will load ME-part of mvtools significantly - though may be only caching of MVs-frames for reusage in both deinterlacing and MDegrainN denoising may help to decrease load).
4. Adding usage of sub-sample shifting is the simplest task and will be quickly.
5. Main addition - to support new interpolated overlap modes with non-overlapped search in MAnalyse the same interpolated overlap processing need to be added to MCompensate. To not make many copy parts of program from MDegrainN to MCompensate it may be good to make some 1 for all usage classes/functions may be. So the architecture of old and simple MCompensate function need to be redesigned to the same as current MDegrainN (with array-based of MVs data storage - not Fake object with accessing of VECTOR data as Block class data).

Last edited by DTL; 9th August 2022 at 21:22.
DTL is offline   Reply With Quote
Old 11th August 2022, 02:40   #154  |  Link
MysteryX
Soul Architect
 
MysteryX's Avatar
 
Join Date: Apr 2014
Posts: 2,559
MVTools2 with hardware acceleration? Great!

Does this implementation have the potential to work on Linux or it's Windows-only? DirectX 12 doesn't look very Linux-friendly... which uses Vulkan. Is Vulkan API suitable for this work? It's generally faster for most tasks.
MysteryX is offline   Reply With Quote
Old 11th August 2022, 03:44   #155  |  Link
takla
Registered User
 
Join Date: May 2018
Posts: 182
@MysterX
Read post #129 & #130
takla is offline   Reply With Quote
Old 11th August 2022, 14:37   #156  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
AMD provide ME from DX11 and windows 7 via custom LiquidVR API. But it may be still windows only and not Linux compatible. Though DX11 may be easier to emulate at Linux if AMD provide all needed drivers.

In better case Linux community need to define ME API for applications and provide DDK for hardware developers (intel/amd/nvidia) to develop required drivers support for this API. May be it will also helps to UNIX developers of x264 MPEG encoder to have some help from hardware ME accelerator too.

May be Linux developers may add ME (and MPEG encode/decode) to Vulkan API and ask drivers developers to support it.

Last edited by DTL; 11th August 2022 at 14:44.
DTL is offline   Reply With Quote
Old 24th September 2022, 19:38   #157  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
New release: https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.16

Added SuperCurrent param to MAnalyse, clip param Allow to provide differently processed clip as current source for search. May be useful to use with prefiltering usecases.

Added SearchDirMode param to MAnalyse, int param. 0 - search standard direction (current frame to ref frame MVs). 1 - reverse search.

Fixed bug with not-selecting combined luma+chroma processing modes when thSADC=thSAD (and thSADC2=thSAD2).

Added Multi-Pass Blending mode in MDegrainN. New params:
MPBthSub, int (10), threshold for subtracted blocks.
MPBthAdd, int (20), threshold for std blended blocks (additively).
MPBNumIt, int (0), number of iterations. 0 - MPB processing mode not used.
MPB_SPC, float (1.5), multiplier and divider for weight adjustment at each iteration if SAD of curent blending result vs subtracted or ref block is above threshold.

Currently MPB mode only supported for 8bit formats.
Current typical usage params for MDegrainN (onCPU search for example):

Code:
tr=10

super=MSuper(last, mt=false, chroma=true, pel=2)
multi_vec=MAnalyse(super, multi=true, blksize=8, delta=tr, search=3, searchparam=2, overlap=0, optSearchOption=1, optPredictorType=0, chroma=false, mt=false)
MDegrainN(last,super, multi_vec, tr, thSAD=150, thSAD2=140, mt=false, wpow=4, thSCD1=400, adjSADzeromv=0.5, adjSADcohmv=0.5, thCohMV=16, 
 MVLPFGauss=0.9, thMVLPFCorr=50, UseSubShift=1, IntOvlp=3, MPBthSub=10, MPBthAdd=20, MPBNumIt=2, MPB_SPC=1.5)
The MPB params are subject of long experiments for best results (zopti optimizer may be highly required). Too much iterations may quickly decrease 'denoising'. Typically enabling MPB require to set thSAD to a bit higher value to keep denoising at static and flat areas good enough. Example - from 110 to 150 with default MPB params and 2 iterations. So each iterations weight adjustment param MPB_SPC is defaulted currently to high enough value of 1.5. Also MPB mode make high CPU computing load (most data is cached but it require many additional SAD computing and some block subtraction from total blending result). So speed penalty from enabling MPB mode is enough even with 2 iterations. May be lower in the future with AVX2/AVX512 SAD and block subtract future functions to develop. Currently uses only SSE2.
Possibly max quality MPB with luma + chroma SAD usage is only supported in combined luma+chroma processing (if thSADC=thSAD, thSADC2=thSAD2 - default mode). Separated planes processing only use luma or current chroma plane SAD so may give worse results.

MPB mode operation:
For each iteration:
1. Calculate subtracted blocks (current blend result with current blend weights minus single block for speed, should be equal or very close to partial blend with single block excluded from blend, partial blends require more operations in compare with subtraction).
2. Calculate SAD of current blend result vs subtracted blocks array (vector) and input blocks array (src + all refs).
3. Calculate average SAD for subtracted and standard blended blocks arrays.
4. If current block SAD different from average above threshold (MPBthSub and MPBthAdd) - adjust its weight to decrease (for subtracted blocks) or increase in MPB_SPC ratio.

The idea is to create more equal weight field for all blocks in blending pool. So if block too badly contribute to blending average its weight is decreased and if blocks looks more equal to blended average its weight increased. No MVs is analysed here (yet).

MPBthSub and MPBthAdd expected to be about thSAD/10 and MPBthAdd > MPBthSub about twice (depend on tr-param and current degraining between input and output) because SAD of currently degrained block vs single subtracted is significantly lower in compare with degrained block vs input noised ref or src.

Too high values of MPBthSub and MPBthAdd disables weight adjusting in MPB (for subtracted or incoming blocks). So it may be used to check result from separated parts of processing. Setting both MPBthSub and MPBthAdd too high - disables MPB processing practically but still make processing speed lower.

Last edited by DTL; 24th September 2022 at 19:59.
DTL is offline   Reply With Quote
Old 26th September 2022, 14:33   #158  |  Link
anton_foy
Registered User
 
Join Date: Dec 2005
Location: Sweden
Posts: 702
DTL many thanks for this! So many new features and parameters to understand yet.
As I use mostly motioncompensated TemporalSoften rather than mdegrain, my question is if this latest version is suitable for hbd (16bit) using the motion compensation bit? What parameters could you suggest to use and or change in the TemporalSoftenMC (TSMC) script?

Code:
####### modded to adj. smaller mrecalculate blocksize
function TSMC(clip input, int "tradius", int "mthresh", int "lumathresh", int "blocksize", int "rBlock",clip "auxclip", bool "pref", int "Y", int "UV")
{

pref  = Default(pref, true)
Y  = Default(Y,  3)
UV = Default(UV, 2)

t=Defined(tradius)
tradius=t ? tradius : 6
# temporal radius-number of frames analyzed before/after current frame.

m=Defined(mthresh)
mthresh=m ? mthresh : 180
# motion threshold-higher numbers denoise areas with higher motion.
#Anything above this number does not get denoised.

l=Defined(lumathresh)
lumathresh=l ? lumathresh : 255
# luma threshold- Denoise pixels that match in surrounding frames.
#255 is the maximum and default. 0-255 are valid numbers.
#Also adjusts chroma threshold.

b=Defined(blocksize)
blocksize=b ? blocksize : 32
#larger numbers = faster processing times

rBlock = Default(rBlock, 4)

chroma = UV == 3
aux=Defined(auxclip)

    w       = width(input)
    h       = height(input)
    isUHD   = (w > 2599 ||  h > 1499) 
    nw      = round(w/2.0)
    nh      = round(h/2.0)
    inputA  = aux ? auxclip : input
    inputA  = isUHD ? inputA.ConvertBits(8,dither=-1).BilinearResize(nw+nw%2, nh+nh%2) : inputA

    super     = MSuper(input, pel=1, hpad = 0, vpad = 0, chroma=true, mt=true, levels=1)
    superfilt = MSuper(inputA,pel=1, hpad = 0, vpad = 0, chroma=true, mt=true)   # bug can't disable chroma otherwise luma isn't processed
   
    vmulti  = Manalyse(superfilt,multi=true,delta=tradius,temporal=true,truemotion=true,blksize=blocksize,overlap=blocksize/2, mt=true, chroma=true)
    vmulti2 = Mrecalculate(superfilt,vmulti,thsad=mthresh,truemotion=true,tr=tradius,blksize=rblock,overlap=rblock/2, mt=true, chroma=true)
    vmulti2 = isUHD ? vmulti2.MScaleVect() : vmulti2
    mocomp  = Mcompensate(input,super,vmulti2,thsad=mthresh,tr=tradius,center=true,mt=true) # recursion=50 is bugged
    dnmc    = mocomp.temporalsoften(tradius,lumathresh,lumathresh,15,2)
    dec = selectevery(dnmc,tradius * 2 + 1,tradius)
    Y != 3 ? input.mergechroma(dec) : dec }
You mentioned here automatic adjustment according to grainlevels/noiselevels.
This I tried with my noise detection to feed to scriptclip that outputs a dynamic mask and also now experimenting using temporalsoften dynamically adjusted (tradius and lumathresh) in ScriptClip for TSMC.
anton_foy is offline   Reply With Quote
Old 26th September 2022, 18:13   #159  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
"if this latest version is suitable for hbd (16bit)"

Currently it is not tested for 16bit. At least the input with MPB-processing enabled should be 8bit (only 8bit component subtraction functions are currently created) and for output only 'lsb' mode compiled (may be old AVS format of 'lsb subplane'). So even for out16 (with AVS+ 16bit plane) it need to be slightly changed and recompiled.
Also the MPB with blocksize luma down to 4x4 and so chroma of 1 sample (blocksize 1x1) for YV12 is not tested - may be better to use YV24 format with both luma and chroma blocksize of 4x4 (it currently work for YV12 and blocksize 8x8 so chroma is 4x4).

About auto-adjust of main params like thSAD - I think about some simple ways to do it inside MDegrain but still not started. For current ideas it may be close to the SCD processing like calculating average of SAD of all blocks in the current frame or all valid blocks in tr-scope and apply some correction multiplier or addition to it (or some more complex function) to calculate thSAD and thSAD2 (and thSCD1) values internally and per each output frame.

"What parameters could you suggest to use and or change in the TemporalSoftenMC (TSMC) script?"

At first it not use MDegrain at all. So MCompensate (also used in widely used QTGMC) is still not changed at all from 2.7.45 version. We have some new features of MDegrain now to move to MCompensate too - at least interpolated overlap modes. May be other new additions too.
As I see it uses MScaleVect for better speed for UHD - it may be not best for quality so you can disable it if required better quality.
The only that can possibly help with speed a bit for this script with current version is using hardware search for MAnalyse (use optSearchOption=5 or 6) and set blocksize to 8 (16 may be still buggy but may be recommended to try for multi-pass search with Mrecalculate later).

The quality difference between degrain approaches of
1. Current MDegrainN
2. MCompensate + temporalsoften() + selectevery()
is good to be evaluated and if approach 2 is better at some scenes I think it may be added as one more mode to MDegrainN (may be something like additional blending mode or blending param for non-linear blending with lumathresh) so you can skip this script (or change sequence of MCompensate +temporalsoften +selectevery to single MDegrainN call) and possibly get better speed with single MDegrainN.

For testing you can try to replace lines
Code:
mocomp  = Mcompensate(input,super,vmulti2,thsad=mthresh,tr=tradius,center=true,mt=true) # recursion=50 is bugged
    dnmc    = mocomp.temporalsoften(tradius,lumathresh,lumathresh,15,2)
    dec = selectevery(dnmc,tradius * 2 + 1,tradius)
with
Code:
dec=MDegrainN(input,super, vmulti2, tr=tradius, thSAD=mthresh, thSAD2=mthresh-10, mt=false, wpow=4, thSCD1=400, adjSADzeromv=0.5, adjSADcohmv=0.5, thCohMV=16, 
 MVLPFGauss=0.9, thMVLPFCorr=50, MPBthSub=10, MPBthAdd=20, MPBNumIt=2, MPB_SPC=1.5)
Though it is not directly compatible for mthresh param and require new adjustment. Also 8bit input/output formats only.

Also for quality it is better to use pel 2 or even 4 in MSuper and set hpad and vpad to blocksize or larger if crashes happen.

Last edited by DTL; 26th September 2022 at 18:22.
DTL is offline   Reply With Quote
Old 26th September 2022, 23:43   #160  |  Link
anton_foy
Registered User
 
Join Date: Dec 2005
Location: Sweden
Posts: 702
Lovely I will try this out tomorrow. I did not have alot of good results with mdegrain but I did see your tests and was impressed! Also mscalevect with blocksize=8 and mrecalculate and "optSearchOption=5" I will try for manalyse/mcompensate. Thanks!
anton_foy is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 23:02.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.