New DirectX 12 Video APIs - Page 5

takla · 1st February 2022, 23:07

Quote:

With TR=12 the speed difference may finally reverse to SO=5 is better

DTL At 4K and TR=6 around 10 to 12GB of my RAM is being used. But at TR=8 I have to cancel the encoding because 99% of my RAM (32GB) is used. (Possible memory leak? Why does ram usage increase so much?) So any higher TR at 4K is not possible.

Boulder there have been plenty of examples in this thread. See post #77.

DTL · 2nd February 2022, 00:40

"Why does ram usage increase so much?"

Pel=4 finest level is currently 16x more RAM in compare with pel=1 for 'super' clip (+lower sized levels) and it is multiplied to number of AVS+ threads and to AVS+ cache system. So it looks old developers not went down to pel=8 with 64x more RAM usage.

Each 'super' frame for pel=4 and 4K is about 2160x17 = about 37000 in height size. You can check it - return 'super' clip and see its frame size.

So may be ask AVS+ support how to decrease cached frames by AVS+ ? I read somewhere about 2-params Prefetch() - like

Prefetch(N, M)

where one value is num of threads and second is cached num frames ? Do not found anything about it in docs. Also there are 2 values of cache control:
SetCacheMode(mode)
AVS+Fine tunes the internal frame caching strategy in AviSynth+.
Available values:
0 or CACHE_FAST_START start up time and size balanced mode (default)
1 or CACHE_OPTIMAL_SIZE slow start up but optimal speed and cache size

May be try to set CACHE_OPTIMAL_SIZE ?

Also http://avisynth.nl/index.php/MT_modes_explained - may be adjusting of MT mode may decrease number of cached frames ?

I run tr=25 with 1080i at 16 GB system with few enough RAM usage - may be about half.

"Possible memory leak?"

It typically increases over time. I run 3 hours transcoding without leakage issues.

The 'fully optimized' mvtools with DX12_ME search and all pel modes on-shader SAD caculation and at-processing sub-shift MDegrainN will use 16x less host RAM for pel=4 processing.

DTL · 4th February 2022, 19:48

Quote:

Originally Posted by takla

Code:

ColorBarsHD(1920, 1080)
ConvertToYV12()
tr = 3
super = MSuper (pel=1, levels=1, chroma=false)
multi_vec = MAnalyse (super, multi=true, blksize=8, delta=tr, optSearchOption=5, overlap=0, levels=1, chroma=false)
MDegrainN (super, multi_vec, tr, thSAD=150)
Prefetch(12)

Encoded 600 frames in 17.893s (FFV1)
GPU usage is around 8%. Very low.

It looks I found where the DirectCompute load graph is shown: In Win 10 task manager 'GPU' window the hardware load graphs can be switched to Compute_0 and Compute_1. It looks all ComputeShader load is displayed only in these graphs. But some software may display sum load of many graphs (3d+cuda+compute+copy+videoencode+videodecode+...).
I still do not found what difference between Compute_0 and Compute_1 load graphs. At degrain running I see some about equal load in both graphs. But it finally not 0..1% load as displayed in 3D GPU graph.

Currently at some still in-progress version of sub-pel shifting for pel=2 and pel=4 for SAD computing I got about 35..40% load of both Compute_0 and Compute_1 graphs. At GTX1060 card and 1920x1080 interlaced processing (at about 18 fps output to AVSmeter). Not very few but still have some space to add MDegrainN in the future. Also I hope inside HWAcc the shifted (subpel-motion compensated) blocks may be reused in both SAD and MDegrainN computation.

Currently was and idea to download set of sub-shifted-blocks to host memory to send to MDegrainN but it will load memory transfer and still complex enough. It may be better to add onCPU subshift to MDegrainN as intermediate solution before transferring all processing to accelerator.

DTL · 9th February 2022, 19:48

Somehow working update https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.08 . Now all supported pel values processed inside accelerator. Not sure if it good for pel 1 and 2 for fast CPUs but still no user-side selection where to process. For testing of onCPU pel 2 and pel 4 processing the previous version may be used.
Not very much tested for quality of SAD generation for pel 2 and 4 and it may more or less be different from onCPU old processing because of using different sub-sample shifting kernel.
It looks making compute shaders at HLSL is not very efficient because 'compute_X' load is significant now at GTX1060 card at relatively simple operation of runtime shifting of blocks with about 8 total samples kernel. Though the performance now is not depend on sub-pel shift value (can support any float shift with equal speed) and depend slightly only on kernel size (with half size of 4 the speed is a bit better, but still no user-control param and only separate build with internal constant define possible for now). So it may be mostly benefitical at pel=4 (depending on balance of CPU and accelerator speed).

magnetite · 9th February 2022, 23:26

I think I went from 30% load up to 60% with this new build on my GTX 1080 Ti.

DTL · 10th February 2022, 00:25

Some strategic idea: if sub-sample shifting takes significant time (resources) it can be somehow reused for degraining. Either to decrease host memory read traffic or to save time of host CPU from performing same shift operation one more time.

Possible ways:
1. Finish degraining inside accelerator (as planned). The already found limitation (at least for CS 5.1 standards): HLSL compiler reports about 16384 max recommended temp array size in threads group. It looks the limitation of 'register file' size of one core in accelerator (of some generation) ? So in current version of shader the number of threads in group was reduced to 4x4 to stop compiler from warning. The compiler allow to have more buf but warn about degrading performance (it can auto-offload temp array to main memory ?). And each thread currently have only small enough buf about 3x time block size to hold sub-shifted block (H and HV shifted).
But MDegrainN operation require to hold a set of 2_x_tr ref blocks to compute sad -> next compute weights from sads -> normalize weights and use blocks in averaging. Unfortunately it looks additive accumulation of shifted blocks in single temp buf is not possible. Because to get weight of block in the sum - we need to calculate all weights and normalize.
So it looks the sub-shifted blocks can not be stored in on-chip memory and need to be temporarily written to accelerator's main memory (it is typically faster in compare with host but usually not very for medium consumer accelerators). But this approach will limit more available memory in accelerator (need to store both source frames + shifted copies) and limit possible max tr-value.

2. Pack sub-shifted blocks into some framebuf and download to host memory and use as source for MDegrainN (instead of super clip of 4x or 16x time larger for pel 2/4). It will be close to current output of MCompensate I think. Though it will be new datastream for mvtools (like replacing of 'super' input clip in MDegrain arguments). And download operation from accelerator and loading into host cpu may takes some time. It allow to use any tr-value because not need to store all source + shifted frames in the accelerator's limited memory.

Also I still not sure if HLSL compiler make best possible asm program to compile convolution - may be some hand-crafted asm (inline-asm if possible in HLSL ?) may be faster and take less resources of accelerator. Need to read more how compute units in accelerator are designed. May be it sort of SIMD dispatch ports and can compute FMA of several floats per clock. Still not check what current HLSL compiler produce. It can output asm file but I need to read about its syntax and application to execution units in shader compute model.

DTL · 10th April 2022, 20:00

Some first working example of low pass motion vectors internal filtering before MDegrainN processing: https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.09

It is not final filter and not perfect - just first working example without significant output bugs. Currently only luma SAD is checked after new vectors calculated and compared with thSAD. If new filtered vector have SAD above thSAD - the original vector from MAnalyse is used.

The initial idea and issue about converting of noise (luma+chroma) into spatial (phase) noise at MDegrainN processing with nosied vectors was described at post: https://forum.doom9.org/showthread.p...66#post1963966

New control params for MDegrainN: MVLPFCutoff, thMVLPFCorr.

MVLPFCutoff: cut off frequency of the low pass filter for motion vector's components (dx,dy) in temporal (tr) axis.
Default 1.0 additional processing disabled.
Valid range 0.0..1.0. Estimated working range when enabled: 0.05 to 0.5. Values below 0.05..0.01 possibly change nothing because internal kernel size of filter is fixed 10 taps now.

thMVLPFCorr: Maximum difference between original and filtered vector's dx,dy components (any of component) for correction. If difference above this value (not internally scaled to pel value) - the original vector from MAnalyse is used.
Value =0 (default) disables correction completely (no LPF-processing effect even with MVLPFCutoff < 1.0). May be useful to fix some bugs at the footage with lots of different movement and noise.
Expected good value: pel*(4..10). It is mostly additional 'fail safe' limit. If no issues found it may be set to infinite (like frame_width * pel) to allow processing of very fast movements. Typical real upper value: about maximum inter-frame shift of moving subjects * 1.5 * pel.

Current production degrain script used for testing (interlaced 1080 source):

Code:

SetFilterMTMode("DEFAULT_MT_MODE", 3)

__source_here___

AddBorders(0,0,0,72)
ConvertToYV12(interlaced=true)
SeparateFields()

tr=15
super=MSuper(last, mt=false, chroma=true, pel=4, hpad=8, vpad=8, levels=1)
multi_vec=MAnalyse (super, multi=true, blksize=8, delta=tr, search=3, searchparam=2, overlap=0, chroma=true, optSearchOption=5, mt=false, levels=1, scaleCSAD=0)
MDegrainN(last,super, multi_vec, tr, thSAD=185, thSAD2=170, mt=false, wpow=4, thSCD1=350, adjSADzeromv=0.5, adjSADcohmv=0.5, thCohMV=16, MVLPFCutoff=0.1, thMVLPFCorr=50)

Weave()
Crop(0,0,0,1080)

Support of overlap processing using filtered vectors still not implemented. It is not complex but need some time.

Found and fixed some point of memory leak in MDegrainN - may be it adds to the issues on February builds too.

DTL · 3rd May 2022, 18:05

Some more idea: typically 'non-simple' degrain scripts use some pre-denoised pre-processed clip as source for MAnalyse (super clip for MAnalyse). Sometime the preprocessing is as simple as some low-pass filtering like blur. So the idea is to add this simple pre-processing into MAnalyse with hardware search options to offload more work to accelerator. The processing may be done with compute shader dispatched with uploaded to accelerator frames before sending to ME engine. It will free more host resources for MPEG encoding.

Dogway · 4th May 2022, 16:04

That's a good idea. The 'standard' though is to use MinBlur() which denoises more flat areas and less edge areas. A 'cheap' alternative similar to MinBlur is to use Inter Quartile Median (IQM), that would be easier to implement.

DTL · 4th May 2022, 22:11

I see typical 'pre-filter' in QTGMC is

Code:

prefilt = last
w = prefilt.width()
h = prefilt.height()
removegrain(12, 12).gaussresize(w, h, 0, 0, w+0.0001, h+0.0001, p=2).mergeluma(prefilt, 0.1)

That is equal in result (may be not speed) to

Code:

Blur(1).gaussresize(w, h, 0, 0, w+0.0001, h+0.0001, p=2).mergeluma(prefilt, 0.1)

Where combination of Blur() and gaussresize() is 2 low-pass filters in a sequence (may be merged to single with combined transfer characteristic).
That is all internal AVS+ processing operators. May be simplified to SomeLowPassFilter(args).mergeluma(prefilt, 0.1) that mean mixing input plane with weight 0.1 to low-pass filtered plane.

In SMDegrain script I see much more complex pre-filter processing.

DTL · 6th May 2022, 11:45

Trying to simulate 'overlap' processing with non-overlap MAnalyse/MDegrain I try to make 2 processing paths with half-blocksize diagonal shift:

Code:

BkSz=8
BkSz_d2=BkSz/2

AddBorders(BkSz_d2,BkSz_d2,BkSz_d2,BkSz_d2)  

no_sh=last
sh=Crop(BkSz_d2,BkSz_d2,width-BkSz_d2, height-BkSz_d2).AddBorders(0,0,BkSz_d2,BkSz_d2)

tr = 12 # Temporal radius
super_no_sh = MSuper (no_sh, chroma=true, pel=2)
super_sh = MSuper (sh, chroma=true, pel=2)

multi_vec_no_sh = MAnalyse (super_no_sh, multi=true, chroma=true, overlap=0, search=3, searchparam=2, delta=tr, mt=false, optSearchOption=1)
multi_vec_sh = MAnalyse (super_sh, multi=true, chroma=true, overlap=0, search=3, searchparam=2, delta=tr, mt=false, optSearchOption=1)

no_sh=MDegrainN(super_no_sh, multi_vec_no_sh, tr, thSAD=200, thSAD2=190, mt=false)
sh=MDegrainN (sh, super_sh, multi_vec_sh, tr, thSAD=200, thSAD2=190, mt=false)

#back
sh=AddBorders(sh, BkSz_d2,BkSz_d2,0,0).Crop(0,0,width-BkSz_d2, height-BkSz_d2)  

Layer(no_sh, sh, "fast")

Crop(BkSz_d2,BkSz_d2,width-BkSz,height-BkSz)

The result looks really better in compare with no-overlap processing but still not as smooth as overlap=blocksize/2. Also the speed is about 2x better in compare with overlap=blocksize/2 (with CPU only processing).

Edit: may be right solution is not about deblocking of layers but creating correct blending mask. Like rhomb-shaped per each block tiled over all blocks of frame. Will try it with auto-sizing of mask using 'for' loops of AVS+.

DTL · 12th May 2022, 08:43

New version: https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.10

Added MVLPFGauss MVs low-pass filtering mode to MDegrainN as single control-param adjustment. MVLPF implemented in all processing modes of MDegrainN (chroma enabled and overlap enabled). Default = 0 (disabled), float param. Expected practical adjustment range 0.5..3.0. Too low values like <0.1 will mostly disable procesing, too high like 10 may cause bugs because internal convolution kernel is about 10 samples size and too high sigma values will make kernel not gauss-shaped but rectangular. The old values of 2-params LPF MVLPFCutoff and MVLPFSlope still exist but suspended to future development because simple gauss-kernel for LPF looks produce good results already. The speed of processing should not depend on kernel type. The non-over/undershoot gauss-kernel processing expected to make good results and the development of other LPF with controlled both cut-off frequency and slope with non-over/undershoot performance is more complex. Left for the future versions.

Added usage of scaleCSAD param defined in MAnalyse in the secondary SAD check after MVLPF processing in MDegrainN.

Fixed bug in MAnalyse that cause random processing aborting with error message 'motion vectors clip too short' (in SO=5). It magically works with non-initialized memory long time in the past but start to fail frequently with >1 MAnalyse in the script or testing with single pair of src-ref frame and MShow().
Fixed bugs in SAD computation in shader with luma and chroma with pel=2 and with chroma with pel=4.

When experimenting with shifted layers blending for simulate overlap processing I found the small padding of the single layer with about blocksize/2 also makes output MPEG encoded speed a bit lower. May be it is about interacting of previous MPEG compressed source blocks tesselation with hardware MVs search engine blocks tesselation. So current single layer processing script for 1080i source is:

Code:

SetFilterMTMode("DEFAULT_MT_MODE", 3)

__source_here___

AddBorders(0,0,0,72)

SeparateFields()

BkSz=8
BkSz_d2=BkSz/2
AddBorders(BkSz_d2,BkSz_d2,BkSz_d2,BkSz_d2)  

tr = 15 # Temporal radius
super = MSuper (chroma=true, pel=4, levels=1)

multi_vec = MAnalyse(super, multi=true, blksize=8, delta=tr, overlap=0, chroma=true, optSearchOption=5, mt=false, levels=1)
MDegrainN(super, multi_vec, tr, thSAD=250, thSAD2=240, mt=false, wpow=4, thSCD1=350, adjSADzeromv=0.5, adjSADcohmv=0.5, thCohMV=16, MVLPFGauss=0.9, thMVLPFCorr=100)

Crop(BkSz_d2,BkSz_d2,width-BkSz,height-BkSz)  
Weave()
Crop(0,0,0,1080)

It looks current float-based sub-shifting in the shader (for pel=2 and pel=4) implementation not good in speed and need to make some integer-based implementation to see if it will be faster.

Some number of tests with pre-filtering of clip for MAnalyse (with 'simple' processing like from QTGMC) still shows close to no improvement in output MPEG output speed. May be current additional low-pass filtering of MVs in time domain before MDegrain works in the close way to low-pass filtering in spatial domain before MAnalyse.

All new features of MDegrainN also works with software modes of MAnalyse so not require Win10+DX12-ME hardware for usage so applicable to https://forum.doom9.org/showthread.php?t=173356 thread mvtools.

takla · 14th May 2022, 05:38

Here are my thoughts so far:

I don't see a point in using GPU acceleration at all (unless you somehow manage to make it SIGNIFICANT faster to what it is now) because the only time GPU is faster right now, is in niche cases like TR=>10 with pel=2 or 4 at 4K. But even in those cases you need over 64GB of system RAM or else you run out of it, as I've pointed out before.

And considering hardware "tiers", when looking at the same price class, a $500 CPU will easily beat a $500 GPU here.

So personally I'd prefer improvements on the CPU side of things, because it makes more sense.

tormento · 14th May 2022, 08:44

Quote:

Originally Posted by takla

I don't see a point in using GPU acceleration at all

Because not everybody owns a 16 cores CPU and whatever resource you can free, it can be allocated to other filters or encoding.

takla · 14th May 2022, 12:12

Quote:

Originally Posted by tormento

Because not everybody owns a 16 cores CPU and whatever resource you can free, it can be allocated to other filters or encoding.

Good point.

DTL · 14th May 2022, 15:57

" the only time GPU is faster right now, is in niche cases like TR=>10 with pel=2 or 4 at 4K. But even in those cases you need over 64GB of system RAM or else you run out of it, as I've pointed out before."

At my current work setup of old enough i5-9600K CPU + GTX1060 accelerator and transcoding with x264 with close to 'placebo' settings I got about 3.5 fps with CPU only and about 5+ fps with SO=5 option. So with hardware acceleration I can process more footages per work day. The blockiness artifacts without overlap is very rare at that footage - mostly on large size fire flames or large size smoke. The water looks good.

Also I still not show here the tests but the 'star-like' hyperbolic zoneplate sub-sample moving + noise test shows a bit better motion compensation in V-direction (horizontal part of hyperbolic zoneplate) in compare with internal MAnalyse pel=4 search. Will try to post comparison results next time when will be at work.

The interlaced 1080 runs well with 6 threads at 16 GB Win10 system. Takes about 50% of RAM, so I expect 4K will takes about 32 GB at 6 threads (with 'typical' AVS+ cache control). But if you run at massive multicore CPU with >10 cores it really can overflow 64 GB RAM. So it may be good to limit number of threads per AVS and leave some free cores to MPEG encoder only.

"will easily beat a $500 GPU here."

I expect when prices after mining will drop - the old enough accelerators much cheaper $500 will be good to installed as 1 or 2 (or more) per host to help free CPU resources to MPEG encoding. Also the >1 accelerator I hope can be used for 'overlap' simulation using either internal AVS scripting (with masked Overlay() function) or I found old Fizick's plugin BlockOverlap with internal mask generation and blending 2 half-blocksize diagonally shifter layers. Though it is C-only and may be not as good in speed as possible AVS+ internal Overlay() filter. The mask for Overlay may be loaded from 8x8 BMP file and tiled using scripting to required frame size. Also this method allow to use any handcrafted blocks blending mask in MSPaint or any other pixel-setting editor. Or may be some scripting-way is possible to create clip of typical blocksize 8x8 with perset samples values.

Though currently I have only about 30% of VideoEncoder load - may be it is possible to make some more advanced mod of MAnalyse to send 2 pairs of frames per command sequence and get 2 ME output results for 'special' mode of MAnalyse for alternative motion-clip format for overlap blending (with not same blocks positions as with 'old overlap mode of MAnalyse+MDegrain) and special mode in MDegrainN for alternative overlap blending. It should be best in speed but require more programming work. It may be already tested for quality using AVS scripting.

The diagonal shift of 'shifted' version of clip in MAnalyse to send to ME accelerator is very easy - just shift starting address of buffer reading at creation of 'upload' command (and after typical padding after MSuper it will not buffer-overrun at the lower-right corner of buffer). In scripting it require combination of AddBorders+Crop that may produce much more memory bus traffic.

I make x64 build of Fizick's BlockOverlap plugin for new AVS+, but still not test it yet: https://github.com/DTL2020/BlockOverlap

"prefer improvements on the CPU side of things"

It is also planned. Currently in progress the internal (inside CPU) shifting of blocks for MDegrain for pel > 1. So with SO=5 in MAnalyse it can completely skip larger sub-shifted planes creation and decrease memory bus traffic. Currently the latest build have not finally debugged 'tech speed test demo' of this mode - new option for MSuper(pelrefine=false) to disable pel >1 planes creation and MDegrainN(UseSubShift=1) to enable alternative request of sub-shifted block from Fake* structure. It currenty have only integer AVX2 implementation for block size 8x8 (and luma only) but the AVX512 may be a bit faster. Currently not in production state - create distorted output but perform the full processing. Hope to look into debug of it soon.

It also will greatly decrease amount of memory for 'super' clips - about 15 times less for pel=4. Currently it is not made because it require to change size of 'super' clip and look if it not crash the underlying processing somewhere. Currently only CPU load disabled for pelrefine=false MSuper mode to test speed benefit that is much easier. So after this change will be finished and 'super' clip will be cropped to only 1x frame size we will get large memory saving with pel=2 (about 3 times) and pel=4 (about 15 times) with hardware ME modes. I theory in that case the usage of 'super' clip in that case will be mostly eliminated (may be to store padded 1x frame only with levels=1).

Also using same ideas of internal scaling of patch in CPU register file planned to make MAnalyse search optimized for pel >1 in same way. But it require more complex SIMD programming.

tormento · 14th May 2022, 19:02

Quote:

Originally Posted by DTL

It currenty have only integer AVX2 implementation for block size 8x8 (and luma only) but the AVX512 may be a bit faster.

Please, keep in mind that someone (me) still has AVX CPUs.

DTL · 15th May 2022, 01:21

AVX mean it do not have fast enough integer operations with increased size register file. Also operations are limited to SSE2 128bit integer per op, AVX2 allow 256bit integer ops that is virtually twice faster. It is better to upgrade to AVX2 CPU at least in 202x years. Intel promises AVX1024 in the mid of 202x already.

kedautinh12 · 15th May 2022, 07:25

Hi, but anyone have much money like you

tormento · 15th May 2022, 07:36

Quote:

Originally Posted by DTL

Intel promises AVX1024 in the mid of 202x already.

Perhaps on Xeons. They are disabling AVX512 in consumer CPUs.

2nd February 2022, 00:40	#82 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,070	"Why does ram usage increase so much?" Pel=4 finest level is currently 16x more RAM in compare with pel=1 for 'super' clip (+lower sized levels) and it is multiplied to number of AVS+ threads and to AVS+ cache system. So it looks old developers not went down to pel=8 with 64x more RAM usage. Each 'super' frame for pel=4 and 4K is about 2160x17 = about 37000 in height size. You can check it - return 'super' clip and see its frame size. So may be ask AVS+ support how to decrease cached frames by AVS+ ? I read somewhere about 2-params Prefetch() - like Prefetch(N, M) where one value is num of threads and second is cached num frames ? Do not found anything about it in docs. Also there are 2 values of cache control: SetCacheMode(mode) AVS+Fine tunes the internal frame caching strategy in AviSynth+. Available values: 0 or CACHE_FAST_START start up time and size balanced mode (default) 1 or CACHE_OPTIMAL_SIZE slow start up but optimal speed and cache size May be try to set CACHE_OPTIMAL_SIZE ? Also http://avisynth.nl/index.php/MT_modes_explained - may be adjusting of MT mode may decrease number of cached frames ? I run tr=25 with 1080i at 16 GB system with few enough RAM usage - may be about half. "Possible memory leak?" It typically increases over time. I run 3 hours transcoding without leakage issues. The 'fully optimized' mvtools with DX12_ME search and all pel modes on-shader SAD caculation and at-processing sub-shift MDegrainN will use 16x less host RAM for pel=4 processing. Last edited by DTL; 2nd February 2022 at 01:08.

10th April 2022, 20:00	#87 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,070	Some first working example of low pass motion vectors internal filtering before MDegrainN processing: https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.09 It is not final filter and not perfect - just first working example without significant output bugs. Currently only luma SAD is checked after new vectors calculated and compared with thSAD. If new filtered vector have SAD above thSAD - the original vector from MAnalyse is used. The initial idea and issue about converting of noise (luma+chroma) into spatial (phase) noise at MDegrainN processing with nosied vectors was described at post: https://forum.doom9.org/showthread.p...66#post1963966 New control params for MDegrainN: MVLPFCutoff, thMVLPFCorr. MVLPFCutoff: cut off frequency of the low pass filter for motion vector's components (dx,dy) in temporal (tr) axis. Default 1.0 additional processing disabled. Valid range 0.0..1.0. Estimated working range when enabled: 0.05 to 0.5. Values below 0.05..0.01 possibly change nothing because internal kernel size of filter is fixed 10 taps now. thMVLPFCorr: Maximum difference between original and filtered vector's dx,dy components (any of component) for correction. If difference above this value (not internally scaled to pel value) - the original vector from MAnalyse is used. Value =0 (default) disables correction completely (no LPF-processing effect even with MVLPFCutoff < 1.0). May be useful to fix some bugs at the footage with lots of different movement and noise. Expected good value: pel(4..10). It is mostly additional 'fail safe' limit. If no issues found it may be set to infinite (like frame_width pel) to allow processing of very fast movements. Typical real upper value: about maximum inter-frame shift of moving subjects * 1.5 * pel. Current production degrain script used for testing (interlaced 1080 source): Code: SetFilterMTMode("DEFAULT_MT_MODE", 3) __source_here___ AddBorders(0,0,0,72) ConvertToYV12(interlaced=true) SeparateFields() tr=15 super=MSuper(last, mt=false, chroma=true, pel=4, hpad=8, vpad=8, levels=1) multi_vec=MAnalyse (super, multi=true, blksize=8, delta=tr, search=3, searchparam=2, overlap=0, chroma=true, optSearchOption=5, mt=false, levels=1, scaleCSAD=0) MDegrainN(last,super, multi_vec, tr, thSAD=185, thSAD2=170, mt=false, wpow=4, thSCD1=350, adjSADzeromv=0.5, adjSADcohmv=0.5, thCohMV=16, MVLPFCutoff=0.1, thMVLPFCorr=50) Weave() Crop(0,0,0,1080) Support of overlap processing using filtered vectors still not implemented. It is not complex but need some time. Found and fixed some point of memory leak in MDegrainN - may be it adds to the issues on February builds too. Last edited by DTL; 11th April 2022 at 12:52.

3rd May 2022, 18:05	#88 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,070	Some more idea: typically 'non-simple' degrain scripts use some pre-denoised pre-processed clip as source for MAnalyse (super clip for MAnalyse). Sometime the preprocessing is as simple as some low-pass filtering like blur. So the idea is to add this simple pre-processing into MAnalyse with hardware search options to offload more work to accelerator. The processing may be done with compute shader dispatched with uploaded to accelerator frames before sending to ME engine. It will free more host resources for MPEG encoding. Last edited by DTL; 4th May 2022 at 22:16.

4th May 2022, 16:04	#89 \| Link
Dogway Registered User Join Date: Nov 2009 Posts: 2,361	That's a good idea. The 'standard' though is to use MinBlur() which denoises more flat areas and less edge areas. A 'cheap' alternative similar to MinBlur is to use Inter Quartile Median (IQM), that would be easier to implement. __________________ i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread

4th May 2022, 22:11	#90 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,070	I see typical 'pre-filter' in QTGMC is Code: prefilt = last w = prefilt.width() h = prefilt.height() removegrain(12, 12).gaussresize(w, h, 0, 0, w+0.0001, h+0.0001, p=2).mergeluma(prefilt, 0.1) That is equal in result (may be not speed) to Code: Blur(1).gaussresize(w, h, 0, 0, w+0.0001, h+0.0001, p=2).mergeluma(prefilt, 0.1) Where combination of Blur() and gaussresize() is 2 low-pass filters in a sequence (may be merged to single with combined transfer characteristic). That is all internal AVS+ processing operators. May be simplified to SomeLowPassFilter(args).mergeluma(prefilt, 0.1) that mean mixing input plane with weight 0.1 to low-pass filtered plane. In SMDegrain script I see much more complex pre-filter processing.

9th February 2022, 19:48	#84 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,070	Somehow working update https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.08 . Now all supported pel values processed inside accelerator. Not sure if it good for pel 1 and 2 for fast CPUs but still no user-side selection where to process. For testing of onCPU pel 2 and pel 4 processing the previous version may be used. Not very much tested for quality of SAD generation for pel 2 and 4 and it may more or less be different from onCPU old processing because of using different sub-sample shifting kernel. It looks making compute shaders at HLSL is not very efficient because 'compute_X' load is significant now at GTX1060 card at relatively simple operation of runtime shifting of blocks with about 8 total samples kernel. Though the performance now is not depend on sub-pel shift value (can support any float shift with equal speed) and depend slightly only on kernel size (with half size of 4 the speed is a bit better, but still no user-control param and only separate build with internal constant define possible for now). So it may be mostly benefitical at pel=4 (depending on balance of CPU and accelerator speed).

9th February 2022, 23:26	#85 \| Link
magnetite Registered User Join Date: May 2010 Posts: 64	I think I went from 30% load up to 60% with this new build on my GTX 1080 Ti.

10th February 2022, 00:25	#86 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,070	Some strategic idea: if sub-sample shifting takes significant time (resources) it can be somehow reused for degraining. Either to decrease host memory read traffic or to save time of host CPU from performing same shift operation one more time. Possible ways: 1. Finish degraining inside accelerator (as planned). The already found limitation (at least for CS 5.1 standards): HLSL compiler reports about 16384 max recommended temp array size in threads group. It looks the limitation of 'register file' size of one core in accelerator (of some generation) ? So in current version of shader the number of threads in group was reduced to 4x4 to stop compiler from warning. The compiler allow to have more buf but warn about degrading performance (it can auto-offload temp array to main memory ?). And each thread currently have only small enough buf about 3x time block size to hold sub-shifted block (H and HV shifted). But MDegrainN operation require to hold a set of 2_x_tr ref blocks to compute sad -> next compute weights from sads -> normalize weights and use blocks in averaging. Unfortunately it looks additive accumulation of shifted blocks in single temp buf is not possible. Because to get weight of block in the sum - we need to calculate all weights and normalize. So it looks the sub-shifted blocks can not be stored in on-chip memory and need to be temporarily written to accelerator's main memory (it is typically faster in compare with host but usually not very for medium consumer accelerators). But this approach will limit more available memory in accelerator (need to store both source frames + shifted copies) and limit possible max tr-value. 2. Pack sub-shifted blocks into some framebuf and download to host memory and use as source for MDegrainN (instead of super clip of 4x or 16x time larger for pel 2/4). It will be close to current output of MCompensate I think. Though it will be new datastream for mvtools (like replacing of 'super' input clip in MDegrain arguments). And download operation from accelerator and loading into host cpu may takes some time. It allow to use any tr-value because not need to store all source + shifted frames in the accelerator's limited memory. Also I still not sure if HLSL compiler make best possible asm program to compile convolution - may be some hand-crafted asm (inline-asm if possible in HLSL ?) may be faster and take less resources of accelerator. Need to read more how compute units in accelerator are designed. May be it sort of SIMD dispatch ports and can compute FMA of several floats per clock. Still not check what current HLSL compiler produce. It can output asm file but I need to read about its syntax and application to execution units in shader compute model.

12th May 2022, 08:43	#92 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,070	New version: https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.10 Added MVLPFGauss MVs low-pass filtering mode to MDegrainN as single control-param adjustment. MVLPF implemented in all processing modes of MDegrainN (chroma enabled and overlap enabled). Default = 0 (disabled), float param. Expected practical adjustment range 0.5..3.0. Too low values like <0.1 will mostly disable procesing, too high like 10 may cause bugs because internal convolution kernel is about 10 samples size and too high sigma values will make kernel not gauss-shaped but rectangular. The old values of 2-params LPF MVLPFCutoff and MVLPFSlope still exist but suspended to future development because simple gauss-kernel for LPF looks produce good results already. The speed of processing should not depend on kernel type. The non-over/undershoot gauss-kernel processing expected to make good results and the development of other LPF with controlled both cut-off frequency and slope with non-over/undershoot performance is more complex. Left for the future versions. Added usage of scaleCSAD param defined in MAnalyse in the secondary SAD check after MVLPF processing in MDegrainN. Fixed bug in MAnalyse that cause random processing aborting with error message 'motion vectors clip too short' (in SO=5). It magically works with non-initialized memory long time in the past but start to fail frequently with >1 MAnalyse in the script or testing with single pair of src-ref frame and MShow(). Fixed bugs in SAD computation in shader with luma and chroma with pel=2 and with chroma with pel=4. When experimenting with shifted layers blending for simulate overlap processing I found the small padding of the single layer with about blocksize/2 also makes output MPEG encoded speed a bit lower. May be it is about interacting of previous MPEG compressed source blocks tesselation with hardware MVs search engine blocks tesselation. So current single layer processing script for 1080i source is: Code: SetFilterMTMode("DEFAULT_MT_MODE", 3) __source_here___ AddBorders(0,0,0,72) SeparateFields() BkSz=8 BkSz_d2=BkSz/2 AddBorders(BkSz_d2,BkSz_d2,BkSz_d2,BkSz_d2) tr = 15 # Temporal radius super = MSuper (chroma=true, pel=4, levels=1) multi_vec = MAnalyse(super, multi=true, blksize=8, delta=tr, overlap=0, chroma=true, optSearchOption=5, mt=false, levels=1) MDegrainN(super, multi_vec, tr, thSAD=250, thSAD2=240, mt=false, wpow=4, thSCD1=350, adjSADzeromv=0.5, adjSADcohmv=0.5, thCohMV=16, MVLPFGauss=0.9, thMVLPFCorr=100) Crop(BkSz_d2,BkSz_d2,width-BkSz,height-BkSz) Weave() Crop(0,0,0,1080) It looks current float-based sub-shifting in the shader (for pel=2 and pel=4) implementation not good in speed and need to make some integer-based implementation to see if it will be faster. Some number of tests with pre-filtering of clip for MAnalyse (with 'simple' processing like from QTGMC) still shows close to no improvement in output MPEG output speed. May be current additional low-pass filtering of MVs in time domain before MDegrain works in the close way to low-pass filtering in spatial domain before MAnalyse. All new features of MDegrainN also works with software modes of MAnalyse so not require Win10+DX12-ME hardware for usage so applicable to https://forum.doom9.org/showthread.php?t=173356 thread mvtools. Last edited by DTL; 12th May 2022 at 09:11.

14th May 2022, 05:38	#93 \| Link
takla Registered User Join Date: May 2018 Posts: 184	Here are my thoughts so far: I don't see a point in using GPU acceleration at all (unless you somehow manage to make it SIGNIFICANT faster to what it is now) because the only time GPU is faster right now, is in niche cases like TR=>10 with pel=2 or 4 at 4K. But even in those cases you need over 64GB of system RAM or else you run out of it, as I've pointed out before. And considering hardware "tiers", when looking at the same price class, a $500 CPU will easily beat a $500 GPU here. So personally I'd prefer improvements on the CPU side of things, because it makes more sense.

14th May 2022, 15:57	#96 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,070	" the only time GPU is faster right now, is in niche cases like TR=>10 with pel=2 or 4 at 4K. But even in those cases you need over 64GB of system RAM or else you run out of it, as I've pointed out before." At my current work setup of old enough i5-9600K CPU + GTX1060 accelerator and transcoding with x264 with close to 'placebo' settings I got about 3.5 fps with CPU only and about 5+ fps with SO=5 option. So with hardware acceleration I can process more footages per work day. The blockiness artifacts without overlap is very rare at that footage - mostly on large size fire flames or large size smoke. The water looks good. Also I still not show here the tests but the 'star-like' hyperbolic zoneplate sub-sample moving + noise test shows a bit better motion compensation in V-direction (horizontal part of hyperbolic zoneplate) in compare with internal MAnalyse pel=4 search. Will try to post comparison results next time when will be at work. The interlaced 1080 runs well with 6 threads at 16 GB Win10 system. Takes about 50% of RAM, so I expect 4K will takes about 32 GB at 6 threads (with 'typical' AVS+ cache control). But if you run at massive multicore CPU with >10 cores it really can overflow 64 GB RAM. So it may be good to limit number of threads per AVS and leave some free cores to MPEG encoder only. "will easily beat a $500 GPU here." I expect when prices after mining will drop - the old enough accelerators much cheaper $500 will be good to installed as 1 or 2 (or more) per host to help free CPU resources to MPEG encoding. Also the >1 accelerator I hope can be used for 'overlap' simulation using either internal AVS scripting (with masked Overlay() function) or I found old Fizick's plugin BlockOverlap with internal mask generation and blending 2 half-blocksize diagonally shifter layers. Though it is C-only and may be not as good in speed as possible AVS+ internal Overlay() filter. The mask for Overlay may be loaded from 8x8 BMP file and tiled using scripting to required frame size. Also this method allow to use any handcrafted blocks blending mask in MSPaint or any other pixel-setting editor. Or may be some scripting-way is possible to create clip of typical blocksize 8x8 with perset samples values. Though currently I have only about 30% of VideoEncoder load - may be it is possible to make some more advanced mod of MAnalyse to send 2 pairs of frames per command sequence and get 2 ME output results for 'special' mode of MAnalyse for alternative motion-clip format for overlap blending (with not same blocks positions as with 'old overlap mode of MAnalyse+MDegrain) and special mode in MDegrainN for alternative overlap blending. It should be best in speed but require more programming work. It may be already tested for quality using AVS scripting. The diagonal shift of 'shifted' version of clip in MAnalyse to send to ME accelerator is very easy - just shift starting address of buffer reading at creation of 'upload' command (and after typical padding after MSuper it will not buffer-overrun at the lower-right corner of buffer). In scripting it require combination of AddBorders+Crop that may produce much more memory bus traffic. I make x64 build of Fizick's BlockOverlap plugin for new AVS+, but still not test it yet: https://github.com/DTL2020/BlockOverlap "prefer improvements on the CPU side of things" It is also planned. Currently in progress the internal (inside CPU) shifting of blocks for MDegrain for pel > 1. So with SO=5 in MAnalyse it can completely skip larger sub-shifted planes creation and decrease memory bus traffic. Currently the latest build have not finally debugged 'tech speed test demo' of this mode - new option for MSuper(pelrefine=false) to disable pel >1 planes creation and MDegrainN(UseSubShift=1) to enable alternative request of sub-shifted block from Fake* structure. It currenty have only integer AVX2 implementation for block size 8x8 (and luma only) but the AVX512 may be a bit faster. Currently not in production state - create distorted output but perform the full processing. Hope to look into debug of it soon. It also will greatly decrease amount of memory for 'super' clips - about 15 times less for pel=4. Currently it is not made because it require to change size of 'super' clip and look if it not crash the underlying processing somewhere. Currently only CPU load disabled for pelrefine=false MSuper mode to test speed benefit that is much easier. So after this change will be finished and 'super' clip will be cropped to only 1x frame size we will get large memory saving with pel=2 (about 3 times) and pel=4 (about 15 times) with hardware ME modes. I theory in that case the usage of 'super' clip in that case will be mostly eliminated (may be to store padded 1x frame only with levels=1). Also using same ideas of internal scaling of patch in CPU register file planned to make MAnalyse search optimized for pel >1 in same way. But it require more complex SIMD programming. Last edited by DTL; 14th May 2022 at 16:39.

15th May 2022, 01:21	#98 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,070	AVX mean it do not have fast enough integer operations with increased size register file. Also operations are limited to SSE2 128bit integer per op, AVX2 allow 256bit integer ops that is virtually twice faster. It is better to upgrade to AVX2 CPU at least in 202x years. Intel promises AVX1024 in the mid of 202x already.

15th May 2022, 07:25	#99 \| Link
kedautinh12 Registered User Join Date: Jan 2018 Posts: 2,156	Hi, but anyone have much money like you

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode