Log in

View Full Version : CUDASynth Filters


Pages : 1 [2]

tebasuna51
10th February 2024, 16:43
Work now for me with test3 DGDecodeNV_AVX2.dll,
but is very slow:

AVSMeter 3.0.9.0 (x64), (c) Groucho2004, 2012-2021
AviSynth+ 3.7.3 (r4062, master, x86_64) (3.7.3.0)

Number of frames: 507
Length (hh:mm:ss.ms): 00:00:21.146
Frame width: 1920
Frame height: 1040
Framerate: 23.976 (24000/1001)
Colorspace: YUV420P16

1) Old DGDecodeNV.dll dgsource("test.dgi").DGDenoise()

2) Test3 DGDecodeNV_AVX2.dll dgsource("test.dgi",dn_enable=1)

3) Test3 DGDecodeNV_AVX2.dll dgsource("test.dgi",dn_enable=3)

1) 2) 3)
----------------- ----------------- -----------------
Frames processed: 507 (0 - 506) 507 (0 - 506) 507 (0 - 506)
FPS (min | max | average): 12.05|23.99|23.49 3.810|4.724|4.656 1.512|9.359|4.579
Process memory usage (max): 393 MiB 598 MiB 627 MiB
Thread count: 22 19 19
CPU usage (average): 8.0% 8.2% 8.3%

GPU usage (average): 89% 98% 97%
VPU usage (average): 3% 1% 1%
GPU memory usage: 849 MiB 856 MiB 834 MiB
GPU Power Consumption (average): 24.3 W 29.9 W 29.7 W

Time (elapsed): 00:00:21.585 00:01:48.902 00:01:50.716

tebasuna51
10th February 2024, 16:51
Previous version was fast with dn_enable=1 (not work with 2 or 3)

Frames processed: 507 (0 - 506)
FPS (min | max | average): 6.357 | 9.447 | 9.260
Process memory usage (max): 524 MiB
Thread count: 17
CPU usage (average): 8.3%

GPU usage (average): 97%
VPU usage (average): 2%
GPU memory usage: 825 MiB
GPU Power Consumption (average): 29.7 W

Time (elapsed): 00:00:54.754

eac3to_mod
10th February 2024, 16:58
OK, everything is fine. :rolleyes:

Here's the deal.

We had searchw 5/7/9.
Now we have good/better/best.

The good/better/best are scaled higher than searchw (this was discussed at DG forum). That means "good" is already equivalent to searchw = 9. So for accurate comparison to old searchw = 9 you have to choose "good".

And you have to turn off temporal to compare to old.

I hope it is clear. I'll give my numbers shortly.

eac3to_mod
10th February 2024, 17:03
For SSE4.1 no AVX2 or AVX512:

Old with searchw = 9: 475 fps (spatial only)
New with dn=1, quality good: 522 fps (spatial only)(good = searchw 9)
New with dn=3, quality good: 485 fps (spatial and temporal)(good = searchw 9)

For spatial there is a good speed increase. You get temporal denoising for free while still being a little faster. And you have the higher qualities of better and best available that you didn't have before. And don't forget that raw frame serving is 33% faster too in my tests.

Will give AVX numbers shortly.

eac3to_mod
10th February 2024, 17:18
AVX2:

Old with searchw = 9: 475 fps (spatial only)
New with dn=1, quality good: 524 fps (spatial only)(good = searchw 9)
New with dn=3, quality good: 505 fps (spatial and temporal)(good = searchw 9)

AVX512:

Old with searchw = 9: 475 fps (spatial only)
New with dn=1, quality good: 524 fps (spatial only)(good = searchw 9)
New with dn=3, quality good: 508 fps (spatial and temporal)(good = searchw 9)

So AVX is a big step up. AVX512 is very minor over AVX2.

AVX is used only for the temporal denoising. When I move it to the GPU it will get even better. I did it this way to leverage Avisynth+ frame caching and secondarily because I could do it quickly. I'll have to cache myself to move it to the GPU.

Another thing not to forget is that the more filters in the chain, the more relative speed improvement there will be, as more and more expensive PCIe transfers are avoided. This is why I intend to add more filters, even those supplied by third parties. If not too difficult I'll even port them to CUDA. This case with one filter doesn't showcase the CUDASynth philosophy, but it already shows signifcant gains. The filter chain with HDR to SDR and denoising starts to get quite impressive, for example.

tormento
10th February 2024, 17:18
Only a cheap NVIDIA GeForce GT 1030 with 2 Gb GDDR5 Video resolution 1920x1040
Doesn't this important progress in CUDA filters deserve its own thread?

Perhaps much better not to confuse people.

tormento
10th February 2024, 17:24
Here's the deal.
Eagerly waiting my queue to finish to try it.

Have dn_strength and dn_cstrengh linear or non linear effects? In the second case, how is the "curve" shaped?

When you will have time, please consider an arbitrary number for searchw.

eac3to_mod
10th February 2024, 17:46
@teba

Can you please post a screen shot of Info() for your CPU? It will help me to address your issue with AVX512, though AVX512 is not a big improvement over AVX2 in this application. Thank you.

tormento
10th February 2024, 17:47
I could consider another bump to ultra but, honestly, when I tried that I could not perceive any improvement.
You tried with really noisy material?
The technical reason is that CUDA loop unrolling can only be done when the loops have constant limits.
Can I pretend to have understood? :p

About dn_quality, dn_strength and dn_cstrengh: what if I want to use different "values" for spatial and temporal denoising?

Do you plan to release a standalone cuda filter, detached from the DGSource video decoding?

Some videos, such as AVC 10 bit or with insane bframes/ref numbers can't be decoded in hardware.

tormento
10th February 2024, 18:06
Decoding only tested and working on i7-2600k, i.e. AVX only.

I will let it run for a whole night on different material to see if there are hiccups.

ChaosKing
10th February 2024, 18:17
Was is "temporal_sample.mkv"? Dropbox says it was 5 years ago... time flies

eac3to_mod
10th February 2024, 18:27
Yeah, that's the one. ;)

tormento
10th February 2024, 19:22
The DGDenoise() in the new DLL already uses the new stuff.
My two cents for the future developement of the separate function.

If possible, now that we have 2 filters (and more I hope will come) that works in GPU memory, let's have a single function that can apply the different filters to the same frame timespan internally (x number of frames in a GPU buffer?).

To be clear: let's say we have to convert HDR to SDR and denoise it.

It would be a pity to have to call 2 different functions that moves the frame back and forth.

I hope to have explained my idea.

tormento
10th February 2024, 23:12
Filters that are able to run in place do so.
Thanks for your explanation.

:thanks:

DTL
11th February 2024, 07:20
Another option is to move to the Intel compiler, which supports runtime dispatching of code for different architectures. However, it is still possibly a bridge too far for me.

Typically enabling AVX512 in the C-compiler makes only very few peerformance benefit. So you can simply make several builds for SSE2/AVX2/AVX512 executable and mark for users. So each user can select the executable file for current machine to run.

If you really make some parts of program for any SIMD - you do not need to enable its architecture support in the compiler. It is auto-selected by compiler from program text. And all selection of used program parts performed manually using AVS returned flags (so user can lower CPUcaps flags for debug if requied using script function).

So you build single executable for SSE2 and it will be compatible with most current chips and can have AVX512 parts where required. The performance penalty typically very few. Most benefit typically from using very advanced compilers like LLVM (but also not as great as manual programing for SIMD).



So AVX is a big step up. AVX512 is very minor over AVX2.



If your application is not hardly limited by host RAM performance the AVX512 can provide 2..4x+ performance gain if used properly. I see no sources of the current AVX512 implementation of this software. Typically for AVX512 you need to grab workunit about 4x larger in size over AVX2 (32x512 vs 16x256 bit datawords) into register file of AVX512 SIMD co-processor and organize compute with attempt to use new shorter instructions and combine several operations for superscalar processing by several dispatch ports in parallel. Some examples are in the Asd-g sources of AVX512 for vsTTempSmooth - https://github.com/Asd-g/AviSynth-vsTTempSmooth/blob/master/src/vsTTempSmooth_AVX512.cpp

Unfortunately current C-compilers have too few AI inside to do such job internally (auto loop-unroll with auto-size of work unit size for each SIMD arch and auto-combine instructions for super-scalarity) so programmers need to write large blocks of repeating intrinsics and adjust workunit size for each loop pass manually for best performance at target SIMD arch.


The temporal denoising has no parameters to tweak. It is a modified temporal median filter rolled quickly because I had to make ChaosKing happy.

There is also some sort of temporal median implementation in AVX2 and AVX512 - https://github.com/Asd-g/AviSynth-vsTTempSmooth/blob/master/src/vsTTempSmooth_pmode1_AVX512.cpp
https://github.com/Asd-g/AviSynth-vsTTempSmooth/blob/master/src/vsTTempSmooth_pmode1_AVX2.cpp

With temporal radius > 2 (1?) it runs several times faster over old implementations of temporal median (in MedianBlurTemporal()) and uses hardware gathering instructions of AVX2/AVX512 to keep all computing in SIMD domain without fallback to scalar gathering loop (greatly destroying total idea of SIMD parallelism).

But anyway it is only temporal median computing implementaton - for real moving pictures denoise it must use motion-compensated frame sequence as input (or the performance/quality will be very poor using simple 'thresholding skip' method). So if CUDA will provide ME data from hardware ASIC and you can use it to make motion compensated frames you can use temporal median as final output stage (or simple weighted averaging as in MDegrain). As I read NVIDIA also provide some API for CUDA-programmers to get ME data from MPEG encoder ASIC where available. Other implementation is for DX12.
See https://docs.nvidia.com/video-technologies/video-codec-sdk/11.1/nvenc-video-encoder-api-prog-guide/index.html
Motion Estimation Only Mode

NVENC can be used as a hardware accelerator to perform motion search and generate motion vectors and mode information. The resulting motion vectors or mode decisions can be used, for example, in motion compensated filtering

Also some temporal noise reduction in openCV ? Hardware accelerated by NVIDIA ? Or any OpenCV device ? https://docs.nvidia.com/vpi/sample_tnr.html

tebasuna51
11th February 2024, 10:14
@teba

Can you please post a screen shot of Info() for your CPU? It will help me to address your issue with AVX512, though AVX512 is not a big improvement over AVX2 in this application. Thank you.
Here is, with 8 GB DDR4:

tebasuna51
11th February 2024, 11:44
Can you please change the title to simply "CUDASynth Filters", as there will be more filters than you listed...
Done.
BTW you are the thread owner if you want change it one more time.

Your CPU does not support AVX512, so you'll have to run the AVX2 DLL.
Of course.

tormento
11th February 2024, 12:03
>8 hours of encoding and no video decoding hiccups at all. :cool:

@eac3to_mod

I suggest to stick to AVX/AVX2 now, as AVX architecture is going to change in 6 months with Arrow Lake and its AVX10 instruction set.

I doubt, anyway, that Intel will reintroduce AVX-512 on consumer products, until it will solve its power consumption problems.

Perhaps the topic title could be a bit misleading. Cudasynth was a 2013 (i.e. dead) project about VST synthesizer that offloaded processing to Cuda-enabled devices.

Plus, if I am not wrong, it was the title of the DG branch of AVS and its not the case here, as we are still using standard AVS+.

Perhaps 'DG CUDA filters'? :)

tormento
11th February 2024, 13:01
Thank you for your interest and contributions.
You are the most welcome and thank you for your hard work in facilitating our hobby (for someone) and job (for others).

Julek
11th February 2024, 15:32
For AVX selection the VLC2 lib (https://github.com/vectorclass/version2) can help, you can use instrset_detect() as in this example (https://github.com/HomeOfVapourSynthEvolution/VapourSynth-TCanny/blob/14ac2ceeb59afc7089974d0ae233fe8d0ea183c8/TCanny/TCanny.cpp#L585).

eac3to_mod
12th February 2024, 02:15
@Julek

Thank you for the links.

tormento
12th February 2024, 07:04
1) What is the bitdepth of internal processing? Does it use 8 bits for 8 bit source and 16 bits for >8 bit source?

2) If the first, could we force to use 16 bits internally for more precision and output 8 or 16 bits?

3) If we internally resize the video, the denoise/hdrtosdr will apply to the resized or the full size?

There is an error in Notes.txt:

dn_cstrengh (default 0.0)

instead of

n_cstrenght (default 0.0)

Beside that, the denoising quality is great! I will try it with some more footage.

I will test DGHDRtoSDR now.

Please consider to add HLG output to DGHDRtoSDR. I know it will be trivial for you.

tormento
12th February 2024, 08:41
Just playing around, I have found something strange.

Look here (https://slow.pics/c/Zhk7nyKj).

First script (crop, resize and SDR conversion)

LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll")
DGSource("M:\In\- 1_58 Il labirinto del fauno 4k\fauno4k.dgi", ct=48,cb=48,cl=0,cr=0, rw=1920,rh=1032, h2s_enable=1,h2s_white=2000)

Second script (first+denoise 0.05)

LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll")
DGSource("M:\In\- 1_58 Il labirinto del fauno 4k\fauno4k.dgi", ct=48,cb=48,cl=0,cr=0, rw=1920,rh=1032, dn_enable=3,dn_strength=0.05,dn_cstrength=0.05,dn_quality="best", h2s_enable=1,h2s_white=2000)

Third script (first+denoise 0.10)

LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll")
DGSource("M:\In\- 1_58 Il labirinto del fauno 4k\fauno4k.dgi", ct=48,cb=48,cl=0,cr=0, rw=1920,rh=1032, dn_enable=3,dn_strength=0.10,dn_cstrength=0.10,dn_quality="best", h2s_enable=1,h2s_white=2000)

I can see 2 things:

1) color artifacts on scene change
2) sort of "duplicate" image superimposing, a bit hard to see on screenshots but really evident in motion
3) when denoised, video brightness is different
4) posterization (look at the face of the girl) but perhaps 0.05 denoise is too high for that scene

dn_enable=1 (spatial) works perfectly
dn_enable=2 (temporal) gives both color artifacts and smearing
dn_enable=3 (spatial+temporal) gives both color artifacts and smearing, exacerbating color artifacts
dn_cstrength=0.0 (no chroma denoising) solves the color artifacts but not smearing
dn_strength=0.00,dn_cstrength=0.05 solves nothing :)

Here (https://pixeldrain.com/u/eeN317Qo) you can find the test clip (on the left there is the download button).

Frames are 83-85.

tormento
12th February 2024, 12:18
If I recall well, some time ago DGSource had the capability to force input to 16 bitdepth. Could you introduce it again so we can apply all the internal filters with 16 bit precision?

The resizing applies to the decoded frame from NVDec and before any further filtering.
That is a blessing (speed) and a damnation. Noise, especially the digital/postproduction one, can be single pixel grain that would "smudge" the image when resized. It would be nice (as I said on other board) to be able to apply the resizing when we need. And what about HDRtoSDR, is it applied before or after denoising? I think that sorting out a way to order filters would be nice, the more you add, as it can lead to different results. Take, as example, all the various deinterlacing/decimating ones.
there is DGPQtoHLG, which could be considered for integration
I completely forgot that filter. If integrated it would be a great thing.
Thank you for the clip, which will be useful for my testing.
As always you are the most welcome.

anton_foy
13th February 2024, 08:26
Great to see this coming up!!!
I dont know if this is of any help but:

https://github.com/AN3223/dotfiles/blob/master/.config/mpv/shaders/nlmeans_temporal.glsl

Uses some kind of motion estimation for the temporal denoising.

eac3to_mod
13th February 2024, 12:45
Thank you for the link. I don't grok shaders and it claims to be buggy with no releases, so it won't be of much help.

anton_foy
13th February 2024, 13:24
Yeah I figured as much. Looking forward to the next version with upped temporal denoising quality. The temporal part I think is the most important as it is motion we are dealing with.

Edit: when it comes to spatiotemporal denoisers I think FFT3D renders pretty good results but the temporal part should have been more developed and one side effect of the denoising is banding/blotches issue which for example vaguedenoiser does not suffer from (my favourite spatial denoiser btw.).

DTL
14th February 2024, 11:03
Great to see this coming up!!!
I dont know if this is of any help but:

https://github.com/AN3223/dotfiles/blob/master/.config/mpv/shaders/nlmeans_temporal.glsl

Uses some kind of motion estimation for the temporal denoising.

Estimation is only begining of the process. Before temporal denoise frames must be backward motion compensated (so any before and after frames motion compensated to 'current' frame while keeping all noise/distortions from that frames) and after compensation it is finally passed to some denoise engine (median statistical based or simple arithmetic mean weighted blending).

So practical temporal denoise pipeline is of 3 stages:
Motion Estimation -> Motion Compensation -> Denoise Stage.

For better results stages maybe recursed (multi-generation motion estimation refining where previous generation denoised frame returned as current or reference frame to motion estimation analyzer). The ME stage is mostly compute-loaded and MC and DS are simple enough but requires large memory subsystem performance.

So with full hardware accelerated temporal denoiser it is expected ME data from MPEG encoder hardware ASIC (NVenc in the NVIDIA case) and MC and DN stages are shader-based dispatched at the general purpose GPU cores connected with high-bandwidth GPU RAM.

It is possible to design ME stage (simple MAnalyse shader-based) at the general purpose GPU compute cores and it may have better quality (same as onCPU MAnalyse) but performance maybe lower in compare with MPEG hardware ASIC. Also as I remember there is some already working ME of simple MAnalyse on CUDA - it is copied in pinterf repository with CUDA filters - in KTGMC part https://github.com/pinterf/AviSynthCUDAFilters/tree/master/KTGMC .

Where https://github.com/pinterf/AviSynthCUDAFilters/blob/master/KTGMC/MV.cpp looks like simplified MSuper+MAnalyse implementation on CUDA with real MVs search. So for complete temporal denoise only simple MC+DN parts need to be designed for CUDA too (like copy of old simple MDegrainN).


Those of you that were around in the early days of desktop image processing starting with VirtualDub filters may remember a filter called TemporalCleaner by Jim Casaburi. The concept was simple:

---
For each pixel in the current frame, if the difference between the previous frame's corresponding pixel and the current pixel is below a threshold then replace the current pixel by the average of the previous and current pixels, otherwise keep the current pixel.
---



It looks the most advanced implementation of these ideas currently in vsTTempSmooth(). But without motion compensation it still very poor in quality or can only remove very slight visible noise (though it still visible help to MPEG encoder in some use cases).

DTL
14th February 2024, 14:09
vsTTempSmooth is sort of advanced by usage of some weighting from close temporal radius (higher weight) to far samples (optional lower weight). So it attempts to lower possible motion blur without using motion compensation.

If you simply try any method of motion detection and skip denoise in these areas it will cause significantly uneven denoise over the total frame. Only mostly static areas will be denoised. The purpose of motion compensation not only saves from motion blur but also makes denoising of moving areas possible. So it is the key operation for denoise of moving pictures in comparison with static images denoising.

" based on the long-dead CUDA-enabled Avisynth."

Yes - it uses CUDA (GPU) memory management by (some branch of) AVS. So a programmer of a separate filter must rewrite it for its own GPU memory management. As for quality you can check classic mvtools old versions like 2.7.45 or even more old. I think the main algorithms of MVs search (see PlaneofBlocks class) have not changed significantly from the very old mvtools designs and at some point were ported into that KTGMC project.
The main difference is that before Win10 and the main effort by Microsoft to make ME API for hardware ASICS available for end users we only had the software-based ME engines (onCPU or some onGPU implementations like that KTGMC project). And after Win10 we finally have compatible ME API from hardware accelerators at end-users GPU chips. Maybe with CUDA SDK NVIDIA provides some UNIX-compatible ME API too for CUDA-software designers.
So currently we have 3 different open source and public freeware API for most compute-loaded ME processes:
1. onCPU mvtools
2. onGPU CUDA-based KTGMC
3. onGPU hardware ME ASIC accelerated from MPEG encoder via DX12 or CUDA-API.

All 3 methods return compatible MVs format with precision up to qpel. The only limitation of hardware ASIC typically only blocksize of 8x8 or 16x16 and colour format of NV12 (YV12). Software implementations can support any blocksize and bitdepth and colour format searches.

Method 3 allow to use most of possible at the enduser machine resources: hardware MPEG ASIC (for ME) + GPU general purpose cores (for MC+DN) + CPU general purpose cores for other processing so can provide max performance in theory.

eac3to_mod
14th February 2024, 16:01
Alice in Wonderland.

LigH
14th February 2024, 16:16
:confused:

I may be missing something; but reading such a reply which seems to lack respect for another person's efforts could discourage me from helping any further...

tebasuna51
15th February 2024, 13:40
I accidentally deleted the thread trying to edit my first post. You can either restore it or not I don't really care. It won't be so useful after I delete all my posts. If they want to take over my thread let them have it.

I will only be posting about eac3to from now on.

I restore the thread, for preserve users contribution. But I closed it because owner don't want continue posting here.