Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 5th January 2011, 23:34   #1  |  Link
Malcolm
Registered User
 
Join Date: Sep 2002
Location: Germany
Posts: 352
NLMeansCL: GPU based Non Local Means Denoising

Hi,
i would like to introduce a new filter for avisynth: NLMeansCL.
The filter is my try on the NLMeans algorithm. Tritical already wrote TNLMeans in 2006, which is also an implementation of the NLMeans algorithm. (Thanks for your work tritical!)
In contrast to tritical's implementation - which is written in C++ and runs on the CPU, my implementation is written in OpenCL and runs on the GPU (typically).
Note: I will update this post to reflect all changes to the filter. The most recent modifications will be marked in blue.

I was only able to test the filter on my NVIDIA Geforce 9600 GT. Therefore i can give no guarantee that it runs on your GPU or even crashes or kills your PC! (Not that i think it would do...) The wrapper around the OpenCL algorithm is written in C#.

Syntax:
NLMeansCL(int A, int Ay, int Az, int S, int Sy, int B, int By, float aa, float h, float hC, int plane, bool debug, string debugpath, string smf, bool cpu, bool buffer, bool sse)

If you want to know some background information about the NLMeans algorithm, i'd like to point you to the very well written readme.txt that is part of tritical's TNLMeans filter package. He also explaines all the parameters of the filter in detail.

The syntax of NLMeansCL:
  • Parameter A: Sets the value for Ax and Ay. If you want to use the same values for Ax and Ay, you only have to specify one parameter. The same applies for S (-> Sx, Sy) and B ( -> Bx, By). You can specify a different value for Ay by using the explicit Ay parameter.
    (default = 4)
  • Parameter Az: Sets the temporal radius. At the moment only 1 and 2 are supported. (default = 0)
  • Parameter S/Sy: default = 2
  • Parameter B/By: default = 1
  • Parameter aa: This parameter is equivalent to the parameter 'a' in TNLMeans. (default = 1.0)
  • Parameter h/hC: h defines the strength of the filter both for luma and chroma planes. hC is an addition to h. With hC you can set a different filter strength for the color channels U and V. (default = 1.8)
  • Parameter plane: Here you can specify, which color planes should be processed (similar to FFT3DFilter):
    0 - luma (Y),
    1 - chroma U,
    2 - chroma V,
    3 - chroma planes U and V,
    4 - both luma and chroma
    5 - copy all planes. In this mode, all planes are just copied by a very simple OpenCL kernel. In case NLMeansCL does not run on your GPU, this might help to check if a very simple OpenCL kernel executes.
    (default = 0)
  • Parameter debug: Specifies, if a file with debug informations and error messages should be written. The debug file is named 'NLMeansCL_debug.txt'. (default = false)
  • Parameter debugpath: Specifies the path to the debug file. (default = C:\Temp\)
  • Parameter smf: This is a purely technical parameter to specifiy different memory allocation strategies. Possible values are "nan", "ahp", "chp", "achp" and "uhp". (default = ahp)
  • Parameter cpu: Specifies if the filter should be executed on the CPU rather than the GPU. (default = false)
  • Parameter buffer: Specifies if OpenCL buffers are used instead of OpenCL image objects to process the video data. (default = false)
  • Parameter sse: Specifies if the sum of squared differences is used (sse=true) or the sum of absolute differences (sse=false). (default = true)

Running the filter on CPUs:
To be able to run the filter on the CPU, you have to install the ATI Stream SDK. Furthermore, you have to set 2 parameters: cpu=true and buffer=true.
Using the parameter combination cpu=true and buffer=false will fail to execute, since AMD has not implemented image support for the CPU version of its OpenCL drivers yet!
The support for buffers is preliminary. That means i'm undecided if i will keep or remove it in future versions of the filter. If AMD adds image support to its drivers (either CPU or GPU), there is not much reason to leave it in the filter.
NLMeansCL will take full advantage of multiple CPU cores. There is no need to use MT() or setMTmode()! (It will rather degrade performance.)

When forcing to use buffers instead of images, the fps on my Geforce drops from 23.05 to 3.90

Here's how the content of the debug log file will look if the filter initializes correctly:
Code:
NLMeansCL Version 0.3.2
ScriptEnvironment present.
Number of OpenCL Compute Platforms = 2.
Trying OpenCL Compute Platform
  NVIDIA Corporation.
  OpenCL 1.0 CUDA 3.2.1.
Number of OpenCL Devices in Platform = 1.
Trying OpenCL Device GeForce 9600 GT.
  Device available.
  Wrong Device Type (Gpu) requesting Cpu.
Trying OpenCL Compute Platform
  Advanced Micro Devices, Inc..
  OpenCL 1.1 ATI-Stream-v2.3 (451).
Number of OpenCL Devices in Platform = 1.
Trying OpenCL Device Intel(R) Core(TM)2 CPU          4400  @ 2.00GHz.
  Device available.
  Device Type Cpu.
  Device does not support images.
Using Device Intel(R) Core(TM)2 CPU          4400  @ 2.00GHz.
OpenCL Compute Context successfully created.
OpenCL Command Queue successfully created.
OpenCL Program successfully built.
Prog Y Build log: 
Prog UV Build log: 
OpenCL kernels successfully created.
Now, what do you have to do to get the filter running:
1. The NLMeansCL Filter DLL itself:
Link and attachment at the end of my post.
Put it in you avisynth plugin folder. And don't rename it!

2. CLOO: A .net library for OpenCL. Needed to run NLMeansCL.
You can download it here: http://sourceforge.net/projects/cloo/
Take the Cloo.dll file from \bin\release inside the zip file and put it in your avisynth plugin folder.

3. AvsFilterNet: A .net library to write Avisynth filters. Needed to run NLMeansCL.
You can download it here: http://avsfilternet.codeplex.com/
Take the AvsFilterNet.dll and put it in your avisynth plugin folder.

That's it.

Performance:
I evaluated some figures for my system:
CPU: Core2Duo, running on 3.2GHz
GPU: NVIDIA Geforce 9600GT, 512MB GDDR3, 650MHz Core / 900MHz Memory / 1600MHZ Shader, not overclocked
I typically get a speed improvement of factor 18 to 25 compared to TNLMeans.

For example:
Video: 720x576, YV12
Parameter: A=4, S=2, B=1
TNLMeans: 0.98 fps
NLMeansCL: 23.93 fps (cpu=false, buffer=false)
NLMeansCL: 3.90 fps (cpu=false, buffer=true)
NLMeansCL: 1.40 fps (cpu=true, buffer=true)

The speed factor between NLMeansCL and TNLMeans is similar for different video sizes (e.g. 1920x1080 or 360x288). As well as for different filter parameters (Bx = 0, By = 0).
On my GPU, the implementation does NOT benefit if you use values of 2 or above for B/By! I have some explanations for this behaviour, but it would lead to far to explain it...
I'm highly interested to see performance figures for different GPUs as well as the feedback if it runs on different graphic cards.

A typical script to test the performance would be the following. Load the script in Virtualdub and check the 'video rendering rate' in the status window.
Code:
mpeg2source(...)
trim(0, 1)
assumefps(500)
last = last + last + last + last
last = last + last + last + last
last = last + last + last + last
last = last + last + last + last
NLMeansCL(A=4, S=2, B=1, aa=1.0, h=1.8, plane=4)
Parameter values:
My findings so far are, that the default values for A, S and B work very well! Typically there is no improvements by setting A or S higher. It only get's a lot slower!
IMHO there is no need to change aa to something other than 1.0.
Playing around with h and hC is sufficient.

Problems:
If you have any problems with the filter, especially if i doesn't work at all. Please use GPU Caps Viewer (http://www.ozone3d.net/gpu_caps_viewer/) and check first, if the included OpenCL demos do run! Then go to the tab named 'Tools' and send me the 'Full XML Export'! There's an extra button for it on the tab. Also, please send me the log file, that NLMeansCL creates!
I cannot guarantee to help you out quick, since i'm rather busy!

Version 0.4.0 alpha:
This version is only a preliminary version (created in January) that implements a temporal version of the algorithm. Currently, it only supports a temporal window of 1, respectively 2 frames (in both directions). The temporal mode is only implemented for the image based algorithm, not the buffer based. On my PC, the algorithm produces some non deterministic artefacts in the video that are visible as small blocks of completely black pixels. I assume some runtime problems / asychronity between the shaders and writing out the memory to the host PC. I haven't worked for months now on the algorithm and this will probably be the status for the rest of the summer. I have also a version for arbitrary values of Az, but i'm not satisfied with the results (it's too slow and the computed values are incorrect)

TODOs:
- (Better) Temporal mode
- Make NLMeansCL work on AMD graphic cards
- x64 version
- Other color spaces (YUY2, RGB)

Changelog:

Changes from v0.1 to v0.1.1
  • Added some debug informations. See above for parameter description.
  • Added parameter to specify memory allocation strategy. Mainly to do technical low level tests.
  • Added mode 5 for parameter 'plane' for debug reasons. See above for parameter description
  • Changed the calculation of the imags areas for processing to prevent misbehaviour at certain video sizes like 1280x720. As a result, some pixels at the video borders might not be processed. (Just do an addborder() in your script to work around this)

Changes from v0.1.1 to v0.1.2
  • Resolved the misbehaviour at certain video sizes. Namely where RowSize == PitchSize. Now the processing is exact up to the video boundaries. No need to do addborder() / crop() anymore.

Changes from v0.1.2 to v0.2
  • Added support for execution on CPUs and hopefully on Radeon cards. Added 2 parameters 'cpu' and 'buffer' for that.
  • Fixed a bug where values near 1.0 (like 1.00000001) for parameter h lead to an error.

Changes from v0.2 to v0.2.1
  • Changed the device selection strategy. Hope now it will work for Platforms where GPU and CPU devices are mixed (AMD Radeon with ATI Stream SDK installed).

Changes from v0.2.1 to v0.2.2
  • Had to change again device selection strategy. Hope this works now in all cases.

Changes from v0.2.1 to v0.3
  • Added mode for sum of absolute differences when computing neighborhood similarity (Parameter sse)
  • Performance improvements in cases where not all planes are processed (plane != 4)

Changes from v0.3 to v0.3.1
  • Changed memory allocation strategy from uhp to ahp due to userreported errors. -> Changed default value for parameter smf to 'ahp'.
  • Fixed kernel name bug

Changes from v0.3.1 to v0.3.2
  • Switched to .Net 4
  • Switched to the latest version of Cloo (v0.9.0)
  • Switched to the latest version of AvsFilterNet (r62998)
  • NLMeansCL now reports errors the normal way (like all other avisynth plugins do)
  • When an error occurs, NLMeansCL does not crash avisynth anymore
  • Minor changes in the OpenCL code


v0.4.0 alpha
  • Implemented preliminary temporal version for Az=1 and Az=2.
  • v0.4.0 alpha is still based on .Net 3.5 as well as the older versions of Cloo (0.8.1) and AvsFilterNet (1.0 beta 2)!

Download latest version:
v0.3.2 :http://www.mediafire.com/?q4butkseucz9tin
v0.3.2 sources:http://www.mediafire.com/?l3swlzu2pm3375l
v0.4.0 alpha : http://www.mediafire.com/?9osy86a14u0qxr6

Malcolm

Last edited by Malcolm; 2nd September 2011 at 22:23. Reason: Added Version 0.3.2
Malcolm is offline   Reply With Quote
Old 6th January 2011, 00:05   #2  |  Link
TheRyuu
warpsharpened
 
Join Date: Feb 2007
Posts: 787
Avisynth is telling me that NLMeansCL_netautoload.dll is not an avisynth 2.5 plugin.

Edit: Am I suppose to load this differently from other plugins other than just a simple LoadPlugin("X:\path\to\filter.dll")?

Last edited by TheRyuu; 6th January 2011 at 00:09.
TheRyuu is offline   Reply With Quote
Old 6th January 2011, 00:13   #3  |  Link
Malcolm
Registered User
 
Join Date: Sep 2002
Location: Germany
Posts: 352
Huh?
You don't have to load it explicitly. AvsFilterNet does that for you if the filename ends with _netautoload.dll and it resides in the same folder.

If you remove the suffix, you can load it manually with LoadPlugin(). Still you need the AvsFilterNet.dll (i guess)

Malcolm
Malcolm is offline   Reply With Quote
Old 6th January 2011, 01:17   #4  |  Link
mastrboy
Registered User
 
Join Date: Sep 2008
Posts: 365
interesting filter, we have too few filters which run on the GPU.

Could you post some screenshots comparing NLMeansCL and TNLMeans?
mastrboy is offline   Reply With Quote
Old 6th January 2011, 01:28   #5  |  Link
Malcolm
Registered User
 
Join Date: Sep 2002
Location: Germany
Posts: 352
@masterboy
When called with the same parameters, both filters produce exactly(*) the same result!



(*) The difference that you see on the right is 16 times enhanced. It contains only a few individual 'dots'. They arise from minor differences in the mathematical calculation. For performance reasons the calculation in OpenCL is performed with 'relaxed-math' optimization and with single-precision (float instead of double)
Malcolm is offline   Reply With Quote
Old 6th January 2011, 01:45   #6  |  Link
mastrboy
Registered User
 
Join Date: Sep 2008
Posts: 365
thats quite impressing considering the speed increase, will try it tomorrow together with DGNVIndex, and see what i can get out of it...
mastrboy is offline   Reply With Quote
Old 6th January 2011, 01:48   #7  |  Link
TheRyuu
warpsharpened
 
Join Date: Feb 2007
Posts: 787
Well unless I'm doing it wrong (I just threw all 3 things in the autoload folder for testing it) it caused my graphics drivers to 'crash' and have to recover (when loading in vdub).

Vdub says nothing on crash, avsp is saying some sort of null pointer exception when I try and run the script (and doesn't cause a driver recovery), dunno if that helps.

Running a GTX 570 here with the latest beta drivers (266.35).
TheRyuu is offline   Reply With Quote
Old 6th January 2011, 02:40   #8  |  Link
GoodzMastaJ
Usered Register
 
Join Date: Dec 2006
Posts: 9
I gave it a try. All three linked dll's in avisynth plugins folder, installed Catalyst 10.12 (APP version that has OpenCL support), and the StreamSDK which has the OpenCL libraries and whatnot. I get the below exception on my Radeon HD 4870. Any ideas how I can determine what actually failed (I know OpenCL on AMD is ?? at best, especially older cards like this one)?

Picture too wide for forum so linked
GoodzMastaJ is offline   Reply With Quote
Old 6th January 2011, 07:29   #9  |  Link
Hiritsuki
Novice of AVS
 
Join Date: Oct 2009
Posts: 156
I wait for this filter long long time.
I test it right now. -w-
__________________
My PC
Hiritsuki is offline   Reply With Quote
Old 6th January 2011, 08:52   #10  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,352
The long awaited denoiser!! Thank you!
I get an error previewing in avspmod:
error messege

Also will you implement temporal Az? I think it was something tritical did by himself, but itd be very welcome.
I have a Geforce 9600M GT card
driver version: 197.16
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread

Last edited by Dogway; 6th January 2011 at 08:56.
Dogway is offline   Reply With Quote
Old 6th January 2011, 09:25   #11  |  Link
Hiritsuki
Novice of AVS
 
Join Date: Oct 2009
Posts: 156
@Dogway
I think driver upate to 260.99 will fix that error.
__________________
My PC
Hiritsuki is offline   Reply With Quote
Old 6th January 2011, 10:34   #12  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,352
Thanks, it worked CL implementation is on the latter drivers only.
Some questions:
-the default behaviour is sse=true (as tnlmeans)? I like using sse=false for animation sources, it works nice for large flat colors.
-If you feel like, could you make some kind of dark protection? if a source has very dark scenes it completly turns into a mud secuence (just like tnl or dfttest).
-Do I need to make it MT or something?

benchmark:

Quote:
# 0.40fps
MT("tnlmeans(ax=4,ay=4,az=1,sx=2,sy=2,bx=1,by=1,h=1.8,sse=true)",2,2)
# 0.19fps
NLMeansCL(A=4, S=2, B=1, aa=1.0, h=1.8, plane=4)
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread

Last edited by Dogway; 6th January 2011 at 10:41.
Dogway is offline   Reply With Quote
Old 6th January 2011, 10:38   #13  |  Link
Malcolm
Registered User
 
Join Date: Sep 2002
Location: Germany
Posts: 352
Quote:
Originally Posted by TheRyuu View Post
Well unless I'm doing it wrong (I just threw all 3 things in the autoload folder for testing it) it caused my graphics drivers to 'crash' and have to recover (when loading in vdub).
Ok, let me explain: Under 'normal' circumstances, windows recovers your graphics card driver because it crashed. However, windows also does this, if your driver doesn't respond within 2 seconds. (You can change the time, as well as the general behaviour of windows by editing the registry).
That means: If you have a very complex OpenCL kernel, that computes for more than 2 seconds on one video frame, then windows will kill and restart your driver!
Since you have a GTX 570 i would assume that it's fast enough. So this shouldn't happen unless you use parameters like A=8, S=6 or so. But since i haven't tested the filter on that GPU i can only guess!
Malcolm is offline   Reply With Quote
Old 6th January 2011, 10:45   #14  |  Link
Malcolm
Registered User
 
Join Date: Sep 2002
Location: Germany
Posts: 352
Quote:
Originally Posted by GoodzMastaJ View Post
Any ideas how I can determine what actually failed (I know OpenCL on AMD is ?? at best, especially older cards like this one)?
You can check if OpenCL is working on your configuration with this tool: http://www.ozone3d.net/gpu_caps_viewer/ It contains some OpenCL demos. You can as well choose on which 'hardware' you would like to execute the demo (if you have more than one GPU, or on the CPU if you have the ATI Stream SDK installed)

I will provide a version of the filter that spits out the real message. What you see on the picture is the mentioned general exception saying that the filter called env.ThowError(...)
Malcolm is offline   Reply With Quote
Old 6th January 2011, 10:57   #15  |  Link
Malcolm
Registered User
 
Join Date: Sep 2002
Location: Germany
Posts: 352
Quote:
Originally Posted by Dogway View Post
-the default behaviour is sse=true (as tnlmeans)? I like using sse=false for animation sources, it works nice for large flat colors.
-If you feel like, could you make some kind of dark protection? if a source has very dark scenes it completly turns into a mud secuence (just like tnl or dfttest).
-Do I need to make it MT or something?
- yes, i have only implemented sse. sad should be no problem. i will consider adding this.
- dark scene protection: actually i would recommend to do that with a little bit of scripting in avisynth.
- Using MT will not help. NLMeansCL itself is already mutithreaded on the GPU by nature. That's where the real work is done. So multithreading the wrapper-part, which runs on the CPU doesn't help improving performance.

0.19 fps?!? Wow! i cannot imagine how this number comes to existance. At 720x576? What's your script?

Last edited by Malcolm; 6th January 2011 at 10:59.
Malcolm is offline   Reply With Quote
Old 6th January 2011, 12:07   #16  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,352
Thanks for the answers, something must have been wrong, as there's no temporal I used an image and that was the speed. Now I tried with a video source and results were more optimistic:

Quote:
# 0.77fps
MT("tnlmeans(ax=4,ay=4,az=1,sx=2,sy=2,bx=1,by=1,h=1.8,sse=true)",2,2)
# 12.77fps
NLMeansCL(A=4, S=2, B=1, aa=1.0, h=1.8, plane=4)
So maybe it doesn't like images or non mod16?
The dark protection is not only scenes, but part of the scenes, but I will look into that.
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread
Dogway is offline   Reply With Quote
Old 6th January 2011, 13:48   #17  |  Link
Malcolm
Registered User
 
Join Date: Sep 2002
Location: Germany
Posts: 352
Quote:
Originally Posted by Dogway View Post
So maybe it doesn't like images or non mod16?
The dark protection is not only scenes, but part of the scenes, but I will look into that.
  • 12.77 fps is pretty good for a mobile GPU like the 9600M GT i would say!
  • The filter doesn't care if it's mod16. Did you process only one frame? The initialization of the OpenCL stuff takes some time. That means processing the first frame is WAY slower than all following frames! Repeat your single frame 512 times and let the video play. What fps do you get after 256 frames?
  • Yes, i understand. it's about dark areas inside the video frame. But atm. i have to focus on the core functionality of the filter. 'dark scene protection' ist really something that can be added on top of any filter by scripting.
  • To the temporal filter mode: I have implemented this as a protoype, but i'm reluctant to investigate deeper at the moment.
    Reasons are:
    1. It doesn't bring that much benefit that one might think. (Maybe i will correct myself here in the future...)
    2. It slows down the filtering. (Though it's not as slow as i expected)
    3. Due to the nature of OpenCL / CUDA (that means to get fast executing code!), i'd have to write a second specialized kernel besides the existing one to realize Az=1. I'd have to write a third specialized kernel to realize Az=2, ...
      A generalized kernel is possible but would be very slow! (And since it's all about performance...) At the current stage, i'd like to focus on the core kernel itself and work this out first.
Malcolm is offline   Reply With Quote
Old 6th January 2011, 14:40   #18  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,352
It's really strange, if I process my image with the example script of your first post, it goes nice (5.85fps), but with the next script I only get 0.19fps:
Quote:
ImageReader("C:\image.jpg")
setmtmode(2)
mmod(2,2) #final resolution 1000x572
converttoyv12
NLMeansCL(A=4, S=2, B=1, aa=1.0, h=1.8, plane=4)
Im testing with AVSinfo.exe

I always use az=3, sometimes 6 depending on sources, I think it would benefit from still areas, taking advantage of temporal information (noise,codec blocks...), but that's only me, Im aware this is still in experimental phase, I just wanted to help a bit. Nice 3 wise present! Keep the good work :P
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread

Last edited by Dogway; 6th January 2011 at 14:46.
Dogway is offline   Reply With Quote
Old 6th January 2011, 16:22   #19  |  Link
naoan
Registered User
 
Join Date: Oct 2009
Posts: 151
I got this error when trying to test the filter on avspmod

http://i.imgur.com/OITBQ.png

My system is using Windows 7 x64 and GPU AMD Radeon HD4850, checked using GPU-Z and opencl is ticked.
naoan is offline   Reply With Quote
Old 6th January 2011, 16:26   #20  |  Link
Didée
Registered User
 
Join Date: Apr 2002
Location: Germany
Posts: 5,389
@ Dogway: Just kill that SetMTmode(2) out of your your script.

Simple logic: [SetMTmode(2)] AND [GPU filter] == FAIL
__________________
- We´re at the beginning of the end of mankind´s childhood -

My little flickr gallery. (Yes indeed, I do have hobbies other than digital video!)
Didée is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 14:37.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.