Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.


Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Thread Tools Search this Thread Display Modes
Prev Previous Post   Next Post Next
Old 5th January 2011, 23:34   #1  |  Link
Registered User
Join Date: Sep 2002
Location: Germany
Posts: 352
NLMeansCL: GPU based Non Local Means Denoising

i would like to introduce a new filter for avisynth: NLMeansCL.
The filter is my try on the NLMeans algorithm. Tritical already wrote TNLMeans in 2006, which is also an implementation of the NLMeans algorithm. (Thanks for your work tritical!)
In contrast to tritical's implementation - which is written in C++ and runs on the CPU, my implementation is written in OpenCL and runs on the GPU (typically).
Note: I will update this post to reflect all changes to the filter. The most recent modifications will be marked in blue.

I was only able to test the filter on my NVIDIA Geforce 9600 GT. Therefore i can give no guarantee that it runs on your GPU or even crashes or kills your PC! (Not that i think it would do...) The wrapper around the OpenCL algorithm is written in C#.

NLMeansCL(int A, int Ay, int Az, int S, int Sy, int B, int By, float aa, float h, float hC, int plane, bool debug, string debugpath, string smf, bool cpu, bool buffer, bool sse)

If you want to know some background information about the NLMeans algorithm, i'd like to point you to the very well written readme.txt that is part of tritical's TNLMeans filter package. He also explaines all the parameters of the filter in detail.

The syntax of NLMeansCL:
  • Parameter A: Sets the value for Ax and Ay. If you want to use the same values for Ax and Ay, you only have to specify one parameter. The same applies for S (-> Sx, Sy) and B ( -> Bx, By). You can specify a different value for Ay by using the explicit Ay parameter.
    (default = 4)
  • Parameter Az: Sets the temporal radius. At the moment only 1 and 2 are supported. (default = 0)
  • Parameter S/Sy: default = 2
  • Parameter B/By: default = 1
  • Parameter aa: This parameter is equivalent to the parameter 'a' in TNLMeans. (default = 1.0)
  • Parameter h/hC: h defines the strength of the filter both for luma and chroma planes. hC is an addition to h. With hC you can set a different filter strength for the color channels U and V. (default = 1.8)
  • Parameter plane: Here you can specify, which color planes should be processed (similar to FFT3DFilter):
    0 - luma (Y),
    1 - chroma U,
    2 - chroma V,
    3 - chroma planes U and V,
    4 - both luma and chroma
    5 - copy all planes. In this mode, all planes are just copied by a very simple OpenCL kernel. In case NLMeansCL does not run on your GPU, this might help to check if a very simple OpenCL kernel executes.
    (default = 0)
  • Parameter debug: Specifies, if a file with debug informations and error messages should be written. The debug file is named 'NLMeansCL_debug.txt'. (default = false)
  • Parameter debugpath: Specifies the path to the debug file. (default = C:\Temp\)
  • Parameter smf: This is a purely technical parameter to specifiy different memory allocation strategies. Possible values are "nan", "ahp", "chp", "achp" and "uhp". (default = ahp)
  • Parameter cpu: Specifies if the filter should be executed on the CPU rather than the GPU. (default = false)
  • Parameter buffer: Specifies if OpenCL buffers are used instead of OpenCL image objects to process the video data. (default = false)
  • Parameter sse: Specifies if the sum of squared differences is used (sse=true) or the sum of absolute differences (sse=false). (default = true)

Running the filter on CPUs:
To be able to run the filter on the CPU, you have to install the ATI Stream SDK. Furthermore, you have to set 2 parameters: cpu=true and buffer=true.
Using the parameter combination cpu=true and buffer=false will fail to execute, since AMD has not implemented image support for the CPU version of its OpenCL drivers yet!
The support for buffers is preliminary. That means i'm undecided if i will keep or remove it in future versions of the filter. If AMD adds image support to its drivers (either CPU or GPU), there is not much reason to leave it in the filter.
NLMeansCL will take full advantage of multiple CPU cores. There is no need to use MT() or setMTmode()! (It will rather degrade performance.)

When forcing to use buffers instead of images, the fps on my Geforce drops from 23.05 to 3.90

Here's how the content of the debug log file will look if the filter initializes correctly:
NLMeansCL Version 0.3.2
ScriptEnvironment present.
Number of OpenCL Compute Platforms = 2.
Trying OpenCL Compute Platform
  NVIDIA Corporation.
  OpenCL 1.0 CUDA 3.2.1.
Number of OpenCL Devices in Platform = 1.
Trying OpenCL Device GeForce 9600 GT.
  Device available.
  Wrong Device Type (Gpu) requesting Cpu.
Trying OpenCL Compute Platform
  Advanced Micro Devices, Inc..
  OpenCL 1.1 ATI-Stream-v2.3 (451).
Number of OpenCL Devices in Platform = 1.
Trying OpenCL Device Intel(R) Core(TM)2 CPU          4400  @ 2.00GHz.
  Device available.
  Device Type Cpu.
  Device does not support images.
Using Device Intel(R) Core(TM)2 CPU          4400  @ 2.00GHz.
OpenCL Compute Context successfully created.
OpenCL Command Queue successfully created.
OpenCL Program successfully built.
Prog Y Build log: 
Prog UV Build log: 
OpenCL kernels successfully created.
Now, what do you have to do to get the filter running:
1. The NLMeansCL Filter DLL itself:
Link and attachment at the end of my post.
Put it in you avisynth plugin folder. And don't rename it!

2. CLOO: A .net library for OpenCL. Needed to run NLMeansCL.
You can download it here: http://sourceforge.net/projects/cloo/
Take the Cloo.dll file from \bin\release inside the zip file and put it in your avisynth plugin folder.

3. AvsFilterNet: A .net library to write Avisynth filters. Needed to run NLMeansCL.
You can download it here: http://avsfilternet.codeplex.com/
Take the AvsFilterNet.dll and put it in your avisynth plugin folder.

That's it.

I evaluated some figures for my system:
CPU: Core2Duo, running on 3.2GHz
GPU: NVIDIA Geforce 9600GT, 512MB GDDR3, 650MHz Core / 900MHz Memory / 1600MHZ Shader, not overclocked
I typically get a speed improvement of factor 18 to 25 compared to TNLMeans.

For example:
Video: 720x576, YV12
Parameter: A=4, S=2, B=1
TNLMeans: 0.98 fps
NLMeansCL: 23.93 fps (cpu=false, buffer=false)
NLMeansCL: 3.90 fps (cpu=false, buffer=true)
NLMeansCL: 1.40 fps (cpu=true, buffer=true)

The speed factor between NLMeansCL and TNLMeans is similar for different video sizes (e.g. 1920x1080 or 360x288). As well as for different filter parameters (Bx = 0, By = 0).
On my GPU, the implementation does NOT benefit if you use values of 2 or above for B/By! I have some explanations for this behaviour, but it would lead to far to explain it...
I'm highly interested to see performance figures for different GPUs as well as the feedback if it runs on different graphic cards.

A typical script to test the performance would be the following. Load the script in Virtualdub and check the 'video rendering rate' in the status window.
trim(0, 1)
last = last + last + last + last
last = last + last + last + last
last = last + last + last + last
last = last + last + last + last
NLMeansCL(A=4, S=2, B=1, aa=1.0, h=1.8, plane=4)
Parameter values:
My findings so far are, that the default values for A, S and B work very well! Typically there is no improvements by setting A or S higher. It only get's a lot slower!
IMHO there is no need to change aa to something other than 1.0.
Playing around with h and hC is sufficient.

If you have any problems with the filter, especially if i doesn't work at all. Please use GPU Caps Viewer (http://www.ozone3d.net/gpu_caps_viewer/) and check first, if the included OpenCL demos do run! Then go to the tab named 'Tools' and send me the 'Full XML Export'! There's an extra button for it on the tab. Also, please send me the log file, that NLMeansCL creates!
I cannot guarantee to help you out quick, since i'm rather busy!

Version 0.4.0 alpha:
This version is only a preliminary version (created in January) that implements a temporal version of the algorithm. Currently, it only supports a temporal window of 1, respectively 2 frames (in both directions). The temporal mode is only implemented for the image based algorithm, not the buffer based. On my PC, the algorithm produces some non deterministic artefacts in the video that are visible as small blocks of completely black pixels. I assume some runtime problems / asychronity between the shaders and writing out the memory to the host PC. I haven't worked for months now on the algorithm and this will probably be the status for the rest of the summer. I have also a version for arbitrary values of Az, but i'm not satisfied with the results (it's too slow and the computed values are incorrect)

- (Better) Temporal mode
- Make NLMeansCL work on AMD graphic cards
- x64 version
- Other color spaces (YUY2, RGB)


Changes from v0.1 to v0.1.1
  • Added some debug informations. See above for parameter description.
  • Added parameter to specify memory allocation strategy. Mainly to do technical low level tests.
  • Added mode 5 for parameter 'plane' for debug reasons. See above for parameter description
  • Changed the calculation of the imags areas for processing to prevent misbehaviour at certain video sizes like 1280x720. As a result, some pixels at the video borders might not be processed. (Just do an addborder() in your script to work around this)

Changes from v0.1.1 to v0.1.2
  • Resolved the misbehaviour at certain video sizes. Namely where RowSize == PitchSize. Now the processing is exact up to the video boundaries. No need to do addborder() / crop() anymore.

Changes from v0.1.2 to v0.2
  • Added support for execution on CPUs and hopefully on Radeon cards. Added 2 parameters 'cpu' and 'buffer' for that.
  • Fixed a bug where values near 1.0 (like 1.00000001) for parameter h lead to an error.

Changes from v0.2 to v0.2.1
  • Changed the device selection strategy. Hope now it will work for Platforms where GPU and CPU devices are mixed (AMD Radeon with ATI Stream SDK installed).

Changes from v0.2.1 to v0.2.2
  • Had to change again device selection strategy. Hope this works now in all cases.

Changes from v0.2.1 to v0.3
  • Added mode for sum of absolute differences when computing neighborhood similarity (Parameter sse)
  • Performance improvements in cases where not all planes are processed (plane != 4)

Changes from v0.3 to v0.3.1
  • Changed memory allocation strategy from uhp to ahp due to userreported errors. -> Changed default value for parameter smf to 'ahp'.
  • Fixed kernel name bug

Changes from v0.3.1 to v0.3.2
  • Switched to .Net 4
  • Switched to the latest version of Cloo (v0.9.0)
  • Switched to the latest version of AvsFilterNet (r62998)
  • NLMeansCL now reports errors the normal way (like all other avisynth plugins do)
  • When an error occurs, NLMeansCL does not crash avisynth anymore
  • Minor changes in the OpenCL code

v0.4.0 alpha
  • Implemented preliminary temporal version for Az=1 and Az=2.
  • v0.4.0 alpha is still based on .Net 3.5 as well as the older versions of Cloo (0.8.1) and AvsFilterNet (1.0 beta 2)!

Download latest version:
v0.3.2 :http://www.mediafire.com/?q4butkseucz9tin
v0.3.2 sources:http://www.mediafire.com/?l3swlzu2pm3375l
v0.4.0 alpha : http://www.mediafire.com/?9osy86a14u0qxr6


Last edited by Malcolm; 2nd September 2011 at 22:23. Reason: Added Version 0.3.2
Malcolm is offline   Reply With Quote

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +1. The time now is 19:10.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, vBulletin Solutions Inc.