Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 Doom9's Forum NLMeansCL: GPU based Non Local Means Denoising
 User Name Remember Me? Password
 Register FAQ Calendar Search Today's Posts Mark Forums Read

 Thread Tools Search this Thread Display Modes
 5th January 2011, 23:34 #1  |  Link Malcolm Registered User   Join Date: Sep 2002 Location: Germany Posts: 352 NLMeansCL: GPU based Non Local Means Denoising Hi, i would like to introduce a new filter for avisynth: NLMeansCL. The filter is my try on the NLMeans algorithm. Tritical already wrote TNLMeans in 2006, which is also an implementation of the NLMeans algorithm. (Thanks for your work tritical!) In contrast to tritical's implementation - which is written in C++ and runs on the CPU, my implementation is written in OpenCL and runs on the GPU (typically). Note: I will update this post to reflect all changes to the filter. The most recent modifications will be marked in blue. I was only able to test the filter on my NVIDIA Geforce 9600 GT. Therefore i can give no guarantee that it runs on your GPU or even crashes or kills your PC! (Not that i think it would do...) The wrapper around the OpenCL algorithm is written in C#. Syntax: NLMeansCL(int A, int Ay, int Az, int S, int Sy, int B, int By, float aa, float h, float hC, int plane, bool debug, string debugpath, string smf, bool cpu, bool buffer, bool sse) If you want to know some background information about the NLMeans algorithm, i'd like to point you to the very well written readme.txt that is part of tritical's TNLMeans filter package. He also explaines all the parameters of the filter in detail. The syntax of NLMeansCL: Parameter A: Sets the value for Ax and Ay. If you want to use the same values for Ax and Ay, you only have to specify one parameter. The same applies for S (-> Sx, Sy) and B ( -> Bx, By). You can specify a different value for Ay by using the explicit Ay parameter. (default = 4) Parameter Az: Sets the temporal radius. At the moment only 1 and 2 are supported. (default = 0) Parameter S/Sy: default = 2 Parameter B/By: default = 1 Parameter aa: This parameter is equivalent to the parameter 'a' in TNLMeans. (default = 1.0) Parameter h/hC: h defines the strength of the filter both for luma and chroma planes. hC is an addition to h. With hC you can set a different filter strength for the color channels U and V. (default = 1.8) Parameter plane: Here you can specify, which color planes should be processed (similar to FFT3DFilter): 0 - luma (Y), 1 - chroma U, 2 - chroma V, 3 - chroma planes U and V, 4 - both luma and chroma 5 - copy all planes. In this mode, all planes are just copied by a very simple OpenCL kernel. In case NLMeansCL does not run on your GPU, this might help to check if a very simple OpenCL kernel executes. (default = 0) Parameter debug: Specifies, if a file with debug informations and error messages should be written. The debug file is named 'NLMeansCL_debug.txt'. (default = false) Parameter debugpath: Specifies the path to the debug file. (default = C:\Temp\) Parameter smf: This is a purely technical parameter to specifiy different memory allocation strategies. Possible values are "nan", "ahp", "chp", "achp" and "uhp". (default = ahp) Parameter cpu: Specifies if the filter should be executed on the CPU rather than the GPU. (default = false) Parameter buffer: Specifies if OpenCL buffers are used instead of OpenCL image objects to process the video data. (default = false) Parameter sse: Specifies if the sum of squared differences is used (sse=true) or the sum of absolute differences (sse=false). (default = true) Running the filter on CPUs: To be able to run the filter on the CPU, you have to install the ATI Stream SDK. Furthermore, you have to set 2 parameters: cpu=true and buffer=true. Using the parameter combination cpu=true and buffer=false will fail to execute, since AMD has not implemented image support for the CPU version of its OpenCL drivers yet! The support for buffers is preliminary. That means i'm undecided if i will keep or remove it in future versions of the filter. If AMD adds image support to its drivers (either CPU or GPU), there is not much reason to leave it in the filter. NLMeansCL will take full advantage of multiple CPU cores. There is no need to use MT() or setMTmode()! (It will rather degrade performance.) When forcing to use buffers instead of images, the fps on my Geforce drops from 23.05 to 3.90 Here's how the content of the debug log file will look if the filter initializes correctly: Code: NLMeansCL Version 0.3.2 ScriptEnvironment present. Number of OpenCL Compute Platforms = 2. Trying OpenCL Compute Platform NVIDIA Corporation. OpenCL 1.0 CUDA 3.2.1. Number of OpenCL Devices in Platform = 1. Trying OpenCL Device GeForce 9600 GT. Device available. Wrong Device Type (Gpu) requesting Cpu. Trying OpenCL Compute Platform Advanced Micro Devices, Inc.. OpenCL 1.1 ATI-Stream-v2.3 (451). Number of OpenCL Devices in Platform = 1. Trying OpenCL Device Intel(R) Core(TM)2 CPU 4400 @ 2.00GHz. Device available. Device Type Cpu. Device does not support images. Using Device Intel(R) Core(TM)2 CPU 4400 @ 2.00GHz. OpenCL Compute Context successfully created. OpenCL Command Queue successfully created. OpenCL Program successfully built. Prog Y Build log: Prog UV Build log: OpenCL kernels successfully created. Now, what do you have to do to get the filter running: 1. The NLMeansCL Filter DLL itself: Link and attachment at the end of my post. Put it in you avisynth plugin folder. And don't rename it! 2. CLOO: A .net library for OpenCL. Needed to run NLMeansCL. You can download it here: http://sourceforge.net/projects/cloo/ Take the Cloo.dll file from \bin\release inside the zip file and put it in your avisynth plugin folder. 3. AvsFilterNet: A .net library to write Avisynth filters. Needed to run NLMeansCL. You can download it here: http://avsfilternet.codeplex.com/ Take the AvsFilterNet.dll and put it in your avisynth plugin folder. That's it. Performance: I evaluated some figures for my system: CPU: Core2Duo, running on 3.2GHz GPU: NVIDIA Geforce 9600GT, 512MB GDDR3, 650MHz Core / 900MHz Memory / 1600MHZ Shader, not overclocked I typically get a speed improvement of factor 18 to 25 compared to TNLMeans. For example: Video: 720x576, YV12 Parameter: A=4, S=2, B=1 TNLMeans: 0.98 fps NLMeansCL: 23.93 fps (cpu=false, buffer=false) NLMeansCL: 3.90 fps (cpu=false, buffer=true) NLMeansCL: 1.40 fps (cpu=true, buffer=true) The speed factor between NLMeansCL and TNLMeans is similar for different video sizes (e.g. 1920x1080 or 360x288). As well as for different filter parameters (Bx = 0, By = 0). On my GPU, the implementation does NOT benefit if you use values of 2 or above for B/By! I have some explanations for this behaviour, but it would lead to far to explain it... I'm highly interested to see performance figures for different GPUs as well as the feedback if it runs on different graphic cards. A typical script to test the performance would be the following. Load the script in Virtualdub and check the 'video rendering rate' in the status window. Code: mpeg2source(...) trim(0, 1) assumefps(500) last = last + last + last + last last = last + last + last + last last = last + last + last + last last = last + last + last + last NLMeansCL(A=4, S=2, B=1, aa=1.0, h=1.8, plane=4) Parameter values: My findings so far are, that the default values for A, S and B work very well! Typically there is no improvements by setting A or S higher. It only get's a lot slower! IMHO there is no need to change aa to something other than 1.0. Playing around with h and hC is sufficient. Problems: If you have any problems with the filter, especially if i doesn't work at all. Please use GPU Caps Viewer (http://www.ozone3d.net/gpu_caps_viewer/) and check first, if the included OpenCL demos do run! Then go to the tab named 'Tools' and send me the 'Full XML Export'! There's an extra button for it on the tab. Also, please send me the log file, that NLMeansCL creates! I cannot guarantee to help you out quick, since i'm rather busy! Version 0.4.0 alpha: This version is only a preliminary version (created in January) that implements a temporal version of the algorithm. Currently, it only supports a temporal window of 1, respectively 2 frames (in both directions). The temporal mode is only implemented for the image based algorithm, not the buffer based. On my PC, the algorithm produces some non deterministic artefacts in the video that are visible as small blocks of completely black pixels. I assume some runtime problems / asychronity between the shaders and writing out the memory to the host PC. I haven't worked for months now on the algorithm and this will probably be the status for the rest of the summer. I have also a version for arbitrary values of Az, but i'm not satisfied with the results (it's too slow and the computed values are incorrect) TODOs: - (Better) Temporal mode - Make NLMeansCL work on AMD graphic cards - x64 version - Other color spaces (YUY2, RGB) Changelog: Changes from v0.1 to v0.1.1Added some debug informations. See above for parameter description. Added parameter to specify memory allocation strategy. Mainly to do technical low level tests. Added mode 5 for parameter 'plane' for debug reasons. See above for parameter description Changed the calculation of the imags areas for processing to prevent misbehaviour at certain video sizes like 1280x720. As a result, some pixels at the video borders might not be processed. (Just do an addborder() in your script to work around this) Changes from v0.1.1 to v0.1.2Resolved the misbehaviour at certain video sizes. Namely where RowSize == PitchSize. Now the processing is exact up to the video boundaries. No need to do addborder() / crop() anymore. Changes from v0.1.2 to v0.2Added support for execution on CPUs and hopefully on Radeon cards. Added 2 parameters 'cpu' and 'buffer' for that. Fixed a bug where values near 1.0 (like 1.00000001) for parameter h lead to an error. Changes from v0.2 to v0.2.1Changed the device selection strategy. Hope now it will work for Platforms where GPU and CPU devices are mixed (AMD Radeon with ATI Stream SDK installed). Changes from v0.2.1 to v0.2.2Had to change again device selection strategy. Hope this works now in all cases. Changes from v0.2.1 to v0.3Added mode for sum of absolute differences when computing neighborhood similarity (Parameter sse) Performance improvements in cases where not all planes are processed (plane != 4) Changes from v0.3 to v0.3.1Changed memory allocation strategy from uhp to ahp due to userreported errors. -> Changed default value for parameter smf to 'ahp'. Fixed kernel name bug Changes from v0.3.1 to v0.3.2Switched to .Net 4 Switched to the latest version of Cloo (v0.9.0) Switched to the latest version of AvsFilterNet (r62998) NLMeansCL now reports errors the normal way (like all other avisynth plugins do) When an error occurs, NLMeansCL does not crash avisynth anymore Minor changes in the OpenCL code v0.4.0 alphaImplemented preliminary temporal version for Az=1 and Az=2. v0.4.0 alpha is still based on .Net 3.5 as well as the older versions of Cloo (0.8.1) and AvsFilterNet (1.0 beta 2)! Download latest version: v0.3.2 :http://www.mediafire.com/?q4butkseucz9tin v0.3.2 sources:http://www.mediafire.com/?l3swlzu2pm3375l v0.4.0 alpha : http://www.mediafire.com/?9osy86a14u0qxr6 Malcolm Last edited by Malcolm; 2nd September 2011 at 22:23. Reason: Added Version 0.3.2
 6th January 2011, 00:05 #2  |  Link TheRyuu warpsharpened   Join Date: Feb 2007 Posts: 788 Avisynth is telling me that NLMeansCL_netautoload.dll is not an avisynth 2.5 plugin. Edit: Am I suppose to load this differently from other plugins other than just a simple LoadPlugin("X:\path\to\filter.dll")? Last edited by TheRyuu; 6th January 2011 at 00:09.
 6th January 2011, 00:13 #3  |  Link Malcolm Registered User   Join Date: Sep 2002 Location: Germany Posts: 352 Huh? You don't have to load it explicitly. AvsFilterNet does that for you if the filename ends with _netautoload.dll and it resides in the same folder. If you remove the suffix, you can load it manually with LoadPlugin(). Still you need the AvsFilterNet.dll (i guess) Malcolm
 6th January 2011, 01:17 #4  |  Link mastrboy Registered User   Join Date: Sep 2008 Posts: 294 interesting filter, we have too few filters which run on the GPU. Could you post some screenshots comparing NLMeansCL and TNLMeans?
 6th January 2011, 01:28 #5  |  Link Malcolm Registered User   Join Date: Sep 2002 Location: Germany Posts: 352 @masterboy When called with the same parameters, both filters produce exactly(*) the same result! (*) The difference that you see on the right is 16 times enhanced. It contains only a few individual 'dots'. They arise from minor differences in the mathematical calculation. For performance reasons the calculation in OpenCL is performed with 'relaxed-math' optimization and with single-precision (float instead of double)
 6th January 2011, 01:45 #6  |  Link mastrboy Registered User   Join Date: Sep 2008 Posts: 294 thats quite impressing considering the speed increase, will try it tomorrow together with DGNVIndex, and see what i can get out of it...
 6th January 2011, 01:48 #7  |  Link TheRyuu warpsharpened   Join Date: Feb 2007 Posts: 788 Well unless I'm doing it wrong (I just threw all 3 things in the autoload folder for testing it) it caused my graphics drivers to 'crash' and have to recover (when loading in vdub). Vdub says nothing on crash, avsp is saying some sort of null pointer exception when I try and run the script (and doesn't cause a driver recovery), dunno if that helps. Running a GTX 570 here with the latest beta drivers (266.35).
 6th January 2011, 02:40 #8  |  Link GoodzMastaJ Usered Register   Join Date: Dec 2006 Posts: 9 I gave it a try. All three linked dll's in avisynth plugins folder, installed Catalyst 10.12 (APP version that has OpenCL support), and the StreamSDK which has the OpenCL libraries and whatnot. I get the below exception on my Radeon HD 4870. Any ideas how I can determine what actually failed (I know OpenCL on AMD is ?? at best, especially older cards like this one)? Picture too wide for forum so linked
 6th January 2011, 07:29 #9  |  Link Hiritsuki Novice of AVS   Join Date: Oct 2009 Posts: 156 I wait for this filter long long time. I test it right now. -w- __________________ My PC
 6th January 2011, 08:52 #10  |  Link Dogway Registered User   Join Date: Nov 2009 Posts: 1,020 The long awaited denoiser!! Thank you! I get an error previewing in avspmod: error messege Also will you implement temporal Az? I think it was something tritical did by himself, but itd be very welcome. I have a Geforce 9600M GT card driver version: 197.16 Last edited by Dogway; 6th January 2011 at 08:56.
 6th January 2011, 09:25 #11  |  Link Hiritsuki Novice of AVS   Join Date: Oct 2009 Posts: 156 @Dogway I think driver upate to 260.99 will fix that error. __________________ My PC
6th January 2011, 10:34   #12  |  Link
Dogway
Registered User

Join Date: Nov 2009
Posts: 1,020
Thanks, it worked CL implementation is on the latter drivers only.
Some questions:
-the default behaviour is sse=true (as tnlmeans)? I like using sse=false for animation sources, it works nice for large flat colors.
-If you feel like, could you make some kind of dark protection? if a source has very dark scenes it completly turns into a mud secuence (just like tnl or dfttest).
-Do I need to make it MT or something?

benchmark:

Quote:
 # 0.40fps MT("tnlmeans(ax=4,ay=4,az=1,sx=2,sy=2,bx=1,by=1,h=1.8,sse=true)",2,2) # 0.19fps NLMeansCL(A=4, S=2, B=1, aa=1.0, h=1.8, plane=4)

Last edited by Dogway; 6th January 2011 at 10:41.

6th January 2011, 10:38   #13  |  Link
Malcolm
Registered User

Join Date: Sep 2002
Location: Germany
Posts: 352
Quote:
 Originally Posted by TheRyuu Well unless I'm doing it wrong (I just threw all 3 things in the autoload folder for testing it) it caused my graphics drivers to 'crash' and have to recover (when loading in vdub).
Ok, let me explain: Under 'normal' circumstances, windows recovers your graphics card driver because it crashed. However, windows also does this, if your driver doesn't respond within 2 seconds. (You can change the time, as well as the general behaviour of windows by editing the registry).
That means: If you have a very complex OpenCL kernel, that computes for more than 2 seconds on one video frame, then windows will kill and restart your driver!
Since you have a GTX 570 i would assume that it's fast enough. So this shouldn't happen unless you use parameters like A=8, S=6 or so. But since i haven't tested the filter on that GPU i can only guess!

6th January 2011, 10:45   #14  |  Link
Malcolm
Registered User

Join Date: Sep 2002
Location: Germany
Posts: 352
Quote:
 Originally Posted by GoodzMastaJ Any ideas how I can determine what actually failed (I know OpenCL on AMD is ?? at best, especially older cards like this one)?
You can check if OpenCL is working on your configuration with this tool: http://www.ozone3d.net/gpu_caps_viewer/ It contains some OpenCL demos. You can as well choose on which 'hardware' you would like to execute the demo (if you have more than one GPU, or on the CPU if you have the ATI Stream SDK installed)

I will provide a version of the filter that spits out the real message. What you see on the picture is the mentioned general exception saying that the filter called env.ThowError(...)

6th January 2011, 10:57   #15  |  Link
Malcolm
Registered User

Join Date: Sep 2002
Location: Germany
Posts: 352
Quote:
 Originally Posted by Dogway -the default behaviour is sse=true (as tnlmeans)? I like using sse=false for animation sources, it works nice for large flat colors. -If you feel like, could you make some kind of dark protection? if a source has very dark scenes it completly turns into a mud secuence (just like tnl or dfttest). -Do I need to make it MT or something?
- yes, i have only implemented sse. sad should be no problem. i will consider adding this.
- dark scene protection: actually i would recommend to do that with a little bit of scripting in avisynth.
- Using MT will not help. NLMeansCL itself is already mutithreaded on the GPU by nature. That's where the real work is done. So multithreading the wrapper-part, which runs on the CPU doesn't help improving performance.

0.19 fps?!? Wow! i cannot imagine how this number comes to existance. At 720x576? What's your script?

Last edited by Malcolm; 6th January 2011 at 10:59.

6th January 2011, 12:07   #16  |  Link
Dogway
Registered User

Join Date: Nov 2009
Posts: 1,020
Thanks for the answers, something must have been wrong, as there's no temporal I used an image and that was the speed. Now I tried with a video source and results were more optimistic:

Quote:
 # 0.77fps MT("tnlmeans(ax=4,ay=4,az=1,sx=2,sy=2,bx=1,by=1,h=1.8,sse=true)",2,2) # 12.77fps NLMeansCL(A=4, S=2, B=1, aa=1.0, h=1.8, plane=4)
So maybe it doesn't like images or non mod16?
The dark protection is not only scenes, but part of the scenes, but I will look into that.

6th January 2011, 13:48   #17  |  Link
Malcolm
Registered User

Join Date: Sep 2002
Location: Germany
Posts: 352
Quote:
 Originally Posted by Dogway So maybe it doesn't like images or non mod16? The dark protection is not only scenes, but part of the scenes, but I will look into that.
• 12.77 fps is pretty good for a mobile GPU like the 9600M GT i would say!
• The filter doesn't care if it's mod16. Did you process only one frame? The initialization of the OpenCL stuff takes some time. That means processing the first frame is WAY slower than all following frames! Repeat your single frame 512 times and let the video play. What fps do you get after 256 frames?
• Yes, i understand. it's about dark areas inside the video frame. But atm. i have to focus on the core functionality of the filter. 'dark scene protection' ist really something that can be added on top of any filter by scripting.
• To the temporal filter mode: I have implemented this as a protoype, but i'm reluctant to investigate deeper at the moment.
Reasons are:
1. It doesn't bring that much benefit that one might think. (Maybe i will correct myself here in the future...)
2. It slows down the filtering. (Though it's not as slow as i expected)
3. Due to the nature of OpenCL / CUDA (that means to get fast executing code!), i'd have to write a second specialized kernel besides the existing one to realize Az=1. I'd have to write a third specialized kernel to realize Az=2, ...
A generalized kernel is possible but would be very slow! (And since it's all about performance...) At the current stage, i'd like to focus on the core kernel itself and work this out first.

6th January 2011, 14:40   #18  |  Link
Dogway
Registered User

Join Date: Nov 2009
Posts: 1,020
It's really strange, if I process my image with the example script of your first post, it goes nice (5.85fps), but with the next script I only get 0.19fps:
Quote:
 ImageReader("C:\image.jpg") setmtmode(2) mmod(2,2) #final resolution 1000x572 converttoyv12 NLMeansCL(A=4, S=2, B=1, aa=1.0, h=1.8, plane=4)
Im testing with AVSinfo.exe

I always use az=3, sometimes 6 depending on sources, I think it would benefit from still areas, taking advantage of temporal information (noise,codec blocks...), but that's only me, Im aware this is still in experimental phase, I just wanted to help a bit. Nice 3 wise present! Keep the good work :P

Last edited by Dogway; 6th January 2011 at 14:46.

 6th January 2011, 16:22 #19  |  Link naoan Registered User   Join Date: Oct 2009 Posts: 152 I got this error when trying to test the filter on avspmod http://i.imgur.com/OITBQ.png My system is using Windows 7 x64 and GPU AMD Radeon HD4850, checked using GPU-Z and opencl is ticked.
 6th January 2011, 16:26 #20  |  Link Didée Registered User   Join Date: Apr 2002 Location: Germany Posts: 5,394 @ Dogway: Just kill that SetMTmode(2) out of your your script. Simple logic: [SetMTmode(2)] AND [GPU filter] == FAIL __________________ - We´re at the beginning of the end of mankind´s childhood - My little flickr gallery. (Yes indeed, I do have hobbies other than digital video!)

 Thread Tools Search this Thread Search this Thread: Advanced Search Display Modes Linear Mode

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home Announcements and Chat     General Discussion     News     Forum / Site Suggestions & Help General     Decrypting     Newbies     DVD2AVI / DGIndex     Audio encoding     Subtitles     Linux, Mac OS X, & Co Capturing and Editing Video     Avisynth Usage     Avisynth Development     VapourSynth     Capturing Video     DV     HDTV / DVB / TiVo     NLE - Non Linear Editing     VirtualDub, VDubMod & AviDemux     New and alternative a/v containers Video Encoding     (Auto) Gordian Knot     MPEG-4 ASP     MPEG-4 Encoder GUIs     MPEG-4 AVC / H.264     High Efficiency Video Coding (HEVC)     New and alternative video codecs     MPEG-2 Encoding     VP9 and AV1 (HD) DVD, Blu-ray & (S)VCD     One click suites for DVD backup and DVD creation     DVD & BD Rebuilder     (HD) DVD & Blu-ray authoring     Advanced authoring     IFO/VOB Editors     DVD burning Hardware & Software     Software players     Hardware players     PC Hard & Software Programming and Hacking     Development     Translations

All times are GMT +1. The time now is 03:49.

 Doom9.org - Archive - Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, vBulletin Solutions Inc.