PDA

View Full Version : Deathray - OpenCL GPU accelerated spatial/temporal non-local means de-noising


Jawed
17th January 2011, 22:07
I've created Deathray 1.00, an Avisynth plug-in filter for spatial/temporal non-local means de-noising. It uses OpenCL for GPU acceleration.

I've provided two downloads:

The DLL only:

www.cupidity.f9.co.uk/Deathray_1_00_DLL.zip

The entire source code, including the compiled DLL:

www.cupidity.f9.co.uk/Deathray_1_00.zip

I've quoted here the Deathray readme.txt:

Installation
============

Copy the Deathray.dll to the "plugins" sub-folder of your installation of
Avisynth.


De-installation
===============

Delete the Deathray.dll from the "plugins" sub-folder of your installation of
Avisynth.


Compatibility
=============

The following software configuration is known to work:

- Avisynth 2.5.8
- AMD Stream SDK 2.3
- AMD Catalyst 10.12
- Windows Vista 64-bit

The following hardware configuration is known to work:

- ATI HD 5870 graphics card

Known non-working hardware:

- ATI cards in the 4000 series or earlier
- ATI cards in the 5400 series

Video:

- Deathray is compatible solely with 8-bit planar formatted video. It has
been tested with YV12 format.


Usage
=====

Deathray separates the video into its 3 component planes and processes each
of them independently. This means some parameters come in two flavours: luma
and chroma.

Filtering can be adjusted with the following parameters, with the default
value for each in brackets:

hY (1.0) - strength of de-noising in the luma plane.

Cannot be negative.

If set to 0 Deathray will not process the luma plane.


hUV (1.0) - strength of de-noising in the chroma planes.

Cannot be negative.

If set to 0 Deathray will not process the chroma planes.


tY (0) - temporal radius for the luma plane.

Limited to the range 0 to 64.

When set to 0 spatial filtering is performed on the
luma plane. When set to 1 filtering uses the prior,
current and next frames for the non-local sampling
and weighting process. Higher values will increase
the range of prior and next frames that are included.

tUV (0) - temporal radius for the chroma planes.

Limited to the range 0 to 64.

When set to 0 spatial filtering is performed on the
chroma planes. When set to 1 filtering uses the prior,
current and next frames for the non-local sampling
and weighting process. Higher values will increase
the range of prior and next frames that are included.

s (1.0) - sigma used to generate the gaussian weights.

Limited to values of at least 0.1.

The kernel implemented by Deathray uses 7x7-pixel
windows centred upon the pixel being filtered.

For a 2-dimensional gaussian kernel sigma should be
approximately 1/3 of the radius of the kernel, or less,
to retain its gaussian nature.

Since a 7x7 window has a radius of 3, values of sigma
greater than 1.0 will tend to bias the kernel towards
a box-weighting. i.e. all pixels in the window will
tend towards being equally weighted. This will tend to
reduce the selectivity of the weighting process and
result in relatively stronger spatial blurring.

x (1) - factor to expand sampling.

Limited to values in the range 1 to 14.

By default Deathray spatially samples 49 windows
centred upon the pixel being filtered, in a 7x7
arrangement. x increases the sampling range in
multiples of the kernel radius.

Since the kernel radius is 3, setting x to 2 produces
a sampling area of 13x13, i.e. 169 windows centred
upon the target pixel. Yet higher values of x such as
3 or 4 will result in 19x19 or 25x25 sample windows.

Deathray uses 32x32 tiles to accelerate its processing.
Each tile is equipped with a border of 8 pixels around
all four edges, with pixels copied from neighbouring
tiles, or mirrored from within the tile if the tile
edge corresponds with a frame edge. This apron of 8
extra pixels ensures that the default sampling of
49 windows is correct, allowing pixels near the edge of
the tile to employ 49 sample windows that all have
valid pixels.

When x is set to 2 or more, sampling will "bump" into
the edges defined by the 48x48 region. With strong
values of the de-noising parameters this will create
artefacts in the filtered image. These artefacts are
visible as a grid of vertical and horizontal lines
corresponding with the 32x32 arrangement of the tiles.

Avisynth MT
===========

Deathray is not thread safe. This means that only a single instance of
Deathray can be used per Avisynth script. By extension this means that
it is not compatible with any of the multi-threading modes of the
Multi Threaded variant of Avisynth.

Use:

SetMTMode(5)

before a call to Deathray in the Avisynth script, if multi-threading
is active in other parts of the script.


Multiple Scripts Using Deathray
===============================

The graphics driver is thread safe. This means it is possible to have
an arbitrary number of Avisynth scripts calling Deathray running on a
system.

e.g. 2 scripts could be encoding, another could be running in a media player
and another could be previewing individual frames in AvsP or VirtualDub.

Eventually video memory will probably run out, even though it's virtualised.


System Responsiveness
=====================

Currently graphics drivers are unable to confer user-responsiveness
guarantees on OpenCL applications that utilise GPUs. This means if you
are using Deathray on a frame size of 16 million pixels, there will be some
juddering in Windows every ~0.7 seconds (1.5 frames per second on HD 5870)
accompanied by difficulty in typing, etc.


Deathray is BSD licensed.

Zep
18th January 2011, 22:19
awesome. will test this weekend for sure. :P

pokazene_maslo
19th January 2011, 01:29
Is this filter motion compensated?

pirej
19th January 2011, 01:42
Nice one, ill give it a try.
Is ati HD5750 compatible ? I have ati stream installed/enabled.


edit: I gues it's compatible, i loaded the default settings, and it works...(just for previewing the filtered video.. cpu load is only 20%, GPU 50% load), now i have to try to tweak the settings and see the effect.
Thanks Jawed

Jawed
19th January 2011, 10:33
Is this filter motion compensated?
Not explicitly.

The general algorithm searches the entire image for blocks (what are usually called "windows") that look like the block around the target pixel. It uses the similarity of each sampled block as the weighting of the centre pixel of each sampled block.

The filtered pixel is then the weighted sum of all the centre pixels of every block in the original image.

In Deathray the search is restricted to 49 windows around the target pixel. This means if motion is "low", i.e. less than 3 pixels in any direction, motion compensation "arises".

Deathray has an option, x, to increase the sampling area.

Intrinsically NLM is a spatial filtering technique based on self-similarity in real world images plus it is geared towards noise rather than artefacts such as JPEG/DCT blocks or interlacing artefacts. See this paper for a summary:

http://hal.archives-ouvertes.fr/docs/00/27/11/47/PDF/ijcvrevised.pdf

One of the problems with NLM (generally as well as in Deathray) is that it isn't doing time-series pixel averaging (what you might do with a series of photographs of a static scene) - the spatial aspect tends to dominate, even with a temporal radius of 5 or even 7.

Ironically, after the grand claims made in the paper linked above, hybrid time-series techniques have been experimented with by some of the same people:

ftp://ftp.math.ucla.edu/pub/camreport/cam09-62.pdf

In this paper you will see reference to something called BM3D, which as far as I can tell is the academics' name for MVTools' MVDegrain (or MVTools2's MDegrain).

The principle of the hybrid approach is to use BM3D where "registration" is achieved (i.e. motion compensation meets a threshold of suitability) and to use NLM where registration fails.

I normally use FizzKiller, which is a variation of MDegrain using a calmed clip for analysis:

http://forum.doom9.org/showthread.php?t=133977

but I'm looking for something faster, so I decided to implement temporal NLM.

I should update the FizzKiller script I posted in that thread (post 23) as I tweaked it a bit. Overall, FizzKiller is awesome.

Jawed
19th January 2011, 10:42
Nice one, ill give it a try.
Is ati HD5750 compatible ? I have ati stream installed/enabled.


edit: I gues it's compatible, i loaded the default settings, and it works...(just for previewing the filtered video.. cpu load is only 20%, GPU 50% load), now i have to try to tweak the settings and see the effect.
Thanks Jawed
Nice, thanks. Glad to hear it works somewhere else!

I'm working on linear correction, a post-filtering step, to improve detail retention. This improves the result while allowing stronger de-noising, so I will post an updated version of Deathray soon.

I recommend temporal rather than spatial - use 2 or 3 for the radius. I prefer low sigma, i.e. <=1. h varies with material.

With sigma set to ~0.7 you are effectively using a 5x5 kernel - the outer ring of 24 pixels in each 7x7-pixel window is effectively weighted "0" (all 24 of these pixels have a total weighting of 4%). The search area is still 7x7 (i.e. 49 windows), but the smaller kernel is "sharper".

I may implement a native 5x5 variant of Deathray to make it go faster, since I think temporal is more useful than spatial.

I also need to understand how to make arguments for a plug-in optional (this is my first plug-in). Argument handling is very clumsy. Any help on that would be appreciated.

Gavino
19th January 2011, 11:28
I also need to understand how to make arguments for a plug-in optional
Simply add the argument name in square brackets in the call to env->AddFunction, and supply a default value when you extract the value using AsInt etc.

For an example, see http://avisynth.org/mediawiki/Filter_SDK/Simple_sample_1.3a.

Jawed
19th January 2011, 11:45
Thanks, this is how I setup Deathray:

env->AddFunction("deathray", "c[hY]f[hUV]f[tY]i[tUV]i[s]f[x]i", CreateDeathray, 0);

Then I have:

AVSValue __cdecl CreateDeathray(AVSValue args, void *user_data, IScriptEnvironment *env) {

double h_Y = args[1].AsFloat(1.);
if (h_Y < 0.) h_Y = 0.;

double h_UV = args[2].AsFloat(1.);
if (h_UV < 0.) h_UV = 0.;

int temporal_radius_Y = args[3].AsInt(0);
if (temporal_radius_Y < 0) temporal_radius_Y = 0;
if (temporal_radius_Y > 64) temporal_radius_Y = 64;

int temporal_radius_UV = args[4].AsInt(0);
if (temporal_radius_UV < 0) temporal_radius_UV = 0;
if (temporal_radius_UV > 64) temporal_radius_UV = 64;

double sigma = args[5].AsFloat(1.);
if (sigma < 0.1) sigma = 0.1;

int sample_expand = args[6].AsInt(1);
if (sample_expand <= 0) sample_expand = 1;
if (sample_expand > 14) sample_expand = 14;

return new deathray(args[0].AsClip(),
h_Y,
h_UV,
temporal_radius_Y,
temporal_radius_UV,
sigma,
sample_expand,
env);
}

If I do this:

deathray(hY=2)

it works fine. If, instead, I try this:

deathray(hy=2,1)

It fails. But now I think about it, I think that's not valid Avisynth function call syntax - you can't mix named and un-named paramters. Sigh, addled by C++...

Didée
19th January 2011, 11:59
The principle of the hybrid approach is to use BM3D where "registration" is achieved (i.e. motion compensation meets a threshold of suitability) and to use NLM where registration fails.
To the experienced Avisynth users around here, this is kind of boring, isn't it? :rolleyes:

There's a bunch of scripts that do exactly this kind of "hybrid" filtering ... several by *mp4guy, one or two by me ...
Here (http://forum.doom9.org/showthread.php?p=1087049#post1087049) is one with a short explanation of the principle, 3 years old. And I definetly know my first mentioning of that problem/solution has been years before that already.

Jawed
19th January 2011, 12:28
To the experienced Avisynth users around here, this is kind of boring, isn't it? :rolleyes:
:D Of course. I didn't suggest it was novel, did I?

There's a bunch of scripts that do exactly this kind of "hybrid" filtering ... several by *mp4guy, one or two by me ...
Here (http://forum.doom9.org/showthread.php?p=1087049#post1087049) is one with a short explanation of the principle, 3 years old. And I definetly know my first mentioning of that problem/solution has been years before that already.
Yes, I made FizzKiller based on the calm-clip idea.

Has anyone built a hybrid of NLM and MDegrain?

I've been thinking about trying Deathray to generate the calm clip for FizzKiller...

ChaosKing
19th January 2011, 16:45
I get an error: "Error in OpenCl status=1 frame 0"
My graphics card is a Nvidia 260GTX.
Win7 x64

Jawed
19th January 2011, 17:46
That seems to be a compilation error. Or it could be that the code is too complex (which behaves like a compilation error).

I presume you have OpenCL working on your system?

As you can probably tell I don't have any NVidia cards to test with so it's something I can only do remotely with others' help.

If you or others are prepared to "mess about", I can try to diagnose the issue with tailored versions of the DLL (which might not do any filtering, but would verify basic capability).

I'm also planning on a change in architecture (which should improve performance), which has the side-effect of reducing complexity, making the code more likely to work on NVidia. But that's a few days away at least.

I'd be interested in results with NVidia 400 or 500 series as the code's complexity is theoretically less of a problem there.

(The complexity issue is to do with registers. The code uses an extremely high register allocation on ATI, and likely similar on NVidia. NVidia prefers lower register allocations, but 400/500 series should be fine. My planned changes include a reduction in register allocation.)

Did you try Deathray with all default values? i.e. use:

Deathray()

It may also be worth trying

Deathray(hUV=0)

but I'm doubtful that will work if the default version doesn't work.

If you'd like to try some diagnosis, try this version of Deathray:

www.cupidity.f9.co.uk/DeathrayNV110119001.zip

Delete the Deathray DLL that is installed in your plugins folder and put this version of Deathray in there. This version of Deathray merely passes through the frame, with default settings. (It will do something else, not sure what, if you turn on temporal filtering.) Make sure to test with no parameters, please.

I've also increased the detail on the error status message. That might provide some insight.

ChaosKing
19th January 2011, 18:49
Yep, OpenCL is working on my system. The NLMeansCL (http://forum.doom9.org/showthread.php?t=158925) filter works on my system for example ^^

hUV=0 makes no difference

your Debug version of Deathray gives me: "Error in OpenCL status=11 frame=0 and OpenCl status= -30"

Hope this values can help you :)

Jawed
19th January 2011, 20:04
Thanks. That means it is having trouble finding devices. Which is definitely not what I was expecting.

Do you have the AMD OpenCL installed on your computer, in addition to NVidia OpenCL? I'm wondering if it finds the AMD OpenCL first, but there's no GPUs. So then fails. But the status you're getting doesn't seem to correspond with that, there's a different error for that situation.

The OpenCL error is more mysterious, "invalid value"...

If you'd like to help some more, this will tell me which OpenCL call is failing:

www.cupidity.f9.co.uk/DeathrayNV110119002.zip

EDIT: corrected, should be fine now

Didée
19th January 2011, 20:48
With this one, it is

"Error reading source frame 0: Avisynth read error: Deathray: Error in OpenCL status=11 frame 0 and OpenCL status=-30"

Jawed
19th January 2011, 21:08
Thanks, that's really peculiar. It's asking for the number of devices and seemingly responding that asking for the number of devices is invalid.

This is going to be tedious.

OK this test version doesn't ask for the device count (eventually Deathray will support multiple cards :p ), it assumes there's 1 device:

www.cupidity.f9.co.uk/DeathrayNV110119003.zip

Fingers-crossed.

Didée
19th January 2011, 22:19
Different message now:

"Error reading source frame 0: Avisynth read error: Single-frame initialisation failed, status=1"


However, note I'm not running the latest NV driver for my GT240, it's one or two revisions behind.
Reports from people running the most recent drivers could be more interesting.
Or, perhaps the GT240 is simply "too small" ?

ChaosKing
19th January 2011, 22:56
I get exactly the same messages as Didée, both new versions tested.
And no, there is no trace of an AMD driver on my system ^^"

Maybe this information can help?
===================================================
GPU Caps Viewer v1.9.5
http://www.ozone3d.net/gpu_caps_viewer/
===================================================


===================================[ System / CPU ]
- CPU Name: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz
- CPU Core Speed: 2833 MHz
- CPU Num Cores: 4
- Family: 6 - Model: 7 - Stepping: 10
- Physical Memory Size: 4095 MB
- Operating System: Windows 7 64-bit build 7600 [No Service Pack]
- DirectX Version: 10.0
- PhysX Version: 9100514


===================================[ Graphics Adapter / GPU ]
- SLI: disabled
- GPUs: 1
- Logical GPUs: 1
- OpenGL Renderer: GeForce GTX 260/PCI/SSE2
- Drivers Renderer: NVIDIA GeForce GTX 260
- DB Renderer: NVIDIA GeForce GTX 260
- Device Description: NVIDIA GeForce GTX 260
- Adapter String: GeForce GTX 260
- Vendor: NVIDIA Corporation
- Vendor ID: 0x10DE
- Device ID: 0x05E2
- Sub device ID: 0x1109
- Sub vendor ID: 0x19DA
- Drivers Version: 8.17.12.6099 (10-16-2010) - nvoglv64
- GPU Codename: GT200
- GPU Unified Shader Processors: 192
- GPU Vertex Shader Processors: 0
- GPU Pixel Shader Processors: 0
- SM / SIMD: 24
- TPC: 8
- TPD (Watts): 182
- Video Memory Size: 896 MB
- Video Memory Type: GDDR3
- Clocks level #0: Core: 300MHz - Memory: 100MHz - Shader: 600MHz
- Clocks level #1: Core: 400MHz - Memory: 300MHz - Shader: 800MHz
- Clocks level #2: Core: 576MHz - Memory: 999MHz - Shader: 1242MHz
- BIOS String: 62.0.61.0.0
- Current Display Mode: 1280x1024 @ 60 Hz - 32 bpp


===================================[ OpenGL GPU Capabilities ]
- OpenGL Version: 3.3.0
- GLSL (OpenGL Shading Language) Version: 3.30 NVIDIA via Cg compiler
- ARB Texture Units: 4
- Vertex Shader Texture Units: 32
- Pixel Shader Texture Units: 32
- Geometry Shader Texture Units: 32
- Max Texture Size: 8192x8192
- Max Anisotropic Filtering Value: X16.0
- Max Point Sprite Size: 63.4
- Max Dynamic Lights: 8
- Max Viewport Size: 8192x8192
- Max Vertex Uniform Components: 4096
- Max Fragment Uniform Components: 2048
- Max Geometry Uniform Components: 2048
- Max Varying Float: 60
- Max Vertex Bindable Uniforms: 12
- Max Fragment Bindable Uniforms: 12
- Max Geometry Bindable Uniforms: 12
- Frame Buffer Objects (FBO) Support:[yes]
- Multiple Render Targets / Max draw buffers: 8
- Pixel Buffer Objects (PBO) Support:[yes]
- S3TC Texture Compression Support:[yes]
- ATI 3Dc Texture Compression Support:[no]
- Texture Rectangle Support:[yes]
- Floating Point Textures Support:[yes]
- MSAA: 2X
- MSAA: 4X
- MSAA: 8X
- MSAA: 16X
- OpenGL Extensions: 221 extensions (GL=199 and WGL=22)
<li>GL_ARB_blend_func_extended</li>
<li>GL_ARB_color_buffer_float</li>
<li>GL_ARB_compatibility</li>
<li>GL_ARB_copy_buffer</li>
<li>GL_ARB_debug_output</li>
<li>GL_ARB_depth_buffer_float</li>
<li>GL_ARB_depth_clamp</li>
<li>GL_ARB_depth_texture</li>
<li>GL_ARB_draw_buffers</li>
<li>GL_ARB_draw_elements_base_vertex</li>
<li>GL_ARB_draw_instanced</li>
<li>GL_ARB_ES2_compatibility</li>
<li>GL_ARB_explicit_attrib_location</li>
<li>GL_ARB_fragment_coord_conventions</li>
<li>GL_ARB_fragment_program</li>
<li>GL_ARB_fragment_program_shadow</li>
<li>GL_ARB_fragment_shader</li>
<li>GL_ARB_framebuffer_object</li>
<li>GL_ARB_framebuffer_sRGB</li>
<li>GL_ARB_geometry_shader4</li>
<li>GL_ARB_get_program_binary</li>
<li>GL_ARB_half_float_pixel</li>
<li>GL_ARB_half_float_vertex</li>
<li>GL_ARB_imaging</li>
<li>GL_ARB_instanced_arrays</li>
<li>GL_ARB_map_buffer_range</li>
<li>GL_ARB_multisample</li>
<li>GL_ARB_multitexture</li>
<li>GL_ARB_occlusion_query</li>
<li>GL_ARB_occlusion_query2</li>
<li>GL_ARB_pixel_buffer_object</li>
<li>GL_ARB_point_parameters</li>
<li>GL_ARB_point_sprite</li>
<li>GL_ARB_provoking_vertex</li>
<li>GL_ARB_robustness</li>
<li>GL_ARB_sampler_objects</li>
<li>GL_ARB_seamless_cube_map</li>
<li>GL_ARB_separate_shader_objects</li>
<li>GL_ARB_shader_bit_encoding</li>
<li>GL_ARB_shader_objects</li>
<li>GL_ARB_shading_language_100</li>
<li>GL_ARB_shadow</li>
<li>GL_ARB_sync</li>
<li>GL_ARB_texture_border_clamp</li>
<li>GL_ARB_texture_buffer_object</li>
<li>GL_ARB_texture_compression</li>
<li>GL_ARB_texture_compression_rgtc</li>
<li>GL_ARB_texture_cube_map</li>
<li>GL_ARB_texture_env_add</li>
<li>GL_ARB_texture_env_combine</li>
<li>GL_ARB_texture_env_crossbar</li>
<li>GL_ARB_texture_env_dot3</li>
<li>GL_ARB_texture_float</li>
<li>GL_ARB_texture_mirrored_repeat</li>
<li>GL_ARB_texture_multisample</li>
<li>GL_ARB_texture_non_power_of_two</li>
<li>GL_ARB_texture_rectangle</li>
<li>GL_ARB_texture_rg</li>
<li>GL_ARB_texture_rgb10_a2ui</li>
<li>GL_ARB_texture_swizzle</li>
<li>GL_ARB_timer_query</li>
<li>GL_ARB_transform_feedback2</li>
<li>GL_ARB_transpose_matrix</li>
<li>GL_ARB_uniform_buffer_object</li>
<li>GL_ARB_vertex_array_bgra</li>
<li>GL_ARB_vertex_array_object</li>
<li>GL_ARB_vertex_buffer_object</li>
<li>GL_ARB_vertex_program</li>
<li>GL_ARB_vertex_shader</li>
<li>GL_ARB_vertex_type_2_10_10_10_rev</li>
<li>GL_ARB_viewport_array</li>
<li>GL_ARB_window_pos</li>
<li>GL_ATI_draw_buffers</li>
<li>GL_ATI_texture_float</li>
<li>GL_ATI_texture_mirror_once</li>
<li>GL_S3_s3tc</li>
<li>GL_EXT_texture_env_add</li>
<li>GL_EXT_abgr</li>
<li>GL_EXT_bgra</li>
<li>GL_EXT_bindable_uniform</li>
<li>GL_EXT_blend_color</li>
<li>GL_EXT_blend_equation_separate</li>
<li>GL_EXT_blend_func_separate</li>
<li>GL_EXT_blend_minmax</li>
<li>GL_EXT_blend_subtract</li>
<li>GL_EXT_compiled_vertex_array</li>
<li>GL_EXT_Cg_shader</li>
<li>GL_EXT_depth_bounds_test</li>
<li>GL_EXT_direct_state_access</li>
<li>GL_EXT_draw_buffers2</li>
<li>GL_EXT_draw_instanced</li>
<li>GL_EXT_draw_range_elements</li>
<li>GL_EXT_fog_coord</li>
<li>GL_EXT_framebuffer_blit</li>
<li>GL_EXT_framebuffer_multisample</li>
<li>GL_EXTX_framebuffer_mixed_formats</li>
<li>GL_EXT_framebuffer_object</li>
<li>GL_EXT_framebuffer_sRGB</li>
<li>GL_EXT_geometry_shader4</li>
<li>GL_EXT_gpu_program_parameters</li>
<li>GL_EXT_gpu_shader4</li>
<li>GL_EXT_multi_draw_arrays</li>
<li>GL_EXT_packed_depth_stencil</li>
<li>GL_EXT_packed_float</li>
<li>GL_EXT_packed_pixels</li>
<li>GL_EXT_pixel_buffer_object</li>
<li>GL_EXT_point_parameters</li>
<li>GL_EXT_provoking_vertex</li>
<li>GL_EXT_rescale_normal</li>
<li>GL_EXT_secondary_color</li>
<li>GL_EXT_separate_shader_objects</li>
<li>GL_EXT_separate_specular_color</li>
<li>GL_EXT_shadow_funcs</li>
<li>GL_EXT_stencil_two_side</li>
<li>GL_EXT_stencil_wrap</li>
<li>GL_EXT_texture3D</li>
<li>GL_EXT_texture_array</li>
<li>GL_EXT_texture_buffer_object</li>
<li>GL_EXT_texture_compression_latc</li>
<li>GL_EXT_texture_compression_rgtc</li>
<li>GL_EXT_texture_compression_s3tc</li>
<li>GL_EXT_texture_cube_map</li>
<li>GL_EXT_texture_edge_clamp</li>
<li>GL_EXT_texture_env_combine</li>
<li>GL_EXT_texture_env_dot3</li>
<li>GL_EXT_texture_filter_anisotropic</li>
<li>GL_EXT_texture_integer</li>
<li>GL_EXT_texture_lod</li>
<li>GL_EXT_texture_lod_bias</li>
<li>GL_EXT_texture_mirror_clamp</li>
<li>GL_EXT_texture_object</li>
<li>GL_EXT_texture_shared_exponent</li>
<li>GL_EXT_texture_sRGB</li>
<li>GL_EXT_texture_swizzle</li>
<li>GL_EXT_timer_query</li>
<li>GL_EXT_transform_feedback2</li>
<li>GL_EXT_vertex_array</li>
<li>GL_EXT_vertex_array_bgra</li>
<li>GL_IBM_rasterpos_clip</li>
<li>GL_IBM_texture_mirrored_repeat</li>
<li>GL_KTX_buffer_region</li>
<li>GL_NV_blend_square</li>
<li>GL_NV_conditional_render</li>
<li>GL_NV_copy_depth_to_color</li>
<li>GL_NV_copy_image</li>
<li>GL_NV_depth_buffer_float</li>
<li>GL_NV_depth_clamp</li>
<li>GL_NV_explicit_multisample</li>
<li>GL_NV_fence</li>
<li>GL_NV_float_buffer</li>
<li>GL_NV_fog_distance</li>
<li>GL_NV_fragment_program</li>
<li>GL_NV_fragment_program_option</li>
<li>GL_NV_fragment_program2</li>
<li>GL_NV_framebuffer_multisample_coverage</li>
<li>GL_NV_geometry_shader4</li>
<li>GL_NV_gpu_program4</li>
<li>GL_NV_half_float</li>
<li>GL_NV_light_max_exponent</li>
<li>GL_NV_multisample_coverage</li>
<li>GL_NV_multisample_filter_hint</li>
<li>GL_NV_occlusion_query</li>
<li>GL_NV_packed_depth_stencil</li>
<li>GL_NV_parameter_buffer_object</li>
<li>GL_NV_parameter_buffer_object2</li>
<li>GL_NV_pixel_data_range</li>
<li>GL_NV_point_sprite</li>
<li>GL_NV_primitive_restart</li>
<li>GL_NV_register_combiners</li>
<li>GL_NV_register_combiners2</li>
<li>GL_NV_shader_buffer_load</li>
<li>GL_NV_texgen_reflection</li>
<li>GL_NV_texture_barrier</li>
<li>GL_NV_texture_compression_vtc</li>
<li>GL_NV_texture_env_combine4</li>
<li>GL_NV_texture_expand_normal</li>
<li>GL_NV_texture_multisample</li>
<li>GL_NV_texture_rectangle</li>
<li>GL_NV_texture_shader</li>
<li>GL_NV_texture_shader2</li>
<li>GL_NV_texture_shader3</li>
<li>GL_NV_transform_feedback</li>
<li>GL_NV_transform_feedback2</li>
<li>GL_NV_vertex_array_range</li>
<li>GL_NV_vertex_array_range2</li>
<li>GL_NV_vertex_buffer_unified_memory</li>
<li>GL_NV_vertex_program</li>
<li>GL_NV_vertex_program1_1</li>
<li>GL_NV_vertex_program2</li>
<li>GL_NV_vertex_program2_option</li>
<li>GL_NV_vertex_program3</li>
<li>GL_NVX_conditional_render</li>
<li>GL_NVX_gpu_memory_info</li>
<li>GL_SGIS_generate_mipmap</li>
<li>GL_SGIS_texture_lod</li>
<li>GL_SGIX_depth_texture</li>
<li>GL_SGIX_shadow</li>
<li>GL_SUN_slice_accum</li>
<li>GL_WIN_swap_hint</li>
<li>WGL_EXT_swap_control</li>
<li>WGL_ARB_buffer_region</li>
<li>WGL_ARB_create_context</li>
<li>WGL_ARB_create_context_profile</li>
<li>WGL_ARB_create_context_robustness</li>
<li>WGL_ARB_extensions_string</li>
<li>WGL_ARB_make_current_read</li>
<li>WGL_ARB_multisample</li>
<li>WGL_ARB_pbuffer</li>
<li>WGL_ARB_pixel_format</li>
<li>WGL_ARB_pixel_format_float</li>
<li>WGL_ARB_render_texture</li>
<li>WGL_ATI_pixel_format_float</li>
<li>WGL_EXT_create_context_es2_profile</li>
<li>WGL_EXT_extensions_string</li>
<li>WGL_EXT_framebuffer_sRGB</li>
<li>WGL_EXT_pixel_format_packed_float</li>
<li>WGL_NVX_DX_interop</li>
<li>WGL_NV_float_buffer</li>
<li>WGL_NV_multisample_coverage</li>
<li>WGL_NV_render_depth_texture</li>
<li>WGL_NV_render_texture_rectangle</li>


===================================[ NVIDIA CUDA Capabilities ]
- CUDA Device 0
- Device name: GeForce GTX 260
- Compute Capability: 1.3
- Total Memory: 877 MB
- Shader Clock Rate: 1242 MHz
- Multiprocessors: 24
- Warp Size: 32
- Max Threads Per Block: 512
- Threads Per Block: 512 x 512 x 64
- Grid Size: 65535 x 65535 x 1
- Registers Per Block: 16384
- Texture Alignment: 256 byte
- Total Constant Memory: 64 Kb


===================================[ OpenCL Capabilities ]
- Num OpenCL platforms: 1
- Name: NVIDIA CUDA
- Version: OpenCL 1.0 CUDA 3.2.1
- Profile: FULL_PROFILE
- Vendor: NVIDIA Corporation
- Num devices: 1

- CL_DEVICE_NAME: GeForce GTX 260
- CL_DEVICE_VENDOR: NVIDIA Corporation
- CL_DRIVER_VERSION: 260.99
- CL_DEVICE_PROFILE: FULL_PROFILE
- CL_DEVICE_VERSION: OpenCL 1.0 CUDA
- CL_DEVICE_TYPE: GPU
- CL_DEVICE_VENDOR_ID: 0x10DE
- CL_DEVICE_MAX_COMPUTE_UNITS: 24
- CL_DEVICE_MAX_CLOCK_FREQUENCY: 1242MHz
- CL_NV_DEVICE_COMPUTE_CAPABILITY_MAJOR: 1
- CL_NV_DEVICE_COMPUTE_CAPABILITY_MINOR: 3
- CL_NV_DEVICE_REGISTERS_PER_BLOCK: 16384
- CL_NV_DEVICE_WARP_SIZE: 32
- CL_NV_DEVICE_GPU_OVERLAP: 1
- CL_NV_DEVICE_KERNEL_EXEC_TIMEOUT: 1
- CL_NV_DEVICE_INTEGRATED_MEMORY: 0
- CL_DEVICE_ADDRESS_BITS: 32
- CL_DEVICE_MAX_MEM_ALLOC_SIZE: 224608KB
- CL_DEVICE_GLOBAL_MEM_SIZE: 877MB
- CL_DEVICE_MAX_PARAMETER_SIZE: 4352
- CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 0 Bytes
- CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 0KB
- CL_DEVICE_ERROR_CORRECTION_SUPPORT: NO
- CL_DEVICE_LOCAL_MEM_TYPE: Local (scratchpad)
- CL_DEVICE_LOCAL_MEM_SIZE: 16KB
- CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64KB
- CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
- CL_DEVICE_MAX_WORK_ITEM_SIZES: [512 ; 512 ; 64]
- CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
- CL_EXEC_NATIVE_KERNEL: 4751356
- CL_DEVICE_IMAGE_SUPPORT: YES
- CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
- CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
- CL_DEVICE_IMAGE2D_MAX_WIDTH: 4096
- CL_DEVICE_IMAGE2D_MAX_HEIGHT: 32768
- CL_DEVICE_IMAGE3D_MAX_WIDTH: 2048
- CL_DEVICE_IMAGE3D_MAX_HEIGHT: 2048
- CL_DEVICE_IMAGE3D_MAX_DEPTH: 16
- CL_DEVICE_MAX_SAMPLERS: 16
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE: 1
- CL_DEVICE_EXTENSIONS: 16
- Extensions:
- cl_khr_byte_addressable_store
- cl_khr_icd
- cl_khr_gl_sharing
- cl_nv_d3d9_sharing
- cl_nv_d3d10_sharing
- cl_khr_d3d10_sharing
- cl_nv_d3d11_sharing
- cl_nv_compiler_options
- cl_nv_device_attribute_query
- cl_nv_pragma_unroll
-
- cl_khr_global_int32_base_atomics
- cl_khr_global_int32_extended_atomics
- cl_khr_local_int32_base_atomics
- cl_khr_local_int32_extended_atomics
- cl_khr_fp64


===================================[ Misc. ]


===================================[ Related Graphics Drivers ]
- http://www.geeks3d.com/?page_id=752
- http://downloads.guru3d.com/download.php?id=10
- http://www.tweakguides.com/NVFORCE_1.html
- http://www.nvidia.com/object/winxp-2k_archive.html
- http://www.geeks3d.com/?p=65


===================================[ Related Graphics Cards Reviews ]
- http://www.geeks3d.com/?tag=geforce-gtx-260
- http://www.google.us/search?q=NVIDIA+GeForce+GTX+260+review

TheProfileth
19th January 2011, 22:57
Will give this a look see in a bit.
Edit:
I tried the normal version and the 3 fixed versions in avspmod
I got this for the normal
Traceback (most recent call last):
File "AvsP.pyo", line 5813, in OnMenuVideoRefresh
File "AvsP.pyo", line 8950, in ShowVideoFrame
File "AvsP.pyo", line 9492, in PaintAVIFrame
File "pyavs.pyo", line 322, in DrawFrame
File "pyavs.pyo", line 306, in _GetFrame
File "avisynth.pyo", line 309, in GetPitch
ValueError: NULL pointer access

I got this for the first fixed
Traceback (most recent call last):
File "AvsP.pyo", line 5813, in OnMenuVideoRefresh
File "AvsP.pyo", line 8950, in ShowVideoFrame
File "AvsP.pyo", line 9492, in PaintAVIFrame
File "pyavs.pyo", line 322, in DrawFrame
File "pyavs.pyo", line 306, in _GetFrame
File "avisynth.pyo", line 309, in GetPitch
ValueError: NULL pointer access

and this for the second and third
Traceback (most recent call last):
File "AvsP.pyo", line 6281, in OnButtonTextKillFocus
File "AvsP.pyo", line 8950, in ShowVideoFrame
File "AvsP.pyo", line 9492, in PaintAVIFrame
File "pyavs.pyo", line 322, in DrawFrame
File "pyavs.pyo", line 306, in _GetFrame
File "avisynth.pyo", line 309, in GetPitch
ValueError: NULL pointer access
Traceback (most recent call last):
File "AvsP.pyo", line 7147, in OnPaintVideoWindow
File "AvsP.pyo", line 9492, in PaintAVIFrame
File "pyavs.pyo", line 322, in DrawFrame
File "pyavs.pyo", line 306, in _GetFrame
File "avisynth.pyo", line 309, in GetPitch
ValueError: NULL pointer access

Really want to test this filter out, so I will hope you fix it soon
also I am able to get nlmeanscl to run fine on my computer, I have a GTX 260 and a AMD Phenom quad core

Jawed
19th January 2011, 23:38
Thanks Didée and ChaosKing. That means it's found the GPU and is now trying to create memory allocations and/or preparing the GPU code (kind of compilation).

From another forum (someone else's application) I've learnt there are (or at least, were) problems with NVidia's support for my use of a feature, which complicates things.

This is my last update for tonight:

www.cupidity.f9.co.uk/DeathrayNV110119004.zip

Yet more detail in error tracking. Though I'm pessimistic overall due to the problem I mentioned above.

Jawed
19th January 2011, 23:46
Really want to test this filter out, so I will hope you fix it soon
also I am able to get nlmeanscl to run fine on my computer, I have a GTX 260 and a AMD Phenom quad core
Thanks for your report.

I don't know why but AvsP has that trouble when Deathray throws an error message. Playing the script in MPC is fine, reporting the messages that the others have posted.

In my experience, sometimes that message is true, there really was a null pointer access. I just can't tell.

In SingleFrameInit Deathray has tried to create two buffers on the GPU, input and output, for each of the 3 planes - so 6 in total. It does this once when Deathray is loaded and re-uses them for each frame.

Jawed
20th January 2011, 00:43
Oh, I've just noticed from ChaosKing's capabilities dump that GTX260 is OpenCL 1.0.

I will have to research the differences between 1.0 and 1.1 to see if I've done something 1.1 that's causing problems for 1.0 cards. The obvious thing, the size of local memory, shouldn't be an issue. I'm allocating ~11KB of local memory out of the 16KB available on OpenCL 1.0 devices.

I'm doubtful this is the source of the problem, but...

Jawed
20th January 2011, 00:58
ARGH

I've just discovered that if the DLL is named Deathray.DLL it's OK, but if the DLL is named like the debug versions I provided earlier, it fails.

SIGH. I forgot that Deathray asks Windows for a handle to "Deathray.dll" as part of compilation. So this has wasted quite a bit of time. Sorry.

Versions 3 or 4 may actually work if the NV110119003/4 suffix is deleted. That's what I get for testing before renaming the file for distribution.

TheProfileth
20th January 2011, 01:19
ooh let me try :D
Edit:
damn, renaming it did not work by renaming them :(

Jawed
20th January 2011, 12:42
If someone can report the error status numbers from the 4th debug DLL I uploaded, that would be cool, thanks.

ChaosKing
20th January 2011, 14:37
debug v4: Single-Frame initialisation failed, status=6 and OpenCL status= -48
I also updated to the newest Nvidia driver, but it changed nothing.

Jawed
20th January 2011, 15:27
That error happens to correspond with the DLL filename being wrong, although of course it could be the real error.

Please make sure that the DLL is called deathray.dll and that there is only one deathray DLL in the plugins folder.

If that doesn't solve the problem, then I need to make an even simpler piece of code to test the OpenCL compilation.

ChaosKing
20th January 2011, 18:01
I tested both filenames (also deleted other deathray* dlls), same error :/

Jawed
20th January 2011, 18:29
Thanks very much for your patience.

OK, this version of Deathray will produce a file called Deathray.log. This file appears in the same folder as your Avisynth script.

This file contains the output from the OpenCL compiler, if an error occurs during compilation. If no error occurs, the file will have 0 bytes as its file size.

www.cupidity.f9.co.uk/DeathrayNV110120001.zip

If Deathray.log contains some text it would be useful if you can post the text here. Thanks.

ChaosKing
20th January 2011, 19:58
Here's my Debug Log:
<program source>:300:22: error: no matching function for call to 'max'
int2 sample_start = max(target - sample_radius, 3);
^~~
<built-in>:3696:26: note: candidate function
ulong16 __OVERLOADABLE__ max(ulong16, ulong16);
^
<built-in>:3695:25: note: candidate function
ulong8 __OVERLOADABLE__ max(ulong8, ulong8);
^
<built-in>:3694:25: note: candidate function
ulong4 __OVERLOADABLE__ max(ulong4, ulong4);
^
<built-in>:3690:25: note: candidate function
ulong2 __OVERLOADABLE__ max(ulong2, ulong2);
^
<built-in>:3689:25: note: candidate function
long16 __OVERLOADABLE__ max(long16, long16);
^
<built-in>:3688:24: note: candidate function
long8 __OVERLOADABLE__ max(long8, long8);
^
<built-in>:3687:24: note: candidate function
long4 __OVERLOADABLE__ max(long4, long4);
^
<built-in>:3683:24: note: candidate function
long2 __OVERLOADABLE__ max(long2, long2);
^
<built-in>:3682:25: note: candidate function
uint16 __OVERLOADABLE__ max(uint16, uint16);
^
<built-in>:3681:24: note: candidate function
uint8 __OVERLOADABLE__ max(uint8, uint8);
^
<built-in>:3680:24: note: candidate function
uint4 __OVERLOADABLE__ max(uint4, uint4);
^
<built-in>:3676:24: note: candidate function
uint2 __OVERLOADABLE__ max(uint2, uint2);
^
<built-in>:3675:24: note: candidate function
int16 __OVERLOADABLE__ max(int16, int16);
^
<built-in>:3674:23: note: candidate function
int8 __OVERLOADABLE__ max(int8, int8);
^
<built-in>:3673:23: note: candidate function
int4 __OVERLOADABLE__ max(int4, int4);
^
<built-in>:3669:23: note: candidate function
int2 __OVERLOADABLE__ max(int2, int2);
^
<built-in>:3668:27: note: candidate function
ushort16 __OVERLOADABLE__ max(ushort16, ushort16);
^
<built-in>:3667:26: note: candidate function
ushort8 __OVERLOADABLE__ max(ushort8, ushort8);
^
<built-in>:3666:26: note: candidate function
ushort4 __OVERLOADABLE__ max(ushort4, ushort4);
^
<built-in>:3662:26: note: candidate function
ushort2 __OVERLOADABLE__ max(ushort2, ushort2);
^
<built-in>:3661:26: note: candidate function
short16 __OVERLOADABLE__ max(short16, short16);
^
<built-in>:3660:25: note: candidate function
short8 __OVERLOADABLE__ max(short8, short8);
^
<built-in>:3659:25: note: candidate function
short4 __OVERLOADABLE__ max(short4, short4);
^
<built-in>:3655:25: note: candidate function
short2 __OVERLOADABLE__ max(short2, short2);
^
<built-in>:3654:26: note: candidate function
uchar16 __OVERLOADABLE__ max(uchar16, uchar16);
^
<built-in>:3653:25: note: candidate function
uchar8 __OVERLOADABLE__ max(uchar8, uchar8);
^
<built-in>:3652:25: note: candidate function
uchar4 __OVERLOADABLE__ max(uchar4, uchar4);
^
<built-in>:3648:25: note: candidate function
uchar2 __OVERLOADABLE__ max(uchar2, uchar2);
^
<built-in>:3647:25: note: candidate function
char16 __OVERLOADABLE__ max(char16, char16);
^
<built-in>:3646:24: note: candidate function
char8 __OVERLOADABLE__ max(char8, char8);
^
<built-in>:3645:24: note: candidate function
char4 __OVERLOADABLE__ max(char4, char4);
^
<built-in>:3641:24: note: candidate function
char2 __OVERLOADABLE__ max(char2, char2);
^
<built-in>:3640:27: note: candidate function
double16 __OVERLOADABLE__ max(double16, double16);
^
<built-in>:3639:26: note: candidate function
double8 __OVERLOADABLE__ max(double8, double8);
^
<built-in>:3638:26: note: candidate function
double4 __OVERLOADABLE__ max(double4, double4);
^
<built-in>:3634:26: note: candidate function
double2 __OVERLOADABLE__ max(double2, double2);
^
<built-in>:3633:26: note: candidate function
float16 __OVERLOADABLE__ max(float16, float16);
^
<built-in>:3632:25: note: candidate function
float8 __OVERLOADABLE__ max(float8, float8);
^
<built-in>:3631:25: note: candidate function
float4 __OVERLOADABLE__ max(float4, float4);
^
<built-in>:3627:25: note: candidate function
float2 __OVERLOADABLE__ max(float2, float2);
^
<built-in>:3474:27: note: candidate function
double16 __OVERLOADABLE__ max(double16 x, double y) ;
^
<built-in>:3473:27: note: candidate function
double8 __OVERLOADABLE__ max(double8 x, double y) ;
^
<built-in>:3472:27: note: candidate function
double4 __OVERLOADABLE__ max(double4 x, double y) ;
^
<built-in>:3468:27: note: candidate function
double2 __OVERLOADABLE__ max(double2 x, double y) ;
^
<built-in>:3458:26: note: candidate function
float16 __OVERLOADABLE__ max(float16 x, float y) ;
^
<built-in>:3457:26: note: candidate function
float8 __OVERLOADABLE__ max(float8 x, float y) ;
^
<built-in>:3456:26: note: candidate function
float4 __OVERLOADABLE__ max(float4 x, float y) ;
^
<built-in>:3452:26: note: candidate function
float2 __OVERLOADABLE__ max(float2 x, float y) ;
^
<built-in>:3449:24: note: candidate function
ulong __OVERLOADABLE__ max(ulong, ulong);
^
<built-in>:3448:23: note: candidate function
long __OVERLOADABLE__ max(long, long);
^
<built-in>:3447:23: note: candidate function
uint __OVERLOADABLE__ max(uint, uint);
^
<built-in>:3446:22: note: candidate function
int __OVERLOADABLE__ max(int, int);
^
<built-in>:3445:25: note: candidate function
ushort __OVERLOADABLE__ max(ushort, ushort);
^
<built-in>:3444:24: note: candidate function
short __OVERLOADABLE__ max(short, short);
^
<built-in>:3443:24: note: candidate function
uchar __OVERLOADABLE__ max(uchar, uchar);
^
<built-in>:3442:23: note: candidate function
char __OVERLOADABLE__ max(char, char);
^
<built-in>:3441:25: note: candidate function
double __OVERLOADABLE__ max(double, double);
^
<built-in>:3440:24: note: candidate function
float __OVERLOADABLE__ max(float, float);
^

Jawed
20th January 2011, 21:12
Ooh, that should be a simple fix. Fingers crossed it's the last thing.

First, try this:

www.cupidity.f9.co.uk/DeathrayNV110120002.zip

This contains a fix for the bug above. I think it's my fault, I think the AMD compiler is mistakenly accepting that line as valid.

If that works, then try this:

www.cupidity.f9.co.uk/DeathrayNV110120003.zip

It will actually filter the video instead of doing nothing to it. It will do spatial or temporal, so all the options are available. It also produces the log file, so let's hope that it's empty.

The filtering is different from version 1.00. This is an experimental linear correction. I don't think the math is correct though, so I'm still working on it. Despite that I prefer it to version 1.00...

TheProfileth
20th January 2011, 21:15
Will test now
Edit:
Woohoo it works!
time to take a look at this thing
Edit2:
Works pretty well, and retains decent details,
http://screenshotcomparison.com/comparison/21410
http://screenshotcomparison.com/comparison/21411
my only issue is that if something is surrounded by black, the area it sort of gets washed into it
the color sort of gets sucked out of those areas too
I wonder if there is a way to give things that border black areas more weight

Jawed
20th January 2011, 21:27
Yay. Thanks guys. Hope it works for you all.

I'll produce a proper version 1.01 tomorrow, probably.

ChaosKing
20th January 2011, 21:36
Yeah it works! :D
I get 28~ fps on my gtx260
I will play with the filter tomorrow.

Didée
20th January 2011, 21:39
Aaahh, waitaminute ...

With 110120003:

Avisynth read error:
"Single-frame initialisation failed, status=6 and OpenCL status=-48"

log:
:310: error: incompatible type initializing 'int', expected 'float4'
float4 euclidean_distance = 0;
^
:396: error: incompatible type initializing 'int', expected 'float4'
float4 average = 0;
^
:397: error: incompatible type initializing 'int', expected 'float4'
float4 weight = 0;
^

Jawed
20th January 2011, 21:53
Sigh, I'm normally over-zealous with zeroes.

Try this:

www.cupidity.f9.co.uk/DeathrayNV110120004.zip

Jawed
20th January 2011, 21:59
my only issue is that if something is surrounded by black, the area it sort of gets washed into it
the color sort of gets sucked out of those areas too
I wonder if there is a way to give things that border black areas more weight
Try more temporal, 2 or 3. Also don't be afraid to lower s (sigma) and increase h at the same time.

It's a bit tricky. Version 1.00 is a lot harder to use and inferior. As I say, I'm still working on the linear correction.

Didée
20th January 2011, 22:06
Hooray, that 110120004 finally works for me, too.

Jawed
20th January 2011, 22:14
Great.

Just need to get my brain around the variance of a single sample and get the linear correction working. Then ponder whether I want to attempt a first or second order regression correction. Gulp.

aNToK
21st January 2011, 07:52
Hi, read in the literature that this doesn't work with AMD 4xxx series, but wasn't clear whether that was the regular Radeon 4xxx series or the HD 4xxx series. Did you mean the HD ones as well?

Jawed
21st January 2011, 09:13
Yes, HD 4000 series. It doesn't support a workgroup size of 256. There's a way round this so there's a decent chance I'll make an additional path specifically for that series, when I do other changes I'm planning.

aNToK
21st January 2011, 11:12
I'll get around to upgrading the card one of these days but since I don't game much, haven't had the need.

Thanks for the reply and the possible future support.

bus_labi
21st January 2011, 21:24
Have a collection of DV home videos - quite a lot of it shot in low light conditions thus noisy. Over last few days extensively tested FFT3Dfilter with various setting to improve the videos.

Found about Deathray and was eager to test it, because had read promising things about NLM denoising.

For testing i wrote simple script that loops 100+ frames thru filter at different settings and outputs filtered video + video with original minus filtered image (contrast enlarged). This gives very good visual of what filter is doing without much pixel peeking. If filter removes only white noise, then only white noise appears in differential image. If there is some structure visible then either that area is not denoised or denoised more or some structure degradation is happening. I am testing only luma channel for now.

Win7 64bit, ATI Radeon HD57xx 1Gb RAM latest versions of Catalyst and Stream SDK installed yesterday. Today experimented with Deathray for an hour or so and not a single crash/problem.

Some observations (based on my noisy dark video sampe).

In low temporal settings - ty = 1,2 there are substantial areas which are not denoised at all. Going up to tY=5 improves, but there are still areas which are less denoised. FFT3DFilter is better in this respect, as there is much less structure in the noise removed. I tested deathray up to ty=5, hy=2 and s = 1.
I could not see difference between hy 1.5 and 2. Will go up from there later.
Deathray was removing about same amount of noise as FFT3Dfilter, but much differently - FFT removed noise was mostly high contrast pixels, where Deathray removed less contrasted pixels and left high contrast noise.
As result FFT3Dfilter produced smoother picture. Higher contrast gives effect of seemingly more detail in deathray'ed picture, but it is less pleasing to look at because of high contrast noise.
Neither of filters at reasonable settings take away any structure.
I noticed that FFT3 tends to denoise more areas which Deathray tends to avoid and tested FFT3 on deathrayed video. This produced most pleasing result as FFT3 removed high contrast pixels. Both filters seem somehow to complement each other. As of now i would not use Deathray alone because of the high contrast noise. On neither of filters i used extreme settings as i prefer some grain to complete smootheness.

I upload some screenshots, but static images do not really give a good impression - moving image is much better and informative. Here is simple script i use for testing.


fr = 100 # number of frames to be looped
sigma_start = 0.5 # start value for sigma range
sigma_incr = 0.1 # sigma increment
sigma_finish = 1.0 # end value for sigma range
hY_start = 2.0 # denoise strength luma
hY_finish = 2.0 # denoise strength luma
tY = 5 # temporal radius luma

c = DirectShowSource ("2001 03 229.avi", audio=false).viewfields.crop(0,0,-0,-288).trim (0,fr).grayscale.converttoyv12
gray = BlankClip(c, length=fr, color = $808080)
cf = BlankClip(c, 0)
co = BlankClip(c, length=0)
co = Stackvertical (co,co)

GScript ("""
while (hY_start <= hY_finish)
{
ssigma = sigma_start
while (ssigma <= sigma_finish)
{
if (FrameCount(cf) > 0) {NOP} else {cr = gray}
cf = deathray (c, hy = hY_start, huv = 0.0, ty = tY, s =ssigma)
cd = subtract (c, cf)
Stackvertical ( \
cf.Subtitle("sigma = " + String(ssigma) + "\n hY = " + String(hY_start)+ "\n tY = " + String(tY), align=6, lsp=0), \
cd.histogram(mode="luma").Subtitle("Difference original - filtered", align=9, lsp=0))
co = co + last
ssigma = ssigma + sigma_incr
}
hY_start = HY_start + 1 }
""")
return (co)

Jawed
22nd January 2011, 11:28
For testing i wrote simple script that loops 100+ frames thru filter at different settings and outputs filtered video + video with original minus filtered image (contrast enlarged).
Your automated parameters technique is nice.

This gives very good visual of what filter is doing without much pixel peeking. If filter removes only white noise, then only white noise appears in differential image. If there is some structure visible then either that area is not denoised or denoised more or some structure degradation is happening. I am testing only luma channel for now.
I've been doing something very similar, testing with the histogram("luma") technique or with a simple levels call to magnify the subtraction.

Though I've been doing it by hand, single stepping through clips and tweaking parameters. My reference filter, for what it's worth, is FizzKiller.

Have you evaluated motion-compensated MDegrain (from MVTools 2)? It's slow, but extremely good.

Obviously I also make encodes and watch them, too.

Win7 64bit, ATI Radeon HD57xx 1Gb RAM latest versions of Catalyst and Stream SDK installed yesterday. Today experimented with Deathray for an hour or so and not a single crash/problem.
There is a way to crash Deathray. I'm surprised your script hasn't caused it to crash, to be honest. I don't understand the reason for the crash and I can't reliably reproduce it. But repeatedly hand-tweaking Deathray parameters and single-frame stepping tends to crash it after a while, sometimes.

Some observations (based on my noisy dark video sampe).

In low temporal settings - ty = 1,2 there are substantial areas which are not denoised at all. Going up to tY=5 improves, but there are still areas which are less denoised. FFT3DFilter is better in this respect, as there is much less structure in the noise removed. I tested deathray up to ty=5, hy=2 and s = 1.
I find the "naked" NLM, which is what version 1.00 is, inadequate. This is the original algorithm, first described by Buades, Coll and Morel - with the caveat that the sampling region is limited, instead of being the entire image. (It turns out there's very little to be gained by sampling the entire image). Actually this algorithm is tweaked with a re-weighting for the target pixel that they described later.

It's too weak at settings that are required not to destroy fine detail.

The most recent "NVidia Debug" version contains a crude averaging with the original image. This is considerably better, but the averaging I'm using is not the averaging that is described here:

http://hal.archives-ouvertes.fr/docs/00/27/11/43/PDF/double_revised.pdf

I'm currently getting my head round that.

The NVidia Debug version (110120004) is preferable to 1.00.

I could not see difference between hy 1.5 and 2. Will go up from there later.
Ultimately, one has to balance the 3 key parameters, h, sigma and temporal-radius.

The sample-radius factor, x, is also useful, but has to be used with care (turn h up to 1000 and set x=3 to see what I'm talking about). A planned change to the filter incidentally makes x safer, though truthfully with sane values of h and s, x=2 or 3 is quite safe (and slow...).

I think temporal 5 is too strong.

I hope the regression correction I'm working on is worth it. It should require re-evaluation of settings...

Deathray was removing about same amount of noise as FFT3Dfilter, but much differently - FFT removed noise was mostly high contrast pixels, where Deathray removed less contrasted pixels and left high contrast noise.
As result FFT3Dfilter produced smoother picture. Higher contrast gives effect of seemingly more detail in deathray'ed picture, but it is less pleasing to look at because of high contrast noise.
Neither of filters at reasonable settings take away any structure.
I noticed that FFT3 tends to denoise more areas which Deathray tends to avoid and tested FFT3 on deathrayed video. This produced most pleasing result as FFT3 removed high contrast pixels. Both filters seem somehow to complement each other. As of now i would not use Deathray alone because of the high contrast noise. On neither of filters i used extreme settings as i prefer some grain to complete smootheness.
Make sure you try MVTools 2 techniques. You can also try motion-compensated FFT3DFilter and of course there's FFT3DGPU, which can also be motion-compensated.

Generally I would say don't rush and use lots of test clips.

---

A little sidenote on one aspect of the NVidia "problems". The problem with the max() function turns out to be a difference between OpenCL 1.0 and OpenCL 1.1.

jclampy
13th December 2011, 23:02
Hi, Would really like to try this filter out but the links are dead.

ChaosKing
13th December 2011, 23:09
Hi, Would really like to try this filter out but the links are dead

DL-Link (http://chaosking.de/wp-content/uploads/avsfilters/Denoisers/Spatio-Temporal_Denoisers/Deathray___(1.00_-_2011-01-17)_gpu.7z)

jclampy
13th December 2011, 23:40
Thankyou very much ChaosKing. ;)

Jawed
13th December 2011, 23:55
Does it work? I just made a long post that got lost :(

Jawed
14th December 2011, 00:02
Rather than try to recreate the post I just lost, I'll just briefly link this, which is DeathrayNV110120004.zip:

http://www.mediafire.com/?yyiaqbqp77h236p

This is just the DLL. It is a slightly different algorithm (which I think is better) and it solves the original problem with failure on NVidia.

This version of the DLL always creates a deathray.log file. If there are problems this file should help me help you.

I have a caveat: when I wrote this OpenCL code AMD's and NVidia's compilers were a little bit loose. I think they are stricter now and there's a chance this will not compile with recent drivers.

I didn't fix this issue due to a general lack of interest, so this may be more of an experiment than you wanted. Fingers-crossed!

EDIT: oh yeah, my website is down because my host took it down for going over the 250MB daily limit :(