Deathray - OpenCL GPU accelerated spatial/temporal non-local means de-noising - Page 2

Jawed · 20th January 2011, 00:46

Quote:

Originally Posted by TheProfileth

Really want to test this filter out, so I will hope you fix it soon
also I am able to get nlmeanscl to run fine on my computer, I have a GTX 260 and a AMD Phenom quad core

Thanks for your report.

I don't know why but AvsP has that trouble when Deathray throws an error message. Playing the script in MPC is fine, reporting the messages that the others have posted.

In my experience, sometimes that message is true, there really was a null pointer access. I just can't tell.

In SingleFrameInit Deathray has tried to create two buffers on the GPU, input and output, for each of the 3 planes - so 6 in total. It does this once when Deathray is loaded and re-uses them for each frame.

Jawed · 20th January 2011, 01:43

Oh, I've just noticed from ChaosKing's capabilities dump that GTX260 is OpenCL 1.0.

I will have to research the differences between 1.0 and 1.1 to see if I've done something 1.1 that's causing problems for 1.0 cards. The obvious thing, the size of local memory, shouldn't be an issue. I'm allocating ~11KB of local memory out of the 16KB available on OpenCL 1.0 devices.

I'm doubtful this is the source of the problem, but...

Jawed · 20th January 2011, 01:58

ARGH

I've just discovered that if the DLL is named Deathray.DLL it's OK, but if the DLL is named like the debug versions I provided earlier, it fails.

SIGH. I forgot that Deathray asks Windows for a handle to "Deathray.dll" as part of compilation. So this has wasted quite a bit of time. Sorry.

Versions 3 or 4 may actually work if the NV110119003/4 suffix is deleted. That's what I get for testing before renaming the file for distribution.

TheProfileth · 20th January 2011, 02:19

ooh let me try

Edit:
damn, renaming it did not work by renaming them

Jawed · 20th January 2011, 13:42

If someone can report the error status numbers from the 4th debug DLL I uploaded, that would be cool, thanks.

ChaosKing · 20th January 2011, 15:37

debug v4: Single-Frame initialisation failed, status=6 and OpenCL status= -48
I also updated to the newest Nvidia driver, but it changed nothing.

Jawed · 20th January 2011, 16:27

That error happens to correspond with the DLL filename being wrong, although of course it could be the real error.

Please make sure that the DLL is called deathray.dll and that there is only one deathray DLL in the plugins folder.

If that doesn't solve the problem, then I need to make an even simpler piece of code to test the OpenCL compilation.

ChaosKing · 20th January 2011, 19:01

I tested both filenames (also deleted other deathray* dlls), same error :/

Jawed · 20th January 2011, 19:29

Thanks very much for your patience.

OK, this version of Deathray will produce a file called Deathray.log. This file appears in the same folder as your Avisynth script.

This file contains the output from the OpenCL compiler, if an error occurs during compilation. If no error occurs, the file will have 0 bytes as its file size.

www.cupidity.f9.co.uk/DeathrayNV110120001.zip

If Deathray.log contains some text it would be useful if you can post the text here. Thanks.

ChaosKing · 20th January 2011, 20:58

Here's my Debug Log:

Code:

<program source>:300:22: error: no matching function for call to 'max'
        int2 sample_start = max(target - sample_radius, 3);
                            ^~~
<built-in>:3696:26: note: candidate function
ulong16 __OVERLOADABLE__ max(ulong16, ulong16);
                         ^
<built-in>:3695:25: note: candidate function
ulong8 __OVERLOADABLE__ max(ulong8, ulong8);
                        ^
<built-in>:3694:25: note: candidate function
ulong4 __OVERLOADABLE__ max(ulong4, ulong4);
                        ^
<built-in>:3690:25: note: candidate function
ulong2 __OVERLOADABLE__ max(ulong2, ulong2);
                        ^
<built-in>:3689:25: note: candidate function
long16 __OVERLOADABLE__ max(long16, long16);
                        ^
<built-in>:3688:24: note: candidate function
long8 __OVERLOADABLE__ max(long8, long8);
                       ^
<built-in>:3687:24: note: candidate function
long4 __OVERLOADABLE__ max(long4, long4);
                       ^
<built-in>:3683:24: note: candidate function
long2 __OVERLOADABLE__ max(long2, long2);
                       ^
<built-in>:3682:25: note: candidate function
uint16 __OVERLOADABLE__ max(uint16, uint16);
                        ^
<built-in>:3681:24: note: candidate function
uint8 __OVERLOADABLE__ max(uint8, uint8);
                       ^
<built-in>:3680:24: note: candidate function
uint4 __OVERLOADABLE__ max(uint4, uint4);
                       ^
<built-in>:3676:24: note: candidate function
uint2 __OVERLOADABLE__ max(uint2, uint2);
                       ^
<built-in>:3675:24: note: candidate function
int16 __OVERLOADABLE__ max(int16, int16);
                       ^
<built-in>:3674:23: note: candidate function
int8 __OVERLOADABLE__ max(int8, int8);
                      ^
<built-in>:3673:23: note: candidate function
int4 __OVERLOADABLE__ max(int4, int4);
                      ^
<built-in>:3669:23: note: candidate function
int2 __OVERLOADABLE__ max(int2, int2);
                      ^
<built-in>:3668:27: note: candidate function
ushort16 __OVERLOADABLE__ max(ushort16, ushort16);                                               
                          ^
<built-in>:3667:26: note: candidate function
ushort8 __OVERLOADABLE__ max(ushort8, ushort8);                                                  
                         ^
<built-in>:3666:26: note: candidate function
ushort4 __OVERLOADABLE__ max(ushort4, ushort4);                                                  
                         ^
<built-in>:3662:26: note: candidate function
ushort2 __OVERLOADABLE__ max(ushort2, ushort2);
                         ^
<built-in>:3661:26: note: candidate function
short16 __OVERLOADABLE__ max(short16, short16);                                                  
                         ^
<built-in>:3660:25: note: candidate function
short8 __OVERLOADABLE__ max(short8, short8);                                                     
                        ^
<built-in>:3659:25: note: candidate function
short4 __OVERLOADABLE__ max(short4, short4);                                                     
                        ^
<built-in>:3655:25: note: candidate function
short2 __OVERLOADABLE__ max(short2, short2);                                                     
                        ^
<built-in>:3654:26: note: candidate function
uchar16 __OVERLOADABLE__ max(uchar16, uchar16);                                                  
                         ^
<built-in>:3653:25: note: candidate function
uchar8 __OVERLOADABLE__ max(uchar8, uchar8);                                                     
                        ^
<built-in>:3652:25: note: candidate function
uchar4 __OVERLOADABLE__ max(uchar4, uchar4);                                                     
                        ^
<built-in>:3648:25: note: candidate function
uchar2 __OVERLOADABLE__ max(uchar2, uchar2);
                        ^
<built-in>:3647:25: note: candidate function
char16 __OVERLOADABLE__ max(char16, char16);                                                     
                        ^
<built-in>:3646:24: note: candidate function
char8 __OVERLOADABLE__ max(char8, char8);                                                        
                       ^
<built-in>:3645:24: note: candidate function
char4 __OVERLOADABLE__ max(char4, char4);                                                        
                       ^
<built-in>:3641:24: note: candidate function
char2 __OVERLOADABLE__ max(char2, char2);                                                        
                       ^
<built-in>:3640:27: note: candidate function
double16 __OVERLOADABLE__ max(double16, double16);                                               
                          ^
<built-in>:3639:26: note: candidate function
double8 __OVERLOADABLE__ max(double8, double8);                                                  
                         ^
<built-in>:3638:26: note: candidate function
double4 __OVERLOADABLE__ max(double4, double4);                                                  
                         ^
<built-in>:3634:26: note: candidate function
double2 __OVERLOADABLE__ max(double2, double2);                                                  
                         ^
<built-in>:3633:26: note: candidate function
float16 __OVERLOADABLE__ max(float16, float16);                                                  
                         ^
<built-in>:3632:25: note: candidate function
float8 __OVERLOADABLE__ max(float8, float8);                                                     
                        ^
<built-in>:3631:25: note: candidate function
float4 __OVERLOADABLE__ max(float4, float4);                                                     
                        ^
<built-in>:3627:25: note: candidate function
float2 __OVERLOADABLE__ max(float2, float2);                                                     
                        ^
<built-in>:3474:27: note: candidate function
double16 __OVERLOADABLE__ max(double16 x, double y) ;
                          ^
<built-in>:3473:27: note: candidate function
double8  __OVERLOADABLE__ max(double8 x, double y)   ;
                          ^
<built-in>:3472:27: note: candidate function
double4  __OVERLOADABLE__ max(double4 x, double y)   ;
                          ^
<built-in>:3468:27: note: candidate function
double2  __OVERLOADABLE__ max(double2 x, double y)   ;
                          ^
<built-in>:3458:26: note: candidate function
float16 __OVERLOADABLE__ max(float16 x, float y) ;
                         ^
<built-in>:3457:26: note: candidate function
float8  __OVERLOADABLE__ max(float8 x, float y)   ;
                         ^
<built-in>:3456:26: note: candidate function
float4  __OVERLOADABLE__ max(float4 x, float y)   ;
                         ^
<built-in>:3452:26: note: candidate function
float2  __OVERLOADABLE__ max(float2 x, float y)   ;
                         ^
<built-in>:3449:24: note: candidate function
ulong __OVERLOADABLE__ max(ulong, ulong);
                       ^
<built-in>:3448:23: note: candidate function
long __OVERLOADABLE__ max(long, long);
                      ^
<built-in>:3447:23: note: candidate function
uint __OVERLOADABLE__ max(uint, uint);
                      ^
<built-in>:3446:22: note: candidate function
int __OVERLOADABLE__ max(int, int);
                     ^
<built-in>:3445:25: note: candidate function
ushort __OVERLOADABLE__ max(ushort, ushort);
                        ^
<built-in>:3444:24: note: candidate function
short __OVERLOADABLE__ max(short, short);
                       ^
<built-in>:3443:24: note: candidate function
uchar __OVERLOADABLE__ max(uchar, uchar);
                       ^
<built-in>:3442:23: note: candidate function
char __OVERLOADABLE__ max(char, char);
                      ^
<built-in>:3441:25: note: candidate function
double __OVERLOADABLE__ max(double, double);
                        ^
<built-in>:3440:24: note: candidate function
float __OVERLOADABLE__ max(float, float);
                       ^

Jawed · 20th January 2011, 22:12

Ooh, that should be a simple fix. Fingers crossed it's the last thing.

First, try this:

www.cupidity.f9.co.uk/DeathrayNV110120002.zip

This contains a fix for the bug above. I think it's my fault, I think the AMD compiler is mistakenly accepting that line as valid.

If that works, then try this:

www.cupidity.f9.co.uk/DeathrayNV110120003.zip

It will actually filter the video instead of doing nothing to it. It will do spatial or temporal, so all the options are available. It also produces the log file, so let's hope that it's empty.

The filtering is different from version 1.00. This is an experimental linear correction. I don't think the math is correct though, so I'm still working on it. Despite that I prefer it to version 1.00...

TheProfileth · 20th January 2011, 22:15

Will test now
Edit:
Woohoo it works!
time to take a look at this thing
Edit2:
Works pretty well, and retains decent details,
http://screenshotcomparison.com/comparison/21410
http://screenshotcomparison.com/comparison/21411
my only issue is that if something is surrounded by black, the area it sort of gets washed into it
the color sort of gets sucked out of those areas too
I wonder if there is a way to give things that border black areas more weight

Jawed · 20th January 2011, 22:27

Yay. Thanks guys. Hope it works for you all.

I'll produce a proper version 1.01 tomorrow, probably.

ChaosKing · 20th January 2011, 22:36

Yeah it works!

I get 28~ fps on my gtx260
I will play with the filter tomorrow.

Didée · 20th January 2011, 22:39

Aaahh, waitaminute ...

With 110120003:

Avisynth read error:
"Single-frame initialisation failed, status=6 and OpenCL status=-48"

log:

Code:

:310: error: incompatible type initializing 'int', expected 'float4'
                        float4 euclidean_distance = 0;
                                                    ^
:396: error: incompatible type initializing 'int', expected 'float4'
        float4 average = 0;
                         ^
:397: error: incompatible type initializing 'int', expected 'float4'
        float4 weight = 0;
                        ^

Jawed · 20th January 2011, 22:53

Sigh, I'm normally over-zealous with zeroes.

Try this:

www.cupidity.f9.co.uk/DeathrayNV110120004.zip

Jawed · 20th January 2011, 22:59

Quote:

Originally Posted by TheProfileth

my only issue is that if something is surrounded by black, the area it sort of gets washed into it
the color sort of gets sucked out of those areas too
I wonder if there is a way to give things that border black areas more weight

Try more temporal, 2 or 3. Also don't be afraid to lower s (sigma) and increase h at the same time.

It's a bit tricky. Version 1.00 is a lot harder to use and inferior. As I say, I'm still working on the linear correction.

Didée · 20th January 2011, 23:06

Hooray, that 110120004 finally works for me, too.

Jawed · 20th January 2011, 23:14

Great.

Just need to get my brain around the variance of a single sample and get the linear correction working. Then ponder whether I want to attempt a first or second order regression correction. Gulp.

aNToK · 21st January 2011, 08:52

Hi, read in the literature that this doesn't work with AMD 4xxx series, but wasn't clear whether that was the regular Radeon 4xxx series or the HD 4xxx series. Did you mean the HD ones as well?

20th January 2011, 02:19	#24 \| Link
TheProfileth Leader of Dual-Duality Join Date: Aug 2010 Location: America Posts: 134	ooh let me try Edit: damn, renaming it did not work by renaming them __________________ I'm Mr.Fixit and I feel good, fixin all the sources in the neighborhood My New filter is in the works, and will be out soon Last edited by TheProfileth; 20th January 2011 at 02:26.

20th January 2011, 15:37	#26 \| Link
ChaosKing Registered User Join Date: Dec 2005 Location: Germany Posts: 1,795	debug v4: Single-Frame initialisation failed, status=6 and OpenCL status= -48 I also updated to the newest Nvidia driver, but it changed nothing. __________________ AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth VapourSynth Portable FATPACK \|\| VapourSynth Database Last edited by ChaosKing; 20th January 2011 at 15:41.

20th January 2011, 19:01	#28 \| Link
ChaosKing Registered User Join Date: Dec 2005 Location: Germany Posts: 1,795	I tested both filenames (also deleted other deathray* dlls), same error :/ __________________ AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth VapourSynth Portable FATPACK \|\| VapourSynth Database

20th January 2011, 22:15	#32 \| Link
TheProfileth Leader of Dual-Duality Join Date: Aug 2010 Location: America Posts: 134	Will test now Edit: Woohoo it works! time to take a look at this thing Edit2: Works pretty well, and retains decent details, http://screenshotcomparison.com/comparison/21410 http://screenshotcomparison.com/comparison/21411 my only issue is that if something is surrounded by black, the area it sort of gets washed into it the color sort of gets sucked out of those areas too I wonder if there is a way to give things that border black areas more weight __________________ I'm Mr.Fixit and I feel good, fixin all the sources in the neighborhood My New filter is in the works, and will be out soon Last edited by TheProfileth; 20th January 2011 at 22:35.

20th January 2011, 22:36	#34 \| Link
ChaosKing Registered User Join Date: Dec 2005 Location: Germany Posts: 1,795	Yeah it works! I get 28~ fps on my gtx260 I will play with the filter tomorrow. __________________ AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth VapourSynth Portable FATPACK \|\| VapourSynth Database

20th January 2011, 01:43	#22 \| Link
Jawed Registered User Join Date: Jan 2008 Location: London Posts: 156	Oh, I've just noticed from ChaosKing's capabilities dump that GTX260 is OpenCL 1.0. I will have to research the differences between 1.0 and 1.1 to see if I've done something 1.1 that's causing problems for 1.0 cards. The obvious thing, the size of local memory, shouldn't be an issue. I'm allocating ~11KB of local memory out of the 16KB available on OpenCL 1.0 devices. I'm doubtful this is the source of the problem, but...

20th January 2011, 01:58	#23 \| Link
Jawed Registered User Join Date: Jan 2008 Location: London Posts: 156	ARGH I've just discovered that if the DLL is named Deathray.DLL it's OK, but if the DLL is named like the debug versions I provided earlier, it fails. SIGH. I forgot that Deathray asks Windows for a handle to "Deathray.dll" as part of compilation. So this has wasted quite a bit of time. Sorry. Versions 3 or 4 may actually work if the NV110119003/4 suffix is deleted. That's what I get for testing before renaming the file for distribution.

20th January 2011, 13:42	#25 \| Link
Jawed Registered User Join Date: Jan 2008 Location: London Posts: 156	If someone can report the error status numbers from the 4th debug DLL I uploaded, that would be cool, thanks.

20th January 2011, 16:27	#27 \| Link
Jawed Registered User Join Date: Jan 2008 Location: London Posts: 156	That error happens to correspond with the DLL filename being wrong, although of course it could be the real error. Please make sure that the DLL is called deathray.dll and that there is only one deathray DLL in the plugins folder. If that doesn't solve the problem, then I need to make an even simpler piece of code to test the OpenCL compilation.

20th January 2011, 19:29	#29 \| Link
Jawed Registered User Join Date: Jan 2008 Location: London Posts: 156	Thanks very much for your patience. OK, this version of Deathray will produce a file called Deathray.log. This file appears in the same folder as your Avisynth script. This file contains the output from the OpenCL compiler, if an error occurs during compilation. If no error occurs, the file will have 0 bytes as its file size. www.cupidity.f9.co.uk/DeathrayNV110120001.zip If Deathray.log contains some text it would be useful if you can post the text here. Thanks.

20th January 2011, 22:12	#31 \| Link
Jawed Registered User Join Date: Jan 2008 Location: London Posts: 156	Ooh, that should be a simple fix. Fingers crossed it's the last thing. First, try this: www.cupidity.f9.co.uk/DeathrayNV110120002.zip This contains a fix for the bug above. I think it's my fault, I think the AMD compiler is mistakenly accepting that line as valid. If that works, then try this: www.cupidity.f9.co.uk/DeathrayNV110120003.zip It will actually filter the video instead of doing nothing to it. It will do spatial or temporal, so all the options are available. It also produces the log file, so let's hope that it's empty. The filtering is different from version 1.00. This is an experimental linear correction. I don't think the math is correct though, so I'm still working on it. Despite that I prefer it to version 1.00...

20th January 2011, 22:27	#33 \| Link
Jawed Registered User Join Date: Jan 2008 Location: London Posts: 156	Yay. Thanks guys. Hope it works for you all. I'll produce a proper version 1.01 tomorrow, probably.

20th January 2011, 22:39	#35 \| Link
Didée Registered User Join Date: Apr 2002 Location: Germany Posts: 5,391	Aaahh, waitaminute ... With 110120003: Avisynth read error: "Single-frame initialisation failed, status=6 and OpenCL status=-48" log: Code: :310: error: incompatible type initializing 'int', expected 'float4' float4 euclidean_distance = 0; ^ :396: error: incompatible type initializing 'int', expected 'float4' float4 average = 0; ^ :397: error: incompatible type initializing 'int', expected 'float4' float4 weight = 0; ^ __________________ - We´re at the beginning of the end of mankind´s childhood - My little flickr gallery. (Yes indeed, I do have hobbies other than digital video!) Last edited by Didée; 20th January 2011 at 22:45. Reason: typo in version number

20th January 2011, 22:53	#36 \| Link
Jawed Registered User Join Date: Jan 2008 Location: London Posts: 156	Sigh, I'm normally over-zealous with zeroes. Try this: www.cupidity.f9.co.uk/DeathrayNV110120004.zip

20th January 2011, 23:06	#38 \| Link
Didée Registered User Join Date: Apr 2002 Location: Germany Posts: 5,391	Hooray, that 110120004 finally works for me, too. __________________ - We´re at the beginning of the end of mankind´s childhood - My little flickr gallery. (Yes indeed, I do have hobbies other than digital video!)

20th January 2011, 23:14	#39 \| Link
Jawed Registered User Join Date: Jan 2008 Location: London Posts: 156	Great. Just need to get my brain around the variance of a single sample and get the linear correction working. Then ponder whether I want to attempt a first or second order regression correction. Gulp.

21st January 2011, 08:52	#40 \| Link
aNToK Registered User Join Date: Nov 2005 Location: California Posts: 81	Hi, read in the literature that this doesn't work with AMD 4xxx series, but wasn't clear whether that was the regular Radeon 4xxx series or the HD 4xxx series. Did you mean the HD ones as well?