nnedi3 - OpenCL rewrite - Page 3

madshi · 26th November 2013, 15:08

Great work, SEt. I was planning to look into implementing NNEDI3 with OpenCL/CUDA myself for madVR. I was also considering dropping the prescreener, but I'm not sure. The prescreener might still be effective. I was thinking of splitting the processing into lines, so that one thread processes one image line. This way I hoped to be able to cache the source reads so that I have to read only 4 new source pixels for each new output pixel (if there are enough registers to store the other source pixels in). With this design maybe the prescreener would then allow each thread to finish faster if there are some pixels in the line which don't need full NNEDI3 processing. Well, anyway. I haven't even started yet, so these were just some ideas I'd been playing with in my head. Haven't looked at your OpenCL code yet, but I'll definitely do when I find some time. And thanks for going with LGPL instead of GPL. That would allow me to reuse your code for madVR, too, if I decide that your implementation idea is better than mine...

One thing that bothers me a bit about NNEDI3 is that it sometimes "finds" things to connect in trees, leaves and grass which makes things look a bit artificial, fractal like. So I'm wondering whether it wouldn't be a good idea to write a separate prescreener which categorizes the image into parts which have clear edge directions and other parts with rather random edge directions (= grass, leaves etc). Thoughts?

FWIW, many months ago I had asked tritical about implementing NNEDI3 in madVR, even though madVR is closed source, and he allowed it. So it seems to me he's quite generous with licensing issues, so I don't think you need to worry about that part. Haven't heard from him in a while, though. Not sure if he's still around...

wOxxOm · 26th November 2013, 15:19

Quote:

Originally Posted by madshi

One thing that bothers me a bit about NNEDI3 is that it sometimes "finds" things to connect in trees, leaves and grass which makes things look a bit artificial, fractal like. So I'm wondering whether it wouldn't be a good idea to write a separate prescreener which categorizes the image into parts which have clear edge directions and other parts with rather random edge directions (= grass, leaves etc). Thoughts?

I noticed it too a long time ago and that's why I always use prescreener and then apply masked AA (sometimes nnedi-AA) where needed. Not sure if the universal content recognition algorithm is possible, but it would be great.

SEt · 6th December 2013, 14:30

New version: added support for all new planar colorspaces of AviSynth 2.6 (but plugin still uses AviSynth 2.5 interfaces and still can be used with AviSynth 2.5). YUY2, RGB24 and center correction are supported by script functions nnedi3x and nnedi3x_rpow2. No changes on OpenCL side.

In would be really nice if someone can confirm/correct center correction magic in nnedi3x_rpow2 (btw, original nnedi3 does it wrong). Script tries to minimize center shift while satisfying two conditions with no/empty cshift:
1) For non-YV12 chroma must be correctly aligned with no resize.
2) For YV12 luma is not resized, but chroma is to be correctly aligned (original nnedi3_rpow2 also does it even with empty cshift).

Luma and chroma are resized no more than once (original nnedi3_rpow2 would resize chroma 2 times with YV12 and center correction), script tries to minimize resize offsets to subpixel values. For now only Spline36Resize method.

nekosama · 7th December 2013, 13:40

Radeon HD7950, 930, I7-4770k@stock

Code:

[General info]
Log file created with:   AVSMeter 1.5.7
Avisynth version:        AviSynth 2.60, build:Sep 28 2013 [15:09:12]
                         Active MT Mode: 2


[Clip info]
Number of frames:                  1000
Length (hhh:mm:ss.ms):    000:00:41.708
Frame width:                       2560
Frame height:                      1440
Framerate:                       23.976 (24000/1001)
Interlaced:                          No
Colorspace:                        YV12


[Runtime info]
Frames processed:                1000 (0 - 999)
FPS (min | max | average):       18.40 | 35.26 | 26.00
CPU usage (average):             1%
Thread count:                    13
Physical Memory usage (peak):    597 MB
Virtual Memory usage (peak):     622 MB
Time (elapsed):                  000:00:38.461


[Script]
SetMTMode(2,4)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl(dh=true, dw=1, nns=2, qual=1)

https://db.tt/0Sf3myc2

SEt · 8th December 2013, 16:13

Minor update: better default for dh parameter and completely implemented nnedi3x_rpow2 with all fancy cases.

nekosama, your result looks too low: comparing to very similar Radeon HD7970 you should be getting around 36 fps average with MT.

Groucho2004 · 8th December 2013, 16:31

Quote:

Originally Posted by SEt

nekosama, your result looks too low: comparing to very similar Radeon HD7970 you should be getting around 36 fps average with MT.

His test is with a 7950, not 7970. Not sure how much difference this makes.

SEt · 8th December 2013, 17:24

Of course I've scaled expected fps by their theoretical FLOPS. Radeon HD7970 is getting 48 fps as you can see in summary tables.

nekosama · 9th December 2013, 19:17

nope SEt, I couldn't get higher results on my 7950 but I tried with an overclock to 1150 MHz core clock and 1350 MHz memory clock and managed to get these reults

Code:

[General info]
Log file created with:   AVSMeter 1.5.7
Avisynth version:        AviSynth 2.60, build:Sep 28 2013 [15:09:12]
                         Active MT Mode: 2


[Clip info]
Number of frames:                  1000
Length (hhh:mm:ss.ms):    000:00:41.708
Frame width:                       2560
Frame height:                      1440
Framerate:                       23.976 (24000/1001)
Interlaced:                          No
Colorspace:                        YV12


[Runtime info]
Frames processed:                1000 (0 - 999)
FPS (min | max | average):       27.38 | 46.77 | 35.56
CPU usage (average):             0%
Thread count:                    13
Physical Memory usage (peak):    599 MB
Virtual Memory usage (peak):     622 MB
Time (elapsed):                  000:00:28.118


[Script]
SetMTMode(2,4)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl(dh=true, dw=1, nns=2, qual=1)

just not 36 fps

I even clocked my 7950 to 1170 core just to see a difference and I managed to get 50 max fps but same average speed (yeah 0.2 fps difference but that's practically none-existent)

Code:

[General info]
Log file created with:   AVSMeter 1.5.7
Avisynth version:        AviSynth 2.60, build:Sep 28 2013 [15:09:12]
                         Active MT Mode: 2


[Clip info]
Number of frames:                 10000
Length (hhh:mm:ss.ms):    000:06:57.083
Frame width:                       2560
Frame height:                      1440
Framerate:                       23.976 (24000/1001)
Interlaced:                          No
Colorspace:                        YV12


[Runtime info]
Frames processed:                10000 (0 - 9999)
FPS (min | max | average):       25.65 | 50.00 | 35.92
CPU usage (average):             0%
Thread count:                    10
Physical Memory usage (peak):    620 MB
Virtual Memory usage (peak):     622 MB
Time (elapsed):                  000:04:38.409


[Script]
SetMTMode(2,4)
BlankClip(10000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl(dh=true, dw=1, nns=2, qual=1)

Groucho2004 · 9th December 2013, 19:26

@nekosama
try with the latest version of AVSMeter. The version you're using is rather old (although it should not make much difference).

SEt · 10th December 2013, 00:06

nekosama, another guess: are you using old drivers? Try with latest beta ones.

madshi · 23rd December 2013, 18:35

@SEt,

tried for 2 days to find a faster implementation than yours, but I have to admit that I failed. Tried lots of different approaches, but none of them were faster than yours. So congrats, you seem to have done a very good job!

Have been working with AMD's CodeXL. Surprisingly, your kernels only run at 20% occupancy on GCN, due to using too many registers. But after those 2 days of trying for myself, I have to say that AMD's OpenCL compiler pretty much sucks. It wastes registers like crazy, often without any sense. For example, I tried modifying your "float8" logic to "float4" in order to save a few registers, in the hope that this might improve occupancy and performance. But after my changes AMD's OpenCL compiler actually spent *MORE* registers on the code than before. Which makes absolutely no sense. Argh... I suppose in a few months/years, when hopefully the OpenCL compiler has matured a bit, maybe there's hope in improving the kernels further to decrease occupancy and improve performance. But for now I've given up on finding a faster/different approach.

So I've played with your kernel a bit and found a small performance improvement. You're doing:

Code:

#if defined(__GPU__) && defined(__AMD__) && __OPENCL_VERSION__ >= 110
				float8 t = (float8)((*(__local float3*)&in[j][0]).s0012, *(__local float4*)&in[j][3]);
#else
				float8 t = (float8)(0, in[j][0], in[j][1], in[j][2], in[j][3], in[j][4], in[j][5], in[j][6]);
#endif
				#pragma unroll
				for (uint i = 0; i < xdia; i++)
				{
					t = (float8)(t.s1234, t.s567, in[j][i+7]);
					sum1 += t*w[i];
					sum2 += t*w[i+8];
				}

By replacing that code with the following code I got a ~ 5% performance improvement on my GCN card:

Code:

				float8 t = *((__local float8*) &in[j][0]);
				#pragma unroll
				for (uint i = 0; i < xdia - 1; i++)
				{
					sum1 += t*w[i];
					sum2 += t*w[i + 8];
					t = (float8) (t.s1234, t.s567, in[j][i + 8]);
				}
				sum1 += t*w[xdia - 1];
				sum2 += t*w[xdia - 1 + 8];

I haven't actually tested if my code is correct, though, so you might want to double check. Just been benchmarking so far...

One more thing: In all my image upscaling tests I've always preferred 8x4 over 8x6. I've found that 8x4 produces less weird artifacts. And my preference for 8x4 had nothing to do with performance, it was based purely on overall image quality. So I would suggest switching to 8x4 (or to offer it as an option). At least for image upscaling. Don't know if the situation is maybe different for deinterlacing. Switching to 8x4 of course also gives another nice performance boost. When only testing the "y" kernel on only one color channel, I got from 95fps to about 130fps by doing the small code change posted above, and by going from 8x6 to 8x4.

SEt · 23rd December 2013, 19:23

Occupancy doesn't matter if you know what you are doing. The code was optimized for Cayman architecture, but not to the point to hurt others much. So +5% for GCN from that change is reasonable but likely hurts something else if I didn't use that obvious implementation, don't remember exact cases though. Also, your code won't run on non-AMD cards.
8x4 - maybe, it was designed to be easy subcase.

madshi · 23rd December 2013, 20:43

A while ago when testing resampling algorithms with various test images I tried all the options NNEDI3 offered and to my eyes 8x4 produced the overall best results. Anyway, just my 2 cents. Feel free to ignore...

Personally, I think optimizing for the latest generation of GPUs makes more sense than optimizing for older generations. Of course that's only true if the gain on newer generations is not lower than the cost on older generations. So it's a balance act, of course. FWIW, I can't see a reason why the original code should be faster on any GPU than the modified code I suggested. My code simply does less work. But after my experience with the AMD OpenCL compiler I've lost trust in what appears logical, so I can't be sure.

Having only tested Y resampling before, I've now switched over to X testing. FYI, I've found that simply "reusing" the Y kernel for X (meaning: I copied the X kernel and just swapped the X/Y coordinates when reading/writing pixels) produces faster results on my GCN GPU compared to the special X kernel. However, I'm not using OpenCL buffers. Instead I'm using OpenCL image objects (D3D9 interop is based on image objects). Maybe using image objects reduces the stride memory access pattern problem? I don't know. I think I read somewhere that GPUs can optimize the cache for stride access when using image objects, so using image objects can work better than buffers for 2D data. Not sure if that is of any use to you, but wanted to mention it, just in case you want to try. If you do try, you may want to use a single channel image format ("CL_R"). That gives me similar performance to buffers (maybe a tiny bit slower), while CL_RGBA is (naturally) slower than CL_R.

SEt · 23rd December 2013, 22:12

Optimizing makes most sense for what hardware you have. Your code does 1 more load or the same work depending on compiler. Data shuffles are free.
The code isn't memory bound – load/store shouldn't matter much, though x kernel does have more work.

madshi · 23rd December 2013, 22:25

Why would my code use one more load? I don't see that. As far as I can see, my code should have the same number of loads, or one less, depending on compiler. And it should have one less shuffle. Please note that the loop is 1 shorter in my ocde ompared to yours.

SEt · 23rd December 2013, 22:41

Ah, indeed 1 shorter – then it's the same and we are looking at compiler difference. There are no shuffles: compiler just renames registers. I guess I'll retest such sequence when resume my work on the code: there is at least one more feature I want to implement.

Tested: indeed, it's a bit slower on Cayman. Roughly the same on NV. Also, NV bluescreened during testing – and you say AMD OpenCL implementation is bad?

madshi · 28th December 2013, 22:40

Yeah, I've now tested my changes on all Intel, AMD and NVidia. And I have to say, NVidia's OpenCL implementation is by far the worst. Not only is it limited to OpenCL 1.1, but it also has weird effects. E.g. I wondered why you stored 68 weight floats per nnst, instead of 66 (2 empty). Now I know why: When using 66 floats, NVidia produces a corrupted image (AMD and Intel don't). Furthermore, using write_imagef() with out of range coordinates makes NVidia go bonkers. So I now basically had to add a couple "if"s just to make NVidia play nice. In comparison AMD's and Intel's OpenCL implementation worked flawless right from the start. And the latest AMD driver now also supports D3D9 interop (finally!). Still not happy with AMD's OpenCL compiler optimizations, though.

Just in case you're interested (if you want to benchmark using the y kernel for x when using image objects), here's the code I've currently ended up with. It contains several changes, though, so it might not be useful to you:

http://madshi.net/nnedi3ocl.zip

JFYI:

> your code won't run on non-AMD cards

It did run just fine on NVidia on Intel GPUs, too.

SEt · 29th December 2013, 14:26

AMD interops are buggy (at least with OpenGL) and not so great speed-wise. NV works... while you have only 1 card; add second and it becomes extremely slow.

Ok if it works, but its memory accesses are not-compliant. AMD hardware allows you to break many restrictions, but NV and Intel are less forgiving as you've seen.

It looks like you are using clamp-to-edge padding instead of original mirroring. Is there much difference? I've faithfully implemented original behavior without experiments in this area.

Your code has 4 or 8 times more memory accesses then mine when reading and writing image data. I believe the kernels are not memory-bound in mine implementation, so it would be interesting to see how much speed impact it has.

madshi · 29th December 2013, 16:44

I guess I could use CLK_ADDRESS_MIRRORED_REPEAT instead of CLK_ADDRESS_CLAMP. Haven't tried that yet. Will double check if that produces better edges and whether it affects performance. Thanks for the hint.

Yeah, I have to issue 8 separate image_read/writef instructions for what you're doing with one float8 assignment. However, my code doesn't seem to execute slower than yours on my GCN card. I'm not sure why. Maybe it's simply because the kernel is not memory-bound. I do notice a (small) slowdown, though, when using 16bit images instead of 8bit.

SEt · 29th December 2013, 16:53

Would be interesting to see if CLK_ADDRESS_MIRRORED_REPEAT works: first of all you will have to use normalized coordinates (the question is accuracy of course) and even still repeat won't be the same type as in original nnedi3.

6th December 2013, 14:30	#43 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	New version: added support for all new planar colorspaces of AviSynth 2.6 (but plugin still uses AviSynth 2.5 interfaces and still can be used with AviSynth 2.5). YUY2, RGB24 and center correction are supported by script functions nnedi3x and nnedi3x_rpow2. No changes on OpenCL side. In would be really nice if someone can confirm/correct center correction magic in nnedi3x_rpow2 (btw, original nnedi3 does it wrong). Script tries to minimize center shift while satisfying two conditions with no/empty cshift: 1) For non-YV12 chroma must be correctly aligned with no resize. 2) For YV12 luma is not resized, but chroma is to be correctly aligned (original nnedi3_rpow2 also does it even with empty cshift). Luma and chroma are resized no more than once (original nnedi3_rpow2 would resize chroma 2 times with YV12 and center correction), script tries to minimize resize offsets to subpixel values. For now only Spline36Resize method. Last edited by SEt; 6th December 2013 at 14:33.

23rd December 2013, 18:35	#51 \| Link
madshi Registered Developer Join Date: Sep 2006 Posts: 9,140	@SEt, tried for 2 days to find a faster implementation than yours, but I have to admit that I failed. Tried lots of different approaches, but none of them were faster than yours. So congrats, you seem to have done a very good job! Have been working with AMD's CodeXL. Surprisingly, your kernels only run at 20% occupancy on GCN, due to using too many registers. But after those 2 days of trying for myself, I have to say that AMD's OpenCL compiler pretty much sucks. It wastes registers like crazy, often without any sense. For example, I tried modifying your "float8" logic to "float4" in order to save a few registers, in the hope that this might improve occupancy and performance. But after my changes AMD's OpenCL compiler actually spent MORE registers on the code than before. Which makes absolutely no sense. Argh... I suppose in a few months/years, when hopefully the OpenCL compiler has matured a bit, maybe there's hope in improving the kernels further to decrease occupancy and improve performance. But for now I've given up on finding a faster/different approach. So I've played with your kernel a bit and found a small performance improvement. You're doing: Code: #if defined(__GPU__) && defined(__AMD__) && __OPENCL_VERSION__ >= 110 float8 t = (float8)(((__local float3)&in[j][0]).s0012, (__local float4)&in[j][3]); #else float8 t = (float8)(0, in[j][0], in[j][1], in[j][2], in[j][3], in[j][4], in[j][5], in[j][6]); #endif #pragma unroll for (uint i = 0; i < xdia; i++) { t = (float8)(t.s1234, t.s567, in[j][i+7]); sum1 += tw[i]; sum2 += tw[i+8]; } By replacing that code with the following code I got a ~ 5% performance improvement on my GCN card: Code: float8 t = ((__local float8) &in[j][0]); #pragma unroll for (uint i = 0; i < xdia - 1; i++) { sum1 += tw[i]; sum2 += tw[i + 8]; t = (float8) (t.s1234, t.s567, in[j][i + 8]); } sum1 += tw[xdia - 1]; sum2 += tw[xdia - 1 + 8]; I haven't actually tested if my code is correct, though, so you might want to double check. Just been benchmarking so far... One more thing: In all my image upscaling tests I've always preferred 8x4 over 8x6. I've found that 8x4 produces less weird artifacts. And my preference for 8x4 had nothing to do with performance, it was based purely on overall image quality. So I would suggest switching to 8x4 (or to offer it as an option). At least for image upscaling. Don't know if the situation is maybe different for deinterlacing. Switching to 8x4 of course also gives another nice performance boost. When only testing the "y" kernel on only one color channel, I got from 95fps to about 130fps by doing the small code change posted above, and by going from 8x6 to 8x4.

23rd December 2013, 22:41	#56 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	Ah, indeed 1 shorter – then it's the same and we are looking at compiler difference. There are no shuffles: compiler just renames registers. I guess I'll retest such sequence when resume my work on the code: there is at least one more feature I want to implement. Tested: indeed, it's a bit slower on Cayman. Roughly the same on NV. Also, NV bluescreened during testing – and you say AMD OpenCL implementation is bad? Last edited by SEt; 23rd December 2013 at 23:49.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

26th November 2013, 15:08	#41 \| Link
madshi Registered Developer Join Date: Sep 2006 Posts: 9,140	Great work, SEt. I was planning to look into implementing NNEDI3 with OpenCL/CUDA myself for madVR. I was also considering dropping the prescreener, but I'm not sure. The prescreener might still be effective. I was thinking of splitting the processing into lines, so that one thread processes one image line. This way I hoped to be able to cache the source reads so that I have to read only 4 new source pixels for each new output pixel (if there are enough registers to store the other source pixels in). With this design maybe the prescreener would then allow each thread to finish faster if there are some pixels in the line which don't need full NNEDI3 processing. Well, anyway. I haven't even started yet, so these were just some ideas I'd been playing with in my head. Haven't looked at your OpenCL code yet, but I'll definitely do when I find some time. And thanks for going with LGPL instead of GPL. That would allow me to reuse your code for madVR, too, if I decide that your implementation idea is better than mine... One thing that bothers me a bit about NNEDI3 is that it sometimes "finds" things to connect in trees, leaves and grass which makes things look a bit artificial, fractal like. So I'm wondering whether it wouldn't be a good idea to write a separate prescreener which categorizes the image into parts which have clear edge directions and other parts with rather random edge directions (= grass, leaves etc). Thoughts? FWIW, many months ago I had asked tritical about implementing NNEDI3 in madVR, even though madVR is closed source, and he allowed it. So it seems to me he's quite generous with licensing issues, so I don't think you need to worry about that part. Haven't heard from him in a while, though. Not sure if he's still around...

8th December 2013, 16:13	#45 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	Minor update: better default for dh parameter and completely implemented nnedi3x_rpow2 with all fancy cases. nekosama, your result looks too low: comparing to very similar Radeon HD7970 you should be getting around 36 fps average with MT.

8th December 2013, 17:24	#47 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	Of course I've scaled expected fps by their theoretical FLOPS. Radeon HD7970 is getting 48 fps as you can see in summary tables.

9th December 2013, 19:26	#49 \| Link
Groucho2004 Join Date: Mar 2006 Location: Barcelona Posts: 5,034	@nekosama try with the latest version of AVSMeter. The version you're using is rather old (although it should not make much difference).

10th December 2013, 00:06	#50 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	nekosama, another guess: are you using old drivers? Try with latest beta ones.

23rd December 2013, 19:23	#52 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	Occupancy doesn't matter if you know what you are doing. The code was optimized for Cayman architecture, but not to the point to hurt others much. So +5% for GCN from that change is reasonable but likely hurts something else if I didn't use that obvious implementation, don't remember exact cases though. Also, your code won't run on non-AMD cards. 8x4 - maybe, it was designed to be easy subcase.

23rd December 2013, 20:43	#53 \| Link
madshi Registered Developer Join Date: Sep 2006 Posts: 9,140	A while ago when testing resampling algorithms with various test images I tried all the options NNEDI3 offered and to my eyes 8x4 produced the overall best results. Anyway, just my 2 cents. Feel free to ignore... Personally, I think optimizing for the latest generation of GPUs makes more sense than optimizing for older generations. Of course that's only true if the gain on newer generations is not lower than the cost on older generations. So it's a balance act, of course. FWIW, I can't see a reason why the original code should be faster on any GPU than the modified code I suggested. My code simply does less work. But after my experience with the AMD OpenCL compiler I've lost trust in what appears logical, so I can't be sure. Having only tested Y resampling before, I've now switched over to X testing. FYI, I've found that simply "reusing" the Y kernel for X (meaning: I copied the X kernel and just swapped the X/Y coordinates when reading/writing pixels) produces faster results on my GCN GPU compared to the special X kernel. However, I'm not using OpenCL buffers. Instead I'm using OpenCL image objects (D3D9 interop is based on image objects). Maybe using image objects reduces the stride memory access pattern problem? I don't know. I think I read somewhere that GPUs can optimize the cache for stride access when using image objects, so using image objects can work better than buffers for 2D data. Not sure if that is of any use to you, but wanted to mention it, just in case you want to try. If you do try, you may want to use a single channel image format ("CL_R"). That gives me similar performance to buffers (maybe a tiny bit slower), while CL_RGBA is (naturally) slower than CL_R.

23rd December 2013, 22:12	#54 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	Optimizing makes most sense for what hardware you have. Your code does 1 more load or the same work depending on compiler. Data shuffles are free. The code isn't memory bound – load/store shouldn't matter much, though x kernel does have more work.

23rd December 2013, 22:25	#55 \| Link
madshi Registered Developer Join Date: Sep 2006 Posts: 9,140	Why would my code use one more load? I don't see that. As far as I can see, my code should have the same number of loads, or one less, depending on compiler. And it should have one less shuffle. Please note that the loop is 1 shorter in my ocde ompared to yours.

28th December 2013, 22:40	#57 \| Link
madshi Registered Developer Join Date: Sep 2006 Posts: 9,140	Yeah, I've now tested my changes on all Intel, AMD and NVidia. And I have to say, NVidia's OpenCL implementation is by far the worst. Not only is it limited to OpenCL 1.1, but it also has weird effects. E.g. I wondered why you stored 68 weight floats per nnst, instead of 66 (2 empty). Now I know why: When using 66 floats, NVidia produces a corrupted image (AMD and Intel don't). Furthermore, using write_imagef() with out of range coordinates makes NVidia go bonkers. So I now basically had to add a couple "if"s just to make NVidia play nice. In comparison AMD's and Intel's OpenCL implementation worked flawless right from the start. And the latest AMD driver now also supports D3D9 interop (finally!). Still not happy with AMD's OpenCL compiler optimizations, though. Just in case you're interested (if you want to benchmark using the y kernel for x when using image objects), here's the code I've currently ended up with. It contains several changes, though, so it might not be useful to you: http://madshi.net/nnedi3ocl.zip JFYI: > your code won't run on non-AMD cards It did run just fine on NVidia on Intel GPUs, too.

29th December 2013, 14:26	#58 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	AMD interops are buggy (at least with OpenGL) and not so great speed-wise. NV works... while you have only 1 card; add second and it becomes extremely slow. Ok if it works, but its memory accesses are not-compliant. AMD hardware allows you to break many restrictions, but NV and Intel are less forgiving as you've seen. It looks like you are using clamp-to-edge padding instead of original mirroring. Is there much difference? I've faithfully implemented original behavior without experiments in this area. Your code has 4 or 8 times more memory accesses then mine when reading and writing image data. I believe the kernels are not memory-bound in mine implementation, so it would be interesting to see how much speed impact it has.

29th December 2013, 16:44	#59 \| Link
madshi Registered Developer Join Date: Sep 2006 Posts: 9,140	I guess I could use CLK_ADDRESS_MIRRORED_REPEAT instead of CLK_ADDRESS_CLAMP. Haven't tried that yet. Will double check if that produces better edges and whether it affects performance. Thanks for the hint. Yeah, I have to issue 8 separate image_read/writef instructions for what you're doing with one float8 assignment. However, my code doesn't seem to execute slower than yours on my GCN card. I'm not sure why. Maybe it's simply because the kernel is not memory-bound. I do notice a (small) slowdown, though, when using 16bit images instead of 8bit.

29th December 2013, 16:53	#60 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	Would be interesting to see if CLK_ADDRESS_MIRRORED_REPEAT works: first of all you will have to use normalized coordinates (the question is accuracy of course) and even still repeat won't be the same type as in original nnedi3.