Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
26th November 2013, 15:08 | #41 | Link |
Registered Developer
Join Date: Sep 2006
Posts: 9,140
|
Great work, SEt. I was planning to look into implementing NNEDI3 with OpenCL/CUDA myself for madVR. I was also considering dropping the prescreener, but I'm not sure. The prescreener might still be effective. I was thinking of splitting the processing into lines, so that one thread processes one image line. This way I hoped to be able to cache the source reads so that I have to read only 4 new source pixels for each new output pixel (if there are enough registers to store the other source pixels in). With this design maybe the prescreener would then allow each thread to finish faster if there are some pixels in the line which don't need full NNEDI3 processing. Well, anyway. I haven't even started yet, so these were just some ideas I'd been playing with in my head. Haven't looked at your OpenCL code yet, but I'll definitely do when I find some time. And thanks for going with LGPL instead of GPL. That would allow me to reuse your code for madVR, too, if I decide that your implementation idea is better than mine...
One thing that bothers me a bit about NNEDI3 is that it sometimes "finds" things to connect in trees, leaves and grass which makes things look a bit artificial, fractal like. So I'm wondering whether it wouldn't be a good idea to write a separate prescreener which categorizes the image into parts which have clear edge directions and other parts with rather random edge directions (= grass, leaves etc). Thoughts? FWIW, many months ago I had asked tritical about implementing NNEDI3 in madVR, even though madVR is closed source, and he allowed it. So it seems to me he's quite generous with licensing issues, so I don't think you need to worry about that part. Haven't heard from him in a while, though. Not sure if he's still around... |
26th November 2013, 15:19 | #42 | Link | |
Oz of the zOo
Join Date: May 2005
Posts: 208
|
Quote:
|
|
6th December 2013, 14:30 | #43 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
New version: added support for all new planar colorspaces of AviSynth 2.6 (but plugin still uses AviSynth 2.5 interfaces and still can be used with AviSynth 2.5). YUY2, RGB24 and center correction are supported by script functions nnedi3x and nnedi3x_rpow2. No changes on OpenCL side.
In would be really nice if someone can confirm/correct center correction magic in nnedi3x_rpow2 (btw, original nnedi3 does it wrong). Script tries to minimize center shift while satisfying two conditions with no/empty cshift: 1) For non-YV12 chroma must be correctly aligned with no resize. 2) For YV12 luma is not resized, but chroma is to be correctly aligned (original nnedi3_rpow2 also does it even with empty cshift). Luma and chroma are resized no more than once (original nnedi3_rpow2 would resize chroma 2 times with YV12 and center correction), script tries to minimize resize offsets to subpixel values. For now only Spline36Resize method. Last edited by SEt; 6th December 2013 at 14:33. |
7th December 2013, 13:40 | #44 | Link |
Registered User
Join Date: Sep 2013
Posts: 16
|
Radeon HD7950, 930, I7-4770k@stock
Code:
[General info] Log file created with: AVSMeter 1.5.7 Avisynth version: AviSynth 2.60, build:Sep 28 2013 [15:09:12] Active MT Mode: 2 [Clip info] Number of frames: 1000 Length (hhh:mm:ss.ms): 000:00:41.708 Frame width: 2560 Frame height: 1440 Framerate: 23.976 (24000/1001) Interlaced: No Colorspace: YV12 [Runtime info] Frames processed: 1000 (0 - 999) FPS (min | max | average): 18.40 | 35.26 | 26.00 CPU usage (average): 1% Thread count: 13 Physical Memory usage (peak): 597 MB Virtual Memory usage (peak): 622 MB Time (elapsed): 000:00:38.461 [Script] SetMTMode(2,4) BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0) nnedi3ocl(dh=true, dw=1, nns=2, qual=1) |
8th December 2013, 16:13 | #45 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
Minor update: better default for dh parameter and completely implemented nnedi3x_rpow2 with all fancy cases.
nekosama, your result looks too low: comparing to very similar Radeon HD7970 you should be getting around 36 fps average with MT. |
9th December 2013, 19:17 | #48 | Link |
Registered User
Join Date: Sep 2013
Posts: 16
|
nope SEt, I couldn't get higher results on my 7950 but I tried with an overclock to 1150 MHz core clock and 1350 MHz memory clock and managed to get these reults
Code:
[General info] Log file created with: AVSMeter 1.5.7 Avisynth version: AviSynth 2.60, build:Sep 28 2013 [15:09:12] Active MT Mode: 2 [Clip info] Number of frames: 1000 Length (hhh:mm:ss.ms): 000:00:41.708 Frame width: 2560 Frame height: 1440 Framerate: 23.976 (24000/1001) Interlaced: No Colorspace: YV12 [Runtime info] Frames processed: 1000 (0 - 999) FPS (min | max | average): 27.38 | 46.77 | 35.56 CPU usage (average): 0% Thread count: 13 Physical Memory usage (peak): 599 MB Virtual Memory usage (peak): 622 MB Time (elapsed): 000:00:28.118 [Script] SetMTMode(2,4) BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0) nnedi3ocl(dh=true, dw=1, nns=2, qual=1) I even clocked my 7950 to 1170 core just to see a difference and I managed to get 50 max fps but same average speed (yeah 0.2 fps difference but that's practically none-existent) Code:
[General info] Log file created with: AVSMeter 1.5.7 Avisynth version: AviSynth 2.60, build:Sep 28 2013 [15:09:12] Active MT Mode: 2 [Clip info] Number of frames: 10000 Length (hhh:mm:ss.ms): 000:06:57.083 Frame width: 2560 Frame height: 1440 Framerate: 23.976 (24000/1001) Interlaced: No Colorspace: YV12 [Runtime info] Frames processed: 10000 (0 - 9999) FPS (min | max | average): 25.65 | 50.00 | 35.92 CPU usage (average): 0% Thread count: 10 Physical Memory usage (peak): 620 MB Virtual Memory usage (peak): 622 MB Time (elapsed): 000:04:38.409 [Script] SetMTMode(2,4) BlankClip(10000, 1280, 720, "YV12", 24000, 1001, 0) nnedi3ocl(dh=true, dw=1, nns=2, qual=1) |
23rd December 2013, 18:35 | #51 | Link |
Registered Developer
Join Date: Sep 2006
Posts: 9,140
|
@SEt,
tried for 2 days to find a faster implementation than yours, but I have to admit that I failed. Tried lots of different approaches, but none of them were faster than yours. So congrats, you seem to have done a very good job! Have been working with AMD's CodeXL. Surprisingly, your kernels only run at 20% occupancy on GCN, due to using too many registers. But after those 2 days of trying for myself, I have to say that AMD's OpenCL compiler pretty much sucks. It wastes registers like crazy, often without any sense. For example, I tried modifying your "float8" logic to "float4" in order to save a few registers, in the hope that this might improve occupancy and performance. But after my changes AMD's OpenCL compiler actually spent *MORE* registers on the code than before. Which makes absolutely no sense. Argh... I suppose in a few months/years, when hopefully the OpenCL compiler has matured a bit, maybe there's hope in improving the kernels further to decrease occupancy and improve performance. But for now I've given up on finding a faster/different approach. So I've played with your kernel a bit and found a small performance improvement. You're doing: Code:
#if defined(__GPU__) && defined(__AMD__) && __OPENCL_VERSION__ >= 110 float8 t = (float8)((*(__local float3*)&in[j][0]).s0012, *(__local float4*)&in[j][3]); #else float8 t = (float8)(0, in[j][0], in[j][1], in[j][2], in[j][3], in[j][4], in[j][5], in[j][6]); #endif #pragma unroll for (uint i = 0; i < xdia; i++) { t = (float8)(t.s1234, t.s567, in[j][i+7]); sum1 += t*w[i]; sum2 += t*w[i+8]; } Code:
float8 t = *((__local float8*) &in[j][0]); #pragma unroll for (uint i = 0; i < xdia - 1; i++) { sum1 += t*w[i]; sum2 += t*w[i + 8]; t = (float8) (t.s1234, t.s567, in[j][i + 8]); } sum1 += t*w[xdia - 1]; sum2 += t*w[xdia - 1 + 8]; One more thing: In all my image upscaling tests I've always preferred 8x4 over 8x6. I've found that 8x4 produces less weird artifacts. And my preference for 8x4 had nothing to do with performance, it was based purely on overall image quality. So I would suggest switching to 8x4 (or to offer it as an option). At least for image upscaling. Don't know if the situation is maybe different for deinterlacing. Switching to 8x4 of course also gives another nice performance boost. When only testing the "y" kernel on only one color channel, I got from 95fps to about 130fps by doing the small code change posted above, and by going from 8x6 to 8x4. |
23rd December 2013, 19:23 | #52 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
Occupancy doesn't matter if you know what you are doing. The code was optimized for Cayman architecture, but not to the point to hurt others much. So +5% for GCN from that change is reasonable but likely hurts something else if I didn't use that obvious implementation, don't remember exact cases though. Also, your code won't run on non-AMD cards.
8x4 - maybe, it was designed to be easy subcase. |
23rd December 2013, 20:43 | #53 | Link |
Registered Developer
Join Date: Sep 2006
Posts: 9,140
|
A while ago when testing resampling algorithms with various test images I tried all the options NNEDI3 offered and to my eyes 8x4 produced the overall best results. Anyway, just my 2 cents. Feel free to ignore...
Personally, I think optimizing for the latest generation of GPUs makes more sense than optimizing for older generations. Of course that's only true if the gain on newer generations is not lower than the cost on older generations. So it's a balance act, of course. FWIW, I can't see a reason why the original code should be faster on any GPU than the modified code I suggested. My code simply does less work. But after my experience with the AMD OpenCL compiler I've lost trust in what appears logical, so I can't be sure. Having only tested Y resampling before, I've now switched over to X testing. FYI, I've found that simply "reusing" the Y kernel for X (meaning: I copied the X kernel and just swapped the X/Y coordinates when reading/writing pixels) produces faster results on my GCN GPU compared to the special X kernel. However, I'm not using OpenCL buffers. Instead I'm using OpenCL image objects (D3D9 interop is based on image objects). Maybe using image objects reduces the stride memory access pattern problem? I don't know. I think I read somewhere that GPUs can optimize the cache for stride access when using image objects, so using image objects can work better than buffers for 2D data. Not sure if that is of any use to you, but wanted to mention it, just in case you want to try. If you do try, you may want to use a single channel image format ("CL_R"). That gives me similar performance to buffers (maybe a tiny bit slower), while CL_RGBA is (naturally) slower than CL_R. |
23rd December 2013, 22:12 | #54 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
Optimizing makes most sense for what hardware you have. Your code does 1 more load or the same work depending on compiler. Data shuffles are free.
The code isn't memory bound – load/store shouldn't matter much, though x kernel does have more work. |
23rd December 2013, 22:25 | #55 | Link |
Registered Developer
Join Date: Sep 2006
Posts: 9,140
|
Why would my code use one more load? I don't see that. As far as I can see, my code should have the same number of loads, or one less, depending on compiler. And it should have one less shuffle. Please note that the loop is 1 shorter in my ocde ompared to yours.
|
23rd December 2013, 22:41 | #56 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
Ah, indeed 1 shorter – then it's the same and we are looking at compiler difference. There are no shuffles: compiler just renames registers. I guess I'll retest such sequence when resume my work on the code: there is at least one more feature I want to implement.
Tested: indeed, it's a bit slower on Cayman. Roughly the same on NV. Also, NV bluescreened during testing – and you say AMD OpenCL implementation is bad? Last edited by SEt; 23rd December 2013 at 23:49. |
28th December 2013, 22:40 | #57 | Link |
Registered Developer
Join Date: Sep 2006
Posts: 9,140
|
Yeah, I've now tested my changes on all Intel, AMD and NVidia. And I have to say, NVidia's OpenCL implementation is by far the worst. Not only is it limited to OpenCL 1.1, but it also has weird effects. E.g. I wondered why you stored 68 weight floats per nnst, instead of 66 (2 empty). Now I know why: When using 66 floats, NVidia produces a corrupted image (AMD and Intel don't). Furthermore, using write_imagef() with out of range coordinates makes NVidia go bonkers. So I now basically had to add a couple "if"s just to make NVidia play nice. In comparison AMD's and Intel's OpenCL implementation worked flawless right from the start. And the latest AMD driver now also supports D3D9 interop (finally!). Still not happy with AMD's OpenCL compiler optimizations, though.
Just in case you're interested (if you want to benchmark using the y kernel for x when using image objects), here's the code I've currently ended up with. It contains several changes, though, so it might not be useful to you: http://madshi.net/nnedi3ocl.zip JFYI: > your code won't run on non-AMD cards It did run just fine on NVidia on Intel GPUs, too. |
29th December 2013, 14:26 | #58 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
AMD interops are buggy (at least with OpenGL) and not so great speed-wise. NV works... while you have only 1 card; add second and it becomes extremely slow.
Ok if it works, but its memory accesses are not-compliant. AMD hardware allows you to break many restrictions, but NV and Intel are less forgiving as you've seen. It looks like you are using clamp-to-edge padding instead of original mirroring. Is there much difference? I've faithfully implemented original behavior without experiments in this area. Your code has 4 or 8 times more memory accesses then mine when reading and writing image data. I believe the kernels are not memory-bound in mine implementation, so it would be interesting to see how much speed impact it has. |
29th December 2013, 16:44 | #59 | Link |
Registered Developer
Join Date: Sep 2006
Posts: 9,140
|
I guess I could use CLK_ADDRESS_MIRRORED_REPEAT instead of CLK_ADDRESS_CLAMP. Haven't tried that yet. Will double check if that produces better edges and whether it affects performance. Thanks for the hint.
Yeah, I have to issue 8 separate image_read/writef instructions for what you're doing with one float8 assignment. However, my code doesn't seem to execute slower than yours on my GCN card. I'm not sure why. Maybe it's simply because the kernel is not memory-bound. I do notice a (small) slowdown, though, when using 16bit images instead of 8bit. |
29th December 2013, 16:53 | #60 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
Would be interesting to see if CLK_ADDRESS_MIRRORED_REPEAT works: first of all you will have to use normalized coordinates (the question is accuracy of course) and even still repeat won't be the same type as in original nnedi3.
|
Thread Tools | Search this Thread |
Display Modes | |
|
|