Log in

View Full Version : nnedi3 - OpenCL rewrite


Pages : [1] 2 3

SEt
18th November 2013, 07:15
It's time to move to modern image processing platforms, i.e. OpenCL. Here is my rewrite of one of the most used and the most heavy AviSynth plugins: nnedi3.

Current (2013.12.08-beta) version: https://www.dropbox.com/s/bmemjsu7jqnlk65/nnedi3ocl_20131208.7z

Syntax: nnedi3ocl(int field, bool dh, bool Y, bool U, bool V, int nsize, int nns, int qual, int etype, int dw)
Most parameters are the same as for nnedi3. Changes:

dw - controls scaling in horizontal direction: -1 no scaling; 0 and 1 scale like field 0 and 1 with dh=true, but horizontally. Default: -1.
Default for field is dw.
Default for dh is false when dw=-1 and true otherwise.
Only nsize=0 implemented, other values are silently ignored.
pscrn, threads, opt, fapprox: removed.


nnedi3ocl is AviSynth 2.5 plugin, but supports all new planar colorspaces when used with AviSynth 2.6. For YUY2 and RGB24 support script function nnedi3x is provided; it also doesn't complain if you feed it with now removed parameters of original nnedi3.

Basic image 2x scaling is done by call nnedi3ocl(dw=1). For advanced scaling with chroma and center correction and support for YUY2 and RGB24 colorspaces use provided nnedi3x_rpow2 script function.

MTMode to use: 2.

Major speed note 1.
Original nnedi3 process not all pixels with its nnedi3 algorithm: first it runs prescreener that decides should each pixel go through nnedi3 or through cubic scaling. This is the main reason that nnedi3 works relatively fast one time and slows to a crawl some other time (for example, on grass and leaves).
Prescreener concept works welll on CPU, but quite alien to extremely parallel GPU code and was not implemented. This means that nnedi3ocl works always with constant speed and will be slower than original nnedi3 with prescreener on simple frames, but faster on complex ones.

TLDR: nnedi3ocl always does best quality mode, so can be both faster and slower than original nnedi3.

Do notice that you can combine CPU and OpenCL processing in one script by using both.

Major speed note 2.
The OpenCL code is quite optimized, but memory transfers are not. So, much time is lost there. Using high number of threads with MTMode 2 (even more than physical threads your CPU has) is the best workaround for now.

Major speed note 3.
Unlike original nnedi3 where speed mostly depended on image complexity, speed of nnedi3ocl is direct result of your hardware speed and settings used.
Each increase of nns by 1 results in speed dropping by 2x. qual=2 also around 2x slower than qual=1. This means that from fastest to slowest parameters there is 32x difference in speed.


OpenCL device preferences.
Don't bother with running the code on CPU OpenCL devices – original nnedi3 would be way faster simply due to prescreener.

For GPU AMD cards with GCN architecture are recommended. Nvidia does ok, but has disadvantage of completely using one of your CPU cores on heavy GPU computations. Intel integrated... it works there too!
Theoretical FLOPS should be good indication of performance as long as you factor in the efficiency of particular architecture. Table below provides some useful coefficients how TFLOPS scale to FPS for cards of different architectures.

In case of multiple OpenCL platforms the order of preference: AMD GPU -> any GPU -> the rest. No manual choice yet.

Multi-GPU are not supported yet, todo.


OpenCL part is open source, license is... LGPLv3? Subject to change. Host code isn't open yet; there is nothing interesting there anyway.
Somewhat unknown issue is nnedi3 nn coefficients data: no idea how they are related to licenses since it's not a code. Preprocessed but conceptually unchanged version of them is currently embeded in dll. If tritical has any issues with current situation – I'm ready to listen.


Hacking.
As you can notice, the main OpenCL part is not just open source, but actually read and realtime compiled from separate text file. You can change it and the next restart changes will be applied.

Feel free to poke at code: sometimes just adding dummy if lines (that always processed or skipped, but compiler can't deduce it) can greatly change the speed in both ways.
Another interesting speed point is #pragma unroll statements.

As for correctness, I haven't noticed the importance of several checks, so they are removed under EXTRA_CHECKS define – uncomment the line if you think you see precision-related errors.


Benchmark.
As it would be useful to get speed estimates with different hardware and speed should differ insignificantly due to non-hardware reasons, let's make "benchmark" section.
In case some modification of nnedi3ocl.cl provides you better (but still correct) result – would be interesting to see such numbers too.

The target is 1280x720 YV12 upscaling to 2560x1440 with medium settings.

FPS with no MTMode, FPS with MTMode(2,4), FPS per theoretical TFLOPS, GPU name, GPU core clock during test MHz, PCIe version (and width if not x16), CPU, nnedi3ocl version

On version 2013.12.08: (the same as 2013.11.21)
32.89 37.53 13.3 Radeon HD7870, 1100, 2, FX8350
18.33 18.82 6.9 Radeon HD5870, 850, 2, i7-920@4.0
14.68 14.96 6.9 GeForce GTX 660, 1137, 3x8, i7-3770@4.0

On version 2013.11.21: (+12% on Nvidia from 2013.11.18)
20.40 23.93 8.8 Radeon HD6950 (unlocked), 885, 2, i7-930@4.0
12.20 12.93 10.4 GeForce GTX 590 (half of dual card), 608, 2, i7-930@4.0
8.44 8.73 8.0 GeForce GTX 560, 810, ?, ?
5.28 5.48 7.4 GeForce GT 750M, 967, 1.1, i7-4700HQ@2.4
2.47 2.55 2.1 Radeon HD4870, 750?, ?, ?
2.44 2.45 2.3 GeForce GTX 275, 666, 2, i7-920@3.4
1.97 2.00 8.1 Quadro 600, 640, 2, i7-930@4.0
0.91 0.91 2.4 GeForce GT 240, 550, 2, i5-2500K@4.0
1.66 1.71 4.5 Intel HD4600, 1200, -, i7-4770@3.7


On version 2013.11.18:
35.50 48.00 12.7 Radeon HD7970, 925, 3, i7-3930K@3.2
20.40 23.93 8.8 Radeon HD6950 (unlocked), 885, 2, i7-930@4.0
20.29 22.68 4.9 GeForce GTX 780, 1006, 2, i7-3930K@3.8
17.72 22.47 8.3 Radeon HD6950 (unlocked), 885, 1.1 x4, i7-930@4.0
18.43 21.21 8.9 Radeon HD6950, 850, 2, i7-930@4.0
14.24 15.78 6.4 GeForce GTX 660 Ti, 915?, ?, ?
10.89 11.48 9.2 GeForce GTX 590 (half of dual card), 608, 2, i7-930@4.0
9.33 9.60 6.3 GeForce GTX 650 Ti Boost, 1006, 3, i7-4770k@4.3
4.65 4.90 6.6 GeForce GT 750M, 967, 1.1, i7-4700HQ@2.4
2.61 2.72 9.4 GeForce GT 555M (GDDR5), 1506, 2, i7-2670QM@2.2
1.30 1.85 6.9 GeForce GT 430, 700, 2, i7-860@2.8
1.72 1.74 7.1 Quadro 600, 640, 2, i7-930@4.0
1.66 1.71 4.5 Intel HD4600, 1200, -, i7-4770@3.7
1.38 1.38 1.3 GeForce GTX 275, 666, 2, i7-920@3.4

FPS is measured as average FPS in AVSMeter 1.7.2 at the end of this script:
SetMTMode(2,4)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl(dw=1, nns=2, qual=1)

or for versions 2013.11.22 and older:
SetMTMode(2,4)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl_rpow2(2, nns=2, qual=1)

If your FPS is terrible and you don't want to wait till the end – reduce first number in BlankClip to 100, but don't interrupt the script in the middle.

Some conclusions:

Even severely limiting PCIe bandwidth from 2 x16 to 1.1 x4 (= 8x less bandwidth) for quite fast card Radeon HD6950 doesn't change overall speed much: only 6% slower with MTMode and 13% without. So, anything starting from PCIe 2 x8 should not affect current implementation much.
There is clear correlation between FPS, theoretical FLOPS and architecture of GPUs.

TurboPascal7
18th November 2013, 07:53
So this is what you were talking about. Fancy. :)

5-15 times faster than the original here, which is a really nice benefit. i7 860 vs gtx760 connected via pci-e 2. Which is compute capability 3.0. Also, noting what version is actually required would be a good idea, imho.

dh=false being the default when it doesn't work yet seems a bit confusing though.

easyfab
18th November 2013, 10:57
Major speed note 2.
The OpenCL code is quite optimized, but memory transfers are not. So, much time is lost there. Using high number of threads with MTMode 2 (even more than physical threads your CPU has) is the best workaround for now.



Will the next generations APU resolve the memory transfer problem, as the memory will be shared between CPU and GPU ?

bcn_246
18th November 2013, 14:27
No cshift yet, ignored. Probably should be done in external script anyway.

Could somebody post such a script?

TIA

SEt
18th November 2013, 14:52
Update: dh=false should work now.
Also added "benchmark" section to first post, as it would be useful for people to get estimates what speed they can expect with different hardware.

TurboPascal7
It's unexpectedly nice that you get good speed on just CC3.0 hardware – I thought register pressure there would be too much. As for minimum required hardware – no idea, probably anything that can run OpenCL 1.1 (maybe even 1.0) will do. Just older hardware would have worse theoretical_FLOPS / real_speed ratio.

easyfab
This problem – yes, but due to their TDP constraints likely it'll still be faster to use powerful GPU on separate card even with all the transfer overhead.

DJATOM
18th November 2013, 18:32
2.61 2.72 GeForce GT 555M (GDDR5), 1506 MHz, 2, i7-2670QM@2.2GHz, 2013.11.18

lansing
18th November 2013, 18:37
9.33 9.60 Nvidia Geforce GTX 650 Ti Boost, 1006MHz, 3, i7-4770k@4.3GHz, 2013.11.18

very slow for my card, and on mt mode, only one thread out of the 8 is working, I suspect the bottleneck on the gpu side.

SEt
18th November 2013, 22:26
Updated first post, added some results of cards I have access to. Also added important clarification about relations of speed and parameters.

There is now enough results to start drawing some conclusions. Unlike what I expected in the beginning, NV cards do relatively well, but their architecture (Compute Capability) does affect the efficiency. NV CC2.0 hardware slightly more efficient than AMD VLIV4, but CC2.1 and CC3.0 are around 1.5x less efficient (btw, this inefficiency in compute workload is quite known for those CC).

Also I remembered nasty habit of NV drivers to consume one CPU core during GPU computations. It's not normal and AMD drivers don't suffer from that: during test CPU load should be insignificant.

wOxxOm
18th November 2013, 23:28
SEt, it'd be nice to have fwidth/fheight gpu-scaling (requires cshift as well), thus the scripts with nnedi3_rpow2(...).downscale_to_final_resolution(...) will run faster, and hopefully by a considerable margin, because of the reduced gpu->cpu copyback data amount.

SEt
19th November 2013, 00:15
wOxxOm, not very likely for near future – there are better places to spend effort of optimizing, cubic level downscalers are computationally cheap. As for copyback, it's slow not because it saturates PCIe bandwidth (do read first post benchmarks) but because it's implemented not quite efficiently.

wOxxOm
19th November 2013, 00:19
SEt, got it. What about combining it with some gpu-assisted degrain then?

SEt
19th November 2013, 00:20
wOxxOm, that's way more likely. No promises though.

PetitDragon
19th November 2013, 02:25
OMG! This is so fxcking greate. Thanks SEt.:thanks:

mikeyakame
19th November 2013, 10:19
20.29 22.68 Nvidia Geforce GTX780, 1006Mhz, 2, i7-3930K@3.8Ghz, 2013.11.18


GPU Load Avg. No MT was ~91%.
GPU Load Avg. MT was ~97%.

So not much headroom left for my card, and CPU load was ~8-9% for both tests.

Geforce drivers => 331.58

Terka
19th November 2013, 13:58
SEt, great job! Thank you!

Now implement phase correlation to mvtools under OpenCL and Avisynth users can celebrate.

Mystery Keeper
19th November 2013, 14:15
There are other more modern motion estimation methods around. MVTools might benefit not from just phase correlation, but from integrating more different algorythms. Also, modern motion estimation uses segmentation. MVTools could give more accurate results with currently used method combined with segmentation.

SEt
19th November 2013, 14:42
mikeyakame, your result is too low looking at FPS/TFLOPS ratio. I had much better expectations for CC3.5 hardware...

Terka, Mystery Keeper, this work is far from finished and I haven't said I'm taking MVTools rewrite... Though recently exposed through OpenCL motion estimators on Intel videocards did look interesting as you can get motion estimation basically for free from dedicated hardware, my cards have nothing like that.

olcifaraga
19th November 2013, 15:09
4.65 4.90 Nvidia Geforce GT750M, 967Mhz, 1.1, i7-4700HQ@2.4GHz, 2013.11.18

Bloax
19th November 2013, 16:54
I get ~0.2 fps upscaling 640x480 to 1280x960 on a Geforce 9800 GT (you can imagine my surprise that it could actually run this!), and I think we can easily conclude what happens on 720p->1440p :p

zero9999
19th November 2013, 17:03
Could somebody post such a script?

TIA

use this mod of nnedi3_resize16 (https://gist.github.com/line0/7547526)

yup
20th November 2013, 06:44
SEt :thanks:
Test for GTX560
GPU 1: NVIDIA GeForce GTX 560
OpenCL 1.1, GeForce GTX 560 compute units:7@1620MHz
FPS (min | max | average): 1.84 | 416268.26 | 7.43
CPU usage (average): 13%
SetMTMode(2,4), version 18 November
yup.

SEt
20th November 2013, 08:18
yup, was it with MT or without? Need both for statistics.

Selur
21st November 2013, 12:45
using a NVIDIA GeForce GTX 660 ti and
LoadPlugin("nnedi3ocl.dll")
SetMTMode(2,8)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl_rpow2(2, nns=2, qual=1)
I got:
AVSMeter 1.7.2 [AVS2.6] by Groucho2004
AviSynth 2.60, build:Sep 28 2013 [15:09:12]
Active MT Mode: 2

Number of frames: 1000
Length (hhh:mm:ss.ms): 000:00:41.708
Frame width: 2560
Frame height: 1440
Framerate: 23.976 (24000/1001)
Interlaced: No
Colorspace: YV12

Frames processed: 1000 (0 - 999)
FPS (min | max | average): 1.93 | 419534.11 | 15.70
CPU usage (average): 13%
Thread count: 22
Physical Memory usage (peak): 569 MB
Virtual Memory usage (peak): 552 MB
Time (elapsed): 000:01:03.714
using:
LoadPlugin("nnedi3ocl.dll")
#SetMTMode(2,8)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl_rpow2(2, nns=2, qual=1)
I got:
Active MT Mode: 0

Number of frames: 1000
Length (hhh:mm:ss.ms): 000:00:41.708
Frame width: 2560
Frame height: 1440
Framerate: 23.976 (24000/1001)
Interlaced: No
Colorspace: YV12

Frames processed: 1000 (0 - 999)
FPS (min | max | average): 11.83 | 14.50 | 14.24
CPU usage (average): 12%
Thread count: 8
Physical Memory usage (peak): 540 MB
Virtual Memory usage (peak): 535 MB
Time (elapsed): 000:01:10.224
using:
LoadPlugin("nnedi3ocl.dll")
SetMTMode(2,4)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl_rpow2(2, nns=2, qual=1)
I got:
Active MT Mode: 2

Number of frames: 1000
Length (hhh:mm:ss.ms): 000:00:41.708
Frame width: 2560
Frame height: 1440
Framerate: 23.976 (24000/1001)
Interlaced: No
Colorspace: YV12

Frames processed: 1000 (0 - 999)
FPS (min | max | average): 3.71 | 419534.11 | 15.78
CPU usage (average): 13%
Thread count: 12
Physical Memory usage (peak): 544 MB
Virtual Memory usage (peak): 540 MB
Time (elapsed): 000:01:03.379

Cu Selur

SEt
21st November 2013, 17:18
Given how many people here turned out to use Nvidia cards, I spent some effort optimizing for them. Result is pretty consistent 15% speed boost: https://www.dropbox.com/s/exz8knrygkoznji/nnedi3ocl_update20131121.7z Don't know why I bothered though, given how Nvidia treats OpenCL (really bad, if you didn't know).

Here you can see how quite minor changes can have noticeable impact on the speed.

If you want your results included in first page table: please provide all the info! Third number (the "efficiency") is computed as your avg FPS with MTMode divided by theoretical TFLOPS on your frequency (so, lookup the reference FLOPS for your card, multiply it by your core frequency and divide by reference core frequency).

Sparktank
21st November 2013, 17:29
Was trying to participate, but Nvidia just had an update recently and it seems to give me BSOD after benchmark completes.

Currently, siphoning through the Nvidia forum for details and provide input and then falling back to previous version which didn't give me BSOD with this plugin.
And hopefully will be able to get some results up by the end of the day.

SEt
21st November 2013, 17:41
Yeah, that's Nvidia drivers today. That they are more stable than AMD is pure myth. I also got BSODs recently only from Nvidia ones.

Speed of nnedi3ocl really depends on how wise/stupid OpenCL compiler in driver was, so it's worth trying several driver versions and see if it changes anything.

Groucho2004
21st November 2013, 18:15
0.91, 0.91, 2.35, GeForce GT 240, 550, 2, i5-2500K@4GHz, 2013.11.21
I grabbed the number for FLOPS from here (http://en.wikipedia.org/wiki/GeForce_200_Series), I hope that's the right place.

olcifaraga
21st November 2013, 20:29
5.28 5.48 GeForce GT 750M, 967, 1.1, i7-4700HQ@2.4 2013.11.21

mikeyakame
21st November 2013, 20:51
@SEt

I get about a 1.6% speed increase with the 2013.11.21 build and 331.82 drivers.
I'll check more later, off to work.

lansing
22nd November 2013, 00:14
well mine went from 9.60fps to 10.69fps with mt on with the new build, 11% increase, not bad

Overdrive80
22nd November 2013, 05:16
Hi, when I execute this code:

LoadPlugin("C:\Program Files (x86)\AviSynth 2.5\plugins\nnedi3ocl.dll")
SetMTMode(2,4)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl_rpow2(2, nns=2, qual=1)

I get this error message:

http://img6.imageshack.us/img6/9097/b2vm.png

Build used is 2013.11.18. Graphic: AMD Radeon HD 4870.

What am I doing wrong??

Keiyakusha
22nd November 2013, 05:42
Did a quick test on GTX 570 with latest stable drivers, no MT mode and default settings, but there are no speed differences between the old and new version. The difference is always within 0.03 fps plus or minus with around 14.46 in total. I checked 3 times. Maybe later I'll check once more after I'll get some sleep...

BTW I used real 720p video. blankclip is a up to 80% faster even though source filter is capable of providing input with more than 500fps

yup
22nd November 2013, 07:35
Hi all!
I was out my working horse, testing both version
GPU 1: NVIDIA GeForce GTX 560
OpenCL 1.1, GeForce GTX 560 compute units:7@1620MHz
18 November cl code
SetMTMode(2,4)
FPS (min | max | average): 1.84 | 416268.26 | 7.43
CPU usage (average): 13%
noMT
FPS (min | max | average): 6.83 | 7.29 | 7.20
CPU usage (average): 13%
20 November cl code
SetMTMode(2,4)
FPS (min | max | average): 2.17 | 416267.00 | 8.73
CPU usage (average): 13%
noMT
FPS (min | max | average): 7.95 | 8.53 | 8.44
CPU usage (average): 14%
Last version give speed up more than 10%.
yup.

SEt
22nd November 2013, 12:13
Overdrive80, try this version: https://www.dropbox.com/s/oz1xz9k8mxp1nhb/nnedi3ocl_fixocl10.7z Your card only supports OpenCL 1.0 while I used 1.1 feature on it. Also note that Radeon HD4xxx not fully OpenCL "capable" (their local memory isn't conformant and emulated with global memory), so "efficiency" will be less than newer Radeons.

Keiyakusha, your speed is faster than it should be on previous version but slower than it should be on new one, huh...
The problem with real scripts is that even with MTMode Avisynth scheduling is pretty bad and you likely see not 100% GPU load. Try putting source in MTMode 2 and/or increasing number of threads.

Overdrive80
23rd November 2013, 00:00
Ok, thanks SEt. Here go my results:

- Four Threads:

LoadPlugin("C:\Program Files (x86)\AviSynth 2.5\plugins\nnedi3ocl.dll")
SetMTMode(2,4)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl_rpow2(2, nns=2, qual=1)


Frames processed: 1000 (0 - 999)
FPS (min | max | average): 0.63 | 325290.06 | 2.55
CPU usage (average): 6%
Thread count: 9
Physical Memory usage (peak): 1123 MB
Virtual Memory usage (peak): 1123 MB
Time (elapsed): 000:06:32.524

- Eight Threads:

LoadPlugin("C:\Program Files (x86)\AviSynth 2.5\plugins\nnedi3ocl.dll")
SetMTMode(2,8)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl_rpow2(2, nns=2, qual=1)

Frames processed: 1000 (0 - 999)
FPS (min | max | average): 0.31 | 325290.06 | 2.52
CPU usage (average): 6%
Thread count: 17
Physical Memory usage (peak): 1169 MB
Virtual Memory usage (peak): 1178 MB
Time (elapsed): 000:06:37.608

- None Threads:

Frames processed: 1000 (0 - 999)
FPS (min | max | average): 1.95 | 2.50 | 2.47
CPU usage (average): 2%
Thread count: 2
Physical Memory usage (peak): 298 MB
Virtual Memory usage (peak): 292 MB
Time (elapsed): 000:06:44.659

Gser
23rd November 2013, 00:22
Anybody tried putting this into QTGMC yet?

Selur
23rd November 2013, 00:32
Anybody tried putting this into QTGMC yet?
"Only nsize=0 implemented, other values silently ignored."
and iirc. at least all presets use nsize 1 and up,...

checked:
# Very Very Super Ultra
# Preset groups: Placebo Slow Slower Slow Medium Fast Faster Fast Fast Fast Draft
...
EdiMode = default( EdiMode, Select( pNum, "NNEDI3", "NNEDI3", "NNEDI3", "NNEDI3", "NNEDI3", "NNEDI3", "NNEDI3", "NNEDI3", "NNEDI3", "RepYadif","Bob" ) )
NNSize = default( NNSize, Select( pNum, 1, 1, 1, 1, 5, 5, 4, 4, 4, 4, 4 ) )

-> atm. it's not really that interesting for QTGMC

SEt
23rd November 2013, 01:24
nsize=0 should be better than nsize=4, so shouldn't hurt using 0 instead of it. For other nsize it's effectively quality of connecting horizontal lines, so you can use 0 instead but quality will be worse than expected.

bcn_246
24th November 2013, 20:37
use this mod of nnedi3_resize16 (https://gist.github.com/line0/7547526)

Thanks a million. I assume it's just the nnedi3_resize16_rpow2 part thats been modded for OCL?

zero9999
26th November 2013, 01:36
Thanks a million. I assume it's just the nnedi3_resize16_rpow2 part thats been modded for OCL?

yes, ofc also calls to this function to pass on the gpu parameter.

madshi
26th November 2013, 15:08
Great work, SEt. I was planning to look into implementing NNEDI3 with OpenCL/CUDA myself for madVR. I was also considering dropping the prescreener, but I'm not sure. The prescreener might still be effective. I was thinking of splitting the processing into lines, so that one thread processes one image line. This way I hoped to be able to cache the source reads so that I have to read only 4 new source pixels for each new output pixel (if there are enough registers to store the other source pixels in). With this design maybe the prescreener would then allow each thread to finish faster if there are some pixels in the line which don't need full NNEDI3 processing. Well, anyway. I haven't even started yet, so these were just some ideas I'd been playing with in my head. Haven't looked at your OpenCL code yet, but I'll definitely do when I find some time. And thanks for going with LGPL instead of GPL. That would allow me to reuse your code for madVR, too, if I decide that your implementation idea is better than mine... :)

One thing that bothers me a bit about NNEDI3 is that it sometimes "finds" things to connect in trees, leaves and grass which makes things look a bit artificial, fractal like. So I'm wondering whether it wouldn't be a good idea to write a separate prescreener which categorizes the image into parts which have clear edge directions and other parts with rather random edge directions (= grass, leaves etc). Thoughts?

FWIW, many months ago I had asked tritical about implementing NNEDI3 in madVR, even though madVR is closed source, and he allowed it. So it seems to me he's quite generous with licensing issues, so I don't think you need to worry about that part. Haven't heard from him in a while, though. Not sure if he's still around...

wOxxOm
26th November 2013, 15:19
One thing that bothers me a bit about NNEDI3 is that it sometimes "finds" things to connect in trees, leaves and grass which makes things look a bit artificial, fractal like. So I'm wondering whether it wouldn't be a good idea to write a separate prescreener which categorizes the image into parts which have clear edge directions and other parts with rather random edge directions (= grass, leaves etc). Thoughts?
I noticed it too a long time ago and that's why I always use prescreener and then apply masked AA (sometimes nnedi-AA) where needed. Not sure if the universal content recognition algorithm is possible, but it would be great.

SEt
6th December 2013, 14:30
New version: added support for all new planar colorspaces of AviSynth 2.6 (but plugin still uses AviSynth 2.5 interfaces and still can be used with AviSynth 2.5). YUY2, RGB24 and center correction are supported by script functions nnedi3x and nnedi3x_rpow2. No changes on OpenCL side.

In would be really nice if someone can confirm/correct center correction magic in nnedi3x_rpow2 (btw, original nnedi3 does it wrong). Script tries to minimize center shift while satisfying two conditions with no/empty cshift:
1) For non-YV12 chroma must be correctly aligned with no resize.
2) For YV12 luma is not resized, but chroma is to be correctly aligned (original nnedi3_rpow2 also does it even with empty cshift).

Luma and chroma are resized no more than once (original nnedi3_rpow2 would resize chroma 2 times with YV12 and center correction), script tries to minimize resize offsets to subpixel values. For now only Spline36Resize method.

nekosama
7th December 2013, 13:40
Radeon HD7950, 930, I7-4770k@stock

[General info]
Log file created with: AVSMeter 1.5.7
Avisynth version: AviSynth 2.60, build:Sep 28 2013 [15:09:12]
Active MT Mode: 2


[Clip info]
Number of frames: 1000
Length (hhh:mm:ss.ms): 000:00:41.708
Frame width: 2560
Frame height: 1440
Framerate: 23.976 (24000/1001)
Interlaced: No
Colorspace: YV12


[Runtime info]
Frames processed: 1000 (0 - 999)
FPS (min | max | average): 18.40 | 35.26 | 26.00
CPU usage (average): 1%
Thread count: 13
Physical Memory usage (peak): 597 MB
Virtual Memory usage (peak): 622 MB
Time (elapsed): 000:00:38.461


[Script]
SetMTMode(2,4)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl(dh=true, dw=1, nns=2, qual=1)

https://db.tt/0Sf3myc2

SEt
8th December 2013, 16:13
Minor update: better default for dh parameter and completely implemented nnedi3x_rpow2 with all fancy cases.

nekosama, your result looks too low: comparing to very similar Radeon HD7970 you should be getting around 36 fps average with MT.

Groucho2004
8th December 2013, 16:31
nekosama, your result looks too low: comparing to very similar Radeon HD7970 you should be getting around 36 fps average with MT.
His test is with a 7950, not 7970. Not sure how much difference this makes.

SEt
8th December 2013, 17:24
Of course I've scaled expected fps by their theoretical FLOPS. Radeon HD7970 is getting 48 fps as you can see in summary tables.

nekosama
9th December 2013, 19:17
nope SEt, I couldn't get higher results on my 7950 but I tried with an overclock to 1150 MHz core clock and 1350 MHz memory clock and managed to get these reults [General info]
Log file created with: AVSMeter 1.5.7
Avisynth version: AviSynth 2.60, build:Sep 28 2013 [15:09:12]
Active MT Mode: 2


[Clip info]
Number of frames: 1000
Length (hhh:mm:ss.ms): 000:00:41.708
Frame width: 2560
Frame height: 1440
Framerate: 23.976 (24000/1001)
Interlaced: No
Colorspace: YV12


[Runtime info]
Frames processed: 1000 (0 - 999)
FPS (min | max | average): 27.38 | 46.77 | 35.56
CPU usage (average): 0%
Thread count: 13
Physical Memory usage (peak): 599 MB
Virtual Memory usage (peak): 622 MB
Time (elapsed): 000:00:28.118


[Script]
SetMTMode(2,4)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl(dh=true, dw=1, nns=2, qual=1)
just not 36 fps :p

I even clocked my 7950 to 1170 core just to see a difference and I managed to get 50 max fps but same average speed (yeah 0.2 fps difference but that's practically none-existent)
[General info]
Log file created with: AVSMeter 1.5.7
Avisynth version: AviSynth 2.60, build:Sep 28 2013 [15:09:12]
Active MT Mode: 2


[Clip info]
Number of frames: 10000
Length (hhh:mm:ss.ms): 000:06:57.083
Frame width: 2560
Frame height: 1440
Framerate: 23.976 (24000/1001)
Interlaced: No
Colorspace: YV12


[Runtime info]
Frames processed: 10000 (0 - 9999)
FPS (min | max | average): 25.65 | 50.00 | 35.92
CPU usage (average): 0%
Thread count: 10
Physical Memory usage (peak): 620 MB
Virtual Memory usage (peak): 622 MB
Time (elapsed): 000:04:38.409


[Script]
SetMTMode(2,4)
BlankClip(10000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl(dh=true, dw=1, nns=2, qual=1)

Groucho2004
9th December 2013, 19:26
@nekosama
try with the latest version of AVSMeter. The version you're using is rather old (although it should not make much difference).

SEt
10th December 2013, 00:06
nekosama, another guess: are you using old drivers? Try with latest beta ones.