Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.


Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Thread Tools Search this Thread Display Modes
Prev Previous Post   Next Post Next
Old 18th November 2013, 07:15   #1  |  Link
Registered User
Join Date: Aug 2007
Posts: 374
nnedi3 - OpenCL rewrite

It's time to move to modern image processing platforms, i.e. OpenCL. Here is my rewrite of one of the most used and the most heavy AviSynth plugins: nnedi3.

Current (2013.12.08-beta) version: https://www.dropbox.com/s/bmemjsu7jq...cl_20131208.7z

Syntax: nnedi3ocl(int field, bool dh, bool Y, bool U, bool V, int nsize, int nns, int qual, int etype, int dw)
Most parameters are the same as for nnedi3. Changes:
  • dw - controls scaling in horizontal direction: -1 no scaling; 0 and 1 scale like field 0 and 1 with dh=true, but horizontally. Default: -1.
  • Default for field is dw.
  • Default for dh is false when dw=-1 and true otherwise.
  • Only nsize=0 implemented, other values are silently ignored.
  • pscrn, threads, opt, fapprox: removed.

nnedi3ocl is AviSynth 2.5 plugin, but supports all new planar colorspaces when used with AviSynth 2.6. For YUY2 and RGB24 support script function nnedi3x is provided; it also doesn't complain if you feed it with now removed parameters of original nnedi3.

Basic image 2x scaling is done by call nnedi3ocl(dw=1). For advanced scaling with chroma and center correction and support for YUY2 and RGB24 colorspaces use provided nnedi3x_rpow2 script function.

MTMode to use: 2.

Major speed note 1.
Original nnedi3 process not all pixels with its nnedi3 algorithm: first it runs prescreener that decides should each pixel go through nnedi3 or through cubic scaling. This is the main reason that nnedi3 works relatively fast one time and slows to a crawl some other time (for example, on grass and leaves).
Prescreener concept works welll on CPU, but quite alien to extremely parallel GPU code and was not implemented. This means that nnedi3ocl works always with constant speed and will be slower than original nnedi3 with prescreener on simple frames, but faster on complex ones.

TLDR: nnedi3ocl always does best quality mode, so can be both faster and slower than original nnedi3.

Do notice that you can combine CPU and OpenCL processing in one script by using both.

Major speed note 2.
The OpenCL code is quite optimized, but memory transfers are not. So, much time is lost there. Using high number of threads with MTMode 2 (even more than physical threads your CPU has) is the best workaround for now.

Major speed note 3.
Unlike original nnedi3 where speed mostly depended on image complexity, speed of nnedi3ocl is direct result of your hardware speed and settings used.
Each increase of nns by 1 results in speed dropping by 2x. qual=2 also around 2x slower than qual=1. This means that from fastest to slowest parameters there is 32x difference in speed.

OpenCL device preferences.
Don't bother with running the code on CPU OpenCL devices – original nnedi3 would be way faster simply due to prescreener.

For GPU AMD cards with GCN architecture are recommended. Nvidia does ok, but has disadvantage of completely using one of your CPU cores on heavy GPU computations. Intel integrated... it works there too!
Theoretical FLOPS should be good indication of performance as long as you factor in the efficiency of particular architecture. Table below provides some useful coefficients how TFLOPS scale to FPS for cards of different architectures.

In case of multiple OpenCL platforms the order of preference: AMD GPU -> any GPU -> the rest. No manual choice yet.

Multi-GPU are not supported yet, todo.

OpenCL part is open source, license is... LGPLv3? Subject to change. Host code isn't open yet; there is nothing interesting there anyway.
Somewhat unknown issue is nnedi3 nn coefficients data: no idea how they are related to licenses since it's not a code. Preprocessed but conceptually unchanged version of them is currently embeded in dll. If tritical has any issues with current situation – I'm ready to listen.

As you can notice, the main OpenCL part is not just open source, but actually read and realtime compiled from separate text file. You can change it and the next restart changes will be applied.

Feel free to poke at code: sometimes just adding dummy if lines (that always processed or skipped, but compiler can't deduce it) can greatly change the speed in both ways.
Another interesting speed point is #pragma unroll statements.

As for correctness, I haven't noticed the importance of several checks, so they are removed under EXTRA_CHECKS define – uncomment the line if you think you see precision-related errors.

As it would be useful to get speed estimates with different hardware and speed should differ insignificantly due to non-hardware reasons, let's make "benchmark" section.
In case some modification of nnedi3ocl.cl provides you better (but still correct) result – would be interesting to see such numbers too.

The target is 1280x720 YV12 upscaling to 2560x1440 with medium settings.

FPS with no MTMode, FPS with MTMode(2,4), FPS per theoretical TFLOPS, GPU name, GPU core clock during test MHz, PCIe version (and width if not x16), CPU, nnedi3ocl version

On version 2013.12.08: (the same as 2013.11.21)
32.89 37.53 13.3  Radeon HD7870, 1100, 2, FX8350
18.33 18.82  6.9  Radeon HD5870, 850, 2, i7-920@4.0
14.68 14.96  6.9  GeForce GTX 660, 1137, 3x8, i7-3770@4.0
On version 2013.11.21: (+12% on Nvidia from 2013.11.18)
20.40 23.93  8.8  Radeon HD6950 (unlocked), 885, 2, i7-930@4.0
12.20 12.93 10.4  GeForce GTX 590 (half of dual card), 608, 2, i7-930@4.0
 8.44  8.73  8.0  GeForce GTX 560, 810, ?, ?
 5.28  5.48  7.4  GeForce GT 750M, 967, 1.1, i7-4700HQ@2.4
 2.47  2.55  2.1  Radeon HD4870, 750?, ?, ?
 2.44  2.45  2.3  GeForce GTX 275, 666, 2, i7-920@3.4
 1.97  2.00  8.1  Quadro 600, 640, 2, i7-930@4.0
 0.91  0.91  2.4  GeForce GT 240, 550, 2, i5-2500K@4.0
 1.66  1.71  4.5  Intel HD4600, 1200, -, i7-4770@3.7
On version 2013.11.18:
35.50 48.00 12.7  Radeon HD7970, 925, 3, i7-3930K@3.2
20.40 23.93  8.8  Radeon HD6950 (unlocked), 885, 2, i7-930@4.0
20.29 22.68  4.9  GeForce GTX 780, 1006, 2, i7-3930K@3.8
17.72 22.47  8.3  Radeon HD6950 (unlocked), 885, 1.1 x4, i7-930@4.0
18.43 21.21  8.9  Radeon HD6950, 850, 2, i7-930@4.0
14.24 15.78  6.4  GeForce GTX 660 Ti, 915?, ?, ?
10.89 11.48  9.2  GeForce GTX 590 (half of dual card), 608, 2, i7-930@4.0
 9.33  9.60  6.3  GeForce GTX 650 Ti Boost, 1006, 3, i7-4770k@4.3
 4.65  4.90  6.6  GeForce GT 750M,  967, 1.1, i7-4700HQ@2.4
 2.61  2.72  9.4  GeForce GT 555M (GDDR5), 1506, 2, i7-2670QM@2.2
 1.30  1.85  6.9  GeForce GT 430, 700, 2, i7-860@2.8
 1.72  1.74  7.1  Quadro 600, 640, 2, i7-930@4.0
 1.66  1.71  4.5  Intel HD4600, 1200, -, i7-4770@3.7
 1.38  1.38  1.3  GeForce GTX 275, 666, 2, i7-920@3.4
FPS is measured as average FPS in AVSMeter 1.7.2 at the end of this script:
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl(dw=1, nns=2, qual=1)
or for versions 2013.11.22 and older:
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl_rpow2(2, nns=2, qual=1)
If your FPS is terrible and you don't want to wait till the end – reduce first number in BlankClip to 100, but don't interrupt the script in the middle.

Some conclusions:
  1. Even severely limiting PCIe bandwidth from 2 x16 to 1.1 x4 (= 8x less bandwidth) for quite fast card Radeon HD6950 doesn't change overall speed much: only 6% slower with MTMode and 13% without. So, anything starting from PCIe 2 x8 should not affect current implementation much.
  2. There is clear correlation between FPS, theoretical FLOPS and architecture of GPUs.

Last edited by SEt; 10th February 2014 at 00:04.
SEt is offline   Reply With Quote

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +1. The time now is 09:30.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.