Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 18th November 2013, 07:15   #1  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
nnedi3 - OpenCL rewrite

It's time to move to modern image processing platforms, i.e. OpenCL. Here is my rewrite of one of the most used and the most heavy AviSynth plugins: nnedi3.

Current (2013.12.08-beta) version: https://www.dropbox.com/s/bmemjsu7jq...cl_20131208.7z

Syntax: nnedi3ocl(int field, bool dh, bool Y, bool U, bool V, int nsize, int nns, int qual, int etype, int dw)
Most parameters are the same as for nnedi3. Changes:
  • dw - controls scaling in horizontal direction: -1 no scaling; 0 and 1 scale like field 0 and 1 with dh=true, but horizontally. Default: -1.
  • Default for field is dw.
  • Default for dh is false when dw=-1 and true otherwise.
  • Only nsize=0 implemented, other values are silently ignored.
  • pscrn, threads, opt, fapprox: removed.

nnedi3ocl is AviSynth 2.5 plugin, but supports all new planar colorspaces when used with AviSynth 2.6. For YUY2 and RGB24 support script function nnedi3x is provided; it also doesn't complain if you feed it with now removed parameters of original nnedi3.

Basic image 2x scaling is done by call nnedi3ocl(dw=1). For advanced scaling with chroma and center correction and support for YUY2 and RGB24 colorspaces use provided nnedi3x_rpow2 script function.

MTMode to use: 2.

Major speed note 1.
Original nnedi3 process not all pixels with its nnedi3 algorithm: first it runs prescreener that decides should each pixel go through nnedi3 or through cubic scaling. This is the main reason that nnedi3 works relatively fast one time and slows to a crawl some other time (for example, on grass and leaves).
Prescreener concept works welll on CPU, but quite alien to extremely parallel GPU code and was not implemented. This means that nnedi3ocl works always with constant speed and will be slower than original nnedi3 with prescreener on simple frames, but faster on complex ones.

TLDR: nnedi3ocl always does best quality mode, so can be both faster and slower than original nnedi3.

Do notice that you can combine CPU and OpenCL processing in one script by using both.

Major speed note 2.
The OpenCL code is quite optimized, but memory transfers are not. So, much time is lost there. Using high number of threads with MTMode 2 (even more than physical threads your CPU has) is the best workaround for now.

Major speed note 3.
Unlike original nnedi3 where speed mostly depended on image complexity, speed of nnedi3ocl is direct result of your hardware speed and settings used.
Each increase of nns by 1 results in speed dropping by 2x. qual=2 also around 2x slower than qual=1. This means that from fastest to slowest parameters there is 32x difference in speed.


OpenCL device preferences.
Don't bother with running the code on CPU OpenCL devices – original nnedi3 would be way faster simply due to prescreener.

For GPU AMD cards with GCN architecture are recommended. Nvidia does ok, but has disadvantage of completely using one of your CPU cores on heavy GPU computations. Intel integrated... it works there too!
Theoretical FLOPS should be good indication of performance as long as you factor in the efficiency of particular architecture. Table below provides some useful coefficients how TFLOPS scale to FPS for cards of different architectures.

In case of multiple OpenCL platforms the order of preference: AMD GPU -> any GPU -> the rest. No manual choice yet.

Multi-GPU are not supported yet, todo.


OpenCL part is open source, license is... LGPLv3? Subject to change. Host code isn't open yet; there is nothing interesting there anyway.
Somewhat unknown issue is nnedi3 nn coefficients data: no idea how they are related to licenses since it's not a code. Preprocessed but conceptually unchanged version of them is currently embeded in dll. If tritical has any issues with current situation – I'm ready to listen.


Hacking.
As you can notice, the main OpenCL part is not just open source, but actually read and realtime compiled from separate text file. You can change it and the next restart changes will be applied.

Feel free to poke at code: sometimes just adding dummy if lines (that always processed or skipped, but compiler can't deduce it) can greatly change the speed in both ways.
Another interesting speed point is #pragma unroll statements.

As for correctness, I haven't noticed the importance of several checks, so they are removed under EXTRA_CHECKS define – uncomment the line if you think you see precision-related errors.


Benchmark.
As it would be useful to get speed estimates with different hardware and speed should differ insignificantly due to non-hardware reasons, let's make "benchmark" section.
In case some modification of nnedi3ocl.cl provides you better (but still correct) result – would be interesting to see such numbers too.

The target is 1280x720 YV12 upscaling to 2560x1440 with medium settings.

FPS with no MTMode, FPS with MTMode(2,4), FPS per theoretical TFLOPS, GPU name, GPU core clock during test MHz, PCIe version (and width if not x16), CPU, nnedi3ocl version

On version 2013.12.08: (the same as 2013.11.21)
Code:
32.89 37.53 13.3  Radeon HD7870, 1100, 2, FX8350
18.33 18.82  6.9  Radeon HD5870, 850, 2, i7-920@4.0
14.68 14.96  6.9  GeForce GTX 660, 1137, 3x8, i7-3770@4.0
On version 2013.11.21: (+12% on Nvidia from 2013.11.18)
Code:
20.40 23.93  8.8  Radeon HD6950 (unlocked), 885, 2, i7-930@4.0
12.20 12.93 10.4  GeForce GTX 590 (half of dual card), 608, 2, i7-930@4.0
 8.44  8.73  8.0  GeForce GTX 560, 810, ?, ?
 5.28  5.48  7.4  GeForce GT 750M, 967, 1.1, i7-4700HQ@2.4
 2.47  2.55  2.1  Radeon HD4870, 750?, ?, ?
 2.44  2.45  2.3  GeForce GTX 275, 666, 2, i7-920@3.4
 1.97  2.00  8.1  Quadro 600, 640, 2, i7-930@4.0
 0.91  0.91  2.4  GeForce GT 240, 550, 2, i5-2500K@4.0
 1.66  1.71  4.5  Intel HD4600, 1200, -, i7-4770@3.7
On version 2013.11.18:
Code:
35.50 48.00 12.7  Radeon HD7970, 925, 3, i7-3930K@3.2
20.40 23.93  8.8  Radeon HD6950 (unlocked), 885, 2, i7-930@4.0
20.29 22.68  4.9  GeForce GTX 780, 1006, 2, i7-3930K@3.8
17.72 22.47  8.3  Radeon HD6950 (unlocked), 885, 1.1 x4, i7-930@4.0
18.43 21.21  8.9  Radeon HD6950, 850, 2, i7-930@4.0
14.24 15.78  6.4  GeForce GTX 660 Ti, 915?, ?, ?
10.89 11.48  9.2  GeForce GTX 590 (half of dual card), 608, 2, i7-930@4.0
 9.33  9.60  6.3  GeForce GTX 650 Ti Boost, 1006, 3, i7-4770k@4.3
 4.65  4.90  6.6  GeForce GT 750M,  967, 1.1, i7-4700HQ@2.4
 2.61  2.72  9.4  GeForce GT 555M (GDDR5), 1506, 2, i7-2670QM@2.2
 1.30  1.85  6.9  GeForce GT 430, 700, 2, i7-860@2.8
 1.72  1.74  7.1  Quadro 600, 640, 2, i7-930@4.0
 1.66  1.71  4.5  Intel HD4600, 1200, -, i7-4770@3.7
 1.38  1.38  1.3  GeForce GTX 275, 666, 2, i7-920@3.4
FPS is measured as average FPS in AVSMeter 1.7.2 at the end of this script:
Code:
SetMTMode(2,4)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl(dw=1, nns=2, qual=1)
or for versions 2013.11.22 and older:
Code:
SetMTMode(2,4)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl_rpow2(2, nns=2, qual=1)
If your FPS is terrible and you don't want to wait till the end – reduce first number in BlankClip to 100, but don't interrupt the script in the middle.

Some conclusions:
  1. Even severely limiting PCIe bandwidth from 2 x16 to 1.1 x4 (= 8x less bandwidth) for quite fast card Radeon HD6950 doesn't change overall speed much: only 6% slower with MTMode and 13% without. So, anything starting from PCIe 2 x8 should not affect current implementation much.
  2. There is clear correlation between FPS, theoretical FLOPS and architecture of GPUs.

Last edited by SEt; 10th February 2014 at 00:04.
SEt is offline   Reply With Quote
Old 18th November 2013, 07:53   #2  |  Link
TurboPascal7
Registered User
 
TurboPascal7's Avatar
 
Join Date: Jan 2010
Posts: 270
So this is what you were talking about. Fancy.

5-15 times faster than the original here, which is a really nice benefit. i7 860 vs gtx760 connected via pci-e 2. Which is compute capability 3.0. Also, noting what version is actually required would be a good idea, imho.

dh=false being the default when it doesn't work yet seems a bit confusing though.
__________________
Me on GitHub | AviSynth+ - the (dead) future of AviSynth

Last edited by TurboPascal7; 18th November 2013 at 08:54.
TurboPascal7 is offline   Reply With Quote
Old 18th November 2013, 10:57   #3  |  Link
easyfab
Registered User
 
Join Date: Jan 2002
Posts: 332
Quote:
Originally Posted by SEt View Post
Major speed note 2.
The OpenCL code is quite optimized, but memory transfers are not. So, much time is lost there. Using high number of threads with MTMode 2 (even more than physical threads your CPU has) is the best workaround for now.
Will the next generations APU resolve the memory transfer problem, as the memory will be shared between CPU and GPU ?
easyfab is offline   Reply With Quote
Old 18th November 2013, 14:27   #4  |  Link
bcn_246
Registered User
 
bcn_246's Avatar
 
Join Date: Nov 2005
Location: UK
Posts: 117
Quote:
Originally Posted by SEt View Post
[*]No cshift yet, ignored. Probably should be done in external script anyway.
Could somebody post such a script?

TIA
bcn_246 is offline   Reply With Quote
Old 18th November 2013, 14:52   #5  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Update: dh=false should work now.
Also added "benchmark" section to first post, as it would be useful for people to get estimates what speed they can expect with different hardware.

TurboPascal7
It's unexpectedly nice that you get good speed on just CC3.0 hardware – I thought register pressure there would be too much. As for minimum required hardware – no idea, probably anything that can run OpenCL 1.1 (maybe even 1.0) will do. Just older hardware would have worse theoretical_FLOPS / real_speed ratio.

easyfab
This problem – yes, but due to their TDP constraints likely it'll still be faster to use powerful GPU on separate card even with all the transfer overhead.
SEt is offline   Reply With Quote
Old 18th November 2013, 18:32   #6  |  Link
DJATOM
Registered User
 
DJATOM's Avatar
 
Join Date: Sep 2010
Location: Ukraine, Bohuslav
Posts: 377
2.61 2.72 GeForce GT 555M (GDDR5), 1506 MHz, 2, i7-2670QM@2.2GHz, 2013.11.18
DJATOM is offline   Reply With Quote
Old 18th November 2013, 18:37   #7  |  Link
lansing
Registered User
 
Join Date: Sep 2006
Posts: 1,657
Code:
9.33 9.60 Nvidia Geforce GTX 650 Ti Boost, 1006MHz, 3, i7-4770k@4.3GHz,  2013.11.18
very slow for my card, and on mt mode, only one thread out of the 8 is working, I suspect the bottleneck on the gpu side.
lansing is offline   Reply With Quote
Old 18th November 2013, 22:26   #8  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Updated first post, added some results of cards I have access to. Also added important clarification about relations of speed and parameters.

There is now enough results to start drawing some conclusions. Unlike what I expected in the beginning, NV cards do relatively well, but their architecture (Compute Capability) does affect the efficiency. NV CC2.0 hardware slightly more efficient than AMD VLIV4, but CC2.1 and CC3.0 are around 1.5x less efficient (btw, this inefficiency in compute workload is quite known for those CC).

Also I remembered nasty habit of NV drivers to consume one CPU core during GPU computations. It's not normal and AMD drivers don't suffer from that: during test CPU load should be insignificant.
SEt is offline   Reply With Quote
Old 18th November 2013, 23:28   #9  |  Link
wOxxOm
Oz of the zOo
 
Join Date: May 2005
Posts: 208
SEt, it'd be nice to have fwidth/fheight gpu-scaling (requires cshift as well), thus the scripts with nnedi3_rpow2(...).downscale_to_final_resolution(...) will run faster, and hopefully by a considerable margin, because of the reduced gpu->cpu copyback data amount.
wOxxOm is offline   Reply With Quote
Old 19th November 2013, 00:15   #10  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
wOxxOm, not very likely for near future – there are better places to spend effort of optimizing, cubic level downscalers are computationally cheap. As for copyback, it's slow not because it saturates PCIe bandwidth (do read first post benchmarks) but because it's implemented not quite efficiently.
SEt is offline   Reply With Quote
Old 19th November 2013, 00:19   #11  |  Link
wOxxOm
Oz of the zOo
 
Join Date: May 2005
Posts: 208
SEt, got it. What about combining it with some gpu-assisted degrain then?
wOxxOm is offline   Reply With Quote
Old 19th November 2013, 00:20   #12  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
wOxxOm, that's way more likely. No promises though.
SEt is offline   Reply With Quote
Old 19th November 2013, 02:25   #13  |  Link
PetitDragon
Registered User
 
Join Date: Sep 2006
Posts: 81
OMG! This is so fxcking greate. Thanks SEt.
PetitDragon is offline   Reply With Quote
Old 19th November 2013, 10:19   #14  |  Link
mikeyakame
lookin for my sanity
 
Join Date: Feb 2007
Location: it all depends on the day and which country comes to mind
Posts: 42
Code:
20.29 22.68 Nvidia Geforce GTX780, 1006Mhz, 2, i7-3930K@3.8Ghz, 2013.11.18
GPU Load Avg. No MT was ~91%.
GPU Load Avg. MT was ~97%.

So not much headroom left for my card, and CPU load was ~8-9% for both tests.

Geforce drivers => 331.58
mikeyakame is offline   Reply With Quote
Old 19th November 2013, 13:58   #15  |  Link
Terka
Registered User
 
Join Date: Jan 2005
Location: cz
Posts: 704
SEt, great job! Thank you!

Now implement phase correlation to mvtools under OpenCL and Avisynth users can celebrate.
Terka is offline   Reply With Quote
Old 19th November 2013, 14:15   #16  |  Link
Mystery Keeper
Beyond Kawaii
 
Mystery Keeper's Avatar
 
Join Date: Feb 2008
Location: Russia
Posts: 724
There are other more modern motion estimation methods around. MVTools might benefit not from just phase correlation, but from integrating more different algorythms. Also, modern motion estimation uses segmentation. MVTools could give more accurate results with currently used method combined with segmentation.
__________________
...desu!
Mystery Keeper is offline   Reply With Quote
Old 19th November 2013, 14:42   #17  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
mikeyakame, your result is too low looking at FPS/TFLOPS ratio. I had much better expectations for CC3.5 hardware...

Terka, Mystery Keeper, this work is far from finished and I haven't said I'm taking MVTools rewrite... Though recently exposed through OpenCL motion estimators on Intel videocards did look interesting as you can get motion estimation basically for free from dedicated hardware, my cards have nothing like that.
SEt is offline   Reply With Quote
Old 19th November 2013, 15:09   #18  |  Link
olcifaraga
Registered User
 
Join Date: Sep 2012
Posts: 4
Code:
4.65 4.90 Nvidia Geforce GT750M,  967Mhz, 1.1, i7-4700HQ@2.4GHz, 2013.11.18
olcifaraga is offline   Reply With Quote
Old 19th November 2013, 16:54   #19  |  Link
Bloax
The speed of stupid
 
Bloax's Avatar
 
Join Date: Sep 2011
Posts: 317
I get ~0.2 fps upscaling 640x480 to 1280x960 on a Geforce 9800 GT (you can imagine my surprise that it could actually run this!), and I think we can easily conclude what happens on 720p->1440p
Bloax is offline   Reply With Quote
Old 19th November 2013, 17:03   #20  |  Link
zero9999
Registered User
 
Join Date: Oct 2011
Posts: 52
Quote:
Originally Posted by bcn_246 View Post
Could somebody post such a script?

TIA
use this mod of nnedi3_resize16
zero9999 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 04:53.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.