Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
18th November 2013, 07:15 | #1 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
nnedi3 - OpenCL rewrite
It's time to move to modern image processing platforms, i.e. OpenCL. Here is my rewrite of one of the most used and the most heavy AviSynth plugins: nnedi3.
Current (2013.12.08-beta) version: https://www.dropbox.com/s/bmemjsu7jq...cl_20131208.7z Syntax: nnedi3ocl(int field, bool dh, bool Y, bool U, bool V, int nsize, int nns, int qual, int etype, int dw) Most parameters are the same as for nnedi3. Changes:
nnedi3ocl is AviSynth 2.5 plugin, but supports all new planar colorspaces when used with AviSynth 2.6. For YUY2 and RGB24 support script function nnedi3x is provided; it also doesn't complain if you feed it with now removed parameters of original nnedi3. Basic image 2x scaling is done by call nnedi3ocl(dw=1). For advanced scaling with chroma and center correction and support for YUY2 and RGB24 colorspaces use provided nnedi3x_rpow2 script function. MTMode to use: 2. Major speed note 1. Original nnedi3 process not all pixels with its nnedi3 algorithm: first it runs prescreener that decides should each pixel go through nnedi3 or through cubic scaling. This is the main reason that nnedi3 works relatively fast one time and slows to a crawl some other time (for example, on grass and leaves). Prescreener concept works welll on CPU, but quite alien to extremely parallel GPU code and was not implemented. This means that nnedi3ocl works always with constant speed and will be slower than original nnedi3 with prescreener on simple frames, but faster on complex ones. TLDR: nnedi3ocl always does best quality mode, so can be both faster and slower than original nnedi3. Do notice that you can combine CPU and OpenCL processing in one script by using both. Major speed note 2. The OpenCL code is quite optimized, but memory transfers are not. So, much time is lost there. Using high number of threads with MTMode 2 (even more than physical threads your CPU has) is the best workaround for now. Major speed note 3. Unlike original nnedi3 where speed mostly depended on image complexity, speed of nnedi3ocl is direct result of your hardware speed and settings used. Each increase of nns by 1 results in speed dropping by 2x. qual=2 also around 2x slower than qual=1. This means that from fastest to slowest parameters there is 32x difference in speed. OpenCL device preferences. Don't bother with running the code on CPU OpenCL devices – original nnedi3 would be way faster simply due to prescreener. For GPU AMD cards with GCN architecture are recommended. Nvidia does ok, but has disadvantage of completely using one of your CPU cores on heavy GPU computations. Intel integrated... it works there too! Theoretical FLOPS should be good indication of performance as long as you factor in the efficiency of particular architecture. Table below provides some useful coefficients how TFLOPS scale to FPS for cards of different architectures. In case of multiple OpenCL platforms the order of preference: AMD GPU -> any GPU -> the rest. No manual choice yet. Multi-GPU are not supported yet, todo. OpenCL part is open source, license is... LGPLv3? Subject to change. Host code isn't open yet; there is nothing interesting there anyway. Somewhat unknown issue is nnedi3 nn coefficients data: no idea how they are related to licenses since it's not a code. Preprocessed but conceptually unchanged version of them is currently embeded in dll. If tritical has any issues with current situation – I'm ready to listen. Hacking. As you can notice, the main OpenCL part is not just open source, but actually read and realtime compiled from separate text file. You can change it and the next restart changes will be applied. Feel free to poke at code: sometimes just adding dummy if lines (that always processed or skipped, but compiler can't deduce it) can greatly change the speed in both ways. Another interesting speed point is #pragma unroll statements. As for correctness, I haven't noticed the importance of several checks, so they are removed under EXTRA_CHECKS define – uncomment the line if you think you see precision-related errors. Benchmark. As it would be useful to get speed estimates with different hardware and speed should differ insignificantly due to non-hardware reasons, let's make "benchmark" section. In case some modification of nnedi3ocl.cl provides you better (but still correct) result – would be interesting to see such numbers too. The target is 1280x720 YV12 upscaling to 2560x1440 with medium settings. FPS with no MTMode, FPS with MTMode(2,4), FPS per theoretical TFLOPS, GPU name, GPU core clock during test MHz, PCIe version (and width if not x16), CPU, nnedi3ocl version On version 2013.12.08: (the same as 2013.11.21) Code:
32.89 37.53 13.3 Radeon HD7870, 1100, 2, FX8350 18.33 18.82 6.9 Radeon HD5870, 850, 2, i7-920@4.0 14.68 14.96 6.9 GeForce GTX 660, 1137, 3x8, i7-3770@4.0 Code:
20.40 23.93 8.8 Radeon HD6950 (unlocked), 885, 2, i7-930@4.0 12.20 12.93 10.4 GeForce GTX 590 (half of dual card), 608, 2, i7-930@4.0 8.44 8.73 8.0 GeForce GTX 560, 810, ?, ? 5.28 5.48 7.4 GeForce GT 750M, 967, 1.1, i7-4700HQ@2.4 2.47 2.55 2.1 Radeon HD4870, 750?, ?, ? 2.44 2.45 2.3 GeForce GTX 275, 666, 2, i7-920@3.4 1.97 2.00 8.1 Quadro 600, 640, 2, i7-930@4.0 0.91 0.91 2.4 GeForce GT 240, 550, 2, i5-2500K@4.0 1.66 1.71 4.5 Intel HD4600, 1200, -, i7-4770@3.7 Code:
35.50 48.00 12.7 Radeon HD7970, 925, 3, i7-3930K@3.2 20.40 23.93 8.8 Radeon HD6950 (unlocked), 885, 2, i7-930@4.0 20.29 22.68 4.9 GeForce GTX 780, 1006, 2, i7-3930K@3.8 17.72 22.47 8.3 Radeon HD6950 (unlocked), 885, 1.1 x4, i7-930@4.0 18.43 21.21 8.9 Radeon HD6950, 850, 2, i7-930@4.0 14.24 15.78 6.4 GeForce GTX 660 Ti, 915?, ?, ? 10.89 11.48 9.2 GeForce GTX 590 (half of dual card), 608, 2, i7-930@4.0 9.33 9.60 6.3 GeForce GTX 650 Ti Boost, 1006, 3, i7-4770k@4.3 4.65 4.90 6.6 GeForce GT 750M, 967, 1.1, i7-4700HQ@2.4 2.61 2.72 9.4 GeForce GT 555M (GDDR5), 1506, 2, i7-2670QM@2.2 1.30 1.85 6.9 GeForce GT 430, 700, 2, i7-860@2.8 1.72 1.74 7.1 Quadro 600, 640, 2, i7-930@4.0 1.66 1.71 4.5 Intel HD4600, 1200, -, i7-4770@3.7 1.38 1.38 1.3 GeForce GTX 275, 666, 2, i7-920@3.4 Code:
SetMTMode(2,4) BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0) nnedi3ocl(dw=1, nns=2, qual=1) Code:
SetMTMode(2,4) BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0) nnedi3ocl_rpow2(2, nns=2, qual=1) Some conclusions:
Last edited by SEt; 10th February 2014 at 00:04. |
18th November 2013, 07:53 | #2 | Link |
Registered User
Join Date: Jan 2010
Posts: 270
|
So this is what you were talking about. Fancy.
5-15 times faster than the original here, which is a really nice benefit. i7 860 vs gtx760 connected via pci-e 2. Which is compute capability 3.0. Also, noting what version is actually required would be a good idea, imho. dh=false being the default when it doesn't work yet seems a bit confusing though. Last edited by TurboPascal7; 18th November 2013 at 08:54. |
18th November 2013, 14:52 | #5 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
Update: dh=false should work now.
Also added "benchmark" section to first post, as it would be useful for people to get estimates what speed they can expect with different hardware. TurboPascal7 It's unexpectedly nice that you get good speed on just CC3.0 hardware – I thought register pressure there would be too much. As for minimum required hardware – no idea, probably anything that can run OpenCL 1.1 (maybe even 1.0) will do. Just older hardware would have worse theoretical_FLOPS / real_speed ratio. easyfab This problem – yes, but due to their TDP constraints likely it'll still be faster to use powerful GPU on separate card even with all the transfer overhead. |
18th November 2013, 22:26 | #8 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
Updated first post, added some results of cards I have access to. Also added important clarification about relations of speed and parameters.
There is now enough results to start drawing some conclusions. Unlike what I expected in the beginning, NV cards do relatively well, but their architecture (Compute Capability) does affect the efficiency. NV CC2.0 hardware slightly more efficient than AMD VLIV4, but CC2.1 and CC3.0 are around 1.5x less efficient (btw, this inefficiency in compute workload is quite known for those CC). Also I remembered nasty habit of NV drivers to consume one CPU core during GPU computations. It's not normal and AMD drivers don't suffer from that: during test CPU load should be insignificant. |
18th November 2013, 23:28 | #9 | Link |
Oz of the zOo
Join Date: May 2005
Posts: 208
|
SEt, it'd be nice to have fwidth/fheight gpu-scaling (requires cshift as well), thus the scripts with nnedi3_rpow2(...).downscale_to_final_resolution(...) will run faster, and hopefully by a considerable margin, because of the reduced gpu->cpu copyback data amount.
|
19th November 2013, 00:15 | #10 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
wOxxOm, not very likely for near future – there are better places to spend effort of optimizing, cubic level downscalers are computationally cheap. As for copyback, it's slow not because it saturates PCIe bandwidth (do read first post benchmarks) but because it's implemented not quite efficiently.
|
19th November 2013, 10:19 | #14 | Link |
lookin for my sanity
Join Date: Feb 2007
Location: it all depends on the day and which country comes to mind
Posts: 42
|
Code:
20.29 22.68 Nvidia Geforce GTX780, 1006Mhz, 2, i7-3930K@3.8Ghz, 2013.11.18 GPU Load Avg. MT was ~97%. So not much headroom left for my card, and CPU load was ~8-9% for both tests. Geforce drivers => 331.58 |
19th November 2013, 14:15 | #16 | Link |
Beyond Kawaii
Join Date: Feb 2008
Location: Russia
Posts: 724
|
There are other more modern motion estimation methods around. MVTools might benefit not from just phase correlation, but from integrating more different algorythms. Also, modern motion estimation uses segmentation. MVTools could give more accurate results with currently used method combined with segmentation.
__________________
...desu! |
19th November 2013, 14:42 | #17 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
mikeyakame, your result is too low looking at FPS/TFLOPS ratio. I had much better expectations for CC3.5 hardware...
Terka, Mystery Keeper, this work is far from finished and I haven't said I'm taking MVTools rewrite... Though recently exposed through OpenCL motion estimators on Intel videocards did look interesting as you can get motion estimation basically for free from dedicated hardware, my cards have nothing like that. |
19th November 2013, 17:03 | #20 | Link |
Registered User
Join Date: Oct 2011
Posts: 52
|
|
Thread Tools | Search this Thread |
Display Modes | |
|
|