nnedi3 - OpenCL rewrite

SEt · 18th November 2013, 07:15

It's time to move to modern image processing platforms, i.e. OpenCL. Here is my rewrite of one of the most used and the most heavy AviSynth plugins: nnedi3.

Current (2013.12.08-beta) version: https://www.dropbox.com/s/bmemjsu7jq...cl_20131208.7z

Syntax: nnedi3ocl(int field, bool dh, bool Y, bool U, bool V, int nsize, int nns, int qual, int etype, int dw)
Most parameters are the same as for nnedi3. Changes:

dw - controls scaling in horizontal direction: -1 no scaling; 0 and 1 scale like field 0 and 1 with dh=true, but horizontally. Default: -1.
Default for field is dw.
Default for dh is false when dw=-1 and true otherwise.
Only nsize=0 implemented, other values are silently ignored.
pscrn, threads, opt, fapprox: removed.

nnedi3ocl is AviSynth 2.5 plugin, but supports all new planar colorspaces when used with AviSynth 2.6. For YUY2 and RGB24 support script function nnedi3x is provided; it also doesn't complain if you feed it with now removed parameters of original nnedi3.

Basic image 2x scaling is done by call nnedi3ocl(dw=1). For advanced scaling with chroma and center correction and support for YUY2 and RGB24 colorspaces use provided nnedi3x_rpow2 script function.

MTMode to use: 2.

Major speed note 1.
Original nnedi3 process not all pixels with its nnedi3 algorithm: first it runs prescreener that decides should each pixel go through nnedi3 or through cubic scaling. This is the main reason that nnedi3 works relatively fast one time and slows to a crawl some other time (for example, on grass and leaves).
Prescreener concept works welll on CPU, but quite alien to extremely parallel GPU code and was not implemented. This means that nnedi3ocl works always with constant speed and will be slower than original nnedi3 with prescreener on simple frames, but faster on complex ones.

TLDR: nnedi3ocl always does best quality mode, so can be both faster and slower than original nnedi3.

Do notice that you can combine CPU and OpenCL processing in one script by using both.

Major speed note 2.
The OpenCL code is quite optimized, but memory transfers are not. So, much time is lost there. Using high number of threads with MTMode 2 (even more than physical threads your CPU has) is the best workaround for now.

Major speed note 3.
Unlike original nnedi3 where speed mostly depended on image complexity, speed of nnedi3ocl is direct result of your hardware speed and settings used.
Each increase of nns by 1 results in speed dropping by 2x. qual=2 also around 2x slower than qual=1. This means that from fastest to slowest parameters there is 32x difference in speed.

OpenCL device preferences.
Don't bother with running the code on CPU OpenCL devices – original nnedi3 would be way faster simply due to prescreener.

For GPU AMD cards with GCN architecture are recommended. Nvidia does ok, but has disadvantage of completely using one of your CPU cores on heavy GPU computations. Intel integrated... it works there too!
Theoretical FLOPS should be good indication of performance as long as you factor in the efficiency of particular architecture. Table below provides some useful coefficients how TFLOPS scale to FPS for cards of different architectures.

In case of multiple OpenCL platforms the order of preference: AMD GPU -> any GPU -> the rest. No manual choice yet.

Multi-GPU are not supported yet, todo.

OpenCL part is open source, license is... LGPLv3? Subject to change. Host code isn't open yet; there is nothing interesting there anyway.
Somewhat unknown issue is nnedi3 nn coefficients data: no idea how they are related to licenses since it's not a code. Preprocessed but conceptually unchanged version of them is currently embeded in dll. If tritical has any issues with current situation – I'm ready to listen.

Hacking.
As you can notice, the main OpenCL part is not just open source, but actually read and realtime compiled from separate text file. You can change it and the next restart changes will be applied.

Feel free to poke at code: sometimes just adding dummy if lines (that always processed or skipped, but compiler can't deduce it) can greatly change the speed in both ways.
Another interesting speed point is #pragma unroll statements.

As for correctness, I haven't noticed the importance of several checks, so they are removed under EXTRA_CHECKS define – uncomment the line if you think you see precision-related errors.

Benchmark.
As it would be useful to get speed estimates with different hardware and speed should differ insignificantly due to non-hardware reasons, let's make "benchmark" section.
In case some modification of nnedi3ocl.cl provides you better (but still correct) result – would be interesting to see such numbers too.

The target is 1280x720 YV12 upscaling to 2560x1440 with medium settings.

FPS with no MTMode, FPS with MTMode(2,4), FPS per theoretical TFLOPS, GPU name, GPU core clock during test MHz, PCIe version (and width if not x16), CPU, nnedi3ocl version

On version 2013.12.08: (the same as 2013.11.21)

Code:

32.89 37.53 13.3  Radeon HD7870, 1100, 2, FX8350
18.33 18.82  6.9  Radeon HD5870, 850, 2, i7-920@4.0
14.68 14.96  6.9  GeForce GTX 660, 1137, 3x8, i7-3770@4.0

On version 2013.11.21: (+12% on Nvidia from 2013.11.18)

Code:

20.40 23.93  8.8  Radeon HD6950 (unlocked), 885, 2, i7-930@4.0
12.20 12.93 10.4  GeForce GTX 590 (half of dual card), 608, 2, i7-930@4.0
 8.44  8.73  8.0  GeForce GTX 560, 810, ?, ?
 5.28  5.48  7.4  GeForce GT 750M, 967, 1.1, i7-4700HQ@2.4
 2.47  2.55  2.1  Radeon HD4870, 750?, ?, ?
 2.44  2.45  2.3  GeForce GTX 275, 666, 2, i7-920@3.4
 1.97  2.00  8.1  Quadro 600, 640, 2, i7-930@4.0
 0.91  0.91  2.4  GeForce GT 240, 550, 2, i5-2500K@4.0
 1.66  1.71  4.5  Intel HD4600, 1200, -, i7-4770@3.7

On version 2013.11.18:

Code:

35.50 48.00 12.7  Radeon HD7970, 925, 3, i7-3930K@3.2
20.40 23.93  8.8  Radeon HD6950 (unlocked), 885, 2, i7-930@4.0
20.29 22.68  4.9  GeForce GTX 780, 1006, 2, i7-3930K@3.8
17.72 22.47  8.3  Radeon HD6950 (unlocked), 885, 1.1 x4, i7-930@4.0
18.43 21.21  8.9  Radeon HD6950, 850, 2, i7-930@4.0
14.24 15.78  6.4  GeForce GTX 660 Ti, 915?, ?, ?
10.89 11.48  9.2  GeForce GTX 590 (half of dual card), 608, 2, i7-930@4.0
 9.33  9.60  6.3  GeForce GTX 650 Ti Boost, 1006, 3, i7-4770k@4.3
 4.65  4.90  6.6  GeForce GT 750M,  967, 1.1, i7-4700HQ@2.4
 2.61  2.72  9.4  GeForce GT 555M (GDDR5), 1506, 2, i7-2670QM@2.2
 1.30  1.85  6.9  GeForce GT 430, 700, 2, i7-860@2.8
 1.72  1.74  7.1  Quadro 600, 640, 2, i7-930@4.0
 1.66  1.71  4.5  Intel HD4600, 1200, -, i7-4770@3.7
 1.38  1.38  1.3  GeForce GTX 275, 666, 2, i7-920@3.4

FPS is measured as average FPS in AVSMeter 1.7.2 at the end of this script:

Code:

SetMTMode(2,4)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl(dw=1, nns=2, qual=1)

or for versions 2013.11.22 and older:

Code:

SetMTMode(2,4)
BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0)
nnedi3ocl_rpow2(2, nns=2, qual=1)

If your FPS is terrible and you don't want to wait till the end – reduce first number in BlankClip to 100, but don't interrupt the script in the middle.

Some conclusions:

Even severely limiting PCIe bandwidth from 2 x16 to 1.1 x4 (= 8x less bandwidth) for quite fast card Radeon HD6950 doesn't change overall speed much: only 6% slower with MTMode and 13% without. So, anything starting from PCIe 2 x8 should not affect current implementation much.
There is clear correlation between FPS, theoretical FLOPS and architecture of GPUs.

TurboPascal7 · 18th November 2013, 07:53

So this is what you were talking about. Fancy.

5-15 times faster than the original here, which is a really nice benefit. i7 860 vs gtx760 connected via pci-e 2. Which is compute capability 3.0. Also, noting what version is actually required would be a good idea, imho.

dh=false being the default when it doesn't work yet seems a bit confusing though.

easyfab · 18th November 2013, 10:57

Quote:

Originally Posted by SEt

Major speed note 2.
The OpenCL code is quite optimized, but memory transfers are not. So, much time is lost there. Using high number of threads with MTMode 2 (even more than physical threads your CPU has) is the best workaround for now.

Will the next generations APU resolve the memory transfer problem, as the memory will be shared between CPU and GPU ?

bcn_246 · 18th November 2013, 14:27

Quote:

Originally Posted by SEt

[*]No cshift yet, ignored. Probably should be done in external script anyway.

Could somebody post such a script?

TIA

SEt · 18th November 2013, 14:52

Update: dh=false should work now.
Also added "benchmark" section to first post, as it would be useful for people to get estimates what speed they can expect with different hardware.

TurboPascal7
It's unexpectedly nice that you get good speed on just CC3.0 hardware – I thought register pressure there would be too much. As for minimum required hardware – no idea, probably anything that can run OpenCL 1.1 (maybe even 1.0) will do. Just older hardware would have worse theoretical_FLOPS / real_speed ratio.

easyfab
This problem – yes, but due to their TDP constraints likely it'll still be faster to use powerful GPU on separate card even with all the transfer overhead.

DJATOM · 18th November 2013, 18:32

2.61 2.72 GeForce GT 555M (GDDR5), 1506 MHz, 2, i7-2670QM@2.2GHz, 2013.11.18

lansing · 18th November 2013, 18:37

Code:

9.33 9.60 Nvidia Geforce GTX 650 Ti Boost, 1006MHz, 3, i7-4770k@4.3GHz,  2013.11.18

very slow for my card, and on mt mode, only one thread out of the 8 is working, I suspect the bottleneck on the gpu side.

SEt · 18th November 2013, 22:26

Updated first post, added some results of cards I have access to. Also added important clarification about relations of speed and parameters.

There is now enough results to start drawing some conclusions. Unlike what I expected in the beginning, NV cards do relatively well, but their architecture (Compute Capability) does affect the efficiency. NV CC2.0 hardware slightly more efficient than AMD VLIV4, but CC2.1 and CC3.0 are around 1.5x less efficient (btw, this inefficiency in compute workload is quite known for those CC).

Also I remembered nasty habit of NV drivers to consume one CPU core during GPU computations. It's not normal and AMD drivers don't suffer from that: during test CPU load should be insignificant.

wOxxOm · 18th November 2013, 23:28

SEt, it'd be nice to have fwidth/fheight gpu-scaling (requires cshift as well), thus the scripts with nnedi3_rpow2(...).downscale_to_final_resolution(...) will run faster, and hopefully by a considerable margin, because of the reduced gpu->cpu copyback data amount.

SEt · 19th November 2013, 00:15

wOxxOm, not very likely for near future – there are better places to spend effort of optimizing, cubic level downscalers are computationally cheap. As for copyback, it's slow not because it saturates PCIe bandwidth (do read first post benchmarks) but because it's implemented not quite efficiently.

wOxxOm · 19th November 2013, 00:19

SEt, got it. What about combining it with some gpu-assisted degrain then?

SEt · 19th November 2013, 00:20

wOxxOm, that's way more likely. No promises though.

PetitDragon · 19th November 2013, 02:25

OMG! This is so fxcking greate. Thanks SEt.

mikeyakame · 19th November 2013, 10:19

Code:

20.29 22.68 Nvidia Geforce GTX780, 1006Mhz, 2, i7-3930K@3.8Ghz, 2013.11.18

GPU Load Avg. No MT was ~91%.
GPU Load Avg. MT was ~97%.

So not much headroom left for my card, and CPU load was ~8-9% for both tests.

Geforce drivers => 331.58

Terka · 19th November 2013, 13:58

SEt, great job! Thank you!

Now implement phase correlation to mvtools under OpenCL and Avisynth users can celebrate.

Mystery Keeper · 19th November 2013, 14:15

There are other more modern motion estimation methods around. MVTools might benefit not from just phase correlation, but from integrating more different algorythms. Also, modern motion estimation uses segmentation. MVTools could give more accurate results with currently used method combined with segmentation.

SEt · 19th November 2013, 14:42

mikeyakame, your result is too low looking at FPS/TFLOPS ratio. I had much better expectations for CC3.5 hardware...

Terka, Mystery Keeper, this work is far from finished and I haven't said I'm taking MVTools rewrite... Though recently exposed through OpenCL motion estimators on Intel videocards did look interesting as you can get motion estimation basically for free from dedicated hardware, my cards have nothing like that.

olcifaraga · 19th November 2013, 15:09

Code:

4.65 4.90 Nvidia Geforce GT750M,  967Mhz, 1.1, i7-4700HQ@2.4GHz, 2013.11.18

Bloax · 19th November 2013, 16:54

I get ~0.2 fps upscaling 640x480 to 1280x960 on a Geforce 9800 GT (you can imagine my surprise that it could actually run this!), and I think we can easily conclude what happens on 720p->1440p

zero9999 · 19th November 2013, 17:03

Quote:

Originally Posted by bcn_246

Could somebody post such a script?

TIA

use this mod of nnedi3_resize16

18th November 2013, 07:15	#1 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	nnedi3 - OpenCL rewrite It's time to move to modern image processing platforms, i.e. OpenCL. Here is my rewrite of one of the most used and the most heavy AviSynth plugins: nnedi3. Current (2013.12.08-beta) version: https://www.dropbox.com/s/bmemjsu7jq...cl_20131208.7z Syntax: nnedi3ocl(int field, bool dh, bool Y, bool U, bool V, int nsize, int nns, int qual, int etype, int dw) Most parameters are the same as for nnedi3. Changes: dw - controls scaling in horizontal direction: -1 no scaling; 0 and 1 scale like field 0 and 1 with dh=true, but horizontally. Default: -1. Default for field is dw. Default for dh is false when dw=-1 and true otherwise. Only nsize=0 implemented, other values are silently ignored. pscrn, threads, opt, fapprox: removed. nnedi3ocl is AviSynth 2.5 plugin, but supports all new planar colorspaces when used with AviSynth 2.6. For YUY2 and RGB24 support script function nnedi3x is provided; it also doesn't complain if you feed it with now removed parameters of original nnedi3. Basic image 2x scaling is done by call nnedi3ocl(dw=1). For advanced scaling with chroma and center correction and support for YUY2 and RGB24 colorspaces use provided nnedi3x_rpow2 script function. MTMode to use: 2. Major speed note 1. Original nnedi3 process not all pixels with its nnedi3 algorithm: first it runs prescreener that decides should each pixel go through nnedi3 or through cubic scaling. This is the main reason that nnedi3 works relatively fast one time and slows to a crawl some other time (for example, on grass and leaves). Prescreener concept works welll on CPU, but quite alien to extremely parallel GPU code and was not implemented. This means that nnedi3ocl works always with constant speed and will be slower than original nnedi3 with prescreener on simple frames, but faster on complex ones. TLDR: nnedi3ocl always does best quality mode, so can be both faster and slower than original nnedi3. Do notice that you can combine CPU and OpenCL processing in one script by using both. Major speed note 2. The OpenCL code is quite optimized, but memory transfers are not. So, much time is lost there. Using high number of threads with MTMode 2 (even more than physical threads your CPU has) is the best workaround for now. Major speed note 3. Unlike original nnedi3 where speed mostly depended on image complexity, speed of nnedi3ocl is direct result of your hardware speed and settings used. *Each increase of nns by 1 results in speed dropping by 2x. qual=2 also around 2x slower than qual=1.* This means that from fastest to slowest parameters there is 32x difference in speed. OpenCL device preferences. Don't bother with running the code on CPU OpenCL devices – original nnedi3 would be way faster simply due to prescreener. For GPU AMD cards with GCN architecture are recommended. Nvidia does ok, but has disadvantage of completely using one of your CPU cores on heavy GPU computations. Intel integrated... it works there too! Theoretical FLOPS should be good indication of performance as long as you factor in the efficiency of particular architecture. Table below provides some useful coefficients how TFLOPS scale to FPS for cards of different architectures. In case of multiple OpenCL platforms the order of preference: AMD GPU -> any GPU -> the rest. No manual choice yet. Multi-GPU are not supported yet, todo. OpenCL part is open source, license is... LGPLv3? Subject to change. Host code isn't open yet; there is nothing interesting there anyway. Somewhat unknown issue is nnedi3 nn coefficients data: no idea how they are related to licenses since it's not a code. Preprocessed but conceptually unchanged version of them is currently embeded in dll. If tritical has any issues with current situation – I'm ready to listen. Hacking. As you can notice, the main OpenCL part is not just open source, but actually read and realtime compiled from separate text file. You can change it and the next restart changes will be applied. Feel free to poke at code: sometimes just adding dummy if lines (that always processed or skipped, but compiler can't deduce it) can greatly change the speed in both ways. Another interesting speed point is #pragma unroll statements. As for correctness, I haven't noticed the importance of several checks, so they are removed under EXTRA_CHECKS define – uncomment the line if you think you see precision-related errors. Benchmark. As it would be useful to get speed estimates with different hardware and speed should differ insignificantly due to non-hardware reasons, let's make "benchmark" section. In case some modification of nnedi3ocl.cl provides you better (but still correct) result – would be interesting to see such numbers too. The target is 1280x720 YV12 upscaling to 2560x1440 with medium settings. FPS with no MTMode, FPS with MTMode(2,4), FPS per theoretical TFLOPS, GPU name, GPU core clock during test MHz, PCIe version (and width if not x16), CPU, nnedi3ocl version On version 2013.12.08: (the same as 2013.11.21) Code: 32.89 37.53 13.3 Radeon HD7870, 1100, 2, FX8350 18.33 18.82 6.9 Radeon HD5870, 850, 2, i7-920@4.0 14.68 14.96 6.9 GeForce GTX 660, 1137, 3x8, i7-3770@4.0 On version 2013.11.21: (+12% on Nvidia from 2013.11.18) Code: 20.40 23.93 8.8 Radeon HD6950 (unlocked), 885, 2, i7-930@4.0 12.20 12.93 10.4 GeForce GTX 590 (half of dual card), 608, 2, i7-930@4.0 8.44 8.73 8.0 GeForce GTX 560, 810, ?, ? 5.28 5.48 7.4 GeForce GT 750M, 967, 1.1, i7-4700HQ@2.4 2.47 2.55 2.1 Radeon HD4870, 750?, ?, ? 2.44 2.45 2.3 GeForce GTX 275, 666, 2, i7-920@3.4 1.97 2.00 8.1 Quadro 600, 640, 2, i7-930@4.0 0.91 0.91 2.4 GeForce GT 240, 550, 2, i5-2500K@4.0 1.66 1.71 4.5 Intel HD4600, 1200, -, i7-4770@3.7 On version 2013.11.18: Code: 35.50 48.00 12.7 Radeon HD7970, 925, 3, i7-3930K@3.2 20.40 23.93 8.8 Radeon HD6950 (unlocked), 885, 2, i7-930@4.0 20.29 22.68 4.9 GeForce GTX 780, 1006, 2, i7-3930K@3.8 17.72 22.47 8.3 Radeon HD6950 (unlocked), 885, 1.1 x4, i7-930@4.0 18.43 21.21 8.9 Radeon HD6950, 850, 2, i7-930@4.0 14.24 15.78 6.4 GeForce GTX 660 Ti, 915?, ?, ? 10.89 11.48 9.2 GeForce GTX 590 (half of dual card), 608, 2, i7-930@4.0 9.33 9.60 6.3 GeForce GTX 650 Ti Boost, 1006, 3, i7-4770k@4.3 4.65 4.90 6.6 GeForce GT 750M, 967, 1.1, i7-4700HQ@2.4 2.61 2.72 9.4 GeForce GT 555M (GDDR5), 1506, 2, i7-2670QM@2.2 1.30 1.85 6.9 GeForce GT 430, 700, 2, i7-860@2.8 1.72 1.74 7.1 Quadro 600, 640, 2, i7-930@4.0 1.66 1.71 4.5 Intel HD4600, 1200, -, i7-4770@3.7 1.38 1.38 1.3 GeForce GTX 275, 666, 2, i7-920@3.4 FPS is measured as average FPS in AVSMeter 1.7.2 at the end of this script: Code: SetMTMode(2,4) BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0) nnedi3ocl(dw=1, nns=2, qual=1) or for versions 2013.11.22 and older: Code: SetMTMode(2,4) BlankClip(1000, 1280, 720, "YV12", 24000, 1001, 0) nnedi3ocl_rpow2(2, nns=2, qual=1) If your FPS is terrible and you don't want to wait till the end – reduce first number in BlankClip to 100, but don't interrupt the script in the middle. Some conclusions: Even severely limiting PCIe bandwidth from 2 x16 to 1.1 x4 (= 8x less bandwidth) for quite fast card Radeon HD6950 doesn't change overall speed much: only 6% slower with MTMode and 13% without. So, anything starting from PCIe 2 x8 should not affect current implementation much. There is clear correlation between FPS, theoretical FLOPS and architecture of GPUs. Last edited by SEt; 10th February 2014 at 00:04.

18th November 2013, 07:53	#2 \| Link
TurboPascal7 Registered User Join Date: Jan 2010 Posts: 270	So this is what you were talking about. Fancy. 5-15 times faster than the original here, which is a really nice benefit. i7 860 vs gtx760 connected via pci-e 2. Which is compute capability 3.0. Also, noting what version is actually required would be a good idea, imho. dh=false being the default when it doesn't work yet seems a bit confusing though. __________________ Me on GitHub \| AviSynth+ - the (dead) future of AviSynth Last edited by TurboPascal7; 18th November 2013 at 08:54.

18th November 2013, 18:37	#7 \| Link
lansing Registered User Join Date: Sep 2006 Posts: 1,685	Code: 9.33 9.60 Nvidia Geforce GTX 650 Ti Boost, 1006MHz, 3, i7-4770k@4.3GHz, 2013.11.18 very slow for my card, and on mt mode, only one thread out of the 8 is working, I suspect the bottleneck on the gpu side.

19th November 2013, 10:19	#14 \| Link
mikeyakame lookin for my sanity Join Date: Feb 2007 Location: it all depends on the day and which country comes to mind Posts: 42	Code: 20.29 22.68 Nvidia Geforce GTX780, 1006Mhz, 2, i7-3930K@3.8Ghz, 2013.11.18 GPU Load Avg. No MT was ~91%. GPU Load Avg. MT was ~97%. So not much headroom left for my card, and CPU load was ~8-9% for both tests. Geforce drivers => 331.58

19th November 2013, 14:15	#16 \| Link
Mystery Keeper Beyond Kawaii Join Date: Feb 2008 Location: Russia Posts: 724	There are other more modern motion estimation methods around. MVTools might benefit not from just phase correlation, but from integrating more different algorythms. Also, modern motion estimation uses segmentation. MVTools could give more accurate results with currently used method combined with segmentation. __________________ ...desu!

18th November 2013, 14:52	#5 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	Update: dh=false should work now. Also added "benchmark" section to first post, as it would be useful for people to get estimates what speed they can expect with different hardware. TurboPascal7 It's unexpectedly nice that you get good speed on just CC3.0 hardware – I thought register pressure there would be too much. As for minimum required hardware – no idea, probably anything that can run OpenCL 1.1 (maybe even 1.0) will do. Just older hardware would have worse theoretical_FLOPS / real_speed ratio. easyfab This problem – yes, but due to their TDP constraints likely it'll still be faster to use powerful GPU on separate card even with all the transfer overhead.

18th November 2013, 18:32	#6 \| Link
DJATOM Registered User Join Date: Sep 2010 Location: Ukraine, Bohuslav Posts: 377	2.61 2.72 GeForce GT 555M (GDDR5), 1506 MHz, 2, i7-2670QM@2.2GHz, 2013.11.18

18th November 2013, 22:26	#8 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	Updated first post, added some results of cards I have access to. Also added important clarification about relations of speed and parameters. There is now enough results to start drawing some conclusions. Unlike what I expected in the beginning, NV cards do relatively well, but their architecture (Compute Capability) does affect the efficiency. NV CC2.0 hardware slightly more efficient than AMD VLIV4, but CC2.1 and CC3.0 are around 1.5x less efficient (btw, this inefficiency in compute workload is quite known for those CC). Also I remembered nasty habit of NV drivers to consume one CPU core during GPU computations. It's not normal and AMD drivers don't suffer from that: during test CPU load should be insignificant.

18th November 2013, 23:28	#9 \| Link
wOxxOm Oz of the zOo Join Date: May 2005 Posts: 208	SEt, it'd be nice to have fwidth/fheight gpu-scaling (requires cshift as well), thus the scripts with nnedi3_rpow2(...).downscale_to_final_resolution(...) will run faster, and hopefully by a considerable margin, because of the reduced gpu->cpu copyback data amount.

19th November 2013, 00:15	#10 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	wOxxOm, not very likely for near future – there are better places to spend effort of optimizing, cubic level downscalers are computationally cheap. As for copyback, it's slow not because it saturates PCIe bandwidth (do read first post benchmarks) but because it's implemented not quite efficiently.

19th November 2013, 00:19	#11 \| Link
wOxxOm Oz of the zOo Join Date: May 2005 Posts: 208	SEt, got it. What about combining it with some gpu-assisted degrain then?

19th November 2013, 00:20	#12 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	wOxxOm, that's way more likely. No promises though.

19th November 2013, 02:25	#13 \| Link
PetitDragon Registered User Join Date: Sep 2006 Posts: 81	OMG! This is so fxcking greate. Thanks SEt.

19th November 2013, 13:58	#15 \| Link
Terka Registered User Join Date: Jan 2005 Location: cz Posts: 704	SEt, great job! Thank you! Now implement phase correlation to mvtools under OpenCL and Avisynth users can celebrate.

19th November 2013, 14:42	#17 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	mikeyakame, your result is too low looking at FPS/TFLOPS ratio. I had much better expectations for CC3.5 hardware... Terka, Mystery Keeper, this work is far from finished and I haven't said I'm taking MVTools rewrite... Though recently exposed through OpenCL motion estimators on Intel videocards did look interesting as you can get motion estimation basically for free from dedicated hardware, my cards have nothing like that.

19th November 2013, 15:09	#18 \| Link
olcifaraga Registered User Join Date: Sep 2012 Posts: 4	Code: 4.65 4.90 Nvidia Geforce GT750M, 967Mhz, 1.1, i7-4700HQ@2.4GHz, 2013.11.18

19th November 2013, 16:54	#19 \| Link
Bloax The speed of stupid Join Date: Sep 2011 Posts: 317	I get ~0.2 fps upscaling 640x480 to 1280x960 on a Geforce 9800 GT (you can imagine my surprise that it could actually run this!), and I think we can easily conclude what happens on 720p->1440p