Log in

View Full Version : nnedi3cl


Pages : [1] 2

HolyWu
22nd September 2017, 11:39
https://github.com/HomeOfVapourSynthEvolution/VapourSynth-NNEDI3CL

Here comes another OpenCL variant of the popular filter. As usual, some benchmarks below FYI, measured by "vspipe -p test.vpy .". My CPU is E3-1231v3 and GPU is GTX 660. Sample videos (https://drive.google.com/open?id=0B25JBmrV1-msT2ZlUXVhSjc5dlk) used for benchmarking. test1 is quite unfriendly to the prescreener, while test2 is very friendly to the prescreener.

vpy

import vapoursynth as vs
core = vs.get_core()
core.max_cache_size = 3072

clip = core.lsmas.LibavSMASHSource('test1.mp4')
#clip = core.lsmas.LibavSMASHSource('test2.mp4')
#clip = core.resize.Bicubic(clip, format=vs.YUV420P16)

# deinterlace
#clip = core.nnedi3.nnedi3(clip, field=1)
#clip = core.nnedi3cl.NNEDI3CL(clip, field=1)

# enlarge
#clip = core.std.Transpose(clip).nnedi3.nnedi3(field=1, dh=True, nsize=4, nns=3).std.Transpose().nnedi3.nnedi3(field=1, dh=True, nsize=4, nns=3)
#clip = core.nnedi3cl.NNEDI3CL(clip, field=1, dh=True, dw=True, nsize=4, nns=3)

clip.set_output()


deinterlace-test1

YUV420P8:
nnedi3: 19.66 fps
nnedi3cl: 35.82 fps

YUV420P16:
nnedi3: 13.12 fps
nnedi3cl: 32.68 fps


deinterlace-test2

YUV420P8:
nnedi3: 98.34 fps
nnedi3cl: 89.53 fps

YUV420P16:
nnedi3: 59.34 fps
nnedi3cl: 71.07 fps


enlarge-test1

YUV420P8:
nnedi3: 6.60 fps
nnedi3cl: 7.72 fps

YUV420P16:
nnedi3: 4.99 fps
nnedi3cl: 7.44 fps


enlarge-test2

YUV420P8:
nnedi3: 28.82 fps
nnedi3cl: 34.85 fps

YUV420P16:
nnedi3: 16.48 fps
nnedi3cl: 29.17 fps

Myrsloik
22nd September 2017, 15:38
Interesting... Maybe I'll try similar benchmarks on my computer which has very different hardware.

Btw, why didn't you use the built in resize for the bitdepth conversion?

poisondeathray
22nd September 2017, 16:46
Thanks

Is there an "easy" way to toggle versions for testing for functions based on NNEDI3 / CL ? Maybe some python magic I'm not aware of since I'm a python newbie?

For example, presumably haf.QTGMC would call the nnedi3.nnedi3 version , would I need to find/replace all instances or something like that ?

TheFluff
22nd September 2017, 17:13
plain_nnedi = core.nnedi3.nnedi3
core.nnedi3.nnedi3 = core.nnedi3cl.NNEDI3CL

If these were regular Python functions this would work just fine, but idk about VS' Cython modules - try it and see. It should work though - you can assign to (and overwrite) builtin standard library functions in Python if you want.

poisondeathray
22nd September 2017, 17:22
I'm getting some issues with build program failure

Win8.1 x64 . Vapoursynth R39test4


Failed to evaluate the script:
Python exception: NNEDI3CL: Build Program Failure
:1:3129: error: incompatible pointer types passing '__local float *' to parameter of type 'const __local float (*)[95]'
:1:161: note: passing argument to parameter 'input' here
:1:4825: error: incompatible pointer types passing '__local float *' to parameter of type 'const __local float (*)[95]'
:1:161: note: passing argument to parameter 'input' here

error: front end compiler failed build.


On simple script. UT Video 4:2:0. 720x480. Commenting out NNEDI3CL line works ok

import vapoursynth as vs
core = vs.get_core()

clip = core.avisource.AVISource(r'PATH\test.avi', pixel_type="yv12")
#clip = core.nnedi3cl.NNEDI3CL(clip, field=1, dh=True, dw=True)
clip.set_output()

HolyWu
23rd September 2017, 04:19
Btw, why didn't you use the built in resize for the bitdepth conversion?

It's simply more convenient for me to test whether the filter is working correctly in different bit depth without having to type full format constant like YUVxxxPxx. Furthermore, only YUV444 has constants for 32 bits defined, hence it complains that module 'vapoursynth' has no attribute 'YUV420PS32' and whatnot. I bet that most users won't even bother to use register_format.

I'm getting some issues with build program failure

Please try https://www.nmm-hd.org/upload/get~gSA4vQbgDC0/NNEDI3CL-r3.7z and see whether it works.

poisondeathray
23rd September 2017, 07:06
Please try https://www.nmm-hd.org/upload/get~gSA4vQbgDC0/NNEDI3CL-r3.7z and see whether it works.

This one works

What was the issue ? / What is different with this build ?

HolyWu
23rd September 2017, 10:11
This one works

What was the issue ? / What is different with this build ?

The issue was type mismatch between the function parameter and the argument.

Updated r3 on GitHub. Binaries are the same as the linked file in my previous post.

aegisofrime
24th September 2017, 05:39
Interesting. Did you manage to measure power draw at the wall? I'm curious if the increase in power consumption (if any) is worth the gains in speed.

DJATOM
30th September 2017, 19:04
I can't get it working on NVIDIA Quadro 600 and Windows Server 2012 R2 - https://pastebin.com/Sya8BVKQ
Is that a driver issue? I'm not familiar with OpenCL.
Upd.: On another server (same OS) on NVIDIA GeForce GTX 550 Ti I get the same errors.

HolyWu
1st October 2017, 13:07
I can't get it working on NVIDIA Quadro 600 and Windows Server 2012 R2 - https://pastebin.com/Sya8BVKQ
Is that a driver issue? I'm not familiar with OpenCL.
Upd.: On another server (same OS) on NVIDIA GeForce GTX 550 Ti I get the same errors.

The filter uses features not present in OpenCL 1.1 and below, hence the minimum requirement is OpenCL 1.2.

DJATOM
1st October 2017, 18:02
Yes, replaced 550 TI with 760 and it works.

Myrsloik
2nd October 2017, 18:43
All tests performed with: vspipe script.vpy . -p

Bitdepth conversion was done with the internal resize.

Threadripper 1950X
Sapphire Fury Tri-X
3200 CL14 RAM

All tests were run with 32 threads (default) except for enlarge on cpu where 16 threads for some reason performed substantially better (7-10 fps difference). The source used was 3000 frames from a typical 1080p tv series episode.

deinterlace1:

YUV420P8:
nnedi3: 352.14 fps
nnedi3cl: 45.68 fps

YUV420P16:
nnedi3: 164.40 fps
nnedi3cl: 40.42 fps

enlarge:

YUV420P8:
nnedi3: 65.22 fps
nnedi3cl: 10.47 fps

YUV420P16:
nnedi3: 32.03 fps
nnedi3cl: 9.90 fps

Mystery Keeper
14th October 2017, 11:41
Tried to replace NNEDI3 with NNEDI3CL in QTGMC.
Got this error:

2017-10-14 13:41:42.021
Failed to evaluate the script:
Python exception: NNEDI3CL: Build Program Failure
:1:528: error: expected identifier or '('
:42:23: note: expanded from here
#define SCALE_ASIZE 0,003472
^
:1:528: error: expected ';' at end of declaration
:42:23: note: expanded from here
#define SCALE_ASIZE 0,003472
^
:1:564: error: expected identifier or '('
:42:23: note: expanded from here
#define SCALE_ASIZE 0,003472
^
:1:564: error: expected ';' at end of declaration
:42:23: note: expanded from here
#define SCALE_ASIZE 0,003472
^
:1:1854: warning: expression result unused


Traceback (most recent call last):
File "src\cython\vapoursynth.pyx", line 1830, in vapoursynth.vpy_evaluateScript (src\cython\vapoursynth.c:36860)
File "D:/video-to-process/Takaradzuka - Phantom 2004/phantom-temp.vpy", line 92, in
deint = haf.QTGMC(weaved, Preset="Placebo", EdiMode="eedi3+nnedi3", ChromaEdi="", EdiQual=2, NNeurons=4, NNSize=3, SubPel=4, SubPelInterp=2, BlockSize=8, Overlap=4, TFF=True, **qtgmcArguments)
File "D:\vapoursynth-plugins\py\havsfunc.py", line 1104, in QTGMC
edi1 = QTGMC_Interpolate(ediInput, InputType, EdiMode, NNSize, NNeurons, EdiQual, EdiMaxD, bobbed, ChromaEdi, TFF)
File "D:\vapoursynth-plugins\py\havsfunc.py", line 1390, in QTGMC_Interpolate
sclip=core.nnedi3cl.NNEDI3CL(Input, field=field, planes=planes, nsize=NNSize, nns=NNeurons, qual=EdiQual))
File "src\cython\vapoursynth.pyx", line 1722, in vapoursynth.Function.__call__ (src\cython\vapoursynth.c:35000)
vapoursynth.Error: NNEDI3CL: Build Program Failure
:1:528: error: expected identifier or '('
:42:23: note: expanded from here
#define SCALE_ASIZE 0,003472
^
:1:528: error: expected ';' at end of declaration
:42:23: note: expanded from here
#define SCALE_ASIZE 0,003472
^
:1:564: error: expected identifier or '('
:42:23: note: expanded from here
#define SCALE_ASIZE 0,003472
^
:1:564: error: expected ';' at end of declaration
:42:23: note: expanded from here
#define SCALE_ASIZE 0,003472
^
:1:1854: warning: expression result unused

HolyWu
19th October 2017, 05:25
Update r4 & r5.


Fix decimal-point character issue in different locales.
Add old & new prescreener.
Use snprintf to convert floating point to string for more precise value, because to_string only writes six decimal digits after the decimal point.
Change filter mode to completely parallel execution.


The benchmark in the first post is revised.

Tried to replace NNEDI3 with NNEDI3CL in QTGMC.
Got this error:

2017-10-14 13:41:42.021
Failed to evaluate the script:
Python exception: NNEDI3CL: Build Program Failure
:1:528: error: expected identifier or '('
:42:23: note: expanded from here
#define SCALE_ASIZE 0,003472


Thanks for the report. It's caused by different representation of decimal-point character in some locales. Please try the latest release again.

HolyWu
19th October 2017, 05:34
Threadripper 1950X

The result with Threadripper is really superb. I wonder does it mostly benefit from the advertised neural net prediction and smart prefetch, besides so many threads available.

HolyWu
22nd October 2017, 18:15
clip = core.ffms2.Source('lena512.bmp').std.Loop(1000)
#clip = core.nnedi3.nnedi3(clip, field=1, pscrn=2)
#clip = core.znedi3.nnedi3(clip, field=1, pscrn=2)



nnedi3: Output 1000 frames in 6.70 seconds (149.24 fps)
znedi3: Output 1000 frames in 71.07 seconds (14.07 fps)


I have a feeling that the code path selection is wrong and it goes into pure c functions.

Anyway, I have no specific favor over CPU or GPGPU personally. It's simply provided as an alternative here. The users will choose which one to use on their own depending on the speed they get then.

HolyWu
23rd October 2017, 14:08
I updated with more support for legacy CPUs. Link. (https://www.dropbox.com/s/4a0dp53yy83q5po/znedi3-v1.7z?dl=0)

It's leaking memory. Use a clip of larger resolution to see the memory grows quickly.

HolyWu
23rd October 2017, 18:21
#clip = core.ffms2.Source('lena512color.tiff').std.Loop(2000)
#clip = core.ffms2.Source('test1.mp4')
#clip = core.ffms2.Source('test2.mp4')

#clip = core.nnedi3.nnedi3(clip, field=1, pscrn=2)
#clip = core.znedi3.nnedi3(clip, field=1, pscrn=2)
#clip = core.nnedi3cl.NNEDI3CL(clip, field=1, pscrn=2)


lena512color.tiff

nnedi3: Output 2000 frames in 14.52 seconds (137.71 fps)
znedi3: Output 2000 frames in 9.08 seconds (220.17 fps)
nnedi3cl: Output 2000 frames in 17.65 seconds (113.28 fps)


test1.mp4

nnedi3: Output 1250 frames in 63.31 seconds (19.75 fps)
znedi3: Output 1250 frames in 40.80 seconds (30.64 fps)
nnedi3cl: Output 1250 frames in 34.97 seconds (35.75 fps)


test2.mp4

nnedi3: Output 2184 frames in 21.82 seconds (100.10 fps)
znedi3: Output 2184 frames in 21.07 seconds (103.66 fps)
nnedi3cl: Output 2184 frames in 23.73 seconds (92.02 fps)


Benchmarking with a static picture is probably not that realistic. And both my CPU and GPU are a bit old. Someone with both CPU and GPU released last year, or even better this year, may give a more representative result.

poisondeathray
23rd October 2017, 18:49
There seem to be some qualitative differences between them too - why would that be ?

ChaosKing
23rd October 2017, 19:10
Source was a 480p DVD

CPU 3570K @ 4.1GHz, GPU RX 480 4GB
clip = core.nnedi3.nnedi3(clip, field=1, pscrn=2) # 262 fps
clip = core.znedi3.nnedi3(clip, field=1, pscrn=2) # 258 fps
clip = core.nnedi3cl.NNEDI3CL(clip, field=1, pscrn=2) # 162 fps

//edit

more tests

clip = core.nnedi3.nnedi3(clip, field=1, pscrn=2, nsize =4, qual =2, nns =4) # 55.7
clip = core.znedi3.nnedi3(clip, field=1, pscrn=2, nsize =4, qual =2, nns =4) # 56.6
clip = core.nnedi3cl.NNEDI3CL(clip, field=1, pscrn=2, nsize =4, qual =2, nns =4) # 47.5

ChaosKing
25th October 2017, 02:19
Now some tests with sample files from HolyWu. I tested nnedi3 nnedi3CL and znedi3.

Rx480 4GB - Driver 17.10.2 (newest)

https://i.imgur.com/xsjFWP8.pnghttp://
Results for test1.mp4
# FFMS2 only
Output 1250 frames in 6.88 seconds (181.73 fps)

nnedi3(clip, field=1):
# deinterlace nnedi3 8Bit
Output 1250 frames in 61.27 seconds (20.40 fps)
# deinterlace nnedi3 16Bit
Output 1250 frames in 115.32 seconds (10.84 fps)
# deinterlace Znedi3 8Bit
Output 1250 frames in 61.96 seconds (20.17 fps)
# deinterlace Znedi3 16Bit
Output 1250 frames in 146.59 seconds (8.53 fps)

NNEDI3CL(clip, field=1):
# deinterlace nnedi3CL 8Bit
Output 1250 frames in 20,61 seconds (60,64 fps)
# deinterlace nnedi3CL 16Bit
Output 1250 frames in 23,05 seconds (54,22 fps)

std.Transpose(clip).nnedi3.nnedi3(field=1, dh=True, nsize=4, nns=3).std.Transpose().nnedi3.nnedi3(field=1, dh=True, nsize=4, nns=3):
# enlarge nnedi3 8Bit
Output 1250 frames in 184.55 seconds (6.77 fps)
# enlarge nnedi3 16Bit
Output 1250 frames in 300.05 seconds (4.17 fps)
# enlarge Znedi3 8Bit
Output 1250 frames in 165.66 seconds (7.55 fps)
# enlarge Znedi3 16Bit
canceled, very slow less then 0.4 fps


NNEDI3CL(clip, field=1, dh=True, dw=True, nsize=4, nns=3):
# enlarge nnedi3CL 8Bit
Output 1250 frames in 73,68 seconds (16,97 fps)
# enlarge nnedi3CL 16Bit
Output 1250 frames in 78,48 seconds (15,93 fps)


>> It seems that znedi has some problems with 16bit.


Results for test2.mp4

# FFMS2 only
Output 2184 frames in 5.69 seconds (389.03 fps)

nnedi3(clip, field=1):
# deinterlace nnedi3 8Bit
Output 2184 frames in 25.33 seconds (86.22 fps)
# deinterlace nnedi3 16Bit
Output 2184 frames in 44.31 seconds (49.29 fps)
# deinterlace Znedi3 8Bit
Output 2184 frames in 23.60 seconds (92.55 fps)
# deinterlace Znedi3 16Bit
Output 2184 frames in 195.96 seconds (11.15 fps)

NNEDI3CL(clip, field=1):
# deinterlace nnedi3CL 8Bit
Output 2184 frames in 18,42 seconds (118,56 fps)
# deinterlace nnedi3CL 16Bit
Output 2184 frames in 23,18 seconds (94,23 fps)

NNEDI3CL(clip, field=1, dh=True, dw=True, nsize=4, nns=3):
# enlarge nnedi3CL 8Bit
Output 2184 frames in 36,41 seconds (59,99 fps)
# enlarge nnedi3CL 16Bit
Output 2184 frames in 43,59 seconds (50,10 fps)

Old anime DVD ntsc. It seems that the CL version is much better for higher res.
# FFMS2 only
Output 1000 frames in 0.60 seconds (1674.38 fps)

nnedi3(clip, field=1):
# deinterlace nnedi3 8Bit
Output 1000 frames in 4.35 seconds (229.70 fps)
# deinterlace nnedi3 16Bit
Output 1000 frames in 7.99 seconds (125.19 fps)


NNEDI3CL(clip, field=1):
# deinterlace nnedi3CL 8Bit
Output 1000 frames in 7,70 seconds (129,85 fps)
# deinterlace nnedi3CL 16Bit
Output 1000 frames in 8,13 seconds (123,06 fps)

std.Transpose(clip).nnedi3.nnedi3(field=1, dh=True, nsize=4, nns=3).std.Transpose().nnedi3.nnedi3(field=1, dh=True, nsize=4, nns=3):
# enlarge nnedi3 8Bit
Output 1000 frames in 12.23 seconds (81.74 fps)
# enlarge nnedi3 16Bit
Output 1000 frames in 20.45 seconds (48.89 fps)


NNEDI3CL(clip, field=1, dh=True, dw=True, nsize=4, nns=3):
# enlarge nnedi3CL 8Bit
Output 1000 frames in 15,25 seconds (65,58 fps)
# enlarge nnedi3CL 16Bit
Output 1000 frames in 15,88 seconds (62,95 fps)

ChaosKing
25th October 2017, 09:44
znedi update

test1.mp4

nnedi3(clip, field=1):
# deinterlace Znedi3 8Bit
Output 1250 frames in 63.23 seconds (19.77 fps)
# deinterlace Znedi3 16Bit
Output 1250 frames in 65.73 seconds (19.02 fps)

std.Transpose(clip).nnedi3.nnedi3(field=1, dh=True, nsize=4, nns=3).std.Transpose().nnedi3.nnedi3(field=1, dh=True, nsize=4, nns=3):
# enlarge Znedi3 8Bit
Output 1250 frames in 169.94 seconds (7.36 fps)
# enlarge Znedi3 16Bit
Output 1250 frames in 179.36 seconds (6.97 fps)

test2.mp4
nnedi3(clip, field=1):
# deinterlace Znedi3 8Bit
Output 2184 frames in 21.79 seconds (100.21 fps)
# deinterlace Znedi3 16Bit
Output 2184 frames in 26.24 seconds (83.22 fps)

std.Transpose(clip).nnedi3.nnedi3(field=1, dh=True, nsize=4, nns=3).std.Transpose().nnedi3.nnedi3(field=1, dh=True, nsize=4, nns=3):
# enlarge Znedi3 8Bit
Output 2184 frames in 79.29 seconds (27.55 fps)
# enlarge Znedi3 16Bit
Output 2184 frames in 105.50 seconds (20.70 fps)

aegisofrime
26th October 2017, 10:28
Woah what's this znedi3? Google only returns me to this thread :confused::confused::confused:

AzraelNewtype
27th October 2017, 08:30
That's probably because it's only in this thread, and only in links in his posts.

aegisofrime
27th October 2017, 16:20
That's probably because it's only in this thread, and only in links in his posts.

I see, I realized later on that Stephen R Savage posts a link to the dll here, heh.

Interesting speed tests, but apparently it's not a drop-in replacement for nnedi3 as the picture gets very messed up. Still, I'm excited for what it may bring in the future!

HolyWu
28th October 2017, 14:29
Please explain in what way the image is "messed up." The interface and output of znedi are both identical to nnedi3.

Um...maybe something like this (https://i.imgur.com/DslOlAQ.png)?

TheFluff
28th October 2017, 16:08
As can clearly be seen, znedi3 is the highest performing and most energy-efficient nnedi available on the market.

When is the IPO?

HolyWu
28th October 2017, 18:00
Update r6.


Change the kernel's work-group size to a better combination.
Correct the offset in new prescreener.
Tweak loop unrolling in the predictor to decrease register spilling.
Store the compiled binary for reuse in the offline cache located in $HOME/.boost_compute on UNIX-like systems and in %APPDATA%/boost_compute on Windows.


Here is probably the last benchmark result(sorry, no watt) from me on this. The same method in the first post is used. Since ChaosKing mentioned the resolution, I also prepare 720x480 version of the same test1 and test2 sample videos.

Deinterlace

test1_480p_YUV420P8
nnedi3: 95.27 fps
znedi3: 151.46 fps
nnedi3cl(r5): 157.66 fps
nnedi3cl(r6): 180.11 fps

test1_480p_YUV420P16
nnedi3: 59.88 fps
znedi3: 150.45 fps
nnedi3cl(r5): 148.92 fps
nnedi3cl(r6): 169.26 fps

test1_1080p_YUV420P8
nnedi3: 19.59 fps
znedi3: 31.00 fps
nnedi3cl(r5): 35.57 fps
nnedi3cl(r6): 39.75 fps

test1_1080p_YUV420P16
nnedi3: 12.89 fps
znedi3: 29.98 fps
nnedi3cl(r5): 32.76 fps
nnedi3cl(r6): 36.29 fps

test2_480p_YUV420P8
nnedi3: 372.69 fps
znedi3: 502.98 fps
nnedi3cl(r5): 263.82 fps
nnedi3cl(r6): 300.23 fps

test2_480p_YUV420P16
nnedi3: 235.39 fps
znedi3: 437.57 fps
nnedi3cl(r5): 240.05 fps
nnedi3cl(r6): 267.86 fps

test2_1080p_YUV420P8
nnedi3: 96.74 fps
znedi3: 112.64 fps
nnedi3cl(r5): 90.28 fps
nnedi3cl(r6): 97.79 fps

test2_1080p_YUV420P16
nnedi3: 58.35 fps
znedi3: 89.06 fps
nnedi3cl(r5): 71.17 fps
nnedi3cl(r6): 75.09 fps


Enlarge

test1_480p_YUV420P8
nnedi3: 22.87 fps
znedi3: 40.99 fps
nnedi3cl(r5): 33.43 fps
nnedi3cl(r6): 46.78 fps

test1_480p_YUV420P16
nnedi3: 18.07 fps
znedi3: 40.39 fps
nnedi3cl(r5): 32.55 fps
nnedi3cl(r6): 45.19 fps

test1_1080p_YUV420P8
nnedi3: 6.85 fps
znedi3: 11.23 fps
nnedi3cl(r5): 7.72 fps
nnedi3cl(r6): 10.61 fps

test1_1080p_YUV420P16
nnedi3: 5.09 fps
znedi3: 10.26 fps
nnedi3cl(r5): 7.44 fps
nnedi3cl(r6): 10.09 fps

test2_480p_YUV420P8
nnedi3: 101.96 fps
znedi3: 138.92 fps
nnedi3cl(r5): 87.02 fps
nnedi3cl(r6): 111.31 fps

test2_480p_YUV420P16
nnedi3: 67.49 fps
znedi3: 118.43 fps
nnedi3cl(r5): 80.80 fps
nnedi3cl(r6): 101.63 fps

test2_1080p_YUV420P8
nnedi3: 29.86 fps
znedi3: 29.47 fps
nnedi3cl(r5): 34.61 fps
nnedi3cl(r6): 41.37 fps

test2_1080p_YUV420P16
nnedi3: 17.28 fps
znedi3: 22.33 fps
nnedi3cl(r5): 28.78 fps
nnedi3cl(r6): 33.89 fps

trip_let
28th October 2017, 20:28
Testing with i7-6700k at 4.0 GHz and GTX 1060 3GB that boosts here to 1900 MHz, default nnedi3 settings with field=1.

1920x1080 YUV420P8 video source:
nnedi3: 182.42 fps
nnedi3cl: 354.41 fps
720x480 YUV420P8 video source:
nnedi3: 460.13 fps
nnedi3cl: 803.98 fps
Well, that's pretty cool and useful, but CPU-only is pretty fast too for me, and in some scripts I'm hammering the GPU already with something else.

Is output identical? I didn't look for that.

Testing with i7-6700k at 4.0 GHz and GTX 1060 3GB that boosts here to 1900 MHz, default nnedi3 settings with field=1.

1920x1080 YUV420P8 video source:
nnedi3: 182.42 fps
nnedi3cl: 354.41 fps
720x480 YUV420P8 video source:
nnedi3: 460.13 fps
nnedi3cl: 803.98 fps
Well, that's pretty cool and useful, but CPU-only is pretty fast too for me, and in some scripts I'm hammering the GPU already with something else.

Is output identical? I didn't look for that.

edit: wait a sec, pscrn default is 2, not using neutral net, and the 1080p source was not interlaced. Maybe I'll revisit later.

lansing
29th October 2017, 01:25
A little suggestion to the comparison format, it would be better if you can also state the core count/thread count of your cpu because not everyone knows every model number by heart.

The energy efficiency comparison is kind of pointless because who cares about +/-10W difference, it's not a big deal. Stating CPU and GPU usage % would be a better info.

monohouse
1st February 2018, 16:00
I get error with r6:

C:\VapourSynth\core64>vspipe.exe --info W:\test.vpy -
Script evaluation failed:
Python exception: NNEDI3CL: Invalid Program

Traceback (most recent call last):
File "src\cython\vapoursynth.pyx", line 1847, in vapoursynth.vpy_evaluateScrip
t
File "W:\test.vpy", line 21, in <module>
clip = haf.daa(clip, opencl=True)
File "C:\Python36\lib\site-packages\havsfunc.py", line 78, in daa
nn = core.nnedi3cl.NNEDI3CL(c, field=3) if opencl else core.nnedi3.nnedi3(c,
field=3)
File "src\cython\vapoursynth.pyx", line 1739, in vapoursynth.Function.__call__

vapoursynth.Error: NNEDI3CL: Invalid Program

I have error in quaternion test of gpucaps, looks like some kind of problem with floating point data types, but the rest of the CL tests pass, I going to try a different driver

never mind, new version of gpu caps fixed the problem, but this error continue

HolyWu
1st February 2018, 17:32
Does your GPU support at least OpenCL 1.2? Anyway, I'll also release r7 tomorrow since there is a fix already committed months ago.

monohouse
1st February 2018, 18:11
http://manoa.flnet.org/Pics/CL1.png
http://manoa.flnet.org/Pics/CL2.png

it looks like it's all ther, I hope so :)
I know it not mutch of a video card but it should do for now
he is on steroids, video memory almost 2x more fast

HolyWu
2nd February 2018, 11:25
I get error with r6

Before I make a formal release, see if http://www.mediafire.com/file/w4a6bcr2eltp963/NNEDI3CL-r7_test.7z works for you.

monohouse
2nd February 2018, 13:18
this time he says many things xD : http://manoa.flnet.org/logs/error.txt

HolyWu
2nd February 2018, 15:00
this time he says many things xD : http://manoa.flnet.org/logs/error.txt

It's on purpose so I can know where the compilation fails. Please try http://www.mediafire.com/file/9u353d3l3xdaaev/NNEDI3CL-r7_test2.7z again.

monohouse
2nd February 2018, 15:18
this time I have 2 error, I tested with my settings and once with no settings, this the results with settings:
C:\VapourSynth\core64>vspipe.exe -p W:\test.vpy .
Script evaluation failed:
Python exception: NNEDI3CL: Invalid Image Size

Traceback (most recent call last):
File "src\cython\vapoursynth.pyx", line 1847, in vapoursynth.vpy_evaluateScrip
t
File "W:\test.vpy", line 21, in <module>
clip = haf.daa(clip, opencl=True)
File "C:\Python36\lib\site-packages\havsfunc.py", line 78, in daa
nn = core.nnedi3cl.NNEDI3CL(c, field=3, nsize=3, nns=4, qual=2, pscrn=1) if
opencl else core.znedi3.nnedi3(c, field=3, nsize=3, nns=4, qual=2, pscrn=0, int1
6_prescreener=False, int16_predictor=False, exp=2 )
File "src\cython\vapoursynth.pyx", line 1739, in vapoursynth.Function.__call__

vapoursynth.Error: NNEDI3CL: Invalid Image Size

and this the result without: http://manoa.flnet.org/logs/error2.txt

is there no way to completely disable prescreen interpolation ?

HolyWu
2nd February 2018, 16:04
this time I have 2 error, I tested with my settings and once with no settings, this the results with settings:
C:\VapourSynth\core64>vspipe.exe -p W:\test.vpy .
Script evaluation failed:
Python exception: NNEDI3CL: Invalid Image Size

Traceback (most recent call last):
File "src\cython\vapoursynth.pyx", line 1847, in vapoursynth.vpy_evaluateScrip
t
File "W:\test.vpy", line 21, in <module>
clip = haf.daa(clip, opencl=True)
File "C:\Python36\lib\site-packages\havsfunc.py", line 78, in daa
nn = core.nnedi3cl.NNEDI3CL(c, field=3, nsize=3, nns=4, qual=2, pscrn=1) if
opencl else core.znedi3.nnedi3(c, field=3, nsize=3, nns=4, qual=2, pscrn=0, int1
6_prescreener=False, int16_predictor=False, exp=2 )
File "src\cython\vapoursynth.pyx", line 1739, in vapoursynth.Function.__call__

vapoursynth.Error: NNEDI3CL: Invalid Image Size

and this the result without: http://manoa.flnet.org/logs/error2.txt

is there no way to completely disable prescreen interpolation ?

http://www.mediafire.com/file/np371gfx0vppzp2/NNEDI3CL-r7_test3.7z

No. Just use the old prescreener if you prefer accuracy over speed. I can't reproduce the "Invalid Image Size" error even I specified the same parameters. What's the property of your video clip?

monohouse
2nd February 2018, 16:16
Video
ID : 1
Format : MPEG Video
Format version : Version 2
Format profile : Main@Main
Format settings : CustomMatrix / BVOP
Format settings, BVOP : Yes
Format settings, Matrix : Custom
Format settings, GOP : M=3, N=15
Format settings, picture structure : Frame
Codec ID : V_MPEG2
Codec ID/Info : MPEG 1 or 2 Video
Duration : 43 min 52 s
Bit rate mode : Variable
Bit rate : 4 866 kb/s
Maximum bit rate : 8 456 kb/s
Width : 720 pixels
Height : 480 pixels
Display aspect ratio : 16:9
Frame rate mode : Constant
Frame rate : 29.970 (30000/1001) FPS
Standard : NTSC
Color space : YUV
Chroma subsampling : 4:2:0
Bit depth : 8 bits
Scan type : Interlaced
Scan order : Top Field First
Compression mode : Lossy
Bits/(Pixel*Frame) : 0.470
Time code of first frame : 00:59:59:00
Time code source : Group of pictures header
GOP, Open/Closed : Open
GOP, Open/Closed of first frame : Closed
Stream size : 1.49 GiB (90%)
Default : Yes
Forced : No

monohouse
2nd February 2018, 16:42
ok, on default setting it gives no errors, but I think the output is wrong:
http://manoa.flnet.org/logs/original.png
http://manoa.flnet.org/logs/post.png
or mybe because default setting quality is low ?
image size error happens when nsize=3 + nns>1 or when nsize=2 + nns>2 or when nsize=6 + nns>2

ok on higher setting (nsize=2, nns=2, qual=2, pscrn=1) the big artifact are removed, but some artifact still present and in addition there is many blur
http://manoa.flnet.org/logs/high.png

HolyWu
3rd February 2018, 11:18
ok, on default setting it gives no errors, but I think the output is wrong:
http://manoa.flnet.org/logs/original.png
http://manoa.flnet.org/logs/post.png
or mybe because default setting quality is low ?
image size error happens when nsize=3 + nns>1 or when nsize=2 + nns>2 or when nsize=6 + nns>2

ok on higher setting (nsize=2, nns=2, qual=2, pscrn=1) the big artifact are removed, but some artifact still present and in addition there is many blur
http://manoa.flnet.org/logs/high.png

Sigh. I tested with your original.png using both default and higher settings, still couldn't produce any artifacts as yours. Can you try setting the core's threads to 1 and see if it still exists? Also post a screenshot of NNEDI3CL(info=True).

monohouse
3rd February 2018, 11:31
core's threads to 1 ?
http://manoa.flnet.org/logs/info.png
waith I think I have problem with the hardware, ati tool artifact scaner show errors :x sorry, I will fix it first and then try again
that strange....he gived error after 1 minutes but now I run him for more than 5 minutes and there is no errors :x
I running now VMT to see if there is problem, he passed 15 tests and there is 0 errors :x
to be sure I slowed him down to less than original speed and run the encoder again, there were still artifacts :x
I am sure it is not the card that the problem

HolyWu
3rd February 2018, 17:41
core's threads to 1 ?
http://manoa.flnet.org/logs/info.png
waith I think I have problem with the hardware, ati tool artifact scaner show errors :x sorry, I will fix it first and then try again
that strange....he gived error after 1 minutes but now I run him for more than 5 minutes and there is no errors :x
I running now VMT to see if there is problem, he passed 15 tests and there is 0 errors :x
to be sure I slowed him down to less than original speed and run the encoder again, there were still artifacts :x
I am sure it is not the card that the problem

core.num_threads = 1 sets the core's thread to 1.

Your card's 1D Image Max Buffer Size is 65536, though adheres to the OpenCL spec's minimum value, but is too small for some combination of nsize+nns. That's the reason why you got Invalid Image Size error. BTW that value of my old GTX 660 is 134217728 (2^27).

The cause of the artifacts you got is still uncertain. Try http://www.mediafire.com/file/dns5dtg7jssydqp/NNEDI3CL-r7_test4.7z one more time. If the issue still exists then I can't help you. Just use znedi3 instead since it has good AVX/AVX2 optimizations.

monohouse
4th February 2018, 20:08
yes the output is artifacting with and without threads = 1.
ok thank :)

HolyWu
5th February 2018, 17:16
Update r7.


Improve OpenCL code's compatibility.
Add proper error message when buffer size exceeds device's limit.

monohouse
6th February 2018, 14:32
I have a fermi 2 card, it nvidea, you recommend nvidea card over AMD ? I thought that nvidea openCL is suckx

HolyWu
6th February 2018, 16:36
you recommend nvidea card over AMD ?

I never said that. How did you draw this conclusion? Also Fermi architecture is quite outdated and some of the models only support OpenCL 1.1.

monohouse
7th February 2018, 16:53
I was thinking that because you mentioned you have a nvidea card, so I blieved that you develop the plugin mostly on that card so I was thinking that the plugin mybe more compatible for nvidea card :x
it sutch a shame fermi old :x he is so fast card EVGA classified and only 1.1 :x. I should probable buy new AMD card then, it will have all the requirements.
about znedi, he is not bad but very very slow because my processor don't have AVX, it AMD phenom 2 :x
but I found eedi3 he is pretty good both clean output and fast (1.4 fps on this crap AMD video card), he use 50% of the card so I can run 2 at the same time to use maximum card speed
your was still faster (2 fps from 75% card used)
it no problem because 1.5 fps still gives me 30 times more speed compared to nnedi3 on CPU using AVS+, even if used with AVX :)
who could blieve it sutch a littel video card can do so mutch more fast than a haswell CPU

you know what I think ? if nvidea whant to do a good openCL, they can, they are more strong than AMD, they have bigger software department and more money than AMD, larger research. but they don't do it because they don't whant openCL - they whant the CUDA, they whant everyone to go to CUDA so that everyone depend on CUDA and in the end depend on nvidea because CUDA only work on nvidea.

cwk
20th April 2018, 00:37
What version of boost is required for make? Ubuntu 16.04 provides 1.58:


$ make
CXX NNEDI3CL/NNEDI3CL.lo
NNEDI3CL/NNEDI3CL.cpp:43:34: fatal error: boost/compute/core.hpp: No such file or directory
compilation terminated.
Makefile:476: recipe for target 'NNEDI3CL/NNEDI3CL.lo' failed
make: *** [NNEDI3CL/NNEDI3CL.lo] Error 1

$ apt list *boost*
Listing... Done
...
libboost1.58-all-dev/xenial-updates 1.58.0+dfsg-5ubuntu3.1 amd64
libboost1.58-dbg/xenial-updates 1.58.0+dfsg-5ubuntu3.1 amd64
libboost1.58-dev/xenial-updates 1.58.0+dfsg-5ubuntu3.1 amd64
libboost1.58-doc/xenial-updates,xenial-updates 1.58.0+dfsg-5ubuntu3.1 all
libboost1.58-tools-dev/xenial-updates 1.58.0+dfsg-5ubuntu3.1 amd64