Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 15th January 2014, 16:27   #1  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 1,640
Port of NNEDI under new v2.6 AVS API

As sugested, i'll open a specific thread for this.

All credit goes of course to Tritical, original creator of the plugin.
The main purpose for what i've done this, was to have an x64 version of the plugin.
The work mainly consisted on putting outside inline ASM, creating an x64 compliant C code, and port ASM function to x64.
Doing this, the API has also be updated to the v2.6, allowing other updates of the plugin, like more input format supported.

This port is made from NNEDI3 v0.9.4, so i've decided to use 0.9.4.x for version numbers.

Current version : 0.9.4.51

Sources are here.
Binaries are here.

Change list:
27/05/2018 v0.9.4.51
* Fix bug in asm PlanarFrame YUY2to422.

05/04/2018 v0.9.4.50
+ Optimized CPU placement if SetAffinity=true for prefetch>1 and prefetch<=number of physical cores.
* SetAffinity back to default false.

08/03/2018 v0.9.4.49
* Fix AVX2 path code.
* Fix some potential issue with range modes.
* Change some default value setting.

23/11/2017 v0.9.4.48
* Put back process whole plane by whole plane.
* Minor change in threadpool interface.

23/08/2017 v0.9.4.47
+ Fix possible deadlock on threadpool destructor.

10/08/2017 v0.9.4.46
+ Forgot to add AVX path code on planarframe.

09/08/2017 v0.9.4.45
+ Fix Threadpool.
+ Add AVX path code.
* Revert to original MT multi-planar mode, may improve MT efficiency.

14/06/2017 v0.9.4.44
* Minor fix.

02/06/2017 v0.9.4.43
* Few changes in the threadpool and small fix.

19/05/2017 v0.9.4.42
* Minor change in the threadpool.

10/05/2017 v0.9.4.41
* Fix crash in PlanarFrame for YUYV.

11/01/2017 v0.9.4.40
* Fix bug in x64 AVX2 asm code.

28/03/2017 v0.9.4.39
* Some small optimizations on PlanarFrame asm for YUYV.

20/03/2017 v0.9.4.38
* Some cleanup and small modifications on PlanarFrame.
* Update AVS+ header.

05/03/2017 v0.9.4.37
* Remove the use of asmlib.
* Some little bug fixes.
+ Add an opt intermediate value (4 for AVX).
+ Use of YMM registers in case of AVX2 (or more) CPU, and some little others cleanup/speedup.

24/01/2017 v0.9.4.36
* Set range mode default to 1. Apply range only on last step.
+ Add range mode 4.

20/01/2017 v0.9.4.35
+ Fix crash on x64 version introduced in v0.9.4.34.
+ Fix prescreener issue on flat area with value of 255.

17/01/2017 v0.9.4.34
+ Add range parameter.

07/01/2017 v0.9.4.33

+ Add support for 9..16 bits and float data formats (thanks to vapoursynth port).
+ Add FMA3 and FMA4 functions on some parts (thanks to vapoursynth port).
+ Add sleep and prefetch paremeters.
+ Fix bug in YUY2 x64 ASM code.

05/12/2016 v0.9.4.32

+ Update to new avisynth header and add support for RGB32, RGBPlanar and alpha channel on avs+.
+ Add A paremeter (for alpha channel) on nnedi3.
* Update asmlib to 2.50
* Use /MD (dynamic link) instead of /MT (static link) for building.

14/10/2016 v0.9.4.31

* Use Mutex instead of CriticalSection on some places and some changes in the threadpool interface.

12/10/2016 v0.9.4.30

* Remove CACHE_DONT_CACHE_ME and small changes in the threadpool interface.

11/10/2016 v0.9.4.29

+ Fix deadlock case in Threadpool interface.

06/10/2016 v0.9.4.28

+ Attempt to fix deadlock with MT of avisynth.

02/09/2016 v0.9.4.27

* Minor fixes and don't use the threadpool if number of threads=1.
+ Add several parameters to control and tune the creation of the threadpool.

30/08/2016 v0.9.4.26

* Update to my threadpool interface.
+ Add a thread parameter for the resampler if use of the MT resamplers.

12/08/2016 v0.9.4.25

* Use Spline36ResizeMT if available.

21/07/2016 v0.9.4.24

* Don't use SetMTMode for now to set MT mode
* Update to new avisynth header.

15/07/2016 v0.9.4.23

* Update to new avisynth header.

30/05/2016 v0.9.4.22

+ Fix for MT version of avisynth.

17/04/2016 v0.9.4.21

* Update to asmlib 3.26.
+ Fix XP build with VS2015.

05/09/2015 v0.9.4.20

* Minor changes, should handle negative pitch properly.

26/08/2015 v0.9.4.19

+ Implement use of asmlib.

25/08/2015 v0.9.4.18

* Fix 4:1:1 chroma shift.
* Modification of the memory transfer functions.

11/08/2015 v0.9.4.17

* Change the order between turnl/r and nnedi3 calls in nnedi3_rpow2 to optimize speed.
* Remove memcpy_amd and use memcpy instead.

10/08/2015 v0.9.4.16

+ Add csresize parameter, and chroma shift adjustment according resize is enabled by default.
* Fix regression on center adjustment.

09/08/2015 v0.9.4.15

* Change default value of mpeg2 to false, and keep exact
previous behavior in that case (but doesn't put back chroma shift issue ^_^).

09/08/2015 v0.9.4.14

+ Add resize adjustment chroma shift in case of MPEG-2 subsampling.
* Faster RGB24 mode always.
+ Add mpeg2 parameter.

08/08/2015 v0.9.4.13

* Correction of chroma shift once for all this time.
* Faster RGB24 mode if FTurn is usable.
* Fix YV411 support.

06/08/2015 v0.9.4.12

* More checks on use of FTurn.
* Fix regression on YUY2 introduced in previous release.

31/07/2015 v0.9.4.11

+ Correction of chroma shift value for 4:2:x color modes.
+ Add YV411 support.

25/05/2015 v0.9.4.10

* Integration of commits coming from Vapoursynth version, thanks to Myrsloik.

10/05/2015 v0.9.4.9

* Bug correction in x64 ASM file, thanks to jackoneill and HolyWu.

13/03/2015 v0.9.4.8

* Update to last AVS+ header files.

26/01/2014 v0.9.4.7

* Little correction in YV24 and Y8 support for nnedi3_rpow2.

17/01/2014 v0.9.4.6

* Little YV16 optimization.

16/01/2014 v0.9.4.5

* Updated YV16 support for nnedi3_rpow2, now fast and direct, not tweaked by going to YUY2.

15/01/2014 v0.9.4.4

* Some few little optimizations.
* Trick YV16 support in nnedi3_rpow2 by working internaly in YUY2 mode to speed-up.

14/01/2014 v0.9.4.3

+ Add fturn support.

13/01/2014 v0.9.4.2

+ Add Y8, YV16 and YV24 support.

03/01/2014 v0.9.4.1

* Move out all inline ASM code in external files, update code to build x64 version.
* Update interface to new avisynth 2.6 API.
- Avisynth 2.5.x not supported anymore.

Original version 0.9.4

==================================================================

Multi-threading information

CPU example case : 4 cores with hyper-threading.

If you leave all the multi-threading parameters to their default value, it's set to be "optimal" when you're not using prefetch or if you are under standard avisynth, all the logical CPU will be used.
If you put SetAffinity to true it will allocate the threads on the CPU contiguously. Physical CPU 1 will have threads (0,1), ... physical CPU 4 will have threads (6,7), allowing optimal cache use. Make test to see what's best for you.

Now, if you are using prefetch on your script, things are different !
If you're using it with the max number of CPUs (8 in our exemple case), you still can make tests, but i would strongly advise to disable the internal multi-threading by using threads=1. In this case, there is no threadpool created, and all the other multi-threading related filter parameters have no effect, even prefetch.
If you're using prefetch on your script, with less than your CPU number, you may want to try to mix the external and internal mutli-threading, setting the internal multi-threading to a lower number of threads, and setting the prefetch parameter of the filter. This parameter will set the number of internal threadpool created, the best is to match the prefetch script value. If you don't set it (leave it to 1) or set a lower value than prefetch on your script, you'll have several instances (or GetFrame) created, but they'll not be running efficiently, because each instance (or GetFrame) will spend time waiting for a threadpool to be avaible, if not enough were created.
Unfortunately, as things are now, i have no way of knowing the prefetch value used in the avisynth script at the time i need the information, this is why you have to use the prefetch parameter in the filter.
In our CPU exemple case, you can have things like :
Code:
filter(...,threads=1)
prefetch(8)
or
Code:
filter(...,threads=2,prefetch=4)
prefetch(4)
or
Code:
filter(...,threads=4,prefetch=2)
prefetch(2)
or even
Code:
filter(...,threads=3,prefetch=4)
prefetch(4)
if you want to boost and go a little over your total CPU number.

Also, if your prefetch is not higher than your number of physical cores, you can try to put SetAffinity to true, but in that case, you have to set MaxPhysCore to false. The threads of each pool will be set on CPUs by steps.
For exemple, in our case :
Code:
filter(...,threads=2,prefetch=4,SetAffinity=true,MaxPhysCore=false)
prefetch(4)
Will create 4 pool of 2 threads, with the following :
pool[0] : threads(0 -> 1) on CPU 1.
pool[1] : threads(0 -> 1) on CPU 2.
pool[2] : threads(0 -> 1) on CPU 3.
pool[3] : threads(0 -> 1) on CPU 4.
Code:
filter(...,threads=4,prefetch=2,SetAffinity=true,MaxPhysCore=false)
prefetch(2)
Will create 2 pool of 4 threads, with the following :
pool[0] : threads(0 -> 1) on CPU 1.
pool[0] : threads(2 -> 3) on CPU 2.
pool[1] : threads(0 -> 1) on CPU 3.
pool[1] : threads(2 -> 3) on CPU 4.

Last edited by jpsdr; 1st June 2018 at 10:01.
jpsdr is offline   Reply With Quote
Old 15th January 2014, 18:17   #2  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 1,640
Despite all my researches in the code, and some tests, YV16 of nnedi3_rpow2 is slow........ even YV24 is way faster ! And YUY2 is also fast. No idea why, so, for now, i'll trick the YV16 support for nnedi3_rpow2 by converting internaly to YUY2 (and back to YV16 in output). It's actualy a way faster than working directly with YV16.
jpsdr is offline   Reply With Quote
Old 15th January 2014, 20:38   #3  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 1,640
Ok, i've made tests.
Test were made using an YV12 640x480 avi file.

Script used is the following :
Code:
SetMemoryMax(64)
AVISource("Test.avi",False,"YV12")
SetPlanarLegacyAlignment(True)
#ConvertToYUY2()
#ConvertToYV16()
#ConvertToYV24()
#ConvertToY8()
nnedi3_rpow2(rfactor=2,cshift="Spline36Resize",fwidth=960,fheight=720,nsize=0,nns=3,qual=2)
It allows me to quick change ouput format for testing.

Results with v0.9.4.3 :
Y8,YV12,YUY2,YV24 : fast.
YV16 : Very slow !

So, i've made modifications to my code to test several cases :
Converting internaly YV16->YUY2(process)->YV16 either in nnedi3 or nnedi3_rpow2.

If YV16->YUY2(process)->YV16 is made in nnedi3, and nnedi3_rpow2 process YV16 without change, result is slow. First hint : Slowness is not because of nnedi3 code.

If YV16->YUY2(process)->YV16 is made in nnedi3_rpow2, and nnedi3 process YV16 without change, result is fast.

Issue is within the YV16 process code of nnedi3_rpow2. Code is :
Code:
					for (int i=0; i<ct; ++i)
					{
						v = new nnedi3(v.AsClip(),i==0?1:0,true,true,true,true,nsize,nns,qual,etype,pscrn,threads,opt,fapprox,env);
						v = env->Invoke(turnRightFunction,v).AsClip();
						// always use field=1 to keep chroma/luma horizontal alignment
						v = new nnedi3(v.AsClip(),1,true,true,true,true,nsize,nns,qual,etype,pscrn,threads,opt,fapprox,env);
						v = env->Invoke(turnLeftFunction,v).AsClip();
					}
I've first removed both turn : Result was fast.
If i put back either one of the turn, result is still fast, but... it seems a little slower, not realy obvious. If i put back both, result is very slow.

So, here i'm at a point i don't know what to do to solve this issue, i'm totaly !!
I need help from someone who master avisynth a looot better than me for solving this case, better than what i've done for now (doing YV16->YUY2(process)->YV16).
jpsdr is offline   Reply With Quote
Old 15th January 2014, 21:20   #4  |  Link
TurboPascal7
Registered User
 
TurboPascal7's Avatar
 
Join Date: Jan 2010
Posts: 270
Turning YV16 internally requires full chroma resampling, i.e. a call to internal resizer (most likely two, but I don't remember how exactly it's implemented). Of course it's slower than not calling resizer.
You can try processing planes separately, either by converting to Y8 or using that internal PlanarFrame thingy. YUY2 is fast because it's doing exactly that, if my memory doesn't lie.
__________________
Me on GitHub | AviSynth+ - the (dead) future of AviSynth
TurboPascal7 is offline   Reply With Quote
Old 15th January 2014, 22:47   #5  |  Link
PetitDragon
Registered User
 
Join Date: Sep 2006
Posts: 80
Quote:
Originally Posted by jpsdr View Post
Change list:
15/01/2014 v0.9.4.4
+ Some few little optimizations.
* Trick YV16 support in nnedi3_rpow2 by working internaly in YUY2 mode to speed-up.

14/01/2014 v0.9.4.3

+ Add fturn support.

13/01/2014 v0.9.4.2

+ Add Y8, YV16 and YV24 support.

03/01/2014 v0.9.4.1

+ Move out all inline ASM code in external files, update code to build x64 version.
+ Update interface to new avisynth 2.6 API.
- Avisynth 2.5.x not supported anymore.

Original version 0.9.4
Formidable!
PetitDragon is offline   Reply With Quote
Old 15th January 2014, 23:17   #6  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 1,640
Quote:
Originally Posted by TurboPascal7 View Post
Turning YV16 internally requires full chroma resampling
... Ok. YV12 and YV24 are fast because chroma planes are square, and not YV16, wich requires resampling...
I didn't see that indeed. I was wondering why YV24 whith more data than YV16 was a lot faster !!
I see, thanks. I'll have to do something similar to what is done with YUY2 : extract each plane... I'll try that.
jpsdr is offline   Reply With Quote
Old 16th January 2014, 00:03   #7  |  Link
Gavino
Avisynth language lover
 
Join Date: Dec 2007
Location: Spain
Posts: 3,380
As well as the performance hit, there is also a quality loss if staying in YV16 as TurnRight and TurnLeft are not lossless for YV16 chroma. So extracting chroma to Y8 is better for both speed and quality.
__________________
GScript and GRunT - complex Avisynth scripting made easier
Gavino is offline   Reply With Quote
Old 16th January 2014, 00:47   #8  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 1,640
Updated YV16 support, now better, and created a stable and final (i hope) release.
jpsdr is offline   Reply With Quote
Old 16th January 2014, 18:23   #9  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 1,640
I have a little question.
For now, i'm doing this :
Code:
					AVSValue vy = env->Invoke("ConvertToY8",v).AsClip();
					AVSValue vu = env->Invoke("UtoY",v).AsClip();
					vu = env->Invoke("ConvertToY8",vu).AsClip();
					AVSValue vv = env->Invoke("VtoY",v).AsClip();
					vv = env->Invoke("ConvertToY8",vv).AsClip();
If i'm doing the code below, will it be faster ?
Code:
					AVSValue vu = env->Invoke("UtoY",v).AsClip();
					vu = env->Invoke("ConvertToY8",vu).AsClip();
					AVSValue vv = env->Invoke("VtoY",v).AsClip();
					vv = env->Invoke("ConvertToY8",vv).AsClip();
					v = env->Invoke("ConvertToY8",v).AsClip();
In that last case, vy doesn't exist anymore, and i'll use v instead.
Question asked differently : Is there a better way to convert/transform v directly in Y8 ?
Because transforming an YV16 data to Y8 is instantaneous. I want if possible avoid an unecessary memory transfer of planar Y. And i have the feeling (but maybe totaly wrong), that doing something like "v = v(converted)" will create a transfer. If there was something like "v.convert", i'll had no wories...

Last edited by jpsdr; 16th January 2014 at 18:29.
jpsdr is offline   Reply With Quote
Old 16th January 2014, 19:18   #10  |  Link
DarkSpace
Registered User
 
Join Date: Oct 2011
Posts: 204
Quote:
Originally Posted by jpsdr View Post
I have a little question.
For now, i'm doing this :
Code:
					AVSValue vy = env->Invoke("ConvertToY8",v).AsClip();
					AVSValue vu = env->Invoke("UtoY",v).AsClip();
					vu = env->Invoke("ConvertToY8",vu).AsClip();
					AVSValue vv = env->Invoke("VtoY",v).AsClip();
					vv = env->Invoke("ConvertToY8",vv).AsClip();
If i'm doing the code below, will it be faster ?
Code:
					AVSValue vu = env->Invoke("UtoY",v).AsClip();
					vu = env->Invoke("ConvertToY8",vu).AsClip();
					AVSValue vv = env->Invoke("VtoY",v).AsClip();
					vv = env->Invoke("ConvertToY8",vv).AsClip();
					v = env->Invoke("ConvertToY8",v).AsClip();
In that last case, vy doesn't exist anymore, and i'll use v instead.
Question asked differently : Is there a better way to convert/transform v directly in Y8 ?
Because transforming an YV16 data to Y8 is instantaneous. I want if possible avoid an unecessary memory transfer of planar Y. And i have the feeling (but maybe totaly wrong), that doing something like "v = v(converted)" will create a transfer. If there was something like "v.convert", i'll had no wories...
Why not use UToY8() / VToY8() directly instead, if your filter can handle Y8 and you require AVS 2.6 anyway?
The wiki page also mentions that UToY8() is faster than UToY().ConvertToY8().
DarkSpace is offline   Reply With Quote
Old 16th January 2014, 20:41   #11  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 1,640
Because i didn't know them. The HTML doc installed with avisynth is apparently not up to date. I'll change that in next release, thanks for the information. Question of directly converting (like "v.convert" instead of "v = convert(v)") is still open.
jpsdr is offline   Reply With Quote
Old 16th January 2014, 21:28   #12  |  Link
DarkSpace
Registered User
 
Join Date: Oct 2011
Posts: 204
Well, I don't know much about this, but you have
Code:
AVSValue vy = env->Invoke("ConvertToY8",v).AsClip();
AVSValue vu = env->Invoke("UtoY8",v).AsClip();
AVSValue vv = env->Invoke("VtoY8",v).AsClip();
versus
Code:
AVSValue vu = env->Invoke("UtoY8",v).AsClip();
AVSValue vv = env->Invoke("VtoY8",v).AsClip();
v = env->Invoke("ConvertToY8",v).AsClip();
here. Now as I see it (I'm not an AviSynth developer at all and I may be totally wrong, I'm just using logic without knowledge), you have the choice between
1) certainly copying the y plane of the clip in memory, which leads to more memory consumption (because clip v is still assigned) and
2) perhaps copying the y plane of the clip in memory, without increased memory consumption (because clip v is reassigned).
(By the way, if it was me, I'd call the input clip "y" in the second example, so I can call the other two clips "u" and "v".)

I do hope that someone who knows more about this subject will be able to give you a better answer.
DarkSpace is offline   Reply With Quote
Old 16th January 2014, 21:38   #13  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 1,640
Actually, the second is at worst identical to first, or with luck beter, but i don't think it can be worse. It's exactly what i intended to do, unless i found something like "v.convert", to replace " v =...", or a syntax case specific where input and output are the same, if it's possible.

Last edited by jpsdr; 16th January 2014 at 21:40.
jpsdr is offline   Reply With Quote
Old 16th January 2014, 23:45   #14  |  Link
DarkSpace
Registered User
 
Join Date: Oct 2011
Posts: 204
Quote:
Originally Posted by jpsdr View Post
Actually, the second is at worst identical to first, or with luck beter, but i don't think it can be worse.
Nice, then my deduction was actually correct!
As for a Convert(v, "Y8") function, I have no idea if something like this exists. Maybe someone qualified can answer that...
DarkSpace is offline   Reply With Quote
Old 17th January 2014, 00:47   #15  |  Link
Gavino
Avisynth language lover
 
Join Date: Dec 2007
Location: Spain
Posts: 3,380
ConvertToY8(), for planar input, does not copy the Y plane contents - it simply creates a new frame sharing the original frame buffer (similar to what Crop() does).

Whether you assign the result to a new variable or the same one makes no difference.
__________________
GScript and GRunT - complex Avisynth scripting made easier
Gavino is offline   Reply With Quote
Old 17th January 2014, 09:33   #16  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 1,640
Ok, thanks for these informations.
jpsdr is offline   Reply With Quote
Old 17th January 2014, 18:05   #17  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 1,640
A new version with a little update on YV16 format. Finaly a good probability of "final" release.
jpsdr is offline   Reply With Quote
Old 19th January 2014, 07:37   #18  |  Link
hydra3333
Registered User
 
Join Date: Oct 2009
Location: crow-land
Posts: 517
a couple of end-user questions.
- is it compatible with TSP's avisynth MT 2.5.7 32 bit ?
- if not, is there any chance of a version which is compatible ?
hydra3333 is offline   Reply With Quote
Old 19th January 2014, 10:27   #19  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 1,640
@hydra3333
No, the use of new avs v2.6 API (which allow support of new color formats) make it not compatible, and no chance.
But if it's to use with v2.5.7, just use the original v0.9.4, there is no point in using this version. There is no change and so neither improvement in the algorithm itself, so result is exactly the same. And also there is no speed improvement in the orignal supported color formats.
jpsdr is offline   Reply With Quote
Old 25th January 2014, 16:47   #20  |  Link
sqrt(9801)
Registered User
 
Join Date: Jun 2013
Posts: 24
Hi.

It seems that there's some kind of horizontal shift (if that's the correct term) when using nned3_rpow2 with rfactor>2 on Y8 or YV24 clips.
Not sure whether it has anything to do with the chroma shift issue present in Tritical's latest version.

Tested with the VS2010 build on AviSynth 2.6a5 (Groucho2004's ICL10 build), Win7 x86.
sqrt(9801) is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 18:01.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.