View Single Post
Old 15th January 2014, 16:27   #1  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,309
Port of NNEDI under new v2.6 AVS API

As sugested, i'll open a specific thread for this.

All credit goes of course to Tritical, original creator of the plugin.
The main purpose for what i've done this, was to have an x64 version of the plugin.
The work mainly consisted on putting outside inline ASM, creating an x64 compliant C code, and port ASM function to x64.
Doing this, the API has also be updated to the v2.6, allowing other updates of the plugin, like more input format supported.

This port is made from NNEDI3 v0.9.4, so i've decided to use 0.9.4.x for version numbers.

Current version : 0.9.4.63

Sources are here.
Binaries are here.

Change list:
19/11/2023 v0.9.4.63
+ Update to new AviSynth+ header.

23/02/2023 v0.9.4.62
+ Update to new AviSynth+ header.

20/11/2022 v0.9.4.61
+ Update on threadpool, no user limit (except memory).

06/02/2022 v0.9.4.60
* Fix in threadpool when using prefetch, Add negative prefetch for triming, read Multithreading.txt or Multithreading chapter here.

30/06/2021 v0.9.4.59
* Fix in threadpool if you have too much cores.

30/04/2021 v0.9.4.58
+ Add new resampler in nnedi3_rpow2 internal resizers.
+ Update to the new avisynth headers.
* Minor code change for threadpool update.

02/08/2020 v0.9.4.57
+ Add new resamplers in nnedi3_rpow2 internal resizers.

17/07/2020 v0.9.4.56
* Fix issue in 4:2:0 when dh=false and height not mod 4.

05/05/2020 v0.9.4.55
* Fix issue introduced in previous version.

27/04/2020 v0.9.4.54
+ Update to the new avisynth headers.
* Minor code change for threadpool update.
* Some cleanup.

31/05/2019 v0.9.4.53
* Minor code change for threadpool update.

30/05/2019 v0.9.4.52
+ Update in the threadpool, add ThreadLevel parameter.

27/05/2018 v0.9.4.51
* Fix bug in asm PlanarFrame YUY2to422.

05/04/2018 v0.9.4.50
+ Optimized CPU placement if SetAffinity=true for prefetch>1 and prefetch<=number of physical cores.
* SetAffinity back to default false.

08/03/2018 v0.9.4.49
* Fix AVX2 path code.
* Fix some potential issue with range modes.
* Change some default value setting.

23/11/2017 v0.9.4.48
* Put back process whole plane by whole plane.
* Minor change in threadpool interface.

23/08/2017 v0.9.4.47
+ Fix possible deadlock on threadpool destructor.

10/08/2017 v0.9.4.46
+ Forgot to add AVX path code on planarframe.

09/08/2017 v0.9.4.45
+ Fix Threadpool.
+ Add AVX path code.
* Revert to original MT multi-planar mode, may improve MT efficiency.

14/06/2017 v0.9.4.44
* Minor fix.

02/06/2017 v0.9.4.43
* Few changes in the threadpool and small fix.

19/05/2017 v0.9.4.42
* Minor change in the threadpool.

10/05/2017 v0.9.4.41
* Fix crash in PlanarFrame for YUYV.

11/01/2017 v0.9.4.40
* Fix bug in x64 AVX2 asm code.

28/03/2017 v0.9.4.39
* Some small optimizations on PlanarFrame asm for YUYV.

20/03/2017 v0.9.4.38
* Some cleanup and small modifications on PlanarFrame.
* Update AVS+ header.

05/03/2017 v0.9.4.37
* Remove the use of asmlib.
* Some little bug fixes.
+ Add an opt intermediate value (4 for AVX).
+ Use of YMM registers in case of AVX2 (or more) CPU, and some little others cleanup/speedup.

24/01/2017 v0.9.4.36
* Set range mode default to 1. Apply range only on last step.
+ Add range mode 4.

20/01/2017 v0.9.4.35
+ Fix crash on x64 version introduced in v0.9.4.34.
+ Fix prescreener issue on flat area with value of 255.

17/01/2017 v0.9.4.34
+ Add range parameter.

07/01/2017 v0.9.4.33

+ Add support for 9..16 bits and float data formats (thanks to vapoursynth port).
+ Add FMA3 and FMA4 functions on some parts (thanks to vapoursynth port).
+ Add sleep and prefetch paremeters.
+ Fix bug in YUY2 x64 ASM code.

05/12/2016 v0.9.4.32

+ Update to new avisynth header and add support for RGB32, RGBPlanar and alpha channel on avs+.
+ Add A paremeter (for alpha channel) on nnedi3.
* Update asmlib to 2.50
* Use /MD (dynamic link) instead of /MT (static link) for building.

14/10/2016 v0.9.4.31

* Use Mutex instead of CriticalSection on some places and some changes in the threadpool interface.

12/10/2016 v0.9.4.30

* Remove CACHE_DONT_CACHE_ME and small changes in the threadpool interface.

11/10/2016 v0.9.4.29

+ Fix deadlock case in Threadpool interface.

06/10/2016 v0.9.4.28

+ Attempt to fix deadlock with MT of avisynth.

02/09/2016 v0.9.4.27

* Minor fixes and don't use the threadpool if number of threads=1.
+ Add several parameters to control and tune the creation of the threadpool.

30/08/2016 v0.9.4.26

* Update to my threadpool interface.
+ Add a thread parameter for the resampler if use of the MT resamplers.

12/08/2016 v0.9.4.25

* Use Spline36ResizeMT if available.

21/07/2016 v0.9.4.24

* Don't use SetMTMode for now to set MT mode
* Update to new avisynth header.

15/07/2016 v0.9.4.23

* Update to new avisynth header.

30/05/2016 v0.9.4.22

+ Fix for MT version of avisynth.

17/04/2016 v0.9.4.21

* Update to asmlib 3.26.
+ Fix XP build with VS2015.

05/09/2015 v0.9.4.20

* Minor changes, should handle negative pitch properly.

26/08/2015 v0.9.4.19

+ Implement use of asmlib.

25/08/2015 v0.9.4.18

* Fix 4:1:1 chroma shift.
* Modification of the memory transfer functions.

11/08/2015 v0.9.4.17

* Change the order between turnl/r and nnedi3 calls in nnedi3_rpow2 to optimize speed.
* Remove memcpy_amd and use memcpy instead.

10/08/2015 v0.9.4.16

+ Add csresize parameter, and chroma shift adjustment according resize is enabled by default.
* Fix regression on center adjustment.

09/08/2015 v0.9.4.15

* Change default value of mpeg2 to false, and keep exact
previous behavior in that case (but doesn't put back chroma shift issue ^_^).

09/08/2015 v0.9.4.14

+ Add resize adjustment chroma shift in case of MPEG-2 subsampling.
* Faster RGB24 mode always.
+ Add mpeg2 parameter.

08/08/2015 v0.9.4.13

* Correction of chroma shift once for all this time.
* Faster RGB24 mode if FTurn is usable.
* Fix YV411 support.

06/08/2015 v0.9.4.12

* More checks on use of FTurn.
* Fix regression on YUY2 introduced in previous release.

31/07/2015 v0.9.4.11

+ Correction of chroma shift value for 4:2:x color modes.
+ Add YV411 support.

25/05/2015 v0.9.4.10

* Integration of commits coming from Vapoursynth version, thanks to Myrsloik.

10/05/2015 v0.9.4.9

* Bug correction in x64 ASM file, thanks to jackoneill and HolyWu.

13/03/2015 v0.9.4.8

* Update to last AVS+ header files.

26/01/2014 v0.9.4.7

* Little correction in YV24 and Y8 support for nnedi3_rpow2.

17/01/2014 v0.9.4.6

* Little YV16 optimization.

16/01/2014 v0.9.4.5

* Updated YV16 support for nnedi3_rpow2, now fast and direct, not tweaked by going to YUY2.

15/01/2014 v0.9.4.4

* Some few little optimizations.
* Trick YV16 support in nnedi3_rpow2 by working internaly in YUY2 mode to speed-up.

14/01/2014 v0.9.4.3

+ Add fturn support.

13/01/2014 v0.9.4.2

+ Add Y8, YV16 and YV24 support.

03/01/2014 v0.9.4.1

* Move out all inline ASM code in external files, update code to build x64 version.
* Update interface to new avisynth 2.6 API.
- Avisynth 2.5.x not supported anymore.

Original version 0.9.4

==================================================================

Multi-threading information

CPU example case : 4 cores with hyper-threading.

If you leave all the multi-threading parameters to their default value, it's set to be "optimal" when you're not using prefetch or if you are under standard avisynth, all the logical CPU will be used.
If you put SetAffinity to true it will allocate the threads on the CPU contiguously. Physical CPU 1 will have threads (0,1), ... physical CPU 4 will have threads (6,7), allowing optimal cache use. Make test to see what's best for you.

Now, if you are using prefetch on your script, things are different !
If you're using it with the max number of CPUs (8 in our exemple case), you still can make tests, but i would strongly advise to disable the internal multi-threading by using threads=1. In this case, there is no threadpool created, and all the other multi-threading related filter parameters have no effect, even prefetch.
If you're using prefetch on your script, with less than your CPU number, you may want to try to mix the external and internal mutli-threading, setting the internal multi-threading to a lower number of threads, and setting the prefetch parameter of the filter. This parameter will set the number of internal threadpool created, the best is to match the prefetch script value. If you don't set it (leave it to 1) or set a lower value than prefetch on your script, you'll have several instances (or GetFrame) created, but they'll not be running efficiently, because each instance (or GetFrame) will spend time waiting for a threadpool to be avaible, if not enough were created.
Unfortunately, as things are now, i have no way of knowing the prefetch value used in the avisynth script at the time i need the information, this is why you have to use the prefetch parameter in the filter.
In our CPU exemple case, you can have things like :
Code:
filter(...,threads=1)
prefetch(8)
or
Code:
filter(...,threads=2,prefetch=4)
prefetch(4)
or
Code:
filter(...,threads=4,prefetch=2)
prefetch(2)
or even
Code:
filter(...,threads=3,prefetch=4)
prefetch(4)
if you want to boost and go a little over your total CPU number.

Also, if your prefetch is not higher than your number of physical cores, you can try to put SetAffinity to true, but in that case, you have to set MaxPhysCore to false. The threads of each pool will be set on CPUs by steps.
For exemple, in our case :
Code:
filter(...,threads=2,prefetch=4,SetAffinity=true,MaxPhysCore=false)
prefetch(4)
Will create 4 pool of 2 threads, with the following :
pool[0] : threads(0 -> 1) on CPU 1.
pool[1] : threads(0 -> 1) on CPU 2.
pool[2] : threads(0 -> 1) on CPU 3.
pool[3] : threads(0 -> 1) on CPU 4.
Code:
filter(...,threads=4,prefetch=2,SetAffinity=true,MaxPhysCore=false)
prefetch(2)
Will create 2 pool of 4 threads, with the following :
pool[0] : threads(0 -> 1) on CPU 1.
pool[0] : threads(2 -> 3) on CPU 2.
pool[1] : threads(0 -> 1) on CPU 3.
pool[1] : threads(2 -> 3) on CPU 4.

Negative prefetch
The possibility to put negative prefecth to tune the prefetch parameter to optimal value has been added. The filter will throw an error if the number is not high enough to avoid waiting when requesting internal threadpool. For this to work properly, you have to put negative prefetch on ALL the filters of your script, and also ALL instances of the same filter.

Exemple :
Code:
filter(...,threads=2,prefetch=-2)
prefetch(2)
You'll see an error.

But with :
Code:
filter(...,threads=2,prefetch=-3)
prefetch(2)
You'll see no error, so the optimal is :
Code:
filter(...,threads=2,prefetch=3)
prefetch(2)
Once you've tune, put back a positive value.

Last edited by jpsdr; 20th November 2023 at 21:56.
jpsdr is offline   Reply With Quote