Avisynth+ plugin modernization efforts [Archive] - Page 5

View Full Version : Avisynth+ plugin modernization efforts

Pages : 1 2 3 4 [5] 6 7 8

tormento

10th May 2017, 15:04

For speed: if a script is using mt_lutxy, it cannot always use fast lookup tables for memory reasons.
Above 12 bits mt_lutxy calculates the expression realtime, pixel-by-pixel, which is slooooow (unlike Expr in VapourSynth).
For the specific bit depths at which realtime expression evaluation kicks in, see masktools2 readme ("feature matrix" section) or the wiki.

Memory? On modern computer, x64 enabled, it shouldn't be a problem anymore. As I am a programming ignorant, could you explain more?

Size... that matter has no answer yet. [emoji55]

pinterf

10th May 2017, 15:21

wonkey_monkey

10th May 2017, 21:04

Set realtime=false manually for a 16 bit lutxy. On x64 it can work with plenty of memory. Lut size is 8 gbyte, even the initial lut calc is a minute I guess. Over a specific clip length it would be faster however.

How about calculating each term when required, then putting it in the lut for the next time those x and y values crop up?

Of course you'd need more memory to store whether or not a particular value was already in the lut, unless you can determine an "illegal" value beforehand. The additional checks would make it slightly slower than a full lut, but at least you wouldn't have to wait for the whole table to be generated before you got results.

Myrsloik

10th May 2017, 21:12

How about calculating each term when required, then putting it in the lut for the next time those x and y values crop up?

Of course you'd need more memory to store whether or not a particular value was already in the lut, unless you can determine an "illegal" value beforehand. The additional checks would make it slightly slower than a full lut, but at least you wouldn't have to wait for the whole table to be generated before you got results.

Now we need 8.5GB per lut.

MysteryX

11th May 2017, 05:51

Now we need 8.5GB per lut.
That's why your computer has 4 memory slots

tormento

11th May 2017, 09:15

Let's for a moment forget speed. There is an explanation for that. Size i.e. noise reduction in the culprit of issue. Why 16 bit is so inefficient?

dlnm

11th May 2017, 11:14

MysteryX

11th May 2017, 15:49

Let's for a moment forget speed. There is an explanation for that. Size i.e. noise reduction in the culprit of issue. Why 16 bit is so inefficient?
If you're using prefilter, I'm pretty sure MinMax isn't doing what it's supposed to do -- which will result in weird results. It hasn't been ported to 16-bit. It doesn't crash but threats the data as 8-bit.

That could explain what you're seeing in terms of high bit-rate.

tormento

11th May 2017, 16:10

If you're using prefilter, I'm pretty sure MinMax isn't doing what it's supposed to do -- which will result in weird results. It hasn't been ported to 16-bit. It doesn't crash but threats the data as 8-bit.

That could explain what you're seeing in terms of high bit-rate.

I use prefilter=4

Motenai Yoda

11th May 2017, 17:23

I'm pretty sure smdegrain (at least the current one) feed minblur with an 8bit clip.

real.finder

11th May 2017, 22:28

Let's for a moment forget speed. There is an explanation for that. Size i.e. noise reduction in the culprit of issue. Why 16 bit is so inefficient?

this is what pinterf said back then

There's no reason to get identical results.
For 16 bit input, even if the original clip is 8 bits and its straight 16 bit conversion has zero lsb, the lower resolution subclips in Super are already interpolated and have meaningful lsb parts.
So the vectors after MAnalyze are possibly different than it would be estimated from a single 8 bit source.
Then the weighting and blending inside MDegrain works with higher precision than for a 8 bit input. That is a difference, too.

and this

Indeed, when I alt-tabbed the 8 bit and 10+ bit result, the 10+ bit version possibly found better motion vectors than 8 bits, I saw less orphaned countour lines and remnants from previous frames.

Groucho2004

11th May 2017, 23:23

Sorry if I post in the wrong thread.
May I ask where I can find the x64 version of RemapFrames 0.4.1?
It seems the link in the wiki (http://avisynth.nl/index.php/AviSynth%2B) is not valid anymore.
Thank you.Yes, link is dead and I couldn't find the file anywhere else. So, I have taken cretindesalpes' last update to the code of RemapFrames from here (https://forum.doom9.org/showthread.php?p=1644971#post1644971), updated it to AVS2.6 interface and built 32 and 64 bit versions. I tested them very briefly, let me know how it goes.

Link is in my signature.

ajp_anton

15th May 2017, 15:56

For speed: if a script is using mt_lutxy, it cannot always use fast lookup tables for memory reasons.
Above 12 bits mt_lutxy calculates the expression realtime, pixel-by-pixel, which is slooooow (unlike Expr in VapourSynth).
For the specific bit depths at which realtime expression evaluation kicks in, see masktools2 readme ("feature matrix" section) or the wiki.Why is the VapourSynth-version faster?

Set realtime=false manually for a 16 bit lutxy. On x64 it can work with plenty of memory. Lut size is 8 gbyte, even the initial lut calc is a minute I guess. Over a specific clip length it would be faster however.Don't know if it already is, but couldn't the LUT calculation be quite easily multithreaded?

Myrsloik

15th May 2017, 16:07

Why is the VapourSynth-version faster?

Don't know if it already is, but couldn't the LUT calculation be quite easily multithreaded?

1. Because on x86 it converts the expression to native SSE2 code and does all calculations in floating point. Including some optimizations like pre-calculating constant parts of the expression and other fun stuff.

2. If your LUT has that many values a LUT is a generally bad idea.

TheFluff

15th May 2017, 22:26

1. Because on x86 it converts the expression to native SSE2 code and does all calculations in floating point. Including some optimizations like pre-calculating constant parts of the expression and other fun stuff.

Or, to explain it with a few more words: writing a runtime RPN expression evaluator in C++ is pretty trivial and a ton of CS undergrad students have been subjected to it as an exercise. It's easy to write but if you want to put it in a video filter it gets really slow since the expression has to be re-evaluated for every pixel value, and the runtime expression evaluator is a pretty hefty bit of code compared to the tiny bits of math that you're actually writing in the RPN expression.

The expr filter in VS isn't like that. The expr filter in VS is (on x86) a fully-fledged, optimizing just-in-time compiler that takes your RPN expression and compiles it to SSE2-optimized native code. When I say "optimizing" I mean it does things like optimize out constant parts of the expression so they don't have to be re-calculated for each pixel, including optimizing out immutable conditionals so you can avoid branching where possible. It also does auto-vectorization, so the compiled code loads, processes and stores four pixels at a time (since XMM registers are 128 bits wide and it works with 32-bit floats internally). In other words, its performance is on the same level as if you had written the equivalent of your RPN expression in C, compiled it as a plugin and used that instead of mt_lut.

8 GB LUT's are almost definitely slow as molasses in comparison. Memory bandwidth isn't free.

wonkey_monkey

15th May 2017, 22:42

The expr filter in VS isn't like that. The expr filter in VS is (on x86)

Why not on x64 as well?

a fully-fledged, optimizing just-in-time compiler that takes your RPN expression and compiles it to SSE2-optimized native code.

Is it based on some other piece of open-source software?

My rgba_rpn plugin does something similar, but using the x87 FPU. I'm wondering now if I should move to SSE2 instead. It certainly isn't crazy-optimal, although I've done my best, and it can do a lot more than expr can.

It's too complex to warrant vectorization, but if there's any interest/need I'd be willing to look into crafting something similar to VS's expr - I was going to provide it as an alias, anyway, but if people would find it really useful it might be worth writing something more optimal for those specific requirements.

TheFluff

15th May 2017, 23:24

By x86 I meant x86_64 too. VS can be compiled for other archs as well though, but for those there's no JIT compilation.

It uses jitasm (https://github.com/hlide/jitasm) to actually do the compilation but all the code generation/optimization is mainly Myrsloik's and dubhater's work AFAIK.

ajp_anton

17th May 2017, 22:10

What I meant was: why can't the Avisynth version be as fast as VS? Anything preventing the use of the same code?

TheFluff

17th May 2017, 22:25

I... don't think so? It's a fairly simple filter, so feel free to go hog wild (https://github.com/vapoursynth/vapoursynth/blob/master/src/core/exprfilter.cpp)

wonkey_monkey

18th May 2017, 20:41

I'm trying to think about updating my plugin to handle all the new colour spaces. I've been reading this:

https://forum.doom9.org/showpost.php?p=1783714&postcount=2484

as a reference and I'm wondering about using the new stuff like ComponentCount() - what happens if I make use of that in my code, but then someone still runing AviSynth 2.6 tries to use it? Will it fail? Is there a "best way" to code for this to maintain compatability?

MysteryX

18th May 2017, 20:43

wonkey_monkey

18th May 2017, 20:52

Ohhh, I see. That makes much more sense than whatever stupid thing I was thinking of. I've still never really gotten the hang of C++...

:stupid::thanks:

jpsdr

1st June 2017, 11:37

Even if it took me a long time, but stupid queston...:eek:

Should i do this :

class XXX : public GenericVideoFilter
{
public:
XXX(PClip _child, ..., IScriptEnvironment* env);
~XXX();
....

or this :

class XXX : public GenericVideoFilter
{
public:
XXX(PClip _child, ..., IScriptEnvironment* env);
virtual ~XXX();
....

It suddenly came to my mind, as i remember destructors are very often very strongly suggested being virtual.
:confused:

Groucho2004

1st June 2017, 13:27

For derived classes you should always use virtual. Have a look at this (https://stackoverflow.com/questions/461203/when-to-use-virtual-destructors) thread.

shekh

1st June 2017, 13:50

No difference. Because destructor is virtual in base class, it is also implicitly virtual in XXX.

Groucho2004

1st June 2017, 14:07

Because destructor is virtual in base class, it is also implicitly virtual in XXX.
Do you mean "GenericVideoFilter" specifically?

shekh

1st June 2017, 14:37

Do you mean "GenericVideoFilter" specifically?

Yes. GenericVideoFilter here.

Groucho2004

1st June 2017, 15:29

Yes. GenericVideoFilter here.I see, I understand it more like a general question about destructors.

jpsdr

1st June 2017, 19:10

Thanks for all the informations, even if it was more a specific question for avs plugins, the link was interesting nevertheless. I've noticed that in some plugin destructor wasn't virtual (nnedi3 for exemple), and in others it was, so i was wondering which one was good, even if i thought it was more with virtual than without.
Finaly, it seems it doesn'r matter, but i've nevertheless put virtual where it was not.

real.finder

8th June 2017, 14:12

pinterf

8th June 2017, 14:23

hi pinterf

some plugins need some hot changes

1st is RGTools

RemoveDirtMC_SE(gpu=false,twopass=false) (http://pastebin.com/uNUbMQEh) in YUY2

will give me "Clense wrok with planar only" (Planar=true in Clense is not doing anything unlike the one in removegrain or Repair).
That was easy, basically I'm ready with it, if mimicing RemoveGrain and Repair is enough.

Yes. tp7 has put planar parameter usage back to the other two, but omitted from Clense.

2nd is masktools here (https://forum.doom9.org/showthread.php?p=1808574#post1808574)
I've seen this one, the request is that masktools should handle undocumented chroma="ignore" again?

real.finder

8th June 2017, 14:28

I've seen this one, the request is that masktools should handle undocumented chroma="ignore" again?

yes, I just write this there https://forum.doom9.org/showthread.php?p=1809060#post1809060

MysteryX

8th June 2017, 22:40

Is it possible to do this in native 16-bit yet?

Function Dither_resize16nr (clip src, int width, int height,
\ float "src_left",
\ float "src_top",
\ float "src_width",
\ float "src_height",
\ string "kernel",
\ float "fh",
\ float "fv",
\ int "taps",
\ float "a1",
\ float "a2",
\ float "a3",
\ int "kovrspl",
\ bool "cnorm",
\ bool "center",
\ string "cplace",
\ int "y",
\ int "u",
\ int "v",
\ string "kernelh",
\ string "kernelv",
\ float "totalh",
\ float "totalv",
\ bool "invks",
\ bool "invksh",
\ bool "invksv",
\ int "invkstaps",
\ string "cplaces",
\ string "cplaced",
\ string "csp",
\ bool "noring")
{
noring = Default (noring, true)

Assert (width > 0 && height > 0, "Dither_resize16nr: width and height must be > 0.")

sr_h = Float (width ) / Float (src.width () )
sr_v = Float (height) / Float (src.height ())
sr_up = Dither_max (sr_h, sr_v)
sr_dw = 1.0 / Dither_min (sr_h, sr_v)
sr = Dither_max (sr_up, sr_dw)
Assert (sr >= 1.0)

# Depending on the scale ratio, we may blend or totally disable
# the ringing cancellation
thr = 2.5
nrb = (sr > thr)
nrf = (sr < thr + 1.0 && noring)
nrr = (nrb) ? Dither_min (sr - thr, 1.0) : 1.0
nrv = (nrb) ? Round ((1.0 - nrr) * 255) * $010101 : 0

main = src.Dither_resize16 (width, height,
\ src_left =src_left,
\ src_top =src_top,
\ src_width =src_width,
\ src_height=src_height,
\ kernel =kernel,
\ fh =fh,
\ fv =fv,
\ taps =taps,
\ a1 =a1,
\ a2 =a2,
\ a3 =a3,
\ kovrspl =kovrspl,
\ cnorm =cnorm,
\ center =center,
\ cplace =cplace,
\ y =y,
\ u =u,
\ v =v,
\ kernelh =kernelh,
\ kernelv =kernelv,
\ totalh =totalh,
\ totalv =totalv,
\ invks =invks,
\ invksh =invksh,
\ invksv =invksv,
\ invkstaps =invkstaps,
\ cplaces =cplaces,
\ cplaced =cplaced,
\ csp =csp
\ )

nrng = (nrf) ? src.Dither_resize16 (width, height,
\ src_left =src_left,
\ src_top =src_top,
\ src_width =src_width,
\ src_height=src_height,
\ kernel ="gauss",
\ a1 =100,
\ center =center,
\ cplace =cplace,
\ cplaces =cplaces,
\ cplaced =cplaced,
\ csp =csp,
\ y =y,
\ u =u,
\ v =v
\ ) : main

nrm = (nrb && nrf) ? main.BlankClip (color_yuv=nrv, height=main.Height()/2) : main

# To do: use a simple frame blending instead of Dither_merge16
rgm = 1
rgc = (nrb) ? -1 : 0
rgy = Defined (y) ? ((y == 3) ? rgm : rgc) : rgm
rgu = Defined (u) ? ((u == 3) ? rgm : rgc) : rgm
rgv = Defined (v) ? ((v == 3) ? rgm : rgc) : rgm
rguv = Dither_max (rgu, rgv)
(nrf ) ? main.Dither_repair16 (nrng, rgy, rguv) : main
(nrf && nrb) ? Dither_merge16_8 (main, last, nrm, y=y, u=u, v=v) : last
}

real.finder

8th June 2017, 23:35

Is it possible to do this in native 16-bit yet?

Function Dither_resize16nr (clip src, int width, int height,
>>>

well, there are ResizeX by Desbreko and I port it from some time ago https://forum.doom9.org/showpost.php?p=1782612&postcount=65

it do the most things that Dither_resize16 do

MysteryX

9th June 2017, 00:05

This function is to add no-ringing to the resize, it's different.

pinterf

9th June 2017, 15:41

New version for RgTools.
I touched it originally because of a request of working planar parameter in Clense.
But decided to implement AVX2 support, though only in RemoveGrain.

Download RgTools 0.96 (https://github.com/pinterf/RgTools/releases/tag/0.96)

Changes
v0.96 (20170609)
- RemoveGrain: AVX2. Available when Avisynth+ reports AVX2 usability
Can be disabled with new parameter: optAvx2=false
- Clense, ForwardClense, BackwardClense: ignore planar colorspace checking when planar=true. Like in RemoveGrain and Repair.
- Fix: Mode 11 and 13 for 32 bit float colorspaces (which worked like mode 10 and 12)

I'd appreciate when someone would test how much AVX2 is faster in RemoveGrain, than previous SSE2/SSE4 code. I could test it (all modes in all bit depth) only with the painfully slow SDE emulator. Thanks.

Motenai Yoda

17th June 2017, 10:32

It's me or dgdecode's deblock/dering doesn't work?
I'm using groucho's dgdecode icl, tp7's deblock and avs+ 2504

videoh

17th June 2017, 12:34

It's you.

real.finder

22nd June 2017, 19:56

any plan for support HBD in MedianBlur?

it used by MinBlur function by Didée that used in many scripts

TheFluff

23rd June 2017, 00:29

you could backport VS-CTMF (https://forum.doom9.org/showthread.php?t=171213), I guess?

LigH

19th July 2017, 12:23

Would a plugin like DSS2Mod from the Xvid4PSP project (with native LAV Filters API) have a chance to get ported? It will probably cause larger efforts, supporting new colorspaces and even audio, so it may not be a serious alternative to FFMS2(000) and LSW anymore.

burfadel

19th July 2017, 13:31

LigH

19th July 2017, 13:35

Well, yes, "The Web Archive" is not a very prominent location... :o

Reel.Deel

19th July 2017, 13:49

Well, yes, "The Web Archive" is not a very prominent location... :o

Do you have a suggestion? Archive.org is one of the few places I trust for longevity.

LigH

19th July 2017, 13:54

Unfortunately, it is also a cemetery of dead projects... so I have to assume there is no more active mirror.

pinterf

14th August 2019, 13:15

New version for RgTools.

The biggest change is that I added "TemporalRepair" from the old RemoveGrainT package. I spent quite a few weeks on it during the spring, reverse engineering pure inline assembly and rewrite them in C and simd intrinsics. Of course new planar yuv, rgb and high bit depth colorspaces were all added, but YUY2 was dropped (of course, again).
Now I included builds using LLVM 9.0 snapshot build, some modes are quicker, others are slower than MS builds. Try and decide.

Download RgTools 0.98 (https://github.com/pinterf/RgTools/releases)

Changes
v0.98 (20190814)
- Include "TemporalRepair" filter from old RemoveGrainT package (rewritten C and SIMD intrinsics from pure inline asm)
Add Y8, YV16, YV24 besides YV12, drop YUY2 support.
Add 10-32 bit support for Y, YUV and planar RGB formats
Add int "opt" parameter (mainly for debug: 0=C 1=SSE2 2=SSE4.1) for testing specific code paths
- Codes for different processor targets (SSSE3 and SSE4.1) are now separated and are compiled using function attributes (clang, gcc).
- Other source changes for errorless gcc and clang build
- LLVM support, see howto in RgTools.txt
Note: use at least LLVM 9.0 build 21 June 2019 due to a clang compiler bug (_mm_avg_epu8 related, fixed on April 14 2019) older versions are up-to 1/3 slower than the Microsoft build.
See latest snapshot builds at https://llvm.org/builds/
- GCC 8.3 support, CMakeFiles.txt, see howto in RgTools.txt
- RemoveGrain/Repair different code paths for SSE2/SSE4.1/AVX2 instead of SSE2/SSE3/AVX2.
- Add documentation (from old docs, new part: gcc/clang howto)

Groucho2004

14th August 2019, 13:18

Thanks, you da man.

real.finder

14th August 2019, 16:47

New version for RgTools.

:thanks: it's seems work fine with RemoveDirtMC_SE

since it's Avisynth+ plugin modernization efforts Thread

I think there are some basic plug-ins need HBD port such as AddGrainC, MedianBlur2 and dfttest

FranceBB

14th August 2019, 19:46

Thank you, again, Ferenc! :)

:thanks: it's seems work fine with RemoveDirtMC_SE

since it's Avisynth+ plugin modernization efforts Thread

I think there are some basic plug-ins need HBD port such as AddGrainC, MedianBlur2 and dfttest

+1 for DFTTest. It's my favorite filter when it comes to Denoise and it's a shame that it only supports 16bit stacked.
Now that f3kdb has been updated to support 16bit planar, all my filter-chain is on native high bit depth... except for DFTTest.
As a side note, if you are actually going to port it, or if someone is gonna port it, please please please just add High Bit Depth but don't remove 16bit stacked support for compatibility reasons.

DJATOM

14th August 2019, 19:54

Yeah, feel free to clone my repo and modernize it. I'm using Vapoursynth nowadays and lost interest in avs stuff.