Avisynth+ [Archive] - Page 84

qyot27

2nd August 2018, 15:01

I do not care at all about high bit depths and hi color.
AviSynth+ was already three years old by the time high bit depth was added. Many of the differences you'd be likely to see a benefit from are actually the older ones that happened right after the fork occurred in 2013. Stuff like improved caching behavior, lower memory usage, autoloading of C plugins, 64-bit support, many filters getting faster because the old SoftWire assembly was removed and replaced by intrinsics that work much better on modern compilers (it also means that support for newer instruction sets like AVX, AVX2, and AVX512 were added much more easily as well), and so on.

I also need ALL of my old AVS plugins to continue working, this includes some VDub filters. Is this a realistic expectation?
2.0 plugins won't work, but 2.5 and 2.6 plugins are fine (with the same 'you do need to update this one' issues that classic AviSynth 2.6 is affected by as well). VDub filters should be okay, as VirtualDubFilter still exists - although it's an external plugin now. I can't personally speak to that part, since I don't use any VDub filters myself, or if I ever did, it was only one or two.

Next question is about multithreading. Is there any speed gain in AVS+ when using old single threaded plugins and just add the "prefetch" command to the end of my scripts?
It depends. It's much more dependent on how the script is written and which MT mode the plugins are marked as (you can either use SetFilterMTMode to set the mode yourself, or use MTmodes.avsi to automatically use the community consensus on the proper modes for the most widely used plugins). If all the plugins you want to use happen to be good with the MT_NICE_FILTER setting, then yes, there should be a boost.

Basically, the order of the filters matters more in MT, because the MT_SERIALIZED mode can basically negate all benefits of MT if such a filter is used late in a script. Anything that uses that mode should occur at the beginning of the script, and then I'm not sure about whether it matters where filters using MT_MULTI_INSTANCE are.

In comparison to the old MT forks (where the only thing to set was a script-wide MT mode and maybe a Distributor call), AviSynth+'s approach is a scalpel, not a hammer. The cost of that is that the process to set things correctly is a tad more complicated, but MTmodes.avsi can take care of the bulk of that for you.

And then the question about AVS+ stability. I am very conservative here, I have not much use for software which gets bug fix updates every couple of weeks. I prefer software which is stable and has an update frequency of no less than one year. From loosely following the related threads I got the impression that AVS+ as well as some important "modernized" plugins (like MaskTools and MVTools) are still work in progress and need frequent bug fix updates.
That's mostly just a different release pattern due to it being more actively maintained and going with a more x264-like setup where people mostly refer to the revision numbers. More frequent builds get released, sure, but if a particular build works for you, it's usually safe to keep using on it if you've not hit any bugs. And if you do find a bug, see if it still exists in the latest build - if it doesn't, great, if it does, report it.

There actually was going to be a '0.2' stable release somewhere around the r1830 point, but that never came to be. By now we'd probably be at a 0.3 or 0.4. Considering that, 0.1 was released in December 2013, r1825 (close to where 0.2 would have been) was early-mid 2016 and had MT but not high bit depth, and so by my estimation, 0.3 would likely have been September 2016, roughly around the time high bit depth support had gotten all of the major kinks hammered out and external support for them in FFmpeg and x264 showed up. We'd currently be inside the development cycle for 0.4, but the stuff lately hasn't been quite so dramatic, so the cycle for 0.4 could be longer (it might be when GCC and cross-platform support fully congeal, I dunno).

Stability is something of a different question. Despite new features showing up or getting improved actively (most recently it was the Expr stuff), existing features don't just disappear or change, save for users submitting bug reports and it being addressed much more quickly. From a developer point of view, the API has remained stable for years, because it has to be backward compatible with 2.6 and programs that still use 2.6.

Groucho2004

2nd August 2018, 18:04

Nice write-up qyot27, it should make the decision to upgrade (or not) to AVS+ easier for some folks.

:goodpost:

rco133

2nd August 2018, 21:24

Hi.

I have a question about using AVS+ multithreaded or not.

Maybe there is a simple explanation to the behaviour I see.

The source is 1920x1080 bluray with progressive video. This Means that the AVS file is quite simple.

LoadPlugin("d:\dgdecnv\x64 Binaries\DGDecodeNV.dll")
DGSource("test.dgi")
crop(0,140,1920,800)
Spline36Resize(1280,536)

Thats it.

I have been doing some tests with AVSMeter 281, and am a bit surprised by the results. The 1080 results have a bit different crop in the AVS file, and the resize command is of course gone.

All the tests have been running for 1 minute and then stopped.

-----

720p with no Prefetch.

Frames processed: 17080 (0 - 17079)
FPS (min | max | average): 76.40 | 369.2 | 282.8
Memory usage (phys | virt): 213 | 369 MiB
Thread count: 27
CPU usage (average): 6%

Time (elapsed): 00:01:00.394

-----

1080p with no Prefetch.

Frames processed: 16940 (0 - 16939)
FPS (min | max | average): 101.1 | 374.6 | 280.2
Memory usage (phys | virt): 210 | 366 MiB
Thread count: 27
CPU usage (average): 4%

Time (elapsed): 00:01:00.465

-----

720p with pefetch(4).

Frames processed: 15550 (0 - 15549)
FPS (min | max | average): 22.65 | 544.3 | 255.7
Memory usage (phys | virt): 241 | 397 MiB
Thread count: 31
CPU usage (average): 12%

Time (elapsed): 00:01:00.812

-----

1080p with prefetch(4).

Frames processed: 13550 (0 - 13549)
FPS (min | max | average): 22.81 | 412.2 | 222.8
Memory usage (phys | virt): 258 | 414 MiB
Thread count: 31
CPU usage (average): 6%

Time (elapsed): 00:01:00.816

-----

I have also done tests with

SetFilterMTMode("DGSource", 3)
DGSource("test.dgi")
SetFilterMTMode("DEFAULT_MT_MODE", 2)

in the AVS file. But it doesn't really make any difference.

What makes me wonder is the big difference in CPU useage, and also the minimum FPS which drops very low.
The resulting average is quite a bit lower as soon as I use Prefetch in the script.

Is it normal for very simple AVS files like this, that enabling MT actually hurts the performance?

For now I am of course just not using Prefetch in AVS files like this, but it made me wonder.

I have some avsi and DLL files located in the plugins64+ folder, but that shouldn't really matter or?

Thanks in advance.

rco133

LigH

2nd August 2018, 21:30

If multithreading doesn't matter, then you have a single threaded bottleneck filter. And it's probably DGSource, because you can't multithread hardware decoding. The one decoder chip on your graphics card is as fast as it is.

And who cries about an average of >200 fps? Multithreading is the more useful the slower the video is filtered, in relation to the decoding speed. Cropping and scaling, that's quite "nothing" compared to e.g. deinterlacing and denoising.

manolito

3rd August 2018, 06:41

Give it a try, easy enough to switch back, but I'm guessin' that you will not be doing that, ever.

Thanks for this, but I will probably hang in there with the old standard AVS for another simple reason:
I have several computers running in parallel, the main desktop machine is the old P3 Coppermine under XP SP3, and this one does not support SSE2. AVS+ does support XP, but it chokes on a CPU without SSE2 (just like all the "modernized" plugins).

Yes, I know your answer, but I am still not ready to throw this old computer away, I believe in sustainability, and I won't throw away things which still work. My newer laptops would not have any problems running AVS+, but I have no intention to maintain different AviSynth installations on my several machines.

A big thanks to qyot27 for his detailed explanation. It really makes the decision for upgrading to AVS+ or not a lot easier.

Cheers
manolito

magiblot

3rd August 2018, 16:01

That was a great explanation, qyot27. I believe a similar summarised and easy to understand text should be added to the AviSynth Wiki, in addition to other changes to make AviSynth+ the new recommended build.

When newcomers to AviSynth enter the Wiki, the first link they see is the one under 'Official builds'. Also, searching Google for 'avisynth download' leads to downloads of the 2.6 branch in SourgeForge and VideoHelp as well.

jpsdr

3rd August 2018, 19:07

TAVS+ does support XP, but it chokes on a CPU without SSE2 (just like all the "modernized" plugins).

A lot of devs, included me, dropped things before SSE2. Begining with SSE2 already give enough differents code branch (roughly SSE2, SSE41, AVX, AVX2). Don't want to add at least two more (MMX, SSE). Even in my own old plugins, which had at the beginig only MMX, then after SSE, after SSE2, after... I deleted some old specific code path to reduce them, otherwise, you'll finish your plugin with a dozen of specific optimised code path !:scared:

Atak_Snajpera

3rd August 2018, 19:25

A lot of devs, included me, dropped things before SSE2. Begining with SSE2 already give enough differents code branch (roughly SSE2, SSE41, AVX, AVX2). Don't want to add at least two more (MMX, SSE). Even in my own old plugins, which had at the beginig only MMX, then after SSE, after SSE2, after... I deleted some old specific code path to reduce them, otherwise, you'll finish your plugin with a dozen of specific optimised code path !:scared:

These days you just need two versions (SSE2 and AVX2). Rest can be omitted.

qyot27

3rd August 2018, 20:39

It's not even that. AviSynth+ dropped the MMX and some of the SSE paths with the removal of SoftWire, but re-implemented some of them that were more obvious and useful with intrinsics (meaning: on a pre-SSE2 machine, some filters may be a mixed bag of whether they're faster than AviSynth 2.6; some of the other caching and memory improvements may make up some of the lost ground, though). But that's not actually what spurred the current build restrictions. Look at CMakeLists.txt. (https://github.com/pinterf/AviSynthPlus/blob/MT/CMakeLists.txt#L74) All anyone building it needs to do to 're-enable'* pre-SSE2 support is swap the commented out ARCH line for the other one. One of these days I'll figure out how to pass a selectable ARCH option to CMake so that optimizing for any of MSVC's supported levels (or disabling it completely) can be done at configure time instead of having to edit CMakeLists.txt before building it. The issue is that to build it with MSVC 2015 or 2017, you need at least Windows 7 (I think? maybe Win8.1) and an SSE2 CPU - MSVC itself doesn't run on older machines or versions of Windows, although it can still build for them.

*Because it's not really disabled, it's just a compiler optimization switch. You could even turn it completely off by passing ARCH:IA32.

qyot27

4th August 2018, 03:59

Here's a build that supports Pentium III. (http://www.mediafire.com/file/vtd3wzl5ynk422u/avisynth__r2741-g0cb91abf-20180803.7z/file) The revision number is higher because it came from some work I'd done to make it compile a little bit more easily/cleanly with MinGW, so nothing markedly different from the latest build from pinterf (especially since I built it with MSVC 2017).

manolito

4th August 2018, 06:00

Thanks so much, this is so nice of you... :thanks:

Gives me something to play with for the next couple of days. Of course I need to stick with my old MaskTools and MVTools, and this means that I also keep my older QTGMC, LSFMod and SRestore scripts because the "modernized" scripts depend on the latest PinterF versions. I am curious if I will still get some speed gains.

Two questions:
1. I know that AVS+ can autoload C-Plugins. Do I have to make any changes in my autoload folder? Right now I autoload my C-Plugins with the help of an AVSI script which loads the C-Plugins.

2. Of cource my P3 Coppermine is single threaded. Would it still make any sense to add the "Prefetch" command to the end of my AVS scripts?

Thanks again
manolito

Groucho2004

4th August 2018, 07:30

Two questions:
1. I know that AVS+ can autoload C-Plugins. Do I have to make any changes in my autoload folder? Right now I autoload my C-Plugins with the help of an AVSI script which loads the C-Plugins.

2. Of cource my P3 Coppermine is single threaded. Would it still make any sense to add the "Prefetch" command to the end of my AVS scripts?
1. You simply remove that .avsi from the auto-load directory.

2. No. It would even reduce performance.

manolito

4th August 2018, 09:35

:thanks:

StainlessS

4th August 2018, 10:02

Qyot27, Top man, your efforts are always much appreciated, alas, my PIII's now all long dead.

1), I like to load C plugins via avsi (from separate sub directory of 'Plugins') as I just dont like them in my plugs (maybe in case I revert to Std
for some temp reason). [I usually enable on demand just by uncommenting a line in avsi].
{just an option to consider, at least keep a copy of original avsi in a SCRIPT sub directory of your Plugins archive}

Groucho2004

4th August 2018, 18:15

I like to load C plugins via avsi (from separate sub directory of 'Plugins') as I just dont like them in my plugs (maybe in case I revert to Std
for some temp reason). [I usually enable on demand just by uncommenting a line in avsi].Same here, more or less. I only have the plugins I use frequently in the autoload directory (mvtools, masktools, rgtools, f3kdb ...) and load others as needed from another location.

Of course there are people who take auto-loading to a new level:
https://forum.doom9.org/showthread.php?p=1844394#post1844394 :eek:

manolito

5th August 2018, 08:01

Just gave qyot27's Non-SSE2 version from this post
https://forum.doom9.org/showthread.php?p=1847975#post1847975
a whirl on my ancient computer, and I am impressed...

To avoid any installation mistakes I installed the latest pinterf version first, then replaced the files with the ones from the qyot27 build. During installation I opted for uninstalling the old AVS version and migrate the old plugins.

Worked on the first attempt. AVStoDVD issued a non-fatal warning about a wrong AviSynth version on startup, but that was it. All my standard scripts worked out of the box. I have 76 plugins in my autoload folder, and so far none of them caused a crash.

BTW is there an easy method to identify ancient 2.0 C-plugins?

Only my old and beloved DVDtoSVCD absolutely refused to work under AVS+. It needs AVS 2.57 anyways, I have to use Groucho's AVS switcher to make it work.

Next I did a few speed comparisons. The specs of this ancient machine are:
Celeron P3 Coppermine @ 1.1 GHz Non-SSE2 Single Core
576 MB system RAM
Win XP SP3 32-bit

The unmodified scripts ran a tiny bit slower than under AVS 2.61 Alpha (42 fps vs. 46 fps). Just for fun I then added Prefetch commands to the end of my scripts, and to my big surprise this was very stable and did not compromise speed. Even using "Prefetch(8)" caused no problems. And remember this is under a single core CPU with very low system RAM. Quite impressive...

So again a big thanks to qyot27 for this build. Even if it has no advantages on my old machine it will definitely make it much easier for me to maintain my AviSynth installations on my various computers because I will only have to deal with one identical installation for all of them.

Cheers
manolito

Groucho2004

5th August 2018, 08:41

BTW is there an easy method to identify ancient 2.0 C-plugins?

"AVSMeter avsinfo"

If you want to specify a custom plugin directory:
"AVSMeter avsinfo -c"

StainlessS

5th August 2018, 08:47

BTW is there an easy method to identify ancient 2.0 C-plugins?
Just AvsMeter as far as I'm aware.

Although you might be able to chop something out of this (dont know how it fares under AVs+, no idea what 'not supported' means in that case).
https://forum.doom9.org/showthread.php?p=1641795#post1641795

C v2.0 plugs I got (maybe not all of them)

avisynth_c.dll # The loader, not C v2.0

AVSCurveFlow.dll
AVSShock.dll
equlines.dll
IBob.dll
SmartDecimate.dll
Transition.dll

EDIT: Damn that Groucho, he's just too fast.

qyot27

5th August 2018, 22:46

Good to know that it worked; when I tried moving it over to my P3 machine to test it myself, I ran into redist problems (and couldn't resolve them because the Windows Installer service on that install is screwed up something fierce), so I was flying blind a little bit.

I did, however, whip up a patch to let the SIMD level be selected during configuration (https://github.com/qyot27/AviSynthPlus/commit/422043aef7a59c6f8554295a25ed1317cc5601af). It seemed to work, since when I told it to optimize for AVX, the .dll would no longer run on this (Apollo Lake-based (https://en.wikipedia.org/wiki/Goldmont), only has up to SSE 4.2) computer.

manolito

6th August 2018, 02:55

jpsdr

6th August 2018, 08:37

It's not impossible you may encoder speed loss with internal core avs filters. The old version you were using had internal asm MMX optimized code. The new avs+, everything has moved to intrinsics, but i think they need at least SSE2. So, it's possible that now you have only pure code C path, instead of MMX optimized code. But i don't know exactly what qyot27 has done, so it's up to him to confirm or not what i've said.

pinterf

6th August 2018, 09:04

In a non-SSE2 build MMX code is still there since it was also moved from the classic Avisynth or got rewritten to intrinsics.
New stuff (mostly for high bit-depth) was implemented in C-only at the beginnings and the SSE2 option gave quite a good optimization for these parts. Since then most of the things have hand-written SSE2/SSE4 optimization, some have AVX2.
I haven't stopped or removed the C implementation, all filters and function have the C version.

jpsdr

6th August 2018, 09:40

In a non-SSE2 build MMX code is still there since it was also moved from the classic Avisynth or got rewritten to intrinsics.

Ah, i was wrong on this part, my mistake.

qyot27

6th August 2018, 20:27

All the MSVC optimization flags (/arch:SSE2 vs. /arch:SSE or /arch:IA32) do is optimize the C portions to emit SIMD instructions in the compiled binary, it's not necessary to match it with the intrinsics; the intrinsics will compile to the proper SIMD-enhanced versions with or without /arch set. GCC largely works the same way with its -march/-mtune/-mCPU flags (except that GCC's handling of branched intrinsics is a complete rat's nest).

wonkey_monkey

6th August 2018, 22:07

I'm on r2506 and always scared of installing a new version. I've noticed that showy, showu, and showv all cause exceptions. Just thought I'd mention it in case it's still a bug. Extracty/u/v all work fine.

StainlessS

6th August 2018, 22:48

and showv all cause exceptions

Confirmed in r2728.
EDIT: Dont be a scaredy-cat David, recent versions of avs+ have had some nice speed improvements and I'm sure that r2728
crashes way faster than the old r2506 that you are using. :)

mkver

7th August 2018, 03:26

Only ancient 2.0 C plugins like InpaintFunc, a delogo I've been using for years, but there are alternatives, so I don't mind.
What exactly do you mean by "make Avisynth+ crash"? Is it the mere presence of a plugin's dll in the plugin's folder enough to make it crash? (InpaintFunc is btw. a script, not a plugin; it uses AVSInpaint internally.) This is something that I couldn't reproduce, neither with the first version of AVSInpaint (that still relied on AviSynth_C.dll) nor with the second version that is a proper 2.5 C plugin.

But I know that AVSInpaint is indeed a bit buggy. I have already asked pinterf about the different behaviour of AVSInpaint on AVS+ and AVS 2.6 (after all, it might have been a bug in AVS+) and it turned out to be a bug in AVSInpaint (as expected):

if (AlphaFrame) avs_release_video_frame(LogoFrame);
if (LogoFrame) avs_release_video_frame(LogoFrame);

The above double part of AVSInpaint.c is contained in the code path for deblending of a static logo. Inpaintfunc makes use of exactly this part of AVSInpaint.c when deblending, therefore it crashes. Thanks again to pinterf for finding this bug.

I managed to compile both x86 and x64 versions of AVSInpaint; I actually did this already in May, but back then I created the necessary x64 AviSynth.lib by using avisynth.def from github and dlltool (part of MinGW64) and "dlltool -l avisynth.lib -d avisynth.def /c/Windows/System32/AviSynth.dll" from the 2664 AVS+ dll and although the resulting plugin worked, I wanted to wait for an officially released lib file (pinterf mentioned that he would add (as he has now) them to his releases) and of course I forgot about them, but your post reminded me of this. Here (https://www.dropbox.com/s/9r3bfkqp9aimidk/AVSInpaint.rar?dl=0) are the builds. Could you confirm that you don't get any more crashes with them and that they work as intended?

I also think to have found a bug in AVS+. capi.h (https://github.com/pinterf/AviSynthPlus/blob/MT/avs_core/include/avs/capi.h#L49) contains these lines:

#ifdef MSVC
#ifndef AVSC_USE_STDCALL
# define AVSC_CC __cdecl
#else
# define AVSC_CC __stdcall
#endif
#else
# define AVSC_CC
#endif

#define AVSC_INLINE static __inline

#ifdef BUILDING_AVSCORE
# define AVSC_EXPORT __declspec(dllexport)
# define AVSC_API(ret, name) EXTERN_C AVSC_EXPORT ret AVSC_CC name
#else
# define AVSC_EXPORT EXTERN_C __declspec(dllexport)
# ifndef AVSC_NO_DECLSPEC
# define AVSC_API(ret, name) EXTERN_C __declspec(dllimport) ret AVSC_CC name
# else
# define AVSC_API(ret, name) typedef ret (AVSC_CC *name##_func)
# endif
#endif

I think the "# define AVSC_CC" should be "# define AVSC_CC __stdcall" (or maybe one shouldn't test for MSVC at all here). Compiling x64 with the present capi.h worked, because there is only one calling convention for x64 Windows (so cdecl and stdcall decorations (https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html) are ignored when the target platform is x86-64). But compiling x86-32 didn't work, because the AVSC_API functions don't get declared as stdcall. After the preprocessing stage they look like this:

__attribute__((dllimport)) int avs_is_yv24(const AVS_VideoInfo * p);

when they should be this:

__attribute__((dllimport)) int __attribute__((__stdcall__)) avs_is_yv24(const AVS_VideoInfo * p);

(This is also what gets produced when one uses an old AviSynth_C.h header file from AVS 2.6.)

qyot27

7th August 2018, 04:09

There was a capi-related commit fairly recently messing with the dllexport/dllimport stuff (https://github.com/pinterf/AviSynthPlus/commit/42088344bbbef1faf01fd83adeedfba1d59eaa9e), but there was no description of what it was supposed to fix. It did indeed used to compile just fine with MinGW-w64 regardless of 32-bit* and 64-bit, and I'd addressed a compatibility workaround with calling convention in my branch, but it was incredibly hackish. (https://github.com/qyot27/AviSynthPlus/commit/105bf2629538fe53d32d77afe1c34476a2c3a019)

*32-bit GCC builds would not work with FFmpeg, though, due to 32-bit calling conventions. That is, unless using the regressed HEAD version of the header that allows for building AviSynth+ with GCC. But then 32-bit MSVC builds won't work. That hack above was to mitigate this and default to the program still working with 32-bit MSVC builds.

pinterf

7th August 2018, 08:54

ajp_anton

7th August 2018, 12:10

When going down in bit depth, dither=-1 means round, right? Or is it floor? Because I see a tendency of everything shifting ever so slightly "down".

If you do the following on a YV24 source (repeat a few times to magnify the effect)
convertbits(16)
converttorgb(matrix="rec709")
converttoyuv444(matrix="rec601")
convertbits(8)
convertbits(16)
converttorgb(matrix="rec601")
converttoyuv444(matrix="rec709")
convertbits(8)
, interleave it with the original, and histogram("levels"), you should see that on average, Y,U and V values are all going down. If dither=-1 rounds the values, shouldn't the average stay roughly the same?

edit:
dither_bits:
- Has no effect if dither=-1 (off).
- Must be an even number from 2 to bits, inclusive.
- In addition, must be >= (clip.BitsPerComponent-8).
I don't understand why any of these restrictions exist, except for maybe for implementation reasons. Not that I need this functionality, just wondering.

mkver

7th August 2018, 17:53

There was a capi-related commit fairly recently messing with the dllexport/dllimport stuff (https://github.com/pinterf/AviSynthPlus/commit/42088344bbbef1faf01fd83adeedfba1d59eaa9e), but there was no description of what it was supposed to fix.
The reason seems to be a difference between GCC and MSVC regarding dllimport: If I undo pinterf's latest change to capi.h, GCC complains if the definition (in AVSInpaint.c) isn't also declared as dllimport. And apparently MSVC wants only the declaration to have the dllimport attribute, otherwise you'd get an error (https://msdn.microsoft.com/en-us/library/62688esh.aspx).
It did indeed used to compile just fine with MinGW-w64 regardless of 32-bit* and 64-bit,
You mean, it worked before pinterf's commit? Because if I use this (https://github.com/pinterf/AviSynthPlus/blob/a616181c63164d3d79f72e4a60a2d1ea40b5bbbc/avs_core/include/avs/capi.h), then the linker doesn't find a lot of symbols, because several functions are not declared as stdcall. Your new version (https://github.com/qyot27/AviSynthPlus/commit/105bf2629538fe53d32d77afe1c34476a2c3a019) meanwhile works fine.

qyot27

8th August 2018, 03:15

You mean, it worked before pinterf's commit? Because if I use this (https://github.com/pinterf/AviSynthPlus/blob/a616181c63164d3d79f72e4a60a2d1ea40b5bbbc/avs_core/include/avs/capi.h), then the linker doesn't find a lot of symbols, because several functions are not declared as stdcall. Your new version (https://github.com/qyot27/AviSynthPlus/commit/105bf2629538fe53d32d77afe1c34476a2c3a019) meanwhile works fine.
I mean that the last time I really checked (a few months ago - April, maybe? Possibly earlier), I could cross-compile AviSynth+ with MinGW-w64/GCC as either 32-bit or 64-bit. I can't speak for plugins, since the only one I really have anything to do with is the FFMS2 C-plugin (the capi.h there merely checks for _WIN32, not MSVC; pretty sure that's an emergency, basal version of the fix I did with the AVSC_WIN32_GCC32 define). I hadn't yet tried to see if that change breaks FFMS2, but I generally had/have a feeling that it might.

Taking a cursory look at AVSInpaint.c, though...ack. That thing needs a cleanup. I can't tell if it's even using the C interface correctly (comparing, as I mentioned above, to the FFMS2 C plugin; that might be a special case, though, and I've not taken any time thus far to see if I can get AssRender building with GCC to verify with it).

manolito

8th August 2018, 14:44

Now I am just curious how big the speed sacrifice using this non-SSE2 version vs. the standard version is. I still assume that the various filter plugins will be the bottleneck and not AviSynth itself...

So far nobody wanted to bite, so I did a couple of benchmarks myself... :cool:

Not representative statistically, I tried to keep it "real world" as much as I could. Interesting results, and I also have a few questions.

Test platform:
Lenovo T530, Core i5-3230M Ivy Bridge with 8 GB RAM. Pretty much middle class today (of course only entry level for Doom9 members). The CPU has 2 physical cores plus 2 virtual (Hyperthreading) cores.

AviSynth versions (32-bit only):
1: AVS 2.61 Alpha VC6 Build
2: AVS+ r2728 pinterf
3: AVS+ r2741 qyot27 Non-SSE2

Source file:
Downloaded HD clip @ 29.97 progressive.

I used 3 different scripts where the first 2 are common everday scripts, the third one uses DCT=1 with MVTools2 and is way too slow for everyday use.

Script #1:
ConvertToYV12()
DegrainMedian(mode=2)
LSFMod()
Spline36Resize(720,576)
ChangeFPS(25)

Script #2:
ConvertToYV12()
Spline36Resize(720,576)
mx_fps(25) # a modded version of FrameRateConverter by MysteryX

Script #3
Same as above except using
mx_fps(25, dct=1)

For AVS+ I tested both "Prefetch(2)" and "Prefetch(4)" at the end of the scripts. I also used "SetMTMode.avsi" which is linked at the AVS+ WIKI page.

And here comes my first question:
Do I copy this "SetMTMode.avsi" into the "plugins+" folder or into the "plugins" folder? I tried both and did not notice any difference, so I put it in the "plugins+" folder.

I deliberately did not just measure the results of the scripts in AVSMeter because I wanted fo find the overall conversion speeds. My encoder was FFmpeg which uses all 4 cores by default.

Results:

AVS 2.61 Alpha:
Script #1: 30 fps
Script #2: 15 fps
Script #3: 1.0 fps

AVS+ r2728 pinterf:
Script #1: 31 fps (no difference between Prefetch(2) and Prefetch(4))
Script #2: 24 fps (Prefetch(2)) and 28 fps (Prefetch(4))
Script #3: 1.6 fps (Prefetch(2)) and 1.5 fps (Prefetch(4))

AVS+ r2741 qyot27 Non-SSE2
Script #1: 31 fps (no difference between Prefetch(2) and Prefetch(4))
Script #2: 24 fps (Prefetch(2)) and 28 fps (Prefetch(4))
Script #3: 1.6 fps (for both Prefetch(2) and Prefetch(4))

Conclusion:
1. The difference between standard AVS and AVS+ is very obvious, mainly when a complex script like mx_fps (which uses MVTools2 and MaskTools2) gets used.

2. There is almost no difference between the official pinterf version of AVS+ and the Non-SSE2 version by qyot27. On one occasion the qyot27 version is even slightly faster.

Which leads me to my second question:
Could it be that the qyot27 version does use the SSE2 capability of the CPU if the CPU supports it? If not then I would say that using only SIMD versions up to MMX and SSE does not necessarily slow down the conversions, at least not with my tests.

Any thoughts?

Cheers
manolito

qyot27

8th August 2018, 17:00

Could it be that the qyot27 version does use the SSE2 capability of the CPU if the CPU supports it? If not then I would say that using only SIMD versions up to MMX and SSE does not necessarily slow down the conversions, at least not with my tests.
I said almost exactly that when talking about the how the /arch flag works.

Essentially, it works like this. Intrinsics or hand-written assembly code use SIMD instructions directly in discrete versions of a particular function (layer_sse4, layer_sse2, layer_avx, etc.). These are then included in a runtime CPU detection dispatcher which allows the program to select the appropriate one based on what the CPU supports. This always gets compiled, and the functions for the different paths are there no matter what. There are SSSE3 and SSE4 functions for some filters, there are AVX/AVX2 versions for others, all based on what can give the greatest boost/anyone bothered writing.

Most compilers, however, have the ability to optimize the plain C versions during the build process so that the final binary can emit SIMD instructions at any time. MSVC controls this through the /arch: parameter, GCC does it through -march, -mtune, or -m[SIMD] flags used together or alone. These flags make it so that said CPU or SIMD is *required*, because it will use that code even in the parts not covered by the intrinsics or assembly.

So take Mask for example. In AviSynth+, this has multiple versions of the function:
mask_sse2
mask_core_mmx (not sure if this is actually just a dependency of the mask_mmx function below)
mask_mmx
mask_c

mask_sse2 and mask_mmx are written with intrinsics. They will always be compiled to SSE2 and MMX code, respectively. The dispatcher chooses the appropriate one based on what the CPU supports. If there isn't any support for SSE2, it uses MMX. If it supports neither, it uses the plain C version.

When MSVC has /arch:IA32 set, mask_c (the entire program's plain C code, actually) will be built without any optimizations. Left at its default, though, mask_c (and all the rest of the plain C code) will be optimized by MSVC itself to emit SSE2 instructions when it gets run. Fine for CPUs that have SSE2 already, not fine for ones that don't. This auto-optimization-by-compiler is generally not as thorough or fine-tuned as you'd get from either intrinsics or hand-written assembly, which is why those are still needed to get significant boosts in speed, but for functions which haven't yet had intrinsics or asm written for them, the auto-optimization is the best you can do.

manolito

8th August 2018, 22:00

Thanks, I think I finally got it... :devil:

Most compilers, however, have the ability to optimize the plain C versions during the build process so that the final binary can emit SIMD instructions at any time.

So the difference is only for plain C code which gets either converted to SIMD instructions (if the CPU supports it) or not. There should be no performance hit whatsoever using your version vs. using pinterf's version.

So why doesn't pinterf implement your CPU branching routine?

Cheers
manolito

qyot27

8th August 2018, 23:04

So the difference is only for plain C code which gets either converted to SIMD instructions (if the CPU supports it) or not.
No. Think of it like a plain vanilla/yellow cake. You want chocolate with the cake. The SIMD instructions that come from the dedicated intrinsics functions are like chocolate frosting - it's on top, makes it taste better, but if you got a slice and didn't want any of the frosting, you could scrape it off. This is the option that's 'it gets used if the CPU supports it'.

/arch and optimizing even the C parts with SIMD is making the cake itself a marble, or straight-up chocolate, cake - if you want to avoid the chocolate, you can't. It's in there, whether you like it or not. Whether the CPU supports it or not (and if the CPU doesn't support it, it crashes with an Illegal instruction error).

I did absolutely nothing to branch AviSynth+'s CPU support. The only difference between the typical builds pinterf has been providing and the one I posted a little bit ago is that I switched /arch back to SSE before building it, the way it was on the original AviSynth+ repo before ultim went on hiatus again (as you can see, it was last updated in August 2016, which is why pinterf's repo is the current development hub everyone points to now):
https://github.com/AviSynth/AviSynthPlus/blob/MT/CMakeLists.txt#L48

vs.

https://github.com/pinterf/AviSynthPlus/blob/MT/CMakeLists.txt#L74

All I did in my working branch (https://github.com/qyot27/AviSynthPlus/blob/dss_deps/CMakeLists.txt#L74) was make it so that when I go to build AviSynth+, I don't have to open CMakeLists.txt in Notepad2-mod and change it back to SSE. Instead, I can now pass -DCPU_ARCH=SSE (or -DCPU_ARCH=IA32, -DCPU_ARCH=AVX, or -DCPU_ARCH=AVX2) to the CMake command line, like any other configuration option and avoid having to open source files in text editors first.

As for why it hasn't shown up outside of my branch, A) I just whipped up that patch earlier this week or last week, and B) I've not opened a pull request for the changes on that branch yet.

Groucho2004

8th August 2018, 23:48

/arch and optimizing even the C parts with SIMD is making the cake itself a marble, or straight-up chocolate, cake
Mmmmh, marble cake...
https://s33.postimg.cc/dy0uso7m7/index.png

manolito

9th August 2018, 00:29

/arch and optimizing even the C parts with SIMD is making the cake itself a marble, or straight-up chocolate, cake - if you want to avoid the chocolate, you can't. It's in there, whether you like it or not. Whether the CPU supports it or not (and if the CPU doesn't support it, it crashes with an Illegal instruction error).

So your build makes sure the cake itself does not become a marble cake (avoiding a crash when the CPU does not support it). By switching back /arch to SSE you avoid optimizing C parts with SIMD.

But then why the hell is your build just as fast or even a little faster on my test computer which certainly does have SSE2? :confused:

qyot27

9th August 2018, 01:45

So your build makes sure the cake itself does not become a marble cake (avoiding a crash when the CPU does not support it). By switching back /arch to SSE you avoid optimizing C parts with SIMD.
Roughly. It avoids using SSE2 SIMD. If you tried using that build on something older than a Pentium-III, it would crash. The only way to fully disable it is by using /arch:IA32, but how many people with a Pentium-II, Pentium Pro, or i486/i386 are going to be running at least Windows XP just to be able to run that AviSynth.dll? Much less actually be using it for anything other than academic 'because I can' points?

The point is that /arch specifies the minimum instruction set the CPU supports, and because of that, it allows the compiler to use SIMD at or below that minimum setting when optimizing the C parts of the code during the build process.

I mean, I could probably throw up a build with all the intrinsics disabled so you'd be forced to use the C versions and see directly how well MSVC optimizes stuff. I think it's just disabling a couple of defines, but I'm not sure.

But then why the hell is your build just as fast or even a little faster on my test computer which certainly does have SSE2? :confused:
MSVC might optimize for MMX/SSE a bit better in spots for a 32-bit compared to SSE2, but largely it would be because on an Ivy Bridge, you wouldn't be using the C versions of anything much/at all (for either pinterf's build or mine). It might in some non-filter areas that linger in the background, possibly. If I had to guess (based on pinterf's comment in CMakeLists.txt), it is the high bit depth stuff where you would see the biggest difference between the two builds. I'm not sure how much of it has intrinsics now, so there may be a higher proportion of it that has to rely on the compiler doing the optimization on plain C code.

manolito

9th August 2018, 02:22

Alright, this answers most of my questions, thanks...

Since all the high bitdepth and high colors stuff is not for me (I am just too old for all this UHD / HDR / 8K stuff, my viewing device is a 4:3 CRT TV set with natural colors I so far have never seen on an LCD. And I also refuse to converge computer (for working) and TV (for entertainment) stuff).
So all I am interested in for AVS+ is speed gain caused by MT.

Thanks again
manolito

jpsdr

9th August 2018, 08:38

There was plasma screen (Pioneer Kuro) which were very good for color, SED/FED died before being born, but now there is OLED, which i think will provide good color for CRT people, as i was. Like you, i've never like LCD, but my Plasma Kuro Pioneer gave me satisfaction, and the day i'll have to replace it (because it will happens, the later i hope), i think OLED will satisfy me.

pinterf

9th August 2018, 10:49

If I had to guess (based on pinterf's comment in CMakeLists.txt), it is the high bit depth stuff where you would see the biggest difference between the two builds. I'm not sure how much of it has intrinsics now, so there may be a higher proportion of it that has to rely on the compiler doing the optimization on plain C code.
Yeah, as I wrote, most of the 10+ bits stuff was in pure C, nowadays most of them are optimized in SIMD intrinsics. Probably 32 bit can go back to the /sse option, because who really need speed (in general and especially for 10+ bit depth option) those are already using x64 toolchain, I guess.

Atak_Snajpera

9th August 2018, 12:07

I'm just curious why Prefetch with value equal to number of physical cores is faster than with number of logical processors?

It does not matter if it is Xeon 8C/16T or Ryzen 8C/16T. Result is always the same.

Script
#VideoSource
LoadPlugin("C:\Users\Dave\Documents\Delphi_Projects\RipBot264\_Compiled\Tools\AviSynth plugins\ffms\ffms_latest\x64\ffms2.dll")
video=FFVideoSource("C:\Temp\RipBot264temp\job1\video.mkv",cachefile = "C:\Temp\RipBot264temp\job1\video.mkv.ffindex")
#Deinterlace

#Resize
LoadPlugin("C:\Users\Dave\Documents\Delphi_Projects\RipBot264\_Compiled\Tools\AviSynth plugins\Plugins_JPSDR\Plugins_JPSDR.dll")
video=Spline36ResizeMT(video,1920,1080,SetAffinity=false).Sharpen(0.2)

#Tonemap
Loadplugin("C:\Users\Dave\Documents\Delphi_Projects\RipBot264\_Compiled\Tools\AviSynth plugins\avsresize\avsresize.dll")
Loadplugin("C:\Users\Dave\Documents\Delphi_Projects\RipBot264\_Compiled\Tools\AviSynth plugins\DGTonemap\x64\DGTonemap.dll")
video=z_ConvertFormat(video,pixel_type="RGBPS",colorspace_op="2020ncl:st2084:2020:l=>rgb:linear:2020:l", dither_type="none").DGHable
video=z_ConvertFormat(video,pixel_type="YV12",colorspace_op="rgb:linear:2020:l=>709:709:709:l",dither_type="ordered")

#Prefetch
video=Prefetch(video,X)

#Return
return video

Prefetch(16)
AVSMeter 2.8.1 (x64) - Copyright (c) 2012-2018, Groucho2004
AviSynth+ 0.1 (r2728, MT, x86_64) (0.1.0.0)

Number of frames: 7935
Length (hh:mm:ss.ms): 00:02:12.382
Frame width: 1920
Frame height: 1080
Framerate: 59.940 (60000/1001)
Colorspace: YV12

Frames processed: 7935 (0 - 7934)
FPS (min | max | average): 0.147 | 1000000 | 30.50
Memory usage (phys | virt): 2099 | 2124 MiB
Thread count: 65
CPU usage (average): 53%

Time (elapsed): 00:04:20.122

Prefetch(8)

AVSMeter 2.8.1 (x64) - Copyright (c) 2012-2018, Groucho2004
AviSynth+ 0.1 (r2728, MT, x86_64) (0.1.0.0)

Number of frames: 7935
Length (hh:mm:ss.ms): 00:02:12.382
Frame width: 1920
Frame height: 1080
Framerate: 59.940 (60000/1001)
Colorspace: YV12

Frames processed: 7935 (0 - 7934)
FPS (min | max | average): 0.339 | 944590 | 33.17
Memory usage (phys | virt): 1591 | 1616 MiB
Thread count: 57
CPU usage (average): 53%

Time (elapsed): 00:03:59.189

The same happens with MDegrain2 or QMTC.

Myrsloik

9th August 2018, 13:49

This is a general answer that applies to most things.

You can easily saturate the total memory bandwidth with fewer than the logical number of threads. Especially (A)VS which processes full frames instead of tiles/lines quickly reach that level. And once you have more than the physical number of cores as threads you have reduced cache too... which means each thread is even more likely to have to access and wait even more for RAM.

It's possible that the default number of threads should be something like max(physical cores, min(logical threads, 8)) for x86.

manolito

9th August 2018, 16:40

It's possible that the default number of threads should be something like max(physical cores, min(logical threads, 8)) for x86.

Not true for my Core i5-3230M Ivy Bridge with 8 GB RAM. According to this formula I should use Prefetch(2), but my tests (latest AVS+ 32-bit) showed that in most cases Prefetch(4) is significantly faster.

Myrsloik

9th August 2018, 16:42

Not true for my Core i5-3230M Ivy Bridge with 8 GB RAM. According to this formula I should use Prefetch(2), but my tests showed that in most cases Prefetch(4) is significantly faster.

My formula gives 4. I don't see the problem here. The whole thing was pulled out of my butt so no guarantees thats it's optimal.

amichaelt

9th August 2018, 16:54

Not true for my Core i5-3230M Ivy Bridge with 8 GB RAM. According to this formula I should use Prefetch(2), but my tests (latest AVS+ 32-bit) showed that in most cases Prefetch(4) is significantly faster.

I don't think you applied the formula correctly.

The formula in your case would work through the following steps:

max(2 physical cores, min(4 logical threads, 8 threads))

Min of 4 and 8 is 4.

max(2 physical cores, 4 logical threads)

Max between 2 and 4 would be 4. So, his back-of-the-envelope formula gave you exactly what you claim was the faster number.

Atak_Snajpera

9th August 2018, 17:19

That formula would return 8 for 8700k/Ryzen 2600 instead of 6. I have disabled 2 cores on my Xeon E5-2690 in BIOS and again test showed that prefetch equal to number of cores is better.

Prefetch(8)
AVSMeter 2.8.1 (x64) - Copyright (c) 2012-2018, Groucho2004
AviSynth+ 0.1 (r2728, MT, x86_64) (0.1.0.0)

Number of frames: 7935
Length (hh:mm:ss.ms): 00:02:12.382
Frame width: 1920
Frame height: 1080
Framerate: 59.940 (60000/1001)
Colorspace: YV12

Frames processed: 7935 (0 - 7934)
FPS (min | max | average): 0.131 | 944593 | 22.98
Memory usage (phys | virt): 1422 | 1446 MiB
Thread count: 45
CPU usage (average): 60%

Time (elapsed): 00:05:45.265

Prefetch(6)
AVSMeter 2.8.1 (x64) - Copyright (c) 2012-2018, Groucho2004
AviSynth+ 0.1 (r2728, MT, x86_64) (0.1.0.0)

Number of frames: 7935
Length (hh:mm:ss.ms): 00:02:12.382
Frame width: 1920
Frame height: 1080
Framerate: 59.940 (60000/1001)
Colorspace: YV12

Frames processed: 7935 (0 - 7934)
FPS (min | max | average): 0.132 | 708445 | 25.73
Memory usage (phys | virt): 1281 | 1305 MiB
Thread count: 43
CPU usage (average): 60%

Time (elapsed): 00:05:08.385

Atak_Snajpera

9th August 2018, 17:51

Another test but this time disabled 6 cores leaving only 2C/4T.
Prefetch(4)
AVSMeter 2.8.1 (x64) - Copyright (c) 2012-2018, Groucho2004
AviSynth+ 0.1 (r2728, MT, x86_64) (0.1.0.0)

Number of frames: 7935
Length (hh:mm:ss.ms): 00:02:12.382
Frame width: 1920
Frame height: 1080
Framerate: 59.940 (60000/1001)
Colorspace: YV12

Frames processed: 7935 (0 - 7934)
FPS (min | max | average): 0.092 | 944596 | 10.70
Memory usage (phys | virt): 784 | 808 MiB
Thread count: 17
CPU usage (average): 89%

Time (elapsed): 00:12:21.689

Prefetch(2)
AVSMeter 2.8.1 (x64) - Copyright (c) 2012-2018, Groucho2004
AviSynth+ 0.1 (r2728, MT, x86_64) (0.1.0.0)

Number of frames: 7935
Length (hh:mm:ss.ms): 00:02:12.382
Frame width: 1920
Frame height: 1080
Framerate: 59.940 (60000/1001)
Colorspace: YV12

Frames processed: 7935 (0 - 7934)
FPS (min | max | average): 3.012 | 314865 | 14.37
Memory usage (phys | virt): 620 | 644 MiB
Thread count: 15
CPU usage (average): 85%

Time (elapsed): 00:09:12.192

I've noticed that script with Prefetch value higher than number of physical cores has tendency to choke from time to time. (see min. fps)