JPSDR Avisynth's plugins pack - Page 7

dREV · 24th November 2020, 03:15

Hi, I wanted to ask about which of the NNEDI's would be best to use on my PC. It's a AMD Ryzen 5 2nd generation I think it's either 6 or 12 cores not sure with 16 GB and on Windows 7 64 bit OS using MeGUI and AviSynth+ 3.6.1 86x version.

I tried reading the readme.txt but not much info there. I been using the folder marked "Release_W7_AVX2" with no issues not sure about the other ones tho.

I was also going to ask a question about the prefetch but it seems really complicated. I've been trying to understand it and more than likely I've been doing it wrong seeing I've had it set to "prefetch=1" according to your multithreading.txt file. I'll just use the default from now on as my fps is a lot faster then when I try the prefetch.

FranceBB · 24th November 2020, 09:45

Afaik AMD Ryzen have up to AVX2 instructions set, so you're probably already using the best possible build.
Please note, though, that since it's coded in C++, the fact that the compiler is instructed to use up to AVX2 doesn't always reflect in improved speed performance.
This is because it's the compiler that is trying to understand what the programmer is doing and write the corresponding assembly optimizations to use all the available instructions set, so it might not make use of them anyway (it can happen) or even misunderstand and generate a slightly slower code (very rare, but it can happen and if you look at other posts here on Doom9 for other plugins there have been times in which some builds were faster than others while it was supposed to be the other way round).
Anyway, as far as everything is behaving correctly and according to a logic, you're already using the fastest build.

If you want, though, you can benchmark the various builds with AVSMeter and in your case I would benchmark two builds in particular: Clang W7 AVX2 and W7 AVX2 so that you can see whether Visual Studio or clang llvm produced a faster build.
I generally stick with the Visual Studio ones, but many people say that Clang ones are faster on their machines, so I guess it's worth giving them a shot.

Boulder · 24th November 2020, 10:49

Clang build for various plugins have generally been faster on my Zen (1 & 2) systems.

jpsdr · 24th November 2020, 18:06

If you're not using the avisynth MT part (so prefetch in your script), no need to set the prefetch.
If you're using prefetch in your script, the best would be to have prefetch*threads=CPU.

StainlessS · 25th November 2020, 03:35

Quote:

prefetch*threads=CPU

What exactly might that mean [x*y="AMD Ryzen 5600X", or maybe something else].

jpsdr · 25th November 2020, 19:46

CPU, core, it's the same...
So, CPU=core number.

StainlessS · 25th November 2020, 20:41

so physical cores.
Thank you.

jpsdr · 26th November 2020, 20:48

I think the optimal value is probably between the number of physical and logical cores. But this optimal value will probably never be the same between different peoples...

larisk2 · 28th November 2020, 20:51

I have a video card from ATI, please advise some high-quality deinterlace filter / plugin for avisint. I tried different ones, but I didn't like the quality of the result.

jpsdr · 29th November 2020, 11:20

Deinterlacing is not realy my stuff, so personnaly i can't realy advise (my use of nnedi3 is only nnedi3_rpow for upsampling). Even more if you're already tested the avisynth's classic ones (nnedi3, QTGMC, ... ... ... ... i realise if there is others i don't know them).

DTL · 30th November 2020, 12:17

It looks like there is somewhere memory corruption bug or buffer overrun if processing too small sized buffers: https://forum.doom9.org/showthread.php?t=182108

If source image is about 1200x720 being down-sized to /10 = 120x72 and then upsized to 8..10x we got buggy blue pixels at the bottom and also non-stable corrupted pixels at bottom (at different runs the pattern of corrupted pixels may vary) and also the progam may crash with memory protection error (like illegal writing to...). Ofcourse processing so small buffers is not commom task but if programmer have time it is good to search the reason of the bug.

I remember there is an assert somewhere in resampler to refuse processing too small buffers with too large 'support' or taps number - may be bug is somewhere close like the limits of assert is too small and processing engine still runs out of the end of buffers and reads from memory with other data content and sometime attempts to write out of reserved pages boundary and finally cautch hardware memory protection error.

jpsdr · 1st December 2020, 18:44

Is it only on ResampleMT, or also on standard resample ?

DTL · 1st December 2020, 23:10

Changed resizers to 'standard' GaussResize and BilinearResize - the result is same buggy. So the bug is in the main Avisynth resample engine (used in ResampleMT too) ? I post bug description with simplest reproduction script to main Avisynth+ thread.

real.finder · 2nd December 2020, 13:09

since SincLin2Resize and SinPowResizeMT was added, is they like NoHalo and LoHalo?, if not can they be added? and seems there are others (LoBlur and LoJaggy)

edit: there are also JincResize maybe worth adding too

DTL · 2nd December 2020, 23:02

"SincLin2Resize and SinPowResizeMT was added, is they like NoHalo and LoHalo?,"

No. They are small additions to 'linear' signal processing based on sinc and Nyquist theorem. SincLin2 is simply workaround for fixing computational bugs of SincResize with too few taps typically used. They just adds a bit step to complete tools for '1D' linear signals processing. For better 2D image processing it is required step to significally different '2D math' - like that EWA/Jinc and other.

" there are also JincResize maybe worth adding too"

JincResize is from completely different 'true-2D' resizers family. It is based on completely different resampling engine. And all ResampleMT resizers including SincLin2 and SinPow uses the one and the only resampler for V+H 1D+1D processing engine (resampler) taken from standard Avisynth. Just MT added. SincLin2 and SinPow are just very small kernel-generation functions added.

Also the only known here JincResize for Avisynth is unstable and buggy still and need more developer resources to be usable. So it is very hard to add them to ResampleMT with all MT functionality.

real.finder · 3rd December 2020, 08:13

Quote:

Originally Posted by DTL

Also the only known here JincResize for Avisynth is unstable and buggy still and need more developer resources to be usable.

even this https://github.com/Asd-g/AviSynth-JincResize ?

DTL · 3rd December 2020, 19:59

Quote:

Originally Posted by real.finder

even this https://github.com/Asd-g/AviSynth-JincResize ?

This one looks more stable. I test 0.x versions and 1.x ported from VapourSynth looks more stable. Though it outputs significally different results with different 'tap' parameter. And only work with 'planar' formats. And it looks only useful for upsampling (and looks do not have corresponding 'true-2D' downsample function for production work like complentary pair SinPow(downsample)/Sinc(upsample) resizers).
May separate thread at forum exists for this plugin ?

You think it will significally gain up speed from internal multithreading ?

jpsdr · 4th December 2020, 18:40

From what i've noticed, the more computation there is using data from a small source area (-> fitting in cache), the more you can gain with MT, and the more you can gain increasing the number of core.

DTL · 4th December 2020, 20:26

For 1-pass Jinc-family resamplers I think the direct 2D convolution of 2D kernel with 2D lines-sampled image buffer may significally suffer from long-stride memory access and cache pollution of unused prefetch. So there may be different shemes of MT task assignment for different cores. May be even many threads processing different but neibour input sample steps of 1 input buffer area (not differend areas of input buffer nor different frames of input sequence) - so there will be less long stride memory reads. But the threads syncing may be harder and time losses on threads syncing may be significant too. The main idea is by some way perform sync of different threads processing neibour input samples - so the processed image buffer area will be cached once and available for many cores.

Like for 2 cores processing:

Code:

static void resize_plane_c(EWAPixelCoeff* coeff, const void* src_, void* VS_RESTRICT dst_,
    int dst_width, int dst_height, int src_stride, int dst_stride, float peak)
{
    EWAPixelCoeffMeta* meta = coeff->meta;

    const T* srcp = reinterpret_cast<const T*>(src_);
    T* VS_RESTRICT dstp = reinterpret_cast<T*>(dst_);

    src_stride /= sizeof(T);
    dst_stride /= sizeof(T);

    for (int y = 0; y < dst_height; y++)
    {
// threads sync start point
//core 1 process 
        for (int x = 0; x < dst_width; x+=2)
        {
            const T* src_ptr = srcp + meta->start_y * static_cast<int64_t>(src_stride) + meta->start_x;
            const float* coeff_ptr = coeff->factor + meta->coeff_meta;

            float result = 0.f;

            for (int ly = 0; ly < coeff->filter_size; ly++)
            {
                for (int lx = 0; lx < coeff->filter_size; lx++)
                {
                    result += src_ptr[lx] * coeff_ptr[lx];
                }
                coeff_ptr += coeff->coeff_stride;
                src_ptr += src_stride;
            }

            if (!(std::is_same_v<T, float>))
                dstp[x] = static_cast<T>(lrintf(clamp(result, 0.f, peak)));
            else
                dstp[x] = result;

            meta+=2;
        }
// core 2 process (very close with x-coord to core 1 - so both cores will share almost same src_ptr[lx] memory area) 
        for (int x = 1; x < dst_width; x+=2)
        {
meta++;
            const T* src_ptr = srcp + meta->start_y * static_cast<int64_t>(src_stride) + meta->start_x;
            const float* coeff_ptr = coeff->factor + meta->coeff_meta;

            float result = 0.f;

            for (int ly = 0; ly < coeff->filter_size; ly++)
            {
                for (int lx = 0; lx < coeff->filter_size; lx++)
                {
                    result += src_ptr[lx] * coeff_ptr[lx];
                }
                coeff_ptr += coeff->coeff_stride;
                src_ptr += src_stride;
            }

            if (!(std::is_same_v<T, float>))
                dstp[x] = static_cast<T>(lrintf(clamp(result, 0.f, peak)));
            else
                dstp[x] = result;

            meta+=2;
        }
//threads end

        dstp += dst_stride;
    }
}

We can start profiling to look if current resampler in jincresize cpu-limited or memory-limited.

For mathematics it looks so:

Standard built-it old fast resampler in Avisynth and many other resamplers looks performs 1Dx1D convolution twice. And 'true-2D' resampler performs 2Dx2D convolution once. But 2Dx2D requires more MUL+ADD operations so it significally slower.

jpsdr · 5th December 2020, 09:46

If i do a MT version, it will be like all the others : splitting image horizontaly.

24th November 2020, 03:15	#121 \| Link
dREV Registered User Join Date: Jan 2019 Location: Antarctica Posts: 74	About NNEDI & Prefetch Hi, I wanted to ask about which of the NNEDI's would be best to use on my PC. It's a AMD Ryzen 5 2nd generation I think it's either 6 or 12 cores not sure with 16 GB and on Windows 7 64 bit OS using MeGUI and AviSynth+ 3.6.1 86x version. I tried reading the readme.txt but not much info there. I been using the folder marked "Release_W7_AVX2" with no issues not sure about the other ones tho. I was also going to ask a question about the prefetch but it seems really complicated. I've been trying to understand it and more than likely I've been doing it wrong seeing I've had it set to "prefetch=1" according to your multithreading.txt file. I'll just use the default from now on as my fps is a lot faster then when I try the prefetch.

24th November 2020, 09:45	#122 \| Link
FranceBB Broadcast Encoder Join Date: Nov 2013 Location: Royal Borough of Kensington & Chelsea, UK Posts: 2,903	Afaik AMD Ryzen have up to AVX2 instructions set, so you're probably already using the best possible build. Please note, though, that since it's coded in C++, the fact that the compiler is instructed to use up to AVX2 doesn't always reflect in improved speed performance. This is because it's the compiler that is trying to understand what the programmer is doing and write the corresponding assembly optimizations to use all the available instructions set, so it might not make use of them anyway (it can happen) or even misunderstand and generate a slightly slower code (very rare, but it can happen and if you look at other posts here on Doom9 for other plugins there have been times in which some builds were faster than others while it was supposed to be the other way round). Anyway, as far as everything is behaving correctly and according to a logic, you're already using the fastest build. If you want, though, you can benchmark the various builds with AVSMeter and in your case I would benchmark two builds in particular: Clang W7 AVX2 and W7 AVX2 so that you can see whether Visual Studio or clang llvm produced a faster build. I generally stick with the Visual Studio ones, but many people say that Clang ones are faster on their machines, so I guess it's worth giving them a shot. __________________ LUT Collection FFAStrans Videotek - AAA - SafeColorLimiter

24th November 2020, 10:49	#123 \| Link
Boulder Pig on the wing Join Date: Mar 2002 Location: Finland Posts: 5,731	Clang build for various plugins have generally been faster on my Zen (1 & 2) systems. __________________ And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon...

24th November 2020, 18:06	#124 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	If you're not using the avisynth MT part (so prefetch in your script), no need to set the prefetch. If you're using prefetch in your script, the best would be to have prefetch*threads=CPU. __________________ My github.

25th November 2020, 19:46	#126 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	CPU, core, it's the same... So, CPU=core number. __________________ My github.

25th November 2020, 20:41	#127 \| Link
StainlessS HeartlessS Usurer Join Date: Dec 2009 Location: Over the rainbow Posts: 10,980	so physical cores. Thank you. __________________ I sometimes post sober. StainlessS@MediaFire ::: AND/OR ::: StainlessS@SendSpace "Some infinities are bigger than other infinities", but how many of them are infinitely bigger ???

26th November 2020, 20:48	#128 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	I think the optimal value is probably between the number of physical and logical cores. But this optimal value will probably never be the same between different peoples... __________________ My github.

28th November 2020, 20:51	#129 \| Link
larisk2 Registered User Join Date: Nov 2020 Posts: 7	I have a video card from ATI, please advise some high-quality deinterlace filter / plugin for avisint. I tried different ones, but I didn't like the quality of the result.

29th November 2020, 11:20	#130 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	Deinterlacing is not realy my stuff, so personnaly i can't realy advise (my use of nnedi3 is only nnedi3_rpow for upsampling). Even more if you're already tested the avisynth's classic ones (nnedi3, QTGMC, ... ... ... ... i realise if there is others i don't know them). __________________ My github.

30th November 2020, 12:17	#131 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,063	It looks like there is somewhere memory corruption bug or buffer overrun if processing too small sized buffers: https://forum.doom9.org/showthread.php?t=182108 If source image is about 1200x720 being down-sized to /10 = 120x72 and then upsized to 8..10x we got buggy blue pixels at the bottom and also non-stable corrupted pixels at bottom (at different runs the pattern of corrupted pixels may vary) and also the progam may crash with memory protection error (like illegal writing to...). Ofcourse processing so small buffers is not commom task but if programmer have time it is good to search the reason of the bug. I remember there is an assert somewhere in resampler to refuse processing too small buffers with too large 'support' or taps number - may be bug is somewhere close like the limits of assert is too small and processing engine still runs out of the end of buffers and reads from memory with other data content and sometime attempts to write out of reserved pages boundary and finally cautch hardware memory protection error. Last edited by DTL; 30th November 2020 at 12:36.

1st December 2020, 18:44	#132 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	Is it only on ResampleMT, or also on standard resample ? __________________ My github.

1st December 2020, 23:10	#133 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,063	Changed resizers to 'standard' GaussResize and BilinearResize - the result is same buggy. So the bug is in the main Avisynth resample engine (used in ResampleMT too) ? I post bug description with simplest reproduction script to main Avisynth+ thread. Last edited by DTL; 1st December 2020 at 23:41.

2nd December 2020, 13:09	#134 \| Link
real.finder Registered User Join Date: Jan 2012 Location: Mesopotamia Posts: 2,587	since SincLin2Resize and SinPowResizeMT was added, is they like NoHalo and LoHalo?, if not can they be added? and seems there are others (LoBlur and LoJaggy) edit: there are also JincResize maybe worth adding too __________________ See My Avisynth Stuff Last edited by real.finder; 2nd December 2020 at 13:16.

2nd December 2020, 23:02	#135 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,063	"SincLin2Resize and SinPowResizeMT was added, is they like NoHalo and LoHalo?," No. They are small additions to 'linear' signal processing based on sinc and Nyquist theorem. SincLin2 is simply workaround for fixing computational bugs of SincResize with too few taps typically used. They just adds a bit step to complete tools for '1D' linear signals processing. For better 2D image processing it is required step to significally different '2D math' - like that EWA/Jinc and other. " there are also JincResize maybe worth adding too" JincResize is from completely different 'true-2D' resizers family. It is based on completely different resampling engine. And all ResampleMT resizers including SincLin2 and SinPow uses the one and the only resampler for V+H 1D+1D processing engine (resampler) taken from standard Avisynth. Just MT added. SincLin2 and SinPow are just very small kernel-generation functions added. Also the only known here JincResize for Avisynth is unstable and buggy still and need more developer resources to be usable. So it is very hard to add them to ResampleMT with all MT functionality. Last edited by DTL; 2nd December 2020 at 23:13.

4th December 2020, 18:40	#138 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	From what i've noticed, the more computation there is using data from a small source area (-> fitting in cache), the more you can gain with MT, and the more you can gain increasing the number of core. __________________ My github.

5th December 2020, 09:46	#140 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	If i do a MT version, it will be like all the others : splitting image horizontaly. __________________ My github.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode