Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 24th November 2020, 03:15   #121  |  Link
dREV
Registered User
 
dREV's Avatar
 
Join Date: Jan 2019
Location: Antarctica
Posts: 74
About NNEDI & Prefetch

Hi, I wanted to ask about which of the NNEDI's would be best to use on my PC. It's a AMD Ryzen 5 2nd generation I think it's either 6 or 12 cores not sure with 16 GB and on Windows 7 64 bit OS using MeGUI and AviSynth+ 3.6.1 86x version.

I tried reading the readme.txt but not much info there. I been using the folder marked "Release_W7_AVX2" with no issues not sure about the other ones tho.

I was also going to ask a question about the prefetch but it seems really complicated. I've been trying to understand it and more than likely I've been doing it wrong seeing I've had it set to "prefetch=1" according to your multithreading.txt file. I'll just use the default from now on as my fps is a lot faster then when I try the prefetch.
dREV is offline   Reply With Quote
Old 24th November 2020, 09:45   #122  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,903
Afaik AMD Ryzen have up to AVX2 instructions set, so you're probably already using the best possible build.
Please note, though, that since it's coded in C++, the fact that the compiler is instructed to use up to AVX2 doesn't always reflect in improved speed performance.
This is because it's the compiler that is trying to understand what the programmer is doing and write the corresponding assembly optimizations to use all the available instructions set, so it might not make use of them anyway (it can happen) or even misunderstand and generate a slightly slower code (very rare, but it can happen and if you look at other posts here on Doom9 for other plugins there have been times in which some builds were faster than others while it was supposed to be the other way round).
Anyway, as far as everything is behaving correctly and according to a logic, you're already using the fastest build.

If you want, though, you can benchmark the various builds with AVSMeter and in your case I would benchmark two builds in particular: Clang W7 AVX2 and W7 AVX2 so that you can see whether Visual Studio or clang llvm produced a faster build.
I generally stick with the Visual Studio ones, but many people say that Clang ones are faster on their machines, so I guess it's worth giving them a shot.
FranceBB is offline   Reply With Quote
Old 24th November 2020, 10:49   #123  |  Link
Boulder
Pig on the wing
 
Boulder's Avatar
 
Join Date: Mar 2002
Location: Finland
Posts: 5,731
Clang build for various plugins have generally been faster on my Zen (1 & 2) systems.
__________________
And if the band you're in starts playing different tunes
I'll see you on the dark side of the Moon...
Boulder is offline   Reply With Quote
Old 24th November 2020, 18:06   #124  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
If you're not using the avisynth MT part (so prefetch in your script), no need to set the prefetch.
If you're using prefetch in your script, the best would be to have prefetch*threads=CPU.
__________________
My github.
jpsdr is offline   Reply With Quote
Old 25th November 2020, 03:35   #125  |  Link
StainlessS
HeartlessS Usurer
 
StainlessS's Avatar
 
Join Date: Dec 2009
Location: Over the rainbow
Posts: 10,980
Quote:
prefetch*threads=CPU
What exactly might that mean [x*y="AMD Ryzen 5600X", or maybe something else].
__________________
I sometimes post sober.
StainlessS@MediaFire ::: AND/OR ::: StainlessS@SendSpace

"Some infinities are bigger than other infinities", but how many of them are infinitely bigger ???
StainlessS is offline   Reply With Quote
Old 25th November 2020, 19:46   #126  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
CPU, core, it's the same...
So, CPU=core number.
__________________
My github.
jpsdr is offline   Reply With Quote
Old 25th November 2020, 20:41   #127  |  Link
StainlessS
HeartlessS Usurer
 
StainlessS's Avatar
 
Join Date: Dec 2009
Location: Over the rainbow
Posts: 10,980
so physical cores.
Thank you.
__________________
I sometimes post sober.
StainlessS@MediaFire ::: AND/OR ::: StainlessS@SendSpace

"Some infinities are bigger than other infinities", but how many of them are infinitely bigger ???
StainlessS is offline   Reply With Quote
Old 26th November 2020, 20:48   #128  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
I think the optimal value is probably between the number of physical and logical cores. But this optimal value will probably never be the same between different peoples...
__________________
My github.
jpsdr is offline   Reply With Quote
Old 28th November 2020, 20:51   #129  |  Link
larisk2
Registered User
 
Join Date: Nov 2020
Posts: 7
I have a video card from ATI, please advise some high-quality deinterlace filter / plugin for avisint. I tried different ones, but I didn't like the quality of the result.
larisk2 is offline   Reply With Quote
Old 29th November 2020, 11:20   #130  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
Deinterlacing is not realy my stuff, so personnaly i can't realy advise (my use of nnedi3 is only nnedi3_rpow for upsampling). Even more if you're already tested the avisynth's classic ones (nnedi3, QTGMC, ... ... ... ... i realise if there is others i don't know them).
__________________
My github.
jpsdr is offline   Reply With Quote
Old 30th November 2020, 12:17   #131  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,063
It looks like there is somewhere memory corruption bug or buffer overrun if processing too small sized buffers: https://forum.doom9.org/showthread.php?t=182108

If source image is about 1200x720 being down-sized to /10 = 120x72 and then upsized to 8..10x we got buggy blue pixels at the bottom and also non-stable corrupted pixels at bottom (at different runs the pattern of corrupted pixels may vary) and also the progam may crash with memory protection error (like illegal writing to...). Ofcourse processing so small buffers is not commom task but if programmer have time it is good to search the reason of the bug.

I remember there is an assert somewhere in resampler to refuse processing too small buffers with too large 'support' or taps number - may be bug is somewhere close like the limits of assert is too small and processing engine still runs out of the end of buffers and reads from memory with other data content and sometime attempts to write out of reserved pages boundary and finally cautch hardware memory protection error.

Last edited by DTL; 30th November 2020 at 12:36.
DTL is offline   Reply With Quote
Old 1st December 2020, 18:44   #132  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
Is it only on ResampleMT, or also on standard resample ?
__________________
My github.
jpsdr is offline   Reply With Quote
Old 1st December 2020, 23:10   #133  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,063
Changed resizers to 'standard' GaussResize and BilinearResize - the result is same buggy. So the bug is in the main Avisynth resample engine (used in ResampleMT too) ? I post bug description with simplest reproduction script to main Avisynth+ thread.

Last edited by DTL; 1st December 2020 at 23:41.
DTL is offline   Reply With Quote
Old 2nd December 2020, 13:09   #134  |  Link
real.finder
Registered User
 
Join Date: Jan 2012
Location: Mesopotamia
Posts: 2,587
since SincLin2Resize and SinPowResizeMT was added, is they like NoHalo and LoHalo?, if not can they be added? and seems there are others (LoBlur and LoJaggy)

edit: there are also JincResize maybe worth adding too
__________________
See My Avisynth Stuff

Last edited by real.finder; 2nd December 2020 at 13:16.
real.finder is offline   Reply With Quote
Old 2nd December 2020, 23:02   #135  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,063
"SincLin2Resize and SinPowResizeMT was added, is they like NoHalo and LoHalo?,"

No. They are small additions to 'linear' signal processing based on sinc and Nyquist theorem. SincLin2 is simply workaround for fixing computational bugs of SincResize with too few taps typically used. They just adds a bit step to complete tools for '1D' linear signals processing. For better 2D image processing it is required step to significally different '2D math' - like that EWA/Jinc and other.

" there are also JincResize maybe worth adding too"

JincResize is from completely different 'true-2D' resizers family. It is based on completely different resampling engine. And all ResampleMT resizers including SincLin2 and SinPow uses the one and the only resampler for V+H 1D+1D processing engine (resampler) taken from standard Avisynth. Just MT added. SincLin2 and SinPow are just very small kernel-generation functions added.

Also the only known here JincResize for Avisynth is unstable and buggy still and need more developer resources to be usable. So it is very hard to add them to ResampleMT with all MT functionality.

Last edited by DTL; 2nd December 2020 at 23:13.
DTL is offline   Reply With Quote
Old 3rd December 2020, 08:13   #136  |  Link
real.finder
Registered User
 
Join Date: Jan 2012
Location: Mesopotamia
Posts: 2,587
Quote:
Originally Posted by DTL View Post
Also the only known here JincResize for Avisynth is unstable and buggy still and need more developer resources to be usable.
even this https://github.com/Asd-g/AviSynth-JincResize ?
__________________
See My Avisynth Stuff
real.finder is offline   Reply With Quote
Old 3rd December 2020, 19:59   #137  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,063
Quote:
Originally Posted by real.finder View Post
This one looks more stable. I test 0.x versions and 1.x ported from VapourSynth looks more stable. Though it outputs significally different results with different 'tap' parameter. And only work with 'planar' formats. And it looks only useful for upsampling (and looks do not have corresponding 'true-2D' downsample function for production work like complentary pair SinPow(downsample)/Sinc(upsample) resizers).
May separate thread at forum exists for this plugin ?

You think it will significally gain up speed from internal multithreading ?

Last edited by DTL; 3rd December 2020 at 21:13.
DTL is offline   Reply With Quote
Old 4th December 2020, 18:40   #138  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
From what i've noticed, the more computation there is using data from a small source area (-> fitting in cache), the more you can gain with MT, and the more you can gain increasing the number of core.
__________________
My github.
jpsdr is offline   Reply With Quote
Old 4th December 2020, 20:26   #139  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,063
For 1-pass Jinc-family resamplers I think the direct 2D convolution of 2D kernel with 2D lines-sampled image buffer may significally suffer from long-stride memory access and cache pollution of unused prefetch. So there may be different shemes of MT task assignment for different cores. May be even many threads processing different but neibour input sample steps of 1 input buffer area (not differend areas of input buffer nor different frames of input sequence) - so there will be less long stride memory reads. But the threads syncing may be harder and time losses on threads syncing may be significant too. The main idea is by some way perform sync of different threads processing neibour input samples - so the processed image buffer area will be cached once and available for many cores.

Like for 2 cores processing:
Code:
static void resize_plane_c(EWAPixelCoeff* coeff, const void* src_, void* VS_RESTRICT dst_,
    int dst_width, int dst_height, int src_stride, int dst_stride, float peak)
{
    EWAPixelCoeffMeta* meta = coeff->meta;

    const T* srcp = reinterpret_cast<const T*>(src_);
    T* VS_RESTRICT dstp = reinterpret_cast<T*>(dst_);

    src_stride /= sizeof(T);
    dst_stride /= sizeof(T);

    for (int y = 0; y < dst_height; y++)
    {
// threads sync start point
//core 1 process 
        for (int x = 0; x < dst_width; x+=2)
        {
            const T* src_ptr = srcp + meta->start_y * static_cast<int64_t>(src_stride) + meta->start_x;
            const float* coeff_ptr = coeff->factor + meta->coeff_meta;

            float result = 0.f;

            for (int ly = 0; ly < coeff->filter_size; ly++)
            {
                for (int lx = 0; lx < coeff->filter_size; lx++)
                {
                    result += src_ptr[lx] * coeff_ptr[lx];
                }
                coeff_ptr += coeff->coeff_stride;
                src_ptr += src_stride;
            }

            if (!(std::is_same_v<T, float>))
                dstp[x] = static_cast<T>(lrintf(clamp(result, 0.f, peak)));
            else
                dstp[x] = result;

            meta+=2;
        }
// core 2 process (very close with x-coord to core 1 - so both cores will share almost same src_ptr[lx] memory area) 
        for (int x = 1; x < dst_width; x+=2)
        {
meta++;
            const T* src_ptr = srcp + meta->start_y * static_cast<int64_t>(src_stride) + meta->start_x;
            const float* coeff_ptr = coeff->factor + meta->coeff_meta;

            float result = 0.f;

            for (int ly = 0; ly < coeff->filter_size; ly++)
            {
                for (int lx = 0; lx < coeff->filter_size; lx++)
                {
                    result += src_ptr[lx] * coeff_ptr[lx];
                }
                coeff_ptr += coeff->coeff_stride;
                src_ptr += src_stride;
            }

            if (!(std::is_same_v<T, float>))
                dstp[x] = static_cast<T>(lrintf(clamp(result, 0.f, peak)));
            else
                dstp[x] = result;

            meta+=2;
        }
//threads end

        dstp += dst_stride;
    }
}
We can start profiling to look if current resampler in jincresize cpu-limited or memory-limited.

For mathematics it looks so:

Standard built-it old fast resampler in Avisynth and many other resamplers looks performs 1Dx1D convolution twice. And 'true-2D' resampler performs 2Dx2D convolution once. But 2Dx2D requires more MUL+ADD operations so it significally slower.

Last edited by DTL; 4th December 2020 at 22:21.
DTL is offline   Reply With Quote
Old 5th December 2020, 09:46   #140  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
If i do a MT version, it will be like all the others : splitting image horizontaly.
__________________
My github.
jpsdr is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 10:28.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.