Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 12th August 2016, 13:53   #1  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,309
Internaly multi-threaded resampling functions

I've made a plugin of the resampling functions, with internal multi-threading.
I can allready ear the "why" ?
The answer is "because"...

More seriously, i'll explain my point of view.
For multi-threading in image processing, if you have n cores, you have basicaly two ways :
Case 1 : You process n pictures in parallel.
Case 2 : You process n 1/nth part of the picture in parallel.

The internal core can only offer the case 1, because it has no idea how to handle junction/overlap of the splitted image, neither how to split it (even if the most common way is the vertical split), and even more if splitting is possible.
The case 2 is only possible internaly, because only the filter know how to split, join, handle overlap, etc...

Personnaly, i don't like the case 1, i think case 2 is a way better.
Differences may not be very obvious with few cores, but it will become with a lot of cores.
With case 2, the memory used is always the same, whatever the number of cores you have, when with case 1, it always increases. The more cores you have, the less memory each core is working on => the more chance you have to fit it in lower cache level => the more performance you can get. Forget it with case 1.

For example, if you have an high-end 10 cores with HT CPU, giving you 20 cores.

A UHD 4k YV12 picture has a size of around 12,5MB. The whole picture fit on the L3 cache, so the 20 cores will access their data within it. I'll let you guess the result with case 1...

A FHD 1080p YV12 picture has a size of around 3MB, fit in L3 but 1/20 gives 155KB, this fits on L2 cache, so each core will work with data from his own L2 cache. Case 1 doesn't even fit in L3 cache.

You may not have this point of view, but for those who share it, you can use my multi-threaded version.

This version works on all avs+ version, and on all 2.6.x versions.

Current version : 2.3.9

Sources are here.
Binaries are here.

Version history
2.3.9 : Update to new AVS+ headers.
2.3.8 : Add DTL pull-request (update on UserDefined2ResizeMT).
2.3.7 : Update to new AVS+ headers.
2.3.6 : Update on threadpool, no user limit (except memory).
2.3.5 : Fix on threadpool, using prefetch parameter created hang. Add negative prefetch for triming, read Multithreading.txt or Multithreading chapter here. Fix for RGB packed format.
2.3.4 : Fix on threadpool.
2.3.3 : Working fix for too small size vs size filter, update avs headers.
2.3.2 : Add UserDefined2ResampleMT function, fix for too small size vs size filter.
2.3.1 : Add SincLin2Resize function.
2.3.0 : Add SinPowResize function.
2.2.3 : Update of the threadpool, update to new avisynth headers.
2.2.2 : Minor code change after threadpool update, fix in the number of threads,
fix to perfectecly match avs+ output (V/H resize order was sometimes different).
2.2.0 : Update of the threadpool, add ThreadLevel parameter.
2.1.2 : Update of Matrix Class.
2.1.1 : Optimized CPU placement if SetAffinity=true for prefetch>1, SetAffinity back to default false.
2.1.0 : Merge new core resampler code, filter is MT_NICE.
2.0.3 : Fix a bug in the MTData, add check for inverting matrix in desample.
2.0.2 : Minor update on threadpool.
2.0.1 : Desample now handle properly croped and/or high value shifted original.
2.0.0 : Added the Desample functions.
1.5.8 : Fix possible deadlock in threadpool, and fix issue of filter "doing nothing".
1.5.7 : Fix in threadpool.
1.5.6 : Minor fix.
1.5.5 : Minor changes on threadpool.
1.5.4 : Minor update on threadpool.
1.5.3 : Update avs header, fix range mode issue.
1.5.2 : Some fixes, add range mode 4, set range mode default to 1, apply range only on last step.
1.5.1 : Change code to allow the merge of the plugins.
1.5.0 : Add range parameter.
1.4.0 : Add sleep and prefetch parameters.
1.3.0 : Update to the new resample core functions. Update to the new avs header and support of all supported video formats. Build with /MD instead of /MT.
1.2.6 : Use Mutex instead of CriticalSection on some places and some changes in the threadpool interface.
1.2.5 : Fix deadlock case in Threadpool interface. Remove CACHE_DONT_CACHE_ME and small changes in the threadpool interface.
1.2.4 : Add several parameters to allow more specific tuning of the threadpool if necessary.
1.2.3 : Minor change.
1.2.2 : Update of the threadpool and minor change.
1.2.1 : Update the threadpool interface, fix the deadloock on init and other small things.
1.2.0 : Update the threadpool interface.
1.1.0 : Use an external thread pool class but internaly to prevent the thread creation explosion.
1.0.2 : Test of finaly a bad idea...
1.0.1 : Fix Intel compiler warning.
1.0.0 : First release.

==================================================================

Desample

Desample functions added on v2.0.0

DeBilinearResizeMT
DeBicubicResizeMT
DeLanczosResizeMT
DeLanczos4ResizeMT
DeBlackmanResizeMT
DeSpline16ResizeMT
DeSpline36ResizeMT
DeSpline64ResizeMT
DeGaussResizeMT
DeSincResizeMT
DeSinPowResizeMT
DeSincLin2ResizeMT
DeUserDefined2ResampleMT


More information on Desample functions here.

==================================================================

The functions inside this plugin are :
PointResizeMT
BilinearResizeMT
BicubicResizeMT
LanczosResizeMT
Lanczos4ResizeMT
BlackmanResizeMT
Spline16ResizeMT
Spline36ResizeMT
Spline64ResizeMT
GaussResizeMT
SincResizeMT
SinPowResizeMT
SincLin2ResizeMT
UserDefined2ResampleMT


Parameters are exactly the same than the orignal resampling functions, and in the same order, so they are totaly backward compatible.
For the new kernel functions added, check the ReadMe file.

Several new parameters are added at the end of all the parameters :

threads -

Controls how many threads will be used for processing. If set to 0, threads will
be set equal to the number of detected logical or physical cores,according logicalCores parameter.

Default: 0 (int)

logicalCores -

If threads is set to 0, it will specify if the number of threads will be the number
of logical CPU (true) or the number of physical cores (false). If your processor doesn't
have hyper-threading or threads<>0, this parameter has no effect.

Default: true (bool)

MaxPhysCore -

If true, the threads repartition will use the maximum of physical cores possible. If your
processor doesn't have hyper-threading or the SetAffinity parameter is set to false,
this parameter has no effect.

Default: true (bool)

SetAffinity -

If this parameter is set to true, the pool of threads will set each thread to a specific core,
according MaxPhysCore parameter. If set to false, it's leaved to the OS.

Default: true (bool)

sleep -
If this parameter is set to true, once the filter has finished one frame, the threads of the
threadpool will be suspended (instead of still running but waiting an event), and resume when
the next frame will be processed. If set to false, the threads of the threadpool are always
running and waiting for a start event even between frames.

Default: false (bool)

prefetch -
This parameter will allow to create more than one threadpool, to avoid mutual resources acces
lock/wait if "prefetch" is used in the avs script.
0 : Will set automaticaly to the prefetch value use in the script. Well... that's what i wanted
to do, but for now it's not possible for me to get this information when i need it, so, for
now, 0 will result in 1. For now, if you're using "prefetch" in your script, put the same
value on this parameter.

range -
This parameter specify the range the output video data has to comply with.
Limited range is 16-235 for Y, 16-240 for U/V. Full range is 0-255 for all planes.
Alpha channel is not affected by this paramter, it's always full range.
Values are adjusted according bit depth of course. This parameter has no effect
for float datas.
0 : Automatic mode. If video is YUV mode is limited range, if video is RGB mode is
full range, if video is greyscale (Y/Y8) mode is Y limited range.
1 : Force full range whatever the video is.
2 : Force limited Y range for greyscale video (Y/Y8), limited range for YUV video,
no effect for RGB video.
3 : Force limited U/V range for greyscale video (Y/Y8), limited range for YUV video,
no effect for RGB video.
4 : Force special camera range (16-255) for greyscale video (Y/Y8) and YUV video,
no effect for RGB video.

Default: 1

ThreadLevel -
This parameter will set the priority level of the threads created for the processing (internal
multithreading). No effect if threads=1.
1 : Idle level.
2 : Lowest level.
3 : Below level.
4 : Normal level.
5 : Above level.
6 : Highest level.
7 : Time critical level (WARNING !!! use this level at your own risk)

Default : 6

The logicalCores, MaxPhysCore, SetAffinity and sleep are parameters to specify how the pool of thread will be created and handled, allowing if necessary each people to tune according his configuration.

So, syntax is :
ResampleFunction([original parameters],int threads, bool logicalCores, bool MaxPhysCore, bool SetAffinity, bool sleep, int prefetch,int range)

==================================================================

Multi-threading information

CPU example case : 4 cores with hyper-threading.

If you leave all the multi-threading parameters to their default value, it's set to be "optimal" when you're not using prefetch or if you are under standard avisynth, all the logical CPU will be used.
If you put SetAffinity to true it will allocate the threads on the CPU contiguously. Physical CPU 1 will have threads (0,1), ... physical CPU 4 will have threads (6,7), allowing optimal cache use. Make test to see what's best for you.

Now, if you are using prefetch on your script, things are different !
If you're using it with the max number of CPUs (8 in our exemple case), you still can make tests, but i would strongly advise to disable the internal multi-threading by using threads=1. In this case, there is no threadpool created, and all the other multi-threading related filter parameters have no effect, even prefetch.
If you're using prefetch on your script, with less than your CPU number, you may want to try to mix the external and internal mutli-threading, setting the internal multi-threading to a lower number of threads, and setting the prefetch parameter of the filter. This parameter will set the number of internal threadpool created, the best is to match the prefetch script value. If you don't set it (leave it to 1) or set a lower value than prefetch on your script, you'll have several instances (or GetFrame) created, but they'll not be running efficiently, because each instance (or GetFrame) will spend time waiting for a threadpool to be avaible, if not enough were created.
Unfortunately, as things are now, i have no way of knowing the prefetch value used in the avisynth script at the time i need the information, this is why you have to use the prefetch parameter in the filter.
In our CPU exemple case, you can have things like :
Code:
filter(...,threads=1)
prefetch(8)
or
Code:
filter(...,threads=2,prefetch=4)
prefetch(4)
or
Code:
filter(...,threads=4,prefetch=2)
prefetch(2)
or even
Code:
filter(...,threads=3,prefetch=4)
prefetch(4)
if you want to boost and go a little over your total CPU number.

Also, if your prefetch is not higher than your number of physical cores, you can try to put SetAffinity to true, but in that case, you have to set MaxPhysCore to false. The threads of each pool will be set on CPUs by steps.
For exemple, in our case :
Code:
filter(...,threads=2,prefetch=4,SetAffinity=true,MaxPhysCore=false)
prefetch(4)
Will create 4 pool of 2 threads, with the following :
pool[0] : threads(0 -> 1) on CPU 1.
pool[1] : threads(0 -> 1) on CPU 2.
pool[2] : threads(0 -> 1) on CPU 3.
pool[3] : threads(0 -> 1) on CPU 4.
Code:
filter(...,threads=4,prefetch=2,SetAffinity=true,MaxPhysCore=false)
prefetch(2)
Will create 2 pool of 4 threads, with the following :
pool[0] : threads(0 -> 1) on CPU 1.
pool[0] : threads(2 -> 3) on CPU 2.
pool[1] : threads(0 -> 1) on CPU 3.
pool[1] : threads(2 -> 3) on CPU 4.

Negative prefetch
The possibility to put negative prefecth to tune the prefetch parameter to optimal value has been added. The filter will throw an error if the number is not high enough to avoid waiting when requesting internal threadpool. For this to work properly, you have to put negative prefetch on ALL the filters of your script, and also ALL instances of the same filter.

Exemple :
Code:
filter(...,threads=2,prefetch=-2)
prefetch(2)
You'll see an error.

But with :
Code:
filter(...,threads=2,prefetch=-3)
prefetch(2)
You'll see no error, so the optimal is :
Code:
filter(...,threads=2,prefetch=3)
prefetch(2)
Once you've tune, put back a positive value.

Last edited by jpsdr; 20th November 2023 at 21:54.
jpsdr is offline   Reply With Quote
Old 12th August 2016, 14:38   #2  |  Link
shekh
Registered User
 
Join Date: Mar 2015
Posts: 775
This is interesting. Do you have any actual timings of slice vs frame performance?
I dont know much low level, but my impression is you are lucky if you are bound by L1 cache (you are already in the fast camp), and L1 caches are per-core anyway. If there is enough computation it can easily dominate over all memory bottlenecks.
__________________
VirtualDub2
shekh is offline   Reply With Quote
Old 12th August 2016, 15:07   #3  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,309
L1... 32k max... you can fit only a few lines of pictures pictures, but small pictures more indeed, even better than L2 if lucky(which are also per core).
jpsdr is offline   Reply With Quote
Old 12th August 2016, 19:49   #4  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,309
Need help from C++ expert, why Intel compiler is not happy ?
Of course, no issue with Visual Studio.

Code is :
Code:
class ResamplingFunction 
/**
  * Pure virtual base class for resampling functions
  */
{
public:
  virtual double f(double x) = 0;
  virtual double support() = 0;

  virtual ResamplingProgram* GetResamplingProgram(int source_size, double crop_start, double crop_size, int target_size, IScriptEnvironment* env);
};


class PointFilter : public ResamplingFunction 
/**
  * Nearest neighbour (point sampler), used in PointResize
 **/
{
public:
  double f(double x);  
  double support() { return 0.0001; }  // 0.0 crashes it.
};

static PClip CreateResize( PClip clip, int target_width, int target_height, int _threads, const AVSValue* args, 
                           ResamplingFunction* f, IScriptEnvironment* env );


return CreateResize( args[0].AsClip(), args[1].AsInt(), args[2].AsInt(),args[7].AsInt(0), &args[3],
                       &PointFilter(), env );
Error message is :
Code:
1>resample.cpp(2458): error : expression must be an lvalue or a function designator
1>                           &PointFilter(), env );
1>                            ^
I also have :
Code:
1>resample.cpp(2472): warning #1563: taking the address of a temporary
1>                           &MitchellNetravaliFilter(args[3].AsDblDef(1./3.), args[4].AsDblDef(1./3.)), env );
1>                           ^
Expert help needed...

Last edited by jpsdr; 12th August 2016 at 19:52.
jpsdr is offline   Reply With Quote
Old 12th August 2016, 20:03   #5  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
I'm no c++ expert but what's "&PointFilter()"?
I assume that PointFilter is a class name here, so PointFilter() is the constructor?
You can't take the memory address of constructors...
And you didn't overload (), so not a functor either
feisty2 is offline   Reply With Quote
Old 12th August 2016, 20:07   #6  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,309
Ok, for now just VS builds, no Intel, so you can test, torture, whatever you want.

@feisty2 : Well, VS don't complain, so maybe it's something "acceptable". It's not the 1rst time i encounter some kind of issue where Intel compiler is less tolerant than (or too strict ?) than VS.
But, this is out of my skills, need realy a c++ expert.

Last edited by jpsdr; 12th August 2016 at 20:10.
jpsdr is offline   Reply With Quote
Old 12th August 2016, 20:15   #7  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
Ahhh, got it
&PointFilter::PointFilter is taking the address of the constructor which is invalid

&PointFilter() is taking the address of a TEMPORARY object, the object has the type of PointFilter &&, which is an xvalue, and you can't take the address of that either
feisty2 is offline   Reply With Quote
Old 12th August 2016, 20:25   #8  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
ResamplingFunction *f -> ResamplingFunction &&f
&PointFilter() -> PointFilter()

will probably work...

Edit: you might be using an obsolete msvc which does not feature rvalue reference, a modern c++ feature, so it didn't bitch about nothing

Last edited by feisty2; 12th August 2016 at 20:35.
feisty2 is offline   Reply With Quote
Old 12th August 2016, 20:43   #9  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,309
Euh... maybe a little for VS2010, but VS2015 update 3 compile without even a warning... Too much permissive maybe.
jpsdr is offline   Reply With Quote
Old 12th August 2016, 20:54   #10  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
Well you should probably not write super confusing stuff like "&PointFilter()" in the future, I mean, who the hell even remembers if & has the higher priority or () does...
Modern c++ is nice, so kiss c++98 goodbye
feisty2 is offline   Reply With Quote
Old 12th August 2016, 20:57   #11  |  Link
shekh
Registered User
 
Join Date: Mar 2015
Posts: 775
&PointFilter() is taking address of a temporary object, which has a const qualifier

I would put this on separate line

PointFilter filter;
return CreateResize( args[0].AsClip(), args[1].AsInt(), args[2].AsInt(),args[7].AsInt(0), &args[3], &filter, env );

but I hope CreateResize does not store that pointer somewhere
__________________
VirtualDub2

Last edited by shekh; 12th August 2016 at 20:59.
shekh is offline   Reply With Quote
Old 12th August 2016, 21:11   #12  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,309
I've figure out. Some throw error, some throw warning. Those who throw warning are those where a constructor is defined, those with error are those where there is no constructor defined. Default constructor is apparently not enough for the Intel compiler, just adding an empty constructor transform the error in a warning. I'll try also the shekh suggestion, it seems safer to me...

Edit : With skeh suggestion, there not even the warning anymore. I'll update to this, even if maybe it's not realy necessary, i don't like warning if i can avoir them...

Last edited by jpsdr; 12th August 2016 at 21:14.
jpsdr is offline   Reply With Quote
Old 12th August 2016, 21:13   #13  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
Quote:
Originally Posted by shekh View Post
&PointFilter() is taking address of a temporary object, which has a const qualifier

I would put this on separate line

PointFilter filter;
return CreateResize( args[0].AsClip(), args[1].AsInt(), args[2].AsInt(),args[7].AsInt(0), &args[3], &filter, env );

but I hope CreateResize does not store that pointer somewhere
Nah, you're passing out an address on stack, &filter is a dangling pointer!
The object is meant to be on the stack of CreateResize, not the stack of the function that calls CreateResize!
feisty2 is offline   Reply With Quote
Old 12th August 2016, 22:08   #14  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,309
Anyway, it's working... Update the 1rst post with a 1.0.1 version, you can torture, test, etc...
jpsdr is offline   Reply With Quote
Old 12th August 2016, 22:38   #15  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
Fine, I was wrong and delusional and half asleep, "return" gave me a false illusion as if it jumped out of the function that called CreateResize and continued at CreateResize and therefore filter was freed, but it indeed jumped to CreateResize, just never out of the outer function, it will go back to that outer function when CreateResize is done so filter was never freed
Hopefully next time I won't reply when I'm taking a nap..
feisty2 is offline   Reply With Quote
Old 13th August 2016, 15:22   #16  |  Link
ultim
AVS+ Dev
 
ultim's Avatar
 
Join Date: Aug 2013
Posts: 359
Bonus points for using a thread pool and not starting new threads each frame. Now when IScriptEnv2 finalizes, this plugin only needs to use the internal pool of Avs+ to make it better

I am also not convinced by the caching arguments, but I can see other ways this kind of threading is helpful. Most obvious example, for large frames (4K or 8K) where memory needs for frame-based threading become prohibitive for normal users, the lower memory needs of slice-based threading may come to the rescue, which will certainly be faster than swapping memory to disk.
__________________
AviSynth+

Last edited by ultim; 13th August 2016 at 15:35.
ultim is offline   Reply With Quote
Old 13th August 2016, 17:55   #17  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,309
Quote:
Originally Posted by ultim View Post
this plugin only needs to use the internal pool of Avs+ to make it better
I'm not closed to it, if it's possible (i'm not against a link to something using it providing an exemple), and if it doesn't break the compatibility with running "the same way" under any avs+ and any avs 2.6.x. (which may probably need to keep the actual code, and create another code specific to the use of this thread pool, increasing complexity, but, again, why not... It's something i have to see)
After, for the thread pool, i've just used and adapted what Tritical has done under nnedi3.
Personnaly, i'll never allocated/create/etc... on each frame !
You do this once for all on constructor. Well, it's also of course my personnal point of view.
The other thing i'll try to see if it improve speed, is to test the "trick" i've used on nnedi3 on RGB24. It helped on it, i have to check to see if it can also help on this case.

Last edited by jpsdr; 13th August 2016 at 17:58.
jpsdr is offline   Reply With Quote
Old 13th August 2016, 21:26   #18  |  Link
ultim
AVS+ Dev
 
ultim's Avatar
 
Join Date: Aug 2013
Posts: 359
well it's in an unstable interface which is why there are no real examples yet for its usage. it's also why you should wait a little more. once it's deemed ready, I'll announce it and provide descriptions+examples of the most useful features.
__________________
AviSynth+
ultim is offline   Reply With Quote
Old 13th August 2016, 23:16   #19  |  Link
TheFluff
Excessively jovial fellow
 
Join Date: Jun 2004
Location: rude
Posts: 1,100
Stephen R. Savage posted this earlier but since he loves deleting his own posts I'll repeat what I remember of it:

There's no evidence that slice-based threading is any faster than frame-based threading for a convolution filter like a resizer. Frame-based threading of the Vapoursynth internal resizers scaled very close to linearly up to 24 cores in Stephen's tests (24 cores, 23.8x speedup compared to one core). He had some argument that there is no cache advantage to the slice-based threading because there's no shared data between lines, or something? I don't remember. But anyway internally multithreading like this is likely pointless, at least for resizers. Then again I'm pretty sure avs-mt's frame-based multithreading design is bad but I don't really have the evidence to back that up.

Quote:
Originally Posted by jpsdr View Post
You may not have this point of view, but for those who share it, you can use my multi-threaded version.
I realize that "optimizing" things based on guesswork, hearsay and fundamental misunderstandings of the underlying technology is a very doom9 thing to do (remember that guy who wrote 3000 lines of asm to try to optimize memcpy even though optimizing memcpy does absolutely nothing in the real world?), but holy shit, seriously. Dude. If you optimize something, you'd better benchmark it to prove that is faster than the thing you wanted to improve on. One algorithm being faster than another isn't an opinion or a point of view. Don't try to rice shit without benchmarks.

Last edited by TheFluff; 14th August 2016 at 00:25.
TheFluff is offline   Reply With Quote
Old 13th August 2016, 23:25   #20  |  Link
Chikuzen
typo lover
 
Chikuzen's Avatar
 
Join Date: May 2009
Posts: 595
benchmark?
I did. http://pastebin.com/ZCNnN5RW.
__________________
my repositories
Chikuzen is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 20:16.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.