Description and comparison of the Avs+ and VS concurrency models

TheFluff · 19th March 2017, 03:31

Avisynth and Vapoursynth share a very simple video filter API model. The filter exposes a GetFrame function, which is supposed to return a single frame. When someone (either a client application or a downstream filter) calls GetFrame, the environment supplies some kind of reference to the upstream filter, from which the called filter may request any number of frames it needs to produce the single output frame (except if it is a source filter - those don't have upstream filters).

In an ideal world, when getframe is called, all frames that the filter needs to produce output should be available immediately, and output should be ready just in time for someone to consume it. Filters should not have to wait for input, and output should not be sitting around waiting for someone to consume it - the former increases processing time and the latter memory usage. Also, the number of software worker threads should not be greater than the number of hardware threads, assuming that you can keep them actually working - having more just increases memory usage.

Achieving all of this is obviously impossible in the real world.

Vapoursynth's concurrency model: asynchronous frame requests

VS uses a familiar and conceptually simple callback-based asynchronous I/O model. To support this, there's an asynchronous requestframe function that you call from your filter's getframe function, instead of calling the plain old getframe and waiting for it to return. When frame N is requested from your filter, you call requestframe as many times as you wish, with the frame numbers you need from upstream to produce your output. Each requestframe call returns immediately and you do not actually get any frames back yet, so you will not be able to start processing. Instead, your filter will be called again later, when your input is ready (you get called both every time one requested input frame is ready and when all input you requested is ready, and it is up to you to decide when to start processing - most filters just wait for all input frames to be ready). When a filter requests frames, each request (for a single frame) becomes a "work item" that is put into a queue managed by the VS core. The core also maintains a worker thread pool (defaulting to hardware_concurrency workers) that processes items from this queue, oldest first (FIFO). This is all just standard async callback stuff, so hopefully no surprises so far. If you've ever done something in, say, Node.js, all of this probably sounds pretty familiar.

To determine how the thread pool interacts with your filter, the filter gets to choose between (in practice) three different concurrency modes when it registers itself with the core:
fmParallel is for thread-safe filters that do not need to modify any internal shared state between requests. Any number of threads may call the filter's getframe function at the same time, and for this to work the filter must only modify state associated with the request for one particular frame (reading shared state is fine). That means things like temporary work buffers need to be allocated and deallocated for each output frame produced - they cannot be allocated once per instance and kept for the lifetime of the filter. On modern hardware this is rarely a problem in practice; malloc is very fast these days.

If you do need something like a permanent work buffer per filter instance, or maybe if you're writing a GPU filter where context swaps are expensive, fmParallelRequests may be a better alternative. In this mode, your filter may still be asked to request the input frames it needs to produce a given output frame from many threads at the same time, but VS will ensure that input is only delivered to it from one thread at a time, and it will wait until the output is ready before delivering more input. In other words, your filter can asynchronously request many input frames in parallel, but it will only be asked to actually process one frame at a time. Hence, maintaining and modifying state per filter instance during processing is safe even without locks or mutexes, so you can have a filter that is effectively single threaded and thread-unsafe in its processing, but is still well-behaved in a multithreaded environment. KNLMeansCL uses this mode, for example.

If that still is too much concurrency for you, you may have to resort to fmUnordered. In this mode, your filter's getframe function will only ever be called from one thread at a time. You can still request input asynchronously, but you cannot be asked to do so while your filter is doing something else (like processing a frame, or determining what frame(s) to request to produce a different output frame). The probably most common use case for this mode is synchronous (and possibly non-threadsafe) source filters that produce frames immediately when requested and can't handle multiple requests at the same time. FFMS2 uses this mode, for example. It can also be used for filters that examine and modify shared internal state when determining which frames to request from the upstream filter - VDecimate is one example of this.

Attentive readers may have noticed already that most plain old thread-unsafe Avisynth filters could most likely be compatible with fmParallelRequests, except for the fact that Avisynth's getframe is synchronous, so they would not play nicely with anything else. To avoid this problem, VS has a whitelist of Avisynth filters together with which frames or frame ranges they are expected to request when the downstream filter requests frame N. This list also takes filter parameters into account, so filters with a temporal range parameter are handled correctly. All filters on the list are run with mode fmParallelRequests. If they're not whitelisted though, the core resorts to fmSerial, which behaves like old single threaded Avisynth does (getframe effectively becomes synchronous for the filter and all upstream filters).

In real filter chains, a slow filter that only supports fmParallelRequests (or fmUnordered) may easily become a bottleneck since its processing is single threaded even if everything else around it runs in parallel. This is only really meaningful in fairly trivial filter chains, though. However, if you're bothered by such a bottleneck, you can easily work around it by creating more than one filter instance, making them take care of a smaller part of the input clip each (for example by selectevery or by trimming the clip into two halves, or even by cropping) and then merging the resulting clips back together. That way you can get as many concurrent "single threaded" filters as you like.

Filters are encouraged (but not required) to request frames from their upstream filter in order, to make nice with source filters. However, because there's a central thread pool that keeps track of frame requests, VS filters can register themselves with the flag nfMakeLinear, which makes the thread pool attempt to reschedule frame requests to such filters so that frames are requested in order. FFMS2 does this, for example, and it can have a huge impact on performance (try selectevery(5, 2, 1, 0, 3, 4) after a ffvideosource call in Avs+ and compare it to VS and you'll see what I mean - when I tried that and put KNLMeansCL afterwards it ran at about a quarter of the speed). The thread pool also makes sure the same frame doesn't get requested from the same filter more than once at the same time. Finally, it also keeps track of frame caches and during processing it "learns" how much cache is actually beneficial, which is why the memory usage of VS processes tends to drop after a little while.

Client applications interacting with the VS API can request frames either synchronously or asynchronously. vspipe defaults to requesting as many frames at a time as there are worker threads in the VS core.

The Avisynth+ concurrency model: buying compatibility with complexity

Avs+ developers valued compatibility with existing filters and existing API users more than anything else, and hence the multithreading had to fit within the existing Avisynth API, which is completely synchronous. Also, it was seen as desirable to be able to parallelize old filters that were not designed with multithreading in mind. On the surface of things, the Avs+ concurrency model seems deceptively simple: multithreading is enabled by creating a prefetcher filter at the end of the script, which spawns a number of worker threads which in turn request frames from upstream filters. Since the client application calling getframe on the prefetcher is a synchronous operation, the prefetcher attempts to predict which frames will be requested in the future, and starts requests for them before they are actually requested.

Avs+ filters only have two practically useful concurrency modes; they can either be MT_NICE_FILTER, which is the equivalent of fmParallel, or they can be MT_MULTI_INSTANCE. In the later case, Avs+ simply spawns one filter instance per worker thread. There is no equivalent to fmParallelRequests since there are no asynchronous frame requests; instead, the only alternative is MT_SERIALIZED, which essentially forces everything upstream of it to run single threaded.

MT_MULTI_INSTANCE sounds cool until you realize that each instance comes with its own buffers and allocations and other overhead, and when you have several instances of the same MT_MULTI_INSTANCE filter in one script you get number of invocations multiplied by number of prefetcher threads instances and suddenly your script is hanging on open because it's trying to create 12 instances of KNLMeansCL. MT_MULTI_INSTANCE simply isn't a reasonable default, but since there's nothing else that's what you get.

Then it turns out that the Avs+ threading isn't so simple at all. I was making an honest effort at trying to understand how it actually worked, but stopped reading the code when it felt like that if I dug any deeper, I'd be staring a balrog in the eye very soon. Every filter is wrapped by an instance of a pseudo-filter called MtGuard. Since there's no central tracking of frame requests, the MtGuards and the prefetcher instead seem to use a global, thread-safe (with locking) frame cache that prevents the same frame from getting processed more than once by requests from different threads (no this is actually wrong, see below). Then there's a huge number of caveats and things that basically don't work, mainly related to how much old garbage Avs+ has inherited. There's a ton of code related to handling env->invoke from various places, for example - remember that in Avisynth, you can invoke from inside getframe. Runtime filters basically don't seem to work at all if multithreading is enabled (that is, filters that call getframe from a non-getframe function, with the frame number based on the state of a script variable) and there are comments in the source code that seem to indicate that there have been attempts to get them to work but it has resulted in either heap corruption or deadlocks, which doesn't surprise me at all.

The Avisynth+ threading code is enormously much more complex than the VS equivalent. Understanding how frame requests are actually handled and routed through the multi-threaded filter chain is incredibly difficult, and reasoning about performance is likewise all but impossible. The reliance on MT_MULTI_INSTANCE for everything that isn't natively thread-safe causes issues with memory consumption, and the entire thing has a number of problems caused by fundamental design issues that I believe are unlikely to ever get fixed.

tl;dr: if you try to shoehorn multithreading into an API designed for being single threaded and synchronous, you're gonna have a bad time.

Disclaimer: I have not attempted to run Avs+ in a debugger; my understanding of the concurrency model is based only on reading source code (and a few old d9 posts). I have probably misunderstood things. Please correct any errors.

19th March 2017, 03:31	#1 \| Link
TheFluff Excessively jovial fellow Join Date: Jun 2004 Location: rude Posts: 1,100	Description and comparison of the Avs+ and VS concurrency models Avisynth and Vapoursynth share a very simple video filter API model. The filter exposes a GetFrame function, which is supposed to return a single frame. When someone (either a client application or a downstream filter) calls GetFrame, the environment supplies some kind of reference to the upstream filter, from which the called filter may request any number of frames it needs to produce the single output frame (except if it is a source filter - those don't have upstream filters). In an ideal world, when getframe is called, all frames that the filter needs to produce output should be available immediately, and output should be ready just in time for someone to consume it. Filters should not have to wait for input, and output should not be sitting around waiting for someone to consume it - the former increases processing time and the latter memory usage. Also, the number of software worker threads should not be greater than the number of hardware threads, assuming that you can keep them actually working - having more just increases memory usage. Achieving all of this is obviously impossible in the real world. Vapoursynth's concurrency model: asynchronous frame requests VS uses a familiar and conceptually simple callback-based asynchronous I/O model. To support this, there's an asynchronous requestframe function that you call from your filter's getframe function, instead of calling the plain old getframe and waiting for it to return. When frame N is requested from your filter, you call requestframe as many times as you wish, with the frame numbers you need from upstream to produce your output. Each requestframe call returns immediately and you do not actually get any frames back yet, so you will not be able to start processing. Instead, your filter will be called again later, when your input is ready (you get called both every time one requested input frame is ready and when all input you requested is ready, and it is up to you to decide when to start processing - most filters just wait for all input frames to be ready). When a filter requests frames, each request (for a single frame) becomes a "work item" that is put into a queue managed by the VS core. The core also maintains a worker thread pool (defaulting to hardware_concurrency workers) that processes items from this queue, oldest first (FIFO). This is all just standard async callback stuff, so hopefully no surprises so far. If you've ever done something in, say, Node.js, all of this probably sounds pretty familiar. To determine how the thread pool interacts with your filter, the filter gets to choose between (in practice) three different concurrency modes when it registers itself with the core: fmParallel is for thread-safe filters that do not need to modify any internal shared state between requests. Any number of threads may call the filter's getframe function at the same time, and for this to work the filter must only modify state associated with the request for one particular frame (reading shared state is fine). That means things like temporary work buffers need to be allocated and deallocated for each output frame produced - they cannot be allocated once per instance and kept for the lifetime of the filter. On modern hardware this is rarely a problem in practice; malloc is very fast these days. If you do need something like a permanent work buffer per filter instance, or maybe if you're writing a GPU filter where context swaps are expensive, fmParallelRequests may be a better alternative. In this mode, your filter may still be asked to request the input frames it needs to produce a given output frame from many threads at the same time, but VS will ensure that input is only delivered to it from one thread at a time, and it will wait until the output is ready before delivering more input. In other words, your filter can asynchronously request many input frames in parallel, but it will only be asked to actually process one frame at a time. Hence, maintaining and modifying state per filter instance during processing is safe even without locks or mutexes, so you can have a filter that is effectively single threaded and thread-unsafe in its processing, but is still well-behaved in a multithreaded environment. KNLMeansCL uses this mode, for example. If that still is too much concurrency for you, you may have to resort to fmUnordered. In this mode, your filter's getframe function will only ever be called from one thread at a time. You can still request input asynchronously, but you cannot be asked to do so while your filter is doing something else (like processing a frame, or determining what frame(s) to request to produce a different output frame). The probably most common use case for this mode is synchronous (and possibly non-threadsafe) source filters that produce frames immediately when requested and can't handle multiple requests at the same time. FFMS2 uses this mode, for example. It can also be used for filters that examine and modify shared internal state when determining which frames to request from the upstream filter - VDecimate is one example of this. Attentive readers may have noticed already that most plain old thread-unsafe Avisynth filters could most likely be compatible with fmParallelRequests, except for the fact that Avisynth's getframe is synchronous, so they would not play nicely with anything else. To avoid this problem, VS has a whitelist of Avisynth filters together with which frames or frame ranges they are expected to request when the downstream filter requests frame N. This list also takes filter parameters into account, so filters with a temporal range parameter are handled correctly. All filters on the list are run with mode fmParallelRequests. If they're not whitelisted though, the core resorts to fmSerial, which behaves like old single threaded Avisynth does (getframe effectively becomes synchronous for the filter and all upstream filters). In real filter chains, a slow filter that only supports fmParallelRequests (or fmUnordered) may easily become a bottleneck since its processing is single threaded even if everything else around it runs in parallel. This is only really meaningful in fairly trivial filter chains, though. However, if you're bothered by such a bottleneck, you can easily work around it by creating more than one filter instance, making them take care of a smaller part of the input clip each (for example by selectevery or by trimming the clip into two halves, or even by cropping) and then merging the resulting clips back together. That way you can get as many concurrent "single threaded" filters as you like. Filters are encouraged (but not required) to request frames from their upstream filter in order, to make nice with source filters. However, because there's a central thread pool that keeps track of frame requests, VS filters can register themselves with the flag nfMakeLinear, which makes the thread pool attempt to reschedule frame requests to such filters so that frames are requested in order. FFMS2 does this, for example, and it can have a huge impact on performance (try selectevery(5, 2, 1, 0, 3, 4) after a ffvideosource call in Avs+ and compare it to VS and you'll see what I mean - when I tried that and put KNLMeansCL afterwards it ran at about a quarter of the speed). The thread pool also makes sure the same frame doesn't get requested from the same filter more than once at the same time. Finally, it also keeps track of frame caches and during processing it "learns" how much cache is actually beneficial, which is why the memory usage of VS processes tends to drop after a little while. Client applications interacting with the VS API can request frames either synchronously or asynchronously. vspipe defaults to requesting as many frames at a time as there are worker threads in the VS core. The Avisynth+ concurrency model: buying compatibility with complexity Avs+ developers valued compatibility with existing filters and existing API users more than anything else, and hence the multithreading had to fit within the existing Avisynth API, which is completely synchronous. Also, it was seen as desirable to be able to parallelize old filters that were not designed with multithreading in mind. On the surface of things, the Avs+ concurrency model seems deceptively simple: multithreading is enabled by creating a prefetcher filter at the end of the script, which spawns a number of worker threads which in turn request frames from upstream filters. Since the client application calling getframe on the prefetcher is a synchronous operation, the prefetcher attempts to predict which frames will be requested in the future, and starts requests for them before they are actually requested. Avs+ filters only have two practically useful concurrency modes; they can either be MT_NICE_FILTER, which is the equivalent of fmParallel, or they can be MT_MULTI_INSTANCE. In the later case, Avs+ simply spawns one filter instance per worker thread. There is no equivalent to fmParallelRequests since there are no asynchronous frame requests; instead, the only alternative is MT_SERIALIZED, which essentially forces everything upstream of it to run single threaded. MT_MULTI_INSTANCE sounds cool until you realize that each instance comes with its own buffers and allocations and other overhead, and when you have several instances of the same MT_MULTI_INSTANCE filter in one script you get number of invocations multiplied by number of prefetcher threads instances and suddenly your script is hanging on open because it's trying to create 12 instances of KNLMeansCL. MT_MULTI_INSTANCE simply isn't a reasonable default, but since there's nothing else that's what you get. Then it turns out that the Avs+ threading isn't so simple at all. I was making an honest effort at trying to understand how it actually worked, but stopped reading the code when it felt like that if I dug any deeper, I'd be staring a balrog in the eye very soon. Every filter is wrapped by an instance of a pseudo-filter called MtGuard. Since there's no central tracking of frame requests, the MtGuards and the prefetcher instead seem to use a global, thread-safe (with locking) frame cache that prevents the same frame from getting processed more than once by requests from different threads (no this is actually wrong, see below). Then there's a huge number of caveats and things that basically don't work, mainly related to how much old garbage Avs+ has inherited. There's a ton of code related to handling env->invoke from various places, for example - remember that in Avisynth, you can invoke from inside getframe. Runtime filters basically don't seem to work at all if multithreading is enabled (that is, filters that call getframe from a non-getframe function, with the frame number based on the state of a script variable) and there are comments in the source code that seem to indicate that there have been attempts to get them to work but it has resulted in either heap corruption or deadlocks, which doesn't surprise me at all. The Avisynth+ threading code is enormously much more complex than the VS equivalent. Understanding how frame requests are actually handled and routed through the multi-threaded filter chain is incredibly difficult, and reasoning about performance is likewise all but impossible. The reliance on MT_MULTI_INSTANCE for everything that isn't natively thread-safe causes issues with memory consumption, and the entire thing has a number of problems caused by fundamental design issues that I believe are unlikely to ever get fixed. tl;dr: if you try to shoehorn multithreading into an API designed for being single threaded and synchronous, you're gonna have a bad time. Disclaimer: I have not attempted to run Avs+ in a debugger; my understanding of the concurrency model is based only on reading source code (and a few old d9 posts). I have probably misunderstood things. Please correct any errors. Last edited by TheFluff; 19th March 2017 at 17:00.