View Full Version : Multicore optimization idea: running consecutive filters in different threads
QuaddiMM
18th July 2008, 23:09
From what I know, MT-plugin splits a frame into parts and processes each part in a different thread ("MT"), or uses alternating threads to get consecutive frames ("SetMTMode").
What I'm missing is a way to run consecutive filters on different threads, like a pipeline (as seen in modern cpu's for instruction-processing) would do. This would eliminate the Overlap-Problem of 'MT' and issues with filters that aren't reentrant or don't work with 'SetMTMode' for other reasons.
Of course, such a pipeline might introduce other incompatibilities, but i'd like to give it a try.
A filter implementing the idea would basically do the following:
- The filter runs a worker-thread that does the frame-getting from previous filters in the chain.
- Every time 'GetFrame' is called, a request is send to the worker-thread, requesting the frame and waiting for it to become available (ideally, the frame is already available).
Now, for it to work as a real pipeline, it is necessary to generate frames before they are requested (in order to being able to answer the request immediately without processing-delay). The simplest way is to assume that frames are requested in linear order (actually the case when encoding).
I wrote a small plugin to test the idea, and it works quite well:
Put "PipeLine" in the script, balancing load between filters below and above the pipeline.
In a fast test I got an speed-improvement of ~53%. With better balancing, improvements of up to 100% should be possible (or up to 300% on Quad-Cores), of course depending on the number of 'PipeLine's used, and the load-balancing.
Can you please comment on the idea? Maybe it's already done in MT and I didn't notice? Is it generally a good idea and should it be compatible with avisynth?
While testing I got some random crashes. Also the race-problem (http://forum.doom9.org/showthread.php?t=138391) reappeared (no idea why). So I wonder if there is a better way to do it?
In the attachment is a binary of the filter, along with its source code.
Note when compiling: Due to 'CreateThread' being used, it must be linked with the Multithreading-DLL (not the static library). I'm fine with that, as I prefer the DLL-Runtime anyway.
Gavino
19th July 2008, 10:24
In principle, this seems like a good idea to me.
But I think for it to work, all filters used need to be thread-safe and I don't know if that is widely true. I can see possible problems if you have a non-thread-safe filter repeated in different parts of the filter chain. For example:
x = UnsafeFilter()
y=PipeLine(x).SomeOtherFilter()
x+y
squid_80
19th July 2008, 14:35
I had a similar idea and ran into the same problems. If it was added to the core, I think the ideal place would be in the cache filter since it normally gets added internally after every filter by default.
I saw a similar massive speed improvement but it was unusable due to random crashes. Possibly the vfb management isn't/wasn't thread-safe, I think it's better now than when I tried but IanB would probably know best.
Isn't prefetching for parallelization going to be in Avisynth 2.6 (of course that seems to be on the same release schedule as 3.0).
PS. there's an easy solution to thread safety ... don't use threads. What are we talking about here? A couple 100 KB per instance of avisynth and the filter, and filter completion times multiple orders of magnitude larger than context switch times ... peanuts. Threading offers no real benefits over multiple processes with shared memory for this particular application.
Mug Funky
20th July 2008, 16:31
if this could be made to work with mvtools, it would be the rocking-est rock that ever rocked.
i have an under-utilised 8-core machine, and HD/2k footage itching to be NR'd.
martino
20th July 2008, 20:28
So with this, could you possibly run a filter, that only works on the current frame (1D filter, if that is a good term to use), split the task into two with SelectEven/Odd and put each instance of the filter under PipeLine and thus let both cores be used on what you'd originally achieve with just one call of the function, on the whole clip?
Gavino
20th July 2008, 20:50
So with this, could you possibly run a filter, that only works on the current frame (1D filter, if that is a good term to use), split the task into two with SelectEven/Odd and put each instance of the filter under PipeLine and thus let both cores be used on what you'd originally achieve with just one call of the function, on the whole clip?
In principle, yes. You'd need Interleave as well of course, to put the result back together again.
In practice, you might run into trouble if any of the upstream filters is not thread-safe (see my post #2). All it would take to screw up is modifying an instance variable in the filter's GetFrame call.
(BTW I'd call it a spatial filter, or perhaps 2D)
QuaddiMM
20th July 2008, 21:09
Thanks for the replies.
@Gavino:
You're right. There are ways to break it with non-threadsafe filters.
Even then it might be useful for simple linear filter chains.
@MfA:
It would be great if something similar will be in AviSynth 2.6. I'm looking forward to it.
About using processes: Processes (can) run concurrently, therefore problems with thread-safety. Actually, on Windows a process is a mere container for one or more threads.
@martino:
You mean something like
src = last
a = src.SelectEven.BilinearResize (640, 480).PipeLine
b = src.SelectOdd.BilinearResize (640, 480).PipeLine
Interleave (a, b)
?
That 'would' be possible, if it would actually work (and not crash).
If that's not asked too much, can someone please revise my code? If there are errors, I'll try to fix them. If it's a problem with avisynth... well.. wait for 2.6.
squid_80
20th July 2008, 21:37
I had a quick look at the code, the only thing I'd comment on is I don't think it's worth implementing a cache since avisynth will already do that for you. It does crash the same as mine with the latest 2.5 beta, I haven't tried 2.6.
I don't see how non-threadsafe filters are a problem as long as getframe calls for individual filters are serialized, which is the point here - a filter doesn't process multiple frames in parallel, instead multiple filters run in parallel processing different frames.
What would be cool is to have code that analyzes which frames are being requested and adapts, rather than assuming linear access.
Gavino
20th July 2008, 21:38
@Gavino:
You're right. There are ways to break it with non-threadsafe filters.
Even then it might be useful for simple linear filter chains.
The problem as I see it is that, since filters are not obliged to be thread-safe, you cannot trust any filter unless it has been shown to be safe (preferably by inspection of its source code). Even if it works on a particular day, you might just have got lucky - Murphy's Law will inevitably strike sooner or later.
Testing can only show the presence of bugs, not their absence. ;)
And that's even more true where multi-threading is concerned.
About using processes: Processes (can) run concurrently, therefore problems with thread-safety. Actually, on Windows a process is a mere container for one or more threads.
With processes a separate instance of the filter will always have it's own set of variables, re-entrance and thread-safety are only an issue with shared resources and multithreading. Things can still go wrong in other ways with concurrent programs, but that is neither here nor there.
With multiprocessing every filter can be run in parallel with other instances of itself, unless the developer really tried very hard to break things (for instance by using named win32 objects as a sidechannel for passing data between filters ... not that many filters using sidechannels though, maybe mvtools?).
Gavino
20th July 2008, 22:19
I don't see how non-threadsafe filters are a problem as long as getframe calls for individual filters are serialized, which is the point here - a filter doesn't process multiple frames in parallel, instead multiple filters run in parallel processing different frames.
Yes, but in a setup like
src = AviSource(...).AnyFilter()
a = src.SelectEven.BilinearResize (640, 480).PipeLine
b = src.SelectOdd.BilinearResize (640, 480).PipeLine
Interleave (a, b)
then the single instance of AnyFilter (and that of AviSource) represented by src has its GetFrame called by two different threads, so not serialized.
Ah, :lightbulb: - how about if Pipeline had a companion called Serialize designed to fix cases like this. You would then write
src = AviSource(...).AnyFilter().Serialize()
a = src.SelectEven.BilinearResize (640, 480).PipeLine
b = src.SelectOdd.BilinearResize (640, 480).PipeLine
Interleave (a, b)
It's a crazy idea, Jim, but it might just work... :)
QuaddiMM
20th July 2008, 22:29
I had a quick look at the code, the only thing I'd comment on is I don't think it's worth implementing a cache since avisynth will already do that for you. It does crash the same as mine with the latest 2.5 beta, I haven't tried 2.6.
Yeah - but the avisynth-cache only works for frames that were already returned by GetFrame. This doesn't cover 'prefetched' frames, so there is a cache for them.
EDIT: That's actually wrong. You're right about the cache. I could omit it.
But currently all frames are put in the cache, even those explicitly requested. I did that for it was the easiest way to do. I'm new to the threading-stuff and it confuses me, so I kept it easy.
I don't see how non-threadsafe filters are a problem as long as getframe calls for individual filters are serialized, which is the point here - a filter doesn't process multiple frames in parallel, instead multiple filters run in parallel processing different frames.
That's the point. I thought I avoided thread-safety issues by only requesting one frame at a time. There I was wrong or it's an error in the code.
What would be cool is to have code that analyzes which frames are being requested and adapts, rather than assuming linear access.
That would be very hard to do properly. For encoding, linear prefetching should be enough.
With processes a separate instance of the filter will always have it's own set of variables, re-entrance and thread-safety are only an issue with shared resources and multithreading. Things can still go wrong in other ways with concurrent programs, but that is neither here nor there.
With multiprocessing every filter can be run in parallel with other instances of itself, unless the developer really tried very hard to break things (for instance by using named win32 objects as a sidechannel for passing data between filters ... not that many filters using sidechannels though, maybe mvtools?).
Sorry, I got you wrong there. So you mean not just using different processes, but also running different instances of the filters. That may solve some problems. But it's more interesting for MT-plugin as it really runs the same filter twice at the same time.
QuaddiMM
20th July 2008, 22:57
Ah, :lightbulb: - how about if Pipeline had a companion called Serialize designed to fix cases like this. You would then write
src = AviSource(...).AnyFilter().Serialize()
a = src.SelectEven.BilinearResize (640, 480).PipeLine
b = src.SelectOdd.BilinearResize (640, 480).PipeLine
Interleave (a, b)
It's a crazy idea, Jim, but it might just work... :)
Nice idea. It's quite simple to implement.
Something like:
PVideoFrame __stdcall Serialize::GetFrame (int n, IScriptEnvironment* env)
{
EnterCriticalSection (&cs);
PVideoFrame frame = child->GetFrame (n, env);
LeaveCriticalSection (&cs);
return frame;
}
That runs fine on its own, but doesn't solve the issues with PipeLine (i tested it).
I suspect (without actually looking at it, so I might be wrong here) that the cause of the problems is the avisynth-cache, which may not be thread safe. If that's the case, there's no way around fixing it.
martino
21st July 2008, 00:49
In principle, yes. You'd need Interleave as well of course, to put the result back together again.
Oh, that's right. I forgot about that.
Thanks for the replies.
@martino:
You mean something like
src = last
a = src.SelectEven.BilinearResize (640, 480).PipeLine
b = src.SelectOdd.BilinearResize (640, 480).PipeLine
Interleave (a, b)
?
That 'would' be possible, if it would actually work (and not crash).
Yup. That was exactly what I was thinking.
*downloads filter and will try some other time
squid_80
21st July 2008, 04:12
Instead of implementing serialization in another filter, why not just put it in pipeline and use it on every filter in the script? It's cheap to implement with a critical section like the code posted (critical sections are a lot faster than the other synchronization primitives).
I'd have to check your code again but I think it's already more or less serialized anyway, since getframe blocks for the "fetch" thread to return if it's already running. So if pipeline were in place immediately after a non-threadsafe filter, wouldn't it ensure only one getframe call went through at a time?
Re: my idea of analyzing the frame order, I think scripts where this would be most useful are going to be the ones doing some sort of framerate adjustment via mvtools and whatever. Typically not linear access. Even adding a selecteven or selectodd would bork that. I never really got into statistics but something like a decaying mode would probably work, even for odd patterns (e.g. frame 0, frame 2, frame 3, frame 5 etc). I'll see if I can come up with some code.
Gavino
21st July 2008, 05:16
Instead of implementing serialization in another filter, why not just put it in pipeline and use it on every filter in the script? ...
I think it's already more or less serialized anyway, since getframe blocks for the "fetch" thread to return if it's already running. So if pipeline were in place immediately after a non-threadsafe filter, wouldn't it ensure only one getframe call went through at a time?
The problem is when you have more than just a linear chain. In my earlier example
src = AviSource(...).AnyFilter().Serialize()
a = src.SelectEven.BilinearResize (640, 480).PipeLine
b = src.SelectOdd.BilinearResize (640, 480).PipeLine
Interleave (a, b) it seems to me the serialisation has to be done on the common part of the graph, hence immediately after AnyFilter as I described.
Or are you suggesting that the two pipelines would co-operate to enforce serial access at a global level? Hmm, perhaps you're right, I'm not sure now.
squid_80
21st July 2008, 07:35
I meant just do this:
src = AviSource(...).AnyFilter().PipeLine()
a = src.SelectEven.BilinearResize (640, 480).PipeLine()
b = src.SelectOdd.BilinearResize (640, 480).PipeLine()
Interleave (a, b)since pipeline will already take care of the serialization (I think).
I replaced my avisynth.dll with the one from tsp's last MT avisynth build (http://www.avisynth.org/tsp/MT_07.zip) (since 2.6 isn't quite ready), and it's not crashing anymore. Don't use any setmtmode statements, I think they'll just slow things down or worse cause a deadlock. Simple mpeg2source().resize() scripts perform at double their previous speed.
sh0dan
21st July 2008, 09:12
Well, in the practical world there are some issues, that make this more complex than it seems. I've written a prefetch addon to MT, which I have used for some of my own tests. It works almost completely like yours, but problems arise from complex scripts.
The example above works nicely en the example, but what if you want to use a temporal filter? If you use spatial division, as Mt(""), you must compute an overlap, plus you get penalized for thread synchronization, since you cannot return a frame before both threads have completed, and they very seldom are finished in the same amount of time.
Futhermore most sources heavily prefer linear access, which means you must still access them in-order, to avoid have a huge seek penalty. You cannot run the script above without synchronization, since you have a 50% chance of 'b' requesting a frame before 'a'. It will be further complicated if we add a "MergeChroma(src)" at the last line, which will request first, if 'a' and 'b' runs async, and each are a frame ahead of yours?
I'm still experimenting with the pre-fetcher. It works ok, but I'm still not happy enough with it to release it. In the example above, it should be able to replace PipeLine(). I need to get a sort of dynamic cache working before it is usable.
btw, I can only get it to be stable on 2.6.
IanB
22nd July 2008, 02:15
Okay this is all pretty close to an idea I am gestating for properly doing multithreading in avisynth 2.6.
The thought currently goes like this :-
Goal :-to instantly provide the client with the frame it is requesting.
Build a queueable GetFrame request infrastructure.
Build a worker thread infrastructure to service the queue.
Build a thread interlock infrastructure.
Use fibre like concepts to manage and control the number of workers active.
On exiting each Cache::GetFrame queue a request for the "next" frame.
Use a history tracking algorithm to predict "next".
Completing the queued request does not queue the next frame in the current cache instance, child cache instances do enqueue.
Prioritise queue request by cache graph depth.
All caches are interlocked to protect non-thread safe filters.
Caches are given knowledge of child filters "Identity" so it can apply a single lock against all instances of a filter.
New filters can declare their thread safeness through enhanced cache hints interface. i.e. Unsafe, Instance only safe, ..., Fully safe!
New filters can declare their processing cost. i.e. Zero, Bitblt like, light, medium, heavy. (Zero cost filters do not get queued requests).
New filters can declare access order restrictions i.e. strictly linear, linear preferred over step N, random, ...
So for a simple graph with 1 filter and 1 source the request pattern is like this :-
-- Client X->cache2->filter->cache1->source
X requests cache2 requests filter requests cache1 requests source for frame 0.
source returns frame 0, cache1 enqueues for frame 1 and returns frame 0.
(okay I better trim this notation)
worker1 starts prefetch frame 1 from source.
cache2 enqueues for frame 1 and returns frame 0.
worker2 starts prefetch frame 1 from filter, blocks in cache1.
worker1 completes frame 1. No enqueuing!
worker2 unblocks, cache1 enqueues for frame 2 and returns frame 1.
worker3 starts prefetch frame 2 from source.
X requests frame 1, blocks in cache2.
worker2 completes frame 1. No enqueuing!
X unblocks, cache2 enqueues for frame 2 and returns frame 1.
worker3 completes frame 2. No enqueuing!
worker1 starts prefetch frame 2 from filter, cache1 enqueues for frame 3 and returns frame 2.
worker2 starts prefetch frame 3 from source.
worker2 completes frame 3. No enqueuing!
worker1 completes frame 2. No enqueuing!
Stall! cache2 has frame 2 ready, cache 1 has frame 3 ready, all CPU cores available to X
X requests frame 2, cache2 enqueues for frame 3 and returns frame 2.
worker3 ...
At the stall there is little point in Avisynth proceeding, the goal is to instantly provide X with the frame it is requesting is being met.
If X had a bursty request pattern, say requesting 3 frames in quick succession and a long pause to process the 3, then simply adding Sh0dans, prefetch filter would accomodate. Of course for bonus points the cache could measure the inter frame request times and more aggressively prefetch frames based on minimum, maximum and latency times.
Threads that block in on cache lock get recycled in a fibre like manner in an attempt to keep the defined number of workers concurrently active, hence the "Build a thread interlock infrastructure".
Most of the code to do this simple involves repackaging TSP's current code and inverting the logic so the distributor is in every cache instead of once only at the very top of the graph. All the confusing SetMTMode calls should no longer be necessary with this model. Legacy filter will be assume thread unsafe, wrapper functionality could promote a legacy filters thread safeness.
Of course by building infrastructures and adding access to them in the API means filter authors can enqueue fragments of their internal processing along with the prefetch logic in a compatible way.
A work in progress!
New filters should also be able to declare how well they work with slice parallelization and how much overlap they should have when used with it ... for pure neighbourhood filters it's generally the most efficient method of parallelization, for others it might screw things up. A text file where you can put the information yourself as a user for older filters would also be nice.
moviefan
22nd July 2008, 17:46
How about that idea: split a video into X parts, where X is the number of cores, let the parts overlap by the amount of frames, used by e.g. MVTools, process all parts simultaneously and rejoin them, removing the overlap? Of course the parts won't have finished at exactly the same time, but if you process 2 hours of material with 4 cores, so each core processes 30 minutes, the delay between the cores should be okay.
Ignore this, if it doesn't make sense or obviously doesn't work, since I don't know much about AviSynth development, but I thought, why not state my idea ;) So please comment on it.
Edit: Oh, I just realized, I maybe suggest the same idea like MfA... Why do you say, filter documentations should inform about the overlap? Of course, you need to know the overlap, but do you need to know it exactly? If you overlap a couple of frames more than needed, it won't hurt, would it?
I was talking about splitting frames, not splitting clips. Splitting clips is also a potential way of parallelization but it will use the caches less efficiently.
PS. and more importantly you can't directly feed an encoder or display ... you need to spool through a lossless file.
moviefan
22nd July 2008, 21:41
Okay, splitting frames... And yes, true, I would have to create a lossless temporary video for encoding, but at least I could use a multicore CPU much more efficiently.
squid_80
23rd July 2008, 10:27
A small benchmark for pipeline() used with avisynth257mt, to show the potential of this method.
Source is a 1920x1088 mpeg2 DVB capture.
Script:source=mpeg2source("new.d2v")
#logo remove function
#basically 3 overlays and a blur on the logo region
wmremove(source, "tenhd")
#program is 4:3 letterboxed, crop and resize to pal
lanczosresize(768,576,src_left=246,src_top=5,src_width=-248,src_height=-20)
Output from xvid_encraw -cq 2: 4361 frames(100%) encoded, 22.93 fps, Average Bitrate = 1388kbps
Tot: enctime(ms) =190175.00, length(bytes) = 30272265
Avg: enctime(ms) = 43.60, fps = 22.94, length(bytes) = 6939
Revised script with pipeline filters:source=mpeg2source("new.d2v").pipeline()
wmremove(source, "tenhd").pipeline()
lanczosresize(768,576,src_left=246,src_top=5,src_width=-248,src_height=-20)Again fed to xvid_encraw -cq 2: 4361 frames(100%) encoded, 37.87 fps, Average Bitrate = 1388kbps
Tot: enctime(ms) =115148.00, length(bytes) = 30272265
Avg: enctime(ms) = 26.40, fps = 37.88, length(bytes) = 6939
Output is identical, but only takes 60% of the original time to produce.
Manao
23rd July 2008, 14:04
New filters should also be able to declare how well they work with slice parallelization and how much overlap they should have when used with it ... for pure neighbourhood filters it's generally the most efficient method of parallelizationNo, it's the most simple to implement, but not the most efficient, since it adds synchronization constraints between threads at the end of each frames.
See the difference in threading efficiency with x264 when it went from slice-based to frame-based (for which threading efficiency is almost perfect)
x264 is hardly a pure neighbourhood filter ... x264 is non local with wildly varying cycle counts per pixel.
Manao
23rd July 2008, 17:46
Indeed, it isn't a pure neighborhood filter. So it's harder for it to get a good threading efficiency. Yet it manages so, thanks mostly to its frame-based threading.
Slice-based will be OK for very slow neighborhood filters, put as soon as speed increases, the inherent synchronization cost (on windows, the smallest timeslice is by default 10ms, which can go down to 1ms if you know what you're doing) must be taken into account. And that, coupled with the fact that avisynth isn't the only process, and so that slices that ought to take the same time won't, quickly makes slice-based threading quite unefficient (even though theorically optimal)
pitch.fr
23rd July 2008, 17:49
could this be used with ConvertToRGB32 ?
this thing kills my CPU in realtime in ffdshow :(
even MT on 4 threads and a C2D doesn't help much..
SledgeHammer_999
23rd August 2008, 19:31
How do I use this filter? Is it compatible with Tdeint?
Here's my script:
LoadPlugin("C:\PROGRA~1\GORDIA~1\DGMPGDec\DGDecode.dll")
LoadPlugin("C:\PROGRA~1\GORDIA~1\AviSynthPlugins\UnDot.dll")
LoadPlugin("C:\PROGRA~1\GORDIA~1\AviSynthPlugins\Tdeint.dll")
LoadPlugin("C:\PROGRA~1\GORDIA~1\AviSynthPlugins\VSFilter.dll")
LoadPlugin("C:\PROGRA~1\GORDIA~1\AviSynthPlugins\avs_pipeline.dll")
# SOURCE
mpeg2source("pathtomovie.d2v")
# CROPPING
crop(0,0,718,572)
TDeint().pipeline()
# DENOISING: choose one combination (or none)
Undot()
# SUBTITLES
VobSub("foldertomovie\VTS_01_0")
# RESIZING
LanczosResize(600,336)
With this script I get this error:
CAVIStreamSynth: System exception - Access Violation at 0x195150b, writing to 0x0
I am using avisynth 2.5.8 RC3
squid_80
23rd August 2008, 19:42
You have to use the MT build of avisynth (http://www.avisynth.org/tsp/MT_07.zip).
SledgeHammer_999
23rd August 2008, 20:08
Thank you. It works. And I definitely see some speed improvement!!!
Will I miss something from 2.5.8(speed-wise)?
Quark.Fusion
1st September 2008, 16:25
Threading issues?
I'm trying to debug my filter chain with pipeline() (as script sometimes breaking linear access model) with WriteFile() filter, problem is that multiple instances of WriteFile("pos#.log","current_frame") mess "current_frame" variable, i.e. second instance report frame number for first.
Gavino
1st September 2008, 19:22
I'm trying to debug my filter chain with pipeline() (as script sometimes breaking linear access model) with WriteFile() filter, problem is that multiple instances of WriteFile("pos#.log","current_frame") mess "current_frame" variable, i.e. second instance report frame number for first.
Can you please post your script or the relevant portion of it, as I'd like to see the details of this.
However, multi-threading the run-time filters (of which WriteFile is one) is almost bound to be problematic because there is only one instance of current_frame.
Quark.Fusion
2nd September 2008, 02:38
It's just WriteFile() before and after PipeLine() (separate test script) — I was thinking that maybe current_frame is different in different threads. Still don't understand what causes non-linear access — solved it with SetMemoryMax(1024) and reducing Pipeline buffer to 4 in first process and 2 frames in second (thinking about changing 4 to 5 later).
I will not post complete script now because it is 2-part, very big and total mess (most is commented out or not used) :) need some cleanup…
I was experimenting with 3-part script, but for some reason it crash with using both TcpSource() and TcpServer() or break linear access model when two TcpSource() call one TcpServer() —*so now encoding prefilter pass to lossless. Initial problem was with fft3dgpu allocating more that 1GB of address space and crash because of 2GB limit.
Gavino
2nd September 2008, 09:18
It's just WriteFile() before and after PipeLine() (separate test script) — I was thinking that maybe current_frame is different in different threads.
Ok, thanks.
Since current_frame is just a script variable stored within the Avisynth core, both threads will be reading and writing to the same thing. To make this thread-safe, the core would need to be modified (probably requiring major re-engineering).
vBulletin® v3.8.5, Copyright ©2000-2012, Jelsoft Enterprises Ltd.