PDA

View Full Version : A question about the speed of a script


Melanchthon
9th August 2006, 02:28
Sections A, B, and D of a script run at ~1fps and take approximately four-and-a-half hours to encode 14915 frames.

Section C of the above script runs at ~2fps and takes approximately two-and-a-half hours to encode 18626 frames. Presumably it would take about two hours to encode 14915 frames.

Sections A, B, C, and D of the script, therefore, will take about six-and-a-half hours to encode 14915 frames, if by 'six-and-a-half' you mean 'sixty-eight' (based on VDM's estimate during a ten-minute test, by the end of which it had encoded 39 frames).

It also BSODs my computer if I attempt to run it through DGVfapi (and tells me the file 'is invalid' if I comment out the offending Section C, and only co-operates with just Section A active, but that's probably a different problem), which was my major beef with it until I sat down and worked out just how slow it was.

The script itself, then, lest I be chased by a mob of placard-waving smilies:

setmemorymax(256) # a quarter of what I have
[load d2v]
[load filters]

# Section A #

a=last.tdeint(order=0,mode=1)
b=last.tdeint(order=1,mode=1)
repair_ff(a,b,booldup=true)
cdeblend(omode=1,dthresh=70,clip2=last.crop(8,8,-8,-8).bilinearresize(352,288))
tdecimate(mode=7,rate=23.976)

# Section B #

clp1=last
a=clp1.edgemask(thy1=0, thy2=14, thc1=0, thc2=0, type="laplace").expand()
maskedmerge (clp1,clp1.eedi2().bicubicresize(720,480).\
eedi2().bicubicresize(720,480),a.blur(1.58))

clp2=last
b=clp2.dehalo_alphamt2(rx=2.5,ry=2.5,lowsens=80,darkstr=0.4,brightstr=1.3)
maskedmerge(clp2,b,a.expand().expand().expand().inflate().inflate(), u=2, v=2)

clp3=last
mt_lutxy(clp3,clp3.removegrain(4), yexpr="x 125 < x 255 > | x y ?", uexpr="x", vexpr="x")

# Section C #

lq_filter()

# Section D #

mergechroma(blur(1))
tweak(sat=1.15,cont=1.15)
ylevelsg(0,1.15,255,0,255)

clp4= last.lanczosresize(652,488)
soothe (clp4.limitedsharpenfaster(ss_x=2.0,ss_y=2.0,strength=175,\
wide=true,soft=1,overshoot=5,edgemode=1),clp4,50)
crop(8,4,-4,-4)


Okay, so it's complex, but I learned how to use masktools along the way and it produces great results for the source it calls. The thing is, I have no idea why LQ filter and everything else together takes nearly ten times as long as LQ filter and everything else separately.

A practical solution is to run Section A and then run Sections B, C, and D on the resulting Lagarith (this should take maybe 6-7 hours... I used a really short sample). I think I can automate that using Job Control by writing the two scripts up so that the first calls a d2v and the second calls a dummy avi created solely to stop it throwing an error... which by the time the second script runs will be the product of the first script. I haven't tested that yet though. (edit 10/08/06: yeah, it works.)

I can get around this problem; I just want to know why it's happening. I'm using the version of RemoveNoiseMC that came with the folder of required plugins.

foxyshadis
9th August 2006, 04:43
Drop MT 1 entirely and replace it all by MT 2, you'll notice some speedup. (How much, I can't say...) edgemask->mt_edge, maskedmerge->mt_merge, *pand->mt_*pand, *flate->mt_*flate. Don't forget to do it to ylevelsg also.

Other bottlenecks I see:

blur(1.58)->removegrain(20,-1)
mergechroma(blur(1))->removegrain(0,12)

a=clp1.edgemask(thy1=0, thy2=14, thc1=0, thc2=0, type="laplace").expand ->
a=clp1.mt_edgemask(thy1=0, thy2=14, thc1=0, thc2=0, "laplace")

a.blur(1.58) ->
bilinearresize(m(4,width/1.2),m(4,height/1.2)).mt_lut("x 8 - 40 / .5 ^ 255 * ").mt_inflate().bilinearresize(width,height)
(1.2 is ~ what you have, sharp, increase it to get fuzzier but faster mask.)

a.expand().expand().expand().inflate().inflate() ->
a.lanczosresize(m4(width/4),m4(height/4)).mt_lut("x 15 - 256 *").mt_expand().lanczosresize(width,height)

(The latter two are assuming expand() is removed from the first line.)

Do you need ss_x=2.0,ss_y=2.0? Usually the difference to 1.5 is minimal, and with Smode=4 even less is needed.

Are you using the two eedi2s for antialiasing? Although it works, it's slow and artifact prone; at the very least you should turn it between operations to get vertical and horizontal. With a tighter mask you might be able to get by with one and a degrainmedian or similar though; when I get home I can test it out.

I suspect a bug in part B: chroma is only treated explicitly once, on a function that doesn't need it, but the lutxy will trash it if not told otherwise. You can probably even combine all three into one faster set, but my brain is worn out now.

Also consider boosting memory as high as you can without swapping, it's quite possible that you might be killing the cache partway through these sections. Tritical's CVS build has much better cache-handling characteristics.

I pretty much habitually write with speed taken into account these days, perhaps I should put together some functions so that clarity isn't sacrificed so much. :p

Melanchthon
10th August 2006, 00:51
New script, major changes in bold:

################
clp1=last
emask=clp1.mt_edge(thy1=0, thy2=14, thc1=0, thc2=0, "laplace").bilinearresize(m4(width/2),m4(height/2)).mt_lut("x 8 - 40 / .5 ^ 255 * ").mt_inflate().bilinearresize(720,480)
mt_merge (clp1,clp1.eedi2().bicubicresize(720,480).eedi2().bicubicresize(720,480),emask)

clp2=last
b=clp2.dehalo_alphamt2(rx=2.5,ry=2.5,lowsens=80,darkstr=0.4,brightstr=1.3)
mt_merge(clp2,b,emask)

clp3=last
mt_lutxy(clp3,clp3.removegrain(4), yexpr="x 125 < x 255 > | x y ?", uexpr="x", vexpr="x")

lq_filter()
removegrain(0,12)

tweak(sat=1.15,cont=1.15)
ylevelsg(0,1.15,255,0,255)

clp4= last.lanczosresize(652,488)
soothe (clp4.limitedsharpenfaster(ss_x=1.5,ss_y=1.5,strength=150,wide=true,soft=1,overshoot=5,edgemode=1),clp4,50)
crop(8,4,-4,-4)
################

I used one edgemask for both the EEDI2 and the dehaloing, as it was sufficient for both.

Are you using the two eedi2s for antialiasing? Although it works, it's slow and artifact prone; at the very least you should turn it between operations to get vertical and horizontal.
Turning the frame between operations effectively nullifies one of them. The only problem I've noticed is the blurring of subtle details, which is what the edgemask is for. I don't want to start losing detail even before I've started cleaning.

It's a problem with TDeint's interpolation, I think. There are some lines it can't reconstruct (example (http://img164.imageshack.us/my.php?image=unfilteredunresizedwz1.png)).

I suspect a bug in part B: chroma is only treated explicitly once, on a function that doesn't need it, but the lutxy will trash it if not told otherwise. You can probably even combine all three into one faster set, but my brain is worn out now.
I can't get the hang of lutxy. The way it's laid out makes my brain go pop, and I don't really understand how uexpr and vexpr work.

Now that I'm using one edgemask, I can at least combine the anti-stairstepping and dehaloing sections. I'll work on that and then run some speed tests. It looks as though whatever issue is causing the tenfold increase in encoding time when the whole script is used hasn't gone away, so I might try using TBilateral istead of RemoveNoiseMC.

Also consider boosting memory as high as you can without swapping, it's quite possible that you might be killing the cache partway through these sections. Tritical's CVS build has much better cache-handling characteristics.

VDM won't load with tritical's build.
http://img283.imageshack.us/img283/1874/errornk3.gif

I used the link with the installer.

foxyshadis
10th August 2006, 03:25
Turning the frame between operations effectively nullifies one of them. The only problem I've noticed is the blurring of subtle details, which is what the edgemask is for. I don't want to start losing detail even before I've started cleaning.
You'll probably want a different edgemask then, it's my feeling that this one is too wide and should hug the edges more. But as I don't have your video that may not be the case. Anyway, eedi2.resize.turnleft.eedi2.turnright.resize doesn't nullify anything - it will be the same speed and should give smoother results. But I'm still curious what you use eedi2 to correct exactly.

I can't get the hang of lutxy. The way it's laid out makes my brain go pop, and I don't really understand how uexpr and vexpr work.
yexpr, uexpr, and vexpr are just expr applied to a single channel. I'll be the first to admit reverse polish is a brain-buster, but you can put mt_polish to use here; mt_lut(expr=mt_polish("(x + 1)/2")) will give you mt_lut(expr="x 1 + 2 /"). That'll ease the transition considerably.

You should try translating some of Didée's abominations, like this. :p
"x y == x x x y - abs 16 / 1 2 / ^ 16 * "+Str+" * x y - 2 ^ x y - 2 ^ "+Str+" 100 * 25 / + / * x y - x y - abs / * + ?"

It looks as though whatever issue is causing the tenfold increase in encoding time when the whole script is used hasn't gone away, so I might try using TBilateral istead of RemoveNoiseMC.

Yes, there's a good chance that the two eedi2s and removenoisemc suck up most of it. This one is where it helps to have a bunch of different denoisers & showfiveversions and just give them all a toss. mvdegrain is very similar and faster, and I suppose with lq_filter you don't need a very conservative function anyway.

Hmm, and the only thing I know of that fails with cvs is mpeg2dec3, which is ancient anyway. Odd.

Didée
10th August 2006, 06:31
You should try translating some of Didée's abominations, like this. :p
"x y == x x x y - abs 16 / 1 2 / ^ 16 * "+Str+" * x y - 2 ^ x y - 2 ^ "+Str+" 100 * 25 / + / * x y - x y - abs / * + ?"
Translation:
"sub-ranged gamma function, sign-preserving, low-value-damped."

In a word, Peanuts.


Melanchton, you're probably underestimating the complexity of that filterchain you're using.

lq_filter is an application of MV-search & MV-compensation through MVTools. These alone tend to need plenty ressources. Feeding such a function with something that has been already pre-processed with other time-consuming filters ... doesn't exactly help.
But, the big crushdown happens at latest when TheHighlyComplexKnotCreatedSoFar is processed with another temporal filter [here: Soothe], because most probably frames can't be cached anymore for the temporal filter. So, what will happen is either: swapping to disk [slooow], or: re-computation of those parts that can't be cached [muhucho slower than slooow].
Compare: You might be able to lift a ball of 30 kg ... but you won't be able to juggle five such balls at once.

Therefore, it is absolutely normal that applying another temporal filter when the ressource usage is already dangerously high, will bring speed down into ridiculous ranges ... nothing wrong with the script, nothing wrong with any filter or function. It's just that, when you take a chessboard and start placing one rice grain on the first field, two grains on the second, 4 on the third and so on, it's hard to imagine that all the rice on earth won't suffice to complete the procedure. :)

For similar reasons, faster machines & more RAM are only makeshifts, but not a solution. Get a 10GHz machine with 16GB RAM, and the script will run fine.
Then we'll add two small lines to that script, and the monster machine will render much slower than what you're getting now. Making complexity going BOOM is easy, very easy ... ;)

foxyshadis
10th August 2006, 12:29
It's the mental reordering that makes luts like that scary, until you've unwound (or created) a few yourself. :p

Oh, and I totally forgot avstimer, silly me. Sprinkle avstimer calls in between all the pieces of your script and watch it go. (A little complicated to figure out, but worth it.) You also need debugview from sysinternals. I'm not sure how effective it'd be at finding mutilated caches, but wherever normally fast funcions are horribly slow should give you an idea.

The cache problem is something cvs builds do better than 2.5.6, and tritical's mods even better than cvs, with complex stuff like this; that's why it'd be nice to get one or the other installed.

Melanchthon
11th August 2006, 17:28
Thanks for the replies, guys.

@ Didée

That's the kind of response I was looking for. I knew it was something to do with the combination, as the script was fine if I split it up, but I didn't know enough about AviSynth to work it out.

@foxyshadis

I'll post some screenshots. Most of them are lossy so they don't take forever to load, and I'm satisfied that the quality is good enough for some general comparisons.

The original screen, after the 29.97fps --> 23.976fps conversion:
http://img238.imageshack.us/img238/2703/originaljq4.jpg

The edgemask without any alterations:
http://img58.imageshack.us/img58/1312/emaskfirstju0.gif

The edgemask after being filled out:
http://img71.imageshack.us/img71/8792/emaskfinalpn0.gif

I think it works fine-- the thresholds keep it off the fine details in the lower-left of the screen. Neither dehalo alpha nor eedi2 have much effect on flat areas, so I'm not too worried about the mask being on the wide side.

Couple of extreme examples:

turnright.(eedi2.resize x6.)turnleft
http://img61.imageshack.us/img61/1664/eedi6turnedbv4.jpg

(eedi2.resize x6)
http://img128.imageshack.us/img128/7325/eedi6pp8.jpg

A single eedi2 doesn't look too bad by itself, but after filtering the line of the arm has a 'crinkle-cut' effect. Also, the reason I picked lq_filter was because it kept more detail than everything else I tried. I did switch to TBilateral in the end, and one session of tweaking that and LSF produced this:
http://img119.imageshack.us/img119/7691/filteredandtweakedtbilateralpa8.png

I gave avstimer a go, and it went down to 2fps after the masking and to 1fps after that (encoding, it does sometimes register as 2fps for a split secand, but that's as fast as it gets). I'm still trying to get it to work on single sections. I tried using it on Section A, but unless I put in a 'return last' line after avstimer it would go to about 25fps when it should be 8-10 fps.

Now, to try and find out why VFApi thinks the script is invalid...

foxyshadis
11th August 2006, 23:37
Well, avstimer is meant to measure not just the whole script (but can), but also the individual chunks of script. You'd put it after every important sequence, with appropriate labels, and it'll tell you that, eg, creating the edgemask runs at 20 fps, running tbilateral at 40, eedi2 at 5, repairff & cdeblend at 8 fps, etc... So that you can gauge the slowest parts at a glance instead of having to guess.