MMX optimized LeakKernelDeint 1.5.4 - Page 2

kassandro · 26th August 2004, 12:27

Quote:

Originally posted by Leak
What other way than NewVideoFrame is there to create a frame, and why would anybody do something like that? And even then, my output buffer will be 8-byte aligned, so all I'd have to do is special-case the last line so I don't do an out-of-bounds read at the very end; if I happen to process some data from the next line at the line end that's not used it doesn't matter. Or I could just copy non-8-byte-aligned frames into a new frame; that's still faster than falling back to the C++-implementation.

You are right: with external filters you can only creat frames with NewVideoFrame, but the internal filters can! For instance, frames generated by crop are usually not aligned to avoid an unnecessary bitblt. Only if you crop with align=true, which is not the default you get a properly aligned frame. While you are correct: it is nearly impossible to get a read access error, if you read only a few bytes beyond the allowed range (a memory page is at least 4 kb), it is simply a dirty programming style to do so. If your mind is not sharpened for these kind of problems, you will sooner or later end up in a mess - at least in larger projects.

Quote:

Also, I'm not convinced that the speedup I'll get from going from MMX to SSE/SSE2 (which mostly added floating point stuff I wouldn't use anyway)

SSE contains a lot more than some new floating point instructions. While in SSE you can do only single precision floating point stuff in SSE registers, you can do a lot more in the MMX registers. For instance, the very useful instructions pminub and pmaxub are SSE only and you only can emulate them with MMX and the extremely powerful psadbw instruction cannot even emulated. Though there are only very few new integer SSE instruction, they are very useful and fill a gap left by MMX. I could have implemented RemoveGrain (not RemoveDirt) for MMX only, because it doesn't use psadbw, but it would have been much slower. The fun starts with SSE and not with MMX.

sh0dan · 26th August 2004, 13:04

A sidenote should be that the MMX-extensions kassandro mentions are also refered to as Integer SSE, which is present on 95% of all processors using AviSynth today.

However, I didn't find any obvoius places it would make sense to apply.

A thing you use a lot:

Code:

mov ebx,043544354h ; 32768*0.526
movd mm2,ebx

G.P. Register <-> MMX register transfers are bad (=slow). You should either 1) Read it from memory to MMX. or 2) store GPR in memory + read from memory with MMX (yes, this is faster - the CPU will be able to do a Store->Load Forward, if the memory is aligned.

Leak · 26th August 2004, 15:24

Quote:

Originally posted by sh0dan
A thing you use a lot:

Code:

mov ebx,043544354h ; 32768*0.526 movd mm2,ebx

G.P. Register <-> MMX register transfers are bad (=slow). You should either 1) Read it from memory to MMX. or 2) store GPR in memory + read from memory with MMX (yes, this is faster - the CPU will be able to do a Store->Load Forward, if the memory is aligned.

Well, that might be because when I started dabbling around with assembler 486s were the top of the line and accessing registers was much less expensive than accessing memory - guess I haven't gotten over that yet...

I guess I'll fix that and do a comparison; are you sure the difference will be really noticeable?

(EDIT: Okay, so my test script went from 338 FPS to 342 FPS with this change; good to have, but still hardly noticeable...)

Then again, I also made the mistake to use MOVNTQ for writing to the target buffer after reading it's description in Intel's docs thoroughly, only to discover that it tears the filter's performance to shreds...

np: Komeit - When The Sun Hits (Blue Skied An' Clear comp.)

Leak · 26th August 2004, 15:31

Quote:

Originally posted by kassandro
You are right: with external filters you can only creat frames with NewVideoFrame, but the internal filters can! For instance, frames generated by crop are usually not aligned to avoid an unnecessary bitblt.

Ugh. Didn't think about crop there...

Quote:

Only if you crop with align=true, which is not the default you get a properly aligned frame. While you are correct: it is nearly impossible to get a read access error, if you read only a few bytes beyond the allowed range (a memory page is at least 4 kb), it is simply a dirty programming style to do so. If your mind is not sharpened for these kind of problems, you will sooner or later end up in a mess - at least in larger projects.

I know that. As I said, it didn't occur to me that crop would fail the assumption I had made; but still - all I need to special case is the last line (assuming the line is at least 8 bytes long, which would IMHO be a sensible constraint to be checked in the constructor) of each field - still a bit messy, but it takes a lot less code duplication. Overreading from one of the other lines into the next one is totally harmless.

Quote:

SSE contains a lot more than some new floating point instructions. While in SSE you can do only single precision floating point stuff in SSE registers, you can do a lot more in the MMX registers. For instance, the very useful instructions pminub and pmaxub are SSE only and you only can emulate them with MMX and the extremely powerful psadbw instruction cannot even emulated.

Yeah, it could be emulated, but it'd need a lot of effort... still, I can't really use it as I need the absolute difference for each pixel, not the sum of them - and that's something that's doable with reasonable effort in MMX. If you add some shuffling and unpacking you can emulate PSADBW...

Quote:

Though there are only very few new integer SSE instruction, they are very useful and fill a gap left by MMX. I could have implemented RemoveGrain (not RemoveDirt) for MMX only, because it doesn't use psadbw, but it would have been much slower. The fun starts with SSE and not with MMX.

As I said, I'll do a SSE/SSE2 version as well, but it's further down my to-do list; I started with an MMX version as I wanted it to run on every machine that's capable of running AviSynth.

It's already a lot faster than the last version, and I'm quite happy about that, and to be totally honest I did it to be able to integrate part of it into BlendBob...

np: The Notwist - Trashing Days (Neon Golden)

Leak · 26th August 2004, 23:24

KernelDeint 1.5.1 (with source) (Old version; see first post for newest version)

This version includes the changes sh0dan suggested (which resulted in a small speedup) and doesn't read beyond the end of frames anymore if your pitch doesn't happen to be a multiple of 8.

EDIT: There's one other change I forgot - in 1.5.0 the order parameter was inverted for RGB32 video; is anybody even using it for that?

Still, if your pitch isn't evenly divisible by 8, you'll get a minor drop in speed if the frame is still aligned on an 8 byte boundary (for example if you just crop off something to the right), but you'll get a bigger speed drop if that's not the case as unaligned memory access just plain takes longer (which can happen when you cut off stuff on the left with crop) - in that case, always use "align=true" with crop.

Or, in numbers:

Code:

Testclip                       FPS
=============================  ===
Normal                         332
Crop(0,0,716,480,align=false)  329
Crop(2,0,716,480,align=false)  282
Crop(2,0,716,480,align=true)   325

My testclip is a 720x480 VOB read via MPEG2Source, and the FPS of doing a KernelBob(1,8,linked=false) were measured using AvsTimer.

np: Markus Guentner - Sleep Well (Audio Island)

Bogalvator · 27th August 2004, 00:26

Quote:

Originally posted by Leak
But still - what do you mean by "proper bobbing"?

In KernelBob's current state it is working as a single rate deinterlacer, shifted by a field and then simply reinterleaved.

It seems to me that the aim of a "proper" bobber would be to return each field to to it's full resolution - which may involve different strategies than that of a deinterlacer thats aiming to give best results for 25 fps or 29.97 progressive output.

I hope that made some sort of sense.

Xesdeeni · 27th August 2004, 13:59

Quote:

Originally posted by Bogalvator
In KernelBob's current state it is working as a single rate deinterlacer, shifted by a field and then simply reinterleaved.

It seems to me that the aim of a "proper" bobber would be to return each field to to it's full resolution - which may involve different strategies than that of a deinterlacer thats aiming to give best results for 25 fps or 29.97 progressive output.

I hope that made some sort of sense.

As I understand it, bob was created for displaying an interlaced video on a progressive display, normally creating the same number of progressive frames as input fields. I.E. PAL would result in 50 progressive fps, while NTSC would result in 59.94 progresive fps (obviously then using frame duplication for display at 70, 72, 75, 85 etc. Hz).

Deinterlacing is basically the process of creating the progressive frames from the interlaced input. But the term also seems to have evolved into the process that outputs progressive frames at the input frame rate (i.e. PAL would result in 25 progressive fps, while NTSC would result in 29.97 progressive fps), normally to improve video encoding by feeding progressive instead of interlaced frames to the codec.

In my book, technically inverse telecine is deinterlacing as well, but it's so specific to film (and the output for NTSC is not the input frame/field rate) I think it's a class of its own.

So at the risk of being pedantic, I guess I'd say a "proper deinterlacer" would use any technique possible (including inverse telecine, field matching, etc.) to create a progressive output at the input frame rate (25 or 29.97), while "proper bob" would use the same techniques to create a progressive output at the input field rate (50 or 59.94).

BTW, my interest in deinterlacing is for standards conversion, so bob is much more useful to me

.

Oh, and just to throw out a little controversy

, the better the deinterlacer, the harder the image will be to compress. So saying that the output of a particular deinterlacer is "harder to compress" can mean the quality is better! [To prove this, you'd need perfect deinterlacing. The closest we have is inverse telecine. So if you don't believe me, take a telecined video and IVTC it, and compare compressing this to a video deinterlaced using your favorite deinterlacer. (Be sure to match frame rates.) Barring out-and-out failures of the deinterlacer (combing left in), the IVTC should be at least as difficult to compress, and usually more difficult.]

Xesdeeni

Nicholi · 28th August 2004, 15:21

The shiny packages reference to the chroma artifacts possibly being lessened doesn't seem to hold true for all the things I've tried so far.

Wondering if anyone else has though? And what type of source?
Dealing mostly with anime's here and the 1.4.0 and 1.5.1 images look exactly the same.

Leak · 28th August 2004, 15:54

Quote:

Originally posted by Nicholi
The shiny packages reference to the chroma artifacts possibly being lessened doesn't seem to not hold true for all the things I've tried so far.

Whoa there - could you please use less double negatives? I'm having a hard time figuring out if you get more or less chroma artifacts.

Quote:

Wondering if anyone else has though? And what type of source?
Dealing mostly with anime's here and the 1.4.0 and 1.5.1 images look exactly the same.

Could you please post some images then?

Also, what I've fixed was one kind of chroma artifacts (which produced some faint chroma ghosting when one plane was deinterlaced and the other wasn't), I didn't say it'd fix _ALL_ possible chroma artifacts; if you mean the slight ghosting KernelDeint causes now and then then that's something inherent in the algorithm that can't be avoided.

Try using a threshold of 0 on the images you get artifacts on, if they stay they're not of the kind that my change fixes.

np: Sole - Teepee On A Highway Blues (Selling Live Water)

Nicholi · 28th August 2004, 16:11

Err yeah sorry. Edited accordingly. I suppose we are talking about different things, my mistake. I speak of the usual "ghosted" lines from the deinterlaced movement. Already on thresh=0 also.

I would upload the pics if I could, but alas I have no webhost or such. So it seems what I was looking for is not actually here, heh. And also unavoidably removed...oh well. There is always Sangnom to alternate with on occasion.

Thank you for continu'ing work on an already great filter however.

KernelDeint is my default choice for deinterlacing, and mayhaps when DG returns he will hopefully implement your hard work.

DDogg · 28th August 2004, 16:56

Nicholi, - For image hosting try this. It is great. You don't even have to register.

Boulder · 6th September 2004, 16:30

There seems to be something wrong with KernelBob.

With this script I get a bad result:
MPEG2Source("c:\temp\captures\startrek.d2v",idct=7)
KernelBob(order=1,sharp=true,threshold=7)
SeparateFields()
SelectEvery(4,1,2)
Weave()

With this script it's OK:
MPEG2Source("c:\temp\captures\startrek.d2v",idct=7)
KernelBob(order=1,sharp=true,threshold=7)
ConverttoYUY2()
SeparateFields()
SelectEvery(4,1,2)
Weave()

Here are the screenshots:

YV12

YUY2

The original material

If I replace KernelBob with a simple Bob(), both scripts give a proper result. I didn't test whether the v1.4.0 with scharfis_brain's function would behave the same way.

The screenshots are not exactly at the same frame so don't pay attention to that. Just pay attention to the amount of combing in the subtitles in the YV12 version.

The funny thing is that this weird behaviour is not seen throughout the whole clip, it's just a small portion in the middle. I also tried order=0 and SelectEvery(4,0,3) but it didn't help.

Boulder · 6th September 2004, 17:31

Just tried it with KernelDeint v1.5.1 and scharfis_brain's function from the Restore24 package, and it gives a correct output without a need to convert to YUY2. So there's something wrong with Leak's implementation of the function

Code:

function kernelbob(clip a, int "th",bool "mask")
{	mask=default(mask,false)
	th=default(th,5)
	ord = getparity(a) ? 1 : 0
	f=a.kerneldeint(order=ord, sharp=true, twoway=false, threshold=th,map=mask) 
	e=a.separatefields.trim(1,0).weave.kerneldeint(order=1-ord, sharp=true, twoway=false, threshold=th,map=mask)
	interleave(f,e).assumeframebased
}

Leak · 6th September 2004, 18:15

Quote:

Originally posted by Boulder
There seems to be something wrong with KernelBob.

...

If I replace KernelBob with a simple Bob(), both scripts give a proper result. I didn't test whether the v1.4.0 with scharfis_brain's function would behave the same way.

The screenshots are not exactly at the same frame so don't pay attention to that. Just pay attention to the amount of combing in the subtitles in the YV12 version.

The funny thing is that this weird behaviour is not seen throughout the whole clip, it's just a small portion in the middle. I also tried order=0 and SelectEvery(4,0,3) but it didn't help.

Hmmm... yeah, taking a close look at KernelBobs output it seems something isn't totally right; could you try setting the threshold to 0 and have a look at the b0rked segment again? I have the nagging feeling that I've got an error in my motionmask code, so if there's no artifacts using a threshold of 0 (which turns KernelDeint into something closer to a regular Bob()) I know where to look.

Also, could you cut out a short part of that sequence and upload it somewhere?

I'm pretty sure it's got something to do with the order of the fields getting passed into MotionMask, but I can't put my finger on it...

np: Plaid - Crumax Rins (Spokes)

erratic · 6th September 2004, 18:17

I have noticed that with Leak's KernelBob I have to use SelectEvery(4,0,3) to maintain the field order. If I use SelectEvery(4,1,2) the field order is reversed. This happens with both TFF and BFF sources.

No matter what the source is, with scharfis_brain's kernelbob function I have to use SelectEvery (4,1,2) to get TFF, and SelectEvery(4,0,3) to get BFF.

Leak · 6th September 2004, 18:28

Quote:

Originally posted by erratic
I have noticed that with Leak's KernelBob I have to use SelectEvery(4,0,3) to maintain the field order. If I use SelectEvery(4,1,2) the field order is reversed. This happens with both TFF and BFF sources.

No matter what the source is, with scharfis_brain's kernelbob function I have to use SelectEvery (4,1,2) to get TFF, and SelectEvery(4,0,3) to get BFF.

That might be because I do an AssumeTFF() internally before calling SeparateFields in my filter to get a fixed field order. Maybe this problem also crops up because I'm not doing an AssumeFrameBased() at the end of KernelBob - Boulder, could you try if adding that after KernelDeint helps?

np: Plaid - Assault On Precinct Zero (Double Figure)

Boulder · 6th September 2004, 18:35

OK, I'll do the test and get you a small sample tomorrow - unfortunately I probably won't have the time today

erratic · 6th September 2004, 18:39

I just ran a short test and with AssumeFrameBased after KernelBob it behaves like scharfis_brain's kernelbob function: SelectEvery(4,1,2) results in TFF, SelectEvery(4,0,3) results in BFF.

EDIT: as far as the field order is concerned, Avisynth's internal Bob() command works like scharfis_brain's kernelbob function.

Mug Funky · 6th September 2004, 18:39

that's bizarre, because i haven't encountered this at all and i've been using the leak version since it was released.

maybe i should check my plugin directory for avsi files with kernelbob in them?

Leak · 6th September 2004, 18:55

Quote:

Originally posted by Mug Funky
that's bizarre, because i haven't encountered this at all and i've been using the leak version since it was released.

maybe i should check my plugin directory for avsi files with kernelbob in them?

Nah, it's true. My KernelBob is not doing the AssumeFrameBased() after bobbing, so the frames seem to keep the TFF flag I force on them in my filter when doing another SeparateFields() afterwards - which of course was something I didn't do when testing, as I'm always going for a progressive result and which won't cause havoc until going back to fieldbased processing (which I hardly ever do) afterwards...

Try this: add a call to Info() after KernelBob() and then add a AssumeFrameBased() in between and compare the parity...

Still, a AssumeFrameBased() after KernelBob() should do the trick until I release the next version; it's just that I currently don't have much time to work on AviSynth filters...

Now I'm just wondering if that was what bit Boulder as well...

np: The Black Dog - Frisbee Skip (Spanners)

26th August 2004, 13:04	#22 \| Link
sh0dan Retired AviSynth Dev ;) Join Date: Nov 2001 Location: Dark Side of the Moon Posts: 3,480	A sidenote should be that the MMX-extensions kassandro mentions are also refered to as Integer SSE, which is present on 95% of all processors using AviSynth today. However, I didn't find any obvoius places it would make sense to apply. A thing you use a lot: Code: mov ebx,043544354h ; 327680.526 movd mm2,ebx G.P. Register <-> MMX register transfers are bad (=slow). You should either 1) Read it from memory to MMX. or 2) store GPR in memory + read from memory with MMX (yes, this is faster - the CPU will be able to do a Store->Load Forward, if the memory is aligned. __________________ Regards, sh0dan // VoxPod Last edited by sh0dan; 26th August 2004 at 13:15.*

26th August 2004, 23:24	#25 \| Link
Leak ffdshow/AviSynth wrangler Join Date: Feb 2003 Location: Austria Posts: 2,441	New Version 1.5.1 KernelDeint 1.5.1 (with source) (Old version; see first post for newest version) This version includes the changes sh0dan suggested (which resulted in a small speedup) and doesn't read beyond the end of frames anymore if your pitch doesn't happen to be a multiple of 8. EDIT: There's one other change I forgot - in 1.5.0 the order parameter was inverted for RGB32 video; is anybody even using it for that? Still, if your pitch isn't evenly divisible by 8, you'll get a minor drop in speed if the frame is still aligned on an 8 byte boundary (for example if you just crop off something to the right), but you'll get a bigger speed drop if that's not the case as unaligned memory access just plain takes longer (which can happen when you cut off stuff on the left with crop) - in that case, always use "align=true" with crop. Or, in numbers: Code: Testclip FPS ============================= === Normal 332 Crop(0,0,716,480,align=false) 329 Crop(2,0,716,480,align=false) 282 Crop(2,0,716,480,align=true) 325 My testclip is a 720x480 VOB read via MPEG2Source, and the FPS of doing a KernelBob(1,8,linked=false) were measured using AvsTimer. np: Markus Guentner - Sleep Well (Audio Island) Last edited by Leak; 15th January 2005 at 19:47.

28th August 2004, 15:21	#28 \| Link
Nicholi Registered User Join Date: Apr 2003 Location: Lancaster, CA Posts: 89	The shiny packages reference to the chroma artifacts possibly being lessened doesn't seem to hold true for all the things I've tried so far. Wondering if anyone else has though? And what type of source? Dealing mostly with anime's here and the 1.4.0 and 1.5.1 images look exactly the same. Last edited by Nicholi; 28th August 2004 at 16:06.

28th August 2004, 16:56	#31 \| Link
DDogg Retired, but still around Join Date: Oct 2001 Location: Lone Star Posts: 3,058	Nicholi, - For image hosting try this. It is great. You don't even have to register. __________________ How to Optimize Bitrate for CCE multipass

6th September 2004, 16:30	#32 \| Link
Boulder Pig on the wing Join Date: Mar 2002 Location: Finland Posts: 5,733	There seems to be something wrong with KernelBob. With this script I get a bad result: MPEG2Source("c:\temp\captures\startrek.d2v",idct=7) KernelBob(order=1,sharp=true,threshold=7) SeparateFields() SelectEvery(4,1,2) Weave() With this script it's OK: MPEG2Source("c:\temp\captures\startrek.d2v",idct=7) KernelBob(order=1,sharp=true,threshold=7) ConverttoYUY2() SeparateFields() SelectEvery(4,1,2) Weave() Here are the screenshots: YV12 YUY2 The original material If I replace KernelBob with a simple Bob(), both scripts give a proper result. I didn't test whether the v1.4.0 with scharfis_brain's function would behave the same way. The screenshots are not exactly at the same frame so don't pay attention to that. Just pay attention to the amount of combing in the subtitles in the YV12 version. The funny thing is that this weird behaviour is not seen throughout the whole clip, it's just a small portion in the middle. I also tried order=0 and SelectEvery(4,0,3) but it didn't help. __________________ And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon...

28th August 2004, 16:11	#30 \| Link
Nicholi Registered User Join Date: Apr 2003 Location: Lancaster, CA Posts: 89	Err yeah sorry. Edited accordingly. I suppose we are talking about different things, my mistake. I speak of the usual "ghosted" lines from the deinterlaced movement. Already on thresh=0 also. I would upload the pics if I could, but alas I have no webhost or such. So it seems what I was looking for is not actually here, heh. And also unavoidably removed...oh well. There is always Sangnom to alternate with on occasion. Thank you for continu'ing work on an already great filter however. KernelDeint is my default choice for deinterlacing, and mayhaps when DG returns he will hopefully implement your hard work.

6th September 2004, 17:31	#33 \| Link
Boulder Pig on the wing Join Date: Mar 2002 Location: Finland Posts: 5,733	Just tried it with KernelDeint v1.5.1 and scharfis_brain's function from the Restore24 package, and it gives a correct output without a need to convert to YUY2. So there's something wrong with Leak's implementation of the function Code: function kernelbob(clip a, int "th",bool "mask") { mask=default(mask,false) th=default(th,5) ord = getparity(a) ? 1 : 0 f=a.kerneldeint(order=ord, sharp=true, twoway=false, threshold=th,map=mask) e=a.separatefields.trim(1,0).weave.kerneldeint(order=1-ord, sharp=true, twoway=false, threshold=th,map=mask) interleave(f,e).assumeframebased } __________________ And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon...

6th September 2004, 18:17	#35 \| Link
erratic member Join Date: Oct 2003 Location: Belgium Posts: 106	I have noticed that with Leak's KernelBob I have to use SelectEvery(4,0,3) to maintain the field order. If I use SelectEvery(4,1,2) the field order is reversed. This happens with both TFF and BFF sources. No matter what the source is, with scharfis_brain's kernelbob function I have to use SelectEvery (4,1,2) to get TFF, and SelectEvery(4,0,3) to get BFF.

6th September 2004, 18:35	#37 \| Link
Boulder Pig on the wing Join Date: Mar 2002 Location: Finland Posts: 5,733	OK, I'll do the test and get you a small sample tomorrow - unfortunately I probably won't have the time today __________________ And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon...

6th September 2004, 18:39	#38 \| Link
erratic member Join Date: Oct 2003 Location: Belgium Posts: 106	I just ran a short test and with AssumeFrameBased after KernelBob it behaves like scharfis_brain's kernelbob function: SelectEvery(4,1,2) results in TFF, SelectEvery(4,0,3) results in BFF. EDIT: as far as the field order is concerned, Avisynth's internal Bob() command works like scharfis_brain's kernelbob function. Last edited by erratic; 6th September 2004 at 19:47.

6th September 2004, 18:39	#39 \| Link
Mug Funky interlace this! Join Date: Jun 2003 Location: i'm in ur transfers, addin noise Posts: 4,555	that's bizarre, because i haven't encountered this at all and i've been using the leak version since it was released. maybe i should check my plugin directory for avsi files with kernelbob in them? __________________ sucking the life out of your videos since 2004