Technical aspects of Uncomb() filter by trbarry [Archive]

View Full Version : Technical aspects of Uncomb() filter by trbarry

trbarry

2nd May 2003, 05:50

[EDIT: Technical discussion moved from Usage forum]

Donald -

I only look at the current and prev frames. I will choose one field from the curr frame and a matching field from either the curr or prev. There are only 3 possibilities to match if I want to ensure at least one of the fields comes from the current frame:

CurrTop to PrevBottom
CurrTop to CurrBottom
CurrBottom to PrevTop

Since the psadbw instruction calcs the packed sum of absolute differences (in luma bytes only) I just sum those for sampling about 1/4 the pixels each of the 3 possibilities. I take the pair of fields with the smaller sum for the whole frame.

And I am actually comparing the gaps in the top fields made by averaging them (say, line 0 & 2) against the odd lines that would be in those positions (line 1).

- Tom

Guest

3rd May 2003, 16:57

@Tom

Just out of curiousity, why don't you post some timing numbers for your filter versus Telecide(post=false)? If it is blazingly faster, we could consider modifying Telecide to use your algorithm (or giving the user a choice of algorithms if their output differs in important ways). Assuming of course, that you'd be agreeable to it. :)

Guest

5th May 2003, 00:55

I tested Uncomb() versus Telecide(post=false) on a real-world application:

AVISource("g:\video test files\decomb\test\tango.avi")
Crop(0,8,-0,-0)
#uncomb()
Telecide(post=false)

I fed this into DivX and ran an encode. The clip timings were:

Telecide(post=false) 264 seconds
Uncomb 252 seconds

So Uncomb() sped up this real-world application by 4.5%.

I suppose that the gain is not larger because Telecide subsamples whereas Uncomb does not.

Guest

5th May 2003, 01:38

Now I've been thinking about the theoretical issues...

Originally posted by trbarry
I only look at the current and prev frames. I will choose one field from the curr frame and a matching field from either the curr or prev. There are only 3 possibilities to match if I want to ensure at least one of the fields comes from the current frame:

CurrTop to PrevBottom
CurrTop to CurrBottom
CurrBottom to PrevTopThis is a really interesting departure from what Telecide does, and it may possibly improve the matching behavior. You are doing 3 compares, as does Telecide. But Telecide always chooses to use the bottom field of the current frame. That is why Telecide has to compare to the next frame as well. It's also the reason why Telecide has the 'reverse' option. If there is a field blend in the bottom field, Telecide won't find a good match, and as a hack I let the user match on the top instead of the bottom. But that didn't often help because the blends aren't typically limited to the top or bottom field. But your method avoids all that because it can use either the top or the bottom field of the current frame. So using your method may well improve the performance when blended fields are present.

Your method is similar in effect to the "desperation mode" of the VirtualDub version of Telecide, which would try to match on the top field if the bottom field failed.

I plan to make an experimental version of Telecide with this new field matching approach. Thank you for contributing this fine new matching strategy.

And I am actually comparing the gaps in the top fields made by averaging them (say, line 0 & 2) against the odd lines that would be in those positions (line 1).That is also very interesting. I am trying to make a Bob filter that compares fields to find inter-field motion and so has to try to adjust for the spatial offset as you are doing here. But I found it worked better to nudge one field up a quarter line and the other down a quarter line. But that would probably be overkill for a matching operation.

Guest

5th May 2003, 03:31

@Tom

Questions:

1. Why do you ignore 1/8 of the width and height on all sides?

2. It looks like you subsample height by four, but fully sample width (except for the 1/8 borders). Is that correct?

3. Why is FieldCopy() in there? :)

Thank you.

trbarry

5th May 2003, 04:25

So Uncomb() sped up this real-world application by 4.5%.

Donald -

Funny, I thought I got MUCH larger differences. Are you doing this in YV12? Maybe I forgot post=false or was using an older Telecide release or something.

But I found it worked better to nudge one field up a quarter line and the other down a quarter line. But that would probably be overkill for a matching operation.

TomsMoComp does this (1/4 line) also when vertical filtering but as you noted it seemed unnessary here.

1. Why do you ignore 1/8 of the width and height on all sides?

2. It looks like you subsample height by four, but fully sample width (except for the 1/8 borders). Is that correct?

3. Why is FieldCopy() in there?

1. Because there is often garbage at the edges, and it's faster. But with my HDTV caps 1/8 is too much. I caught a counter example today where something moved in from the side of an otherwise still picture. I'll change the edge to only 8 pixels.

2. I was intending to subsample both height and width by 2. What do you use? (Note that vertically going every 4 lines but doing both an even and odd line means by 2) Is it your experience that I could get away with more in both dimensions?

3. Dead code. I only figured out how to use the faster BitBlt() this week because I didn't want to be reminded again about using my slower legacy FieldCopy() that I've stuck with. ;)

Most of my filters were written at a time when I believed (falsely) that calling C memcpy already somehow invoked a machine optimized copy function. I'm slowly changing them to BitBlt as they come up for maintenance.

- Tom

Guest

5th May 2003, 04:45

@Tom

I did test in YV12. Perhaps you had less other processing so that the field matching got swamped by other processing.

I subsample by 4 vertically. That means I examine pixels for combing only on every fourth line. But the test involves references to the lines above and below. I subsample by 4 horizontally, but like this: do 4, skip 12, do 4, skip 12, ... That is to allow the MMX to have a chunk of 4 to work on. You could probably get away with do 16, skip 16.

trbarry

5th May 2003, 06:25

Okay. I'm going to switch vertical subsampling from 2 to 4 for a mild speed increase. I'll probably leave horizontal at 2 since much of the cost is fetching all the memory anyway. I'm guessing this is bound by memory access speed.

And I'll make all those 1/8 borders smaller.

Thanks,

- Tom

sh0dan

5th May 2003, 10:02

@trbarry: An extreme optimization could be for you to check the picture backwards, (bottom up, right to left) since the hardware prefetcher will most likely prefetch the following line, when it detects linear access. Just an observation. (But in real life probably a very minor increase).

trbarry

5th May 2003, 15:21

Sh0dan -

Sorry. I don't understand that. It was my thought that when you fetch a qword it actually pulls a whole cache line into memory. This size varies depending upon processor but explains why you can often access a number of adjacent bytes in multiple instructions very fast, even when you are out of registers to hold everything.

But how does backwards help?

- Tom

sh0dan

5th May 2003, 15:37

There are two kinds of prefetching in modern processors - there is software (SSE) prefetch and hardware prefetch. You probably know how the software prefetcher works, but the hardware prefetcher works by analyzing your memory access patterns and try to prefetch memory you are most likely to use next.

When you request data sequencially (two or three cachelines for instance), the hardware prefetcher deducts that you are very likely to continue accessing the following cachelines, and therefore issues a prefetch for the following cachelines, so that data is already present in the cache when you request it (this is a good thing since a cachemiss usually gives a penalty of 30+ cycles).

So when you process a line, skip one, process a line the lines inbetween is very likely to be prefetched anyway, so you will not get any gain there. If you instead process data backwards and do manual (software) prefetch you will only get the needed lines prefetched.

Again - this has only a very small impact - and is probably not worth the efford. I just mentioned it to support your point that there probably isn't any real gain in skipping pixels.

(@Donald: This is also why I didn't skip pixels in the Decimate detection code).

ARDA

5th May 2003, 16:21

trbarry-

The way Shodan has explained is perfect but as I've been
trying to write mine for the last twenty minutes here you have.

If I'm not wrong when you make a backwards loop
you avoid the hardware pretecht ;in that way you
can keep in chache data you've preload.What for?
To preserve data unaltered to use again if you need
and write memory with movntq and movntdq without penalties.
The difference and the advantage is in the way the
prefetchnta works in every processor.

From
Intel ® Pentium ® 4
and Intel ® Xeon™
Processor Optimization
Reference Manual
Chapter 6

Order Number: 248966-04

prefetchnta
Fetch the data into the second-level cache, minimizing cache
pollution.

In pentium III for instance
Fetch 32 bytes
Fetch into 1st- level cache
Do not fetch into 2nd-level cache

Pentium 4 processor
Fetch 128 bytes
Do not fetch into 1st-level cache
Fetch into 1 way of 2nd-level cache

I think that with pentium 4 as it uses just one
way you have more freedom to use it.
But better than my explanatin there is a good
example using software preload and backwards loop
instead of prefetcht in Steady's BitBlt in avysinth.

I hope that can be usefull

Arda

scmccarthy

5th May 2003, 18:05

Most of my filters were written at a time when I believed (falsely) that calling C memcpy already somehow invoked a machine optimized copy function. I'm slowly changing them to BitBlt as they come up for maintenance. Yeah, I am the one that brought it up, but I doubt I would have looked at the source of this new program though.

The reason I noticed it in the first place is that I think of classes in C++ as a programming language extension. So learning the methods in the AviSynth Script Environment is like learning a programming language specifically tailored for AviSynth. This reinforces our understanding of the AddFunction method which we always have to use anyway. Otherwise that part of the code is just a template that we don't understand. Using BitBlt is straightforward enough if you already understand memcopy, but I think it is a good reminder that all the functions in the AviSynth header are available to us any time we need them.

Stephen

ErMaC

5th May 2003, 19:56

I apologize for not adding anything constructive, but I just wanted to say you guys are just so darned smart. :) When I read these threads it always makes me feel great about AVISynth knowing we having you awesome folks like you guys working on it, and having high level discussions in a totally open forum, to boot. Kudos to all of you.

WarpEnterprises

6th May 2003, 14:40

OT, but connecting a little to ErMaC:
Does anybody know WHY BenRG has stopped development or does not make a stop at doom9?

Guest

6th May 2003, 16:38

I'll make an answer to that while requesting the discussion of it not to continue here.

Ben is very busy on other projects right now. He feels that Avisynth is in good hands and can proceed very well without him.

OUTPinged_

19th May 2003, 12:48

@neuron2

It seems that Decimate() takes more machine time than Telecide().

Since Decomb is way under realtime for 1080i material (in particular 30i to 24p conversion) maybe we can have a "quick and dirty" mode for Decimate() instead?

Telecide seems to be pretty fast, it takes 4 times less cpu cycles if we compare Decimate(5) to Telecide(post=false). At least that what am I getting on HDTV material.

Guest

19th May 2003, 12:56

Did you try Decimate(quality=0)?

Your query is off-topic. Please start a new thread if you wish to continue it. Thank you.

Guest

24th May 2003, 03:28

We were discussing Tom's interesting new filter before the thread went off topic. I'd like to get back to that.

I had posted a clip, chest.avi, that I asserted supported the idea that full comb detection worked better than Tom's field differencing. I have now discovered that the assertion is not valid. That clip has misplaced fields that Telecide is serendipitously able to deal with due to its matching strategy, i.e, it matches current bottom to previous top, current top, or next top, whereas Tom matches only to the previous frame or the current frame. This allows Telecide to handle this PATHOLOGIC clip, where Uncomb fails. But the clip is wrong, it has out-of order fields probably due to a bad capture card, or brain-dead processing. So it is not a valid test.

On valid tests, I now find that the field differencing approach works better than Telecide's full comb-detection matching. I have made an additional advance over Tom's approach, which I am including in my new-generation Decomb, currently under development. You may read the details at my website Journal (blog).

trbarry

24th May 2003, 05:17

Donald -

Glad to hear it was the clip. As mentioned before, I don't have any intention of handling the really pathological ones in UnComb.

And at the risk of very briefly going back OT, was it much work to set up the blog? I'm looking for a job again so I seriously need to fix my web page, if only to have somewhere to hang a resume. But once I get started I might as well make a nicer one.

- Tom

Guest

24th May 2003, 06:23

Originally posted by trbarry
And at the risk of very briefly going back OT, was it much work to set up the blog? You're one of my heroes, so I won't strike you for going off topic. :)

Usually a "blog" has an on-line interface that lets you publish new things directly, without editing a web page and uploading it with FTP or whatever. Mine is a "manual blog". I use the blog format but I do it by editing an HTML page and uploading it with FTP. So in that sense, it was not hard to set up. :rolleyes: