Some fundamental motion vector questions [Archive]

View Full Version : Some fundamental motion vector questions

markfilipak

29th April 2021, 01:22

Does Super() et al retrieve the source video's motion vectors or do they cook up "home brew" MVs by looking at picture pixels and "searching" for correlations?

Thanks,
Mark.

poisondeathray

29th April 2021, 02:09

Does Super() et al retrieve the source video's motion vectors or do they cook up "home brew" MVs by looking at picture pixels and "searching" for correlations?

It's explained in the avisynth mvtools2 documentation.

The video codec's motion vectors (as defined by the video codec) are not used.

"picture pixels" are aren't used, "blocks" and their respective correlations are "searched" and calculated

Mvtools2 (and thus svpflow) are both block based optical flow approaches. The motion vectors are estimated by block comparisons (e.g. 16x16, 8x8 etc...) in current frame, vs. n+1, or n-1, or, n+2, n-2, etc.... The shift in blocks is the motion vector. This is why the blocksize is a very important setting

blksize (mvtools2)

Size of a block (horizontal). Larger blocks are less sensitive to noise, are faster, but also less accurate.
List of available block sizes (blksize x blksizeV)
64x64, 64x48, 64x32, 64x16
48x64, 48x48, 48x24, 48x12
32x64, 32x32, 32x24, 32x16, 32x8
24x48, 24x24, 24x32, 24x12, 24x6
16x64, 16x32, 16x16, 16x12, 16x8, 16x4, 16x2
12x48, 12x24, 12x16, 12x12, 12x6, 12x3
8x32, 8x16, 8x8, 8x4, 8x2, 8x1
6x24, 6x12, 6x6, 6x3
4x8, 4x4, 4x2
3x6, 3x3
2x4, 2x2

The search settings determine the pattern used for motion vectors, the options are similar to what video codecs use

markfilipak

29th April 2021, 02:55

Howdy, poisondeathray!
It's explained in the avisynth mvtools2 documentation.
Sorry. I did finally find authoritative documentation for MVTools2 (other than the Chinese site I found earlier... in Chinese of course ;) ) I read it all. It was unclear to me that source MVs were not used.

The video codec's motion vectors (as defined by the video codec) are not used.

Why the heck not?

Frankly, I'm shocked.

If I was doing it, I'd extract the MVs from the source frames into a moving window of 5 frames, [A] [B] [C] [D] [E], in order to interpolate MVs between [B] and [D], i.e. for [BBBCC] [BCCCC] [CCCCD] [CCDDD]. I'd do a 2nd-order (linear MV 'velocity' + non-linear MV 'acceleration') autocorrelation. Jeez, the source's MVs literally point the way. Why aren't they being used?

poisondeathray

29th April 2021, 03:06

Sorry. I did finally find authoritative documentation for MVTools2 (other than the Chinese site I found earlier... in Chinese of course ;) ) I read it all. It was unclear to me that source MVs were not used.

It's pretty clear source (codec) MV's are not used. Otherwise , why would you need to calculate or measure or estimate anything ?

Plugin uses block-matching method of motion estimation (similar methods are used in MPEG2, MPEG4, etc). At analysis stage plugin divides frames by small blocks and try to find for every block in current frame the most similar (matching) block in second frame (previous or next). The relative shift of these blocks is motion vector. The main measure of block similarity is sum of absolute differences (SAD) of all pixels of these two blocks compared. SAD is a value which says how good the motion estimation was.

Why the heck not?

Frankly, I'm shocked.

If I was doing it, I'd extract the MVs from the source frames into a moving window of 5 frames, [A] [B] [C] [D] [E], in order to interpolate MVs between [B] and [D], i.e. for [BBBCC] [BCCCC] [CCCCD] [CCDDD]. I'd do a 2nd-order (linear MV 'velocity' + non-linear MV 'acceleration') autocorrelation. Jeez, the source's MVs literally point the way. Why aren't they being used?

Another reason is the plugin receives uncompressed data from the decoder. There is no such thing as motion vectors at that stage. It does not access video codec parameters upstream .

If you were to write something capable of accessing codec MV's, it would have to be integrated or have access at the level of the decoder

markfilipak

29th April 2021, 03:26

It's pretty clear source (codec) MV's are not used. Otherwise , why would you need to calculate or measure or estimate anything ?

Let me repeat that: why would you need to calculate or measure or estimate anything ?

I think you made my point.

Of course, some calculating is required to synthesize intermediate frames, but to start by throwing out the source video's motion vectors seems insane to me.

Another reason is the plugin receives uncompressed data from the decoder. There is no such thing as motion vectors at that stage. It does not access video codec parameters upstream .

Screw the decoder. How hard would it be to rewrite the decoders to pass on the original motion vectors? If that can't be done -- I can't see why that would be -- then access the source frames (files) directly. Why not? That's not decoding. That's just knowing how to read the formatted data. It's not rocket science.

If you were to write something capable of accessing codec MV's, it would have to be integrated or have access at the level of the decoder

So? What's the problem?

It's not for me to write that. I'm a hardware design engineer, not a programmer. Oh, sure, I've written some code -- test programs, mostly -- and I did take computer programming in college 45 years ago, but I don't pretend to be a codesmith.

poisondeathray

29th April 2021, 03:34

Screw the decoder. How hard would it be to rewrite the decoders to pass on the original motion vectors? If that can't be done -- I can't see why that would be -- then access the source frames (files) directly. Why not? That's not decoding. That's just knowing how to read the formatted data. It's not rocket science.

So? What's the problem?

It's not for me to write that. I'm a hardware design engineer, not a programmer. Oh, sure, I've written some code -- test programs, mostly -- and I did take computer programming in college 45 years ago, but I don't pretend to be a codesmith.

Not sure how hard, I'm not a programmer either

The way all these programs work (vapoursynth , ffmpeg, avisynth) is decoded frames (uncompressed data) gets passed to some other filter or operation

Some can pass certain types of metadata, but I've never seen codec MV's being passed in any application. I've seen MV's being visualized - and it looks pretty - but that's not the quite same thing as being able to pass that data in a usable format into a plugin or filter

I don't know how much you know about lossy encoding , but codec MV's aren't always that accurate either. You can see this when you visualize them. Often there are mistakes. Sometimes it is better to recalculate, if the goal was "quality" or "accuracy"

markfilipak

29th April 2021, 04:12

Not sure how hard, I'm not a programmer either

The way all these programs work (vapoursynth , ffmpeg, avisynth) is decoded frames (uncompressed data) gets passed to some other filter or operation.

Yes, I knew that. All frames are progressive, raw video frames.

Some can pass certain types of metadata, but I've never seen codec MV's being passed in any application.

Well, motion vectors are not metadata, but I know what you mean. I've never seen anything in a processing pipeline other than raw frames. But so what? Does throwing away the source's motion vectors make sense to you? Afterall, it's those motion vectors that the decoder uses to produce pictures.

I've seen MV's being visualized - and it looks pretty - but that's not the quite same thing as being able to pass that data in a usable format into a plugin or filter

Well, I agree, but what's the point? It's not been done before is hardly a reason to not do it.

I don't know how much you know about lossy encoding , but codec MV's aren't always that accurate either.

How would one know that? Surely you don't mean that decoders make mistakes doing the motion vector operations? Though that's possible of course, how would you know that it made a mistake. And what does that matter?

You can see this when you visualize them. Often there are mistakes. Sometimes it is better to recalculate, if the goal was "quality" or "accuracy"

I'm sorry, my friend, but I just don't buy that. The decoder uses its MVs to make pictures of pixels and then those pictures of pixels are used to produce "home brew" vectors that show that the decoder made mistakes working with the original MVs. I think that's circular illogic.

markfilipak

29th April 2021, 04:21

I've seen MV's being visualized - and it looks pretty ...

How do you know you were seeing MVs or "home brew" vectors? I'll bet they were "home brew" vectors. They may have been bogus. If they were created by the same algorithms that are used in interpolation, then they are just something to look at and cannot reasonably be used to judge the goodness of interpolation.

markfilipak

29th April 2021, 04:23

Sometimes it is better to recalculate, if the goal was "quality" or "accuracy"

You can't get more accurate than the source video's actual MVs.

poisondeathray

29th April 2021, 04:26

Does throwing away the source's motion vectors make sense to you? Afterall, it's those motion vectors that the decoder uses to produce pictures.

That's what happens because that's how all the frameworks operate

But I'm curious if it could be done, and if it would improve results with block based optical flow

Newer optical flow methods (such as some neural net) use make use of depth. Instead of just x,y vectors, there is "z" depth. (But that doesn' t mean you might be able to improve results if you were able to access codec MV's) . I'll post an example later if you're interested. This is one category where the newer "AI" flow algorithms tend to perform better

How would one know that? Surely you don't mean that decoders make mistakes doing the motion vector operations? Though that's possible of course, how would you know that it made a mistake. And what does that matter?

Not the decoder; during encoding - that' s when the motion vectors are calculated . You can visualize them with codec analysis tools, such as codec visa. If the premise is more accurate motion vectors would lead to better results, then use it matters

markfilipak

29th April 2021, 05:13

That's what happens because that's how all the frameworks operate

But I'm curious if it could be done, and if it would improve results with block based optical flow

It would have to improve results. It would totally eliminate errors. The source motion vectors point frame-to-frame with complete accuracy. Cooking up "home brew" vectors will never match that.

Not the decoder; during encoding - that' s when the motion vectors are calculated . You can visualize them with codec analysis tools, such as codec visa. If the premise is more accurate motion vectors would lead to better results, then use it matters

Well, I don't know what "codec visa" is, but, 1, the source MVs make pixels, then 2, MV interpolation uses pixels to make MVs. Those made MVs can't be better than the original VMs. Methinks that whether the original encoder made mistakes is immaterial and I don't understand why you are objecting. Maybe you're not objecting, but you're not supporting my proposition, either.

As I wrote: "I'm shocked." I would think that using the original MVs was a slam dunk.

poisondeathray

29th April 2021, 05:45

You can't get more accurate than the source video's actual MVs.

That's true , in terms of the source video's particular codec's partitions and blocks . So if you reuse those same blocks and motion vectors you essentially have the original video. But that's not necessarily a good idea in this scenario.

The block distribution or sizes used in the encoded source video might not ideal for the purposes of optical flow. For example, some codecs cannot express certain block sizes . eg. MPEG2 does not support 2x2 or 12x3 or 64x64. What if a smaller , more detailed block was better ? What if larger block is better ? The codecs' block distribution and partitions , does not necessarilly reflect the objects and boundaries of the actual content. Maybe the size of the codec chosen block size does not match the object boundary as neatly.

The motion of the blocks used by the encoded video does not necessarily match actual content object motion. "Predictive" motion vectors used by an encoder "predict" where the matching blocks are in other frames . That is how lossy long GOP encoding works - encoders search for block matching and redundancies using various algorithms. If you have "off" or not ideal motion vectors in the encoded video, this results in higher residuals, lower quality at a given bitrate. More accurate motion vectors result in lower residuals, higher quality at a given bitrate. Some encoding settings or poor encoders might "miss" blocks from an abrupt search pattern or fast algorithms settings, that they might have caught with more exhautive search settings. You can see this when you perform codec and encoder comparisons and look at the motion vectors. Re-using some of the lower quality MV's might not be ideal for interpolation purposes and I can see situations where it would make it worse

It would have to improve results. It would totally eliminate errors. The source motion vectors point frame-to-frame with complete accuracy. Cooking up "home brew" vectors will never match that.

Possibly, or make it worse

feisty2

29th April 2021, 06:25

before you ask "what can't I do blah blah blah", you need to ask yourself "does it make sense if I do blah blah blah", and the answer is no.

to synthesize an intermediate frame, you need motion vectors from both the preceding frame and the succeeding frame, that's not how motion estimation works for video encoding.
an I frame has no motion vectors at all, a P frame only has motion vectors from its preceding frame, only B frames have bidirectional motion vectors.
you cannot have an encoded sequence consisting only B frames, that would be impossible to decode (any attempt to decode such sequence results in circular dependencies).
therefore you cannot reuse encoding dedicated motion vectors for the purpose of generating intermediate frames.

feisty2

29th April 2021, 06:42

It would have to improve results. It would totally eliminate errors. The source motion vectors point frame-to-frame with complete accuracy. Cooking up "home brew" vectors will never match that.

nonsense, a P frame or B frame still needs to encode the residual after motion compensation, there would be no residual to encode if motion estimation were picture perfect for video encoding.

do you have any idea what you're talking about???

markfilipak

29th April 2021, 07:32

before you ask "what can't I do blah blah blah", you need to ask yourself "does it make sense if I do blah blah blah", and the answer is no.

to synthesize an intermediate frame, you need motion vectors from both the preceding frame and the succeeding frame ...

Both? I can visualize that not being true. I can visualize a single set of MVs from a single source frame helping (see 1, below).

an I frame has no motion vectors at all, a P frame only has motion vectors from its preceding frame, only B frames have bidirectional motion vectors.

I understand.

you cannot have an encoded sequence consisting only B frames, that would be impossible to decode (any attempt to decode such sequence results in circular dependencies).
therefore you cannot reuse encoding dedicated motion vectors for the purpose of generating intermediate frames.

I believe that the output of VM interpolation is pictures of pixels, not B-frames, so vectors are not reused, there's nothing circular.

Let's consider a realistic situation, eh?

Suppose that the sequence is I P P B P P B P P B P P B P P I. I believe that's a fairly typical sequence.

1, Don't the MVs in the 1st P point backwards to a block in I? Wouldn't an I-P intermediate block lay on that path? Even with simple linear interpolation between them, wouldn't that help the interpolation? Wouldn't that help in the "search"?

2, B-frames rely on vectors from both directions in order to produce its picture of pixels, but couldn't the rendered B-frame (as it exists exiting the decoder) simply be treated as an I-frame for the purposes of synthesis of a B-P intermediate?

Disclamer: I'm a novice here, but I am an engineer. If I'm wrong about this stuff, kindly enlighten me. You don't need to dumb down the explanation. Links to articles would be fine of course.

I very much appreciate your reply and I look forward to more.

Regards,
Mark.

markfilipak

29th April 2021, 07:44

do you have any idea what you're talking about???

Some.. :)

I appreciate that it can be frustrating to address a novice like me but I've found that doing so can pay dividends vis-a-vis perspective. I've taught electronics and have found that dumb questions sometimes caused me to rethink the subject... to my profit.

feisty2

29th April 2021, 14:47

I see from your earlier posts that you wanna do some framerate up-conversion, all frame interpolation functions (FlowInter, FlowFPS, BlockFPS) from MVTools require a pair of bidirectional vectors. there're indeed 2 functions that take one vector clip rather than a pair (Flow, Compensate) but you cannot use them to interpolate intermediate frames. something like mv.Flow(vectors, time=50) applies a partial compensation like what you described as "linear interpolation along the path", it moves a pixel from the reference frame to the current frame halfway along the vector, and it does not have a frame interpolating effect. I don't have the time to give you a detailed lecture, go read the source code of MVTools if you wanna know how things work precisely.

if you have B frames between two consecutive P frames
https://www.ramugedia.com/upload/8028/images/optimize/893500FE30A663DE.webp
motion vectors will no longer be estimated from adjacent frames. that sure as hell cannot be used to generate intermediate frames.

the problem is not you being a newbie, it's that you don't listen to everybody else, you assume that things work in a certain way and when everyone tells you it's not like that, you don't calibrate your assumptions, instead, you just keep asking the same question over and over again, why don't things work the way I want them to? maybe that's just the mindset of old people, it's certainly something that I cannot understand.

poisondeathray

29th April 2021, 16:49

1, Don't the MVs in the 1st P point backwards to a block in I? Wouldn't an I-P intermediate block lay on that path? Even with simple linear interpolation between them, wouldn't that help the interpolation? Wouldn't that help in the "search"?

2, B-frames rely on vectors from both directions in order to produce its picture of pixels, but couldn't the rendered B-frame (as it exists exiting the decoder) simply be treated as an I-frame for the purposes of synthesis of a B-P intermediate?

P is a forward predicted picture from I.

I'm guessing that your thinking is - can you at least re-use some of the codec MV data, ie. hypothetically would it help? At least with 1 direction ?

It depends on the "quality" and accuracy of the motion vectors, and the distribution of blocks in reference to the content. It can definitely make things worse

You can visualize motion vectors with expensive tools such as codec visa - but a free way is with -vf codecview using libavfilter/ffmpeg. You can view the forward vectors of b,p and
backward vectors of b , or each individually
https://ffmpeg.org/ffmpeg-filters.html#codecview

You can demonstrate that a video can have "bad" motion vectors (they don't track the content well, they "slide" off and miss targets - ie. poor prediction) , but re-encoding it with a better search parameters produces better matching motion vectors. Better "home brew" vectors if you will.

So the re-encoded video looks similar, but is technically "worse" in terms of signal to noise. But if you visualize the actual MV's, they are more accurate for the re-calculated vectors. They match the actual content and track objects better.

More accurate motion vectors would theoretically produce better results for optical flow, just like if the source had been encoded with better parameters in the first place, it would be closer to the original before the source, with smaller residuals. You mentioned "eliminating errors" before ; Those errors in the source video prediction are from "bad" MV's - Those are contributing factors to higher residuals. "Perfect" prediction does not exist

If you need more demonstration or actual examples, let me know

On a related note - often noise and grain can interfere with accuracy motion vectors. You can use a degrained preprocessed clip for the vector calculation - this technique is used in many scripts.

In some professional optical flow applications, you can "guide" the motion estimation with motion tracking - because there are always errors in prediction. It's never perfect. You can use mattes and roto to help delineate object boundaries, because the chosen "block" distribution does not necessary outline objects perfectly - this helps reduce the object edge artifacts

StainlessS

29th April 2021, 16:53

maybe that's just the mindset of old people
Impudent young puip !

markfilipak

1st May 2021, 08:56

You can demonstrate that a video can have "bad" motion vectors (they don't track the content well, they "slide" off and miss targets - ie. poor prediction) , but re-encoding it with a better search parameters produces better matching motion vectors. Better "home brew" vectors if you will.

You see, that's what I just can't understand. Help me. Tell me what's wrong (beside the fact that I'm old and senile :) ).

The source's MVs guide the decoder when producing pictures of pixels. MVTools et al use the pictures of pixels to produce new MVs. You say that those "home brew" MVs can be better than the source's MVs, but I don't see how that's possible. Tell you what. I'll try -vf codecview (never done that before) so I can see what you're 'talking' about.

markfilipak

1st May 2021, 10:17

ffmpeg -flags2 +export_mvs -i source.mkv -vf codecview=mv=pf codecview=mv=pf.mkv

I see what you're 'talking' about. In a tracking panning shot in which nearly all the vectors should be parallel, some groups are pointing way 'wrong' and are way too long. But obviously, the decoder produces correct pixels. So, for the decoder to produce the picture of pixels I see, it must be fixing up the vectors -- I think the fix-up is using something called "residuals" (?) -- it's been a while since I last read the H264 ITU specification.

Maybe the codec doesn't export the residuals (or whatever they're called). Is that the problem that blocks MVTools et al from directly using the source's vectors?

markfilipak

1st May 2021, 10:56

... go read the source code of MVTools if you wanna know how things work precisely.
I don't know 'C'. And a precise treatise is more than's needed. Just a general presentation would do. A link would be fine.

.---------.
¦.----..-.¦
¦¦ ¦¦ ¦¦
vv ¦¦ v¦
P1 B2 B3 P4
^ ¦¦ ^
'-''----'

motion vectors will no longer be estimated from adjacent frames.
So? Why is that a show stopper?
It looks to me to be a sum of vectors issue and an issue of working on several frames at the same time. Decoders obviously do it, or am I wrong?