How VP9 works, technical details & diagrams - Page 2

jimwei · 24th July 2015, 09:43

Quote:

Originally Posted by pieter3d

Alt-Ref,Golden and Last are just names, nothing more. Think of it as an enumerated type on top of reference number 0, 1, 2.

Any frame can be marked as hidden (not for display). In the case of YouTube VP9 streams, the encoder uses the alt-ref label for the hidden frames, golden for some earlier frame, and last for the immediately preceding frame. That is the kind of structure you can get from the libvpx reference encoder, but other encoders may do it differently.

So is it correct that the alf-ref frame is not for display, and the H bit (show_frame bit) in the first octet of the payload header in the first partition of the bitstream must be set to 0 by the application (actually here I am not sure that this payload head is generated by the encoder, or it should be generate by the application)?

And some more questions

1. The decoder should use the same reference frame to decode a new coming frame as the one when this frame was encoded, but how the decoder know the new coming frame is decoded by which reference frame as it maintains 3 reference frame?

2. The decoder maintains 3 frame buffers for the reference frame, when and how decoder update these buffer as far as libvpx is concerned? Does it update all these three buffers when a key frame arrives ? And does it update the alt-ref frame buffer when it receive a frame with the show_frame bit in the first octet of the payload header in the first partition (the alt-ref frame) ? When does it update the gold frame?

3. For VP9, how does the setting of rc_target_bitrate works? Does it means that the encoder can generate different bitstream with different quality according to this parameter? But it must have a min value, how dose the encoder react if the parameter is lower than this min value?

pieter3d · 24th July 2015, 16:15

Quote:

Originally Posted by jimwei

So is it correct that the alf-ref frame is not for display, and the H bit (show_frame bit) in the first octet of the payload header in the first partition of the bitstream must be set to 0 by the application (actually here I am not sure that this payload head is generated by the encoder, or it should be generate by the application)?

Again, the usage of alt-ref for hidden frames is purely an encoder choice, one that the libvpx encoder happens to make. It is not enforced in any way by the VP9 specification. Any frame can be made hidden by setting the show_frame bit to 0.
As far as generating the frame header, this is typically done by software, a driver. Since the frame header format is defined by the VP9 specification, a VP9 compliant encoder must be the one that writes it.

Quote:

Originally Posted by jimwei

And some more questions

1. The decoder should use the same reference frame to decode a new coming frame as the one when this frame was encoded, but how the decoder know the new coming frame is decoded by which reference frame as it maintains 3 reference frame?

Each partition in the frame (each block of 8x8 pixels or larger) gets assigned a reference ID of 0, 1 or 2. These correspond to LAST, GOLDEN, ALTREF. This information is encoded in the compressed bitstream, so that is how a decoder knows which one to use for each block.

Quote:

Originally Posted by jimwei

2. The decoder maintains 3 frame buffers for the reference frame, when and how decoder update these buffer as far as libvpx is concerned? Does it update all these three buffers when a key frame arrives ? And does it update the alt-ref frame buffer when it receive a frame with the show_frame bit in the first octet of the payload header in the first partition (the alt-ref frame) ? When does it update the gold frame?

There are actually 8 buffers, but any single frame may only use 3 of these. Keyframes necessarily update all buffers, because otherwise you would be dependent on frames prior to the keyframe, which defeats the purpose.
An encoder can pick any, all, or none of the 8 references to update with the current frame. There is an octet of bits, refresh_frame_flags in the frame header, and the encoder uses these to communicate to the decoder which of the 8 buffers should be filled with the new frame.
The choice of which ones to update is free, you can come up with whatever scheme you like. Good compression performance means you need a clever scheme, and it depends heavily on the type of video you are encoding. This kind of thing makes good encoders tricky and complicated to design. Luckily though, a decoder doesn't have to worry about it since once the encoder has made the decision about which frames to update, it simply tells the decoder verbatim.

Quote:

Originally Posted by jimwei

3. For VP9, how does the setting of rc_target_bitrate works? Does it means that the encoder can generate different bitstream with different quality according to this parameter? But it must have a min value, how dose the encoder react if the parameter is lower than this min value?

The encoder tries to maintain a particular bitrate by adjusting the quantization strength (i.e. how many bits are thrown away in the transform coefficients). It at first does a best guess, or multi-pass to achieve that. Then there is a feedback mechanism. It sees how many bits were actually encoded, then if it was not enough, it will reduce quantization for the next frame. If there were too many it will increase the quantization. A few other things come into play here too to make the bitrate nice and stable, and to make sure that the decoded image doesn't noticeably jump around in quality.

jimwei · 5th August 2015, 16:39

Hi pieter3d, can you help to show how the decoder know the current whether is a reference frame or not, and which kind of reference frame? Thanks in advance.

pieter3d · 5th August 2015, 16:50

For any given frame, the decoder cannot know how it will be used in the future. However, it does know how the current frame should be handled in the reference pool of 8 buffers. In the current frame's header there are a set of 8 flags, refresh_frame_flags[]. For each flag that is 1, the current picture will be inserted into that reference pool slot after decode is complete.

jimwei · 5th August 2015, 17:02

Thanks, it is the case for vp9, but there is no such kind of flag for vp8. And actualy, my question is for vp8.

pieter3d · 5th August 2015, 17:20

For vp8 it is similar, look for the refresh_golden_frame, refresh_alt_ref_frame and copy_buffer_to_arf syntax elements in the header.

Shevach · 14th April 2017, 10:39

Lack of start codes in VP9 makes error resynchronization challenged (barely possible). Consequently, the error resilience of HEVC is in virtue better than that in VP9.

Shevach · 15th April 2017, 09:33

i wonder if there are plans in Mpeg committee to add support of VP9 for Mpeg File System (mp4 format). How specify stsd-box if vp9 video stream present in mdat-box?
Generally speaking encapsulation of VP9 elementary stream into mp4 container would require a full parsing of the stream in order to specify frame boundaries and then to populate stsz and other boxes in metadata.

sneaker_ger · 15th April 2017, 10:01

There are efforts to do this.
https://www.webmproject.org/vp9/mp4/

nevcairiel · 15th April 2017, 10:05

Frame boundaries are required by all containers that support vp9, like webm/matroska. "raw" elementary streams are not typically used for vp9, ie. the most raw format it typically goes as is the simple ivf container, which still holds frame sizes/boundaries and timestamps.

Shevach · 19th April 2017, 10:19

@pieter3d
you wrote:
"Another glaring flaw is the fact that MV decode requires fully reconstructed neighboring and co-located MV values, which means the entire MV prediction process is required for entropy decode and cannot be decoupled."

i'm afraid i don't understand this point, especially "MV decode requires fully reconstructed neighboring".
As far as i know the recent version of VP9 spec. there is a function 'assign_mv' which in turn calls 'read_mv' and uses results of other functions. However, the process of MV derivation is similar (in spirit) to that of HEVC. Why 'reconstructed residuals' are needed in MV derivation process? Could you elaborate?
According to the spec. 'assign_mv' function exploits only surrounding MVs in a manner similar to AVC/HEVC. Where is a flaw against HEVC/AVC here? i don't see.
Anyway, i'll ask Pieter to elaborate this point.

Shevach · 20th April 2017, 16:50

i see three flaws VP9 which deteriorate error resilience: lack of start-codes, lack of slices and non-adaptivity of probabilities within a frame.

When arithmetic coding is used error-detection latency is long (error detection latency is a distance in MBs/CTUs/Superblocks between a place where a bitstream error occurs and the place where it's detected). For example, a bit-flip can occur at the start of a frame but detected at the end and as a result the entire frame is corrupted. If the error is detected at the middle of the frame then the second half can be filled by co-located MBs/CUTs/Superblocks.

In HEVC/AVC, a bitstream error is necessarily detected either at the end of a frame when the next start code is sensed or when the number of CTUs/MBs exceeding the expected amount (according to resolution).
In VP9 (due to lack of start codes) a bitstream error can be detected at the middle of the next frame or at the middle of next-next-frame and in such case two or more frames are corrupted (it's worth mentioning that in HEVC in worst case a single frame is corrupted).

Division into slices is extremely useful for error resilience since corruption area is limited to a slice size (notice that a bitstream error is inevitable detected prior to start-code of the next slice). Consequently in worst case a single slice is corrupted and not the whole frame.

In error-resilience mode of VP9 each frame is coded with a default set of fixed probabilities (or context models in HEVC/AVC's jargon), VP9 encoder can’t get probabilities from the previous frame since the previous frame may be corrupted prior to arriving a decoder. Consequently all frames are coded with fixed probabilities. If the actual context models are close to the default ones then ok ('sababa'). However, if no then penalty bits are produced (the penalty can be assessed via Kullback–Leibler divergence). Consequently, vp9 entropy coding is non-effective. In HEVC/AVC, even if default probabilities strongly differ from the actual ones, CABAC quickly adapt itself to the actual and coding gets optimal.

FancyMouse · 26th April 2017, 23:42

Quote:

Originally Posted by Shevach

@pieter3d
i'm afraid i don't understand this point, especially "MV decode requires fully reconstructed neighboring".
As far as i know the recent version of VP9 spec. there is a function 'assign_mv' which in turn calls 'read_mv' and uses results of other functions. However, the process of MV derivation is similar (in spirit) to that of HEVC. Why 'reconstructed residuals' are needed in MV derivation process? Could you elaborate?
According to the spec. 'assign_mv' function exploits only surrounding MVs in a manner similar to AVC/HEVC. Where is a flaw against HEVC/AVC here? i don't see.

Disclaimer: I've not read VP9 spec/code.
I believe it's vp9's nature of "using code as spec" is to blame, as code bug becomes spec as well. Even though the intention is good, the code bug might destroy it. See OP first paragraph - it would mean that the ref decoder at the time of bitstream frozen is the golden standard. They might fix it in the newer document, but then the newer one should be called vp9.1 or stuff like that, otherwise existing shipped vp9 decoder might break.

Shevach · 2nd July 2017, 16:16

Let me share a white paper "Choosing of the Right Codec: Comparing HEVC & VP9"
https://drive.google.com/file/d/0B7a...pRN2pVZjQ/view

in this paper a qualitative analysis of key features of HEVC and VP9 is provided. i am a co-author of this paper, therefore ask me.
i know the paper is general and lacking details and figures (numbers saying which feature and what gain is obtained are absent). The original paper contains everything but the published article is a strictly censored version.
Anyway i appreciate Beamr Imaging on decision to share even censored version of the results.

MasterNobody · 2nd July 2017, 18:43

Shevach
Why does paper give VP9 advantage in Segmentation part? IMHO 8 segments are too coarse comparing to AVC/HEVC fine grained (per MB/CU or smaller) quantizer values and so it is disadvantage (as x264 showed AQ is very important feature).

Shevach · 3rd July 2017, 15:02

Quote:

Originally Posted by MasterNobody

Shevach
Why does paper give VP9 advantage in Segmentation part? IMHO 8 segments are too coarse comparing to AVC/HEVC fine grained (per MB/CU or smaller) quantizer values and so it is disadvantage (as x264 showed AQ is very important feature).

@MasterNobody
Benefits of Segmentation depends on content (animation with large fixed background) and bitrate (low bitrate).
As per x264 AQ, the quantizer fluctuates within a frame according to "block variance" (or another HVS metric), therefore QP is not a good choice for Segmentation.
On the other hand i am familiar with commercial Rate Controls where the quantizer is fixed within the entire frame (changed only across frames) or the quantizer per CTU/MB completely depends on a virtual buffer (strictly CBR mode, used in statistical multiplexing).
Other parameters as Reference Frame and Loop Filter strength are sufficiently good temporal-spatial correlated.

MasterNobody · 3rd July 2017, 18:21

That doesn't answer the question of why coarse and limited (only 8) segmentation is better than fine grained per CU adaptivity of such params (frame reference also can be changed per MB and while you can't change loop filter strength directly it depends from QP value)? I.e. what does it give that AVC/HEVC can't do by using other more fine grained adaptivity?

Shevach · 6th July 2017, 11:49

Quote:

Originally Posted by MasterNobody

That doesn't answer the question of why coarse and limited (only 8) segmentation is better than fine grained per CU adaptivity of such params (frame reference also can be changed per MB and while you can't change loop filter strength directly it depends from QP value)? I.e. what does it give that AVC/HEVC can't do by using other more fine grained adaptivity?

The Segmentation is an optional mode in VP9 only. i agree that for typical real video 'fine grained block adaptivity' is better than the Segmentation. However, there are scenarios when Segmentation gives a good gain in coding efficiency. In such scenarios a smart VP9 encoder can exploit Segmentation while HEVC can't.
i'll provide a simple example, let's suppose that in a frame all reference indexes are equal to zero (i.e. all blocks refer to the previous frame). In HEVC an encoder would transmit 1 bin per CU to signal ref_idx=0. Roughly speaking due to CABAC 1-bin is converted to 1/6 bits. Hence the number of bits dedicated to signal reference indexes is #CUs * 1/6 .
In VP9 this number is almost zero (there is a small overhead in a picture header).

Beelzebubu · 26th July 2017, 21:21

I started reading the document, and I hope any dutch person here can appreciate my reference to WC eend. I'll make my point by just responding to a single section.

Quote:

Bi-prediction means that two references are used simultaneously to create motion predictions, and this feature is supported both in AVC and HEVC. VP9 doesn’t support this feature (possibly due to IP issues), and instead supports a workaround called “compound prediction”. In this method, a first predicted frame is a hidden, non displayable frame created using bi-directional prediction, and another frame which is essentially a skipped frame “copies” the first frame. The two frames together constitute a ‘superframe’. This still adds some overhead, giving HEVC an advantage in coding efficiency over VP9 when this mode is used.

The above is totally bogus.

The VP9 bitstream can signal up to 3 active references, and each reference is then assigned a sign bias bit (0 or 1). The idea here is similar to h264/hevc frames having two reference lists: l0 and l1. Compound prediction happens between reference frames of different sign bias (but not frames of identical sign bias). This is similar to h264/hevc, where bidirectional prediction happens between a l0+l1 reference (but not two l0s or two l1s). As a result, just like for hevc/h264, each frame can use bidirectional prediction, depending on the reference list setup in the frame header. One pretty fundamental issue here is the limit on the number of active references, which you oddly didn't mention at all in this section. (You did mention it further down, but then failed to acknowledge that the memory issues you mentioned have been addressed in the VP9 levels.) The reference to IP issues is unsubstantiated.

The remainder of the quote talks about invisible/overlay frames, which are frame reordering techniques, i.e. frame-level tools, that have nothing to do with prediction types at the block level. An invisible frame in VP9 is conceptually the same as an out-of-order coded frame (which is coded, but not yet displayed, and thus not yet visible, a.k.a. invisible) in hevc/h264. The signaling is indeed slightly different. In hevc/h264, you would signal an invisible (out-of-order) frame by having the poc being ahead of the next expected poc, which means the decoder needs to delay its output. In vp9, you signal this by marking the frame as invisible. Later on, using reordering based on poc (in hevc/h264) or the direct-reference single-byte packet (in VP9), these not-yet-displayed (a.k.a. invisible) frames are made visible. The suggestion that the 1 byte overhead of this signaling would be significant, is crazy. There is also a reference to overlay frames, which are a libvpx-specific thing that are otherwise unrelated to the VP9 bitstream. One could make well-founded points on the overlay frames and ARNR (the two go hand-in-hand), and how they may result in more PSNR gains than visual gains, but you didn't mention this at all.

elinzer1 · 13th May 2018, 18:52

Great post. Do you have a similar one for AV1?

15th April 2017, 10:05	#30 \| Link
nevcairiel Registered Developer Join Date: Mar 2010 Location: Hamburg/Germany Posts: 10,346	Frame boundaries are required by all containers that support vp9, like webm/matroska. "raw" elementary streams are not typically used for vp9, ie. the most raw format it typically goes as is the simple ivf container, which still holds frame sizes/boundaries and timestamps. __________________ LAV Filters - open source ffmpeg based media splitter and decoders

2nd July 2017, 18:43	#35 \| Link
MasterNobody Registered User Join Date: Jul 2007 Posts: 552	Shevach Why does paper give VP9 advantage in Segmentation part? IMHO 8 segments are too coarse comparing to AVC/HEVC fine grained (per MB/CU or smaller) quantizer values and so it is disadvantage (as x264 showed AQ is very important feature). Last edited by MasterNobody; 2nd July 2017 at 18:48.

13th May 2018, 18:52	#40 \| Link
elinzer1 Registered User Join Date: May 2018 Posts: 2	Similar write up for AV1? Great post. Do you have a similar one for AV1?

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

5th August 2015, 16:39	#23 \| Link
jimwei Registered User Join Date: Jul 2015 Posts: 4	Hi pieter3d, can you help to show how the decoder know the current whether is a reference frame or not, and which kind of reference frame? Thanks in advance.

5th August 2015, 16:50	#24 \| Link
pieter3d Registered User Join Date: Jan 2013 Location: Santa Clara CA Posts: 114	For any given frame, the decoder cannot know how it will be used in the future. However, it does know how the current frame should be handled in the reference pool of 8 buffers. In the current frame's header there are a set of 8 flags, refresh_frame_flags[]. For each flag that is 1, the current picture will be inserted into that reference pool slot after decode is complete.

5th August 2015, 17:02	#25 \| Link
jimwei Registered User Join Date: Jul 2015 Posts: 4	Thanks, it is the case for vp9, but there is no such kind of flag for vp8. And actualy, my question is for vp8.

5th August 2015, 17:20	#26 \| Link
pieter3d Registered User Join Date: Jan 2013 Location: Santa Clara CA Posts: 114	For vp8 it is similar, look for the refresh_golden_frame, refresh_alt_ref_frame and copy_buffer_to_arf syntax elements in the header.

14th April 2017, 10:39	#27 \| Link
Shevach Video compressionist Join Date: Jun 2009 Location: Israel Posts: 126	Lack of start codes in VP9 makes error resynchronization challenged (barely possible). Consequently, the error resilience of HEVC is in virtue better than that in VP9.

15th April 2017, 09:33	#28 \| Link
Shevach Video compressionist Join Date: Jun 2009 Location: Israel Posts: 126	i wonder if there are plans in Mpeg committee to add support of VP9 for Mpeg File System (mp4 format). How specify stsd-box if vp9 video stream present in mdat-box? Generally speaking encapsulation of VP9 elementary stream into mp4 container would require a full parsing of the stream in order to specify frame boundaries and then to populate stsz and other boxes in metadata.

15th April 2017, 10:01	#29 \| Link
sneaker_ger Registered User Join Date: Dec 2002 Posts: 5,565	There are efforts to do this. https://www.webmproject.org/vp9/mp4/

19th April 2017, 10:19	#31 \| Link
Shevach Video compressionist Join Date: Jun 2009 Location: Israel Posts: 126	@pieter3d you wrote: "Another glaring flaw is the fact that MV decode requires fully reconstructed neighboring and co-located MV values, which means the entire MV prediction process is required for entropy decode and cannot be decoupled." i'm afraid i don't understand this point, especially "MV decode requires fully reconstructed neighboring". As far as i know the recent version of VP9 spec. there is a function 'assign_mv' which in turn calls 'read_mv' and uses results of other functions. However, the process of MV derivation is similar (in spirit) to that of HEVC. Why 'reconstructed residuals' are needed in MV derivation process? Could you elaborate? According to the spec. 'assign_mv' function exploits only surrounding MVs in a manner similar to AVC/HEVC. Where is a flaw against HEVC/AVC here? i don't see. Anyway, i'll ask Pieter to elaborate this point.

20th April 2017, 16:50	#32 \| Link
Shevach Video compressionist Join Date: Jun 2009 Location: Israel Posts: 126	i see three flaws VP9 which deteriorate error resilience: lack of start-codes, lack of slices and non-adaptivity of probabilities within a frame. When arithmetic coding is used error-detection latency is long (error detection latency is a distance in MBs/CTUs/Superblocks between a place where a bitstream error occurs and the place where it's detected). For example, a bit-flip can occur at the start of a frame but detected at the end and as a result the entire frame is corrupted. If the error is detected at the middle of the frame then the second half can be filled by co-located MBs/CUTs/Superblocks. In HEVC/AVC, a bitstream error is necessarily detected either at the end of a frame when the next start code is sensed or when the number of CTUs/MBs exceeding the expected amount (according to resolution). In VP9 (due to lack of start codes) a bitstream error can be detected at the middle of the next frame or at the middle of next-next-frame and in such case two or more frames are corrupted (it's worth mentioning that in HEVC in worst case a single frame is corrupted). Division into slices is extremely useful for error resilience since corruption area is limited to a slice size (notice that a bitstream error is inevitable detected prior to start-code of the next slice). Consequently in worst case a single slice is corrupted and not the whole frame. In error-resilience mode of VP9 each frame is coded with a default set of fixed probabilities (or context models in HEVC/AVC's jargon), VP9 encoder can’t get probabilities from the previous frame since the previous frame may be corrupted prior to arriving a decoder. Consequently all frames are coded with fixed probabilities. If the actual context models are close to the default ones then ok ('sababa'). However, if no then penalty bits are produced (the penalty can be assessed via Kullback–Leibler divergence). Consequently, vp9 entropy coding is non-effective. In HEVC/AVC, even if default probabilities strongly differ from the actual ones, CABAC quickly adapt itself to the actual and coding gets optimal.

2nd July 2017, 16:16	#34 \| Link
Shevach Video compressionist Join Date: Jun 2009 Location: Israel Posts: 126	Let me share a white paper "Choosing of the Right Codec: Comparing HEVC & VP9" https://drive.google.com/file/d/0B7a...pRN2pVZjQ/view in this paper a qualitative analysis of key features of HEVC and VP9 is provided. i am a co-author of this paper, therefore ask me. i know the paper is general and lacking details and figures (numbers saying which feature and what gain is obtained are absent). The original paper contains everything but the published article is a strictly censored version. Anyway i appreciate Beamr Imaging on decision to share even censored version of the results.

3rd July 2017, 18:21	#37 \| Link
MasterNobody Registered User Join Date: Jul 2007 Posts: 552	That doesn't answer the question of why coarse and limited (only 8) segmentation is better than fine grained per CU adaptivity of such params (frame reference also can be changed per MB and while you can't change loop filter strength directly it depends from QP value)? I.e. what does it give that AVC/HEVC can't do by using other more fine grained adaptivity?