How VP9 works, technical details & diagrams [Archive]

View Full Version : How VP9 works, technical details & diagrams

pieter3d

8th October 2013, 00:34

Hi all,

After my last post on HEVC (http://forum.doom9.org/showthread.php?t=167081) got some interest, I’ve decided to write one on VP9, Google’s new video coding standard. The bitstream was frozen in July 2013, paving the way for hardware engineers (like myself) to be confident they don’t have to worry about design changes. There is no official spec document (yet...), so the following is all based on how the reference decoder (http://www.webmproject.org/code/) works. As with the HEVC overview, I will gloss over a few low level details, but feel free to ask for any elaborations!

As of today VP9 only supports YUV 4:2:0 (full chroma subsampling). There are provisions in the header for future formats (RGB, alpha) and less chroma subsampling, but as of today YUV4:2:0 only. Also there is no support for field coding, progressive only.

Picture partitioning:
VP9 divides the picture into 64x64-sized blocks called super blocks - SB for short. Superblocks are processed in raster order: left to right, top to bottom. This is the same as most other codecs. Superblocks can be subdivided down, all the way to 4x4. The subdivision is done with a recursive quadtree just like HEVC. But unlike HEVC, a subdivision can also be horizontal or vertical only. In these cases the subdivision stops. Although 4x4 is the smallest “partition”, lots of information is stored at 8x8 granularity only, a MI (mode info) unit. This causes blocks smaller than 8x8 to be handled as sort of a special case. For example a pair of 4x8 intra coded blocks is treated like an 8x8 block with two intra modes. Example partitioning of a super block:
http://i.imgur.com/xZdoWlA.png
Unlike AVC or HEVC, there is no such thing as a slice. Once the data for a frame begins, you get it all until the frame is complete.
VP9 also supports tiles, where the picture is broken up into a grid of tiles along superblock boundaries. Unlike HEVC, these tiles are always as evenly spaced as possible, and there are a power-of-two number of them. Tiles must be at least 256 pixels wide and no more than 4096 pixels wide. There can be no more than four tile rows. The tiles are scanned in raster order, and the super blocks within them are scanned in raster order. Thus the ordering of superblocks within the picture depends on the tile structure. Coding dependencies are broken along vertical tile boundaries, which means that two tiles in the same tile row may be decoded at the same time, a great feature for multi-core decoders. Unlike HEVC, coding dependencies are not broken between horizontal boundaries. So a frame split into 2x2 tiles can be decoded with 2 threads, but not with 4.
At the start of every tile except the last one, a 32-bit byte count is transmitted, indicating how many bytes are used to code the next tile. This lets a multithreaded decoder skip ahead to the next tile in order to start a particular decoding thread.

Bitstream coding:
The VP9 bitstream generated by the reference code from Google is containerized with either IVF or WebM. IVF is extremely simple (http://wiki.multimedia.cx/index.php?title=IVF), and WebM is essentially just a subset of MKV (http://www.webmproject.org/docs/container/). If no container is used at all, then it is impossible to seek to a particular frame without doing a full decode of all preceding frames. This is due to the lack of start codes as seen with AVC/HEVC AnnexB streams.
All VP9 bitstreams start with a keyframe, where every block is intra coded and the internal state of the decoder is reset. A decoder must start at a keyframe, after which it can decode any number of inter frames, which use previous frames as reference data.
Like its predecessor VP8, VP9 compresses the bitstream using an 8-bit arithmetic coding engine known as the bool-coder. The probability model is fixed for the whole frame; all probabilities are known before decode of frame data begins (does not adapt like AVC/HEVC’s CABAC). Probabilities are one byte each, and there are 1783 of them for keyframes, 1902 for inter frames (by my count). Each probability has a known default value.
These probabilities are stored in what is known as a frame context. The decoder maintains four of these contexts, and the bitstream signals which one to use for the frame decode.
Each frame is coded in three sections as follows:

Uncompressed header, only a dozen or so bytes that contains things like picture size, loop filter strength etc.
Compressed header: Bool-coded section that transmits the probabilities used for the whole frame. They are transmitted as deviation from their default values.
Compressed frame data. This bool-coded data contains the data needed to reconstruct the frame, including block partition sizes, motion vectors, intra modes and transform coefficients.

Note that unlike VP8, there is no data partitioning: all data types are interleaved in super block coding order. This is a design choice that makes life easier for hardware designers.
After the frame is decoded, the probabilities can optionally be adapted: Based on the occurrence counts of each particular symbol during the frame decode, new probabilities are derived that are stored in the frame context buffer and may be used for a future frame.

Residual coding:
Unless a block codes (or infers) a skip flag, a residual signal is transmitted for each block. The skip flag applies at 8x8 granularity, so for block splits below 8x8, the skip flag applies to all blocks within the 8x8. Like HEVC, VP9 supports four transform sizes: 32x32, 16x16, 8x8 and 4x4. Like most other coding standards, these transforms are an integer approximation of the DCT. For intra coded blocks either or both the vertical and horizontal transform pass can be DST (discrete sine transform) instead. This has to do with the specific characteristics of the residual signal of intra blocks.
The bitstream codes the transform size used in each block. For example, if a 32x16 block specifies 8x8 transform, the luma residual data consists of a grid of 4x2 8x8’s, and the two 16x8 chroma residuals consist of 2x1 8x8s. If the transform size for luma does not fit in chroma, it is reduced accordingly (e.g. a 16x16 block with 16x16 luma transform uses 8x8 transforms for chroma).
Transform coefficients are scanned starting at the DC position (upper left), and follow a semi-random curved pattern towards the higher frequencies. Transform blocks with mixed DCT/DST use a scan pattern skewed accordingly.
http://i.imgur.com/H41WZMK.pnghttp://i.imgur.com/RtQID5i.png
This coefficient ordering is not very predictable like the diagonal or zigzag scans of other codecs, so it requires the pattern to be stored as a lookup table (larger silicon area).
Each coefficient is read from the bitstream using the bool-coder and several probabilities. The probabilities required are chosen depending on various parameters such as position in the block, size of the transform block, value of neighboring coefficients etc. The large amount of permutations of these parameters is the reason why the bool-coder’s probability model contains so many probabilities.
Inverse quantization in VP9 is very simple; it is just a multiplication by a number that is fixed for the entire frame (i.e. no block-level QP adjustment like HEVC/AVC). There are four of these scaling factors:

Luma DC (first coefficient)
Luma AC (all other coefficients)
Chroma DC (first coefficient)
Chroma AC (all other coefficients)

VP9 offers a lossless coding mode where all transform blocks are always 4x4, no inverse quantization, and the transform used is always a special one known as a Walsh 4x4. This lossless mode is either on or off for the entire frame.

Intra prediction:
Intra prediction in VP9 is similar to AVC/HEVC intra prediction, and follows the transform block partitions. Thus intra prediction operations are always square. For example 16x8 block with 8x8 transforms will result in two 8x8 luma prediction operations.

There are 10 different prediction modes, with 8 of them are directional. Like other codecs, intra prediction requires two 1D arrays that contain the reconstructed left and upper pixels of the neighboring blocks. The left array is the same height as the current block’s height, and the above array is twice as long as the current block’s width. However for intra blocks larger than 4x4, the second half of the horizontal array is simply extended from the last pixel of the first part (notice value 80):
http://i.imgur.com/0n7jgj4.png
- End of part 1 -

pieter3d

8th October 2013, 00:38

- Part2 -

Inter prediction:
Inter prediction in VP9 uses 1/8th pel motion compensation, twice the precision of most other standards. In most cases, motion compensation is unidirectional, meaning one motion vector per block, no bi-prediction. However, VP9 does support “compound prediction”, which really is just another word for bi-prediction where there are two motion vectors for each block and the two resulting prediction samples are averaged together. In order to avoid patents on bi-prediction, compound prediction is only enabled in frames that are marked as not-displayable. A frame like this is never output for display, but may be used for reference later. In fact, a later frame may consist of nothing but 64x64 blocks with no residuals and 0,0 motion vectors that point to this non-displayed frame, effectively causing it to be output later using very little data.
Non-displayed frames do cause a bit of a problem when putting the VP9 bitstream in a container, since each “frame” from the container should result in a displayable frame. To resolve this, VP9 introduces the concept of a super-frame. A super-frame is simply one or more non-displayable frames and one displayable frame all strung together as one chunk of data in the container. Thus the decoder still outputs a frame, and the internal references are all updated with the non-displayable frames.

Back to motion compensation. As mentioned, the motion compensation filter is 1/8th pixel accurate. Additionally, VP9 offers a clever new feature where each block can pick one of 3 motion compensation filters:

Normal 8th pel
Smooth 8th pel, which lightly smoothes or blurs the prediction
Sharp 8th pel, which lightly sharpens the prediction.

The motion vector points to one of three possible reference frames (see reference frame management below) known as Last, Golden, or AltRef. These names are merely names, and don’t really indicate anything else. Last doesn’t have to be the last frame, although typically it is with streams from the reference encoder. The reference frame is applied at 8x8 granularity, so for example two 4x8 blocks, each with their own MV, will always point to the same reference frame.

Motion vector prediction in VP9 is quite involved. Like HEVC, a 2-entry list of predictors is built up using up to 8 of the surrounding blocks that use the same reference picture, followed by a temporal predictor (motion vector from the previous frame in the same location). If this search process does not fill the list, the surrounding blocks are searched again but this time the reference doesn’t have to match. Finally, if the prediction list is still not filled, then 0,0 vectors are inferred.
The block codes one of four motion vector modes:

New MV – Use the first entry of this prediction list and add in a delta MV, transmitted in the bitstream
Nearest MV – use the first entry of this prediction list as is, no delta
Near MV – use the second entry of this prediction list as is, no delta
Zero MV – simply use 0,0 as the MV value.

Reference frame management:
A decoder maintains a pool or 8 reference pictures at all times. Each frame picks 3 of these to use for inter prediction (known as Last, Golden, and AltRef), and afterwards can insert itself into any, all, or none of these 8 slots, evicting whatever frame was there before.
A new and exciting feature of VP9 is reference frame scaling. Each new inter frame can actually be coded using a different resolution than the previous frame. When creating inter predictions, the reference data is scaled up or down accordingly. The scaling filters are 16th pel accurate, 8-tap. This feature is allows for quick and seamless on-the-fly bitrate adjustment, for example during video conferencing. A very elegant solution, much less complicated than SVC for example.

Loop Filter:
Like AVC and HEVC, VP9 specifies a loop filter that is applied to the whole picture after it has been decoded. It attempts to clean up the blocky artifacts that can occur. The loop filter operates per super block, filtering first the vertical edges, then the horizontal edges of each superblock. The super blocks are processed in raster order, regardless of any tile structure. This is unlike HEVC, where all vertical edges of the frame are filtered before any horizontal edges. There are 4 different filters:

16-wide, 8 pixels on each side of the edge
8-wide, 4 pixels on each side of the edge
4-wide, 2 pixels on each side of the edge
2-wide, 1 pixel on each side of the edge

Each of these filters requires a threshold level to be met, as set by the frame header: The pixels on either side of the edge should be relatively uniform (smooth), and there must be a distinct difference between the brightness on either side of the edge. If the filter condition is met, then the filter applies, smooths the transition. If not, the next smaller filter is attempted. Blocks with transform size 4x4 or 8x8 do not start with the 16-wide filter.
Before:
http://i.imgur.com/AYYJ1Ez.png
After:
http://i.imgur.com/fnwROzs.png

Segmentation:
Vp9 offers an optional feature known as segmentation. When enabled, the bitstream codes a segment ID for each block, which is a number between 0 and 7. Each of these eight segments can have any of the following four features enabled:

Skip – blocks belonging to a segment with the skip feature active are automatically assumed to not have a residual signal. Useful for static background.
Alternate quantizer – blocks belonging to a segment with the AltQ feature may use a different inverse quantization scale factor. Useful for regions that require more (or less) detail than the rest of the picture. Or it could be used for rate control
Ref – blocks belonging to a segment that have the Ref feature enabled are assumed to point to a particular reference frame (Last, Golden, or AltRef), as opposed to the bitstream explicitly transmitting the reference frame as usual.
AltLf - blocks belonging to a segment that have the AltLf feature enabled use a different smoothing strength when getting loop-filtered. This can be useful for smooth areas that would otherwise appear too blocky. They can get more smoothing without having to smoother the entire picture more.

pieter3d

8th October 2013, 00:39

Phew - hope that is all clear. Questions & comments welcome!

mandarinka

8th October 2013, 13:56

Thanks, nice write-up!
Interesting to see that this time, they basically used H264-like multiple references and bframes - but they had to use some weird kludges to complicate it enough so that they can't be held liable for copying the (patented) original techniques...

pieter3d

18th October 2013, 01:58

A note to implementers, VP9 contains a bug in the loop filter that must be matched. See: https://code.google.com/p/webm/issues/detail?id=642

mandarinka

18th October 2013, 18:08

And here people thought that the same mistakes that were there with VP8 wouldn't happen anymore.
Well, that's what they get with rushed (and sort-of-like-proprietary) one-party development, I guess.

pieter3d

18th October 2013, 18:09

benwaggoner

21st October 2013, 00:33

One big practical difference versus HEVC seems to be the lack of any equivalent to Wavefront Parallel Processing. If I'm parsing that right, that'll make it hard to do performant multithreaded software decoders. Exacerbated by the lack of B-frames to allow for frame-level parallelism where non-reference frame get decoded in separate threads.

pieter3d

21st October 2013, 00:41

NikosD

21st October 2013, 19:11

Could you compare AVC, HEVC & VP9 in terms of HW decoding complexity for the same resolution, bitrate in common profiles for each format ?

For example, if AVC 1080p HW decoding complexity is 100, how much is for HEVC & VP9 ?

pieter3d

21st October 2013, 19:14

That's a pretty hard question to answer; depending on modes HEVC & VP9 could be harder or easier than AVC. It also depends on how you measure complexity. Memory usage? Silicon area? CPU load?
For HW the silicon area is higher than AVC due to the usage of larger blocks. In terms of decode time in SW, HEVC seems to be on average ~2x that of AVC, and I suspect VP9 is similar or slightly worse.

NikosD

21st October 2013, 20:19

Die area and CPU utilization were in my mind.
So you answered me both.

Thanks.

benwaggoner

21st October 2013, 22:09

Well in HEVC WPP is optional, which means that even if you create a really great multi threaded decoder, you still cannot guarantee performance since a stream might not have it enabled.
Certainly. But since WPP doesn't have nearly the quality overhead of AVC slices, I can imagine that it might become standard in general encodes.

For my professional use, I generally can tune my encodes to devices and vise versa.

In VP9 tiles are mandatory for picture widths greater than 4096.
Well, I think 8K video is probably farther out than VP10 :).

How to tiles compare in practice to Slices/WPP as a parallelization mechanism? Say, 128x128 tiles?

pieter3d

22nd October 2013, 00:52

Certainly. But since WPP doesn't have nearly the quality overhead of AVC slices, I can imagine that it might become standard in general encodes.

For my professional use, I generally can tune my encodes to devices and vise versa.

Yes, when you are the encoder it is a terrific feature. Also when you control the whole pipe (like if you are Netflix), then it is great too. But when you just need to decode any Main Profile clip, you can't rely on it sadly.

Well, I think 8K video is probably farther out than VP10 :).

How to tiles compare in practice to Slices/WPP as a parallelization mechanism? Say, 128x128 tiles?

Tiles have a coding efficiency impact, since you break the coding dependencies along boundaries, and CABAC adaptation is reset at the start of each tile (same argument goes for slices). If you go to small tiles like 128x128, you will see substantial coding loss due to adaptation being reset many times. Note that 128x128 is actually not allowed by HEVC Main Profile, the smallest allowed tile is 256x64.

I have spoken to some of the HEVC engineers at JCT-VC meetings who have said that in some sequences the distortion introduced by tile boundary independence can actually make the tile boundaries visible (a smart encoder could allocate more bits near the boundaries to compensate though).

HEVC tiles have a major advantage that you do not get with WPP: workload balancing. At the start of each picture, you can resize the tiles, making one wider/taller than another, or adding more. This lets you adapt your encoder threads so that they all have close to 100% duty cycle (they all finish approximately the same time). Also you can add more/less tiles as physical threads free up/get allocated elsewhere due to changing load on a machine.

An advantage of tiles over slices is that they can be more square, so you have better spatial correlation than long thin slices. Of course you also avoid coding the slice header more than once. You can also use tiles to quickly make a 4k encoder using 4 threads (hw or sw) of a 1080p encoder without very many modifications.

WPP makes the CABAC engine start each CTB row with the state from the start of the above CTB row, which really means that the adaptation is more spatially relevant. Normally when you finish one row and start the next, the context variables don't receive any special treatment despite the big jump spatially. With WPP, you can actually see some coding gains there,on top of the multithreading advantage. However, you do have to code entry points for each CTB row in the slice header, which actually can take up a significant chunk of data for low bitrates, or pictures with lots of skip CUs.

spacesinger

2nd April 2014, 00:07

Yeah. Another glaring flaw is the fact that MV decode requires fully reconstructed neighboring and co-located MV values, which means the entire MV prediction process is required for entropy decode and cannot be decoupled.

I know vp8 does that. But I search whole vp9 reference code and don't find any codes that require neighboring final MV when decoding syntax.

Could you show me in detail ? Thanks a lot.

pieter3d

2nd April 2014, 00:15

Look at how it derives the use_hp bit when reading the MV delta from the bitstream. It uses the predictor, which is derived from final MV values.

spacesinger

2nd April 2014, 02:10

Look at how it derives the use_hp bit when reading the MV delta from the bitstream. It uses the predictor, which is derived from final MV values.

Got it finally..... in read_mv function call.
I didn't expect it hid so deeply and used just for one use_hp bit.
Not hardware friendly at all. I am designing ASIC for VP9 now.
I thought that the final MV should affect the context but it didn't.

Truly appreciate :D :D
:thanks:

pieter3d

2nd April 2014, 02:22

Just wait until you start implementing the loop filter

jimwei

23rd July 2015, 08:12

Hi Pieter3d, Can you help to answers the following questions about the VP8 and VP9:

1. For both VP9 and VP8, the alt ref frame and the golden frame are not for display?
2. For VP9, the encoded data are not packed into partitions as the way the VP8 does, i.e. there are only one partition, right ?
3. For SVC mode for VP9, is it correct that the application for encoder ( the sender in the case of video communication ) decide whether to send all the layers or just the base layer to the decoder( the receiver in the case of video communication ) ?
4. How the encoder get the infomation about the networking situation? by the RTCP feed back message ?

Thanks!

pieter3d

23rd July 2015, 16:35

1. For both VP9 and VP8, the alt ref frame and the golden frame are not for display?

Alt-Ref,Golden and Last are just names, nothing more. Think of it as an enumerated type on top of reference number 0, 1, 2.

Any frame can be marked as hidden (not for display). In the case of YouTube VP9 streams, the encoder uses the alt-ref label for the hidden frames, golden for some earlier frame, and last for the immediately preceding frame. That is the kind of structure you can get from the libvpx reference encoder, but other encoders may do it differently.

2. For VP9, the encoded data are not packed into partitions as the way the VP8 does, i.e. there are only one partition, right ?

Correct, a single bitstream per frame.

3. For SVC mode for VP9, is it correct that the application for encoder ( the sender in the case of video communication ) decide whether to send all the layers or just the base layer to the decoder( the receiver in the case of video communication ) ?

Yes, the point of SVC is to be able to send less bits and still reconstruct a smaller version of the image. So the sender sends as many layers as the receiver is able to handle.

4. How the encoder get the infomation about the networking situation? by the RTCP feed back message ?

Thanks!
That is one way, but it is up to you! the VP9 standard doesn't specify it, so you may do it by any means appropriate to your application.

jimwei

24th July 2015, 09:43

Alt-Ref,Golden and Last are just names, nothing more. Think of it as an enumerated type on top of reference number 0, 1, 2.

Any frame can be marked as hidden (not for display). In the case of YouTube VP9 streams, the encoder uses the alt-ref label for the hidden frames, golden for some earlier frame, and last for the immediately preceding frame. That is the kind of structure you can get from the libvpx reference encoder, but other encoders may do it differently.

So is it correct that the alf-ref frame is not for display, and the H bit (show_frame bit) in the first octet of the payload header in the first partition of the bitstream must be set to 0 by the application (actually here I am not sure that this payload head is generated by the encoder, or it should be generate by the application)?

And some more questions:)

1. The decoder should use the same reference frame to decode a new coming frame as the one when this frame was encoded, but how the decoder know the new coming frame is decoded by which reference frame as it maintains 3 reference frame?

2. The decoder maintains 3 frame buffers for the reference frame, when and how decoder update these buffer as far as libvpx is concerned? Does it update all these three buffers when a key frame arrives ? And does it update the alt-ref frame buffer when it receive a frame with the show_frame bit in the first octet of the payload header in the first partition (the alt-ref frame) ? When does it update the gold frame?

3. For VP9, how does the setting of rc_target_bitrate works? Does it means that the encoder can generate different bitstream with different quality according to this parameter? But it must have a min value, how dose the encoder react if the parameter is lower than this min value?

pieter3d

24th July 2015, 16:15

So is it correct that the alf-ref frame is not for display, and the H bit (show_frame bit) in the first octet of the payload header in the first partition of the bitstream must be set to 0 by the application (actually here I am not sure that this payload head is generated by the encoder, or it should be generate by the application)?

Again, the usage of alt-ref for hidden frames is purely an encoder choice, one that the libvpx encoder happens to make. It is not enforced in any way by the VP9 specification. Any frame can be made hidden by setting the show_frame bit to 0.
As far as generating the frame header, this is typically done by software, a driver. Since the frame header format is defined by the VP9 specification, a VP9 compliant encoder must be the one that writes it.

And some more questions:)

1. The decoder should use the same reference frame to decode a new coming frame as the one when this frame was encoded, but how the decoder know the new coming frame is decoded by which reference frame as it maintains 3 reference frame?

Each partition in the frame (each block of 8x8 pixels or larger) gets assigned a reference ID of 0, 1 or 2. These correspond to LAST, GOLDEN, ALTREF. This information is encoded in the compressed bitstream, so that is how a decoder knows which one to use for each block.

2. The decoder maintains 3 frame buffers for the reference frame, when and how decoder update these buffer as far as libvpx is concerned? Does it update all these three buffers when a key frame arrives ? And does it update the alt-ref frame buffer when it receive a frame with the show_frame bit in the first octet of the payload header in the first partition (the alt-ref frame) ? When does it update the gold frame?

There are actually 8 buffers, but any single frame may only use 3 of these. Keyframes necessarily update all buffers, because otherwise you would be dependent on frames prior to the keyframe, which defeats the purpose.
An encoder can pick any, all, or none of the 8 references to update with the current frame. There is an octet of bits, refresh_frame_flags in the frame header, and the encoder uses these to communicate to the decoder which of the 8 buffers should be filled with the new frame.
The choice of which ones to update is free, you can come up with whatever scheme you like. Good compression performance means you need a clever scheme, and it depends heavily on the type of video you are encoding. This kind of thing makes good encoders tricky and complicated to design. Luckily though, a decoder doesn't have to worry about it since once the encoder has made the decision about which frames to update, it simply tells the decoder verbatim.

3. For VP9, how does the setting of rc_target_bitrate works? Does it means that the encoder can generate different bitstream with different quality according to this parameter? But it must have a min value, how dose the encoder react if the parameter is lower than this min value?

The encoder tries to maintain a particular bitrate by adjusting the quantization strength (i.e. how many bits are thrown away in the transform coefficients). It at first does a best guess, or multi-pass to achieve that. Then there is a feedback mechanism. It sees how many bits were actually encoded, then if it was not enough, it will reduce quantization for the next frame. If there were too many it will increase the quantization. A few other things come into play here too to make the bitrate nice and stable, and to make sure that the decoded image doesn't noticeably jump around in quality.

jimwei

5th August 2015, 16:39

Hi pieter3d, can you help to show how the decoder know the current whether is a reference frame or not, and which kind of reference frame? Thanks in advance.

pieter3d

5th August 2015, 16:50

For any given frame, the decoder cannot know how it will be used in the future. However, it does know how the current frame should be handled in the reference pool of 8 buffers. In the current frame's header there are a set of 8 flags, refresh_frame_flags[]. For each flag that is 1, the current picture will be inserted into that reference pool slot after decode is complete.

jimwei

5th August 2015, 17:02

Thanks, it is the case for vp9, but there is no such kind of flag for vp8. And actualy, my question is for vp8. :)

pieter3d

5th August 2015, 17:20

For vp8 it is similar, look for the refresh_golden_frame, refresh_alt_ref_frame and copy_buffer_to_arf syntax elements in the header.

Shevach

14th April 2017, 10:39

Lack of start codes in VP9 makes error resynchronization challenged (barely possible). Consequently, the error resilience of HEVC is in virtue better than that in VP9.

Shevach

15th April 2017, 09:33

i wonder if there are plans in Mpeg committee to add support of VP9 for Mpeg File System (mp4 format). How specify stsd-box if vp9 video stream present in mdat-box?
Generally speaking encapsulation of VP9 elementary stream into mp4 container would require a full parsing of the stream in order to specify frame boundaries and then to populate stsz and other boxes in metadata.

sneaker_ger

15th April 2017, 10:01

There are efforts to do this.
https://www.webmproject.org/vp9/mp4/

nevcairiel

15th April 2017, 10:05

Frame boundaries are required by all containers that support vp9, like webm/matroska. "raw" elementary streams are not typically used for vp9, ie. the most raw format it typically goes as is the simple ivf container, which still holds frame sizes/boundaries and timestamps.

Shevach

19th April 2017, 10:19

@pieter3d
you wrote:
"Another glaring flaw is the fact that MV decode requires fully reconstructed neighboring and co-located MV values, which means the entire MV prediction process is required for entropy decode and cannot be decoupled."

i'm afraid i don't understand this point, especially "MV decode requires fully reconstructed neighboring".
As far as i know the recent version of VP9 spec. there is a function 'assign_mv' which in turn calls 'read_mv' and uses results of other functions. However, the process of MV derivation is similar (in spirit) to that of HEVC. Why 'reconstructed residuals' are needed in MV derivation process? Could you elaborate?
According to the spec. 'assign_mv' function exploits only surrounding MVs in a manner similar to AVC/HEVC. Where is a flaw against HEVC/AVC here? i don't see.
Anyway, i'll ask Pieter to elaborate this point.

Shevach

20th April 2017, 16:50

i see three flaws VP9 which deteriorate error resilience: lack of start-codes, lack of slices and non-adaptivity of probabilities within a frame.

When arithmetic coding is used error-detection latency is long (error detection latency is a distance in MBs/CTUs/Superblocks between a place where a bitstream error occurs and the place where it's detected). For example, a bit-flip can occur at the start of a frame but detected at the end and as a result the entire frame is corrupted. If the error is detected at the middle of the frame then the second half can be filled by co-located MBs/CUTs/Superblocks.

In HEVC/AVC, a bitstream error is necessarily detected either at the end of a frame when the next start code is sensed or when the number of CTUs/MBs exceeding the expected amount (according to resolution).
In VP9 (due to lack of start codes) a bitstream error can be detected at the middle of the next frame or at the middle of next-next-frame and in such case two or more frames are corrupted (it's worth mentioning that in HEVC in worst case a single frame is corrupted).

Division into slices is extremely useful for error resilience since corruption area is limited to a slice size (notice that a bitstream error is inevitable detected prior to start-code of the next slice). Consequently in worst case a single slice is corrupted and not the whole frame.

In error-resilience mode of VP9 each frame is coded with a default set of fixed probabilities (or context models in HEVC/AVC's jargon), VP9 encoder can’t get probabilities from the previous frame since the previous frame may be corrupted prior to arriving a decoder. Consequently all frames are coded with fixed probabilities. If the actual context models are close to the default ones then ok ('sababa'). However, if no then penalty bits are produced (the penalty can be assessed via Kullback–Leibler divergence). Consequently, vp9 entropy coding is non-effective. In HEVC/AVC, even if default probabilities strongly differ from the actual ones, CABAC quickly adapt itself to the actual and coding gets optimal.

FancyMouse

26th April 2017, 23:42

@pieter3d
i'm afraid i don't understand this point, especially "MV decode requires fully reconstructed neighboring".
As far as i know the recent version of VP9 spec. there is a function 'assign_mv' which in turn calls 'read_mv' and uses results of other functions. However, the process of MV derivation is similar (in spirit) to that of HEVC. Why 'reconstructed residuals' are needed in MV derivation process? Could you elaborate?
According to the spec. 'assign_mv' function exploits only surrounding MVs in a manner similar to AVC/HEVC. Where is a flaw against HEVC/AVC here? i don't see.

Disclaimer: I've not read VP9 spec/code.
I believe it's vp9's nature of "using code as spec" is to blame, as code bug becomes spec as well. Even though the intention is good, the code bug might destroy it. See OP first paragraph - it would mean that the ref decoder at the time of bitstream frozen is the golden standard. They might fix it in the newer document, but then the newer one should be called vp9.1 or stuff like that, otherwise existing shipped vp9 decoder might break.

Shevach

2nd July 2017, 16:16

Let me share a white paper "Choosing of the Right Codec: Comparing HEVC & VP9"
https://drive.google.com/file/d/0B7a5Sr0jgK2ISTg5NkpRN2pVZjQ/view

in this paper a qualitative analysis of key features of HEVC and VP9 is provided. i am a co-author of this paper, therefore ask me.
i know the paper is general and lacking details and figures (numbers saying which feature and what gain is obtained are absent). The original paper contains everything but the published article is a strictly censored version.
Anyway i appreciate Beamr Imaging on decision to share even censored version of the results.

MasterNobody

2nd July 2017, 18:43

Shevach

3rd July 2017, 15:02

Shevach
Why does paper give VP9 advantage in Segmentation part? IMHO 8 segments are too coarse comparing to AVC/HEVC fine grained (per MB/CU or smaller) quantizer values and so it is disadvantage (as x264 showed AQ is very important feature).

@MasterNobody
Benefits of Segmentation depends on content (animation with large fixed background) and bitrate (low bitrate).
As per x264 AQ, the quantizer fluctuates within a frame according to "block variance" (or another HVS metric), therefore QP is not a good choice for Segmentation.
On the other hand i am familiar with commercial Rate Controls where the quantizer is fixed within the entire frame (changed only across frames) or the quantizer per CTU/MB completely depends on a virtual buffer (strictly CBR mode, used in statistical multiplexing).
Other parameters as Reference Frame and Loop Filter strength are sufficiently good temporal-spatial correlated.

MasterNobody

3rd July 2017, 18:21

Shevach

6th July 2017, 11:49

That doesn't answer the question of why coarse and limited (only 8) segmentation is better than fine grained per CU adaptivity of such params (frame reference also can be changed per MB and while you can't change loop filter strength directly it depends from QP value)? I.e. what does it give that AVC/HEVC can't do by using other more fine grained adaptivity?

The Segmentation is an optional mode in VP9 only. i agree that for typical real video 'fine grained block adaptivity' is better than the Segmentation. However, there are scenarios when Segmentation gives a good gain in coding efficiency. In such scenarios a smart VP9 encoder can exploit Segmentation while HEVC can't.
i'll provide a simple example, let's suppose that in a frame all reference indexes are equal to zero (i.e. all blocks refer to the previous frame). In HEVC an encoder would transmit 1 bin per CU to signal ref_idx=0. Roughly speaking due to CABAC 1-bin is converted to 1/6 bits. Hence the number of bits dedicated to signal reference indexes is #CUs * 1/6 .
In VP9 this number is almost zero (there is a small overhead in a picture header).

Beelzebubu

26th July 2017, 21:21

I started reading the document, and I hope any dutch person here can appreciate my reference to WC eend (https://www.youtube.com/watch?v=rqE89Y0voxs). I'll make my point by just responding to a single section.

Bi-prediction means that two references are used simultaneously to create motion predictions, and this feature is supported both in AVC and HEVC. VP9 doesn’t support this feature (possibly due to IP issues), and instead supports a workaround called “compound prediction”. In this method, a first predicted frame is a hidden, non displayable frame created using bi-directional prediction, and another frame which is essentially a skipped frame “copies” the first frame. The two frames together constitute a ‘superframe’. This still adds some overhead, giving HEVC an advantage in coding efficiency over VP9 when this mode is used.

The above is totally bogus.

The VP9 bitstream can signal up to 3 active references, and each reference is then assigned a sign bias bit (0 or 1). The idea here is similar to h264/hevc frames having two reference lists: l0 and l1. Compound prediction happens between reference frames of different sign bias (but not frames of identical sign bias). This is similar to h264/hevc, where bidirectional prediction happens between a l0+l1 reference (but not two l0s or two l1s). As a result, just like for hevc/h264, each frame can use bidirectional prediction, depending on the reference list setup in the frame header. One pretty fundamental issue here is the limit on the number of active references, which you oddly didn't mention at all in this section. (You did mention it further down, but then failed to acknowledge that the memory issues you mentioned have been addressed in the VP9 levels (https://www.webmproject.org/vp9/levels/).) The reference to IP issues is unsubstantiated.

The remainder of the quote talks about invisible/overlay frames, which are frame reordering techniques, i.e. frame-level tools, that have nothing to do with prediction types at the block level. An invisible frame in VP9 is conceptually the same as an out-of-order coded frame (which is coded, but not yet displayed, and thus not yet visible, a.k.a. invisible) in hevc/h264. The signaling is indeed slightly different. In hevc/h264, you would signal an invisible (out-of-order) frame by having the poc being ahead of the next expected poc, which means the decoder needs to delay its output. In vp9, you signal this by marking the frame as invisible. Later on, using reordering based on poc (in hevc/h264) or the direct-reference single-byte packet (in VP9), these not-yet-displayed (a.k.a. invisible) frames are made visible. The suggestion that the 1 byte overhead of this signaling would be significant, is crazy. There is also a reference to overlay frames, which are a libvpx-specific thing that are otherwise unrelated to the VP9 bitstream. One could make well-founded points on the overlay frames and ARNR (the two go hand-in-hand), and how they may result in more PSNR gains than visual gains, but you didn't mention this at all.

elinzer1

13th May 2018, 18:52

:goodpost: Great post. Do you have a similar one for AV1?