Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > VP9 and AV1

Reply
 
Thread Tools Search this Thread Display Modes
Old 8th October 2013, 00:34   #1  |  Link
pieter3d
Registered User
 
Join Date: Jan 2013
Location: Santa Clara CA
Posts: 114
How VP9 works, technical details & diagrams

Hi all,

After my last post on HEVC got some interest, I’ve decided to write one on VP9, Google’s new video coding standard. The bitstream was frozen in July 2013, paving the way for hardware engineers (like myself) to be confident they don’t have to worry about design changes. There is no official spec document (yet...), so the following is all based on how the reference decoder works. As with the HEVC overview, I will gloss over a few low level details, but feel free to ask for any elaborations!

As of today VP9 only supports YUV 4:2:0 (full chroma subsampling). There are provisions in the header for future formats (RGB, alpha) and less chroma subsampling, but as of today YUV4:2:0 only. Also there is no support for field coding, progressive only.

Picture partitioning:
VP9 divides the picture into 64x64-sized blocks called super blocks - SB for short. Superblocks are processed in raster order: left to right, top to bottom. This is the same as most other codecs. Superblocks can be subdivided down, all the way to 4x4. The subdivision is done with a recursive quadtree just like HEVC. But unlike HEVC, a subdivision can also be horizontal or vertical only. In these cases the subdivision stops. Although 4x4 is the smallest “partition”, lots of information is stored at 8x8 granularity only, a MI (mode info) unit. This causes blocks smaller than 8x8 to be handled as sort of a special case. For example a pair of 4x8 intra coded blocks is treated like an 8x8 block with two intra modes. Example partitioning of a super block:

Unlike AVC or HEVC, there is no such thing as a slice. Once the data for a frame begins, you get it all until the frame is complete.
VP9 also supports tiles, where the picture is broken up into a grid of tiles along superblock boundaries. Unlike HEVC, these tiles are always as evenly spaced as possible, and there are a power-of-two number of them. Tiles must be at least 256 pixels wide and no more than 4096 pixels wide. There can be no more than four tile rows. The tiles are scanned in raster order, and the super blocks within them are scanned in raster order. Thus the ordering of superblocks within the picture depends on the tile structure. Coding dependencies are broken along vertical tile boundaries, which means that two tiles in the same tile row may be decoded at the same time, a great feature for multi-core decoders. Unlike HEVC, coding dependencies are not broken between horizontal boundaries. So a frame split into 2x2 tiles can be decoded with 2 threads, but not with 4.
At the start of every tile except the last one, a 32-bit byte count is transmitted, indicating how many bytes are used to code the next tile. This lets a multithreaded decoder skip ahead to the next tile in order to start a particular decoding thread.

Bitstream coding:
The VP9 bitstream generated by the reference code from Google is containerized with either IVF or WebM. IVF is extremely simple, and WebM is essentially just a subset of MKV. If no container is used at all, then it is impossible to seek to a particular frame without doing a full decode of all preceding frames. This is due to the lack of start codes as seen with AVC/HEVC AnnexB streams.
All VP9 bitstreams start with a keyframe, where every block is intra coded and the internal state of the decoder is reset. A decoder must start at a keyframe, after which it can decode any number of inter frames, which use previous frames as reference data.
Like its predecessor VP8, VP9 compresses the bitstream using an 8-bit arithmetic coding engine known as the bool-coder. The probability model is fixed for the whole frame; all probabilities are known before decode of frame data begins (does not adapt like AVC/HEVC’s CABAC). Probabilities are one byte each, and there are 1783 of them for keyframes, 1902 for inter frames (by my count). Each probability has a known default value.
These probabilities are stored in what is known as a frame context. The decoder maintains four of these contexts, and the bitstream signals which one to use for the frame decode.
Each frame is coded in three sections as follows:
  • Uncompressed header, only a dozen or so bytes that contains things like picture size, loop filter strength etc.
  • Compressed header: Bool-coded section that transmits the probabilities used for the whole frame. They are transmitted as deviation from their default values.
  • Compressed frame data. This bool-coded data contains the data needed to reconstruct the frame, including block partition sizes, motion vectors, intra modes and transform coefficients.
Note that unlike VP8, there is no data partitioning: all data types are interleaved in super block coding order. This is a design choice that makes life easier for hardware designers.
After the frame is decoded, the probabilities can optionally be adapted: Based on the occurrence counts of each particular symbol during the frame decode, new probabilities are derived that are stored in the frame context buffer and may be used for a future frame.

Residual coding:
Unless a block codes (or infers) a skip flag, a residual signal is transmitted for each block. The skip flag applies at 8x8 granularity, so for block splits below 8x8, the skip flag applies to all blocks within the 8x8. Like HEVC, VP9 supports four transform sizes: 32x32, 16x16, 8x8 and 4x4. Like most other coding standards, these transforms are an integer approximation of the DCT. For intra coded blocks either or both the vertical and horizontal transform pass can be DST (discrete sine transform) instead. This has to do with the specific characteristics of the residual signal of intra blocks.
The bitstream codes the transform size used in each block. For example, if a 32x16 block specifies 8x8 transform, the luma residual data consists of a grid of 4x2 8x8’s, and the two 16x8 chroma residuals consist of 2x1 8x8s. If the transform size for luma does not fit in chroma, it is reduced accordingly (e.g. a 16x16 block with 16x16 luma transform uses 8x8 transforms for chroma).
Transform coefficients are scanned starting at the DC position (upper left), and follow a semi-random curved pattern towards the higher frequencies. Transform blocks with mixed DCT/DST use a scan pattern skewed accordingly.

This coefficient ordering is not very predictable like the diagonal or zigzag scans of other codecs, so it requires the pattern to be stored as a lookup table (larger silicon area).
Each coefficient is read from the bitstream using the bool-coder and several probabilities. The probabilities required are chosen depending on various parameters such as position in the block, size of the transform block, value of neighboring coefficients etc. The large amount of permutations of these parameters is the reason why the bool-coder’s probability model contains so many probabilities.
Inverse quantization in VP9 is very simple; it is just a multiplication by a number that is fixed for the entire frame (i.e. no block-level QP adjustment like HEVC/AVC). There are four of these scaling factors:
  • Luma DC (first coefficient)
  • Luma AC (all other coefficients)
  • Chroma DC (first coefficient)
  • Chroma AC (all other coefficients)
VP9 offers a lossless coding mode where all transform blocks are always 4x4, no inverse quantization, and the transform used is always a special one known as a Walsh 4x4. This lossless mode is either on or off for the entire frame.

Intra prediction:
Intra prediction in VP9 is similar to AVC/HEVC intra prediction, and follows the transform block partitions. Thus intra prediction operations are always square. For example 16x8 block with 8x8 transforms will result in two 8x8 luma prediction operations.

There are 10 different prediction modes, with 8 of them are directional. Like other codecs, intra prediction requires two 1D arrays that contain the reconstructed left and upper pixels of the neighboring blocks. The left array is the same height as the current block’s height, and the above array is twice as long as the current block’s width. However for intra blocks larger than 4x4, the second half of the horizontal array is simply extended from the last pixel of the first part (notice value 80):

- End of part 1 -

Last edited by pieter3d; 18th October 2013 at 02:04.
pieter3d is offline   Reply With Quote
Old 8th October 2013, 00:38   #2  |  Link
pieter3d
Registered User
 
Join Date: Jan 2013
Location: Santa Clara CA
Posts: 114
- Part2 -

Inter prediction:
Inter prediction in VP9 uses 1/8th pel motion compensation, twice the precision of most other standards. In most cases, motion compensation is unidirectional, meaning one motion vector per block, no bi-prediction. However, VP9 does support “compound prediction”, which really is just another word for bi-prediction where there are two motion vectors for each block and the two resulting prediction samples are averaged together. In order to avoid patents on bi-prediction, compound prediction is only enabled in frames that are marked as not-displayable. A frame like this is never output for display, but may be used for reference later. In fact, a later frame may consist of nothing but 64x64 blocks with no residuals and 0,0 motion vectors that point to this non-displayed frame, effectively causing it to be output later using very little data.
Non-displayed frames do cause a bit of a problem when putting the VP9 bitstream in a container, since each “frame” from the container should result in a displayable frame. To resolve this, VP9 introduces the concept of a super-frame. A super-frame is simply one or more non-displayable frames and one displayable frame all strung together as one chunk of data in the container. Thus the decoder still outputs a frame, and the internal references are all updated with the non-displayable frames.

Back to motion compensation. As mentioned, the motion compensation filter is 1/8th pixel accurate. Additionally, VP9 offers a clever new feature where each block can pick one of 3 motion compensation filters:
  • Normal 8th pel
  • Smooth 8th pel, which lightly smoothes or blurs the prediction
  • Sharp 8th pel, which lightly sharpens the prediction.
The motion vector points to one of three possible reference frames (see reference frame management below) known as Last, Golden, or AltRef. These names are merely names, and don’t really indicate anything else. Last doesn’t have to be the last frame, although typically it is with streams from the reference encoder. The reference frame is applied at 8x8 granularity, so for example two 4x8 blocks, each with their own MV, will always point to the same reference frame.

Motion vector prediction in VP9 is quite involved. Like HEVC, a 2-entry list of predictors is built up using up to 8 of the surrounding blocks that use the same reference picture, followed by a temporal predictor (motion vector from the previous frame in the same location). If this search process does not fill the list, the surrounding blocks are searched again but this time the reference doesn’t have to match. Finally, if the prediction list is still not filled, then 0,0 vectors are inferred.
The block codes one of four motion vector modes:
  • New MV – Use the first entry of this prediction list and add in a delta MV, transmitted in the bitstream
  • Nearest MV – use the first entry of this prediction list as is, no delta
  • Near MV – use the second entry of this prediction list as is, no delta
  • Zero MV – simply use 0,0 as the MV value.

Reference frame management:
A decoder maintains a pool or 8 reference pictures at all times. Each frame picks 3 of these to use for inter prediction (known as Last, Golden, and AltRef), and afterwards can insert itself into any, all, or none of these 8 slots, evicting whatever frame was there before.
A new and exciting feature of VP9 is reference frame scaling. Each new inter frame can actually be coded using a different resolution than the previous frame. When creating inter predictions, the reference data is scaled up or down accordingly. The scaling filters are 16th pel accurate, 8-tap. This feature is allows for quick and seamless on-the-fly bitrate adjustment, for example during video conferencing. A very elegant solution, much less complicated than SVC for example.

Loop Filter:
Like AVC and HEVC, VP9 specifies a loop filter that is applied to the whole picture after it has been decoded. It attempts to clean up the blocky artifacts that can occur. The loop filter operates per super block, filtering first the vertical edges, then the horizontal edges of each superblock. The super blocks are processed in raster order, regardless of any tile structure. This is unlike HEVC, where all vertical edges of the frame are filtered before any horizontal edges. There are 4 different filters:
  • 16-wide, 8 pixels on each side of the edge
  • 8-wide, 4 pixels on each side of the edge
  • 4-wide, 2 pixels on each side of the edge
  • 2-wide, 1 pixel on each side of the edge
Each of these filters requires a threshold level to be met, as set by the frame header: The pixels on either side of the edge should be relatively uniform (smooth), and there must be a distinct difference between the brightness on either side of the edge. If the filter condition is met, then the filter applies, smooths the transition. If not, the next smaller filter is attempted. Blocks with transform size 4x4 or 8x8 do not start with the 16-wide filter.
Before:

After:


Segmentation:
Vp9 offers an optional feature known as segmentation. When enabled, the bitstream codes a segment ID for each block, which is a number between 0 and 7. Each of these eight segments can have any of the following four features enabled:
  • Skip – blocks belonging to a segment with the skip feature active are automatically assumed to not have a residual signal. Useful for static background.
  • Alternate quantizer – blocks belonging to a segment with the AltQ feature may use a different inverse quantization scale factor. Useful for regions that require more (or less) detail than the rest of the picture. Or it could be used for rate control
  • Ref – blocks belonging to a segment that have the Ref feature enabled are assumed to point to a particular reference frame (Last, Golden, or AltRef), as opposed to the bitstream explicitly transmitting the reference frame as usual.
  • AltLf - blocks belonging to a segment that have the AltLf feature enabled use a different smoothing strength when getting loop-filtered. This can be useful for smooth areas that would otherwise appear too blocky. They can get more smoothing without having to smoother the entire picture more.

Last edited by pieter3d; 18th October 2013 at 02:03.
pieter3d is offline   Reply With Quote
Old 8th October 2013, 00:39   #3  |  Link
pieter3d
Registered User
 
Join Date: Jan 2013
Location: Santa Clara CA
Posts: 114
Phew - hope that is all clear. Questions & comments welcome!
pieter3d is offline   Reply With Quote
Old 8th October 2013, 13:56   #4  |  Link
mandarinka
Registered User
 
mandarinka's Avatar
 
Join Date: Jan 2007
Posts: 729
Thanks, nice write-up!
Interesting to see that this time, they basically used H264-like multiple references and bframes - but they had to use some weird kludges to complicate it enough so that they can't be held liable for copying the (patented) original techniques...

Last edited by mandarinka; 8th October 2013 at 14:03.
mandarinka is offline   Reply With Quote
Old 18th October 2013, 01:58   #5  |  Link
pieter3d
Registered User
 
Join Date: Jan 2013
Location: Santa Clara CA
Posts: 114
A note to implementers, VP9 contains a bug in the loop filter that must be matched. See: https://code.google.com/p/webm/issues/detail?id=642

Last edited by pieter3d; 18th October 2013 at 02:12.
pieter3d is offline   Reply With Quote
Old 18th October 2013, 18:08   #6  |  Link
mandarinka
Registered User
 
mandarinka's Avatar
 
Join Date: Jan 2007
Posts: 729
And here people thought that the same mistakes that were there with VP8 wouldn't happen anymore.
Well, that's what they get with rushed (and sort-of-like-proprietary) one-party development, I guess.
mandarinka is offline   Reply With Quote
Old 18th October 2013, 18:09   #7  |  Link
pieter3d
Registered User
 
Join Date: Jan 2013
Location: Santa Clara CA
Posts: 114
Yeah. Another glaring flaw is the fact that MV decode requires fully reconstructed neighboring and co-located MV values, which means the entire MV prediction process is required for entropy decode and cannot be decoupled.
pieter3d is offline   Reply With Quote
Old 21st October 2013, 00:33   #8  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,750
One big practical difference versus HEVC seems to be the lack of any equivalent to Wavefront Parallel Processing. If I'm parsing that right, that'll make it hard to do performant multithreaded software decoders. Exacerbated by the lack of B-frames to allow for frame-level parallelism where non-reference frame get decoded in separate threads.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 21st October 2013, 00:41   #9  |  Link
pieter3d
Registered User
 
Join Date: Jan 2013
Location: Santa Clara CA
Posts: 114
Well in HEVC WPP is optional, which means that even if you create a really great multi threaded decoder, you still cannot guarantee performance since a stream might not have it enabled.

In VP9 tiles are mandatory for picture widths greater than 4096.
pieter3d is offline   Reply With Quote
Old 21st October 2013, 19:11   #10  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Could you compare AVC, HEVC & VP9 in terms of HW decoding complexity for the same resolution, bitrate in common profiles for each format ?

For example, if AVC 1080p HW decoding complexity is 100, how much is for HEVC & VP9 ?
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 21st October 2013, 19:14   #11  |  Link
pieter3d
Registered User
 
Join Date: Jan 2013
Location: Santa Clara CA
Posts: 114
That's a pretty hard question to answer; depending on modes HEVC & VP9 could be harder or easier than AVC. It also depends on how you measure complexity. Memory usage? Silicon area? CPU load?
For HW the silicon area is higher than AVC due to the usage of larger blocks. In terms of decode time in SW, HEVC seems to be on average ~2x that of AVC, and I suspect VP9 is similar or slightly worse.
pieter3d is offline   Reply With Quote
Old 21st October 2013, 20:19   #12  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Die area and CPU utilization were in my mind.
So you answered me both.

Thanks.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 21st October 2013, 22:09   #13  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,750
Quote:
Originally Posted by pieter3d View Post
Well in HEVC WPP is optional, which means that even if you create a really great multi threaded decoder, you still cannot guarantee performance since a stream might not have it enabled.
Certainly. But since WPP doesn't have nearly the quality overhead of AVC slices, I can imagine that it might become standard in general encodes.

For my professional use, I generally can tune my encodes to devices and vise versa.

Quote:
In VP9 tiles are mandatory for picture widths greater than 4096.
Well, I think 8K video is probably farther out than VP10 .

How to tiles compare in practice to Slices/WPP as a parallelization mechanism? Say, 128x128 tiles?
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 22nd October 2013, 00:52   #14  |  Link
pieter3d
Registered User
 
Join Date: Jan 2013
Location: Santa Clara CA
Posts: 114
Quote:
Originally Posted by benwaggoner View Post
Certainly. But since WPP doesn't have nearly the quality overhead of AVC slices, I can imagine that it might become standard in general encodes.

For my professional use, I generally can tune my encodes to devices and vise versa.
Yes, when you are the encoder it is a terrific feature. Also when you control the whole pipe (like if you are Netflix), then it is great too. But when you just need to decode any Main Profile clip, you can't rely on it sadly.
Quote:
Originally Posted by benwaggoner View Post
Well, I think 8K video is probably farther out than VP10 .

How to tiles compare in practice to Slices/WPP as a parallelization mechanism? Say, 128x128 tiles?
Tiles have a coding efficiency impact, since you break the coding dependencies along boundaries, and CABAC adaptation is reset at the start of each tile (same argument goes for slices). If you go to small tiles like 128x128, you will see substantial coding loss due to adaptation being reset many times. Note that 128x128 is actually not allowed by HEVC Main Profile, the smallest allowed tile is 256x64.

I have spoken to some of the HEVC engineers at JCT-VC meetings who have said that in some sequences the distortion introduced by tile boundary independence can actually make the tile boundaries visible (a smart encoder could allocate more bits near the boundaries to compensate though).

HEVC tiles have a major advantage that you do not get with WPP: workload balancing. At the start of each picture, you can resize the tiles, making one wider/taller than another, or adding more. This lets you adapt your encoder threads so that they all have close to 100% duty cycle (they all finish approximately the same time). Also you can add more/less tiles as physical threads free up/get allocated elsewhere due to changing load on a machine.

An advantage of tiles over slices is that they can be more square, so you have better spatial correlation than long thin slices. Of course you also avoid coding the slice header more than once. You can also use tiles to quickly make a 4k encoder using 4 threads (hw or sw) of a 1080p encoder without very many modifications.

WPP makes the CABAC engine start each CTB row with the state from the start of the above CTB row, which really means that the adaptation is more spatially relevant. Normally when you finish one row and start the next, the context variables don't receive any special treatment despite the big jump spatially. With WPP, you can actually see some coding gains there,on top of the multithreading advantage. However, you do have to code entry points for each CTB row in the slice header, which actually can take up a significant chunk of data for low bitrates, or pictures with lots of skip CUs.

Last edited by pieter3d; 22nd October 2013 at 00:57.
pieter3d is offline   Reply With Quote
Old 2nd April 2014, 00:07   #15  |  Link
spacesinger
Registered User
 
Join Date: Mar 2014
Posts: 2
Quote:
Originally Posted by pieter3d View Post
Yeah. Another glaring flaw is the fact that MV decode requires fully reconstructed neighboring and co-located MV values, which means the entire MV prediction process is required for entropy decode and cannot be decoupled.

I know vp8 does that. But I search whole vp9 reference code and don't find any codes that require neighboring final MV when decoding syntax.

Could you show me in detail ? Thanks a lot.
spacesinger is offline   Reply With Quote
Old 2nd April 2014, 00:15   #16  |  Link
pieter3d
Registered User
 
Join Date: Jan 2013
Location: Santa Clara CA
Posts: 114
Look at how it derives the use_hp bit when reading the MV delta from the bitstream. It uses the predictor, which is derived from final MV values.
pieter3d is offline   Reply With Quote
Old 2nd April 2014, 02:10   #17  |  Link
spacesinger
Registered User
 
Join Date: Mar 2014
Posts: 2
Quote:
Originally Posted by pieter3d View Post
Look at how it derives the use_hp bit when reading the MV delta from the bitstream. It uses the predictor, which is derived from final MV values.
Got it finally..... in read_mv function call.
I didn't expect it hid so deeply and used just for one use_hp bit.
Not hardware friendly at all. I am designing ASIC for VP9 now.
I thought that the final MV should affect the context but it didn't.

Truly appreciate
spacesinger is offline   Reply With Quote
Old 2nd April 2014, 02:22   #18  |  Link
pieter3d
Registered User
 
Join Date: Jan 2013
Location: Santa Clara CA
Posts: 114
Just wait until you start implementing the loop filter
pieter3d is offline   Reply With Quote
Old 23rd July 2015, 08:12   #19  |  Link
jimwei
Registered User
 
Join Date: Jul 2015
Posts: 4
Hi Pieter3d, Can you help to answers the following questions about the VP8 and VP9:

1. For both VP9 and VP8, the alt ref frame and the golden frame are not for display?
2. For VP9, the encoded data are not packed into partitions as the way the VP8 does, i.e. there are only one partition, right ?
3. For SVC mode for VP9, is it correct that the application for encoder ( the sender in the case of video communication ) decide whether to send all the layers or just the base layer to the decoder( the receiver in the case of video communication ) ?
4. How the encoder get the infomation about the networking situation? by the RTCP feed back message ?

Thanks!
jimwei is offline   Reply With Quote
Old 23rd July 2015, 16:35   #20  |  Link
pieter3d
Registered User
 
Join Date: Jan 2013
Location: Santa Clara CA
Posts: 114
Quote:
Originally Posted by jimwei View Post
1. For both VP9 and VP8, the alt ref frame and the golden frame are not for display?
Alt-Ref,Golden and Last are just names, nothing more. Think of it as an enumerated type on top of reference number 0, 1, 2.

Any frame can be marked as hidden (not for display). In the case of YouTube VP9 streams, the encoder uses the alt-ref label for the hidden frames, golden for some earlier frame, and last for the immediately preceding frame. That is the kind of structure you can get from the libvpx reference encoder, but other encoders may do it differently.

Quote:
Originally Posted by jimwei View Post
2. For VP9, the encoded data are not packed into partitions as the way the VP8 does, i.e. there are only one partition, right ?
Correct, a single bitstream per frame.

Quote:
Originally Posted by jimwei View Post
3. For SVC mode for VP9, is it correct that the application for encoder ( the sender in the case of video communication ) decide whether to send all the layers or just the base layer to the decoder( the receiver in the case of video communication ) ?
Yes, the point of SVC is to be able to send less bits and still reconstruct a smaller version of the image. So the sender sends as many layers as the receiver is able to handle.

Quote:
Originally Posted by jimwei View Post
4. How the encoder get the infomation about the networking situation? by the RTCP feed back message ?

Thanks!
That is one way, but it is up to you! the VP9 standard doesn't specify it, so you may do it by any means appropriate to your application.
pieter3d is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 21:05.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.