How HEVC/H.265 works, technical details & diagrams [Archive]

View Full Version : How HEVC/H.265 works, technical details & diagrams

Pages : [1] 2 3

pieter3d

31st January 2013, 22:49

Hi all,

Now that the HEVC standard is finalized, I'd like take this opportunity to explain how HEVC coding works in plain-ish English. About me: I am a hardware codec engineer who has been working with HEVC for over a year now and I have participated in the JCT-VC meetings where HEVC has taken shape. The spec itself, located at http://phenix.int-evry.fr/jct/doc_end_user/current_document.php?id=7243, is rather hard to follow. This post will assume you are reasonably familiar with the coding techniques in H.264/AVC, hereafter referred to as AVC. This overview will gloss over a few details, but feel free to ask for any elaboration.

To start with: HEVC is actually a bit simpler conceptually than AVC, and lots of things in the spec are done to make life easier for the hardware codec designer (like me).

Picture partitioning
Instead of macroblocks, HEVC pictures are divided into so-called coding tree blocks, or CTBs for short, which appear in the picture in raster order. Depending on the stream parameters, they are either 64x64, 32x32 or 16x16. Each CTB can be split recursively in a quad-tree structure, all the way down to 8x8. So for example a 32x32 CTB can consist of three 16x16 and four 8x8 regions. These regions are called coding units, or CUs. CUs are the basic unit of prediction in HEVC. If you have been paying attention you have already inferred that CUs can be 64x64, 32x32, 16x16 or 8x8. The CUs in a CTB are traversed and coded in Z-order. Example ordering in a 64x64 CTB:
http://i.imgur.com/BZo2ZoY.png
Like in AVC, a sequence of CTBs is called a slice. A picture can be split up into any number of slices, or the whole picture can be just one slice. In turn, each slice is split up into one or more “slice segments”, each in its own NAL unit. Only the first slice segment of a slice contains the full slice header, and the rest of the segments are referred to as dependent slice segments. A dependent slice segment is not decodable on its own; the decoder must have access to the first slice segment of the slice. This splitting of slices exists to allow for low-delay transmission of pictures without the coding efficiency loss of using many full slices. For example, a camera could send out a slice segment of the first CTB row so that the playback device on the other side of the network can begin drawing the picture before the camera is done coding the second CTB row. This can help achieve low-latency video conferencing.
HEVC does not support any interlaced tools (no more MBAFF hooray!). Interlaced video can still be coded, but it must be coded as a sequence of field pictures. No mixing of field and frame pictures.

Residual coding
For each CU, a residual signal is coded. HEVC supports four transform sizes: 4x4, 8x8, 16x16 and 32x32. Like AVC, the transforms are integer transforms based on the DCT. However the transform used for intra 4x4 is based on the DST (discrete sine transform). There is no Hadamard-transform like in AVC. The basis matrix uses coefficients requiring 7 bit storage, so it is quite a bit more precise than AVC. The higher precision and larger sizes of the transforms are one of the main reasons HEVC performs so much better than AVC.
The residual signal of a CU consists of one or more transform units, or TUs. The CU is recursively split with the same quad-tree method as the CTB splitting, with the smallest allowable block being of course 4x4, the smallest TU. For example a 16x16 CU could contain three 8x8 TUs and four 4x4 TUs. For each luma TU there is a corresponding chroma TU of one quarter the size, so a 16x16 luma TU comes with two 8x8 chroma TUs. Since there is no 64x64 transform, a 64x64 CU must be split at least once into four 32x32 TUs. The only exception to this is for skipped CUs, when there is no residual signal at all. Note that there is no 2x2 chroma TU size. Since the smallest possible CU is 8x8, there are always at least four 4x4 luma TUs in an 8x8 region, and that region thus consists of four luma 4x4’s and two chroma 4x4s (as opposed to 8 2x2s). Like the CUs in a CTB, TUs within a CU are also traversed in Z-order.
If a TU has size 4x4, the encoder has the option to signal a so-called “transform skip” flag, where the transform is simply bypassed all together, and the transmitted coefficients are really just spatial residual samples. This can help code crisp small text for example.
Inverse quantization is essentially the same as in AVC.
The way a TUs coefficients are coded in the bitstream is vastly different from AVC. First, the bitstream signals a last xy postion, indicating the position of the last coefficient in scan order. Then the decoder, starting at this last position, scans backwards until it reaches position 0,0, known as the DC coefficient. The coefficients are grouped into 4x4 coefficient groups. The coefficients are scanned diagonally (down and left) with each group, and the groups are scanned diagonally as well. For each group, the bitstream signals if it contains any coefficients. If so, it then signals a bit for each of the 16 coefficients in the group to indicate which are non-zero. Then for each of the non-zero coefficients in a group the remainder of the level is signaled. Finally the signs of all the non-zero coefficients in the group are decoded, and the decoder moves on to the next group. HEVC has an optional tool called sign bit hiding. If enabled and there are enough coefficients in the group, one of the sign bits is not coded, but rather inferred. The missing sign is inferred to be equal to the least significant bit of the sum of all the coefficient’s absolute values. This means that when the encoder was coding the coefficient group in question and the inferred sign was not the correct one, it had to adjust one of the coefficients up or down to fix that. The reason this tool works is that sign bits are coded in bypass mode (not compressed) and thus are expensive to code. By not coding some of the sign bits, the savings more than makes for any distortion caused by adjusting one of the coefficients.
Example of the scan process of a 16x16 TU:
http://i.imgur.com/ST3A5g7.png (http://imgur.com/ST3A5g7)

Prediction units
A CU is split using one of eight partition modes. These eight modes have the following mnemonics: 2Nx2N, 2NxN, Nx2N, NxN, 2NxnU, 2NxnD, nLx2N, nRx2N. Here the uppercase N represents half the length of a CU’s side and lowercase n represents one quarter. For a 32x32 CU, N = 16 and n = 8.
http://i.imgur.com/jorqcDc.png (http://imgur.com/jorqcDc)
Thus a CU consists of one, two or four prediction units, or PUs. Note that this division is not recursive. A CU is either inter- or intra- coded, so if a CU is split into two PUs, both of them are inter- or both of them are intra-coded. Intra-coded CUs may only use the partition modes 2Nx2N or NxN, so intra PUs are always square. A CU may also be skipped, which implies inter coding and a partition of mode of 2Nx2N. NxN partition mode is only allowed when the CU is the smallest size allowed (8x8 normally). The idea is that if you want four separate predictions in a CU, you might as well just split (if you can) and create four separate CUs. Also, inter CUs are not allowed to be NxN if the CU is 8x8, meaning no 4x4 motion compensation at all. The smallest block size is 8x4 and 4x8, and these can never be bidirectional. This was done to minimize worst case memory bandwidth (see section below on motion compensation).

Intra prediction
Intra prediction in a CU follows the TU tree exactly. When an intra CU is coded using the NxN partition mode, the TU tree is forcibly split at least once, ensuring the intra and TU tree match. This means that the intra operation is always 32x32, 16x16, 8x8 or 4x4.
In HEVC, there are, wait for it, 35 different intra modes, as opposed to the 9 in AVC. 33 are directional and there is a DC and Planar mode as well. Like AVC, intra prediction requires a two 1D arrays that contain the upper and left neighboring samples, as well as an upper-left sample. The arrays are twice as long as the intra block size, extending below and right of the block. Example for an 8x8 block:
http://i.imgur.com/LwEucP7.png
Depending on the position of the intra prediction block, any number of these neighboring samples may not be available. For example they could be outside the picture, in another slice, or belong to a CU that will be decoded in the future (causality violation). Any samples that are not available are filled in using a well-defined process after which the neighboring arrays are completely full of valid samples. Depending on the block size and intra mode, the neighboring arrays are filtered (smoothed).
The angular prediction process is similar to AVC, just with modes and a unified algorithm that can handle all block size. In addition to the 33 angular modes, there is a DC mode which simply uses a single value for the prediction, and Planar, which does a smooth gradient of the neighbor samples.
Intra mode coding is done by building a 3-entry list of modes. This list is generated using the left and above modes, along with some special derivations of them to come up with 3 unique modes. If the desired mode is in the list, the index is sent, otherwise the mode is sent explicitly.

- End of Part 1 -

pieter3d

31st January 2013, 23:44

- Part 2 -

Inter prediction – motion vector prediction
Like AVC, HEVC has two reference lists: L0 and L1. They can hold 16 references each, but the maximum total number of unique pictures is 8. This means that to max out the lists you have to add the same picture more than once. The encoder may choose to do this to be able to predict off the same picture with different weights (weighted prediction).
If you thought AVC had complex motion vector prediction, you haven’t seen anything yet. HEVC uses candidate list indexing. There are two MV prediction modes: Merge and AMVP (advanced motion vector prediction, although the spec doesn’t specifically call it that). The encoder decides between these two modes for each PU and signals it in the bitstream with a flag. Only the AMVP process can result in any desired MV, since it is the only one that codes an MV delta. Each mode builds a list of candidate MVs, and then selects one of them using an index coded in the bitstream.
http://i.imgur.com/XnPEy9h.png
AMVP process: This process is performed once for each MV; so once per L0 or L1 PU, or twice for a bidirectional PU. The bitstream specifies the reference picture to use for each MV. A two-deep candidate list is formed: First, attempt to obtain the left predictor. Prefer A0 over A1, prefer the same list over the opposite list, and prefer a neighbor that point to the same picture over one that doesn’t. If no neighbor points to the same picture, scale the vector to match the picture distance (similar process as AVC temporal direct mode). If all this resulted in a valid candidate, add it to the candidate list. Next, attempt to obtain the upper predictor. Prefer B0 over B1, over B2, prefer a neighbor MV that points to the same picture over one that doesn’t. Neighbor scaling for the upper predictor is only done if it wasn’t done for the left neighbor, ensuring no more than one scaling operation per PU. Add the candidate to the list if one was found. If the list now still contains less than 2 candidates, find the temporal candidate (scaled MV according to picture distance), which is co-located with the right bottom of the PU. If that lies outside the CTB row, or outside the picture, or if the co-located PU is intra, try again with the center position. Add the temporal candidate to the list if one was found. If the candidate list is still empty, just add 0,0 vectors until full. Finally, with the transmitted index, select the right candidate and add in the transmitted MV delta.
Phew, now merge mode: The merge process results in a candidate list of up to 5 entries deep, configured in the slice header. Each entry might end up being L0, L1 or bidirectional. First add at most 4 spatial candidates in this order: A1, B1, B0, A0, B2. A candidate cannot be added to the list if it is the same as one of the earlier candidates. Then, if the list still has room, add the temporal candidate, which is found by the same process as in AMVP. Then, if the list still has room, add bidirectional candidates formed by making combinations of the L0 and L1 vectors of the other candidates already in the list. Then finally if the list still isn’t full, add 0,0 MVs with increasing reference indices. The final motion is obtained by picking one of the up-to-5 candidates as signaled in the bitstream.
Note that HEVC sub-samples the temporal motion vectors on a 16x16 grid. That means that a decoder only needs make room for two motion vectors (L0 and L1) per 16x16 region in the picture when it allocates the temporal motion vector buffer. When the decoder calculates the co-located position, it zeroes out the lower 4 bits of the x/y position, snapping the location to a multiple of 16. Regarding which picture is considered the co-located picture, that is signaled in the slice header. This picture must be the same for all slices in a picture, which is a great feature for hardware decoders since it enables the motion vectors to be queued up ahead of time without having to worry about slice boundaries.

Inter prediction – motion compensation
Like AVC, HEVC specifies motion vectors in 1/4-pel, but uses an 8-tap filter for luma (all positions), and a 4-tap 1/8-pel filter for chroma. This is up from 6-tap and bilinear (2-tap) in AVC, respectively.
Because of the 8-tap filter, any given NxM sized block will need extra pixels on all sides (3 left/above, 4 right and below) to provide the filter with the data it needs. With small blocks like an 8x4, you really need to read (8+7)x(4+7) = 15x11 pixels. You can see that the more small blocks you have, the more you have to read from memory. That means more access to DRAM, which costs more time and power (battery life!), so this is why HEVC limits the smallest block to be uni-directional and 4x4 is not possible.
HEVC supports weighted prediction for both uni- and bi-directional PUs. However the weights are always explicitly transmitted in the slice header, there is no implicit weighted prediction like in AVC.

Deblocking
Deblocking in HEVC is performed on the 8x8 grid only, unlike AVC which deblocks every 4x4 grid edge. All vertical edges in the picture are deblocked first, followed by all horizontal edges. The actual filter is very similar to AVC, but only boundary strengths 2, 1 and 0 are supported. Because of the 8-pixel separation between edges, edges do not depend on each other enabling a highly parallelized implementation. In theory you could perform the vertical edge filtering with one thread per 8-pixel column in the picture. Chroma is only deblocked when one of the PUs on either side of a particular edge is intra-coded.

SAO
After deblocking is performed, a second filter optionally processes the picture. This filter is called Sample Adaptive Offset, or SAO. This relatively simple process is done on a per-CTB basis, and operates once on each pixel (so 64*64 + 32*32 + 32*32 = 6144 times in a 64x64 CTB). For each CTB, the bitstream codes a filter type and four offset values, which range from -7..7 (in 8-bit video).
There are two types of filters: Band and Edge.
Band Filter: Divide the sample range into 32 bands. A sample’s band is simply the upper 5 bits of its value. So samples with value 0..7 are in band 0, 8..15 in band 1 and so on. Then a band index is transmitted, along with the four offsets, that identifies four adjacent bands. So if the band index is 4, it means bands 4, 5, 6 and 7. If a pixel falls into one of these bands, add the corresponding offset to it.
Edge Filter: Along with the four offsets, and edge mode is transmitted: 0-degree, 90-degree, 45-degree or 135-degree. Depending on the mode, two adjacent neighbor samples are picked from the 3x3 sample around the current sample. 90-degree means use the above and lower samples, 45-degree means upper-right and lower-left and so on. Each of these two neighbors can be less than, greater than or equal to the current sample. Depending on the outcome of these two comparisons, the sample is either unchanged or one of the four offsets is added to it.
The offsets and filter modes are picked by the encoder in an attempt to make the CTB more closely match the source image. Often you will see in regions that have no motion or simple linear motion (like panning shots), the SAO filter will get turned off for inter pictures, since the “fixing” that the SAO filter did in the intra-picture carries forward through the inter pictures.

Entropy coding
HEVC performs entropy coding using CABAC only; there is no choice between CABAC and CAVLC like in AVC. Yay! The CABAC algorithm is nearly identical to AVC, but with a few minor improvements. There are about half as many context state variables as in AVC, and the initialization process is much simpler. In the design of the syntax (the sequence of values read from the bitstream), great care has been taken to group bypass-coded bins together as much as possible. CABAC decoding is inherently a very serial operation, making fast hardware implementations of CABAC difficult. However it is possible to decode more than one bypass bin at a time, and the bypass-bin grouping ensures hardware decoders can take advantage of this property.

Parallel tools
HEVC has two tools that are specifically designed to enable a multi-threaded decoder to decode a single picture with threads: Tiles and Wavefront.
Tiles: The picture is divided into a rectangular grid of CTBs, up to 20 columns and 22 rows. Each tile contains an integer number of independently decodable CTBs. This means motion vector prediction and intra-prediction is not performed across tile boundaries, it is as if each tile is a separate picture. The only exception to this independence is that the two in-loop filters can filter across the tile boundaries. The slice header contains byte-offsets for each tile, so that multiple decoder threads can seek to the start of their respective tiles right away. A single-threaded decoder would simply process the tiles one by one in raster order. So how does this work with slices and slice segments? To avoid difficult situations, the HEVC spec says that if a tile contains multiple slices, the slices must not contain any CTBs outside that tile. Conversely, if a slice contains multiple tiles, the slice must start with the CTB at the beginning of the first tile and end with the CTB at the end of the last tile. These same rules apply to the dependent slice segments. The tile structure (size and number) can vary from picture to picture, allowing a smart multi-threaded encoder to load balance.
Wavefront: Each CTB row can be decoded by its own thread. After a particular thread has completed the decoding of the second CTB in its row, the entropy decoder’s state is saved and transferred to the row below. Then the thread of the row below can start. This way there are no cross-thread prediction dependency issues. It does however require careful implementation to ensure a lower thread does not advance too far. Inter-thread communication is required.

Profiles/Levels:
There are 3 profiles in the spec: Main, Main 10 and Still Picture. For almost all content, Main is the only one that counts. It adds a few notable limitations: Bit depth is 8, tiles must be at least 256x64, tiles and wavefront may not be enabled at the same time. Many levels are specified, from 1 to 6.2. A 6.2 stream could be 8192x4320@120fps. See the spec for details.

Phew, hope you got all that! Questions and comments welcome.

Gabrielgoc

1st February 2013, 00:48

Great! Very good job...Very interesting!!!!!!!!!!!!!

THX!

Gabriel

xooyoozoo

1st February 2013, 00:53

For almost all content, Main is the only one that counts.

Can you expand on that?

For home use, encoding time plays a significantly bigger role than decoding time, and based on the published results, it seems that Main 10 would basically offer a "free" compression boost in terms of the former at least. It's not enough to miss, but when presented with the option, Main 10 would seem like the obvious choice.

Excellent write up, by the way. Thanks for your time!

JEEB

1st February 2013, 01:05

Welcome to Doom9, and congratulations on writing a lump of information on HEVC, as well as on finishing the major decisions on HEVC version 1.

Yes, HEVC seems like a nice simplification compared to AVC looking at the effort at some places taken in order to support both PAFF and MBAFF properly, and so forth :) Also having only one way for entropy coding does make it simpler in a pleasant way.

I did seemingly miss the part where interlacing with both fields being encoded in the same "picture" was made impossible, which is of course a happy limitation for me. Should re-check L1003 on it again :)

Also, as far as reading up on HEVC goes, this IEEE paper (http://iphome.hhi.de/wiegand/assets/pdfs/2012_12_IEEE-HEVC-Overview.pdf) is another very good introduction of various parts of the HEVC standard.

P.S. Are you somewhere on those photos from the Geneva meeting posted on the JCT-VC mailing list? ;)

pieter3d

1st February 2013, 01:27

Can you expand on that?

Main 10 is really for 10-bit source content, which pretty much no one has access too. Also, most HW decoder support will be Main only (initally).

Welcome to Doom9, and
I did seemingly miss the part where interlacing with both fields being encoded in the same "picture" was made impossible, which is of course a happy limitation for me. Should re-check L1003 on it again :)

You can still do that, but the quality will suck because the two fields will be transformed together etc. If you want to encode an interlaced sequence, you have to do it using field pictures. Good riddance in my opinion anyway, we need to move away from the ancient technology that is interlacing.

P.S. Are you somewhere on those photos from the Geneva meeting posted on the JCT-VC mailing list? ;)
Nope, I did not attend the last meeting because I had a scheduling conflict with work. Also this last meeting was not expected to change anything about the fundamental way things worked in HEVC, so as a hardware designer there were no issues I needed to raise.

LoRd_MuldeR

1st February 2013, 01:29

Main 10 is really for 10-bit source content, which pretty much no one has access too. Also, most HW decoder support will be Main only (initally).

With AVC it is known that 10-Bit encoding is beneficial even for 8-Bit sources, because the higher internal precision improves compression efficiency. Won't the same apply to HEVC?

Dark Shikari

1st February 2013, 01:39

It might, it might not; you should probably do some testing first. Some of the gain of 10-bit might be mitigated by the features of HEVC, though I'm not certain how much.

schweinsz

1st February 2013, 04:25

With AVC it is known that 10-Bit encoding is beneficial even for 8-Bit sources, because the higher internal precision improves compression efficiency. Won't the same apply to HEVC?

according to some proposals I read in the JCT-VC, the 10 bits could give gain for the 8-bits sources. But it is smaller than that in the H.264/AVC.
The gain for the 10-bits is from the higher precision of the reference so it could give higher prediction precision. So the gain should not disppear. It just become smaller.

xooyoozoo

1st February 2013, 04:28

It might, it might not; you should probably do some testing first. Some of the gain of 10-bit might be mitigated by the features of HEVC, though I'm not certain how much.

L0322 already had the data inputted. I moved them around. (http://i.imgur.com/t776kKU.png)

About a couple of percentage points improvement on average, with the outliers improving tremendously.

Edit: Nebutta and SteamLocomotive are 10bit sources.

pieter3d

1st February 2013, 04:43

L0322 already had the data inputted. I moved them around. (http://i.imgur.com/t776kKU.png)

About a couple of percentage points improvement on average, with the outliers improving tremendously.

Edit: Nebutta and SteamLocomotive are 10bit sources.

Right so with 8 bit sources marginal gain. Although you see decent gains in chroma, it makes almost no difference in overall bitrate since thats such a small portion of the stream

xooyoozoo

1st February 2013, 04:49

Right so with 8 bit sources almost no gain. Although you see decent gains in chroma, it makes almost no difference in overall bitrate since thats such a small portion of the stream

As far as the software reference encoder/decoder is concerned though, this ~2% efficiency boost is almost "free".

Would this not also apply to hardware encoder/decoder implementations?

pieter3d

1st February 2013, 05:29

As far as the software reference encoder/decoder is concerned though, this ~2% efficiency boost is almost "free".

Would this not also apply to hardware encoder/decoder implementations?

No not at all unfortunately. You have to store all the reference pictures in 10bit, and that means you need to bit pack pixels into data words that are pretty much guaranteed to be a multiple of 8. Often designs will just go to 16 bits per pixel just to make the addressing easy (small easy to verify logic). So that is extra bandwidth and storage right there. Furthermore, the output display device will likely only take 8 bit pictures, so that means the decoder will need to write out 2 pictures: one 8bit for output and one 10bit for future reference. Lots of extra bandwidth.

Also a large part of a hw design's area on silicon comes from small buffers that hold temp data, and those will all have to grow by 25%, which makes the decoder/encoder correspondingly more expensive to fabricate. Silicon area is very expensive.

vivan

1st February 2013, 12:40

L0322 already had the data inputted. I moved them around. (http://i.imgur.com/t776kKU.png)What about visual difference?
I don't think objective video quality metrics consider banding as really bad artifact, however it's very noticeable for human eye...

Furthermore, the output display device will likely only take 8 bit pictures8 bit 4:2:0 pictures with the resolution of the video? ;) No, it's renderer's work to upsample chroma, scale image, convert it to rgb. I think HQ h/w renderers should do it with 16-bit precision, so 8 or 10 bit input makes no difference for them...

m3sh

3rd February 2013, 23:34

Thanks pieter3d. You should do another post detailing tiling and WPP :)

pieter3d

3rd February 2013, 23:56

Thanks pieter3d. You should do another post detailing tiling and WPP :)

I did include those tools in my post. Is there anything specific you wanted to know about them?

kieranrk

4th February 2013, 00:35

As far as the software reference encoder/decoder is concerned though, this ~2% efficiency boost is almost "free".

A significant speed decrease is not "free" by anyone's standard.

xooyoozoo

4th February 2013, 02:30

A significant speed decrease is not "free" by anyone's standard.

Just to make sure we're on the same page, those decode/encode times are relative to 100%, which means no change. JCT-VC docs use that format; I kept it as is.

Beyond that, I'm not going to argue over whether 6% falls under the umbrella of 'significant'.

*.mp4 guy

4th February 2013, 06:27

pieter3d

4th February 2013, 08:33

The two things that cause AVC to have such huge problems encoding gradients using 8 bit encoding are the abysmally designed spatial transforms and the deblocking filter. The transform issue looks like it may well have been solved, so its likely that at rates where the deblocking filter isn't used, 8 bit encoding may perform fine. However, at lower rates where the deblocking filter is used, its inability to reconstruct shallow gradients will likely cause large psychovisual problems that will not be measurable using PSNR. In any case, the larger more precise transforms should make it easier to preserve visual energy in low complexity areas.

Another thing that will really help is bilinear gradient creation on the intra predictor array sample for large blocks (32x32), which should really help prevent contouring. See http://phenix.int-evry.fr/jct/doc_end_user/documents/11_Shanghai/wg11/JCTVC-K0139-v1.zip

The offset filter, the ability to encode 4x4 blocks without a transform, and the proper use of qpel all present interesting opportunities for encoder-side optimization.

One thing I'm curious about is if the raw image data coding mode allows quantization (reduction of color accuracy). Presumably it does, but its best not to make assumptions about these things...
There are actually 3 ways to code raw data: PCM, which can be at any bit-depth that is less than the main bit-depth. For example, 5-bit PCM samples are upshifted by 3 to obtain the final values. PCM is applied to the entire CU, and may not be larger than 32x32. Luma and chroma can have separate bit-depths.
Transform Skip: The tool mentioned above where 4x4 TUs optionally are just not transformed, merely downshifted after the inverse quantization. This choice applies to each TU individually, but only if it is 4x4.
Trans-quant bypass: Applies to an entire CU (luma and chroma), where the coded coefficients are just treated as the final residual signal. This allows 100% lossless coding while still getting to use all the prediction tools (unlike PCM). Because it is for the whole CU, 8x8 is the smallest size it applies to.

The loop filters can be configured to modify PCM samples or leave them as-is. Trans-quant bypass blocks may never be modified by the loop filters, and transform-skip 4x4s have no impact on the loop filter decision whatsoever.

*.mp4 guy

4th February 2013, 09:59

Another thing that will really help is bilinear gradient creation on the intra predictor array sample for large blocks (32x32), which should really help prevent contouring. See http://phenix.int-evry.fr/jct/doc_end_user/documents/11_Shanghai/wg11/JCTVC-K0139-v1.zip

IIRC, vp8 has this prediction mode already, but it doesn't appear to have helped it in this regard at all. To be fair, I've only tested vp8 once, but it looked worse then x264 with all of its psy turned off. Perhaps it will make gradient preservation cheaper, but just like the deblocking filter, I would again be worried about it becoming useless when faced with shallow gradients, which have always been the real problem. Unless the prediction accuracy is better then 8 bits, that is; though of course it wont be.

Of course, the new prediction mode is still a good addition to the spec. In general the spec looks like it is a solid step forward over avc in every category, which is really quite encouraging. However, I still anticipate teething problems out of the gate regarding banding and a few other issues (such as the motion vector prediction inadvertently causing misprediction issues when paired with rd-optimization).

[edit] I just realized that the linked paper actually includes an example of the technique used on a problem sample, which is much more informative then these papers often are. The prediction mode certainly does help, and at the target quality level they have set, it leads to adequate gradient performance. However, I expect it will not help greatly at more common bitrates. Still, its more useful then I expected.

falocn88

21st February 2013, 16:29

Hi Peter,
Thanks for information about HEVC. You explanation is very narrative and easy to understand. I would like get some more information about residual decoding, Can u suggest me any documents or JCT-VC proposals?.

Cheers, Keep posting some more topics regarding HEVC.:-)

pieter3d

21st February 2013, 17:37

What exactly did you want to know about it? The current scheme has come by interating over literally 100's of jct-vc proposals, so there isn't just one document I can point you to.

sirt

23rd February 2013, 20:13

pieter3d,

Thanks for your explanations. Could you tell me when an HEVC encoder would be available for public ? I read an HEVC encoder implementation project lead by a Chinese developer has currently been suspended. So it seems such a project is not yet to be as strong as x264. Moreover, regarding as many people don't even have a Blu Ray player nor any H264 decoder engine, which means they simply don't use x264 but XviD instead for their personal encodes, when do you think HEVC will definitely take the lead ? What about HEVC & HEVC Blu ray players ?

pieter3d

23rd February 2013, 20:15

I know of no 4k or HEVC disc format, likely it will just be online streaming from now on. The only public encoder right now is HM, the reference model. However it is slow an unoptimized, and doesn't plug well into existing tools. I suspect ffmpeg patches with HEVC support aren't that far off now that the standard is finalized.
I expect to see hardware decoders and encoders show up in consumer devices next year sometime.

hajj_3

23rd February 2013, 22:06

HEVC will probably be used in the possible upcoming 4k bluray format. An encoder isn't really important at the moment, a decoder is. I doubt intel will have hevc decoding in their upcoming haswell chips but you never know.

Yellow_

1st March 2013, 14:04

Hi, i started a thread here: http://forum.doom9.org/showthread.php?t=167312 before finding this one, is the source in the link HEVC?

pieter3d

4th March 2013, 06:06

Not a chance, there aren't any consumer products with hevc ready yet

Stephen R. Savage

8th March 2013, 05:43

What is the status of full resolution chroma in HEVC? The description you gave in the first posts indicates that the standard is currently YV12-only, which is, to say the least, highly disappointing.

pieter3d

8th March 2013, 06:04

What is the status of full resolution chroma in HEVC? The description you gave in the first posts indicates that the standard is currently YV12-only, which is, to say the least, highly disappointing.

Correct, the spec as of today is just 4:2:0, where chroma is down sampled by a factor of two in both horizontal and vertical direction. However a range extension profile is in the works for 4:2:2 and 4:4:4. But I would challenge anyone to spot the difference between them when watching an HD video. The human eye is not nearly as sensitive to color as it is to texture. So I wouldn't view the omission of the higher fidelity chroma profiles in this first version as disappointing.

Consumer content is pretty much all 4:2:0, the higher chroma modes are more for post-production and archival. I suppose it would also be desirable in a still picture profile, for largely the same reasons.

Poutnik

16th March 2013, 07:47

pieter3d

16th March 2013, 07:54

Back to H265 and pleasing 4:4:4 poster at the same time......

Is there any specific difference to H264/AVC,
how H265 addresses encoding video with solid stable colors, like cartoons and computer graphics video?

Yes, there is a tool called transform_skip that applies optionally to any 4x4 TU (either luma or chroma). You can see the dramatic improvement in quality when images have sharp details like small text here (the powerpoint): http://phenix.int-evry.fr/jct/doc_end_user/documents/9_Geneva/wg11/JCTVC-I0408-v2.zip

At the time this tool was proposed with this document, it was for intra only but in the final spec that distinction isn't made (can be both inter or intra 4x4's).

Note that coding solid colors is easy, it is the sharp transitions that usually create artifacts which this tool can address.

Guest

16th March 2013, 16:15

The whole chroma subsampling discussion/debate has been moved to a new thread at the OP's request.

http://forum.doom9.org/showthread.php?t=167428

Please continue that line of discussion in the linked thread and use this thread for technical issues specifically related to HEVC.

reuven1984

19th May 2013, 08:28

Hi Peter3d,
I was hoping you can help me with this issue - regarding Transform Unit size restriction:
I'm trying to encode a video with CUs (Coding Units) up to 32x32 pixels, and with TUs (Transform Units) up to 16x16 pixels.
I'm setting MaxCUWidth, MaxCUHeight = 32, MaxPartitionDepth=3, and QuadtreeTULog2MaxSize = 4 (s.t. 2^4 = 16). QuadtreeTUMaxDepthInter,Intra = 3.
The parameter of TULog2MaxSize only controls the Intra blocks. Meaning, in intra blocks I get max. TU = 16x16, but in Inter blocks, for some reason sometimes I get bigger TUs – I have few Inter TU with size = 32x32.
How do I prevent this from happening? How can I restrict both intra AND inter blocks to be with TU <= 16x16?
thanks

Sulik

19th May 2013, 17:37

If the inter CU is not coded the not-coded TU is as large as the CU - I don't think you can avoid that (makes no sense to split a non-coded TU into smaller TUs)

pieter3d

19th May 2013, 20:08

If the inter CU is not coded the not-coded TU is as large as the CU - I don't think you can avoid that (makes no sense to split a non-coded TU into smaller TUs)

A 32x32 CU can be skipped, in which case the TU doesn't exist. So although there is no split downto 16x16, there are no 32x32 operations to perform

reuven1984

21st May 2013, 13:29

pieter3d: I received also in non-split blocks, meaning regular CU (with merge flag = 0), few TUs with size=32x32.
how to avoid that?
(Another thing: These TUs had no coefficients - all of their coefficients were zero, but still it was marked as a 32x32 TU)

Sulik: Did you mean "If the inter TU is not coded..." or "if the inter CU is ..."?

pieter3d

21st May 2013, 17:20

So isn't that good enough? If you have a 32x32 TU that is all 0's, you don't have to perform the large transform. You could always treat it as four 0-TUs that are 16x16

Shevach

28th July 2013, 08:09

pieter3d

31st July 2013, 00:09

Dear experts

Let me share "HEVC Overview" presentation (113 slides) which I prepared, it's located at:
https://app.box.com/s/rxxxzr1a1lnh7709yvih

Notice that the overview has been reviewed by several experts. Moreover, the overview has been discussed within LinkedIn HEVC/H.265 technical group (see the link
http://www.linkedin.com/groups/Detailed-HEVC-Presentation-3724292.S.255803948?qid=887aa617-29ce-4469-bcc5-2722d05b8d65&trk=group_most_popular-0-b-ttl&goback=%2Egmp_3724292).

I brought up many aspects of HEVC implementation and I would like to discuss it in Doom9 forum.
Some comments:

VPS is optional
The variable N in CTB size is usually half CTB size, so that the mnemonics 2Nx2N etc make sense.
May want to mention that intra prediction follows the TU tree
Perhaps mention spatial neighbor scaling for AMVP
Transform can actually be implemented with 28-bit precision
Slide 47, quantization will typically operate on columns because the coeff block is processed first by columns.
Visual artefacts on large transform blocks - this isn't always case. It is heavily content and bit-rate dependent.
Should mention deblocking processing order: Vertical first, then horizontal. When done in-loop with CTU decode (single pass), this presents challenges and requires access to neighboring slice parameters (tc/beta offsets).
Deblocking boundary strength calculation for bs = 1 doesn't mention the rules regarding the MVs point to the same or different pics
May want to mention that for inter boundaries, chroma is never deblocked.
May want to mention SAO has susbstantial gains when there is no biprediction
SAO filter may also cross tile bondaries, not just deblocker.
MTU matching can still be done with tiles, since the structure may change from pic to pic.
No mention of fractional CTB rules at right and bottom picture boundaries.
No mention of the relationship and rules between tiles and slices
No mention of temporal MV subsampling (16x16 granularity storage)
No mention of the (near) lossless tools transform_skip, transquant_bypass, PCM
No mention of custom quant matrices
No mention of delta QP operation.

Shevach

31st July 2013, 07:53

@Pieter3d

Thanks for your professional respond.

According to your profound comments I guess you are Pieter K. Am I right? If so you should remember me from JCT-VC meetings.

Till now only three experts (including you) reviewed the overview. My purpose is to compile a detailed complete free-access presentation on HEVC basing on discussions and feedbacks with/from experts and my own opinion. Therefore I use in the the title the word "prepared by" instead of "author".

Regardding to your comments:

1) VPS is optional - agreed

2) ... mnemonics 2Nx2N etc make sense - very confused, especially for AVC/H.264 guys

3) ... intra prediction follows the TU tree - i tried to explain it (e.g. in the slide #35), apparently i need add more comments.

4) ... spatial neighbor scaling for AMVP -
good point, it's worth also stress the following point:
Unlike AVC/H.264 neighbors with different prediction direction are teken into consideration. For example if current block is forward-only and a neighboring one is backward-only then backward MVs of the neighboring blocks are incorporated in AMVP process with a corresponding scaling.

5) Transform can actually be implemented with 28-bit precision - it's worth to add with DR calculation.

6) ... columns because the coeff block is processed first by columns -
correct, unlike to AVC/H.264 the HEVC defines a column-row order for the transform. This point is mentioned in the section "Transforms and quantization" in the IEEE paper: "HEVC Complexity and Implementation Analysis".

7) Visual artefacts on large transform blocks - this isn't always case ...
do you know any heuristics to determine when the artefacts appear and when not?

maxlovic

24th October 2013, 07:40

Hi all.
I made a research on the efficiency of HEVC, VP9 and Daala codecs. By the codecs I mean WebM VP9, HM and Xiph's Daala encoders, not the standarts themselves, but never the less.
Here (http://maxsharabayko.blogspot.ru/2013/10/next-generation-video-codecs-hevc-vp9.html)you can find the research.

ekaveera

31st December 2013, 12:16

Hi Nice Explanation. Can you also explain how Quantization is done in HEVC. I am working on RDOQ in HM reference code. I have completely vague idea how exactly Rare distortion Optimization is done in HEVC. I will be Happy if someone can Explain on this. Thanks in Advance

pieter3d

1st January 2014, 03:29

Forward quantization is pretty much up to whatever encoder you write. You just have to keep in mind the way a compliant decoder performs inverse quantization:

The QP value for a CU is determined, a number between 0 and 51 for normal 8-bit sequences. Then a scale factor is derived: scale_factor = levelScale[qp%6]<<(qp/6), where levelScale = { 40, 45, 51, 57, 64, 72 }. This creates an exponential relationship between qp and scale_factor. Then essentially the coefficients are multiplied by this values and shifted down by (bitDepth + log2TransformSize - 5). There are a few other details, but the spec is pretty easy to follow for this.

RDO is a different topic though....

ekaveera

1st January 2014, 07:42

Thanks Pieter. Please dont think that i am deviating from the topic. As per my Knowledge RDOQ must be done for every Coding unit and for each mode. That is for each CU in a particular mode(either INTRA, INTER,etc) we need to find Rate and Distortion and find the cost function J=R+(lambda)D. The mode which yields least 'J' is selected and is RD Optimal. My question is How this rate, Distortion and lambda are to be estimated. I went through HM reference code but could not follow the code flow as which algorithm is being used for RDOQ. This might be a very Basic Question, appreciate if you can explain or provide a link on this area.

foxyshadis

3rd January 2014, 03:19

Mode decision mostly boils down to "test everything, pick whatever costs the least." Since testing literally everything is stupidly slow, the complication comes from the tons of speedups to pare down the test space: Take a guess on where to start looking, compare mostly via fullpel SAD (or even low-res SAD), then only test subpixel on the decent matches, then only transform the closest candidates, then only attempt to entropy code the lowest energy candidates, then only try trellis quant on the smallest result(s) before settling and moving on to the next block. Each step is a sieve that reduces the problem space for successively slower steps, so you can spend time where it's more important. AVC and HEVC also include a number of predictors for each block, like skip and direct modes, so they need to be tested too; so you can bypass everything above entirely if heuristics tell you one of the predictors is already good enough. Most encoders let you tweak how much they ignore, so you can find your own speed/optimum balance.

Usually intra isn't even considered unless no viable candidate has been found with basic motion estimation first, because it's so much larger.

One reason why the HM 12.1 can be higher PSNR than even x265's placebo mode is because it doesn't use as many shortcuts on its full mode. It doesn't bother to sieve out as much, it just tests everything within a specific range. It may take a year to encode a whole movie, but it will generate a more optimal solution for each individual picture. (Without rate control, adaptive GOP, CU-tree, or forward prediction, it does a much worse job at global optimization. But at least it completes in a year, instead of a millennium.)

ekaveera

24th January 2014, 11:39

Hi in Video Coding Why we Mimicking of Decoder Functionality at Encoder side. Also What is the Significance of Rounding offset in Quantization of DCT coefficients. Can any one explain in detail..

pieter3d

24th January 2014, 18:04

The encoder must generate the same picture that the receiving decoder will, because that is the picture the decoder will use as reference. If the encoder only used the source picture as reference, then differences would accumulate over time and become very noticeable. This is referred to as error drift.

DCT coefficients get quantized (precision reduced) in the process of encoding. Since the specification only tells you how to perform inverse quantization, the encoder may perform the forward quantization by any appropriate method. Typically some value is added to the coefficient first before the quantization (usually a division of some kind) to allow for rounding. For example if the scale factor was 10, and a coefficient is 59 before quantization, simply doing 59/10 = 5 seems like a bad result (integer division). So we can add a value: (59+5)/10 = 6. This accounts for the truncation towards 0 during integer division.

ekaveera

26th January 2014, 08:18

Thanks Pieter. So do you mean to say 5 is the Rounding Offset. But we can choose 2,3 also right. Basically how we will decide to chose Rounding offset. In the original HM code, Rounding offset is constant for all DCT Coefficients in a Block. But i could not understand what is the mathematical model used to choose a Particular Rounding Offset. Can you give me any Research paper links on this.

LoRd_MuldeR

26th January 2014, 15:34

It's up to the encoder to choose the "quantized" coefficients. So every encoder may use his own algorithm.

For an explanation of x264's "trellis" algorithm (and also a summary of "uniform deadzones"), please see the description here:
http://akuvian.org/src/x264/trellis.txt