How HEVC/H.265 works, technical details & diagrams

pieter3d · 31st January 2013, 22:49

Hi all,

Now that the HEVC standard is finalized, I'd like take this opportunity to explain how HEVC coding works in plain-ish English. About me: I am a hardware codec engineer who has been working with HEVC for over a year now and I have participated in the JCT-VC meetings where HEVC has taken shape. The spec itself, located at http://phenix.int-evry.fr/jct/doc_en...nt.php?id=7243, is rather hard to follow. This post will assume you are reasonably familiar with the coding techniques in H.264/AVC, hereafter referred to as AVC. This overview will gloss over a few details, but feel free to ask for any elaboration.

To start with: HEVC is actually a bit simpler conceptually than AVC, and lots of things in the spec are done to make life easier for the hardware codec designer (like me).

Picture partitioning
Instead of macroblocks, HEVC pictures are divided into so-called coding tree blocks, or CTBs for short, which appear in the picture in raster order. Depending on the stream parameters, they are either 64x64, 32x32 or 16x16. Each CTB can be split recursively in a quad-tree structure, all the way down to 8x8. So for example a 32x32 CTB can consist of three 16x16 and four 8x8 regions. These regions are called coding units, or CUs. CUs are the basic unit of prediction in HEVC. If you have been paying attention you have already inferred that CUs can be 64x64, 32x32, 16x16 or 8x8. The CUs in a CTB are traversed and coded in Z-order. Example ordering in a 64x64 CTB:

Like in AVC, a sequence of CTBs is called a slice. A picture can be split up into any number of slices, or the whole picture can be just one slice. In turn, each slice is split up into one or more “slice segments”, each in its own NAL unit. Only the first slice segment of a slice contains the full slice header, and the rest of the segments are referred to as dependent slice segments. A dependent slice segment is not decodable on its own; the decoder must have access to the first slice segment of the slice. This splitting of slices exists to allow for low-delay transmission of pictures without the coding efficiency loss of using many full slices. For example, a camera could send out a slice segment of the first CTB row so that the playback device on the other side of the network can begin drawing the picture before the camera is done coding the second CTB row. This can help achieve low-latency video conferencing.
HEVC does not support any interlaced tools (no more MBAFF hooray!). Interlaced video can still be coded, but it must be coded as a sequence of field pictures. No mixing of field and frame pictures.

Residual coding
For each CU, a residual signal is coded. HEVC supports four transform sizes: 4x4, 8x8, 16x16 and 32x32. Like AVC, the transforms are integer transforms based on the DCT. However the transform used for intra 4x4 is based on the DST (discrete sine transform). There is no Hadamard-transform like in AVC. The basis matrix uses coefficients requiring 7 bit storage, so it is quite a bit more precise than AVC. The higher precision and larger sizes of the transforms are one of the main reasons HEVC performs so much better than AVC.
The residual signal of a CU consists of one or more transform units, or TUs. The CU is recursively split with the same quad-tree method as the CTB splitting, with the smallest allowable block being of course 4x4, the smallest TU. For example a 16x16 CU could contain three 8x8 TUs and four 4x4 TUs. For each luma TU there is a corresponding chroma TU of one quarter the size, so a 16x16 luma TU comes with two 8x8 chroma TUs. Since there is no 64x64 transform, a 64x64 CU must be split at least once into four 32x32 TUs. The only exception to this is for skipped CUs, when there is no residual signal at all. Note that there is no 2x2 chroma TU size. Since the smallest possible CU is 8x8, there are always at least four 4x4 luma TUs in an 8x8 region, and that region thus consists of four luma 4x4’s and two chroma 4x4s (as opposed to 8 2x2s). Like the CUs in a CTB, TUs within a CU are also traversed in Z-order.
If a TU has size 4x4, the encoder has the option to signal a so-called “transform skip” flag, where the transform is simply bypassed all together, and the transmitted coefficients are really just spatial residual samples. This can help code crisp small text for example.
Inverse quantization is essentially the same as in AVC.
The way a TUs coefficients are coded in the bitstream is vastly different from AVC. First, the bitstream signals a last xy postion, indicating the position of the last coefficient in scan order. Then the decoder, starting at this last position, scans backwards until it reaches position 0,0, known as the DC coefficient. The coefficients are grouped into 4x4 coefficient groups. The coefficients are scanned diagonally (down and left) with each group, and the groups are scanned diagonally as well. For each group, the bitstream signals if it contains any coefficients. If so, it then signals a bit for each of the 16 coefficients in the group to indicate which are non-zero. Then for each of the non-zero coefficients in a group the remainder of the level is signaled. Finally the signs of all the non-zero coefficients in the group are decoded, and the decoder moves on to the next group. HEVC has an optional tool called sign bit hiding. If enabled and there are enough coefficients in the group, one of the sign bits is not coded, but rather inferred. The missing sign is inferred to be equal to the least significant bit of the sum of all the coefficient’s absolute values. This means that when the encoder was coding the coefficient group in question and the inferred sign was not the correct one, it had to adjust one of the coefficients up or down to fix that. The reason this tool works is that sign bits are coded in bypass mode (not compressed) and thus are expensive to code. By not coding some of the sign bits, the savings more than makes for any distortion caused by adjusting one of the coefficients.
Example of the scan process of a 16x16 TU:

Prediction units
A CU is split using one of eight partition modes. These eight modes have the following mnemonics: 2Nx2N, 2NxN, Nx2N, NxN, 2NxnU, 2NxnD, nLx2N, nRx2N. Here the uppercase N represents half the length of a CU’s side and lowercase n represents one quarter. For a 32x32 CU, N = 16 and n = 8.

Thus a CU consists of one, two or four prediction units, or PUs. Note that this division is not recursive. A CU is either inter- or intra- coded, so if a CU is split into two PUs, both of them are inter- or both of them are intra-coded. Intra-coded CUs may only use the partition modes 2Nx2N or NxN, so intra PUs are always square. A CU may also be skipped, which implies inter coding and a partition of mode of 2Nx2N. NxN partition mode is only allowed when the CU is the smallest size allowed (8x8 normally). The idea is that if you want four separate predictions in a CU, you might as well just split (if you can) and create four separate CUs. Also, inter CUs are not allowed to be NxN if the CU is 8x8, meaning no 4x4 motion compensation at all. The smallest block size is 8x4 and 4x8, and these can never be bidirectional. This was done to minimize worst case memory bandwidth (see section below on motion compensation).

Intra prediction
Intra prediction in a CU follows the TU tree exactly. When an intra CU is coded using the NxN partition mode, the TU tree is forcibly split at least once, ensuring the intra and TU tree match. This means that the intra operation is always 32x32, 16x16, 8x8 or 4x4.
In HEVC, there are, wait for it, 35 different intra modes, as opposed to the 9 in AVC. 33 are directional and there is a DC and Planar mode as well. Like AVC, intra prediction requires a two 1D arrays that contain the upper and left neighboring samples, as well as an upper-left sample. The arrays are twice as long as the intra block size, extending below and right of the block. Example for an 8x8 block:

Depending on the position of the intra prediction block, any number of these neighboring samples may not be available. For example they could be outside the picture, in another slice, or belong to a CU that will be decoded in the future (causality violation). Any samples that are not available are filled in using a well-defined process after which the neighboring arrays are completely full of valid samples. Depending on the block size and intra mode, the neighboring arrays are filtered (smoothed).
The angular prediction process is similar to AVC, just with modes and a unified algorithm that can handle all block size. In addition to the 33 angular modes, there is a DC mode which simply uses a single value for the prediction, and Planar, which does a smooth gradient of the neighbor samples.
Intra mode coding is done by building a 3-entry list of modes. This list is generated using the left and above modes, along with some special derivations of them to come up with 3 unique modes. If the desired mode is in the list, the index is sent, otherwise the mode is sent explicitly.

- End of Part 1 -

31st January 2013, 22:49	#1 \| Link
pieter3d Registered User Join Date: Jan 2013 Location: Santa Clara CA Posts: 114	How HEVC/H.265 works, technical details & diagrams Hi all, Now that the HEVC standard is finalized, I'd like take this opportunity to explain how HEVC coding works in plain-ish English. About me: I am a hardware codec engineer who has been working with HEVC for over a year now and I have participated in the JCT-VC meetings where HEVC has taken shape. The spec itself, located at http://phenix.int-evry.fr/jct/doc_en...nt.php?id=7243, is rather hard to follow. This post will assume you are reasonably familiar with the coding techniques in H.264/AVC, hereafter referred to as AVC. This overview will gloss over a few details, but feel free to ask for any elaboration. To start with: HEVC is actually a bit simpler conceptually than AVC, and lots of things in the spec are done to make life easier for the hardware codec designer (like me). Picture partitioning Instead of macroblocks, HEVC pictures are divided into so-called coding tree blocks, or CTBs for short, which appear in the picture in raster order. Depending on the stream parameters, they are either 64x64, 32x32 or 16x16. Each CTB can be split recursively in a quad-tree structure, all the way down to 8x8. So for example a 32x32 CTB can consist of three 16x16 and four 8x8 regions. These regions are called coding units, or CUs. CUs are the basic unit of prediction in HEVC. If you have been paying attention you have already inferred that CUs can be 64x64, 32x32, 16x16 or 8x8. The CUs in a CTB are traversed and coded in Z-order. Example ordering in a 64x64 CTB: Like in AVC, a sequence of CTBs is called a slice. A picture can be split up into any number of slices, or the whole picture can be just one slice. In turn, each slice is split up into one or more “slice segments”, each in its own NAL unit. Only the first slice segment of a slice contains the full slice header, and the rest of the segments are referred to as dependent slice segments. A dependent slice segment is not decodable on its own; the decoder must have access to the first slice segment of the slice. This splitting of slices exists to allow for low-delay transmission of pictures without the coding efficiency loss of using many full slices. For example, a camera could send out a slice segment of the first CTB row so that the playback device on the other side of the network can begin drawing the picture before the camera is done coding the second CTB row. This can help achieve low-latency video conferencing. HEVC does not support any interlaced tools (no more MBAFF hooray!). Interlaced video can still be coded, but it must be coded as a sequence of field pictures. No mixing of field and frame pictures. Residual coding For each CU, a residual signal is coded. HEVC supports four transform sizes: 4x4, 8x8, 16x16 and 32x32. Like AVC, the transforms are integer transforms based on the DCT. However the transform used for intra 4x4 is based on the DST (discrete sine transform). There is no Hadamard-transform like in AVC. The basis matrix uses coefficients requiring 7 bit storage, so it is quite a bit more precise than AVC. The higher precision and larger sizes of the transforms are one of the main reasons HEVC performs so much better than AVC. The residual signal of a CU consists of one or more transform units, or TUs. The CU is recursively split with the same quad-tree method as the CTB splitting, with the smallest allowable block being of course 4x4, the smallest TU. For example a 16x16 CU could contain three 8x8 TUs and four 4x4 TUs. For each luma TU there is a corresponding chroma TU of one quarter the size, so a 16x16 luma TU comes with two 8x8 chroma TUs. Since there is no 64x64 transform, a 64x64 CU must be split at least once into four 32x32 TUs. The only exception to this is for skipped CUs, when there is no residual signal at all. Note that there is no 2x2 chroma TU size. Since the smallest possible CU is 8x8, there are always at least four 4x4 luma TUs in an 8x8 region, and that region thus consists of four luma 4x4’s and two chroma 4x4s (as opposed to 8 2x2s). Like the CUs in a CTB, TUs within a CU are also traversed in Z-order. If a TU has size 4x4, the encoder has the option to signal a so-called “transform skip” flag, where the transform is simply bypassed all together, and the transmitted coefficients are really just spatial residual samples. This can help code crisp small text for example. Inverse quantization is essentially the same as in AVC. The way a TUs coefficients are coded in the bitstream is vastly different from AVC. First, the bitstream signals a last xy postion, indicating the position of the last coefficient in scan order. Then the decoder, starting at this last position, scans backwards until it reaches position 0,0, known as the DC coefficient. The coefficients are grouped into 4x4 coefficient groups. The coefficients are scanned diagonally (down and left) with each group, and the groups are scanned diagonally as well. For each group, the bitstream signals if it contains any coefficients. If so, it then signals a bit for each of the 16 coefficients in the group to indicate which are non-zero. Then for each of the non-zero coefficients in a group the remainder of the level is signaled. Finally the signs of all the non-zero coefficients in the group are decoded, and the decoder moves on to the next group. HEVC has an optional tool called sign bit hiding. If enabled and there are enough coefficients in the group, one of the sign bits is not coded, but rather inferred. The missing sign is inferred to be equal to the least significant bit of the sum of all the coefficient’s absolute values. This means that when the encoder was coding the coefficient group in question and the inferred sign was not the correct one, it had to adjust one of the coefficients up or down to fix that. The reason this tool works is that sign bits are coded in bypass mode (not compressed) and thus are expensive to code. By not coding some of the sign bits, the savings more than makes for any distortion caused by adjusting one of the coefficients. Example of the scan process of a 16x16 TU: Prediction units A CU is split using one of eight partition modes. These eight modes have the following mnemonics: 2Nx2N, 2NxN, Nx2N, NxN, 2NxnU, 2NxnD, nLx2N, nRx2N. Here the uppercase N represents half the length of a CU’s side and lowercase n represents one quarter. For a 32x32 CU, N = 16 and n = 8. Thus a CU consists of one, two or four prediction units, or PUs. Note that this division is not recursive. A CU is either inter- or intra- coded, so if a CU is split into two PUs, both of them are inter- or both of them are intra-coded. Intra-coded CUs may only use the partition modes 2Nx2N or NxN, so intra PUs are always square. A CU may also be skipped, which implies inter coding and a partition of mode of 2Nx2N. NxN partition mode is only allowed when the CU is the smallest size allowed (8x8 normally). The idea is that if you want four separate predictions in a CU, you might as well just split (if you can) and create four separate CUs. Also, inter CUs are not allowed to be NxN if the CU is 8x8, meaning no 4x4 motion compensation at all. The smallest block size is 8x4 and 4x8, and these can never be bidirectional. This was done to minimize worst case memory bandwidth (see section below on motion compensation). Intra prediction Intra prediction in a CU follows the TU tree exactly. When an intra CU is coded using the NxN partition mode, the TU tree is forcibly split at least once, ensuring the intra and TU tree match. This means that the intra operation is always 32x32, 16x16, 8x8 or 4x4. In HEVC, there are, wait for it, 35 different intra modes, as opposed to the 9 in AVC. 33 are directional and there is a DC and Planar mode as well. Like AVC, intra prediction requires a two 1D arrays that contain the upper and left neighboring samples, as well as an upper-left sample. The arrays are twice as long as the intra block size, extending below and right of the block. Example for an 8x8 block: Depending on the position of the intra prediction block, any number of these neighboring samples may not be available. For example they could be outside the picture, in another slice, or belong to a CU that will be decoded in the future (causality violation). Any samples that are not available are filled in using a well-defined process after which the neighboring arrays are completely full of valid samples. Depending on the block size and intra mode, the neighboring arrays are filtered (smoothed). The angular prediction process is similar to AVC, just with modes and a unified algorithm that can handle all block size. In addition to the 33 angular modes, there is a DC mode which simply uses a single value for the prediction, and Planar, which does a smooth gradient of the neighbor samples. Intra mode coding is done by building a 3-entry list of modes. This list is generated using the left and above modes, along with some special derivations of them to come up with 3 unique modes. If the desired mode is in the list, the index is sent, otherwise the mode is sent explicitly. - End of Part 1 - Last edited by pieter3d; 22nd February 2013 at 01:03. Reason: Spec URL update