Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > MPEG-4 AVC / H.264

Reply
 
Thread Tools Search this Thread Display Modes
Old 6th November 2016, 23:31   #1  |  Link
Ataril
Registered User
 
Join Date: Oct 2016
Posts: 8
Deep learning of h264

Hi everyone!
Im reading about this standart at the moment, already got acquainted with well-known Richardson's book. But still have a lot of questions!

For example, where can I read about the very beginning of encoding process? Let's say I know that coder gets data already in YUV format, but what is responsible for converting input RGB file containing normal pixels? This functional is a part of codec or otherwise how it is implemented? (or it's all depends on codec?)
As far as I understand even in original ITU-T papers mainly decoder's model are described in details.

Another mistery for me is the motion estimation. I met mentions of Diamon, Hex, UMH and ESA methods for searching the best matching block in the frame (It is still the part of block-matching algorithm, right?), but never met detailed comprehensive explanation what is the difference between all of them or in which case one or the other should be used.
Hope can get some usefull information here.
Correct me if I'm wrong somewhere. I will appreciate any help!
Ataril is offline   Reply With Quote
Old 6th November 2016, 23:54   #2  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 12,828
Quote:
Originally Posted by Ataril View Post
For example, where can I read about the very beginning of encoding process? Let's say I know that coder gets data already in YUV format, but what is responsible for converting input RGB file containing normal pixels? This functional is a part of codec or otherwise how it is implemented? (or it's all depends on codec?)
Color-space conversion, such as RGB to YUV (actually YCbCr), is not specific to H.264 at all. You will find a lot information about it, even on Wikipedia:
https://en.wikipedia.org/wiki/YCbCr#YCbCr

Note that compressed video formats usually operate on YCbCr color-space, because it separates "chrominance" (color) information from "luminance" (brightness) information, which helps compression.

See also:
https://en.wikipedia.org/wiki/Chroma_subsampling

Quote:
Originally Posted by Ataril View Post
As far as I understand even in original ITU-T papers mainly decoder's model are described in details.
Video compression standards, such as H.264 or H.265, only describe how a valid bit-stream looks. And how a compliant decoder handles such a valid bit-stream.

But, how to generate a valid bit-stream that, after decompression, resembles the original input video as closely as possible (under the given bitrate limitations), is totally undefined.

That exercise is left for the encoder developers to figure out

Quote:
Originally Posted by Ataril View Post
Another mistery for me is the motion estimation. I met mentions of Diamon, Hex, UMH and ESA methods for searching the best matching block in the frame (It is still the part of block-matching algorithm, right?), but never met detailed comprehensive explanation what is the difference between all of them or in which case one or the other should be used.
In order to find the "best" motion vectors, the encoder has to try them out and keep the result that performed best, e.g. in terms of smallest "error".

Now, there are way too many possibilities to try them all (in reasonable time). So, the encoder has to search the space of possible motion vectors in a "smart" way.

Simply put, what the encoder does in practice is trying only a few possibilities (according to some "search pattern") and then refining the most promising candidates.

The names "diamond" (DIA), "hexagonal" (HEX), "uneven multi-hexagon" (UMH) and "exhaustive search" (ESA) refer to such search methods/patterns.
__________________
There was of course no way of knowing whether you were being watched at any given moment.
How often, or on what system, the Thought Police plugged in on any individual wire was guesswork.



Last edited by LoRd_MuldeR; 7th November 2016 at 00:08.
LoRd_MuldeR is offline   Reply With Quote
Old 7th November 2016, 01:32   #3  |  Link
StainlessS
HeartlessS Usurer
 
StainlessS's Avatar
 
Join Date: Dec 2009
Location: Over the rainbow
Posts: 5,161
Me cant say different to the lord above, Lord Mulder already put you right, he is the man on this.
__________________
I sometimes post sober.
StainlessS@MediaFire ::: AND/OR ::: StainlessS@SendSpace

"Some infinities are bigger than other infinities", but are any of them infinitely bigger ???
StainlessS is offline   Reply With Quote
Old 7th November 2016, 16:08   #4  |  Link
Ataril
Registered User
 
Join Date: Oct 2016
Posts: 8
Quote:
Originally Posted by LoRd_MuldeR View Post
Color-space conversion, such as RGB to YUV (actually YCbCr), is not specific to H.264 at all.
I get it, h.264 knows nothing about RGB color space. But complete codec (such as ffmpeg for example) starts working with any video format not only in YCbCr, am I wrong? It should first convert it into recognizable color space, so it need some converter's functional additionally to codec's functional.

Quote:
But, how to generate a valid bit-stream that, after decompression, resembles the original input video as closely as possible (under the given bitrate limitations), is totally undefined.
Now it's clear for me, thank you for this brief ultimate explanation. So to figure out how coder works maybe I should explore some existed coder (unfortunately reading someone's code is quite difficult)

Quote:
The names "diamond" (DIA), "hexagonal" (HEX), "uneven multi-hexagon" (UMH) and "exhaustive search" (ESA) refer to such search methods/patterns.
Where can I read more about these methods? They determine only the pattern for search or the size of the search field too? Or maybe it's depend on the specific algorithm too?

And one more dumb question - as soon as we use YCbCr color space, when we say 16x16 size macroblock, what does 16x16 mean exactly? It is pixels of brightness, right?
And all distances on the frame are measured inside this concept?
Ataril is offline   Reply With Quote
Old 7th November 2016, 20:04   #5  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 12,828
Quote:
Originally Posted by Ataril View Post
I get it, h.264 knows nothing about RGB color space. But complete codec (such as ffmpeg for example) starts working with any video format not only in YCbCr, am I wrong? It should first convert it into recognizable color space, so it need some converter's functional additionally to codec's functional.
Most encoder libraries expect that the application already passes the input frames in the "proper" color format. For lossy encoders that's typically the YCbCr format (most often with 4:2:0 or 4:2:2 subsampling).

But that's really implementation specific! An encoder library could also accept various color formats and take care of the required color-space conversion internally.

FFmpeg is an application that does a whole lot of different things, including decoding/encoding, color-space conversion, resizing and so on. It uses various own and third-party libraries to implement all those features.

x264, for example, does support color-space conversion. But that's implemented in the x264 command-line front-end (via libswscale library), not in the actual "libx264" encoder library.


Quote:
Originally Posted by Ataril View Post
Now it's clear for me, thank you for this brief ultimate explanation. So to figure out how coder works maybe I should explore some existed coder (unfortunately reading someone's code is quite difficult)
Yes, if you really want to understand how video encoding is actually done in practice, you should probably start looking at "real-world" encoder code, e.g. x264 for H.264/AVC encoding or x265 for H.265/HEVC encoding.

There also exist so-called "reference" encoders for most video formats (e.g. JM for H.264/AVC). But be aware that those are more "proof of concept" encoders, way too slow for real-world usage.


Quote:
Originally Posted by Ataril View Post
Where can I read more about these methods? They determine only the pattern for search or the size of the search field too? Or maybe it's depend on the specific algorithm too?
Either in encoder implementations that actually make use of those methods. Or in scientific papers that describe those methods theoretically.

Just a quick Google search:
* http://www.ntu.edu.sg/home/ekkma/1_P...pt.%201997.pdf
* http://www.ee.oulu.fi/mvg/files/pdf/pdf_725.pdf
* http://www.mirlab.org/conference_pap...ers/P90619.pdf


Quote:
Originally Posted by Ataril View Post
And one more dumb question - as soon as we use YCbCr color space, when we say 16x16 size macroblock, what does 16x16 mean exactly? It is pixels of brightness, right?
And all distances on the frame are measured inside this concept?
In the YCbCr color-space, there are three separate channels. One channel ("Y") represents brightness/luminance information. And the other two channels ("Cr" and "Cb") represent color/chrominance information.

So you have "brightness" and "color" information, a although "brightness" is usually kept at a higher resolution than "color" (aka chroma-subsampling).

The "transform blocks", e.g. 88 pixels in size, are used to transform the image data from spatial domain into frequency domain. For example, a block of 8x8 pixels is transformed into a matrix of 64 frequency coefficients.

I suggest that you start with the more simple JPEG image format before you move on to video, because many fundamental concepts are the same (but more easy to follow):
https://en.wikipedia.org/wiki/JPEG#Encoding
__________________
There was of course no way of knowing whether you were being watched at any given moment.
How often, or on what system, the Thought Police plugged in on any individual wire was guesswork.



Last edited by LoRd_MuldeR; 7th November 2016 at 21:10.
LoRd_MuldeR is offline   Reply With Quote
Old 8th November 2016, 16:30   #6  |  Link
Ataril
Registered User
 
Join Date: Oct 2016
Posts: 8
Thanks again for all the valuable information that you gave!

And if we go further there are more obscure fields concerning the sizes of macroblock. As far as I understand coder chooses the appropriate size (such as 16x16, 8x8 or smaller) depending on the area's detalization in the frame (in order to provide better quality and compression of the video). For detailed high-frequency area it is reasonable to use smaller macroblock size and vice versa.
But how it's evaluate where is the smooth areas in the frame and where is not? How much should be the differences between these areas so the coder decides to use one or other macroblock size? It should compare values in the matrix or brightness or chroma or both?
Ataril is offline   Reply With Quote
Old 8th November 2016, 21:16   #7  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 12,828
Quote:
Originally Posted by Ataril View Post
Thanks again for all the valuable information that you gave!

And if we go further there are more obscure fields concerning the sizes of macroblock. As far as I understand coder chooses the appropriate size (such as 16x16, 8x8 or smaller) depending on the area's detalization in the frame (in order to provide better quality and compression of the video). For detailed high-frequency area it is reasonable to use smaller macroblock size and vice versa.
But how it's evaluate where is the smooth areas in the frame and where is not? How much should be the differences between these areas so the coder decides to use one or other macroblock size? It should compare values in the matrix or brightness or chroma or both?
The input frame is in spatial domain, so each value in a NN block represents the "brightness" (luminance) or "color" (chrominance) of a pixel/sample. Those "pixel" values are transformed into frequency domain, because, in frequency domain, the same information can usually be represented with only a few non-zero frequency coefficients. In other words: You still have NN values (frequency coefficients) after the transform, but most of those values are very close to zero. And most values (coefficients) actually become zero after the quantization stage. Finally, thanks to the entropy coding stage (e.g. via Huffman coding or arithmetic coding), those long sequences of zero's become extremely "cheap" to store, in terms of bit cost.

Example of DCT transform:
http://img.tomshardware.com/us/1999/...part_3/dct.gif

Now, as a "rule of thumb", using larger transform blocks is advantageous in "flat" image regions. Simply put, that's because a very large image area can be covered with a single block that, after the transform to frequency domain, has only a few non-zero coefficients. But that won't work well in "detailed" image regions! A large block would need too many non-zero coefficients to provide a reasonable approximation of the "detailed" area. Smaller transform blocks are advantageous there.

How does the encoder know what transform size to use in a specific image location? Again: The standard does not dictate that! It's up to the encoder developers to figure out such things, using whatever methods/ideas they deem appropriate

(A typical approach is called "rate-distortion-optimization", aka RDO, which will actually try out many possible decisions and, in the end, keep the decision that resulted in the best "error vs. bit-cost" trade-off)
__________________
There was of course no way of knowing whether you were being watched at any given moment.
How often, or on what system, the Thought Police plugged in on any individual wire was guesswork.



Last edited by LoRd_MuldeR; 8th November 2016 at 21:31.
LoRd_MuldeR is offline   Reply With Quote
Old 13th November 2016, 17:24   #8  |  Link
Ataril
Registered User
 
Join Date: Oct 2016
Posts: 8
Quote:
Originally Posted by LoRd_MuldeR View Post
The input frame is in spatial domain, so each value in a NN block represents the "brightness" (luminance) or "color" (chrominance) of a pixel/sample. Those "pixel" values are transformed into frequency domain, because, in frequency domain, the same information can usually be represented with only a few non-zero frequency coefficients. In other words: You still have NN values (frequency coefficients) after the transform, but most of those values are very close to zero. And most values (coefficients) actually become zero after the quantization stage. Finally, thanks to the entropy coding stage (e.g. via Huffman coding or arithmetic coding), those long sequences of zero's become extremely "cheap" to store, in terms of bit cost.

Example of DCT transform:
http://img.tomshardware.com/us/1999/...part_3/dct.gif
Yes, I've read about DCT and quantization. But what was bothering me is that im not really understood how RGB hex color code transforms into separate frequency values of brightness and color.
Lets say we have a 16x16 block of pixels.
Brightness is calculated first using formula - Y′ = 0.299 R′ + 0.587 G′ + 0.114 B′ ( it is the sum of red, green and blue colors with proper weight coefficients - 0.299 for red, 0.587 for green and 0.114 for blue, according to CCIR 601 standart) and its range is from 16 (black) to 235 (white) and we got matrux of 256 values.
For color blocks obviously we take color value in the range 0-255 for blue and for red, average them (if we use 4:2:0 or 4:2:2 subsampling) and construct two matrixes using following formulas for each chroma value:
Cb = 0.564 (B - Y)
Cr = 0.713 (R - Y)
Before being displayed on the screen the picture should be transformed from YCbCr color space back into habitual RGB color space.

Found the answer here:
https://en.wikipedia.org/wiki/Luma_(video)
And once again reread second chapter in Richardson's book, it has quite detailed information
Quote:
Originally Posted by LoRd_MuldeR View Post
How does the encoder know what transform size to use in a specific image location? Again: The standard does not dictate that! It's up to the encoder developers to figure out such things, using whatever methods/ideas they deem appropriate

(A typical approach is called "rate-distortion-optimization", aka RDO, which will actually try out many possible decisions and, in the end, keep the decision that resulted in the best "error vs. bit-cost" trade-off)
Thanks for this new knowledge, not sure that heard about RDO before.


And what about codecs that are built in mobile devices? How are they operate? As far as I understand they don't have chance to evaluate video before coding, because it should be performed on the fly (unlike desktop codecs which are evaluate video one or more times to decide how better to destribute bitrate among frames). How they manage unpredictable video? Are they just using some default parameters?

Last edited by Ataril; 13th November 2016 at 17:27.
Ataril is offline   Reply With Quote
Old 13th November 2016, 18:52   #9  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 12,828
Quote:
Originally Posted by Ataril View Post
Yes, I've read about DCT and quantization. But what was bothering me is that im not really understood how RGB hex color code transforms into separate frequency values of brightness and color.
Lets say we have a 16x16 block of pixels.
Brightness is calculated first using formula - Y′ = 0.299 R′ + 0.587 G′ + 0.114 B′ ( it is the sum of red, green and blue colors with proper weight coefficients - 0.299 for red, 0.587 for green and 0.114 for blue, according to CCIR 601 standart) and its range is from 16 (black) to 235 (white) and we got matrux of 256 values.
For color blocks obviously we take color value in the range 0-255 for blue and for red, average them (if we use 4:2:0 or 4:2:2 subsampling) and construct two matrixes using following formulas for each chroma value:
Cb = 0.564 (B - Y)
Cr = 0.713 (R - Y)
Before being displayed on the screen the picture should be transformed from YCbCr color space back into habitual RGB color space.

Found the answer here:
https://en.wikipedia.org/wiki/Luma_(video)
And once again reread second chapter in Richardson's book, it has quite detailed information
Again, the input RGB picture is converted to YCbCr format, if not in YCbCr format already. Here you see (top to bottom) the original RGB image and how it's split into Y, Cb and Cr channels:
https://upload.wikimedia.org/wikiped...separation.jpg

Next, each NN block is transformed from spatial domain into frequency domain - separately for each channel.

In spatial domain, each NN block consist of N brightness (luminance) or color (chrominance) values. In frequency domain, the same information is represented by N frequency coefficients.

For example, when using DCT transform with 88 block size, each pixel block will be represented as a linear-combination of the following 64 "patterns" (base functions):
https://upload.wikimedia.org/wikiped...23/Dctjpeg.png

Think of it like this: Each of the 64 frequency coefficients belongs to one of the 64 patterns shown above. You can interpret the coefficient as the "intensity" of the corresponding pattern:
http://img.tomshardware.com/us/1999/...part_3/dct.gif

In order to reconstruct the original 88 pixel block, each pattern would be multiplied by the corresponding coefficient (intensity) value and then the results are all added up.


Quote:
Originally Posted by Ataril View Post
And what about codecs that are built in mobile devices? How are they operate? As far as I understand they don't have chance to evaluate video before coding, because it should be performed on the fly (unlike desktop codecs which are evaluate video one or more times to decide how better to destribute bitrate among frames). How they manage unpredictable video? Are they just using some default parameters?
Basically, "hardware" encoders don't work that much different from "software" encoders - only that they are implemented directly in silicium

However, while software encoders are usually highly tweakable, ranging from "ultra fast" encoding (i.e. "quick and dirty" optimization for maximum throughput) to "very slow" encoding (i.e. "thorough" optimization for maximum compression efficiency), hardware encoders tend to be targeted more towards "real-time" encoding rather than maximum compression efficiency. And hardware encoders most often don't provide many options, if at all, to adjust the encoding behavior.

If you do "real-time" encoding, you can not use elaborate optimization techniques, such as "2-Pass" encoding. This applies to both, software and hardware, encoders. However, hardware encoders usually are bound to "real-time" encoding.
__________________
There was of course no way of knowing whether you were being watched at any given moment.
How often, or on what system, the Thought Police plugged in on any individual wire was guesswork.



Last edited by LoRd_MuldeR; 13th November 2016 at 19:19.
LoRd_MuldeR is offline   Reply With Quote
Reply

Tags
algorithm, codec, h.264, motion estimation

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 01:36.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2017, vBulletin Solutions Inc.