Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
![]() |
#1 | Link |
Registered User
Join Date: Oct 2016
Posts: 8
|
Deep learning of h264
Hi everyone!
Im reading about this standart at the moment, already got acquainted with well-known Richardson's book. But still have a lot of questions! For example, where can I read about the very beginning of encoding process? Let's say I know that coder gets data already in YUV format, but what is responsible for converting input RGB file containing normal pixels? This functional is a part of codec or otherwise how it is implemented? (or it's all depends on codec?) As far as I understand even in original ITU-T papers mainly decoder's model are described in details. Another mistery for me is the motion estimation. I met mentions of Diamon, Hex, UMH and ESA methods for searching the best matching block in the frame (It is still the part of block-matching algorithm, right?), but never met detailed comprehensive explanation what is the difference between all of them or in which case one or the other should be used. Hope can get some usefull information here. Correct me if I'm wrong somewhere. I will appreciate any help! |
![]() |
![]() |
![]() |
#2 | Link | |||
Software Developer
![]() Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,246
|
Quote:
https://en.wikipedia.org/wiki/YCbCr#YCbCr Note that compressed video formats usually operate on YCbCr color-space, because it separates "chrominance" (color) information from "luminance" (brightness) information, which helps compression. See also: https://en.wikipedia.org/wiki/Chroma_subsampling Quote:
But, how to generate a valid bit-stream that, after decompression, resembles the original input video as closely as possible (under the given bitrate limitations), is totally undefined. That exercise is left for the encoder developers to figure out ![]() Quote:
Now, there are way too many possibilities to try them all (in reasonable time). So, the encoder has to search the space of possible motion vectors in a "smart" way. Simply put, what the encoder does in practice is trying only a few possibilities (according to some "search pattern") and then refining the most promising candidates. The names "diamond" (DIA), "hexagonal" (HEX), "uneven multi-hexagon" (UMH) and "exhaustive search" (ESA) refer to such search methods/patterns.
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ Last edited by LoRd_MuldeR; 7th November 2016 at 00:08. |
|||
![]() |
![]() |
![]() |
#3 | Link |
HeartlessS Usurer
Join Date: Dec 2009
Location: Over the rainbow
Posts: 10,876
|
Me cant say different to the lord above, Lord Mulder already put you right, he is the man on this.
__________________
I sometimes post sober. StainlessS@MediaFire ::: AND/OR ::: StainlessS@SendSpace "Some infinities are bigger than other infinities", but how many of them are infinitely bigger ??? |
![]() |
![]() |
![]() |
#4 | Link | |||
Registered User
Join Date: Oct 2016
Posts: 8
|
Quote:
Quote:
Quote:
And one more dumb question - as soon as we use YCbCr color space, when we say 16x16 size macroblock, what does 16x16 mean exactly? It is pixels of brightness, right? And all distances on the frame are measured inside this concept? |
|||
![]() |
![]() |
![]() |
#5 | Link | ||||
Software Developer
![]() Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,246
|
Quote:
But that's really implementation specific! An encoder library could also accept various color formats and take care of the required color-space conversion internally. FFmpeg is an application that does a whole lot of different things, including decoding/encoding, color-space conversion, resizing and so on. It uses various own and third-party libraries to implement all those features. x264, for example, does support color-space conversion. But that's implemented in the x264 command-line front-end (via libswscale library), not in the actual "libx264" encoder library. Quote:
There also exist so-called "reference" encoders for most video formats (e.g. JM for H.264/AVC). But be aware that those are more "proof of concept" encoders, way too slow for real-world usage. Quote:
Just a quick Google search: * http://www.ntu.edu.sg/home/ekkma/1_P...pt.%201997.pdf * http://www.ee.oulu.fi/mvg/files/pdf/pdf_725.pdf * http://www.mirlab.org/conference_pap...ers/P90619.pdf Quote:
So you have "brightness" and "color" information, a although "brightness" is usually kept at a higher resolution than "color" (aka chroma-subsampling). The "transform blocks", e.g. 8×8 pixels in size, are used to transform the image data from spatial domain into frequency domain. For example, a block of 8x8 pixels is transformed into a matrix of 64 frequency coefficients. I suggest that you start with the more simple JPEG image format before you move on to video, because many fundamental concepts are the same (but more easy to follow): https://en.wikipedia.org/wiki/JPEG#Encoding
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ Last edited by LoRd_MuldeR; 7th November 2016 at 21:10. |
||||
![]() |
![]() |
![]() |
#6 | Link |
Registered User
Join Date: Oct 2016
Posts: 8
|
Thanks again for all the valuable information that you gave!
![]() And if we go further there are more obscure fields concerning the sizes of macroblock. As far as I understand coder chooses the appropriate size (such as 16x16, 8x8 or smaller) depending on the area's detalization in the frame (in order to provide better quality and compression of the video). For detailed high-frequency area it is reasonable to use smaller macroblock size and vice versa. But how it's evaluate where is the smooth areas in the frame and where is not? How much should be the differences between these areas so the coder decides to use one or other macroblock size? It should compare values in the matrix or brightness or chroma or both? |
![]() |
![]() |
![]() |
#7 | Link | |
Software Developer
![]() Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,246
|
Quote:
Example of DCT transform: http://img.tomshardware.com/us/1999/...part_3/dct.gif Now, as a "rule of thumb", using larger transform blocks is advantageous in "flat" image regions. Simply put, that's because a very large image area can be covered with a single block that, after the transform to frequency domain, has only a few non-zero coefficients. But that won't work well in "detailed" image regions! A large block would need too many non-zero coefficients to provide a reasonable approximation of the "detailed" area. Smaller transform blocks are advantageous there. How does the encoder know what transform size to use in a specific image location? Again: The standard does not dictate that! It's up to the encoder developers to figure out such things, using whatever methods/ideas they deem appropriate ![]() (A typical approach is called "rate-distortion-optimization", aka RDO, which will actually try out many possible decisions and, in the end, keep the decision that resulted in the best "error vs. bit-cost" trade-off)
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ Last edited by LoRd_MuldeR; 8th November 2016 at 21:31. |
|
![]() |
![]() |
![]() |
#8 | Link | ||
Registered User
Join Date: Oct 2016
Posts: 8
|
Quote:
Lets say we have a 16x16 block of pixels. Brightness is calculated first using formula - Y′ = 0.299 R′ + 0.587 G′ + 0.114 B′ ( it is the sum of red, green and blue colors with proper weight coefficients - 0.299 for red, 0.587 for green and 0.114 for blue, according to CCIR 601 standart) and its range is from 16 (black) to 235 (white) and we got matrux of 256 values. For color blocks obviously we take color value in the range 0-255 for blue and for red, average them (if we use 4:2:0 or 4:2:2 subsampling) and construct two matrixes using following formulas for each chroma value: Cb = 0.564 (B - Y) Cr = 0.713 (R - Y) Before being displayed on the screen the picture should be transformed from YCbCr color space back into habitual RGB color space. Found the answer here: https://en.wikipedia.org/wiki/Luma_(video) And once again reread second chapter in Richardson's book, it has quite detailed information Quote:
![]() And what about codecs that are built in mobile devices? How are they operate? As far as I understand they don't have chance to evaluate video before coding, because it should be performed on the fly (unlike desktop codecs which are evaluate video one or more times to decide how better to destribute bitrate among frames). How they manage unpredictable video? Are they just using some default parameters? ![]() Last edited by Ataril; 13th November 2016 at 17:27. |
||
![]() |
![]() |
![]() |
#9 | Link | ||
Software Developer
![]() Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,246
|
Quote:
https://upload.wikimedia.org/wikiped...separation.jpg Next, each N×N block is transformed from spatial domain into frequency domain - separately for each channel. In spatial domain, each N×N block consist of N² brightness (luminance) or color (chrominance) values. In frequency domain, the same information is represented by N² frequency coefficients. For example, when using DCT transform with 8×8 block size, each pixel block will be represented as a linear-combination of the following 64 "patterns" (base functions): https://upload.wikimedia.org/wikiped...23/Dctjpeg.png Think of it like this: Each of the 64 frequency coefficients belongs to one of the 64 patterns shown above. You can interpret the coefficient as the "intensity" of the corresponding pattern: http://img.tomshardware.com/us/1999/...part_3/dct.gif In order to reconstruct the original 8×8 pixel block, each pattern would be multiplied by the corresponding coefficient (intensity) value and then the results are all added up. Quote:
![]() However, while software encoders are usually highly tweakable, ranging from "ultra fast" encoding (i.e. "quick and dirty" optimization for maximum throughput) to "very slow" encoding (i.e. "thorough" optimization for maximum compression efficiency), hardware encoders tend to be targeted more towards "real-time" encoding rather than maximum compression efficiency. And hardware encoders most often don't provide many options, if at all, to adjust the encoding behavior. If you do "real-time" encoding, you can not use elaborate optimization techniques, such as "2-Pass" encoding. This applies to both, software and hardware, encoders. However, hardware encoders usually are bound to "real-time" encoding.
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ Last edited by LoRd_MuldeR; 13th November 2016 at 19:19. |
||
![]() |
![]() |
![]() |
Tags |
algorithm, codec, h.264, motion estimation |
Thread Tools | Search this Thread |
Display Modes | |
|
|