Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > MPEG-4 AVC / H.264

Reply
 
Thread Tools Search this Thread Display Modes
Old 15th October 2011, 09:11   #1  |  Link
Dust Signs
Registered User
 
Join Date: Jun 2004
Location: Salzburg, Austria
Posts: 219
Performance analysis of Short-term Memory in x264

Dear all,

I just finished my Master's thesis on porting x264 to Short-term Memory and thought that it might be interesting for (some of) you. For the sake of completion, here is the abstract:

Quote:
Originally Posted by Performance Analysis of Short-Term Memory in a State-of-the-Art H.264 Video Encoder
This thesis describes the process of porting x264, an H.264-compliant state-of-the-art video encoder, to short-term memory, a recent paradigm in memory management, in order to show the ease of use of the latter for heap management in complex applications. Different ways to apply the new paradigm are described and compared to one another and the original implementation. By evaluating each implementation variant’s run time behavior, advantages and disadvantages of short-term memory management over classical memory management are discussed, revealing that the ease of use of those implementation variants which work for all tested parameter combinations comes at the expense of higher peak memory consumption (less than 57% overhead for the default multi-threaded configuration) with a negligible impact on execution time. Furthermore, it is pointed out that a reduction of this overhead is possible through a special implementation variant which only works for some parameter combinations, but reduces the aforementioned overhead to less than 19%. Nonetheless, the adequacy and efficiency of certain x264 implementation variants for single-threaded configurations is shown.
I am very well aware that this will not affect the official x264 code at all, but perhaps it gives some insights into new possibilities to manage the (memory) lifetime of frames in x264. Also, chapter 3 on the inner workings of x264 may be useful for those who are looking for a more detailed explanation of e.g. x264's frame-based multi-threading approach as well as the memory-management-related implementation specifics.
The thesis will be submitted for grading next week - it is currently being printed. As it is required by the University to do the research and implementation by oneself in principle to obtain the Master's degree, I did not make this public earlier, with one little exception which I had pre-approved by my advisor, Professor Christoph Kirsch. I hope that you find the thesis interesting. Please find the download links for both, the thesis and the code (for the sake of completion), below. Note that the first title page of the thesis is in German due to the submission requirements - the rest of the thesis is in English.

Download links:
Best regards
Dust Signs

This is a cross post of http://doom10.org/index.php?topic=1962.0
__________________
The number you dialed is imaginary. Please turn your phone by 90° and try again

Last edited by Dust Signs; 27th March 2012 at 15:57. Reason: Updated links
Dust Signs is offline   Reply With Quote
Old 15th October 2011, 13:48   #2  |  Link
Gser
Registered User
 
Join Date: Apr 2008
Posts: 418
Thank you for sharing and may your thesis be a success.
Gser is offline   Reply With Quote
Old 15th October 2011, 18:42   #3  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
When posting code, can you post a git diff instead of (or in addition to) a tarball? It's vastly easier to read.

Some corrections and/or suggestions:

Page 33:
1. x264 only supports up to 10-bit, not 16-bit.
2. Newer versions of x264 support 4:2:2 and 4:4:4.
3. It might be useful to mention the differences between sliced and frame threads (and why both exist), and why frame threads is generally better.

Page 36:
1. There is one instance of x264_t per thread, not per encode.. It holds basically everything, including analysis data for the current macroblock and so forth. You are correct that it's small enough to be largely irrelevant in terms of memory management.

Page 38:
1. Maybe you want to mention the hpel data in fdec frames as a large portion? This uses more data than the original pixel data (for those frames).

Page 41:
1. --sync-lookahead is the number of frames used in the sync buffer between lookahead and encoding. --rc-lookahead is the number of lookahead frames.
2. Some of your parameters are missing hyphens.
3. It might be useful to mention some of the meanings of the parameters in your chart, e.g. that b-adapt 1 is a fast heuristic algorithm whereas b-adapt 2 is a Viterbi decision algorithm. This might be relevant because the latter requires dozens of frames to form a path, whereas the former can work on just a couple.
4. You omitted MB-tree.
5. "veryslow", not "very-slow", in the footnote.

Page 50:
1. Floating point operations are not expensive when done on a per-frame basis. Ratecontrol does thousands of them.

Page 55:
1. Foreman is often used for PSNR testing by low-quality papers, but it's not good for performance testing, especially with x264, because of its very small size. x264 can only use one thread per couple macroblock rows of frame.

General:
1. Have you considered taking advantage of this scheme to only allocate data where necessary? This is obviously impossible in x264's typical allocate-once scheme, but might be possible here. For example, in a P-frame, you don't need to allocate h->mb.mv[1].
2. Does your scheme have better or worse cache behavior? Have you tried making measurements of cache misses? Is there any difference?
3. This is one of the best papers on x264 I've ever seen. While that's not saying that much considering their typical quality, congrats.

Last edited by Dark Shikari; 15th October 2011 at 19:11.
Dark Shikari is offline   Reply With Quote
Old 15th October 2011, 18:53   #4  |  Link
Dust Signs
Registered User
 
Join Date: Jun 2004
Location: Salzburg, Austria
Posts: 219
As I based my code on a downloaded tarball (snapshot) which did not have the git directories included, it will take some time to provide an appropriate diff. I'll create one and put it online as soon as I have it ready.

Dust Signs
__________________
The number you dialed is imaginary. Please turn your phone by 90° and try again
Dust Signs is offline   Reply With Quote
Old 15th October 2011, 19:24   #5  |  Link
Dust Signs
Registered User
 
Join Date: Jun 2004
Location: Salzburg, Austria
Posts: 219
Please find the diff(s) in the link below (only considering source and header files, no scripts). As I don't have a Linux machine at hand right now, I created the diffs "manually" using a temporary TortoiseGit installation. I hope they are correct - I have no possibility to test compilation or execution on this machine.
The version the diff is based on is http://git.videolan.org/gitweb.cgi?p...cae52b770eeefb. I hope that the diff file is easier to read.

Diff(s)

Dust Signs
__________________
The number you dialed is imaginary. Please turn your phone by 90° and try again

Last edited by Dust Signs; 27th March 2012 at 15:57. Reason: Updated links
Dust Signs is offline   Reply With Quote
Old 15th October 2011, 19:30   #6  |  Link
Dust Signs
Registered User
 
Join Date: Jun 2004
Location: Salzburg, Austria
Posts: 219
@Dark Shikari: Thank you for your corrections and suggestions. Unfortunately the thesis is already submitted (and printed), so it cannot be changed anymore. If I have some time after my defense, I'll fix the incorrect statements and upload a new version.

Quote:
Originally Posted by Dark Shikari View Post
1. Have you considered taking advantage of this scheme to only allocate data where necessary? This is obviously impossible in x264's typical allocate-once scheme, but might be possible here. For example, in a P-frame, you don't need to allocate h->mb.mv[1].
I did not take advantage of this - the allocations are basically the same as in the original code.

Quote:
Originally Posted by Dark Shikari View Post
2. Does your scheme have better or worse cache behavior? Have you tried making measurements of cache misses? Is there any difference?
I have not performed any explicit measurements on cache behavior and misses, so I cannot tell. I suspect that, if there is a difference, the modified version of x264 will behave not as well as the original one as I did not take this aspect into consideration.

Quote:
Originally Posted by Dark Shikari View Post
3. This is one of the best papers on x264 I've ever seen. While that's not saying that much considering their typical quality, congrats.
I'll take this as a compliment - thank you.

Dust Signs
__________________
The number you dialed is imaginary. Please turn your phone by 90° and try again

Last edited by Dust Signs; 15th October 2011 at 19:35. Reason: Added answers to all three questions
Dust Signs is offline   Reply With Quote
Old 20th October 2011, 10:46   #7  |  Link
akupenguin
x264 developer
 
akupenguin's Avatar
 
Join Date: Sep 2004
Posts: 2,392
Page 15:
The diagram on the left is not the conventional non-pyramid B-frame structure (nor is it even an allowed structure: you're claiming to predict P-frames from B-frames that are later in coded order). Conventional is to predict B-frames only from P-frames, not from previous B-frames.

Page 20:
The DPB always acts as a FIFO (except for MMCOs). This is not caused by the use or ordering of L0 and L1.
(And if you're focusing on x264 rather than on the standard, you could skip the part about how P and B have different default reference orders, because x264 ignores the standard's default and makes them the same.)

Page 21:
Motion search area is not usually rectangular, nor any other data-independent shape. The sane methods are hill-climbing searches.
SSD and MSE are the same thing (if you ignore the normalization constant, which you would if you're only comparing to other values of the same metric).

Page 26:
The standard describes CABAC inefficiently. CABAC states actually fit in 7 bits per context, not 16.

Page 28:
Storing the bitstream from RDO is not helpful. Not even for speed and ignoring the negligible memory costs. It's faster to not generate the bitstream in the first place, since RDO only cares about the number of bits, not which bits they are.
akupenguin is offline   Reply With Quote
Old 20th October 2011, 12:21   #8  |  Link
Dust Signs
Registered User
 
Join Date: Jun 2004
Location: Salzburg, Austria
Posts: 219
Quote:
Originally Posted by akupenguin View Post
Page 15:
The diagram on the left is not the conventional non-pyramid B-frame structure (nor is it even an allowed structure: you're claiming to predict P-frames from B-frames that are later in coded order). Conventional is to predict B-frames only from P-frames, not from previous B-frames.
Some arrows are drawn in the wrong, i.e. opposite, direction. It is correctly described in the text, though. Thanks for noticing.

Quote:
Originally Posted by akupenguin View Post
Page 28:
Storing the bitstream from RDO is not helpful. Not even for speed and ignoring the negligible memory costs. It's faster to not generate the bitstream in the first place, since RDO only cares about the number of bits, not which bits they are.
Although RDO only cares about the number of bits, the encoded bit stream of the optimal mode can be used right away instead of encoding it again later.

Thanks for your comments.

Dust Signs
__________________
The number you dialed is imaginary. Please turn your phone by 90° and try again

Last edited by Dust Signs; 20th October 2011 at 12:39.
Dust Signs is offline   Reply With Quote
Old 20th October 2011, 23:14   #9  |  Link
akupenguin
x264 developer
 
akupenguin's Avatar
 
Join Date: Sep 2004
Posts: 2,392
Quote:
Originally Posted by Dust Signs View Post
Some arrows are drawn in the wrong, i.e. opposite, direction.
Not just direction. If you meant to describe what's commonly used, there shouldn't be any arrows from one B-frame to another, and B-frames shouldn't be listed as DPB entries.
(This is the same frame structure as has been used since MPEG1, where B-frames could not be kept as references. If you're going to skip that tradition and use all of H.264's features, you might as well go all the way to pyramid (which is the other common structure), not some half-way state with B-references but old-fashioned order.)

Quote:
Although RDO only cares about the number of bits, the encoded bit stream of the optimal mode can be used right away instead of encoding it again later.
I agree that that's a valid factor in performance analysis, but not a dominant one.
What I meant to point out is that the speed gain from not computing a bitstream in any of the RDO candidate modes, outweighs the cost of recomputing the one finally selected mode, in any situation where there's more than 2 or 3 candidates total. (I don't know know the exact value of the threshold, but x264 is far above it.)
And there are also other optimizations that are incompatible with bitstream reuse, such as using trellis quantization for the final encode but deadzone for the candidates (which is a good idea in the medium speeds that might plausibly use RDO on some but not a lot of candidates).

And finally, if you're going to mention bitstream reuse despite my arguments against its efficiency, know that it also requires memory to store the reconstructed pixels and various sideband data (motion vectors etc) of the best mode so far, not just the bitstream.

Last edited by akupenguin; 20th October 2011 at 23:16.
akupenguin is offline   Reply With Quote
Reply

Tags
h.264, memory management, short-term memory, x264

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 05:18.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.