Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
15th October 2011, 09:11 | #1 | Link | |
Registered User
Join Date: Jun 2004
Location: Salzburg, Austria
Posts: 219
|
Performance analysis of Short-term Memory in x264
Dear all,
I just finished my Master's thesis on porting x264 to Short-term Memory and thought that it might be interesting for (some of) you. For the sake of completion, here is the abstract: Quote:
The thesis will be submitted for grading next week - it is currently being printed. As it is required by the University to do the research and implementation by oneself in principle to obtain the Master's degree, I did not make this public earlier, with one little exception which I had pre-approved by my advisor, Professor Christoph Kirsch. I hope that you find the thesis interesting. Please find the download links for both, the thesis and the code (for the sake of completion), below. Note that the first title page of the thesis is in German due to the submission requirements - the rest of the thesis is in English. Download links: Best regards Dust Signs This is a cross post of http://doom10.org/index.php?topic=1962.0
__________________
The number you dialed is imaginary. Please turn your phone by 90° and try again Last edited by Dust Signs; 27th March 2012 at 15:57. Reason: Updated links |
|
15th October 2011, 18:42 | #3 | Link |
x264 developer
Join Date: Sep 2005
Posts: 8,666
|
When posting code, can you post a git diff instead of (or in addition to) a tarball? It's vastly easier to read.
Some corrections and/or suggestions: Page 33: 1. x264 only supports up to 10-bit, not 16-bit. 2. Newer versions of x264 support 4:2:2 and 4:4:4. 3. It might be useful to mention the differences between sliced and frame threads (and why both exist), and why frame threads is generally better. Page 36: 1. There is one instance of x264_t per thread, not per encode.. It holds basically everything, including analysis data for the current macroblock and so forth. You are correct that it's small enough to be largely irrelevant in terms of memory management. Page 38: 1. Maybe you want to mention the hpel data in fdec frames as a large portion? This uses more data than the original pixel data (for those frames). Page 41: 1. --sync-lookahead is the number of frames used in the sync buffer between lookahead and encoding. --rc-lookahead is the number of lookahead frames. 2. Some of your parameters are missing hyphens. 3. It might be useful to mention some of the meanings of the parameters in your chart, e.g. that b-adapt 1 is a fast heuristic algorithm whereas b-adapt 2 is a Viterbi decision algorithm. This might be relevant because the latter requires dozens of frames to form a path, whereas the former can work on just a couple. 4. You omitted MB-tree. 5. "veryslow", not "very-slow", in the footnote. Page 50: 1. Floating point operations are not expensive when done on a per-frame basis. Ratecontrol does thousands of them. Page 55: 1. Foreman is often used for PSNR testing by low-quality papers, but it's not good for performance testing, especially with x264, because of its very small size. x264 can only use one thread per couple macroblock rows of frame. General: 1. Have you considered taking advantage of this scheme to only allocate data where necessary? This is obviously impossible in x264's typical allocate-once scheme, but might be possible here. For example, in a P-frame, you don't need to allocate h->mb.mv[1]. 2. Does your scheme have better or worse cache behavior? Have you tried making measurements of cache misses? Is there any difference? 3. This is one of the best papers on x264 I've ever seen. While that's not saying that much considering their typical quality, congrats.
__________________
Follow x264 development progress | akupenguin quotes | x264 git status ffmpeg and x264-related consulting/coding contracts | Doom10 Last edited by Dark Shikari; 15th October 2011 at 19:11. |
15th October 2011, 18:53 | #4 | Link |
Registered User
Join Date: Jun 2004
Location: Salzburg, Austria
Posts: 219
|
As I based my code on a downloaded tarball (snapshot) which did not have the git directories included, it will take some time to provide an appropriate diff. I'll create one and put it online as soon as I have it ready.
Dust Signs
__________________
The number you dialed is imaginary. Please turn your phone by 90° and try again |
15th October 2011, 19:24 | #5 | Link |
Registered User
Join Date: Jun 2004
Location: Salzburg, Austria
Posts: 219
|
Please find the diff(s) in the link below (only considering source and header files, no scripts). As I don't have a Linux machine at hand right now, I created the diffs "manually" using a temporary TortoiseGit installation. I hope they are correct - I have no possibility to test compilation or execution on this machine.
The version the diff is based on is http://git.videolan.org/gitweb.cgi?p...cae52b770eeefb. I hope that the diff file is easier to read. Diff(s) Dust Signs
__________________
The number you dialed is imaginary. Please turn your phone by 90° and try again Last edited by Dust Signs; 27th March 2012 at 15:57. Reason: Updated links |
15th October 2011, 19:30 | #6 | Link | |||
Registered User
Join Date: Jun 2004
Location: Salzburg, Austria
Posts: 219
|
@Dark Shikari: Thank you for your corrections and suggestions. Unfortunately the thesis is already submitted (and printed), so it cannot be changed anymore. If I have some time after my defense, I'll fix the incorrect statements and upload a new version.
Quote:
Quote:
Quote:
Dust Signs
__________________
The number you dialed is imaginary. Please turn your phone by 90° and try again Last edited by Dust Signs; 15th October 2011 at 19:35. Reason: Added answers to all three questions |
|||
20th October 2011, 10:46 | #7 | Link |
x264 developer
Join Date: Sep 2004
Posts: 2,392
|
Page 15:
The diagram on the left is not the conventional non-pyramid B-frame structure (nor is it even an allowed structure: you're claiming to predict P-frames from B-frames that are later in coded order). Conventional is to predict B-frames only from P-frames, not from previous B-frames. Page 20: The DPB always acts as a FIFO (except for MMCOs). This is not caused by the use or ordering of L0 and L1. (And if you're focusing on x264 rather than on the standard, you could skip the part about how P and B have different default reference orders, because x264 ignores the standard's default and makes them the same.) Page 21: Motion search area is not usually rectangular, nor any other data-independent shape. The sane methods are hill-climbing searches. SSD and MSE are the same thing (if you ignore the normalization constant, which you would if you're only comparing to other values of the same metric). Page 26: The standard describes CABAC inefficiently. CABAC states actually fit in 7 bits per context, not 16. Page 28: Storing the bitstream from RDO is not helpful. Not even for speed and ignoring the negligible memory costs. It's faster to not generate the bitstream in the first place, since RDO only cares about the number of bits, not which bits they are. |
20th October 2011, 12:21 | #8 | Link | ||
Registered User
Join Date: Jun 2004
Location: Salzburg, Austria
Posts: 219
|
Quote:
Quote:
Thanks for your comments. Dust Signs
__________________
The number you dialed is imaginary. Please turn your phone by 90° and try again Last edited by Dust Signs; 20th October 2011 at 12:39. |
||
20th October 2011, 23:14 | #9 | Link | ||
x264 developer
Join Date: Sep 2004
Posts: 2,392
|
Quote:
(This is the same frame structure as has been used since MPEG1, where B-frames could not be kept as references. If you're going to skip that tradition and use all of H.264's features, you might as well go all the way to pyramid (which is the other common structure), not some half-way state with B-references but old-fashioned order.) Quote:
What I meant to point out is that the speed gain from not computing a bitstream in any of the RDO candidate modes, outweighs the cost of recomputing the one finally selected mode, in any situation where there's more than 2 or 3 candidates total. (I don't know know the exact value of the threshold, but x264 is far above it.) And there are also other optimizations that are incompatible with bitstream reuse, such as using trellis quantization for the final encode but deadzone for the candidates (which is a good idea in the medium speeds that might plausibly use RDO on some but not a lot of candidates). And finally, if you're going to mention bitstream reuse despite my arguments against its efficiency, know that it also requires memory to store the reconstructed pixels and various sideband data (motion vectors etc) of the best mode so far, not just the bitstream. Last edited by akupenguin; 20th October 2011 at 23:16. |
||
Tags |
h.264, memory management, short-term memory, x264 |
Thread Tools | Search this Thread |
Display Modes | |
|
|