View Full Version : Maximum Theoretical Interframe Compression
Avenger007
25th January 2008, 02:24
I'll say upfront that I don't know much about the implementation of video encoders. Stray thoughts eventually lead me to ask the following question:
What is the maximum theoretical interframe compression (ie. temporal compression which reduces final file size) that a video file can be compressed with (along with regular intraframe compression), in terms of relative percentages with current encoders, without considering how much CPU time or RAM that's required?
For example, shows like Lost and 24 cycle through camera angles within the same scene (or room) but encoders usually place an I-frame at camera angle changes without realizing that angle was seen before. I am guessing it's because of a limited number of reference frames, like 16 in x264, but if there is no limit then how much compression can be achieved?
Manao
25th January 2008, 06:59
H264 has the notion of "longterm" references which is made to fix the problem you are mentionning, even with a limited number of reference (if you cycle between two scenes, only three references, two of them being longterm (each in one of the two scenes), are needed)
However, it's not used yet because :
- it's complicated to find when to mark a frame as longterm reference
- it prevents seeking
As for the efficiency of such a feature. Let's say you cycle between two scenes, and that you can perfectly use long term. The duration of a scene is usually 2 seconds, so longterm will change a I frame into a P frame every two seconds. Depending on the bitrate, I frame are 2 to 4 times bigger than P frames, so, with a video at 25 fps, without bframes (in order to make it simple), you end up with a bitrate gain between 51/50 and 53/50, ie 2 to 6 percents.
foxyshadis
25th January 2008, 15:16
I think it would be kind of cool to have a patch that keeps the last frame before a scenecut around as a reference, reserving the last two or more reference slots for that (since I guess you'd need to save a scene's before testing the prior scene's against the next scene). That way you're not trying to pull references from everywhere in the movie at once, just a common case kind of thing, and if the user really wants to have lousy seeking they can have it, as with --keyint. Scenes that are back-and-forth enough to make use of it probably aren't long enough or common enough to make a killer dent in seeking, just a few places where it's longer than usual.
I mean, we have 16 ref slots and barely ever use more than 6-8 in normal encoding, can't hurt to fill the others with something as long as you mind level restrictions.
Avenger007
26th January 2008, 02:00
H264 has the notion of "longterm" references
I didn't know that, thanks for enlightening me ;)
As a result I did a quick search and came across this thread:
wide range of possible reference frames in mpeg-4 avc/h.264? http://forum.doom9.org/showthread.php?t=86532
P.S. That thread is dated December 2004
However, it's not used yet because :
- it's complicated to find when to mark a frame as longterm reference
- it prevents seeking
I figured it would be complicated but I didn't think about the seeking issue. Would it really prevent seeking; it would most likely need a new strategy for seeking.
This is what akupenguin said in the thread above about an idea similar to foxyshadis'
Yes, it's doable. There is a provision in the spec for "long-term references", and I can't imagine what else they intended it for.
It wouldn't even be too slow if you restrict it to scenechanges.
And it would only kill seeking in a 1-pass encode; the second pass would know when to keep long-term refs, and when it's ok to insert seek-points.
So it's on my todo list for x264. But I'm not Ateme, and I do this in my spare time, so no guarantees on when I finish.
Keep in mind that was back in Dec 2004.
The duration of a scene is usually 2 seconds
I was thinking of a much longer duration (I make the distinction between scene and camera angle).
For example, an entire episode of 24 would show inside CTU in several scenes in between other scenes (like in-the-field scenes). I can see a correlation between the CTU scenes wrt background (walls, etc.).
Would an extremely sophisticated encoder, perhaps using Neuro-Fuzzy logic, be able to "see" and learn from such a correlation? It would most likely need several passes (maybe even several passes over an entire season of 24), but would it achieve the required level of awareness to allow for significant compression (>50%)?
I mean, we have 16 ref slots and barely ever use more than 6-8 in normal encoding, can't hurt to fill the others with something as long as you mind level restrictions.
I completely agree with you; how much repetition is there in less than a few seconds of video :rolleyes:
Remember, I'm leaning towards the theoretical side of this discussion.
This is what Gary Sullivan at Mp4-tech said about short term and long term references:
I guess there are two primary differences.
1) Short-term reference pictures are indexed by referring to variables
that are a function of their frame_num value. But frame_num is a modulo
counter that wraps over periodically. Therefore there is a limit on how
long a short-term reference picture can remain in the buffer -- it
cannot remain there after the frame_num value has wrapped all the way
around and crossed over the same value again. In contrast, long-term
reference pictures are referenced by an index that is explicitly
assigned to them by syntax -- their long-term frame index. So a
long-term reference picture can stay in the decoded picture buffer as
long as the encoder wants it to.
2) There is no use of temporal (picture order count) relationships when
referencing long-term reference pictures in the decoding process.
For those interested, there is a paper on IEEE entitled
"H.264 Long-Term Reference Selection for Videos with Frequent Camera Transitions" by Ozbek, N. and Tekalp, A.M. at Ege Univ., Izmir
Here's the Abstract:
"Long-term reference prediction is an important feature of the H.264/MPEG-4 AVC standard, which provides a tradeoff between compression gain and computational complexity. In this study, we propose a long-term reference selection method for videos with frequent camera transitions to optimize compression efficiency at shot boundaries without increasing the computational complexity. Experimental results show up to 50% reduction in the number of bits (at the same PSNR) for frames at the border of camera transitions."
foxyshadis
26th January 2008, 05:05
Since this actually has practical possibilities, I'll move it into the AVC forum. More people who might have some actual numbers for you there anyway. :p
akupenguin
26th January 2008, 09:00
I figured it would be complicated but I didn't think about the seeking issue. Would it really prevent seeking; it would most likely need a new strategy for seeking.
Of the current containers, Matroska is the only one that supports the information needed to seek past longterm references. All other containers can only say keyframe or not, they can't specify that one frame references something other than the previous in coding order. This isn't a problem for normal multiref because each reference frame depends on the previous, so there's still a unique decode order.
I make the distinction between scene and camera angle.
But the codec doesn't.
Would an extremely sophisticated encoder, perhaps using Neuro-Fuzzy logic, be able to "see" and learn from such a correlation? It would most likely need several passes (maybe even several passes over an entire season of 24), but would it achieve the required level of awareness to allow for significant compression (>50%)?
Unless each appearance uses (almost) exactly the same camera angle, you'd need far more than encoder sophistication, but rather a change in the inter prediction framework from block-based to some sort of 3-D modeling. There are attempts to derive a 3-D model of a scene from a video, but none that I am aware of have attempted to use than in compression. Although if enough scenes show the same background, probably some of them will be similar enough just by chance (especially when TV show budget gives incentive to reuse footage...)
Such a change in prediction method would make more difference than the long term references. e.g. overlapped block motion compensation (which is still quite simple) can reduce the residual by ~25% at the cost of 4x the decoding cpu requirement.
vBulletin® v3.8.5, Copyright ©2000-2012, Jelsoft Enterprises Ltd.