Confused by PTS, DTS and CTS in MP4 and FLV

leiming2006 · 2nd January 2018, 16:40

Hello, I recently learning about FLV and MP4 container and confused by the timestamps.

Long ago I have got touch with XviD in AVI, heard about packed stream.
It's said that AVI has no B-frame support and requires the decoder to output one frame with on packet's input.
To satisfy the requirement, a hacking called packed stream is used to break the limitaion. It placed the depended P-frame and B-frame in the same packet.
Thus the packets are stored in the turns [I] [PB] [B] [] [PB] [B] []... so that it can give one frame's output with one packet's input.

Time flies and it's the age of MP4 now.
I works on FLV streaming on job, and be getting to know that MP4 and FLV store DTS and CTS-Offset in header.
The famous tool FFMPEG has a FFPROBE in its package offering the feature to dump packet information of a stream, which offering PTS and DTS on output.
It regards DTS + CTS Offset as PTS.

I have a lot of idea to confirm if it's correct.

In CFR (constant framerate) situation,
I know that PTS is "[frame number] * [frame interval]" which frame number is ordered in presentation sequence,
and DTS is "[frame number] * [frame interval]" which frame number is ordered in decoding sequence.

For example a bit stream

Code:

I P B B P B B P B B

will given DTS and CTS (or PTS) like this in 25 FPS

Code:

    I   P   B   B   P   B   B   P   ...
DTS 0   40  80 120 160 200 240 280  ...
CTS 0  120  40  80 240 160 200 360  ...

But CTS >= DTS is required so that one of them will have a delay. It then turns to

Code:

    I   P   B   B   P   B   B   P   ...
DTS 0   40  80 120 160 200 240 280  ...
CTS 40 160  80 120 280 200 240 400  ...

Same to FLV. (is it right?)

But I don't know what are them.

In my understanding, DTS is short for decoding timestamp and PTS is short for presentation timestamp.
The former one standing for the timestamp when the packet should feed to decoder, and the latter one means when the decoded frame should be shown. (is it right?)
It seems very important for hardware players to know how to control the input and output buffer.
If I CTS means output and DTS means input (is it right?), the sorted action will be

Code:

   action            buffer             present
  I (DTS: 0)        I
  I (CTS: 40)                           I
  P (DTS: 40)       P                   I
  B (DTS: 80)       P B                 I
  B (CTS: 80)       P                   I B
  B (DTS: 120)      P B                 I B
  B (CTS: 120)      P                   I B B
  P (CTS: 160)                          I B B P
  P (DTS: 160)      P                   I B B P
  B (DTS: 200)      P B                 I B B P
  B (CTS: 200)      P                   I B B P B
  B (DTS: 240)      P B                 I B B P B
  B (CTS: 240)      P                   I B B P B B
  P (CTS: 280)                          I B B P B B P
    ...              ...                   ...

Then, the cts offset (nearly) means how long will the sample (packet/frame) will stay before it's outputed.
By the idea, on the condition that the input and output order kept unchanged and CTS unchanged, modify the DTS and CTS-Offset will not affect the final playback. (is it right?)

I guess 2 situations:
1. A hardware player will running out internal buffer if I feed it a MP4 file with small DTS in begining. like this:

Code:

    I   P   B   B   P   B   B   P   ...
DTS 0   1   2   3   4   5   6  280  ...
CTS 40 160  80 120 280 200 240 400  ...

it may have no chance running to the P-frame with DTS=280.

2. I can give timestamps like what packed stream do like this:

Code:

    I   P   B   B   P   B   B   P   ...
DTS 0   79  80 120 199 200 240 319  ...
CTS 0  120  40  80 240 160 200 360  ...

and the playback should have no problem.

The trouble comes in the situation of VFR.
I have seen many tools have the feature to import timecode into a track of MP4 file.
There are tc4mp4, mp4fpsmod, dtsedit, dtsrepair, lsmash-timelineeditor. And they seem to generate output file with different DTS.
If the idea that I came up with is right, the outputs are all correct.

But I'm not confident with the idea.
I think the concept of DTS and CTS is the same across MP4 and FLV.
In the opensource project FLV.JS (https://github.com/bilibili/flv.js which is a library for FLV playback in HTML5-compatible browser),
it gives a video sample's duration by [next frame's dts] - [current frame's dts]. (https://github.com/Bilibili/flv.js/b...emuxer.js#L359 )
I think the duration should be computed from cts (or pts) because it's the presentation timestamp. (is it right?)
The interval of feeding packet to decoder has nothing to do with a sample's duration.
However, in the MP4 standard (http://standards.iso.org/ittf/Public...96-12_2015.zip ),
it mentions that

Quote:

(from 8.8.12.1)
it is not necessary to sum the sample durations of all
preceding samples in previous fragments to find this value (where the sample durations are the deltas
in the Decoding Time to Sample Box and the sample_durations in the preceding track runs).

and

Quote:

(from H.3.5)
The decoding time DT(i) for sample number i is derived by summing up the sample durations of all the
samples preceding sample i from the Decoding Time to Sample box and, if needed, the Track Fragment
Run boxes referring to any sample preceding sample i.

Which indicate that DTS has something to do with duration. (is it right?)
Thus the DTS is not only the timestamp when the packet is sent to decoder, but also has effect in how long the sample keeps.

Then, there comes a large number of questions like
"how it will be if the duration is not matched with pts interval"
"which tool that import timecodes into MP4 tracks fits the standard best"
etc.

Since nearly all kind of players are made compatible with variable incorrect streams.
So it's not a easy work to check the idea I mentioned is correct - they are likely to give correct output on incorrect input.
But some of them, like the FLV.JS implementation, may output incorrect av-sync results for incorrect input.

Currently I'm working on live broadcasting FLV streams, I think it's important to have a correct stream pushed to server.
Mobile phones has limited resource and it will not produce CFR stream.
And also the android's "MediaCodec" API gives no "DTS" for outputed packet. I must fill it by myself.

Thanks very much.