Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.


Go Back   Doom9's Forum > Capturing and Editing Video > New and alternative a/v containers

Thread Tools Search this Thread Display Modes
Prev Previous Post   Next Post Next
Old 2nd January 2018, 16:40   #1  |  Link
Registered User
leiming2006's Avatar
Join Date: Mar 2006
Location: Shanghai, China
Posts: 203
Confused by PTS, DTS and CTS in MP4 and FLV

Hello, I recently learning about FLV and MP4 container and confused by the timestamps.

Long ago I have got touch with XviD in AVI, heard about packed stream.
It's said that AVI has no B-frame support and requires the decoder to output one frame with on packet's input.
To satisfy the requirement, a hacking called packed stream is used to break the limitaion. It placed the depended P-frame and B-frame in the same packet.
Thus the packets are stored in the turns [I] [PB] [B] [] [PB] [B] []... so that it can give one frame's output with one packet's input.

Time flies and it's the age of MP4 now.
I works on FLV streaming on job, and be getting to know that MP4 and FLV store DTS and CTS-Offset in header.
The famous tool FFMPEG has a FFPROBE in its package offering the feature to dump packet information of a stream, which offering PTS and DTS on output.
It regards DTS + CTS Offset as PTS.

I have a lot of idea to confirm if it's correct.

In CFR (constant framerate) situation,
I know that PTS is "[frame number] * [frame interval]" which frame number is ordered in presentation sequence,
and DTS is "[frame number] * [frame interval]" which frame number is ordered in decoding sequence.

For example a bit stream
will given DTS and CTS (or PTS) like this in 25 FPS
    I   P   B   B   P   B   B   P   ...
DTS 0   40  80 120 160 200 240 280  ...
CTS 0  120  40  80 240 160 200 360  ...
But CTS >= DTS is required so that one of them will have a delay. It then turns to
    I   P   B   B   P   B   B   P   ...
DTS 0   40  80 120 160 200 240 280  ...
CTS 40 160  80 120 280 200 240 400  ...
Same to FLV. (is it right?)

But I don't know what are them.

In my understanding, DTS is short for decoding timestamp and PTS is short for presentation timestamp.
The former one standing for the timestamp when the packet should feed to decoder, and the latter one means when the decoded frame should be shown. (is it right?)
It seems very important for hardware players to know how to control the input and output buffer.
If I CTS means output and DTS means input (is it right?), the sorted action will be
   action            buffer             present
  I (DTS: 0)        I
  I (CTS: 40)                           I
  P (DTS: 40)       P                   I
  B (DTS: 80)       P B                 I
  B (CTS: 80)       P                   I B
  B (DTS: 120)      P B                 I B
  B (CTS: 120)      P                   I B B
  P (CTS: 160)                          I B B P
  P (DTS: 160)      P                   I B B P
  B (DTS: 200)      P B                 I B B P
  B (CTS: 200)      P                   I B B P B
  B (DTS: 240)      P B                 I B B P B
  B (CTS: 240)      P                   I B B P B B
  P (CTS: 280)                          I B B P B B P
    ...              ...                   ...
Then, the cts offset (nearly) means how long will the sample (packet/frame) will stay before it's outputed.
By the idea, on the condition that the input and output order kept unchanged and CTS unchanged, modify the DTS and CTS-Offset will not affect the final playback. (is it right?)

I guess 2 situations:
1. A hardware player will running out internal buffer if I feed it a MP4 file with small DTS in begining. like this:
    I   P   B   B   P   B   B   P   ...
DTS 0   1   2   3   4   5   6  280  ...
CTS 40 160  80 120 280 200 240 400  ...
it may have no chance running to the P-frame with DTS=280.

2. I can give timestamps like what packed stream do like this:
    I   P   B   B   P   B   B   P   ...
DTS 0   79  80 120 199 200 240 319  ...
CTS 0  120  40  80 240 160 200 360  ...
and the playback should have no problem.

The trouble comes in the situation of VFR.
I have seen many tools have the feature to import timecode into a track of MP4 file.
There are tc4mp4, mp4fpsmod, dtsedit, dtsrepair, lsmash-timelineeditor. And they seem to generate output file with different DTS.
If the idea that I came up with is right, the outputs are all correct.

But I'm not confident with the idea.
I think the concept of DTS and CTS is the same across MP4 and FLV.
In the opensource project FLV.JS (https://github.com/bilibili/flv.js which is a library for FLV playback in HTML5-compatible browser),
it gives a video sample's duration by [next frame's dts] - [current frame's dts]. (https://github.com/Bilibili/flv.js/b...emuxer.js#L359 )
I think the duration should be computed from cts (or pts) because it's the presentation timestamp. (is it right?)
The interval of feeding packet to decoder has nothing to do with a sample's duration.
However, in the MP4 standard (http://standards.iso.org/ittf/Public...96-12_2015.zip ),
it mentions that
it is not necessary to sum the sample durations of all
preceding samples in previous fragments to find this value (where the sample durations are the deltas
in the Decoding Time to Sample Box and the sample_durations in the preceding track runs).
(from H.3.5)
The decoding time DT(i) for sample number i is derived by summing up the sample durations of all the
samples preceding sample i from the Decoding Time to Sample box and, if needed, the Track Fragment
Run boxes referring to any sample preceding sample i.
Which indicate that DTS has something to do with duration. (is it right?)
Thus the DTS is not only the timestamp when the packet is sent to decoder, but also has effect in how long the sample keeps.

Then, there comes a large number of questions like
"how it will be if the duration is not matched with pts interval"
"which tool that import timecodes into MP4 tracks fits the standard best"

Since nearly all kind of players are made compatible with variable incorrect streams.
So it's not a easy work to check the idea I mentioned is correct - they are likely to give correct output on incorrect input.
But some of them, like the FLV.JS implementation, may output incorrect av-sync results for incorrect input.

Currently I'm working on live broadcasting FLV streams, I think it's important to have a correct stream pushed to server.
Mobile phones has limited resource and it will not produce CFR stream.
And also the android's "MediaCodec" API gives no "DTS" for outputed packet. I must fill it by myself.

Thanks very much.

Last edited by leiming2006; 2nd January 2018 at 18:07.
leiming2006 is offline   Reply With Quote

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +1. The time now is 21:29.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, vBulletin Solutions Inc.