Feedback wanted: Dynamic QP adaptation for AI Video Chat [Archive]

Jiangkai

30th November 2025, 14:50

Hi all,

I am working on a project that optimizes real-time video communication where the receiver is an AI (like GPT-4o), not a human.

Since the "viewer" is an AI, the optimization target shifts from human perceptual quality (SSIM/VMAF) to model inference accuracy. We found we can sacrifice background visual quality significantly without hurting the model's understanding. We developed a method to dynamically adjust Region-based QP based on the current chat context (using CLIP model to define ROIs and applying delta-QP to the encoder).

I'm looking for some feedback or critique from the experts here regarding this encoding strategy. Also, I would appreciate any other suggestions or tricks to further reduce latency for this specific AI video chat scenario.

Technical Report: https://arxiv.org/abs/2507.10510
Email: jiangkai.wu@stu.pku.edu.cn

Any thoughts are welcome! Thanks!

GeoffreyA

1st December 2025, 18:12

Would it be possible to transform the frame into a "machine representation," opaque to humans but understandable to the model? It could extract key features of the image, discarding a lot of information humans need.

Z2697

1st December 2025, 20:28

Maybe just send a couple of low resolution images, if the numbers in 2.1 are accurate. (The limitations of the AI "watching the video" are 2 FPS and 602,112 pixels)
Of course, the same ROI approach can be applied to the images as well.

And there's probably still some bitrate can be saved by motion compensation, 2 fps video it is then.

You can even use AI to select most stable frames, or reduce motion blur, if it's from a handheld device.

And, the example video in your DeViBench README (https://github.com/pku-netvideo/DeViBench/blob/4c41b6133aae6ac9394b3bb34f503a43c44afa91/README.md) is a YUV444P8 video with YCbCr matrix (well, not sure which one) mistakenly flagged as GBR.
And again, judging from the "rainbow colored" video artifacts in the left view, it's encoded in actual GBR colorspace, which is not the most efficient colorspace for video encoders optimized for YUV.
Ugh...

Jiangkai

17th December 2025, 04:32

Would it be possible to transform the frame into a "machine representation," opaque to humans but understandable to the model? It could extract key features of the image, discarding a lot of information humans need.

That is a truly insightful idea—and essentially the ultimate goal of our work!

In reality, LLMs do not process raw images or video directly. Instead, the input is first encoded into visual tokens (or embeddings), which are then concatenated with text tokens for understanding and response generation. So, theoretically, transmitting these visual tokens directly (which corresponds exactly to the "machine representation" you suggested) would be the ideal approach.

However, for LLM video chatting tasks, these visual tokens are floating-point tensors, whose bitrate is often too high to stream efficiently. Furthermore, quantization or compression of these features tends to severely degrade LLM accuracy (I discuss this trade-off in detail in Section 4 of the paper).

Therefore, we fell back to transmitting the video stream, but leveraged ROI encoding to achieve the goal of discarding information the LLM doesn't need.

Jiangkai

17th December 2025, 04:36

Maybe just send a couple of low resolution images, if the numbers in 2.1 are accurate. (The limitations of the AI "watching the video" are 2 FPS and 602,112 pixels)
Of course, the same ROI approach can be applied to the images as well.

And there's probably still some bitrate can be saved by motion compensation, 2 fps video it is then.

You can even use AI to select most stable frames, or reduce motion blur, if it's from a handheld device.

And, the example video in your DeViBench README (https://github.com/pku-netvideo/DeViBench/blob/4c41b6133aae6ac9394b3bb34f503a43c44afa91/README.md) is a YUV444P8 video with YCbCr matrix (well, not sure which one) mistakenly flagged as GBR.
And again, judging from the "rainbow colored" video artifacts in the left view, it's encoded in actual GBR colorspace, which is not the most efficient colorspace for video encoders optimized for YUV.
Ugh...

Thanks a lot for the thorough analysis and the great suggestions! I really appreciate your insights, especially regarding the encoding settings and transmission strategies. Here are my thoughts on the points you raised:

1. Regarding your suggestion: "Maybe just send a couple of low resolution images ... And there's probably still some bitrate can be saved by motion compensation, 2 fps video it is then."

That's a great point. Yes, we can definitely limit transmission to the low-quality video (or images) that the model actually handles. Since any higher-quality video is inevitably downsampled to the resolution and frame rate the model can process before input anyway, sending more than that is redundant.

Regarding resolution, the situation is quite clear: we should send video frames containing no more than 602,112 pixels.

Regarding frame rate, there are a few points worth discussing:

First, should we transmit video continuously at a fixed frame rate, or only transmit images when the user asks a question? The latter, typically paired with VAD (Voice Activity Detection) algorithms to trigger transmission upon speech, works well for traditional QA scenarios. However, for "Always-on" assistant scenarios, the model needs to continuously analyze video content to proactively respond when necessary—even without a user prompt. In this context, continuous video transmission at a fixed frame rate is required.

Second, is 2 FPS sufficient? This could potentially increase latency. If a user's question falls exactly between two frames, the model must either wait 250 ms for the next frame or use a stale frame from 250 ms ago. Both cases introduce unwanted delay. To mitigate this, the most straightforward solution is simply increasing the frame rate. Another solution is to align the frame transmission with the query timestamp at the sender. That is, the moment a user question is detected, the sender immediately transmits a frame and resets the 500 ms interval from that point on.

2. Regarding your suggestion: "You can even use AI to select most stable frames, or reduce motion blur, if it's from a handheld device."

This is an excellent suggestion! I think "selecting the most stable frames" is a very promising direction for improvement.
Building on the previous point: if we adopt the "2 FPS" strategy, the first frame would still need to align with the moment the question is asked (to minimize latency). However, for subsequent frames, instead of strictly sampling at fixed intervals, we could dynamically select the most stable frame within each 500 ms window. This approach could definitely improve the model's accuracy. Thanks for the suggestion!

3. Regarding your observation: "the example video in your DeViBench README is a YUV444P8 video ... it's encoded in actual GBR colorspace, which is not the most efficient colorspace for video encoders optimized for YUV."

Thank you for the detailed inspection and for pointing this out! Based on your feedback, I verified the video with ffprobe and you are absolutely correct. Regarding the cause: the example video in the README was encoded via FFmpeg directly from PNG images (RGB). Since I omitted the -pix_fmt yuv420p flag during that process, the output defaulted to a GBR (4:4:4) stream.

However, please note that this does not affect the conclusions in the paper. The quantitative experiments (Figure 9) were conducted using the Kvazaar encoder taking raw .yuv files (standard yuv420p) as input. Therefore, the actual experimental data is strictly based on standard YUV formats.

I really appreciate you taking the time to analyze the stream—it’s a great catch regarding the demo visualization.