Starange glitch in x265 encoded video [Archive] - Page 2

View Full Version : Starange glitch in x265 encoded video

Pages : 1 [2] 3

rwill

1st July 2025, 05:05

I noticed that the bug occurs at POC=256. If I use trace_header BSF in ffmpeg, POC is reset at zero after 255 (it seems to be only one byte).

Perhaps some weird things happens in the decoder which overwrites some carry flag?

Yeah, must be that one number you can view with ffmpeg and not one of the million numbers you cannot. /s

rwill

1st July 2025, 05:07

There is some sort of limit in x265... is this not enough?
Or is it something with the specific binary? I can't reproduce the Uma Musume bug in my test build of x265 stable branch, and can reproduce using the FFmpeg from AnimMouse.
https://bitbucket.org/multicoreware/x265_git/src/cd4f0d6e9d30766f5e2be89dab2e6809f3b76507/source/encoder/search.cpp#lines-4523:4527

The limits in place, and the way this function you linked is called, are quite ineffective for motion vector clipping.

And you probably cannot reproduce the issue because its one of those "when the stars align" things. You need the same input, same platform, and same configuration to even have a chance to repro.

Z2697

1st July 2025, 11:22

The bug "starts in display order" at POC 256, but in reality it starts at POC 259 in decoding order.

GeoffreyA

3rd July 2025, 19:07

Felipe from Intel would like to know if anyone tested this on the RX 7000 and 9000 series (RDNA 3 and 4)? If someone could check, that would be great.

https://github.com/IGCIT/Intel-GPU-Community-Issue-Tracker-IGCIT/issues/1153#issuecomment-3033059227

Z2697

3rd July 2025, 21:10

I'm planning to buy a RX 9070 XT but that's not gonna happen very soon...
If this is fixable at the driver level that's a good news.

GeoffreyA

3rd July 2025, 22:00

That's a potent GPU.

benwaggoner

6th July 2025, 23:59

The limits in place, and the way this function you linked is called, are quite ineffective for motion vector clipping.

And you probably cannot reproduce the issue because its one of those "when the stars align" things. You need the same input, same platform, and same configuration to even have a chance to repro.
Definitely file it as a bug with MultiCoreWare!

GeoffreyA

8th July 2025, 17:09

Definitely file it as a bug with MultiCoreWare!

That seems to be best.

Intel's analysis says that frame 182 is corrupted, and AMD and software decoding are handling it differently. Any further questions of them before closing the ticket?

https://github.com/IGCIT/Intel-GPU-Community-Issue-Tracker-IGCIT/issues/1153#issuecomment-3049365819

hellgauss

8th July 2025, 18:13

Quite cryptic:
"Based on the analysis this is a decode issue. The first corrupted frame is 182"

Perhaps Z2697 is right and it is both an encoder and decoder issue :)

benwaggoner

8th July 2025, 19:16

Quite cryptic:
"Based on the analysis this is a decode issue. The first corrupted frame is 182"

Perhaps Z2697 is right and it is both an encoder and decoder issue :)
Lots of things are both, yeah. Although the standard really defines conformance of the encoded bitstream, so getting that right is the most important. How a decoder deals with malformed bitstreams is undefined, although it is always welcome when they do so gracefully.

Z2697

10th July 2025, 10:32

7.4.9.9 Motion vector difference semantics
...
The value of MvdLX[ x0 ][ y0 ][ compIdx ] shall be in the range of −215
to 215 − 1, inclusive.

I think this is where the range of MVD is specified, and it is [-2^15, 2^15-1].
The pure text quote is hard to read, so I suggest go read the PDF.

Z2697

10th July 2025, 11:24

Are we going to send a patch to multicoreware?

Z2697

10th July 2025, 11:28

That seems to be best.

Intel's analysis says that frame 182 is corrupted, and AMD and software decoding are handling it differently. Any further questions of them before closing the ticket?

https://github.com/IGCIT/Intel-GPU-Community-Issue-Tracker-IGCIT/issues/1153#issuecomment-3049365819

The bitstream violates the specification, yes.
But if the decoding process is done exactly according to the specification, the result is "not affected" in this case.

GeoffreyA

10th July 2025, 11:55

The bitstream violates the specification, yes.
But if the decoding process is done exactly according to the specification, the result is "not affected".

I see. So their decoding process is not following the specification, exactly?

Are we going to send a patch to multicoreware?

Why not? :)

Z2697

10th July 2025, 11:58

I see. So their decoding process is not following the specification, exactly?

Can't be sure, without the knowledge about their internal design...
But very likely.

https://forum.doom9.org/showthread.php?p=2020241#post2020241
This kind of decoding process looks very like a low-hanging fruit for optimization.

GeoffreyA

10th July 2025, 12:15

Can't be sure, without the knowledge about their internal design...
But very likely.

https://forum.doom9.org/showthread.php?p=2020241#post2020241
This kind of decoding process looks very like a low-hanging fruit for optimization.

Yes, my thought is, optimisation.

I wonder, though, if this is locked down in the hardware, or if it can be changed in the driver. Intel does seem to use a generalised encoding-decoding engine, like a miniature video CPU, so the logic might be in firmware.

rwill

10th July 2025, 13:41

I see. So their decoding process is not following the specification, exactly?

It is according to the specification as it supports the full legal value range.

I wonder why I have missed the range limit in 7.4.9.9 when reading the standard.. too much selective reading caused by social media probably.

Z2697

10th July 2025, 14:43

Of course, the culprit is the encoder encodes illegal values.
And that decoding process is over-engineered.
We can't blame the decoders really.

But the evidence does show that they probably optimized the over-engineered process out.

Z2697

10th July 2025, 15:02

It is according to the specification as it supports the full legal value range.

I wonder why I have missed the range limit in 7.4.9.9 when reading the standard.. too much selective reading caused by social media probably.
Those things are scattered all around like hell...

Can I add you as a co-author in the patch?
Or maybe you want to send the patch?

jpsdr

10th July 2025, 17:24

If you're doing a patch, i'm interested by it to apply asap on my build, not to have to wait a long time it goes to main branch.

rwill

10th July 2025, 19:08

Those things are scattered all around like hell...

Can I add you as a co-author in the patch?
Or maybe you want to send the patch?

I would like to stay out of it for political reasons... ;D

Z2697

10th July 2025, 20:04

My patch is gonna be look like this:

diff --git a/source/encoder/motion.cpp b/source/encoder/motion.cpp
index 86f413c3d..a3fe0619d 100644
--- a/source/encoder/motion.cpp
+++ b/source/encoder/motion.cpp
@@ -1604,6 +1604,17 @@ me_hex2:
bmv = MV(0, 0);
}

+ // to ensure the mvdLX is in the range of [-2^15, 2^15-1]
+ MV clipmin((int32_t) -(1<<15), (int32_t) -(1<<15));
+ MV clipmax((int32_t) (1<<15) - 1, (int32_t) (1<<15) - 1);
+ clipmin += qmvp;
+ clipmax += qmvp;
+ if (!bmv.checkRange(clipmin, clipmax))
+ {
+ bmv = bmv.clipped(clipmin, clipmax);
+ bcost = subpelCompare(ref, bmv, satd) + mvcost(bmv);
+ }
+
x265_emms();
outQMv = bmv;
return bcost;

I chose to do it in motionEstimate because I want to update the cost as well... however it's probably not that important after all.
If the clipping is done in the predInterSearch we can probably save some CPU cycles.
What's your thought?

jpsdr

11th July 2025, 10:43

I don't have the knowledge for technical answer, so just some thought.

If you wanted to update the cost, it's probably because you thought that, even if impact is very small, the result will be more accurate, so globaly it will produce a better result, even if it's minimal.
As i'm looking for "best" result, i don't care saving some CPU cycles if solution is better than another. Even if the "better" is very small, sometimes it's the sum of several very small things which begins to produce a finaly not so much small "better".

Z2697

11th July 2025, 11:24

Uh, it's just my assumption that the cost will be slightly different.
But maybe it will be too small to make any change in the RDO process.
And given that this bug is only occured like, once in a million encodes, as far as I can tell / estimate from the fact that "this is the first time I saw it", I think I'd rather save that CPU resource.
(I didn't actually watched a million x265 encoded videos though ;) )

So here's the predInterSearch version:
diff --git a/source/encoder/search.cpp b/source/encoder/search.cpp
index 0522f52cc..5f3fca5b6 100644
--- a/source/encoder/search.cpp
+++ b/source/encoder/search.cpp
@@ -2543,6 +2543,12 @@ void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChroma
}
}

+ // to ensure the mvdLX is in the range of [-2^15, 2^15-1]
+ MV clipmin((int32_t) -(1<<15) , (int32_t) -(1<<15) );
+ MV clipmax((int32_t) (1<<15) - 1, (int32_t) (1<<15) - 1);
+ bestME[0].mv = bestME[0].mv.clipped(bestME[0].mvp + clipmin, bestME[0].mvp + clipmax);
+ bestME[1].mv = bestME[1].mv.clipped(bestME[1].mvp + clipmin, bestME[1].mvp + clipmax);
+
/* Bi-directional prediction */
MotionData bidir[2];
uint32_t bidirCost = MAX_UINT;
@@ -2648,6 +2654,11 @@ void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChroma
bidirBits = bits0 + bits1 + m_listSelBits[2] - (m_listSelBits[0] + m_listSelBits[1]);
}
}
+ // to ensure the mvdLX is in the range of [-2^15, 2^15-1]
+ MV clipmin((int32_t) -(1<<15) , (int32_t) -(1<<15) );
+ MV clipmax((int32_t) (1<<15) - 1, (int32_t) (1<<15) - 1);
+ bidir[0].mv = bidir[0].mv.clipped(bidir[0].mvp + clipmin, bidir[0].mvp + clipmax);
+ bidir[1].mv = bidir[1].mv.clipped(bidir[1].mvp + clipmin, bidir[1].mvp + clipmax);
}

/* select best option and store into CU */

GeoffreyA

11th July 2025, 12:34

MV is a class with no dynamic allocation?

I think the motionEstimate() version looks cleaner and simpler.

Z2697

11th July 2025, 13:26

MV is a class with no dynamic allocation?

I think the motionEstimate() version looks cleaner and simpler.

What do we need dynamic allocation for, in this case?

2 versions are basically the same thing, in different places.

GeoffreyA

11th July 2025, 14:49

What do we need dynamic allocation for, in this case?

Just asking. I don't know what the definition is.

Z2697

13th July 2025, 09:38

rwill

13th July 2025, 11:31

The patch is sent.
I can't help but wondering, since the values we've seen are only overflowed by 1, is there some value (constant) just off by one, and we can tweak it to make the problem not happening in the first place?

Well we just had 1 case here where it was 1 too large. SquallMX his sample had the MVD at 0x8004, so 5 too large. So its just a coincidence.

rwill

14th July 2025, 07:01

Z2697

14th July 2025, 12:12

Regarding your patch...

The MVP can only be between -2^15 and (2^15)-1.
Clipping the MV to MVP +/- 2^15 kind of limits the range an MV can have.

For example if the MVP X is -2^15 the X maximum is -1. So if MVP ever gets strange the ME will have a hard time to re-sanitize the MV vector field.

Might I suggest a clipping range of +/- 49152 ( 2^15 + 2^14 ) and do this additional step:

void Entropy::codeMvd(const CUData& cu, uint32_t absPartIdx, int list)
...
const int hor = mvd.x & 0xffff;
const int ver = mvd.y & 0xffff;

It should work because the ( MVP + MVD ) does the 16 bit wraparound in the decoder.

This actually makes the things much worse, if I'm not implementing it wrong:

diff --git a/source/encoder/entropy.cpp b/source/encoder/entropy.cpp
index ece8d6bc9..3826c31a7 100644
--- a/source/encoder/entropy.cpp
+++ b/source/encoder/entropy.cpp
@@ -2104,8 +2104,8 @@ void Entropy::codeRefFrmIdx(const CUData& cu, uint32_t absPartIdx, int list)
void Entropy::codeMvd(const CUData& cu, uint32_t absPartIdx, int list)
{
const MV& mvd = cu.m_mvd[list][absPartIdx];
- const int hor = mvd.x;
- const int ver = mvd.y;
+ const int hor = mvd.x & 0xffff;
+ const int ver = mvd.y & 0xffff;

encodeBin(hor != 0 ? 1 : 0, m_contextState[OFF_MV_RES_CTX]);
encodeBin(ver != 0 ? 1 : 0, m_contextState[OFF_MV_RES_CTX]);
diff --git a/source/encoder/search.cpp b/source/encoder/search.cpp
index 2a324700d..295cec52c 100644
--- a/source/encoder/search.cpp
+++ b/source/encoder/search.cpp
@@ -2545,6 +2545,12 @@ void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChroma
}
}

+ // to ensure the mvdLX is in the range of [-2^15, 2^15-1]
+ MV clipmin((int32_t) -(1<<15) - (1<<14) , (int32_t) -(1<<15) - (1<<14) );
+ MV clipmax((int32_t) (1<<15) + (1<<14) - 1, (int32_t) (1<<15) + (1<<14) - 1);
+ bestME[0].mv = bestME[0].mv.clipped(bestME[0].mvp + clipmin, bestME[0].mvp + clipmax);
+ bestME[1].mv = bestME[1].mv.clipped(bestME[1].mvp + clipmin, bestME[1].mvp + clipmax);
+
/* Bi-directional prediction */
MotionData bidir[2];
uint32_t bidirCost = MAX_UINT;
@@ -2650,6 +2656,11 @@ void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChroma
bidirBits = bits0 + bits1 + m_listSelBits[2] - (m_listSelBits[0] + m_listSelBits[1]);
}
}
+ // to ensure the mvdLX is in the range of [-2^15, 2^15-1]
+ MV clipmin((int32_t) -(1<<15) - (1<<14) , (int32_t) -(1<<15) - (1<<14) );
+ MV clipmax((int32_t) (1<<15) + (1<<14) - 1, (int32_t) (1<<15) + (1<<14) - 1);
+ bidir[0].mv = bidir[0].mv.clipped(bidir[0].mvp + clipmin, bidir[0].mvp + clipmax);
+ bidir[1].mv = bidir[1].mv.clipped(bidir[1].mvp + clipmin, bidir[1].mvp + clipmax);
}

/* select best option and store into CU */

I don't understand why...

rwill

14th July 2025, 22:11

This actually makes the things much worse, if I'm not implementing it wrong:

+ const int hor = mvd.x & 0xffff;
+ const int ver = mvd.y & 0xffff;

..

I don't understand why...

That is because I am actually an idiot. The int type of hor and ver is still 32bit, so maybe instead of my suggestion to do "& 0xffff", which will break things for sure, maybe try "const signed short" instead of "const int" .. or something like that. Maybe then it works ...

jpsdr

15th July 2025, 09:11

GeoffreyA

15th July 2025, 11:36

What is the type of mvd.x (or mvd.y) ?
Is it int (meaning 32 bits or 64 bits acording if the OS target is 32 or 64 bits) ?
And exactly what do you want to do with the "& 0xFFFF" ?
You realize in that case +65535 or -1 will produce the same result (if it's int) ?
(Unless mvd.x is only positive... That i don't know).

They are int32_t: https://bitbucket.org/multicoreware/x265_git/src/cd4f0d6e9d30766f5e2be89dab2e6809f3b76507/source/common/mv.h#lines-42

I assume it's to keep the lower 16 bits.

Z2697

15th July 2025, 12:20

I don't see the reason behind this extra step...
Or the extended clipping range...

rwill

15th July 2025, 13:29

If the MVP is -32768 (what we had here in the issue stream), if you clip to int16_t search range around the MVP, you can at most reach MV = -1 but not 0 to 32767.

So the search range has to be extended to something sane like +49152 to reach more positive values around the CU that is coded.

Then regarding coding the MVD, the HEVC standard says:

uLX[ 0 ] = ( mvpLX[ 0 ] + mvdLX[ 0 ] + 2^16 ) % 2^16
mvLX[ 0 ] = ( uLX[ 0 ] >= 2^15 ) ? ( uLX[ 0 ] − 2^16 ) : uLX[ 0 ]

So lets take for example MVP = -32768 and we want to code an MV of 1024, the MVD is 33.792 which, if reduced to signed 16 bit, is -31744.

uLX = ( -32768 + -31744 + 2^16 ) % 2^16 = 1024.

This also works the other way around with MVP 32767 and we want to code an MV of -1024, then the MVD is -33791 which, if reduced to signed 16 bit, is 31745.

uLX = ( 32767 + 31745 + 2^16 ) % 2^16 = 64512
mvLX = 64511 >= 32768 ? ==> 64512 - 2^16 = -1024

That is what I tried to do but I have a hard time reading that x265 code.

jpsdr

15th July 2025, 16:58

If i want to use the first motion.cpp version of the patch, do i also need to expand the clipping values ?

Z2697

15th July 2025, 19:43

Is it safe to just not doing the clipping part?

Z2697

15th July 2025, 20:31

So... since this works, Nvidia and Intel are not doing the decoding process differently, they just did some sanity check first.
So this is a encoder bug 100%.

Sorry Nvidia and Intel :o

rwill

15th July 2025, 22:16

Is it safe to just not doing the clipping part?

MCW should know best but I guess the guys that did the original architecture and implementation are no longer around so my guess is no one knows.

Instead of clipping to (MVP -2^15 / (2^15)-1) maybe clip to absolute (-2^15 / (2^15)-1) around the CU? These should be always reachable... but I would start the motion estimation with that constraint and don't add it as an afterthought after.

Z2697

16th July 2025, 04:43

There is a limit about max mv length in setSearchRange

/* Clip search range to signaled maximum MV length.
* We do not support this VUI field being changed from the default */
const int maxMvLen = (1 << 15) - 1;
mvmin.x = X265_MAX(mvmin.x, -maxMvLen);
mvmin.y = X265_MAX(mvmin.y, -maxMvLen);
mvmax.x = X265_MIN(mvmax.x, maxMvLen);
mvmax.y = X265_MIN(mvmax.y, maxMvLen);

I don't know how reliable is this, because these mvmin mvmax values clearly have been breached in this case (they do work normally most of the time)
POC=930 CTU=73 mvp=-32768,-16141 outmv=1,-1052 mvmin=-903,-263 mvmax=-903,-263

EDIT: Now I think of it, if the MV is exceeding that range, it's a different (and worse?) problem. Let's just fking go! (for now).

jpsdr

16th July 2025, 09:12

I don't understand what is the trouble here.
Is it because it's difficult to find where exactly do it in the code architecture ?
Even if the spec tells exactly what to restrein ?

Edit:
I didn't see an issue opened on the bitbucket.
Maybe you should cancel your patch, and just describe precisaly the issue on a ticket on the bitbucket, if not with a video sample, at least with some results like the previous post.

And i'm asking again, is the first patch you made in motion.cpp not finaly the good place ?

I mean, if spec said -32768/+32767 but for any reason at the place you clip you have to clip to higher values, maybe it means in that case the clipping is not at the right place ?

Z2697

17th July 2025, 09:11

I think the current method, casting the mvd to int16_t at entropy coding stage, is "as painless as it goes".

GeoffreyA

17th July 2025, 11:32

I think the current method, casting the mvd to int16_t at entropy coding stage, is "as painless as it goes".

I don't know encoder design or the x265 code, but wouldn't the "correct" way be to alter the code where the motion vector is calculated (i.e., that code is the culprit), clamping it there, and wouldn't clamping or casting late in the pipeline cause undefined results if earlier stages used the original value?

Z2697

17th July 2025, 13:54

I don't know encoder design or the x265 code, but wouldn't the "correct" way be to alter the code where the motion vector is calculated (i.e., that code is the culprit), clamping it there, and wouldn't clamping or casting late in the pipeline cause undefined results if earlier stages used the original value?

I honestly can't figure out where this bug starts.
(so I just created an issue in bitbucket but still awaits approval)

But talking about the difference made by casting it, there's really none.
The values here are, I'll say they are what will be seen by the decoder, and only the decoder will see them, and as we have discussed about, the decoding process makes correct and identical results.
(yes, there're flags to be set according to the value, and the final thing gets to the bitstream is "abs" and "minus2", but they are nothing to worry about, they are just describing this value)

Here is a reproduced encode and a "fixed" encode, please test.
https://workupload.com/file/d9D48eUYudr

(yes, we have to run that exact github action workflow to get a binary that can trigger the bug by the following command, the x265 parameters are translated from the FFmpeg style from the OP, it's really rare)

ffmpeg-test -ss 20:10 -to 21:00 -threads 1 -y -hide_banner -nostdin -bitexact -i "u.mkv" -map 0:v -profile:v main10 -vf "removegrain=1:1:1,format=yuv420p10le,scale=1280:720:flags=bicublin:param0=0.125:param1=0.4375" -sws_flags accurate_rnd -f yuv4mpegpipe -strict -2 -fps_mode vfr - | x265-ghaction --input - --y4m -D10 -o out.265 -pveryslow --crf 18.6 --psy-rd 0.7 --psy-rdoq 0.8 --aq-mode 3 --aq-strength 0.7 --deblock 1,1 --nr-intra 10 --nr-inter 25 --tu-intra-depth 4 --tu-inter-depth 4 --merange 58 --ref 6 --rc-lookahead 99 -b16 --rd 6 --rskip 2 --rskip-edge-threshold 2 --cbqpoffs 1 --crqpoffs 1 --scenecut 42 --scenecut-bias 5.2 --qcomp 0.59 --tskip --no-tskip-fast --limit-sao --selective-sao 2 --vbv-maxrate 8000 --vbv-bufsize 12000 --frame-threads 1 --pools none --no-wpp --level 40 --high-tier --colorprim 1 --colormatrix 1 --transfer 1 --range limited --chromaloc 0

GeoffreyA

17th July 2025, 14:27

That makes sense. If nothing is accessing the values apart from the decoder, then where to clamp or cast would be driven by convenience.

Would the x264 source provide guidance on how to handle this? Considering that the logic should be the same but different ranges.

Z2697

17th July 2025, 14:48

H.264 doesn't have temporal mv prediction, so probably no.

GeoffreyA

17th July 2025, 15:38

Here is a reproduced encode and a "fixed" encode, please test.
https://workupload.com/file/d9D48eUYudr

(yes, we have to run that exact github action workflow to get a binary that can trigger the bug by the following command, the x265 parameters are translated from the FFmpeg style from the OP, it's really rare)

Tested it, and "out-4.1-casting" shows no artefacts. "out-4.1" does.

Oh, boy, there's anime on everything, even running :)

jpsdr

17th July 2025, 17:44

I think the current method, casting the mvd to int16_t at entropy coding stage, is "as painless as it goes".

So you means, that in entropy.cpp just changing in Entropy::codeMvd:

const int hor = mvd.x;
const int ver = mvd.y;

by

const int16_t hor = (int16_t)mvd.x;
const int16_t ver = (int16_t)mvd.y;

and nothing else, is how you made your casting video ?
And it seems at least for GeoffreyA it made a difference.

jpsdr

18th July 2025, 08:24

Oh, boy, there's anime on everything

You have no idea !!