Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > MPEG-4 ASP

Reply
 
Thread Tools Search this Thread Display Modes
Old 23rd April 2007, 17:03   #21  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
It sounds like you're using an option like -ffast-math (GCC example) that causes deviation from the ANSI C standard.
Dark Shikari is offline   Reply With Quote
Old 23rd April 2007, 17:36   #22  |  Link
sysKin
Registered User
 
sysKin's Avatar
 
Join Date: Jun 2002
Location: Adelaide, Australia
Posts: 1,167
Quote:
Originally Posted by Dark Shikari View Post
It sounds like you're using an option like -ffast-math (GCC example) that causes deviation from the ANSI C standard.
It's a floating point option, it's completely irrelevant.
__________________
Visit #xvid or #x264 at irc.freenode.net
sysKin is offline   Reply With Quote
Old 23rd April 2007, 22:24   #23  |  Link
plugh
A hollow voice says
 
Join Date: Sep 2006
Posts: 269
The difference set in my preceding post, between the two mpeg+default cpuflags encodes (using opt or noopt on that bvop rd module) consists of a ~180 frame sequence of P and B frames, starting at a P frame. It ends at the next I frame. The rest of the 5000 frame clip is identical.

Very curious indeed - somehow opt/noopt of *this* module polluted / affected a P frame?

Ran another series of tests, expanding on the 'diff' axes
constant -> vhq-b=on, vhq=1, 4mv=off, mpeg
variable -> range of cpu flags
10 encodes
Code:
		  /diff\  /same\  /same\  /SAME\
full optimize	mmx	xmm	sse	3dn	3de
		 |	 |	 |	 |	 |
		same	same	same	same	DIFF
		 |	 |	 |	 |	 |
noopt bvop rd	mmx	xmm	sse	3dn	3de
		  \diff/  \same/  \same/  \DIFF/
I really don't know what to make of this. Going from 3dn to 3de switches in a fairly sizable set of asm routines, including dequant_mpeg_inter_3dne. When estimation_rd_based_bvop is compiled with optimizer, I get same results as with xmm, sse, and 3dn. But when it is compiled noopt, this causes results to change. (Or perhaps result was *supposed* to change in the optimized build case, but didn't?)

Well, at least it further narrow things down...

Last edited by plugh; 23rd April 2007 at 23:54.
plugh is offline   Reply With Quote
Old 24th April 2007, 00:44   #24  |  Link
plugh
A hollow voice says
 
Join Date: Sep 2006
Posts: 269
Well, I still don't know what/why, but I can say this particular issue IS _directly_ related to asm routine dequant_mpeg_inter_3dne (in module quantize_mpeg_xmm.asm) and NOT the other 3dne asm routines...

I modified xvid.c and commented out the function pointer assignment for this routine, leaving all others alone. Rebuilt, with bvop rd still 'noopt', reran that corner case, and now output matches.

Really need someone who knows those SIMD instructions to look at that routine... Why, with the routine enabled, do we get differant output for the 'opt' and 'noopt' cases?

Last edited by plugh; 24th April 2007 at 00:56.
plugh is offline   Reply With Quote
Old 24th April 2007, 14:36   #25  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
Quote:
Originally Posted by sysKin View Post
It's a floating point option, it's completely irrelevant.
Wait, xvid doesn't use floating point?
Dark Shikari is offline   Reply With Quote
Old 24th April 2007, 15:26   #26  |  Link
plugh
A hollow voice says
 
Join Date: Sep 2006
Posts: 269
Quote:
Originally Posted by Dark Shikari View Post
Wait, xvid doesn't use floating point?
I can't comment in general, but I don't see any in that module at least. I do know there are floats in plugin_2pass2, but that wouldn't impact this.

I remembered I had a VMware virtual machine with W2K and a gcc setup on it (Msys 1.0.10, MinGW 4.1.0, gcc 3.4.4). Added nasm, installed xvid 1.1.2 sources, tried doing a build - success! So now I can poke at the gcc built version and see what it reveals about that module.

FWIW, the canned xvid+gcc build procedure uses the following gcc flags:
-Wall -O2 -fstrength-reduce -finline-functions -freduce-all-givs -ffast-math -fomit-frame-pointer

(note the 'fast-math' referred to earlier)

Guess I need to read up on them...

re-EDIT: Just did a quick one shot encode comparison between gcc build with/without fast-math flag --> one B frame is slightly differant in the .pass files. However, the gcc build *with* fast-math and msvc are in agreement on that particular frame.

Last edited by plugh; 24th April 2007 at 17:16.
plugh is offline   Reply With Quote
Old 24th April 2007, 20:43   #27  |  Link
plugh
A hollow voice says
 
Join Date: Sep 2006
Posts: 269
re: opt/noopt and asm routines

Could this be a case of a missing emms/femms someplace?

EDIT: Just tried an experiment, and it looks like the answer is "yes".

Last edited by plugh; 24th April 2007 at 21:53.
plugh is offline   Reply With Quote
Old 25th April 2007, 21:26   #28  |  Link
plugh
A hollow voice says
 
Join Date: Sep 2006
Posts: 269
It has been an extremely tedious process, but I have tracked back to a specific chunk of code that gives differant results when compiled with msvc vs gcc.

(When I say tedious I mean it - tracking backwards through the code, identifying where a particular rd mode evaluation for a particular macroblock for a particular frame goes weird)

The chuck of code is
Code:
static __inline uint32_t
d_mv_bits(int x, int y, const VECTOR pred, const uint32_t iFcode, const int qpel)
{
	unsigned int bits;

	x <<= qpel;
	y <<= qpel;

	x -= pred.x;
	bits = (x != 0 ? iFcode:0);
	x = -abs(x);
	x >>= (iFcode - 1);
	bits += r_mvtab[x+63];

	y -= pred.y;
	bits += (y != 0 ? iFcode:0);
	y = -abs(y);
	y >>= (iFcode - 1);
	bits += r_mvtab[y+63];

	return bits;
}
in motion_inlines.h; r_mvtab is defined there as well.

The arguments being passed in this particular case are
x=-64 y=63 pred={x=63,y=15} iFcode=2 qpel=0

msvc produces the value 4128837
gcc produces the value 14
I manually calc it, and I get 26

The call stack is:
ModeDecision_BVOP_RD ->
SearchInterpolate_RD ->
CheckCandidateRDInt ->
the first instance in the following statement
Code:
rd += BITS_MULT * (d_mv_bits(xf, yf, data->predMV, data->iFcode, data->qpel^data->qpel_precision)
		+ d_mv_bits(xb, yb, data->bpredMV, data->iFcode, data->qpel^data->qpel_precision));
WTF!
plugh is offline   Reply With Quote
Old 25th April 2007, 22:24   #29  |  Link
plugh
A hollow voice says
 
Join Date: Sep 2006
Posts: 269
Found another one; differant macroblock, differant rd mode

The arguments being passed
x=63 y=24 pred={x=-64,y=-20} iFcode=2 qpel=0

msvc computes 4128837 (again)
gcc computes 14 (again)
I manually calculate 26 (again)

Call stack is
ModeDecision_BVOP_RD ->
SearchBF_RD (mode is Forward) ->
CheckCandidateRDBF ->
The following line
Code:
rd += BITS_MULT*(d_mv_bits(x, y, data->predMV, data->iFcode, data->qpel^data->qpel_precision)-2);
plugh is offline   Reply With Quote
Old 25th April 2007, 23:10   #30  |  Link
plugh
A hollow voice says
 
Join Date: Sep 2006
Posts: 269
Bingo - I see it!

-127 integer divide by two is -63

-127 shift right once (sign extended) is -64

It's going out the top of the array...

So the next question is -

Is this a bug in the routine, or is "-64" an illegal/out-of-range value for a vector?

Perhaps some asm routine not rounding / range limiting correctly?

Last edited by plugh; 25th April 2007 at 23:25.
plugh is offline   Reply With Quote
Old 26th April 2007, 02:27   #31  |  Link
plugh
A hollow voice says
 
Join Date: Sep 2006
Posts: 269
Well as an experiment, I changed three lines in motion_inlines.h
Code:
	static const int r_mvtab[64] = {
to
	static const int r_mvtab[65] = {12,

	bits += r_mvtab[x+63];
to
	bits += r_mvtab[x+64];

	bits += r_mvtab[y+63];
to
	bits += r_mvtab[y+64];
then created normal msvc and gcc optimized builds and did four encodes - vhq-b=on, vhq=1, 4mv=off, with both h263 and mpeg quant, with both dlls.

Compared the paired output .pass and .avi files, and they are identical now!

Not saying the above is a "fix", but it does seem to show that both compilers are generating equivalent functional representations of the source code (unlike the ICL builds). I'll probably run some more comparison series (range of VHQ, range of cpu-flags), but I have much greater confidence that my builds are 'right' now.

I hope someone knowledgable will chime in and indicate if "-64" is a valid value for a vector component - if it is, then the above *is* a fix. If not, it's just a workaround for some badly behaved code elsewhere (which both msvc and gcc compilers are building as directed ) Might be interesting to see if this change improves psnr/ssim/xyzzy...

The other weirdness with the opt vs noopt msvc builds and that one asm routine - I'm not sure what to think about that one. As an experiment, I added a 'femms' instruction to the asm file just before the return, and it magically caused the noopt build to produce the same output as the opt build - not the other way around. Again, I hope someone more knowledgable will look at that oddity...
plugh is offline   Reply With Quote
Old 26th April 2007, 06:10   #32  |  Link
Manao
Registered User
 
Join Date: Jan 2002
Location: France
Posts: 2,856
A motion vector goes from -2^x to 2^x - 1/2 ( or 1/4 for QPel ), so yes, -64 is valid ( in your case, -64 is -16 integer pixels, and 63 is 15.75 integer pel ).
__________________
Manao is offline   Reply With Quote
Old 26th April 2007, 10:03   #33  |  Link
sysKin
Registered User
 
sysKin's Avatar
 
Join Date: Jun 2002
Location: Adelaide, Australia
Posts: 1,167
Whoa plugh what a great work.

Yes -64 is valid. So we were nicely reading r_mvtab[-1]? Great, I wonder why memory access analysis tools didn't pick it up

I suppose I should stick this d_mv_bits() after motion vector writing code and assert that calculated length is the actual bitstream length. This will make us 100% sure nothing else is wrong.

Although, then again, I did have such assertion for a whole macroblock (part of VHQ debugging). I suppose vectors of -64 were never chosen (as they appeared to be horribly costly, 44 kilobits!) and therefore assertion was never hit.
__________________
Visit #xvid or #x264 at irc.freenode.net
sysKin is offline   Reply With Quote
Old 26th April 2007, 13:44   #34  |  Link
plugh
A hollow voice says
 
Join Date: Sep 2006
Posts: 269
So that three line change *can* be considered a "fix" for 1.1.2?

Given I'm only working with "HD" encodes (and with 4mv off), perhaps that magnitude of vector is somewhat more likely? ie MB displacement across X% of the image literally crosses more pixels? (Don't know what I'm talking about, but it sounds good anyway )

BTW, there was one other thing in the huge volume of debug print data I collected that struck me - I'll pass it on, for whatever it is worth.

In ModeDecisionBVOP_RD, right after the "evaluate cost of all modes" loop, the values for d_rd, f_rd, b_rd, i_rd were frequently the same (with my short test clip). The code is evaluating the modes in increasing SAD order, but in this case should it simply choose the 'first' mode at that cost?

EDIT: Duh - stupid question; you want the one with the lowest SAD. never mind...

Anyway, it happens enough (multiple modes yeilding same rd) in my data that it caught my eye, so I thought I'd mention it. Seemed odd, given radically differant code paths.

Examples: - Frame 182, the 4 SADs, evaluation order stuff, x/y of MB, the four RD costs, the chosen cost/mode

182 ds=464 bs=301 fs=464 is=238 bst=238 order=1 2 0 3 num=4 I0 B1 D2 F3 x=41 y=0 d=770 f=786 b=770 i=770 rd=770 mod=1
182 ds=324 bs=306 fs=340 is=216 bst=216 order=1 2 0 3 num=4 I0 B1 D2 F3 x=22 y=1 d=1179 f=1195 b=1179 i=1179 rd=1179 mod=1

Last edited by plugh; 26th April 2007 at 15:58.
plugh is offline   Reply With Quote
Old 27th April 2007, 08:31   #35  |  Link
plugh
A hollow voice says
 
Join Date: Sep 2006
Posts: 269
Out of curiosity, I also did a build using ICL 9.0.28 with the above "fix", and compared it to the msvc/gcc builds.

The difference set is now much smaller, however there are still differences. I've poked at it some, and made the following observations.

1) Ever so often, the VOP header is a single bit longer than 'usual'. This extra bit is sometimes enough to cause the byte-padded frame to be a single byte longer. The msvc and ICL builds do not do this 'in sync' with each other. Thus, a comparison of the .pass files for ICL vs msvc shows occasional one byte frame length differences. No such difference is observed comparing msvc vs gcc .pass files.

The source of this difference in behaviour is the following routine in encoder.c
Code:
simplify_time(int *inc, int *base)
{
	/* common factor */
	const int s = gcd(*inc, *base);
  *inc  /= s;
  *base /= s;

	if (*base > 65535 || *inc > 65535) {
		int *biggest;
		int *other;
		float div;

		if (*base > *inc) {
			biggest = base;
			other = inc;
		} else {
			biggest = inc;
			other = base;
		}
		
		div = ((float)*biggest)/((float)65535);
		*biggest = (unsigned int)(((float)*biggest)/div);
		*other = (unsigned int)(((float)*other)/div);
	}
}
In my case avisynth was feeding an 'inc' of 41708 and 'base' of 1,000,000. The above code, in attempting to normalize the base to 65535, actually returns 65534 with the ICL build.


2) If I encode a very short sequence of frames (so that I don't encounter that extra bit/byte 'time' thing above), then binary compare the avi files, I consistently show a single byte difference per frame. In my test case, the msvc build will have an 'FF' where the ICL build has an 'FB'. I don't have any tool to parse the avi and tell me where this byte is in the frame (though I would guess it's at the end?)

Again, the msvc and gcc builds show no such difference. I'm suspicious of the "bitstream" code in this case, but will leave that as 'an exercise' for someone else...

Last edited by plugh; 27th April 2007 at 08:46.
plugh is offline   Reply With Quote
Old 27th April 2007, 15:56   #36  |  Link
plugh
A hollow voice says
 
Join Date: Sep 2006
Posts: 269
When I did my initial "fixed" msvc/gcc/icl compare above I collected one other datum, which yielded a quite surprising comparative result:
Code:
time (min:sec) to complete test-clip encode

		msvc	gcc	icl
h263 quant	16:11	16:30	16:39
mpeg quant	17:35	17:52	18:22

dll size	580KB	728KB	808KB
I did NOT expect this.

The only hypothesis I can come up with is that the more compact dll works better with my cache-challenged Duron processor. Guess which one I'll be using for my future encodes

If anyone wants to experiment, attached is msvc build of v1.1.2 xvidcore with the above arraysize "fix".

EDIT: withdrawn, based upon syskin's post below. Updated build here

Last edited by plugh; 28th April 2007 at 20:04.
plugh is offline   Reply With Quote
Old 28th April 2007, 07:01   #37  |  Link
foxyshadis
ангел смерти
 
foxyshadis's Avatar
 
Join Date: Nov 2004
Location: Lost
Posts: 9,314
If it helps, avisynth had its own problems with fps and ended up with this function to fix things up:
Code:
// This function uses continued fractions to find the best rational
// approximation that satisfies (denom <= limit).  The algorithm
// is from Wikipedia, Continued Fractions.
//
static void reduce_frac(unsigned &num, unsigned &den, unsigned limit)
{
  unsigned n0 = 0, n1 = 1, n2, nx = num;    // numerators
  unsigned d0 = 1, d1 = 0, d2, dx = den;  // denominators
  unsigned a2, ax, amin;  // integer parts of quotients
  unsigned f1, f2;        // fractional parts of quotients
  int i = 0;  // number of loop iterations

  while (1) { // calculate convergents
    a2 = nx / dx;
    f2 = nx % dx;
    n2 = n0 + n1 * a2;
    d2 = d0 + d1 * a2;

    if (f2 == 0) break;
    if ((i++) && (d2 >= limit)) break;

    n0 = n1; n1 = n2;
    d0 = d1; d1 = d2;
    nx = dx; dx = f1 = f2;
  }
  if (d2 <= limit)
  {
    num = n2; den = d2;  // use last convergent
  }
  else { // (d2 > limit)
    // d2 = d0 + d1 * ax
    // d1 * ax = d2 - d1
    ax = (limit - d0) / d1;  // set d2 = limit and solve for a2

    if ((a2 % 2 == 0) && (d0 * f1 > f2 * d1))
      amin = a2 / 2;  // passed 1/2 a_k admissibility test
    else
      amin = a2 / 2 + 1;

    if (ax < amin) {
      // use previous convergent
      num = n1;
      den = d1;
    }
    else {
      // calculate best semiconvergent
      num   = n0 + n1 * ax;
      den = d0 + d1 * ax;
    }
  }
}
Hopefully Intel would be kinder to this one, as well as giving smaller fractions.
__________________
There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. ~ Ed Howdershelt
foxyshadis is offline   Reply With Quote
Old 28th April 2007, 10:17   #38  |  Link
celtic_druid
Registered User
 
celtic_druid's Avatar
 
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 2,173
Maybe -Os or -O2 -fno-reorder-blocks -fno-reorder-functions would be faster for Duron's? That along with -march=athlon-xp
celtic_druid is offline   Reply With Quote
Old 28th April 2007, 17:27   #39  |  Link
plugh
A hollow voice says
 
Join Date: Sep 2006
Posts: 269
I just used the canned build options from the source kit, as my focus was getting the builds to produce identical output.

I may experiment in that area some, but it would mean re-running encoder output comparisons (a time consuming process) to insure such tweaks didn't change the results - like that msvc opt/noopt oddity I discuss above...

Last edited by plugh; 28th April 2007 at 17:46.
plugh is offline   Reply With Quote
Old 28th April 2007, 17:37   #40  |  Link
sysKin
Registered User
 
sysKin's Avatar
 
Join Date: Jun 2002
Location: Adelaide, Australia
Posts: 1,167
OK I committed the d_mv_bits out-of-bouds memory access bustage.

Unfortunately the fix is not correct. For some negative vectors which land in the range mv_table[64-34]..[64-64], the correct value seems to be 11 not 12.

I added an assertion that fails when incorrectly-estimated vector is coded.

I am not sure if the logic is incorrect in one place, or maybe the entire mv_bits can't be calculated in such "smart", branchless way. Following the code from CodeVector is surely correct but unfortunately measurably slower.

We should just use a LUT.

Anyway, overestimating cost by one in those rare cases (I need to encode over 200 frames to hit the assertion) should have absolutely no effect on quality.
__________________
Visit #xvid or #x264 at irc.freenode.net

Last edited by sysKin; 28th April 2007 at 17:49.
sysKin is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 22:22.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.