Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
23rd April 2007, 22:24 | #23 | Link |
A hollow voice says
Join Date: Sep 2006
Posts: 269
|
The difference set in my preceding post, between the two mpeg+default cpuflags encodes (using opt or noopt on that bvop rd module) consists of a ~180 frame sequence of P and B frames, starting at a P frame. It ends at the next I frame. The rest of the 5000 frame clip is identical.
Very curious indeed - somehow opt/noopt of *this* module polluted / affected a P frame? Ran another series of tests, expanding on the 'diff' axes constant -> vhq-b=on, vhq=1, 4mv=off, mpeg variable -> range of cpu flags 10 encodes Code:
/diff\ /same\ /same\ /SAME\ full optimize mmx xmm sse 3dn 3de | | | | | same same same same DIFF | | | | | noopt bvop rd mmx xmm sse 3dn 3de \diff/ \same/ \same/ \DIFF/ Well, at least it further narrow things down... Last edited by plugh; 23rd April 2007 at 23:54. |
24th April 2007, 00:44 | #24 | Link |
A hollow voice says
Join Date: Sep 2006
Posts: 269
|
Well, I still don't know what/why, but I can say this particular issue IS _directly_ related to asm routine dequant_mpeg_inter_3dne (in module quantize_mpeg_xmm.asm) and NOT the other 3dne asm routines...
I modified xvid.c and commented out the function pointer assignment for this routine, leaving all others alone. Rebuilt, with bvop rd still 'noopt', reran that corner case, and now output matches. Really need someone who knows those SIMD instructions to look at that routine... Why, with the routine enabled, do we get differant output for the 'opt' and 'noopt' cases? Last edited by plugh; 24th April 2007 at 00:56. |
24th April 2007, 15:26 | #26 | Link |
A hollow voice says
Join Date: Sep 2006
Posts: 269
|
I can't comment in general, but I don't see any in that module at least. I do know there are floats in plugin_2pass2, but that wouldn't impact this.
I remembered I had a VMware virtual machine with W2K and a gcc setup on it (Msys 1.0.10, MinGW 4.1.0, gcc 3.4.4). Added nasm, installed xvid 1.1.2 sources, tried doing a build - success! So now I can poke at the gcc built version and see what it reveals about that module. FWIW, the canned xvid+gcc build procedure uses the following gcc flags: -Wall -O2 -fstrength-reduce -finline-functions -freduce-all-givs -ffast-math -fomit-frame-pointer (note the 'fast-math' referred to earlier) Guess I need to read up on them... re-EDIT: Just did a quick one shot encode comparison between gcc build with/without fast-math flag --> one B frame is slightly differant in the .pass files. However, the gcc build *with* fast-math and msvc are in agreement on that particular frame. Last edited by plugh; 24th April 2007 at 17:16. |
25th April 2007, 21:26 | #28 | Link |
A hollow voice says
Join Date: Sep 2006
Posts: 269
|
It has been an extremely tedious process, but I have tracked back to a specific chunk of code that gives differant results when compiled with msvc vs gcc.
(When I say tedious I mean it - tracking backwards through the code, identifying where a particular rd mode evaluation for a particular macroblock for a particular frame goes weird) The chuck of code is Code:
static __inline uint32_t d_mv_bits(int x, int y, const VECTOR pred, const uint32_t iFcode, const int qpel) { unsigned int bits; x <<= qpel; y <<= qpel; x -= pred.x; bits = (x != 0 ? iFcode:0); x = -abs(x); x >>= (iFcode - 1); bits += r_mvtab[x+63]; y -= pred.y; bits += (y != 0 ? iFcode:0); y = -abs(y); y >>= (iFcode - 1); bits += r_mvtab[y+63]; return bits; } The arguments being passed in this particular case are x=-64 y=63 pred={x=63,y=15} iFcode=2 qpel=0 msvc produces the value 4128837 gcc produces the value 14 I manually calc it, and I get 26 The call stack is: ModeDecision_BVOP_RD -> SearchInterpolate_RD -> CheckCandidateRDInt -> the first instance in the following statement Code:
rd += BITS_MULT * (d_mv_bits(xf, yf, data->predMV, data->iFcode, data->qpel^data->qpel_precision) + d_mv_bits(xb, yb, data->bpredMV, data->iFcode, data->qpel^data->qpel_precision)); |
25th April 2007, 22:24 | #29 | Link |
A hollow voice says
Join Date: Sep 2006
Posts: 269
|
Found another one; differant macroblock, differant rd mode
The arguments being passed x=63 y=24 pred={x=-64,y=-20} iFcode=2 qpel=0 msvc computes 4128837 (again) gcc computes 14 (again) I manually calculate 26 (again) Call stack is ModeDecision_BVOP_RD -> SearchBF_RD (mode is Forward) -> CheckCandidateRDBF -> The following line Code:
rd += BITS_MULT*(d_mv_bits(x, y, data->predMV, data->iFcode, data->qpel^data->qpel_precision)-2); |
25th April 2007, 23:10 | #30 | Link |
A hollow voice says
Join Date: Sep 2006
Posts: 269
|
Bingo - I see it!
-127 integer divide by two is -63 -127 shift right once (sign extended) is -64 It's going out the top of the array... So the next question is - Is this a bug in the routine, or is "-64" an illegal/out-of-range value for a vector? Perhaps some asm routine not rounding / range limiting correctly? Last edited by plugh; 25th April 2007 at 23:25. |
26th April 2007, 02:27 | #31 | Link |
A hollow voice says
Join Date: Sep 2006
Posts: 269
|
Well as an experiment, I changed three lines in motion_inlines.h
Code:
static const int r_mvtab[64] = { to static const int r_mvtab[65] = {12, bits += r_mvtab[x+63]; to bits += r_mvtab[x+64]; bits += r_mvtab[y+63]; to bits += r_mvtab[y+64]; Compared the paired output .pass and .avi files, and they are identical now! Not saying the above is a "fix", but it does seem to show that both compilers are generating equivalent functional representations of the source code (unlike the ICL builds). I'll probably run some more comparison series (range of VHQ, range of cpu-flags), but I have much greater confidence that my builds are 'right' now. I hope someone knowledgable will chime in and indicate if "-64" is a valid value for a vector component - if it is, then the above *is* a fix. If not, it's just a workaround for some badly behaved code elsewhere (which both msvc and gcc compilers are building as directed ) Might be interesting to see if this change improves psnr/ssim/xyzzy... The other weirdness with the opt vs noopt msvc builds and that one asm routine - I'm not sure what to think about that one. As an experiment, I added a 'femms' instruction to the asm file just before the return, and it magically caused the noopt build to produce the same output as the opt build - not the other way around. Again, I hope someone more knowledgable will look at that oddity... |
26th April 2007, 06:10 | #32 | Link |
Registered User
Join Date: Jan 2002
Location: France
Posts: 2,856
|
A motion vector goes from -2^x to 2^x - 1/2 ( or 1/4 for QPel ), so yes, -64 is valid ( in your case, -64 is -16 integer pixels, and 63 is 15.75 integer pel ).
__________________
|
26th April 2007, 10:03 | #33 | Link |
Registered User
Join Date: Jun 2002
Location: Adelaide, Australia
Posts: 1,167
|
Whoa plugh what a great work.
Yes -64 is valid. So we were nicely reading r_mvtab[-1]? Great, I wonder why memory access analysis tools didn't pick it up I suppose I should stick this d_mv_bits() after motion vector writing code and assert that calculated length is the actual bitstream length. This will make us 100% sure nothing else is wrong. Although, then again, I did have such assertion for a whole macroblock (part of VHQ debugging). I suppose vectors of -64 were never chosen (as they appeared to be horribly costly, 44 kilobits!) and therefore assertion was never hit. |
26th April 2007, 13:44 | #34 | Link |
A hollow voice says
Join Date: Sep 2006
Posts: 269
|
So that three line change *can* be considered a "fix" for 1.1.2?
Given I'm only working with "HD" encodes (and with 4mv off), perhaps that magnitude of vector is somewhat more likely? ie MB displacement across X% of the image literally crosses more pixels? (Don't know what I'm talking about, but it sounds good anyway ) BTW, there was one other thing in the huge volume of debug print data I collected that struck me - I'll pass it on, for whatever it is worth. In ModeDecisionBVOP_RD, right after the "evaluate cost of all modes" loop, the values for d_rd, f_rd, b_rd, i_rd were frequently the same (with my short test clip). The code is evaluating the modes in increasing SAD order, but in this case should it simply choose the 'first' mode at that cost? EDIT: Duh - stupid question; you want the one with the lowest SAD. never mind... Anyway, it happens enough (multiple modes yeilding same rd) in my data that it caught my eye, so I thought I'd mention it. Seemed odd, given radically differant code paths. Examples: - Frame 182, the 4 SADs, evaluation order stuff, x/y of MB, the four RD costs, the chosen cost/mode 182 ds=464 bs=301 fs=464 is=238 bst=238 order=1 2 0 3 num=4 I0 B1 D2 F3 x=41 y=0 d=770 f=786 b=770 i=770 rd=770 mod=1 182 ds=324 bs=306 fs=340 is=216 bst=216 order=1 2 0 3 num=4 I0 B1 D2 F3 x=22 y=1 d=1179 f=1195 b=1179 i=1179 rd=1179 mod=1 Last edited by plugh; 26th April 2007 at 15:58. |
27th April 2007, 08:31 | #35 | Link |
A hollow voice says
Join Date: Sep 2006
Posts: 269
|
Out of curiosity, I also did a build using ICL 9.0.28 with the above "fix", and compared it to the msvc/gcc builds.
The difference set is now much smaller, however there are still differences. I've poked at it some, and made the following observations. 1) Ever so often, the VOP header is a single bit longer than 'usual'. This extra bit is sometimes enough to cause the byte-padded frame to be a single byte longer. The msvc and ICL builds do not do this 'in sync' with each other. Thus, a comparison of the .pass files for ICL vs msvc shows occasional one byte frame length differences. No such difference is observed comparing msvc vs gcc .pass files. The source of this difference in behaviour is the following routine in encoder.c Code:
simplify_time(int *inc, int *base) { /* common factor */ const int s = gcd(*inc, *base); *inc /= s; *base /= s; if (*base > 65535 || *inc > 65535) { int *biggest; int *other; float div; if (*base > *inc) { biggest = base; other = inc; } else { biggest = inc; other = base; } div = ((float)*biggest)/((float)65535); *biggest = (unsigned int)(((float)*biggest)/div); *other = (unsigned int)(((float)*other)/div); } } 2) If I encode a very short sequence of frames (so that I don't encounter that extra bit/byte 'time' thing above), then binary compare the avi files, I consistently show a single byte difference per frame. In my test case, the msvc build will have an 'FF' where the ICL build has an 'FB'. I don't have any tool to parse the avi and tell me where this byte is in the frame (though I would guess it's at the end?) Again, the msvc and gcc builds show no such difference. I'm suspicious of the "bitstream" code in this case, but will leave that as 'an exercise' for someone else... Last edited by plugh; 27th April 2007 at 08:46. |
27th April 2007, 15:56 | #36 | Link |
A hollow voice says
Join Date: Sep 2006
Posts: 269
|
When I did my initial "fixed" msvc/gcc/icl compare above I collected one other datum, which yielded a quite surprising comparative result:
Code:
time (min:sec) to complete test-clip encode msvc gcc icl h263 quant 16:11 16:30 16:39 mpeg quant 17:35 17:52 18:22 dll size 580KB 728KB 808KB The only hypothesis I can come up with is that the more compact dll works better with my cache-challenged Duron processor. Guess which one I'll be using for my future encodes If anyone wants to experiment, attached is msvc build of v1.1.2 xvidcore with the above arraysize "fix". EDIT: withdrawn, based upon syskin's post below. Updated build here Last edited by plugh; 28th April 2007 at 20:04. |
28th April 2007, 07:01 | #37 | Link |
Angel of Night
Join Date: Nov 2004
Location: Tangled in the silks
Posts: 9,559
|
If it helps, avisynth had its own problems with fps and ended up with this function to fix things up:
Code:
// This function uses continued fractions to find the best rational // approximation that satisfies (denom <= limit). The algorithm // is from Wikipedia, Continued Fractions. // static void reduce_frac(unsigned &num, unsigned &den, unsigned limit) { unsigned n0 = 0, n1 = 1, n2, nx = num; // numerators unsigned d0 = 1, d1 = 0, d2, dx = den; // denominators unsigned a2, ax, amin; // integer parts of quotients unsigned f1, f2; // fractional parts of quotients int i = 0; // number of loop iterations while (1) { // calculate convergents a2 = nx / dx; f2 = nx % dx; n2 = n0 + n1 * a2; d2 = d0 + d1 * a2; if (f2 == 0) break; if ((i++) && (d2 >= limit)) break; n0 = n1; n1 = n2; d0 = d1; d1 = d2; nx = dx; dx = f1 = f2; } if (d2 <= limit) { num = n2; den = d2; // use last convergent } else { // (d2 > limit) // d2 = d0 + d1 * ax // d1 * ax = d2 - d1 ax = (limit - d0) / d1; // set d2 = limit and solve for a2 if ((a2 % 2 == 0) && (d0 * f1 > f2 * d1)) amin = a2 / 2; // passed 1/2 a_k admissibility test else amin = a2 / 2 + 1; if (ax < amin) { // use previous convergent num = n1; den = d1; } else { // calculate best semiconvergent num = n0 + n1 * ax; den = d0 + d1 * ax; } } } |
28th April 2007, 17:27 | #39 | Link |
A hollow voice says
Join Date: Sep 2006
Posts: 269
|
I just used the canned build options from the source kit, as my focus was getting the builds to produce identical output.
I may experiment in that area some, but it would mean re-running encoder output comparisons (a time consuming process) to insure such tweaks didn't change the results - like that msvc opt/noopt oddity I discuss above... Last edited by plugh; 28th April 2007 at 17:46. |
28th April 2007, 17:37 | #40 | Link |
Registered User
Join Date: Jun 2002
Location: Adelaide, Australia
Posts: 1,167
|
OK I committed the d_mv_bits out-of-bouds memory access bustage.
Unfortunately the fix is not correct. For some negative vectors which land in the range mv_table[64-34]..[64-64], the correct value seems to be 11 not 12. I added an assertion that fails when incorrectly-estimated vector is coded. I am not sure if the logic is incorrect in one place, or maybe the entire mv_bits can't be calculated in such "smart", branchless way. Following the code from CodeVector is surely correct but unfortunately measurably slower. We should just use a LUT. Anyway, overestimating cost by one in those rare cases (I need to encode over 200 frames to hit the assertion) should have absolutely no effect on quality. Last edited by sysKin; 28th April 2007 at 17:49. |
Thread Tools | Search this Thread |
Display Modes | |
|
|