Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
24th August 2012, 13:17 | #20081 | Link |
李姗倩 Lǐ Shān Qiàn
Join Date: Nov 2002
Posts: 1,340
|
Sample clip
This happened between 5555 and 5572; the revision 5569 by XhmikosR on July 19, 2012 should be the culprit.
This SSE2 optimization from VSFilterMod (i.e. Transform_SSE2) is obviously broken, where they're rotating font-x in the wrong (negative) direction. I just confirmed that VSFilterMod does really rotate font to the opposite direction. This "bug" should be something trivial, where they typed "+" when they should have typed "-" or something like that, when handling sin or cos. The original rotation by Gabest agrees with MPC-HC 1.6.2 and before; with ffdshow, VLC, and probably with libass too (which I cannot test directly since I'm sadly on Windows). The latest version of MPC-HC is practically the only one that is not compatible. See for yourself: font_rotation_x.mkv (~50 kB) MPC rev. 114 ffdshow rev. 4483 VLC v2.0.3 MPC-HC rev. 5555 MPC-HC rev. 5572 (Broken!) If you guys are not interested in \frx at all, will you at least revert to the non-SSE2 version before many typesetters start rotating fonts to the opposite direction, creating a lot of confusion? Like I said, VSFilterMod should be fixed too if someone is still working on it. Thanks again for your wonderful jobs! |
24th August 2012, 16:12 | #20084 | Link | |
Registered Developer
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,344
|
Quote:
Not much was added from VSFilterMod, only stuff that seemed harmless at the time, like this SSE2 optimization.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders |
|
25th August 2012, 02:32 | #20085 | Link | |
李姗倩 Lǐ Shān Qiàn
Join Date: Nov 2002
Posts: 1,340
|
Still buggy though better
Quote:
The obvious problem you mentioned has been surely fixed, but as it turned out, something (like the rotation origin, the nature of rotation...) is still wrong, suggesting the SSE2 optimizations in VSFilterMod are not well-tested, not compatible with the unoptimized code, and this problem is more subtle than I first thought. The following images show the +70-degree x-rotation of "Èj" in the {\pos} (position) (0,0) with the {\an7} style (upper-left corner is the origin), that is the {\org} (origin) is (0,0) too. The original letters that are not rotated are shown in pale green. The results of the rotation are shown in red. 1) This is the good one, rendered by old, reliable 1.6.2, compatible with Gabest and with ffdshow, and almost compatible with VLC. 2) The next image shows what we have now, using 1.6.4.5887 (698e56d) taken from http://xhmikosr.1f0.de/mpc-hc/ (thanks to XhmikosR). As you can see, the general direction of the rotation is more or less better now, but this is far from being compatible. See for yourself: font_rotation_x2.mkv (~50 kiB): hardsubbed in blue, softsubbed in red. Also note that, although I said that this was borked in 5569 by XhmikosR, the code was not written by XhmikosR. It was already there, already imported, and conditionally commented out. He/she (XhmikosR) simply enabled it. It's not his/her fault that this SSE2 code is broken. |
|
25th August 2012, 03:23 | #20086 | Link | |
Broadband Junkie
Join Date: Oct 2005
Posts: 1,859
|
I remember these \frx \fry \frz bugs. It's why we disabled the Transform_SSE2 code in xy-VSFilter a long time ago.
Quote:
Here is a better sample which was initially created to track down this bug in xy-VSFilter last year: http://www.mediafire.com/?j9ota09016rt2yb The video is hardsubbed with a reference VSFilter 2.39 \frx \fry \frz rendering. The external script subtitles with opposite color should match up identically through all the rotations. |
|
25th August 2012, 03:58 | #20087 | Link | |
李姗倩 Lǐ Shān Qiàn
Join Date: Nov 2002
Posts: 1,340
|
Quote:
The bug officially came to MPC-HC in the v1.6.3 release like 10 days ago, which I first tried like 2 days ago, when I noticed the incompatibility. So this was a new problem for me. (Btw, xy-VSFilter seems really promising. Like, \fscx/fscy + float is what I've always wanted. I'll look into xy-VSFilter too when I have time.) |
|
25th August 2012, 07:00 | #20088 | Link |
李姗倩 Lǐ Shān Qiàn
Join Date: Nov 2002
Posts: 1,340
|
I looked into the inside of RTS.cpp, and this is my first impression. If you're watching __pointx in Transform_SSE2:
Code:
__pointx = _mm_add_ps(__xx, __yy); // xx = x * caz + y * saz; //#1 __xx = _mm_mul_ps(__pointx, __cay); // x * cay //#2 __pointx = _mm_add_ps(__xx, __zz); // xx = x * cay + z * say //#3 __xx = _mm_mul_ps(__pointx, __say); // x * say //#4 #2 is wrong: you're using __pointx as "x" but it's "xx" since #1. #3 makes __pointx "xx" again. #4 is wrong: you're using __pointx as "x" but it's "xx" since #3. Similarly, for __pointy: Code:
__pointy = _mm_add_ps(__yy, __zz); // y = yy * cax + zz * sax //#5 __yy = _mm_mul_ps(__pointy, __sax); // yy * sax //#6 #6 is wrong. You're using __pointy as "yy" but it's "y" since #5. |
25th August 2012, 08:11 | #20089 | Link |
Registered Developer
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,344
|
You're right, i really should've seen that when i documented the calls. I didn't actually change them except flipping the one operand there, just added the docs, but apparently didn't pay enough attention.
No matter, i gave it another try and refactored the whole function to use variable names matching the C ones very closely, no changes of meaning in the variables anymore. The test videos you and cyberbeing posted are working fine now.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders Last edited by nevcairiel; 25th August 2012 at 08:24. |
25th August 2012, 14:38 | #20090 | Link |
Registered User
Join Date: Oct 2010
Location: The Netherlands
Posts: 1,083
|
Well, you chose quite a monster of a function to work on there. (I'll try to explain it mostly to the public that knows at least basic programming, but hasn't seen this function yet.)
I disabled Transform_SSE2() when I was fixing the issues with scaling to the video area on the window area and aspect ratio correction with the frx, fry and frz functions a while ago. I marked Transform_SSE2() with: "TODO: The methods here are very wrong (the C function does work as it should), fix this with decent code, or delete this function." As resolving the list of issues of Transform_SSE2() plus fixing the scaling and aspect ratio bugs seemed to be too much work at the time, I chose to only edit Transform_C(). As the variables are prefixed with "_", I was already wondering who would write such code. That prefix is generally reserved for the global compiler-specific functions, constants and variables. Renaming would indeed be a good start. A quick review of the function shows that: -None of the typecasts between single- and double-precision floating points and 32-bit signed integers used in this function is properly handled with SSE. -"static_cast<LONG>(x + 0.5)" is used to do rounding casts to integer in both the C and the SSE functions. SSE has native rounding casts to do this, which also handles negative values properly. (Vertices stored as integers is one of the most retarded design choices of the subtitle renderer. Using floating-point vertices would eliminate these type casts and provide decent sub-pixel precision.) -No intrinsic to generate a native load data to register is used. "_mm_set_~" is used everywhere, where "_mm_load_~" should have been used. -The basic cos() and sin() functions are not handled with SSE. (I'll ignore the horribly truncated PI in these lines for now.) -4 lines of code feature access to member variables of the __m128 union. That's illegal if you take it strictly, but the compiler will probably bend over backwards and do a lot of register shuffles to output the correct values. -"__m128 __pointz = _mm_set_ps1(0);"; this functionality is provided properly by _mm_setzero_ps(). -_mm_rcp_ps() (reciprocal approximation) is used. That's reasonably okay, but the output precision with only this instruction is low (12 bits). It could use a Newton-Raphson iteration to improve that (to about 22 bits, out of the full 24 bits). Else, a regular floating-point division to get the reciprocal would work fine. The design choice of using single-precision instead of double-precision for this function is okay. (Transform_C() uses doubles.) The lower precision isn't significant. Handling this function with SSE2 vectorized doubles would have been just as easy, though. I don't think that the author of the Transform_SSE2() function had any experience working with SSE intrinsics or ever wrote assembly before. The function is marked with "// speed up ~1.5-1.7x", but I doubt that to be true. Pretty much all the code for the subtitle renderer's SSE2 functions have issues, although most are less severe than this one. Unfortunately, this function illustrates the state the subtitle renderer is in rather well. Refactoring the functions with these kinds of issues isn't easy, but any effort on getting proper, efficient routines in the subtitle renderer is very welcome indeed.
__________________
development folder, containing MPC-HC experimental tester builds, pixel shaders and more: http://www.mediafire.com/?xwsoo403c53hv |
25th August 2012, 15:13 | #20091 | Link |
Registered Developer
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,344
|
I just fixed the math in it to match the C math, nothing else.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders |
25th August 2012, 15:27 | #20092 | Link |
Broadband Junkie
Join Date: Oct 2005
Posts: 1,859
|
On that note, xy-VSFilter rewrote the Transform_C code for better performance a couple weeks ago. Though from what I remember hearing, the transform code was never really a bottleneck for VSFilter in the first place.
|
25th August 2012, 16:20 | #20093 | Link | |
李姗倩 Lǐ Shān Qiàn
Join Date: Nov 2002
Posts: 1,340
|
The fix itself is awesome. Thank you very much, Nevcairiel. (I've not yet got a binary, though.)
However... Quote:
Code:
__xx = _mm_mul_ps(__xx, __20000); // xx * 20000 __pointx = _mm_div_ps(__xx,__zz); // x = (xx * 20000) / (zz + 20000) __yy = _mm_mul_ps(__yy, __20000); // yy * 20000 __pointy = _mm_div_ps(__yy, __zz); // y = (yy * 20000) / (zz + 20000); Code:
//__zz = _mm_rcp_ps(__zz); // 1 / (zz + 20000) __m128 z0 = _mm_rcp_ps(__zz); __zz = _mm_sub_ps( _mm_add_ps(z0,z0), _mm_mul_ps(z0,_mm_mul_ps(z0,__zz)) ); |
|
25th August 2012, 16:41 | #20094 | Link |
Registered Developer
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,344
|
the easiest solution would obviously be to use _mm_div_ps, but no idea how speed comparison is there.
Considering the C version also uses a real division, its probably not going to be slower then C at least. Edit: I pushed a patch to use _mm_div_ps and another small cleanup.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders Last edited by nevcairiel; 25th August 2012 at 16:47. |
25th August 2012, 17:53 | #20095 | Link | |
Registered User
Join Date: Oct 2010
Location: The Netherlands
Posts: 1,083
|
Quote:
I took a peek at the latency and throughput table (Intel® 64 and IA-32 Architectures Optimization Reference Manual, edition june 2011): Given for a Sandy Bridge model (06_2AH) and an older Merom (06_0FH): divps: 14/14, <21, <16 rcpps: 5/1, 3/1 mulps: 5/1, 4/1 addps: 3/1, 3/1 subps: 3/1, 3/1 I certainly can also look it up for AMD, but I think it won't matter much. Straight divisions are always expensive in both latency and throughput, no matter for integer or floating point. The rcpps and a Newton-Raphson iteration method is mostly a lot faster if the pipeline can be reordered easily. If the µops are crammed together, the pipeline will stall for a little bit. Of course, divps outputs with full 24 bits of precision, and the approximate routine with about 22. That's probably the main reason that there's no rcppd or rcpsd for doubles.
__________________
development folder, containing MPC-HC experimental tester builds, pixel shaders and more: http://www.mediafire.com/?xwsoo403c53hv |
|
26th August 2012, 02:57 | #20096 | Link | ||
Broadband Junkie
Join Date: Oct 2005
Posts: 1,859
|
Quote:
Quote:
He's basically suggesting that xy-VSFilter's Transform_C code be used as a base if any proper Transform_SSE2 function is ever going to be written. If xy-VSFilter's rewritten Transform_C really is faster than the 'fixed' VSFilterMOD Transform_SSE2, that may be a good reason for MPC-HC to just scrap it and start over. |
||
26th August 2012, 07:55 | #20097 | Link |
Registered User
Join Date: Oct 2010
Location: The Netherlands
Posts: 1,083
|
I took a peek at that commit.
PI is still truncated to 3.1415. DSUtil (linked in) has M_PI with enough digits to fill a double. Code:
double scalex = style.fontScaleX/100; Code:
(long)(x + org.x + 0.5);
__________________
development folder, containing MPC-HC experimental tester builds, pixel shaders and more: http://www.mediafire.com/?xwsoo403c53hv |
26th August 2012, 08:51 | #20098 | Link | ||
李姗倩 Lǐ Shān Qiàn
Join Date: Nov 2002
Posts: 1,340
|
Tested 1.6.4.5895 (0235e5b), and the bug was finally fixed. Yay! Since I don't have a binary with the fix AND _mm_rcp_ps, I can't tell how _mm_rcp_ps is bad in a real-world sample.
The SSE2 version of CWord::Transform was originally committed on May 13, 2010: https://code.google.com/p/vsfiltermod/source/detail?r=76 The log message says, "is this good and correct?" So the author was not sure themself if they were doing this right. But hey, everyone has the right to enjoy experimenting It's part of what free-software is all about. Let's stop blaming this function. It was unfortunate that this was just imported to MPC-HC without being tested properly. Similarly, things should be tested properly before imported from xy-VSFilter, even though I'm willing to support xy-VSFilter too. I read somewhere that xy-VSFilter crashed just because you used a negative value between {\p1}...{\p0}, showing it's not very stable yet, though promising. Quote:
Compare the 4th and 5th images, and you'll see the quality gain by going from qpel to 8th-pel is marginal. Probably the algorithm of anti-aliasing itself could be improved, but you wouldn't gain much here by just using (float) or (double). A subtitle renderer has to render things quickly, like 24 times a second. When Subtitler (the parent of VSFilter) was created about 10 years ago by Avery Lee (the same author of VirtualDub), 0.125-pixel accuracy stored as integer was probably an optimal choice - considering the CPUs used back then. And no one is going to say that Avery Lee didn't use float because he didn't know the best solution. That's too disrespectful! Today we have better hardware, and probably someone will create something even better eventually, like libass. But even if the font rendering quality of VSFilter is not perfect, maybe the "limited subpixelness" (8x8 subpixels per pixel) is not the reason. Rendered text looks solid to me, if not perfect. Correct me if I'm wrong. Quote:
It seems better to me not to change 3.1415, even though it's weird. Changing it might break existing ASS scripts that depend on this weirdness. But then again, {\blur} broke the compatibility of {\b} too... |
||
26th August 2012, 08:54 | #20099 | Link |
Registered Developer
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,344
|
All i did was to make the SSE2 function behave the same as the C function, i have no real interest in changing it any more.
If someone wants to still optimize the old and rather broken VSFilter, have fun.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders |
26th August 2012, 09:28 | #20100 | Link |
Registered User
Join Date: Jul 2007
Posts: 552
|
While you fixing subtitles renderer may be somebody can look at crash with this sample:
To reproduce bug at least this conditions must be met: 1) Internal subtitles enabled and configured like this: 2) Desktop resolution is 1280x1024 3) Sample played in full screen from beginning to end I checked with MPC-HC 1.6.4.5895 (0235e5b) x86 and it still happen. OS: Windows XP SP3. P.S. Looks like final crash happen not inside subtitle renderer but caused by it (probably write outside of its memory). |
Tags |
dxva, h264, home cinema, media player classic, mpc-hc |
Thread Tools | Search this Thread |
Display Modes | |
|
|