Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
13th May 2003, 08:59 | #21 | Link |
Moderator
Join Date: Oct 2001
Location: England
Posts: 3,285
|
@alx:
Very Weird, Ive not added anything, only taken bits out of loops, i.e. every time a frame was got a memory allocation was done (& leaked) as well as the iDCT would get refreshed as well as a bunch of variables being set that dont need to be set, etc etc I think its impossible for mine to be slower (unless the intel compilers causing it, but it makes it faster on mine), but ill look into it As for dvd2avi_nic, lol, maybe ill never release it. Mainly because I dont use Comp check, so dvd2avi_nic doesnt have one. But people will want it, so Id better code one first. @trbarry: Add_Block was just one that it mentioned (that I remembered). Its important and may not be able to be improved. The SSE iDCT is there, not the other SSE2 code. Marc FD removed/ifdef'd it because it was unstable (?) I think or at least not producing accurate results. If I had a SSE2 computer for development id test through each bit & find the bits that worked (i.e. created the same output as the SSEMMX parts). But I dont at present (you got any free time ? ) BTW: Marc's post on SSE2:- http://forum.doom9.org/showthread.ph...SE2#post207193 -Nic Last edited by Nic; 13th May 2003 at 09:17. |
13th May 2003, 12:55 | #22 | Link |
Retired AviSynth Dev ;)
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
|
If memory is no longer aligned, a minor penalty can be a expected on Athlon (and other processors). But it seems like a lot - MPEG2 decoding shouldn't take much more than 10-15% of the overall processing time.
Could you repeat the test, just to be sure it isn't something strange like windows swapping or something.
__________________
Regards, sh0dan // VoxPod |
13th May 2003, 13:08 | #23 | Link |
Moderator
Join Date: Oct 2001
Location: England
Posts: 3,285
|
@sh0dan: the memory is still aligned. Windows cacheing make a big difference on small tests...as ive been finding out. im going to try and make mpeg2dec3's disk accessing more efficient and then stop for now.
-Nic |
13th May 2003, 14:20 | #24 | Link | |
Registered User
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,092
|
Quote:
As Int21h pointed out the SSE2 code (except for IDCT) made only marginal improvements, though it made more of a difference on my machine than his for some reason. But some of it was very sensitive to compiler optimization and would crash with some settings and combination of inlining. That's probably why Marc FD had to turn it off. There is a whole series of timing tests written up for similar DVD2AVI code in the DVD2AVI section in that huge DVD2AVI Sourceforge thread somewhere. I've been planning for awhile to add a couple more simpler assembler optimizations but haven't quite got to it. These would just need P3's (or less), not P4's. First, the iDCT code should probably be called via pointer like in Xvid (or my DctFilter), avoiding all the extra logic in AddBlock. You want to do this? (the SSE2 prefetch call is unneeded) Second, the assembler now in AddBlock can be easily optimized a bit. And third, and more important, would probably be asm optimizing the dequant functions. This wouldn't be hard and -h nags us about this from time to time. I'll try to get to those. I haven't checked yet to see if there are unneeded data copies in YV12 but maybe you can get those if you can find them. It certainly seems YV12 should be able to copy data fewer times since (IIRC for MPEG2DEC2) for YUY2 there was first a pass to planar 4:2:2 and then a conversion to YUY2. It seems at least one and maybe both of those should be unneeded. - Tom Last edited by trbarry; 13th May 2003 at 14:24. |
|
13th May 2003, 14:42 | #25 | Link |
Moderator
Join Date: Oct 2001
Location: England
Posts: 3,285
|
Trying to get my head around whats safe and whats not to store in the GOPBuffer is tricky. Im tempted to re-write that whole bit.
Ill do the iDCT pointer stuff as thats a good idea indeed. I had a look at the dequant functions ages ago. If you find any speedups or improvements let me know Anything you can give to it would be very appreciated Tom Please post here if you come with any improvements Cheers, -Nic |
14th May 2003, 13:51 | #27 | Link | |
Registered User
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,092
|
Quote:
Is 1.04 the best source to start from now? - Tom |
|
14th May 2003, 21:43 | #29 | Link |
Registered User
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,092
|
Coding now.
But I notice that running in the debugger I will get messages about bad heap free's etc. from Virtualdubmod when I exit or try to do a "Save & Refresh". This does not happen with previous versions. There may be a problem with new storage management. - Tom |
15th May 2003, 16:27 | #31 | Link |
Moderator
Join Date: Oct 2001
Location: England
Posts: 3,285
|
"I will get messages about bad heap free's etc. from Virtualdubmod when I exit or try to do a "Save & Refresh". This does not happen with previous versions. " Are you refering to previous versions of MPEG2Dec3 or VDubMod ?
"My first simple attempts to optimize the dequant stuff made it about 2% slower". No Luck Im sure you'll be able to do something to speed it up though Cheers, -Nic |
15th May 2003, 16:40 | #32 | Link |
Retired AviSynth Dev ;)
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
|
I also get these errors, when debugging through VirtualDubMod - probably for about a month or two. Quite annoying actually - but it made me wonder - how can we even get these unless there is some code somewhere which is compiled in debug mode?
AFAIK these checks are only present in Debug mode - or am I mistaking?
__________________
Regards, sh0dan // VoxPod |
15th May 2003, 19:13 | #33 | Link | |
Registered User
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,092
|
V 1.0.5 temp version for testing only
Quote:
It happens if I compile your 1.0.4 (or my new one) for debug. Anyway, I did a little more optimization, based upon your changes. I only seem to be able to squeeze out another 1-2% improvement from it but added to your recent changes that might add up to about 5-6%. And I'm using VS6 without the Intel compiler so maybe if you compiled and hosted it we might get a tad more. And I'm sure there's still more to do somewhere. I temporarily put out the source and dll for you or anyone to test at edit: Removed link to buggy test version. I changed some assembler code in GetPic.cpp functions Add_Block(), Decode_MPEG2_Intra_Block(), and Decode_MPEG2_Non_Intra_Block(). The changes will only help machines with ssemmx. This would include all P3's, P4's, Athlons, Durons, and Celerons > about 550 mhz. Older machines won't notice the difference. I kind of eyed the sse2 stuff again but decided it maybe wasn't worthwhile playing with again right now. - Tom Last edited by trbarry; 20th May 2003 at 18:16. |
|
15th May 2003, 20:23 | #35 | Link |
Registered User
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,092
|
My version of VS6 is not the one with profiling. I had a trial version of Intel Vtune a year ago when I was fooling with this stuff in DVD2AVI but that has long since expired.
Is there a good free way to profile stuff? - Tom |
15th May 2003, 20:32 | #36 | Link |
Retired AviSynth Dev ;)
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
|
AMD CodeAnalyst is a very great tool IMO - it enables functionwise profiling, and pipeline analysis (with detail state/stall information). I don't know if it requires an AMD CPU though.
I'm downloading it now (their server is dog-slow from where I sit). A very minor thing I noticed while browsing (I know it's nitpicking): Two pack instructions doesn't pair, so you could save a few (two) cycles by doing: Code:
movq mm0, [ebx+0*16] movq mm1, [ebx+1*16] packsswb mm0, [ebx+0*16+8] // pack with SIGNED saturate (unlike old way) movq mm2, [eax] // get rfp val movq mm3, [eax+edx] // " packsswb mm1, [ebx+1*16+8] // pack with SIGNED saturate In general most routines seem memory-intense - so either faster RAM or less memory use is probably the only way to get any significant speedups.
__________________
Regards, sh0dan // VoxPod Last edited by sh0dan; 15th May 2003 at 20:43. |
15th May 2003, 23:36 | #37 | Link |
Registered User
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,092
|
"But since it is probably quite memory saturated, it probably doesn't matter even a bit. "
Can't hurt. I'll change it. I agree that most of our stuff is memory bound. Nic's probably right that we should next be checking in MPEG2DEC3 to see if there are still any unneeded buffer copies now that we are returning YV12. My problem with optimizing with vTune was that it got a bit confused by all the inlining used by MPEG2DEC. If I compiled with debug and no inlining then I could get very clear results that no longer matched the usual usage profile since there are a lot of small rtn's that without inlining will spend a good amount of time in linkage. Let me know if you find out whether the AMD analyzer works only on AMD boxes. I downloaded it over a year ago but then never tried it for some reason (think I forgot aboout it). - Tom |
16th May 2003, 11:24 | #39 | Link |
Retired AviSynth Dev ;)
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
|
Some numbers:
Source: SVCD 480x480 (sorry I currently have no DVD material) Processor: Athlon 500 non-DDR memory. AVS2AVI -> XviD "null encoder". Code:
77.63% mpeg2dec3.dll 11.52% avisynth.dll Distribution within mpeg2dec3.dll Code:
16.13% SSEMMX_IDCT 11.79% CMPEG2Decoder::Copyall 10.53% CMPEG2Decoder::decode_macroblock 4.68% CMPEG2Decoder::motion_compensation 3.84% CMPEG2Decoder::Copyodd 2.66% MC_put_16_mmxext 2.55% CMPEG2Decoder::Show_Bits
__________________
Regards, sh0dan // VoxPod |
16th May 2003, 12:20 | #40 | Link |
Retired AviSynth Dev ;)
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
|
vfapidec.cpp:
Code:
void CMPEG2Decoder::Copyall(YV12PICT *src, YV12PICT *dst) { AVSenv->BitBlt(dst->y, dst->ypitch, src->y, src->ypitch, src->ypitch, Coded_Picture_Height); AVSenv->BitBlt(dst->u, dst->uvpitch, src->u, src->uvpitch, src->uvpitch, Coded_Picture_Height>>1); AVSenv->BitBlt(dst->v, dst->uvpitch, src->v, src->uvpitch, src->uvpitch, Coded_Picture_Height>>1); } void CMPEG2Decoder::Copyodd(YV12PICT *src, YV12PICT *dst) { AVSenv->BitBlt(dst->y, dst->ypitch*2, src->y,src->ypitch*2, src->ypitch, Coded_Picture_Height>>1); AVSenv->BitBlt(dst->u, dst->uvpitch*2, src->u,src->uvpitch*2, src->uvpitch, Coded_Picture_Height>>2); AVSenv->BitBlt(dst->v, dst->uvpitch*2, src->v,src->uvpitch*2, src->uvpitch, Coded_Picture_Height>>2); } void CMPEG2Decoder::Copyeven(YV12PICT *src, YV12PICT *dst) { AVSenv->BitBlt(dst->y+dst->ypitch, dst->ypitch*2, src->y+src->ypitch, src->ypitch*2, src->ypitch, Coded_Picture_Height>>1); AVSenv->BitBlt(dst->u+dst->uvpitch, dst->uvpitch*2, src->u+src->uvpitch, src->uvpitch*2, src->uvpitch, Coded_Picture_Height>>2); AVSenv->BitBlt(dst->v+dst->uvpitch, dst->uvpitch*2, src->v+src->uvpitch, src->uvpitch*2, src->uvpitch, Coded_Picture_Height>>2); } Code:
PVideoFrame __stdcall MPEG2Source::GetFrame(int n, IScriptEnvironment* env) { m_decoder.AVSenv = env; [...] Code:
class MPEG2DEC_API CMPEG2Decoder { friend class MPEG2Source; protected: IScriptEnvironment* AVSenv;
__________________
Regards, sh0dan // VoxPod |
|
|