Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
21st May 2003, 15:45 | #61 | Link |
Moderator
Join Date: Oct 2001
Location: England
Posts: 3,285
|
I thought as much..exactly what im doing. I have to check the AVSEnv in copyall/etc because of when the DLL is being used as standalone.
But I was wondering whether this "if" statement will negate any improvment we get from using BitBlt... ? The invokation (like that word ) could also be used to take the resize parameters from dvd2avi, but I dont think ill implement that (yet). -Nic |
21st May 2003, 15:51 | #62 | Link |
Retired AviSynth Dev ;)
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
|
Even in worst case it will cost 20-30 cycles (on P4 - less on K7) - but considering the amount that's being copied this is nothing.
Besides, the processor will be able to brach predict this 100% after 3 runs, since it never changes.
__________________
Regards, sh0dan // VoxPod |
21st May 2003, 19:14 | #63 | Link | |
Registered User
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,092
|
Quote:
Could you set a break point and confirm it is even going through the new _SSE versions of those functions on your Athlon? I'll still take a look at the Add_Block. It's possible I really can't do 8 bit arithmatic there without loss of precision but it's probably just a silly bug. - Tom |
|
21st May 2003, 19:46 | #64 | Link |
Retired AviSynth Dev ;)
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
|
I can see no reason at all, why Toms Add_Block should be slower - everything tells me it should be faster. Loop unrolling will have a larger effect on P4 compared to K7, but still it should be faster.
There are fewer instructions in the loop - no brach mispredicts. It might be connected to the non-linear memory access. Is it possible to do a version, that accesses memory more linear (and doesn't look up eax, followed by eax+edx) - it does however seem unavoidable to me. Perhaps doing: Code:
// make rfp qwords 0, 1 prefetchnta [eax+edx*4] movq mm2, [eax] // get rfp val movq mm3, [eax+edx] // " movq mm0, [ebx+0*16] movq mm1, [ebx+1*16] packsswb mm0, [ebx+0*16+8] // pack with SIGNED saturate (unlike old way) packsswb mm1, [ebx+1*16+8] // pack with SIGNED saturate
__________________
Regards, sh0dan // VoxPod |
21st May 2003, 21:51 | #65 | Link |
Registered User
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,092
|
Sh0dan -
From what Nic said the Add_Block wasn't the slow part. It was the broken part. And I didn't want to add a separate section there for SSEMMX, hence the lack of prefetch. That part is not in a tight loop anyway. But right now I first have to figure out how to even get the right answer. - Tom |
22nd May 2003, 08:57 | #66 | Link |
Moderator
Join Date: Oct 2001
Location: England
Posts: 3,285
|
Yup Add_Block was causing the errors for me. They were only slight (you could see random blocks appear). If you need a test clip to re-produce it ill put one up along with an accompanying d2v file
As for my minor progress, the crop stuff is working and tested, the iDCT is now done using a function pointer, Ive added two more iDCTs: one is Skal's from his MPEG-4 project (which is the fastest ive ever come across, he's given me permission to put into mpeg2dec) and the other is SimpleiDCT from XviD, which is known to have very high precision (although a tad slower, thought it might be useful ?). Ive added Sh0dan's suggestion of using BitBlt and all the external code (i.e. using MPEG2Dec3.dll without avisynth) seems to be working fine. (ive written a little commandline example to go with the source i.e. GetPic d2vfile frame output.bmp -> for capturing bitmaps) The speed is now definitely faster on all machines ive tested, but trbarry's intra/non-intra code is still slowing it down on my Athlon ? But I cant think why, ill test more. Hope that all sounds ok -Nic |
22nd May 2003, 09:56 | #68 | Link |
Moderator
Join Date: Oct 2001
Location: England
Posts: 3,285
|
Use 1.04 source (first post of this thread) for now. ill post it when its ready for release, which will hopefully be later today (I dont have it on me now, hopefully Tom/Sh0dan will fix the new Add_Block in that time)
Ive got ICL 7.1 as well (well ive got the evaluation version, until they send the full license). Doesn't make any real difference about 1-2fps faster also had a problem with some of the 3DNow assembler If I remember correctly Cheers, -Nic |
22nd May 2003, 22:54 | #69 | Link | |
Registered User
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,092
|
Quote:
Did you ever check if the following block of code (in Getpic) is even being executed? Mine's the only place in Getpic that checks that cpu.ssemmx flag and I can't test that myself on an Athlon. - Tom Code:
/* decode blocks */ // separate rtn for ssemmx now - trbarry 5/2003 if (cpu.ssemmx) { for (comp=0; comp<block_count; comp++) { if (coded_block_pattern & (1<<(block_count-1-comp))) { if (*macroblock_type & MACROBLOCK_INTRA) Decode_MPEG2_Intra_Block_SSE(comp, dc_dct_pred); else Decode_MPEG2_Non_Intra_Block_SSE(comp); if (Fault_Flag) { #ifdef PROFILING // stop_decMB_timer(); #endif return 0; // trigger: go to next slice } } } } else |
|
23rd May 2003, 08:40 | #70 | Link |
Moderator
Join Date: Oct 2001
Location: England
Posts: 3,285
|
Yup I did check...It does get called (it would do, the Athlon XP has all extended instructions apart from SSE2).
I honestly dont know why its slower ? But it does appear to be (but only slightly). Did you manage to fix add_block? Cheers, -Nic |
23rd May 2003, 09:35 | #71 | Link |
Guest
Posts: n/a
|
[HS]Someone knows this compilateur ?
http://www.codeplay.com/vectorc/bench.html[/hs] |
23rd May 2003, 12:28 | #72 | Link |
Retired
Join Date: Jan 2002
Location: Netherlands
Posts: 1,529
|
I tried Codeplay out a while ago, and this is what I think of it:
- Despite that the demos look really promising, it is specifically made for vectorizing code (2D/3D modelling, rotations like the demos), and isn't very spectacular on normal stuff. - I could hardly find any code that compiled on it at all. Almost everything gave an error of some sort. - The latest version supports C++, but that support is so limited that you're better off using it only for plain C. - In my opinion it was very expensive. $100 for a nutured version, $800 for the full version. - The nutured version isn't capable of Athlon optimizations (although at first they said it would be P4 that wouldn't be supported). And since I have an Athlon I very much didn't like that . |
23rd May 2003, 14:59 | #73 | Link |
Retired AviSynth Dev ;)
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
|
vectorc is primarily aimed at float point optimizations. MPEG2DEC doesn't really contain any float point code, so I very much doubt it will be of much use here.
However we really cannot know this until it is tested.
__________________
Regards, sh0dan // VoxPod |
23rd May 2003, 16:42 | #75 | Link | |
Registered User
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,092
|
Quote:
I found it but haven't corrected it yet. There are 2 sections of mmx code in Add_Block and the first one can not be done in 8 bit arithmatic without overflowing. But I can still optimize it a bit. Hopefully I'll get a replacement out today. Should I still base it on v 1.04? - Tom |
|
23rd May 2003, 17:46 | #76 | Link |
Retired AviSynth Dev ;)
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
|
Uh - conflicts
IMO you should use 1.05.. Maybe you should just check for SSE2 instead - that way it will only run on P4 boxes for now.
__________________
Regards, sh0dan // VoxPod |
23rd May 2003, 18:10 | #77 | Link |
Moderator
Join Date: Oct 2001
Location: England
Posts: 3,285
|
Use any version you like trbarry, or just post the fixed code snippets and ill add it into the version ive made up.
Ill just take your code and fit into my version, test it and release it. And then that will be that for a little while I think Cheers, -Nic ps BTW: I just bunged up quick the latest source at: http://nic.dnsalias.com/src.zip Just in case you want to see what ive done so far Last edited by Nic; 23rd May 2003 at 18:21. |
23rd May 2003, 18:38 | #78 | Link |
Registered User
Join Date: Oct 2001
Location: Gainesville FL USA
Posts: 2,092
|
Nic -
Oops. Too late to get the new one first. I just posted a fixed Add_Block function snippet at: www.trbarry.com/Add_Block.txt . And maybe give some thought to Sh0dans comments about SSE2 only for the performance functions. Though it should run faster on P3's too. Maybe Athlon has a slower bsr instruction (emulated?). I don't often use that but don't see a fast way around it. - Tom |
23rd May 2003, 18:47 | #79 | Link | |
Retired AviSynth Dev ;)
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
|
Quote:
__________________
Regards, sh0dan // VoxPod |
|
Thread Tools | Search this Thread |
Display Modes | |
|
|