View Full Version :
dvd2avi optimization
hlD2002
10th December 2001, 19:31
hi all
got some ideas, maybe someone reads it and add support ;)
dvd2avi:
++ dvd2avi main core (used from ogo's version dvd2avi 1.83preview)
+- killing ac3dec
+- adding azid.dll support
+- adding lame.dll support
+- adding resizing code (already done by OgO, thx dude ;))
+- adding 2pass support (now it's ur turn prot0vision & NiC ;))
i'm working on killing ac3dec completly from it, noticed that vob2mp3 is now a really good project so i'll contact dspguru, maybe he can add a dll or something that dvd2avi can do the mp3 byself)
also planning on adding some codes from DVDx by px3...
best regards
you-know-who from you-know-where *grin*
prot0vision
13th December 2001, 06:03
I would love to add the 2-pass code, which is pretty easy the way I did it (similar to VirtualDub, set codec options twice (1stpass/2ndpass), then hit Go), just need to get DVD2Avi w/resize (1.82?) source code.
The bilinear/Bicubic resize code that I've seen makes my head hurt trying to pull them out of other programs like Ogo did. Besides, I'm more familiar with c; c++ is wierd :)
prot0vision
DmitryR
13th December 2001, 18:08
I can contrubute SSE2 iDTC. Now I fully rewrote the DCT_8_INV_ROW_1_s MACRO for SSE2 and got 22 instructions instead of 41. If one will send me the algorythm of DCT_8_INV_COL_4 or any other small part from DVD2AVI I will rewrite it too (for this is my main tool and I have P4). Note that I have nothing to debug it so you probably will find some errors ... But finaly we can make it nearly twice faster.
dragonlz
13th December 2001, 20:50
Is there any chance AMD optimizations will be done? Does anyone have experienece with 3DNow instructions?
DmitryR
14th December 2001, 10:47
Originally posted by dragonlz
Is there any chance AMD optimizations will be done? Does anyone have experienece with 3DNow instructions?
I can get this experience easy but I see no reason for modern AMD processors support SSE already, and those who choosed AMD platform have an opportunity to upgrade. And SSE2 give mush more options for optimization with new 128-bit registers - the code can really be made twice shorter.
dragonlz
14th December 2001, 11:36
You are talking about SSE2 iDTC, not SSE though, and I doubt AMD will put that in their next processor. It's also very unlikely that people will go for a P4 just because the various SSE2 optimizations with programs;)
I guess 3DNow2 optimization would be harder to do. However, if i knew how to write chip specific instructions I'd try it myself. Whenever people talk about optimizations it seems like Intel processors always gets the benefit first(eg. Divx4.11) That kinda leaves all of us AMDers out in the cold.
Anyways, I guess it would be nice to have a AMD optimized iDTC, but it's not the most important thing in the world:)
DmitryR
14th December 2001, 13:53
Originally posted by dragonlz
You are talking about SSE2 iDTC, not SSE though, and I doubt AMD will put that in their next processor. It's also very unlikely that people will go for a P4 just because the various SSE2 optimizations with programs;)
I guess 3DNow2 optimization would be harder to do.
I see no reason to write 3dnow2 code for I sure that on modern AMD SSE and 3dnow2 will run at the same speed. Please correct me someone if I am wrong.
And the only thing that is making 3dnow2 programing harder is that AMD web site organisation is much weaker than Intel's one. I digged in SSE2 during 1 day with smart Intel docs but cannot find smth suitable to read about 3dnow2 still.
hlD2002
14th December 2001, 14:13
ho,
cool... dvd2avi needs optimizations ;), but OgO will release a new version of it's hack next week... i think with source... better wait for this version...
i contacted dspguru for vob2mp3... haven't got any response yet... hopefully he'll do it ;=)
best regards
you-know-who from you-know-where
trbarry
14th December 2001, 15:08
I can contrubute SSE2 iDTC. Now I fully rewrote the DCT_8_INV_ROW_1_s MACRO for SSE2 and got 22 instructions instead of 41. If one will send me the algorythm of DCT_8_INV_COL_4 or any other small part from DVD2AVI I will rewrite it too (for this is my main tool and I have P4). Note that I have nothing to debug it so you probably will find some errors ... But finaly we can make it nearly twice faster.
DmitryR -
I'm just building my P4 system today. I'd love to get the P4 optimized version. How can I help?
I think I'm a pretty decent assembler programmer but I confess that I don't understand the IDCT code.
A preliminary version of the 1.83 source was posted in
this thread (http://rilanparty.com/vbb/showthread.php?s=&threadid=10619&pagenumber=1) but I don't know if it contains the routine you wanted.
What is the problem with testing? When I considered doing this I thought I'd just execute both the old and new and compare the results for awhile. It's integer so you should be able to match it exactly, yes? I'd hoped I could convert the asm instructions without really understanding why the algorithm works.
- Tom
hlD2002
14th December 2001, 17:48
ho,
stay cool! ogo will release a new version, so wait and optimize that one! i'm working on an audio solution... change that from AC3DEC to AZiD... dspguru answered... so support can be added...
trbarry
14th December 2001, 18:44
stay cool! ogo will release a new version, so wait and optimize that one!
That would be nice, but at least the SSE2 IDCT is pretty much self contained and would be easy to move over later once we had it. I'd personally rather not wait if it was available. And I don't remember any indication Ogo was working on that part.
- Tom
hlD2002
14th December 2001, 18:47
ogo is working on it, believe me :)
so stay cool :)
best regards
you-know-who from you-know-where
DmitryR
14th December 2001, 22:26
I downloaded the 1.83 prerelease and found no changes in iDCT. The resizing will be most likely the thing I will try to optimize. But I have no time to set up the compiler for it and debug as well as to write the code for CPU type detection. So if anyone want to do this - let me know, I will send you the source.
hlD2002
14th December 2001, 23:21
ho...
haven't even got the msvc++ right now... will get it next time... but i'm searching for cpu detec. sources... i think the someone has written a good routine that we can use ;)
best regards
trbarry
15th December 2001, 00:20
I downloaded the 1.83 prerelease and found no changes in iDCT. The resizing will be most likely the thing I will try to optimize. But I have no time to set up the compiler for it and debug as well as to write the code for CPU type detection. So if anyone want to do this - let me know, I will send you the source.
I wrote CPU detect code for DScaler base upon some Intel & AMD samples. I can add it or send it to you if you like.
If you write any SSE2 IDCT code I'd be willing to add it in and test it before passing it on to whoever, if that was convenient. Assuming my source already compiles. I haven't tried that yet.
- Tom
VorteX
15th December 2001, 01:36
This is great if we all work together we can make the ultimate backup program :)
hlD2002
15th December 2001, 01:52
lo trbarry,
haven't got ms vc++ and cpu pack2... have to get it but dunno from where...
but, i think the old dvd2avi user interface is really bad... so here's my solution, isn't perfect but will make working with it much easier... some specific options aren't included...
my question: can you redisign the gui and compile it for guys like me???
still working on some optimations...
VorteX
15th December 2001, 01:55
sorry picture not working for me
hlD2002
15th December 2001, 02:02
hrmpf... i have zipped it now, maybe working... *grrr*
hlD2002
15th December 2001, 02:12
aahhh forgot to say... dmitry and barry... if you change or add somethin' please make a lil log file like this:
changed lalala.c
- resizing (line xx)
or
added muuuh.c
- interlacing
or something like that... i think a correctly written change_log is a MUST BE for optimizing later or other versions of it... maybe we or other can benefit of that...
VorteX
15th December 2001, 02:37
hows this gui look
VorteX
15th December 2001, 02:40
mm try this
hlD2002
15th December 2001, 02:41
please zip the image and add that!
hlD2002
15th December 2001, 02:48
hi,
looks cool, already compiled that??? do you got ms vc++ 6 with the cpu pack 2 ??? please contact me at ein.held@gmx.de , thx.
if you compiled it already, please upload it ;)
okay... dspguru will write a dll of BeSweet for direct conversion for vob -> mp3/mp2 ... so the university is kept ;) (includes azid decoder...)
so if anyone has time next days (i haven't... exams @ school, sorry), just post that you got time and can integrate that dll...
tha revolution has begun :)
VorteX
15th December 2001, 02:51
sorry not compiled just played around with paint shop pro :)
trbarry
15th December 2001, 03:42
Hey, whoa!
As much as the gui looked good, I wasn't volunteering to take over the next release of DVD2AVI.
I'm just an assembler programmer that likes things to go fast. My Gui dev skills, my C++ skills, and my time are all somewhat limited.
But the offer still stands if someone happens to have SSE2 IDCT code just lying around and they needed it selectively added and compiled and tested a bit. Assuming Ogo hasn't already done it. ;)
- Tom
Farok
15th December 2001, 07:57
I should be able to add the 3DNow Idct which come from FlaskMpeg 3DNow.
It should be really faster than the mmx one. (I have already added for me an other 3DNow Idct, but its slower than the mmx one, but it has more quality like the miha's one.)
I would be ready by the end of next week (hehe :)
hlD2002
15th December 2001, 14:22
lo...
thought around a lil' bit when sleeping... (yes, i'm mad... dreaming of pc's is REALLY bad...). an automatic bitrate calculator is the next thing i'll integrate... can someone gimme good sources for calcs??? thx :)
DmitryR
15th December 2001, 15:04
Originally posted by hlD2002
thought around a lil' bit when sleeping... (yes, i'm mad... dreaming of pc's is REALLY bad...)
Not so bad sometimes ... Today night I saw a way to reduce the SSE2 iDCT to 21 instruction :) So here it is. The changelog is very simple: added everything below the 'Code for Pentium IV' line, did not change anything else.
hlD2002
15th December 2001, 15:24
hi,
*lol* hmmm, that's good... i hope it could do a lil' speed up... but since i'm on an athlon(c,1.4ghz) there isn't a hope... so, this message goes to anyone... please compile and upload the zip... trbarry... i think a new gui is very important... times has changed, dvd2avi should do this, too... only an idea...
hlD2002
15th December 2001, 15:30
good, please tell me if you do changes and were... i'll do the changelog, here's the first version:
dvd2avi changelog:
feature: Code for Pentium IV
by: DmitryR on 15th December 2001 14:04
file: idctmmx.asm
added: 'Code for Pentium IV' below line 742
:)
hlD2002
15th December 2001, 15:32
ahhhr, forgot: vortex... can you compile the gui???
trbarry
15th December 2001, 16:17
Not so bad sometimes ... Today night I saw a way to reduce the SSE2 iDCT to 21 instruction So here it is. The changelog is very simple: added everything below the 'Code for Pentium IV' line, did not change anything else
Nifty. Thanks.
My P4 is still only components sitting around the room in their original cardboard boxes but hopefully my trusty screwdriver will work and it will be a real computer able to compile stuff in a couple days. With a little luck I'll try to bring it up and test then.
- Tom
hlD2002
15th December 2001, 16:23
good luck... :)
VorteX
17th December 2001, 03:03
nope i aint a programmer :)
hlD2002
18th December 2001, 19:33
hi,
again for DmitryR... found something interesting, dunno if you can use it...
/* Copyright Jean-Marc Valin 2001 */
/* This software is released under the version 2.1 of the GNU Lesser General*/
/*Example: for an (out-of-place) 256 transform:
COMPLEX in[256];
COMPLEX out[256];
COMPLEX w[256];
int bits[256];
fft_initCosSinTables(w, bits, 8);
void fft(in, out, 8, w, bits);
Note: fft_initCosSinTables only need to be called once for each size
8 is the order of the FFT (2^8=256)
*/
#include <math.h>
typedef struct COMPLEX {
float re;
float im;
} COMPLEX;
void fft_initCosSinTables(COMPLEX *w, int *bits, int M)
{
int i,j;
int tmp;
int size = 1 << M;
for (i=0;i<size;i++)
{
bits[i]=0;
tmp=i;
for (j=0;j<M;j++)
{
bits[i] <<= 1;
bits[i] += tmp&1;
tmp>>=1;
}
}
while (size)
{
int k;
float tmp, p = (2.0 * M_PI) / size;
COMPLEX tmp2;
for (k = 0; k < (size>>1); k++) {
tmp = k * p;
tmp2.re=cos(tmp);
tmp2.im=-sin(tmp);
*w++ = tmp2;
}
size >>= 1;
}
}
inline void bitrev(COMPLEX *in, COMPLEX *out, int *bits, int M)
{
int dummy;
int size = 1 << M;
__asm__ __volatile__ (
"
push %0
push %2
push %3
.loop%=:
push %0
movl (%3), %4
add $4, %3
movl (%3), %0
add $4, %3
movq (%1,%4,8), %%mm0
movq (%1,%0,8), %%mm1
movl (%3), %4
add $4, %3
movl (%3), %0
add $4, %3
movq (%1,%4,8), %%mm2
movq (%1,%0,8), %%mm3
movl (%3), %4
add $4, %3
movl (%3), %0
add $4, %3
movq (%1,%4,8), %%mm4
movq (%1,%0,8), %%mm5
movl (%3), %4
add $4, %3
movl (%3), %0
add $4, %3
movq (%1,%4,8), %%mm6
movq (%1,%0,8), %%mm7
movq %%mm0, (%2)
add $8, %2
movq %%mm1, (%2)
add $8, %2
movq %%mm2, (%2)
add $8, %2
movq %%mm3, (%2)
add $8, %2
movq %%mm4, (%2)
add $8, %2
movq %%mm5, (%2)
add $8, %2
movq %%mm6, (%2)
add $8, %2
movq %%mm7, (%2)
add $8, %2
pop %0
dec %0
jne .loop%=
femms
pop %3
pop %2
pop %0
" : : "q" (size>>3), "r" (in), "r" (out), "r" (bits), "q" (dummy)
: "memory", "st", "st(1)", "st(2)", "st(3)", "st(4)", "st(5)", "st(6)", "st(7)");
}
inline void recurs_fft(COMPLEX *x, int M, COMPLEX *w, int repeat)
{
int dummy1, dummy2, dummy3, dummy4;
int N = 1 << M;
int N2 = N >> 1;
int mask = N-1;
if (M==1)
return;
if (M>11)
{
recurs_fft(x, M-1, w+N2, repeat);
recurs_fft(x+N2, M-1, w+N2, repeat);
} else {
recurs_fft(x, M-1, w+N2, repeat<<1);
}
if (M>2)
{
while (repeat--)
{
__asm__ __volatile__ (
"
.loop%=:
movq (%0), %%mm0
movq 8(%0), %%mm4
pswapd %%mm0, %%mm2
pswapd %%mm4, %%mm6
movq (%1), %%mm1
movq 8(%1), %%mm5
pfmul %%mm1, %%mm0
pfmul %%mm1, %%mm2
pfmul %%mm5, %%mm4
pfmul %%mm5, %%mm6
pfpnacc %%mm2, %%mm0
pfpnacc %%mm6, %%mm4
movq (%2), %%mm3
movq 8(%2), %%mm7
movq %%mm3, %%mm1
movq %%mm7, %%mm5
pfsub %%mm0, %%mm3
pfadd %%mm0, %%mm1
pfsub %%mm4, %%mm7
pfadd %%mm4, %%mm5
movq %%mm1, (%2)
movq %%mm5, 8(%2)
add $16, %1
add $16, %2
movq %%mm3, (%0)
movq %%mm7, 8(%0)
add $16, %0
dec %3
jne .loop%=
"
: "=r" (dummy1), "=r" (dummy2), "=r" (dummy3), "=q" (dummy4) : "0" (x+N2), "1" (w), "2" (x), "3" (N2>>1)
: "memory", "st", "st(1)", "st(2)", "st(3)", "st(4)", "st(5)", "st(6)", "st(7)");
x+=N;
}
} else {
int mask[2]={0x80000000, 0x00000000};
__asm__ __volatile__ (
"
push %0
push %1
movq (%2), %%mm7
.loop%=:
movq (%0), %%mm0
movq 8(%0), %%mm1
movq %%mm0, %%mm4
pfadd %%mm1, %%mm0
movq 16(%0), %%mm2
movq 24(%0), %%mm3
movq %%mm2, %%mm5
pfadd %%mm3, %%mm2
pfsub %%mm1, %%mm4
pfsub %%mm3, %%mm5
movq %%mm0, %%mm1
pfsub %%mm2, %%mm0
pfadd %%mm2, %%mm1
pswapd %%mm5, %%mm5
movq %%mm0, 16(%0)
movq %%mm1, (%0)
pxor %%mm7, %%mm5
movq %%mm4, %%mm0
pfadd %%mm5, %%mm0
pfsub %%mm5, %%mm4
movq %%mm0, 24(%0)
movq %%mm4, 8(%0)
add $32, %0
dec %1
jne .loop%=
pop %1
pop %0
"
: : "r" (x), "q" (repeat), "r" (mask)
: "memory", "st", "st(1)", "st(2)", "st(3)", "st(4)", "st(5)", "st(6)", "st(7)");
}
__asm__ __volatile__ ("femms"
: :
: "memory", "st", "st(1)", "st(2)", "st(3)", "st(4)", "st(5)", "st(6)", "st(7)");
}
void fft(COMPLEX *in, COMPLEX *out, int M, COMPLEX *w, int *bits)
{
bitrev(in, out, bits, M);
recurs_fft(out, M, w, 1);
}
...:) hope it helps...
DmitryR
18th December 2001, 20:20
Here are some more SSE2. Added prefetches, unrolled some weird asm loops in getpic.c, added IDCT column SSE2. Hope the form_component_prediction will be the next: there IS a job to do. Hope that one of you will be able to compile and debug it ;) SSE2 code is really twice, and somewhere 3-4 times shorter.
ChristianHJW
18th December 2001, 22:02
I hope i am able to contribute something to this nasty project :
1. next AMDs will have SSE2. AMD internal code name : 'hammer'
2. Nic already implemented Azid into DVD2AVI1.82 this very weekend. Just send him a private message either here or over at divX.com
trbarry
18th December 2001, 22:07
DmitryR -
Got it. Thanks. I'm still struggling with an orgy of home computer & network rebuilding.
- Tom
Nic
19th December 2001, 10:34
Shhhh Christian! ;-)
I made that version for my self, it just adds automatic Azid & Lame support using the commandline exe's (oh and it has 2pass support)
-Nic
ps
all my respect goes out to mr.Rozhdestvensky, very professional :)
trbarry
27th December 2001, 05:15
DimitryR -
Ok, Xmas over and my machine built, I started working on the SSE2 stuff. But the sse2idct.zip file doesn't appear to have any members in it, at least using the default Win/Me functions.
What's the trick?
- Tom
later edit: Nevermind. It opens properly with WinRAR just not the builtin Win/Me functions.
Nic
27th December 2001, 11:52
Hi Tom,
Ive been working on the iDCT of DVD2AVI amongst other things....
Ive been meaning to ask dmitry about his code as it does not compile quite right for me & there are certain anomalies that dont make sense,
Is DmitryR around to comment? Anyway if you want ill give you a version of DVD2AVI with the SSE2 code in....but with my Duron I dont know if it works (ill also compile it using ICL, as that may help your P4 too)
Cheers,
-Nic
ps
A list of weird anomalies (that could be all my own stupid fault):
1) the prefetch0 opcode isn't recognised by VC Compiler (ver 6, SP 5, pro pack 5)
2) in the .c files you look for a IDCT_FLAG but the variable name is iDCT_Flag ??
3) The PREFETCH function in the .asm file requires a parameter, @4 instead of @0)
4) The MASM I got with Pro Pack5 (6.15 I think) didn't like the .asm file. There were two "mov eax, somereference" that it didn't like, simply adding "dword ptr" before the referenced allowed it to compile
There were others but cant remember, I adjusted them all so they would work, however of course they compiled ok for Dmitry so could he enlighten me as to what I am doing wrong?
trbarry
27th December 2001, 15:45
Nic, DimitryR -
Thanks.
I think the problem with the prefetch routine is just it needs lea instructions to load the addresses of the values instead of mov instructions to load them.
I already have Dimitry's IVTC code running successfully without the changes to GetPic or prefetch calls. I just hard coded the call to the SSE2 IDCT stuff since I know I'm on my brand new P4 :). I'll add the cpu checking for SSE2 later and one at a time bring in the other optimized routines.
But I want to put in testing code that will compare the outputs of the SSE2 & SSE versions. So far all I've done is check that the preview looks the same and that it doesn't crash.
It looks good so far but I haven't checked to see if it speeds things up any.
- Tom
Edit: I also had to ensure the constant table data was 16 byte aligned so it wouldn't crash.
Nic
27th December 2001, 15:55
Thanx Tom, do you have an email address I can get you at?
Cheers,
-Nic
trbarry
27th December 2001, 21:19
Nic -
Yes, I'd sent you a PM on that, but it's trbarry@trbarry.com.
But it eventually ends up on MediaOne/RoadRunner/ATT/Comcast/LSMFT?? so it may go south on the 29'th if we are not lucky. :(
I prefer the email but PM me if email seems to get no response after the 29'th.
- Tom
DmitryR
28th December 2001, 08:15
Originally posted by Nic
Is DmitryR around to comment?
[...]
A list of weird anomalies (that could be all my own stupid fault):
1) the prefetch0 opcode isn't recognised by VC Compiler (ver 6, SP 5, pro pack 5)
2) in the .c files you look for a IDCT_FLAG but the variable name is iDCT_Flag ??
3) The PREFETCH function in the .asm file requires a parameter, @4 instead of @0)
4) The MASM I got with Pro Pack5 (6.15 I think) didn't like the .asm file. There were two "mov eax, somereference" that it didn't like, simply adding "dword ptr" before the referenced allowed it to compile
Here I am.
1) Sorry, I was blind reading P4 manual. The right command name is prefetcht0.
2) May be :) Just retype it to be right.
3) Yes, @0 seems to be right.
4) I am writing the code in notepad and do not know what assembler will be used to compile the code. For MASM, dword ptr will be the right solution.
And there is one more bug in prefetch: the right code seems to be as follows:
mov eax,[c_pointer_variable]
prefetcht0 [eax]
OR
lea eax,[c_pointer_variable]
prefetcht0 [eax]
Add dword ptr as needed.
trbarry
1st January 2002, 23:15
Progress report on SSE2.
Seems to mostly work with one glitch still needs to be fixed and more testing. (Dmitry, I'll email you)
But the SSE2 code seems to be able to speed up things anywhere from 2-10% based upon no obvious variables. Should have something more soon.
I'm then also going to try to fit the same code into a version of MPEG2DEC.dll so I can still use Avisynth filters.
Is someone the official keeper of that code?
- Tom
Nic
2nd January 2002, 09:50
ps.
I added Wizard_FL's DeInterlace filter into DVD2AVI, haven't done it well (i.e. it really slows down the encode) & I dont know how good a deinterlacer it is, but it was just a preliminary test, now on to IVTC....(unless trbarry fancies doing it :)
Cheers,
-Nic
trbarry
2nd January 2002, 20:40
Nic -
I've considered adding GreedyHMA functions to DVD2AVI but that would be further out in the future, nothing anyone should wait for.
- Tom
int 21h
11th January 2002, 07:15
Did the SSE2 idct and other optimizations ever get completed? Is there somewhere I can pick up the changed source files to add into mpeg2dec.dll and compile with the Intel C++ compiler?
Nic
11th January 2002, 09:44
I know what you mean Tom, I think time is short for all of us... :)
I might add the TemporalSmoother though over this weekend, as that might be useful :)
Take Care,
-Nic
ps
Would you believe that the Intel Compiler makes it slightly slower on my Duron than with MSVC (which probably makes sense when you think about it :) - I should get MSVC 7 at the end of next month, maybe that will help.....
int 21h
11th January 2002, 12:11
I added the attached code into Mpeg2dec.dll, it compiles fine, but when I try to use the dll in anything (Gordian Knot, Windows Media Player, etc), it crashes. Not sure what the problem is, I've attached the project in case anyone wants to take a look.
trbarry
13th January 2002, 17:40
I added the attached code into Mpeg2dec.dll, it compiles fine, but when I try to use the dll in anything (Gordian Knot, Windows Media Player, etc), it crashes. Not sure what the problem is, I've attached the project in case anyone wants to take a look.
int 21h -
The first version of Dmitry's SSE2 optimizations ran fine except for a possible crash (alignment) in the Add_Block funtion in GetPic. If you have that version you can probably get most of the SSE2 benifits by temporarily using the old Add_Block code.
Dmitry has sent me some more code which I'll be testing, hopefully today. Sorry I've gotten away from this again for a few days.
- Tom
vBulletin® v3.8.4, Copyright ©2000-2010, Jelsoft Enterprises Ltd.