Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
|
|
Thread Tools | Search this Thread | Display Modes |
16th December 2007, 23:16 | #81 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
Code:
int foo(int abc) { int def; __asm { mov eax, abc ;; abc is probably [ebp+8] mov def, eax ;; def is probably [ebp-20] Always look at the ASM listing from the compiler, your __asm code must fit in with the code from the compiler. i.e. the whole code must be consistant. And the magic word you need to reference a code label is offset Code:
add ebx, offset jumper |
16th December 2007, 23:32 | #82 | Link | |
Registered User
Join Date: Jun 2007
Posts: 95
|
Quote:
A correctly predicted jump or an unconditional jump has a latency of 2, assuming that the next instruction is in the L1 code cache. If it was mispredicted then the CPU has to flush all decoded instructions and start again. This takes about 42 cycles. The AMD processor (I don't care about Intel) has a branch prediction as follows: A conditional branch is assumed as non-taken the first time. The second time (address) it is assumed as the same as last time. All further times (address) it is assumed as the second time. |
|
17th December 2007, 20:32 | #84 | Link | |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
Yeah, almost reading like a book;-)
Of course I am learning from that. I read also one funny thing:the compiler translated my jnz to some other kind of conditional jump;-) Quote:
|
|
17th December 2007, 20:41 | #85 | Link | |
ffdshow/AviSynth wrangler
Join Date: Feb 2003
Location: Austria
Posts: 2,441
|
Quote:
np: Yello - Daily Disco (1980-1985: The New Mix In One Go)
__________________
now playing: [artist] - [track] ([album]) |
|
17th December 2007, 21:02 | #86 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
@Fizick
I changed the title...d'you like it better? First I borrowed your Bytes2Float routine and changed it into Bytes2Words...Then I completely replaced it with scalar asm and it was significant speedup...I will add MMX and then I post it...maybe it would be useful for you if it is possible change it back to float... |
17th December 2007, 23:06 | #87 | Link |
AviSynth plugger
Join Date: Nov 2003
Location: Russia
Posts: 2,183
|
i have also SSE optimized bytes to float routine in Vaguedenoiser
(code written by Kurosu though) But it take very small percent of time. When (if) you make fast MMX dct 16x16 (16x8, 8x16), I will try add it to MVTools.
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages. |
18th December 2007, 19:19 | #88 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
Hello,
here is new area to explore for me: Packed integer division on mmx registers... afaik it is not in MMX set, is there any common instruction set with divisions? Or it is only AMD specific when I have AthlonXP? Where should I focus my learning efforts? |
18th December 2007, 19:57 | #89 | Link |
Retired AviSynth Dev ;)
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
|
redfordxx: What is the C-equivalent of what you want to do?
MMX cannot do division, but you can use inverse multiply to achieve a division if your division is constant. For example: Code:
y = x / 5; == y = x * (256 / 5) / 256 == y = (x * 51) >> 8
__________________
Regards, sh0dan // VoxPod |
18th December 2007, 21:13 | #90 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
OK, I try it short and fast, coz this hotel internet in Brussels is constantly disconnecting me.
The mul-division is nice, I heard of it but didn't know what is it exactly...will use later...tnx |
18th December 2007, 21:16 | #91 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
My case:
Code:
a=IDCT(Quantize(DCT(x)) b=IDCT(Quantize(DCT(y)) c=a/b I have x=[0,65535] y=[0,255] a,b scaled to signed DW c=[0,255] unsigned saturated So I see two options: 1)do it scalar 2)use other instruction set? There is no common packed integer division? Last edited by redfordxx; 18th December 2007 at 21:18. |
19th December 2007, 22:18 | #95 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
It seems, that maybe all this insert/extract gymnastics takes so much time, that it is better to do in in normal scalar asm...
Or maybe I will do this part is scalar version for compatibility and then I'll do 3DNowPro version with normal division... however, I am not sure yet, whether 3DNowPro can on xmm do something so nice as PMADDWD |
19th December 2007, 23:42 | #96 | Link |
Registered User
Join Date: Jan 2002
Location: San Jose, CA
Posts: 216
|
The most efficient way to achieve this is probably to temporarily convert the data to floating point and use single-precision floating-point SSE to perform the final division on 4 values at a time.
|
20th December 2007, 02:41 | #97 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
Look at how many cycles all the division instructions take on all of the processors. In this many cycles you could almost rebuild the pyramids.
Do not do division unless there is no other way around the problem. Post your existing code for this portion and I will see what can be done. |
20th December 2007, 03:09 | #98 | Link |
Registered User
Join Date: Jan 2002
Location: San Jose, CA
Posts: 216
|
Not true. You should be able to issue a DIVPS instruction operating on 4 values with only ~20 cycle latency on a Core2 duo.
This should end up faster than 4 scalar lookup and avoids trashing L1. |
20th December 2007, 03:37 | #99 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
|
The code I am thinking about would be 4 streams of 2 fast instructions plus a pmulhuw all up maybe 12 to 16 cycles on a Core2 and will do almost as well on most other CPU's
You also have to include the to and from float conversion as well to use the DIVPS. |
|
|