MxN DCTFilter using MMX asm optimization (developmnent stage) - Page 5

IanB · 16th December 2007, 23:16

Code:

int foo(int abc) {
  int def;
__asm {
    mov eax, abc  ;; abc is probably [ebp+8]
    mov def, eax  ;; def is probably [ebp-20]

Have you worked out how to get ASM listings from the compiler yet?

Always look at the ASM listing from the compiler, your __asm code must fit in with the code from the compiler. i.e. the whole code must be consistant.

And the magic word you need to reference a code label is offset

Code:

add     ebx, offset jumper

To find these magic words look in the ASM listing, write some C that will use the concept you need and see how the compiler does it.

gioowe · 16th December 2007, 23:32

Quote:

Originally Posted by redfordxx

but this would be slow, wouldn't it?
IIRC, in the AMD appendix there is written latency 1 for most of short jumps...but I assume it is only the evalulation of the condition and then the jump takes some time...

...same as memory reads... they announce latency 2 but they take much longer as sh0dan mentioned

The condition is calculated before. A jump only checks flags.

A correctly predicted jump or an unconditional jump has a latency of 2, assuming that the next instruction is in the L1 code cache. If it was mispredicted then the CPU has to flush all decoded instructions and start again. This takes about 42 cycles. The AMD processor (I don't care about Intel) has a branch prediction as follows: A conditional branch is assumed as non-taken the first time. The second time (address) it is assumed as the same as last time. All further times (address) it is assumed as the second time.

foxyshadis · 17th December 2007, 19:38

Quote:

Originally Posted by IanB

And the magic word you need to reference a code label is offset

Code:

add     ebx, offset jumper

To find these magic words look in the ASM listing, write some C that will use the concept you need and see how the compiler does it.

Thanks, I was scratching my head over that when I was trying to test and get the code actually working yesterday.

redfordxx · 17th December 2007, 20:32

Quote:

Originally Posted by IanB

Have you worked out how to get ASM listings from the compiler yet?

Yeah, almost reading like a book;-)
Of course I am learning from that.

I read also one funny thing:the compiler translated my jnz to some other kind of conditional jump;-)

Quote:

Always look at the ASM listing from the compiler, your __asm code must fit in with the code from the compiler. i.e. the whole code must be consistant.

...having consistent code with the code from the compiler...this topic I probably leave to the horse for now...he has bigger head than me;-)

Leak · 17th December 2007, 20:41

Quote:

Originally Posted by redfordxx

I read also one funny thing:the compiler translated my jnz to some other kind of conditional jump;-)

Well, jne (not equal) and jnz (not zero) are the same instruction, since for both you just check whether the zero bit in the flags register is set - maybe that's what happened?

np: Yello - Daily Disco (1980-1985: The New Mix In One Go)

redfordxx · 17th December 2007, 21:02

@Fizick

I changed the title...d'you like it better?

First I borrowed your Bytes2Float routine and changed it into Bytes2Words...Then I completely replaced it with scalar asm and it was significant speedup...I will add MMX and then I post it...maybe it would be useful for you if it is possible change it back to float...

Fizick · 17th December 2007, 23:06

i have also SSE optimized bytes to float routine in Vaguedenoiser

(code written by Kurosu though)

But it take very small percent of time.

When (if) you make fast MMX dct 16x16 (16x8, 8x16), I will try add it to MVTools.

redfordxx · 18th December 2007, 19:19

Hello,

here is new area to explore for me:

Packed integer division on mmx registers... afaik it is not in MMX set, is there any common instruction set with divisions? Or it is only AMD specific when I have AthlonXP? Where should I focus my learning efforts?

sh0dan · 18th December 2007, 19:57

redfordxx: What is the C-equivalent of what you want to do?

MMX cannot do division, but you can use inverse multiply to achieve a division if your division is constant.

For example:

Code:

y = x / 5;
==
y = x * (256 / 5) / 256
==
y = (x * 51) >> 8

Increase 256 to any power of two to get better precision.

redfordxx · 18th December 2007, 21:13

OK, I try it short and fast, coz this hotel internet in Brussels is constantly disconnecting me.
The mul-division is nice, I heard of it but didn't know what is it exactly...will use later...tnx

redfordxx · 18th December 2007, 21:16

My case:

Code:

a=IDCT(Quantize(DCT(x))
b=IDCT(Quantize(DCT(y))
c=a/b

I have
x=[0,65535]
y=[0,255]
a,b scaled to signed DW
c=[0,255] unsigned saturated

x,y are values from video...so definitely not constant...

So I see two options:
1)do it scalar
2)use other instruction set? There is no common packed integer division?

IanB · 19th December 2007, 04:00

Code:

  unsigned short Reciprocal[65536]; // 65536/i

c=(a*Reciprocal[b])>>16;

Look for the extract word/insert word instructions pinsw/pextw (sp?) and pmulhw

redfordxx · 19th December 2007, 08:48

So, iiuc I should prepare some lookuptable for all numbers [0,65535]...

IanB · 19th December 2007, 13:16

In the general case, Yes, a 65K table. But if you know your data you can pull some tricks to increase accuracy or reduce the table size.

redfordxx · 19th December 2007, 22:18

It seems, that maybe all this insert/extract gymnastics takes so much time, that it is better to do in in normal scalar asm...

Or maybe I will do this part is scalar version for compatibility and then I'll do 3DNowPro version with normal division... however, I am not sure yet, whether 3DNowPro can on xmm do something so nice as PMADDWD

Sulik · 19th December 2007, 23:42

The most efficient way to achieve this is probably to temporarily convert the data to floating point and use single-precision floating-point SSE to perform the final division on 4 values at a time.

IanB · 20th December 2007, 02:41

Look at how many cycles all the division instructions take on all of the processors. In this many cycles you could almost rebuild the pyramids.

Do not do division unless there is no other way around the problem.

Post your existing code for this portion and I will see what can be done.

Sulik · 20th December 2007, 03:09

Not true. You should be able to issue a DIVPS instruction operating on 4 values with only ~20 cycle latency on a Core2 duo.
This should end up faster than 4 scalar lookup and avoids trashing L1.

IanB · 20th December 2007, 03:37

The code I am thinking about would be 4 streams of 2 fast instructions plus a pmulhuw all up maybe 12 to 16 cycles on a Core2 and will do almost as well on most other CPU's

You also have to include the to and from float conversion as well to use the DIVPS.

redfordxx · 23rd December 2007, 04:44

OK, back to switch code and jumps:
jmp ecx is already working...
jg ecx does not...
is there any trick?

16th December 2007, 23:16	#81 \| Link
IanB Avisynth Developer Join Date: Jan 2003 Location: Melbourne, Australia Posts: 3,167	Code: int foo(int abc) { int def; __asm { mov eax, abc ;; abc is probably [ebp+8] mov def, eax ;; def is probably [ebp-20] Have you worked out how to get ASM listings from the compiler yet? Always look at the ASM listing from the compiler, your __asm code must fit in with the code from the compiler. i.e. the whole code must be consistant. And the magic word you need to reference a code label is offset Code: add ebx, offset jumper To find these magic words look in the ASM listing, write some C that will use the concept you need and see how the compiler does it.

17th December 2007, 23:06	#87 \| Link
Fizick AviSynth plugger Join Date: Nov 2003 Location: Russia Posts: 2,183	i have also SSE optimized bytes to float routine in Vaguedenoiser (code written by Kurosu though) But it take very small percent of time. When (if) you make fast MMX dct 16x16 (16x8, 8x16), I will try add it to MVTools. __________________ My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages.

18th December 2007, 19:57	#89 \| Link
sh0dan Retired AviSynth Dev ;) Join Date: Nov 2001 Location: Dark Side of the Moon Posts: 3,480	redfordxx: What is the C-equivalent of what you want to do? MMX cannot do division, but you can use inverse multiply to achieve a division if your division is constant. For example: Code: y = x / 5; == y = x * (256 / 5) / 256 == y = (x * 51) >> 8 Increase 256 to any power of two to get better precision. __________________ Regards, sh0dan // VoxPod

18th December 2007, 21:16	#91 \| Link
redfordxx Registered User Join Date: Jan 2005 Location: Praha (not that one in Texas) Posts: 863	My case: Code: a=IDCT(Quantize(DCT(x)) b=IDCT(Quantize(DCT(y)) c=a/b I have x=[0,65535] y=[0,255] a,b scaled to signed DW c=[0,255] unsigned saturated x,y are values from video...so definitely not constant... So I see two options: 1)do it scalar 2)use other instruction set? There is no common packed integer division? Last edited by redfordxx; 18th December 2007 at 21:18.

19th December 2007, 04:00	#92 \| Link
IanB Avisynth Developer Join Date: Jan 2003 Location: Melbourne, Australia Posts: 3,167	Code: unsigned short Reciprocal[65536]; // 65536/i c=(a*Reciprocal[b])>>16; Look for the extract word/insert word instructions pinsw/pextw (sp?) and pmulhw

17th December 2007, 21:02	#86 \| Link
redfordxx Registered User Join Date: Jan 2005 Location: Praha (not that one in Texas) Posts: 863	@Fizick I changed the title...d'you like it better? First I borrowed your Bytes2Float routine and changed it into Bytes2Words...Then I completely replaced it with scalar asm and it was significant speedup...I will add MMX and then I post it...maybe it would be useful for you if it is possible change it back to float...

18th December 2007, 19:19	#88 \| Link
redfordxx Registered User Join Date: Jan 2005 Location: Praha (not that one in Texas) Posts: 863	Hello, here is new area to explore for me: Packed integer division on mmx registers... afaik it is not in MMX set, is there any common instruction set with divisions? Or it is only AMD specific when I have AthlonXP? Where should I focus my learning efforts?

18th December 2007, 21:13	#90 \| Link
redfordxx Registered User Join Date: Jan 2005 Location: Praha (not that one in Texas) Posts: 863	OK, I try it short and fast, coz this hotel internet in Brussels is constantly disconnecting me. The mul-division is nice, I heard of it but didn't know what is it exactly...will use later...tnx

19th December 2007, 08:48	#93 \| Link
redfordxx Registered User Join Date: Jan 2005 Location: Praha (not that one in Texas) Posts: 863	So, iiuc I should prepare some lookuptable for all numbers [0,65535]...

19th December 2007, 13:16	#94 \| Link
IanB Avisynth Developer Join Date: Jan 2003 Location: Melbourne, Australia Posts: 3,167	In the general case, Yes, a 65K table. But if you know your data you can pull some tricks to increase accuracy or reduce the table size.

19th December 2007, 22:18	#95 \| Link
redfordxx Registered User Join Date: Jan 2005 Location: Praha (not that one in Texas) Posts: 863	It seems, that maybe all this insert/extract gymnastics takes so much time, that it is better to do in in normal scalar asm... Or maybe I will do this part is scalar version for compatibility and then I'll do 3DNowPro version with normal division... however, I am not sure yet, whether 3DNowPro can on xmm do something so nice as PMADDWD

19th December 2007, 23:42	#96 \| Link
Sulik Registered User Join Date: Jan 2002 Location: San Jose, CA Posts: 216	The most efficient way to achieve this is probably to temporarily convert the data to floating point and use single-precision floating-point SSE to perform the final division on 4 values at a time.

20th December 2007, 02:41	#97 \| Link
IanB Avisynth Developer Join Date: Jan 2003 Location: Melbourne, Australia Posts: 3,167	Look at how many cycles all the division instructions take on all of the processors. In this many cycles you could almost rebuild the pyramids. Do not do division unless there is no other way around the problem. Post your existing code for this portion and I will see what can be done.

20th December 2007, 03:09	#98 \| Link
Sulik Registered User Join Date: Jan 2002 Location: San Jose, CA Posts: 216	Not true. You should be able to issue a DIVPS instruction operating on 4 values with only ~20 cycle latency on a Core2 duo. This should end up faster than 4 scalar lookup and avoids trashing L1.

20th December 2007, 03:37	#99 \| Link
IanB Avisynth Developer Join Date: Jan 2003 Location: Melbourne, Australia Posts: 3,167	The code I am thinking about would be 4 streams of 2 fast instructions plus a pmulhuw all up maybe 12 to 16 cycles on a Core2 and will do almost as well on most other CPU's You also have to include the to and from float conversion as well to use the DIVPS.

23rd December 2007, 04:44	#100 \| Link
redfordxx Registered User Join Date: Jan 2005 Location: Praha (not that one in Texas) Posts: 863	OK, back to switch code and jumps: jmp ecx is already working... jg ecx does not... is there any trick?