RedAverage plugin (version 1.4.3 biiiiig bugfix) - Page 2

cretindesalpes · 15th November 2011, 21:05

Quote:

Originally Posted by redfordxx

For pointers and counters, do I have anything else available to use than eax,ebx,ecx,edx,edi,esi?

Unless some wizardry, I don't think so.

Quote:

Also, the compiled always warns me: frame pointer register 'ebx' modified by inline assembly code.
Everything seems to work OK though, so should that bother me?

http://msdn.microsoft.com/en-us/library/k1a8ss06.aspx

Quote:

Some SSE types require eight-byte stack alignment, forcing the compiler to emit dynamic stack-alignment code. To be able to access both the local variables and the function parameters after the alignment, the compiler maintains two frame pointers. If the compiler performs frame pointer omission (FPO), it will use EBP and ESP. If the compiler does not perform FPO, it will use EBX and EBP. To ensure code runs correctly, do not modify EBX in asm code if the function requires dynamic stack alignment as it could modify the frame pointer. Either move the eight-byte aligned types out of the function, or avoid using EBX.

Also: http://forum.doom9.org/showthread.php?t=100374

redfordxx · 16th November 2011, 00:21

Well, sorry, I never nowhere learned C or asm except now for the purpose of Avisynth, so I am a little slow. But after slowly getting thing, I wonder whether this should be enough to safely use all registers for my purposes, as long as I do not use POP and PUSH?

Code:

int saveesp;
__asm
{	
pushad
mov saveesp, esp
....
mov esp, saveesp
emms
popad
}

EDIT: now I realized ebp seems to by used when I am getting variable from memory. So, maybe, I could use ebp, but everytime I reference memory like eg mov eax, [variable] ebp value will be overwritten...and I should keep it in mind.

redfordxx · 16th November 2011, 00:38

cretindesalpes, by the way, as you are the guy of 16bits, I would appreciate your opinion on this (or whoever else's):
When I will extend the Average plugin to 16bits, the formula should be

Code:

int(bias+sum(clip_i*mask_i)/65536+0.5)       i=1...n

But I believe doing it like this:

Code:

int(bias+0.5)+sum(int(clip_i*mask_i/65536))       i=1...n

could be easier and faster code, I think. However, there will be some inaccuracy in the lsb. What do you think of it?

IanB · 16th November 2011, 02:47

No, ESP always need to point at a valid stack except when interrupts are disabled.

No leaving out the proper rounding causes problems.

Implement as :-

Code:

int K=(bias<<16)+32768;
....
(K + sum(clip_i*mask_i) ) >> 16;       i=1...n

redfordxx · 16th November 2011, 02:59

Well, I achieved one of my benchmark. One of the function, in following, parameters:

Code:

RAverageW(c1,1,c2,-1,bias=128)

Does the same as mt_makediff and I believe it is tiny bit faster on xmm. But, of course, you can choose multiple number of clips and different weights.
There is one think I am fighting with:
It is calculated scaled to signed short with scale 32*256 and as soon as the weight is 4, it switches to non-asm.
I need variable scale which I tried with following thing but doesnot work:

Code:

if (maxweight<4) {
#define SCALE (256*32)
#define SCALEPOWER (8+5)
} else {
#define SCALE (256/2)
#define SCALEPOWER (8-1)
}

Either this is wrong approach or I have bug somewhere. Bug is up to me to find, but pls tell me, if this kind of if else define is possible.

redfordxx · 16th November 2011, 03:49

Quote:

Originally Posted by IanB

Code:

int K=(bias<<16)+32768;
....
(K + sum(clip_i*mask_i) ) >> 16;       i=1...n

Well then it would be slow or even slower.
As of my knowledge I see two instruction I can use:
pmaddwd however, there still can be error on last bit. Because this instruction is signed, I have to do

Code:

(K + sum((clip_i>>1)*(mask_i>>1)) ) >> 14;       i=1...n

Or pmuludq which would be precise but processes fourtimes less data... and will be four times slower

I don't see other options but maybe there are instructions I can't think of.

IanB · 16th November 2011, 08:20

Quote:

Originally Posted by redfordxx

...
It is calculated scaled to signed short with scale 32*256 and as soon as the weight is 4, it switches to non-asm.
I need variable scale which I tried with following thing but doesnot work:

Code:

if (maxweight<4) {
#define SCALE (256*32)
#define SCALEPOWER (8+5)
} else {
#define SCALE (256/2)
#define SCALEPOWER (8-1)
}

Either this is wrong approach or I have bug somewhere. Bug is up to me to find, but pls tell me, if this kind of if else define is possible.

I take it you actually mean something like this :-

Code:

if (maxweight<4) {
#define SCALE (256*32)
#define SCALEPOWER (8+5)
#include "common_include_code.hpp"
#undef SCALEPOWER
#UNDEF SCALE
} else {
#define SCALE (256/2)
#define SCALEPOWER (8-1)
#include "common_include_code.hpp"
#undef SCALEPOWER
#UNDEF SCALE
}

You cannot mix program flow with macro substitution. The C preprocessor does all the macro parsing then the compiler compiles the resulting source. If you ask for an ASM listing with C code as the comments you can see what you actually compiled.

Perhaps you need the power of Softwire or similar to generate dynamic assembly on the fly.

IanB · 16th November 2011, 08:34

Quote:

Originally Posted by redfordxx

Quote:

Originally Posted by IanB

Code:

int K=(bias<<16)+32768;
....
(K + sum(clip_i*mask_i) ) >> 16;       i=1...n

Well then it would be slow or even slower.
As of my knowledge I see two instruction I can use:
pmaddwd however, there still can be error on last bit. Because this instruction is signed, I have to do

Code:

(K + sum((clip_i>>1)*(mask_i>>1)) ) >> 14;       i=1...n

Or pmuludq which would be precise but processes fourtimes less data... and will be four times slower

I don't see other options but maybe there are instructions I can't think of.

The assumption was from your earlier code, where you zeroed a register then looped about summing the clip_i*mask_i values. The significant hint here was precalculate K and start with K instead of zero.

Really this guessing in the dark is not helpful. Please post complete code fragments and ask direct questions about that code.

As I guess your problem it is to do sum(clip_i * mask_i) quickly.

If so you need code to do DWORD=K, Loop{UWORD=BYTE*BYTE, DWORD+=UWORD}, ScaleAndRound(DWORD)

Am I close with the above pseudo code guess

jpsdr · 16th November 2011, 09:24

Quote:

Originally Posted by redfordxx

Small question. For pointers and counters, do I have anything else available to use than eax,ebx,ecx,edx,edi,esi?

Under 32bits, no. On 64bits, yes, you have r8 to r15 avaibles.

redfordxx · 16th November 2011, 13:03

Quote:

Originally Posted by IanB

If so you need code to do DWORD=K, Loop{UWORD=BYTE*BYTE, DWORD+=UWORD}, ScaleAndRound(DWORD)

In 16bits I need two unsigned words to multiply. If UWORD means unsigned word and UDWORD means unsigned doubleword:

DWORD=K, Loop{UDWORD=UWORD*UWORD, DWORD+=UDWORD}, ScaleAndRound(UDWORD)

Clip and Mask would be numbers in < 0 ; 65535 > range. So the sum before scaling would be in < 0 ; 256^4 )

However bias can be negative also and has no real boundaries, although the only range which makes sense is ( -65536*n ; 65535 > where n is number of input clips.

I don't have code yet but the first idea with the one bit inaccuracy would be something like

Code:

move xmm7, dword ptr [bias_scaled]     //here I cannot scale 16bits up beacause of the range is outside signed word so I will scale 8bits up

loop:
...
psrlw   xmm0, 1   //xmm0 contains packed interleaved clip_i and clip_j where j=i+1
psrlw   xmm1, 1   //xmm1 contains packed interleaved mask_i and mask_j   (I need to do these operations to clear the sign bit because pmaddwd expects signed arguments
pmaddwd xmm0, xmm1
psrld   xmm0, 6   //extra op to get on the scale of K in xmm7 
paddd xmm7, xmm0
...
jnz loop
...
psraw xmm7, 8
...
pack with unsigned saturation to word

Now I see there is lot of psr* which I don't believe is very fast op

redfordxx · 16th November 2011, 13:47

Read EDIT2 first!
I am trying other thing when I have problems with this if {#define} and it is to use variable instead of constant. But I got really confused and please correct my error:
These are the definitions:

Code:

int scaleI;
scaleI=13;
#define SCALE 13
__asm{
mov eax, SCALE
mov ebx, scaleI

Now, following instructions have different results and I dont know why:

Code:

psrad   xmm0, SCALE  //only this one is correct
psrad   xmm0, scaleI
psrad   xmm0, eax
psrad   xmm0, ebx

EDIT: I think the root cause can be found in this asm listing, which...well what's happening there is beyond my understanding:

Code:

; 75   : 	scalepowerA=13;

	mov	DWORD PTR [edi+16], 13			; 0000000dH

; 76   : 
; 77   : 	__asm
; 78   : 	 {	
; 79   : 	pushad

	pushad

; 82   : 	mov		eax, scalepowerA

	mov	eax, 16					; 00000010H

; 110  : 			  psrad   xmm0, eax 

	psrad	xmm0, eax

; 111  : 			  psrad   xmm1, scalepowerA 

	psrad	xmm1, 16				; 00000010H

I deleted some lines but nowhere was changed or accessed eax or scalepowerA

EDIT2:
I seem to figure that out, is that so, that inline asm can access only local variables? And when the variable is defined in *.h file, it is silently replaced with number 16...

EDIT3:
But again:

Code:

This doesnot work:
mov	eax, 13
psrad   xmm0, eax
This does:
psrad   xmm0, 13

So, shortly:
how can I shift xmm register based on some variable I create in C++ code? (hopefully it won't be too slow, compared to immediate value...)

IanB · 17th November 2011, 00:37

Again with the keyhole view. If you want help post enough so the full context is available to us.

PSRAD has no Reg32 variant. Only Immediate and MMreg versions.

Code:

Opcode Instruction Description
0F E2 /r        PSRAD mm, mm/m64       Shift doublewords in mm right by mm/m64 while shifting in sign bits.
66 0F E2 /r     PSRAD xmm1, xmm2/m128  Shift doubleword in xmm1 right by xmm2 /m128 while shifting in sign bits.
0F 72 /4 ib     PSRAD mm, imm8         Shift doublewords in mm right by imm8 while shifting in sign bits.
66 0F 72 /4 ib  PSRAD xmm1, imm8       Shift doublewords in xmm1 right by imm8 while shifting in sign bits.

redfordxx · 17th November 2011, 09:30

Hi, please, how do I change dimensions of the video? I tried:

Code:

PVideoFrame __stdcall RAverageM::GetFrame(int n, IScriptEnvironment *env) {...
vi.width  = dst_width;
vi.height = dst_height;
return WeightedAverageM16(env, vi);
...}


PVideoFrame RAverageM::WeightedAverageM16(IScriptEnvironment* env, VideoInfo vi)
{
	result = env->NewVideoFrame (vi,PitchAlign);

	if (y==3) AveragePlaneM16(PLANAR_Y, env);
	if (u==3) AveragePlaneM16(PLANAR_U, env);
	if (v==3) AveragePlaneM16(PLANAR_V, env);

	return (result);
}

This idea I copied from somewhere, but it causes crashes.

Gavino · 17th November 2011, 10:43

Dimensions should only be changed in the filter's constructor, not in GetFrame(). All frames of a clip are assumed to have the same dimensions.

Youka · 17th November 2011, 18:22

From Filter SDK documentation.

redfordxx · 17th November 2011, 18:26

Thanx guys. I am only not sure, if I have multiple input clips, how to check dimensions of all of them in constructor? I know how to do it in GetFrame function...

Gavino · 17th November 2011, 18:30

All input clips are available in the constructor (if you pass them in as arguments), so what's the problem?
The GetFrame() function has no more information available than the constructor has.

SEt · 17th November 2011, 18:54

Writing in assembler means that you know what you are doing, so don't worry about compiler warning if you are sure your code is correct.
If you really want registers you can use all 8 of them:
1) no problems with 6 you mentioned;
2) after you modify ebp in inline assembler you won't be able to access most of your C/C++ variables by name, so load it last and restore at the end;
3) you can even use esp in extreme cases - just need to save it somewhere and restore at the end, don't worry about interrupts - they'll switch to their own stack.

Also it's important to understand that not everything has to be put in registers - for example, putting counters of outer loops to memory is perfectly fine and won't change your program speed in any noticeable way.

redfordxx · 17th November 2011, 23:18

Quote:

Originally Posted by Gavino

The GetFrame() function has no more information available than the constructor has.

Just to be clear, so I should call GetFrame in the constructor for every input clip, just to know its size?

cretindesalpes · 17th November 2011, 23:44

No need to call GetFrame:

Code:

PClip p = ...;
const VideoInfo & v = p->GetVideoInfo ();
area = v.width * v.height;

16th November 2011, 00:21	#22 \| Link
redfordxx Registered User Join Date: Jan 2005 Location: Praha (not that one in Texas) Posts: 863	Well, sorry, I never nowhere learned C or asm except now for the purpose of Avisynth, so I am a little slow. But after slowly getting thing, I wonder whether this should be enough to safely use all registers for my purposes, as long as I do not use POP and PUSH? Code: int saveesp; __asm { pushad mov saveesp, esp .... mov esp, saveesp emms popad } EDIT: now I realized ebp seems to by used when I am getting variable from memory. So, maybe, I could use ebp, but everytime I reference memory like eg mov eax, [variable] ebp value will be overwritten...and I should keep it in mind. Last edited by redfordxx; 16th November 2011 at 01:04.

16th November 2011, 00:38	#23 \| Link
redfordxx Registered User Join Date: Jan 2005 Location: Praha (not that one in Texas) Posts: 863	cretindesalpes, by the way, as you are the guy of 16bits, I would appreciate your opinion on this (or whoever else's): When I will extend the Average plugin to 16bits, the formula should be Code: int(bias+sum(clip_imask_i)/65536+0.5) i=1...n But I believe doing it like this: Code: int(bias+0.5)+sum(int(clip_imask_i/65536)) i=1...n could be easier and faster code, I think. However, there will be some inaccuracy in the lsb. What do you think of it?

16th November 2011, 02:47	#24 \| Link
IanB Avisynth Developer Join Date: Jan 2003 Location: Melbourne, Australia Posts: 3,167	No, ESP always need to point at a valid stack except when interrupts are disabled. No leaving out the proper rounding causes problems. Implement as :- Code: int K=(bias<<16)+32768; .... (K + sum(clip_i*mask_i) ) >> 16; i=1...n

16th November 2011, 02:59	#25 \| Link
redfordxx Registered User Join Date: Jan 2005 Location: Praha (not that one in Texas) Posts: 863	Well, I achieved one of my benchmark. One of the function, in following, parameters: Code: RAverageW(c1,1,c2,-1,bias=128) Does the same as mt_makediff and I believe it is tiny bit faster on xmm. But, of course, you can choose multiple number of clips and different weights. There is one think I am fighting with: It is calculated scaled to signed short with scale 32256 and as soon as the weight is 4, it switches to non-asm. I need variable scale which I tried with following thing but doesnot work: Code: if (maxweight<4) { #define SCALE (25632) #define SCALEPOWER (8+5) } else { #define SCALE (256/2) #define SCALEPOWER (8-1) } Either this is wrong approach or I have bug somewhere. Bug is up to me to find, but pls tell me, if this kind of if else define is possible.

16th November 2011, 13:47	#31 \| Link
redfordxx Registered User Join Date: Jan 2005 Location: Praha (not that one in Texas) Posts: 863	Read EDIT2 first! I am trying other thing when I have problems with this if {#define} and it is to use variable instead of constant. But I got really confused and please correct my error: These are the definitions: Code: int scaleI; scaleI=13; #define SCALE 13 __asm{ mov eax, SCALE mov ebx, scaleI Now, following instructions have different results and I dont know why: Code: psrad xmm0, SCALE //only this one is correct psrad xmm0, scaleI psrad xmm0, eax psrad xmm0, ebx EDIT: I think the root cause can be found in this asm listing, which...well what's happening there is beyond my understanding: Code: ; 75 : scalepowerA=13; mov DWORD PTR [edi+16], 13 ; 0000000dH ; 76 : ; 77 : __asm ; 78 : { ; 79 : pushad pushad ; 82 : mov eax, scalepowerA mov eax, 16 ; 00000010H ; 110 : psrad xmm0, eax psrad xmm0, eax ; 111 : psrad xmm1, scalepowerA psrad xmm1, 16 ; 00000010H I deleted some lines but nowhere was changed or accessed eax or scalepowerA EDIT2: I seem to figure that out, is that so, that inline asm can access only local variables? And when the variable is defined in .h file, it is silently replaced with number 16... EDIT3: But again: Code: This doesnot work: mov eax, 13 psrad xmm0, eax* This does: psrad xmm0, 13 So, shortly: how can I shift xmm register based on some variable I create in C++ code? (hopefully it won't be too slow, compared to immediate value...) Last edited by redfordxx; 16th November 2011 at 17:42.

17th November 2011, 10:43	#34 \| Link
Gavino Avisynth language lover Join Date: Dec 2007 Location: Spain Posts: 3,431	Dimensions should only be changed in the filter's constructor, not in GetFrame(). All frames of a clip are assumed to have the same dimensions. __________________ GScript and GRunT - complex Avisynth scripting made easier

17th November 2011, 18:22	#35 \| Link
Youka Registered User Join Date: Mar 2011 Location: Germany Posts: 64	From Filter SDK documentation.

17th November 2011, 18:26	#36 \| Link
redfordxx Registered User Join Date: Jan 2005 Location: Praha (not that one in Texas) Posts: 863	Thanx guys. I am only not sure, if I have multiple input clips, how to check dimensions of all of them in constructor? I know how to do it in GetFrame function...

17th November 2011, 18:30	#37 \| Link
Gavino Avisynth language lover Join Date: Dec 2007 Location: Spain Posts: 3,431	All input clips are available in the constructor (if you pass them in as arguments), so what's the problem? The GetFrame() function has no more information available than the constructor has. __________________ GScript and GRunT - complex Avisynth scripting made easier

17th November 2011, 18:54	#38 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	Writing in assembler means that you know what you are doing, so don't worry about compiler warning if you are sure your code is correct. If you really want registers you can use all 8 of them: 1) no problems with 6 you mentioned; 2) after you modify ebp in inline assembler you won't be able to access most of your C/C++ variables by name, so load it last and restore at the end; 3) you can even use esp in extreme cases - just need to save it somewhere and restore at the end, don't worry about interrupts - they'll switch to their own stack. Also it's important to understand that not everything has to be put in registers - for example, putting counters of outer loops to memory is perfectly fine and won't change your program speed in any noticeable way.

17th November 2011, 23:44	#40 \| Link
cretindesalpes ͡҉҉ ̵̡̢̛̗̘̙̜̝̞̟̠͇̊̋̌̍̎̏̿̿ Join Date: Feb 2009 Location: No support in PM Posts: 712	No need to call GetFrame: Code: PClip p = ...; const VideoInfo & v = p->GetVideoInfo (); area = v.width * v.height; __________________ dither 1.28.1 for AviSynth \| avstp 1.0.4 for AviSynth development \| fmtconv r30 for Vapoursynth & Avs+ \| trimx264opt segmented encoding