Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development
Register FAQ Calendar Today's Posts Search

Reply
 
Thread Tools Search this Thread Display Modes
Old 22nd September 2005, 21:04   #1  |  Link
mg262
Clouded
 
mg262's Avatar
 
Join Date: Jul 2003
Location: Cambridge, UK
Posts: 1,148
Inline assembly and ebx (and fast function calls)

So, while trying to speed up filters in VC 2003, I get this warning every time I use ebx in inline assembly:

warning C4731: 'UnmaskedRangeTranslateDifference_P4::_match' : frame pointer register 'ebx' modified by inline assembly code

As I understand it, that means that ebx is being used in the same way as ebp, to (approximately) point to the local variables.

Two questions:

1. Why is ebx needed for this as well as ebp ?
2. Is there something sensible to do to get round the warning -- i.e. to flag to the compiler that ebx is not needed in a particular instance? (I am aware that you can suppress warnings by number, but this is prevention rather than cure...)

Thanks!
M.

Last edited by mg262; 7th November 2005 at 15:04.
mg262 is offline   Reply With Quote
Old 22nd September 2005, 21:40   #2  |  Link
Sulik
Registered User
 
Join Date: Jan 2002
Location: San Jose, CA
Posts: 216
ebx is used by the compiler to keep track of aligned variables (stack aligned to 8 or 16 bytes).
This is commonly used when declaring local variables with __declspec(align) and/or _m64/_m128 data types.
Sulik is offline   Reply With Quote
Old 23rd September 2005, 01:40   #3  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
Sulik has the right answer. The warning is really a serious error, but because of the credo "assembler programmers know what they are doing" it is classed only as a warning. If you push/pop ebx in your asm code and the compiler will forgive you but this sucks. When doing __asm always ask for an asm compiler listing and check what stupidity the compiler comitting.

As a hackey workaround when I absolutly have to put aligned variables on the stack, I declare a stack based char array 7, 15 or 63 bigger than I need and mask and assign it to a stack based pointer to the type I need.
Code:
char dummy[(N+1)*sizeof(__int64)-1];
__int64 *var1 = (__int64*)(((int)dummy+sizeof(__int64)-1)&(-sizeof(__int64)));
Also for users of the older VC6 compiler, it is very easy to confuse it with regard to ebx register tracking and have it screw the entry/exit prologue code. I haven't seen the .net compiler commit the same sin, but I have never seen this bug logged as fixed.

IanB
IanB is offline   Reply With Quote
Old 23rd September 2005, 09:16   #4  |  Link
mg262
Clouded
 
mg262's Avatar
 
Join Date: Jul 2003
Location: Cambridge, UK
Posts: 1,148
Thank you both. I don't follow why aligned variables are kept separately (improve packing?), but it doesn't in itself matter; the problem is more that I need the extra register. I think there's an option like "omit frame pointer" in the compiler... I'm going to look into it and see if it frees up ebp.
mg262 is offline   Reply With Quote
Old 23rd September 2005, 11:42   #5  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
When you declare a variable aligned and is not static compiler will no alocate it ,it will be alocated in execution time ,so compiler reserves a register (ebx)as pointer.

Solutions the one IanB has pointed; use static if you know your variable will be a constant in any instance your plugin is called , otherwise pass the parameter in a way you arrange it in a register and you can forget about its alignment. And finally dont use ebx register

I forgot, there is a trick I dont like but works, declare your variable static and aligned but without any value,pass the parameter to your inline code and fill the variables within your own code with the values you want in execution time, that way you are sure you can have several instance of your plugin.

I hope this can be usefull
ARDA
ARDA is offline   Reply With Quote
Old 23rd September 2005, 12:00   #6  |  Link
mg262
Clouded
 
mg262's Avatar
 
Join Date: Jul 2003
Location: Cambridge, UK
Posts: 1,148
Quote:
I forgot, there is a trick I dont like but works, declare your variable static and aligned but without any value,pass the parameter to your inline code and fill the variables within your own code with the values you want in execution time, that way you are sure you can have several instance of your plugin.
Unless they are both trying to run simultaneously? One could try and change the static variable while the other was halfway through using it?

Quote:
I hope this can be usefull
It is -- in particular, I didn't know about the static behaviour -- thank you!

I guess one alternative is not to try and use any non-aligned variables in the main part of the code... so I can push/pop ebp and use that in place of ebx.

Edit:

looks like it may not always be safe to rely on the omit frame pointer compiler option:

http://groups.google.com/group/micro...1fdaad1d22b5d4

Last edited by mg262; 23rd September 2005 at 12:06.
mg262 is offline   Reply With Quote
Old 23rd September 2005, 12:18   #7  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
Unless they are both trying to run simultaneously?
One could try and change the static variable while the other was halfway through using it?


If I am not wrong avisynth just delivers frame to next filter in the chain ,once it has finished
the one is processing


I guess one alternative is not to try and use any non-aligned variables in the main part of the code... so
I can push/pop ebp and use that in place of ebx.

You are right, but that is not always possible and pass the variable and manage to allocate it in a register
is still better, you will avoid memory access.
Any way a safe mode is to declare your inline assembler as static void __declspec(naked)
look as an example inline assembler of Tweak by Dividee.

Regards
ARDA
ARDA is offline   Reply With Quote
Old 23rd September 2005, 12:29   #8  |  Link
mg262
Clouded
 
mg262's Avatar
 
Join Date: Jul 2003
Location: Cambridge, UK
Posts: 1,148
Quote:
If I am not wrong avisynth just delivers frame to next filter in the chain ,once it has finished the one is processing
Ah, sorry, I thought you were talking about parallel execution (cf. tsp's multithreading filter). I can't actually execute things in parallel, but where there is no overhead I'd rather make it safe for parallel execution.

At the moment, I'm not accessing any variables from inside the main body of the assembly... but in a long inner loop, one xmm register is permanently tied up to hold a constant (at least, constant per class whose member function is being called) and since movdqa is fast I thought I'd try freeing up that register.

Quote:
static void __declspec(naked)
-- I'd just been reading up on that, but I didn't make the connection to the warning (d'0h). I will try that.

Last edited by mg262; 23rd September 2005 at 12:32.
mg262 is offline   Reply With Quote
Old 23rd September 2005, 12:50   #9  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
Sincerly I´ve not read anything about cf. tsp's multithreading filter I am not update.

movdqa reg,memory is fast but movda reg,reg is faster
and if you repeat access memory in your in loop always an operation over a register is faster than over memory overall in pentium 4 architecture,obviusly as I dont know your algo this is a general consideration,you have to measure how to implent it better and benchmark, this is always a contradition between the registers you need and can have free and the perfomance you want , not easy many times.In future 64 bits editions we shall have more registers and this question probably will be overcome.

With static void __declspec(naked) you will be obliged to push/pop registers and parameters will pass throug stack pointer(esp).That will free not only ebx but ebp.

With more specific reference to your code maybe any guru developer could help you better than me.

Once more
Regards
ARDA
ARDA is offline   Reply With Quote
Old 23rd September 2005, 13:16   #10  |  Link
mg262
Clouded
 
mg262's Avatar
 
Join Date: Jul 2003
Location: Cambridge, UK
Posts: 1,148
Quote:
Sincerly I´ve not read anything about cf. tsp's multithreading filter I am not update.
He was kind enough to list all the modes and their requirements for me here: http://forum.doom9.org/showthread.ph...362#post711362

Quote:
movdqa reg,memory is fast but movda reg,reg is faster
Both one cycle? The former does have a higher latency... but it also happens in a different execution unit. In this case it's a question of using movdqa versus nop -- but the former frees up another register where it may be needed. I am certainly going to benchmark -- I generally start by writing a fairly simple version with timing code and then try lots of changes to see what helps. In fact, I may as well dump the timing class here since someone who comes across this thread may have some use for it:
Code:
class MeasureCycles
{
	int begincycles;
	int endcycles;
	int scale;
	unsigned int cycles();
public:
	MeasureCycles(int _scale=1000): scale(_scale) {begincycles = cycles();}
	void reset()
	{
		begincycles  = cycles ();		
	}
	int mark()
	{
		endcycles = cycles();
		int cyclesused=endcycles-begincycles;
	//	loga	<<cyclesused <<"\t"<< float (cyclesused)/scale<<"\n" ;
	//	loga.flush ();
		begincycles = endcycles;
		return cyclesused;
	}
	operator int() {return mark();}
};

inline unsigned int MeasureCycles::cycles ()
{
	int result;
	__asm
    {
		xor ebx,ebx
		rdtsc
		xor ebx,ebx
		mov	[result], eax
	}
	return result;
}
(I'd typically wrap it around the assembly code -- you declare a MeasureCycles object just beforehand and read its value just afterwards. It's obviously not the only method of timing, but I find it corresponds pretty well to perceptual speed.)

Edit: The static void __declspec(naked) does look straightforward, but in this case the function is inline... or at least is flagged with the inline keyword. I wish the inline assembler had the features of a proper macro assembler! (I'm aware that you can use standard macros, but it's very awkward to use...)

Last edited by mg262; 23rd September 2005 at 13:24.
mg262 is offline   Reply With Quote
Old 25th September 2005, 03:06   #11  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
@mg262,

You and Arda seem to have it mostly in hand As I said earlier
Quote:
always ask for an asm (with source comments) compiler listing and check what stupidity the compiler is comitting.
For more declspec(naked) hints have a look at the Virtualdub source, Avery is a wizard at register scrapeing, he even scrapes ESP sometimes. Also the CRT library sources (you did install them, right?) get up to some interesting mischief with inline naked routines.

Try variants of this (it's untested off the top of my head)
Code:
declspec(naked) inline unsigned int MeasureCycles::cycles ()
{
  __asm
  {
    mov edx,ebx
    xor ebx,ebx
    rdtsc
    mov ebx,edx
//  ret // can't remember if compiler gives you this
  }
}
IanB
IanB is offline   Reply With Quote
Old 25th September 2005, 09:54   #12  |  Link
mg262
Clouded
 
mg262's Avatar
 
Join Date: Jul 2003
Location: Cambridge, UK
Posts: 1,148
Ah! I had read (on another forum) that __declspec(naked) inline was an illegitimate combination. Good to know otherwise.|Thank you!

Quote:
Avery is a wizard at register scrapeing, he even scrapes ESP sometimes.
Isn't that a little dangerous when an interrupt occurs?

Edit:

Looking back at that old code again, I just checked one case and using movdqa to load a register with a _declspec(align(16)) value used ebp rather than ebx. [ebx is being used for something completely different.] That is in a perfectly normal non-inline non-declspec class member function using VC 2003.

Edit 2:

Neither switching on omit frame pointers nor using a __declspec(naked) function (with no particular initialising code) results in the use of esp rather than ebp. I'm going to keep testing. Edit: I can't find a way to stop it from using ebp.

Last edited by mg262; 25th September 2005 at 12:38.
mg262 is offline   Reply With Quote
Old 25th September 2005, 13:50   #13  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,167
Quote:
Isn't that a little dangerous when an interrupt occurs?
Yes, "assembler programmers know what they are doing"

When using __declspec(naked) you are expected to write ALL the code in asm for the routine.

When referencing stack local variables from __asm, the compiler only seems to know how to reference them via EBP and this re-enable frame pointer code mode. It's a cruel world

To make life easy when I want frame pointerless code, I knock up a framework routine with just the C code I need and reference all the variables I am going to need as "var++" in the middle. I compile with omit frame pointer and ask for an asm listing, swipe all the code the compiler generates and paste it into a __declspec(naked) routine and then add the code I need. All the var++ references I laced in generate samples for the stack offsets so I don't need to calculate them.

IanB
IanB is offline   Reply With Quote
Old 25th September 2005, 14:58   #14  |  Link
mg262
Clouded
 
mg262's Avatar
 
Join Date: Jul 2003
Location: Cambridge, UK
Posts: 1,148
Ian, thank you for all the help -- it really is appreciated.

In this case I think I'm not going to try that trick -- doing it now is one thing, but reading it three months later when I want to make a small change to the 200 or so lines of assembly is another. In this case, it would only save a couple of push/pop/movd instructions... and I'm sure I can find latencies to squish those into!

Warnings aside, ebx really doesn't seem to be being used as a frame pointer with aligned variables (none are static, because I want them to be allocated close to each other) -- so I shall just continue as I have been going and use ebx.

Again, Ian, ARDA, thanks for all the guidance.

Edit: From what I have now seen, I am beginning to suspect that ebx is used with aligned variables in functions that are not class member functions. I haven't made any systematic attempt to check that yet.

Last edited by mg262; 25th September 2005 at 20:47.
mg262 is offline   Reply With Quote
Old 7th November 2005, 15:12   #15  |  Link
mg262
Clouded
 
mg262's Avatar
 
Join Date: Jul 2003
Location: Cambridge, UK
Posts: 1,148
I have just found something else for speeding up function calls in the vein of __declspec(naked) ... Ian et al, you've probably seen this, but in case it's useful to someone:

Under the MSVC project properties, look under C++ and then under Advanced; the first entry is Calling Convention, and it can be switched to __fastcall, which makes all functions default to taking their first two arguments via registers rather than the stack. You can also add the __fastcall (Microsoft) keyword to individual function calls.

More here:
http://msdn.microsoft.com/library/de...__fastcall.asp

Edit:

Also...
Quote:
The compiler cannot generate an inline function for a function marked with the naked attribute, even if the function is also marked with the __forceinline keyword.
from http://msdn.microsoft.com/library/de..._attribute.asp
__________________
a.k.a. Clouded. Come and help by making sure your favourite AVISynth filters and scripts are listed.

Last edited by mg262; 7th November 2005 at 15:17.
mg262 is offline   Reply With Quote
Old 7th November 2005, 15:59   #16  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
@MG262

Hi mate
I've out for more than a month on business travel , but many times I remembered this subjet and as I knew you would search for better solutions, I was curious.I have read too fast your post and links. Can you summarize your conclusions and if you have some tests; what do you think en general terms. And you if have used please point us your solutions with an example. Probably this could be usefull for many new and young developers but I guess that not only.

Regards ARDA

Last edited by ARDA; 7th November 2005 at 16:17.
ARDA is offline   Reply With Quote
Old 7th November 2005, 19:00   #17  |  Link
Richard Berg
developer wannabe
 
Richard Berg's Avatar
 
Join Date: Nov 2001
Location: Brooklyn, NY
Posts: 1,211
FYI -
Quote:
Also for users of the older VC6 compiler, it is very easy to confuse it with regard to ebx register tracking and have it screw the entry/exit prologue code. I haven't seen the .net compiler commit the same sin, but I have never seen this bug logged as fixed.
VC7 bug #204760, fixed 2/6/2001.
Richard Berg is offline   Reply With Quote
Old 8th November 2005, 19:57   #18  |  Link
mg262
Clouded
 
mg262's Avatar
 
Join Date: Jul 2003
Location: Cambridge, UK
Posts: 1,148
ARDA,

I'm not sure I'm the best person to answer this! In any case, I'm afraid I don't have that much to report... I got distracted onto other things and only just came back to this. Part of the difficulty is that getting to the bottom of this requires sitting down and repeatedly analysing the assembly output of the compiler... which is something I don't want to get into doing right now. I can give you some scattered thoughts, for what they are worth:
  • __fastcall is easy to switch on and (assuming the compiler doesn't do anything stupid) should always help a little.
  • the fewer arguments your time-critical functions take the better (because the first two are passed in registers with __fastcall).
  • There is no guaranteed way to get the function call overhead down to zero... so if it really matters, paste the code in.
  • there is a choice between naked and inline; I would tend to go with inline because the compiler needs to deal with inline functions quite often (because of classes) so you would expect it to perform reasonably well.
  • inline can be ignored; but the alternative __forceinline can't. I would be careful when using this, because there is no guarantee that the compiler will deal with it efficiently.
  • At the end of the day, ALWAYS, ALWAYS profile. If you really want to squeeze out speed as much as is possible, you may need to take all the time-critical code and write it in assembly without using functions. But on the other hand I would guess that the speed up relative to a saner approach is on the order of 5%...
I suspect that using the Intel compiler would give better results... but as noted above it appears to be deliberately b0rked for non-Intel processors.

Sorry not to be able to be more helpful...
__________________
a.k.a. Clouded. Come and help by making sure your favourite AVISynth filters and scripts are listed.
mg262 is offline   Reply With Quote
Old 8th November 2005, 20:38   #19  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
@MG262

You needn't apologize at all. The only fact you've searching for options is a good step, including for discard them.I will read carefully and maybe make some tests.Don't know when.
I've seen your developing some filters .Go ahead, and thanks for your report

Regards Arda
ARDA is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 10:47.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.