View Single Post
Old 22nd February 2010, 21:09   #18  |  Link
JoshyD
Registered User
 
Join Date: Feb 2010
Posts: 84
Quote:
Originally Posted by Fizick View Post
Strange personality discussion.... I am simply waiting the source code and project files of MVTools2_x64 (GPL).
Are 32bit and 64 bit asm code generalized?
The source is finally up, sorry for the delay

Unfortunately, they're not generalized as a whole. I did the inline assembler generally using just #ifdef's and then grabbed the latest x264 asm functions that the project uses, which are generalized. The actual asm for functions contained in files like bilinear.asm just got overwritten. It's trivial to swap the file for the original, and re-compile. It's just not exactly elegant.

The functions should have been generalized, but I was just working fast, kind of without putting a ton of thought into the process. I've been learning as I go, and it seems I always find an example of a sleeker solution after working through the first ugly one that popped into my head.

The main differences are in function calling, and register usage. You can get by with a lot less push/popping in 64bit land. Stack allocation between function calls changes as well. As a rule, all arguments are aligned at 8 byte boundaries. Arguments that are passed to the functions via registers still get shadow space on the stack, so your 5th integer argument will be at [rsp+40], if you didn't push any registers on the stack in the first place.

I wanted to ask some specific questions about the filtering and code copying functions. Was there a reason that they're often limited to the mmx registers? Things like the Horizontal Weiner filter are bothering me, because depending on your byte window, you're going to get different results from it, or it would seem that way. Right now, it has a 4 byte "window" to filter around. My understanding of a weiner filter is that it's adaptive, so changing its discrete window would change the filter's output altogether.

It's possible to look at 8 byte chunks using the XMM registers (unpacking to words for arithmetic (128bits total), repacking) but I'm unsure on the effect on overall image quality. Thoughts?

Finally, a lot of the mmx functions don't take advantage of the fact that we have a ton of XMM registers floating around that can also be used in mmx arithmetic. XMM0-XMM5 are all volatile across calls which could prevent some mmx registers from being shuffled around, etc.

When writing assembler, I'm not sure how the CPU's register files are architected to interact with each other. As in, if there's a pentalty associated with transferring a qword from an XMM register to an MMX reg, and vice versa.

I'm actually a VLSI designer (very large scale integration) by education, so thinking on the machine level is interesting and thought provoking. I don't know enough about the design of the x86 cores of late to generalize performance impact of various code paths. Is there any way other than running a battery of tests to analyze the clock cycles it takes for an instruction to retire?

I'm going to search around for the answers, but I thought I'd ask anyhow. Sometimes that's the fastest and most concise way to find the info you're after.

Last edited by JoshyD; 22nd February 2010 at 21:45.
JoshyD is offline   Reply With Quote