Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 28th December 2007, 05:44   #121  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,171
Okay the equivalent C code helps a lot. Perhaps it should be in your source as comments.

Have you got this
Code:
movntq [ebx+edx], mm0
in the block you are testing? And you have an AMD beastie. Test with a normal movq.

Do not mix cached and non-temporal memory accesses, specially writes and make absolutly sure that they are 64 bit aligned.

Also as a baseline perhaps you should test the pure C code version.

------------------------

Also a possible hint for your end around code. As long as you have at least 8 bytes total to process just do a single unaligned movq at the end. i.e. offset the movq to match the end of the buffer and process the few overlapping bytes twice.
IanB is offline   Reply With Quote
Old 28th December 2007, 06:11   #122  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
Quote:
Originally Posted by IanB View Post
Have you got this
Code:
movntq [ebx+edx], mm0
in the block you are testing? And you have an AMD beastie. Test with a normal movq.
I had it before... it was little slower.
Quote:
Also as a baseline perhaps you should test the pure C code version.
Had it before and was slower.


Quote:
Also a possible hint for your end around code. As long as you have at least 8 bytes total to process just do a single unaligned movq at the end. i.e. offset the movq to match the end of the buffer and process the few overlapping bytes twice.
Will try...but have to be cautios about the end of row
redfordxx is offline   Reply With Quote
Old 28th December 2007, 06:15   #123  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
Quote:
Originally Posted by squid_80 View Post
Looks like sourcep should be a const unsigned char **.
Thanks...ehm but now I can't access the members of the array easily like this anymore
Code:
mov     ecx, [sourcep+4*eax]
where eax walks thru the array.
redfordxx is offline   Reply With Quote
Old 28th December 2007, 07:10   #124  |  Link
squid_80
Registered User
 
Join Date: Dec 2004
Location: Melbourne, AU
Posts: 1,963
Quote:
Originally Posted by redfordxx View Post
Thanks...ehm but now I can't access the members of the array easily like this anymore
Code:
mov     ecx, [sourcep+4*eax]
where eax walks thru the array.
Why not?
squid_80 is offline   Reply With Quote
Old 28th December 2007, 12:37   #125  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,171
I meant that you should repeat the knee test with both movq and C code to get a comparitive baseline for the 50% slowdown. i.e. you are looking for the speed ratio to be constant.

I assume you are still trying to find why you have a performance knee.

--------------------

Oh and also about the end code stuff. For planar frames Avisynth guarantees that pitch is at least mod 8 (mod 16 on >= SSE2 machines). Use the PLANAR_Y_ALIGNED form of the RowSize(). Just remember that bytes beyond rowsize up to pitch contain uninitialized junk, you are free to overwrite it with good stuff or trash as you see fit.
IanB is offline   Reply With Quote
Old 28th December 2007, 12:52   #126  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
Quote:
Originally Posted by IanB View Post
Oh and also about the end code stuff. For planar frames Avisynth guarantees that pitch is at least mod 8 (mod 16 on >= SSE2 machines). Use the PLANAR_Y_ALIGNED form of the RowSize(). Just remember that bytes beyond rowsize up to pitch contain uninitialized junk, you are free to overwrite it with good stuff or trash as you see fit.
But if block width is let's say 4 and rowsize=pitch, then I will write four bytes over...
redfordxx is offline   Reply With Quote
Old 28th December 2007, 13:00   #127  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
Code:
_declspec(align(64)) const unsigned char* sourcep[64]
__asm{
mov     ecx, [sourcep]      ;=sourcep[0]
mov     ecx, [sourcep+4]    ;=sourcep[1]
}
but
Code:
const unsigned char** sourcep = (const unsigned char**) _aligned_malloc(sizeof(unsigned char*)*length, 64);
__asm{
mov     ecx, [sourcep]      ;=sourcep[0]
mov     ecx, [sourcep+4]    ;=&sourcep+4
}
redfordxx is offline   Reply With Quote
Old 28th December 2007, 14:06   #128  |  Link
squid_80
Registered User
 
Join Date: Dec 2004
Location: Melbourne, AU
Posts: 1,963
I don't see anything wrong with that. In C code *(sourcep+1) is the same as sourcep[1]. In assembly they both translate to [sourcep+4] (assuming pointers are 4 bytes).
squid_80 is offline   Reply With Quote
Old 28th December 2007, 14:39   #129  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
In the second case I cant access directly sourcep[1].
Probably I have to get the pointer first and then the value...two memory reads instead of one. Is there workaround or not?
redfordxx is offline   Reply With Quote
Old 28th December 2007, 14:42   #130  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
Quote:
Originally Posted by IanB View Post
For special buffers when squeezing every last ounce of performance, you can align the buffer mod 64, this avoids the 2 wasted part cache lines at each end.
Maybe I am missing, what exactly you mean by buffer...
redfordxx is offline   Reply With Quote
Old 28th December 2007, 15:16   #131  |  Link
squid_80
Registered User
 
Join Date: Dec 2004
Location: Melbourne, AU
Posts: 1,963
Quote:
Originally Posted by redfordxx View Post
In the second case I cant access directly sourcep[1].
Sure you can... The same as before, sourcep[1] is [sourcep+4] in assembly.
squid_80 is offline   Reply With Quote
Old 28th December 2007, 22:03   #132  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,171
@Squid, The 1st case is really
Code:
mov ecx, [ebp+sourcep+0]
mov edx, [ebp+sourcep+4]
the 2nd case is
Code:
mov esi, [ebp+sourcep]
mov ecx, [esi+0]
mov edx, [esi+4]
--------------------------
Quote:
Originally Posted by redfordxx
But if block width is let's say 4 and rowsize=pitch, then I will write four bytes over...
Yes you would, but you can adjust the way you process it so that only the last loop has to deal with the problem.
Code:
AWIDTH=rowsize;
if (AWIDTH+8-blockwidth > pitch)
  AWIDTH-=8-blockwidth;
for(y=0;y<height;y+=blockheight) {
  for(x=0;x<AWIDTH;x+=blockwidth)
    for(j=0;j<blockheight;j++)
      for(i=0;i<blockwidth;i++) {
        FAST CODE (overlaps to mod 8)
      }
  for(x=AWIDTH;x<rowsize;x+=blockwidth)
    for(j=0;j<blockheight;j++)
      for(i=0;i<blockwidth;i++) {
        END CODE (No overlaps!)
      }
}
Quote:
Originally Posted by redfordxx
Maybe I am missing, what exactly you mean by buffer...
buffer :- chunk of memory, work space, something you declare or malloc, where you read and/or write data to.

The point was in the very special case where you are trying to squeeze every last ounce of cache there is the option to waste some memory and align your buffer to cache lines. This is not normal practise for general purpose filters because you do not have enough control of the world, i.e. you do not know what size image the user will give you, you do not know what block size the user will choose.

For normal use being 4 aligned for cpu registers, 8 aligned for MMX registers and 16 aligned for SSE2 registers is adequate.
IanB is offline   Reply With Quote
Old 28th December 2007, 22:10   #133  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,171
Or even more devious
Code:
...
for(y=0;y<height-blockheight;y+=blockheight) {
  for(x=0;x<rowsize;x+=blockwidth)
    for(j=0;j<blockheight;j++)
      for(i=0;i<blockwidth;i++) {
        FAST CODE (overlaps to mod 8)
      }
}
y=height-blockheight;
for(x=0;x<AWIDTH;x+=blockwidth)
  for(j=0;j<blockheight;j++)
    for(i=0;i<blockwidth;i++) {
      FAST CODE (overlaps to mod 8)
    }
for(x=AWIDTH;x<rowsize;x+=blockwidth)
  for(j=0;j<blockheight;j++)
    for(i=0;i<blockwidth;i++) {
      END CODE (No overlaps!)
    }
IanB is offline   Reply With Quote
Old 30th December 2007, 09:15   #134  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
Would be maybe helpful, to copy first row of blocks from source to multiple workareas (for faster copying) then process all of them and then copy all of them to dst frame?

If that would be more cache-friendly.

If so which way (copying src to workarea):
1) copy block after block (copy all lines of first block, then second...)
2) go thru frame reading one whole line after each other and copy the pixel to different workareas where they belong...

1) seems to me more cachable for the writing part (in the workarea the pixel would be written aftereach other) but reads in src are random
2) seems to me more cachable for the reading part (going thru the src in line pixel after pixel, but writing is random)


By the way, how is it with the back direction reading? You recomended me to avoid it but there was notice in the BitBlt function that the author used it to avoid HW prefetch...
redfordxx is offline   Reply With Quote
Old 30th December 2007, 13:10   #135  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,171
You are at the point where there is much unknown information to guess. It could work or it could exhaust the cache. You will have to test and time it.

Reading backwards applies to a special set of cases, search the AMD technotes for how and when to use it.

Also with using movntq you say your filter is faster, does your measure include a next filter in the chain. You may make your filter faster but the next filter might be slower because it must re-read that frame data just written back into cache, hence the whole script might run slower. Test carefully!
IanB is offline   Reply With Quote
Old 7th November 2011, 21:52   #136  |  Link
redfordxx
Registered User
 
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 791
Sharing

Hi...after years..
I came around and I thougnt I would share what I've done. Just as it is.
So, quick comment. It is based on the TBarrys one. Usage:
Code:
DctFilter c[fac0]f[fac1]f[fac2]f[fac3]f[fac4]f[fac5]f[fac6]f[fac7]f[fac8]f[fac9]f[fac10]f[fac11]f[fac12]f[fac13]f[fac14]f[fac15]f[mode]s[blockyx]i[blockyy]i[shiftyx]i[shiftyy]i
Mode can be: showF, default, classic, freeDCT
-showF....shows frequency values in spatial domain (maps it on 0-255 range)
-other modes are just the frequency decimation, various kinds (classic is original 8x8 method, freeDCT is [blockyx]x[blockyy] transformation, default chooses automatically...I think
There are 16 factors to be adjusted.

I added dll and sources.
v8 works
v13 seems not to work

Issues and stupid things:
- I didn't rename the filter at that time, so ithas same name as theold one (probably that's why recompiled the old filter to name DCTFiler, so I can have both in AviSynth)
- DCTAddConstand doesnot work and I don't know why
- Newer version doesnot work and I dont know why...

So, maybe someone's interested. ShowF looks interesting...;-)
Attached Files
File Type: rar DctFilter16.v8.rar (126.0 KB, 56 views)
File Type: rar DctFilter16.v13.rar (126.8 KB, 55 views)
File Type: rar DctFilter16.v8.dll.rar (50.1 KB, 42 views)
redfordxx is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 01:56.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2019, vBulletin Solutions Inc.