Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
|
![]() |
|
Thread Tools | Search this Thread | Display Modes |
![]() |
#121 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,173
|
Okay the equivalent C code helps a lot. Perhaps it should be in your source as comments.
Have you got this Code:
movntq [ebx+edx], mm0 Do not mix cached and non-temporal memory accesses, specially writes and make absolutly sure that they are 64 bit aligned. Also as a baseline perhaps you should test the pure C code version. ------------------------ Also a possible hint for your end around code. As long as you have at least 8 bytes total to process just do a single unaligned movq at the end. i.e. offset the movq to match the end of the buffer and process the few overlapping bytes twice. |
![]() |
![]() |
![]() |
#122 | Link | |||
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
Quote:
Quote:
![]() Quote:
![]() |
|||
![]() |
![]() |
![]() |
#125 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,173
|
I meant that you should repeat the knee test with both movq and C code to get a comparitive baseline for the 50% slowdown. i.e. you are looking for the speed ratio to be constant.
I assume you are still trying to find why you have a performance knee. -------------------- Oh and also about the end code stuff. For planar frames Avisynth guarantees that pitch is at least mod 8 (mod 16 on >= SSE2 machines). Use the PLANAR_Y_ALIGNED form of the RowSize(). Just remember that bytes beyond rowsize up to pitch contain uninitialized junk, you are free to overwrite it with good stuff or trash as you see fit. |
![]() |
![]() |
![]() |
#126 | Link | |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
Quote:
|
|
![]() |
![]() |
![]() |
#127 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
Code:
_declspec(align(64)) const unsigned char* sourcep[64] __asm{ mov ecx, [sourcep] ;=sourcep[0] mov ecx, [sourcep+4] ;=sourcep[1] } Code:
const unsigned char** sourcep = (const unsigned char**) _aligned_malloc(sizeof(unsigned char*)*length, 64); __asm{ mov ecx, [sourcep] ;=sourcep[0] mov ecx, [sourcep+4] ;=&sourcep+4 } |
![]() |
![]() |
![]() |
#129 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
In the second case I cant access directly sourcep[1].
Probably I have to get the pointer first and then the value...two memory reads instead of one. Is there workaround or not? |
![]() |
![]() |
![]() |
#132 | Link | ||
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,173
|
@Squid, The 1st case is really
Code:
mov ecx, [ebp+sourcep+0] mov edx, [ebp+sourcep+4] Code:
mov esi, [ebp+sourcep] mov ecx, [esi+0] mov edx, [esi+4] Quote:
Code:
AWIDTH=rowsize; if (AWIDTH+8-blockwidth > pitch) AWIDTH-=8-blockwidth; for(y=0;y<height;y+=blockheight) { for(x=0;x<AWIDTH;x+=blockwidth) for(j=0;j<blockheight;j++) for(i=0;i<blockwidth;i++) { FAST CODE (overlaps to mod 8) } for(x=AWIDTH;x<rowsize;x+=blockwidth) for(j=0;j<blockheight;j++) for(i=0;i<blockwidth;i++) { END CODE (No overlaps!) } } Quote:
The point was in the very special case where you are trying to squeeze every last ounce of cache there is the option to waste some memory and align your buffer to cache lines. This is not normal practise for general purpose filters because you do not have enough control of the world, i.e. you do not know what size image the user will give you, you do not know what block size the user will choose. For normal use being 4 aligned for cpu registers, 8 aligned for MMX registers and 16 aligned for SSE2 registers is adequate. |
||
![]() |
![]() |
![]() |
#133 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,173
|
Or even more devious
Code:
... for(y=0;y<height-blockheight;y+=blockheight) { for(x=0;x<rowsize;x+=blockwidth) for(j=0;j<blockheight;j++) for(i=0;i<blockwidth;i++) { FAST CODE (overlaps to mod 8) } } y=height-blockheight; for(x=0;x<AWIDTH;x+=blockwidth) for(j=0;j<blockheight;j++) for(i=0;i<blockwidth;i++) { FAST CODE (overlaps to mod 8) } for(x=AWIDTH;x<rowsize;x+=blockwidth) for(j=0;j<blockheight;j++) for(i=0;i<blockwidth;i++) { END CODE (No overlaps!) } |
![]() |
![]() |
![]() |
#134 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
Would be maybe helpful, to copy first row of blocks from source to multiple workareas (for faster copying) then process all of them and then copy all of them to dst frame?
If that would be more cache-friendly. If so which way (copying src to workarea): 1) copy block after block (copy all lines of first block, then second...) 2) go thru frame reading one whole line after each other and copy the pixel to different workareas where they belong... 1) seems to me more cachable for the writing part (in the workarea the pixel would be written aftereach other) but reads in src are random 2) seems to me more cachable for the reading part (going thru the src in line pixel after pixel, but writing is random) By the way, how is it with the back direction reading? You recomended me to avoid it but there was notice in the BitBlt function that the author used it to avoid HW prefetch... ![]() |
![]() |
![]() |
![]() |
#135 | Link |
Avisynth Developer
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,173
|
You are at the point where there is much unknown information to guess. It could work or it could exhaust the cache. You will have to test and time it.
Reading backwards applies to a special set of cases, search the AMD technotes for how and when to use it. Also with using movntq you say your filter is faster, does your measure include a next filter in the chain. You may make your filter faster but the next filter might be slower because it must re-read that frame data just written back into cache, hence the whole script might run slower. Test carefully! |
![]() |
![]() |
![]() |
#136 | Link |
Registered User
Join Date: Jan 2005
Location: Praha (not that one in Texas)
Posts: 863
|
Sharing
Hi...after years..
I came around and I thougnt I would share what I've done. Just as it is. So, quick comment. It is based on the TBarrys one. Usage: Code:
DctFilter c[fac0]f[fac1]f[fac2]f[fac3]f[fac4]f[fac5]f[fac6]f[fac7]f[fac8]f[fac9]f[fac10]f[fac11]f[fac12]f[fac13]f[fac14]f[fac15]f[mode]s[blockyx]i[blockyy]i[shiftyx]i[shiftyy]i -showF....shows frequency values in spatial domain (maps it on 0-255 range) -other modes are just the frequency decimation, various kinds (classic is original 8x8 method, freeDCT is [blockyx]x[blockyy] transformation, default chooses automatically...I think There are 16 factors to be adjusted. I added dll and sources. v8 works v13 seems not to work Issues and stupid things: - I didn't rename the filter at that time, so ithas same name as theold one (probably that's why recompiled the old filter to name DCTFiler, so I can have both in AviSynth) - DCTAddConstand doesnot work and I don't know why - Newer version doesnot work and I dont know why... So, maybe someone's interested. ShowF looks interesting...;-) |
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
Display Modes | |
|
|