aWarpSharp2 – rewrite of aWarpSharp

SEt · 23rd May 2009, 22:36

Current version: 2012.03.28

Previous versions:
2009.06.19
2009.05.24

aWarpSharp by MarcFD is nice plugin (especially for tasks like halo removing), but has some bugs and like to produce green artifacts on the image borders. Other WarpSharp plugins produced worse results for me, so i decided to rewrite aWarpSharp algorithm with better handling of borders and optimization for modern CPUs.

Besides complete algorithm aWarpSharp2, its parts are also available as aSobel, aBlur, aWarp and aWarp4. This way you can do advanced edge mask filtering (like MDegrain) before passing it to warp stage to get more stable result.

Good usage examples:

Code:

aWarp4(Spline36Resize(width*4, height*4, 0.375, 0.375), aSobel().aBlur(), depth=3)
aWarp4(nnedi3_rpow2(rfactor=2).Spline36Resize(width*4, height*4, 0.25, 0.25), aSobel().aBlur(), depth=3)
aWarp4(nnedi3_rpow2(rfactor=2).nnedi3_rpow2(rfactor=2), aSobel().aBlur(), depth=2)

Note that upsampling for aWarp4 should be left-top aligned, so Spline36Resize(width*4, height*4) or nnedi3_rpow2(rfactor=4) won't produce correct results.

For options explanation and values mapping from used in aWarpSharp by MarcFD - read the included aWarpSharp.txt.

Binary patched Toon-v1.0 to use aWarpSharp2 instead of aWarpSharp: Toon-v1.1

ChaosKing · 23rd May 2009, 23:29

wow nice...

Just made a quick test.

awarpsharp(154,2,20)#new
awarpsharp(20,2,0.6)#old (marcFD)

1. new, 2. old

As you can see, the border is no longer green. The pictures look very similar and the plugin seems to be about 20% faster on my Pentium D.

Very good job SEt, i waited so long for a bug free awarpshap

Adub · 24th May 2009, 01:17

You wouldn't happen to have tested the speed now would you, ChaosKing?

I would do it my self, but my rig is currently encoding a Bluray and will be doing so for at least the next 17 hours.

2nd June 2009, 03:19

Thanks SEt !!

Tried it for a drop in for the original aWarpSharp.dll, but MCTDenoise is giving errors with it not supporting some of the paramters passed (and dropping the original aWarpSharp.dll back in resolves it) ...

It's seems to be about 20-30% faster, so well done !!

Tek

2nd June 2009, 03:38

.double post.

SEt · 2nd June 2009, 22:28

All parameters of original aWarpSharp are supposted, but some are renamed - read aWarpSharp.txt for the mapping.

lansing · 3rd June 2009, 04:04

thanks for the rewrite, and I think sticking with the out naming would be more convenience for us

Dark Shikari · 3rd June 2009, 04:29

Code:

			movdqu	xmm2, [esi-1]
				movdqa	xmm3, [esi]
				movdqu	xmm4, [esi+1]
				movdqu	xmm5, [esi+edx-1]
				movdqa	xmm6, [esi+edx]
				movdqu	xmm7, [esi+edx+1]
...
				movdqu	xmm1, [esi+eax-1]
				movdqu	xmm3, [esi+eax+1]

This is what palignr was made for; SSSE3-ifying this with palignr will avoid all the unaligned loads nicely. If you retain loads between loop iterations, you can reduce the number of memory accesses, too.

Code:

				movdqu	xmm6, [esi-6]
				movdqu	xmm0, [esi+6]
				pavgb	xmm6, xmm0
				movdqu	xmm5, [esi-5]
				movdqu	xmm7, [esi+5]
				pavgb	xmm5, xmm7
				movdqu	xmm4, [esi-4]
				movdqu	xmm0, [esi+4]
				pavgb	xmm4, xmm0
				movdqu	xmm3, [esi-3]
				movdqu	xmm7, [esi+3]
				pavgb	xmm3, xmm7
				movdqu	xmm2, [esi-2]
				movdqu	xmm0, [esi+2]
				pavgb	xmm2, xmm0
				movdqu	xmm1, [esi-1]
				movdqu	xmm7, [esi+1]
				pavgb	xmm1, xmm7
				movdqa	xmm0, [esi]

Did someone say made for palignr

Code:

movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi], 0
				pinsrw	xmm4, [eax+edx], 0
				movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi+1], 1
				pinsrw	xmm4, [eax+edx+1], 1
				movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi+2], 2
				pinsrw	xmm4, [eax+edx+2], 2
				movd	eax, xmm2
				pinsrw	xmm3, [eax+esi+3], 3
				pinsrw	xmm4, [eax+edx+3], 3
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+4], 4
				pinsrw	xmm4, [eax+edx+4], 4
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+5], 5
				pinsrw	xmm4, [eax+edx+5], 5
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+6], 6
				pinsrw	xmm4, [eax+edx+6], 6
				movd	eax, xmm7
				pinsrw	xmm3, [eax+esi+7], 7
				pinsrw	xmm4, [eax+edx+7], 7
				mov	eax, [esp]

I'm going to have to start killing kittens if I keep seeing things like this.

Code:

movq	xmm7, qword ptr [edi+ebx-1] // one line above actual position, but it gives 1.4x speedup

How about you figure out why it does?

sh0dan · 17th June 2009, 17:00

Quote:

Originally Posted by Dark Shikari

Code:

movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi], 0
				pinsrw	xmm4, [eax+edx], 0
				movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi+1], 1
				pinsrw	xmm4, [eax+edx+1], 1
				movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi+2], 2
				pinsrw	xmm4, [eax+edx+2], 2
				movd	eax, xmm2
				pinsrw	xmm3, [eax+esi+3], 3
				pinsrw	xmm4, [eax+edx+3], 3
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+4], 4
				pinsrw	xmm4, [eax+edx+4], 4
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+5], 5
				pinsrw	xmm4, [eax+edx+5], 5
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+6], 6
				pinsrw	xmm4, [eax+edx+6], 6
				movd	eax, xmm7
				pinsrw	xmm3, [eax+esi+7], 7
				pinsrw	xmm4, [eax+edx+7], 7
				mov	eax, [esp]

I'm going to have to start killing kittens if I keep seeing things like this.

I completely agree - while this might seem fast, it isn't. Movd r32,xmm has a latency of 6 cycles on Core2, pinsrw has a latency of 4. Both are eons.

Store the content of xmm2 and xmm7 into memory, do lookups in scalar assembler and read them back:

Code:

				movdqa [temp1], xmm2   ; Store all pixels

				; push eax, ebx, ecs on the stack, if you use them already

				xor eax, eax
				xor ebx, ebx
				xor ecx, ecx
	
				mov ax,[temp1]
				mov bx,[eax+esi]
				mov cx,[eax+edi]
				mov [temp2], bx
				mov [temp3], cx

				mov ax,[temp1+2]
				mov bx,[eax+esi]
				mov cx,[eax+edi]
				mov [temp2+2], bx
				mov [temp3+2], cx

				(you get the picture -use a macro, for nice code)

				movdqa xmm3, [temp2]		
				movdqa xmm4, [temp3]

This way you only get the performance hit of the cache lookups, and a Store->Load Forward size mismatch penalty. And for please, use palignr, it is much faster on Core2.

SEt · 17th June 2009, 23:42

Quote:

Originally Posted by sh0dan

I completely agree - while this might seem fast, it isn't. Movd r32,xmm has a latency of 6 cycles on Core2, pinsrw has a latency of 4. Both are eons.

Your numbers seems too high even for Core2. According to http://www.agner.org/optimize/ tables movd has latency/throughput of 2/0.33 on both Core2 and pinsrw 6/1.5 on 65nm and 2/1 on 45nm. On my Nehalem Everest measures "MOVD r32, xmm+MOVD xmm, r32" to have latency 4, "PEXTRW + PINSRW r32" latency 1, movd throughput 0.4, pinsrw throughput 0.66.

For speed of this part it's actually the other way around - at the beginning i thought it to be painfully slow too, but profiler says it's quite fast. Changing even movd/psrldq into movdqa/mov [+0/4/8/12] gives me 10% speed drop for the whole function. Changing pinsrw would likely cost even more. Of course it's measured on Nehalem and not on Core2 that everyone seems to love for reason that's beyond me. Modern CPUs taught me to believe only profiler when optimizing and not what you think is faster, so i'm not going to write something that "maybe better for Core2" when i have no means of confirming it by testing there.

Quote:

Originally Posted by sh0dan

And for please, use palignr, it is much faster on Core2.

palignr is definitely great for some cases, but guess how much difference i measure between old unaligned hell of

Code:

movdqu	xmm6, [esi-6]
movdqu	xmm0, [esi+6]
pavgb	xmm6, xmm0
movdqu	xmm5, [esi-5]
movdqu	xmm7, [esi+5]
pavgb	xmm5, xmm7
movdqu	xmm4, [esi-4]
movdqu	xmm0, [esi+4]
pavgb	xmm4, xmm0
movdqu	xmm3, [esi-3]
movdqu	xmm7, [esi+3]
pavgb	xmm3, xmm7
movdqu	xmm2, [esi-2]
movdqu	xmm0, [esi+2]
pavgb	xmm2, xmm0
movdqu	xmm1, [esi-1]
movdqu	xmm7, [esi+1]
pavgb	xmm1, xmm7
movdqa	xmm0, [esi]
pavgb	xmm6, xmm5
pavgb	xmm4, xmm3
pavgb	xmm2, xmm1
pavgb	xmm6, xmm4
pavgb	xmm2, xmm0
pavgb	xmm6, xmm2
pavgb	xmm6, xmm2
movntdq	[esi+edi], xmm6

and new (to be released)

Code:

movdqa	xmm7, [esi+10h]
movdqa	xmm0, xmm6
movdqa	xmm2, xmm7
palignr	xmm0, xmm5, 10
palignr	xmm2, xmm6, 6
pavgb	xmm0, xmm2
movdqa	xmm3, xmm6
movdqa	xmm4, xmm7
palignr	xmm3, xmm5, 11
palignr	xmm4, xmm6, 5
pavgb	xmm3, xmm4
pavgb	xmm0, xmm3
movdqa	xmm1, xmm6
movdqa	xmm2, xmm7
palignr	xmm1, xmm5, 12
palignr	xmm2, xmm6, 4
pavgb	xmm1, xmm2
movdqa	xmm3, xmm6
movdqa	xmm4, xmm7
palignr	xmm3, xmm5, 13
palignr	xmm4, xmm6, 3
pavgb	xmm3, xmm4
pavgb	xmm1, xmm3
pavgb	xmm0, xmm1
movdqa	xmm1, xmm6
movdqa	xmm2, xmm7
palignr	xmm1, xmm5, 14
palignr	xmm2, xmm6, 2
pavgb	xmm1, xmm2
movdqa	xmm3, xmm6
movdqa	xmm4, xmm7
palignr	xmm3, xmm5, 15
palignr	xmm4, xmm6, 1
pavgb	xmm3, xmm4
pavgb	xmm1, xmm3
pavgb	xmm1, xmm6
movdqa	xmm5, xmm6
movdqa	xmm6, xmm7
pavgb	xmm0, xmm1
pavgb	xmm0, xmm1
movntdq	[esi+edi], xmm0

? The second version is only 10% faster (of course, on Nehalem again).

Dark Shikari · 17th June 2009, 23:50

Quote:

Originally Posted by SEt

Of course it's measured on Nehalem and not on Core2 that everyone seems to love for reason that's beyond me.

Maybe because that's what they own? It will be years before the Nehalem has a higher install base than the Core 2.

Quote:

Originally Posted by SEt

Modern CPUs taught me to believe only profiler when optimizing and not what you think is faster, so i'm not going to write something that "maybe better for Core2" when i have no means of confirming it by testing there.

Then ask someone for SSH access, there are billions of Core 2s.

Quote:

Originally Posted by SEt

? The second version is only 10% faster (of course, on Nehalem again).

Try running the first set of code when you're not on a cacheline, and the second set when you are, perhaps?

SEt · 18th June 2009, 00:23

Quote:

Originally Posted by Dark Shikari

Maybe because that's what they own? It will be years before the Nehalem has a higher install base than the Core 2.

I suspect there is higher install base for P4/D than Core2 now, but does it mean we should optimize primary for P4? If care for something else it would be Phenoms and i suspect them to behave similar to Nehalem, not Core2.

Quote:

Originally Posted by Dark Shikari

Then ask someone for SSH access, there are billions of Core 2s.

Good idea, but i'd rather spend my time now for more important stuff, like actually releasing something.

Quote:

Originally Posted by Dark Shikari

Try running the first set of code when you're not on a cacheline, and the second set when you are, perhaps?

I'm not quite get your point here. I bench the code on real video processing. Also notice the 'movdqa' in the only memory load of second variant.

owais · 4th June 2009, 16:09

Help!! with this new updated famous plugin i m geting kind of like this image

i used

Code:

aWarpSharp(depth=12,blur=4,thresh=51,chroma=1)

Am i doing something wrong?

the colours are dancing

with old plugin i m getting normal, yeah but having green lines

i used

Code:

aWarpSharp(depth=12,blurlevel=4,thresh=0.2,cm=1)

Edited

i hav found little bit that it is due to chroma=1 bascally i don know wat is chroma cause i m new to video (just started on march and learn a lot

)

for me till now chroma= 2or 3 works well and 4 also, problem is with 1 for me .. 0 was giving me black and white colour hehe

SEt · 6th June 2009, 13:10

Dark Shikari, i know not everything is optimally written, but i thought better release working version now than super-optimized never. I know that horizontal blur is made for palignr and will look into this when i have time, but i have no idea how to save kittens or why loading correct line gives 1.4x speed drop for the whole function, including those awful pinsrw that should be much more time consuming than just unaligned load from additional memory location.

owais, have you tried to read all the aWarpSharp.txt? There explained that cm=1 of original aWarpSharp is chroma=4 in mine and what chroma values mean.

Dark Shikari · 6th June 2009, 15:39

Quote:

Originally Posted by SEt

Dark Shikari, i know not everything is optimally written, but i thought better release working version now than super-optimized never. I know that horizontal blur is made for palignr and will look into this when i have time, but i have no idea how to save kittens or why loading correct line gives 1.4x speed drop for the whole function, including those awful pinsrw that should be much more time consuming than just unaligned load from additional memory location.

"Should be much more time consuming?"

Does that imply you tested it, and found it to be faster?

If it's faster, I'm going to be inclined to blame cacheline-split. Test on an AMD chip or Nehalem and watch the penalties melt away.

SEt · 6th June 2009, 19:34

I'm already on Nehalem and when i change

Code:

movq	xmm7, qword ptr [edi+pitch*0-1]
movq	xmm4, qword ptr [edi+pitch*0+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]

to

Code:

movq	xmm7, qword ptr [edi+pitch*1-1]
movq	xmm4, qword ptr [edi+pitch*1+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]

i see 1.4x slowdown in profiler for the whole function.

Dark Shikari · 6th June 2009, 19:37

Quote:

Originally Posted by SEt

I'm already on Nehalem and when i change

Code:

movq	xmm7, qword ptr [edi+pitch*0-1]
movq	xmm4, qword ptr [edi+pitch*0+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]

to

Code:

movq	xmm7, qword ptr [edi+pitch*1-1]
movq	xmm4, qword ptr [edi+pitch*1+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]

i see 1.4x slowdown in profiler for the whole function.

I was referring to the pinsrw with regard to the cacheline split.

If you're getting such a large slowdown merely by changing that, you should try to figure out why. Performance counters might be useful for analyzing that.

Fizick · 6th June 2009, 19:56

Can I ask, why do you change parameters meaning? It is confusing, and not "fully compatible with original aWarpSharp".
If you prefer new parameters, please use new parameters names (or new name of plugin).

SEt · 6th June 2009, 22:20

Played with performance counters for some time and here is what i found (removed all pinsrw for tests as they don't change the situation):
global slowdown is produced by

Code:

movq	xmm7, qword ptr [edi+pitch*1-1]

but not by

Code:

movq	xmm4, qword ptr [edi+pitch*1+1]

It results in spike of L1D.REPL and huge spikes of L1D.M_REPL, L1D.M_EVICT, L1D.M_SNOOP_EVICT in that area (also ILD_STALL.ANY, but i don't think it's interesting).
I've tried to change the only writing instruction here from movq to movdq2q,movntq but that changed nothing.

Fizick, i think the situation is similar to MVTools 1-2

It's fully compatible in terms of available functionality and effective ranges of parameters are supersets of the original ones. I know i should probably change the name to aWarpSharp2, but it looks kind of strange with aSobel, aBlur, aWarp. In truth it's more like a beta release to me due to mentioned wrong offsets in Warp and saturated multiplication by 6 at the end of Sobel that i don't like at all.

IanB · 6th June 2009, 23:40

Quote:

Originally Posted by SEt

I'm already on Nehalem and when i change

Code:

movq	xmm7, qword ptr [edi+pitch*0-1]
movq	xmm4, qword ptr [edi+pitch*0+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]

to

Code:

movq	xmm7, qword ptr [edi+pitch*1-1]
movq	xmm4, qword ptr [edi+pitch*1+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]

i see 1.4x slowdown in profiler for the whole function.

Consider the memory address each is referencing and which cache line each uses. I have colour coded 3 different memory areas. In the fast case only 2 areas are used. Also accessing data not aligned to 64 bits has a penalty. And a very big penalty when you cross a cache line (64 byte) boundary. For the [edi+pitch*1-1] you maybe slipping into the previous cache line (what address is in EDI ?)

23rd May 2009, 22:36	#1 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	aWarpSharp2 – rewrite of aWarpSharp Current version: 2012.03.28 Previous versions: 2009.06.19 2009.05.24 aWarpSharp by MarcFD is nice plugin (especially for tasks like halo removing), but has some bugs and like to produce green artifacts on the image borders. Other WarpSharp plugins produced worse results for me, so i decided to rewrite aWarpSharp algorithm with better handling of borders and optimization for modern CPUs. Besides complete algorithm aWarpSharp2, its parts are also available as aSobel, aBlur, aWarp and aWarp4. This way you can do advanced edge mask filtering (like MDegrain) before passing it to warp stage to get more stable result. Good usage examples: Code: aWarp4(Spline36Resize(width4, height4, 0.375, 0.375), aSobel().aBlur(), depth=3) aWarp4(nnedi3_rpow2(rfactor=2).Spline36Resize(width4, height4, 0.25, 0.25), aSobel().aBlur(), depth=3) aWarp4(nnedi3_rpow2(rfactor=2).nnedi3_rpow2(rfactor=2), aSobel().aBlur(), depth=2) Note that upsampling for aWarp4 should be left-top aligned, so Spline36Resize(width4, height4) or nnedi3_rpow2(rfactor=4) won't produce correct results. For options explanation and values mapping from used in aWarpSharp by MarcFD - read the included aWarpSharp.txt. Binary patched Toon-v1.0 to use aWarpSharp2 instead of aWarpSharp: Toon-v1.1 Last edited by SEt; 28th March 2012 at 02:07.

23rd May 2009, 23:29	#2 \| Link
ChaosKing Registered User Join Date: Dec 2005 Location: Germany Posts: 1,795	wow nice... Just made a quick test. awarpsharp(154,2,20)#new awarpsharp(20,2,0.6)#old (marcFD) 1. new, 2. old As you can see, the border is no longer green. The pictures look very similar and the plugin seems to be about 20% faster on my Pentium D. Very good job SEt, i waited so long for a bug free awarpshap __________________ AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth VapourSynth Portable FATPACK \|\| VapourSynth Database

24th May 2009, 01:17	#3 \| Link
Adub Fighting spam with a fish Join Date: Sep 2005 Posts: 2,699	You wouldn't happen to have tested the speed now would you, ChaosKing? I would do it my self, but my rig is currently encoding a Bluray and will be doing so for at least the next 17 hours. __________________ FAQs:Bond's AVC/H.264 FAQ Site:Adubvideo

2nd June 2009, 03:38	#5 \| Link
7ekno Guest Posts: n/a	.double post. Last edited by 7ekno; 2nd June 2009 at 03:39. Reason: double post

4th June 2009, 16:09	#13 \| Link
owais Registered User Join Date: Mar 2009 Posts: 44	Help!! with this new updated famous plugin i m geting kind of like this image i used Code: aWarpSharp(depth=12,blur=4,thresh=51,chroma=1) Am i doing something wrong? the colours are dancing with old plugin i m getting normal, yeah but having green lines i used Code: aWarpSharp(depth=12,blurlevel=4,thresh=0.2,cm=1) Edited i hav found little bit that it is due to chroma=1 bascally i don know wat is chroma cause i m new to video (just started on march and learn a lot ) for me till now chroma= 2or 3 works well and 4 also, problem is with 1 for me .. 0 was giving me black and white colour hehe Last edited by owais; 4th June 2009 at 16:35.

2nd June 2009, 03:19	#4 \| Link
7ekno Guest Posts: n/a	Thanks SEt !! Tried it for a drop in for the original aWarpSharp.dll, but MCTDenoise is giving errors with it not supporting some of the paramters passed (and dropping the original aWarpSharp.dll back in resolves it) ... It's seems to be about 20-30% faster, so well done !! Tek

2nd June 2009, 22:28	#6 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	All parameters of original aWarpSharp are supposted, but some are renamed - read aWarpSharp.txt for the mapping.

3rd June 2009, 04:04	#7 \| Link
lansing Registered User Join Date: Sep 2006 Posts: 1,657	thanks for the rewrite, and I think sticking with the out naming would be more convenience for us

6th June 2009, 13:10	#14 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	Dark Shikari, i know not everything is optimally written, but i thought better release working version now than super-optimized never. I know that horizontal blur is made for palignr and will look into this when i have time, but i have no idea how to save kittens or why loading correct line gives 1.4x speed drop for the whole function, including those awful pinsrw that should be much more time consuming than just unaligned load from additional memory location. owais, have you tried to read all the aWarpSharp.txt? There explained that cm=1 of original aWarpSharp is chroma=4 in mine and what chroma values mean.

6th June 2009, 19:56	#18 \| Link
Fizick AviSynth plugger Join Date: Nov 2003 Location: Russia Posts: 2,183	Can I ask, why do you change parameters meaning? It is confusing, and not "fully compatible with original aWarpSharp". If you prefer new parameters, please use new parameters names (or new name of plugin). __________________ My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages.

6th June 2009, 22:20	#19 \| Link
SEt Registered User Join Date: Aug 2007 Posts: 374	Played with performance counters for some time and here is what i found (removed all pinsrw for tests as they don't change the situation): global slowdown is produced by Code: movq xmm7, qword ptr [edi+pitch1-1] but not by Code: movq xmm4, qword ptr [edi+pitch1+1] It results in spike of L1D.REPL and huge spikes of L1D.M_REPL, L1D.M_EVICT, L1D.M_SNOOP_EVICT in that area (also ILD_STALL.ANY, but i don't think it's interesting). I've tried to change the only writing instruction here from movq to movdq2q,movntq but that changed nothing. Fizick, i think the situation is similar to MVTools 1-2 It's fully compatible in terms of available functionality and effective ranges of parameters are supersets of the original ones. I know i should probably change the name to aWarpSharp2, but it looks kind of strange with aSobel, aBlur, aWarp. In truth it's more like a beta release to me due to mentioned wrong offsets in Warp and saturated multiplication by 6 at the end of Sobel that i don't like at all.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Switch to Linear Mode Hybrid Mode Switch to Threaded Mode