Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 23rd May 2009, 22:36   #1  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
aWarpSharp2 rewrite of aWarpSharp

Current version: 2012.03.28

Previous versions:
2009.06.19
2009.05.24

aWarpSharp by MarcFD is nice plugin (especially for tasks like halo removing), but has some bugs and like to produce green artifacts on the image borders. Other WarpSharp plugins produced worse results for me, so i decided to rewrite aWarpSharp algorithm with better handling of borders and optimization for modern CPUs.

Besides complete algorithm aWarpSharp2, its parts are also available as aSobel, aBlur, aWarp and aWarp4. This way you can do advanced edge mask filtering (like MDegrain) before passing it to warp stage to get more stable result.

Good usage examples:
Code:
aWarp4(Spline36Resize(width*4, height*4, 0.375, 0.375), aSobel().aBlur(), depth=3)
aWarp4(nnedi3_rpow2(rfactor=2).Spline36Resize(width*4, height*4, 0.25, 0.25), aSobel().aBlur(), depth=3)
aWarp4(nnedi3_rpow2(rfactor=2).nnedi3_rpow2(rfactor=2), aSobel().aBlur(), depth=2)
Note that upsampling for aWarp4 should be left-top aligned, so Spline36Resize(width*4, height*4) or nnedi3_rpow2(rfactor=4) won't produce correct results.

For options explanation and values mapping from used in aWarpSharp by MarcFD - read the included aWarpSharp.txt.

Binary patched Toon-v1.0 to use aWarpSharp2 instead of aWarpSharp: Toon-v1.1

Last edited by SEt; 28th March 2012 at 02:07.
SEt is offline   Reply With Quote
Old 23rd May 2009, 23:29   #2  |  Link
ChaosKing
Registered User
 
Join Date: Dec 2005
Location: Germany
Posts: 484
wow nice...

Just made a quick test.

awarpsharp(154,2,20)#new
awarpsharp(20,2,0.6)#old (marcFD)

1. new, 2. old


As you can see, the border is no longer green. The pictures look very similar and the plugin seems to be about 20% faster on my Pentium D.

Very good job SEt, i waited so long for a bug free awarpshap
__________________
Search and denoise
ChaosKing is online now   Reply With Quote
Old 24th May 2009, 01:17   #3  |  Link
Adub
Fighting spam with a fish
 
Adub's Avatar
 
Join Date: Sep 2005
Posts: 2,685
You wouldn't happen to have tested the speed now would you, ChaosKing?

I would do it my self, but my rig is currently encoding a Bluray and will be doing so for at least the next 17 hours.
__________________
FAQs:Bond's AVC/H.264 FAQ
Site:Adubvideo
Adub is offline   Reply With Quote
Old 2nd June 2009, 03:19   #4  |  Link
7ekno
Registered User
 
Join Date: Jul 2007
Posts: 389
Thanks SEt !!

Tried it for a drop in for the original aWarpSharp.dll, but MCTDenoise is giving errors with it not supporting some of the paramters passed (and dropping the original aWarpSharp.dll back in resolves it) ...

It's seems to be about 20-30% faster, so well done !!

Tek
7ekno is offline   Reply With Quote
Old 2nd June 2009, 03:38   #5  |  Link
7ekno
Registered User
 
Join Date: Jul 2007
Posts: 389
.double post.

Last edited by 7ekno; 2nd June 2009 at 03:39. Reason: double post
7ekno is offline   Reply With Quote
Old 2nd June 2009, 22:28   #6  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
All parameters of original aWarpSharp are supposted, but some are renamed - read aWarpSharp.txt for the mapping.
SEt is offline   Reply With Quote
Old 3rd June 2009, 04:04   #7  |  Link
lansing
Registered User
 
Join Date: Sep 2006
Posts: 1,001
thanks for the rewrite, and I think sticking with the out naming would be more convenience for us
lansing is offline   Reply With Quote
Old 3rd June 2009, 04:29   #8  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,689
Code:
			movdqu	xmm2, [esi-1]
				movdqa	xmm3, [esi]
				movdqu	xmm4, [esi+1]
				movdqu	xmm5, [esi+edx-1]
				movdqa	xmm6, [esi+edx]
				movdqu	xmm7, [esi+edx+1]
...
				movdqu	xmm1, [esi+eax-1]
				movdqu	xmm3, [esi+eax+1]
This is what palignr was made for; SSSE3-ifying this with palignr will avoid all the unaligned loads nicely. If you retain loads between loop iterations, you can reduce the number of memory accesses, too.

Code:
				movdqu	xmm6, [esi-6]
				movdqu	xmm0, [esi+6]
				pavgb	xmm6, xmm0
				movdqu	xmm5, [esi-5]
				movdqu	xmm7, [esi+5]
				pavgb	xmm5, xmm7
				movdqu	xmm4, [esi-4]
				movdqu	xmm0, [esi+4]
				pavgb	xmm4, xmm0
				movdqu	xmm3, [esi-3]
				movdqu	xmm7, [esi+3]
				pavgb	xmm3, xmm7
				movdqu	xmm2, [esi-2]
				movdqu	xmm0, [esi+2]
				pavgb	xmm2, xmm0
				movdqu	xmm1, [esi-1]
				movdqu	xmm7, [esi+1]
				pavgb	xmm1, xmm7
				movdqa	xmm0, [esi]
Did someone say made for palignr

Code:
movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi], 0
				pinsrw	xmm4, [eax+edx], 0
				movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi+1], 1
				pinsrw	xmm4, [eax+edx+1], 1
				movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi+2], 2
				pinsrw	xmm4, [eax+edx+2], 2
				movd	eax, xmm2
				pinsrw	xmm3, [eax+esi+3], 3
				pinsrw	xmm4, [eax+edx+3], 3
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+4], 4
				pinsrw	xmm4, [eax+edx+4], 4
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+5], 5
				pinsrw	xmm4, [eax+edx+5], 5
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+6], 6
				pinsrw	xmm4, [eax+edx+6], 6
				movd	eax, xmm7
				pinsrw	xmm3, [eax+esi+7], 7
				pinsrw	xmm4, [eax+edx+7], 7
				mov	eax, [esp]
I'm going to have to start killing kittens if I keep seeing things like this.

Code:
movq	xmm7, qword ptr [edi+ebx-1] // one line above actual position, but it gives 1.4x speedup
How about you figure out why it does?
Dark Shikari is offline   Reply With Quote
Old 17th June 2009, 17:00   #9  |  Link
sh0dan
Retired AviSynth Dev ;)
 
sh0dan's Avatar
 
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
Quote:
Originally Posted by Dark Shikari View Post
Code:
movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi], 0
				pinsrw	xmm4, [eax+edx], 0
				movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi+1], 1
				pinsrw	xmm4, [eax+edx+1], 1
				movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi+2], 2
				pinsrw	xmm4, [eax+edx+2], 2
				movd	eax, xmm2
				pinsrw	xmm3, [eax+esi+3], 3
				pinsrw	xmm4, [eax+edx+3], 3
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+4], 4
				pinsrw	xmm4, [eax+edx+4], 4
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+5], 5
				pinsrw	xmm4, [eax+edx+5], 5
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+6], 6
				pinsrw	xmm4, [eax+edx+6], 6
				movd	eax, xmm7
				pinsrw	xmm3, [eax+esi+7], 7
				pinsrw	xmm4, [eax+edx+7], 7
				mov	eax, [esp]
I'm going to have to start killing kittens if I keep seeing things like this.
I completely agree - while this might seem fast, it isn't. Movd r32,xmm has a latency of 6 cycles on Core2, pinsrw has a latency of 4. Both are eons.

Store the content of xmm2 and xmm7 into memory, do lookups in scalar assembler and read them back:
Code:
				movdqa [temp1], xmm2   ; Store all pixels

				; push eax, ebx, ecs on the stack, if you use them already

				xor eax, eax
				xor ebx, ebx
				xor ecx, ecx
	
				mov ax,[temp1]
				mov bx,[eax+esi]
				mov cx,[eax+edi]
				mov [temp2], bx
				mov [temp3], cx

				mov ax,[temp1+2]
				mov bx,[eax+esi]
				mov cx,[eax+edi]
				mov [temp2+2], bx
				mov [temp3+2], cx

				(you get the picture -use a macro, for nice code)

				movdqa xmm3, [temp2]		
				movdqa xmm4, [temp3]
This way you only get the performance hit of the cache lookups, and a Store->Load Forward size mismatch penalty. And for please, use palignr, it is much faster on Core2.
__________________
Regards, sh0dan // VoxPod
sh0dan is offline   Reply With Quote
Old 17th June 2009, 23:42   #10  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Quote:
Originally Posted by sh0dan View Post
I completely agree - while this might seem fast, it isn't. Movd r32,xmm has a latency of 6 cycles on Core2, pinsrw has a latency of 4. Both are eons.
Your numbers seems too high even for Core2. According to http://www.agner.org/optimize/ tables movd has latency/throughput of 2/0.33 on both Core2 and pinsrw 6/1.5 on 65nm and 2/1 on 45nm. On my Nehalem Everest measures "MOVD r32, xmm+MOVD xmm, r32" to have latency 4, "PEXTRW + PINSRW r32" latency 1, movd throughput 0.4, pinsrw throughput 0.66.

For speed of this part it's actually the other way around - at the beginning i thought it to be painfully slow too, but profiler says it's quite fast. Changing even movd/psrldq into movdqa/mov [+0/4/8/12] gives me 10% speed drop for the whole function. Changing pinsrw would likely cost even more. Of course it's measured on Nehalem and not on Core2 that everyone seems to love for reason that's beyond me. Modern CPUs taught me to believe only profiler when optimizing and not what you think is faster, so i'm not going to write something that "maybe better for Core2" when i have no means of confirming it by testing there.
Quote:
Originally Posted by sh0dan View Post
And for please, use palignr, it is much faster on Core2.
palignr is definitely great for some cases, but guess how much difference i measure between old unaligned hell of
Code:
movdqu	xmm6, [esi-6]
movdqu	xmm0, [esi+6]
pavgb	xmm6, xmm0
movdqu	xmm5, [esi-5]
movdqu	xmm7, [esi+5]
pavgb	xmm5, xmm7
movdqu	xmm4, [esi-4]
movdqu	xmm0, [esi+4]
pavgb	xmm4, xmm0
movdqu	xmm3, [esi-3]
movdqu	xmm7, [esi+3]
pavgb	xmm3, xmm7
movdqu	xmm2, [esi-2]
movdqu	xmm0, [esi+2]
pavgb	xmm2, xmm0
movdqu	xmm1, [esi-1]
movdqu	xmm7, [esi+1]
pavgb	xmm1, xmm7
movdqa	xmm0, [esi]
pavgb	xmm6, xmm5
pavgb	xmm4, xmm3
pavgb	xmm2, xmm1
pavgb	xmm6, xmm4
pavgb	xmm2, xmm0
pavgb	xmm6, xmm2
pavgb	xmm6, xmm2
movntdq	[esi+edi], xmm6
and new (to be released)
Code:
movdqa	xmm7, [esi+10h]
movdqa	xmm0, xmm6
movdqa	xmm2, xmm7
palignr	xmm0, xmm5, 10
palignr	xmm2, xmm6, 6
pavgb	xmm0, xmm2
movdqa	xmm3, xmm6
movdqa	xmm4, xmm7
palignr	xmm3, xmm5, 11
palignr	xmm4, xmm6, 5
pavgb	xmm3, xmm4
pavgb	xmm0, xmm3
movdqa	xmm1, xmm6
movdqa	xmm2, xmm7
palignr	xmm1, xmm5, 12
palignr	xmm2, xmm6, 4
pavgb	xmm1, xmm2
movdqa	xmm3, xmm6
movdqa	xmm4, xmm7
palignr	xmm3, xmm5, 13
palignr	xmm4, xmm6, 3
pavgb	xmm3, xmm4
pavgb	xmm1, xmm3
pavgb	xmm0, xmm1
movdqa	xmm1, xmm6
movdqa	xmm2, xmm7
palignr	xmm1, xmm5, 14
palignr	xmm2, xmm6, 2
pavgb	xmm1, xmm2
movdqa	xmm3, xmm6
movdqa	xmm4, xmm7
palignr	xmm3, xmm5, 15
palignr	xmm4, xmm6, 1
pavgb	xmm3, xmm4
pavgb	xmm1, xmm3
pavgb	xmm1, xmm6
movdqa	xmm5, xmm6
movdqa	xmm6, xmm7
pavgb	xmm0, xmm1
pavgb	xmm0, xmm1
movntdq	[esi+edi], xmm0
? The second version is only 10% faster (of course, on Nehalem again).
SEt is offline   Reply With Quote
Old 17th June 2009, 23:50   #11  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,689
Quote:
Originally Posted by SEt View Post
Of course it's measured on Nehalem and not on Core2 that everyone seems to love for reason that's beyond me.
Maybe because that's what they own? It will be years before the Nehalem has a higher install base than the Core 2.
Quote:
Originally Posted by SEt View Post
Modern CPUs taught me to believe only profiler when optimizing and not what you think is faster, so i'm not going to write something that "maybe better for Core2" when i have no means of confirming it by testing there.
Then ask someone for SSH access, there are billions of Core 2s.

Quote:
Originally Posted by SEt View Post
? The second version is only 10% faster (of course, on Nehalem again).
Try running the first set of code when you're not on a cacheline, and the second set when you are, perhaps?

Last edited by Dark Shikari; 17th June 2009 at 23:52.
Dark Shikari is offline   Reply With Quote
Old 18th June 2009, 00:23   #12  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Quote:
Originally Posted by Dark Shikari View Post
Maybe because that's what they own? It will be years before the Nehalem has a higher install base than the Core 2.
I suspect there is higher install base for P4/D than Core2 now, but does it mean we should optimize primary for P4? If care for something else it would be Phenoms and i suspect them to behave similar to Nehalem, not Core2.
Quote:
Originally Posted by Dark Shikari View Post
Then ask someone for SSH access, there are billions of Core 2s.
Good idea, but i'd rather spend my time now for more important stuff, like actually releasing something.
Quote:
Originally Posted by Dark Shikari View Post
Try running the first set of code when you're not on a cacheline, and the second set when you are, perhaps?
I'm not quite get your point here. I bench the code on real video processing. Also notice the 'movdqa' in the only memory load of second variant.
SEt is offline   Reply With Quote
Old 4th June 2009, 16:09   #13  |  Link
owais
Registered User
 
Join Date: Mar 2009
Posts: 44
Help!! with this new updated famous plugin i m geting kind of like this image

i used
Code:
aWarpSharp(depth=12,blur=4,thresh=51,chroma=1)
Am i doing something wrong?

the colours are dancing





with old plugin i m getting normal, yeah but having green lines


i used
Code:
aWarpSharp(depth=12,blurlevel=4,thresh=0.2,cm=1)


Edited

i hav found little bit that it is due to chroma=1 bascally i don know wat is chroma cause i m new to video (just started on march and learn a lot )

for me till now chroma= 2or 3 works well and 4 also, problem is with 1 for me .. 0 was giving me black and white colour hehe

Last edited by owais; 4th June 2009 at 16:35.
owais is offline   Reply With Quote
Old 6th June 2009, 13:10   #14  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Dark Shikari, i know not everything is optimally written, but i thought better release working version now than super-optimized never. I know that horizontal blur is made for palignr and will look into this when i have time, but i have no idea how to save kittens or why loading correct line gives 1.4x speed drop for the whole function, including those awful pinsrw that should be much more time consuming than just unaligned load from additional memory location.

owais, have you tried to read all the aWarpSharp.txt? There explained that cm=1 of original aWarpSharp is chroma=4 in mine and what chroma values mean.
SEt is offline   Reply With Quote
Old 6th June 2009, 15:39   #15  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,689
Quote:
Originally Posted by SEt View Post
Dark Shikari, i know not everything is optimally written, but i thought better release working version now than super-optimized never. I know that horizontal blur is made for palignr and will look into this when i have time, but i have no idea how to save kittens or why loading correct line gives 1.4x speed drop for the whole function, including those awful pinsrw that should be much more time consuming than just unaligned load from additional memory location.
"Should be much more time consuming?"

Does that imply you tested it, and found it to be faster?

If it's faster, I'm going to be inclined to blame cacheline-split. Test on an AMD chip or Nehalem and watch the penalties melt away.
Dark Shikari is offline   Reply With Quote
Old 6th June 2009, 19:34   #16  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
I'm already on Nehalem and when i change
Code:
movq	xmm7, qword ptr [edi+pitch*0-1]
movq	xmm4, qword ptr [edi+pitch*0+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
to
Code:
movq	xmm7, qword ptr [edi+pitch*1-1]
movq	xmm4, qword ptr [edi+pitch*1+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
i see 1.4x slowdown in profiler for the whole function.
SEt is offline   Reply With Quote
Old 6th June 2009, 19:37   #17  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,689
Quote:
Originally Posted by SEt View Post
I'm already on Nehalem and when i change
Code:
movq	xmm7, qword ptr [edi+pitch*0-1]
movq	xmm4, qword ptr [edi+pitch*0+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
to
Code:
movq	xmm7, qword ptr [edi+pitch*1-1]
movq	xmm4, qword ptr [edi+pitch*1+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
i see 1.4x slowdown in profiler for the whole function.
I was referring to the pinsrw with regard to the cacheline split.

If you're getting such a large slowdown merely by changing that, you should try to figure out why. Performance counters might be useful for analyzing that.
Dark Shikari is offline   Reply With Quote
Old 6th June 2009, 19:56   #18  |  Link
Fizick
AviSynth plugger
 
Fizick's Avatar
 
Join Date: Nov 2003
Location: Russia
Posts: 2,183
Can I ask, why do you change parameters meaning? It is confusing, and not "fully compatible with original aWarpSharp".
If you prefer new parameters, please use new parameters names (or new name of plugin).
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick
I usually do not provide a technical support in private messages.
Fizick is offline   Reply With Quote
Old 6th June 2009, 22:20   #19  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Played with performance counters for some time and here is what i found (removed all pinsrw for tests as they don't change the situation):
global slowdown is produced by
Code:
movq	xmm7, qword ptr [edi+pitch*1-1]
but not by
Code:
movq	xmm4, qword ptr [edi+pitch*1+1]
It results in spike of L1D.REPL and huge spikes of L1D.M_REPL, L1D.M_EVICT, L1D.M_SNOOP_EVICT in that area (also ILD_STALL.ANY, but i don't think it's interesting).
I've tried to change the only writing instruction here from movq to movdq2q,movntq but that changed nothing.


Fizick, i think the situation is similar to MVTools 1-2 It's fully compatible in terms of available functionality and effective ranges of parameters are supersets of the original ones. I know i should probably change the name to aWarpSharp2, but it looks kind of strange with aSobel, aBlur, aWarp. In truth it's more like a beta release to me due to mentioned wrong offsets in Warp and saturated multiplication by 6 at the end of Sobel that i don't like at all.
SEt is offline   Reply With Quote
Old 6th June 2009, 23:40   #20  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
Quote:
Originally Posted by SEt
I'm already on Nehalem and when i change
Code:
movq	xmm7, qword ptr [edi+pitch*0-1]
movq	xmm4, qword ptr [edi+pitch*0+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
to
Code:
movq	xmm7, qword ptr [edi+pitch*1-1]
movq	xmm4, qword ptr [edi+pitch*1+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
i see 1.4x slowdown in profiler for the whole function.
Consider the memory address each is referencing and which cache line each uses. I have colour coded 3 different memory areas. In the fast case only 2 areas are used. Also accessing data not aligned to 64 bits has a penalty. And a very big penalty when you cross a cache line (64 byte) boundary. For the [edi+pitch*1-1] you maybe slipping into the previous cache line (what address is in EDI ?)
IanB is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 19:12.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.