Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 23rd May 2009, 22:36   #1  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
aWarpSharp2 rewrite of aWarpSharp

Current version: 2012.03.28

Previous versions:
2009.06.19
2009.05.24

aWarpSharp by MarcFD is nice plugin (especially for tasks like halo removing), but has some bugs and like to produce green artifacts on the image borders. Other WarpSharp plugins produced worse results for me, so i decided to rewrite aWarpSharp algorithm with better handling of borders and optimization for modern CPUs.

Besides complete algorithm aWarpSharp2, its parts are also available as aSobel, aBlur, aWarp and aWarp4. This way you can do advanced edge mask filtering (like MDegrain) before passing it to warp stage to get more stable result.

Good usage examples:
Code:
aWarp4(Spline36Resize(width*4, height*4, 0.375, 0.375), aSobel().aBlur(), depth=3)
aWarp4(nnedi3_rpow2(rfactor=2).Spline36Resize(width*4, height*4, 0.25, 0.25), aSobel().aBlur(), depth=3)
aWarp4(nnedi3_rpow2(rfactor=2).nnedi3_rpow2(rfactor=2), aSobel().aBlur(), depth=2)
Note that upsampling for aWarp4 should be left-top aligned, so Spline36Resize(width*4, height*4) or nnedi3_rpow2(rfactor=4) won't produce correct results.

For options explanation and values mapping from used in aWarpSharp by MarcFD - read the included aWarpSharp.txt.

Binary patched Toon-v1.0 to use aWarpSharp2 instead of aWarpSharp: Toon-v1.1

Last edited by SEt; 28th March 2012 at 02:07.
SEt is offline   Reply With Quote
Old 23rd May 2009, 23:29   #2  |  Link
ChaosKing
Registered User
 
Join Date: Dec 2005
Location: Germany
Posts: 328
wow nice...

Just made a quick test.

awarpsharp(154,2,20)#new
awarpsharp(20,2,0.6)#old (marcFD)

1. new, 2. old


As you can see, the border is no longer green. The pictures look very similar and the plugin seems to be about 20% faster on my Pentium D.

Very good job SEt, i waited so long for a bug free awarpshap
__________________
Search and denoise
ChaosKing is offline   Reply With Quote
Old 24th May 2009, 01:17   #3  |  Link
Adub
Fighting spam with a fish
 
Adub's Avatar
 
Join Date: Sep 2005
Posts: 2,685
You wouldn't happen to have tested the speed now would you, ChaosKing?

I would do it my self, but my rig is currently encoding a Bluray and will be doing so for at least the next 17 hours.
__________________
FAQs:Bond's AVC/H.264 FAQ
Site:Adubvideo
Adub is offline   Reply With Quote
Old 2nd June 2009, 03:19   #4  |  Link
7ekno
Registered User
 
Join Date: Jul 2007
Posts: 389
Thanks SEt !!

Tried it for a drop in for the original aWarpSharp.dll, but MCTDenoise is giving errors with it not supporting some of the paramters passed (and dropping the original aWarpSharp.dll back in resolves it) ...

It's seems to be about 20-30% faster, so well done !!

Tek
7ekno is offline   Reply With Quote
Old 2nd June 2009, 03:38   #5  |  Link
7ekno
Registered User
 
Join Date: Jul 2007
Posts: 389
.double post.

Last edited by 7ekno; 2nd June 2009 at 03:39. Reason: double post
7ekno is offline   Reply With Quote
Old 2nd June 2009, 22:28   #6  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
All parameters of original aWarpSharp are supposted, but some are renamed - read aWarpSharp.txt for the mapping.
SEt is offline   Reply With Quote
Old 3rd June 2009, 04:04   #7  |  Link
lansing
Registered User
 
Join Date: Sep 2006
Posts: 925
thanks for the rewrite, and I think sticking with the out naming would be more convenience for us
lansing is online now   Reply With Quote
Old 3rd June 2009, 04:29   #8  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
Code:
			movdqu	xmm2, [esi-1]
				movdqa	xmm3, [esi]
				movdqu	xmm4, [esi+1]
				movdqu	xmm5, [esi+edx-1]
				movdqa	xmm6, [esi+edx]
				movdqu	xmm7, [esi+edx+1]
...
				movdqu	xmm1, [esi+eax-1]
				movdqu	xmm3, [esi+eax+1]
This is what palignr was made for; SSSE3-ifying this with palignr will avoid all the unaligned loads nicely. If you retain loads between loop iterations, you can reduce the number of memory accesses, too.

Code:
				movdqu	xmm6, [esi-6]
				movdqu	xmm0, [esi+6]
				pavgb	xmm6, xmm0
				movdqu	xmm5, [esi-5]
				movdqu	xmm7, [esi+5]
				pavgb	xmm5, xmm7
				movdqu	xmm4, [esi-4]
				movdqu	xmm0, [esi+4]
				pavgb	xmm4, xmm0
				movdqu	xmm3, [esi-3]
				movdqu	xmm7, [esi+3]
				pavgb	xmm3, xmm7
				movdqu	xmm2, [esi-2]
				movdqu	xmm0, [esi+2]
				pavgb	xmm2, xmm0
				movdqu	xmm1, [esi-1]
				movdqu	xmm7, [esi+1]
				pavgb	xmm1, xmm7
				movdqa	xmm0, [esi]
Did someone say made for palignr

Code:
movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi], 0
				pinsrw	xmm4, [eax+edx], 0
				movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi+1], 1
				pinsrw	xmm4, [eax+edx+1], 1
				movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi+2], 2
				pinsrw	xmm4, [eax+edx+2], 2
				movd	eax, xmm2
				pinsrw	xmm3, [eax+esi+3], 3
				pinsrw	xmm4, [eax+edx+3], 3
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+4], 4
				pinsrw	xmm4, [eax+edx+4], 4
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+5], 5
				pinsrw	xmm4, [eax+edx+5], 5
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+6], 6
				pinsrw	xmm4, [eax+edx+6], 6
				movd	eax, xmm7
				pinsrw	xmm3, [eax+esi+7], 7
				pinsrw	xmm4, [eax+edx+7], 7
				mov	eax, [esp]
I'm going to have to start killing kittens if I keep seeing things like this.

Code:
movq	xmm7, qword ptr [edi+ebx-1] // one line above actual position, but it gives 1.4x speedup
How about you figure out why it does?
Dark Shikari is offline   Reply With Quote
Old 4th June 2009, 16:09   #9  |  Link
owais
Registered User
 
Join Date: Mar 2009
Posts: 44
Help!! with this new updated famous plugin i m geting kind of like this image

i used
Code:
aWarpSharp(depth=12,blur=4,thresh=51,chroma=1)
Am i doing something wrong?

the colours are dancing





with old plugin i m getting normal, yeah but having green lines


i used
Code:
aWarpSharp(depth=12,blurlevel=4,thresh=0.2,cm=1)


Edited

i hav found little bit that it is due to chroma=1 bascally i don know wat is chroma cause i m new to video (just started on march and learn a lot )

for me till now chroma= 2or 3 works well and 4 also, problem is with 1 for me .. 0 was giving me black and white colour hehe

Last edited by owais; 4th June 2009 at 16:35.
owais is offline   Reply With Quote
Old 6th June 2009, 13:10   #10  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Dark Shikari, i know not everything is optimally written, but i thought better release working version now than super-optimized never. I know that horizontal blur is made for palignr and will look into this when i have time, but i have no idea how to save kittens or why loading correct line gives 1.4x speed drop for the whole function, including those awful pinsrw that should be much more time consuming than just unaligned load from additional memory location.

owais, have you tried to read all the aWarpSharp.txt? There explained that cm=1 of original aWarpSharp is chroma=4 in mine and what chroma values mean.
SEt is offline   Reply With Quote
Old 6th June 2009, 15:39   #11  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
Quote:
Originally Posted by SEt View Post
Dark Shikari, i know not everything is optimally written, but i thought better release working version now than super-optimized never. I know that horizontal blur is made for palignr and will look into this when i have time, but i have no idea how to save kittens or why loading correct line gives 1.4x speed drop for the whole function, including those awful pinsrw that should be much more time consuming than just unaligned load from additional memory location.
"Should be much more time consuming?"

Does that imply you tested it, and found it to be faster?

If it's faster, I'm going to be inclined to blame cacheline-split. Test on an AMD chip or Nehalem and watch the penalties melt away.
Dark Shikari is offline   Reply With Quote
Old 6th June 2009, 19:34   #12  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
I'm already on Nehalem and when i change
Code:
movq	xmm7, qword ptr [edi+pitch*0-1]
movq	xmm4, qword ptr [edi+pitch*0+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
to
Code:
movq	xmm7, qword ptr [edi+pitch*1-1]
movq	xmm4, qword ptr [edi+pitch*1+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
i see 1.4x slowdown in profiler for the whole function.
SEt is offline   Reply With Quote
Old 6th June 2009, 19:37   #13  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
Quote:
Originally Posted by SEt View Post
I'm already on Nehalem and when i change
Code:
movq	xmm7, qword ptr [edi+pitch*0-1]
movq	xmm4, qword ptr [edi+pitch*0+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
to
Code:
movq	xmm7, qword ptr [edi+pitch*1-1]
movq	xmm4, qword ptr [edi+pitch*1+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
i see 1.4x slowdown in profiler for the whole function.
I was referring to the pinsrw with regard to the cacheline split.

If you're getting such a large slowdown merely by changing that, you should try to figure out why. Performance counters might be useful for analyzing that.
Dark Shikari is offline   Reply With Quote
Old 6th June 2009, 19:56   #14  |  Link
Fizick
AviSynth plugger
 
Fizick's Avatar
 
Join Date: Nov 2003
Location: Russia
Posts: 2,183
Can I ask, why do you change parameters meaning? It is confusing, and not "fully compatible with original aWarpSharp".
If you prefer new parameters, please use new parameters names (or new name of plugin).
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick
I usually do not provide a technical support in private messages.
Fizick is offline   Reply With Quote
Old 6th June 2009, 22:20   #15  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Played with performance counters for some time and here is what i found (removed all pinsrw for tests as they don't change the situation):
global slowdown is produced by
Code:
movq	xmm7, qword ptr [edi+pitch*1-1]
but not by
Code:
movq	xmm4, qword ptr [edi+pitch*1+1]
It results in spike of L1D.REPL and huge spikes of L1D.M_REPL, L1D.M_EVICT, L1D.M_SNOOP_EVICT in that area (also ILD_STALL.ANY, but i don't think it's interesting).
I've tried to change the only writing instruction here from movq to movdq2q,movntq but that changed nothing.


Fizick, i think the situation is similar to MVTools 1-2 It's fully compatible in terms of available functionality and effective ranges of parameters are supersets of the original ones. I know i should probably change the name to aWarpSharp2, but it looks kind of strange with aSobel, aBlur, aWarp. In truth it's more like a beta release to me due to mentioned wrong offsets in Warp and saturated multiplication by 6 at the end of Sobel that i don't like at all.
SEt is offline   Reply With Quote
Old 6th June 2009, 23:40   #16  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
Quote:
Originally Posted by SEt
I'm already on Nehalem and when i change
Code:
movq	xmm7, qword ptr [edi+pitch*0-1]
movq	xmm4, qword ptr [edi+pitch*0+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
to
Code:
movq	xmm7, qword ptr [edi+pitch*1-1]
movq	xmm4, qword ptr [edi+pitch*1+1]
movq	xmm1, qword ptr [edi+pitch*0]
movq	xmm2, qword ptr [edi+pitch*2]
i see 1.4x slowdown in profiler for the whole function.
Consider the memory address each is referencing and which cache line each uses. I have colour coded 3 different memory areas. In the fast case only 2 areas are used. Also accessing data not aligned to 64 bits has a penalty. And a very big penalty when you cross a cache line (64 byte) boundary. For the [edi+pitch*1-1] you maybe slipping into the previous cache line (what address is in EDI ?)
IanB is offline   Reply With Quote
Old 7th June 2009, 00:12   #17  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
The code reads 3 lines of one frame while writing one line to another frame in simple loop. edi is global counter increased by 8 that is used as offset for all sources and destination. It's not cache line split problem as [edi+pitch*1+1] and [edi+pitch*0-1] are ok but not [edi+pitch*1-1], also the order of impact is too big and on Nehalem such penalties are small. It seems like some kind of cache (address?) conflict as the descriptions of performance counters that produce spikes:
REPL - Counts the number of lines brought into the L1 data cache.
M_REPL - Counts the number of modified lines brought into the L1 data cache.
M_EVICT - Counts the number of modified lines evicted from the L1 data cache due to replacement.
M_SNOOP_EVICT - Counts the number of modified lines evicted from the L1 data cache due to snoop HITM intervention.

But the code linearly reads from one location and linearly writes to another in simple loop.

EDIT: It's indeed seems like cache address conflict as lower 16 bits of [edi+pitch*1] and [output] are the same, but it doesn't give me any idea how to fix it besides caching [edi+pitch*1-1] from previous iteration (as both memory locations are what i get from avisynth).
EDIT2: And it seems to be L2-3 problem with scenario something like:
Cache lines in L1 are allocated independently, but when output L1 line is written to L2+ it mistakes next reference to [edi+pitch*1-1] as accessing the same location for that single -1 byte which results in the L1 cache lines ping-pong hell as seen by counters.

Last edited by SEt; 7th June 2009 at 00:48.
SEt is offline   Reply With Quote
Old 7th June 2009, 03:40   #18  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
What does adding this:-
Code:
mov	al, byte ptr [edi+pitch*1-8]
a few lines earlier into your code do?

And is pitch a constant or a register, i.e. are your hiding relevant code from us?
IanB is offline   Reply With Quote
Old 7th June 2009, 11:10   #19  |  Link
Leak
ffdshow/AviSynth wrangler
 
Leak's Avatar
 
Join Date: Feb 2003
Location: Austria
Posts: 2,441
Quote:
Originally Posted by IanB View Post
And is pitch a constant or a register, i.e. are your hiding relevant code from us?
Well, the code is in the archive he linked to in his first post, so I'd think he'd have a hard time really hiding it from you...

np: Plastikman - I Don't Know (Closer)
__________________
now playing: [artist] - [track] ([album])
Leak is offline   Reply With Quote
Old 7th June 2009, 15:32   #20  |  Link
IanB
Avisynth Developer
 
Join Date: Jan 2003
Location: Melbourne, Australia
Posts: 3,168
@Leak <- ,

I asked because
movq xmm7, qword ptr [edi+pitch*1-1]
does not appear in the code, but
movq xmm7, qword ptr [edi+ebx-1]
does appear in the code.

And when people want help, I like to make sure nothing is clouding the issue.
IanB is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 03:10.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.