Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 7th June 2009, 18:17   #21  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Problem solved. Thanks to Dark Shikari for kicking me to actually go see performance counters and IanB for idea what can help.
Placing earlier
Code:
mov	al, byte ptr [edi+pitch*1-8]
changed nothing, but gave me idea that worked - i moved problem loads before write of previous iteration. A few more optimizations and new code works as fast as old one and sometimes even a bit faster. Will post it later when other things are done.

IanB, i started to use pitch*? instead of registers because included source had no register for pitch*1.
SEt is offline   Reply With Quote
Old 14th June 2009, 20:11   #22  |  Link
Chainmax
Huh?
 
Chainmax's Avatar
 
Join Date: Sep 2003
Location: Uruguay
Posts: 3,103
This is awesome, one of my favorite plugins to use on animated content finally updated and bug-free! Thank you so much, SEt .

One question: in order to achieve the same effects as aWarpSharp(depth=16,cm=1) in MarcFD's original one would have to use aWarpSharp(depth=16,chroma=4) on your version, right?
__________________
Read Decomb's readmes and tutorials, the IVTC tutorial and the capture guide in order to learn about combing and how to deal with it.

Last edited by Chainmax; 14th June 2009 at 20:31.
Chainmax is offline   Reply With Quote
Old 15th June 2009, 10:22   #23  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Yes, and both are defaults btw.
It's not bug free yet - as described here it uses incorrect position in edge map. Next version will be correct one.

I'm interested in usefulness of bm=0/1/2 parameter (choice of internal blur type for edge map) - is it really needed now when you can use external one? There are probably enough blurs for AviSynth already.
SEt is offline   Reply With Quote
Old 17th June 2009, 17:00   #24  |  Link
sh0dan
Retired AviSynth Dev ;)
 
sh0dan's Avatar
 
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
Quote:
Originally Posted by Dark Shikari View Post
Code:
movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi], 0
				pinsrw	xmm4, [eax+edx], 0
				movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi+1], 1
				pinsrw	xmm4, [eax+edx+1], 1
				movd	eax, xmm2
				psrldq	xmm2, 4
				pinsrw	xmm3, [eax+esi+2], 2
				pinsrw	xmm4, [eax+edx+2], 2
				movd	eax, xmm2
				pinsrw	xmm3, [eax+esi+3], 3
				pinsrw	xmm4, [eax+edx+3], 3
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+4], 4
				pinsrw	xmm4, [eax+edx+4], 4
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+5], 5
				pinsrw	xmm4, [eax+edx+5], 5
				movd	eax, xmm7
				psrldq	xmm7, 4
				pinsrw	xmm3, [eax+esi+6], 6
				pinsrw	xmm4, [eax+edx+6], 6
				movd	eax, xmm7
				pinsrw	xmm3, [eax+esi+7], 7
				pinsrw	xmm4, [eax+edx+7], 7
				mov	eax, [esp]
I'm going to have to start killing kittens if I keep seeing things like this.
I completely agree - while this might seem fast, it isn't. Movd r32,xmm has a latency of 6 cycles on Core2, pinsrw has a latency of 4. Both are eons.

Store the content of xmm2 and xmm7 into memory, do lookups in scalar assembler and read them back:
Code:
				movdqa [temp1], xmm2   ; Store all pixels

				; push eax, ebx, ecs on the stack, if you use them already

				xor eax, eax
				xor ebx, ebx
				xor ecx, ecx
	
				mov ax,[temp1]
				mov bx,[eax+esi]
				mov cx,[eax+edi]
				mov [temp2], bx
				mov [temp3], cx

				mov ax,[temp1+2]
				mov bx,[eax+esi]
				mov cx,[eax+edi]
				mov [temp2+2], bx
				mov [temp3+2], cx

				(you get the picture -use a macro, for nice code)

				movdqa xmm3, [temp2]		
				movdqa xmm4, [temp3]
This way you only get the performance hit of the cache lookups, and a Store->Load Forward size mismatch penalty. And for please, use palignr, it is much faster on Core2.
__________________
Regards, sh0dan // VoxPod
sh0dan is offline   Reply With Quote
Old 17th June 2009, 23:42   #25  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Quote:
Originally Posted by sh0dan View Post
I completely agree - while this might seem fast, it isn't. Movd r32,xmm has a latency of 6 cycles on Core2, pinsrw has a latency of 4. Both are eons.
Your numbers seems too high even for Core2. According to http://www.agner.org/optimize/ tables movd has latency/throughput of 2/0.33 on both Core2 and pinsrw 6/1.5 on 65nm and 2/1 on 45nm. On my Nehalem Everest measures "MOVD r32, xmm+MOVD xmm, r32" to have latency 4, "PEXTRW + PINSRW r32" latency 1, movd throughput 0.4, pinsrw throughput 0.66.

For speed of this part it's actually the other way around - at the beginning i thought it to be painfully slow too, but profiler says it's quite fast. Changing even movd/psrldq into movdqa/mov [+0/4/8/12] gives me 10% speed drop for the whole function. Changing pinsrw would likely cost even more. Of course it's measured on Nehalem and not on Core2 that everyone seems to love for reason that's beyond me. Modern CPUs taught me to believe only profiler when optimizing and not what you think is faster, so i'm not going to write something that "maybe better for Core2" when i have no means of confirming it by testing there.
Quote:
Originally Posted by sh0dan View Post
And for please, use palignr, it is much faster on Core2.
palignr is definitely great for some cases, but guess how much difference i measure between old unaligned hell of
Code:
movdqu	xmm6, [esi-6]
movdqu	xmm0, [esi+6]
pavgb	xmm6, xmm0
movdqu	xmm5, [esi-5]
movdqu	xmm7, [esi+5]
pavgb	xmm5, xmm7
movdqu	xmm4, [esi-4]
movdqu	xmm0, [esi+4]
pavgb	xmm4, xmm0
movdqu	xmm3, [esi-3]
movdqu	xmm7, [esi+3]
pavgb	xmm3, xmm7
movdqu	xmm2, [esi-2]
movdqu	xmm0, [esi+2]
pavgb	xmm2, xmm0
movdqu	xmm1, [esi-1]
movdqu	xmm7, [esi+1]
pavgb	xmm1, xmm7
movdqa	xmm0, [esi]
pavgb	xmm6, xmm5
pavgb	xmm4, xmm3
pavgb	xmm2, xmm1
pavgb	xmm6, xmm4
pavgb	xmm2, xmm0
pavgb	xmm6, xmm2
pavgb	xmm6, xmm2
movntdq	[esi+edi], xmm6
and new (to be released)
Code:
movdqa	xmm7, [esi+10h]
movdqa	xmm0, xmm6
movdqa	xmm2, xmm7
palignr	xmm0, xmm5, 10
palignr	xmm2, xmm6, 6
pavgb	xmm0, xmm2
movdqa	xmm3, xmm6
movdqa	xmm4, xmm7
palignr	xmm3, xmm5, 11
palignr	xmm4, xmm6, 5
pavgb	xmm3, xmm4
pavgb	xmm0, xmm3
movdqa	xmm1, xmm6
movdqa	xmm2, xmm7
palignr	xmm1, xmm5, 12
palignr	xmm2, xmm6, 4
pavgb	xmm1, xmm2
movdqa	xmm3, xmm6
movdqa	xmm4, xmm7
palignr	xmm3, xmm5, 13
palignr	xmm4, xmm6, 3
pavgb	xmm3, xmm4
pavgb	xmm1, xmm3
pavgb	xmm0, xmm1
movdqa	xmm1, xmm6
movdqa	xmm2, xmm7
palignr	xmm1, xmm5, 14
palignr	xmm2, xmm6, 2
pavgb	xmm1, xmm2
movdqa	xmm3, xmm6
movdqa	xmm4, xmm7
palignr	xmm3, xmm5, 15
palignr	xmm4, xmm6, 1
pavgb	xmm3, xmm4
pavgb	xmm1, xmm3
pavgb	xmm1, xmm6
movdqa	xmm5, xmm6
movdqa	xmm6, xmm7
pavgb	xmm0, xmm1
pavgb	xmm0, xmm1
movntdq	[esi+edi], xmm0
? The second version is only 10% faster (of course, on Nehalem again).
SEt is offline   Reply With Quote
Old 17th June 2009, 23:50   #26  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
Quote:
Originally Posted by SEt View Post
Of course it's measured on Nehalem and not on Core2 that everyone seems to love for reason that's beyond me.
Maybe because that's what they own? It will be years before the Nehalem has a higher install base than the Core 2.
Quote:
Originally Posted by SEt View Post
Modern CPUs taught me to believe only profiler when optimizing and not what you think is faster, so i'm not going to write something that "maybe better for Core2" when i have no means of confirming it by testing there.
Then ask someone for SSH access, there are billions of Core 2s.

Quote:
Originally Posted by SEt View Post
? The second version is only 10% faster (of course, on Nehalem again).
Try running the first set of code when you're not on a cacheline, and the second set when you are, perhaps?

Last edited by Dark Shikari; 17th June 2009 at 23:52.
Dark Shikari is offline   Reply With Quote
Old 18th June 2009, 00:23   #27  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Quote:
Originally Posted by Dark Shikari View Post
Maybe because that's what they own? It will be years before the Nehalem has a higher install base than the Core 2.
I suspect there is higher install base for P4/D than Core2 now, but does it mean we should optimize primary for P4? If care for something else it would be Phenoms and i suspect them to behave similar to Nehalem, not Core2.
Quote:
Originally Posted by Dark Shikari View Post
Then ask someone for SSH access, there are billions of Core 2s.
Good idea, but i'd rather spend my time now for more important stuff, like actually releasing something.
Quote:
Originally Posted by Dark Shikari View Post
Try running the first set of code when you're not on a cacheline, and the second set when you are, perhaps?
I'm not quite get your point here. I bench the code on real video processing. Also notice the 'movdqa' in the only memory load of second variant.
SEt is offline   Reply With Quote
Old 18th June 2009, 00:37   #28  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
Quote:
Originally Posted by SEt View Post
I suspect there is higher install base for P4/D than Core2 now, but does it mean we should optimize primary for P4? If care for something else it would be Phenoms and i suspect them to behave similar to Nehalem, not Core2.
People who care about video processing speed are not using Pentium 4s.
Quote:
Originally Posted by SEt View Post
Good idea, but i'd rather spend my time now for more important stuff, like actually releasing something.
Why is it that I can release code optimized for multiple modern CPUs while you can't?
Quote:
Originally Posted by SEt View Post
I'm not quite get your point here. I bench the code on real video processing. Also notice the 'movdqa' in the only memory load of second variant.
And? x264 performs "real video processing" and has separate code paths for when the loads fall across a cacheline.
Dark Shikari is offline   Reply With Quote
Old 18th June 2009, 00:52   #29  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Quote:
Originally Posted by Dark Shikari View Post
People who care about video processing speed are not using Pentium 4s.
People who really care about video processing speed would use Nehalem.
Quote:
Originally Posted by Dark Shikari View Post
Why is it that I can release code optimized for multiple modern CPUs while you can't?
It's a matter of preference. I think optimization is important, but comes behind the features. I'd love to see in next x264 option to be 2x slower but produce 5% better quality/smaller file size instead of even 10% overal speed up.
Quote:
Originally Posted by Dark Shikari View Post
And? x264 performs "real video processing" and has separate code paths for when the loads fall across a cacheline.
How can dq loads aligned on 16 bytes be across a cacheline?
SEt is offline   Reply With Quote
Old 18th June 2009, 01:20   #30  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
Quote:
Originally Posted by SEt View Post
How can dq loads aligned on 16 bytes be across a cacheline?
You said that your palignr-based code is only 10% faster.

What if it was 70% faster when on a cacheline, and 10% slower otherwise? This would average out to 10% faster overall.

In that case, you'd want to use the unaligned code when not on a cacheline, and the palignr code when on a cacheline.
Dark Shikari is offline   Reply With Quote
Old 18th June 2009, 02:01   #31  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Ok, now i understand you. Profiled such case and it turned out that old and new code have exactly the same speed when not on a cacheline. So, those 10% are what palignr wins on cacheline split.
SEt is offline   Reply With Quote
Old 18th June 2009, 02:11   #32  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
Quote:
Originally Posted by SEt View Post
Ok, now i understand you. Profiled such case and it turned out that old and new code have exactly the same speed when not on a cacheline. So, those 10% are what palignr wins on cacheline split.
By the way, the reason it's only 10% faster on Nehalem is because the Nehalem has only a 2-cycle penalty for cacheline splits.

The Core 2 Penryn has a ~12-14-cycle penalty, which will probably mean that SSSE3 code will go massively faster on Penryn; I'd guess at least 50% benefit.
Dark Shikari is offline   Reply With Quote
Old 18th June 2009, 02:36   #33  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
I know that much about my Nehalem. But would love to read something like http://www.agner.org/optimize/ about its microarchitecture and instruction timings (regrettably, information there is only up to Core2 now).
SEt is offline   Reply With Quote
Old 18th June 2009, 02:47   #34  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
Quote:
Originally Posted by SEt View Post
I know that much about my Nehalem. But would love to read something like http://www.agner.org/optimize/ about its microarchitecture and instruction timings (regrettably, information there is only up to Core2 now).
Intel's latest optimization guide has Nehalem documentation, but not timings.

Mubench has mostly-accurate Nehalem timings.
Dark Shikari is offline   Reply With Quote
Old 18th June 2009, 14:56   #35  |  Link
Gokumon
Guest
 
Posts: n/a
Quote:
Originally Posted by Dark Shikari View Post
Why is it that I can release code optimized for multiple modern CPUs while you can't?
Because you get paid to work on x264 while SEt doesn't get paid to work on his aWarpSharp rewrite?
  Reply With Quote
Old 19th June 2009, 21:15   #36  |  Link
SEt
Registered User
 
Join Date: Aug 2007
Posts: 374
Finally an update:
  • renamed main filter from aWarpSharp to aWarpSharp2 for less confusion with original aWarpSharp
  • fixed wrong offsets in Warp
  • added new blur type - produce better quality, but around 2.5x slower
  • blur will be more precise around frame borders if SSSE3 is available
  • some optimizations, mostly noticeable on Core2
  • removed support for undocumented parameters of original aWarpSharp
SEt is offline   Reply With Quote
Old 19th June 2009, 21:18   #37  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
Quote:
Originally Posted by Gokumon View Post
Because you get paid to work on x264
I do? That's news to me, because I haven't gotten a check.
Dark Shikari is offline   Reply With Quote
Old 19th June 2009, 23:41   #38  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 12,906
Quote:
Originally Posted by Dark Shikari View Post
I do? That's news to me, because I haven't gotten a check.
What about your (previous) engagement at Avail Media?
__________________
There was of course no way of knowing whether you were being watched at any given moment.
How often, or on what system, the Thought Police plugged in on any individual wire was guesswork.


LoRd_MuldeR is offline   Reply With Quote
Old 19th June 2009, 23:43   #39  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,690
Quote:
Originally Posted by LoRd_MuldeR View Post
What about your (previous) engagement at Avail Media?
As it happens, summer internships end at the end of the summer
Dark Shikari is offline   Reply With Quote
Old 22nd June 2009, 20:17   #40  |  Link
Gokumon
Guest
 
Posts: n/a
Quote:
Originally Posted by Dark Shikari View Post
As it happens, summer internships end at the end of the summer
Okay, from what I read from your posts it seemed to be an actual full-time position not an internship. Let's put it this way instead then, some people don't necessarily have as much time and effort to spend to write assembly for many multiple CPUs as other people may be able to.
  Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 02:29.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.