Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
7th June 2009, 18:17 | #21 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
Problem solved. Thanks to Dark Shikari for kicking me to actually go see performance counters and IanB for idea what can help.
Placing earlier Code:
mov al, byte ptr [edi+pitch*1-8] IanB, i started to use pitch*? instead of registers because included source had no register for pitch*1. |
14th June 2009, 20:11 | #22 | Link |
Huh?
Join Date: Sep 2003
Location: Uruguay
Posts: 3,103
|
This is awesome, one of my favorite plugins to use on animated content finally updated and bug-free! Thank you so much, SEt .
One question: in order to achieve the same effects as aWarpSharp(depth=16,cm=1) in MarcFD's original one would have to use aWarpSharp(depth=16,chroma=4) on your version, right?
__________________
Read Decomb's readmes and tutorials, the IVTC tutorial and the capture guide in order to learn about combing and how to deal with it. Last edited by Chainmax; 14th June 2009 at 20:31. |
15th June 2009, 10:22 | #23 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
Yes, and both are defaults btw.
It's not bug free yet - as described here it uses incorrect position in edge map. Next version will be correct one. I'm interested in usefulness of bm=0/1/2 parameter (choice of internal blur type for edge map) - is it really needed now when you can use external one? There are probably enough blurs for AviSynth already. |
17th June 2009, 17:00 | #24 | Link | |
Retired AviSynth Dev ;)
Join Date: Nov 2001
Location: Dark Side of the Moon
Posts: 3,480
|
Quote:
Store the content of xmm2 and xmm7 into memory, do lookups in scalar assembler and read them back: Code:
movdqa [temp1], xmm2 ; Store all pixels ; push eax, ebx, ecs on the stack, if you use them already xor eax, eax xor ebx, ebx xor ecx, ecx mov ax,[temp1] mov bx,[eax+esi] mov cx,[eax+edi] mov [temp2], bx mov [temp3], cx mov ax,[temp1+2] mov bx,[eax+esi] mov cx,[eax+edi] mov [temp2+2], bx mov [temp3+2], cx (you get the picture -use a macro, for nice code) movdqa xmm3, [temp2] movdqa xmm4, [temp3]
__________________
Regards, sh0dan // VoxPod |
|
17th June 2009, 23:42 | #25 | Link | |
Registered User
Join Date: Aug 2007
Posts: 374
|
Quote:
For speed of this part it's actually the other way around - at the beginning i thought it to be painfully slow too, but profiler says it's quite fast. Changing even movd/psrldq into movdqa/mov [+0/4/8/12] gives me 10% speed drop for the whole function. Changing pinsrw would likely cost even more. Of course it's measured on Nehalem and not on Core2 that everyone seems to love for reason that's beyond me. Modern CPUs taught me to believe only profiler when optimizing and not what you think is faster, so i'm not going to write something that "maybe better for Core2" when i have no means of confirming it by testing there. palignr is definitely great for some cases, but guess how much difference i measure between old unaligned hell of Code:
movdqu xmm6, [esi-6] movdqu xmm0, [esi+6] pavgb xmm6, xmm0 movdqu xmm5, [esi-5] movdqu xmm7, [esi+5] pavgb xmm5, xmm7 movdqu xmm4, [esi-4] movdqu xmm0, [esi+4] pavgb xmm4, xmm0 movdqu xmm3, [esi-3] movdqu xmm7, [esi+3] pavgb xmm3, xmm7 movdqu xmm2, [esi-2] movdqu xmm0, [esi+2] pavgb xmm2, xmm0 movdqu xmm1, [esi-1] movdqu xmm7, [esi+1] pavgb xmm1, xmm7 movdqa xmm0, [esi] pavgb xmm6, xmm5 pavgb xmm4, xmm3 pavgb xmm2, xmm1 pavgb xmm6, xmm4 pavgb xmm2, xmm0 pavgb xmm6, xmm2 pavgb xmm6, xmm2 movntdq [esi+edi], xmm6 Code:
movdqa xmm7, [esi+10h] movdqa xmm0, xmm6 movdqa xmm2, xmm7 palignr xmm0, xmm5, 10 palignr xmm2, xmm6, 6 pavgb xmm0, xmm2 movdqa xmm3, xmm6 movdqa xmm4, xmm7 palignr xmm3, xmm5, 11 palignr xmm4, xmm6, 5 pavgb xmm3, xmm4 pavgb xmm0, xmm3 movdqa xmm1, xmm6 movdqa xmm2, xmm7 palignr xmm1, xmm5, 12 palignr xmm2, xmm6, 4 pavgb xmm1, xmm2 movdqa xmm3, xmm6 movdqa xmm4, xmm7 palignr xmm3, xmm5, 13 palignr xmm4, xmm6, 3 pavgb xmm3, xmm4 pavgb xmm1, xmm3 pavgb xmm0, xmm1 movdqa xmm1, xmm6 movdqa xmm2, xmm7 palignr xmm1, xmm5, 14 palignr xmm2, xmm6, 2 pavgb xmm1, xmm2 movdqa xmm3, xmm6 movdqa xmm4, xmm7 palignr xmm3, xmm5, 15 palignr xmm4, xmm6, 1 pavgb xmm3, xmm4 pavgb xmm1, xmm3 pavgb xmm1, xmm6 movdqa xmm5, xmm6 movdqa xmm6, xmm7 pavgb xmm0, xmm1 pavgb xmm0, xmm1 movntdq [esi+edi], xmm0 |
|
17th June 2009, 23:50 | #26 | Link | ||
x264 developer
Join Date: Sep 2005
Posts: 8,666
|
Quote:
Quote:
Try running the first set of code when you're not on a cacheline, and the second set when you are, perhaps?
__________________
Follow x264 development progress | akupenguin quotes | x264 git status ffmpeg and x264-related consulting/coding contracts | Doom10 Last edited by Dark Shikari; 17th June 2009 at 23:52. |
||
18th June 2009, 00:23 | #27 | Link | ||
Registered User
Join Date: Aug 2007
Posts: 374
|
Quote:
Quote:
I'm not quite get your point here. I bench the code on real video processing. Also notice the 'movdqa' in the only memory load of second variant. |
||
18th June 2009, 00:37 | #28 | Link | ||
x264 developer
Join Date: Sep 2005
Posts: 8,666
|
Quote:
Quote:
|
||
18th June 2009, 00:52 | #29 | Link | ||
Registered User
Join Date: Aug 2007
Posts: 374
|
Quote:
Quote:
How can dq loads aligned on 16 bytes be across a cacheline? |
||
18th June 2009, 01:20 | #30 | Link |
x264 developer
Join Date: Sep 2005
Posts: 8,666
|
You said that your palignr-based code is only 10% faster.
What if it was 70% faster when on a cacheline, and 10% slower otherwise? This would average out to 10% faster overall. In that case, you'd want to use the unaligned code when not on a cacheline, and the palignr code when on a cacheline. |
18th June 2009, 02:11 | #32 | Link | |
x264 developer
Join Date: Sep 2005
Posts: 8,666
|
Quote:
The Core 2 Penryn has a ~12-14-cycle penalty, which will probably mean that SSSE3 code will go massively faster on Penryn; I'd guess at least 50% benefit. |
|
18th June 2009, 02:36 | #33 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
I know that much about my Nehalem. But would love to read something like http://www.agner.org/optimize/ about its microarchitecture and instruction timings (regrettably, information there is only up to Core2 now).
|
18th June 2009, 02:47 | #34 | Link | |
x264 developer
Join Date: Sep 2005
Posts: 8,666
|
Quote:
Mubench has mostly-accurate Nehalem timings. |
|
19th June 2009, 21:15 | #36 | Link |
Registered User
Join Date: Aug 2007
Posts: 374
|
Finally an update:
|
19th June 2009, 23:41 | #38 | Link |
Software Developer
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,248
|
What about your (previous) engagement at Avail Media?
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ |
22nd June 2009, 20:17 | #40 | Link |
Guest
Posts: n/a
|
Okay, from what I read from your posts it seemed to be an actual full-time position not an internship. Let's put it this way instead then, some people don't necessarily have as much time and effort to spend to write assembly for many multiple CPUs as other people may be able to.
|
Thread Tools | Search this Thread |
Display Modes | |
|
|