Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > VapourSynth

Reply
 
Thread Tools Search this Thread Display Modes
Old 15th January 2017, 11:07   #21  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
guess I can actually optimize my floating point MVTools now with AVX and FMA, but I don't really want to...
feisty2 is offline   Reply With Quote
Old 15th January 2017, 11:13   #22  |  Link
Mystery Keeper
Beyond Kawaii
 
Mystery Keeper's Avatar
 
Join Date: Feb 2008
Location: Russia
Posts: 724
Quote:
Originally Posted by feisty2 View Post
guess I can actually optimize my floating point MVTools now with AVX and FMA, but I don't really want to...
Pretty please with a cherry on the top?
__________________
...desu!
Mystery Keeper is offline   Reply With Quote
Old 15th January 2017, 11:34   #23  |  Link
Myrsloik
Professional Code Monkey
 
Myrsloik's Avatar
 
Join Date: Jun 2003
Location: Kinnarps Chair
Posts: 2,548
Quote:
Originally Posted by feisty2 View Post
r3:
AVX and FMA3 optimizations, will probably crash if your CPU is a predecessor of the Haswell microarchitecture (just use r2 in that case)
How much faster is it?
__________________
VapourSynth - proving that scripting languages and video processing isn't dead yet
Myrsloik is offline   Reply With Quote
Old 15th January 2017, 13:14   #24  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
Quote:
Originally Posted by Myrsloik View Post
How much faster is it?
r2 runs at 476.65fps at 1920x1080
r3 runs at 489.65fps at 1920x1080

makes no sense!!!
shouldn't AVX and FMA be at least 8x faster than x87???

I think there's probably something wrong with my compiler (VS2017), can you please compile the source code with your compiler and do a test as well?

to switch avx/fma back to c++

Line 225: FixFadesPrepare_AVX(); -> FixFadesPrepare();

Line 231: FixFadesMode0_AVX_FMA(); -> FixFadesMode0();

Line 234: FixFadesMode1_AVX_FMA(); -> FixFadesMode1();

Line 237: FixFadesMode2_AVX_FMA(); -> FixFadesMode2();
feisty2 is offline   Reply With Quote
Old 15th January 2017, 13:16   #25  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
Quote:
Originally Posted by Mystery Keeper View Post
Pretty please with a cherry on the top?
someday, I'll add that to my to-do list...
feisty2 is offline   Reply With Quote
Old 15th January 2017, 15:59   #26  |  Link
cork_OS
Registered User
 
cork_OS's Avatar
 
Join Date: Mar 2016
Posts: 160
Quote:
Originally Posted by feisty2 View Post
shouldn't AVX and FMA be at least 8x faster than x87???
1. It's limited by FPU IPC (real number and width of add/mult/etc. units).
2. Why x87? Are you using 80 bits of precision?
__________________
I'm infected with poor sources.
cork_OS is offline   Reply With Quote
Old 15th January 2017, 16:08   #27  |  Link
Myrsloik
Professional Code Monkey
 
Myrsloik's Avatar
 
Join Date: Jun 2003
Location: Kinnarps Chair
Posts: 2,548
Quote:
Originally Posted by feisty2 View Post
r2 runs at 476.65fps at 1920x1080
r3 runs at 489.65fps at 1920x1080

makes no sense!!!
shouldn't AVX and FMA be at least 8x faster than x87???

I think there's probably something wrong with my compiler (VS2017), can you please compile the source code with your compiler and do a test as well?

to switch avx/fma back to c++

Line 225: FixFadesPrepare_AVX(); -> FixFadesPrepare();

Line 231: FixFadesMode0_AVX_FMA(); -> FixFadesMode0();

Line 234: FixFadesMode1_AVX_FMA(); -> FixFadesMode1();

Line 237: FixFadesMode2_AVX_FMA(); -> FixFadesMode2();
YOU ARE NOT USING X87!!! You compiled it as x64 code and the ABI (more or less) requires it to use sse2 instructions to implement it. Obviously at least the scalar float versions. It's even possible that it managed to auto vectorize like half of this code since most of it is just mindless read and sum. Look at the generated code instead of asking us about what you, YOURSELF, told the compiler to do.

Your assumption still wouldn't be true about x87 vs avx. For simple algorithms you run into memory bw limitations long before you see the glory of sse (avx is even more rare to matter). Modern cpus are just too good.
__________________
VapourSynth - proving that scripting languages and video processing isn't dead yet
Myrsloik is offline   Reply With Quote
Old 15th January 2017, 16:53   #28  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
Quote:
Originally Posted by Myrsloik View Post
YOU ARE NOT USING X87!!! You compiled it as x64 code and the ABI (more or less) requires it to use sse2 instructions to implement it. Obviously at least the scalar float versions. It's even possible that it managed to auto vectorize like half of this code since most of it is just mindless read and sum. Look at the generated code instead of asking us about what you, YOURSELF, told the compiler to do.
the "Look at the generated code" part is a bit too hard to me tho... Staring at thousands lines of generated assembly is far beyond my programming skill since I'm not a professionally trained programmer...
Quote:
Originally Posted by Myrsloik View Post
Your assumption still wouldn't be true about x87 vs avx. For simple algorithms you run into memory bw limitations long before you see the glory of sse (avx is even more rare to matter). Modern cpus are just too good.
which means it's pretty much pointless to manually optimize simple plugins?
feisty2 is offline   Reply With Quote
Old 15th January 2017, 16:58   #29  |  Link
Myrsloik
Professional Code Monkey
 
Myrsloik's Avatar
 
Join Date: Jun 2003
Location: Kinnarps Chair
Posts: 2,548
Quote:
Originally Posted by feisty2 View Post
the "Look at the generated code" part is a bit too hard to me tho... Staring at thousands lines of generated assembly is far beyond my programming skill since I'm not a professionally trained programmer...

which means it's pretty much pointless to manually optimize simple plugins?
It's not thousands of lines. The interesting part is about 50 instructions at most. How to do shit properly: set a breakpoint in one of your inner loops, when it's hit simply open the (debug\window\disassembly) window. Look at like 10 instructions and see what it picked. Repeat for all critical loops. Done.

And yes, you're wasting perfectly good internet space by trying to optimize such simple things. Oh, and if you can't read simple disassembled stuff you shouldn't be writing optimizations like these in the first place.
__________________
VapourSynth - proving that scripting languages and video processing isn't dead yet
Myrsloik is offline   Reply With Quote
Old 18th January 2017, 23:57   #30  |  Link
MonoS
Registered User
 
Join Date: Aug 2012
Posts: 203
Ohi feisty, if you want some help with your optimization task fell free to ask to me directly (send me a PM and we can chose a more direct mean of communication)

As Myrsloik said probably your code was just being autovectorized directly by the compiler, on gcc for example if you compile with -O3 and -march=native it will try to vectorize using the best instruction set you have available. Probably vc do the same.

First things first, i'll base all my statement using the intel reference you can find here https://software.intel.com/sites/lan...trinsicsGuide/ , agner sheet would be better but is not as easy to use, also i'll use the official intel optimization guide http://www.intel.com/content/www/us/...on-manual.html.

There are some things you could write better just looking at ProcessLine_AVX_FMA.
First you should avoid all those store, without an assembly listing it's a bit difficult to know, but probably they are using most of the load/store port making it more difficult for the processor to load data (the load port are shared with the store ones), also you are clogging the frontend with useless istruction to decode.
Division are NO when dealing with any kind of core, on haswell they are 21 cycles of latency (and you can issues another one after 13 cycles), you can turn it in a multiplication making the reciprocal of the value, multiplication are WAAAAAY cheaper than division (5 cycles latency, 0.5 througput) and even more you can predivide YMMField and YMMReference and make a single call to fmad.
I'd also suggest to remove the _mm256_set_ps at the start and replace with a _mm256_set1_ps.

If assembly listing are for you hard to reason about, i'll try AICA (always from intel) https://software.intel.com/en-us/art...-code-analyzer , this software will give you a better understanding where your code is failing, it helped me a lot understand where i was doing things wrong with my optimization in mvtools.

There are other places where you could write better SIMD code, but for now i'll stop .

I strongly disagree with myrsloik as this being a waste of time, i find SIMD optimization to be very fun to program with and simple code teach the basics of SIMD programming, and yes, sometimes computer generated assembly is better than handmade one, but people improve over time.

For a tutorial from an "expert", i MUST suggest you to watch some of the first episode of handmade hero in which @cmuratori explayn how to SIMD optimize math heavy code, you can find the playlist here https://www.youtube.com/playlist?lis...7Vysr1nJcX_BW_ what episode from 112 to 121 (337 is a bonus becose i'm pretty behind the series).

With this i go to sleep after have recovered ALL THIS TEXT from ram cause my browser froze just before me finish this message, i hope to have been of some help.
MonoS is offline   Reply With Quote
Old 27th January 2017, 16:16   #31  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
r4 runs at 626.89 fps at 1920x1080, 150.24 fps faster than the compiler generated code (no idea if optimized or not)
@MonoS
you were right that _mm256_div_ps was the bottleneck there, I removed it and got a noticeable performance boost, thx.

anyone could show me how to do that "opt" parameter thing? like call the avx function if it's supported by the CPU, and call the C++ function if not.. and please tell me I don't have to write asm for it
feisty2 is offline   Reply With Quote
Old 27th January 2017, 18:51   #32  |  Link
TheFluff
Excessively jovial fellow
 
Join Date: Jun 2004
Location: rude
Posts: 1,100
You need to write asm to identify the CPU, but it's trivial. Just steal from the VS source tree:

cpu.asm
cpufeatures.c
cpufeatures.h
TheFluff is offline   Reply With Quote
Old 27th January 2017, 18:53   #33  |  Link
Myrsloik
Professional Code Monkey
 
Myrsloik's Avatar
 
Join Date: Jun 2003
Location: Kinnarps Chair
Posts: 2,548
Quote:
Originally Posted by TheFluff View Post
You need to write asm to identify the CPU. From the VS source tree:

cpu.asm
cpufeatures.c
cpufeatures.h
Just copy those files into your own project. Done.
__________________
VapourSynth - proving that scripting languages and video processing isn't dead yet
Myrsloik is offline   Reply With Quote
Old 28th January 2017, 18:54   #34  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
r5:
new parameter "opt"
opt: call the fastest possible functions if opt=True, else call the C++ functions.

Quote:
Originally Posted by Myrsloik View Post
Just copy those files into your own project. Done.
Quote:
Originally Posted by TheFluff View Post
You need to write asm to identify the CPU, but it's trivial. Just steal from the VS source tree:

cpu.asm
cpufeatures.c
cpufeatures.h
Tried to use them but got stuck at "cpu.asm", apparently it requires nasm or yasm and these 2 are real pain in the ass, I wasted hours trying to make them work on VS2017 and all I got was millions of errors popping out relentlessly...

I even thought about translating "cpu.asm" to masm, and obviously I would have to translate "x86inc.asm" along with it, and that's a big NO.

so I merged those 3 into one file, "cpufeatures.hpp", and wrote my own version of CPUFeatures, with absolutely no trace of (literal) asm.

these 2 functions:
Code:
vs_cpu_cpuid()
vs_cpu_xgetbv()
have been integrated into the compiler,
Code:
Visual Studio 2017:
vs_cpu_cpuid() -> __cpuid()
GCC:
vs_cpu_cpuid() -> __get_cpuid()

vs_cpu_xgetbv() -> _xgetbv() //defined in immintrin.h
so I think there's no need to write literal asm for them.
and I canceled the "getCPUFeatures()" function, it could be simply integrated into the constructor since I'm using a C++ header.

so instead of
Code:
CPUFeatures CPU;
getCPUFeatures(&CPU);
if (CPU.fma3)
	xxx
it's now cleaner like
Code:
auto CPU = CPUFeatures();
if (CPU.fma3)
	xxx
feisty2 is offline   Reply With Quote
Old 29th January 2017, 02:02   #35  |  Link
Are_
Registered User
 
Join Date: Jun 2012
Location: Ibiza, Spain
Posts: 321
Right now it looks like it's not compiling anymore with GCC.
Are_ is offline   Reply With Quote
Old 29th January 2017, 07:35   #36  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
Quote:
Originally Posted by Are_ View Post
Right now it looks like it's not compiling anymore with GCC.
to make it work with GCC

cpufeatures.hpp:
line 1: #include <intrin.h> -> #include <cpuid.h>
line 6: return static_cast<int32_t>(val); -> return static_cast<uint32_t>(val);
line 28: __cpuid(Registers, 1); -> __get_cpuid(1, &eax, &ebx, &ecx, &edx);
line 43:__cpuid(Registers, 7); -> __get_cpuid(7, &eax, &ebx, &ecx, &edx);
feisty2 is offline   Reply With Quote
Old 29th January 2017, 07:58   #37  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
Quote:
Originally Posted by Are_ View Post
Right now it looks like it's not compiling anymore with GCC.
or just use this version of cpufeatures.hpp
feisty2 is offline   Reply With Quote
Old 29th January 2017, 13:03   #38  |  Link
MonoS
Registered User
 
Join Date: Aug 2012
Posts: 203
Glad to be of help
MonoS is offline   Reply With Quote
Old 29th January 2017, 15:03   #39  |  Link
sl1pkn07
Pajas Mentales...
 
Join Date: Dec 2004
Location: Spanishtán
Posts: 496
Quote:
Originally Posted by feisty2 View Post
or just use this version of cpufeatures.hpp
nope

https://sl1pkn07.wtf/paste/view/b0cc57fb
__________________
[AUR] Vapoursynth Stuff
[AUR] Avisynth Stuff
sl1pkn07 is offline   Reply With Quote
Old 29th January 2017, 16:37   #40  |  Link
feisty2
I'm Siri
 
feisty2's Avatar
 
Join Date: Oct 2012
Location: void
Posts: 2,633
Quote:
Originally Posted by sl1pkn07 View Post
working now?

I'm sure I fixed everything except the "xgetbv" part, that function has been integrated into immintrin.h in Visual Studio 2017 but apparently not in GCC...
You'll have to compile cpu.asm yourself with yasm or nasm and I can't help you with that... I wouldn't have written my own Visual Studio version of CPUFeatures if I could handle them..
feisty2 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 07:28.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.