Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
15th January 2017, 11:34 | #23 | Link |
Professional Code Monkey
Join Date: Jun 2003
Location: Kinnarps Chair
Posts: 2,555
|
How much faster is it?
__________________
VapourSynth - proving that scripting languages and video processing isn't dead yet |
15th January 2017, 13:14 | #24 | Link |
I'm Siri
Join Date: Oct 2012
Location: void
Posts: 2,633
|
r2 runs at 476.65fps at 1920x1080
r3 runs at 489.65fps at 1920x1080 makes no sense!!! shouldn't AVX and FMA be at least 8x faster than x87??? I think there's probably something wrong with my compiler (VS2017), can you please compile the source code with your compiler and do a test as well? to switch avx/fma back to c++ Line 225: FixFadesPrepare_AVX(); -> FixFadesPrepare(); Line 231: FixFadesMode0_AVX_FMA(); -> FixFadesMode0(); Line 234: FixFadesMode1_AVX_FMA(); -> FixFadesMode1(); Line 237: FixFadesMode2_AVX_FMA(); -> FixFadesMode2(); |
15th January 2017, 16:08 | #27 | Link | |
Professional Code Monkey
Join Date: Jun 2003
Location: Kinnarps Chair
Posts: 2,555
|
Quote:
Your assumption still wouldn't be true about x87 vs avx. For simple algorithms you run into memory bw limitations long before you see the glory of sse (avx is even more rare to matter). Modern cpus are just too good.
__________________
VapourSynth - proving that scripting languages and video processing isn't dead yet |
|
15th January 2017, 16:53 | #28 | Link | |
I'm Siri
Join Date: Oct 2012
Location: void
Posts: 2,633
|
Quote:
which means it's pretty much pointless to manually optimize simple plugins? |
|
15th January 2017, 16:58 | #29 | Link | |
Professional Code Monkey
Join Date: Jun 2003
Location: Kinnarps Chair
Posts: 2,555
|
Quote:
And yes, you're wasting perfectly good internet space by trying to optimize such simple things. Oh, and if you can't read simple disassembled stuff you shouldn't be writing optimizations like these in the first place.
__________________
VapourSynth - proving that scripting languages and video processing isn't dead yet |
|
18th January 2017, 23:57 | #30 | Link |
Registered User
Join Date: Aug 2012
Posts: 203
|
Ohi feisty, if you want some help with your optimization task fell free to ask to me directly (send me a PM and we can chose a more direct mean of communication)
As Myrsloik said probably your code was just being autovectorized directly by the compiler, on gcc for example if you compile with -O3 and -march=native it will try to vectorize using the best instruction set you have available. Probably vc do the same. First things first, i'll base all my statement using the intel reference you can find here https://software.intel.com/sites/lan...trinsicsGuide/ , agner sheet would be better but is not as easy to use, also i'll use the official intel optimization guide http://www.intel.com/content/www/us/...on-manual.html. There are some things you could write better just looking at ProcessLine_AVX_FMA. First you should avoid all those store, without an assembly listing it's a bit difficult to know, but probably they are using most of the load/store port making it more difficult for the processor to load data (the load port are shared with the store ones), also you are clogging the frontend with useless istruction to decode. Division are NO when dealing with any kind of core, on haswell they are 21 cycles of latency (and you can issues another one after 13 cycles), you can turn it in a multiplication making the reciprocal of the value, multiplication are WAAAAAY cheaper than division (5 cycles latency, 0.5 througput) and even more you can predivide YMMField and YMMReference and make a single call to fmad. I'd also suggest to remove the _mm256_set_ps at the start and replace with a _mm256_set1_ps. If assembly listing are for you hard to reason about, i'll try AICA (always from intel) https://software.intel.com/en-us/art...-code-analyzer , this software will give you a better understanding where your code is failing, it helped me a lot understand where i was doing things wrong with my optimization in mvtools. There are other places where you could write better SIMD code, but for now i'll stop . I strongly disagree with myrsloik as this being a waste of time, i find SIMD optimization to be very fun to program with and simple code teach the basics of SIMD programming, and yes, sometimes computer generated assembly is better than handmade one, but people improve over time. For a tutorial from an "expert", i MUST suggest you to watch some of the first episode of handmade hero in which @cmuratori explayn how to SIMD optimize math heavy code, you can find the playlist here https://www.youtube.com/playlist?lis...7Vysr1nJcX_BW_ what episode from 112 to 121 (337 is a bonus becose i'm pretty behind the series). With this i go to sleep after have recovered ALL THIS TEXT from ram cause my browser froze just before me finish this message, i hope to have been of some help. |
27th January 2017, 16:16 | #31 | Link |
I'm Siri
Join Date: Oct 2012
Location: void
Posts: 2,633
|
r4 runs at 626.89 fps at 1920x1080, 150.24 fps faster than the compiler generated code (no idea if optimized or not)
@MonoS you were right that _mm256_div_ps was the bottleneck there, I removed it and got a noticeable performance boost, thx. anyone could show me how to do that "opt" parameter thing? like call the avx function if it's supported by the CPU, and call the C++ function if not.. and please tell me I don't have to write asm for it |
27th January 2017, 18:51 | #32 | Link |
Excessively jovial fellow
Join Date: Jun 2004
Location: rude
Posts: 1,100
|
You need to write asm to identify the CPU, but it's trivial. Just steal from the VS source tree:
cpu.asm cpufeatures.c cpufeatures.h |
27th January 2017, 18:53 | #33 | Link | |
Professional Code Monkey
Join Date: Jun 2003
Location: Kinnarps Chair
Posts: 2,555
|
Quote:
__________________
VapourSynth - proving that scripting languages and video processing isn't dead yet |
|
28th January 2017, 18:54 | #34 | Link | |
I'm Siri
Join Date: Oct 2012
Location: void
Posts: 2,633
|
r5:
new parameter "opt" opt: call the fastest possible functions if opt=True, else call the C++ functions. Quote:
I even thought about translating "cpu.asm" to masm, and obviously I would have to translate "x86inc.asm" along with it, and that's a big NO. so I merged those 3 into one file, "cpufeatures.hpp", and wrote my own version of CPUFeatures, with absolutely no trace of (literal) asm. these 2 functions: Code:
vs_cpu_cpuid() vs_cpu_xgetbv() Code:
Visual Studio 2017: vs_cpu_cpuid() -> __cpuid() GCC: vs_cpu_cpuid() -> __get_cpuid() vs_cpu_xgetbv() -> _xgetbv() //defined in immintrin.h and I canceled the "getCPUFeatures()" function, it could be simply integrated into the constructor since I'm using a C++ header. so instead of Code:
CPUFeatures CPU; getCPUFeatures(&CPU); if (CPU.fma3) xxx Code:
auto CPU = CPUFeatures(); if (CPU.fma3) xxx |
|
29th January 2017, 07:35 | #36 | Link |
I'm Siri
Join Date: Oct 2012
Location: void
Posts: 2,633
|
to make it work with GCC
cpufeatures.hpp: line 1: #include <intrin.h> -> #include <cpuid.h> line 6: return static_cast<int32_t>(val); -> return static_cast<uint32_t>(val); line 28: __cpuid(Registers, 1); -> __get_cpuid(1, &eax, &ebx, &ecx, &edx); line 43:__cpuid(Registers, 7); -> __get_cpuid(7, &eax, &ebx, &ecx, &edx); |
29th January 2017, 07:58 | #37 | Link |
I'm Siri
Join Date: Oct 2012
Location: void
Posts: 2,633
|
or just use this version of cpufeatures.hpp
|
29th January 2017, 15:03 | #39 | Link |
Pajas Mentales...
Join Date: Dec 2004
Location: Spanishtán
Posts: 496
|
|
29th January 2017, 16:37 | #40 | Link | |
I'm Siri
Join Date: Oct 2012
Location: void
Posts: 2,633
|
Quote:
I'm sure I fixed everything except the "xgetbv" part, that function has been integrated into immintrin.h in Visual Studio 2017 but apparently not in GCC... You'll have to compile cpu.asm yourself with yasm or nasm and I can't help you with that... I wouldn't have written my own Visual Studio version of CPUFeatures if I could handle them.. |
|
|
|