New filter: Fix Telecined Fades - Page 2

feisty2 · 15th January 2017, 11:07

guess I can actually optimize my floating point MVTools now with AVX and FMA, but I don't really want to...

Mystery Keeper · 15th January 2017, 11:13

Quote:

Originally Posted by feisty2

guess I can actually optimize my floating point MVTools now with AVX and FMA, but I don't really want to...

Pretty please with a cherry on the top?

Myrsloik · 15th January 2017, 11:34

Quote:

Originally Posted by feisty2

r3:
AVX and FMA3 optimizations, will probably crash if your CPU is a predecessor of the Haswell microarchitecture (just use r2 in that case)

How much faster is it?

feisty2 · 15th January 2017, 13:14

Quote:

Originally Posted by Myrsloik

How much faster is it?

r2 runs at 476.65fps at 1920x1080
r3 runs at 489.65fps at 1920x1080

makes no sense!!!
shouldn't AVX and FMA be at least 8x faster than x87???

I think there's probably something wrong with my compiler (VS2017), can you please compile the source code with your compiler and do a test as well?

to switch avx/fma back to c++

Line 225: FixFadesPrepare_AVX(); -> FixFadesPrepare();

Line 231: FixFadesMode0_AVX_FMA(); -> FixFadesMode0();

Line 234: FixFadesMode1_AVX_FMA(); -> FixFadesMode1();

Line 237: FixFadesMode2_AVX_FMA(); -> FixFadesMode2();

feisty2 · 15th January 2017, 13:16

Quote:

Originally Posted by Mystery Keeper

Pretty please with a cherry on the top?

someday, I'll add that to my to-do list...

cork_OS · 15th January 2017, 15:59

Quote:

Originally Posted by feisty2

shouldn't AVX and FMA be at least 8x faster than x87???

1. It's limited by FPU IPC (real number and width of add/mult/etc. units).
2. Why x87? Are you using 80 bits of precision?

Myrsloik · 15th January 2017, 16:08

Quote:

Originally Posted by feisty2

r2 runs at 476.65fps at 1920x1080
r3 runs at 489.65fps at 1920x1080

makes no sense!!!
shouldn't AVX and FMA be at least 8x faster than x87???

I think there's probably something wrong with my compiler (VS2017), can you please compile the source code with your compiler and do a test as well?

to switch avx/fma back to c++

Line 225: FixFadesPrepare_AVX(); -> FixFadesPrepare();

Line 231: FixFadesMode0_AVX_FMA(); -> FixFadesMode0();

Line 234: FixFadesMode1_AVX_FMA(); -> FixFadesMode1();

Line 237: FixFadesMode2_AVX_FMA(); -> FixFadesMode2();

YOU ARE NOT USING X87!!! You compiled it as x64 code and the ABI (more or less) requires it to use sse2 instructions to implement it. Obviously at least the scalar float versions. It's even possible that it managed to auto vectorize like half of this code since most of it is just mindless read and sum. Look at the generated code instead of asking us about what you, YOURSELF, told the compiler to do.

Your assumption still wouldn't be true about x87 vs avx. For simple algorithms you run into memory bw limitations long before you see the glory of sse (avx is even more rare to matter). Modern cpus are just too good.

feisty2 · 15th January 2017, 16:53

Quote:

Originally Posted by Myrsloik

YOU ARE NOT USING X87!!! You compiled it as x64 code and the ABI (more or less) requires it to use sse2 instructions to implement it. Obviously at least the scalar float versions. It's even possible that it managed to auto vectorize like half of this code since most of it is just mindless read and sum. Look at the generated code instead of asking us about what you, YOURSELF, told the compiler to do.

the "Look at the generated code" part is a bit too hard to me tho... Staring at thousands lines of generated assembly is far beyond my programming skill since I'm not a professionally trained programmer...

Quote:

Originally Posted by Myrsloik

Your assumption still wouldn't be true about x87 vs avx. For simple algorithms you run into memory bw limitations long before you see the glory of sse (avx is even more rare to matter). Modern cpus are just too good.

which means it's pretty much pointless to manually optimize simple plugins?

Myrsloik · 15th January 2017, 16:58

Quote:

Originally Posted by feisty2

the "Look at the generated code" part is a bit too hard to me tho... Staring at thousands lines of generated assembly is far beyond my programming skill since I'm not a professionally trained programmer...

which means it's pretty much pointless to manually optimize simple plugins?

It's not thousands of lines. The interesting part is about 50 instructions at most. How to do shit properly: set a breakpoint in one of your inner loops, when it's hit simply open the (debug\window\disassembly) window. Look at like 10 instructions and see what it picked. Repeat for all critical loops. Done.

And yes, you're wasting perfectly good internet space by trying to optimize such simple things. Oh, and if you can't read simple disassembled stuff you shouldn't be writing optimizations like these in the first place.

MonoS · 18th January 2017, 23:57

Ohi feisty, if you want some help with your optimization task fell free to ask to me directly (send me a PM and we can chose a more direct mean of communication)

As Myrsloik said probably your code was just being autovectorized directly by the compiler, on gcc for example if you compile with -O3 and -march=native it will try to vectorize using the best instruction set you have available. Probably vc do the same.

First things first, i'll base all my statement using the intel reference you can find here https://software.intel.com/sites/lan...trinsicsGuide/ , agner sheet would be better but is not as easy to use, also i'll use the official intel optimization guide http://www.intel.com/content/www/us/...on-manual.html.

There are some things you could write better just looking at ProcessLine_AVX_FMA.
First you should avoid all those store, without an assembly listing it's a bit difficult to know, but probably they are using most of the load/store port making it more difficult for the processor to load data (the load port are shared with the store ones), also you are clogging the frontend with useless istruction to decode.
Division are NO when dealing with any kind of core, on haswell they are 21 cycles of latency (and you can issues another one after 13 cycles), you can turn it in a multiplication making the reciprocal of the value, multiplication are WAAAAAY cheaper than division (5 cycles latency, 0.5 througput) and even more you can predivide YMMField and YMMReference and make a single call to fmad.
I'd also suggest to remove the _mm256_set_ps at the start and replace with a _mm256_set1_ps.

If assembly listing are for you hard to reason about, i'll try AICA (always from intel) https://software.intel.com/en-us/art...-code-analyzer , this software will give you a better understanding where your code is failing, it helped me a lot understand where i was doing things wrong with my optimization in mvtools.

There are other places where you could write better SIMD code, but for now i'll stop

.

I strongly disagree with myrsloik as this being a waste of time, i find SIMD optimization to be very fun to program with and simple code teach the basics of SIMD programming, and yes, sometimes computer generated assembly is better than handmade one, but people improve over time.

For a tutorial from an "expert", i MUST suggest you to watch some of the first episode of handmade hero in which @cmuratori explayn how to SIMD optimize math heavy code, you can find the playlist here https://www.youtube.com/playlist?lis...7Vysr1nJcX_BW_ what episode from 112 to 121 (337 is a bonus becose i'm pretty behind the series).

With this i go to sleep after have recovered ALL THIS TEXT from ram cause my browser froze just before me finish this message, i hope to have been of some help.

feisty2 · 27th January 2017, 16:16

r4 runs at 626.89 fps at 1920x1080, 150.24 fps faster than the compiler generated code (no idea if optimized or not)
@MonoS
you were right that _mm256_div_ps was the bottleneck there, I removed it and got a noticeable performance boost, thx.

anyone could show me how to do that "opt" parameter thing? like call the avx function if it's supported by the CPU, and call the C++ function if not.. and please tell me I don't have to write asm for it

TheFluff · 27th January 2017, 18:51

You need to write asm to identify the CPU, but it's trivial. Just steal from the VS source tree:

cpu.asm
cpufeatures.c
cpufeatures.h

Myrsloik · 27th January 2017, 18:53

Quote:

Originally Posted by TheFluff

You need to write asm to identify the CPU. From the VS source tree:

cpu.asm
cpufeatures.c
cpufeatures.h

Just copy those files into your own project. Done.

feisty2 · 28th January 2017, 18:54

r5:
new parameter "opt"
opt: call the fastest possible functions if opt=True, else call the C++ functions.

Quote:

Originally Posted by Myrsloik

Just copy those files into your own project. Done.

Quote:

Originally Posted by TheFluff

You need to write asm to identify the CPU, but it's trivial. Just steal from the VS source tree:

cpu.asm
cpufeatures.c
cpufeatures.h

Tried to use them but got stuck at "cpu.asm", apparently it requires nasm or yasm and these 2 are real pain in the ass, I wasted hours trying to make them work on VS2017 and all I got was millions of errors popping out relentlessly...

I even thought about translating "cpu.asm" to masm, and obviously I would have to translate "x86inc.asm" along with it, and that's a big NO.

so I merged those 3 into one file, "cpufeatures.hpp", and wrote my own version of CPUFeatures, with absolutely no trace of (literal) asm.

these 2 functions:

Code:

vs_cpu_cpuid()
vs_cpu_xgetbv()

have been integrated into the compiler,

Code:

Visual Studio 2017:
vs_cpu_cpuid() -> __cpuid()
GCC:
vs_cpu_cpuid() -> __get_cpuid()

vs_cpu_xgetbv() -> _xgetbv() //defined in immintrin.h

so I think there's no need to write literal asm for them.
and I canceled the "getCPUFeatures()" function, it could be simply integrated into the constructor since I'm using a C++ header.

so instead of

Code:

CPUFeatures CPU;
getCPUFeatures(&CPU);
if (CPU.fma3)
	xxx

it's now cleaner like

Code:

auto CPU = CPUFeatures();
if (CPU.fma3)
	xxx

Are_ · 29th January 2017, 02:02

Right now it looks like it's not compiling anymore with GCC.

feisty2 · 29th January 2017, 07:35

Quote:

Originally Posted by Are_

Right now it looks like it's not compiling anymore with GCC.

to make it work with GCC

cpufeatures.hpp:
line 1: #include <intrin.h> -> #include <cpuid.h>
line 6: return static_cast<int32_t>(val); -> return static_cast<uint32_t>(val);
line 28: __cpuid(Registers, 1); -> __get_cpuid(1, &eax, &ebx, &ecx, &edx);
line 43:__cpuid(Registers, 7); -> __get_cpuid(7, &eax, &ebx, &ecx, &edx);

feisty2 · 29th January 2017, 07:58

Quote:

Originally Posted by Are_

Right now it looks like it's not compiling anymore with GCC.

or just use this version of cpufeatures.hpp

MonoS · 29th January 2017, 13:03

Glad to be of help

sl1pkn07 · 29th January 2017, 15:03

Quote:

Originally Posted by feisty2

or just use this version of cpufeatures.hpp

nope

https://sl1pkn07.wtf/paste/view/b0cc57fb

feisty2 · 29th January 2017, 16:37

Quote:

Originally Posted by sl1pkn07

nope

https://sl1pkn07.wtf/paste/view/b0cc57fb

working now?

I'm sure I fixed everything except the "xgetbv" part, that function has been integrated into immintrin.h in Visual Studio 2017 but apparently not in GCC...
You'll have to compile cpu.asm yourself with yasm or nasm and I can't help you with that... I wouldn't have written my own Visual Studio version of CPUFeatures if I could handle them..

15th January 2017, 11:07	#21 \| Link
feisty2 I'm Siri Join Date: Oct 2012 Location: void Posts: 2,633	guess I can actually optimize my floating point MVTools now with AVX and FMA, but I don't really want to...

18th January 2017, 23:57	#30 \| Link
MonoS Registered User Join Date: Aug 2012 Posts: 203	Ohi feisty, if you want some help with your optimization task fell free to ask to me directly (send me a PM and we can chose a more direct mean of communication) As Myrsloik said probably your code was just being autovectorized directly by the compiler, on gcc for example if you compile with -O3 and -march=native it will try to vectorize using the best instruction set you have available. Probably vc do the same. First things first, i'll base all my statement using the intel reference you can find here https://software.intel.com/sites/lan...trinsicsGuide/ , agner sheet would be better but is not as easy to use, also i'll use the official intel optimization guide http://www.intel.com/content/www/us/...on-manual.html. There are some things you could write better just looking at ProcessLine_AVX_FMA. First you should avoid all those store, without an assembly listing it's a bit difficult to know, but probably they are using most of the load/store port making it more difficult for the processor to load data (the load port are shared with the store ones), also you are clogging the frontend with useless istruction to decode. Division are NO when dealing with any kind of core, on haswell they are 21 cycles of latency (and you can issues another one after 13 cycles), you can turn it in a multiplication making the reciprocal of the value, multiplication are WAAAAAY cheaper than division (5 cycles latency, 0.5 througput) and even more you can predivide YMMField and YMMReference and make a single call to fmad. I'd also suggest to remove the _mm256_set_ps at the start and replace with a _mm256_set1_ps. If assembly listing are for you hard to reason about, i'll try AICA (always from intel) https://software.intel.com/en-us/art...-code-analyzer , this software will give you a better understanding where your code is failing, it helped me a lot understand where i was doing things wrong with my optimization in mvtools. There are other places where you could write better SIMD code, but for now i'll stop . I strongly disagree with myrsloik as this being a waste of time, i find SIMD optimization to be very fun to program with and simple code teach the basics of SIMD programming, and yes, sometimes computer generated assembly is better than handmade one, but people improve over time. For a tutorial from an "expert", i MUST suggest you to watch some of the first episode of handmade hero in which @cmuratori explayn how to SIMD optimize math heavy code, you can find the playlist here https://www.youtube.com/playlist?lis...7Vysr1nJcX_BW_ what episode from 112 to 121 (337 is a bonus becose i'm pretty behind the series). With this i go to sleep after have recovered ALL THIS TEXT from ram cause my browser froze just before me finish this message, i hope to have been of some help.

27th January 2017, 16:16	#31 \| Link
feisty2 I'm Siri Join Date: Oct 2012 Location: void Posts: 2,633	r4 runs at 626.89 fps at 1920x1080, 150.24 fps faster than the compiler generated code (no idea if optimized or not) @MonoS you were right that _mm256_div_ps was the bottleneck there, I removed it and got a noticeable performance boost, thx. anyone could show me how to do that "opt" parameter thing? like call the avx function if it's supported by the CPU, and call the C++ function if not.. and please tell me I don't have to write asm for it

27th January 2017, 18:51	#32 \| Link
TheFluff Excessively jovial fellow Join Date: Jun 2004 Location: rude Posts: 1,100	You need to write asm to identify the CPU, but it's trivial. Just steal from the VS source tree: cpu.asm cpufeatures.c cpufeatures.h

29th January 2017, 02:02	#35 \| Link
Are_ Registered User Join Date: Jun 2012 Location: Ibiza, Spain Posts: 321	Right now it looks like it's not compiling anymore with GCC.

29th January 2017, 13:03	#38 \| Link
MonoS Registered User Join Date: Aug 2012 Posts: 203	Glad to be of help