Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Announcements and Chat > General Discussion

Reply
 
Thread Tools Search this Thread Display Modes
Old 9th November 2018, 17:46   #21  |  Link
ErazorTT
Registered User
 
Join Date: Mar 2003
Location: Germany
Posts: 85
So the generic options are based on the compiler vectorization and optimization.

Have you tried to increase the alignment using --with-incoming-stack-boundary? Like suggested here: https://forum.doom9.org/showthread.p...80#post1857180
ErazorTT is offline   Reply With Quote
Old 10th November 2018, 00:10   #22  |  Link
Wolfberry
Helenium(Easter)
 
Wolfberry's Avatar
 
Join Date: Aug 2017
Location: Hsinchu, Taiwan
Posts: 99
Quote:
Originally Posted by Groucho2004 View Post
There are a number of guidelines here about building fftw.
The official guideline for building fftw on windows is outdated, I consider BUILD-MINGW32 and BUILD-MINGW64 and PKGBUILD as a better reference.

Quote:
On win32, some versions of gcc assume that the stack is 16-byte aligned, but code compiled with other compilers may only guarantee a 4-byte alignment, resulting in mysterious segfaults.
As quoted, --with-incoming-stack-boundary=2 is only applicable to win32(x86), not x64.
Code:
configure:15780: checking whether C compiler accepts -mincoming-stack-boundary=2
configure:15795: gcc -c -mincoming-stack-boundary=2  conftest.c >&5
cc1.exe: error: -mincoming-stack-boundary=2 is not between 3 and 12
__________________
Monochrome Anomaly
Wolfberry is offline   Reply With Quote
Old 9th December 2018, 01:56   #23  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Germany
Posts: 647
Thanks for the new build!
__________________
Broadcast Encoder
Avisynth memes: 1 - 2 - 3
Videotek - Audacity XP
FranceBB is offline   Reply With Quote
Old 28th January 2019, 10:32   #24  |  Link
almosely
Registered User
 
Join Date: Dec 2006
Location: Germany
Posts: 43
Hi,

currently I am using AviSynth 2.6.0 MT (SEt) (x86) and almost everytime FFT3DGPU 0.8.2 (with my GeForce GTX 660 Ti); my CPU is an Intel Core i5-3470S (Features: MMX, SSE, SSE-2, SSE-3, SSSE-3, SSE4.1, AVX, DEP, VMX, SMX, SMEP, EM64T, EIST, TM1, TM2, Turbo, AES-NI, RDRAND).

I want to save Energy, therefore I am going to switch to AviSynth+ r1576 (x64) and to FFT3DFilter 2.5 (x64); the latter needs FFT3W. At the moment I have fft3w.dll 3.3.4 (x86) (original FFT3W-build). I testet that one against the original 3.3.5-build, but 3.3.4 is faster (testet with my old environment, AviSynth 2.6.0 x86 MT). The 3.3.7-build (x86, SSE) from Lord_Mulder is not working in my environment, sadly. Another linked 3.3.7-build here (from GMJCZP), is apparently an AVX2-build. The next link (from GMJCZP) to File Dropper is offline. Another one (from Groucho2004) is offline too. The link to another 3.3.7-version (from Midzuki) - there I don't know, which build it is. And the last link (from Wolfberry) to an 3.3.8-build, I cannot find (and it is AVX2, which my CPU does not have, and I don't know what "codelets" means too).

So ... I would be very happy, if someone could compile a 3.3.8-version (with SSE2 and AVX; that's all, my CPU is able to do, I think - respectively that's all, that FFT3W is utilizing, regarding my CPU?) in x86 and x64 version! That would be great :-) I tried to install MinGW, but could not figure out, how to build it on my own (so I uninstalled it).

Last edited by almosely; 28th January 2019 at 11:07.
almosely is offline   Reply With Quote
Old 28th January 2019, 11:49   #25  |  Link
Groucho2004
►◄
 
Groucho2004's Avatar
 
Join Date: Mar 2006
Location: A wretched hive of scum and villainy
Posts: 4,394
Quote:
Originally Posted by almosely View Post
Hi,

currently I am using AviSynth 2.6.0 MT (SEt) (x86) and almost everytime FFT3DGPU 0.8.2 (with my GeForce GTX 660 Ti); my CPU is an Intel Core i5-3470S (Features: MMX, SSE, SSE-2, SSE-3, SSSE-3, SSE4.1, AVX, DEP, VMX, SMX, SMEP, EM64T, EIST, TM1, TM2, Turbo, AES-NI, RDRAND).

I want to save Energy, therefore I am going to switch to AviSynth+ r1576 (x64) and to FFT3DFilter 2.5 (x64)
A couple of questions/remarks:

What makes you think that you'll save energy switching to the CPU variant of that filter? Did you measure power consumption and, more importantly, task energy (power consumption * time)?

Why do you want to use the very old AVS+ r1576? It does not even support multi-threading. The recent builds of AVS+ can be found here.
__________________
Groucho's Avisynth Stuff
Groucho2004 is offline   Reply With Quote
Old 28th January 2019, 12:10   #26  |  Link
StainlessS
HeartlessS Usurer
 
StainlessS's Avatar
 
Join Date: Dec 2009
Location: Over the rainbow
Posts: 7,087
Quote:
Why do you want to use the very old AVS+ r1576?
Probably because Utlim is still in charge of Avs+ thread, and that is the posted latest on first page.
Pinterf is apparently considering opening his own avs+ thread, where this prob should no longer exist.
__________________
I sometimes post sober.
StainlessS@MediaFire ::: AND/OR ::: StainlessS@SendSpace

"Some infinities are bigger than other infinities", but how many of them are infinitely bigger ???
StainlessS is offline   Reply With Quote
Old 28th January 2019, 12:19   #27  |  Link
Groucho2004
►◄
 
Groucho2004's Avatar
 
Join Date: Mar 2006
Location: A wretched hive of scum and villainy
Posts: 4,394
Quote:
Originally Posted by StainlessS View Post
Pinterf is apparently considering opening his own avs+ thread, where this prob should no longer exist.
Sounds good.
__________________
Groucho's Avisynth Stuff
Groucho2004 is offline   Reply With Quote
Old 28th January 2019, 12:29   #28  |  Link
almosely
Registered User
 
Join Date: Dec 2006
Location: Germany
Posts: 43
Oh, thanks for the fast replies and the hint for the newer versions of AVS+! Of course I am going with the newer ones then :-)

I did not measure the consumption with a device, but I know the TDPs of my CPU and GPU and watch the consumptions within HWInfo. The GTX uses 15% of 150W in IDLE mode, another 15% when just playing a video through its engine and another 10% for using FFT3DGPU. These are 60 W in total for the GTX 660 Ti, plus 31 W from the CPU (6,5 W IDLE plus 24,5 W while encoding). I switched to QuickSync as Video Engine and used FFT3dFilter (same parameters as fft3dgpu, 32 blocksize and bt=3), let a short clip encode, got 24 fps and a power consumption auf 56 W for both, cpu (encoding) and gpu (idle). The same clip using the GTX video engine and FFT3DGPU (utilizing the gtx) got 29 fps and consumed 91 Watts. So the pure cpu-encoding takes 20% longer, which results in 67 Watts for the same clip against 91 W when I use my gpu in addition. Sadly enough, that the gtx is using 24 W in IDLE, grrr.

Last edited by almosely; 28th January 2019 at 12:50.
almosely is offline   Reply With Quote
Old 28th January 2019, 12:38   #29  |  Link
StainlessS
HeartlessS Usurer
 
StainlessS's Avatar
 
Join Date: Dec 2009
Location: Over the rainbow
Posts: 7,087
Quality wise, the CPU version is probably better bet, methinks that GPU FFT3DFilter is now quite old and works in (I think) less precise fixed point calculations,
CPU version updated many times since GPU version. (leastwise, thats what I seem to remember, maybe I'm wrong on that).
__________________
I sometimes post sober.
StainlessS@MediaFire ::: AND/OR ::: StainlessS@SendSpace

"Some infinities are bigger than other infinities", but how many of them are infinitely bigger ???
StainlessS is offline   Reply With Quote
Old 28th January 2019, 12:42   #30  |  Link
almosely
Registered User
 
Join Date: Dec 2006
Location: Germany
Posts: 43
I observed a big difference when comparing fft3dfilter against fft3dgpu within AvsPmod at first sight (histogram "luma" activated). But until now I did not go into depth comparing - I just checked if everything is working so I could test the consumptions. Is AvsPmod still working with AVS+? Or is there a better tool like AvsPmod meanwhile? I use it everytime before I start an encoding.

-edit-

Btw, my computer is running aprox. 12 hours a day, almost non-stop, used as a NAS and workstation. So, the IDLE-consumption is there anyway. The pure consumption for the encodings are therefore 31 W (31 W CPU + 0 W GPU) against 61 W (25 W CPU + 36 W GPU) (task energy) - 50% less/more, thats a huge difference.

Last edited by almosely; 28th January 2019 at 13:02.
almosely is offline   Reply With Quote
Old 28th January 2019, 12:54   #31  |  Link
Groucho2004
►◄
 
Groucho2004's Avatar
 
Join Date: Mar 2006
Location: A wretched hive of scum and villainy
Posts: 4,394
Quote:
Originally Posted by almosely View Post
I did not measure the consumption with a device, but I know the TDPs of my CPU and GPU and watch the consumptions within HWInfo. The GTX uses 15% of 150W in IDLE mode, another 15% when just playing a video through its engine and another 10% for using FFT3DGPU. These are 60 W in total for the GTX 660 Ti, plus 31 W from the CPU (6,5 W IDLE plus 24,5 W while encoding). I switched to QuickSync as Video Engine and used FFT3dFilter (same parameters as fft3dgpu, 32 blocksize and bt=3), let a short clip encode, got 24 fps and a power consumption auf 56 W for both, cpu (encoding) and gpu (idle). The same clip using the GTX video engine and FFT3DGPU (utilizing the gtx) got 29 fps and consumed 90 Watts. So the pure cpu-encoding takes 20% longer, which results in 67 Watts for the same clip against 90 W when I use my gpu in addition. Sadly enough, that the gtx is using 24 W in IDLE, grrr.
OK, so you did (kind of) compare. Fair enough. Stainless is correct that the GPU version is a bit old and that the CPU version is probably better in terms of quality. As for the FFTW DLLs I suggest to use these (for now).
__________________
Groucho's Avisynth Stuff
Groucho2004 is offline   Reply With Quote
Old 28th January 2019, 13:24   #32  |  Link
almosely
Registered User
 
Join Date: Dec 2006
Location: Germany
Posts: 43
Thanks for the hint; I already downloaded them (these are the ones from Midzuki), but did not test (an hour ago). If this 3.3.7-version is working, I don't know if it's optimised for SSE2 and AVX - probably not. For now, it would be okay. But I think with SSE2 and AVX my encodings would be finished earlier, so more energy saving - especially the times, when the computer is running for encoding only (sometimes at night), finishing with a shutdown. And of course, sometimes faster would just be nice. And, maybe I could use the sigma2,3,4 parameters of fft3dfilter, which the fft3dGPU-version not provides - sometimes the noise/grain is very different, that could help a lot, while obtaining the same encoding-speed as with a non-SSE2-AVX-build. Maybe someone is compiling a new one, I hope so.

Last edited by almosely; 28th January 2019 at 13:49.
almosely is offline   Reply With Quote
Old 28th January 2019, 13:32   #33  |  Link
Wolfberry
Helenium(Easter)
 
Wolfberry's Avatar
 
Join Date: Aug 2017
Location: Hsinchu, Taiwan
Posts: 99
I can compile 3.3.8 with ICC 19.0 in a few days, once it is done, you may do some tests to see if it is any faster.
__________________
Monochrome Anomaly
Wolfberry is offline   Reply With Quote
Old 28th January 2019, 13:40   #34  |  Link
almosely
Registered User
 
Join Date: Dec 2006
Location: Germany
Posts: 43
3.3.8 with SSE2 and AVX (not AVX2) in x86 and x64, please - that would be great! :-) (x86 for safety and test against AVS 2.6.0). In a few days is fine - I also need some time to setup AVS+ and read into it (and have some other urgent jobs to do). Thank you in advance! :-) I will do some different tests, of course.

-edit, because I forgot to mention within my first post -

Quote:
Originally Posted by almosely View Post
I did not measure the consumption with a device, but I know the TDPs of my CPU and GPU and watch the consumptions within HWInfo. The GTX uses 15% of 150W in IDLE mode, another 15% when just playing a video through its engine and another 10% for using FFT3DGPU. These are 60 W in total for the GTX 660 Ti, plus 31 W from the CPU (6,5 W IDLE plus 24,5 W while encoding). I switched to QuickSync as Video Engine and used FFT3dFilter (same parameters as fft3dgpu, 32 blocksize and bt=3), let a short clip encode, got 24 fps and a power consumption auf 56 W for both, cpu (encoding) and gpu (idle). The same clip using the GTX video engine and FFT3DGPU (utilizing the gtx) got 29 fps and consumed 91 Watts. So the pure cpu-encoding takes 20% longer, which results in 67 Watts for the same clip against 91 W when I use my gpu in addition. Sadly enough, that the gtx is using 24 W in IDLE, grrr.
Using QuickSync + FFT3DGPU still consumes 90 W, although the CUVID-Engine is not used; so it's not only 10% of the TDP for FFT3DGPU and that way isn't an alternative to prior usage.

HWInfo is showing Watts for the "CPU Package" and Percent only for the "Total GPU Power" auf the GTX.

The default values for the blk-sizes of FFT3DGPU are 32x32, for FFT3dFilter they are 48x48. When I tested FFT3DGPU vs FFT3dFilter, 48x48 was significantly slower than 32x32 and same goes with bt=3 vs bt=5; so I compared these two filter-versions with blk-size 32x32 and bt=3 (even it would probably result in a better quality, if FFT3dFilter would use its bt=5, which the FFT3DGPU is not able to do; even bt=4 is buggy there).

Oh, and to get QuickSync working while having the GTX 660 Ti still in use, was very tricky - so stupid and annoying from microsoft, grrr! After many setbacks and different unveilings (e.g. that the HD Graphics is absolutely not capable of OpenCL even Intel is telling so), the final solution is now, to use the GTX as the first initialized GPU (within BIOS), let the IGPU (HD Graphics 2500) work in "Multi-Monitor" (within BIOS) and let Windows 7 "think", that there is another display device, working in "extended screen"-mode, connected as "VGA" to the HD Graphics (even that there is actually no device connected to), and put that "display" on the top left edge of my real display, so that I am still able to use window-edge-snapping and don't slide out of the screen with the mouse and windows at any display-border. But that way, I can use QuickSync and CUVID (and OpenCL and any other feature of the GTX) parallel, the same time. Till now, that setup is working good - I hope, installing and playing games won't be a problem. But it would be such a waste of potential and energy, to not use QuickSync when it's provided (1 Watt with QS vs 25 Watt with CUVID for pure and "simple" video-decoding is a joke/effrontery!).

Last edited by almosely; 28th January 2019 at 22:23.
almosely is offline   Reply With Quote
Old 29th January 2019, 14:10   #35  |  Link
Wolfberry
Helenium(Easter)
 
Wolfberry's Avatar
 
Join Date: Aug 2017
Location: Hsinchu, Taiwan
Posts: 99
FFTW 3.3.8 built with ICC 19.0 now in my signature.

Have fun
__________________
Monochrome Anomaly
Wolfberry is offline   Reply With Quote
Old 29th January 2019, 14:30   #36  |  Link
almosely
Registered User
 
Join Date: Dec 2006
Location: Germany
Posts: 43
Coooooool :-) Thank you very much!!!

Just downloaded every generated version. I will test later, today or in 1-2 days, and report my results.

Question: While building FFTW, there's the only option to choose one single SIMD or could it be a combination of more, e.g. SSE2+AVX? ;-)
almosely is offline   Reply With Quote
Old 29th January 2019, 14:39   #37  |  Link
ChaosKing
Registered User
 
Join Date: Dec 2005
Location: Germany
Posts: 982
thx, I get 2-4 fps more now on a Ryzen xD
Will add it to my vs portable fatpack.
__________________
AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth
VapourSynth Portable FATPACK || VapourSynth Database || https://github.com/avisynth-repository
ChaosKing is offline   Reply With Quote
Old 29th January 2019, 14:52   #38  |  Link
almosely
Registered User
 
Join Date: Dec 2006
Location: Germany
Posts: 43
Quote:
Originally Posted by http://www.fftw.org/release-notes.html
FFTW 3.3.5
Jul 31, 2016

- New SIMD support:
-- Power8 VSX instructions in single and double precision. To use, add --enable-vsx to configure.
-- Support for AVX2 (256-bit FMA instructions). To use, add --enable-avx2 to configure.
-- Experimental support for AVX512 and KCVI. (--enable-avx512, --enable-kcvi) This code is expected to work but the FFTW maintainers do not have hardware to test it.
-- Support for AVX128/FMA (for some AMD machines) (--enable-avx128-fma)
-- Double precision Neon SIMD for aarch64. This code is expected to work but the FFTW maintainers do not have hardware to test it.
-- generic SIMD support using gcc vector intrinsics.
- Add fftw_make_planner_thread_safe() API.
- fix #18 (disable float128 for CUDACC).
- fix #19: missing Fortran interface for fftwq_alloc_real.
- fix #21 (don't use float128 on Portland compilers, which pretend to be gcc).
- fix: Avoid segfaults due to double free in MPI transpose.
- Special note for distribution maintainers: Although FFTW supports a zillion SIMD instruction sets, enabling them all at the same time is a bad idea, because it increases the planning time for minimal gain. We recommend that general-purpose x86 distributions only enable SSE2 and perhaps AVX. Users who care about the last ounce of performance should recompile FFTW themselves.
It would be possible ;-)
almosely is offline   Reply With Quote
Old 29th January 2019, 17:46   #39  |  Link
almosely
Registered User
 
Join Date: Dec 2006
Location: Germany
Posts: 43
Quote:
Originally Posted by Intel Developer Zone
AVX implies support for all extensions up to SSE4.1)
https://software.intel.com/en-us/for...s/topic/591956

Yeah! :-) So, no another build with SSE2 + AVX is needed.

I am going to test a little bit with AviSynth 2.6.0 so far.
almosely is offline   Reply With Quote
Old 29th January 2019, 22:26   #40  |  Link
almosely
Registered User
 
Join Date: Dec 2006
Location: Germany
Posts: 43
So, I did some tests with AviSynth 2.6.0 (x86), this time with another clip (the fastest results gave: multithreading=off, ncpu=1 (fft3dfilter) and no RequestLinear(); everything else was more slowly; blk-size 32, bt=3).

FFTW 3.3.8 AVX is 4% faster than FFTW 3.3.4 (I guess, it's an AVX-build too, given the filesize 2259 KB), at least with this setup :-) Thank you! And I am very curious about comparing that with AVS+ (x64)!

26.88 fps CUVID
26.54 fps CUVID + FFT3DGPU 0.8.2 (sigma 1.0)
26.41 fps CUVID + FFT3DGPU 0.8.2 (sigma 1.0, sharpen 0.16) (GPU Power 59 W)
26.14 fps CUVID + FFT3DGPU 0.8.4 (sigma 1.0, sharpen 0.16) (GPU Power 59 W)

26.90 fps QuickSync

FFTW 3.3.4 (AVX)
------------------
22.10 fps QuickSync + FFT3dFilter 2.5 (sigma 1.0)
21.81 fps QuickSync + FFT3dFilter 2.5 (sigma 1.0, sharpen 0.16)

FFTW 3.3.8 SSE2
-----------------
22.84 fps QuickSync + FFT3dFilter 2.5 (sigma 1.0)
22.31 fps QuickSync + FFT3dFilter 2.5 (sigma 1.0, sharpen 0.16)

FFTW 3.3.8 AVX
----------------
22.99 fps QuickSync + FFT3dFilter 2.5 (sigma 1.0)
22.39 fps QuickSync + FFT3dFilter 2.5 (sigma 1.0, sharpen 0.16)

Speed Comparison
------------------
100.00 % 120.09 % 115.44 % 26.54 fps CUVID + FFT3DGPU 0.8.2 (sigma 1.0)
083.27 % 100.00 % 096.13 % 22.10 fps QuickSync + FFT3dFilter 2.5 (sigma 1.0) FFTW 3.3.4 (AVX)
086.06 % 103.35 % 099.35 % 22.84 fps QuickSync + FFT3dFilter 2.5 (sigma 1.0) FFTW 3.3.8 SSE2
086.62 % 104.03 % 100.00 % 22.99 fps QuickSync + FFT3dFilter 2.5 (sigma 1.0) FFTW 3.3.8 AVX

Consumption Comparison: CPU+GPU total (workload only)
------------------------------------------------------
100 % 90 Watt (100 % 60 Watt) 60 min 26.54 fps CUVID + FFT3DGPU 0.8.2 (sigma 1.0)
-26 % 67 Watt (-48 % 31 Watt) 72 min 22.10 fps QuickSync + FFT3dFilter 2.5 (sigma 1.0) FFTW 3.3.4 (AVX)
-29 % 64 Watt (-50 % 30 Watt) 69 min 22.99 fps QuickSync + FFT3dFilter 2.5 (sigma 1.0) FFTW 3.3.8 AVX

- edit -

I did some tests with AviSynth+ 0.1.0 r2772 MT (x64) with the same clip and script, and this time with the x64-filter-plugins, of course - compared with formerly QuickSync results of AVS 2.6.0 ...

098,96 % 26.62 fps QuickSync

FFTW 3.3.8 AVX
----------------
099.91 % 22.97 fps QuickSync + FFT3dFilter 2.5 (sigma 1.0)
101,25 % 22.67 fps QuickSync + FFT3dFilter 2.5 (sigma 1.0, sharpen 0.16)

... so, a little bit slower, sadly, except the last one. Maybe with activated MT it will be better.

BUT: I can't use FFT3dFilter - it destroys the luma :-( Regarding this, I posted within the corresponding thread:

https://forum.doom9.org/showpost.php...0&postcount=53

- edit -

Tested QuickSync + FFT3dFilter 2.5 (sigma 1.0, sharpen 0.16) in MT-mode: Every trial was more slowly (2,4,6 threads, with and without requestlinear;19-21 fps)

Last edited by almosely; 1st February 2019 at 16:21.
almosely is offline   Reply With Quote
Reply

Tags
fftw, fftw3.dll

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 01:42.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2019, vBulletin Solutions Inc.