Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 24th November 2021, 19:16   #741  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by DTL View Post
Here is a testbuilds for users of scripts without set of params to MAnalyse
None of those are working.

I can't even get a nice error, they simply produce a 0 size file.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 24th November 2021, 19:47   #742  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
Oh - it looks at least error with hard setting of optPredictorType inside - it looks everywhere is PT=0. Sorry - not tested it for output result after build. Will try to look what is wrong.

"simply produce a 0 size file."

May be it simply crash at startup. Can you look in system evens viewer or drwatson log - may it have some crash records ? It is strange - it should use only SSE instructions up to SSE 4.1 and checked at Intel Core2Duo E7500. Will enable check for required SSE version in next testbuild.

Edit: Found source of one error - Visual Studio opens for editing files form different copy of the project. So all .dlls were build from equal sources. Will rebuild now. Though it not shows why processing not make any result.

Last edited by DTL; 24th November 2021 at 20:19.
DTL is offline   Reply With Quote
Old 24th November 2021, 21:28   #743  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
Well - new fixed testbuild and checked for run at Core2Duo E7500: https://drive.google.com/file/d/1ry3...ew?usp=sharing

Have both SO0 and SO1 builds for all PT values.
The PT2 versions removed because can not run correctly without limiting levels to about 2..3. PT4 require lowering of thSAD to about 1.5 times lower in compare with other.

On E7500 old CPU optSearchOption=1 with some new SSE optimizations enabled runs slower - it looks not all old SSE CPUs can run faster with SSE versions of functions instead of C. So it definitely can not be non-controlled enabled in final release.
DTL is offline   Reply With Quote
Old 26th November 2021, 01:08   #744  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
Issue found with latest testbuild : crash with block size 16x16 and (ALIGN_SOURCEBLOCK = 1 'asb1' in file names) (aligned copy disabled). With default padding = 8. If increase padding to 16 (in Msuper) - crash not happens. So users of scripts with default MSuper params (hpad=vpad=8) will have crash if using faster 'asb1' builds with block size 16x16. Looks x264 SSE2 and SSSE3 16x16 SAD functions was not tested with disabling aligned copy of source block.
Hope for some workaround for this issue. Current user-side workaround - increase hpad and vpad to about blocksize or larger if crash occur.
DTL is offline   Reply With Quote
Old 30th November 2021, 16:14   #745  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,352
I got a BSOD with latest official build, trying to return a scaled MV clip... I panicked. Lost my dev version script of SMDegrain (filled with zeros), luckily it wasn't much, only a few commented expressions and notes.

This was more or less the trigger. I was trying to check bv1 clip dimensions to debug an issue I was having.
Environment:
i7-4790K
Win7-SP1 x64
AVS+ test29 x64
no avstp.dll in plugin path


Code:
setmemorymax(2048)
DGSource(bluray source)
ConvertBits(16)
w=width()
h=height()
bicubicresize(w*2,h*2)

pref8 = ConvertBits(8, dither=-1)
pref8 = pref8.BilinearResize(w, h) 
pref8 = pref8.ConvertToYUV420(false,"","MPEG1","spline16","top_left").ConvertBits(16)
pref8 = pref8.ex_Luma_Rebuild(S0=3.0,c=0.0625,uv=3,tv_range=true,fulls=false).ConvertBits(8, dither=-1)

super_search = MSuper(pref8, pel=1, chroma=true, hpad=0, vpad=0, sharp=1, rfilter=4, mt=true)
Recalculate  = MSuper(pref8, pel=1, chroma=true, hpad=0, vpad=0, sharp=1, rfilter=4, mt=true,levels=1)
bv1 = super_search.MAnalyse(isb = true, delta = 1, overlap=8, blksize= 16, search=4, chroma=true, truemotion=false, divide=0, dct=0, searchparam=2, pelsearch=1, temporal=false, trymany=false, scaleCSAD=1, mt=true)
bv1 = MRecalculate(Recalculate, bv1,  overlap=4,blksize=8, thSAD=200, chroma=true, truemotion=false, divide=0, dct=0, scaleCSAD=1, mt=true)
bv1 = bv1.MScaleVect()

bv1

# without prefetch, in avspmod
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread

Last edited by Dogway; 30th November 2021 at 16:30.
Dogway is offline   Reply With Quote
Old 30th November 2021, 16:55   #746  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,309
Quote:
Originally Posted by Dogway View Post
I got a BSOD with latest official build, trying to return a scaled MV clip... I panicked. Lost my dev version script of SMDegrain (filled with zeros), luckily it wasn't much, only a few commented expressions and notes.

This was more or less the trigger. I was trying to check bv1 clip dimensions to debug an issue I was having.
Environment:
i7-4790K
Win7-SP1 x64
AVS+ test29 x64
no avstp.dll in plugin path


Code:
setmemorymax(2048)
DGSource(bluray source)
ConvertBits(16)
w=width()
h=height()
bicubicresize(w*2,h*2)

pref8 = ConvertBits(8, dither=-1)
pref8 = pref8.BilinearResize(w, h) 
pref8 = pref8.ConvertToYUV420(false,"","MPEG1","spline16","top_left").ConvertBits(16)
pref8 = pref8.ex_Luma_Rebuild(S0=3.0,c=0.0625,uv=3,tv_range=true,fulls=false).ConvertBits(8, dither=-1)

super_search = MSuper(pref8, pel=1, chroma=true, hpad=0, vpad=0, sharp=1, rfilter=4, mt=true)
Recalculate  = MSuper(pref8, pel=1, chroma=true, hpad=0, vpad=0, sharp=1, rfilter=4, mt=true,levels=1)
bv1 = super_search.MAnalyse(isb = true, delta = 1, overlap=8, blksize= 16, search=4, chroma=true, truemotion=false, divide=0, dct=0, searchparam=2, pelsearch=1, temporal=false, trymany=false, scaleCSAD=1, mt=true)
bv1 = MRecalculate(Recalculate, bv1,  overlap=4,blksize=8, thSAD=200, chroma=true, truemotion=false, divide=0, dct=0, scaleCSAD=1, mt=true)
bv1 = bv1.MScaleVect()

bv1

# without prefetch, in avspmod
I've changed the source filter to a ColorbarsHD().
bv1 is a 172444 x 1 sized RGB32 clip. Works for me from avsmeter64 and in 64 bit avspmod as well.
pinterf is offline   Reply With Quote
Old 30th November 2021, 17:30   #747  |  Link
Dogway
Registered User
 
Join Date: Nov 2009
Posts: 2,352
Oh well, thanks for testing, I didn't feel brave enough to reproduce. I guess the long sized clip did something to my RAM, also I was running low on disk space so it could be a thing. I thought bv1 was similar to msuper clip. Now I will try to debug without returning mv clips, lesson learned.
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread
Dogway is offline   Reply With Quote
Old 3rd December 2021, 12:31   #748  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
Small important update based on pinterf sources from 9 November 2021 - https://drive.google.com/file/d/1EEY...ew?usp=sharing . Should run stable with block size 16.

Added check of coordinates of predictors to skip repeated check of already checked predictor. Should make optPredictorType=0 (all predictors, old default) close to PT1 in speed while still kepping all possible predictors.
In real footage many predictors are equal (of Zero, Global, Median and 4 neibour, also may be +Temporal if enabled)) so keeping track of already checked predictors saves form some calls to single SAD() function that is not SIMD-friendly and hard to optimize.
Speed is content-dependent so the completely static sources like ColorBars() will give more speed. So better to test speed on real footage with different movements. Included also very small SSE41 optimizations in separate file and hardcoded inside SO=1 for users of old scripts.

Last edited by DTL; 3rd December 2021 at 12:44.
DTL is offline   Reply With Quote
Old 3rd December 2021, 15:45   #749  |  Link
kedautinh12
Registered User
 
Join Date: Jan 2018
Posts: 2,153
Any chance for x86 ver??
kedautinh12 is online now   Reply With Quote
Old 3rd December 2021, 17:31   #750  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
It built by system - https://drive.google.com/file/d/1B6E...ew?usp=sharing

but not any good tested if work correctly.
DTL is offline   Reply With Quote
Old 7th December 2021, 18:20   #751  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by DTL View Post
Small important update based on pinterf sources from 9 November 2021
Tested SSE41 builds thoroughly, both standard and SO1.

I am now using the SO1 version, instead of stable one, because of its speed and good results.

If you want to try a SSE42 build, my CPU supports it and perhaps we can get a little speed bump.
__________________
@turment on Telegram

Last edited by tormento; 7th December 2021 at 22:57.
tormento is offline   Reply With Quote
Old 7th December 2021, 23:20   #752  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
Later it looks I found a bug in that build from December 3 - it may cause skipping some valuable predictors and decrease degraining quality. Hope bugfixed build - (both x64 and x86)
https://drive.google.com/file/d/1kMc...ew?usp=sharing

"SSE42 build, my CPU supports it and perhaps we can get a little speed bump."

Unfortunately SSE4.2 do not adds any significant. The only way to boost performance with all predictors and all levels refining - either AVX2 or better AVX512 capable chip.

For old CPUs only possible to try 'logical optimizations' like PT=4 mode - with pure interpolated prediction at level 0. It may provide lower quality of degraning but fastest possible mode. Also it is planed to put to SIMD (of low family like simple SSE) the InterpolatePrediction() function and it may also add some speed at SSE-level chips. But it still of lower priority - I currently in developing of multi-blocks search for AVX2 and AVX512 and interesting in the difference between 4/8 blocks AVX2 processing vs 16/32 blocks AVX512 processing. Of blocksize 8x8.
Today the 4Blocks 8x8 sp1 avx2 function looks like converted from pure tech speed test to something working for degraining.

Addition: PT=4 do require re-adjusting thSAD value in MDegrain (lower to about 1.5 times from 'standard' because it output SAD from level 1 and it typically lower). Using 'standard' thSAD value may cause too much detail blurring as usual too high setting of thSAD.

Last edited by DTL; 8th December 2021 at 00:57.
DTL is offline   Reply With Quote
Old 8th December 2021, 04:49   #753  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by DTL View Post
Hope bugfixed build
Can you release for SSE41 too?

Thanks.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 8th December 2021, 08:17   #754  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
I hope all possible SSE enabled. I just not put it to the file name. Only possible is add intel compiler build for some exact chip family - you have Sandy Bridge ? It may be a bit faster.
DTL is offline   Reply With Quote
Old 8th December 2021, 08:19   #755  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by DTL View Post
you have Sandy Bridge ? It may be a bit faster.
Yep, good old i7-2600k. Best Intel CPU ever
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 8th December 2021, 08:25   #756  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
I think the best home chip is about i5-11400 now. But it looks it need about 200 watt unlocked power and cooler to run with AVX512 at good performance. If run with rated TDP 65 watt limit it looks will self-limiting to much lower performance level.
DTL is offline   Reply With Quote
Old 8th December 2021, 08:32   #757  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by DTL View Post
I think the best home chip is about i5-11400 now. But it looks it need about 200 watt unlocked power and cooler to run with AVX512 at good performance. If run with rated TDP 65 watt limit it looks will self-limiting to much lower performance level.

Alder Lake is a nice beast, unfortunately you have to disable E-Cores to have AVX-512 back.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 8th December 2021, 11:48   #758  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,883
Quote:
Originally Posted by DTL View Post
The only way to boost performance with all predictors and all levels refining - either AVX2 or better AVX512 capable chip.
AVX512?
Bring it on for the next stable release, Sky servers will thank you for the AVX512 build speed-up!

FranceBB is offline   Reply With Quote
Old 8th December 2021, 12:43   #759  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,041
Quote:
Originally Posted by tormento View Post
Alder Lake is a nice beast, unfortunately you have to disable E-Cores to have AVX-512 back.
As I understand if Windows task planner is not very bad it can load both P and E cores and E also helps. But I not sure how thread will detect if it can use AVX512 version of function or not. Also pricing for Adler Lake may be much more in compare with lower Rocket Lake like 11400.

As for DDR5 vs DDR4 - I not sure if it makes lot difference. As I see with typical latency about 50 ns the real random access byte-read speed is about 20 MBytes/s. And linear transfer is typical > 50 GBytes/s nowdays. The gap is about 2500 times. Unfortunately progress in latency at SDRAM is about 2 times at about 2 decades.

Can you make test of speed for 64x16 vs 16x64 block processing ? At old Core 2 Duo E7500 CPU I got about 60% of speed difference. But at i5-9600 and i5-11500 much less (looks latest intels have better hardware prefetchers tuned and really have about 10 times more cache).
I think about re-design of MDegrainN memory access pattern for better speed of memory access but it also need time and data if it will significantly helps to newer CPUs.

"AVX512?"

Yes - it have 4 times larger register file and allow to process 4 times more blocks in a single search op (if vector coherency domain is large enough - that is frequently happens). But it looks something still bad with consumer-level AVX512 intels - testers reports of large power overbudget if try to load CPU with calculation and not limit power at motherboard power supply. So it either over-heated (with small funny box cooler) and auto-trottle speed or overload motherboard power supply and crash/BSOD/etc of even burn motherboard. I personally have really burn-out motherboard at Pentium2 time - it was 2 slots and 1 of 2 once burn at night.
So it looks 14 nm intel can not run with AVX512 processing even at nominal frequency and start to auto-trottle itself. So the performance at consumer-level AVX512 chips may be still limited. Or very good (water ?) cooler required and special motherboard with large power over-limiting over rated chip TDP (like 3x times larger). I wonder how server-class intel chips with > 10 cores of AVX512 work at full speed for years.
I hope newer 7nm intel chips will be less power-hungry at AVX512 processing. But it still the future.

" will thank you for the AVX512 build speed-up!"

Unfortunately creating 'massive multi-block' processing versions of search functions takes lots of time for checking. The 'very simple' 4blocks sp1 AVX2 function take visible part of day to check all 4 blocks x 8 positions_each_block = 32 test points. And for AVX512 it is planned up to 32 blocks - 32x8=256 test points. Or require to build special test software for automation testing task. And for level>0 the sp2 versions required that have 24..25 search positions for each block - it is 32x25=800 points to test for full checking. The performance of new hardware quickly outperform the performance of user to create programs for it.
I hope AVX512 16/32 blocks 'tech demo' of SearchOption=4 will soon be available to check for possible speedup of AVX512.

Last edited by DTL; 8th December 2021 at 13:17.
DTL is offline   Reply With Quote
Old 8th December 2021, 13:27   #760  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,883
Quote:
Originally Posted by DTL View Post
I wonder how server-class intel chips with > 10 cores of AVX512 work at full speed for years.
Dunno, but they just do and the clock doesn't go down. On the other hand, we're talking about CPUs with a much lower clock than in the consumer versions. in my case the CPU has 56c/112th with base clock 2.20GHz. Whenever I use AVX2, it goes up to 2.50GHz even at 100% usage, however, if I try to do the same with AVX512, it will go down to 2.20GHz, which, again, ain't bad 'cause that's the base/standard CPU clock frequency.


Quote:
Originally Posted by DTL View Post
I hope newer 7nm intel chips will be less power-hungry at AVX512 processing. But it still the future.
Perhaps. Fingers crossed, though.

Quote:
Originally Posted by DTL View Post

Unfortunately creating 'massive multi-block' processing versions of search functions takes lots of time for checking. The 'very simple' 4blocks sp1 AVX2 function take visible part of day to check all 4 blocks x 8 positions_each_block = 32 test points. And for AVX512 it is planned up to 32 blocks - 32x8=256 test points. Or require to build special test software for automation testing task. And for level>0 the sp2 versions required that have 24..25 search positions for each block - it is 32x25=800 points to test for full checking. The performance of new hardware quickly outperform the performance of user to create programs for it.
Ah, yeah, right, I see...

Quote:
Originally Posted by DTL View Post
I hope AVX512 16/32 blocks 'tech demo' of SearchOption=4 will soon be available to check for possible speedup of AVX512.
Well, fingers crossed again, then.
FranceBB is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 10:02.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.