Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 9th January 2022, 14:08   #41  |  Link
ChaosKing
Registered User
 
Join Date: Dec 2005
Location: Germany
Posts: 1,751
Quote:
Originally Posted by DTL View Post
" an error (0xC000374)"

It is with optSearchOption=5 ?
Yes all DX mvtools runs were tested with optSearchOption=5, for non DX I just removed it bcs it does not know the optSearchOption parameter. => I tested "pinterf mvtools" vs "DX mvtools"


EDIT:

with your mvtools build and optSearchOption=2
I get:
PHP Code:
no prefetch() 45 fps
prefetch
(6220 fps
prefetch
(8248 fps
prefetch
(12275 fps 
That's quite the improvement!
__________________
AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth
VapourSynth Portable FATPACK || VapourSynth Database || https://github.com/avisynth-repository

Last edited by ChaosKing; 9th January 2022 at 14:18.
ChaosKing is offline   Reply With Quote
Old 9th January 2022, 14:47   #42  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 880
"275 fps "

275/177=1.55. Not very great. For a bit faster processing you can also set optPredictorType=1 - though it may lower quality a bit. Default is optPredictorType=0 - should be most close quality to v.2.7.45. Raising PredictorType to 2,3,4 may degrade quality even more - experimental modes. Fastest is optPredictorType=4 (require adjustment of thSAD in MDegrain to lower values).

I expect from moving to HWAcc about 5x better speed. At least for intel-based CPU hosts with small enough cache sizes. The MVtools onCPU is limited mostly by host memory speed at fast CPUs like end-of-201x. Host memory speed is about 50 GB/s nowdays (about 2 channels of DDR4) and at top HWAccelerators reach about 1 TB/s - about 10..20 times faster.

Last edited by DTL; 9th January 2022 at 14:56.
DTL is offline   Reply With Quote
Old 10th January 2022, 12:41   #43  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,359
Quote:
Originally Posted by ChaosKing View Post
That's quite the improvement!
I can't understand why there is so little sharing of this thread inside doom9 community. MVTools are one of the slowest and most used filter around and any help would be useful.

DTL is the only one developing this branch with almost no help at all.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 10th January 2022, 13:17   #44  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 880
It looks community is almost absent nowdays. And development process going to more difficult state: today I make about finished compute shader pipeline for SAD calculation with compute shader at HWAcc and found that remote debugging looks like do not support shader debugging. So the development of shader-based processing looks will be more slow with only way to download ready result from shader and analyse it.
Most of freeware developers were active about decade ago.

Some working tech demo of completely DX12-based (ME motion search + compute shader for SAD calculation) MAnalyse - https://drive.google.com/file/d/1cNs...ew?usp=sharing . Based on the commit - https://github.com/DTL2020/mvtools/c...47cba16727e838

It outputs 'standard' MVs pseudo-file (with SAD values) so should be compatible with other clients filters.

Only SO=5 (SO=6 still not finished in this build). Only overlap=0, chroma is used in ME but shader still not finished for addition of chroma SAD. Only 8x8 block size.

It works slower in compare with i5-9600K in RAW mvtools performance (with SO=2) but with x264 encoding it works faster because using HWAcc for MAnalyse saves some (about 30%) of CPU time for MPEG encoding.

Based on the testing of GTX1060 ME-engine performance: It looks RAW performance is about 700..800 pairs of ref+current frames per second with 1080p frame format and qpel precision. So it can significantly outperform i5-9600K only in half-frame size mode that limit block size to 16x16 only. Though max possible transfer of denoising processing to HWAcc will mostly free host CPU for MPEG encoding so total transcoding with denoising will be about twice faster (or limited to current settings x264 encoder execution speed onCPU).

Last edited by DTL; 11th January 2022 at 14:29.
DTL is offline   Reply With Quote
Old 11th January 2022, 13:59   #45  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 880
The usage of natural massive-parallel processing on HWAcc makes things much easier:

The SAD compute shader for HWAcc is as small and easy as
Code:
RWTexture2D<int>OutputTexture : register(u0);
Texture2D<int>CurrentTexture_Y : register(t0);
Texture2D<int2>CurrentTexture_UV : register(t1);
Texture2D<int>ReferenceTexture_Y : register(t2);
Texture2D<int2>ReferenceTexture_UV : register(t3);
Texture2D<int2>ResolvedMVsTexture : register(t4);

void main(uint3 DTid : SV_DispatchThreadID)
{
	int3 i3Coord;

	i3Coord.x = DTid.x;
	i3Coord.y = DTid.y;
	i3Coord.z = 0;

	int iBlockSize = 8;

	int2 i2MV = ResolvedMVsTexture.Load(i3Coord);
	i2MV.r = i2MV.r >> 2; // full frame search qpel/4
	i2MV.g = i2MV.g >> 2;

	int iYsrc;
	int iYref;

	int iSAD = 0;

	for (int x = 0; x < iBlockSize; x++)
	{
		for (int y = 0; y < iBlockSize; y++)
		{
			i3Coord.x = DTid.x * iBlockSize + x;
			i3Coord.y = DTid.y * iBlockSize + y;
			iYsrc = CurrentTexture_Y.Load(i3Coord).r;

			i3Coord.x = DTid.x * iBlockSize + x + i2MV.r;
			i3Coord.y = DTid.y * iBlockSize + y + i2MV.g;

			iYref = ReferenceTexture_Y.Load(i3Coord).r;

			iSAD += abs(iYsrc - iYref);
		}
	}

	OutputTexture[DTid.xy] = iSAD;
}
(still no chroma but add chroma is also simple in the future). And it is executed in the prof-programmers designed massive parallel architecture and software environment of HWAcc. No large hand-made parallel/SIMD program required. The execution speed is very nice. I see almost 0 load of GPU graph.

It is also not suffers from 'out of buffer' memory exceptions with 'invalid' vectors produced sometime by ME-engine. The sampler simply return zeros for out of frame coordinates.

The last core required MDegrainN compute shader is close to this.

And C++ multi-blocks SAD computing on general-purpose CPU with SIMD co-processor is about thousands lines of program and still greatly suffers from low memory speed.

Most of AVS filters may be put to compute shaders and greatly add to performance. But it require addition of DX-resources management in AVS core. Currently it will be only inside 'mvtools-environment'.

To help users it may be designed some helper-filter-pack like 'toDX12' to upload AVS pipeline to HWAcc memory and generate pointers to resources and 'fromDX12' to download back to host CPU memory and to continue AVS pipeline processing. And ability to compile and load for execution compute shaders for processing. It may be performed at user-side at filter load before execution. No special development environment required (though some debugging will be useful).

Last edited by DTL; 11th January 2022 at 14:23.
DTL is offline   Reply With Quote
Old 11th January 2022, 14:32   #46  |  Link
ChaosKing
Registered User
 
Join Date: Dec 2005
Location: Germany
Posts: 1,751
Maybe you can also get some inspiration from a "cudaSynth" concept from here http://rationalqm.us/board/viewtopic.php?f=14&t=671
__________________
AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth
VapourSynth Portable FATPACK || VapourSynth Database || https://github.com/avisynth-repository

Last edited by ChaosKing; 11th January 2022 at 14:41.
ChaosKing is offline   Reply With Quote
Old 11th January 2022, 15:15   #47  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 880
I think CUDA is more manufacturer-oriented API and less universal in compare with Windows DX-API. I expect support of DX-API will be not only from NVIDIA but intel and AMD (and some more HWAcc developers). So using natural for Windows API is easier in support and may be development I hope.

It is game-developers API that drives the market of HWAcc goods (for end-users/home entertainment use). CUDA market may be more limited and more for pro-usage (not for poor-people home PCs).

Unfortunately it looks intel with its QSV hardware accelerator may be too slow in driver development for DX12ME or may be old QSV hardware do not support required output by design. But intel with host-CPU built-in accelerator is also band-limited with very poor host RAM performance in compare with separate good accelerator board.

The too outdated intel-architecture of end-users PCs is now divided into separate high-performance islands - inside CPU caches and inside HWAcc board fast RAM. The host RAM is large but service only as very slow cache of data (for HDD and SSD).

Last edited by DTL; 11th January 2022 at 15:22.
DTL is offline   Reply With Quote
Old 14th January 2022, 13:32   #48  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 880
Quote:
Originally Posted by pinterf View Post
Had a look at NVidia site;
https://docs.nvidia.com/video-techno...tion-only-mode
DX12 and NvEnc direct ME mode cannot work at the same time.
If it is about "Motion-estimation (ME) only mode is not supported if DirectX 12 device is used."

It may mean if DX12 device is used in application (? or the whole system ?) - it takes resources of ME engine to use for DX12-ME clients so it can not be shared with NVIDIA-SDK based applications at the same time.

Current sources shows how DX12-ME server runs with same DX12 device that provide Compute Shader execution (that is enough for other tasks of MAnalyse + MDegrain processing). And CS execution works with same resources in the same format as used by ME-engine - it saves from additional converting/copying/uploading. The loaded for ME processing resources after resolving MVs simply assigned as Shader Resource Views inputs to CS.

Also " For full-pel precision, the client must ignore two LSBs of the motion vector. For sub-pel precision, the two LSBs of the motion vector represent fractional part of the motion vector."

Possibly mean even with NVIDIA-API it is not possible to switch ME engine from qpel to full-pel search mode for better speed. It always works in qpel mode and user must discard 1 or 2 LSBs to get full pel MV value. But it not adds to speed.

I currently think if ME-engine from NVENC is not very great in speed and left most of HWAcc resources free it may be possible to make 'standard' MAnalyse processing on Compute Shaders because at each level blocks processing is mostly independent (only interconnected with FetchPredictors -> getting some surrounding processed blocks MVs as predictors) but it may possibly can also somehow added using ordered (?) processing inside group of threads. Or that additional predictors gathering may be skipped for more or less lower quality. It is now something inbetween optPredictorsType=1 and 2 modes - the usage of zero, interpolated and global predictors is possible with full independent blocks processing.

So users of 'complex' mvtools-based scripts may use different MAnalyse calls with different hardware execution units and get more speed.

Last edited by DTL; 14th January 2022 at 13:35.
DTL is offline   Reply With Quote
Old 20th January 2022, 23:19   #49  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 880
New build - https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.06

Some faster in resource uploading and finally make chroma in SAD calculation. With chroma=true it finally visibly faster at GTX1060 in compare with CPU (i5-9600K) processing because optSearchOption=2 (and higher CPU-based) do not have chroma SAD processing now.
Also scaleCSAD is working too and significantly changes output MPEG speed. (Setting to -1 makes MPEG speed lower - either with lowering noise or softening too, -2 not tested yet) The adjusting steps looks even too coarse as integer 0,-1,-2 and may be good to add some finer (trying to keep compatibility - may be use float ?). The speed with executing at accelerator should not be visibly smaller even with fine float adjustment with float multiplication and converting to integer.

It looks scaleCSAD should be adjusted (fine tuned) with thSAD values in MDegrain (and other like scene change and other th-s).

At some quick tests it looks with FullHD frame and 8x8 block size there exist some issue with lowest 3 rows of blocks - may cause visible error blends with 'large' tr like 25. To fix if happens (temporarily ?) - make thSCD1 lower default 400 (like 350..320). It may be some hidden bug in data transfer somewhere - like cut-out some buffer too early. Though it appear rare enough near some scene changes only.

Last edited by DTL; 20th January 2022 at 23:23.
DTL is offline   Reply With Quote
Old 21st January 2022, 14:01   #50  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,359
Quote:
Originally Posted by DTL View Post
New build
Will try ASAP.

Is HBD a long way to go?
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 21st January 2022, 15:16   #51  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 880
"Is HBD a long way to go?"

It not supported by DX12_ME for ME work. Only NV12 is the only supported input format currently for ME engine. It is for bitdepth equal to YV12 8bit format (difference is only in interleaving UV planes). So is the current requirement for SO=5 mode of MAnalyse - only YV12 input. It is currently converted to NV12 at the upload procedure - Y plane used directly as first subresource slice and interleaved UV as second subresource slice. In theory it is possible to upload more bits formats and convert to NV12 by compute shader but it is not first place work. Better use ConvertBits(8) or ConvertToYV12() for the MAnalyse source. Though using 8bit only may limit quality of processing HDR sources (at dark areas) - so it may be recommended to convert HDR to SDR 8bit before feeding MAnalyse.

To have some 'more accelerated' HBD - either required to design compute-shader based MAnalyse (SearchMVs() function of PlaneofBlocks) or MDegrain. It is all planned but in some future. I hope compute-shader based MAnalyse will be faster in compare with the dedicated ME engine from encoder. Currently 'GPU load' system stats shows almost 0 load (even with using SAD compute shader). So it is still free compute resources exist.

The only way to increase speed of ME engine of encoder is possibly overclock accelerator (or measure its speed in different chips).
May be chips with more 'encoder units' will have also ability to run more ME threads at the same time. Currently DX12 environment looks like completely not limits number of ME threads (for same process or different processes) so it is hard to find what is the actual limit of performance of current hardware. One possible method is to run MAnalyse + MDegrain with AVSmeter and get close to 100% Video Encode load and check for output fps and tr-value. The performance of ME engine is about fps*(2*tr) pairs of frames per second.

The MDegrainN support HBD input/output already for onCPU processing. So it looks like required to create 2 'super' clips for using in MAnalyse (8bit) and in MDegrain (>8bit).

The addition of compute-shader based MDegrain is in progress. It started from uploading all required resources (2 x tr + 1 number of frames) in HWAcc. It also will make speed of MAnalyse best possible (only 1 new upload frame per 1 output frame of MDegrain). Currently MAnalyse still use partial optimization - 1 src frame is remembered and not uploaded for all src+ref pairs requested by MDegrain for each output frame. So with full optimization the upload traffic will be (2 x tr) times smaller (also leave more host memory speed for other tasks). Currently even making Y-plane not copy in NV12 intermediate structure before upload makes visible speedup.

The expected limiting factor for MT Avisynth - too low onboard memory of accelerator to hold total frames pool of all MAnalyse+MDegrain threads. With current architecture the frames pool can not be shared across different AVS threads. So it possibly will not allow to run script with all possible CPU cores threads number (but it will leave more free cores to MPEG encoder). With shared ME engine for all threads it may be no need to run many threads to get max possible speed. So the Prefetch(N) should be adjusted between accelerator memory load and Video Encode load. If Video Encode is close to 100% load - no need to add more threads.

Current major limitation of usage of DX12_ME engine - it looks overlap processing mode (for MAnalyse) is not possible (keeping speed in good range). Some cosplay of overlap possible only by sending 4 shifted planes to process but it will make speed 4 times slower. So the only way to get less blocky output of MDegrain is new no-overlap blending mode (still work in progress) with smooth transition of blending weight between blocks. It may run more or less slower onCPU but should run good at compute shader based processing. Though with 'half-frame' SO=6 mode the frame size is 4 times smaller and possibly it will be processed 4 times faster by ME engine and allow to run some cosplay of overlap mode with good speed - need testing too. And SO=6 mode currently limits usable block size to 16x16 only.
DTL is offline   Reply With Quote
Old 21st January 2022, 20:00   #52  |  Link
magnetite
Registered User
 
Join Date: May 2010
Posts: 64
I kept getting this unhandled C++ exception error while trying to run the new version. However, running it with MeGUI's manual mode actually gave me a specific error message, instead of some generic unhandled C++ exception:

Avisynth script error:
Unhandled error: dimzon_avs_init_2

Code:
LoadPlugin("C:\MeGUI 64-bit\tools\lsmash\LSMASHSource.dll")
LSMASHVideoSource("E:\Mobile Video\MVI_0564.mp4")
ConvertToYV12()
tr = 2 # Temporal radius
super = MSuper (mt=true, chroma=true,pel=1, hpad=8, vpad=8)
multi_vec = MAnalyse (super, multi=true, blksize=8, delta=tr, chroma=true, overlap=0, mt=true, optSearchOption=5, levels=1)
MDegrainN (super, multi_vec, tr, thSAD=300, thSAD2=290, mt=true, wpow=4)
Prefetch(8)
Removing the optSearchOption=5 and levels=1 fixes it.

Last edited by magnetite; 21st January 2022 at 20:17.
magnetite is offline   Reply With Quote
Old 21st January 2022, 20:05   #53  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 880
Is it keep to throw error in 1 thread mode ? Without or commented-out Prefetch(8) at the end of script.

I had some Unhandled C++ exception inside Avisynth when trying to use some Graphic lib from DX12TK in multithreaded mode but now it do not use it.

Also try to set mt=false for MSuper (and also for MAnalyse !)- may it somehow conflict with something (if you also have avstp.dll somewhere and avstp-mt is really activated). With SO=5 the processing in MAnalyse should not reach the avstp-mt slicing now but it still may somehow made difference.

As for MDegrainN avstp-mt - not sure how it work - I not use it and not tested. First it is also recommended to set mt=false in all mvtools filters.
Also if you use AVS+ mt with Prefetch(N) at the end - it mostly good to disable any avstp-mt in mvtools because it will makes things much more slow. You need to use either avstp-mt or AVS+ mt or precisely adjust threading of avstp-mt with AVS+ mt if avstp-mt is really faster in your case.

If avstp-mt is activated in your environment with that script it try to create 8+(num of cores ?) threads. avstp-mt in mvtools is now 'shadow of the poor past' and only may be recommended to use with care if AVS+ mt is not possible.

Also you may set levels=1 to
super = MSuper(chroma=true,pel=1, hpad=8, vpad=8, levels=1)

MAnalyse with SO=5 do not use >1 level now so it may make processing a bit faster (no need to generate lower res levels in MSuper). MDegrainN also do not use smaller levels.

Last edited by DTL; 21st January 2022 at 20:24.
DTL is offline   Reply With Quote
Old 21st January 2022, 20:26   #54  |  Link
magnetite
Registered User
 
Join Date: May 2010
Posts: 64
I don't use avstp. Just use Avisynth+'s MT modes. With MT completely disabled and no prefetech I get the same error. Have to wonder what Dimzon is.

Last edited by magnetite; 21st January 2022 at 20:35.
magnetite is offline   Reply With Quote
Old 21st January 2022, 20:51   #55  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 880
"I don't use avstp."

But your script shows mt=true in every mvtools call. It is not required for AVS+ mt. It only switch on/off avstp-based internal mt.

"Have to wonder what Dimzon is."

Yes - it is strange error. 'dimzon' token is not found in AVS+ 3.7 sources. Can you run avsmeter from command line ? I test with avsmeter and virtualdub x64 (also encoding with x264 command line - x264 direct reading src.avs script file).

All files from archive is now required - the Compute.cso must be located in the same folder with mvtools2.dll. Its loading engine (some helper function from DX12 toolkit) mostly probably searches for file near mvtools2.dll in the same path only. It is compiled shader file. I do not know how to pack (link ?) in single executable resources if even possible. So all new versions with compute shaders will have a set of .cso files - some like SAD.cso, Mdegrain.cso and may be Msearch.cso and other. I even not sure if it good and possible to pack different compute shaders in the single file. In theory reader loads some bytestream from file so it possible to read some known parts start-end from single binary file but it is additional work and source of errors.

"Removing the optSearchOption=5"

It is disabling all DX12-based hardware accelerated processing and fallback to 'standard' mvtools 2.7.45 processing.

Last edited by DTL; 21st January 2022 at 21:06.
DTL is offline   Reply With Quote
Old 21st January 2022, 21:16   #56  |  Link
magnetite
Registered User
 
Join Date: May 2010
Posts: 64
Here's a link to the combined error messages. The last DLL that worked for me was on 12/31/2021. The CSO file is in the same folder as the mvtools2 DLL.

Last edited by magnetite; 21st January 2022 at 21:36.
magnetite is offline   Reply With Quote
Old 21st January 2022, 21:39   #57  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 880
"LSMASHVideoSource("E:\Mobile Video\MVI_0564.mp4")"

What is frame size ? Can you run with FullHD internal source like

ColorBarsHD(1920,1080) ?

Currently it looks the height must be integer divisible to block size so for FullHD progressive it is 1920x1080 and for interlaced - pad to 1920x1088 to get 1920x544 after SeparateFields() and so on.

"link to the combined error messages."

Evaluate: Unhandled C++ exception - it is AVS+ error message. I see you use still not 'final' 3.7.2 AVS+ build.

I see same error message at pinterf's github: https://github.com/pinterf/AviSynthPlus/issues/42 . May be not all more pointing to place of problem messages still displayed by Avisynth correctly ?

Last edited by DTL; 21st January 2022 at 21:59.
DTL is offline   Reply With Quote
Old 21st January 2022, 23:07   #58  |  Link
magnetite
Registered User
 
Join Date: May 2010
Posts: 64
The frame size of that video source was 1920x1080 progressive.

I still get the same error message with the official Avisynth+ build from Github. Is that the final build you're referring to?

With ColorBarsHD as the source, same error message pops up.

Last edited by magnetite; 21st January 2022 at 23:14.
magnetite is offline   Reply With Quote
Old 21st January 2022, 23:31   #59  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 880
"with the official Avisynth+ build from Github."

Yes - I also setup latest 'release' 3.7.1 at my the only working system with GTX1060. I think of creating debug build also it have enabled DirectX debug layer - it can be installed as part of Windows 10 user-side and may be will shows more detailed errors. It may be something around structured exceptions handling - may be it is disabled in release build so exception do not catched in the underlying layers and passed to AVS+ host.

Here is debug build - https://drive.google.com/file/d/1yEN...ew?usp=sharing . Also added exception handling around loading of Compute.cso file. It have builds with 2 types of C++ exception handling - EHa and EHsc in VS2019 settings - may it helps.

*updated link to archive with debug MS .dlls included*

Last edited by DTL; 22nd January 2022 at 10:50.
DTL is offline   Reply With Quote
Old 23rd January 2022, 15:00   #60  |  Link
takla
Registered User
 
Join Date: May 2018
Posts: 151
DTL
I get the following error "MAnalyse: Can not load file Compute.cso ReadData"
Any way to manually specify the file path for it? I tried mvtools2_EHa.dll & mvtools2_EHsc.dll but both give me this error.
takla is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 08:46.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, vBulletin Solutions Inc.