Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 15th December 2021, 14:53   #21  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,064
With multi-PCIe cards setup it is possible to create 'degraining farm' at one host and it may be much cheaper in compare with multi-workers distributed processing with CPU-based workers and IP interconnection.

I see used cheapest Maxwell-based 745 cards from about $50. Though it require some multi-PCIe (x16 ?) motherboard or may be available special 'PCIe-risers' to connect 16x card to 1x slot at slower speed ?

The MDegrainN motion search task is naturally paralleled to 2_x_tr number of src+ref searches so simple tr=12 degraining process can load up to 24 DX12-ME workers (NVIDIA promises to execute >1 ME task per chip at some chips).

The MVs data usage in MDegrainN is already fixed in some way and should be less limiting the speed I hope.

It also interesting question - what is the max vector length search in NVIDIA ME engine ? Do it have any limits ? Currently API do not have any settings about it. Though for typical degraining work the very short vectors like +-several samples are typically enough because it work on slow motion areas and with high speed motion noise is less visible and typically no large high speed areas in the motion pictures content.

Last edited by DTL; 15th December 2021 at 15:15.
DTL is online now   Reply With Quote
Old 15th December 2021, 15:09   #22  |  Link
ReinerSchweinlin
Registered User
 
Join Date: Oct 2001
Posts: 454
Maybe helpful addition:

While "Turing" refers to the GPU Generation, the included Encoders are not necessarely the same.

A 1650 has the VOLTA Encoders, not being able to produce B-frames with h265.

The 1660 has the b-frame HEVC capability, but only with nvencc is able to produce b-frames, the Nvidia provided encoder (used in A´s Video converter for example) is NOT using b-frames.

ffmpeg with hw-acc encoding also does not support b-frames at the moment.

all cards having "RTX" in theyr name work fine with b-frames HEVC encoding.

some of the quadro and OEM Cards also are missing the newer encoder (while being labeled turing or pascal)...
ReinerSchweinlin is offline   Reply With Quote
Old 17th December 2021, 15:04   #23  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,064
Development with remote debugging via LAN is started but not very fast because of my rare being at work with GTX1060 card and not very great being after long no sleeping time. Currently DX12-me init is partially finished in MAnalyse - this commit https://github.com/DTL2020/mvtools/c...c8404e1ea79769

If someone can try to provide remote debugging environment it may speedup the process. It require public IP connection for debugger and ftp access to the folder to upload dev builds of .dll and may be other required debug .dlls. It not require to install many to system - only unpack archive to some folder with known path from root of disk letter and run remote debugger monitor application. The debug session is protected with login/password of local system user with 'remote debugging' access.

It is not recommended by Microsoft to debug via internet because it may be slow and packet loss and more but it is possibly working solution.

Still do not know if GTX745 card support DX12-ME and not get it to my living place cheap enough. Unfortunately it looks no software emulation of this processing in DX12 available. Not search it very hard though.

Last edited by DTL; 17th December 2021 at 15:07.
DTL is online now   Reply With Quote
Old 18th December 2021, 12:40   #24  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,577
Quote:
Originally Posted by DTL View Post
If someone can try to provide remote debugging environment it may speedup the process.
Perhaps you can try to contact videoh. He has lot of experience with nvidia cards and CUDA.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 18th December 2021, 13:41   #25  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,064
Sad news - tested GTX 745 card at Win10 19041 and November 2021 driver and it do not provide HW-ME function. So it is again not known which cheapest card (from Maxwell ?) can provide required service. May be need to e-mail NVIDIA support about this issue.

Hehe - the article about hardware ME acceleration for external software clients https://developer.nvidia.com/blog/an...ical-flow-sdk/ is dated Feb 2019 - the 2+ years passed already. Have post a question to nvidia developers support forums about selecting DX12-ME capable card.

Last edited by DTL; 18th December 2021 at 14:14.
DTL is online now   Reply With Quote
Old 22nd December 2021, 18:00   #26  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,064
Possibly more sad news - the GTX750 was tested and looks also not support. Though it is listed as second-generation of Maxwell.

I mostly finished DX12_ME interface part of program. But still need many help of Microsoft developers to get it working. Or may be any working sample of DX12 video encoder (or motion estimator). The current program returns no-error HR-state of most of functions but final mapping of the resulted resource buffer to read MVs in the plugin return 'general error - D3D device disconnected' and error reason 'application have errors and need debug'. Very unclear where may be error. Also the resource state to motionestimator logical device still can not be switched to required state 'motion estimator read source' after writing source and ref data.
DTL is online now   Reply With Quote
Old 27th December 2021, 16:21   #27  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,064
With enabling debug layer in DX12 the error messages much more better. Current progress: Possibly close to real processing tech sample for speed test. Still required conversion of input to NV12 format and correct reading of output. Also currently somewhere inside GPU is huge memory leak so it fill 3 GB memory after 500 FullHD frames. Need to found how to free resources. The ME performance in the GTX1060 looks comparable to i5-9600K or may be 2..3 times better. The number of concurrent ME tasks looks like non-limited but may be performance will be degraded after 100% Encoder loading.

Need some thinking how to balance ME parallel tasks per each MDegrainN thread. Currently each MDegrainN thread scan src-ref pairs with MAnalyse one by one (total 2_x_tr pairs per each output frame) and it looks required thinking about Avisynth interfacing with MDegrainN and MAnalyse to allow asking for several src-ref pairs in parallel for load balancing between MDegrainN CPU processing and DX12-ME hardware acceleration. Without balancing currently hardware accelerator is not fully loaded.

May be the only easy solution to use several 'super' clips in one MDegrainN and asks for ME search in several MAnalyse objects in parallel ? Unfortunately current interfacing between MDegrain and MAnalyse via Avisynth layer looks like limit some easy ways of ME-multithreading.
DTL is online now   Reply With Quote
Old 29th December 2021, 13:00   #28  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,064
Some more info on NVIDIA: https://www.nvidia.com/en-us/geforce...casting-guide/
One thing that is great about NVENC on the GeForce RTX 20 and 30-series and GeForce GTX 1650 Super and up is that all GPUs have the same NVENC with the same performance and quality, from the RTX 2060 to the RTX 3090.
NVENC can do up to 8K30, so the only way to overload it is to do 2x4K60 streams.

It is not clear how many current+ref pairs do it process for ME in MPEG encoding for each output frame (possibly from 1 to several) but even with 1 pair at 8K 30fps it mean about 16x better at fullHD resolution. I.e. 480 pairs per second.
MAnalyse/MDegrain need to process 2_x_tr pairs per each output frame.

Last edited by DTL; 29th December 2021 at 13:02.
DTL is online now   Reply With Quote
Old 30th December 2021, 11:53   #29  |  Link
ReinerSchweinlin
Registered User
 
Join Date: Oct 2001
Posts: 454
Quote:
Originally Posted by DTL View Post
.....
One thing that is great about NVENC on the GeForce RTX 20 and 30-series and GeForce GTX 1650 Super and up is that all GPUs have the same NVENC with the same performance and quality, from the RTX 2060 to the RTX 3090....

not quite

http://forum.doom9.org/showthread.ph...35#post1959335
ReinerSchweinlin is offline   Reply With Quote
Old 30th December 2021, 16:01   #30  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,064
It only about b-frames for MPEG. I hope performance of ME-engine of NVENC not degrades greatly at low cost chips. It is an idea using current sources to create small tech test of RAW ME performance of current hardware. With output result in pairs of processed frames per second. Currently I do not have an ideas how to run MAnalyse in solo processing mode without MDegrain as frames-request engine and in 'multi=true' mode. And MDegrain even in overlap=0 mode also a significant part of slow processing now.
With MShow() as data-sink it possibly will request only 1 pair or src+ref frames.

And script with MAnalyse output returns an error like 'no video clip created for output'.
DTL is online now   Reply With Quote
Old 30th December 2021, 16:15   #31  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,314
Had a look at NVidia site;
https://docs.nvidia.com/video-techno...tion-only-mode
DX12 and NvEnc direct ME mode cannot work at the same time.
pinterf is offline   Reply With Quote
Old 30th December 2021, 16:24   #32  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,064
It may be limits of old (or current) manufacturer-only API. In current Microsoft Windows API the hardware motion estimator looks do not have limitations and also data movement to and from ME-engine in the NVENC is performed via DX12 resources. Starting from creating D3D12-device.

Or may be NVIDIA have several generations (it mention 'old ME' and 'new Optical Flow') of ME modes and some (old ?) can work via DirectX12 without limitations. But may be lower in quality ?

Anyway the ME via DX12 works via DX12 API (Windows SDK) - not via NVIDIA SDK. And described as part of Windows Media Foundation API. So it should be NVIDIA-independent (as promised to have HWAcc for DX12 from Intel and AMD too). It is good advantage from NVIDIA-SDK only - not connected to one HW manufacturer only. It promised to be much more standard and long-live support in compare with one HW manufacturer only.
I hope in current latest windows the all old DirectShow API is perfectly working. It demonstrates the support for decades.

May be working with ME via NVIDIA SDK provide more settings/modes/params. Currently DX12-ME only support NV12 format as input for example.

Last edited by DTL; 30th December 2021 at 16:34.
DTL is online now   Reply With Quote
Old 31st December 2021, 14:25   #33  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,064
Finally some working testbuild of MVtools with DX12_ME search mode for MAnalyse.

https://github.com/DTL2020/mvtools/r...46-dx12_me.a01

Enabling DX12_ME search: optSearchOption=5 (and levels=1). Using levels > 1 or default will possibly overwrite saved vectors with interpolated prediction from level 1. It copy MVs from DX12_ME output to vectors structure of the plane 0 and perform SAD calculation for MDegrainN weighting processing. Chroma looks like supported in both DX12_ME search and SAD. Currently looks like pel=1 only. The pel 2 and 4 is possible but in the future builds.

Only supported block sizes 8x8 and 16x16. Windows 10 build 19041 or newer looks like minimum requirement. When it uses DX12_ME hardware - the resource monitor TaskManager->GPU->Video Encoder shows the load graph and %.

Test script is about:
Code:
LoadPlugin("mvtools2.dll")
LoadPlugin("ffms2.dll")

FFMpegSource2("src.mxf")

ConvertToYV12()

tr = 12 # Temporal radius
super = MSuper (mt=false, chroma=true,pel=1, hpad=8, vpad=8)
multi_vec = MAnalyse (super, multi=true, blksize=8, delta=tr, chroma=false, overlap=0, mt=false, optSearchOption=5, levels=1)
MDegrainN (super, multi_vec, tr, thSAD=300, thSAD2=290, mt=false, wpow=4)

Prefetch(6) # set prefetch number to number of host CPU cores
Only hope completely compatible with MDegrainN (may crash other MDegrains and other with outside frame borders vectors sometime returned by DX12_ME). Need to find why they pass the ClipMV() limiting in MAnalyse.
So the performance of MDegrainN blenging engine is a bit limited in this build with additional clipping of block pos inside valid region.
Input format currently limited to YV12 only but may be relaxed to all Mvtools inputs because currently frames are read from loaded structures of MAnalyse.
Frame size tested - 1920x1080. Unfortunately inverlaced with separated fields (to 1920x540 frame size) still return some error from DX subsystem - need to fix in the later builds.

The speed at i5-9600K and with GTX1060 is a bit faster in compare with official 2.7.45 build and a bit slower in compare with SO=2. It looks overlapped modes of MDegrainN is supported (run but not test for quality).

May be better speed will be with SAD calculation also inside HWAcc with DirectX resoures (textures) processing. It will completely avoid the memory read into MAnalyse SAD calculation.

Last edited by DTL; 11th January 2022 at 14:31.
DTL is online now   Reply With Quote
Old 7th January 2022, 01:09   #34  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,577
Quote:
Originally Posted by DTL View Post
Finally some working testbuild of MVtools with DX12_ME search mode for MAnalyse.
Tried and not working for me. Nvidia 1060 3GB.

Access violation from AVS.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 7th January 2022, 20:28   #35  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,064
What was the script, frame size, frame format and error message ?

Currently found V-size limitation: It looks required integer number of blocks so block 8x8 work with 1080 height (1080/8=135) frame and not work with SeparatedFields to 540 height (540/8=67.5). Solution - pad vertical size with AddBorders() For example to process interlaced 1080 - pad to 1088. 1088/2/8=68.
DTL is online now   Reply With Quote
Old 7th January 2022, 23:50   #36  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,577
Quote:
Originally Posted by DTL View Post
What was the script, frame size, frame format and error message?
Really simple SMDegrain call:

SetMemoryMax()
SetFilterMTMode("DEFAULT_MT_MODE", 2)
LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll")
DGSource("F:\In\3_00 wolf of Wall Street, The\wolf.dgi",ct=140,cb=140,cl=0,cr=0)
ConvertBits(16)
SMDegrain (tr=3, thSAD=300, refinemotion=true, contrasharp=false, PreFilter=5, plane=4, chroma=true)
fmtc_bitdepth (bits=8,dmode=8)
Prefetch(6)
(tried without Prefetch too)

1920*1080 cropped to 1920*800 by DGSource.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 8th January 2022, 12:12   #37  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,064
1. MAnalyse with DX12_ME only support 8 bit input (and currently only YV12 format). Not sure if SMDegrain convert internally to 8 for MAnalyse or not.
2. To activate DX12_ME search mode in that build you need to pass option optSearchOption=5 to MAnalyse (and set levels =1 to not overwrite DX12_ME result with level1 search data in the level0 MVs array).

If you use SMDegrain without new options passing - again special build required with hardcoded options SO=5 and levels=1 .

Because speed of ME engine in GTX1060 is not many times faster in compare with i5-9600K currently trying to make Compute Shader SAD search to skip SearchMVs from GroupofPlanes completely (it saves 1 set of memory read operations) and will create fully compatible result of MAnalyse with all other clients filters of mvtools. Also have an idea how to make MDegrain HWAcc based without DX12 resources management in AVS core - to make all required reqources allocation in MAnalyse and pass pointers to resources via existing way of mvtools inbetween filters - via pseudo-audio stream. So in the best future all MAnalyse+MDegrain processing may be created with loaded to HWAcc memory frames once.

Also in current sources already implemented idea of half-sized ME for better speed: The DX12_ME engine currently can only work with qpel precision that is too large for typical pel=1 fastest processing. But with block size 8x8 it allow to make search with half-sized data (level1 from MSuper) and scatter received MVs to 16x16 block sized level0 with half precision truncating (from qpel to half pel). It currently controlled as optSearchOption=6. And now the speed limiting is SAD calculation in MAnalyse - so the next step to compute-shader processing in HWAcc required to get SAD values too.

Last edited by DTL; 8th January 2022 at 12:24.
DTL is online now   Reply With Quote
Old 9th January 2022, 07:24   #38  |  Link
magnetite
Registered User
 
Join Date: May 2010
Posts: 64
I tried this out on my GTX 1080 Ti and initially it threw an error (0xC000374). So I lowered the temporal radius to 2 instead of 12, and it was able to run. Performance was around 150 FPS.

While using the CPU only, performance was around 125 FPS on my i7 6700K.

Source was a DVD.

Last edited by magnetite; 9th January 2022 at 07:50.
magnetite is offline   Reply With Quote
Old 9th January 2022, 13:20   #39  |  Link
ChaosKing
Registered User
 
Join Date: Dec 2005
Location: Germany
Posts: 1,795
I also made some quick tests:

Source ntsc DVD
CPU Ryzen 3600, GPU 3070 TI

Test script from here https://forum.doom9.org/showpost.php...1&postcount=33

But without wpow=4 (mvtools comlained, so I removed it)
+ removed Searchoption in non DX test

PHP Code:
original mvtools
without prefetch
28 fps
with prefetch
(6): 130 fps
with prefetch
(8): 155 fps
with prefetch
(12): 177 fps

DX12 mvtools
without prefetch
13 fps
with prefetch
(6): 86 fps
with prefetch
(8): 94 fps
with prefetch
(12): 97 fps 
__________________
AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth
VapourSynth Portable FATPACK || VapourSynth Database

Last edited by ChaosKing; 9th January 2022 at 13:23.
ChaosKing is offline   Reply With Quote
Old 9th January 2022, 13:41   #40  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,064
" an error (0xC000374)"

This error code looks like something about heap corruption. Still do not have ideas where it can come from. Most of DirectX API calls have checks of HR-return value and if non-S_OK result - writes error message about the tried API call. It is possible to enable DirectX Debug layer with better error messages what is detected to going wrong but need to found way how to pick messages from something like 'debug output stream' in VisualStudio and add to Avisynth environment error output. So it may be special 'DX debug' builds of the plugin with possibly better error-collection from users.

"lowered the temporal radius to 2 instead of 12, and it was able to run"

Is it the max possible value without throwing error or just some low working ?

"DX12 mvtools
without prefetch: 13 fps"

It is with optSearchOption=5 ? The other modes may be somehow slower in compare with 2.7.45 in MDegrain because of added 'anti-bug' check of invalid vectors in that testbuild. In latest sources it looks I found where bad vectors may be passed and this check removed.

"ntsc DVD
CPU Ryzen 3600"

It have 32 MB L3 cache so at small frame size (and low tr) it may be faster to process on CPU and not send data for motion search to HWAccelerator and readback. For that config and frame size the HW processing may be faster only after transferring 'all' processing to DX12 pipeline - both motion search and blocks blending in MDegrain. It still some unknown future.

"with prefetch(12): 177 fps"

In that testbuild fastest working 'onCPU' optSearchOption should be 2. It should be faster in compare with v.2.7.45. Unfortunately SO=3 still not finished in debugging and works only at static colorbars mostly as tech demo of multi-blocks MAnalyse processing.

Last edited by DTL; 9th January 2022 at 13:55.
DTL is online now   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 11:19.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.