New DirectX 12 Video APIs - Page 2

DTL · 15th December 2021, 14:53

With multi-PCIe cards setup it is possible to create 'degraining farm' at one host and it may be much cheaper in compare with multi-workers distributed processing with CPU-based workers and IP interconnection.

I see used cheapest Maxwell-based 745 cards from about $50. Though it require some multi-PCIe (x16 ?) motherboard or may be available special 'PCIe-risers' to connect 16x card to 1x slot at slower speed ?

The MDegrainN motion search task is naturally paralleled to 2_x_tr number of src+ref searches so simple tr=12 degraining process can load up to 24 DX12-ME workers (NVIDIA promises to execute >1 ME task per chip at some chips).

The MVs data usage in MDegrainN is already fixed in some way and should be less limiting the speed I hope.

It also interesting question - what is the max vector length search in NVIDIA ME engine ? Do it have any limits ? Currently API do not have any settings about it. Though for typical degraining work the very short vectors like +-several samples are typically enough because it work on slow motion areas and with high speed motion noise is less visible and typically no large high speed areas in the motion pictures content.

ReinerSchweinlin · 15th December 2021, 15:09

Maybe helpful addition:

While "Turing" refers to the GPU Generation, the included Encoders are not necessarely the same.

A 1650 has the VOLTA Encoders, not being able to produce B-frames with h265.

The 1660 has the b-frame HEVC capability, but only with nvencc is able to produce b-frames, the Nvidia provided encoder (used in A´s Video converter for example) is NOT using b-frames.

ffmpeg with hw-acc encoding also does not support b-frames at the moment.

all cards having "RTX" in theyr name work fine with b-frames HEVC encoding.

some of the quadro and OEM Cards also are missing the newer encoder (while being labeled turing or pascal)...

DTL · 17th December 2021, 15:04

Development with remote debugging via LAN is started but not very fast because of my rare being at work with GTX1060 card and not very great being after long no sleeping time. Currently DX12-me init is partially finished in MAnalyse - this commit https://github.com/DTL2020/mvtools/c...c8404e1ea79769

If someone can try to provide remote debugging environment it may speedup the process. It require public IP connection for debugger and ftp access to the folder to upload dev builds of .dll and may be other required debug .dlls. It not require to install many to system - only unpack archive to some folder with known path from root of disk letter and run remote debugger monitor application. The debug session is protected with login/password of local system user with 'remote debugging' access.

It is not recommended by Microsoft to debug via internet because it may be slow and packet loss and more but it is possibly working solution.

Still do not know if GTX745 card support DX12-ME and not get it to my living place cheap enough. Unfortunately it looks no software emulation of this processing in DX12 available. Not search it very hard though.

tormento · 18th December 2021, 12:40

Quote:

Originally Posted by DTL

If someone can try to provide remote debugging environment it may speedup the process.

Perhaps you can try to contact videoh. He has lot of experience with nvidia cards and CUDA.

DTL · 18th December 2021, 13:41

Sad news - tested GTX 745 card at Win10 19041 and November 2021 driver and it do not provide HW-ME function. So it is again not known which cheapest card (from Maxwell ?) can provide required service. May be need to e-mail NVIDIA support about this issue.

Hehe - the article about hardware ME acceleration for external software clients https://developer.nvidia.com/blog/an...ical-flow-sdk/ is dated Feb 2019 - the 2+ years passed already. Have post a question to nvidia developers support forums about selecting DX12-ME capable card.

DTL · 22nd December 2021, 18:00

Possibly more sad news - the GTX750 was tested and looks also not support. Though it is listed as second-generation of Maxwell.

I mostly finished DX12_ME interface part of program. But still need many help of Microsoft developers to get it working. Or may be any working sample of DX12 video encoder (or motion estimator). The current program returns no-error HR-state of most of functions but final mapping of the resulted resource buffer to read MVs in the plugin return 'general error - D3D device disconnected' and error reason 'application have errors and need debug'. Very unclear where may be error. Also the resource state to motionestimator logical device still can not be switched to required state 'motion estimator read source' after writing source and ref data.

DTL · 27th December 2021, 16:21

With enabling debug layer in DX12 the error messages much more better. Current progress: Possibly close to real processing tech sample for speed test. Still required conversion of input to NV12 format and correct reading of output. Also currently somewhere inside GPU is huge memory leak so it fill 3 GB memory after 500 FullHD frames. Need to found how to free resources. The ME performance in the GTX1060 looks comparable to i5-9600K or may be 2..3 times better. The number of concurrent ME tasks looks like non-limited but may be performance will be degraded after 100% Encoder loading.

Need some thinking how to balance ME parallel tasks per each MDegrainN thread. Currently each MDegrainN thread scan src-ref pairs with MAnalyse one by one (total 2_x_tr pairs per each output frame) and it looks required thinking about Avisynth interfacing with MDegrainN and MAnalyse to allow asking for several src-ref pairs in parallel for load balancing between MDegrainN CPU processing and DX12-ME hardware acceleration. Without balancing currently hardware accelerator is not fully loaded.

May be the only easy solution to use several 'super' clips in one MDegrainN and asks for ME search in several MAnalyse objects in parallel ? Unfortunately current interfacing between MDegrain and MAnalyse via Avisynth layer looks like limit some easy ways of ME-multithreading.

DTL · 29th December 2021, 13:00

Some more info on NVIDIA: https://www.nvidia.com/en-us/geforce...casting-guide/
One thing that is great about NVENC on the GeForce RTX 20 and 30-series and GeForce GTX 1650 Super and up is that all GPUs have the same NVENC with the same performance and quality, from the RTX 2060 to the RTX 3090.
NVENC can do up to 8K30, so the only way to overload it is to do 2x4K60 streams.

It is not clear how many current+ref pairs do it process for ME in MPEG encoding for each output frame (possibly from 1 to several) but even with 1 pair at 8K 30fps it mean about 16x better at fullHD resolution. I.e. 480 pairs per second.
MAnalyse/MDegrain need to process 2_x_tr pairs per each output frame.

ReinerSchweinlin · 30th December 2021, 11:53

Quote:

Originally Posted by DTL

.....
One thing that is great about NVENC on the GeForce RTX 20 and 30-series and GeForce GTX 1650 Super and up is that all GPUs have the same NVENC with the same performance and quality, from the RTX 2060 to the RTX 3090....

not quite

http://forum.doom9.org/showthread.ph...35#post1959335

DTL · 30th December 2021, 16:01

It only about b-frames for MPEG. I hope performance of ME-engine of NVENC not degrades greatly at low cost chips. It is an idea using current sources to create small tech test of RAW ME performance of current hardware. With output result in pairs of processed frames per second. Currently I do not have an ideas how to run MAnalyse in solo processing mode without MDegrain as frames-request engine and in 'multi=true' mode. And MDegrain even in overlap=0 mode also a significant part of slow processing now.
With MShow() as data-sink it possibly will request only 1 pair or src+ref frames.

And script with MAnalyse output returns an error like 'no video clip created for output'.

pinterf · 30th December 2021, 16:15

Had a look at NVidia site;
https://docs.nvidia.com/video-techno...tion-only-mode
DX12 and NvEnc direct ME mode cannot work at the same time.

DTL · 30th December 2021, 16:24

It may be limits of old (or current) manufacturer-only API. In current Microsoft Windows API the hardware motion estimator looks do not have limitations and also data movement to and from ME-engine in the NVENC is performed via DX12 resources. Starting from creating D3D12-device.

Or may be NVIDIA have several generations (it mention 'old ME' and 'new Optical Flow') of ME modes and some (old ?) can work via DirectX12 without limitations. But may be lower in quality ?

Anyway the ME via DX12 works via DX12 API (Windows SDK) - not via NVIDIA SDK. And described as part of Windows Media Foundation API. So it should be NVIDIA-independent (as promised to have HWAcc for DX12 from Intel and AMD too). It is good advantage from NVIDIA-SDK only - not connected to one HW manufacturer only. It promised to be much more standard and long-live support in compare with one HW manufacturer only.
I hope in current latest windows the all old DirectShow API is perfectly working. It demonstrates the support for decades.

May be working with ME via NVIDIA SDK provide more settings/modes/params. Currently DX12-ME only support NV12 format as input for example.

DTL · 31st December 2021, 14:25

Finally some working testbuild of MVtools with DX12_ME search mode for MAnalyse.

https://github.com/DTL2020/mvtools/r...46-dx12_me.a01

Enabling DX12_ME search: optSearchOption=5 (and levels=1). Using levels > 1 or default will possibly overwrite saved vectors with interpolated prediction from level 1. It copy MVs from DX12_ME output to vectors structure of the plane 0 and perform SAD calculation for MDegrainN weighting processing. Chroma looks like supported in both DX12_ME search and SAD. Currently looks like pel=1 only. The pel 2 and 4 is possible but in the future builds.

Only supported block sizes 8x8 and 16x16. Windows 10 build 19041 or newer looks like minimum requirement. When it uses DX12_ME hardware - the resource monitor TaskManager->GPU->Video Encoder shows the load graph and %.

Test script is about:

Code:

LoadPlugin("mvtools2.dll")
LoadPlugin("ffms2.dll")

FFMpegSource2("src.mxf")

ConvertToYV12()

tr = 12 # Temporal radius
super = MSuper (mt=false, chroma=true,pel=1, hpad=8, vpad=8)
multi_vec = MAnalyse (super, multi=true, blksize=8, delta=tr, chroma=false, overlap=0, mt=false, optSearchOption=5, levels=1)
MDegrainN (super, multi_vec, tr, thSAD=300, thSAD2=290, mt=false, wpow=4)

Prefetch(6) # set prefetch number to number of host CPU cores

Only hope completely compatible with MDegrainN (may crash other MDegrains and other with outside frame borders vectors sometime returned by DX12_ME). Need to find why they pass the ClipMV() limiting in MAnalyse.
So the performance of MDegrainN blenging engine is a bit limited in this build with additional clipping of block pos inside valid region.
Input format currently limited to YV12 only but may be relaxed to all Mvtools inputs because currently frames are read from loaded structures of MAnalyse.
Frame size tested - 1920x1080. Unfortunately inverlaced with separated fields (to 1920x540 frame size) still return some error from DX subsystem - need to fix in the later builds.

The speed at i5-9600K and with GTX1060 is a bit faster in compare with official 2.7.45 build and a bit slower in compare with SO=2. It looks overlapped modes of MDegrainN is supported (run but not test for quality).

May be better speed will be with SAD calculation also inside HWAcc with DirectX resoures (textures) processing. It will completely avoid the memory read into MAnalyse SAD calculation.

tormento · 7th January 2022, 01:09

Quote:

Originally Posted by DTL

Finally some working testbuild of MVtools with DX12_ME search mode for MAnalyse.

Tried and not working for me. Nvidia 1060 3GB.

Access violation from AVS.

DTL · 7th January 2022, 20:28

What was the script, frame size, frame format and error message ?

Currently found V-size limitation: It looks required integer number of blocks so block 8x8 work with 1080 height (1080/8=135) frame and not work with SeparatedFields to 540 height (540/8=67.5). Solution - pad vertical size with AddBorders() For example to process interlaced 1080 - pad to 1088. 1088/2/8=68.

tormento · 7th January 2022, 23:50

Quote:

Originally Posted by DTL

What was the script, frame size, frame format and error message?

Really simple SMDegrain call:

SetMemoryMax()
SetFilterMTMode("DEFAULT_MT_MODE", 2)
LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll")
DGSource("F:\In\3_00 wolf of Wall Street, The\wolf.dgi",ct=140,cb=140,cl=0,cr=0)
ConvertBits(16)
SMDegrain (tr=3, thSAD=300, refinemotion=true, contrasharp=false, PreFilter=5, plane=4, chroma=true)
fmtc_bitdepth (bits=8,dmode=8)
Prefetch(6) (tried without Prefetch too)

1920*1080 cropped to 1920*800 by DGSource.

DTL · 8th January 2022, 12:12

1. MAnalyse with DX12_ME only support 8 bit input (and currently only YV12 format). Not sure if SMDegrain convert internally to 8 for MAnalyse or not.
2. To activate DX12_ME search mode in that build you need to pass option optSearchOption=5 to MAnalyse (and set levels =1 to not overwrite DX12_ME result with level1 search data in the level0 MVs array).

If you use SMDegrain without new options passing - again special build required with hardcoded options SO=5 and levels=1 .

Because speed of ME engine in GTX1060 is not many times faster in compare with i5-9600K currently trying to make Compute Shader SAD search to skip SearchMVs from GroupofPlanes completely (it saves 1 set of memory read operations) and will create fully compatible result of MAnalyse with all other clients filters of mvtools. Also have an idea how to make MDegrain HWAcc based without DX12 resources management in AVS core - to make all required reqources allocation in MAnalyse and pass pointers to resources via existing way of mvtools inbetween filters - via pseudo-audio stream. So in the best future all MAnalyse+MDegrain processing may be created with loaded to HWAcc memory frames once.

Also in current sources already implemented idea of half-sized ME for better speed: The DX12_ME engine currently can only work with qpel precision that is too large for typical pel=1 fastest processing. But with block size 8x8 it allow to make search with half-sized data (level1 from MSuper) and scatter received MVs to 16x16 block sized level0 with half precision truncating (from qpel to half pel). It currently controlled as optSearchOption=6. And now the speed limiting is SAD calculation in MAnalyse - so the next step to compute-shader processing in HWAcc required to get SAD values too.

magnetite · 9th January 2022, 07:24

I tried this out on my GTX 1080 Ti and initially it threw an error (0xC000374). So I lowered the temporal radius to 2 instead of 12, and it was able to run. Performance was around 150 FPS.

While using the CPU only, performance was around 125 FPS on my i7 6700K.

Source was a DVD.

ChaosKing · 9th January 2022, 13:20

I also made some quick tests:

Source ntsc DVD
CPU Ryzen 3600, GPU 3070 TI

Test script from here https://forum.doom9.org/showpost.php...1&postcount=33

But without wpow=4 (mvtools comlained, so I removed it)
+ removed Searchoption in non DX test

PHP Code:


			
original mvtools
without prefetch: 28 fps
with prefetch(6): 130 fps
with prefetch(8): 155 fps
with prefetch(12): 177 fps

DX12 mvtools
without prefetch: 13 fps
with prefetch(6): 86 fps
with prefetch(8): 94 fps
with prefetch(12): 97 fps

DTL · 9th January 2022, 13:41

" an error (0xC000374)"

This error code looks like something about heap corruption. Still do not have ideas where it can come from. Most of DirectX API calls have checks of HR-return value and if non-S_OK result - writes error message about the tried API call. It is possible to enable DirectX Debug layer with better error messages what is detected to going wrong but need to found way how to pick messages from something like 'debug output stream' in VisualStudio and add to Avisynth environment error output. So it may be special 'DX debug' builds of the plugin with possibly better error-collection from users.

"lowered the temporal radius to 2 instead of 12, and it was able to run"

Is it the max possible value without throwing error or just some low working ?

"DX12 mvtools
without prefetch: 13 fps"

It is with optSearchOption=5 ? The other modes may be somehow slower in compare with 2.7.45 in MDegrain because of added 'anti-bug' check of invalid vectors in that testbuild. In latest sources it looks I found where bad vectors may be passed and this check removed.

"ntsc DVD
CPU Ryzen 3600"

It have 32 MB L3 cache so at small frame size (and low tr) it may be faster to process on CPU and not send data for motion search to HWAccelerator and readback. For that config and frame size the HW processing may be faster only after transferring 'all' processing to DX12 pipeline - both motion search and blocks blending in MDegrain. It still some unknown future.

"with prefetch(12): 177 fps"

In that testbuild fastest working 'onCPU' optSearchOption should be 2. It should be faster in compare with v.2.7.45. Unfortunately SO=3 still not finished in debugging and works only at static colorbars mostly as tech demo of multi-blocks MAnalyse processing.

15th December 2021, 14:53	#21 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,064	With multi-PCIe cards setup it is possible to create 'degraining farm' at one host and it may be much cheaper in compare with multi-workers distributed processing with CPU-based workers and IP interconnection. I see used cheapest Maxwell-based 745 cards from about $50. Though it require some multi-PCIe (x16 ?) motherboard or may be available special 'PCIe-risers' to connect 16x card to 1x slot at slower speed ? The MDegrainN motion search task is naturally paralleled to 2_x_tr number of src+ref searches so simple tr=12 degraining process can load up to 24 DX12-ME workers (NVIDIA promises to execute >1 ME task per chip at some chips). The MVs data usage in MDegrainN is already fixed in some way and should be less limiting the speed I hope. It also interesting question - what is the max vector length search in NVIDIA ME engine ? Do it have any limits ? Currently API do not have any settings about it. Though for typical degraining work the very short vectors like +-several samples are typically enough because it work on slow motion areas and with high speed motion noise is less visible and typically no large high speed areas in the motion pictures content. Last edited by DTL; 15th December 2021 at 15:15.

17th December 2021, 15:04	#23 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,064	Development with remote debugging via LAN is started but not very fast because of my rare being at work with GTX1060 card and not very great being after long no sleeping time. Currently DX12-me init is partially finished in MAnalyse - this commit https://github.com/DTL2020/mvtools/c...c8404e1ea79769 If someone can try to provide remote debugging environment it may speedup the process. It require public IP connection for debugger and ftp access to the folder to upload dev builds of .dll and may be other required debug .dlls. It not require to install many to system - only unpack archive to some folder with known path from root of disk letter and run remote debugger monitor application. The debug session is protected with login/password of local system user with 'remote debugging' access. It is not recommended by Microsoft to debug via internet because it may be slow and packet loss and more but it is possibly working solution. Still do not know if GTX745 card support DX12-ME and not get it to my living place cheap enough. Unfortunately it looks no software emulation of this processing in DX12 available. Not search it very hard though. Last edited by DTL; 17th December 2021 at 15:07.

18th December 2021, 13:41	#25 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,064	Sad news - tested GTX 745 card at Win10 19041 and November 2021 driver and it do not provide HW-ME function. So it is again not known which cheapest card (from Maxwell ?) can provide required service. May be need to e-mail NVIDIA support about this issue. Hehe - the article about hardware ME acceleration for external software clients https://developer.nvidia.com/blog/an...ical-flow-sdk/ is dated Feb 2019 - the 2+ years passed already. Have post a question to nvidia developers support forums about selecting DX12-ME capable card. Last edited by DTL; 18th December 2021 at 14:14.

29th December 2021, 13:00	#28 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,064	Some more info on NVIDIA: https://www.nvidia.com/en-us/geforce...casting-guide/ One thing that is great about NVENC on the GeForce RTX 20 and 30-series and GeForce GTX 1650 Super and up is that all GPUs have the same NVENC with the same performance and quality, from the RTX 2060 to the RTX 3090. NVENC can do up to 8K30, so the only way to overload it is to do 2x4K60 streams. It is not clear how many current+ref pairs do it process for ME in MPEG encoding for each output frame (possibly from 1 to several) but even with 1 pair at 8K 30fps it mean about 16x better at fullHD resolution. I.e. 480 pairs per second. MAnalyse/MDegrain need to process 2_x_tr pairs per each output frame. Last edited by DTL; 29th December 2021 at 13:02.

30th December 2021, 16:15	#31 \| Link
pinterf Registered User Join Date: Jan 2014 Posts: 2,314	Had a look at NVidia site; https://docs.nvidia.com/video-techno...tion-only-mode DX12 and NvEnc direct ME mode cannot work at the same time. __________________ AviSynth+ on github, Other repos: RgTools, Masktools2, MvTools2, TIVTC, Average

15th December 2021, 15:09	#22 \| Link
ReinerSchweinlin Registered User Join Date: Oct 2001 Posts: 454	Maybe helpful addition: While "Turing" refers to the GPU Generation, the included Encoders are not necessarely the same. A 1650 has the VOLTA Encoders, not being able to produce B-frames with h265. The 1660 has the b-frame HEVC capability, but only with nvencc is able to produce b-frames, the Nvidia provided encoder (used in A´s Video converter for example) is NOT using b-frames. ffmpeg with hw-acc encoding also does not support b-frames at the moment. all cards having "RTX" in theyr name work fine with b-frames HEVC encoding. some of the quadro and OEM Cards also are missing the newer encoder (while being labeled turing or pascal)...

22nd December 2021, 18:00	#26 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,064	Possibly more sad news - the GTX750 was tested and looks also not support. Though it is listed as second-generation of Maxwell. I mostly finished DX12_ME interface part of program. But still need many help of Microsoft developers to get it working. Or may be any working sample of DX12 video encoder (or motion estimator). The current program returns no-error HR-state of most of functions but final mapping of the resulted resource buffer to read MVs in the plugin return 'general error - D3D device disconnected' and error reason 'application have errors and need debug'. Very unclear where may be error. Also the resource state to motionestimator logical device still can not be switched to required state 'motion estimator read source' after writing source and ref data.

27th December 2021, 16:21	#27 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,064	With enabling debug layer in DX12 the error messages much more better. Current progress: Possibly close to real processing tech sample for speed test. Still required conversion of input to NV12 format and correct reading of output. Also currently somewhere inside GPU is huge memory leak so it fill 3 GB memory after 500 FullHD frames. Need to found how to free resources. The ME performance in the GTX1060 looks comparable to i5-9600K or may be 2..3 times better. The number of concurrent ME tasks looks like non-limited but may be performance will be degraded after 100% Encoder loading. Need some thinking how to balance ME parallel tasks per each MDegrainN thread. Currently each MDegrainN thread scan src-ref pairs with MAnalyse one by one (total 2_x_tr pairs per each output frame) and it looks required thinking about Avisynth interfacing with MDegrainN and MAnalyse to allow asking for several src-ref pairs in parallel for load balancing between MDegrainN CPU processing and DX12-ME hardware acceleration. Without balancing currently hardware accelerator is not fully loaded. May be the only easy solution to use several 'super' clips in one MDegrainN and asks for ME search in several MAnalyse objects in parallel ? Unfortunately current interfacing between MDegrain and MAnalyse via Avisynth layer looks like limit some easy ways of ME-multithreading.

30th December 2021, 16:01	#30 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,064	It only about b-frames for MPEG. I hope performance of ME-engine of NVENC not degrades greatly at low cost chips. It is an idea using current sources to create small tech test of RAW ME performance of current hardware. With output result in pairs of processed frames per second. Currently I do not have an ideas how to run MAnalyse in solo processing mode without MDegrain as frames-request engine and in 'multi=true' mode. And MDegrain even in overlap=0 mode also a significant part of slow processing now. With MShow() as data-sink it possibly will request only 1 pair or src+ref frames. And script with MAnalyse output returns an error like 'no video clip created for output'.

30th December 2021, 16:24	#32 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,064	It may be limits of old (or current) manufacturer-only API. In current Microsoft Windows API the hardware motion estimator looks do not have limitations and also data movement to and from ME-engine in the NVENC is performed via DX12 resources. Starting from creating D3D12-device. Or may be NVIDIA have several generations (it mention 'old ME' and 'new Optical Flow') of ME modes and some (old ?) can work via DirectX12 without limitations. But may be lower in quality ? Anyway the ME via DX12 works via DX12 API (Windows SDK) - not via NVIDIA SDK. And described as part of Windows Media Foundation API. So it should be NVIDIA-independent (as promised to have HWAcc for DX12 from Intel and AMD too). It is good advantage from NVIDIA-SDK only - not connected to one HW manufacturer only. It promised to be much more standard and long-live support in compare with one HW manufacturer only. I hope in current latest windows the all old DirectShow API is perfectly working. It demonstrates the support for decades. May be working with ME via NVIDIA SDK provide more settings/modes/params. Currently DX12-ME only support NV12 format as input for example. Last edited by DTL; 30th December 2021 at 16:34.

31st December 2021, 14:25	#33 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,064	Finally some working testbuild of MVtools with DX12_ME search mode for MAnalyse. https://github.com/DTL2020/mvtools/r...46-dx12_me.a01 Enabling DX12_ME search: optSearchOption=5 (and levels=1). Using levels > 1 or default will possibly overwrite saved vectors with interpolated prediction from level 1. It copy MVs from DX12_ME output to vectors structure of the plane 0 and perform SAD calculation for MDegrainN weighting processing. Chroma looks like supported in both DX12_ME search and SAD. Currently looks like pel=1 only. The pel 2 and 4 is possible but in the future builds. Only supported block sizes 8x8 and 16x16. Windows 10 build 19041 or newer looks like minimum requirement. When it uses DX12_ME hardware - the resource monitor TaskManager->GPU->Video Encoder shows the load graph and %. Test script is about: Code: LoadPlugin("mvtools2.dll") LoadPlugin("ffms2.dll") FFMpegSource2("src.mxf") ConvertToYV12() tr = 12 # Temporal radius super = MSuper (mt=false, chroma=true,pel=1, hpad=8, vpad=8) multi_vec = MAnalyse (super, multi=true, blksize=8, delta=tr, chroma=false, overlap=0, mt=false, optSearchOption=5, levels=1) MDegrainN (super, multi_vec, tr, thSAD=300, thSAD2=290, mt=false, wpow=4) Prefetch(6) # set prefetch number to number of host CPU cores Only hope completely compatible with MDegrainN (may crash other MDegrains and other with outside frame borders vectors sometime returned by DX12_ME). Need to find why they pass the ClipMV() limiting in MAnalyse. So the performance of MDegrainN blenging engine is a bit limited in this build with additional clipping of block pos inside valid region. Input format currently limited to YV12 only but may be relaxed to all Mvtools inputs because currently frames are read from loaded structures of MAnalyse. Frame size tested - 1920x1080. Unfortunately inverlaced with separated fields (to 1920x540 frame size) still return some error from DX subsystem - need to fix in the later builds. The speed at i5-9600K and with GTX1060 is a bit faster in compare with official 2.7.45 build and a bit slower in compare with SO=2. It looks overlapped modes of MDegrainN is supported (run but not test for quality). May be better speed will be with SAD calculation also inside HWAcc with DirectX resoures (textures) processing. It will completely avoid the memory read into MAnalyse SAD calculation. Last edited by DTL; 11th January 2022 at 14:31.

7th January 2022, 20:28	#35 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,064	What was the script, frame size, frame format and error message ? Currently found V-size limitation: It looks required integer number of blocks so block 8x8 work with 1080 height (1080/8=135) frame and not work with SeparatedFields to 540 height (540/8=67.5). Solution - pad vertical size with AddBorders() For example to process interlaced 1080 - pad to 1088. 1088/2/8=68.

8th January 2022, 12:12	#37 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,064	1. MAnalyse with DX12_ME only support 8 bit input (and currently only YV12 format). Not sure if SMDegrain convert internally to 8 for MAnalyse or not. 2. To activate DX12_ME search mode in that build you need to pass option optSearchOption=5 to MAnalyse (and set levels =1 to not overwrite DX12_ME result with level1 search data in the level0 MVs array). If you use SMDegrain without new options passing - again special build required with hardcoded options SO=5 and levels=1 . Because speed of ME engine in GTX1060 is not many times faster in compare with i5-9600K currently trying to make Compute Shader SAD search to skip SearchMVs from GroupofPlanes completely (it saves 1 set of memory read operations) and will create fully compatible result of MAnalyse with all other clients filters of mvtools. Also have an idea how to make MDegrain HWAcc based without DX12 resources management in AVS core - to make all required reqources allocation in MAnalyse and pass pointers to resources via existing way of mvtools inbetween filters - via pseudo-audio stream. So in the best future all MAnalyse+MDegrain processing may be created with loaded to HWAcc memory frames once. Also in current sources already implemented idea of half-sized ME for better speed: The DX12_ME engine currently can only work with qpel precision that is too large for typical pel=1 fastest processing. But with block size 8x8 it allow to make search with half-sized data (level1 from MSuper) and scatter received MVs to 16x16 block sized level0 with half precision truncating (from qpel to half pel). It currently controlled as optSearchOption=6. And now the speed limiting is SAD calculation in MAnalyse - so the next step to compute-shader processing in HWAcc required to get SAD values too. Last edited by DTL; 8th January 2022 at 12:24.

9th January 2022, 07:24	#38 \| Link
magnetite Registered User Join Date: May 2010 Posts: 64	I tried this out on my GTX 1080 Ti and initially it threw an error (0xC000374). So I lowered the temporal radius to 2 instead of 12, and it was able to run. Performance was around 150 FPS. While using the CPU only, performance was around 125 FPS on my i7 6700K. Source was a DVD. Last edited by magnetite; 9th January 2022 at 07:50.

9th January 2022, 13:20	#39 \| Link
ChaosKing Registered User Join Date: Dec 2005 Location: Germany Posts: 1,795	I also made some quick tests: Source ntsc DVD CPU Ryzen 3600, GPU 3070 TI Test script from here https://forum.doom9.org/showpost.php...1&postcount=33 But without wpow=4 (mvtools comlained, so I removed it) + removed Searchoption in non DX test PHP Code: `original mvtools without prefetch: 28 fps with prefetch(6): 130 fps with prefetch(8): 155 fps with prefetch(12): 177 fps DX12 mvtools without prefetch: 13 fps with prefetch(6): 86 fps with prefetch(8): 94 fps with prefetch(12): 97 fps` __________________ AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth VapourSynth Portable FATPACK \|\| VapourSynth Database Last edited by ChaosKing; 9th January 2022 at 13:23.

9th January 2022, 13:41	#40 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,064	" an error (0xC000374)" This error code looks like something about heap corruption. Still do not have ideas where it can come from. Most of DirectX API calls have checks of HR-return value and if non-S_OK result - writes error message about the tried API call. It is possible to enable DirectX Debug layer with better error messages what is detected to going wrong but need to found way how to pick messages from something like 'debug output stream' in VisualStudio and add to Avisynth environment error output. So it may be special 'DX debug' builds of the plugin with possibly better error-collection from users. "lowered the temporal radius to 2 instead of 12, and it was able to run" Is it the max possible value without throwing error or just some low working ? "DX12 mvtools without prefetch: 13 fps" It is with optSearchOption=5 ? The other modes may be somehow slower in compare with 2.7.45 in MDegrain because of added 'anti-bug' check of invalid vectors in that testbuild. In latest sources it looks I found where bad vectors may be passed and this check removed. "ntsc DVD CPU Ryzen 3600" It have 32 MB L3 cache so at small frame size (and low tr) it may be faster to process on CPU and not send data for motion search to HWAccelerator and readback. For that config and frame size the HW processing may be faster only after transferring 'all' processing to DX12 pipeline - both motion search and blocks blending in MDegrain. It still some unknown future. "with prefetch(12): 177 fps" In that testbuild fastest working 'onCPU' optSearchOption should be 2. It should be faster in compare with v.2.7.45. Unfortunately SO=3 still not finished in debugging and works only at static colorbars mostly as tech demo of multi-blocks MAnalyse processing. Last edited by DTL; 9th January 2022 at 13:55.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode