Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 6th April 2024, 12:11   #261  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,159
" Intel Compiler. AFAIK it produces the fastest builds, even for AMD."

I tried it in beginning of 202x years and it makes some small visible benefit over VisualStudio compiler. But as we see with Asd-g builds the progress of development of LLVM compilers is good and they makes best binaries. Maybe I also someday will try to install LLVM somehow to my development system.

Current best in quality degrain script with first search with 16x16 blocksize pel=1 and recalculating to 8x8 blocksize pel=2 before MDegrainN (with some not very slow speed):
Code:
my_tr=12
my_AMDiffSAD=0
my_thSADA_a=1.2

my_intOvlp=3
my_ovlp=0
my_blksize=16

super_p2=MSuper(last, mt=false, pel=2, hpad=32, vpad=32)
super_p1=MSuper(last, mt=false, pel=1, hpad=32, vpad=32)


vec_am1o4=MAnalyse (super_p1, multi=true, blksize=my_blksize, delta=my_tr, search=3, searchparam=2, truemotion=true, pnew=0, global=true,  overlap=my_ovlp, chroma=true,\
 optSearchOption=1, optPredictorType=0, mt=false, AreaMode=1, AMoffset=4, AMdiffSAD=my_AMDiffSAD)


multi_vec_mrec_am1o4am2recall=MRecalculate(super_p2, vec_am1o4, thSAD=20, blksize=8, search=3, searchparam=4, truemotion=true, pnew=0, chroma=true, overlap=my_ovlp, \
AreaMode=2, AMoffset=0, tr=my_tr)


MDegrainN(last,super_p2, multi_vec_mrec_am1o4am2recall, my_tr, thSADA_a=my_thSADA_a, thSADA_b=50, mt=false, wpow=4, adjSADzeromv=0.8, adjSADcohmv=0.8,\
thCohMV=16, MVLPFGauss=0.9, thMVLPFCorr=50, adjSADLPFedmv=0.9, IntOvlp=my_intOvlp)
The thSAD param for MRecalculate is significant quality/performance balancing value:

1. Low or zero thSAD: Perform refining search of all new recalculated (interpolated) MVs. Slowest mode but may give best quality.
2. Default of 200: Depend on noise level and motion can cause more or less refining searches for blocks with SAD > thSAD.
3. Very high value (about thSCD1 ): Only interpolate MVs and re-check SAD for interpolated MV. Fastest mode but may give lower quality (depend on the noise level/profile and may more).

About low AreaMode setting (starting from 1 'layer'): It looks some better quality happen with non-zero AMoffset value and expected good value about blocksize/4. For higher AreaMode settings the best AMstep/AMoffset values depending on blocksize is still subject of many tests.
For blocksize of 16x16 AMoffset=0 and AreaMode=1 cause only additional search with +-1 of block's center position and it is only 1/16 of block size. With AMoffset=4 more surround samples used in lower number of new search calculations and it looks create more benefit in quality per used CPU cycles.
Current equation for offset from center block's position for each AreaMode step/'layer' (i) is int iOffset = (iAMstep * i + iAMoffset + 1); and new 4 searches of MAnalyse performed with 4 search positions of +-iOffset (diagonal, not sides). So for AreaMode=1, AMoffset=0, AMstep=1 4 new searches performed with +-1 offset (in current 'pel' scale - different for each level and pel setting). It may be not effective (for areas with complex motion) to make additional searches with iOffset > blocksize or even blocksize/2.

Also the AreaMode algorithm may have some 'natural' limit of MVs quality increase with increasing of number of new search positions. Currently internal (not checked) limit of search positions vector is limited to 100 (expected too slow and no one used ever). So user can test AreaMode offsets much larger than blocksize as experiments.

Currently compute complexity is of linear scale from AreaMode setting (number of new search positions is linear scale 4 of AreaMode setting). It is possible to add check of all other possible integer positions around center with given radius (including sides positions and all intermediate). But it will make compute complexity up to square of AreaMode setting. May be in some future version may be added at least sides check option (it only increases complexity 2x linear). Maybe as additional AMflags param like:
AMflags 1 - diagonal offsets
AMflags 2 - sides offsets
AMflags 4 - all offsets in defined by AreaMode+AMstep+AMoffset area.
So user can select AMflags 1+2=3 for diagonal+sides offsets for example. For 'big' blocksizes like 16x16 or 32x32 using AreaMode=1, AMoffset=blocksize/4, AMflags=3 may be better in performance/quality balance. For blocksize 8x8 AMFlags=3 equal to all possible positions with AreaMode=1 (4 diagonal and 4 sides) and so on.

Last edited by DTL; 6th April 2024 at 12:18.
DTL is offline   Reply With Quote
Old 13th April 2024, 03:58   #262  |  Link
TDS
Formally known as .......
 
TDS's Avatar
 
Join Date: Sep 2021
Location: Down Under.
Posts: 1,041
I have a question...

In the case of encoding using Distributed Encoding (which can use different CPU's), some are AVX only, some are AVX2, and one has AVX512...

Which "variety" of mvtools would be used (I am yet to be able to test for myself), just curious....
__________________
Long term RipBot264 user.

RipBot264 modded builds..
*new* x264 & x265 addon packs..
TDS is offline   Reply With Quote
Old 13th April 2024, 11:02   #263  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,159
You do not describe how the encoding is distrubuted - by different movies to different hosts or some cut/interleaved parts of single movie to different hosts or even much more methods possible.

For example I now thinking of distrubuting of MAnalyse analysis for AreaMode of different offsets to different hosts if possible and agregate several MVs frames in the single decision filter like MAvg().

For best performance you can install different builds (AVX, AVX2, AVX512) to different hosts. But in some cases the computing results may be somehow different. To get most equal results as possible it may be recommended to install single (lowest SIMD family) build to all hosts and also use SetMaxCPU() to limit all execution to lowest equal SIMD functions.
DTL is offline   Reply With Quote
Old 13th April 2024, 12:00   #264  |  Link
TDS
Formally known as .......
 
TDS's Avatar
 
Join Date: Sep 2021
Location: Down Under.
Posts: 1,041
Quote:
Originally Posted by DTL View Post
You do not describe how the encoding is distributed - by different movies to different hosts or some cut/interleaved parts of single movie to different hosts or even much more methods possible.

For example I now thinking of distributing of MAnalyse analysis for AreaMode of different offsets to different hosts if possible and aggregate several MVs frames in the single decision filter like MAvg().

For best performance you can install different builds (AVX, AVX2, AVX512) to different hosts. But in some cases the computing results may be somehow different. To get most equal results as possible it may be recommended to install single (lowest SIMD family) build to all hosts and also use SetMaxCPU() to limit all execution to lowest equal SIMD functions.
Hi DTL, glad you asked

I use RipBot264 (see sig), and it has this unique feature, that AFAIK, no other encoding app has anything like it.

Staxrip comes close with it's chunk encoding, but it can only use the PC it's being used on

RipBot264 can encode "chunks" that the main server PC processes, and then the client (up to 15 other PC's, on a LAN) can then encode these chunks speeding up the process enormously, then when all the chunks are encoded, the client PC's idle, until the main server then combines the chunks, and muxes it all into the completed encoded video/audio file, then if there is another job in the queue, it will commence processing & compiling the chunks to start the process again.

So, to my question, say the server/main PC is AVX2 or AVX512, so you use the AVX2 build of your mvtools, however, some of the PC's in the encoding "farm" are only AVX, or lower, would a script using mvtools (SMDegrain) work on the AVX PC's ??

But for example, if I use the x265 AVX512 command line on my Ryzen 7950X, then the others will automatically disable that command, and proceed.

Would mvtools do something similar ??

Quite interested in the SetMaxCPU(), that might be a good option.
__________________
Long term RipBot264 user.

RipBot264 modded builds..
*new* x264 & x265 addon packs..

Last edited by TDS; 13th April 2024 at 12:04.
TDS is offline   Reply With Quote
Old 13th April 2024, 14:42   #265  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,159
"some of the PC's in the encoding "farm" are only AVX, or lower, would a script using mvtools (SMDegrain) work on the AVX PC's ??"

Script will work if you do not set options with AVX2 or higher requirements (optSearchOption=2 and more). If CPU not supported it must throw error about no AVX2 (or AVX512) present and exit. Also you need to install AVX or lower builds of mvtools2.dll to AVX hosts.

There are 2 different types of SIMD optimizations in current mvtools2 builds:
1. Auto by compiler. These builds can only run at target SIMD architecture (or higher).
2. Manual functions with selectors and architecture check - these are controlled by options (and SetMaxCPU()) and can throw error if options are not compatible with host CPU features present.

Typically performance benefit of auto by compiler optimizations are not big so as safe option you can install SSE2 build to all hosts. And after all scripts testing you can ty to install AVX/AVX2/AVX512 builds to the hosts supported and check the performance difference.
DTL is offline   Reply With Quote
Old 13th April 2024, 14:59   #266  |  Link
TDS
Formally known as .......
 
TDS's Avatar
 
Join Date: Sep 2021
Location: Down Under.
Posts: 1,041
Quote:
Originally Posted by DTL View Post
Typically performance benefit of auto by compiler optimizations are not big so as safe option you can install SSE2 build to all hosts. And after all scripts testing you can ty to install AVX/AVX2/AVX512 builds to the hosts supported and check the performance difference.
I think the lowest CPU would be AVX, so if I use the AVX compile, on the main server PC, then that should be suitable for any in the "farm" ??

However, if I don't use the AVX CPU's very often, I could use the AVX2 compile...should work either way

FYI, I don't need to install it on the clients, as they are "commanded" by the settings on the main encoding server.

I will do some tests, in the coming days, and see what happens.

Thanks for the feedback.
__________________
Long term RipBot264 user.

RipBot264 modded builds..
*new* x264 & x265 addon packs..
TDS is offline   Reply With Quote
Old 13th April 2024, 15:30   #267  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,159
"I think the lowest CPU would be AVX, so if I use the AVX compile, on the main server PC, then that should be suitable for any in the "farm" ??"

Yes.

"I don't need to install it on the clients, as they are "commanded" by the settings on the main encoding server."

It can copy all required .dlls on the clients ? If it can not be disabled - you can only use 'lowest' build on the server and all clients.
DTL is offline   Reply With Quote
Old 13th April 2024, 15:40   #268  |  Link
TDS
Formally known as .......
 
TDS's Avatar
 
Join Date: Sep 2021
Location: Down Under.
Posts: 1,041
Quote:
Originally Posted by DTL View Post

It can copy all required .dlls on the clients ? If it can not be disabled - you can only use 'lowest' build on the server and all clients.
Not exactly how it does what it does, but it works.

I will test a few different compiles, and let you know how it behaves.
__________________
Long term RipBot264 user.

RipBot264 modded builds..
*new* x264 & x265 addon packs..
TDS is offline   Reply With Quote
Old 15th April 2024, 16:12   #269  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,159
New release: https://github.com/DTL2020/mvtools/r.../r.2.7.46-a.29

New params to MRecalculate:

SuperCurrent clip (same as for MAnalyse).

AMthVSMang (float), default = 10.0f (disabled). Threshold of Vectors Stability Metric to skip current level MV and use previous level hierarchy predictor.
Usable range 0.0f .. 1.0f . 0.0f - skip all MVs, 1.0f - do not skip any. Auto-normalized to AMpoints for any combination of AM-params so expected to be non-
dependent on AM-params (AreaMode and others).

AMflags (integer), default = 1. Any combination of AM positions direction flags per each AMstep: 1 = diagonal positions, 2 = sides positions. Current valid
values 1 (4 diagonal positions only), 2 (4 sides positions only), 3 (4 diagonals and 4 sides positions - 8 positions total, about 2x slower).

AMavg (integer), default = 0. The type of averaging operation of MVs after area gathering searches in the checked area around the current block.
0 - Mode (median ?) of dx and dy separated.
1 - Mean of dx and dy.
2 - Mode (median ?) of MVs angle difference.
3 - Mode (median ?) of MVS difference vector length.
Types 0,2,3 uses some performance optimization - skipping of about half search positions if currently gathered MVs equal to input to AreaMode (original block MVs from initial MAnalyse search). Type 1 always searches all defined by AM options positions so may be up to 2 times slower.

New params to MAnalyse:

AMavg (same as for MRecalculate).

AMpt (integer), default = 0. Same as optPredictorType. PredictorType used in AM searches.
AMst (integer), default = 3. Same as search. Search type used in AM searches.
AMsp (integer), default = 2. Same as searchparam. Search param (radius) used in AM searches.


Added some performance optimization for SAD-only DMFlags used (most typical use case).
Fixed optPredictorType=1 for MAnalyse: Added check of median predictor of current level (now zero, global, hierarchy and median predictors checked before refine).

Next ideas for performance/quality balance is to enable use-configurable ratios of already found equal MVs for fast AM-skip processing. Currently if we measure this ratio in the 0.0 to 1.0f range it is already 0.5 for AMavg 0,2,3. But may be extended to much lower than 0.5. It is motion to AM-adaptive search mode. So for stable enough blocks the number of additional AM-searches may be significantly decreased without quality loss. And for non-stable areas the AM searches may be expanded to all defined by settings number of searches (may be several times slower). This level of adaptive search is only possible with per-block AreaMode with onCPU MAnalyse.

In the next versions expected new param like AMfsRatio of the range 0.0f to 1.0f with default of 0.5f (same as current AMavg 0,2,3, optimizations). If set to below 0.5f (or below 1.0f for AMavg=1) it will cause more early (in the AM search positions checking) call to MVs equality comparison function and if all found MVs equal to input (center of block from standard MAnalyse search) it will stop new AM searches and return (output MV assigned as input).

The new added Averaging types for AM may give different results of MVs and resulted in denoising at different types of content (sort of noise profiles) and also personal user preference. In some tests with scanned film footage AMavg=2 saves more details. AMavg=0 makes a bit lower MPEG output bitrate. AMavg=1 provides significantly better denoising (and MPEG output bitrate) but may also cause more details loss.

I made some tests of denoising quality vs Topaz Artemis model: The Topaz AI generally makes much stable objects views in a sequence of frames (even with fast motion and transformed objects) but can lost more details in some frames of low contrast and low brightness areas. mvtools as linear temporal denoiser (also with low enough thSADA_a values around 1..2) left more noise at the fast motion areas and complex transforms but it means these areas left with less processing and close to original input.

Last edited by DTL; 15th April 2024 at 19:26.
DTL is offline   Reply With Quote
Old 16th April 2024, 14:06   #270  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,159
Good news on mining cards - prices are really nice like $30..$50 per board like p104-100. And a patched driver promises it can have the NVENC feature working https://github.com/dartraiden/NVIDIA-patcher .

p104-100 board in the best case can be equal to the GTX1070 with 2 NVENC cores working at 1200+fps combined performance for h.264 FHD encoding. So ME performance is expected at the close level. But mining cheap boards really do not have any warranty on NVENC present in working state on chip. So each board requires testing. It is limited to 4x PCIe lines typically but for AreaMode with onboard frames shifting it should not be significantly limiting. The RAM bus of 256bit promises good performance for texture copying for sample-shifted resources onboard.

I am trying to ask p104-100 board sellers to make tests if DX12-ME fully works with patched drivers.

Also as I found from old (2016 ?) years NVIDIA developer papers - the ME-only mode looks like it was working long ago via the NVIDIA SDK (vendor-only API). And it also may accept input MVs as hints for refining. So it is possible to make more quality hierarchy search more close to onCPU MAnalyse by providing the same downsized frames for high levels of search and return interpolated MVs back to ME search engine as hints for lower levels of search.

Some users at github report the MVs input hints feature for ME via DX12 looks like it does not work. It looks like it was not implemented via DX12 drivers from the vendor while announced in the DX12 Microsoft API.

It is from https://forums.developer.nvidia.com/...encoding/64456

You must accept that NVidia is incapable to describe correctly own products. I lost hundreds of $$$ for NVidia official incorrect or intentionally unpublished information.
https://developer.nvidia.com/video-e...support-matrix 38 is wrong again. Line "GeForce GTX 1060 - 1070 Ti" should be "GeForce GTX some 1060 / 1070 / 1070 Ti / 1080" eg. GP104 or should be split to two lines if "GTX some 1060 / 1070 / 1070 Ti" have GP104 but only one hwenc (see next point).
But be warned. Some crippled (with HW faults from chip factory) chips have less CUDA (some faulted SMs are disabled) and sometimes one of two hwenc is faulted too (compare Quadro P4000/P5000 - see https://devtalk.nvidia.com/default/topic/1036615/ 6). So, there may be the same problem with GTX 1070/1070Ti (GP104-200-A1/GP104-300-A1 vs. GP104-400-A1). If https://devtalk.nvidia.com/default/t...ed-comparison/ 16 is correct (eg. tested correctly on real card) GTX 1070 has only one hwenc enabled.
Check also https://en.wikipedia.org/wiki/List_o...ocessing_units 25 for chip marking and how it is crippled. For example check GTX 1060 - it uses 4 different chips. Normal chip (GP106-400-A1), crippled chips (GP106-300-A1 / GP106-350-K3-A1) and super crippled chip (GP104-140-A1 (only 9 from 20 SMs and 192 of 256 bit memory bus width are working)). ... Life with NVidia is like a box of chocolates, you never know what you're gonna get ! :-)

Last edited by DTL; 16th April 2024 at 19:21.
DTL is offline   Reply With Quote
Old 17th April 2024, 11:09   #271  |  Link
takla
Registered User
 
Join Date: May 2018
Posts: 184
@DTL
https://github.com/ROCm/ROCm/releases/tag/rocm-6.1.0

Quote:
rocDecode, a new ROCm component that provides high-performance video decode support for
AMD GPUs. With rocDecode, you can decode compressed video streams while keeping the resulting
YUV frames in video memory. With decoded frames in video memory, you can run video
post-processing using ROCm HIP, avoiding unnecessary data copies via the PCIe bus.
Would be nice to have MvTools in HIP
takla is offline   Reply With Quote
Old 17th April 2024, 12:21   #272  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,613
Quote:
Originally Posted by takla View Post
@DTLWould be nice to have MvTools in HIP
If any rewrite occur for GPU, I will prefer CUDA, OpenCL or Vulkan.

CUDA is the fastest, while OpenCL and Vulkan are compatible with almost anything modern.

AMD is prone to present interesting projects, abandon them and never offer proper support.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 17th April 2024, 12:40   #273  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,159
More real addition to current MAnalyse is optional MVs search via NVIDIA SDK (not using DirectX at all ?). I see the ME-only mode of NVENC noted in the old NVIDIA developer documents around mid-201x. And NVENC SDK/API (encode mode ?) can run from DirectX 9. So it expected to work from Win7 without hacks of DX12 for Win7.

Also NVIDIA SDK in C may be also compatible with UNIX/Linux ? So this mode of MAnalyse will be compatible with both Win and UNIX (as todays no-DX12 builds). But in addition to MVs search only in best case it is required some computing (like SAD shader compute) and also for new AreaMode the onboard resources/texture shift to save CPU time and bus transfers. I expect NVIDIA NVENC SDK resources also somehow can be processed with CUDA or other APIs (without reuploading of resources).

Also it looks NVIDIA do not like to provide all possible features like MVs re-using (as hints) via Microsoft DX API (or too lazy to make all required drivers features for Windows for rarely used API). So making some addition to MAnalyse on NVIDIA SDK/API only have some benefits. But it will be completely NVIDIA-only in hardware. This also may be not great for open source developers.

"AMD is prone to present interesting projects, abandon them and never offer proper support."

Microsoft with Windows looks like still some working global force in the residuals of civilization to force many hardware vendors provide some commonly used hardware features (like general purpose computing and Motion Estimation) via single API. But it looks DirectX only now. And even DirectX (ME) with low number of promised features not completely implemented in some drivers and/or for some products.

Last edited by DTL; 17th April 2024 at 12:45.
DTL is offline   Reply With Quote
Old 17th April 2024, 16:54   #274  |  Link
takla
Registered User
 
Join Date: May 2018
Posts: 184
Quote:
Originally Posted by tormento View Post
If any rewrite occur for GPU, I will prefer CUDA, OpenCL or Vulkan.
OpenCL is superseded by Vulkan.

Also
takla is offline   Reply With Quote
Old 17th April 2024, 17:25   #275  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,613
Quote:
Originally Posted by DTL View Post
More real addition to current MAnalyse is optional MVs search via NVIDIA SDK
Are you talking about this?
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 17th April 2024, 23:42   #276  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,159
About this: https://developer.nvidia.com/video-codec-sdk

Currently implemented in mvtools2:
DirectX Video

Platform Windows

Benefits
Low Level Control
Native DirectX and Windows Integration
Easy for DirectX developers
Multi-Vendor

Native API interface - D3D11 (Decode only) and D3D12

So it is Windows only and Win10 with DX12 and higher only. But support any (all 2 current) manufacturers of DX12 boards - NVIDIA and AMD and possibly some or some days of intel DX12).

Possible other API:

NVIDIA Video Codec SDK
Platform Windows and Linux

Benefits
High Level Control
Native Integration in custom pipelines
Useful for users with less knowledge of Vulkan and Direct X
Easy for C, C++ developers
Nvidia Proprietary API
Comprehensive feature set

Native API interface
D3D9, D3D10, D3D11, D3D12 (Encode only) CUDA (Encode and decode)

So it is Linux and Windows compatible (more happy opensource developers from the residuals of current White civilization) and Windows 7 compatible with DX9 (more happy users of Win7). Also possible more features available. But main disadvantage: NVIDIA hardware only.

Made some testbuild for possible peak ME performance test for multi-positions search for AreaMode: https://drive.google.com/file/d/1gJs...ew?usp=sharing

It have 3 .dlls :
x1me - standard 1 call to motion estimation engine per input pairs of frames
x5me - 5 sequential calls to motion estimation engine per input pairs of frames
x9me - 9 sequential calls to motion estimation engine per input pairs of frames

AreaMode=1 (AMflags=1) will use 5 searches for single pairs of frames and AreaMode=2 (AMflags=1) will use 9 searches.

Test script:
Code:
LoadPlugin("mvtools2_x9me.dll")

ColorBars(1920,1080, pixel_type="YV12")
#ColorBars(3840,2160, pixel_type="YV12")

#super = MSuper(mt=false, pel=4, pelrefine=true, chroma=true, levels=0) # onCPU compare
super = MSuper(mt=false, pel=4, pelrefine=false, chroma=true, levels=1) # onHWA
#forward_vec1 = MAnalyse(super, isb = false, delta = 1, search=3, chroma=true, optSearchOption=1, levels=0) #onCPU cpmpare
forward_vec1 = MAnalyse(super, blksize=8, isb = false, delta = 1, chroma=true, optSearchOption=5, levels=1) # onHWA
MStoreVect(forward_vec1)

Prefetch(...)
With Gigabyte GTX1060 card GPU clock max 1911 MHz and 1920x1080 frame:

x1me - 500 fps, 74% load of Video Encoder. Performance somehow limited to data transfer overheads. Bus Load ~30%, GPU chip power ~35W only. Expected ME RAW frames pairs per second performance 500/0.74 = 675 fpps.

x5me - 150 fps, 100% load Video Encoder. Bus load ~8%. ME RAW frames pairs per second performance 150x5 = 750 fpps.

x9me - 84 fps, 100% load Video Encoder. Bus load ~5%. ME RAW frames pairs per second performance 84x9 = 765 fpps.

So single NVENC at Pascal chip as expected provides ME performance about 750 fpps. Full-blood GTX1070 and 1080 boards with Dual NVENC healthy expected about 1500 fpps ME RAW performance.

The i5-9600K provides about 6 times lower (with same pel=4 quality).

Changing blocksize from 8x8 to 16x16 do not change anything - it looks performance of NVENC is only samples limited and not blocks number (per frame or per second). With 3840x2160 frame it is about 4x linearly lower.

Expected performance of AreaMode=1 AMflags=1 for 1920x1080 frame size and tr=12:
tr=12 need 24 frames pairs per output frame and AreaMode=1 AMflags=1 need 4 additional frames pairs search (x5 total) so MDegrainN output frame require 24*5=120 pairs frame search with these settings. 750 fpps ME RAW performance of single Pascal NVENC divided to 120 ~= 6 fps total denoise performance max.

With GTX1070..1080 boards (p104..p102 miners editions too ?) performance expected twice higher.

Table 3 of the document https://developer.download.nvidia.co...09-001_v08.pdf also lists about 648 fps for Pascal NVENC at h.264 encoding at fast mode with FHD frame.

Another perspective Average function to AMavg planned is geometric median - https://en.wikipedia.org/wiki/Geometric_median . It is not easy to compute so may add more performance penalty for onCPU MAnalyse. But planned to offloading to Compute Shader for onHWA processing in the future.
Valuable property of geometric (2D) median for noised sources is:
The geometric median has a breakdown point of 0.5.[13] That is, up to half of the sample data may be arbitrarily corrupted, and the median of the samples will still provide a robust estimator for the location of the uncorrupted data.

Last edited by DTL; 18th April 2024 at 10:06.
DTL is offline   Reply With Quote
Old 18th April 2024, 10:17   #277  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,613
Quote:
Originally Posted by DTL View Post
About this
Have a look at the link I provided. It seems to have everything ready.
Quote:
Originally Posted by DTL View Post
Full-blood GTX1070 and 1080 boards with Dual NVENC healthy expected about 1500 fpps ME RAW performance.
I suppose you are aware of NVEnc unlocking patches.
Quote:
Originally Posted by DTL View Post
p104..p102 miners editions too
There are patches for them too.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 18th April 2024, 11:18   #278  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,159
"It seems to have everything ready."

That promo do not list any denoising application. So it may work good only with new cameras at good light with low noise at input. Same as RIFE also fail with noised sources. Also I do not have Turing board to check it.

"I suppose you are aware of NVEnc unlocking patches."

I do not got any out_of_licence or other error. It simply limits by expected max MPEG encoder rate at lowest settings (fastest). So I think it is natural limits of NVENC hardware. Also licence limits MPEG encoding sessions to 2 for end-user drivers. Here we have usage of 1 D3D12 device by 1 process. And even not in MPEG encoding mode. May be it is too rarely used by anyone at this civilization so NVIDIA do not think about putting any limits to this API. It even do not use any MPEG-LA licensed algorithms I think.
DTL is offline   Reply With Quote
Old 18th April 2024, 11:54   #279  |  Link
TDS
Formally known as .......
 
TDS's Avatar
 
Join Date: Sep 2021
Location: Down Under.
Posts: 1,041
Quote:
Originally Posted by DTL

It can copy all required .dlls on the clients ? If it can not be disabled - you can only use 'lowest' build on the server and all clients.
Quote:
Not exactly how it does what it does, but it works.

I will test a few different compiles, and let you know how it behaves.
Hi DTL, well, I've finally got around to doing some test's, and most of your current builds work for me, (on the AVX2 CPU's) but as I tried to explain the Distributed Encoding I use, the clients that only have AVX CPU's, do not start encoding, despite what compile I tried, when using an SMDegrain script that calls mvtool2.dll.

Not sure what to do, now
__________________
Long term RipBot264 user.

RipBot264 modded builds..
*new* x264 & x265 addon packs..
TDS is offline   Reply With Quote
Old 18th April 2024, 12:02   #280  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,159
More safely use _e.XX builds for SMDegrain. Latest is https://github.com/DTL2020/mvtools/r.../r.2.7.46-e.03 . If it also not work on AVX - the only way is fallback to pinterf' 2.7.45 builds.
DTL is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 22:53.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.