Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. Domains: forum.doom9.org / forum.doom9.net / forum.doom9.se |
|
|
#3601 | Link | |
|
Registered User
Join Date: Jan 2014
Posts: 2,527
|
Quote:
In a next rstdoc update will include the fix. |
|
|
|
|
|
|
#3602 | Link | |
|
Registered User
Join Date: Dec 2008
Posts: 2,400
|
I'd like to clarify the GetFrame method.
Older VSFilters didn't throw exceptions in GetFrame, and it simply didn't work. The newer xy-VSFilter raises a ThrowError if it doesn't support the video format, which causes many applications (FFmpeg-based) to crash. Example script. Quote:
Is this an application issue or an xy-VSFilter issue?
__________________
MPC-BE 1.8.9 and Nightly builds | VideoRenderer | ImageSource | ScriptSource | BassAudioSource |
|
|
|
|
|
|
#3604 | Link | ||
|
Registered User
Join Date: Dec 2008
Posts: 2,400
|
Quote:
But this does not work with avformat_open_input. Added: Quote:
__________________
MPC-BE 1.8.9 and Nightly builds | VideoRenderer | ImageSource | ScriptSource | BassAudioSource Last edited by v0lt; 1st February 2026 at 08:41. |
||
|
|
|
|
|
#3605 | Link |
|
Registered User
Join Date: Jan 2014
Posts: 2,527
|
New build: Avisynth r4483
https://github.com/pinterf/AviSynthP...3.7.6pre-r4483 Code:
20260203 3.7.5.r4483 (pre 3.7.6) -------------------------------- * rst documentation update: RGBAdjust https://avisynthplus.readthedocs.io/...rs/adjust.html * rst documentation update: ColorYUV https://avisynthplus.readthedocs.io/.../coloryuv.html * optimization: add AVX2 TurnLeft/TurnRight/Turn180 (R/L: 1,5-3x speed). * optimization: ConvertBits AVX2 integer->float * optimization: ConvertToPlanarRGB(A): YUV->RGB add AVX2 (2-3x speed) * optimization: ConvertToPlanarRGB(A): YUV->RGB 16 bit: a quicker way (1,5x) * Fix: C version of 32-bit ConvertToPlanarRGB YUV->RGB to not clamp output RGB values. * ConvertToPlanarRGB(A): add bits parameter to alter target bit-depth. * ConvertToPlanarRGB(A): from YUV->RGB full range output: optimized in-process when bits=32, other cases call ConvertBits internally. * Fix: Packed RGB conversions altering the bit-depth (e.g. rgb32->ConvertToRGB64() worked always in full range. * Add more AVX512 resampler code. (WIP) * Add more AVX512_BASE code paths (Resamplers) * Build: add _avx512b.cpp/hpp pattern in CMake to detect source to compile with base (F,CD,BW,DQ,VL) flags. However AVX512_BASE itself is set only when AVX512_FAST found. For pre-Ice Lake (older AVX512) systems you can enable it with SetMaxCPU("avx512base+") and get the optimized AVX512_BASE functions. * Build: add new architecture z/Architecture |
|
|
|
|
|
#3607 | Link |
|
Registered User
Join Date: Jul 2018
Posts: 1,478
|
It looks for memory management (and filtergraph filters interconnection ?) for CUDA-based plugins (no CUDA internal filters currently ?). It looks like in old times when the number of programmers was big there was an attempt to make CUDA-computing avisynth with script-based filters interconnection inside CUDA-accelerator and/or even a mix of onCPU and onCUDA filters. But the development of this looks like it stopped a long time ago. Now all filters work as onCPU and if a filter needs some external acceleration it makes all memory management inside itself. This causes some performance loss in case of single filter and more in case of a filterchain but it is more simple in support by the current limited number of developers.
There is an idea for special onAccelerator filters extended filters interconnection interface like GetFrameToBuffer(buffer_description) so that onAccelerator based filters can ask for storage of a frame directly into an allocated upload buffer in global virtual memory address space. Currently AI/NN filters like RIFE and avs-mlrt only can get frames by standard GetFrame() method and make additional copy into allocated buffers for uploading to the accelerator. With the largest RGBPS format this causes great load on the memory subsystem (though while AI/NN filters are not very fast it may be not greatly visible). As I understand, the idea of the CUDA-based filtergraph was to save from frame resources upload/download to/from host RAM if several CUDA-based filters are connected in a graph inside an external accelerator. Current AVS filters interconnection (typically via software Cache() for frame-based MT) require to write frames to host RAM cache buffers and read from cache buffers (though some DO_NOT_CACHE_ME mode exist and filters can attempt for direct connection via GetFrame() ?). |
|
|
|
|
|
#3609 | Link |
|
Registered User
Join Date: Jul 2018
Posts: 1,478
|
CUDA-based filters may interact with other CUDA-based filters without downloading input/output frames to host RAM.
Important update for use with ML/AI/NN filters - ConvertToPlanarRGB(A): add bits parameter to alter target bit-depth. It mean instead of long and slow sequence ConvertToPlanarRGB() ConvertBits(32) before RIFE/avs-mlrt filters after r4483 can be used single (and faster) filter ConvertToPlanarRGB(bits=32) It is expected finally faster in comparison with avsresize Z_ConvertFormat(pixel_type="RGBPS") if convert from YUV. Also uint8..16 to float32 precision may become a bit better because the new function uses direct int32 immediate to float32 conversion without immediate integer stage. May be close in precision to possible (but also slower) ConvertBits(32) ConvertToPlanarRGB() sequence. Also waiting for a new test release from pinterf - it will have finally fixed ColorBarsHD to any integer precision and float and again fixed matrix/dematix part for better YUV<->RGB conversions to test precision of new functions. Currently ColorBarsHD uses only 8bit internal table and bit depth upscale and can not be used as a good source generator for precise matrix testing at different bit depths. Precision of 10bit YUV to 8bit RGB conversion expected to be very close or equal to avsresize in new test release. Last edited by DTL; 6th February 2026 at 16:01. |
|
|
|
|
|
#3610 | Link | |
|
Registered User
Join Date: Jan 2014
Posts: 2,527
|
Quote:
However, at 16-bit, a possible +/-1 lsb error can be considered as "good enough". When using full range, it's effectively the same as working in 32 bit float; for complex transformations (like full range bit depth conversion), my version uses float calculations internally as well. Avisynth hasn't previously had chained (fused) filter options where a matrix was involved, so this bits= parameter is a first. It immediately became clear that while optimizing the simplest conversion was easy, the code bloats exponentially when you try to optimize every specific sub-case. Luckily, the more complex cases finally needed 'float' - inside and could be unified. Combining the bit depth conversion is not only much faster but also more accurate than simply adding ConvertBits after the YUV-RGB conversion, so it was a very good and useful idea from DTL. |
|
|
|
|
|
|
#3611 | Link | |
|
Acid fr0g
Join Date: May 2002
Location: Italy
Posts: 3,075
|
Quote:
SetFilterMTMode("DEFAULT_MT_MODE", 2) LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll") Import("D:\Eseguibili\Media\StaxRip\Apps\Plugins\AVS\DehaloAlpha\Dehalo_alpha.avsi") Import("D:\Eseguibili\Media\StaxRip\Apps\Plugins\AVS\Dither\mt_xxpand_multi.avsi") Import("D:\Eseguibili\Media\StaxRip\Apps\Plugins\AVS\FineDehalo\FineDehalo.avsi") DGSource("M:\In\The promised neverland S1 ~Dynit\01-1.dgi") z_ConvertFormat(resample_filter="Spline64", pixel_type="yuv420p16") DeBilinearResizeMT(1280, 720, threads=2, prefetch=2, accuracy=2) z_ConvertFormat(resample_filter="Spline64", pixel_type="yuv444ps") BM3D_CUDA(sigma=12, radius=4, chroma=true, block_step=6, bm_range=12, ps_range=6, fast=false) BM3D_VAggregate(radius=4) z_ConvertFormat(resample_filter="spline64",dither_type="error_diffusion",pixel_type="yuv420p16") FineDehalo(rx=2, ry=2, thmi=80, thma=128, thlimi=50, thlima=100, darkstr=0.3, brightstr=1.0, showmask=0, contra=0.0, excl=true) libplacebo_Deband(radius=12, iterations=4, temporal=false, planes=[3,3,3]) fmtc_bitdepth (bits=10,dmode=8) Prefetch(2,6) How could I change it to have some performance increase, according to the last commits?
__________________
@turment on Telegram |
|
|
|
|
|
|
#3612 | Link |
|
Registered User
Join Date: Jul 2018
Posts: 1,478
|
Currently looks like no. In future some 420<->444 conversion performance increase expected in UV 2x upsize/downsize but it is not yet ported to AVS+ core so no test release exist. Also it is internal Resize() optimization and not required script changes.
As some testing may be recommended to try 14bit integers instead of 16bit. It was found 16bit format was not nice for performance in many AVS filters and processed by different functions in comparison with 10-14 bits. But if you use external filters - it may depend on their implementations. The precision may be not greatly different between 14 and 16bits. Last edited by DTL; 6th February 2026 at 16:23. |
|
|
|
|
|
#3613 | Link | ||||
|
Broadcast Encoder
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 3,377
|
Quote:
TextSub("whatever.ass") would literally pass the frames through without returning any error nor overlaying any subtitles. ![]() ![]() Quote:
Quote:
Also, when it comes to distributed farms like in the case of FFAStrans running on prem, the servers it's running on may not be all exactly the same and have the very same dedicated GPU with the same drivers etc and the same goes for the CPU cores, instruction sets (assembly optimization) etc so writing the scripts and letting the filters "figure it out" automatically on the CPU is actually very helpful. Even in a cloud environment in which you typically have 1 machine = 1 job with a workflow running end to end, you only really have CPU only EC2 like the Elastic c6i.4xlarge and Elastic c6i.2xlarge that we're using as only the ones without a GPU would scale up and down by being created dynamically according to the number of jobs. Sure, one could set it up in a similar way with GPU powered EC2, but the likelihood of them not being created as there aren't any resources available for the region raises significantly, thus making the CPU only option way more appealing. TL;DR I would rather pick something slower but that works and is available all the time than something slightly faster that might not be available. Quote:
|
||||
|
|
|
|
|
#3614 | Link | |
|
Broadcast Encoder
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 3,377
|
Quote:
Tested and working on Windows XP Professional x86 ![]() Windows 7 Builds (x64-win7-19.44.35221-17.14): Tested and working on Windows Server 2008 R2 x64
|
|
|
|
|
|
|
#3615 | Link |
|
Registered User
Join Date: Jul 2018
Posts: 1,478
|
"How could I change it to have some performance increase, according to the last commits?"
What I found with some total encoding performance testing (using x264.exe as MPEG encoder) and RIFE as some call to accelerator processing: The performance may depend not on the best performance single filter but on some magic combinations of filters in a graph. I see you use avsresize for all format conversions. In my test it is not best for total performance (equal to not-best other AVS filters sequences). So if you have time for long testing - you can also test single or even sequences of AVS filters of Convert(matrix) + ConvertBits() if total encoding performance is visibly different. In my (test) case with RIFE: Fastest convert of YV12 to planar RGBPS before RIFE filter is ConvertToPlanarRGB() ConvertBits(32) all other ways (with single ConvertToPlanarRGB(bits=32) and Z_ConverFormat(pixel_type="RGBPS") and reverse order of ConvertBits(32).ConvertToPlanarRGB() are slower. And it is tested with r4483 (and sources up to 09 Feb). But all this may be changed in any new release if we will have any changes in AVS caching and new frames fetching from global frames buffer etc. The total encoding fps difference between 'lucky fast' filters sequence and 'about all other slow' is about 22..23 vs 14..15 fps. Full test script was Code:
LoadPlugin("avsresize.dll")
LoadPlugin("RIFE.dll")
ColorBars(800,800,pixel_type="YUV444P8")
#ConvertToPlanarRGB(bits=32)
#ConvertBits(32)
ConvertToPlanarRGB()
ConvertBits(32)
#Z_ConvertFormat(pixel_type="RGBPS")
RIFE(denoise_tr=1)
ConvertBits(8)
ConvertToYV12()
Prefetch(2)
x264.exe --profile high --crf 18 --preset "placebo" --merange 50 --psnr -o out.264 rife_test.avs Typical average performance GPU load graph and x264 fps ![]() Anomaly best performance Convert filters sequence ![]() Maybe some complex issue with Windows threading and GPU drivers etc. Another idea: RIFE in denoise-mode (tr=1) requests -1 and +1 frames from current to output interpolated frame. A bit close to MDegrain1(). This may cause some significant performance difference in AVS caching and sequence of ConvertToPlanarRGB() ConvertBits(32) may have some more lucky results ? But these filters are simple and sequence ConvertBits(32) ConvertToPlanarRGB() expect to have same cache service time performance (not including larger float data size after ConvertBits(32) ). Last edited by DTL; Today at 06:46. |
|
|
|
|
|
#3616 | Link |
|
Formerly davidh*****
![]() Join Date: Jan 2004
Posts: 2,815
|
Presumably RIFE has to upload multiple frames to the GPU for each output frame. How efficiently does it do so?
Edit: I see you mentioned that. Off the top of my head, to be compatible with prefetch(2) it can't be keeping track of internal state between frames so it must be uploading 3 frames every time. If it could keep track of what's previously been loaded, and ran single-threaded, could it eliminate two of those uploads? Last edited by wonkey_monkey; Today at 14:27. |
|
|
|
|
|
#3617 | Link |
|
Registered User
Join Date: Jul 2018
Posts: 1,478
|
RIFE applied to AVS is still a very simple 2 frame input and 1 frame output engine. And it always uploads 2 frames to get 1 output frame (+ time param to interpolate between 0.0 (first frame) to 1.0 (second frame) float). This allows it to run in random access mode without additional programming. But this causes 2x additional upload traffic if work with sequential frames requests. It may not know how to remap the last uploaded frame to be first and only upload 1 new frame. It is subject to one more issue to open for possible optimization in the RIFE AVS plugin (if underlying NN engine support resources remapping).
In theory any temporal radius filter with sequential frame requests at output (normal encode mode in 1 thread) can only upload 1 new frame and remap all tr-frames by shifting to 1 and skipping first (oldest) frame if accelerator memory is enough to hold all tr-frames at once. But with current massive-multithreaded CPUs running at single threaded mode may not make best performance for the accelerator. Also with random output frame access the frame source engine needs to upload all required frames before output requested frame. This is additional programming to remember uploaded frames numbers. Also each thread must either have its own frame pool in (very small and expensive accelerator memory) or even some global memory manager interconnected with AVS frame cache to run >1 threaded filters if frame numbers in different threads overlap. This expectation was some part of the idea of onCUDA filters processing. Because current underlying RIFE engine always work with 2 frames (and can not remember surround frames to make quality better) the optimizing sequential frame access by remap last frame to first and upload only 1 new may not give significant performance boost if even it is possible with currently used Vulkan API and RIFE engine. Though it is subject to testing if possible. Much better performance boost may be with larger-tr engines (but we currently do not have any ?). Current AVS groups of frames multithreading model require the constant new output frame (not sequential) filter refresh after the current group of frames at current thread fully processed and thread getting a new work unit of the next group of frames (may be also in some random order). This means the filter in the thread needs to reupload all frames to the accelerator for the current requested output frame at the start of each frame group. Or attempt to remap frames from the global uploaded frames pool for current clip (if available and managed). This requires even more programming. |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|