Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

Domains: forum.doom9.org / forum.doom9.net / forum.doom9.se

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Display Modes
Old 29th January 2026, 13:40   #3601  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,527
Quote:
Originally Posted by DTL View Post
It looks documentation for ConvertToPlanarRGB missed chromaresample params:
https://avisynthplus.readthedocs.io/...s/convert.html
RGB planar

ConvertToPlanarRGB(clip, [ string matrix, bool interlaced,
string ChromaInPlacement,
string chromaresample ] )
ConvertToPlanarRGBA(clip, [ string matrix, bool interlaced,
string ChromaInPlacement,
string chromaresample,
float param1, float param2, float param3 ] )

But sources for ConvertToPlanarRGB also lists param1,2,3
Thanks, obviously I missed it on a copy-paste session. Also, http://avisynth.nl/index.php/Convert is O.K.
In a next rstdoc update will include the fix.
pinterf is offline   Reply With Quote
Old 31st January 2026, 18:31   #3602  |  Link
v0lt
Registered User
 
Join Date: Dec 2008
Posts: 2,400
I'd like to clarify the GetFrame method.
Older VSFilters didn't throw exceptions in GetFrame, and it simply didn't work. The newer xy-VSFilter raises a ThrowError if it doesn't support the video format, which causes many applications (FFmpeg-based) to crash.
Example script.
Quote:
LoadPlugin("c:\temp\XySubFilter\VSFilter.dll")

Colorbars(1280, 720)
ConvertToPlanarRGB()

TextSub("c:\temp\test_subtitles.srt")
In my ScriptSourceFilter, the call to PClip::GetFrame is inside a try-catch block, but that doesn't help (or am I doing something wrong).

Is this an application issue or an xy-VSFilter issue?
v0lt is offline   Reply With Quote
Old 31st January 2026, 23:37   #3603  |  Link
wonkey_monkey
Formerly davidh*****
 
wonkey_monkey's Avatar
 
Join Date: Jan 2004
Posts: 2,815
AvisynthError isn't derived from std::exception, so your code won't catch it (I think).

(Not sure why xy-VSFilter isn't checking the format in its constructor...)
__________________
My AviSynth filters / I'm the Doctor
wonkey_monkey is offline   Reply With Quote
Old 1st February 2026, 06:45   #3604  |  Link
v0lt
Registered User
 
Join Date: Dec 2008
Posts: 2,400
Quote:
Originally Posted by wonkey_monkey View Post
AvisynthError isn't derived from std::exception, so your code won't catch it (I think).
Thanks. I forgot about that again. I fixed it.

But this does not work with avformat_open_input.

Added:
Quote:
Originally Posted by wonkey_monkey View Post
(Not sure why xy-VSFilter isn't checking the format in its constructor...)
Thanks. It works!

Last edited by v0lt; 1st February 2026 at 08:41.
v0lt is offline   Reply With Quote
Old 3rd February 2026, 09:15   #3605  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,527
New build: Avisynth r4483
https://github.com/pinterf/AviSynthP...3.7.6pre-r4483

Code:
20260203 3.7.5.r4483 (pre 3.7.6)
--------------------------------
* rst documentation update: RGBAdjust https://avisynthplus.readthedocs.io/...rs/adjust.html
* rst documentation update: ColorYUV https://avisynthplus.readthedocs.io/.../coloryuv.html
* optimization: add AVX2 TurnLeft/TurnRight/Turn180 (R/L: 1,5-3x speed).
* optimization: ConvertBits AVX2 integer->float
* optimization: ConvertToPlanarRGB(A): YUV->RGB add AVX2 (2-3x speed)
* optimization: ConvertToPlanarRGB(A): YUV->RGB 16 bit: a quicker way (1,5x)
* Fix: C version of 32-bit ConvertToPlanarRGB YUV->RGB to not clamp output RGB values.
* ConvertToPlanarRGB(A): add bits parameter to alter target bit-depth.
* ConvertToPlanarRGB(A): from YUV->RGB full range output: optimized in-process when bits=32, other cases call ConvertBits internally.
* Fix: Packed RGB conversions altering the bit-depth (e.g. rgb32->ConvertToRGB64() worked always in full range.
* Add more AVX512 resampler code. (WIP)
* Add more AVX512_BASE code paths (Resamplers)
* Build: add _avx512b.cpp/hpp pattern in CMake to detect source to compile with base (F,CD,BW,DQ,VL) flags.
  However AVX512_BASE itself is set only when AVX512_FAST found.
  For pre-Ice Lake (older AVX512) systems you can enable it with SetMaxCPU("avx512base+") and get the optimized AVX512_BASE functions.
* Build: add new architecture z/Architecture
pinterf is offline   Reply With Quote
Old 4th February 2026, 06:39   #3606  |  Link
Kurt.noise
Registered User
 
Join Date: Nov 2022
Location: Aix en Provence, France
Posts: 163
Hi,

Didnt look closely but I've seen CUDA parameters into the compilation settings. What is it for exactly ?
Kurt.noise is offline   Reply With Quote
Old 4th February 2026, 08:11   #3607  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,478
It looks for memory management (and filtergraph filters interconnection ?) for CUDA-based plugins (no CUDA internal filters currently ?). It looks like in old times when the number of programmers was big there was an attempt to make CUDA-computing avisynth with script-based filters interconnection inside CUDA-accelerator and/or even a mix of onCPU and onCUDA filters. But the development of this looks like it stopped a long time ago. Now all filters work as onCPU and if a filter needs some external acceleration it makes all memory management inside itself. This causes some performance loss in case of single filter and more in case of a filterchain but it is more simple in support by the current limited number of developers.

There is an idea for special onAccelerator filters extended filters interconnection interface like GetFrameToBuffer(buffer_description) so that onAccelerator based filters can ask for storage of a frame directly into an allocated upload buffer in global virtual memory address space. Currently AI/NN filters like RIFE and avs-mlrt only can get frames by standard GetFrame() method and make additional copy into allocated buffers for uploading to the accelerator. With the largest RGBPS format this causes great load on the memory subsystem (though while AI/NN filters are not very fast it may be not greatly visible).

As I understand, the idea of the CUDA-based filtergraph was to save from frame resources upload/download to/from host RAM if several CUDA-based filters are connected in a graph inside an external accelerator. Current AVS filters interconnection (typically via software Cache() for frame-based MT) require to write frames to host RAM cache buffers and read from cache buffers (though some DO_NOT_CACHE_ME mode exist and filters can attempt for direct connection via GetFrame() ?).
DTL is offline   Reply With Quote
Old 4th February 2026, 11:29   #3608  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 3,075
Quote:
Originally Posted by DTL View Post
As I understand, the idea of the CUDA-based filtergraph was to save from frame resources upload/download to/from host RAM
So, there could be an implementation for my beloved BM3DCUDA, where the temporal part is CPU bonded?
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 6th February 2026, 08:51   #3609  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,478
CUDA-based filters may interact with other CUDA-based filters without downloading input/output frames to host RAM.

Important update for use with ML/AI/NN filters -
ConvertToPlanarRGB(A): add bits parameter to alter target bit-depth.

It mean instead of long and slow sequence

ConvertToPlanarRGB()
ConvertBits(32)

before RIFE/avs-mlrt filters after r4483 can be used single (and faster) filter

ConvertToPlanarRGB(bits=32)

It is expected finally faster in comparison with avsresize Z_ConvertFormat(pixel_type="RGBPS") if convert from YUV.

Also uint8..16 to float32 precision may become a bit better because the new function uses direct int32 immediate to float32 conversion without immediate integer stage.
May be close in precision to possible (but also slower)
ConvertBits(32)
ConvertToPlanarRGB()
sequence.

Also waiting for a new test release from pinterf - it will have finally fixed ColorBarsHD to any integer precision and float and again fixed matrix/dematix part for better YUV<->RGB conversions to test precision of new functions. Currently ColorBarsHD uses only 8bit internal table and bit depth upscale and can not be used as a good source generator for precise matrix testing at different bit depths.
Precision of 10bit YUV to 8bit RGB conversion expected to be very close or equal to avsresize in new test release.

Last edited by DTL; 6th February 2026 at 16:01.
DTL is offline   Reply With Quote
Old 6th February 2026, 13:03   #3610  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,527
Quote:
Originally Posted by DTL View Post
CUDA-based filters may interact with other CUDA-based filters without downloading input/output frames to host RAM.

Important update for use with ML/AI/NN filters -
ConvertToPlanarRGB(A): add bits parameter to alter target bit-depth.

It mean instead of long and slow sequence

ConvertToPlanarRGB()
ConvertBits(32)

before RIFE/avs-mlrt filters after r4483 can be used single (and faster) filter

ConvertToPlanarRGB(bits=32)

It is expected finally faster in comparison with avsresize Z_ConvertFormat(pixel_type="RGBPS") if convert from YUV.

Also uint8..16 to float32 precision may become a bit better because the new function uses direct int32 immediate to float32 conversion without immediate integer stage.
May be close in precision to possible (but also slower)
ConvertBits(32)
ConvertToPlanarRGB()
sequence.

Also waiting for a new test release from pinterf - it will have finally fixed ColorBarsHD to any integer precision and float and again fixed matrix/dermatix part for better YUV<->RGB conversions to test precision of new functions. Currently ColorBarsHD uses only 8bit internal table and bit depth upscale and can not be used as a good source generator for precise matrix testing at different bit depths.
Precision of 10bit YUV to 8bit RGB conversion expected to be very close or equal to avsresize in new test release.
Yep, returning to this matrix conversion area during development caused a "bit" more work, testing, and experimentation than I had initially expected. I can't quite compete with the accuracy of avsresize, as it performs all matrix operations internally in 32 bit float.

However, at 16-bit, a possible +/-1 lsb error can be considered as "good enough".

When using full range, it's effectively the same as working in 32 bit float; for complex transformations (like full range bit depth conversion), my version uses float calculations internally as well.

Avisynth hasn't previously had chained (fused) filter options where a matrix was involved, so this bits= parameter is a first. It immediately became clear that while optimizing the simplest conversion was easy, the code bloats exponentially when you try to optimize every specific sub-case. Luckily, the more complex cases finally needed 'float' - inside and could be unified.

Combining the bit depth conversion is not only much faster but also more accurate than simply adding ConvertBits after the YUV-RGB conversion, so it was a very good and useful idea from DTL.
pinterf is offline   Reply With Quote
Old 6th February 2026, 13:18   #3611  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 3,075
Quote:
Originally Posted by DTL View Post
CUDA-based filters may interact with other CUDA-based filters without downloading input/output frames to host RAM.
My usual script is something like:

SetFilterMTMode("DEFAULT_MT_MODE", 2)
LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll")
Import("D:\Eseguibili\Media\StaxRip\Apps\Plugins\AVS\DehaloAlpha\Dehalo_alpha.avsi")
Import("D:\Eseguibili\Media\StaxRip\Apps\Plugins\AVS\Dither\mt_xxpand_multi.avsi")
Import("D:\Eseguibili\Media\StaxRip\Apps\Plugins\AVS\FineDehalo\FineDehalo.avsi")
DGSource("M:\In\The promised neverland S1 ~Dynit\01-1.dgi")
z_ConvertFormat(resample_filter="Spline64", pixel_type="yuv420p16")
DeBilinearResizeMT(1280, 720, threads=2, prefetch=2, accuracy=2)
z_ConvertFormat(resample_filter="Spline64", pixel_type="yuv444ps")
BM3D_CUDA(sigma=12, radius=4, chroma=true, block_step=6, bm_range=12, ps_range=6, fast=false)
BM3D_VAggregate(radius=4)
z_ConvertFormat(resample_filter="spline64",dither_type="error_diffusion",pixel_type="yuv420p16")
FineDehalo(rx=2, ry=2, thmi=80, thma=128, thlimi=50, thlima=100, darkstr=0.3, brightstr=1.0, showmask=0, contra=0.0, excl=true)
libplacebo_Deband(radius=12, iterations=4, temporal=false, planes=[3,3,3])
fmtc_bitdepth (bits=10,dmode=8)
Prefetch(2,6)


How could I change it to have some performance increase, according to the last commits?
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 6th February 2026, 16:17   #3612  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,478
Currently looks like no. In future some 420<->444 conversion performance increase expected in UV 2x upsize/downsize but it is not yet ported to AVS+ core so no test release exist. Also it is internal Resize() optimization and not required script changes.

As some testing may be recommended to try 14bit integers instead of 16bit. It was found 16bit format was not nice for performance in many AVS filters and processed by different functions in comparison with 10-14 bits. But if you use external filters - it may depend on their implementations. The precision may be not greatly different between 14 and 16bits.

Last edited by DTL; 6th February 2026 at 16:23.
DTL is offline   Reply With Quote
Old 8th February 2026, 23:24   #3613  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 3,377
Quote:
Originally Posted by v0lt View Post
Older VSFilters didn't throw exceptions in GetFrame, and it simply didn't work.
I remember it well. Sometimes I forgot to convert things at the end and I ended up wasting encoding hours only to find out that the output wasn't hardsubbed.

TextSub("whatever.ass") would literally pass the frames through without returning any error nor overlaying any subtitles.




Quote:
Originally Posted by DTL View Post
there was an attempt to make CUDA-computing avisynth with script-based filters interconnection inside CUDA-accelerator and/or even a mix of onCPU and onCUDA filters.
Yes, however it's always better to avoid going back and forth between CPU and GPU (so onCPU and onCUDA) because it means copying a lot of data and it might not be worth it. One classic example of this is Cube() and DGCube() from Donald Graft. Although - on paper - DGCube() makes the entire tetrahedral interpolation and application of the 65x65x65 LUT on the GPU and it's significantly computationally faster than doing it on the CPU with AVX512, copying huge frames from RAM (DDR) to VRAM (GDDR) via the motherboard lanes to perform the process and then back causes a significant slowdown and only really makes DGCube() a fraction faster than Cube(), thus losing almost all its advantage. We're talking about 16bit RGB Planar UHD frames here for normal content (or potentially 6K for post production stuff shot in log).

Quote:
Originally Posted by DTL View Post
Now all filters work as onCPU and if a filter needs some external acceleration it makes all memory management inside itself. This causes some performance loss in case of single filter and more in case of a filterchain but it is more simple in support by the current limited number of developers.
To be fair, this is very easily supportable and it's actually very simple from the user perspective as well given that there's no need to mess with MT Modes and Prefetch() either as the threadpool is created internally by the filter itself, so from a user perspective you know that all you have to do is write the script and it will run everywhere.

Also, when it comes to distributed farms like in the case of FFAStrans running on prem, the servers it's running on may not be all exactly the same and have the very same dedicated GPU with the same drivers etc and the same goes for the CPU cores, instruction sets (assembly optimization) etc so writing the scripts and letting the filters "figure it out" automatically on the CPU is actually very helpful.

Even in a cloud environment in which you typically have 1 machine = 1 job with a workflow running end to end, you only really have CPU only EC2 like the Elastic c6i.4xlarge and Elastic c6i.2xlarge that we're using as only the ones without a GPU would scale up and down by being created dynamically according to the number of jobs. Sure, one could set it up in a similar way with GPU powered EC2, but the likelihood of them not being created as there aren't any resources available for the region raises significantly, thus making the CPU only option way more appealing. TL;DR I would rather pick something slower but that works and is available all the time than something slightly faster that might not be available.

Quote:
Originally Posted by DTL View Post
Also waiting for a new test release from pinterf - it will have finally fixed ColorBarsHD to any integer precision and float and again fixed matrix/dematix part for better YUV<->RGB conversions to test precision of new functions. Currently ColorBarsHD uses only 8bit internal table and bit depth upscale and can not be used as a good source generator for precise matrix testing at different bit depths.
Precision of 10bit YUV to 8bit RGB conversion expected to be very close or equal to avsresize in new test release.
That's actually gonna be very good as I routinely use Avisynth's colorbars to test stuff, so thank you for looking into this guys.
FranceBB is offline   Reply With Quote
Old Yesterday, 10:13   #3614  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 3,377
Quote:
Originally Posted by pinterf View Post
Windows XP Builds (x86-xp):

Tested and working on Windows XP Professional x86


Windows 7 Builds (x64-win7-19.44.35221-17.14):

Tested and working on Windows Server 2008 R2 x64
FranceBB is offline   Reply With Quote
Old Yesterday, 17:59   #3615  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,478
"How could I change it to have some performance increase, according to the last commits?"

What I found with some total encoding performance testing (using x264.exe as MPEG encoder) and RIFE as some call to accelerator processing: The performance may depend not on the best performance single filter but on some magic combinations of filters in a graph.
I see you use avsresize for all format conversions. In my test it is not best for total performance (equal to not-best other AVS filters sequences).
So if you have time for long testing - you can also test single or even sequences of AVS filters of Convert(matrix) + ConvertBits() if total encoding performance is visibly different.

In my (test) case with RIFE:

Fastest convert of YV12 to planar RGBPS before RIFE filter is
ConvertToPlanarRGB()
ConvertBits(32)

all other ways (with single ConvertToPlanarRGB(bits=32) and Z_ConverFormat(pixel_type="RGBPS") and reverse order of ConvertBits(32).ConvertToPlanarRGB() are slower. And it is tested with r4483 (and sources up to 09 Feb). But all this may be changed in any new release if we will have any changes in AVS caching and new frames fetching from global frames buffer etc.

The total encoding fps difference between 'lucky fast' filters sequence and 'about all other slow' is about 22..23 vs 14..15 fps.

Full test script was
Code:
LoadPlugin("avsresize.dll")
LoadPlugin("RIFE.dll")

ColorBars(800,800,pixel_type="YUV444P8")

#ConvertToPlanarRGB(bits=32)
#ConvertBits(32)
ConvertToPlanarRGB()
ConvertBits(32)
#Z_ConvertFormat(pixel_type="RGBPS")

RIFE(denoise_tr=1)

ConvertBits(8)
ConvertToYV12()
Prefetch(2)
And x264 encoding is
x264.exe --profile high --crf 18 --preset "placebo" --merange 50 --psnr -o out.264 rife_test.avs

Typical average performance GPU load graph and x264 fps


Anomaly best performance Convert filters sequence


Maybe some complex issue with Windows threading and GPU drivers etc.

Another idea: RIFE in denoise-mode (tr=1) requests -1 and +1 frames from current to output interpolated frame. A bit close to MDegrain1(). This may cause some significant performance difference in AVS caching and sequence of
ConvertToPlanarRGB()
ConvertBits(32)
may have some more lucky results ?
But these filters are simple and sequence
ConvertBits(32)
ConvertToPlanarRGB()
expect to have same cache service time performance (not including larger float data size after ConvertBits(32) ).

Last edited by DTL; Today at 06:46.
DTL is offline   Reply With Quote
Old Today, 14:24   #3616  |  Link
wonkey_monkey
Formerly davidh*****
 
wonkey_monkey's Avatar
 
Join Date: Jan 2004
Posts: 2,815
Presumably RIFE has to upload multiple frames to the GPU for each output frame. How efficiently does it do so?

Edit: I see you mentioned that. Off the top of my head, to be compatible with prefetch(2) it can't be keeping track of internal state between frames so it must be uploading 3 frames every time. If it could keep track of what's previously been loaded, and ran single-threaded, could it eliminate two of those uploads?
__________________
My AviSynth filters / I'm the Doctor

Last edited by wonkey_monkey; Today at 14:27.
wonkey_monkey is offline   Reply With Quote
Old Today, 18:37   #3617  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,478
RIFE applied to AVS is still a very simple 2 frame input and 1 frame output engine. And it always uploads 2 frames to get 1 output frame (+ time param to interpolate between 0.0 (first frame) to 1.0 (second frame) float). This allows it to run in random access mode without additional programming. But this causes 2x additional upload traffic if work with sequential frames requests. It may not know how to remap the last uploaded frame to be first and only upload 1 new frame. It is subject to one more issue to open for possible optimization in the RIFE AVS plugin (if underlying NN engine support resources remapping).

In theory any temporal radius filter with sequential frame requests at output (normal encode mode in 1 thread) can only upload 1 new frame and remap all tr-frames by shifting to 1 and skipping first (oldest) frame if accelerator memory is enough to hold all tr-frames at once. But with current massive-multithreaded CPUs running at single threaded mode may not make best performance for the accelerator. Also with random output frame access the frame source engine needs to upload all required frames before output requested frame. This is additional programming to remember uploaded frames numbers. Also each thread must either have its own frame pool in (very small and expensive accelerator memory) or even some global memory manager interconnected with AVS frame cache to run >1 threaded filters if frame numbers in different threads overlap. This expectation was some part of the idea of onCUDA filters processing.

Because current underlying RIFE engine always work with 2 frames (and can not remember surround frames to make quality better) the optimizing sequential frame access by remap last frame to first and upload only 1 new may not give significant performance boost if even it is possible with currently used Vulkan API and RIFE engine. Though it is subject to testing if possible. Much better performance boost may be with larger-tr engines (but we currently do not have any ?).

Current AVS groups of frames multithreading model require the constant new output frame (not sequential) filter refresh after the current group of frames at current thread fully processed and thread getting a new work unit of the next group of frames (may be also in some random order). This means the filter in the thread needs to reupload all frames to the accelerator for the current requested output frame at the start of each frame group. Or attempt to remap frames from the global uploaded frames pool for current clip (if available and managed). This requires even more programming.
DTL is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 22:14.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2026, vBulletin Solutions Inc.