Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

Domains: forum.doom9.org / forum.doom9.net / forum.doom9.se

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Usage

Reply
 
Thread Tools Display Modes
Old 2nd November 2024, 05:19   #121  |  Link
kedautinh12
Registered User
 
Join Date: Jan 2018
Posts: 2,170
Quote:
Originally Posted by FranceBB View Post
Does it also happen with fp16=true?
I also have 8GB GDDR5 in my NVIDIA Quadro P4000 and I had to add that to compensate 'cause otherwise it would perform calculations in 32bit float and run out of memory.
Yes, the 32bit is very strong for even with 8gb vram GPU in 480p video
kedautinh12 is offline   Reply With Quote
Old 2nd November 2024, 06:15   #122  |  Link
Arx1meD
Registered User
 
Arx1meD's Avatar
 
Join Date: Feb 2021
Posts: 143
Quote:
Originally Posted by FranceBB View Post
Does it also happen with fp16=true?
Yes. When fp16=true also appears errors.
I also noticed when the models are converted in the chaiNNer program with the selection fp32 everything works fine (for ESRGAN architecture models), and appears errors for the same model with the selection fp16.
Arx1meD is offline   Reply With Quote
Old 17th February 2025, 17:54   #123  |  Link
Stereodude
Registered User
 
Join Date: Dec 2002
Location: Region 0
Posts: 1,481
FWIW, mlrt_ort using CUDA is faster than mlrt_ncnn to run an ONNX model (Dotzilla) on a RTX 3070Ti using a DVD source. I think it was 24fps vs. 18fps.

Last edited by Stereodude; 17th February 2025 at 18:22.
Stereodude is offline   Reply With Quote
Old 15th September 2025, 15:13   #124  |  Link
takla
Registered User
 
Join Date: May 2018
Posts: 240
Posting this example script here, so I don't forget to set colorspace stuff again...

Also, Cuda is waay slower for me, barely reaching 1/3 of the speed I get with DML.
This has also been shown by other to be the case.

With Hardware-accelerated GPU scheduling enabled, GPU usage often gets stuck at 80%.
If you run into this issue, you can disable it under Windows settings "System > Display > Graphics >> Advanced graphic settings"
Edit: Someone else ran into the same issue.

Code:
LWLibavVideoSource("C:\Users\ULTRA\Downloads\ffmpeg\INPUT.mkv").Prefetch(0)
z_ConvertFormat(pixel_type="RGBPS", colorspace_op="709:709:709:limited=>rgb:709:709:full", dither_type="error_diffusion", cpu_type="avx2", use_props=0).Prefetch(0)
mlrt_ort(network_path="C:\Program Files (x86)\AviSynth+\plugins64+\mlrt_ort_rt\models\2x_Ani4Kv2_G6i2_Compact_107500_fp32.onnx", builtin=False, provider="DML", fp16=True, num_streams=1).Prefetch(2)
z_ConvertFormat(pixel_type="YUV420P8", colorspace_op="rgb:709:709:full=>709:709:709:limited", dither_type="error_diffusion", cpu_type="avx2", use_props=0).Prefetch(0)

Last edited by takla; 26th September 2025 at 15:43.
takla is offline   Reply With Quote
Old 5th February 2026, 23:51   #125  |  Link
takla
Registered User
 
Join Date: May 2018
Posts: 240
Exciting news:
ONNX is coming to FFMPEG
takla is offline   Reply With Quote
Old 6th February 2026, 09:01   #126  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,491
A question to NN/mlrt filters developers:

Currently RIFE uses internal float32 range remap to x255.0f (0..1 to 0..255) and all other (3) filters form avs-mlrt uses direct data upload from AVS to accelerator (in its range of 0..1).
Is it a special feature required by RIFE models/processing ? Can we use direct AVS 0..1 float range data with RIFE without issues ? Or RIFE models were specifically learned on 0..255 old PC BMP data and can not process 0..1 data range with same quality ?

The idea is to add new filters interconnection mode to AVS like GetFrameToBuffer() to direct data write into allocated upload buffer to accelerator instead of large RGBPS data copy from AVS cache to allocated upload buffer by CPU.

Last edited by DTL; 6th February 2026 at 09:17.
DTL is offline   Reply With Quote
Old 6th February 2026, 11:12   #127  |  Link
WolframRhodium
Registered User
 
Join Date: Jan 2016
Posts: 172
Quote:
Originally Posted by DTL View Post
Currently RIFE uses internal float32 range remap to x255.0f (0..1 to 0..255)
What do you mean by that? It uses 0..1 float range.

Quote:
Originally Posted by DTL View Post
The idea is to add new filters interconnection mode to AVS like GetFrameToBuffer() to direct data write into allocated upload buffer to accelerator instead of large RGBPS data copy from AVS cache to allocated upload buffer by CPU.
I am not a fan of this. We can hide the latency of data transfer if it is shorter than the computation time, without changing existing software stack. (Of course, this could be necessary for lightweight models, but I don't think it is significant in the near future.)
WolframRhodium is offline   Reply With Quote
Old 6th February 2026, 11:49   #128  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,491
Quote:
Originally Posted by WolframRhodium View Post
What do you mean by that? It uses 0..1 float range.
At input it accepts 'standard' AVS float32 range of about 0..1 but at making copy into allocated by accelerator library buffer it makes additional range remap to 0..255 - https://github.com/Asd-g/AviSynthPlu.../rife.cpp#L379
in0R[w * y + x] = src0R[src_stride * y + x] * 255.0f;
in0G[w * y + x] = src0G[src_stride * y + x] * 255.0f;
in0B[w * y + x] = src0B[src_stride * y + x] * 255.0f;
in1R[w * y + x] = src1R[src_stride * y + x] * 255.0f;
in1G[w * y + x] = src1G[src_stride * y + x] * 255.0f;
in1B[w * y + x] = src1B[src_stride * y + x] * 255.0f;


The all other (3) avs-mlrt plugins to make a copy to upload buffer simply call avs_bitblt (copy without range remap).


The question: is it some obligatory feature of RIFE models or some old redundant operation and can be removed (if we can write directly to the upload buffer with 'standard' AVS float32 range mapping about 0..1) ?

"We can hide the latency of data transfer if it is shorter than the computation time, "

The idea is to free CPU and host memory bus resources. While we making some AVS filters processing we also make MPEG encoding with x26x software and it is also very slow. If we free some CPU + memory bus resources it is expected to have some performance boost with total processing of AVS+MPEG encoding (onCPU). Copy of the largest RGBPS format is very bad for cache and memory bus load.
So it is planned to test if we can simply make uncached stores at the output of
ConvertToPlanarRGB(bits=32) new filter in AVS+ core directly in the upload buffer of accelerator.
It is the special type of filters without CPU-core-readback for processing. It is filters-adapters to feed data to external accelerator (read from host RAM via DMA ?).

Updated plugin expected like:
Code:
    ncnn::Mat in0;
    ncnn::Mat in1;
    in0.create(w, h, channels, sizeof(float), 1); // allocate upload buffers to accelerator
    in1.create(w, h, channels, sizeof(float), 1);
    float* in0R{ in0.channel(0) };
    float* in0G{ in0.channel(1) };
    float* in0B{ in0.channel(2) };
    float* in1R{ in1.channel(0) };
    float* in1G{ in1.channel(1) };
    float* in1B{ in1.channel(2) };

child->GetFrameToBuffer(frame_num, buffer_description(in0R,...)); // uncached write from CPU core 
to upload buffers for DMA to accelerator in large RGBPS format

    ncnn::VkCompute cmd(vkdev);

Last edited by DTL; 6th February 2026 at 12:02.
DTL is offline   Reply With Quote
Old 6th February 2026, 13:08   #129  |  Link
StvG
Registered User
 
Join Date: Jul 2018
Posts: 595
Quote:
Originally Posted by DTL View Post
At input it accepts 'standard' AVS float32 range of about 0..1 but at making copy into allocated by accelerator library buffer it makes additional range remap to 0..255 - https://github.com/Asd-g/AviSynthPlu.../rife.cpp#L379...
@DTL, AviSynthPlus-RIFE - https://github.com/Asd-g/AviSynthPlu...n/src/rife.cpp is completely different plugin than avs-mlrt - https://github.com/Asd-g/avs-mlrt/bl...mlrt_RIFE.avsi
StvG is offline   Reply With Quote
Old 6th February 2026, 14:19   #130  |  Link
WolframRhodium
Registered User
 
Join Date: Jan 2016
Posts: 172
Quote:
Originally Posted by DTL View Post
The idea is to free CPU and host memory bus resources.
There are tons of similar improvements that could make a massive difference, e.g. fusing pointwise operations like MakeDiff() with resize. And PCIe 5.0 is fast.
WolframRhodium is offline   Reply With Quote
Old 7th February 2026, 14:38   #131  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 3,086
Quote:
Originally Posted by WolframRhodium View Post
And PCIe 5.0 is fast.
Remember that somebody hasn't the money to upgrade to a modern system.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 7th February 2026, 16:42   #132  |  Link
WolframRhodium
Registered User
 
Join Date: Jan 2016
Posts: 172
Quote:
Originally Posted by tormento View Post
Remember that somebody hasn't the money to upgrade to a modern system.
Then similarly the gpu processing time will be longer than the data transfer time.
WolframRhodium is offline   Reply With Quote
Old 7th February 2026, 16:44   #133  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 3,086
Quote:
Originally Posted by WolframRhodium View Post
Then similarly the gpu processing time will be longer than the data transfer time.
Did you use all the possible tricks on newer builds of AVS+ to decrease the overhead from GPU to CPU transfers to have the temporal part processed?
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 7th February 2026, 23:26   #134  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,491
If host CPU load will be lower - user can set higher quality MPEG encoding params and get more quality at the same encoding time (limited by accelerator performance).

Currently no significant optimizations for data transfer from host memory to GPU in latest AVS+ builds. The only optimization in r4483 build is faster YUV to RGBPS transform with single YUV data load to CPU instead of separate ConvertBits(32) load/store stage.

Now users can call single ConvertToPlanarRGB(bits=32) and it is faster in comparison with old sequence of
ConvertToPlanarRGB()
ConvertBits(32)
and also may be a bit higher in precision (in case of integer YUV dematrix before ConvertBits(32)).
DTL is offline   Reply With Quote
Old 8th February 2026, 01:30   #135  |  Link
WolframRhodium
Registered User
 
Join Date: Jan 2016
Posts: 172
Quote:
Originally Posted by tormento View Post
Did you use all the possible tricks on newer builds of AVS+ to decrease the overhead from GPU to CPU transfers to have the temporal part processed?
I do. I use an operating system trick (pinned host memory), a driver trick (CUDA graphs) and an x86-specific trick (write-combined memory, for CPU to GPU only). More tricks could be available for newer GPUs.
WolframRhodium is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 20:30.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2026, vBulletin Solutions Inc.