Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. Domains: forum.doom9.org / forum.doom9.net / forum.doom9.se |
|
|
#122 | Link |
|
Registered User
Join Date: Feb 2021
Posts: 143
|
Yes. When fp16=true also appears errors.
I also noticed when the models are converted in the chaiNNer program with the selection fp32 everything works fine (for ESRGAN architecture models), and appears errors for the same model with the selection fp16. |
|
|
|
|
|
#123 | Link |
|
Registered User
Join Date: Dec 2002
Location: Region 0
Posts: 1,481
|
FWIW, mlrt_ort using CUDA is faster than mlrt_ncnn to run an ONNX model (Dotzilla) on a RTX 3070Ti using a DVD source. I think it was 24fps vs. 18fps.
Last edited by Stereodude; 17th February 2025 at 18:22. |
|
|
|
|
|
#124 | Link |
|
Registered User
Join Date: May 2018
Posts: 240
|
Posting this example script here, so I don't forget to set colorspace stuff again...
Also, Cuda is waay slower for me, barely reaching 1/3 of the speed I get with DML. This has also been shown by other to be the case. With Hardware-accelerated GPU scheduling enabled, GPU usage often gets stuck at 80%. If you run into this issue, you can disable it under Windows settings "System > Display > Graphics >> Advanced graphic settings" Edit: Someone else ran into the same issue. Code:
LWLibavVideoSource("C:\Users\ULTRA\Downloads\ffmpeg\INPUT.mkv").Prefetch(0)
z_ConvertFormat(pixel_type="RGBPS", colorspace_op="709:709:709:limited=>rgb:709:709:full", dither_type="error_diffusion", cpu_type="avx2", use_props=0).Prefetch(0)
mlrt_ort(network_path="C:\Program Files (x86)\AviSynth+\plugins64+\mlrt_ort_rt\models\2x_Ani4Kv2_G6i2_Compact_107500_fp32.onnx", builtin=False, provider="DML", fp16=True, num_streams=1).Prefetch(2)
z_ConvertFormat(pixel_type="YUV420P8", colorspace_op="rgb:709:709:full=>709:709:709:limited", dither_type="error_diffusion", cpu_type="avx2", use_props=0).Prefetch(0)
Last edited by takla; 26th September 2025 at 15:43. |
|
|
|
|
|
#125 | Link |
|
Registered User
Join Date: May 2018
Posts: 240
|
Exciting news:
ONNX is coming to FFMPEG |
|
|
|
|
|
#126 | Link |
|
Registered User
Join Date: Jul 2018
Posts: 1,491
|
A question to NN/mlrt filters developers:
Currently RIFE uses internal float32 range remap to x255.0f (0..1 to 0..255) and all other (3) filters form avs-mlrt uses direct data upload from AVS to accelerator (in its range of 0..1). Is it a special feature required by RIFE models/processing ? Can we use direct AVS 0..1 float range data with RIFE without issues ? Or RIFE models were specifically learned on 0..255 old PC BMP data and can not process 0..1 data range with same quality ? The idea is to add new filters interconnection mode to AVS like GetFrameToBuffer() to direct data write into allocated upload buffer to accelerator instead of large RGBPS data copy from AVS cache to allocated upload buffer by CPU. Last edited by DTL; 6th February 2026 at 09:17. |
|
|
|
|
|
#127 | Link | |
|
Registered User
Join Date: Jan 2016
Posts: 172
|
Quote:
I am not a fan of this. We can hide the latency of data transfer if it is shorter than the computation time, without changing existing software stack. (Of course, this could be necessary for lightweight models, but I don't think it is significant in the near future.) |
|
|
|
|
|
|
#128 | Link | |
|
Registered User
Join Date: Jul 2018
Posts: 1,491
|
Quote:
in0R[w * y + x] = src0R[src_stride * y + x] * 255.0f; in0G[w * y + x] = src0G[src_stride * y + x] * 255.0f; in0B[w * y + x] = src0B[src_stride * y + x] * 255.0f; in1R[w * y + x] = src1R[src_stride * y + x] * 255.0f; in1G[w * y + x] = src1G[src_stride * y + x] * 255.0f; in1B[w * y + x] = src1B[src_stride * y + x] * 255.0f; The all other (3) avs-mlrt plugins to make a copy to upload buffer simply call avs_bitblt (copy without range remap). The question: is it some obligatory feature of RIFE models or some old redundant operation and can be removed (if we can write directly to the upload buffer with 'standard' AVS float32 range mapping about 0..1) ? "We can hide the latency of data transfer if it is shorter than the computation time, " The idea is to free CPU and host memory bus resources. While we making some AVS filters processing we also make MPEG encoding with x26x software and it is also very slow. If we free some CPU + memory bus resources it is expected to have some performance boost with total processing of AVS+MPEG encoding (onCPU). Copy of the largest RGBPS format is very bad for cache and memory bus load. So it is planned to test if we can simply make uncached stores at the output of ConvertToPlanarRGB(bits=32) new filter in AVS+ core directly in the upload buffer of accelerator. It is the special type of filters without CPU-core-readback for processing. It is filters-adapters to feed data to external accelerator (read from host RAM via DMA ?). Updated plugin expected like: Code:
ncnn::Mat in0;
ncnn::Mat in1;
in0.create(w, h, channels, sizeof(float), 1); // allocate upload buffers to accelerator
in1.create(w, h, channels, sizeof(float), 1);
float* in0R{ in0.channel(0) };
float* in0G{ in0.channel(1) };
float* in0B{ in0.channel(2) };
float* in1R{ in1.channel(0) };
float* in1G{ in1.channel(1) };
float* in1B{ in1.channel(2) };
child->GetFrameToBuffer(frame_num, buffer_description(in0R,...)); // uncached write from CPU core
to upload buffers for DMA to accelerator in large RGBPS format
ncnn::VkCompute cmd(vkdev);
Last edited by DTL; 6th February 2026 at 12:02. |
|
|
|
|
|
|
#129 | Link | |
|
Registered User
Join Date: Jul 2018
Posts: 595
|
Quote:
|
|
|
|
|
|
|
#134 | Link |
|
Registered User
Join Date: Jul 2018
Posts: 1,491
|
If host CPU load will be lower - user can set higher quality MPEG encoding params and get more quality at the same encoding time (limited by accelerator performance).
Currently no significant optimizations for data transfer from host memory to GPU in latest AVS+ builds. The only optimization in r4483 build is faster YUV to RGBPS transform with single YUV data load to CPU instead of separate ConvertBits(32) load/store stage. Now users can call single ConvertToPlanarRGB(bits=32) and it is faster in comparison with old sequence of ConvertToPlanarRGB() ConvertBits(32) and also may be a bit higher in precision (in case of integer YUV dematrix before ConvertBits(32)). |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|