Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.


Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Thread Tools Search this Thread Display Modes
Old 29th January 2023, 21:02   #41  |  Link
Registered User
Join Date: Feb 2016
Location: Nonsense land
Posts: 331
I'm sorry to say this, but I must interrupt the discussion, and I'm going to disappear for like a month. Too complicated to explain why. (If someone wants to take over, please do).
I come from nonsense land. I usually post under the effect of alchool and I don't think before writing, so don't get it personal, I didn't mean to.
Ceppo is offline   Reply With Quote
Old 29th January 2023, 21:46   #42  |  Link
Registered User
Join Date: Jul 2018
Posts: 987
It looks intel SDE even latest 9.14.0 only support full debugging integration only with VS2017. As readme says no support for VS2019 yet.

So if running in VS2019 as debug application it run AVSmeter and plugin but not provide modules information to VisualStudio and VS can not load symbols and can not break on breakpoints or crashes.

But you still can run emulation standalone to check application output and execution without crashes. If it crashes SDE report only address and crash type.
DTL is offline   Reply With Quote
Old 30th January 2023, 03:25   #43  |  Link
Registered User
Join Date: Jan 2018
Posts: 1,898
Here Cuda code, hope you understand that
kedautinh12 is offline   Reply With Quote
Old 12th March 2023, 02:31   #44  |  Link
Registered User
Join Date: Jul 2018
Posts: 987
Some interesting way of SIMD processing of rows without adding special processing for last columns if they are not integer divide to number of columns in SIMD 'workunit' size - https://github.com/Asd-g/AviSynth-vs...mooth_SSE2.cpp (same is for AVX2 and AVX512) .

It looks with not very old AVS core (?) the pitch of row is guaranteed to be integer divisor of alignment size so no long epilogue for process unknown count of residual samples at the end of row required. But it may be good to put an assert or even direct check at plugin init (class constructor) for the pitch size provided for buffers to process. If it can not be integer divided to required alignment size for SIMD processing it is better to throw stop error with description ?

Last edited by DTL; 23rd March 2023 at 13:12.
DTL is offline   Reply With Quote
Old 1st May 2023, 10:30   #45  |  Link
Registered User
Join Date: Jul 2018
Posts: 987
Some longread about superscalar programming for SIMD:

Many new CPU chips (may be after 199x years already) have some limited capability of superscalar computing. It mean in some cases more computing may be performed at the same time (clock count). In chip design it is performed with several dispatch ports capable to execute same instruction. So the total computing performance on CPU is
Number_of_Cores x SIMD_datawidth x Superscalarity_factor

The number of cores and max SIMD dataword width is cleary visible from CPU hardware config and SIMD family (64/128/256/512 bit). The superscalarity factor depend on current chip design and depend on instruction and number of dispatch ports capable to dispatch compute instruction. For some groups of instructions superscalarity factor may be 2 and more directly noted in CPU specs - like 2 FMA units in some Xeons.

Generally for some program the superscalarity factor is >1 and for some instruction and some chips may reach 3. It may be found in the CPU documentation in the list of CPI per instruction (if CPI <1 it mean it is executed at 1 clocktick and 2 or more dispatch ports available, so CPI of 0.5 is 2 dispatch ports and CPI of 0.33 is 3 dispatch ports).

The required conditions for superscalar computing:
1. Data for computing must not have dependancy.
2. Data for computing mostly probably should be located in register file (reading memory even L1D cache is too slow).
3. There should be free to dispatch 2 or more ports supporting this instruction computing.

Example of possible to superscalar computing program:

Not possible (data dependant):

So in the SIMD programming to use possible benefit from superscalarity it is good to group big workunits of data (of several SIMD datawords) and if they are not dependant - group several compute instructions to process this data. So the instructions decode unit of CPU may detect it as superscalar ready part of program and route commands to several free and supporting dispatch ports.

Example of low or not superscalar friendly program processing loop:
for (int i=0; i < N; i++)
store(dst+i, result)

It use SIMD but process only one SIMD dataword per loop spin. If program designer is very lucky with compiler - it may unroll this loop to be more superscalar friendly. But it depends on compiler.

More superscalar-way of explicit programming is:
for (int i=0; i < N; i+=2)
store(dst+i, result1)
store(dst+i+1, result2)

It uses superscalarity factor of 2 if sum instruction is supported on 2 or more dispatch ports. Also there is less bus direction switches on load and store of data. It is expected with progress of CPU design the superscalarity factor for more and more instrucsions may be increased (may be to 4 and more) so it may be recommended to design SIMD programs supporting up to 4 and more dispatch ports in the same computing (depend on available space in register file and more).

The C-program text for superscalar computing is not very nice with lots of repeating blocks - may be it can be somehow compacted with language tools in more compact form.

Last edited by DTL; 1st May 2023 at 10:35.
DTL is offline   Reply With Quote
Old 13th May 2023, 03:11   #46  |  Link
Registered User
Join Date: Jul 2018
Posts: 987
Sample of simple colour-space converting plugin (YV12 to RGB32 decoding) using AVX2 optimization for both memory transfer and SIMD computing:

Can do both RGB planar (commented-out) or RGB32 interleaved store.

Also some addition to high-performance computing programming:

It looks dispatch ports of core are not support all range of instructions directly but designed as sort of FPGA with reloading of compute config to support all required instructions.

So when instruction decoder see some new instruction it performs:

1. Find free dispatch port supporting this instruction.
2. Check if port configured to dispatch.
3. If port not configured - load configuration of FPGA (takes several clocktics).
4. Route data and instruction code to dispatch to the port.

So instruction have 2 performance params: Latency and Throughput. The Latency used when it is first instruction in a sequence and no ready configured dispatch ports available. So first result will be ready only after Latency clockticks. If there are several equal instructions in a sequence - they can be pipelined to ready to dispatch port at Throughput performance. So it may be good to arrange many equal instructions in large sets to use Throughput performance level. Good compilers should do this work from intrinsics and VCL based C-programming if enough data to compute is prepared.

Last edited by DTL; 13th May 2023 at 03:22.
DTL is offline   Reply With Quote

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +1. The time now is 22:39.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, vBulletin Solutions Inc.