HBD/OPTIMIZATION Request :) - Page 2

DTL · 28th January 2023, 21:41

"some functions have not a _m256i, _m128i one"

What are the examples ? If you do not have single op instruction or macro you make a sequence to workaround (and may make/define your own macro to save program text from repeating text).

The instructions sets of SSE and AVX eras are really smaller in compare with the large family of AVX512 instructions sets. So for some operations you need to design a sequence of old instructions (and it will be executed slower).

Also for 128bit and 256bit integer instructions you need to check (enable search and display) SSE2 and AVX2 instructions sets at https://www.laruence.com/sse Intrinsics Guide.

It is really a part of good program: If cpuFlags do not show AVX2 and SSE2 you (mostly probably) can not run integer 16bit and 8bit SIMD functions and need to run slow C-reference only. Though you can also design for 64bit MMX integer if have time. Check

Code:

CPUID Flags:

of the each instruction you use to save user from invalid instruction crash. We still have users with AVX-only chips without AVX2 so they can run only SSE2 version of integer processing functions.

To make things a bit shorter I typically check only AVX2 - if present we can run integer and float AVX. If not - check for SSE2, if not present - run C-reference. There is also some SSE3/SSSE3/4.1/4.2 but it is too many versions to make for very old chips if you like to make such degree of support for old chips.

Unfortunately AVX512 also have lots of family members and it is best practice to check presence of all used instructions.

I think current software may be designed to the set of AVX512 covered by Zen4 area - may be most of todays desktop intel/AMD chips will support it.

Ceppo · 28th January 2023, 21:56

Ok, I will try to make a workaround as you said. For now the code is this, I removed the template since, as you said, it was not needed in this case.

Code:

#include <windows.h>
#include <avisynth.h>
#include <immintrin.h>

void (*CoreFilterPtr)(const unsigned char*, unsigned char*, int, int, int, int, int, int);

void Invert(const unsigned char* _srcp, unsigned char* _dstp, int src_pitch, int dst_pitch, int height, int row_size, int bits, int threads)
{   
    if (bits == 32)
    {
        float* dstp = reinterpret_cast<float*>(_dstp);
        const float* srcp = reinterpret_cast<const float*>(_srcp);
        #pragma omp parallel for num_threads(threads)
        for (auto y = 0; y < height; y++)
        {
            float* local_dstp = dstp + y * dst_pitch;
            const float* local_srcp = srcp + y * src_pitch;

            for (auto x = 0; x < row_size; x++)
            {
                local_dstp[x] = (float)(1.0f - local_srcp[x]);
            }
        }
    }
    else if (bits == 16 || bits == 14 || bits == 12 || bits == 10)
    {
        uint16_t max_pixel = (1 << bits) - 1;
        uint16_t* dstp = reinterpret_cast<uint16_t*>(_dstp);
        const uint16_t* srcp = reinterpret_cast<const uint16_t*>(_srcp);

        #pragma omp parallel for num_threads(threads)
        for (auto y = 0; y < height; y++)
        {
            uint16_t* local_dstp = dstp + y * dst_pitch;
            const uint16_t* local_srcp = srcp + y * src_pitch;

            for (auto x = 0; x < row_size; x++)
            {
                local_dstp[x] = (uint16_t)(local_srcp[x] ^ max_pixel);
            }
        }
    }
    else
    {
        uint8_t* dstp = reinterpret_cast<uint8_t*>(_dstp);
        const uint8_t* srcp = reinterpret_cast<const uint8_t*>(_srcp);

        #pragma omp parallel for num_threads(threads)
        for (auto y = 0; y < height; y++)
        {
            uint8_t* local_dstp = dstp + y * dst_pitch;
            const uint8_t* local_srcp = srcp + y * src_pitch;

            for (auto x = 0; x < row_size; x++)
            {
                local_dstp[x] = (uint8_t)(local_srcp[x] ^ 255);
            }
        }
    }
}

void Invert_AVX512(const unsigned char* _srcp, unsigned char* _dstp, int src_pitch, int dst_pitch, int height, int row_size, int bits, int threads)
{
    if (bits == 32)
    {
        float* dstp = reinterpret_cast<float*>(_dstp);
        const float* srcp = reinterpret_cast<const float*>(_srcp);

        #pragma omp parallel for num_threads(threads)
        for (auto y = 0; y < height; y++)
        {
            float* local_dstp = (float*)(dstp + y * dst_pitch);
            float* local_srcp = (float*)(srcp + y * src_pitch);

            __m512 vector_max_512 = _mm512_set1_ps(1.0f);
            auto row_size_mod64 = row_size - (row_size % 64);

            for (auto column = 0; column < row_size_mod64; column += 64)
            {
                __m512 vector_src_00 = _mm512_loadu_ps(local_srcp);
                __m512 vector_src_16 = _mm512_loadu_ps(local_srcp + 16);
                __m512 vector_src_32 = _mm512_loadu_ps(local_srcp + 32);
                __m512 vector_src_48 = _mm512_loadu_ps(local_srcp + 48);

                vector_src_00 = _mm512_sub_ps(vector_max_512, vector_src_00);
                vector_src_16 = _mm512_sub_ps(vector_max_512, vector_src_16);
                vector_src_32 = _mm512_sub_ps(vector_max_512, vector_src_32);
                vector_src_48 = _mm512_sub_ps(vector_max_512, vector_src_48);

                _mm512_storeu_ps(local_dstp, vector_src_00);
                _mm512_storeu_ps(local_dstp + 16, vector_src_16);
                _mm512_storeu_ps(local_dstp + 32, vector_src_32);
                _mm512_storeu_ps(local_dstp + 48, vector_src_48);

                local_srcp += 64;
                local_dstp += 64;
            }
            for (auto column = row_size_mod64; column < row_size; column++)
            {
                *local_dstp = (float)(1.0f - *local_srcp);
                local_dstp++;
                local_srcp++;
            }
        }
    }
    else if(bits == 16 || bits == 14 || bits == 12 || bits == 10)
    {
        uint16_t max_pixel = (1 << bits) - 1;
        uint16_t* dstp = reinterpret_cast<uint16_t*>(_dstp);
        const uint16_t* srcp = reinterpret_cast<const uint16_t*>(_srcp);

        #pragma omp parallel for num_threads(threads)
        for (auto y = 0; y < height; y++)
        {
            uint16_t* local_dstp = (uint16_t*)(dstp + y * dst_pitch);
            uint16_t* local_srcp = (uint16_t*)(srcp + y * src_pitch);

            __m512i vector_max_512 = _mm512_set1_epi16(max_pixel);
            auto row_size_mod64 = row_size - (row_size % 64);

            for (auto column = 0; column < row_size_mod64; column += 64)
            {
                __m512i vector_src_00 = _mm512_loadu_si512(local_srcp);
                __m512i vector_src_32 = _mm512_loadu_si512(local_srcp + 32);

                vector_src_00 = _mm512_subs_epu16(vector_max_512, vector_src_00);
                vector_src_32 = _mm512_subs_epu16(vector_max_512, vector_src_32);

                _mm512_storeu_si512(local_dstp, vector_src_00);
                _mm512_storeu_si512(local_dstp + 32, vector_src_32);

                local_srcp += 64;
                local_dstp += 64;
            }
            for (auto column = row_size_mod64; column < row_size; column++)
            {
                *local_dstp = (uint16_t)(*local_srcp ^ max_pixel);
                local_dstp++;
                local_srcp++;
            }
        }
    }
    else
    {
        uint8_t* dstp = reinterpret_cast<uint8_t*>(_dstp);
        const uint8_t* srcp = reinterpret_cast<const uint8_t*>(_srcp);

        #pragma omp parallel for num_threads(threads)
        for (auto y = 0; y < height; y++)
        {
            uint8_t* local_dstp = (uint8_t*)(dstp + y * dst_pitch);
            uint8_t* local_srcp = (uint8_t*)(srcp + y * src_pitch);

            __m512i vector_max_512 = _mm512_set1_epi8(255);
            auto row_size_mod64 = row_size - (row_size % 64);

            for (auto column = 0; column < row_size_mod64; column += 64)
            {
                __m512i vector_src_00 = _mm512_loadu_si512(local_srcp);

                vector_src_00 = _mm512_subs_epu8(vector_max_512, vector_src_00);

                _mm512_storeu_si512(local_dstp, vector_src_00);

                local_srcp += 64;
                local_dstp += 64;
            }
            for (auto column = row_size_mod64; column < row_size; column++)
            {
                *local_dstp = (uint8_t)(*local_srcp ^ 255);
                local_dstp++;
                local_srcp++;
            }
        }
    }
}

DTL · 28th January 2023, 22:36

For integers:

Code:

            for (auto column = 0; column < row_size_mod64; column += 64)
            {
                __m512i vector_src_00 = _mm512_loadu_si512(local_srcp);
                __m512i vector_src_32 = _mm512_loadu_si512(local_srcp + 32);

If program compile and run OK - do not reduce usage of register file because it can visibly reduce performance. The unaligned load instruction have significant startup (latency) penalty and good enough throughput. Like 8 clocktics latency at Icelake and only 0.5 CPI Throughput. It mean after 8 clocks delay at least 2 dispatch ports for load unaligned data from memory (cache) can provide 2 datawords of 512bit per cycle. So when you put more load instructions in a sequence you hide startup latency more and more.

So for best performance it is recommended to use as much space of register file as possible - grab and process more data in a loop pass. You can load up to 31 512bit dataholders without registerfile overload of AVX512 chip in x64 mode. The last 32-nd is left for your first subtract member. It is many lines of text but may be visibly faster in execution.

Like:

Code:

// process 'big' frames here
auto row_size_mod992 = row_size - (row_size % 992);

            for (auto column = 0; column < row_size_mod992; column += 992)
            {
// load 31 'registers' of 512bits in x64 build or 15 in x86 build
                __m512i vector_src_00 = _mm512_loadu_si512(local_srcp);
                __m512i vector_src_32 = _mm512_loadu_si512(local_srcp + 32);
...
__m512i vector_src_(992-32) = _mm512_loadu_si512(local_srcp + ..);

// make 31 subtractions of 512bits
                vector_src_00 = _mm512_subs_epu16(vector_max_512, vector_src_00);
                vector_src_32 = _mm512_subs_epu16(vector_max_512, vector_src_32);
...
                vector_src_(992-32) = _mm512_subs_epu16(vector_max_512, vector_src_32);

// store 31 datawords of 512bit
                _mm512_storeu_si512(local_dstp, vector_src_00);
                _mm512_storeu_si512(local_dstp + 32, vector_src_32);
...

                local_srcp += 992;
                local_dstp += 992;
}
// now if frame width < 992 and/or if too much residual columns left - make smaller SIMD processing
auto row_size_mod128 = row_size - (row_size % 128);

            for (auto column = row_size_mod992; column < row_size_mod128; column += 128)
            {
// load 4 'registers' of 512bits
                __m512i vector_src_00 = _mm512_loadu_si512(local_srcp);
                __m512i vector_src_32 = _mm512_loadu_si512(local_srcp + 32);
...

// make 4 subtractions of 512bits
                vector_src_00 = _mm512_subs_epu16(vector_max_512, vector_src_00);
                vector_src_32 = _mm512_subs_epu16(vector_max_512, vector_src_32);
...
             
// store 4 datawords of 512bit
                _mm512_storeu_si512(local_dstp, vector_src_00);
                _mm512_storeu_si512(local_dstp + 32, vector_src_32);
...

                local_srcp += 128;
                local_dstp += 128;
}
// process last > 32 columns in single 512bit op
auto row_size_mod32 = row_size - (row_size % 32);
            for (auto column = row_size_mod128; column < row_size_mod32; column += 32)
            {
// load 1 'register' of 512bits
                __m512i vector_src_00 = _mm512_loadu_si512(local_srcp);
                vector_src_00 = _mm512_subs_epu16(vector_max_512, vector_src_00);
                _mm512_storeu_si512(local_dstp, vector_src_00);

                local_srcp += 32;
                local_dstp += 32;
}
// finally process residual columns as scalar C...

So 3840 samples width frame you process with 3 passes of 992 samples load followed by 6 passes of 128 samples load and so on. Not 60 passes of 64 samples per pass loop.

Ceppo · 28th January 2023, 22:49

OK, so my 128/256 mode was better than the 64 mode. But we have 720/640/320 width sources so should I stick to 128 and 256, or do you advice even more?

Now I'm going to show you something ugly. For AVX, I couldn't came up with a work around without using pure C++ code. So I thought, "let's convert to float and take half the pixels" does this make sense (will it work/speed up the process)? Also, I'm not sure if, (float*)(local_srcp) will return half the number of pixels, or something evil (I'm the champion on this) will happen.

Code:

    else if (bits == 16 || bits == 14 || bits == 12 || bits == 10)
    {
        uint16_t max_pixel = (1 << bits) - 1;
        uint16_t* dstp = reinterpret_cast<uint16_t*>(_dstp);
        const uint16_t* srcp = reinterpret_cast<const uint16_t*>(_srcp);

        #pragma omp parallel for num_threads(threads)
        for (auto y = 0; y < height; y++)
        {
            uint16_t* local_dstp = (uint16_t*)(dstp + y * dst_pitch);
            uint16_t* local_srcp = (uint16_t*)(srcp + y * src_pitch);

            __m256 vector_max_256 = _mm256_set1_ps((float)max_pixel);
            auto row_size_mod16 = row_size - (row_size % 16);

            for (auto column = 0; column < row_size_mod16; column += 16)
            {
                //IF I CAST TO FLOAT* DO I GET HALF THE PIXELS OR SOMETHING UGLY?
                //MAKES SENSE TO CONVERT 16bit TO 32bit FOR THE LINEAR ALGEBRA?
                __m256 vector_src_00 = _mm256_load_ps((float*)(local_srcp));
                __m256 vector_src_08 = _mm256_load_ps((float*)(local_srcp + 8));

                vector_src_00 = _mm256_sub_ps(vector_max_256, vector_src_00);
                vector_src_08 = _mm256_sub_ps(vector_max_256, vector_src_08);

                _mm256_storeu_ps((float*)(local_srcp), vector_src_00);
                _mm256_storeu_ps((float*)(local_srcp + 8), vector_src_08);

                local_srcp += 16;
                local_dstp += 16;
            }
            for (auto column = row_size_mod16; column < row_size; column++)
            {
                *local_dstp = (uint16_t)(*local_srcp ^ max_pixel);
                local_dstp++;
                local_srcp++;
            }
        }
    }

I might make no sense but I'm not a programmer. (This is my excuse :P)

DTL · 28th January 2023, 22:59

"do you advice even more?"

Most slowdown happens on something like UHDTV sized frames. As you see even small 4K frame width can not be loaded in the 2KBytes register file of AVX512 in single load sequence. And users may try 8K and more.

"For AVX, I couldn't came up with a work around without using pure C++ code."

For AVX-only chip you can go down to SSE2 and still use 128bit SIMD. The 256bit AVX with 32bits floats will process same data per pass - only 8 floats per 'register'.
With SSE2 and 128bit wide datawords you can load same 8 16bit samples per 'register' but save lots of time on converting to float and back.

For SSE2 integer up to 16bits use
_mm_loadu_si128()
_mm_subs_epu16() for 16bit and _mm_subs_epu8() for 8bit
_mm_storeu_si128()
and __m128i type

Code:

__m256 vector_src_00 = _mm256_load_ps((float*)(local_srcp));

You can not convert short integer to float at load. Only load shorts as integers, expand 16bit to 32bit datawidth and use convert integer to float _mm256_cvtepi32_ps() (slow).

Ceppo · 29th January 2023, 00:18

DTL, thanks for all your help! All this new knowledge is getting me very exited! To get the best out of your help I'm going to re read everything carefully, and take notes like a good student

so that I can at least explain myself correctly and in a more detailed way (and hopefully best understand all your inputs, because you might have noticed by my answers that I fail to understand pretty often). I hope to fix my ignorance so that we can better communicate.

UPDATE: I have a Intel (R) Core (TM) i7-6500U CPU which supports only SSE4.1, SSE4.2 and AVX2, so I can't debug other optimizations.

I'm still willing to make AVX512 if someone checks if the dll works as intended

kedautinh12 · 29th January 2023, 01:37

I think DTL already had AVX-512 CPU

TDS · 29th January 2023, 02:45

Quote:

Originally Posted by kedautinh12

I think DTL already had AVX-512 CPU

So do I, but I probably wouldn't be too much help

Ceppo · 29th January 2023, 05:20

Fully working HBD InvertNeg with SSE4 and AVX2 optimization

https://pastebin.com/enHEmuq9

DTL · 29th January 2023, 05:43

Quote:

Originally Posted by Ceppo

UPDATE: I have a Intel (R) Core (TM) i7-6500U CPU which supports only SSE4.1, SSE4.2 and AVX2, so I can't debug other optimizations.

I'm still willing to make AVX512 if someone checks if the dll works as intended

For up to VisualStudio 2017 you can install Intel SDE and Visual Studio addon
https://www.google.com/url?sa=t&rct=...ZxROFaLjARUV9U

Also may be you can configure SDE run of debug build with newer versions of Visual Studio (not tried yet). Though if Intel company still exist and provide versions from 2022 it may be somehow simply integrated with newer versions of Visual Studio without installing addon.

So you can develop and debug at the CPU without AVX512.

Simple test run may be with standalone SDE install to check if it executes OK.

Quote:

Originally Posted by kedautinh12

I think DTL already had AVX-512 CPU

I have VS2017 with SDE addon installed at SSE2 CPU and it enough for development. For speed test I have access to AVX512 CPUs at work.

StainlessS · 29th January 2023, 06:40

Thanks guys, I'm too stupid to understand most of it, and that is probably a good reason for this to be stickied,
so I [and others, maybe] can come back to it at some future date and try figure it all out.
(Not just me, but loadsa guys. [EDIT: 'Loadsa' = 'Loads of' = "lots of"]), and I'm guessin that I aint the only stupid guy here.
I'd rather hope that there are other stupids here, I dont wanna [EDIT: want to] be unique.

Thanks both of you, we are lovin' this.
Especial thanks to DTL, an artist at work, thank you.

DTL · 29th January 2023, 07:15

Quote:

Originally Posted by Ceppo

Fully working HBD InvertNeg with SSE4 and AVX2 optimization

https://pastebin.com/enHEmuq9

It is very good to add at least 'single register SIMD' immediate processing after 'massive full-registerfile SIMD' and before pure C scalar ending. As described in https://forum.doom9.org/showthread.p...79#post1981979 .
In this version the frame widths below 112/224/448 will not be processed with SIMD part at all. Also the frame widths of (448*2)-1 for example will got significant penalty of first 448 columns processed fast with single SIMD pass and last 447 will be slow scalar C with 447 loop spins.
For most smooth performance over arbitrary frame widths it is good to have several SIMD stages like 'full-registerfile / half / may be 1/4 and single dataword'. The Invert with very easy subtraction is sort of rare case - in most of other processing you have much more usage of registerfile with different variables and constants and do not have so much space for different size load/stores.

Your SSE4 function looks not use any instrustions above SSE2 so you can lower CPU limit to SSE2 and use it at wider number of CPUs (having SSE2 but not SSE4).

You not need to double-init pointers now like

Code:

        float* dstp = reinterpret_cast<float*>(_dstp);
        const float* srcp = reinterpret_cast<const float*>(_srcp);
 
        #pragma omp parallel for num_threads(threads)
        for (auto y = 0; y < height; y++)
        {
            float* local_dstp = (float*)(dstp + y * dst_pitch);
            float* local_srcp = (float*)(srcp + y * src_pitch);

you can directly init local threads pointers from function arguments like

Code:

//        float* dstp = reinterpret_cast<float*>(_dstp);
//        const float* srcp = reinterpret_cast<const float*>(_srcp);
 
        #pragma omp parallel for num_threads(threads)
        for (auto y = 0; y < height; y++)
        {
            float* local_dstp = (float*)(reinterpret_cast<float*>(_dstp) + y * dst_pitch);
            float* local_srcp = (float*)(reinterpret_cast<float*>(_srcp) + y * src_pitch);

and save some lines of text.

Code:

if (!vi.IsY()) env->ThrowError("InvertNeg: Only Y8 input, sorry!");

More correct error message is about Y-only. Y8 may be Y-only with 8bit samples only.

You can auto-detect avaialable CPU features with env->GetCPUFlags(); and if user not provide cpu-param (like 0 - default) - use max available SIMD. For debug or other reasons user may lower max CPU features with SetMaxCPU() script command. And env->GetCPUFlags(); will return disabled upper or all flags.

Code:

 if (threads < 1) env->ThrowError("InvertNeg: threads must be >= 1!");

0 may be 'auto' or 'all avaialble' so the invalid value will be < 0.
To make '0 - auto' example from https://github.com/Asd-g/AviSynth-Ji...esize.cpp#L583

Code:

#include <thread>
    const int thr = std::thread::hardware_concurrency();
    if (threads_ == 0)
        threads_ = thr;
    else if (threads_ < 0 || threads_ > thr)
    {
        const std::string msg = ": threads must be between 0.." + std::to_string(thr) + ".";
        env->ThrowError(msg.c_str());
    }

Reel.Deel · 29th January 2023, 07:44

Quote:

Originally Posted by StainlessS

Thanks guys, I'm too stupid to understand most of it, and that is probably a good reason for this to be stickied,
so I [and others, maybe] can come back to it at some future date and try figure it all out.
(Not just me, but loadsa guys. [EDIT: 'Loadsa' = 'Loads of' = "lots of"]), and I'm guessin that I aint the only stupid guy here.
I'd rather hope that there are other stupids here, I dont wanna [EDIT: want to] be unique.

Thanks both of you, we are lovin' this.
Especial thanks to DTL, an artist at work, thank you.

I added a link to this page on the InvertNeg wiki page: http://avisynth.nl/index.php/Filter_...nd_HDB_Support

Don't worry StainlessS, you're not alone. I'm right there beside you, I'm probably even a few more levels of stupid

Ceppo · 29th January 2023, 15:56

DTL, I'm going to update it, as soon as possible! And figure out this emulation stuff

BTW I have a NVIDIA GeForce GTX 950M, can we GPU InvertNeg for example reference?

kedautinh12 · 29th January 2023, 15:59

NVIDIA need Cuda ver for speed

Reel.Deel · 29th January 2023, 16:28

Quote:

Originally Posted by Ceppo

... BTW I have a NVIDIA GeForce GTX 950M, can we GPU InvertNeg for example reference?

Here's InverNegCUDA by nekopanda: https://github.com/nekopanda/AviSynthPlusCUDASample

BUT, it's an example of Nekopanda's interface which is not compatible with current AviSynth+. They used to be compatible some versions ago but I don't know why they're not anymore. Here's a bit more info: https://github.com/AviSynth/AviSynthPlus/issues/296

Ceppo · 29th January 2023, 17:22

Well, FFT3DGPU works on my graphic card. So, as far I understand, it's a metter on how you make communicate the GPU with avisynth+... I will look into FFT3dGPU code to see if I can figure out something...

UPDATE: NOPE, I don't understand pinterf codes, it's to much a PROgrammer for me

DTL · 29th January 2023, 17:25

Quote:

Originally Posted by Ceppo

I have a NVIDIA GeForce GTX 950M, can we GPU InvertNeg for example reference?

I have only experience with DirectX12 and compute shader to process. The compute shader itself for inverting is very simple. But DX12 resources init and data upload/download to/from HWAcc is not very simple.

You can look into MAnalyse: https://github.com/DTL2020/mvtools/b.../MVAnalyse.cpp

The DX12 init is in the function https://github.com/DTL2020/mvtools/b...lyse.cpp#L1600 (2400-1600= about 800 lines of C, for 2 textures and 1 shader may be about 400..500 lines).

And GetFrame() usage at https://github.com/DTL2020/mvtools/b...alyse.cpp#L812

For Invert example you need only 2 texture resources for upload and download and 1 compute shader resource. The shader is simple C-like function for each sample: https://github.com/DTL2020/mvtools/b...s/Compute.hlsl

All required for DX12 headers are located in DX12_ME ifdef so easy to see in https://github.com/DTL2020/mvtools/b...es/MVAnalyse.h

Like:

Code:

#if defined _WIN32 && defined DX12_ME

#include <initguid.h>
#include <d3d12.h>
#include <dxgi1_6.h>
#include <D3Dcompiler.h>
#include <DirectXMath.h>
#include "d3dx12.h"
#include "d3d12video.h"
#include "DirectXHelpers.h"
#include "ReadData.h"
#include "DescriptorHeap.h"

#include <string>
#include <wrl.h>
#include <shellapi.h>

using Microsoft::WRL::ComPtr;
using namespace DirectX;

#endif

for includes.

Ceppo · 29th January 2023, 20:00

I did the code for the AVX512F version, with the progressive reduction of the number of registers, however when I debug I get "illegal instruction". I don't understand if I installed wrong the intel emulator or if it is not compatible with MSVS 2022 or the code is bugged. Here the code:
https://pastebin.com/3hXP4Q9x

DTL · 29th January 2023, 20:06

SDE typically install as standalone. And may have addon to VisualStudio. After installing addon you configure it with the path to SDE installation (or may be addon try to do it in auto way). After installing SDE addon to VS you see "SDE Debugger" debug startup avaialble (new item to dropdown list of Local Windows Debugger, Remote Windows debugger and others).
If you install standalone SDE and do not have addon to VS2022 - you may try to set startup of debug executable via command line SDE.

At least SDE should have run compiled executable from command line with all possible emulations enabled (default ?) without 'illegal instruction' error. After it run OK as standalone process - try to load it from VS IDE debug mode.

Code:

            auto n = 8;
            auto row_size_rst = row_size % (n*30);
            auto row_size_mod = row_size - row_size_rst;
 
            __m512 vector_max = _mm512_set1_ps(1.0f);
 
            for (auto column = 0; column < row_size_mod; column += (n * 30))
            {
                __m512 vector_src_00 = _mm512_loadu_ps(local_srcp + n * 0);
                __m512 vector_src_01 = _mm512_loadu_ps(local_srcp + n * 1);

For AVX512 n should be 16 - the 512bit dataword hold 16 floats of 32bit each.

Also

Code:

            auto row_size_rst = row_size % (n*30);
            for (auto column = 0; column < row_size_mod; column += (n * 30))
last load -               __m512 vector_src_30 = _mm512_loadu_ps(local_srcp + n * 30); - it is 31 loaded dataword starting from 0.

if you load 31 datawords in a pass so may be loop step should be 31 and row_size_rst = row_size % (n*31); . And same for all other loop steps (need +1) and row_size_rst and _mod calculation. And same with local pointers advancing.

So correct for 4 datawords per pass should be

Code:

            row_size_mod = row_size_rst - (row_size_rst % (n * 4));
            row_size_rst = row_size_rst % (n * 4);
 
            for (auto column = 0; column < row_size_mod; column += (n * 4))
            {
                __m512 vector_src_00 = _mm512_loadu_ps(local_srcp + n * 0);
                __m512 vector_src_01 = _mm512_loadu_ps(local_srcp + n * 1);
                __m512 vector_src_02 = _mm512_loadu_ps(local_srcp + n * 2);
                __m512 vector_src_03 = _mm512_loadu_ps(local_srcp + n * 3);
 
                vector_src_00 = _mm512_sub_ps(vector_max, vector_src_00);
                vector_src_01 = _mm512_sub_ps(vector_max, vector_src_01);
                vector_src_02 = _mm512_sub_ps(vector_max, vector_src_02);
                vector_src_03 = _mm512_sub_ps(vector_max, vector_src_03);
 
                _mm512_storeu_ps(local_dstp + n * 0, vector_src_00);
                _mm512_storeu_ps(local_dstp + n * 1, vector_src_01);
                _mm512_storeu_ps(local_dstp + n * 2, vector_src_02);
                _mm512_storeu_ps(local_dstp + n * 3, vector_src_03);
 
                local_srcp += (n * 4);
                local_dstp += (n * 4);
            }

"with the progressive reduction of the number of registers"

I think it is better to skip some 'intermediate' step but make last step = 1 SIMD dataword. Now with last SIMD step for AVX512 of 4 dataword (each of 16 samples) you left last up to 63 samples to scalar C processing. Some more perfectionists SIMD programmers even add finer granularity SIMD steps after 'main dataword width' when avaialble. Like

Code:

// for AVX512 and float32 samples
// last full-dataword step = 1 and 16 samples per pass
// now use AVX256 (guaranteed supported by AVX512 chip) and process 8 samples per pass
// now use SSE128 (guaranteed supported by AVX512 chip) and process 4 samples per pass
// now process up to 3 residual samples with C-scalar program

28th January 2023, 21:41	#21 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,067	"some functions have not a _m256i, _m128i one" What are the examples ? If you do not have single op instruction or macro you make a sequence to workaround (and may make/define your own macro to save program text from repeating text). The instructions sets of SSE and AVX eras are really smaller in compare with the large family of AVX512 instructions sets. So for some operations you need to design a sequence of old instructions (and it will be executed slower). Also for 128bit and 256bit integer instructions you need to check (enable search and display) SSE2 and AVX2 instructions sets at https://www.laruence.com/sse Intrinsics Guide. It is really a part of good program: If cpuFlags do not show AVX2 and SSE2 you (mostly probably) can not run integer 16bit and 8bit SIMD functions and need to run slow C-reference only. Though you can also design for 64bit MMX integer if have time. Check Code: CPUID Flags: of the each instruction you use to save user from invalid instruction crash. We still have users with AVX-only chips without AVX2 so they can run only SSE2 version of integer processing functions. To make things a bit shorter I typically check only AVX2 - if present we can run integer and float AVX. If not - check for SSE2, if not present - run C-reference. There is also some SSE3/SSSE3/4.1/4.2 but it is too many versions to make for very old chips if you like to make such degree of support for old chips. Unfortunately AVX512 also have lots of family members and it is best practice to check presence of all used instructions. I think current software may be designed to the set of AVX512 covered by Zen4 area - may be most of todays desktop intel/AMD chips will support it. Last edited by DTL; 28th January 2023 at 22:03.

28th January 2023, 22:59	#25 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,067	"do you advice even more?" Most slowdown happens on something like UHDTV sized frames. As you see even small 4K frame width can not be loaded in the 2KBytes register file of AVX512 in single load sequence. And users may try 8K and more. "For AVX, I couldn't came up with a work around without using pure C++ code." For AVX-only chip you can go down to SSE2 and still use 128bit SIMD. The 256bit AVX with 32bits floats will process same data per pass - only 8 floats per 'register'. With SSE2 and 128bit wide datawords you can load same 8 16bit samples per 'register' but save lots of time on converting to float and back. For SSE2 integer up to 16bits use _mm_loadu_si128() _mm_subs_epu16() for 16bit and _mm_subs_epu8() for 8bit _mm_storeu_si128() and __m128i type Code: __m256 vector_src_00 = _mm256_load_ps((float)(local_srcp)); You can not convert short integer to float at load. Only load shorts as integers, expand 16bit to 32bit datawidth and use convert integer to float _mm256_cvtepi32_ps() (slow). Last edited by DTL; 28th January 2023 at 23:13.*

29th January 2023, 00:18	#26 \| Link
Ceppo Registered User Join Date: Feb 2016 Location: Nonsense land Posts: 339	DTL, thanks for all your help! All this new knowledge is getting me very exited! To get the best out of your help I'm going to re read everything carefully, and take notes like a good student so that I can at least explain myself correctly and in a more detailed way (and hopefully best understand all your inputs, because you might have noticed by my answers that I fail to understand pretty often). I hope to fix my ignorance so that we can better communicate. UPDATE: I have a Intel (R) Core (TM) i7-6500U CPU which supports only SSE4.1, SSE4.2 and AVX2, so I can't debug other optimizations. I'm still willing to make AVX512 if someone checks if the dll works as intended Last edited by Ceppo; 29th January 2023 at 01:01.

29th January 2023, 06:40	#31 \| Link
StainlessS HeartlessS Usurer Join Date: Dec 2009 Location: Over the rainbow Posts: 10,980	Thanks guys, I'm too stupid to understand most of it, and that is probably a good reason for this to be stickied, so I [and others, maybe] can come back to it at some future date and try figure it all out. (Not just me, but loadsa guys. [EDIT: 'Loadsa' = 'Loads of' = "lots of"]), and I'm guessin that I aint the only stupid guy here. I'd rather hope that there are other stupids here, I dont wanna [EDIT: want to] be unique. Thanks both of you, we are lovin' this. Especial thanks to DTL, an artist at work, thank you. __________________ I sometimes post sober. StainlessS@MediaFire ::: AND/OR ::: StainlessS@SendSpace "Some infinities are bigger than other infinities", but how many of them are infinitely bigger ??? Last edited by StainlessS; 29th January 2023 at 07:04.

29th January 2023, 17:22	#37 \| Link
Ceppo Registered User Join Date: Feb 2016 Location: Nonsense land Posts: 339	Well, FFT3DGPU works on my graphic card. So, as far I understand, it's a metter on how you make communicate the GPU with avisynth+... I will look into FFT3dGPU code to see if I can figure out something... UPDATE: NOPE, I don't understand pinterf codes, it's to much a PROgrammer for me Last edited by Ceppo; 29th January 2023 at 17:25.

29th January 2023, 01:37	#27 \| Link
kedautinh12 Registered User Join Date: Jan 2018 Posts: 2,156	I think DTL already had AVX-512 CPU

29th January 2023, 05:20	#29 \| Link
Ceppo Registered User Join Date: Feb 2016 Location: Nonsense land Posts: 339	Fully working HBD InvertNeg with SSE4 and AVX2 optimization https://pastebin.com/enHEmuq9

29th January 2023, 15:56	#34 \| Link
Ceppo Registered User Join Date: Feb 2016 Location: Nonsense land Posts: 339	DTL, I'm going to update it, as soon as possible! And figure out this emulation stuff BTW I have a NVIDIA GeForce GTX 950M, can we GPU InvertNeg for example reference?

29th January 2023, 15:59	#35 \| Link
kedautinh12 Registered User Join Date: Jan 2018 Posts: 2,156	NVIDIA need Cuda ver for speed

29th January 2023, 20:00	#39 \| Link
Ceppo Registered User Join Date: Feb 2016 Location: Nonsense land Posts: 339	I did the code for the AVX512F version, with the progressive reduction of the number of registers, however when I debug I get "illegal instruction". I don't understand if I installed wrong the intel emulator or if it is not compatible with MSVS 2022 or the code is bugged. Here the code: https://pastebin.com/3hXP4Q9x

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode