Avisynth Assembly Tutorial? - Page 2

DTL · 27th November 2021, 23:53

Not so bad for AVX512 at desktop CPUs - intel Rocket Lake Q1 2021 chips finally have it. Cnews make some testing of x265 performance of AVX512 - https://www.google.com/amp/s/www.hwc...-help/%3famp=1
The speedup still very poor but it looks development of AVX512 in x265 were slow because there were very few AVX512 chips at endusers available. Hope to make tests soon with i5-11500 of AVX512 versions of some functions in MAnalyse and MDegrainN in mvtools.
The AVX512 have both scatter and gather instructions and AVX2 only gather. And MDegrainN now very poor on memory performance - may scatter/gather of blocks store/load will helps.

orion44 · 28th November 2021, 00:40

I read somewhere that modern optimizing C/C++ compilers are able to beat hand-coded assembly in like 99% of cases.

Is this true in the area of video codec and video filters development?

DTL · 28th November 2021, 09:49

The speedup is not about compiler vs hand produced assembly executable. It is mostly about compiler vs human "vectorization". As the intel still going into idea of "universal logical core + built-in large vector co-processor" it is required to offload moving pictures data processing to this large vector co-processor. And it reqiure to re-arrange of data structures and variables computations in SIMD-friendly way. But compilers still very poor in producing "auto-vectorized" result. Too much intelligence required for significant program re-write to be compatible with large vector computation on SIMD execution units. Also it may sometime produce different result that is forbidden to optimizing compilers. So in todays work in progress for mvtools the new SIMD processing is separated from old C-version to keep exact compatibility with old versions at output.

Also about non-temporal stores - it require both store writecombining to cacheline size (typically 64bytes) and runtime knowledge if it will help to total execution or not depending on task size and cache size of current CPU. Though possibly profile-guided optimization on intel compiler can somehow approach any help in this problem. But it may help at developer task and CPU and not help at user-side with different task and CPU.

Also as that testing show the AVX512 large vector co-processor is not only 4 times larger in register file size and twice larger in bus width in compare with AVX2 but also very visibly power-hungry. So it is now started two different branches of general purpose CPUs - intel more or less continue to increase size and performance of large vector co-processor near general purpose core and AMD start to increase number of general purpose cores with AVX2 max. So now software for intel need to be optimized for low threads number and AVX512 large vector processing and for AMD - large threads number and AVX2 max. AMD looks like wins for desktop because it is harder to re-write software for AVX512 and much easier to increase threads number.

DTL · 3rd December 2021, 02:22

Heh - the magic of AVX512 time: sad of block 8x8 8bit:

Code:

__m512i zmm_src = _mm512_set_epi64(*(pSrc + nSrcPitch * 7), *(pSrc + nSrcPitch * 6), *(pSrc + nSrcPitch * 5), *(pSrc + nSrcPitch * 4), \
    * (pSrc + nSrcPitch * 3), *(pSrc + nSrcPitch * 2), *(pSrc + nSrcPitch * 1), *(pSrc + nSrcPitch * 0));

  __m512i zmm_ref = _mm512_set_epi64(*(pRef + nRefPitch * 7), *(pRef + nRefPitch * 6), *(pRef + nRefPitch * 5), *(pRef + nRefPitch * 4), \
    * (pRef + nRefPitch * 3), *(pRef + nRefPitch * 2), *(pRef + nRefPitch * 1), *(pRef + nRefPitch * 0));
  
  return _mm512_reduce_add_epi64(_mm512_sad_epu8(zmm_src, zmm_ref));

It may be fit even in 1 line: reduce_add(sad(set(src), set(ref))); . 3 'macros' and 1 'real instruction'. Macros expansion is compiler-dependent and possibly can be better at next versions of compiler and target chip.
In poor old times it is 64 subtractions + 64 abs + dozens of additions.

DTL · 13th January 2022, 00:49

For really high speed motion pictures data processing using SIMD co-processor in host CPU looks like outdated about in the 201x years already. Now best way is understanding Compute Shaders executed by separated HW accelerator in the system (typically with dedicated memory of much higher speed - from GDDR to HBM). It is much more 2D processing-oriented and saves user from dealing with lots of manual dealing with samples scan order and memory-management and applying 2D processing to SIMD co-processor capabilities. Also it is naturally massive-multithreaded inside 'frame' processing for free - no need to deal with MT in OS. So developer of shader only implement its design of real data processing and free from lots of overhead for typical multithreaded program on general-purpose CPU.
In hardware it somehow supported from DirectX-10.x and now more advanced in DX12.
It have API from DirectX and OpenGL. The OpenGL possibly cross-platform compatible with Linux too. For OpenGL some description: https://www.khronos.org/opengl/wiki/Compute_Shader
So for Avisynth interfacing with Compute Shaders processing it currently only required to apply resources upload to HW acc and result download and setup 'compute pipeline' (it is more or less smaller in complexity with 'full-3D rendering pipeline') - https://docs.microsoft.com/en-us/win...ith-directx-12 . It may be created some 'universal' CS-Plugin SDK with 'Hello ComputeShader' sample to start with.
The more operations executed in the HW Acc - the less overhead for data uploading-downloading from host memory to accelerator and back. Most benefit is expected for memory-bound operations like massive-multiframe denoisers.

27th November 2021, 23:53	#21 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,068	Not so bad for AVX512 at desktop CPUs - intel Rocket Lake Q1 2021 chips finally have it. Cnews make some testing of x265 performance of AVX512 - https://www.google.com/amp/s/www.hwc...-help/%3famp=1 The speedup still very poor but it looks development of AVX512 in x265 were slow because there were very few AVX512 chips at endusers available. Hope to make tests soon with i5-11500 of AVX512 versions of some functions in MAnalyse and MDegrainN in mvtools. The AVX512 have both scatter and gather instructions and AVX2 only gather. And MDegrainN now very poor on memory performance - may scatter/gather of blocks store/load will helps. Last edited by DTL; 27th November 2021 at 23:56.

28th November 2021, 09:49	#23 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,068	The speedup is not about compiler vs hand produced assembly executable. It is mostly about compiler vs human "vectorization". As the intel still going into idea of "universal logical core + built-in large vector co-processor" it is required to offload moving pictures data processing to this large vector co-processor. And it reqiure to re-arrange of data structures and variables computations in SIMD-friendly way. But compilers still very poor in producing "auto-vectorized" result. Too much intelligence required for significant program re-write to be compatible with large vector computation on SIMD execution units. Also it may sometime produce different result that is forbidden to optimizing compilers. So in todays work in progress for mvtools the new SIMD processing is separated from old C-version to keep exact compatibility with old versions at output. Also about non-temporal stores - it require both store writecombining to cacheline size (typically 64bytes) and runtime knowledge if it will help to total execution or not depending on task size and cache size of current CPU. Though possibly profile-guided optimization on intel compiler can somehow approach any help in this problem. But it may help at developer task and CPU and not help at user-side with different task and CPU. Also as that testing show the AVX512 large vector co-processor is not only 4 times larger in register file size and twice larger in bus width in compare with AVX2 but also very visibly power-hungry. So it is now started two different branches of general purpose CPUs - intel more or less continue to increase size and performance of large vector co-processor near general purpose core and AMD start to increase number of general purpose cores with AVX2 max. So now software for intel need to be optimized for low threads number and AVX512 large vector processing and for AMD - large threads number and AVX2 max. AMD looks like wins for desktop because it is harder to re-write software for AVX512 and much easier to increase threads number. Last edited by DTL; 28th November 2021 at 10:12.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

28th November 2021, 00:40	#22 \| Link
orion44 None Join Date: Jul 2007 Location: The Background Posts: 307	I read somewhere that modern optimizing C/C++ compilers are able to beat hand-coded assembly in like 99% of cases. Is this true in the area of video codec and video filters development?

13th January 2022, 00:49	#25 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,068	For really high speed motion pictures data processing using SIMD co-processor in host CPU looks like outdated about in the 201x years already. Now best way is understanding Compute Shaders executed by separated HW accelerator in the system (typically with dedicated memory of much higher speed - from GDDR to HBM). It is much more 2D processing-oriented and saves user from dealing with lots of manual dealing with samples scan order and memory-management and applying 2D processing to SIMD co-processor capabilities. Also it is naturally massive-multithreaded inside 'frame' processing for free - no need to deal with MT in OS. So developer of shader only implement its design of real data processing and free from lots of overhead for typical multithreaded program on general-purpose CPU. In hardware it somehow supported from DirectX-10.x and now more advanced in DX12. It have API from DirectX and OpenGL. The OpenGL possibly cross-platform compatible with Linux too. For OpenGL some description: https://www.khronos.org/opengl/wiki/Compute_Shader So for Avisynth interfacing with Compute Shaders processing it currently only required to apply resources upload to HW acc and result download and setup 'compute pipeline' (it is more or less smaller in complexity with 'full-3D rendering pipeline') - https://docs.microsoft.com/en-us/win...ith-directx-12 . It may be created some 'universal' CS-Plugin SDK with 'Hello ComputeShader' sample to start with. The more operations executed in the HW Acc - the less overhead for data uploading-downloading from host memory to accelerator and back. Most benefit is expected for memory-bound operations like massive-multiframe denoisers.