Doom9's Forum - View Single Post - Development of BitBlt_SSE2_avs and memcpySSE2_avs

ARDA · 2nd September 2015, 18:44

Despite the fact that I've only tested BitBlt_SSE2_avs and memcpySSE2_avs within a few plugins of my own use,
I guess they could be used for any other projects in need. Warning!, It is sill under heavy development,
we can say it is a draft, no guarantees of being free of bugs.
This new bitblt has not been tested as a substitute of internal one in avisynth, so
I can't guarantee full compatibility.

Bitblt is very frequently used in avisynth, to copy clip's planes from one buffer to another, the difference
with a traditional memcpy library is that bitblt does the copy with sequences of rows, line by line
avoiding copying the modulo (pitch-row).

This version actually selects between two ways of copying, as a whole line like memcpy or line by line
like internal avisynth bitblt depending on the conditions. This choice is almost like in avisynth, except
in some cases we managed another way to take profit of memcpy better performance.

C++ prototype:
extern "C"

void BitBlt_SSE2_avs (void * dstp, int dst_pitch, const void * srcp, int src_pitch,
int row_size, int height);

void memcpySSE2_avs (void * dest, const void * src, size_t count);

They made use of sse2, sse3, s_sse3 and AVX instructions when possible and appropriate.

One of the most difficult aspects of a bitblt routine (or any other memcpy) is the bypasscache;
in other words, how to decide when we use a temporal(tpa) versus a non-temporal store(nta).

In the avisynth context there is another challenge: previous conditions in the chain [of filters and memory usage]
can lead to variations on the effective buffer size which bitblt cannot know about and are probably unpredictable.
For example if you bitblt a chroma plane in YV12 but you are working with both chromas and luma as well, in order
to calculate the whole size of frame in memory we would need to add luma, chromau and chromav sizes but we cannot pass
such parameters to bitblt without breaking backward compatibility.

To work around that limitation, I provide several different external methods which give a hint to bitblt
or adjust bypass caches values, on whether it should use temporal or non-temporal stores.

Another special feature in this implementation is the automatic estimation of the optimal percentage of cache
what should be used for temporal stores. According to my tests, the smaller cache, the lower the percentage
we should use.

That required a lot of empirical tests becasue I was not able to derive a generic formula. Some cases are
estimated and might need fine-tuning. You can see the used approximations at TableByPass (almost bottom)
in bitbltsse2_avs.asm. Besides that I must warn that in caches below or equal 1Mb this limit is very critical
under avisynth, it was impossible to establish an exact value; once more is a trade off.

I must also warn that all these values for bypasscache are only valid under avisynth enviroment.
Under other environments it is almost always advisable to use 50% of highest level cache for memcpy
except in some new ultra low power consumption laptops(need research).

Finally if you want to force or help bitblt to decide how to store data; you can hint bitblt with:

DoNta_Oneline(); Internally sets any value different to -1, return nothing, just
set a variable to hint to use nta instructions.
It forces bitblt to copy one line with non temporal store.
It can be used also to hint memcpySSE2_avs with the same purpose.
When count is below or equal 8 will not be copied with nta; external request
will be ignored.

DoNta_RowLoop(); Internally sets any value different to -1, return nothing, just
set a variable to hint to use nta instructions.
It forces bitblt to do a row bitblt with non temporal store

Pass_sizeframe(int); Passes the addition of clip planes sizes when appropriate and
necessary.
It tells bitblt the real size frame is loaded in memory.
It can be used also to inform memcpySSE2_avs with the real size frame
in some conditions.

Set_SizeByPass (int ByPassTpaOrNta, int BypassPreLdNta);

Passes values for two bypasses;
1st) to select between temporal store versus non temporal;
2nd) one to decide between a preloading src block technique
or not preload before non temporal block store.
This one (2nd)could be avoided once we have gathered enough
information from several machines to fix those values.
It seems in machine with cache bigger 1mb block preload
does not help at all. If you use it set it to -1 if you
don't want blokc techniques are applied

You can also let bitblt decide (with normal parameters) but you should keep in mind that, in some cases,
it might use the wrong strategy and lead to sub-optimal performance.

Another aspect worth reviewing is the quantity of branches.
This implementation has three main branches:

a) When you call it with height=1, that is really similar to memcpy. This branch
can also be chosen internally when srcpitch==dstpitch==rowsize.
This branch has also two branches, a temporal write and a non temporal write to memory.
Also in some machines(old) when non temporal write to memory and count is bigger than
the largest cache avaiable, a backward row loop copy is done cause is faster.

b) When the total size frame is below the bypass cache chosen for the cpu on which it
is running, a temporal row loop will be used to copy the frame.

c) When the total size frame is bigger than the bypass cache chosen for the cpu on which it
is running, a non temporal row loop will be used to copy the frame.
In this option there is also a branch with another value for bypass.
One to decide if it will do a src block preload and nta store or no src block preload,
Over that value it makes a source preload before nta block store, below that value
it doesn't use block techniques. In newest cpus with Large caches bigger 1mb and better
hardware prediction(and authomatic prefetches), block preload are not necessary.

Besides that, every branch has two branches:

1) when both pointers are 16b or 32b aligned(AVX)
2) when we only got to have dst 16b or 32b aligned and source misaligned

The problem is when source is not 16b or 32b (AVX)aligned. Again we have to know in which cpu
is running, and accordingly, we use different methods to read unaligned data;
which are:

a) old sse2 capable cpus, where we use two partial qwords reading for each oword.

b) some intel and amd sse3 capable machines where we use lddqu to load unaligned data

c) offset (aligned)read of unaligned data plus shifts plus por to combine

d) offset (aligned)read of unaligned data plus palignr(s_sse3 cpu) to shift and combine

e) read unaligned data with movdqu or movups(some newer cpus), because
finally is luckily the fastest way to load misaligned data, nowadays in some new cpus.

f) Included 256 bits AVX read and write instructions, vmovups etc.
Finally I could test and increase of performance is great.

(some new cpus have REP MOVSB, etc.. instructions fully optimized, but unluckly not still for misaligned data)
In my small haswell laptop that has (ERMSB)ENHANCED REP MOVSB, it has an excellent performance but just
in values below bypass cache, so when temporal store, when non temporal store routines
with vmovntps are faster.

Why all that have changed so much? Well, you have to ask hardware engineers.
It is beyond my capabilities.

All these branches plus cpu identification have a price; when you blit a small value, (below 16kb aprox.)
the overhead offsets some of the optimization speed up; but nowadays with HD clips, I think this is a good
trade off. In fact, there are plenty of trades off in all this code, trying to make it most compatible
and universal possible for many architectures.

However, and for that reason as well, I have also included:

memcpySSE2_avs(void *dest, const void *src, size_t count);

It is almost fully compatible with memcpy, except it doesn't check overlap.

Warning! this memcpySSE2_avs was done to work under avisynth; under other enviroment it could be slower
as consecuence of bypasscache values, than other implementations out of avisynth. In fact it was only tested
within a few plugins of my own use, never as a substitute of internal bitblt in avisynth.

As I've said in my previous post none of these branches and options could be done without
Agner Fog's libraries, codes and papers.
This project includes four files from Agner Fog's libraries, cachesize32.asm, cputype32.asm,
instrset32.asm, unalignedfaster32.asm and some slightly modified subroutines from memcpy32.asm.
You can find them in http://www.agner.org/optimize/asmlib.zip

There are many parts of this code that have never been tested in real life, so your help to track bugs,
would be greatly appreciated.

My maybe TODO list;
Track and eliminate bugs
Look for inconsistences or misconceptions according the performance in different cpus
In other words, in most cases adjust bypass values.
Delete some controls that probably are redundant and excesive
Reduce cpu id routines.
Redesign and start from scratch all again.
Let's see when I feel like doing. Probably never.

Too long post, sorry!
As always I hope you find something usefull. Arda.

2nd September 2015, 18:44	#1 \| Link
ARDA Registered User Join Date: Nov 2001 Posts: 291	Development of BitBlt_SSE2_avs and memcpySSE2_avs Despite the fact that I've only tested BitBlt_SSE2_avs and memcpySSE2_avs within a few plugins of my own use, I guess they could be used for any other projects in need. Warning!, It is sill under heavy development, we can say it is a draft, no guarantees of being free of bugs. This new bitblt has not been tested as a substitute of internal one in avisynth, so I can't guarantee full compatibility. Bitblt is very frequently used in avisynth, to copy clip's planes from one buffer to another, the difference with a traditional memcpy library is that bitblt does the copy with sequences of rows, line by line avoiding copying the modulo (pitch-row). This version actually selects between two ways of copying, as a whole line like memcpy or line by line like internal avisynth bitblt depending on the conditions. This choice is almost like in avisynth, except in some cases we managed another way to take profit of memcpy better performance. C++ prototype: extern "C" void BitBlt_SSE2_avs (void * dstp, int dst_pitch, const void * srcp, int src_pitch, int row_size, int height); void memcpySSE2_avs (void * dest, const void * src, size_t count); They made use of sse2, sse3, s_sse3 and AVX instructions when possible and appropriate. One of the most difficult aspects of a bitblt routine (or any other memcpy) is the bypasscache; in other words, how to decide when we use a temporal(tpa) versus a non-temporal store(nta). In the avisynth context there is another challenge: previous conditions in the chain [of filters and memory usage] can lead to variations on the effective buffer size which bitblt cannot know about and are probably unpredictable. For example if you bitblt a chroma plane in YV12 but you are working with both chromas and luma as well, in order to calculate the whole size of frame in memory we would need to add luma, chromau and chromav sizes but we cannot pass such parameters to bitblt without breaking backward compatibility. To work around that limitation, I provide several different external methods which give a hint to bitblt or adjust bypass caches values, on whether it should use temporal or non-temporal stores. Another special feature in this implementation is the automatic estimation of the optimal percentage of cache what should be used for temporal stores. According to my tests, the smaller cache, the lower the percentage we should use. That required a lot of empirical tests becasue I was not able to derive a generic formula. Some cases are estimated and might need fine-tuning. You can see the used approximations at TableByPass (almost bottom) in bitbltsse2_avs.asm. Besides that I must warn that in caches below or equal 1Mb this limit is very critical under avisynth, it was impossible to establish an exact value; once more is a trade off. I must also warn that all these values for bypasscache are only valid under avisynth enviroment. Under other environments it is almost always advisable to use 50% of highest level cache for memcpy except in some new ultra low power consumption laptops(need research). Finally if you want to force or help bitblt to decide how to store data; you can hint bitblt with: DoNta_Oneline(); Internally sets any value different to -1, return nothing, just set a variable to hint to use nta instructions. It forces bitblt to copy one line with non temporal store. It can be used also to hint memcpySSE2_avs with the same purpose. When count is below or equal 8 will not be copied with nta; external request will be ignored. DoNta_RowLoop(); Internally sets any value different to -1, return nothing, just set a variable to hint to use nta instructions. It forces bitblt to do a row bitblt with non temporal store Pass_sizeframe(int); Passes the addition of clip planes sizes when appropriate and necessary. It tells bitblt the real size frame is loaded in memory. It can be used also to inform memcpySSE2_avs with the real size frame in some conditions. Set_SizeByPass (int ByPassTpaOrNta, int BypassPreLdNta); Passes values for two bypasses; 1st) to select between temporal store versus non temporal; 2nd) one to decide between a preloading src block technique or not preload before non temporal block store. This one (2nd)could be avoided once we have gathered enough information from several machines to fix those values. It seems in machine with cache bigger 1mb block preload does not help at all. If you use it set it to -1 if you don't want blokc techniques are applied You can also let bitblt decide (with normal parameters) but you should keep in mind that, in some cases, it might use the wrong strategy and lead to sub-optimal performance. Another aspect worth reviewing is the quantity of branches. This implementation has three main branches: a) When you call it with height=1, that is really similar to memcpy. This branch can also be chosen internally when srcpitch==dstpitch==rowsize. This branch has also two branches, a temporal write and a non temporal write to memory. Also in some machines(old) when non temporal write to memory and count is bigger than the largest cache avaiable, a backward row loop copy is done cause is faster. b) When the total size frame is below the bypass cache chosen for the cpu on which it is running, a temporal row loop will be used to copy the frame. c) When the total size frame is bigger than the bypass cache chosen for the cpu on which it is running, a non temporal row loop will be used to copy the frame. In this option there is also a branch with another value for bypass. One to decide if it will do a src block preload and nta store or no src block preload, Over that value it makes a source preload before nta block store, below that value it doesn't use block techniques. In newest cpus with Large caches bigger 1mb and better hardware prediction(and authomatic prefetches), block preload are not necessary. Besides that, every branch has two branches: 1) when both pointers are 16b or 32b aligned(AVX) 2) when we only got to have dst 16b or 32b aligned and source misaligned The problem is when source is not 16b or 32b (AVX)aligned. Again we have to know in which cpu is running, and accordingly, we use different methods to read unaligned data; which are: a) old sse2 capable cpus, where we use two partial qwords reading for each oword. b) some intel and amd sse3 capable machines where we use lddqu to load unaligned data c) offset (aligned)read of unaligned data plus shifts plus por to combine d) offset (aligned)read of unaligned data plus palignr(s_sse3 cpu) to shift and combine e) read unaligned data with movdqu or movups(some newer cpus), because finally is luckily the fastest way to load misaligned data, nowadays in some new cpus. f) Included 256 bits AVX read and write instructions, vmovups etc. Finally I could test and increase of performance is great. (some new cpus have REP MOVSB, etc.. instructions fully optimized, but unluckly not still for misaligned data) In my small haswell laptop that has (ERMSB)ENHANCED REP MOVSB, it has an excellent performance but just in values below bypass cache, so when temporal store, when non temporal store routines with vmovntps are faster. Why all that have changed so much? Well, you have to ask hardware engineers. It is beyond my capabilities. All these branches plus cpu identification have a price; when you blit a small value, (below 16kb aprox.) the overhead offsets some of the optimization speed up; but nowadays with HD clips, I think this is a good trade off. In fact, there are plenty of trades off in all this code, trying to make it most compatible and universal possible for many architectures. However, and for that reason as well, I have also included: memcpySSE2_avs(void dest, const void src, size_t count); It is almost fully compatible with memcpy, except it doesn't check overlap. Warning! this memcpySSE2_avs was done to work under avisynth; under other enviroment it could be slower as consecuence of bypasscache values, than other implementations out of avisynth. In fact it was only tested within a few plugins of my own use, never as a substitute of internal bitblt in avisynth. As I've said in my previous post none of these branches and options could be done without Agner Fog's libraries, codes and papers. This project includes four files from Agner Fog's libraries, cachesize32.asm, cputype32.asm, instrset32.asm, unalignedfaster32.asm and some slightly modified subroutines from memcpy32.asm. You can find them in http://www.agner.org/optimize/asmlib.zip There are many parts of this code that have never been tested in real life, so your help to track bugs, would be greatly appreciated. My maybe TODO list; Track and eliminate bugs Look for inconsistences or misconceptions according the performance in different cpus In other words, in most cases adjust bypass values. Delete some controls that probably are redundant and excesive Reduce cpu id routines. Redesign and start from scratch all again. Let's see when I feel like doing. Probably never. Too long post, sorry! As always I hope you find something usefull. Arda. Last edited by ARDA; 2nd September 2015 at 18:52. Reason: typo