Development of BitBlt_SSE2_avs and memcpySSE2_avs

ARDA · 2nd September 2015, 18:44

Despite the fact that I've only tested BitBlt_SSE2_avs and memcpySSE2_avs within a few plugins of my own use,
I guess they could be used for any other projects in need. Warning!, It is sill under heavy development,
we can say it is a draft, no guarantees of being free of bugs.
This new bitblt has not been tested as a substitute of internal one in avisynth, so
I can't guarantee full compatibility.

Bitblt is very frequently used in avisynth, to copy clip's planes from one buffer to another, the difference
with a traditional memcpy library is that bitblt does the copy with sequences of rows, line by line
avoiding copying the modulo (pitch-row).

This version actually selects between two ways of copying, as a whole line like memcpy or line by line
like internal avisynth bitblt depending on the conditions. This choice is almost like in avisynth, except
in some cases we managed another way to take profit of memcpy better performance.

C++ prototype:
extern "C"

void BitBlt_SSE2_avs (void * dstp, int dst_pitch, const void * srcp, int src_pitch,
int row_size, int height);

void memcpySSE2_avs (void * dest, const void * src, size_t count);

They made use of sse2, sse3, s_sse3 and AVX instructions when possible and appropriate.

One of the most difficult aspects of a bitblt routine (or any other memcpy) is the bypasscache;
in other words, how to decide when we use a temporal(tpa) versus a non-temporal store(nta).

In the avisynth context there is another challenge: previous conditions in the chain [of filters and memory usage]
can lead to variations on the effective buffer size which bitblt cannot know about and are probably unpredictable.
For example if you bitblt a chroma plane in YV12 but you are working with both chromas and luma as well, in order
to calculate the whole size of frame in memory we would need to add luma, chromau and chromav sizes but we cannot pass
such parameters to bitblt without breaking backward compatibility.

To work around that limitation, I provide several different external methods which give a hint to bitblt
or adjust bypass caches values, on whether it should use temporal or non-temporal stores.

Another special feature in this implementation is the automatic estimation of the optimal percentage of cache
what should be used for temporal stores. According to my tests, the smaller cache, the lower the percentage
we should use.

That required a lot of empirical tests becasue I was not able to derive a generic formula. Some cases are
estimated and might need fine-tuning. You can see the used approximations at TableByPass (almost bottom)
in bitbltsse2_avs.asm. Besides that I must warn that in caches below or equal 1Mb this limit is very critical
under avisynth, it was impossible to establish an exact value; once more is a trade off.

I must also warn that all these values for bypasscache are only valid under avisynth enviroment.
Under other environments it is almost always advisable to use 50% of highest level cache for memcpy
except in some new ultra low power consumption laptops(need research).

Finally if you want to force or help bitblt to decide how to store data; you can hint bitblt with:

DoNta_Oneline(); Internally sets any value different to -1, return nothing, just
set a variable to hint to use nta instructions.
It forces bitblt to copy one line with non temporal store.
It can be used also to hint memcpySSE2_avs with the same purpose.
When count is below or equal 8 will not be copied with nta; external request
will be ignored.

DoNta_RowLoop(); Internally sets any value different to -1, return nothing, just
set a variable to hint to use nta instructions.
It forces bitblt to do a row bitblt with non temporal store

Pass_sizeframe(int); Passes the addition of clip planes sizes when appropriate and
necessary.
It tells bitblt the real size frame is loaded in memory.
It can be used also to inform memcpySSE2_avs with the real size frame
in some conditions.

Set_SizeByPass (int ByPassTpaOrNta, int BypassPreLdNta);

Passes values for two bypasses;
1st) to select between temporal store versus non temporal;
2nd) one to decide between a preloading src block technique
or not preload before non temporal block store.
This one (2nd)could be avoided once we have gathered enough
information from several machines to fix those values.
It seems in machine with cache bigger 1mb block preload
does not help at all. If you use it set it to -1 if you
don't want blokc techniques are applied

You can also let bitblt decide (with normal parameters) but you should keep in mind that, in some cases,
it might use the wrong strategy and lead to sub-optimal performance.

Another aspect worth reviewing is the quantity of branches.
This implementation has three main branches:

a) When you call it with height=1, that is really similar to memcpy. This branch
can also be chosen internally when srcpitch==dstpitch==rowsize.
This branch has also two branches, a temporal write and a non temporal write to memory.
Also in some machines(old) when non temporal write to memory and count is bigger than
the largest cache avaiable, a backward row loop copy is done cause is faster.

b) When the total size frame is below the bypass cache chosen for the cpu on which it
is running, a temporal row loop will be used to copy the frame.

c) When the total size frame is bigger than the bypass cache chosen for the cpu on which it
is running, a non temporal row loop will be used to copy the frame.
In this option there is also a branch with another value for bypass.
One to decide if it will do a src block preload and nta store or no src block preload,
Over that value it makes a source preload before nta block store, below that value
it doesn't use block techniques. In newest cpus with Large caches bigger 1mb and better
hardware prediction(and authomatic prefetches), block preload are not necessary.

Besides that, every branch has two branches:

1) when both pointers are 16b or 32b aligned(AVX)
2) when we only got to have dst 16b or 32b aligned and source misaligned

The problem is when source is not 16b or 32b (AVX)aligned. Again we have to know in which cpu
is running, and accordingly, we use different methods to read unaligned data;
which are:

a) old sse2 capable cpus, where we use two partial qwords reading for each oword.

b) some intel and amd sse3 capable machines where we use lddqu to load unaligned data

c) offset (aligned)read of unaligned data plus shifts plus por to combine

d) offset (aligned)read of unaligned data plus palignr(s_sse3 cpu) to shift and combine

e) read unaligned data with movdqu or movups(some newer cpus), because
finally is luckily the fastest way to load misaligned data, nowadays in some new cpus.

f) Included 256 bits AVX read and write instructions, vmovups etc.
Finally I could test and increase of performance is great.

(some new cpus have REP MOVSB, etc.. instructions fully optimized, but unluckly not still for misaligned data)
In my small haswell laptop that has (ERMSB)ENHANCED REP MOVSB, it has an excellent performance but just
in values below bypass cache, so when temporal store, when non temporal store routines
with vmovntps are faster.

Why all that have changed so much? Well, you have to ask hardware engineers.
It is beyond my capabilities.

All these branches plus cpu identification have a price; when you blit a small value, (below 16kb aprox.)
the overhead offsets some of the optimization speed up; but nowadays with HD clips, I think this is a good
trade off. In fact, there are plenty of trades off in all this code, trying to make it most compatible
and universal possible for many architectures.

However, and for that reason as well, I have also included:

memcpySSE2_avs(void *dest, const void *src, size_t count);

It is almost fully compatible with memcpy, except it doesn't check overlap.

Warning! this memcpySSE2_avs was done to work under avisynth; under other enviroment it could be slower
as consecuence of bypasscache values, than other implementations out of avisynth. In fact it was only tested
within a few plugins of my own use, never as a substitute of internal bitblt in avisynth.

As I've said in my previous post none of these branches and options could be done without
Agner Fog's libraries, codes and papers.
This project includes four files from Agner Fog's libraries, cachesize32.asm, cputype32.asm,
instrset32.asm, unalignedfaster32.asm and some slightly modified subroutines from memcpy32.asm.
You can find them in http://www.agner.org/optimize/asmlib.zip

There are many parts of this code that have never been tested in real life, so your help to track bugs,
would be greatly appreciated.

My maybe TODO list;
Track and eliminate bugs
Look for inconsistences or misconceptions according the performance in different cpus
In other words, in most cases adjust bypass values.
Delete some controls that probably are redundant and excesive
Reduce cpu id routines.
Redesign and start from scratch all again.
Let's see when I feel like doing. Probably never.

Too long post, sorry!
As always I hope you find something usefull. Arda.

TheFluff · 2nd September 2015, 19:20

I guess I'm kind of an asshole for saying this outright, but since my gentle nudges in the fvertical thread hasn't discouraged you, here's some harsh words: this is completely pointless. You're writing thousands of lines of assembler to do what simple rep movb does as fast or insignificantly slower in basically all practical cases. Furthermore, if you can enforce 32 byte alignment and use more than one thread (like Vapoursynth does) you will basically always be bottlenecked by memory bandwidth on any sort of modern system. Basically, just use memcpy and don't try to ~optimize~ such a basic function. Someone else who is much smarter than you and had a lot more time on their hands has already done it.

Elegant · 2nd September 2015, 19:27

He's allowed to have fun experimenting though. I mean that's how I do most of my work these days.

ARDA · 2nd September 2015, 22:16

Quote:

Originally Posted by Thefluff

I guess I'm kind of an asshole for saying this outright, but since my gentle nudges in the
fvertical thread hasn't discouraged you, here's some harsh words: this is completely pointless.
You're writing thousands of lines of assembler to do what simple rep movb does as fast or insignificantly
slower in basically all practical cases. Furthermore, if you can enforce 32 byte alignment and
use more than one thread (like Vapoursynth does) you will basically always be bottlenecked by
memory bandwidth on any sort of modern system. Basically, just use memcpy and don't try to
~optimize~ such a basic function. Someone else who is much smarter than you and had a lot more
time on their hands has already done it.

Quote:

Originally Posted by Thefluff

I guess we have to explain the joke since it apparently went completely over
your head. Myrsloik hasn't looked at your code at all, he wrote his own flipvertical implementation
using Vapoursynth's bitblt, which looks like this:

Quote:

Originally Posted by Thefluff

It's probably faster than Avisynth 2.6's implementation and insignificantly
slower than yours.
If you were sane, you would have benchmarked that, but I don't think you have.

Quote:

Originally Posted by Thefluff

that's genuinely horrifying

In my almost fourteen years of Doom9 forum I had never seen this kind of post almost insultant,
except once, that it is not worth mention. I had never seen a member whose purpose was
discourage other works, but mainly with a partial knowledge of the subject but also without
reading carefully my posts neither my code. I could answer this post with academic links and
other more important than me developers works, but won't.....I have not free time to waste.
Maybe in the future with other tone and another interlocutor we can get a deep and constructive
polemic about many points in this project.
From now onward, I feel free not to answer these kind of posts. I hope this kind of behaviour
were the exception, otherwise it means my mistake was to post my job in this forum.

My apologies if I have offended anyone
Thanks ARDA

TheFluff · 2nd September 2015, 23:03

my job here is to tell people they're doing dumb things, and while i like to think i'm pretty good at it, there's a lot of dumb to go around on this forum and i don't take the time to call out everything. your particular brand of dumb (trying to "optimize" common library functions that are already heavily optimized and have an insignificant performance cost in practice) just happened to be especially offensive since you're doing your best to be directly contrary to good programming practices.

now, how about you post an actual benchmark of your bitblt (don't measure anything else, just the bitblt, preferably in cycles per byte or some similar unit) instead so you can actually prove me wrong instead of just hurfing a durf about how insulted you are that someone dare question your fancy memcpy?

pbristow · 3rd September 2015, 00:33

Quote:

Originally Posted by TheFluff

my job here is to tell people they're doing dumb things

Really? You're *employed* to do that? Who by?

AzraelNewtype · 3rd September 2015, 01:34

Quote:

Originally Posted by pbristow

Really? You're *employed* to do that? Who by?

Me. His service is very useful to me.

Reel.Deel · 3rd September 2015, 03:30

I unusually don't argue with people on the internet since it's all a waste of energy but sometimes there are those who deserve it...

[*REMOVED*]

Seriously TheFluff, what's your deal? I see you're always bitching/being negative about things that have nothing to do with you and do not affect you in no way, shape, or form. If it doen's concern you why not just STFU and let it be? It's their time and they can do with it what they please. I think it's cute that you actually think your opinion really matters LOL. You should be at an age where your wise enough to realize that it doesn't, especially on the internet.

"tl;dr > anime makes you stupid" - you should practice what you preach. Take care sweetheart

jmac698 · 3rd September 2015, 05:54

I don't know much about Intel assembly, but I'm just curious if this idea would help; ignore copying the ends which are out of alignment, just copy the main parts that are in alignment, then go back and copy the leftover bits. I've used such ideas before on other cpus's and it was much faster.

Arda, would any of what you've learned help with just source reading? For example average of all frames or average of all pixels in each frame. People have uses for these functions. Some other ideas: stacked 16 bit to 16 bit words, yuv2planar, swap u/v

sh0dan · 3rd September 2015, 09:47

Quote:

Originally Posted by TheFluff

my job here is to tell people they're doing dumb things [...]

No, it is not your job. I am a moderator; that is *my* job.

Please keep a respectful tone in your conversations. I don't mind you disagreeing on some points (and you might even be right, but I don't see any benchmarks from you either), but use *facts* to convince people, not personal insults.

The tone you are using is one of the reasons that people don't want to do free open source work, if they constantly are being attacked for what they are doing.

ARDA · 3rd September 2015, 10:00

Quote:

Originally Posted by jmac698

I don't know much about Intel assembly, but I'm just curious if this idea would help; ignore
copying the ends which are out of alignment, just copy the main parts that are in alignment,
then go back and copy the leftover bits. I've used such ideas before on other cpus's and it
was much faster.

In fact your proposal is very near what it is done, if beginning of buffers pointers
are misaligned, we first copy a few bytes till arrive to first aligned address of destination,
afterwards if by chance both ptrs are aligned we use a routine by using aligned instructions
to load and store, in case only destination is aligned we use a routine where we load src with
non aligned instruction varying according the architecture of the machine in which is running
and that is one of the imortant strategies of this code, and store with aligned instructions always.
If by chance there is still a tail, we copy these few bytes also with the instructions more
appropriate for that machine. If you read again first post of this thread, I think you will
find most answers, but always feel free to ask whenever and whatever you want.

Second question I don't get to understand it completely, I will study it carefully a will
answer you in a few days . Be patient please! Few free time

I hope you find something usefull
ARDA

jmac698 · 3rd September 2015, 13:11

@arda
-Could you write a fast plugin to average all frames?
-To average all pixels on a frame?
-To swap U and V channels?
-To copy src,src2 to dst,dst+1, src++, src2++, dst+=2

Using your experience, it could be faster than existing codes.

Myrsloik · 3rd September 2015, 14:15

Quote:

Originally Posted by jmac698

@arda
-Could you write a fast plugin to average all frames?
-To average all pixels on a frame?
-To swap U and V channels?
-To copy src,src2 to dst,dst+1, src++, src2++, dst+=2

Using your experience, it could be faster than existing codes.

I can't take this anymore. Usually I do the discrete trollings but NO NO NO. THE ANSWER IS NO!

-Could you write a fast plugin to average all frames?
No, not faster than what currently exists. This is an extremely memory speed bound operation.

-To average all pixels on a frame?
This code is already optimal in Avs+. PSADBW cannot be beaten. FACT

-To swap U and V channels?
This is already a pointer swap for planar formats - NO SPEED INCREASE EVAR

-To copy src,src2 to dst,dst+1, src++, src2++, dst+=2
Zero copy with SubFrame(Planar) if I understand your doodle correctly - SO NO GAINS FOR U

If it's too slow buy faster memory. Or a faster cpu (for the faster memory support, not for the other useless parts) as well if your current one can't support it.

Objectively true speedup tips.

ARDA · 3rd September 2015, 15:56

Quote:

Originally Posted by jmac698

@arda
-Could you write a fast plugin to average all frames?
-To average all pixels on a frame?
-To swap U and V channels?
-To copy src,src2 to dst,dst+1, src++, src2++, dst+=2

Using your experience, it could be faster than existing codes.

Myrsloik has more experience than me, about internal of every part
of avisynth, and of course vapoursynt still more. I think you should trust
on his opinion, except IanB appears and has another one.

Thanks all ARDA

jmac698 · 3rd September 2015, 16:00

ok

The fastest frame average I know of is RedAverage
http://forum.doom9.org/showthread.ph...11#post1537311

My "doodle" refers to converting stacked MSB/LSB 16bit video to normal 16bit words for each pixel, the point of it actually being in words is to use vectored instructions to further do fast operations on those words, for example for brightness or contrast adjustment, unless of course it's just as fast to do an add on two halves of a word scattered far apart in memory

colours · 3rd September 2015, 23:23

Quote:

Originally Posted by jmac698

ok

The fastest frame average I know of is RedAverage
http://forum.doom9.org/showthread.ph...11#post1537311

Have I ever mentioned that the plugin crashes when the input clip(s) are unaligned? And that seemingly every function other than RAverageW is practically untested and produces different results between all eight combinations of C++ vs SSE, 8-bit input vs 16-bit input, and 8-bit output vs 16-bit output?

I do use RAverageW occasionally because it's still sorta useful despite its shortcomings, but it's not really a good example of writing fast and good code. More like bad but fast, if you ask me.

Quote:

Originally Posted by jmac698

My "doodle" refers to converting stacked MSB/LSB 16bit video to normal 16bit words for each pixel, the point of it actually being in words is to use vectored instructions to further do fast operations on those words, for example for brightness or contrast adjustment, unless of course it's just as fast to do an add on two halves of a word scattered far apart in memory

The stack16 format is a historical mistake, but it's too late to fix that. However, the copying involved in interleaving the MSB/LSB before further processing, while a lot slower than even an unoptimised blit, is still pretty fast, and if this is a bottleneck in your script then your script clearly isn't doing a lot of anything at all.

Quote:

Originally Posted by sh0dan

The tone you are using is one of the reasons that people don't want to do free open source work, if they constantly are being attacked for what they are doing.

The only reason they're constantly attacked is that they're usually also the ones to be open about what they're doing. People who write proprietary software like to think of their code as being TRADE SECRET and avoid discussing about it altogether, and it's pretty hard to criticise that because no one but the developers know anything about it.

Quote:

Originally Posted by Reel.Deel

I unusually don't argue with people on the internet since it's all a waste of energy but sometimes there are those who deserve it...

The only reason TheFluff picks on people over the internet is because he can't do so in real life. Looking like this and sounding like this I'm sure he's the one that gets picked on. Now that I know what he looks like, there's no way I can take this clown serious, especially face to face.

Doxing is not cool. Neither are personal attacks. Please don't do this.

ARDA · 5th September 2015, 15:15

(ERMSB)Enhanced rep movsb/stosb, is sometimes proclaimed as the final hardware solution to move
data, we will not need memcpy functions anymore, but that is unluckily still not true nowadays.
Here some intel reference and my own notes about it, hope it becomes true in the future.
The net is full of information and discussions about this subject. Search if you feel like (boring)

Quote:

Originally Posted by INTEL PDF

3.7.7 Enhanced REP MOVSB and STOSB operation (ERMSB)
Beginning with processors based on Intel microarchitecture code named Ivy Bridge,
REP string operation using MOVSB and STOSB can provide both flexible and highperformance
REP string operations for software in common situations like memory
copy and set operations. Processors that provide enhanced MOVSB/STOSB operations
are enumerated by the CPUID feature flag: CPUID: (EAX=7H, ECX=0H):EBX.ERMSB[bit 9] = 1.

3.7.7.1 Memcpy Considerations
The interface for the standard library function memcpy introduces several factors
(e.g. length, alignment of the source buffer and destination) that interact with
microarchitecture to determine the performance characteristics of the implementation
of the library function. Two of the common approaches to implement memcpy
are driven from small code size vs. maximum throughput. The former generally uses
REP MOVSD+B (see Section 3.7.6), while the latter uses SIMD instruction sets and
has to deal with additional data alignment restrictions.
For processors supporting enhanced REP MOVSB/STOSB, implementing memcpy
with REP MOVSB will provide even more compact benefits in code size and better
throughput than using the combination of REP MOVSD+B. For processors based on
Intel microarchitecture code named Ivy Bridge, implementing memcpy using ERMSB
might not reach the same level of throughput as using 256-bit or 128-bit AVX alternatives,
depending on length and alignment factors.
to follow reading this article use this link in chapter 3.7.7
http://www.intel.com/content/dam/doc...ion-manual.pdf

Some small notes from my side, I have only experimented with rep movsb(ERMSB) in my avx2 laptop
so I haven't still a whole vision of its performance. In my laptop it has a good performance
very close than obtained with ymm vectors, but only when is below by pass cache, when temporal
store(tpa) is used, when non temporal store it is always slower around 20% in both cases with
both poiners aligned 32b, when misaligned ptrs, ERMSB performance is still worse. Let's see
if we can gather more information from mainly newest machines.
If any one wants to do some tests here is a small and quick code, used in some tests,
that can be included in any c++ code. Please take into account that it will work only from Ivybridge onwards (not sure google)

Code:


 enhrepsup()
{
     int enhrep=0;
    _asm{
        xor ecx,ecx
        mov eax,7
        cpuid
        and ebx,00000000000000000000001000000000b //bit9
        jz noexistenhrep
        mov [enhrep],1
align 16
noexistenhrep:
        sub eax,eax
        cpuid
    };
    return enhrep;
}

I hope you find this usefull
ARDA

ARDA · 29th September 2015, 16:48

BitBlitTest is a simple plugin that in fact does nothing except copying each frame to another
memory buffer and returns it, it was done only to test and compare bitblts performances and memcpy
libraries of the compiler.

To simplify its development and use, it only works under yv8 (just one plane)and sse2 or up
it allows you to play with external hints to help new bitblt to decide the strategy to follow
for better performance.
It is intended to be a tool to test different routines for many conditions that may vary a lot
depending on the cpu on which is running and script as well.
It is usefull for developers that want to research and benchmark these routines. But also for
all volunteers(PLEASE)to help to finish the bypasses table.

Code:


     Syntax;
     bool IntBBlt     default false    ; if true it forces to use avisynth internal bitblt
     bool UseCompiler default false    ; if true it forces to use memcpy compiler library
     bool UseAgFog    default false    ; if true it forces to use Agner Fog memcpy library
     bool UseEnRpMsb  default false    ; if true it forces to use rep movsb/movsd instructions

     bool NtaLine default false     ; force new bitblt to do a non temporal store in memcpy
     bool NtaRow  default false     ; force new bitblt to do a non temporal store in a row copy
     int  SzeFame  ; pass the whole size frame to the new bitblt, not usefull for now in this plugin
     int  TpaorNta ; pass the value for bypass cache, to select between temporal or non temporal store
     int  PrldNta  ; pass the value for bypass when non temporal store to select between block preload
                   ; src or not. it is mainly usefull in old cpus with small caches
                   ; PrldNta=-1 will skip preload; will be appropriate for newer cpus
     If you set IntBBlt=true, UseAgFog=true or UseCompiler=true the rest of parameters are ignored

Speaking specifically about my new bitblt performance, according to my experience the increase can
vary between 5 and 35%, depending on which cpu is running, but most of the tests have been done with
this plugin and another of my own use, never as a substitute of internal bitblt for now.

I use avstimer and debugview to measure performance

Code:


#LanczosResize(1920,1088)         # use it to force a linear memcpy or any other mod64 resolution
#separatefields                   # use it to force a row loop copy, both lines are just as examples
AvsTimer(frames=1000, name="ANYONE",type=3, frequency=x?, total=false, quiet=true)# use your cpu frequency
# Put here your filter to benchmark
#BitBlitTest(IntBBlt=true)        # this option test internal bitblt
#BitBlitTest()                    # this option tests my new bitblt
#BitBlitTest(UseCompiler=true)    # this option tests a bitblt using memcpy library of the compiler
#BitBlitTest(UseAgFog= true)      # this option tests a bitblt using Agner Fog's memcpy library
#BitBlitTest(UseEnRpMsb= true)    # this option tests a bitblt using ERMSB, rep movsb/movsd instructions
AvsTimer(frames=1500 ,name="ANYONE",type=3, frequency=x?, difference=1, total=false)# use your cpu frequency

If you prefer QueryPerformance instead of rdtsc(type=3) for measuring, you should set type=0

With this specific plugin I also use avsmeter cause I donnot need a writable src, any form of source is appropriate,
you can use it like this

Code:


#ColorBars(width=3840, height=2160, pixel_type="YV12").ConvertToY8().KillAudio().AssumeFPS(25, 1).Trim(0, 49999)
#any resolution and form of source you like
#BitBlitTest(IntBBlt=true)      # this option test internal avisynth bitblt
#BitBlitTest()                  # this option tests my new bitblt
#BitBlitTest(UseCompiler=true)  # this option tests a bitblt using memcpy library of the compiler
#BitBlitTest(UseAgFog=true)     # this option tests a bitblt using Agner Fog's memcpy library
#BitBlitTest(UseEnRpMsb=true)   # this option tests a bitblt using ERMSB, rep movsb/movsd instructions

Download here, only dll, sources soon
Version 1.000 BitBlitTest.dll

I hope you find something usefull

ARDA

ARDA · 29th September 2015, 16:48

A little more about this tool plugin, it is really simple and useless for any other thing
than benchmark internal bitblt against my version, without having to integrate mine in avisynth
and to switch between versions, and also avoiding the limitation of fvertical that only use
line per line copy (a row loop copy).
With this test plugin new bitblt will have more options to do the copy and in many cases
will use memcpysse2 or some similar routines. But also and not less important, to test against
itself by varying some parameters; and also against memcpy routines of the(UseCompiler)compiler you are using.

IntBBlt=true Steady bitblt in avisynth is around ten years old, so a comparison
is not tottaly fair, but in spite of that when non temporal store must be applied
it has an excellent performance. In old cpus, those when started sse2 set, it was
quite difficult to overcome it. So if anyone still have any old sse2 pentium4 or
any old amd , some behchmarks would be very usefull to tune that options in my
new bitblt.

UseCompiler should vary a lot depending on the compiler and the parameters you set to compile, therefore
this option is only for those developers that want to compile theirs own versions of the tool.
Why did I add this option? Mainly cause is another way to discover if the values of bypasses
are correct, but also of course to see if we can have always better performance than the compilers with
different dispatches. The version I realeased was compiled with Visual Studio 2008 and without any cpu special
targeting, so it can run in any sse2 machine or up.
[B]The most critical[B/] performance comparation between my memcpysse2 and compilers versions is with values
below 32kb, almost always with temporal store, cause the big overhead added by my cpu analysis.
To be fair there is also a condition(sizeframe bigger bypass cache) where compiler's memcpy library
will always be slower, no matter what you do. When a line per line (row loop) is done, memcpy library can
never know that the real size frame is bigger than the row it is being copied, why that? cause memcpy is
an old C library when caches bypasses to select between temporal or non temporal store were not necessary.
In this case my new bitblt will always win cause the external SzeFame parameter to help my new bitblit to select
between temporal or not temporal.
Last but not least important, I have added [B]UseCompiler[B/] with the aim to get help from other developers
that have different compilers (Intel for instance), so we can compare with all of them too and in the hope of
avoiding more pollution of this thread

UseAgFogMy new bitblt is based in many Agner Fog's libraries so I have been using this
option to check if I intruduced a bug or any misconception.
In time I started modifying some bypass cache values according to my experience to be able
to make comparisons more fair.
In most cases this option is a little bit slower than my new bitblt, except in some very small
values where it can reach the same performance.

UseEnRpMsbThis option was included to test some machines which have ERMSB, it can
be used in any machine, but results will be unpredictable. Actually many compilers just makes
use of these instructions and in new machines gets a good performance in a certain range of
count.

#BitBlitTest() # this option tests my new bitblt
This option without any parameter will test my new bitblt, but with some parameters
you can modify its behaviour. Let's see how to use them

How to use them, and what for? the main goal of this tool plugin is to find the appropiate value
for by passes in cpu you are testing.
How do you do that? In the table below you can see the default values in new bitblt according the size
of biggest cache availabe in your cpu and others. This table is part of BitBlt_SSE2_avs, you can find it
at the end of this file.

You will see an indication of percentage of this largest cache, they are always aproximations, by testing
performance near the border, one a little below the bypass and one a little over, if you notice a big
sprite is cause the bypass is wrong.

Int TpaorNta with this option you can vary the value of bypasscache to choose
between temporal or non temporal store, if you include a value for this paramater, you
have to include one for
In PrldNta if it is set -1 it indicates block copy techniques will not be applied, you can modify
and introduce a value, but I must warn you that in all tests I've done this is only usefull in old cpus which
largest cache is 1MB or smaller.
In avx machines the prldnta, block techniques will never be applied cause it is useless.
This is due to better hardware prediction(and larger caches) and that new bitblt when does non temporal(nta) store,
it always do nta, inclcuding when alignment is done and tails as well, by avoiding always to load any dst line
in cache and by this way there will not be victimization of cache lines ever.

NtaLine and NtaRow, both force bitblt to do nta store whatever the case, if you donnot get
to know how bitblt was doing originally, just test twice with the nta.... and without it there
is not difference is because bitblt was already doing nta store. But you can say I wamt
to force it do temporal(tpa), well just increase the value of TpaOrNta till you find a difference
in performance, if there is no difference is cause it was already doing tpa.

SzeFame is almost useless in this plugin or redundant cause bitblt already has the parameters to calculate
the frame size when a only one line(memcpy)is done; except when a row loop is chosen; in that case is usefull
to pass SzeFame. Anyway if you donnot include it, BitBlitTest will pass it just in case when possibly necessary.

Last but not least important, you can question, how can I force ? new bitblt to do a linear memcpy or row loop so
I can measure different situations. There many ways but here two simple examples;
Apply a resize, any one to a resolution where row is mod64, for instance BilinearResize 1920,1088 inmediately
previous to BitBlitTest, this will force a linear memcpy.
Apply separatefields() inmediately previous to BitBlitTest, this will force a row loop copy.(see the script with avstimer above)

Code:


;.................................................................................................
align 16
TableByPass:

;   TpaOrNta   PreLdNta     LgstCche TpaOrNta PreLdNta
LgstChe256k:
  DD  4196,     32768;       256kb .   xx%   12.5%  can't test these old cpus, haven't acces to any.
                                                   ; in memcpy_amd values are 64*1024 and 197*1024
                                                   ; for old cpus with 256 or 512 kb L2 caches
;.......................................................................................................
LgstChe512k:                                       ; but with L1 in some cases of 128kb. Anyway
  DD  8192,     65536;       512kb .   xxx%  12.5%   seems too high for me, need research and benchmark 
..........................................................................
LgstChe1Mega:
  DD  10240,    262144;      1024kb.  ?(1)%  25%    already tested and ok.in two Amd with 1 mb L2 cache
                                                   ; avisynth gobbles up 1038336 bytes of the largest cache?
                                                   ; when only 1mb L2 in my old amd and without L3
;.......................................................................................................
LgstChe1M512k:
  DD  52224 ,   786432;      1536kb.  3.2%   50% need research and benchmarks
;.......................................................................................................
LgstChe2Mega:
  DD  314368,   1310720;     2048kb   15%    65% need research and benchmarks
;.......................................................................................................
LgstChe3Mega:
  DD  838656,  -1      ;     3072kb.  27%    NOPRELOAD. need research and benchmarks
;.......................................................................................................
LgstChe4Mega:
  DD  1362944,  -1     ;     4096kb.  31%    NOPRELOAD. need research and benchmarks
;.......................................................................................................
LgstChe6Mega:
  DD 2411520 ,  -1;          6144kb.  39%    NOPRELOAD. need research and benchmarks
;.......................................................................................................
LgstChe8Mega:
  DD  3460096,  -1;          8192kb.  41%    NOPRELOAD. need research and benchmarks
;.......................................................................................................
LgstChe12Mega:
  DD  5557248,  -1;          12288kb. 44%    NOPRELOAD. need research and benchmarks
;.......................................................................................................
LgstChe16Mega:
  DD  7654400,  -1;          16384kb. 45%    NOPRELOAD. need research and benchmarks
;.......................................................................................................
LgstChe18Mega:
  DD  8702976,  -1;          18432kb  46%    NOPRELOAD. need research and benchmarks
;.......................................................................................................
LgstChe24Mega:
  DD  11848704, -1;          24576kb  47%    NOPRELOAD. need research and benchmarks
;.......................................................................................................
LgstChe32Mega:
  DD  16043008, -1;          32768kb  48%    NOPRELOAD. need research and benchmarks
;.......................................................................................................

source and dll
Version 1.000 BitBlttest.7z

I hope you find this usefull
ARDA

ChaosKing · 29th September 2015, 21:15

Just for fun

Code:

AVSMeter 2.1.2 (x86)
AviSynth+ 0.1 (r1576, x86) (2.6.0.5)
CPU i5-3570K @ 4 GHz


ColorBars(width=3840, height=2160, pixel_type="YV12").ConvertToY8().KillAudio().AssumeFPS(25, 1).Trim(0, 49999) # 1680393 fps :D
#BitBlitTest(IntBBlt=true)      # 1211 avg fps
#BitBlitTest()                  # 1306
#BitBlitTest(UseCompiler=true)  # 906.0 (00:00:55.186)
#BitBlitTest(UseAgFog=true)     # 1353 (00:00:36.954)
#BitBlitTest(UseEnRpMsb=true)   # 1312

When I add LanczosResize(1920,1088) after ColorBars, I get 107.x fps with every BitBlit version

2nd September 2015, 18:44	#1 \| Link
ARDA Registered User Join Date: Nov 2001 Posts: 291	Development of BitBlt_SSE2_avs and memcpySSE2_avs Despite the fact that I've only tested BitBlt_SSE2_avs and memcpySSE2_avs within a few plugins of my own use, I guess they could be used for any other projects in need. Warning!, It is sill under heavy development, we can say it is a draft, no guarantees of being free of bugs. This new bitblt has not been tested as a substitute of internal one in avisynth, so I can't guarantee full compatibility. Bitblt is very frequently used in avisynth, to copy clip's planes from one buffer to another, the difference with a traditional memcpy library is that bitblt does the copy with sequences of rows, line by line avoiding copying the modulo (pitch-row). This version actually selects between two ways of copying, as a whole line like memcpy or line by line like internal avisynth bitblt depending on the conditions. This choice is almost like in avisynth, except in some cases we managed another way to take profit of memcpy better performance. C++ prototype: extern "C" void BitBlt_SSE2_avs (void * dstp, int dst_pitch, const void * srcp, int src_pitch, int row_size, int height); void memcpySSE2_avs (void * dest, const void * src, size_t count); They made use of sse2, sse3, s_sse3 and AVX instructions when possible and appropriate. One of the most difficult aspects of a bitblt routine (or any other memcpy) is the bypasscache; in other words, how to decide when we use a temporal(tpa) versus a non-temporal store(nta). In the avisynth context there is another challenge: previous conditions in the chain [of filters and memory usage] can lead to variations on the effective buffer size which bitblt cannot know about and are probably unpredictable. For example if you bitblt a chroma plane in YV12 but you are working with both chromas and luma as well, in order to calculate the whole size of frame in memory we would need to add luma, chromau and chromav sizes but we cannot pass such parameters to bitblt without breaking backward compatibility. To work around that limitation, I provide several different external methods which give a hint to bitblt or adjust bypass caches values, on whether it should use temporal or non-temporal stores. Another special feature in this implementation is the automatic estimation of the optimal percentage of cache what should be used for temporal stores. According to my tests, the smaller cache, the lower the percentage we should use. That required a lot of empirical tests becasue I was not able to derive a generic formula. Some cases are estimated and might need fine-tuning. You can see the used approximations at TableByPass (almost bottom) in bitbltsse2_avs.asm. Besides that I must warn that in caches below or equal 1Mb this limit is very critical under avisynth, it was impossible to establish an exact value; once more is a trade off. I must also warn that all these values for bypasscache are only valid under avisynth enviroment. Under other environments it is almost always advisable to use 50% of highest level cache for memcpy except in some new ultra low power consumption laptops(need research). Finally if you want to force or help bitblt to decide how to store data; you can hint bitblt with: DoNta_Oneline(); Internally sets any value different to -1, return nothing, just set a variable to hint to use nta instructions. It forces bitblt to copy one line with non temporal store. It can be used also to hint memcpySSE2_avs with the same purpose. When count is below or equal 8 will not be copied with nta; external request will be ignored. DoNta_RowLoop(); Internally sets any value different to -1, return nothing, just set a variable to hint to use nta instructions. It forces bitblt to do a row bitblt with non temporal store Pass_sizeframe(int); Passes the addition of clip planes sizes when appropriate and necessary. It tells bitblt the real size frame is loaded in memory. It can be used also to inform memcpySSE2_avs with the real size frame in some conditions. Set_SizeByPass (int ByPassTpaOrNta, int BypassPreLdNta); Passes values for two bypasses; 1st) to select between temporal store versus non temporal; 2nd) one to decide between a preloading src block technique or not preload before non temporal block store. This one (2nd)could be avoided once we have gathered enough information from several machines to fix those values. It seems in machine with cache bigger 1mb block preload does not help at all. If you use it set it to -1 if you don't want blokc techniques are applied You can also let bitblt decide (with normal parameters) but you should keep in mind that, in some cases, it might use the wrong strategy and lead to sub-optimal performance. Another aspect worth reviewing is the quantity of branches. This implementation has three main branches: a) When you call it with height=1, that is really similar to memcpy. This branch can also be chosen internally when srcpitch==dstpitch==rowsize. This branch has also two branches, a temporal write and a non temporal write to memory. Also in some machines(old) when non temporal write to memory and count is bigger than the largest cache avaiable, a backward row loop copy is done cause is faster. b) When the total size frame is below the bypass cache chosen for the cpu on which it is running, a temporal row loop will be used to copy the frame. c) When the total size frame is bigger than the bypass cache chosen for the cpu on which it is running, a non temporal row loop will be used to copy the frame. In this option there is also a branch with another value for bypass. One to decide if it will do a src block preload and nta store or no src block preload, Over that value it makes a source preload before nta block store, below that value it doesn't use block techniques. In newest cpus with Large caches bigger 1mb and better hardware prediction(and authomatic prefetches), block preload are not necessary. Besides that, every branch has two branches: 1) when both pointers are 16b or 32b aligned(AVX) 2) when we only got to have dst 16b or 32b aligned and source misaligned The problem is when source is not 16b or 32b (AVX)aligned. Again we have to know in which cpu is running, and accordingly, we use different methods to read unaligned data; which are: a) old sse2 capable cpus, where we use two partial qwords reading for each oword. b) some intel and amd sse3 capable machines where we use lddqu to load unaligned data c) offset (aligned)read of unaligned data plus shifts plus por to combine d) offset (aligned)read of unaligned data plus palignr(s_sse3 cpu) to shift and combine e) read unaligned data with movdqu or movups(some newer cpus), because finally is luckily the fastest way to load misaligned data, nowadays in some new cpus. f) Included 256 bits AVX read and write instructions, vmovups etc. Finally I could test and increase of performance is great. (some new cpus have REP MOVSB, etc.. instructions fully optimized, but unluckly not still for misaligned data) In my small haswell laptop that has (ERMSB)ENHANCED REP MOVSB, it has an excellent performance but just in values below bypass cache, so when temporal store, when non temporal store routines with vmovntps are faster. Why all that have changed so much? Well, you have to ask hardware engineers. It is beyond my capabilities. All these branches plus cpu identification have a price; when you blit a small value, (below 16kb aprox.) the overhead offsets some of the optimization speed up; but nowadays with HD clips, I think this is a good trade off. In fact, there are plenty of trades off in all this code, trying to make it most compatible and universal possible for many architectures. However, and for that reason as well, I have also included: memcpySSE2_avs(void dest, const void src, size_t count); It is almost fully compatible with memcpy, except it doesn't check overlap. Warning! this memcpySSE2_avs was done to work under avisynth; under other enviroment it could be slower as consecuence of bypasscache values, than other implementations out of avisynth. In fact it was only tested within a few plugins of my own use, never as a substitute of internal bitblt in avisynth. As I've said in my previous post none of these branches and options could be done without Agner Fog's libraries, codes and papers. This project includes four files from Agner Fog's libraries, cachesize32.asm, cputype32.asm, instrset32.asm, unalignedfaster32.asm and some slightly modified subroutines from memcpy32.asm. You can find them in http://www.agner.org/optimize/asmlib.zip There are many parts of this code that have never been tested in real life, so your help to track bugs, would be greatly appreciated. My maybe TODO list; Track and eliminate bugs Look for inconsistences or misconceptions according the performance in different cpus In other words, in most cases adjust bypass values. Delete some controls that probably are redundant and excesive Reduce cpu id routines. Redesign and start from scratch all again. Let's see when I feel like doing. Probably never. Too long post, sorry! As always I hope you find something usefull. Arda. Last edited by ARDA; 2nd September 2015 at 18:52. Reason: typo

2nd September 2015, 23:03	#5 \| Link
TheFluff Excessively jovial fellow Join Date: Jun 2004 Location: rude Posts: 1,100	my job here is to tell people they're doing dumb things, and while i like to think i'm pretty good at it, there's a lot of dumb to go around on this forum and i don't take the time to call out everything. your particular brand of dumb (trying to "optimize" common library functions that are already heavily optimized and have an insignificant performance cost in practice) just happened to be especially offensive since you're doing your best to be directly contrary to good programming practices. now, how about you post an actual benchmark of your bitblt (don't measure anything else, just the bitblt, preferably in cycles per byte or some similar unit) instead so you can actually prove me wrong instead of just hurfing a durf about how insulted you are that someone dare question your fancy memcpy? Last edited by TheFluff; 2nd September 2015 at 23:15.

3rd September 2015, 03:30	#8 \| Link
Reel.Deel Registered User Join Date: Mar 2012 Location: Texas Posts: 1,666	I unusually don't argue with people on the internet since it's all a waste of energy but sometimes there are those who deserve it... [REMOVED] Seriously TheFluff, what's your deal? I see you're always bitching/being negative about things that have nothing to do with you and do not affect you in no way, shape, or form. If it doen's concern you why not just STFU and let it be? It's their time and they can do with it what they please. I think it's cute that you actually think your opinion really matters LOL. You should be at an age where your wise enough to realize that it doesn't, especially on the internet. "tl;dr > anime makes you stupid" - you should practice what you preach. Take care sweetheart Last edited by sh0dan; 4th September 2015 at 10:14. Reason: Removed links and insults

3rd September 2015, 05:54	#9 \| Link
jmac698 Registered User Join Date: Jan 2006 Posts: 1,867	I don't know much about Intel assembly, but I'm just curious if this idea would help; ignore copying the ends which are out of alignment, just copy the main parts that are in alignment, then go back and copy the leftover bits. I've used such ideas before on other cpus's and it was much faster. Arda, would any of what you've learned help with just source reading? For example average of all frames or average of all pixels in each frame. People have uses for these functions. Some other ideas: stacked 16 bit to 16 bit words, yuv2planar, swap u/v Last edited by jmac698; 3rd September 2015 at 06:02.

29th September 2015, 16:48	#19 \| Link
ARDA Registered User Join Date: Nov 2001 Posts: 291	A little more about this tool plugin, it is really simple and useless for any other thing than benchmark internal bitblt against my version, without having to integrate mine in avisynth and to switch between versions, and also avoiding the limitation of fvertical that only use line per line copy (a row loop copy). With this test plugin new bitblt will have more options to do the copy and in many cases will use memcpysse2 or some similar routines. But also and not less important, to test against itself by varying some parameters; and also against memcpy routines of the(UseCompiler)compiler you are using. IntBBlt=true Steady bitblt in avisynth is around ten years old, so a comparison is not tottaly fair, but in spite of that when non temporal store must be applied it has an excellent performance. In old cpus, those when started sse2 set, it was quite difficult to overcome it. So if anyone still have any old sse2 pentium4 or any old amd , some behchmarks would be very usefull to tune that options in my new bitblt. UseCompiler should vary a lot depending on the compiler and the parameters you set to compile, therefore this option is only for those developers that want to compile theirs own versions of the tool. Why did I add this option? Mainly cause is another way to discover if the values of bypasses are correct, but also of course to see if we can have always better performance than the compilers with different dispatches. The version I realeased was compiled with Visual Studio 2008 and without any cpu special targeting, so it can run in any sse2 machine or up. [B]The most critical[B/] performance comparation between my memcpysse2 and compilers versions is with values below 32kb, almost always with temporal store, cause the big overhead added by my cpu analysis. To be fair there is also a condition(sizeframe bigger bypass cache) where compiler's memcpy library will always be slower, no matter what you do. When a line per line (row loop) is done, memcpy library can never know that the real size frame is bigger than the row it is being copied, why that? cause memcpy is an old C library when caches bypasses to select between temporal or non temporal store were not necessary. In this case my new bitblt will always win cause the external SzeFame parameter to help my new bitblit to select between temporal or not temporal. Last but not least important, I have added [B]UseCompiler[B/] with the aim to get help from other developers that have different compilers (Intel for instance), so we can compare with all of them too and in the hope of avoiding more pollution of this thread UseAgFogMy new bitblt is based in many Agner Fog's libraries so I have been using this option to check if I intruduced a bug or any misconception. In time I started modifying some bypass cache values according to my experience to be able to make comparisons more fair. In most cases this option is a little bit slower than my new bitblt, except in some very small values where it can reach the same performance. UseEnRpMsbThis option was included to test some machines which have ERMSB, it can be used in any machine, but results will be unpredictable. Actually many compilers just makes use of these instructions and in new machines gets a good performance in a certain range of count. #BitBlitTest() # this option tests my new bitblt This option without any parameter will test my new bitblt, but with some parameters you can modify its behaviour. Let's see how to use them How to use them, and what for? the main goal of this tool plugin is to find the appropiate value for by passes in cpu you are testing. How do you do that? In the table below you can see the default values in new bitblt according the size of biggest cache availabe in your cpu and others. This table is part of BitBlt_SSE2_avs, you can find it at the end of this file. You will see an indication of percentage of this largest cache, they are always aproximations, by testing performance near the border, one a little below the bypass and one a little over, if you notice a big sprite is cause the bypass is wrong. Int TpaorNta with this option you can vary the value of bypasscache to choose between temporal or non temporal store, if you include a value for this paramater, you have to include one for In PrldNta if it is set -1 it indicates block copy techniques will not be applied, you can modify and introduce a value, but I must warn you that in all tests I've done this is only usefull in old cpus which largest cache is 1MB or smaller. In avx machines the prldnta, block techniques will never be applied cause it is useless. This is due to better hardware prediction(and larger caches) and that new bitblt when does non temporal(nta) store, it always do nta, inclcuding when alignment is done and tails as well, by avoiding always to load any dst line in cache and by this way there will not be victimization of cache lines ever. NtaLine and NtaRow, both force bitblt to do nta store whatever the case, if you donnot get to know how bitblt was doing originally, just test twice with the nta.... and without it there is not difference is because bitblt was already doing nta store. But you can say I wamt to force it do temporal(tpa), well just increase the value of TpaOrNta till you find a difference in performance, if there is no difference is cause it was already doing tpa. SzeFame is almost useless in this plugin or redundant cause bitblt already has the parameters to calculate the frame size when a only one line(memcpy)is done; except when a row loop is chosen; in that case is usefull to pass SzeFame. Anyway if you donnot include it, BitBlitTest will pass it just in case when possibly necessary. Last but not least important, you can question, how can I force ? new bitblt to do a linear memcpy or row loop so I can measure different situations. There many ways but here two simple examples; Apply a resize, any one to a resolution where row is mod64, for instance BilinearResize 1920,1088 inmediately previous to BitBlitTest, this will force a linear memcpy. Apply separatefields() inmediately previous to BitBlitTest, this will force a row loop copy.(see the script with avstimer above) Code: ;................................................................................................. align 16 TableByPass: ; TpaOrNta PreLdNta LgstCche TpaOrNta PreLdNta LgstChe256k: DD 4196, 32768; 256kb . xx% 12.5% can't test these old cpus, haven't acces to any. ; in memcpy_amd values are 641024 and 1971024 ; for old cpus with 256 or 512 kb L2 caches ;....................................................................................................... LgstChe512k: ; but with L1 in some cases of 128kb. Anyway DD 8192, 65536; 512kb . xxx% 12.5% seems too high for me, need research and benchmark .......................................................................... LgstChe1Mega: DD 10240, 262144; 1024kb. ?(1)% 25% already tested and ok.in two Amd with 1 mb L2 cache ; avisynth gobbles up 1038336 bytes of the largest cache? ; when only 1mb L2 in my old amd and without L3 ;....................................................................................................... LgstChe1M512k: DD 52224 , 786432; 1536kb. 3.2% 50% need research and benchmarks ;....................................................................................................... LgstChe2Mega: DD 314368, 1310720; 2048kb 15% 65% need research and benchmarks ;....................................................................................................... LgstChe3Mega: DD 838656, -1 ; 3072kb. 27% NOPRELOAD. need research and benchmarks ;....................................................................................................... LgstChe4Mega: DD 1362944, -1 ; 4096kb. 31% NOPRELOAD. need research and benchmarks ;....................................................................................................... LgstChe6Mega: DD 2411520 , -1; 6144kb. 39% NOPRELOAD. need research and benchmarks ;....................................................................................................... LgstChe8Mega: DD 3460096, -1; 8192kb. 41% NOPRELOAD. need research and benchmarks ;....................................................................................................... LgstChe12Mega: DD 5557248, -1; 12288kb. 44% NOPRELOAD. need research and benchmarks ;....................................................................................................... LgstChe16Mega: DD 7654400, -1; 16384kb. 45% NOPRELOAD. need research and benchmarks ;....................................................................................................... LgstChe18Mega: DD 8702976, -1; 18432kb 46% NOPRELOAD. need research and benchmarks ;....................................................................................................... LgstChe24Mega: DD 11848704, -1; 24576kb 47% NOPRELOAD. need research and benchmarks ;....................................................................................................... LgstChe32Mega: DD 16043008, -1; 32768kb 48% NOPRELOAD. need research and benchmarks ;....................................................................................................... source and dll Version 1.000 BitBlttest.7z I hope you find this usefull ARDA Last edited by ARDA; 29th September 2015 at 21:20.

2nd September 2015, 19:20	#2 \| Link
TheFluff Excessively jovial fellow Join Date: Jun 2004 Location: rude Posts: 1,100	I guess I'm kind of an asshole for saying this outright, but since my gentle nudges in the fvertical thread hasn't discouraged you, here's some harsh words: this is completely pointless. You're writing thousands of lines of assembler to do what simple rep movb does as fast or insignificantly slower in basically all practical cases. Furthermore, if you can enforce 32 byte alignment and use more than one thread (like Vapoursynth does) you will basically always be bottlenecked by memory bandwidth on any sort of modern system. Basically, just use memcpy and don't try to ~optimize~ such a basic function. Someone else who is much smarter than you and had a lot more time on their hands has already done it.

2nd September 2015, 19:27	#3 \| Link
Elegant Registered User Join Date: Jul 2014 Posts: 55	He's allowed to have fun experimenting though. I mean that's how I do most of my work these days.

3rd September 2015, 13:11	#12 \| Link
jmac698 Registered User Join Date: Jan 2006 Posts: 1,867	@arda -Could you write a fast plugin to average all frames? -To average all pixels on a frame? -To swap U and V channels? -To copy src,src2 to dst,dst+1, src++, src2++, dst+=2 Using your experience, it could be faster than existing codes.

3rd September 2015, 16:00	#15 \| Link
jmac698 Registered User Join Date: Jan 2006 Posts: 1,867	ok The fastest frame average I know of is RedAverage http://forum.doom9.org/showthread.ph...11#post1537311 My "doodle" refers to converting stacked MSB/LSB 16bit video to normal 16bit words for each pixel, the point of it actually being in words is to use vectored instructions to further do fast operations on those words, for example for brightness or contrast adjustment, unless of course it's just as fast to do an add on two halves of a word scattered far apart in memory

29th September 2015, 21:15	#20 \| Link
ChaosKing Registered User Join Date: Dec 2005 Location: Germany Posts: 1,795	Just for fun Code: AVSMeter 2.1.2 (x86) AviSynth+ 0.1 (r1576, x86) (2.6.0.5) CPU i5-3570K @ 4 GHz ColorBars(width=3840, height=2160, pixel_type="YV12").ConvertToY8().KillAudio().AssumeFPS(25, 1).Trim(0, 49999) # 1680393 fps :D #BitBlitTest(IntBBlt=true) # 1211 avg fps #BitBlitTest() # 1306 #BitBlitTest(UseCompiler=true) # 906.0 (00:00:55.186) #BitBlitTest(UseAgFog=true) # 1353 (00:00:36.954) #BitBlitTest(UseEnRpMsb=true) # 1312 When I add LanczosResize(1920,1088) after ColorBars, I get 107.x fps with every BitBlit version __________________ AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth VapourSynth Portable FATPACK \|\| VapourSynth Database Last edited by ChaosKing; 29th September 2015 at 21:24.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode