Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 24th January 2007, 00:07   #1  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
An approach to an open source fast memset

At first I was just studying greyscale codes. I've started looking for assembler improvements
and finished just thinking how to do a kind of Memset for modern architectures and taking advantages
of SSE2 instructions etc.
Finally I arrived to a code that is already usable and fast enough in this plugins GreyYV12.

After IanB's post on this thread http://forum.doom9.org/showthread.php?t=121066
I have some doubts if there is no a better way to improve the perfomance of this plugin.
Meanwhile one purpose can be achieved, some steps to develop alternative codes for memset.

Warning! these codes (memset) are not compatible with standard C++ library; besides that it just has
been tested in this plugin; so be carefull if you want to use it under other conditions.

My TODO list:
Find bugs.
Should we test a clflush support for sse2 cpus or newer?
To do a special case in memset_SSE2 when frame is smaller than L1 cache size
and unaligend.
Redesign the library to make it compatible.
Benchmark against fast memset of intel compiler.

For all that I ask please, opinions, contributions, bugs reports, suggestions, tests etc.

I hope you find this usefull
ARDA

Version 1.2.1
updated:
Source and dll http://www.iespana.es/Ardaversions/GREY1_21.7z

Last edited by ARDA; 23rd June 2007 at 21:10. Reason: version change
ARDA is offline   Reply With Quote
Old 24th January 2007, 01:03   #2  |  Link
Fizick
AviSynth plugger
 
Fizick's Avatar
 
Join Date: Nov 2003
Location: Russia
Posts: 2,183
memset MMX is in Vaguedenoiser (simply for complicity
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick
I usually do not provide a technical support in private messages.
Fizick is offline   Reply With Quote
Old 24th January 2007, 01:08   #3  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
Thanks for your complicity, I remember to have looked vaguedenoiser sources long time ago; we have always some feedback of other developers including when we are not completly aware of that.
It is first time I address to you directly, my respect for all your contributions. Please I would appreciate your opinions.

Thanks ARDA
ARDA is offline   Reply With Quote
Old 25th January 2007, 19:00   #4  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
Quote:
Originally Posted by IanB
GreyScale is nice and simple to analyse but is not a good practicle target to optimise,
even in C++ code it is fast so your return on investment is small.
i.e. If greyscale takes 5ms per frame (200 fps) and you optimise it to be 10 times faster,
0.5ms per frame (2000 fps) but it is part of a filter chain that takes 100ms per frame (10fps)
then the improvement is only 4.5 ms or 95.5ms per frame (10.47 fps).
This new option of doing greyscale is not probably too important in itself; this thread was
mainly about memset:
Quote:
Originally Posted by ARDA
At first I was just studying greyscale codes. I've started looking for assembler improvements
and finished just thinking how to do a kind of Memset for modern architectures and taking advantages
of SSE2 instructions etc.
Finally I arrived to a code that is already usable and fast enough in this plugins GreyYV12.

Meanwhile one purpose can be achieved, some steps to develop alternative codes for memset.
Besides that the fact that greyscale under YV12 was so fast with simple C++ code
encouraged me to take up a challenge.
Quote:
Originally Posted by IanB
Spend your time optimizing the filter that takes 50+ms per frame!
Thanks for trusting on my skills and knowledge! I know you are pointing to the right direction.
Maybe in near future I'll face up to more complex tasks; by now I am sharing that I can
assume according to my capabilities and time.

Coming back to develop code:
Quote:
Originally Posted by IanB
A better implementation would use IsWriteable so the two outcomes become :-
1. Blit the just Luma plane, then memset both chroma planes.
or
2. Keep the Luma plane and memset both chroma planes.
Saving is the blits of 2 chroma planes in case 1.
As always you have pointed a good way to develop this plugins

I have tested and confirm your previsions. Soon I'll put some benchmark results.

Finally I want to point the real objective of this thread An approach to an open source fast memset ISSE and SSE2 code shows an example how to use write combining and how to take such decision.
Another point would be to finish this code to have a full compatible memset library for avisynth.
Is that usefull? I don't know. Please, give us your opinion about that!
Just to mention memset is used 95 times in avisynth, but I confess I don't know the real weigh of them.
Fast memset of Intel Compiler is already done, but I wanted an open source code with similar perfomace
or better. Why not?

Summarizing my points:
Find bugs
Improve code
A discussion about write combining techniques and when they should be applied
How to take such decision.
Make it compatible with standard library

I don't aspire to limit the discussion only about these subjects but I would appreciate comments in
such directions.

Thanks ARDA

Version 1.2.1
updated:
Source and dll http://www.iespana.es/Ardaversions/GREY1_21.7z

Last edited by ARDA; 23rd June 2007 at 21:11. Reason: version update
ARDA is offline   Reply With Quote
Old 14th February 2007, 23:33   #5  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
Important improvements in SSE2 version of memset.
I've made tests in two machines pentium 4 3073 mhz and amd turion ML 37 2000 MHz.
Comparations were done with fast_memset_intel of W_CC_C_9.1.028.exe compiler version
In pentium 4 perfomance is at minimum the same, some gains in big sizes.
In ml 37 the gains are important over all in small sizes cause the intel code is
better tuned for pentium4.
In any case I've got to arrive to a solution which works well for both my two machines,
I hope it could have similar perfomance in many machines.

Suggestions ?

Version 1.2.1
updated:
Source and dll http://www.iespana.es/Ardaversions/GREY1_21.7z

Thanks ARDA

Last edited by ARDA; 23rd June 2007 at 21:12. Reason: version change
ARDA is offline   Reply With Quote
Old 11th March 2007, 02:19   #6  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
All assembly codes were moved too nasm and unified, cause
I couldn't find an easy way to make a jump table with inline functions.

This jump table is used just for small values which is the most common
scenario of memset.

Also cpu detect has been included in memset_avs.asm to avoid too many branches
in chromaoff.cpp.
In spite of that I have still left L2 cache size detection in GreyYV12 constructor,
cause for the purpose of this filter is the more appropiate.
For a full compatibility code with standard library I should include it in the
assembly code.

By now I will be updating cache detection for newer machines and correcting some
bugs or design misconceptions if any else is suggested.

As code has changed a lot I have started giving it a number version and
upload the whole code.

Version 1.2.1
updated:
Source and dll http://www.iespana.es/Ardaversions/GREY1_21.7z
Thanks ARDA

Last edited by ARDA; 23rd June 2007 at 21:12. Reason: version update
ARDA is offline   Reply With Quote
Old 21st March 2007, 23:40   #7  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
Changelog

Version 1.1

Some updates in L2 Cache size detection for new machines.
(still working on that)

Small changes to be able to take advantage of non temporal write
instructions; when calling after separatefields.

A small change for better perfomance in old ISSE machines.

Version 1.2.1
updated:
Source and dll http://www.iespana.es/Ardaversions/GREY1_21.7z

Last edited by ARDA; 23rd June 2007 at 21:13. Reason: typo
ARDA is offline   Reply With Quote
Old 23rd March 2007, 03:00   #8  |  Link
video_magic
Registered User
 
Join Date: Jan 2005
Posts: 368
Hello, I'm a fairly ordinary user of Avisynth - not a coder; I just wondered if this work You are doing is to eventually be integrated into the Avisynth program to make it faster, or to use less resources?
Or is it going to be a seperate plugin (for something else other than the main program)?

I use Avisynth quite often, and will probably be using it more soon for a large VHS-to-DVD backup project,- so I am curious I guess to wonder if this project is to optimize it and might translate into a large number of saved hours over dozens of encodes - mainly via spline resize, addborders & changefps into HCEnc encoder? Just wondered as a user of the program but thanks for Your time it's appreciated.
__________________
Thankyou!, I am grateful for any help
video_magic is offline   Reply With Quote
Old 23rd March 2007, 13:07   #9  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
Quote:
Originally Posted by video_magic
I just wondered if this work You are doing is to eventually be integrated into the Avisynth
program to make it faster, or to use less resources?
I don't know. It could be integrated when finished (too near) in any code that needs memset.
The weight of memset in general perfomance in avisynth must not be too important, it can increase
perfomance in filters where it is used intensively.
Quote:
Originally Posted by video_magic
Or is it going to be a seperate plugin (for something else other than the main program)?
It is already a seperate plugin for converting your clip into greyscale. But it just the excuse
to test memset development. Usage : just put GreyYV12.dll into your avisynth plugins folder
and write GreyYV12() in your script.
Quote:
Originally Posted by video_magic
this project is to optimize it and might translate into a large number of saved hours
over dozens of encodes.
No,unluckly not so easy, general improvement in a decoding resizing filtering encoding process has more important bottlenecks.
From this example, cause its simplicity, can be taken an idea that is widely spread nowadays on how to use non temporal writting by analyzing the relation between L2 cache size and the amount of data your are processing; in this plugin some particular conditions of avisynth are taken into account as well (if source is writable or not, separatefields).
This main concept can be applied in many filters and codes.
It is also used in this other small project http://forum.doom9.org/showthread.php?t=121066
Maybe in the future we will see many parts of the chain with this kind of optimization.
Quote:
Originally Posted by video_magic
Just wondered as a user of the program but thanks for Your time it's appreciated
Thanks for your words and my apologizes if I have disappointed you;
but these are my two cents (maybe just one) by now.

ARDA
ARDA is offline   Reply With Quote
Old 23rd March 2007, 22:01   #10  |  Link
video_magic
Registered User
 
Join Date: Jan 2005
Posts: 368
Thanks for the explanations & for taking the time - best of regards for the project
__________________
Thankyou!, I am grateful for any help
video_magic is offline   Reply With Quote
Old 30th April 2007, 22:31   #11  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
Changelog Version 1.2

It does not include Camel - CPU Identifying Tool Copyright (C) 2002, Iain Chesworth anymore
A new L2 cache size detection code. Updated for some new machines.
An error detection has been added to test this new code
Added use of 128 xmm registers for old Athlon XP and Pentium III SSE capable.
Several small changes to optimize perfomance.

At this stage I can say that memset is almost finished; some fine tunning can still be
done. This code depends highly on a correct L2 cache size detection. The updated version
includes a new code; that is why I need your help.
If anyones gets "L2 cache size not detected; impossible to continue. Use greyscale()."
I ask you please to report here. Need processor type and L2 cache size.

Version 1.2.1
updated:
Source and dll http://www.iespana.es/Ardaversions/GREY1_21.7z

Thanks ARDA

Last edited by ARDA; 24th June 2007 at 08:14.
ARDA is offline   Reply With Quote
Old 23rd June 2007, 21:15   #12  |  Link
ARDA
Registered User
 
Join Date: Nov 2001
Posts: 291
Changelog Version 1.2.1

Added YUY2 suport; SSE2 and ISSE.
updated L2 cache detection for some new machines



Version 1.2.1
updated:
Source and dll http://www.iespana.es/Ardaversions/GREY1_21.7z

Thanks ARDA
ARDA is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 22:56.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.