Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 2nd December 2010, 01:39   #1  |  Link
Prettz
easily bamboozled user
 
Prettz's Avatar
 
Join Date: Sep 2002
Location: Atlanta
Posts: 373
New 64-bit FluxSmooth with SSE2 and SSSE3

A straight 64-bit port of FluxSmooth 1.1a (and what I started from) can be found here: http://forum.doom9.org/showthread.ph...28#post1425528

I've made new 64-bit versions of FluxSmooth for YV12 using SSE2 and SSSE3. The SSE2 version is optimized specifically for Athlon 64 and the SSSE3 version specifically for the 65nm Intel Core 2 (Conroe/Kentsfield). None of AMD's current chips support SSSE3, annoyingly.

I haven't done the YUY2 version yet, I'll be starting on that next. For now, these .dll's contain the original C++ and MMX versions for YUY2. Although FluxSmooth's documentation said it was SSE optimized it was actually in MMX, so these new versions involve a total rewrite, that's why it's taking so long.

I'm posting this now because I'd like some feedback and help testing. I've gotten all the obvious bugs, but there's bound to be something that needs more to uncover. Being new to writing Avisynth plugins, I don't have any good methods of testing filters. I'd also really like to see some speed numbers for the Athlon 64.

About the SSE2 version... There was a problem in moving FluxSmooth's method of doing the average to SSE. The MMX routine uses a lookup table that's a whopping 512KB. It's impossible to use this method in 128-bit -- it would require a 64GB table. For the SSE2 version I had to use floating-point code to take the average. It's not that much of a speed loss though, because with the MMX version's 512KB table, virtually every access would be a cache miss (and a lot of them an L2 cache miss). The SSSE3 version is able to avoid this thanks to the new instructions, and it uses the same average calculation as the C++ and MMX versions.

There are enough changes in this release that I think it should constitute a new version, v1.2, of FluxSmooth:
  • FluxSmooth's documentation never mentioned that the MMX version actually skips over not just the edge pixels but the first 4 and last 4 pixels of every row. So, previously, FluxSmooth's only optimized version returned different results from the reference C++ code. The new SSE2 and SSSE3 versions only skip the edge pixels; they smooth all the same pixels that the C++ version does.
  • The temporal-only versions of the original FluxSmooth skipped over the same pixels that the ST versions did, although there was no reason to do this. I've modified the C++ code to process all pixels on each frame, and I made (highly-optimized) standalone versions of the SSE2 and SSSE3 for temporal-only that also process all pixels. The MMX code processes the top and bottom row but continues to skip the first and last 4 pixels of each row.
  • Because the SSE2 version uses floating-point for the average, its results are occassionally off by 1 from the pixel values the other versions give (due to rounding). This isn't that big of a deal, though, because FluxSmooth's regular average calculation is itself off by 1 from the true average every once in a while. However, I still felt that this was worth mentioning.

Now, on to the speed gains. I've only done some very brief speed testing with avs2avi64. I'd really love to get some feedback on the speed from other users with other hardware. I've got a Core 2 Quad Q6600 (65nm), and I tested running at stock clock speed, 2.4GHz. I tested on an Xvid avi to get a more realistic scenario, so FluxSmooth doesn't have the CPU cache all to itself.

Xvid (no b-frames, no Qpel)
720 x 480
46580 frames
232MB

Empty:
1:15.3 (618.97 fps)
1:15.0 (621.03 fps)

C:
4:49.0 (161.17 fps)
4:51.0 (160.05 fps)

MMX:
4:23.5 (176.76 fps)
4:25.3 (175.59 fps)

SSE2:
3:23.5 (228.87 fps)
3:23.8 (228.58 fps)

SSSE3:
2:41.8 (287.96 fps)
2:44.3 (283.58 fps)

If you're wondering what contribution the FP code makes to the SSE2 version's time (I was), I also made an SSSE3 version that uses the same FP code for the average, but tuned for Core 2 instead of Athlon 64. Looks like the FP code makes up a substantial portion of the time:

SSSE3 /w FP:
3:00.3 (258.40 fps)
3:00.8 (257.69 fps)

For testing I've made a version of the plugin named FluxSmoothTest.dll that includes all of the different filter versions for YV12, and an extra parameter to choose which to use. The parameter is an integer called "opt": 0 = C code, 1 = MMX, 2 = SSE2, 3 = SSSE3. If your CPU doesn't support the instruction set you chose, it defaults back to the C code (it does not fall back to the next-best optimized version, this way you'll know for sure which version is being run).

I liked the way RemoveGrain did its dlls, so I did the same here. There's a dll that contains only the SSE2 code and another with only the SSSE3 code, so they're much smaller. They throw an error if your CPU doesn't support the required instructions. That won't be an issue with the SSE2 because all 64-bit x86 chips have it. If people really want it, I can make a dll with all optimized versions that chooses the best one your CPU supports.
Attached Files
File Type: 7z FluxSmooth SSE DLLs.7z (53.1 KB, 1521 views)
File Type: 7z FluxSmooth_x64_code.7z (38.0 KB, 516 views)
Prettz is offline   Reply With Quote
Old 2nd December 2010, 08:18   #2  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,537
Thank you! Waiting for attachment approval to test them.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 2nd December 2010, 14:46   #3  |  Link
TheFluff
Excessively jovial fellow
 
Join Date: Jun 2004
Location: rude
Posts: 1,100
Quote:
Originally Posted by Prettz View Post
I liked the way RemoveGrain did its dlls, so I did the same here. There's a dll that contains only the SSE2 code and another with only the SSSE3 code, so they're much smaller. They throw an error if your CPU doesn't support the required instructions. That won't be an issue with the SSE2 because all 64-bit x86 chips have it. If people really want it, I can make a dll with all optimized versions that chooses the best one your CPU supports.
I never understood why people seem to think this is a good idea. As long as it doesn't have to be cross-platform, detection of CPU capabilities is trivial and reliable, so why dump the effort on the user who is likely to get it wrong anyway? I find the space saving argument to be highly irrelevant in this day and age (whoa, you might save almost a whole megabyte!).
TheFluff is offline   Reply With Quote
Old 2nd December 2010, 18:34   #4  |  Link
Limit
Registered User
 
Join Date: Oct 2005
Location: .DE
Posts: 15
Thank you for your efforts.


Quote:
Originally Posted by Prettz View Post
None of AMD's current chips support SSSE3, annoyingly.
All current AMD chips support SSE3.
Limit is offline   Reply With Quote
Old 2nd December 2010, 18:44   #5  |  Link
Hagbard23
23sKiDdOo!
 
Hagbard23's Avatar
 
Join Date: May 2010
Location: Germany
Posts: 182
Quote:
I never understood why people seem to think this is a good idea. As long as it doesn't have to be cross-platform, detection of CPU capabilities is trivial and reliable, so why dump the effort on the user who is likely to get it wrong anyway? I find the space saving argument to be highly irrelevant in this day and age (whoa, you might save almost a whole megabyte!).
I don't think that is a good idea...i liked the removegrain way...single dlls for every single CPU Optimization. I appreciate Prettz, that you follow this way.

and BTW: Thanx for your efforts...
Hagbard23 is offline   Reply With Quote
Old 2nd December 2010, 20:27   #6  |  Link
Motenai Yoda
Registered User
 
Motenai Yoda's Avatar
 
Join Date: Jan 2010
Posts: 709
Quote:
Originally Posted by Limit View Post
Thank you for your efforts.

ll current AMD chips support SSE3.
SSSE3 and SSE3 aren't the same
Motenai Yoda is offline   Reply With Quote
Old 2nd December 2010, 22:59   #7  |  Link
Prettz
easily bamboozled user
 
Prettz's Avatar
 
Join Date: Sep 2002
Location: Atlanta
Posts: 373
Quote:
Originally Posted by TheFluff View Post
I never understood why people seem to think this is a good idea. As long as it doesn't have to be cross-platform, detection of CPU capabilities is trivial and reliable, so why dump the effort on the user who is likely to get it wrong anyway? I find the space saving argument to be highly irrelevant in this day and age (whoa, you might save almost a whole megabyte!).
I like it because you know for sure which optimization is being used. And so you know that none of your time is being wasted with a slower version without you knowing it. The SSE-only dlls don't even contain the C++ version, so you know what's being used (and you know that I didn't screw up the detection code ).

Quote:
Originally Posted by Motenai Yoda View Post
SSSE3 and SSE3 aren't the same
Yeah, it's like Intel went out of its way to make it confusing. I made the error message for the detection code in the SSSE3-only dll say it "requires SSSE3 (not just SSE3)".

SSE3 is all floating-point instructions. SSSE3 (Supplemental SSE3) is all integer instructions.
Prettz is offline   Reply With Quote
Old 2nd December 2010, 23:35   #8  |  Link
Hagbard23
23sKiDdOo!
 
Hagbard23's Avatar
 
Join Date: May 2010
Location: Germany
Posts: 182
Could someone of the (s)mods finally approve the attachment - it can't be, that one waits 2 days for allowance - that is not the first time it happened. Whats wrong? Looking at the On-Times of our mods gives the result, that several Mods were online during this two days...

please unlock the attachment soon...
Hagbard23 is offline   Reply With Quote
Old 2nd December 2010, 23:40   #9  |  Link
cretindesalpes
͡҉҉ ̵̡̢̛̗̘̙̜̝̞̟̠͇̊̋̌̍̎̏̿̿
 
cretindesalpes's Avatar
 
Join Date: Feb 2009
Location: No support in PM
Posts: 712
Quote:
Originally Posted by Prettz View Post
you know that none of your time is being wasted
I often move my plugins directory to different computers, with different CPUs, or share it with other people to have a common, up to date base to make our scripts run correctly. Obviously, checking each dll version and substituting the files is a big PITA. Therefore I end up using the smallest common denominator, SSE2 or even plain C++.

A good solution would be an autodetection with an optional parameter to override it. Like for example the "opt" parameter in most of the Tritical's plug-ins.
__________________
dither 1.28.1 for AviSynth | avstp 1.0.4 for AviSynth development | fmtconv r30 for Vapoursynth & Avs+ | trimx264opt segmented encoding
cretindesalpes is offline   Reply With Quote
Old 3rd December 2010, 00:36   #10  |  Link
ncatt
Registered User
 
Join Date: Oct 2006
Posts: 10
Hi Prettz, thank you for the great job! Can you put the files (temporarily) in another host, please? Mediafire maybe?
ncatt is offline   Reply With Quote
Old 3rd December 2010, 04:30   #11  |  Link
Prettz
easily bamboozled user
 
Prettz's Avatar
 
Join Date: Sep 2002
Location: Atlanta
Posts: 373
The attachments have been approved now. I'll compile a dll that autodetects which optimization to use once we know there's no bugs in the YV12 code. For now there's no need, it needs testing first and foremost.

For a lot of testing I used a script like this:
Code:
LoadPlugin("C:\Program Files (x86)\Avisynth 2.5\plugins64\FluxSmoothTest.dll")

s = AviSource("E:\metropolis\test3.avi")

fc = s.FluxSmoothST(12,10,0)
f  = s.FluxSmoothST(12,10,2)

cu = fc.UToY()
cv = fc.VToY()

fu = f.UToY()
fv = f.VToY()

y = Overlay(fc, f, mode="Difference", pc_range=true).ColorYUV(autogain=true,cont_u=1024,cont_v=1024)
u = Overlay(cu, fu, mode="Difference", pc_range=true).ColorYUV(autogain=true)
v = Overlay(cv, fv, mode="Difference", pc_range=true).ColorYUV(autogain=true)

StackHorizontal(y, StackVertical(u, v))
But this isn't that great. I still needed to step through manually in virtualdub with my eyes glued to the screen. I'm sure some filter developer out there has some kind of batch tool to comb through a video and print out differences to a window or a file. Trying to tell if the SSE2 version is correct or not is inherently difficult because even when it's working perfectly many pixels on each frame will still be off by 1 from the C version.

One other thing I forgot to mention: the YUY2 code is completely untouched. The "opt" parameter in the test dll doesn't affect it; it'll autodetect MMX just like always.
Prettz is offline   Reply With Quote
Old 3rd December 2010, 21:51   #12  |  Link
Mr VacBob
Registered User
 
Join Date: Feb 2005
Posts: 140
Quote:
Originally Posted by Prettz View Post
I like it because you know for sure which optimization is being used. And so you know that none of your time is being wasted with a slower version without you knowing it.
How would you not know it? You'd have to accidentally install the wrong CPU. CPU detection can fail if you implement it wrong, but hopefully it wouldn't get past you into a released binary.
Mr VacBob is offline   Reply With Quote
Old 4th December 2010, 00:16   #13  |  Link
kemuri-_9
Compiling Encoder
 
kemuri-_9's Avatar
 
Join Date: Jan 2007
Posts: 1,348
Avisynth already offers a way to get the CPU capabilities without needing to write your own detection algorithm with the
GetCPUFlags() function on the IScriptEnvironment class, where flags are also defined in avisynth.h
__________________
custom x264 builds & patches | F@H | My Specs
kemuri-_9 is offline   Reply With Quote
Old 20th January 2012, 23:28   #14  |  Link
ryrynz
Registered User
 
ryrynz's Avatar
 
Join Date: Mar 2009
Posts: 3,645
Would love to this this continued and updated with a 32bit version.
ryrynz is offline   Reply With Quote
Old 25th April 2012, 22:06   #15  |  Link
TheProfileth
Leader of Dual-Duality
 
TheProfileth's Avatar
 
Join Date: Aug 2010
Location: America
Posts: 134
BTW is there a reason why nobody has ported the updated version to 32bit yet? Are some of the new key factors limited to 64bit functionality? Or is this more along the lines that the original creator has disappeared and everyone who could actually make this happen is busy or working on something else?
__________________
I'm Mr.Fixit and I feel good, fixin all the sources in the neighborhood
My New filter is in the works, and will be out soon
TheProfileth is offline   Reply With Quote
Old 1st May 2012, 03:20   #16  |  Link
ryrynz
Registered User
 
ryrynz's Avatar
 
Join Date: Mar 2009
Posts: 3,645
I think that's it exactly. Considering how useful and popular this filter is I'm a little surprised nobody has jumped on it, though it'll happen eventually.
ryrynz is offline   Reply With Quote
Old 16th December 2013, 23:07   #17  |  Link
jackoneill
unsigned int
 
jackoneill's Avatar
 
Join Date: Oct 2012
Location: 🇪🇺
Posts: 760
At a glance, the new code makes use of the extra general-purpose and xmm registers available only in 64 bit mode. Modifying it to use only the registers available in 32 bit mode is probably non-trivial, if it's possible at all.
__________________
Buy me a "coffee" and/or hire me to write code!
jackoneill is offline   Reply With Quote
Old 26th February 2019, 13:27   #18  |  Link
silverwing
Registered User
 
Join Date: Nov 2017
Location: Russia, Nizhny Novgorod
Posts: 25
I modified the code a bit (for Avisynth+ 64bit).
Added
Code:
static IScriptEnvironment * AVSenvironment;

const AVS_Linkage * AVS_linkage = nullptr;
extern "C" __declspec (dllexport) const char * __stdcall AvisynthPluginInit3 (IScriptEnvironment * env, const AVS_Linkage * const vectors)
{
AVS_linkage = vectors;
AVSenvironment = env;
Compiled for SSE2 and SSSE3 using MS VS 2015 and ICL 2016 (Multi-threaded DLL with TBB).
If anyone is interested. I can post it on the forum (with source code and MS VS project).

Checked - it works.
silverwing is offline   Reply With Quote
Old 1st March 2019, 12:27   #19  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,537
Quote:
Originally Posted by silverwing View Post
If anyone is interested. I can post it on the forum (with source code and MS VS project).
Of course we are ALL interested.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 26th March 2019, 19:41   #20  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,537
Quote:
Originally Posted by silverwing View Post
If anyone is interested. I can post it on the forum (with source code and MS VS project).
Would you?
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Reply

Tags
avisynth, fluxsmooth, x64

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 03:16.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.