Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 10th March 2010, 01:35   #81  |  Link
JoshyD
Registered User
 
Join Date: Feb 2010
Posts: 84
@turbojet

Would you try the latest release on the first page to see if that cleared up any of the problems?

If that doesn't, would you try the following two procedures:
1. Only resize on the the vertical axis
2. Only resize on the horizontal axis

Hopefully, only one of those will crash the program, if at all.

I'm guessing DirectShowSource is giving you a yv12 stream?

Last edited by JoshyD; 10th March 2010 at 01:38.
JoshyD is offline   Reply With Quote
Old 10th March 2010, 02:11   #82  |  Link
turbojet
Registered User
 
Join Date: May 2008
Posts: 1,840
With same 1920x1080 source
LanczosResize(1280,720) - crash
LanczosResize(1920,720) - crash
LanczosResize(1280,1080) - works

yes yv12
turbojet is offline   Reply With Quote
Old 10th March 2010, 03:43   #83  |  Link
JoshyD
Registered User
 
Join Date: Feb 2010
Posts: 84
@turbojet
I can't seem to recreate the crash. I specifically wrote the vertical resize functions to check for memory alignment before executing. I hope this isn't processor specific, that would be a bummer. Checking for instruction support, your Athlon II should have all the goods to do the resize correctly. Any chance you could snip a few frames (100?) of the source and post it somewhere so I can investigate further?

@Stephen
That cache bug is annoying . . . but interesting. It seems that it also exists in SEt's 32bit build of avisynth 2.5.8 as well, perhaps something got all strange when the MT mode was hacked to be supported?

Last edited by JoshyD; 10th March 2010 at 04:19.
JoshyD is offline   Reply With Quote
Old 10th March 2010, 06:31   #84  |  Link
turbojet
Registered User
 
Join Date: May 2008
Posts: 1,840
Every source crashes instantly with horizontal resize so a source I don't think would help and I'm afraid it's a cpu instruction issue. Some things that might help is x264 doesn't use SSE3 on this cpu, this is what it uses: MMX2 SSE2Fast FastShuffle SSEMisalign LZCNT. Also a few months ago I was testing icc x264 builds and found some that worked and some that didn't. I believe the ones that didn't use -QaxSSE3 during the compile but I'm not 100% on that.

Last edited by turbojet; 10th March 2010 at 06:34.
turbojet is offline   Reply With Quote
Old 10th March 2010, 14:16   #85  |  Link
kemuri-_9
Compiling Encoder
 
kemuri-_9's Avatar
 
Join Date: Jan 2007
Posts: 1,348
Quote:
Originally Posted by turbojet View Post
Every source crashes instantly with horizontal resize so a source I don't think would help and I'm afraid it's a cpu instruction issue. Some things that might help is x264 doesn't use SSE3 on this cpu, this is what it uses: MMX2 SSE2Fast FastShuffle SSEMisalign LZCNT. Also a few months ago I was testing icc x264 builds and found some that worked and some that didn't. I believe the ones that didn't use -QaxSSE3 during the compile but I'm not 100% on that.
what cpu do you have again (codename preferred)?

SSE3 is barely used in x264, the majority of the SSE3 related asm actually uses SSSE3.
that being said, x264 does have some SSE3 asm but it only uses these for CPUs that flag Cacheline64, which is only on intel processors.

tl;dr
AMD processors never see SSE3 getting used in x264 even if they have it.
__________________
custom x264 builds & patches | F@H | My Specs
kemuri-_9 is offline   Reply With Quote
Old 10th March 2010, 17:11   #86  |  Link
Stephen R. Savage
Registered User
 
Stephen R. Savage's Avatar
 
Join Date: Nov 2009
Posts: 327
Quote:
Originally Posted by JoshyD View Post
@turbojet
I can't seem to recreate the crash. I specifically wrote the vertical resize functions to check for memory alignment before executing. I hope this isn't processor specific, that would be a bummer. Checking for instruction support, your Athlon II should have all the goods to do the resize correctly. Any chance you could snip a few frames (100?) of the source and post it somewhere so I can investigate further?

@Stephen
That cache bug is annoying . . . but interesting. It seems that it also exists in SEt's 32bit build of avisynth 2.5.8 as well, perhaps something got all strange when the MT mode was hacked to be supported?
Here is AAA.avsi in full:

Code:
function AAA(clip input, int "type", bool "mask", bool "chroma")
{
	ox = width(input)
	oy = height(input)

	type = default(type, 1)
	mask = default(mask, true)
	chroma = default(chroma, false)

	gscale = chroma ? input : Greyscale(input)

	aa = type >= 2 ? nnedi2_rpow2(gscale, rfactor=2) :
		\ type == 1 ? TurnRight(gscale).EEDI2(field=1).TurnLeft().EEDI2(field=1) :
		\ PointResize(gscale, ox * 2, oy * 2).TurnRight().SangNom().TurnLeft().SangNom()

	edge = mt_didee(aa).Spline36Resize(ox, oy, -0.5, -0.5, 2 * ox, 2 * oy)
	ds = Spline36Resize(aa, ox, oy, -0.5, -0.5, 2 * ox, 2 * oy)
	maskmerge = mask ? mt_merge(input, ds, edge, U=1, V=1) : ds

	return chroma ? MergeChroma(maskmerge, ds) : MergeChroma(maskmerge, input)
}

function mt_didee(clip input)
{
	mask = mt_logic(mt_edge(input, "5 10 5 0 0 0 -5 -10 -5 4", 0, 255, 0, 255),
		\ mt_edge(input, "5 0 -5 10 0 -10 5 0 -5 4", 0, 255, 0, 255), "max").Greyscale().
		\ Levels(0, 0.8, 128, 0, 255, false)
	return mask
}
I just call it as
Quote:
DirectShowSource("source.avi")
AAA()
adding "chroma=true" makes the cache bug even worse. I don't use MT myself, so this cache bug is quite a bummer. Hope you can find it, even if it's not a bug in your own code. Perhaps it will even turn out to be an intractable design flaw (argh).

Also, many older AMDs do not support SSE3, but only up to SSE2 (slowly).

Last edited by Stephen R. Savage; 10th March 2010 at 17:52.
Stephen R. Savage is offline   Reply With Quote
Old 10th March 2010, 17:32   #87  |  Link
Wilbert
Moderator
 
Join Date: Nov 2001
Location: Netherlands
Posts: 6,364
Quote:
Also, many older AMDs do not support SSE3, but only up to SSE2 (slowly).
My old Athlon XP doesn't support SSE2, but only iSSE (or perhaps SSE dunno).
Wilbert is offline   Reply With Quote
Old 10th March 2010, 17:48   #88  |  Link
Stephen R. Savage
Registered User
 
Stephen R. Savage's Avatar
 
Join Date: Nov 2009
Posts: 327
According to Wikipedia, the oldest AMD64 CPU (Opteron, 130nm) supports SSE2. However, turbojet's Athlon II appears to support SSE3, so who knows...
Stephen R. Savage is offline   Reply With Quote
Old 10th March 2010, 18:32   #89  |  Link
JoshyD
Registered User
 
Join Date: Feb 2010
Posts: 84
@Stephen
You're quite correct, which is why I was confused. I don't think his processor agrees with loading the values of some of the arithmetic functions straight from memory. For now, I'm going to band-aid the code paths to use MMX if a non-compatible CPU turns up. I had totally forgotten that the Athlon64's only supported SSE2. I was happily thinking that many of the feature checking before function execution were going to go away because x64 processors generally have the latest and greatest when it comes to SIMD instructions. Looks like that's not the case. I've got an old Athlon64 I rarely use, so I guess I'll be turning it on to run test vectors before any future code gets loose.

It's weird because he can use the horizontal resize functions which make heavy usage of 128bit memory transfers to set up their workspaces. SSE2 doesn't have official listings for movdqa or movdqu instructions, but those are most certainly used when resizing YV12 or YUY2 horizontally.

The cache issue remains open, and I'm able to perfectly recreate it using any source. Looking intensively over a diff with the core Avisynth files hasn't turned up anything of interest yet. This test case seems to be the only one that really highlights the problem. Running other filters, internal and external, with and without a SetMTMode command doesn't give such a stark contrast in performance. I'm wondering a) why this filter combination is a showstopper and b) what script environment variables aren't set unless you allow a MT mode to be set. The largest difference I could find is:
Code:
if ((env->GetMTMode(false) > 0) && (env->GetMTMode(false) < 5)) {
            filter_graph = new CacheMT1(new Distributor(filter_graph, env), env);
          }
          else {
            filter_graph = Cache::Create_Cache(AVSValue(filter_graph), 0, env).AsClip();
          }
This occurs on script instantiation. I'll keep looking. On a sidenote, building and using the current 2.6 allows MT mode to be set, but the performance is abysmal. All non-MT functionality is perfect though.

@kemuri-_9
Thanks for the heads up on the x264 methodology of choosing instructions based on cacheline size. This makes me wonder if turbojet's CPU will return "true" when asked if it supports SSE3. If that's the case, then I'll have to add in code (yank it from the x264 cpuid functions) that checks for cache line size, and goes about its business appropriately.

Looking at the 2.6 CVS, it dynamically assembles almost identical code to what I have written for the vertical resizers. It does so based upon a check for SSE3 and SSSE3. If it's the case that Athlon II's and their brethren say they have SSE3, but fail when executing these SSE ops, the main code branch may hit a similar snag.

@turbojet:
Would you mind running CPU-z and telling me what feature flags your CPU reports? If it says SSE3, then dang, more code to write.


EDIT:

Caching bug squashed. Turns out, EEDI2 identifies itself as never wanting a cache, so when your script asks for a frame generated by EEDI2, it goes all the way back and generates it again.
Code:
  if (h_policy == CACHE_NOTHING) { // don't want a cache. Typically filters that only ever seek forward.
    __asm mov rbx,rbx  // Hack! prevent compiler from trusting ebx contents across call
    return childGetFrame(n, env);
  }
That childGetFrame executes more times than needed when your AAA script is run. This is the cause of the huge slowdown. Can anyone think of an instance where we would universally never want to save the previously generated frames? Check the main page for an update, this also *tries* to address the problems with older processors, but I don't think it's all the way yet.

Last edited by JoshyD; 10th March 2010 at 20:34.
JoshyD is offline   Reply With Quote
Old 10th March 2010, 23:15   #90  |  Link
turbojet
Registered User
 
Join Date: May 2008
Posts: 1,840
It's an athlon II 620. CPU Instructions are: MMX(+) 3DNow(+) SSE(1,2,3,4A) x86-x64, AMD-V which I'm pretty sure is identical to Phenom II. At least linux /proc/sys/cpu has identical flags.

Avisynth 2.6 Alpha 2 works fine here and if a dropback is needed wouldn't it be better to drop back to SSE(2) instead? IIRC that's all the further 2.6 branch goes to and it's a significant speedup
turbojet is offline   Reply With Quote
Old 11th March 2010, 00:04   #91  |  Link
Stephen R. Savage
Registered User
 
Stephen R. Savage's Avatar
 
Join Date: Nov 2009
Posts: 327
Updated results for AAA (200 frames):

32-bit: 1.85 fps
64-bit: 1.90 fps
Relative Performance: 102.7%

A bit disappointing, since I recalled EEDI2 being 10% faster and masktools2 similarly faster. How did you fix the caching bug? Did you enforce a cache for all filters? How does the original Avisynth code handle this?

Also, here's to hoping that you manage to get TIVTC ported one day, as it's one of the Greatest Filters Of All Time™.
Stephen R. Savage is offline   Reply With Quote
Old 11th March 2010, 03:45   #92  |  Link
JoshyD
Registered User
 
Join Date: Feb 2010
Posts: 84
@Stephen
The quick fix was to just *allow* a filter to be cached. EEDI2 was requesting not to be cached at all for some reason or another. Before, if the policy was not not cache a filter, it would go to the filter's get frame function, and return that, and no cache related code would be executed. By disallowing the insta-return and allowing it to also check / insert the frame into the cache, AAA starts finding the frames EEDI2 previously generated, and uses those, rather than making EEDI2 do the work all over again.

I'm not sure what the difference is between SEt's build and the standard build that creates the slowdown. The MTModes are hacked in to the main program, so there's some incongruities in the code. I think the internal caching mechanism is being polished little by little for 2.6, so this will be cleared up in the end. For now, the fix uses a little extra memory in certain corner cases. In general, a cache irregardless of whether or not the filter was written to only seek forward isn't a bad idea. You can write a script to access the source in any pattern you like really. If the accesses repeat, the script will be faster.

Also, a comment on the speed differences in AAA, over 3 runs of 250 frames, I have:
32-bit: 3.28fps
64-bit: 3.69fps
Relative performance: 112.5%

Did you grab the latest build of EEDI2? It used to use the same code for memory copying as Avisynth, which I wrote to work on processors with 128bit registers, instead of 64. The memory copy alone adds a little speed bump.

TIVTC is a beast because of the hodge podge of code it consists of. It also intermingles inline asm with compiler intrinsics as a means of using the XMM registers in some cases. It can get a little ugly. If I can figure out what's causing turbojet's crash, I'll re-examine TIVTC.

@turbojet:
The latest grab of the 2.6CVS has vertical resize code that checks for SSE3 support and then uses the same combination of instructions I do. I wish I had a similar machine to test on, so I could trace during run time. When I said fall back to MMX, I meant the mmx registers. They're half the size of the SSE registers, but technically, iSSE instructions use them. I'd drop back to that code in the case of an incompatible processor. The conundrum is why is your SSE1-4A supporting processor balking at the code? I also wonder what x264 is using to red flag your CPU to restrict it to SSE2, perhaps the vendor ID?

Do the older versions of the DLL let you resize? I hadn't changed ANY of the resize code at that point. I guess that's a good sanity check point. Try this one or the fail safe is this build of the source. If that doesn't work, there's something deeper behind the problem.

Has anyone been able to get this to run correctly on AMD cpu's?

Last edited by JoshyD; 11th March 2010 at 08:32.
JoshyD is offline   Reply With Quote
Old 11th March 2010, 04:48   #93  |  Link
kemuri-_9
Compiling Encoder
 
kemuri-_9's Avatar
 
Join Date: Jan 2007
Posts: 1,348
Quote:
Originally Posted by JoshyD View Post
I also wonder what x264 is using to red flag your CPU to restrict it to SSE2, perhaps the vendor ID?
look at common/cpu.c and common/x86/cpu-a.asm
__________________
custom x264 builds & patches | F@H | My Specs
kemuri-_9 is offline   Reply With Quote
Old 11th March 2010, 15:43   #94  |  Link
osgZach
Registered User
 
Join Date: Feb 2009
Location: USA
Posts: 676
Does anyone have 32 vs 64 bit performance numbers for TempGaussMC_beta2 ? (assuming it runs.. Stephen mentioned TGMC but not which version he's using, I don't think)

Last edited by osgZach; 11th March 2010 at 16:17. Reason: spelling error
osgZach is offline   Reply With Quote
Old 11th March 2010, 18:36   #95  |  Link
Stephen R. Savage
Registered User
 
Stephen R. Savage's Avatar
 
Join Date: Nov 2009
Posts: 327
@JoshyD: I'm using the copy of EEDI2 from the first page, which is dated to 2/19/2010. If there is another version, I am not aware of it.

Last edited by Stephen R. Savage; 11th March 2010 at 19:54.
Stephen R. Savage is offline   Reply With Quote
Old 11th March 2010, 19:19   #96  |  Link
osgZach
Registered User
 
Join Date: Feb 2009
Location: USA
Posts: 676
I was referring more to the AVS(i) script itself. But I'm guessing we're talking about the same thing really.

Either way. I average about 2fps give or take some fractional ups and downs. Takes about 4h:45m to do a 22m:48s clip

Core 2 Duo @ 3.2ghz

What kind of fps do you see on your machine?

And while I have your attention.. any way to get it entered either into the Fieldhint(blah..()) or replace that line entirely, during Yatta AVS generation ?

Sucks to have to manually swap it out on 50 different project files on average
But I suppose nothing a find/replace script can't fix. Still learning bits and pieces about YATTA, but since writing my Find30fps tool it's been so much less trouble.

Last edited by osgZach; 11th March 2010 at 19:23.
osgZach is offline   Reply With Quote
Old 11th March 2010, 19:59   #97  |  Link
JoshyD
Registered User
 
Join Date: Feb 2010
Posts: 84
Temp Gauss Beta 2 were the runs Stephen and I were performing, single threaded, for me enjoys a healthy ~20% speed increase, multiple threads speeds up the process by a larger margin. Here are some sample numbers for a 4000 frame run of a dvsd (720x480, 29.97 fps, interlaced bff) source through the script using avs2avi, no compression, outputting to null:

32bit Avisynth: 5.06fps
64bit Avisynth: 6.07fps
Relative speed: 119.96%

Threading the script with SetMTMode(2,8):
32bit Avisynth: 14.47fps
64bit Avisynth: 18.05fps
Relative speed: 124.74%

Tests were run with a Core i5-750 (4 cores, no HT) @ 3.71GHz. The eight thread creation request keeps all 4 cores constantly churning at 100%. Requesting less threads reduces total CPU utilization, with 4 threads giving ~50% total usage. Your performance will vary based upon your system setup, obviously.

The versions of Avisynth were both based on SEt's 2.5.8 build with multithreading enabled, for as much of an apples to apples comparison as possible.

Generally, it seems safe to assume ~15-20% performance increase, dependent upon which plugins you want to use.

Vertical Sharpen was ported specifically because of it's use in TGMC beta 2. The same goes for RemoveGrain and Repair. MaskTools2 and MVTools2 were ported because they're so darned useful. EEDI2 was ported to fill the void of a good 64bit deinterlacer.

The 32 bit versions of these plugins were compiled by myself, with some tweaks here and there as I went. I've been "rolling my own" versions of these for a while now, the 64bit port was an extension of this hobby.
JoshyD is offline   Reply With Quote
Old 11th March 2010, 20:10   #98  |  Link
osgZach
Registered User
 
Join Date: Feb 2009
Location: USA
Posts: 676
Thanks for your response.

Those are certainly some nice numbers. I wasn't even aware you could run it multi-threaded either. Frankly the setup process for MT related stuff scared me way, was afraid I would boink something
Right now I only exclusively use TempGaussMC_beta2, the only other filter being TDecimate, and any other stuff YATTA deems necesarry in the generated AVS. I am encoding to HuffYV12 and then filtering later. So hey maybe I'll get something like 5 or 6 fps ? LOL


Perhaps I will go back to the initial post and see if I can follow the steps to do all of this..

Although as far as MT goes.. I only have 2 cores, so I wouldn't expect huge numbers like you got (but really impressive), but I think the x64 single threaded boost might be pretty big as well.

Is there a chance this will all be available vial a one-click installer at some point?

Last edited by osgZach; 11th March 2010 at 20:12.
osgZach is offline   Reply With Quote
Old 11th March 2010, 20:26   #99  |  Link
Adub
Fighting spam with a fish
 
Adub's Avatar
 
Join Date: Sep 2005
Posts: 2,699
So, are we currently not able to use TIVTC and most of Tritical's plugins with the 64 bit version of Avisynth? Or is that just because they haven't been converted yet?

If so, I request that we convert as many of Tritical's plugins as possible. Specifically TIVTC and Colormatrix, as I think that those are some of the most often used plugins.
__________________
FAQs:Bond's AVC/H.264 FAQ
Site:Adubvideo
Adub is offline   Reply With Quote
Old 11th March 2010, 20:34   #100  |  Link
osgZach
Registered User
 
Join Date: Feb 2009
Location: USA
Posts: 676
@ Adub

Quote:
Originally Posted by JoshyD View Post
@Stephen


TIVTC is a beast because of the hodge podge of code it consists of. It also intermingles inline asm with compiler intrinsics as a means of using the XMM registers in some cases. It can get a little ugly. If I can figure out what's causing turbojet's crash, I'll re-examine TIVTC.


Hopefully something will happen in the future though. This is a great project, I've been waiting for a long time.. So it is great to see the progress we have already.

I wish I had the skills to contribute.. I barely know what little Python I use as it is...
osgZach is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 20:21.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.