Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
|
|
Thread Tools | Search this Thread | Display Modes |
2nd March 2010, 19:14 | #61 | Link |
Registered User
Join Date: May 2008
Posts: 1,840
|
I was able to get directshowsource working. Loadplugin was pointing to a 32 bit dll.
Also I made an batch installer that checks for x64 os, checks if avisynth x86 is installed, copies files to the avisynth plugins directory, and enters the registry entries. It's kind of a rough installer but it only takes one click and gets the job done. Feel free to edit at your will and/or include it with the binaries if you want. It's ok to use avisynth x86 plugins directory for x64 dll autoloading but they need to be named different from x86 dll's to allow x86 dll's to load. Last edited by turbojet; 9th March 2010 at 23:00. |
3rd March 2010, 00:50 | #62 | Link |
Registered User
Join Date: Nov 2009
Posts: 327
|
@JoshyD: Thank you. I have updated the benchmarks to reflect the more accurate avs2avi methodology. avs2avi allows the accurate measurement of even very fast filters.
@turbojet: Good job figuring out the autoload thing. I thought it would be something like that. |
3rd March 2010, 02:45 | #63 | Link |
Registered User
Join Date: Feb 2010
Posts: 84
|
@Turbojet: Thanks for the nifty little script, I'll go ahead and stick it in the archive with future updates of Avisynth64.
@Stephen: How many threads were you allowing during those benchmarks? I'm guessing you just ran it through the normal cache, so just double checking. Also, are there any other little pieces of code that you'd like to see work with the project? I took a look at TIVTC, and it's a bit of a beast to convert, are there any other day to day plugins that would be beneficial? I'm thinking I might optimize some of those hot spots where a single thread gets "stuck". I think there's some speed gains hidden in removing the core program's reliance on the mmx register set. All 64 bit processors have 16 XMM registers, 6 of which are volatile across function calls, meaning we can do whatever we want with them. The BitBlt Functions move data in 64bit chunks when we could be doing it in 128bit chunks. I at least want to get the memory copy functions using the XMM registers instead of the MMX ones. |
3rd March 2010, 04:09 | #64 | Link |
Registered User
Join Date: Nov 2009
Posts: 327
|
All tests were strictly single-threaded. Neither MT() nor SetMTMode() were used, and only single-threaded filters were used.
If you can't get TIVTC, then perhaps: Stuff I use frequently (descending order of priority):
Stuff I have but don't really use much (unsorted):
Stuff I use that is closed source:
Last edited by Stephen R. Savage; 3rd March 2010 at 06:07. |
6th March 2010, 14:27 | #66 | Link | |
Compiling Encoder
Join Date: Jan 2007
Posts: 1,348
|
Quote:
if the plugin uses inline asm then it'll need to be rewritten... (seeing at how most plugins inline asm instead of using an asm compiler like yasm/nasm) if it only uses C/C++, then it can (most generally) just simply recompiled for x86_64 (this depends on how assumptive they are of things and used code that breaks when the sizeof some of the basic variable types change). if you want to have avisynth x86 and x86_64 intermix, the available option is to use TCPDeliver to deliver frames over socket connections between them. Last edited by kemuri-_9; 6th March 2010 at 14:33. |
|
6th March 2010, 19:36 | #67 | Link |
Registered User
Join Date: Nov 2009
Posts: 327
|
By the way, I am reporting that the version of FFT3DFilter on the first post of this thread does not work. It returns "could not load plugin fft3dfilter.dll". I have tried with fftw3.dll in system32, the application directory, and in another global/path directory.
Also, instead of making FFT3DFilter link against fftw3.dll, why not use the default name libfftw3f-3.dll, so that the DLL can be shared with dfttest (if you ever port it). |
6th March 2010, 20:54 | #68 | Link | |
Registered User
Join Date: Feb 2010
Posts: 84
|
Quote:
I had it up and running on my system . . . I'll get back to my development machine and check it out. Also, in other fun news, I've ported the vertical resizers to use SSE registers instead of mmx, and taken it out of inline assembly. I'm almost done removing the default inline assembly horizontal routines to their own assembly functions. You'd be surprised how much speed you can gain by just not using inline assembler and going with straight up assembly instead. After that's all done, I think the slowest internal Avisynth function will be bitblt, and then it's on to the plugins for more optimizations. |
|
6th March 2010, 21:38 | #70 | Link |
Registered User
Join Date: Nov 2009
Posts: 327
|
JoshyD, the MD5 of my fftw3.dll matches the one from the main page (Win64 version, of course).
Also, I believe the resizer functions were optimized for SSE2 in the 2.6 branch. Hopefully your optimizations won't be for nothing and can be generalized to the CVS version. |
7th March 2010, 14:52 | #71 | Link |
Registered User
Join Date: Dec 2004
Location: Melbourne, AU
Posts: 1,963
|
Softwire was alive a few years ago, when 64-bit support was added: http://cvs.gna.org/cvsweb/softwire/?cvsroot=softwire
For benchmarking, that is what the "play" command of avsutil is for. |
7th March 2010, 19:33 | #72 | Link |
Registered User
Join Date: Feb 2010
Posts: 84
|
@Stephen
The optimizations shouldn't be all lost. The only difference is that mine aren't dynamically generated like they are in the new branch. Currently, I'm implementing them with function pointers to different resizers depending on the FIR filter size which is dependent on the filter. For example Lanczos4Resize has a FIR filter size of 8, it's possible to generate larger FIR filters, but those are rare. I'm still kicking around ideas on how to best handle these cases. The trade here is that the actual code is larger, but since it's generated at compile time, there's no one time cost of function generation during filter instantiation. Will it really make a difference? Not really sure right now. To give an idea of the difference in execution, I do something along these lines: Code:
switch (plane) { case 2: // take V plane cur = resampling_patternUV; fir_filter_size = *cur++; src_pitch = src->GetPitch(PLANAR_V); dst_pitch = dst->GetPitch(PLANAR_V); xloops = ((src->GetRowSize(PLANAR_V_ALIGNED)+15) / 16)*16; // Round to multiple of 16 dstp = dst->GetWritePtr(PLANAR_V); srcp = src->GetReadPtr(PLANAR_V); y = dst->GetHeight(PLANAR_V); yOfs2 = this->yOfsUV; (((INT_PTR)srcp&15) || (src_pitch &15)) ? ua_proc_uvplane(srcp, dstp, src_pitch, dst_pitch, y, xloops, yOfs2, cur) :a_proc_uvplane(srcp, dstp, src_pitch, dst_pitch, y, xloops, yOfs2, cur); break; case 1: // U Plane cur = resampling_patternUV; fir_filter_size = *cur++; dstp = dst->GetWritePtr(PLANAR_U); srcp = src->GetReadPtr(PLANAR_U); y = dst->GetHeight(PLANAR_U); src_pitch = src->GetPitch(PLANAR_U); dst_pitch = dst->GetPitch(PLANAR_U); xloops = ((src->GetRowSize(PLANAR_U_ALIGNED)+15) / 16)*16; // Round to multiple of 16 yOfs2 = this->yOfsUV; plane--; // skip case 0 (((INT_PTR)srcp&15) || (src_pitch &15)) ? ua_proc_uvplane(srcp, dstp, src_pitch, dst_pitch, y, xloops, yOfs2, cur) :a_proc_uvplane(srcp, dstp, src_pitch, dst_pitch, y, xloops, yOfs2, cur); break; case 3: // Y plane for planar break; case 0: // Default for interleaved (((INT_PTR)srcp&15) || (src_pitch &15)) ? ua_proc_yplane(srcp, dstp, src_pitch, dst_pitch, y, xloops, yOfs2, cur) :a_proc_yplane(srcp, dstp, src_pitch, dst_pitch, y, xloops, yOfs2, cur); break; default: break; } Code:
switch (plane) { case 2: // take V plane src_pitch = src->GetPitch(PLANAR_V); dst_pitch = dst->GetPitch(PLANAR_V); dstp = dst->GetWritePtr(PLANAR_V); srcp = src->GetReadPtr(PLANAR_V); y = dst->GetHeight(PLANAR_V); yOfs2 = this->yOfsUV; (((int)srcp&15) || (src_pitch &15)) ? assemblerUV.Call() : assemblerUV_aligned.Call(); break; case 1: // U Plane dstp = dst->GetWritePtr(PLANAR_U); srcp = src->GetReadPtr(PLANAR_U); y = dst->GetHeight(PLANAR_U); src_pitch = src->GetPitch(PLANAR_U); dst_pitch = dst->GetPitch(PLANAR_U); yOfs2 = this->yOfsUV; plane--; // skip case 0 (((int)srcp&15) || (src_pitch &15)) ? assemblerUV.Call() : assemblerUV_aligned.Call(); break; case 3: // Y plane for planar case 0: // Default for interleaved (((int)srcp&15) || (src_pitch &15)) ? assemblerY.Call() : assemblerY_aligned.Call(); break; } I checked my FFT3DFilter, and it works still. I think I may have been building it funny, I noticed some intel compiler specific include directories in my configuration, so I removed them, and rebuilt. Would you humor me and try this build? @squid_80 Softwire does indeed have 64bit support, but it's apparently lacking. The main team of Avisynth guys has been updating 32bit Softwire as they go, I've noticed. I guess I just feel more comfortable writing hard coded assembly than dealing with Softwire, which will probably come back and bite me later, we'll see. I can see myself breaking down and starting to implement the same things with Softwire64, it wouldn't be much of a stretch. |
7th March 2010, 19:57 | #73 | Link | |
Registered User
Join Date: Nov 2009
Posts: 327
|
Oh, figured it out:
Quote:
Edit: It loads and works fine after I installed the MSVC 2008 SP1 runtimes and a googled copy of libmmd.dll. Hopefully you can resolve the dependencies on your end as well (static linking?). Last edited by Stephen R. Savage; 7th March 2010 at 20:13. |
|
7th March 2010, 20:28 | #74 | Link | |
Fighting spam with a fish
Join Date: Sep 2005
Posts: 2,699
|
Quote:
Since I'm a bleeding edge kind of guy, I though, let's go 64bit! But I wanted to be sure that I could still get the most out of my current encoding chains. I currently don't have a particular filter in mind, but then I haven't gone through all of my plugins to check which ones are compatible and not. Judging as more and more support is being added for 64 bit, I'll think I will go ahead and make the switch in the next few weeks. I'll be back with plugin suggestions soon after that, I'm sure! And thank you so much for your hard work! It is greatly appreciated right now with the big switch to 64 bit operating systems and processors. |
|
9th March 2010, 21:07 | #76 | Link |
Registered User
Join Date: Feb 2010
Posts: 84
|
New build is up on the main page, please download it and let me know if I've missed any oddities in my test cases.
Changes are listed on the front page, there's some nice speed gains to be had in the main dll. Running single threaded TempGaussMC beta2 through avs2avi with a null output and using an old home movie recorded in DVSD (720x480, 29.97fps, interlaced, bff) as the source: Avisynth32: 5.05fps Avisynth64: 6.09fps Hooray, a whole extra frame every second! It may not seem like much, but for a long encode (each run took ~40mins, give or take depending on the version used), an extra frame every second, can shave a significant amount off your total encode time. Add some good ol' SetMTMode(2) into the mix, and you've got a larger gap (same test): Avisynth32: 13.79fps Avisynth64: 17.66fps Still not setting any speed records, but with a slow script (that produces great results) I'll take my speed gains, however minor they may be. Script dependent, I'd say speed increases are in the 15% to 20% range on average. Last edited by JoshyD; 9th March 2010 at 21:18. |
9th March 2010, 22:27 | #77 | Link |
Registered User
Join Date: Nov 2009
Posts: 327
|
Quick tests:
Decode 640x480 Xvid (MS MPEG-4 Decoder, 500 frames) 32-bit: 113.27 fps 64-bit: 118.20 fps Relative Speed: 104.4% Spline36Resize 16x Enlarge (500 frames) 32-bit: 18.06 fps 64-bit: 31.61 fps Relative Speed: 175.0% TempGaussMC beta 2 (EEDI2, 100 fields) 32-bit: 3.42 fps 64-bit: 3.72 fps Relative Speed: 108.9% MDeGrain3 (200 frames) 32-bit: 5.96 fps 64-bit: 6.54 fps Relative Speed: 109.7% AAA (100 frames) 32-bit: 2.17 fps 64-bit: 2.20 fps (MTMode = 1) Relative Speed: 101.3% All tests were run three times and averaged. I am not seeing the performance increases in TGMC that you are, JoshyD. However, resize performance is definitely up! The caching bug related to AAA is still not fixed. I suspect this bug is sapping performance out of other scripted filters (read: TGMC) as well. Note that the SetMTMode hack is suboptimal, as it costs performance when the caching bug is not in play (e.g. a simple source+resize script). Last edited by Stephen R. Savage; 9th March 2010 at 22:29. |
9th March 2010, 22:59 | #78 | Link |
Registered User
Join Date: May 2008
Posts: 1,840
|
Thanks for the new version however it crashes on Athlon II. Here's the windows error codes.
Code:
Problem signature: Problem Event Name: APPCRASH Application Name: x264_x64.exe Application Version: 0.0.0.0 Application Timestamp: 4b8c1206 Fault Module Name: avisynth.DLL Fault Module Version: 2.5.8.5 Fault Module Timestamp: 4b9681a6 Exception Code: c0000005 Exception Offset: 0000000000005528 OS Version: 6.1.7600.2.0.0.256.1 Locale ID: 1033 Additional Information 1: 9eab Additional Information 2: 9eabb149e34b0e02564736c484278831 Additional Information 3: 6bcd Additional Information 4: 6bcdaf28393e1989487185c90748dcec Also since all 64 bit filters I could find are named identical to the 32 bit counterpart I changed the install script to use plugins64 directory. You can grab it here Another thing is as far as I know there's only one haali media splitter build that's x64 and handles vc-1 and it's very difficult to find but I uploaded it here. Up to you if you want to post it in the original post in case people aren't able to use HMS x64 or report issues with VC-1 (from the latest official release). Lastly about filters I heavily use TIVTC so it's unfortunate that's not easy to convert. The other ones I use every once in awhile are: Decomb v4 (more effective ivtc then v5 but nowhere near tivtc, squid's is v5 only) LeakKernelDeint (fast, simple, sharp deinterlacer) RePAL (very effective for handling pal sources that were blended for ntsc dvds) Last edited by turbojet; 9th March 2010 at 23:10. |
10th March 2010, 00:14 | #79 | Link |
Registered User
Join Date: Feb 2010
Posts: 84
|
@Stephen: Let TGMC (and all the filters, really) have a few thousand frames to work with, 100 doesn't give the caching mechanisms in Avisynth (nor the computer as a whole) time to get filled and ready to go. All that data has to be pulled in closer and closer to the processor before any computationally intensive algorithms can really shine. If they don't have the needed data close at hand, you're going to be memory latency limited rather than compute limited. I've been running tests with TGMC and my own personal builds to collect usage data in the program. That being said, using a larger sample can sometimes accentuate the differences between the two versions. For example, a 2002 frame sample produces slightly more differences than if I just let it run though the first few hundred frames.
Avisynth 32: 5.54fps Avisynth 64: 6.60fps Relative increase: 119% The caching bug is annoying, can post the exact script and file for AAA that you're working with? @turbojet: That's a memory access error, can you post the script that you were running when that occurred? A short sample of the clip you were using would be useful as well, I need to get pitch, height, width, etc info from it. I haven't coded any instructions that are incompatible with your processor, however, I may very well have mucked up my memory alignment access requests. I think it probably occurred in the horizontal resize function, but can't be certain. The DLL linked here is ridiculous in size because of the compiler that generated it, and the options I allowed. Intel's C++ compiler will generate a specific code path for every Intel processor p4 and newer. At runtime, the code CPUID's your processor, and if you're lucky enough to have the your VendorID = 'GenuineIntel' you'll get a special set of the code optimized for your particular processor and its idiosyncrasies. Therefore, extra code for those processors, in addition to some statically linked code from OpenMP, balloon the size of the DLL. Don't worry about AMD processors though, the base code path is for any processor that has SSE3 or newer. While Intel's compiler won't auto-vectorize using SSE4 instructions for AMD processors, it will still give them the benefit of data operations that can be performed with SSE3 and any older set of SIMD instructions. Last edited by JoshyD; 10th March 2010 at 00:24. |
10th March 2010, 00:51 | #80 | Link |
Registered User
Join Date: May 2008
Posts: 1,840
|
Source is a 1920x1080 m2ts, script is: DirectShowsource().AssumeFPS(24000,1001).LanczosResize(1280,720)
It's definitely an issue with resize, if I don't resize it works without crashing. I tried bilinear, point, bicubic resize and all crashed. |
|
|