Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
20th June 2008, 06:11 | #742 | Link |
Registered User
Join Date: Aug 2006
Posts: 77
|
I have included every exported function (I have only listed the 16x16 variants though, because most exist only as 16xY variants and comparisons between different block sizes are not comparable), For example 16x16 with the SSSE3 cache split code, there are no 8x8 or 8x16 variants, as far as I can tell, so i set the 64_mmxext variants for the color planes. I used all functions which produce single sad and are visible (cglobal) in all defined variants.
8x8 blocks have the disadvantage, that chroma is 4x8 or 4x4 which only exists in the default functions of MVTools and on the plain mmxext function. Everything SHOULD work right now and as far as possible the most applicable x264 function is chosen (if any exists). I haven't modified the x264 assembler files because I still can't predict if that would degrade the performance (at best I can test it on my machines) and I wanted to make an update possible where only the new files had to be transfered and recompiled.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
20th June 2008, 09:22 | #743 | Link |
Registered User
Join Date: Aug 2006
Posts: 77
|
If anyone is curious....
My modified version can be acquired here mvtools-CLS-F.zip Please mind, that it is still beta. used functions: Beware that chroma for YV12 is half the block size and YUY2 has half width, so only 16x16 and 16x8 has a cache line depended function for both chroma & luma as far as possible the same cache line optimization is used. If a function doesn't exist for the requested block size, the default ISSE function is used. SadXxY_iSSE with X and Y being anything with 16, 8 & 4 x264_pixel_sad_16x16_sse2 x264_pixel_sad_16x8_sse2 x264_pixel_sad_16x16_sse3 x264_pixel_sad_16x8_sse3 x264_pixel_sad_16x16_cache64_sse2 x264_pixel_sad_16x8_cache64_sse2 x264_pixel_sad_16x16_cache64_ssse3 x264_pixel_sad_16x8_cache64_ssse3 x264_pixel_sad_16x16_cache32_mmxext x264_pixel_sad_16x8_cache32_mmxext x264_pixel_sad_16x16_cache64_mmxext x264_pixel_sad_16x8_cache64_mmxext x264_pixel_sad_8x16_cache32_mmxext x264_pixel_sad_8x8_cache32_mmxext x264_pixel_sad_8x4_cache32_mmxext x264_pixel_sad_8x16_cache64_mmxext x264_pixel_sad_8x8_cache64_mmxext x264_pixel_sad_8x4_cache64_mmxext
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
20th June 2008, 22:41 | #744 | Link |
AviSynth plugger
Join Date: Nov 2003
Location: Russia
Posts: 2,183
|
TSchniede,
thanks for contribution (from Dark Shikari )! Please increment the version number to 1.9.5 or greater.
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages. |
21st June 2008, 14:04 | #745 | Link |
Registered User
Join Date: Aug 2006
Posts: 77
|
Ok,
So here is the somewhat cleaned version. mvtools_1.9.5 Differences to MVTools1.9.3: added parameter sadx264 to MVAnalyse modified the old SAD functions to work with the same interface as the ones of x264, which in turn made changes to PlaneOFBlocks necessary. MVPlane has now a longer array of planepointers -> the upper half is used to free the memory and both pitch and planepointer after adding padding is aligned to the alignment constant defined in MVInterface (64 now). minor: DebugPrint can now be disabled in MVinterface - stdio outside #ifdef area. disabled annoying security warnings The information which sad function should be used is transported via additional flags. For simpler function pointer definitions the copycode, Luma and Variance functions now all follow the XxY naming scheme. which also means the respective files are modified and Mvincrease and MVcompensate got modified as well. The documentation in English is up to date too.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
21st June 2008, 15:19 | #747 | Link |
AviSynth plugger
Join Date: Nov 2003
Location: Russia
Posts: 2,183
|
TSchniede,
OK, will try merge my betas with your
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages. |
21st June 2008, 18:06 | #749 | Link |
AviSynth plugger
Join Date: Nov 2003
Location: Russia
Posts: 2,183
|
Released public beta 1.9.5.1 - Merged all changes by TSchniede.
Almost not tested.
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages. |
22nd June 2008, 17:23 | #751 | Link |
AviSynth plugger
Join Date: Nov 2003
Location: Russia
Posts: 2,183
|
One man says me that some sadx264 modes (=7) does not work with overlap.
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages. |
22nd June 2008, 20:51 | #752 | Link |
Pig on the wing
Join Date: Mar 2002
Location: Finland
Posts: 5,731
|
A quick test:
5000 frames of a simple MPEG2Source + MVDegrain2 script on my Q6750: Code:
MPEG2Source("path\clip.d2v") Denoise() function denoise(clip c) { vbw1=MVAnalyse(c,isb=true,truemotion=true,delta=1,pel=2,chroma=false,blksize=8,idx=1,overlap=4,sadx264=3) vfw1=MVAnalyse(c,isb=false,truemotion=true,delta=1,pel=2,chroma=false,blksize=8,idx=1,overlap=4,sadx264=3) vbw2=MVAnalyse(c,isb=true,truemotion=true,delta=2,pel=2,chroma=false,blksize=8,idx=1,overlap=4,sadx264=3) vfw2=MVAnalyse(c,isb=false,truemotion=true,delta=2,pel=2,chroma=false,blksize=8,idx=1,overlap=4,sadx264=3) return MVDegrain2(c,vbw1,vfw1,vbw2,vfw2,thSAD=400,idx=1) } sadx264=1 : 0:20:14 sadx264=3 : 0:19:07 A nice improvement, I'd say Too bad the higher modes are not available for blocksize 8.
__________________
And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon... |
23rd June 2008, 00:12 | #753 | Link | |
Registered User
Join Date: Aug 2006
Posts: 77
|
Quote:
An additional version for 8 overlap on 16 width blocks might work with the SSE2 / SSSE3 workaround as there would only be alternating between aligned and known misalignment.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
|
23rd June 2008, 00:20 | #754 | Link | |
x264 developer
Join Date: Sep 2005
Posts: 8,666
|
Quote:
|
|
23rd June 2008, 19:43 | #756 | Link |
Pig on the wing
Join Date: Mar 2002
Location: Finland
Posts: 5,731
|
Oh, you'll have to explain this in laymen's terms..I'm merely a simple end user myself
__________________
And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon... |
24th June 2008, 01:03 | #758 | Link | |
Registered User
Join Date: Aug 2006
Posts: 77
|
Quote:
So each line in a 16 pixel wide blocks the sad can be calculated with SSE2 in 1 instruction, or 2 with MMX. Which means SSE2 is best for 16xY Blocks, MMX best for 8xY blocks. Other sizes take more time to get the data into the instruction. On intel chips alignment in memory is very important, so using 2 aligned memory accesses instead of an unaligned one is faster -> hence the speedup by using the cache optimized sad functions imported from x264. It is possible to load two lines into one xmm register, but then both operands of the sad analysis have to be registers, ie. load 2 lines into register 1, load the other 2 lines into register 2 the do the sad, which are 5 instructions opposed to load first line(16 pixel) then make sad with memory => which are 2 instructions. In most cases the additional overhead for copying in the data and then getting the output makes SSE2 style sad calculations slower than mmx on 8 pixel wide sad functions. Additional advantage of mmx is, that it is far easier to use on unaligned data
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
|
24th June 2008, 15:28 | #759 | Link |
Registered User
Join Date: Aug 2006
Posts: 77
|
I have created a version which copies the source block to a aligned buffer area before doing the sad calculation, because this is reused a few times the overhead is small. The cache optimized functions get speed up (if they worked at all) with unaligned source blocks.
I still have to do a bit of testing and optimizing. I an curious though, why has the buffer block for dct a minimum width of 16 (the pitch)?
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
Thread Tools | Search this Thread |
Display Modes | |
|
|