MVTools - Page 38

Dark Shikari · 20th June 2008, 03:53

That's pretty impressive: up to a 15% speed boost!

How about trying x264's 8x8 cacheline split variants?

TSchniede · 20th June 2008, 06:11

I have included every exported function (I have only listed the 16x16 variants though, because most exist only as 16xY variants and comparisons between different block sizes are not comparable), For example 16x16 with the SSSE3 cache split code, there are no 8x8 or 8x16 variants, as far as I can tell, so i set the 64_mmxext variants for the color planes. I used all functions which produce single sad and are visible (cglobal) in all defined variants.
8x8 blocks have the disadvantage, that chroma is 4x8 or 4x4 which only exists in the default functions of MVTools and on the plain mmxext function. Everything SHOULD work right now and as far as possible the most applicable x264 function is chosen (if any exists).
I haven't modified the x264 assembler files because I still can't predict if that would degrade the performance (at best I can test it on my machines) and I wanted to make an update possible where only the new files had to be transfered and recompiled.

TSchniede · 20th June 2008, 09:22

If anyone is curious....

My modified version can be acquired here mvtools-CLS-F.zip

Please mind, that it is still beta.

used functions:
Beware that chroma for YV12 is half the block size and YUY2 has half width, so only 16x16 and 16x8 has a cache line depended function for both chroma & luma
as far as possible the same cache line optimization is used.
If a function doesn't exist for the requested block size, the default ISSE function is used.

SadXxY_iSSE with X and Y being anything with 16, 8 & 4

x264_pixel_sad_16x16_sse2
x264_pixel_sad_16x8_sse2
x264_pixel_sad_16x16_sse3
x264_pixel_sad_16x8_sse3
x264_pixel_sad_16x16_cache64_sse2
x264_pixel_sad_16x8_cache64_sse2
x264_pixel_sad_16x16_cache64_ssse3
x264_pixel_sad_16x8_cache64_ssse3

x264_pixel_sad_16x16_cache32_mmxext
x264_pixel_sad_16x8_cache32_mmxext
x264_pixel_sad_16x16_cache64_mmxext
x264_pixel_sad_16x8_cache64_mmxext
x264_pixel_sad_8x16_cache32_mmxext
x264_pixel_sad_8x8_cache32_mmxext
x264_pixel_sad_8x4_cache32_mmxext
x264_pixel_sad_8x16_cache64_mmxext
x264_pixel_sad_8x8_cache64_mmxext
x264_pixel_sad_8x4_cache64_mmxext

Fizick · 20th June 2008, 22:41

TSchniede,
thanks for contribution (from Dark Shikari

)!
Please increment the version number to 1.9.5 or greater.

TSchniede · 21st June 2008, 14:04

Ok,

So here is the somewhat cleaned version.

mvtools_1.9.5

Differences to MVTools1.9.3:
added parameter sadx264 to MVAnalyse
modified the old SAD functions to work with the same interface as the ones of x264, which in turn made changes to PlaneOFBlocks necessary. MVPlane has now a longer array of planepointers -> the upper half is used to free the memory and both pitch and planepointer after adding padding is aligned to the alignment constant defined in MVInterface (64 now).
minor: DebugPrint can now be disabled in MVinterface - stdio outside #ifdef area. disabled annoying security warnings
The information which sad function should be used is transported via additional flags. For simpler function pointer definitions the copycode, Luma and Variance functions now all follow the XxY naming scheme. which also means the respective files are modified and Mvincrease and MVcompensate got modified as well. The documentation in English is up to date too.

Dark Shikari · 21st June 2008, 15:09

You should probably just port the CPU detection code too while you're at it so that people don't have to select the SAD to use manually.

Fizick · 21st June 2008, 15:19

TSchniede,
OK, will try merge my betas with your

TSchniede · 21st June 2008, 16:11

I'll to port the CPU detection code, a manual override probably won't hurt though.

Fizick · 21st June 2008, 18:06

Released public beta 1.9.5.1 - Merged all changes by TSchniede.
Almost not tested.

superuser · 21st June 2008, 22:34

^ thnxs will soon give it a try :thumbup:

Fizick · 22nd June 2008, 17:23

One man says me that some sadx264 modes (=7) does not work with overlap.

Boulder · 22nd June 2008, 20:51

A quick test:

5000 frames of a simple MPEG2Source + MVDegrain2 script on my Q6750:

Code:

MPEG2Source("path\clip.d2v")
Denoise()

function denoise(clip c)
{
vbw1=MVAnalyse(c,isb=true,truemotion=true,delta=1,pel=2,chroma=false,blksize=8,idx=1,overlap=4,sadx264=3)
vfw1=MVAnalyse(c,isb=false,truemotion=true,delta=1,pel=2,chroma=false,blksize=8,idx=1,overlap=4,sadx264=3)
vbw2=MVAnalyse(c,isb=true,truemotion=true,delta=2,pel=2,chroma=false,blksize=8,idx=1,overlap=4,sadx264=3)
vfw2=MVAnalyse(c,isb=false,truemotion=true,delta=2,pel=2,chroma=false,blksize=8,idx=1,overlap=4,sadx264=3)
return MVDegrain2(c,vbw1,vfw1,vbw2,vfw2,thSAD=400,idx=1)
}

sadx264=0 : 0:20:19
sadx264=1 : 0:20:14
sadx264=3 : 0:19:07

A nice improvement, I'd say

Too bad the higher modes are not available for blocksize 8.

TSchniede · 23rd June 2008, 00:12

Quote:

Originally Posted by Fizick

One man says me that some sadx264 modes (=7) does not work with overlap.

This is unfortunately true. The reason is simple: overlap obviously creates non-aligned source blocks, which effectively means that both source and reference are unaligned access. This would not be any real problem if the reason for the work-arounds didn't exist in the first place. Only the LLDQU and MMX work around can be done on unaligned source blocks. I don't expect real performance gain if data access is unaligned for both. 8 overlap on 16 width blocks work obviously with mmx without serious performance loss.

An additional version for 8 overlap on 16 width blocks might work with the SSE2 / SSSE3 workaround as there would only be alternating between aligned and known misalignment.

Dark Shikari · 23rd June 2008, 00:20

Quote:

Originally Posted by TSchniede

This is unfortunately true. The reason is simple: overlap obviously creates non-aligned source blocks, which effectively means that both source and reference are unaligned access. This would not be any real problem if the reason for the work-arounds didn't exist in the first place. Only the LLDQU and MMX work around can be done on unaligned source blocks. I don't expect real performance gain if data access is unaligned for both. 8 overlap on 16 width blocks work obviously with mmx without serious performance loss.

An additional version for 8 overlap on 16 width blocks might work with the SSE2 / SSSE3 workaround as there would only be alternating between aligned and known misalignment.

Why not load the source pixels for each block into an aligned buffer before doing the motion search on that block? I suspect that would save time even without x264's assembly (though only for width16 blocks and SSE code, of course).

akupenguin · 23rd June 2008, 19:31

Quote:

Originally Posted by Boulder

Too bad the higher modes are not available for blocksize 8.

Too bad we don't use width 16 registers to process width 8 blocks?

Boulder · 23rd June 2008, 19:43

Quote:

Originally Posted by akupenguin

Too bad we don't use width 16 registers to process width 8 blocks?

Oh, you'll have to explain this in laymen's terms..I'm merely a simple end user myself

akupenguin · 23rd June 2008, 21:42

mmx can sad a 8byte block in 1 cycle. xmm can sad a 16byte block in 1 cycle. xmm doesn't help if your blocks are only 8 bytes.

TSchniede · 24th June 2008, 01:03

Quote:

Originally Posted by akupenguin

mmx can sad a 8byte block in 1 cycle. xmm can sad a 16byte block in 1 cycle. xmm doesn't help if your blocks are only 8 bytes.

the MMX/SSE versions of the (spatial) sad calculations use the fact, that MMX has a special instruction, that can calculate the sad of the bytes in an mmx register this is 64 bit wide => 8 bytes or in layman'sv terms 8 luma pixel. SSE2 allows the same instruction to work on XMM registers, which have 128 bit.
So each line in a 16 pixel wide blocks the sad can be calculated with SSE2 in 1 instruction, or 2 with MMX.

Which means SSE2 is best for 16xY Blocks, MMX best for 8xY blocks. Other sizes take more time to get the data into the instruction.
On intel chips alignment in memory is very important, so using 2 aligned memory accesses instead of an unaligned one is faster -> hence the speedup by using the cache optimized sad functions imported from x264.

It is possible to load two lines into one xmm register, but then both operands of the sad analysis have to be registers, ie. load 2 lines into register 1, load the other 2 lines into register 2 the do the sad, which are 5 instructions opposed to load first line(16 pixel) then make sad with memory => which are 2 instructions.
In most cases the additional overhead for copying in the data and then getting the output makes SSE2 style sad calculations slower than mmx on 8 pixel wide sad functions.
Additional advantage of mmx is, that it is far easier to use on unaligned data

TSchniede · 24th June 2008, 15:28

I have created a version which copies the source block to a aligned buffer area before doing the sad calculation, because this is reused a few times the overhead is small. The cache optimized functions get speed up (if they worked at all) with unaligned source blocks.

I still have to do a bit of testing and optimizing.
I an curious though, why has the buffer block for dct a minimum width of 16 (the pitch)?

yup · 24th June 2008, 17:04

Hi all!
Simple question default value for searchparam for search=3?
yup.

20th June 2008, 03:53	#741 \| Link
Dark Shikari x264 developer Join Date: Sep 2005 Posts: 8,666	That's pretty impressive: up to a 15% speed boost! How about trying x264's 8x8 cacheline split variants? __________________ Follow x264 development progress \| akupenguin quotes \| x264 git status ffmpeg and x264-related consulting/coding contracts \| Doom10

20th June 2008, 06:11	#742 \| Link
TSchniede Registered User Join Date: Aug 2006 Posts: 77	I have included every exported function (I have only listed the 16x16 variants though, because most exist only as 16xY variants and comparisons between different block sizes are not comparable), For example 16x16 with the SSSE3 cache split code, there are no 8x8 or 8x16 variants, as far as I can tell, so i set the 64_mmxext variants for the color planes. I used all functions which produce single sad and are visible (cglobal) in all defined variants. 8x8 blocks have the disadvantage, that chroma is 4x8 or 4x4 which only exists in the default functions of MVTools and on the plain mmxext function. Everything SHOULD work right now and as far as possible the most applicable x264 function is chosen (if any exists). I haven't modified the x264 assembler files because I still can't predict if that would degrade the performance (at best I can test it on my machines) and I wanted to make an update possible where only the new files had to be transfered and recompiled. __________________ GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800

20th June 2008, 09:22	#743 \| Link
TSchniede Registered User Join Date: Aug 2006 Posts: 77	If anyone is curious.... My modified version can be acquired here mvtools-CLS-F.zip Please mind, that it is still beta. used functions: Beware that chroma for YV12 is half the block size and YUY2 has half width, so only 16x16 and 16x8 has a cache line depended function for both chroma & luma as far as possible the same cache line optimization is used. If a function doesn't exist for the requested block size, the default ISSE function is used. SadXxY_iSSE with X and Y being anything with 16, 8 & 4 x264_pixel_sad_16x16_sse2 x264_pixel_sad_16x8_sse2 x264_pixel_sad_16x16_sse3 x264_pixel_sad_16x8_sse3 x264_pixel_sad_16x16_cache64_sse2 x264_pixel_sad_16x8_cache64_sse2 x264_pixel_sad_16x16_cache64_ssse3 x264_pixel_sad_16x8_cache64_ssse3 x264_pixel_sad_16x16_cache32_mmxext x264_pixel_sad_16x8_cache32_mmxext x264_pixel_sad_16x16_cache64_mmxext x264_pixel_sad_16x8_cache64_mmxext x264_pixel_sad_8x16_cache32_mmxext x264_pixel_sad_8x8_cache32_mmxext x264_pixel_sad_8x4_cache32_mmxext x264_pixel_sad_8x16_cache64_mmxext x264_pixel_sad_8x8_cache64_mmxext x264_pixel_sad_8x4_cache64_mmxext __________________ GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800

20th June 2008, 22:41	#744 \| Link
Fizick AviSynth plugger Join Date: Nov 2003 Location: Russia Posts: 2,183	TSchniede, thanks for contribution (from Dark Shikari )! Please increment the version number to 1.9.5 or greater. __________________ My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages.

21st June 2008, 14:04	#745 \| Link
TSchniede Registered User Join Date: Aug 2006 Posts: 77	Ok, So here is the somewhat cleaned version. mvtools_1.9.5 Differences to MVTools1.9.3: added parameter sadx264 to MVAnalyse modified the old SAD functions to work with the same interface as the ones of x264, which in turn made changes to PlaneOFBlocks necessary. MVPlane has now a longer array of planepointers -> the upper half is used to free the memory and both pitch and planepointer after adding padding is aligned to the alignment constant defined in MVInterface (64 now). minor: DebugPrint can now be disabled in MVinterface - stdio outside #ifdef area. disabled annoying security warnings The information which sad function should be used is transported via additional flags. For simpler function pointer definitions the copycode, Luma and Variance functions now all follow the XxY naming scheme. which also means the respective files are modified and Mvincrease and MVcompensate got modified as well. The documentation in English is up to date too. __________________ GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800

21st June 2008, 15:09	#746 \| Link
Dark Shikari x264 developer Join Date: Sep 2005 Posts: 8,666	You should probably just port the CPU detection code too while you're at it so that people don't have to select the SAD to use manually. __________________ Follow x264 development progress \| akupenguin quotes \| x264 git status ffmpeg and x264-related consulting/coding contracts \| Doom10

21st June 2008, 15:19	#747 \| Link
Fizick AviSynth plugger Join Date: Nov 2003 Location: Russia Posts: 2,183	TSchniede, OK, will try merge my betas with your __________________ My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages.

21st June 2008, 16:11	#748 \| Link
TSchniede Registered User Join Date: Aug 2006 Posts: 77	I'll to port the CPU detection code, a manual override probably won't hurt though. __________________ GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800

21st June 2008, 18:06	#749 \| Link
Fizick AviSynth plugger Join Date: Nov 2003 Location: Russia Posts: 2,183	Released public beta 1.9.5.1 - Merged all changes by TSchniede. Almost not tested. __________________ My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages.

21st June 2008, 22:34	#750 \| Link
superuser Registered User Join Date: Sep 2006 Posts: 84	^ thnxs will soon give it a try :thumbup:

22nd June 2008, 17:23	#751 \| Link
Fizick AviSynth plugger Join Date: Nov 2003 Location: Russia Posts: 2,183	One man says me that some sadx264 modes (=7) does not work with overlap. __________________ My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages.

23rd June 2008, 21:42	#757 \| Link
akupenguin x264 developer Join Date: Sep 2004 Posts: 2,392	mmx can sad a 8byte block in 1 cycle. xmm can sad a 16byte block in 1 cycle. xmm doesn't help if your blocks are only 8 bytes.

24th June 2008, 15:28	#759 \| Link
TSchniede Registered User Join Date: Aug 2006 Posts: 77	I have created a version which copies the source block to a aligned buffer area before doing the sad calculation, because this is reused a few times the overhead is small. The cache optimized functions get speed up (if they worked at all) with unaligned source blocks. I still have to do a bit of testing and optimizing. I an curious though, why has the buffer block for dct a minimum width of 16 (the pitch)? __________________ GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800

24th June 2008, 17:04	#760 \| Link
yup Registered User Join Date: Feb 2003 Location: Russia, Moscow Posts: 854	Hi all! Simple question default value for searchparam for search=3? yup.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode