Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Usage

Reply
 
Thread Tools Search this Thread Display Modes
Old 20th June 2008, 03:53   #741  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
That's pretty impressive: up to a 15% speed boost!

How about trying x264's 8x8 cacheline split variants?
Dark Shikari is offline   Reply With Quote
Old 20th June 2008, 06:11   #742  |  Link
TSchniede
Registered User
 
Join Date: Aug 2006
Posts: 77
I have included every exported function (I have only listed the 16x16 variants though, because most exist only as 16xY variants and comparisons between different block sizes are not comparable), For example 16x16 with the SSSE3 cache split code, there are no 8x8 or 8x16 variants, as far as I can tell, so i set the 64_mmxext variants for the color planes. I used all functions which produce single sad and are visible (cglobal) in all defined variants.
8x8 blocks have the disadvantage, that chroma is 4x8 or 4x4 which only exists in the default functions of MVTools and on the plain mmxext function. Everything SHOULD work right now and as far as possible the most applicable x264 function is chosen (if any exists).
I haven't modified the x264 assembler files because I still can't predict if that would degrade the performance (at best I can test it on my machines) and I wanted to make an update possible where only the new files had to be transfered and recompiled.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800
TSchniede is offline   Reply With Quote
Old 20th June 2008, 09:22   #743  |  Link
TSchniede
Registered User
 
Join Date: Aug 2006
Posts: 77
If anyone is curious....

My modified version can be acquired here mvtools-CLS-F.zip

Please mind, that it is still beta.

used functions:
Beware that chroma for YV12 is half the block size and YUY2 has half width, so only 16x16 and 16x8 has a cache line depended function for both chroma & luma
as far as possible the same cache line optimization is used.
If a function doesn't exist for the requested block size, the default ISSE function is used.

SadXxY_iSSE with X and Y being anything with 16, 8 & 4

x264_pixel_sad_16x16_sse2
x264_pixel_sad_16x8_sse2
x264_pixel_sad_16x16_sse3
x264_pixel_sad_16x8_sse3
x264_pixel_sad_16x16_cache64_sse2
x264_pixel_sad_16x8_cache64_sse2
x264_pixel_sad_16x16_cache64_ssse3
x264_pixel_sad_16x8_cache64_ssse3

x264_pixel_sad_16x16_cache32_mmxext
x264_pixel_sad_16x8_cache32_mmxext
x264_pixel_sad_16x16_cache64_mmxext
x264_pixel_sad_16x8_cache64_mmxext
x264_pixel_sad_8x16_cache32_mmxext
x264_pixel_sad_8x8_cache32_mmxext
x264_pixel_sad_8x4_cache32_mmxext
x264_pixel_sad_8x16_cache64_mmxext
x264_pixel_sad_8x8_cache64_mmxext
x264_pixel_sad_8x4_cache64_mmxext
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800
TSchniede is offline   Reply With Quote
Old 20th June 2008, 22:41   #744  |  Link
Fizick
AviSynth plugger
 
Fizick's Avatar
 
Join Date: Nov 2003
Location: Russia
Posts: 2,183
TSchniede,
thanks for contribution (from Dark Shikari )!
Please increment the version number to 1.9.5 or greater.
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick
I usually do not provide a technical support in private messages.
Fizick is offline   Reply With Quote
Old 21st June 2008, 14:04   #745  |  Link
TSchniede
Registered User
 
Join Date: Aug 2006
Posts: 77
Ok,

So here is the somewhat cleaned version.

mvtools_1.9.5

Differences to MVTools1.9.3:
added parameter sadx264 to MVAnalyse
modified the old SAD functions to work with the same interface as the ones of x264, which in turn made changes to PlaneOFBlocks necessary. MVPlane has now a longer array of planepointers -> the upper half is used to free the memory and both pitch and planepointer after adding padding is aligned to the alignment constant defined in MVInterface (64 now).
minor: DebugPrint can now be disabled in MVinterface - stdio outside #ifdef area. disabled annoying security warnings
The information which sad function should be used is transported via additional flags. For simpler function pointer definitions the copycode, Luma and Variance functions now all follow the XxY naming scheme. which also means the respective files are modified and Mvincrease and MVcompensate got modified as well. The documentation in English is up to date too.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800
TSchniede is offline   Reply With Quote
Old 21st June 2008, 15:09   #746  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
You should probably just port the CPU detection code too while you're at it so that people don't have to select the SAD to use manually.
Dark Shikari is offline   Reply With Quote
Old 21st June 2008, 15:19   #747  |  Link
Fizick
AviSynth plugger
 
Fizick's Avatar
 
Join Date: Nov 2003
Location: Russia
Posts: 2,183
TSchniede,
OK, will try merge my betas with your
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick
I usually do not provide a technical support in private messages.
Fizick is offline   Reply With Quote
Old 21st June 2008, 16:11   #748  |  Link
TSchniede
Registered User
 
Join Date: Aug 2006
Posts: 77
I'll to port the CPU detection code, a manual override probably won't hurt though.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800
TSchniede is offline   Reply With Quote
Old 21st June 2008, 18:06   #749  |  Link
Fizick
AviSynth plugger
 
Fizick's Avatar
 
Join Date: Nov 2003
Location: Russia
Posts: 2,183
Released public beta 1.9.5.1 - Merged all changes by TSchniede.
Almost not tested.
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick
I usually do not provide a technical support in private messages.
Fizick is offline   Reply With Quote
Old 21st June 2008, 22:34   #750  |  Link
superuser
Registered User
 
Join Date: Sep 2006
Posts: 84
^ thnxs will soon give it a try :thumbup:
superuser is offline   Reply With Quote
Old 22nd June 2008, 17:23   #751  |  Link
Fizick
AviSynth plugger
 
Fizick's Avatar
 
Join Date: Nov 2003
Location: Russia
Posts: 2,183
One man says me that some sadx264 modes (=7) does not work with overlap.
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick
I usually do not provide a technical support in private messages.
Fizick is offline   Reply With Quote
Old 22nd June 2008, 20:51   #752  |  Link
Boulder
Pig on the wing
 
Boulder's Avatar
 
Join Date: Mar 2002
Location: Finland
Posts: 5,718
A quick test:

5000 frames of a simple MPEG2Source + MVDegrain2 script on my Q6750:

Code:
MPEG2Source("path\clip.d2v")
Denoise()

function denoise(clip c)
{
vbw1=MVAnalyse(c,isb=true,truemotion=true,delta=1,pel=2,chroma=false,blksize=8,idx=1,overlap=4,sadx264=3)
vfw1=MVAnalyse(c,isb=false,truemotion=true,delta=1,pel=2,chroma=false,blksize=8,idx=1,overlap=4,sadx264=3)
vbw2=MVAnalyse(c,isb=true,truemotion=true,delta=2,pel=2,chroma=false,blksize=8,idx=1,overlap=4,sadx264=3)
vfw2=MVAnalyse(c,isb=false,truemotion=true,delta=2,pel=2,chroma=false,blksize=8,idx=1,overlap=4,sadx264=3)
return MVDegrain2(c,vbw1,vfw1,vbw2,vfw2,thSAD=400,idx=1)
}
sadx264=0 : 0:20:19
sadx264=1 : 0:20:14
sadx264=3 : 0:19:07

A nice improvement, I'd say Too bad the higher modes are not available for blocksize 8.
__________________
And if the band you're in starts playing different tunes
I'll see you on the dark side of the Moon...
Boulder is offline   Reply With Quote
Old 23rd June 2008, 00:12   #753  |  Link
TSchniede
Registered User
 
Join Date: Aug 2006
Posts: 77
Quote:
Originally Posted by Fizick View Post
One man says me that some sadx264 modes (=7) does not work with overlap.
This is unfortunately true. The reason is simple: overlap obviously creates non-aligned source blocks, which effectively means that both source and reference are unaligned access. This would not be any real problem if the reason for the work-arounds didn't exist in the first place. Only the LLDQU and MMX work around can be done on unaligned source blocks. I don't expect real performance gain if data access is unaligned for both. 8 overlap on 16 width blocks work obviously with mmx without serious performance loss.

An additional version for 8 overlap on 16 width blocks might work with the SSE2 / SSSE3 workaround as there would only be alternating between aligned and known misalignment.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800
TSchniede is offline   Reply With Quote
Old 23rd June 2008, 00:20   #754  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
Quote:
Originally Posted by TSchniede View Post
This is unfortunately true. The reason is simple: overlap obviously creates non-aligned source blocks, which effectively means that both source and reference are unaligned access. This would not be any real problem if the reason for the work-arounds didn't exist in the first place. Only the LLDQU and MMX work around can be done on unaligned source blocks. I don't expect real performance gain if data access is unaligned for both. 8 overlap on 16 width blocks work obviously with mmx without serious performance loss.

An additional version for 8 overlap on 16 width blocks might work with the SSE2 / SSSE3 workaround as there would only be alternating between aligned and known misalignment.
Why not load the source pixels for each block into an aligned buffer before doing the motion search on that block? I suspect that would save time even without x264's assembly (though only for width16 blocks and SSE code, of course).
Dark Shikari is offline   Reply With Quote
Old 23rd June 2008, 19:31   #755  |  Link
akupenguin
x264 developer
 
akupenguin's Avatar
 
Join Date: Sep 2004
Posts: 2,392
Quote:
Originally Posted by Boulder View Post
Too bad the higher modes are not available for blocksize 8.
Too bad we don't use width 16 registers to process width 8 blocks?
akupenguin is offline   Reply With Quote
Old 23rd June 2008, 19:43   #756  |  Link
Boulder
Pig on the wing
 
Boulder's Avatar
 
Join Date: Mar 2002
Location: Finland
Posts: 5,718
Quote:
Originally Posted by akupenguin View Post
Too bad we don't use width 16 registers to process width 8 blocks?
Oh, you'll have to explain this in laymen's terms..I'm merely a simple end user myself
__________________
And if the band you're in starts playing different tunes
I'll see you on the dark side of the Moon...
Boulder is offline   Reply With Quote
Old 23rd June 2008, 21:42   #757  |  Link
akupenguin
x264 developer
 
akupenguin's Avatar
 
Join Date: Sep 2004
Posts: 2,392
mmx can sad a 8byte block in 1 cycle. xmm can sad a 16byte block in 1 cycle. xmm doesn't help if your blocks are only 8 bytes.
akupenguin is offline   Reply With Quote
Old 24th June 2008, 01:03   #758  |  Link
TSchniede
Registered User
 
Join Date: Aug 2006
Posts: 77
Quote:
Originally Posted by akupenguin View Post
mmx can sad a 8byte block in 1 cycle. xmm can sad a 16byte block in 1 cycle. xmm doesn't help if your blocks are only 8 bytes.
the MMX/SSE versions of the (spatial) sad calculations use the fact, that MMX has a special instruction, that can calculate the sad of the bytes in an mmx register this is 64 bit wide => 8 bytes or in layman'sv terms 8 luma pixel. SSE2 allows the same instruction to work on XMM registers, which have 128 bit.
So each line in a 16 pixel wide blocks the sad can be calculated with SSE2 in 1 instruction, or 2 with MMX.

Which means SSE2 is best for 16xY Blocks, MMX best for 8xY blocks. Other sizes take more time to get the data into the instruction.
On intel chips alignment in memory is very important, so using 2 aligned memory accesses instead of an unaligned one is faster -> hence the speedup by using the cache optimized sad functions imported from x264.

It is possible to load two lines into one xmm register, but then both operands of the sad analysis have to be registers, ie. load 2 lines into register 1, load the other 2 lines into register 2 the do the sad, which are 5 instructions opposed to load first line(16 pixel) then make sad with memory => which are 2 instructions.
In most cases the additional overhead for copying in the data and then getting the output makes SSE2 style sad calculations slower than mmx on 8 pixel wide sad functions.
Additional advantage of mmx is, that it is far easier to use on unaligned data
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800
TSchniede is offline   Reply With Quote
Old 24th June 2008, 15:28   #759  |  Link
TSchniede
Registered User
 
Join Date: Aug 2006
Posts: 77
I have created a version which copies the source block to a aligned buffer area before doing the sad calculation, because this is reused a few times the overhead is small. The cache optimized functions get speed up (if they worked at all) with unaligned source blocks.

I still have to do a bit of testing and optimizing.
I an curious though, why has the buffer block for dct a minimum width of 16 (the pitch)?
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800
TSchniede is offline   Reply With Quote
Old 24th June 2008, 17:04   #760  |  Link
yup
Registered User
 
Join Date: Feb 2003
Location: Russia, Moscow
Posts: 854
Hi all!
Simple question default value for searchparam for search=3?
yup.
yup is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 09:06.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.