MVTools - Page 39

TSchniede · 24th June 2008, 18:27

Quote:

Originally Posted by yup

Hi all!
Simple question default value for searchparam for search=3?
yup.

The default for searchparam is 2 for all search types.

TSchniede · 24th June 2008, 18:41

This is the new version with buffered source block.
It is still based on 1.9.5.0

The only changes are a new constant in MVInterface and in PlaneOfBloacks (all code is controlled by the constant)

My tests show a slight variation on speed (<1%) if source blocks are aligned anyway (no overlap) as the overhead and the better locality almost cancel each other out. On overlapped blocks 'I measured up to 10% performance increase.

Dark Shikari · 24th June 2008, 20:13

If you aren't already, have you tried using x264's mc.copy for creating the aligned source blocks from the source data? It's blazingly fast.

Also, note that when you're using an aligned source block you can probably take great advantage of the constant stride in the various assembly functions.

Another idea: DCT is slow as hell, and x264's SATD is quite fast. How about replacing the "dct" option with a SATD option instead, borrowing x264's SATD code? And while we're taking assembly from x264, you could try using x264's 6-tap upscaling filter for hpel; its extremely fast.

Boulder · 24th June 2008, 20:29

It seems that the new version is somewhat slower on my E6750.

With blocksize 8 and overlap 4:

The first version
x264_sad=3 : 5.7 fps

New version
x264_sad=3 : 5.3 fps

With blocksize 8 and no overlapping:

The first version
x264_sad=3 : 18.4 fps

The new version
x264_sad=3 : 16.8 fps

TSchniede · 24th June 2008, 21:12

Dark Shikari, I was looking into those things, but they are a bit more complicated than importing the sad functions, so that will take some time. Though I wasn't really looking for a dct replacement yet.

Boulder, interesting - on my Q9300 and on my Pentium M the performance is quite good. Quite a chunk of the additional overhead comes by copying, so by speeding that up, it should be better. I tried to switch between direct source block references and buffered based on the alignment, but that was even slower, as the overhead for that is definitely bigger than doing it always. If you are comparing Fizick's merge with my version, different compiler / Win-API versions can make a difference too (and I have no idea yet what additional tweaks were introduced).

Boulder · 24th June 2008, 21:16

Yes, it's Fizick's merge that I tested. I could run the same tests on your first build tomorrow to verify if the difference still exists.

TSchniede · 25th June 2008, 04:29

Quote:

Originally Posted by Dark Shikari

If you aren't already, have you tried using x264's mc.copy for creating the aligned source blocks from the source data? It's blazingly fast.

Also, note that when you're using an aligned source block you can probably take great advantage of the constant stride in the various assembly functions.

Another idea: DCT is slow as hell, and x264's SATD is quite fast. How about replacing the "dct" option with a SATD option instead, borrowing x264's SATD code? And while we're taking assembly from x264, you could try using x264's 6-tap upscaling filter for hpel; its extremely fast.

I tried the calling SSD & SATD. they are slow compared to naked SAD as was to expect (150% for SSD, 250% for SATD) if used as replacements to SAD (both luma & chroma).but blazingly fast compared to the current dct (which needs 6x the time of SATD for luma alone), but it is not really comparable this way. It has to be at least scaled to match default assumptions and probably weighted with spatial SAD. Nevertheless it promises to be a faster alternative. I haven't been able to verify the correctness of my implementation, as it obviously isn't equivalent to a current option. It "works" in the way as it doesn't crash and does "something".

I looked into mc-copy too, but it is far simpler than the other functions (as most optimizations are of no advantage in such a simple algorithm) and uses a different interface, so I think the best option is a reimplementation.

Soon most assembler functions will come from x264

I am working on a optimized 4xY SAD function which takes advantage of the special source block properties too.
Right now it is faster working on a upscaled clip with 8x8 blocks compared to 4x4 with pel=2.

The hpel filter is something i have to try first as a stand alone avisynth filter to make sure it will work.

Dark Shikari · 25th June 2008, 06:40

Have you tried using sad_x3/sad_x4? They're quite a bit faster than doing one SAD at a time.

SATD is a drop-in replacement for SAD (its used as such in x264 too, for --me tesa). It doesn't need scaling and there's a satd_x3/satd_x4 in pixel.c just to "fake" the multiple-SAD-call to allow it to be a drop-in replacement.

Terka · 25th June 2008, 09:07

Hi guys,
its good to hear you are improving the mvtools speed. Would it be possible to port them to run using GPU like fft3dgpu?
Maybee this way the speedup gain will be greater?

Boulder · 25th June 2008, 16:12

Quote:

Originally Posted by Boulder

It seems that the new version is somewhat slower on my E6750.

With blocksize 8 and overlap 4:

The first version
x264_sad=3 : 5.7 fps

New version
x264_sad=3 : 5.3 fps

With blocksize 8 and no overlapping:

The first version
x264_sad=3 : 18.4 fps

The new version
x264_sad=3 : 16.8 fps

I tested the first build and here are the results:

blksize 8, overlap 4 : 5.2 fps
blksize 8, overlap 0 : 17.1 fps

Apparently Fizick's official 1.9.5.1 build is a tad bit faster.

Fizick · 25th June 2008, 18:59

IMO, SSD or SATD is not useful.
But Hadamard (if i spelled it correctly) transform is interesting faster alternative to DCT.
I do not remember where I saw it, Mplayer or x264

TSchniede · 25th June 2008, 19:47

Quote:

Originally Posted by Terka

Hi guys,
its good to hear you are improving the mvtools speed. Would it be possible to port them to run using GPU like fft3dgpu?
Maybe this way the speedup gain will be greater?

Right now we are talking about making MVAnalyse faster. Unfortunately the algorithm is highly linear and you can't split the frame into smaller chunks without sacrificing quality. Even single threaded the memory footprint is huge. So I really doubt moving to a (relatively) memory constrained, low clock rate platform with many cores will help. A basic 8x8, overlap=0 MVDegrain3 on a PAL clip runs nearly real time with SetMTMode(2,4) on my Q9300 anyway even without the last tweaks. And we already use SSE to work on several pixel at once, so there is little which can still be done better in parallel. I have no real knowledge how well GPUs respond to huge amounts of conditional code and synchronization, but it doesn't seem really plausible.

TSchniede · 25th June 2008, 19:56

Quote:

Originally Posted by Fizick

IMO, SSD or SATD is not useful.
But Hadamard (if i spelled it correctly) transform is interesting faster alternative to DCT.
I do not remember where I saw it, Mplayer or x264

You are unfortunately right. In the current code SATD seem to work inverse to SAD, meaning the best "SAD" is on the worst case scenario - the scene change. Currently I was investigating how Hadamard is supposed to work. I can't say it something is working as expected / useful, if I don't know what it should do in the first place.

Dark Shikari · 25th June 2008, 21:08

Quote:

Originally Posted by Fizick

IMO, SSD or SATD is not useful.
But Hadamard (if i spelled it correctly) transform is interesting faster alternative to DCT.

SATD is the Hadamard transform and is better than SAD because it doesn't fail miserably in the case of fades. Its a far faster alternative to the DCT.

TSchniede · 26th June 2008, 00:05

Quote:

Originally Posted by Dark Shikari

SATD is the Hadamard transform and is better than SAD because it doesn't fail miserably in the case of fades. Its a far faster alternative to the DCT.

I think i have found my previous error. I had forgotten a debug instruction AND SATD triggered a lot of scene changes. It is definitely far more sensitive to noise than SAD. So it is somewhat "sharper". SSD is even more extreme in that part.

On a blurred clip the picture is completely reversed - SATD is definitely superior to SAD.
On a grainy source -addgrain(20) the blurred SATD is virtually identical to SAD without blur(1). On fades only SATD produces decent quality. I suppose better prefiltering even works better. Besides it is a lot faster than default dct.

Terranigma · 26th June 2008, 00:18

Have a compile with "SATD" that we can test?

TSchniede · 26th June 2008, 01:52

Quote:

Originally Posted by Terranigma

Have a compile with "SATD" that we can test?

you can get my current version here.
It is still based on 1.9.3. I think it's time to get Fizick version and update my Win API

there are two ways to access the new functions:
sadx264: 8-12
dct: 5-10

for a description see the documentation.

I have only tested base functionality yet. And I haven't thoroughly looked for potential performance problems with dct mode. There are definitely some minor parts which will need some work if this is going to stay.

The changes are in SADFunctions.h and PlaneOfBlocks (and of course MVInterface & MVAnalyse) pixel*.asm were added

Terranigma · 26th June 2008, 03:26

Thanks TSchniede.
Since you guys are borrowing code from x264, how difficult would it be to port over the hexagon and multi hexagon search algorithms?

TSchniede · 26th June 2008, 04:05

Quote:

Originally Posted by Terranigma

Since you guys are borrowing code from x264, how difficult would it be to port over the hexagon and multi hexagon search algorithms?

I don't know. A quick glance over the source wasn't that informative

I don't know the algorithm yet so I can't say if it would be useful in the first place. It might be possible to add a hexagonal search function to MVTools along the other. There seems to be some similarity to the logarithmic search. But it would be more a reimplementation anyway. The other functions were small assembler functions where the possible interfaces were very limited. I have only adapted the interface of MVTools in calling them, (if necessary) as they are more complex than the default functions.

Manao · 26th June 2008, 05:58

Quote:

SATD is the Hadamard transform and is better than SAD because it doesn't fail miserably in the case of fades

I disagree, it still fails miserably on fades. Fades make motion vectors go crazy because of DC change, and SATD is as sensitive to that as SAD.

24th June 2008, 18:41	#762 \| Link
TSchniede Registered User Join Date: Aug 2006 Posts: 77	This is the new version with buffered source block. It is still based on 1.9.5.0 The only changes are a new constant in MVInterface and in PlaneOfBloacks (all code is controlled by the constant) My tests show a slight variation on speed (<1%) if source blocks are aligned anyway (no overlap) as the overhead and the better locality almost cancel each other out. On overlapped blocks 'I measured up to 10% performance increase. __________________ GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 Last edited by TSchniede; 2nd July 2008 at 00:59.

24th June 2008, 20:13	#763 \| Link
Dark Shikari x264 developer Join Date: Sep 2005 Posts: 8,666	If you aren't already, have you tried using x264's mc.copy for creating the aligned source blocks from the source data? It's blazingly fast. Also, note that when you're using an aligned source block you can probably take great advantage of the constant stride in the various assembly functions. Another idea: DCT is slow as hell, and x264's SATD is quite fast. How about replacing the "dct" option with a SATD option instead, borrowing x264's SATD code? And while we're taking assembly from x264, you could try using x264's 6-tap upscaling filter for hpel; its extremely fast. __________________ Follow x264 development progress \| akupenguin quotes \| x264 git status ffmpeg and x264-related consulting/coding contracts \| Doom10 Last edited by Dark Shikari; 24th June 2008 at 20:17.

24th June 2008, 20:29	#764 \| Link
Boulder Pig on the wing Join Date: Mar 2002 Location: Finland Posts: 5,731	It seems that the new version is somewhat slower on my E6750. With blocksize 8 and overlap 4: The first version x264_sad=3 : 5.7 fps New version x264_sad=3 : 5.3 fps With blocksize 8 and no overlapping: The first version x264_sad=3 : 18.4 fps The new version x264_sad=3 : 16.8 fps __________________ And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon...

24th June 2008, 21:12	#765 \| Link
TSchniede Registered User Join Date: Aug 2006 Posts: 77	Dark Shikari, I was looking into those things, but they are a bit more complicated than importing the sad functions, so that will take some time. Though I wasn't really looking for a dct replacement yet. Boulder, interesting - on my Q9300 and on my Pentium M the performance is quite good. Quite a chunk of the additional overhead comes by copying, so by speeding that up, it should be better. I tried to switch between direct source block references and buffered based on the alignment, but that was even slower, as the overhead for that is definitely bigger than doing it always. If you are comparing Fizick's merge with my version, different compiler / Win-API versions can make a difference too (and I have no idea yet what additional tweaks were introduced). __________________ GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800

24th June 2008, 21:16	#766 \| Link
Boulder Pig on the wing Join Date: Mar 2002 Location: Finland Posts: 5,731	Yes, it's Fizick's merge that I tested. I could run the same tests on your first build tomorrow to verify if the difference still exists. __________________ And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon...

25th June 2008, 06:40	#768 \| Link
Dark Shikari x264 developer Join Date: Sep 2005 Posts: 8,666	Have you tried using sad_x3/sad_x4? They're quite a bit faster than doing one SAD at a time. SATD is a drop-in replacement for SAD (its used as such in x264 too, for --me tesa). It doesn't need scaling and there's a satd_x3/satd_x4 in pixel.c just to "fake" the multiple-SAD-call to allow it to be a drop-in replacement. __________________ Follow x264 development progress \| akupenguin quotes \| x264 git status ffmpeg and x264-related consulting/coding contracts \| Doom10

25th June 2008, 09:07	#769 \| Link
Terka Registered User Join Date: Jan 2005 Location: cz Posts: 704	Hi guys, its good to hear you are improving the mvtools speed. Would it be possible to port them to run using GPU like fft3dgpu? Maybee this way the speedup gain will be greater?

25th June 2008, 18:59	#771 \| Link
Fizick AviSynth plugger Join Date: Nov 2003 Location: Russia Posts: 2,183	IMO, SSD or SATD is not useful. But Hadamard (if i spelled it correctly) transform is interesting faster alternative to DCT. I do not remember where I saw it, Mplayer or x264 __________________ My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages.

26th June 2008, 00:18	#776 \| Link
Terranigma Space Reserved Join Date: May 2006 Posts: 953	Have a compile with "SATD" that we can test? __________________ Kurama Link And Fox Doom10 - It's brighter on the other side

26th June 2008, 03:26	#778 \| Link
Terranigma Space Reserved Join Date: May 2006 Posts: 953	Thanks TSchniede. Since you guys are borrowing code from x264, how difficult would it be to port over the hexagon and multi hexagon search algorithms? __________________ Kurama Link And Fox Doom10 - It's brighter on the other side

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode