Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
24th June 2008, 18:41 | #762 | Link |
Registered User
Join Date: Aug 2006
Posts: 77
|
This is the new version with buffered source block.
It is still based on 1.9.5.0 The only changes are a new constant in MVInterface and in PlaneOfBloacks (all code is controlled by the constant) My tests show a slight variation on speed (<1%) if source blocks are aligned anyway (no overlap) as the overhead and the better locality almost cancel each other out. On overlapped blocks 'I measured up to 10% performance increase.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 Last edited by TSchniede; 2nd July 2008 at 00:59. |
24th June 2008, 20:13 | #763 | Link |
x264 developer
Join Date: Sep 2005
Posts: 8,666
|
If you aren't already, have you tried using x264's mc.copy for creating the aligned source blocks from the source data? It's blazingly fast.
Also, note that when you're using an aligned source block you can probably take great advantage of the constant stride in the various assembly functions. Another idea: DCT is slow as hell, and x264's SATD is quite fast. How about replacing the "dct" option with a SATD option instead, borrowing x264's SATD code? And while we're taking assembly from x264, you could try using x264's 6-tap upscaling filter for hpel; its extremely fast.
__________________
Follow x264 development progress | akupenguin quotes | x264 git status ffmpeg and x264-related consulting/coding contracts | Doom10 Last edited by Dark Shikari; 24th June 2008 at 20:17. |
24th June 2008, 20:29 | #764 | Link |
Pig on the wing
Join Date: Mar 2002
Location: Finland
Posts: 5,731
|
It seems that the new version is somewhat slower on my E6750.
With blocksize 8 and overlap 4: The first version x264_sad=3 : 5.7 fps New version x264_sad=3 : 5.3 fps With blocksize 8 and no overlapping: The first version x264_sad=3 : 18.4 fps The new version x264_sad=3 : 16.8 fps
__________________
And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon... |
24th June 2008, 21:12 | #765 | Link |
Registered User
Join Date: Aug 2006
Posts: 77
|
Dark Shikari, I was looking into those things, but they are a bit more complicated than importing the sad functions, so that will take some time. Though I wasn't really looking for a dct replacement yet.
Boulder, interesting - on my Q9300 and on my Pentium M the performance is quite good. Quite a chunk of the additional overhead comes by copying, so by speeding that up, it should be better. I tried to switch between direct source block references and buffered based on the alignment, but that was even slower, as the overhead for that is definitely bigger than doing it always. If you are comparing Fizick's merge with my version, different compiler / Win-API versions can make a difference too (and I have no idea yet what additional tweaks were introduced).
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
24th June 2008, 21:16 | #766 | Link |
Pig on the wing
Join Date: Mar 2002
Location: Finland
Posts: 5,731
|
Yes, it's Fizick's merge that I tested. I could run the same tests on your first build tomorrow to verify if the difference still exists.
__________________
And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon... |
25th June 2008, 04:29 | #767 | Link | |
Registered User
Join Date: Aug 2006
Posts: 77
|
Quote:
I looked into mc-copy too, but it is far simpler than the other functions (as most optimizations are of no advantage in such a simple algorithm) and uses a different interface, so I think the best option is a reimplementation. Soon most assembler functions will come from x264 I am working on a optimized 4xY SAD function which takes advantage of the special source block properties too. Right now it is faster working on a upscaled clip with 8x8 blocks compared to 4x4 with pel=2. The hpel filter is something i have to try first as a stand alone avisynth filter to make sure it will work.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
|
25th June 2008, 06:40 | #768 | Link |
x264 developer
Join Date: Sep 2005
Posts: 8,666
|
Have you tried using sad_x3/sad_x4? They're quite a bit faster than doing one SAD at a time.
SATD is a drop-in replacement for SAD (its used as such in x264 too, for --me tesa). It doesn't need scaling and there's a satd_x3/satd_x4 in pixel.c just to "fake" the multiple-SAD-call to allow it to be a drop-in replacement. |
25th June 2008, 16:12 | #770 | Link | |
Pig on the wing
Join Date: Mar 2002
Location: Finland
Posts: 5,731
|
Quote:
blksize 8, overlap 4 : 5.2 fps blksize 8, overlap 0 : 17.1 fps Apparently Fizick's official 1.9.5.1 build is a tad bit faster.
__________________
And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon... |
|
25th June 2008, 18:59 | #771 | Link |
AviSynth plugger
Join Date: Nov 2003
Location: Russia
Posts: 2,183
|
IMO, SSD or SATD is not useful.
But Hadamard (if i spelled it correctly) transform is interesting faster alternative to DCT. I do not remember where I saw it, Mplayer or x264
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages. |
25th June 2008, 19:47 | #772 | Link |
Registered User
Join Date: Aug 2006
Posts: 77
|
Right now we are talking about making MVAnalyse faster. Unfortunately the algorithm is highly linear and you can't split the frame into smaller chunks without sacrificing quality. Even single threaded the memory footprint is huge. So I really doubt moving to a (relatively) memory constrained, low clock rate platform with many cores will help. A basic 8x8, overlap=0 MVDegrain3 on a PAL clip runs nearly real time with SetMTMode(2,4) on my Q9300 anyway even without the last tweaks. And we already use SSE to work on several pixel at once, so there is little which can still be done better in parallel. I have no real knowledge how well GPUs respond to huge amounts of conditional code and synchronization, but it doesn't seem really plausible.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
25th June 2008, 19:56 | #773 | Link |
Registered User
Join Date: Aug 2006
Posts: 77
|
You are unfortunately right. In the current code SATD seem to work inverse to SAD, meaning the best "SAD" is on the worst case scenario - the scene change. Currently I was investigating how Hadamard is supposed to work. I can't say it something is working as expected / useful, if I don't know what it should do in the first place.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
26th June 2008, 00:05 | #775 | Link | |
Registered User
Join Date: Aug 2006
Posts: 77
|
Quote:
On a blurred clip the picture is completely reversed - SATD is definitely superior to SAD. On a grainy source -addgrain(20) the blurred SATD is virtually identical to SAD without blur(1). On fades only SATD produces decent quality. I suppose better prefiltering even works better. Besides it is a lot faster than default dct.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
|
26th June 2008, 01:52 | #777 | Link |
Registered User
Join Date: Aug 2006
Posts: 77
|
you can get my current version here.
It is still based on 1.9.3. I think it's time to get Fizick version and update my Win API there are two ways to access the new functions: sadx264: 8-12 dct: 5-10 for a description see the documentation. I have only tested base functionality yet. And I haven't thoroughly looked for potential performance problems with dct mode. There are definitely some minor parts which will need some work if this is going to stay. The changes are in SADFunctions.h and PlaneOfBlocks (and of course MVInterface & MVAnalyse) pixel*.asm were added
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
26th June 2008, 04:05 | #779 | Link | |
Registered User
Join Date: Aug 2006
Posts: 77
|
Quote:
I don't know the algorithm yet so I can't say if it would be useful in the first place. It might be possible to add a hexagonal search function to MVTools along the other. There seems to be some similarity to the logarithmic search. But it would be more a reimplementation anyway. The other functions were small assembler functions where the possible interfaces were very limited. I have only adapted the interface of MVTools in calling them, (if necessary) as they are more complex than the default functions.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
|
26th June 2008, 05:58 | #780 | Link | |
Registered User
Join Date: Jan 2002
Location: France
Posts: 2,856
|
Quote:
__________________
|
|
Thread Tools | Search this Thread |
Display Modes | |
|
|