Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
28th June 2008, 17:47 | #801 | Link | |
Registered User
Join Date: Aug 2006
Posts: 77
|
Quote:
and MfA replied here, a port to the GPU will only get any performance gain if either the GPU is ridiculously faster than the CPU or if a completely different approach to motion estimation is used. MVTools estimates the motion by interpolating the estemated motion based on a reduced version (the hierarchical part) and on the direct neighbor blocks - the up and left blocks, which have been calculated at this point. This means each block is in part dependent on every other block in the whole frame, because this is repeated down to a single line of blocks. So the only useful parallel part is doing several frames in parallel, but you will never be able to do 50+ frames like that, it needs too much RAM (try SetMTMode(2,20) and you will always need far more than 2GB memory... then image using 100+ threads...)
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
|
28th June 2008, 20:06 | #803 | Link |
AviSynth plugger
Join Date: Nov 2003
Location: Russia
Posts: 2,183
|
Terranigma,
like TSchniede answered instead of Manao, I may answer instead of TSchniede : No. 1. why rename degrain (or have two similar functions)? 2. Mvdenoise was faster in past when it used compensated frames inplace. Now it is obsoleted and may be simply removed. 3. TSchniede is great coder. Do not impede him to optimize the code
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages. |
28th June 2008, 20:44 | #805 | Link |
AviSynth plugger
Join Date: Nov 2003
Location: Russia
Posts: 2,183
|
Terranigma
As you ask, No, it will not be removed for compatibility with old scripts. TSchniede, is SSE2 and overlap problem solved (e.g. by switching to original MVTools mode)?
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages. |
28th June 2008, 23:34 | #806 | Link | |
Registered User
Join Date: Oct 2007
Posts: 713
|
Quote:
my question has yet still not been answered. Can it be ported? |
|
29th June 2008, 00:05 | #809 | Link | |
Registered User
Join Date: Aug 2006
Posts: 77
|
Quote:
One thing still bugs me: 4xY blocks are ~7% slower than 8xY blocks, at least on my Intel systems. I think the cache split issue might be the cause here. I only affects 4xY(2xY) Blocks, 8xY blocks don't benefit from up sizing to 16xY blocks(that would slow it down by 14%).
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
|
29th June 2008, 00:19 | #810 | Link | |
Qualitas Opus Operis
Join Date: Feb 2008
Posts: 45
|
Quote:
Although the number of transistors is comparably equal, because of the manufacturing process, GPUs tend to have more transistors than CPUs. CPU - 0.731 billion transistors http://www.anandtech.com/cpuchipsets...spx?i=3102&p=2 GPU - 1.4 billion transistors http://enthusiast.hardocp.com/articl...50aHVzaWFzdA== Both are relatively new and real products to be released this year. |
|
29th June 2008, 00:20 | #811 | Link | ||
x264 developer
Join Date: Sep 2005
Posts: 8,666
|
Quote:
Quote:
Also note that a very large amount of space on modern CPUs is taken up by cache, which graphics chips have very little of (which can be an absolute nightmare when trying to program them to do anything efficiently).
__________________
Follow x264 development progress | akupenguin quotes | x264 git status ffmpeg and x264-related consulting/coding contracts | Doom10 Last edited by Dark Shikari; 29th June 2008 at 00:23. |
||
29th June 2008, 00:33 | #812 | Link | |
Registered User
Join Date: Aug 2006
Posts: 77
|
Quote:
Take the newest AMD GPU - it has ~2x the transistor count (as Penryn), but as I stated before it is completely different to my CPU (see sig). The RV770 has 800 "cores" running on 750MHz compared to my 4 cores at 3GHz. Yet on an algorithm like mvtools uses nearly only the 750MHz vs. 3GHz counts. Even though the memory bandwidth is higher, we will run out of memory AND memory bandwidth far before we compute 16 frames in parallel - and this is when both should about equal in performance. Not to mention that the architecture even of the new GPU cores is not optimized for complex conditional code. Either the parallel brute force approach waste or the branch mis prediction penalty must be enormous (and I am not certain the huge amount of "cores" could work with branches in the first place, IMHO that is restricted to the shader part, which is only 4 "cores". And that doesn't include performance loss due to memory latency. A geed deal of the CPU core is only dedicated to logic / buffers only to prevent / reduce performance loss, which CAN be far higher than 99%! Something like blur() or sharpen on the other hand will work ridiculously fast (only local data is needed, so every pixel can be computed in parallel)
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
|
29th June 2008, 00:33 | #813 | Link | |
Qualitas Opus Operis
Join Date: Feb 2008
Posts: 45
|
Quote:
However, Nehalem Octo-Core is going to be released in late 2009, so don't compare that with the now releasing 9800GTX+ |
|
29th June 2008, 10:23 | #814 | Link |
Registered User
Join Date: Aug 2007
Location: Italy
Posts: 286
|
That said, I think that's a total waste of time to start programming GPU-optimized MVTools when only in a few months technology enhancements will make it obsolete (= slower than the CPU-optimized one) ... the porting could very well be not completed by that time
|
30th June 2008, 03:57 | #815 | Link | |
Pig on the wing
Join Date: Mar 2002
Location: Finland
Posts: 5,718
|
Quote:
blksize 8, overlap 0, x264_sad=3 : 18.4 fps (Fizick) vs. 17.9 fps (v1.9.5.5).
__________________
And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon... |
|
2nd July 2008, 01:06 | #818 | Link |
Registered User
Join Date: Aug 2006
Posts: 77
|
I did some profiling and improved MVAnalyse a bit.
I added Overlap_2xY_mmx (virtually no difference, mostly for symmetry). I optimized my Sad2x2_iSSE_T a bit (very minor improvement) The things that did bring an improvement were: CheckMV2 - removed all checks except IsVectorOK(), as the are virtually never used and create a lot of work themselves. I made the LumaSAD prioritize standard spatial SAD. This two things brought up to 10% improvement (quite uniform on Pentium M), 5-10% on Core2. Smaller block sized benefit more. Dct (I tried dct=7) halves the benefit. Further improvement should be more difficult, as the very frequent GetAbsolutePointer is now one of the most time consuming operations together with GetRefBlock and of course the obvious CheckMV2 / CheckMV. On 2x2 blocks I found the jump penalty of the function sometimes larger than the SAD calculation itself. Anyway here is the link to the Source & dll. Boulder, until both Fizick and me use the same compiler (version & options) the performance between our builds will always be different.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800 |
2nd July 2008, 18:08 | #819 | Link |
AviSynth plugger
Join Date: Nov 2003
Location: Russia
Posts: 2,183
|
i tried to rebuilt v 1.9.5.5, but pixel-a.asm is not compiled with NASM: error in x264_pixel_ads_mvs - short jge jump is out of range.
should replace it to jl, jmp pair Code:
.loopi0: add esi, 8 cmp esi, edi jl .loopi ; fix short jump for nasm (Fizick) jmp .end .loopi: UPDATE: jge NEAR .end works fine too.
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick I usually do not provide a technical support in private messages. Last edited by Fizick; 2nd July 2008 at 18:36. |
2nd July 2008, 18:10 | #820 | Link |
Registered User
Join Date: Jan 2002
Location: France
Posts: 2,856
|
You should use yasm, it automatically computes proper jump sizes (nasm should to, but is (was?) buggy)
__________________
|
Thread Tools | Search this Thread |
Display Modes | |
|
|