Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Usage

Reply
 
Thread Tools Search this Thread Display Modes
Old 28th June 2008, 17:47   #801  |  Link
TSchniede
Registered User
 
Join Date: Aug 2006
Posts: 77
Quote:
Originally Posted by Undead Sega View Post
To Manao,

MVTools is an excellent piece of work! i use it for motion compensation deinterlacing and results look great!

but may i ask, what is the possibilities of having MVTools ported to the GPU?
like I replied to Terka here
and MfA replied here, a port to the GPU will only get any performance gain if either the GPU is ridiculously faster than the CPU or if a completely different approach to motion estimation is used.
MVTools estimates the motion by interpolating the estemated motion based on a reduced version (the hierarchical part) and on the direct neighbor blocks - the up and left blocks, which have been calculated at this point. This means each block is in part dependent on every other block in the whole frame, because this is repeated down to a single line of blocks.
So the only useful parallel part is doing several frames in parallel, but you will never be able to do 50+ frames like that, it needs too much RAM (try SetMTMode(2,20) and you will always need far more than 2GB memory... then image using 100+ threads...)
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800
TSchniede is offline   Reply With Quote
Old 28th June 2008, 19:02   #802  |  Link
Terranigma
*Space Reserved*
 
Terranigma's Avatar
 
Join Date: May 2006
Posts: 953
TSchniede, any chance you could perhaps update MVDenoise with important features such as overlap support, allow use of index clip, limit, thSADC, & pelclip? =P
Terranigma is offline   Reply With Quote
Old 28th June 2008, 20:06   #803  |  Link
Fizick
AviSynth plugger
 
Fizick's Avatar
 
Join Date: Nov 2003
Location: Russia
Posts: 2,183
Terranigma,
like TSchniede answered instead of Manao, I may answer instead of TSchniede : No.
1. why rename degrain (or have two similar functions)?
2. Mvdenoise was faster in past when it used compensated frames inplace. Now it is obsoleted and may be simply removed.
3. TSchniede is great coder. Do not impede him to optimize the code
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick
I usually do not provide a technical support in private messages.
Fizick is offline   Reply With Quote
Old 28th June 2008, 20:11   #804  |  Link
Terranigma
*Space Reserved*
 
Terranigma's Avatar
 
Join Date: May 2006
Posts: 953
Quote:
Originally Posted by Fizick View Post
Mvdenoise was faster in past when it used compensated frames inplace. Now it is obsoleted and may be simply removed.
So, are you going to remove it?
Terranigma is offline   Reply With Quote
Old 28th June 2008, 20:44   #805  |  Link
Fizick
AviSynth plugger
 
Fizick's Avatar
 
Join Date: Nov 2003
Location: Russia
Posts: 2,183
Terranigma
As you ask, No, it will not be removed for compatibility with old scripts.

TSchniede,
is SSE2 and overlap problem solved (e.g. by switching to original MVTools mode)?
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick
I usually do not provide a technical support in private messages.
Fizick is offline   Reply With Quote
Old 28th June 2008, 23:34   #806  |  Link
Undead Sega
Registered User
 
Join Date: Oct 2007
Posts: 713
Quote:
a port to the GPU will only get any performance gain if either the GPU is ridiculously faster than the CPU or if a completely different approach to motion estimation is used.
well i dont want to use the features of a GPU, but use the GPU as a processor, due to the many more transistors it contains, thus making the filter take advantage of it, as in speed wise and not let the CPU do the handling as already proved, it needs to be faster/powerful to gain a faster fps, do it is definately worth a try for a GPU.

my question has yet still not been answered. Can it be ported?
Undead Sega is offline   Reply With Quote
Old 28th June 2008, 23:39   #807  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
Quote:
Originally Posted by Undead Sega View Post
wbut use the GPU as a processor, due to the many more transistors it contains
GPUs don't generally contain larger numbers of transistors than modern CPUs.
Dark Shikari is offline   Reply With Quote
Old 28th June 2008, 23:52   #808  |  Link
Undead Sega
Registered User
 
Join Date: Oct 2007
Posts: 713
im sure they do, thats why some consider GPU render over CPU rendering, and there was even a chart showing the FPS differences (i cant find it anymore at the moment). and think of the Gefore 8 series
Undead Sega is offline   Reply With Quote
Old 29th June 2008, 00:05   #809  |  Link
TSchniede
Registered User
 
Join Date: Aug 2006
Posts: 77
Quote:
Originally Posted by Fizick View Post
TSchniede,
is SSE2 and overlap problem solved (e.g. by switching to original MVTools mode)?
Yes it should be solved since 1.9.5.2. I used the solution Dark Shikari suggested (I buffered the source block in a aligned area before calling PseudoEPZSearch (see ALIGN_SOURCEBLOCK regions), so each of the SAD calls uses the same source data. In most cases the cost is compensated with the better cache performance and in the case of the new 2xY SAD only one load is needed instead of 2 or 4.

One thing still bugs me: 4xY blocks are ~7% slower than 8xY blocks, at least on my Intel systems. I think the cache split issue might be the cause here. I only affects 4xY(2xY) Blocks, 8xY blocks don't benefit from up sizing to 16xY blocks(that would slow it down by 14%).
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800
TSchniede is offline   Reply With Quote
Old 29th June 2008, 00:19   #810  |  Link
kandrey89
Qualitas Opus Operis
 
Join Date: Feb 2008
Posts: 45
Quote:
Originally Posted by Dark Shikari View Post
GPUs don't generally contain larger numbers of transistors than modern CPUs.
That's not exactly true.

Although the number of transistors is comparably equal, because of the manufacturing process, GPUs tend to have more transistors than CPUs.

CPU - 0.731 billion transistors
http://www.anandtech.com/cpuchipsets...spx?i=3102&p=2

GPU - 1.4 billion transistors
http://enthusiast.hardocp.com/articl...50aHVzaWFzdA==

Both are relatively new and real products to be released this year.
kandrey89 is offline   Reply With Quote
Old 29th June 2008, 00:20   #811  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
Quote:
Originally Posted by Undead Sega View Post
im sure they do, thats why some consider GPU render over CPU rendering, and there was even a chart showing the FPS differences (i cant find it anymore at the moment). and think of the Gefore 8 series
You don't need more transistors to be faster at a specific task. The number of transistors is just a function of die size and the process used.
Quote:
Originally Posted by kandrey89 View Post
CPU - 0.731 billion transistors
http://www.anandtech.com/cpuchipsets...spx?i=3102&p=2

GPU - 1.4 billion transistors
http://enthusiast.hardocp.com/articl...50aHVzaWFzdA==

Both are relatively new and real products to be released this year.
A better comparison would probably be the Octocore Nehalem, as that would be an example of a "top of the line" equivalent to nVidia's offering. Being 8 cores, it would have twice the transistors as the 4 core version.

Also note that a very large amount of space on modern CPUs is taken up by cache, which graphics chips have very little of (which can be an absolute nightmare when trying to program them to do anything efficiently).

Last edited by Dark Shikari; 29th June 2008 at 00:23.
Dark Shikari is offline   Reply With Quote
Old 29th June 2008, 00:33   #812  |  Link
TSchniede
Registered User
 
Join Date: Aug 2006
Posts: 77
Quote:
Originally Posted by Undead Sega View Post
im sure they do, thats why some consider GPU render over CPU rendering, and there was even a chart showing the FPS differences (i cant find it anymore at the moment). and think of the Gefore 8 series
This is a classical comparison of two different things.
Take the newest AMD GPU - it has ~2x the transistor count (as Penryn), but as I stated before it is completely different to my CPU (see sig).
The RV770 has 800 "cores" running on 750MHz compared to my 4 cores at 3GHz. Yet on an algorithm like mvtools uses nearly only the 750MHz vs. 3GHz counts. Even though the memory bandwidth is higher, we will run out of memory AND memory bandwidth far before we compute 16 frames in parallel - and this is when both should about equal in performance. Not to mention that the architecture even of the new GPU cores is not optimized for complex conditional code. Either the parallel brute force approach waste or the branch mis prediction penalty must be enormous (and I am not certain the huge amount of "cores" could work with branches in the first place, IMHO that is restricted to the shader part, which is only 4 "cores". And that doesn't include performance loss due to memory latency. A geed deal of the CPU core is only dedicated to logic / buffers only to prevent / reduce performance loss, which CAN be far higher than 99%!

Something like blur() or sharpen on the other hand will work ridiculously fast (only local data is needed, so every pixel can be computed in parallel)
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800
TSchniede is offline   Reply With Quote
Old 29th June 2008, 00:33   #813  |  Link
kandrey89
Qualitas Opus Operis
 
Join Date: Feb 2008
Posts: 45
Quote:
Originally Posted by Dark Shikari View Post
You don't need more transistors to be faster at a specific task. The number of transistors is just a function of die size and the process used.A better comparison would probably be the Octocore Nehalem, as that would be an example of a "top of the line" equivalent to nVidia's offering. Being 8 cores, it would have twice the transistors as the 4 core version.

Also note that a very large amount of space on modern CPUs is taken up by cache, which graphics chips have very little of (which can be an absolute nightmare when trying to program them to do anything efficiently).
True, then there is a discussion about floating point on GPUs and integers on CPUs.
However, Nehalem Octo-Core is going to be released in late 2009, so don't compare that with the now releasing 9800GTX+
kandrey89 is offline   Reply With Quote
Old 29th June 2008, 10:23   #814  |  Link
talen9
Registered User
 
Join Date: Aug 2007
Location: Italy
Posts: 286
Quote:
Originally Posted by kandrey89 View Post
True, then there is a discussion about floating point on GPUs and integers on CPUs.
However, Nehalem Octo-Core is going to be released in late 2009, so don't compare that with the now releasing 9800GTX+
That said, I think that's a total waste of time to start programming GPU-optimized MVTools when only in a few months technology enhancements will make it obsolete (= slower than the CPU-optimized one) ... the porting could very well be not completed by that time
talen9 is offline   Reply With Quote
Old 30th June 2008, 03:57   #815  |  Link
Boulder
Pig on the wing
 
Boulder's Avatar
 
Join Date: Mar 2002
Location: Finland
Posts: 5,718
Quote:
Originally Posted by Boulder View Post
I tested the first build and here are the results:

blksize 8, overlap 4 : 5.2 fps
blksize 8, overlap 0 : 17.1 fps

Apparently Fizick's official 1.9.5.1 build is a tad bit faster.
Fizick's build is still faster than v1.9.5.5 on my computer.

blksize 8, overlap 0, x264_sad=3 : 18.4 fps (Fizick) vs. 17.9 fps (v1.9.5.5).
__________________
And if the band you're in starts playing different tunes
I'll see you on the dark side of the Moon...
Boulder is offline   Reply With Quote
Old 30th June 2008, 04:03   #816  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
Quote:
Originally Posted by Boulder View Post
Fizick's build is still faster than v1.9.5.5 on my computer.

blksize 8, overlap 0, x264_sad=3 : 18.4 fps (Fizick) vs. 17.9 fps (v1.9.5.5).
What CPU are you using?
Dark Shikari is offline   Reply With Quote
Old 30th June 2008, 06:41   #817  |  Link
Boulder
Pig on the wing
 
Boulder's Avatar
 
Join Date: Mar 2002
Location: Finland
Posts: 5,718
Quote:
Originally Posted by Dark Shikari View Post
What CPU are you using?
Intel Core2Duo, E6750.
__________________
And if the band you're in starts playing different tunes
I'll see you on the dark side of the Moon...
Boulder is offline   Reply With Quote
Old 2nd July 2008, 01:06   #818  |  Link
TSchniede
Registered User
 
Join Date: Aug 2006
Posts: 77
I did some profiling and improved MVAnalyse a bit.
I added Overlap_2xY_mmx (virtually no difference, mostly for symmetry). I optimized my Sad2x2_iSSE_T a bit (very minor improvement)
The things that did bring an improvement were: CheckMV2 - removed all checks except IsVectorOK(), as the are virtually never used and create a lot of work themselves. I made the LumaSAD prioritize standard spatial SAD. This two things brought up to 10% improvement (quite uniform on Pentium M), 5-10% on Core2. Smaller block sized benefit more. Dct (I tried dct=7) halves the benefit.

Further improvement should be more difficult, as the very frequent GetAbsolutePointer is now one of the most time consuming operations together with GetRefBlock and of course the obvious CheckMV2 / CheckMV. On 2x2 blocks I found the jump penalty of the function sometimes larger than the SAD calculation itself.

Anyway here is the link to the Source & dll.

Boulder, until both Fizick and me use the same compiler (version & options) the performance between our builds will always be different.
__________________
GA-P35-DS3R, Core2Quad Q9300@3GHz, 4.0GB/800 MHz DDR2, 2x250GB SATA HD, Geforce 6800
TSchniede is offline   Reply With Quote
Old 2nd July 2008, 18:08   #819  |  Link
Fizick
AviSynth plugger
 
Fizick's Avatar
 
Join Date: Nov 2003
Location: Russia
Posts: 2,183
i tried to rebuilt v 1.9.5.5, but pixel-a.asm is not compiled with NASM: error in x264_pixel_ads_mvs - short jge jump is out of range.
should replace it to jl, jmp pair
Code:
.loopi0:
    add     esi, 8
    cmp     esi, edi
    jl .loopi ; fix short jump for nasm (Fizick)
    jmp .end
.loopi:

UPDATE:
jge NEAR .end works fine too.
__________________
My Avisynth plugins are now at http://avisynth.org.ru and mirror at http://avisynth.nl/users/fizick
I usually do not provide a technical support in private messages.

Last edited by Fizick; 2nd July 2008 at 18:36.
Fizick is offline   Reply With Quote
Old 2nd July 2008, 18:10   #820  |  Link
Manao
Registered User
 
Join Date: Jan 2002
Location: France
Posts: 2,856
You should use yasm, it automatically computes proper jump sizes (nasm should to, but is (was?) buggy)
__________________
Manao is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 00:20.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.