View Full Version : DirectX version of ML3Dex need someone with PCI-e card to test it.
tsp
16th January 2005, 23:30
I made a direct-x 9 version of ML3Dex from the medianblur filter (see this (http://forum.doom9.org/showthread.php?s=&threadid=84636) thread for the cpu version). I need someone with a PCI-express card to test the speed of the filter compaired to the cpu version. To run it you will need at least a NVidia Geforce FX 5xxx or a ATI X600(maybe also X300) and the latest MVTools.
At the moment on a Athlon XP 2400 MHz,nvidia nforce-2 chipset and a geforce 6800 GT 350MHz memory 700 MHz the filter runs about half the speed compaired to the cpu version. I think most of the slowdown is caused by the slow readback from the GPU because of the AGP-bus.
The directx-code was created using the latest brookGPU (http://graphics.stanford.edu/projects/brookgpu/)
CVS source.
you can get the GPU version here (http://www.tsp.person.dk/ML3DexGPU.zip)
and the cpu version here (http://www.tsp.person.dk/medianblur084.zip)
syntax:
ml3dexGPU(clip,bool mc, int Y, int U, int V,bool markscenechange, int blksize, int pel, int lambda, int thSCD1, int thSCD2,bool spfull,bool bsfull)
for the GPU version and ml3dex for the cpu version.
good values for cartoons are ml3dex(blksize=4,searchparam=60,lambda=0,thSCD1=1000)
mc: Use Motion Compensation. This is achieved using Manaos MVTools. It's has only been tested with version 0.9.8.1 and 0.9.8.4.
Get it in this thread: http://forum.doom9.org/showthread.php?s=&threadid=84770&perpage=20&pagenumber=1 or here http://manao4.free.fr/MVTools-v0.9.8.1.zip
Don't use version 0.9.8.0. This version would cause an error. Default is false.
markscenechange: If set to true there will be a red square in the upper left corner when MVTools detects a scenechange.
If this happens to often the filter can't remove scratches and larger defects as efficient. Increase thSCD1 or thSCD2. Default false.
blksize,pel,lambda,thSCD1,thSCD2,searchparam: Parameters to control MVTools. for the meaning of each parameter see MVTools documentation.
Defaults: blksize 8, pel 1, lambda 1000,thSCD1 200,thSCD2 200,searchparam=1
Y,U,V: Controls which planes the filter is applied to. If set to 3 the plane will be filter, if 2 the plane will be copied from the source,
if 1 the plane will be ignored and from 0 to -255 the plane will be assigned the absolute value. Default Y 3,U=2, V=2
mc: Use Motion Compensation. This is achieved using Manaos MVTools. It's has only been tested with version 0.9.8.1 and 0.9.8.4.
Get it in this thread: http://forum.doom9.org/showthread.php?s=&threadid=84770&perpage=20&pagenumber=1 or here http://manao4.free.fr/MVTools-v0.9.8.1.zip
Don't use version 0.9.8.0. This version would cause an error. Default is true.
blksize,pel,lambda,thSCD1,thSCD2: Parameters to control MVTools. for the meaning of each parameter see MVTools documentation.
Defaults: blksize 4, pel 1, lambda 1000,thSCD1 200,thSCD2 175
bsfull,spfull: If set to true the blocksize(bsfull) and/or searchparam(spfull) are applied to the frames before the current frame and the frames after.
If set to false then they are only applied to the frames before the current frame. This speeds the motion compensation up but decrease the posible noise reduction.
Defaults are false,false.
bill_baroud
17th January 2005, 10:32
i haven't a pci-e graphic card (fx5900...) but i want to try it ;)
Will you publish the source code (if it's not included), i'm really curious to see how you read data back from the gpu.
edit : source code is included, thanks.
bill_baroud
18th January 2005, 09:00
hello, did some test with my FX5900XT at different frequencies (gpu/ram):
200/600 > 3-4fps
400/800 > 6fps
480/800 > 7fps
I couldn't test the cpu version, ml3dex(MC=false) always thrown an error (bad arguments), i don't know why...perhaps because i didn't have MVtool ?
But still, it's looks more like that it's the code that run on the gpu that is somehow "unoptimized" (no offense) as performances grow almost linearly with the increase of the frequency.
Did you try some "profiling" tool for gpu, like nvperfHUD ? Could help to see where the bottleneck is.
tsp
18th January 2005, 18:56
the problem with nvperfHUD is that it really doesn't works well when rendering to textures(ie it doesn't work at all). I downloaded NVShaderperf and ran it on the pixelshader. It gave this output:
Target: GeForce 6800 GT (NV40-GT) :: Unified Compiler: v66.93
Cycles: 683.68 :: R Regs Used: 26 :: R Regs Max Index (0 based): 25
Pixel throughput (assuming 1 cycle texture lookup) 8.20 MP/s
Converted to avisynth pixels(when only working on the luma plane) it's about 4*8.2MP/s=32.8MPs (4x because 4 avisynth pixels are packed into 1 RGBA pixel) or about 80 fps(720x576)
On my computer (with an amputated geforce 6800 GT memory wise) I get about 22-26 fps(720x576)
The performance analysis for the NV30 series looks like this:
-------------------- NV38 --------------------
Target: GeForceFX 5950 Ultra (NV38) :: Unified Compiler: v66.93
Cycles: 1610 :: # R Registers: 32
Pixel throughput (assuming 1 cycle texture lookup) 1.18 MP/s
-------------------- NV36 --------------------
Target: GeForceFX 5700 Ultra (NV36) :: Unified Compiler: v66.93
Cycles: 1610 :: # R Registers: 32
Pixel throughput (assuming 1 cycle texture lookup) 590.06 KP/s
-------------------- NV35 --------------------
Target: GeForceFX 5900 Ultra (NV35) :: Unified Compiler: v66.93
Cycles: 1610 :: # R Registers: 32
Pixel throughput (assuming 1 cycle texture lookup) 1.12 MP/s
-------------------- NV34 --------------------
Target: GeForceFX 5200 Ultra (NV34) :: Unified Compiler: v66.93
Cycles: 1549 :: # R Registers: 32
Pixel throughput (assuming 1 cycle texture lookup) 516.46 KP/s
-------------------- NV31 --------------------
Target: GeForceFX 5600 Ultra (NV31) :: Unified Compiler: v66.93
Cycles: 1549 :: # R Registers: 32
Pixel throughput (assuming 1 cycle texture lookup) 516.46 KP/s
-------------------- NV30 --------------------
Target: GeForceFX 5800 Ultra (NV30) :: Unified Compiler: v66.93
Cycles: 1549 :: # R Registers: 32
Pixel throughput (assuming 1 cycle texture lookup) 1.29 MP/s
This is about maximum 22FPS for a FX5900XT.
I made a new version of ML3dexGPU(same link as the first) with 2 new options:
bool Disblecache and int pixelshader
Disablecache forces the filter to upload both the current previous and next frame to the GPU memory for every frame decreasing performance if AGP bandwith bound.
pixelshader specify what pixelshader to use.
if set to 1 the pixelshader just copy the current frame
if set to 2 the average between the previous and next frame are used.
if set to anything else the default ML3Dex shader is used.
when using pixelshader 1 and disablecache=true I get about 30 FPS and with disablecache=false about 40-42 fps(compaired to about 26 fps with ml3dex pixelshader)
I don't think its possible to optimize the code (pixelshader that is. Brook takes care of all the other aspect) on the GPU side much more (that is finding the median of 3,7,7,11,11 and 3 values).
the cpu version runs with about 33 fps on my computer(athlon xp 2400 MHz).
Also how does the error message exactly looks like when you try to run the cpu version? I tried to not load MVTools and it didn't generate any errors but you could try to download MVTools to see if it works.
bill_baroud
19th January 2005, 15:53
Originally posted by tsp
the problem with nvperfHUD is that it really doesn't works well when rendering to textures(ie it doesn't work at all). I downloaded NVShaderperf and ran it on the pixelshader. It gave this output:
Target: GeForce 6800 GT (NV40-GT)
Pixel throughput (assuming 1 cycle texture lookup) 8.20 MP/s
Target: GeForceFX 5900 Ultra (NV35)
Pixel throughput (assuming 1 cycle texture lookup) 1.12 MP/s
This is about maximum 22FPS for a FX5900XT.
wow that's quite a difference between the 2 generation ... i wonder why the NV30 perform a little better though.
I made a new version of ML3dexGPU(same link as the first) with 2 new options:
bool Disblecache and int pixelshader
Disablecache forces the filter to upload both the current previous and next frame to the GPU memory for every frame decreasing performance if AGP bandwith bound.
hmmm AGP upload should be quite fast no ? it's the download that is slow.
when using pixelshader 1 and disablecache=true I get about 30 FPS and with disablecache=false about 40-42 fps(compaired to about 26 fps with ml3dex pixelshader)
I don't think its possible to optimize the code (pixelshader that is. Brook takes care of all the other aspect) on the GPU side much more (that is finding the median of 3,7,7,11,11 and 3 values).
that's about 12-16MB/s ... i heard somewhere that agp downloading was around 50MB/s so it's more the gpu that's lacking here no ?
well a test with a pci-express card would definitly be interesting, but i suppose that a median filter need too much conditional operation for a gpu.
Perhaps setting your own opengl/d3d9 context instead using brookgpu could be a little faster, but i don't see really why.
(http://download.developer.nvidia.com/developer/SDK/Individual_Samples/DEMOS/OpenGL/simple_pbuffer.zip ?)
the cpu version runs with about 33 fps on my computer(athlon xp 2400 MHz).
so that's 26fps vs 33fps now (not half ?), it's quite an achievement.
Also how does the error message exactly looks like when you try to run the cpu version? I tried to not load MVTools and it didn't generate any errors but you could try to download MVTools to see if it works.
The standard Avisynth error when you use wrong arguments/parameters.
I tried many way to use the filter and differents parameters with no luck.
Antitorgo
19th January 2005, 21:51
This all looks interesting and is along the lines of what I am writing AviShader for. I think you might be seeing a performance hit just because of BrookGPU (which from what I have heard is notoriously slow). For example, a simple 3x3 median blur is easily done at > 30FPS on my 2-year old Radeon 9700 Pro in AviShader.
Perhaps someone is interested in porting some of this to a pixel shader for AviShader? I'd take a swing at it, but am trying to make improvements to the general AviShader architecture right now (like supporting packed formats so that texture read/writes aren't as slow).
tsp
19th January 2005, 23:26
Antitorgo:
In the newest CVS source of BrookGPU its possible to use 1 byte char instead of 4 byte floats. This does improve speed a lot. Also the code I use to pack 4 pixels down to one looks like this:
::brook::stream* FrametoStream(const unsigned char *srcp,int width,int height,int pitch,unsigned char *inputf){
//inputf=buffer to create temporary data before they are uploaded to the GPU
int xsize = width; //MOD4 assumed
int insizex =xsize+8;
::brook::stream* retstream=new ::brook::stream(::brook::getStreamType(( fixed4 *)0), height , insizex / 4,-1); //creates empty Texture R8G8B8A8 of height:height and width insizex/4
int i,j,k;
for (i = 0; i < height; i++)
{
for (k = 0; k < 4; k++)
for (j = 0; j < xsize / 4; j++)
{
inputf[i * insizex + 4 + j * 4 + k] = srcp[i * pitch + j + k * xsize / 4]; //interleave pixels. ie row: ABCD EFGH IJKL MNOP->AEIM BFJN CGKO DHLP allowing SIMD
}
for (k = 0; k < 3; k++)
{
inputf[i * insizex + k + 1] = inputf[i * insizex + (insizex - 8) + k]; //copies the last 3 pixels in a row to the front AEIM BFJN CGKO DHLP-> 0(zero)DHL AEIM BFJN CGKO DHLP
inputf[i * insizex + insizex - 4 + k] = inputf[i * insizex + k + 5]; //copies the last 3 pixels of the first 4 to the back 0DHL AEIM BFJN CGKO DHLP-> 0DHL AEIM BFJN CGKO DHLP EIM0
}
inputf[i * insizex] = inputf[i * insizex + 4]; //first pixel = original first pixel 0DHL AEIM BFJN CGKO DHLP EIM0->ADHL AEIM BFJN CGKO DHLP EIM0
inputf[i * insizex + insizex] = inputf[i * insizex + insizex - 5]; //last pixel = original last pixel ADHL AEIM BFJN CGKO DHLP EIM0 ->ADHL AEIM BFJN CGKO DHLP EIMP
}
streamRead(*retstream,inputf); //upload data
return retstream; //return pointer to the created stream/texture
}
[code]
and the code to unpack the texture:
[code]
streamWrite(o_img,outputf); Get tecture o_img back to memory outputf
for (i = 0; i < outsizey; i++)
{
for (k = 0; k < 4; k++)
for (j = 0; j < outsizex / 4; j++)
{
output[(i + 1) * outrowsize + j + k * outsizex / 4] = outputf[i * outsizex + j * 4 + k];
}
}
I use brookGPU because it's easier than to learn DirectX from scratch. At least until you release some working source code for avisynth :p
If you want to look at the pixelshadercode just look in the file kernels.br
bill_baroud:
hmmm AGP upload should be quite fast no ? it's the download that is slow.
You right about that. It must be the extra packing (by the cpu) of the two images that causes most the slowdown then. Also strangely enough the CPU util is about 95-98% when running the code. Using codeanalyst from AMD most of the cycles is used in nv4_disp(41%) and only about (11%) in ml3dexgpu out of total 61% (I think it was so low because the datacollection was running)
that's about 12-16MB/s ... i heard somewhere that agp downloading was around 50MB/s so it's more the gpu that's lacking here no ?
well a test with a pci-express card would definitly be interesting, but i suppose that a median filter need too much conditional operation for a gpu.
The code doesn't use a single conditional operation mainly to let the cpu version be MMX'ed easy. This does create a rather long program however.
The standard Avisynth error when you use wrong arguments/parameters.
I tried many way to use the filter and differents parameters with no luck.
The standard Avisynth error when you use wrong arguments/parameters.
I tried many way to use the filter and differents parameters with no luck.
I got the same error when I tested the extracted dll. I compiled and uploaded the zip file again(same link) so it should work now.
bill_baroud
20th January 2005, 13:07
I got the same error when I tested the extracted dll. I compiled and uploaded the zip file again(same link) so it should work now.
well i installed MVtool and it worked :)
speed with your new version of the gpu filter (the test file is 720x576 mjpeg to huffyuv) :
DisableCache do almost nothing for me (?)
PixelShader 1 > about 30fps
PixelShader 2 > 25 fps iirc, or almost the same as 1 (it was late...err early today ;))
Normal mode > 6fps
Cpu filter > 25fps (amd xp 2200mhz)
@Antitorgo : i didn't have the time to really look at your AviShader, but iirc, you just display the result of a video filtered with some shaders no ? In this case you are not limited by the slow agp download, right ? (forget this if i didn't understand).
[edit] my bad, i missed the end of your post. It is an avisynth filter ;)
tsp
20th January 2005, 16:51
DisableCache do almost nothing for me (?)
I think most of the speed decrease I get by disabling the cache is because of the increased cpu utilization and maybe because of the bandwith limitations(AGP).
In your case I think ml3dex is severly GPU power limited(compaired to the great speed increase with mode 1 and 2) so the extra cpu cycles doesn't mater much with the disabled cahce.
Antitorgo
2nd February 2005, 00:34
Some tips you might want to try to speed up things: (I know you already did some of these) Plus, I'm not sure how much this would apply to Brook, but take what you will...
a) The GPU runs it's shaders asynchronously, and the default behavior when you try to read back from a render target is to block the CPU (which is totally lame). So you might want to try doing something else useful on the CPU before trying to read the results back. In AviShader, I'm copying the next frame into a texture and uploading it to the graphics card. I was actually able to lower CPU utilization by putting a sleep() call in before reading the contents of the render target, but since I'm doing a general purpose app with no control over shader run-lengths, doing a sleep call can/will lower the FPS too. For a fixed shader, you could probably get some decent timings and adjust the sleep() variable.
b) Working in a YV12 or YUY2 space in the shader is a bitch (due to packing/unpacking and instruction count limits), and can easily saturate the instruction count. However, if you can do it, it is well worth it.
c) Using textures as lookup tables is usually really fast. On ATi cards, they can do 1 texture lookup and 1 ALU per cycle per pipeline. Plus, the memory subsystem on graphics cards are blazingly fast, so you can do some large lookup tables w/o as much of a cache-miss penalty that you'd see on a CPU. (Plus, IIRC the pixel shader pipelines typically work on a 2x2 block of pixels, so locality in x-y dimensions are less of an issue.
d) A render target read after a render target read is fast, so if you can, you might look into render several frames at a time, then reading them all back at once (but watch out for memory thrashing and other badness)
e) A seperable filter is usually better than a 1-pass filter (ex: 6-texture lookups/adds in 2-pass vs. 9-texture lookups/adds in 1-pass).
I'm sure I'll think of more, but it is a start...
tsp
3rd February 2005, 23:12
Originally posted by Antitorgo
Some tips you might want to try to speed up things: (I know you already did some of these) Plus, I'm not sure how much this would apply to Brook, but take what you will...
a) The GPU runs it's shaders asynchronously, and the default behavior when you try to read back from a render target is to block the CPU (which is totally lame). So you might want to try doing something else useful on the CPU before trying to read the results back. In AviShader, I'm copying the next frame into a texture and uploading it to the graphics card. I was actually able to lower CPU utilization by putting a sleep() call in before reading the contents of the render target, but since I'm doing a general purpose app with no control over shader run-lengths, doing a sleep call can/will lower the FPS too. For a fixed shader, you could probably get some decent timings and adjust the sleep() variable.
I just noticed a similair post in the brook forum. Really usefull. Now i just have to figure out what to do. Maybe the solution is to render ahead a couple of frames and use the time to prepare the next frames(packing/unpacking etc.) as you suggested. This will of course waiste some times if the script is non-linear I think must users uses avisynth for linear editing.
b) Working in a YV12 or YUY2 space in the shader is a bitch (due to packing/unpacking and instruction count limits), and can easily saturate the instruction count. However, if you can do it, it is well worth it.
Well in the current version I do pack the YV12 plane before uploading
This works well when the shader is only accesing pixels in the neigbourhood of the current pixel.
c) Using textures as lookup tables is usually really fast. On ATi cards, they can do 1 texture lookup and 1 ALU per cycle per pipeline. Plus, the memory subsystem on graphics cards are blazingly fast, so you can do some large lookup tables w/o as much of a cache-miss penalty that you'd see on a CPU. (Plus, IIRC the pixel shader pipelines typically work on a 2x2 block of pixels, so locality in x-y dimensions are less of an issue.
At the moment I'm working on a FFT filter and this does uses a couple of LUT's. I'm still debugging it so I don't know how fast it will be.
d) A render target read after a render target read is fast, so if you can, you might look into render several frames at a time, then reading them all back at once (but watch out for memory thrashing and other badness)
e) A seperable filter is usually better than a 1-pass filter (ex: 6-texture lookups/adds in 2-pass vs. 9-texture lookups/adds in 1-pass).
I'm sure I'll think of more, but it is a start... [/B]
boombastic
8th June 2006, 13:19
so does this filter works on an agp card?is it sufficient stable to use for a whole movie backup?
Thanks for your work tsp!!!
It has only been tested on an agp card. But don't expect this gpu version to be faster than the cpu version as it was my first try to create a gpu filter.
devaster
9th June 2006, 18:01
tsp : try Cg compiler by nvidia ...
vBulletin® v3.8.5, Copyright ©2000-2012, Jelsoft Enterprises Ltd.