View Full Version : Cuda Challenge for x264 ... ?
Sagittaire
4th May 2008, 19:56
http://www.nvidia.fr/object/cuda_contest_april2008_fr.html
http://www.nvidia.fr/content/EMEAI/CUDA/cuda_challenge_terms_fr.html
Dark Shikari
4th May 2008, 21:07
Avail Media is already working on CUDA for x264 ;)
Yoshiyuki Blade
4th May 2008, 22:58
I cant wait to see how that turns out. I hope my aging 8800GTX will have even more lasting value than it has already given me.
I hope my aging 8800GTX
I hate you so much right now...
/me looks at his 6600gt with it's poor crackling fan and sighs
Shinigami-Sama
6th May 2008, 08:00
I hate you so much right now...
/me looks at his 6600gt with it's poor crackling fan and sighs
I got all of you beat!
5200fx
first gen dx9!
it will be nice to see this nice looking stuff make it in x264
its definitely growing fast these days, kinda sad big commercial companies can't even compete still
Inventive Software
6th May 2008, 10:47
Yeah, Shinigami, except it cheated with DX9 and fell back to DX8.1 when it felt like it! :p
DarkZell666
6th May 2008, 15:04
Too bad, I switched from a Geforce 8600 GTS to a Radeon HD 3870 and I'm regretting I chose ATI, not only because of it's poor drivers and performance, but now even more so because of CUDA :D *Looks at his HD 3870 with an evil grin :devil: mwahahahaha*
lucassp
6th May 2008, 16:33
I think an ATi fanboy will make the same thing using CTM...it's just a matter of time.
IIRC, CTM has been discontinued and is no longer officially supported.
What is CUDA, I can't read french!
(And dont say google/wikpedia, doom9 is my bibble)
What is CUDA, I can't read french!
(And dont say google/wikpedia, doom9 is my bibble)
http://www.nvidia.com/object/cuda_learn.html
Shinigami-Sama
7th May 2008, 06:07
http://www.nvidia.com/object/cuda_learn.html
so basicly C/++ CPU-GPU interaction/offloading for parallel computing no?
so basicly C/++ CPU-GPU interaction/offloading for parallel computing no?
Yeah, from the example code they have on their site it seems they just provide an abstraction layer to programmers that unifies CPU and GPU. So you don't need to learn OpenGL or Direct3D, and worry about switching between CPU and GPU targets for your code, you just write what appears to be normal C++ with some new coding conventions (function and variable names), and compiler does the rest. And if it does it well enough you get profit :)
I think Sh (http://libsh.org/) commercialized by RapidMind (http://www.rapidmind.net/) (same developers) are the cross platform as well as CPU and GPU independent version of this (even supports PS3's Cell + GPU combo). Since last I checked RapidMind still offered their stuff for free for non-commercial use, maybe it will be a better code investment than CUDA? Especially since Sh devs have been at it longer than Nvidia.
Rapidmind is higher level ... with abstraction comes loss of performance, more severe with these kind of architectures than on a desktop processor which bristle with technology to make bad code run fast. I don't think you can freely download it anymore.
BlackSharkfr
7th May 2008, 22:48
I was curious so i read through a cuda introduction tutorial they have on their website. It's well written and easy to understand even for programming noobs like me.
i'd really like to see this technology being used in x264
Apprently they use the GPU as a giant SIMD co-processor :
.so it should be very usefull for video, since you often have to execute the same algorighm on a huge amount of pixels and/or macroblocks (especially in HD) simultaneously, but i don't know much about the x264 code so i can't really say which parts of the encoding process would benefit the most from the gpu.
using the cuda api (C/c++) seems simple, but in order to get the computations done "10/20/30/40/50/100+ times faster" (which is what they claim), you need to have massively parallelizable code without divergent branching, and then avoid bottlenecks caused by memory latency and the PCI-express bandwidth.
Highly irregular branching patterns (skip modes) and bit manipulation (quantization/entropy coding) don't suit present GPUs. IMO the only really good application at the moment are full search ME algorithms, in the end though accelerated full search is still slow even if it's faster than on the CPU. Because it has to do everything at single precision floating point it won't usually have 10x+ the performance of modern processors when you can use 8 or 16 bit on the CPU BTW. In pure arithmetic at usual precisions in x.264 it probably ties with a quad core ... it does have a lot more bandwidth though.
Dark Shikari
7th May 2008, 23:33
Highly irregular branching patterns (skip modes) and bit manipulation (quantization/entropy coding) don't suit present GPUs. IMO the only really good application at the moment are full search ME algorithms, in the end though accelerated full search is still slow even if it's faster than on the CPU.Actually, basically everything can be reasonably done on the GPU except CABAC (which could be done, it just couldn't be parallelized).
x264 CUDA will implement a fullpel and subpel ME algorithm initially; later on we could do something like RDO with a bit-cost approximation instead of CABAC.
Because it has to do everything at single precision floating pointWrong, CUDA supports integer math.
It supports integer math, but it doesn't support SIMD ... it's basically just not doing the renormalization.
Dark Shikari
7th May 2008, 23:36
It supports integer math, but it doesn't support SIMD ... it's basically just not doing the renormalization.Yes it does support SIMD. CUDA on an 8800GTX, for example, has 16 stream processors, each of which can perform 8 of the same arithmetic operation at the same time.
Seriously? Hmm, didn't notice that before ... I stand corrected.
PS. are you quite certain it can operate on uchar4 at 4 times the throughput as fp? (I don't have a g80 unfortunately.) I would have thought they would have made a bigger deal out of that if they could do it.
PPS. I think I worded it poorly the first time, I meant they can't do SIMD inside a "thread" (operations on vectors are simply iterated over the components) everything from 8 bits to 24 bits operations run on the same arithmetic units and at the same throughput. Whereas a CPU will get significant increase in throughput of arithmetic ops when using lower precision.
akupenguin
8th May 2008, 00:55
PS. are you quite certain it can operate on uchar4 at 4 times the throughput as fp? (I don't have a g80 unfortunately.) I would have thought they would have made a bigger deal out of that if they could do it.
PPS. I think I worded it poorly the first time, I meant they can't do SIMD inside a "thread" (operations on vectors are simply iterated over the components) everything from 8 bits to 24 bits operations run on the same arithmetic units and at the same throughput. Whereas a CPU will get significant increase in throughput of arithmetic ops when using lower precision.
Right. CUDA doesn't have vector registers. It operates on multiple 32bit scalars at once, and masking off some of the bits doesn't make it faster.
deekey777
8th May 2008, 01:10
IIRC, CTM has been discontinued and is no longer officially supported.
It's CAL now.
http://ati.amd.com/technology/streamcomputing/sdkdwnld.html
metaxaos
19th May 2008, 13:21
Dark Shikari
Could you call approximate date of first public release of CUDA's x264?
And then - will it be just a next version of the present x264's .exe with GPU acceleration with the same interface and usage keys or it will be some backward-uncompartible fork?
Dark Shikari
19th May 2008, 15:06
Dark Shikari
Could you call approximate date of first public release of CUDA's x264?
And then - will it be just a next version of the present x264's .exe with GPU acceleration with the same interface and usage keys or it will be some backward-uncompartible fork?I definitely can't, considering that Avail still isn't actually sure whether they're going to do CUDA, FPGA work, or some bizarre combination... :p
audyovydeo
19th May 2008, 17:19
I definitely can't, considering that Avail still isn't actually sure whether they're going to do CUDA, FPGA work, or some bizarre combination... :p
What I can't figure out is the relationship between Avail Media (out to make money) and x264 (a (currently) free H.264 encoder).
These guys (http://www.rapihd.com/) will probably get there first, independently of x264.
cheers
a/v
Dark Shikari
19th May 2008, 17:25
What I can't figure out is the relationship between Avail Media (out to make money) and x264 (a (currently) free H.264 encoder).
cheers
a/vAvail is responsible for a very large portion of x264 code and uses it for live SD and HD broadcast of television (IPTV, etc). Akupenguin used to work for them.
audyovydeo
19th May 2008, 17:28
Avail is responsible for a very large portion of x264 code and uses it for live SD and HD broadcast of television (IPTV, etc).
Hello there.
interesting. But I don't often see a .com spending €€€ on a project, only to cascade it to x264 at no charge, for the good of humanity.
I'll keep watching this space ...
a/v
Dark Shikari
19th May 2008, 17:32
Hello there.
interesting. But I don't often see a .com spending €€€ on a project, only to cascade it to x264 at no charge, for the good of humanity.Then you haven't seen a lot of modern companies ;)
Look at Facebook for example; they open-source a whole lot of their work and they will be open-sourcing their Flash video encoding scripts (which automatically source videos using mplayer/ffmpeg, automatically compensate for VFR/etc, and then re-encode video/audio with x264 and remux) when they're done with them.
Avail of course is even bigger than that; basically every improvement to x264 they've made has been open-sourced; since they operate as a service business (handling broadcast to make money), there is no loss to them to release those kind of software improvements; if anything, there is the benefit they gain from community involvement.
(P.S: I'm posting this from Avail Media ;) )
audyovydeo
19th May 2008, 17:47
Thanks for the subliminal message ;-)
cheers
a/v
tre31
25th May 2008, 12:52
IIRC, CTM has been discontinued and is no longer officially supported.
AMD now uses CAL (which includes amd-brookplus-sdk-v1.00.0-alpha.msi and amd-cal-sdk-v0.90.0-alpha.msi - not sure if they are latest versions but thats whats inside the CAL sdk) not CTM.
So hopefuly someone with the knowledge will do something with it, I myself don't have the gfx or mathematics or knowledge of encoding techniques too do it, but downloaded the kit myself too see the capabilities firsthand - and I must say ati's gpgpu products are quite good as well (tested on hd2600xt).
RapidHD demonstrated their products today in Nvidia Tech day. It can encode 720P HD file at 2x the speed of Real time using a 9600GT. Which is about 8 times faster then a 3Ghz Quad Core according to them. I cant wait to see what can be done on the new GTX200 series.
The best part of this, is that you can do the encoding on GPU while you keep surfing, Chatting on IM, listening to music, watching youtube etc with NO Speed lost at all!!
Dark Shikari
28th May 2008, 17:41
RapidHD demonstrated their products today in Nvidia Tech day. It can encode 720P HD file at 2x the speed of Real time using a 9600GT. Which is about 8 times faster then a 3Ghz Quad Core according to them.8 times faster than a quadcore? What encoder are they comparing to, the JM? ;)
On lowest settings, x264 encodes 720p ~3.8 times faster than realtime on a quadcore... and that's on 32-bit.
the_corona
28th May 2008, 18:18
RapidHD demonstrated their products today in Nvidia Tech day. It can encode 720P HD file at 2x the speed of Real time using a 9600GT. Which is about 8 times faster then a 3Ghz Quad Core according to them. I cant wait to see what can be done on the new GTX200 series.
The best part of this, is that you can do the encoding on GPU while you keep surfing, Chatting on IM, listening to music, watching youtube etc with NO Speed lost at all!!
There's a video of it here (http://www.youtube.com/watch?v=8C_Pj1Ep4nw), it's just stupid marketing though....
They should some 720p to "itunes format" in about 150fps and then 1440 × 1080 export from Premiere @ 46 fps.
g_aleph_r
28th May 2008, 20:02
... and that's on 32-bit.
Is it faster on 64bit?
Dark Shikari
28th May 2008, 20:26
Is it faster on 64bit?10-15% or so.
But isn't 64-bit not fully supported yet? I thought you guys said it could use some work?
crypto
28th May 2008, 20:47
..On lowest settings, x264 encodes 720p ~3.8 times faster than realtime on a quadcore... and that's on 32-bit.
Hi, No way. I am getting 30 fps max in the first pass, 20 fps in second on a Q6600. Can you suggest faster settings?
%TOOLS%\x264.807 --pass 1 --bitrate %BITRATE% --bframes 3 --partitions none --subme 1 --me dia --no-cabac --analyse none --threads auto --level 4.0 --progress --no-psnr --no-ssim -o NUL pass1.avs
Dark Shikari
28th May 2008, 20:53
Hi, No way. I am getting 30 fps max in the first pass, 20 fps in second on a Q6600. Can you suggest faster settings?
1. Update your x264 and use one of the latest fprofiled builds (e.g. from Jarod)
2. I did the test on a single core of a Core 2 Duo, using --thread-input to make sure I was ignoring the cost of decoding. My CPU was 2Ghz and one core, so I scaled up the number by a factor of 6 to account for 4x cores and 1.5x processor speed, making the assumption that threading scaled perfectly (which is, according to tests, a pretty accurate assumption, but in reality its probably a few % worse).
--qp 25 --partitions none --subme 1 --me dia --no-cabac --analyse none --threads auto --no-dct-decimate --progress --no-psnr --no-ssim --merange 6 --aq-mode 0
is a terrible series of settings but it should be pretty fast :p
Make sure you're not decoding-bottlenecked. Also note that I measured "realtime" as "24fps."
crypto
28th May 2008, 20:57
Great, thanks for the clues.
But isn't 64-bit not fully supported yet? I thought you guys said it could use some work?
It needs work only to support 64-bit Windows. Otherwise x264 is pretty well optimized for x86-64 and it can be used on 64-bit Linux and *BSD, at least. I've been using it for a few years already without larger problems.
10-15% or so.
Is this only on AMD processors, or on Intel ones? It's really hard to get good i386/amd64 comparisons, but one test on AMD processors showed an average 20% speed increase (Linux) while the last test I could find on the Core 2 architecture (Windows) showed 64bits versions of the same programs often lagging behind the 32bits versions (and Intel even admitted that only in Nehalem will their processors offer on-par 64bits optimisations compared to the 32bits mode). Are there good x264 benchmarks out there comparing its performance on different architectures (we already knows it trashes most other h264 codecs ;) ).
To come back to the thread, I'm really happy to see x264 development already planning support for CUDA, when most commercial programs (except CS4) will still not make use of it.
Is there any plans to use AMD's CAL, since it sound less straightforward than CUDA?
CAL is less straightforward ... but on the other hand brook+ is more straightforward.
akupenguin
30th May 2008, 14:14
Is this only on AMD processors, or on Intel ones?
They are quite similar. I make no claims for how general this is, but for my favorite x264 setting, Conroe gains 12% speed in 64bit while K8 gains 11%.
It's really hard to get good i386/amd64 comparisons
You must not be trying very hard. It's as simple as: compile for x86_64, compile for x86_32, run both.
Ok, it becomes somewhat harder for interactive programs with no batch mode, but that doesn't describe anything related to compression.
Intel even admitted that only in Nehalem will their processors offer on-par 64bits optimisations compared to the 32bits mode.
So what? x86_64 isn't faster due to any instruction taking fewer cycles. It's faster because it has more registers. (There are a few other minor reasons, but they either don't make very much difference, or they're consequences of more registers.) If Intel gave x86_64 more optimization attention than x86_32, that would just be gravy on top of the existing 12% speedup.
It needs work only to support 64-bit Windows. Otherwise x264 is pretty well optimized for x86-64 and it can be used on 64-bit Linux and *BSD, at least. I've been using it for a few years already without larger problems.
Thanks for clearing that up. Now I will have to look into it.
You must not be trying very hard. It's as simple as: compile for x86_64, compile for x86_32, run both.
Obviously, you also need an x86_64 compatible processor, which I still don't have (and even when I'll get one, I won't be able to compare a Core 2 to a Phenom, since I'm just planning to get one processor ;)
And I prefer to know in advance what kind of performance I can expect. Since I'm going to use Linux in 64bits, I'd like to know if getting a Core 2 will end up with worse results than a Phenom (I could care less for games) in that mode.
This recent test shows 64 bits Vista often quite slower than 32 bits Vista - I just don't know if it's because of Vista, or if it's because of the processor they used.
http://www.extremetech.com/article2/0,2845,2280881,00.asp
Since nobody in the journalistic world seems to care about Phenom or about 64bits Linux, it's hard to get an idea of Core 2 / Phenom respective performance in 64bits Linux (and yes, I did spend a lot of time on Google, but it seems nobody ever had the idea to compare recent Intel & AMD processors in 64bits Linux).
Where I am, 9550 Phenom slightly cheaper than a Q6600, while having far better motherboards & IGP. However, I can't figure what the difference is going to be in 64 bits Linux, since all the comparisons are done using 32 bits Vista.
Since nobody in the journalistic world seems to care about Phenom or about 64bits Linux, it's hard to get an idea of Core 2 / Phenom respective performance in 64bits Linux (and yes, I did spend a lot of time on Google, but it seems nobody ever had the idea to compare recent Intel & AMD processors in 64bits Linux).
Where I am, 9550 Phenom slightly cheaper than a Q6600, while having far better motherboards & IGP. However, I can't figure what the difference is going to be in 64 bits Linux, since all the comparisons are done using 32 bits Vista.
I think that x264 performance is a pretty good indicator that the speed gain is about the same in both architectures. Programs that use a lot of pointers (some of the object-oriented stuff, perhaps) need more memory and may be slower because the pointers are 64-bit.
If you're going to run Linux and the machine is intended for video-related work, the Radeon IGP is not that tempting compared to NVIDIA. I think that there are still tearing/vsync problems with the latest proprietary ATI drivers and open-source drivers aren't any better. NVIDIA's proprietary drivers have far less problems for now.
7oby
21st June 2008, 11:56
Avail Media is already working on CUDA for x264 ;)
Dark Shikari,
could you elaborate somewhat on this issue?
You've been experimenting in december 2007 with CUDA and ME/MC:
http://forums.nvidia.com/lofiversion/index.php?t53172.html
I even don't expect a CUDA solution to be faster than CPU from the beginning. I see it more like an incremental approach. At the beginning only let's say ME is ported to CUDA and the communication overhead of bouncing data between CPU and GPU eats up all potential performance gains. But once there is sufficiently computation going on with CUDA, one can suffle around the algorithm: e.g. by introducing pipelining like the Toshiba Cell guys did.
What are/were the most challenging things regarding CUDA:
. technical issues like insufficient performance and debugging support of CUDA?
. infant CUDA libraries full of errors?
. the overall x264 execution design being completely different than a design appropriate for CUDA. E.g. you had to go back to a single threaded x264 execution model (= extremely slow) in order to sequentialize tasks and allow execution of ME on CUDA. And based on that exetreme difficulties in parallizing tasks.
Now that you work for Avail Media: What does that mean for CUDA and x264?
LoRd_MuldeR
21st June 2008, 14:16
@7oby: Dark Shikari has already said that Avail Media is now going for FPGA's instead of CUDA ...
7oby
21st June 2008, 14:34
@7oby: Dark Shikari has already said that Avail Media is now going for FPGA's instead of CUDA ...
Thanks for that information. I'm not a regular reader of all boards here.
Yet another FGPA solution:
http://focus.ti.com.cn/cn/lit/wp/spry103/spry103.pdf
A month ago we've seen that RapiHD is already pretty far that road:
http://www.youtube.com/watch?v=8C_Pj1Ep4nw
Is there any source (thread, blog, svn branch alpha source code?) regarding current CUDA development for x264?
I do understand some of the difficulties of porting H.264 encoding to the GPU and I'm very interested in the current status or the experiences made so far.
The only good use for the GPU would be a motion estimation pre-pass IMO. You get away from all the R/D optimization and tight coupling with the rest of codec which makes GPU acceleration such a headache (if you you want to do it at the same quality).
PS. which is not to say the GPU isn't useful for high speed low quality transcoding, I just don't think an accelerated x264 will do it ... it's less about acceleration and more about porting.
the_corona
24th June 2008, 11:41
Just stumbled upon this bit longer article about "Elemental's GPU Accelerated H.264 Encoder".
http://www.anandtech.com/video/showdoc.aspx?i=3339
Maybe its off interest to some, although It seems x264 has decided against GPU (or do I interprete the responses incorrectly?). I couldn't really figure out what FPGA's are (does any consumer have them?)
cogman
24th June 2008, 15:22
Just stumbled upon this bit longer article about "Elemental's GPU Accelerated H.264 Encoder".
http://www.anandtech.com/video/showdoc.aspx?i=3339
Maybe its off interest to some, although It seems x264 has decided against GPU (or do I interprete the responses incorrectly?). I couldn't really figure out what FPGA's are (does any consumer have them?)
Its not that they are against it, its just that it would be hard to make something good that doesn't just work on people with nvidia GPUs. I think that OpenCL would be much more promising for x264.
FPGA is basically reprogramable hardware. they aren't extremely expensive but I imagine most consumers don't have them (though they have access to them). They are great for training CompEngineers as you just have to make the schematic and then plug in the board.
Dark Shikari
24th June 2008, 15:53
An update for those who care; if the numbers we have now are correct, the FPGA currently being designed can do about 6 billion 16x16 SADs per second using a 64x64 exhaustive motion search on one reference frame.
No GPU can even come close to that order of magnitude :devil:
Inventive Software
24th June 2008, 15:54
Compare that with a conventional CPU. ;)
akupenguin
24th June 2008, 16:39
x264's ESA takes about 8 cycles per 16x16 SAD. On an 8core 3GHz box, that's 3 billion SADs per second. Of course the FPGA is much cheaper than such a beefy CPU. Otoh, ESA isn't really what you want, it's just what's easy to implement on a FPGA.
cogman
24th June 2008, 17:44
An update for those who care; if the numbers we have now are correct, the FPGA currently being designed can do about 6 billion 16x16 SADs per second using a 64x64 exhaustive motion search on one reference frame.
No GPU can even come close to that order of magnitude :devil:
What kind of bandwidth will that thing need? Could you put it on a USB stick, or would it need something like a PCI express 1x slot (or 16x)
Dark Shikari
24th June 2008, 17:47
What kind of bandwidth will that thing need? Could you put it on a USB stick, or would it need something like a PCI express 1x slot (or 16x)PCI-Express. 1x is probably sufficient.
The price of a board would probably on the order of magnitude of $200-$400.
nekrosoft13
24th June 2008, 18:11
http://www.techarp.com/editorials/img/0823_PhysX_05.png
this is the performance we might expect from well written Cuda app.
I wonder how my GTX 280 does ;)
Dark Shikari
24th June 2008, 18:18
http://www.techarp.com/editorials/img/0823_PhysX_05.png
this is the performance we might expect from well written Cuda app.
I wonder how my GTX 280 does ;)I can make great charts if I completely make up numbers too. ;)
All comparisons I have seen of this sort are utter bullshit that probably is doing something on the order of comparing the graphics card to the JM encoder, because their speeds for CPU encoding are usually off by at least a factor of 16 if not more. Its quite easy to beat the competition if you lie about their encoding speed.
Gabriel_Bouvigne
24th June 2008, 18:19
Come on, we don't even know how the resulting video looks like...
cogman
24th June 2008, 18:26
I can make great charts if I completely make up numbers too. ;)
All comparisons I have seen of this sort are utter bullshit that probably is doing something on the order of comparing the graphics card to the JM encoder, because their speeds for CPU encoding are usually off by at least a factor of 16 if not more. Its quite easy to beat the competition if you lie about their encoding speed.
Its not lieing, its marketing! But yeah, your point is completely valid. We have no Idea what the results look like, what settings where used for the encodes. Heck we don't even know if they where encoding to the H.264 standard or cutting corners.
Dark Shikari
24th June 2008, 18:34
Also, you notice in their graph that a 3Ghz quadcore is barely more than twice as fast as a 1.2Ghz dualcore--meaning the encoder they tested with was singlethreaded :rolleyes:
Here's a slightly more accurate graph using x264 numbers (assuming HD is defined as 720p, which appears to be what they're going for):
http://i31.tinypic.com/2nvs9dh.png
Also, you notice in their graph that a 3Ghz quadcore is barely more than twice as fast as a 1.2Ghz dualcore--meaning the encoder they tested with was singlethreaded :rolleyes:
Here's a slightly more accurate graph using x264 numbers (assuming HD is defined as 720p, which appears to be what they're going for):
http://i31.tinypic.com/2nvs9dh.png
I have a 3.4GHz quad core and I can't even get 110 FPS on the first pass on 720p. More realistic rates are 80 on first pass and 25 on second using
x264.exe --pass 1 --bitrate #### --stats "some.stats" --threads auto │
│ --keyint 240 --min-keyint 24 --bframes 3 --b-pyramid --me dia --subme 1 │
│ --partitions none --progress --no-psnr --no-ssim --output NUL │
│ "some.avs" │
│ │
│ x264.exe --pass 2 --bitrate #### --stats "some.stats" --threads auto │
│ --keyint 240 --min-keyint 24 --ref 3 --bframes 3 --b-pyramid --bime │
│ --weightb --subme 6 --trellis 1 --8x8dct --progress --no-psnr --no-ssim │
│ --output "some.mkv" "some.avs"
with an avs doing a basic straight feed. i.e. no resize, no decimate etc... from a HDTV 720p source and all 4 cores maxed out on both passes.
Dark Shikari
24th June 2008, 20:36
I have a 3.4GHz quad core and I can't even get 110 FPS on the first pass on 720p.Yes, because you're actually using half-decent settings ;)
If you skimp much more on your settings and drop to baseline profile, you can do better.
Actually, if you're willing to completely trash your settings (Baseline, dia, low merange, subme1, no dct decimate, no partitions, no scenecut, no deblocking, no AQ, constant quantizer), you can get about 56 FPS, singlethreaded, on a 3Ghz Core 2 on 64-bit Linux (and probably slightly higher on a Penryn). Assuming perfect scaling, which is expected with no B-frames or scenecut, you could reach about 240FPS or more on a quadcore.
Fun fact: Xvid only gets 54FPS on fastest settings on one core of that machine.
With four threads, that completely trashes the 9800GTX. We have no idea what quality the GPU encoder produces, of course; I'm guessing its awful, but we won't know for sure until they stop posting bullcrap benchmarks and post actual streams.
An update for those who care; if the numbers we have now are correct, the FPGA currently being designed can do about 6 billion 16x16 SADs per second using a 64x64 exhaustive motion search on one reference frame.
Is that with SEA?
Dark Shikari
24th June 2008, 21:03
Is that with SEA?No, SEA is not practical to implement on an FPGA as far as I know. Its just raw ESA.
Hmm, it has been said that on large FFTs the new AMD Firestream could get 170 GFLOPs throughput (which is outrageously fast when compared to CUFFT, nearly an order of magnitude faster than 8800s). If that's really true you could do a 128x128 fast full SSD search about as fast as with the FPGA full SAD search.
Of course initial ME is only half the battle (if that). RDO mode optimization and MV refinement are just as much performance killers ... and slightly harder to implement on FPGAs or GPUs.
Dark Shikari
24th June 2008, 21:35
Hmm, it has been said that on large FFTs the new AMD Firestream could get 170 GFLOPs throughput (which is outrageously fast when compared to CUFFT, nearly an order of magnitude faster than 8800s). If that's really true you could do a 128x128 fast full SSD search about as fast as with the FPGA full SAD search.SSD is a worse motion search metric than SAD.Of course initial ME is only half the battle (if that). RDO mode optimization and MV refinement are just as much performance killers ... and slightly harder to implement on FPGAs or GPUs."Slightly" is the understatement of the century.
akupenguin
24th June 2008, 22:06
Hmm, it has been said that on large FFTs the new AMD Firestream could get 170 GFLOPs throughput (which is outrageously fast when compared to CUFFT, nearly an order of magnitude faster than 8800s).
... and still slower than a plain old CPU. Remember where I said that a decent 8core does 3 billion SAD-equivalents per second? (SEA, so not all of those are real, but you're not planning to implement SEA on GPU either) 1 SAD is 768 arithmetic ops, so a brute force implementation would need 2.3 TFLOPS to match that CPU.
Only most of the time :) It's fixed time complexity as opposed to SEA. My point was to compare it to the FPGA though, not the CPU.
slavickas
25th June 2008, 19:56
I can make great charts if I completely make up numbers too. ;)
All comparisons I have seen of this sort are utter bullshit that probably is doing something on the order of comparing the graphics card to the JM encoder, because their speeds for CPU encoding are usually off by at least a factor of 16 if not more. Its quite easy to beat the competition if you lie about their encoding speed.
I think they compare quicktime or is it slowtime, at least in youtube video from GT200 presentation they talked about quicktime
Snowknight26
27th June 2008, 03:06
http://www.guru3d.com/news/download-ati-avivo-xcode-pack-for-hd4800-series/
Apparently it uses the GPU, even though I've been lead to believe that Avivo does it with the CPU.
vBulletin® v3.8.4, Copyright ©2000-2009, Jelsoft Enterprises Ltd.