PDA

View Full Version : Cuda Challenge for x264 ... ?


Sagittaire
4th May 2008, 19:56
http://www.nvidia.fr/object/cuda_contest_april2008_fr.html
http://www.nvidia.fr/content/EMEAI/CUDA/cuda_challenge_terms_fr.html

Dark Shikari
4th May 2008, 21:07
Avail Media is already working on CUDA for x264 ;)

Yoshiyuki Blade
4th May 2008, 22:58
I cant wait to see how that turns out. I hope my aging 8800GTX will have even more lasting value than it has already given me.

lexor
4th May 2008, 23:49
I hope my aging 8800GTX
I hate you so much right now...

/me looks at his 6600gt with it's poor crackling fan and sighs

Shinigami-Sama
6th May 2008, 08:00
I hate you so much right now...

/me looks at his 6600gt with it's poor crackling fan and sighs

I got all of you beat!
5200fx
first gen dx9!

it will be nice to see this nice looking stuff make it in x264
its definitely growing fast these days, kinda sad big commercial companies can't even compete still

Inventive Software
6th May 2008, 10:47
Yeah, Shinigami, except it cheated with DX9 and fell back to DX8.1 when it felt like it! :p

DarkZell666
6th May 2008, 15:04
Too bad, I switched from a Geforce 8600 GTS to a Radeon HD 3870 and I'm regretting I chose ATI, not only because of it's poor drivers and performance, but now even more so because of CUDA :D *Looks at his HD 3870 with an evil grin :devil: mwahahahaha*

lucassp
6th May 2008, 16:33
I think an ATi fanboy will make the same thing using CTM...it's just a matter of time.

Sulik
7th May 2008, 03:40
IIRC, CTM has been discontinued and is no longer officially supported.

bob0r
7th May 2008, 04:17
What is CUDA, I can't read french!
(And dont say google/wikpedia, doom9 is my bibble)

lexor
7th May 2008, 04:24
What is CUDA, I can't read french!
(And dont say google/wikpedia, doom9 is my bibble)

http://www.nvidia.com/object/cuda_learn.html

Shinigami-Sama
7th May 2008, 06:07
http://www.nvidia.com/object/cuda_learn.html

so basicly C/++ CPU-GPU interaction/offloading for parallel computing no?

lexor
7th May 2008, 18:33
so basicly C/++ CPU-GPU interaction/offloading for parallel computing no?

Yeah, from the example code they have on their site it seems they just provide an abstraction layer to programmers that unifies CPU and GPU. So you don't need to learn OpenGL or Direct3D, and worry about switching between CPU and GPU targets for your code, you just write what appears to be normal C++ with some new coding conventions (function and variable names), and compiler does the rest. And if it does it well enough you get profit :)

I think Sh (http://libsh.org/) commercialized by RapidMind (http://www.rapidmind.net/) (same developers) are the cross platform as well as CPU and GPU independent version of this (even supports PS3's Cell + GPU combo). Since last I checked RapidMind still offered their stuff for free for non-commercial use, maybe it will be a better code investment than CUDA? Especially since Sh devs have been at it longer than Nvidia.

MfA
7th May 2008, 22:28
Rapidmind is higher level ... with abstraction comes loss of performance, more severe with these kind of architectures than on a desktop processor which bristle with technology to make bad code run fast. I don't think you can freely download it anymore.

BlackSharkfr
7th May 2008, 22:48
I was curious so i read through a cuda introduction tutorial they have on their website. It's well written and easy to understand even for programming noobs like me.
i'd really like to see this technology being used in x264

Apprently they use the GPU as a giant SIMD co-processor :
.so it should be very usefull for video, since you often have to execute the same algorighm on a huge amount of pixels and/or macroblocks (especially in HD) simultaneously, but i don't know much about the x264 code so i can't really say which parts of the encoding process would benefit the most from the gpu.

using the cuda api (C/c++) seems simple, but in order to get the computations done "10/20/30/40/50/100+ times faster" (which is what they claim), you need to have massively parallelizable code without divergent branching, and then avoid bottlenecks caused by memory latency and the PCI-express bandwidth.

MfA
7th May 2008, 23:29
Highly irregular branching patterns (skip modes) and bit manipulation (quantization/entropy coding) don't suit present GPUs. IMO the only really good application at the moment are full search ME algorithms, in the end though accelerated full search is still slow even if it's faster than on the CPU. Because it has to do everything at single precision floating point it won't usually have 10x+ the performance of modern processors when you can use 8 or 16 bit on the CPU BTW. In pure arithmetic at usual precisions in x.264 it probably ties with a quad core ... it does have a lot more bandwidth though.

Dark Shikari
7th May 2008, 23:33
Highly irregular branching patterns (skip modes) and bit manipulation (quantization/entropy coding) don't suit present GPUs. IMO the only really good application at the moment are full search ME algorithms, in the end though accelerated full search is still slow even if it's faster than on the CPU.Actually, basically everything can be reasonably done on the GPU except CABAC (which could be done, it just couldn't be parallelized).

x264 CUDA will implement a fullpel and subpel ME algorithm initially; later on we could do something like RDO with a bit-cost approximation instead of CABAC.
Because it has to do everything at single precision floating pointWrong, CUDA supports integer math.

MfA
7th May 2008, 23:34
It supports integer math, but it doesn't support SIMD ... it's basically just not doing the renormalization.

Dark Shikari
7th May 2008, 23:36
It supports integer math, but it doesn't support SIMD ... it's basically just not doing the renormalization.Yes it does support SIMD. CUDA on an 8800GTX, for example, has 16 stream processors, each of which can perform 8 of the same arithmetic operation at the same time.

MfA
7th May 2008, 23:40
Seriously? Hmm, didn't notice that before ... I stand corrected.

PS. are you quite certain it can operate on uchar4 at 4 times the throughput as fp? (I don't have a g80 unfortunately.) I would have thought they would have made a bigger deal out of that if they could do it.

PPS. I think I worded it poorly the first time, I meant they can't do SIMD inside a "thread" (operations on vectors are simply iterated over the components) everything from 8 bits to 24 bits operations run on the same arithmetic units and at the same throughput. Whereas a CPU will get significant increase in throughput of arithmetic ops when using lower precision.

akupenguin
8th May 2008, 00:55
PS. are you quite certain it can operate on uchar4 at 4 times the throughput as fp? (I don't have a g80 unfortunately.) I would have thought they would have made a bigger deal out of that if they could do it.

PPS. I think I worded it poorly the first time, I meant they can't do SIMD inside a "thread" (operations on vectors are simply iterated over the components) everything from 8 bits to 24 bits operations run on the same arithmetic units and at the same throughput. Whereas a CPU will get significant increase in throughput of arithmetic ops when using lower precision.

Right. CUDA doesn't have vector registers. It operates on multiple 32bit scalars at once, and masking off some of the bits doesn't make it faster.

deekey777
8th May 2008, 01:10
IIRC, CTM has been discontinued and is no longer officially supported.

It's CAL now.

http://ati.amd.com/technology/streamcomputing/sdkdwnld.html

metaxaos
19th May 2008, 13:21
Dark Shikari
Could you call approximate date of first public release of CUDA's x264?
And then - will it be just a next version of the present x264's .exe with GPU acceleration with the same interface and usage keys or it will be some backward-uncompartible fork?

Dark Shikari
19th May 2008, 15:06
Dark Shikari
Could you call approximate date of first public release of CUDA's x264?
And then - will it be just a next version of the present x264's .exe with GPU acceleration with the same interface and usage keys or it will be some backward-uncompartible fork?I definitely can't, considering that Avail still isn't actually sure whether they're going to do CUDA, FPGA work, or some bizarre combination... :p

audyovydeo
19th May 2008, 17:19
I definitely can't, considering that Avail still isn't actually sure whether they're going to do CUDA, FPGA work, or some bizarre combination... :p

What I can't figure out is the relationship between Avail Media (out to make money) and x264 (a (currently) free H.264 encoder).

These guys (http://www.rapihd.com/) will probably get there first, independently of x264.

cheers
a/v

Dark Shikari
19th May 2008, 17:25
What I can't figure out is the relationship between Avail Media (out to make money) and x264 (a (currently) free H.264 encoder).

cheers
a/vAvail is responsible for a very large portion of x264 code and uses it for live SD and HD broadcast of television (IPTV, etc). Akupenguin used to work for them.

audyovydeo
19th May 2008, 17:28
Avail is responsible for a very large portion of x264 code and uses it for live SD and HD broadcast of television (IPTV, etc).

Hello there.

interesting. But I don't often see a .com spending €€€ on a project, only to cascade it to x264 at no charge, for the good of humanity.

I'll keep watching this space ...
a/v

Dark Shikari
19th May 2008, 17:32
Hello there.

interesting. But I don't often see a .com spending €€€ on a project, only to cascade it to x264 at no charge, for the good of humanity.Then you haven't seen a lot of modern companies ;)

Look at Facebook for example; they open-source a whole lot of their work and they will be open-sourcing their Flash video encoding scripts (which automatically source videos using mplayer/ffmpeg, automatically compensate for VFR/etc, and then re-encode video/audio with x264 and remux) when they're done with them.

Avail of course is even bigger than that; basically every improvement to x264 they've made has been open-sourced; since they operate as a service business (handling broadcast to make money), there is no loss to them to release those kind of software improvements; if anything, there is the benefit they gain from community involvement.

(P.S: I'm posting this from Avail Media ;) )

audyovydeo
19th May 2008, 17:47
Thanks for the subliminal message ;-)

cheers
a/v

tre31
25th May 2008, 12:52
IIRC, CTM has been discontinued and is no longer officially supported.

AMD now uses CAL (which includes amd-brookplus-sdk-v1.00.0-alpha.msi and amd-cal-sdk-v0.90.0-alpha.msi - not sure if they are latest versions but thats whats inside the CAL sdk) not CTM.

So hopefuly someone with the knowledge will do something with it, I myself don't have the gfx or mathematics or knowledge of encoding techniques too do it, but downloaded the kit myself too see the capabilities firsthand - and I must say ati's gpgpu products are quite good as well (tested on hd2600xt).

iwod
28th May 2008, 17:39
RapidHD demonstrated their products today in Nvidia Tech day. It can encode 720P HD file at 2x the speed of Real time using a 9600GT. Which is about 8 times faster then a 3Ghz Quad Core according to them. I cant wait to see what can be done on the new GTX200 series.

The best part of this, is that you can do the encoding on GPU while you keep surfing, Chatting on IM, listening to music, watching youtube etc with NO Speed lost at all!!

Dark Shikari
28th May 2008, 17:41
RapidHD demonstrated their products today in Nvidia Tech day. It can encode 720P HD file at 2x the speed of Real time using a 9600GT. Which is about 8 times faster then a 3Ghz Quad Core according to them.8 times faster than a quadcore? What encoder are they comparing to, the JM? ;)

On lowest settings, x264 encodes 720p ~3.8 times faster than realtime on a quadcore... and that's on 32-bit.

the_corona
28th May 2008, 18:18
RapidHD demonstrated their products today in Nvidia Tech day. It can encode 720P HD file at 2x the speed of Real time using a 9600GT. Which is about 8 times faster then a 3Ghz Quad Core according to them. I cant wait to see what can be done on the new GTX200 series.

The best part of this, is that you can do the encoding on GPU while you keep surfing, Chatting on IM, listening to music, watching youtube etc with NO Speed lost at all!!

There's a video of it here (http://www.youtube.com/watch?v=8C_Pj1Ep4nw), it's just stupid marketing though....

They should some 720p to "itunes format" in about 150fps and then 1440 × 1080 export from Premiere @ 46 fps.

g_aleph_r
28th May 2008, 20:02
... and that's on 32-bit.

Is it faster on 64bit?

Dark Shikari
28th May 2008, 20:26
Is it faster on 64bit?10-15% or so.

Adub
28th May 2008, 20:42
But isn't 64-bit not fully supported yet? I thought you guys said it could use some work?

crypto
28th May 2008, 20:47
..On lowest settings, x264 encodes 720p ~3.8 times faster than realtime on a quadcore... and that's on 32-bit.

Hi, No way. I am getting 30 fps max in the first pass, 20 fps in second on a Q6600. Can you suggest faster settings?


%TOOLS%\x264.807 --pass 1 --bitrate %BITRATE% --bframes 3 --partitions none --subme 1 --me dia --no-cabac --analyse none --threads auto --level 4.0 --progress --no-psnr --no-ssim -o NUL pass1.avs

Dark Shikari
28th May 2008, 20:53
Hi, No way. I am getting 30 fps max in the first pass, 20 fps in second on a Q6600. Can you suggest faster settings?
1. Update your x264 and use one of the latest fprofiled builds (e.g. from Jarod)
2. I did the test on a single core of a Core 2 Duo, using --thread-input to make sure I was ignoring the cost of decoding. My CPU was 2Ghz and one core, so I scaled up the number by a factor of 6 to account for 4x cores and 1.5x processor speed, making the assumption that threading scaled perfectly (which is, according to tests, a pretty accurate assumption, but in reality its probably a few % worse).

--qp 25 --partitions none --subme 1 --me dia --no-cabac --analyse none --threads auto --no-dct-decimate --progress --no-psnr --no-ssim --merange 6 --aq-mode 0

is a terrible series of settings but it should be pretty fast :p

Make sure you're not decoding-bottlenecked. Also note that I measured "realtime" as "24fps."

crypto
28th May 2008, 20:57
Great, thanks for the clues.

nm
28th May 2008, 22:13
But isn't 64-bit not fully supported yet? I thought you guys said it could use some work?
It needs work only to support 64-bit Windows. Otherwise x264 is pretty well optimized for x86-64 and it can be used on 64-bit Linux and *BSD, at least. I've been using it for a few years already without larger problems.

t3g
30th May 2008, 11:21
10-15% or so.

Is this only on AMD processors, or on Intel ones? It's really hard to get good i386/amd64 comparisons, but one test on AMD processors showed an average 20% speed increase (Linux) while the last test I could find on the Core 2 architecture (Windows) showed 64bits versions of the same programs often lagging behind the 32bits versions (and Intel even admitted that only in Nehalem will their processors offer on-par 64bits optimisations compared to the 32bits mode). Are there good x264 benchmarks out there comparing its performance on different architectures (we already knows it trashes most other h264 codecs ;) ).

To come back to the thread, I'm really happy to see x264 development already planning support for CUDA, when most commercial programs (except CS4) will still not make use of it.

Is there any plans to use AMD's CAL, since it sound less straightforward than CUDA?

MfA
30th May 2008, 13:17
CAL is less straightforward ... but on the other hand brook+ is more straightforward.

akupenguin
30th May 2008, 14:14
Is this only on AMD processors, or on Intel ones?
They are quite similar. I make no claims for how general this is, but for my favorite x264 setting, Conroe gains 12% speed in 64bit while K8 gains 11%.
It's really hard to get good i386/amd64 comparisons
You must not be trying very hard. It's as simple as: compile for x86_64, compile for x86_32, run both.
Ok, it becomes somewhat harder for interactive programs with no batch mode, but that doesn't describe anything related to compression.
Intel even admitted that only in Nehalem will their processors offer on-par 64bits optimisations compared to the 32bits mode.
So what? x86_64 isn't faster due to any instruction taking fewer cycles. It's faster because it has more registers. (There are a few other minor reasons, but they either don't make very much difference, or they're consequences of more registers.) If Intel gave x86_64 more optimization attention than x86_32, that would just be gravy on top of the existing 12% speedup.

Adub
30th May 2008, 19:15
It needs work only to support 64-bit Windows. Otherwise x264 is pretty well optimized for x86-64 and it can be used on 64-bit Linux and *BSD, at least. I've been using it for a few years already without larger problems.

Thanks for clearing that up. Now I will have to look into it.

t3g
30th May 2008, 19:34
You must not be trying very hard. It's as simple as: compile for x86_64, compile for x86_32, run both.

Obviously, you also need an x86_64 compatible processor, which I still don't have (and even when I'll get one, I won't be able to compare a Core 2 to a Phenom, since I'm just planning to get one processor ;)

And I prefer to know in advance what kind of performance I can expect. Since I'm going to use Linux in 64bits, I'd like to know if getting a Core 2 will end up with worse results than a Phenom (I could care less for games) in that mode.

This recent test shows 64 bits Vista often quite slower than 32 bits Vista - I just don't know if it's because of Vista, or if it's because of the processor they used.
http://www.extremetech.com/article2/0,2845,2280881,00.asp

Since nobody in the journalistic world seems to care about Phenom or about 64bits Linux, it's hard to get an idea of Core 2 / Phenom respective performance in 64bits Linux (and yes, I did spend a lot of time on Google, but it seems nobody ever had the idea to compare recent Intel & AMD processors in 64bits Linux).

Where I am, 9550 Phenom slightly cheaper than a Q6600, while having far better motherboards & IGP. However, I can't figure what the difference is going to be in 64 bits Linux, since all the comparisons are done using 32 bits Vista.

nm
30th May 2008, 21:48
Since nobody in the journalistic world seems to care about Phenom or about 64bits Linux, it's hard to get an idea of Core 2 / Phenom respective performance in 64bits Linux (and yes, I did spend a lot of time on Google, but it seems nobody ever had the idea to compare recent Intel & AMD processors in 64bits Linux).

Where I am, 9550 Phenom slightly cheaper than a Q6600, while having far better motherboards & IGP. However, I can't figure what the difference is going to be in 64 bits Linux, since all the comparisons are done using 32 bits Vista.
I think that x264 performance is a pretty good indicator that the speed gain is about the same in both architectures. Programs that use a lot of pointers (some of the object-oriented stuff, perhaps) need more memory and may be slower because the pointers are 64-bit.

If you're going to run Linux and the machine is intended for video-related work, the Radeon IGP is not that tempting compared to NVIDIA. I think that there are still tearing/vsync problems with the latest proprietary ATI drivers and open-source drivers aren't any better. NVIDIA's proprietary drivers have far less problems for now.

7oby
21st June 2008, 11:56
Avail Media is already working on CUDA for x264 ;)

Dark Shikari,

could you elaborate somewhat on this issue?

You've been experimenting in december 2007 with CUDA and ME/MC:
http://forums.nvidia.com/lofiversion/index.php?t53172.html

I even don't expect a CUDA solution to be faster than CPU from the beginning. I see it more like an incremental approach. At the beginning only let's say ME is ported to CUDA and the communication overhead of bouncing data between CPU and GPU eats up all potential performance gains. But once there is sufficiently computation going on with CUDA, one can suffle around the algorithm: e.g. by introducing pipelining like the Toshiba Cell guys did.

What are/were the most challenging things regarding CUDA:
. technical issues like insufficient performance and debugging support of CUDA?

. infant CUDA libraries full of errors?

. the overall x264 execution design being completely different than a design appropriate for CUDA. E.g. you had to go back to a single threaded x264 execution model (= extremely slow) in order to sequentialize tasks and allow execution of ME on CUDA. And based on that exetreme difficulties in parallizing tasks.

Now that you work for Avail Media: What does that mean for CUDA and x264?

LoRd_MuldeR
21st June 2008, 14:16
@7oby: Dark Shikari has already said that Avail Media is now going for FPGA's instead of CUDA ...

7oby
21st June 2008, 14:34
@7oby: Dark Shikari has already said that Avail Media is now going for FPGA's instead of CUDA ...

Thanks for that information. I'm not a regular reader of all boards here.

Yet another FGPA solution:
http://focus.ti.com.cn/cn/lit/wp/spry103/spry103.pdf

A month ago we've seen that RapiHD is already pretty far that road:
http://www.youtube.com/watch?v=8C_Pj1Ep4nw

Is there any source (thread, blog, svn branch alpha source code?) regarding current CUDA development for x264?

I do understand some of the difficulties of porting H.264 encoding to the GPU and I'm very interested in the current status or the experiences made so far.

MfA
22nd June 2008, 07:16
The only good use for the GPU would be a motion estimation pre-pass IMO. You get away from all the R/D optimization and tight coupling with the rest of codec which makes GPU acceleration such a headache (if you you want to do it at the same quality).

PS. which is not to say the GPU isn't useful for high speed low quality transcoding, I just don't think an accelerated x264 will do it ... it's less about acceleration and more about porting.

the_corona
24th June 2008, 11:41
Just stumbled upon this bit longer article about "Elemental's GPU Accelerated H.264 Encoder".

http://www.anandtech.com/video/showdoc.aspx?i=3339

Maybe its off interest to some, although It seems x264 has decided against GPU (or do I interprete the responses incorrectly?). I couldn't really figure out what FPGA's are (does any consumer have them?)

cogman
24th June 2008, 15:22
Just stumbled upon this bit longer article about "Elemental's GPU Accelerated H.264 Encoder".

http://www.anandtech.com/video/showdoc.aspx?i=3339

Maybe its off interest to some, although It seems x264 has decided against GPU (or do I interprete the responses incorrectly?). I couldn't really figure out what FPGA's are (does any consumer have them?)

Its not that they are against it, its just that it would be hard to make something good that doesn't just work on people with nvidia GPUs. I think that OpenCL would be much more promising for x264.

FPGA is basically reprogramable hardware. they aren't extremely expensive but I imagine most consumers don't have them (though they have access to them). They are great for training CompEngineers as you just have to make the schematic and then plug in the board.

Dark Shikari
24th June 2008, 15:53
An update for those who care; if the numbers we have now are correct, the FPGA currently being designed can do about 6 billion 16x16 SADs per second using a 64x64 exhaustive motion search on one reference frame.

No GPU can even come close to that order of magnitude :devil:

Inventive Software
24th June 2008, 15:54
Compare that with a conventional CPU. ;)

akupenguin
24th June 2008, 16:39
x264's ESA takes about 8 cycles per 16x16 SAD. On an 8core 3GHz box, that's 3 billion SADs per second. Of course the FPGA is much cheaper than such a beefy CPU. Otoh, ESA isn't really what you want, it's just what's easy to implement on a FPGA.

cogman
24th June 2008, 17:44
An update for those who care; if the numbers we have now are correct, the FPGA currently being designed can do about 6 billion 16x16 SADs per second using a 64x64 exhaustive motion search on one reference frame.

No GPU can even come close to that order of magnitude :devil:

What kind of bandwidth will that thing need? Could you put it on a USB stick, or would it need something like a PCI express 1x slot (or 16x)

Dark Shikari
24th June 2008, 17:47
What kind of bandwidth will that thing need? Could you put it on a USB stick, or would it need something like a PCI express 1x slot (or 16x)PCI-Express. 1x is probably sufficient.

The price of a board would probably on the order of magnitude of $200-$400.

nekrosoft13
24th June 2008, 18:11
http://www.techarp.com/editorials/img/0823_PhysX_05.png

this is the performance we might expect from well written Cuda app.

I wonder how my GTX 280 does ;)

Dark Shikari
24th June 2008, 18:18
http://www.techarp.com/editorials/img/0823_PhysX_05.png

this is the performance we might expect from well written Cuda app.

I wonder how my GTX 280 does ;)I can make great charts if I completely make up numbers too. ;)

All comparisons I have seen of this sort are utter bullshit that probably is doing something on the order of comparing the graphics card to the JM encoder, because their speeds for CPU encoding are usually off by at least a factor of 16 if not more. Its quite easy to beat the competition if you lie about their encoding speed.

Gabriel_Bouvigne
24th June 2008, 18:19
Come on, we don't even know how the resulting video looks like...

cogman
24th June 2008, 18:26
I can make great charts if I completely make up numbers too. ;)

All comparisons I have seen of this sort are utter bullshit that probably is doing something on the order of comparing the graphics card to the JM encoder, because their speeds for CPU encoding are usually off by at least a factor of 16 if not more. Its quite easy to beat the competition if you lie about their encoding speed.

Its not lieing, its marketing! But yeah, your point is completely valid. We have no Idea what the results look like, what settings where used for the encodes. Heck we don't even know if they where encoding to the H.264 standard or cutting corners.

Dark Shikari
24th June 2008, 18:34
Also, you notice in their graph that a 3Ghz quadcore is barely more than twice as fast as a 1.2Ghz dualcore--meaning the encoder they tested with was singlethreaded :rolleyes:

Here's a slightly more accurate graph using x264 numbers (assuming HD is defined as 720p, which appears to be what they're going for):

http://i31.tinypic.com/2nvs9dh.png

Zep
24th June 2008, 20:31
Also, you notice in their graph that a 3Ghz quadcore is barely more than twice as fast as a 1.2Ghz dualcore--meaning the encoder they tested with was singlethreaded :rolleyes:

Here's a slightly more accurate graph using x264 numbers (assuming HD is defined as 720p, which appears to be what they're going for):

http://i31.tinypic.com/2nvs9dh.png

I have a 3.4GHz quad core and I can't even get 110 FPS on the first pass on 720p. More realistic rates are 80 on first pass and 25 on second using

x264.exe --pass 1 --bitrate #### --stats "some.stats" --threads auto │
│ --keyint 240 --min-keyint 24 --bframes 3 --b-pyramid --me dia --subme 1 │
│ --partitions none --progress --no-psnr --no-ssim --output NUL │
│ "some.avs" │
│ │
│ x264.exe --pass 2 --bitrate #### --stats "some.stats" --threads auto │
│ --keyint 240 --min-keyint 24 --ref 3 --bframes 3 --b-pyramid --bime │
│ --weightb --subme 6 --trellis 1 --8x8dct --progress --no-psnr --no-ssim │
│ --output "some.mkv" "some.avs"



with an avs doing a basic straight feed. i.e. no resize, no decimate etc... from a HDTV 720p source and all 4 cores maxed out on both passes.

Dark Shikari
24th June 2008, 20:36
I have a 3.4GHz quad core and I can't even get 110 FPS on the first pass on 720p.Yes, because you're actually using half-decent settings ;)

If you skimp much more on your settings and drop to baseline profile, you can do better.

Actually, if you're willing to completely trash your settings (Baseline, dia, low merange, subme1, no dct decimate, no partitions, no scenecut, no deblocking, no AQ, constant quantizer), you can get about 56 FPS, singlethreaded, on a 3Ghz Core 2 on 64-bit Linux (and probably slightly higher on a Penryn). Assuming perfect scaling, which is expected with no B-frames or scenecut, you could reach about 240FPS or more on a quadcore.

Fun fact: Xvid only gets 54FPS on fastest settings on one core of that machine.

With four threads, that completely trashes the 9800GTX. We have no idea what quality the GPU encoder produces, of course; I'm guessing its awful, but we won't know for sure until they stop posting bullcrap benchmarks and post actual streams.

MfA
24th June 2008, 20:59
An update for those who care; if the numbers we have now are correct, the FPGA currently being designed can do about 6 billion 16x16 SADs per second using a 64x64 exhaustive motion search on one reference frame.
Is that with SEA?

Dark Shikari
24th June 2008, 21:03
Is that with SEA?No, SEA is not practical to implement on an FPGA as far as I know. Its just raw ESA.

MfA
24th June 2008, 21:28
Hmm, it has been said that on large FFTs the new AMD Firestream could get 170 GFLOPs throughput (which is outrageously fast when compared to CUFFT, nearly an order of magnitude faster than 8800s). If that's really true you could do a 128x128 fast full SSD search about as fast as with the FPGA full SAD search.

Of course initial ME is only half the battle (if that). RDO mode optimization and MV refinement are just as much performance killers ... and slightly harder to implement on FPGAs or GPUs.

Dark Shikari
24th June 2008, 21:35
Hmm, it has been said that on large FFTs the new AMD Firestream could get 170 GFLOPs throughput (which is outrageously fast when compared to CUFFT, nearly an order of magnitude faster than 8800s). If that's really true you could do a 128x128 fast full SSD search about as fast as with the FPGA full SAD search.SSD is a worse motion search metric than SAD.Of course initial ME is only half the battle (if that). RDO mode optimization and MV refinement are just as much performance killers ... and slightly harder to implement on FPGAs or GPUs."Slightly" is the understatement of the century.

akupenguin
24th June 2008, 22:06
Hmm, it has been said that on large FFTs the new AMD Firestream could get 170 GFLOPs throughput (which is outrageously fast when compared to CUFFT, nearly an order of magnitude faster than 8800s).
... and still slower than a plain old CPU. Remember where I said that a decent 8core does 3 billion SAD-equivalents per second? (SEA, so not all of those are real, but you're not planning to implement SEA on GPU either) 1 SAD is 768 arithmetic ops, so a brute force implementation would need 2.3 TFLOPS to match that CPU.

MfA
24th June 2008, 22:16
Only most of the time :) It's fixed time complexity as opposed to SEA. My point was to compare it to the FPGA though, not the CPU.

slavickas
25th June 2008, 19:56
I can make great charts if I completely make up numbers too. ;)

All comparisons I have seen of this sort are utter bullshit that probably is doing something on the order of comparing the graphics card to the JM encoder, because their speeds for CPU encoding are usually off by at least a factor of 16 if not more. Its quite easy to beat the competition if you lie about their encoding speed.
I think they compare quicktime or is it slowtime, at least in youtube video from GT200 presentation they talked about quicktime

Snowknight26
27th June 2008, 03:06
http://www.guru3d.com/news/download-ati-avivo-xcode-pack-for-hd4800-series/

Apparently it uses the GPU, even though I've been lead to believe that Avivo does it with the CPU.

d0ORk
12th February 2010, 10:25
Any news on the Cuda Support for x264?

Blue_MiSfit
12th February 2010, 10:51
If there was, it would be all over the boards :)

So, no.

It doesn't matter though.

~MiSfit

LoRd_MuldeR
12th February 2010, 13:53
Any news on the Cuda Support for x264?

The facts remain the same: Despite all the marketing blabber, CUDA isn't the perfect platform to do video encoding.
All those "CUDA H.264 encoders" that are available on the market sacrifice a whole lot of quality in order to reach fast encoding speed.
If CUDA really was that great for video encoding, we would have seen at least one competitive product. But so far they all have been disappointing!
The upcoming "Fermi" generation has some nice improvements, but it isn't available yet. We'll see if it makes CUDA encoding more attractive...

aegisofrime
12th February 2010, 17:29
The facts remain the same: Despite all the marketing blabber, CUDA isn't the perfect platform to do video encoding.
All those "CUDA H.264 encoders" that are available on the market sacrifice a whole lot of quality in order to reach fast encoding speed.
If CUDA really was that great for video encoding, we would have seen at least one competitive product. But so far they all have been disappointing!
The upcoming "Fermi" generation has some nice improvements, but it isn't available yet. We'll see if it makes CUDA encoding more attractive...

Both Fermi and the Radeon 58xx series feature DX11 and OpenCL. Do Fermi have any advantages with regards to GPU encoding?

LoRd_MuldeR
12th February 2010, 17:37
Both Fermi and the Radeon 58xx series feature DX11 and OpenCL. Do Fermi have any advantages with regards to GPU encoding?

The interfaces (API's) they support aren't that important.

OpenCL basically is CUDA, or at least heavily inspired by CUDA. They just renamed a few things and changed the API calls a bit here and there ;)

While I have no idea about the "ComputeShaders" of DX11, I assume they aren't much different. That's because the capabilities and limitations of the underlying GPU hardware are the same.

Fermi has some huge advantages. Memory accesses to the "global" GPU memory are now cached - for reading and writing. Before only the read-only "texture" memory was cached.

However one fundamental limitation of GPGPU processing remains: The GPU can only access the GPU memory. So all input data that is processed on the GPU needs to go through the slow PCIe bus first.

Also results must go the same way back. Hence moving only a small function to the GPU is useless, even if it is 100x faster there. The delay for moving the data would simply be too long!

That's exactly the reason why you cannot take x264 as-is, move a few functions to the GPU and expect speed-up. Instead complete algorithms would have to be re-implemented on the GPU.

In some cases you even need to invent completely new algorithms, because your existing algorithms simply don't scale well on the GPU ...

hajj_3
12th February 2010, 20:16
It would be cool if the x264 team could manage to put some of the functions of x264 to be done by a fermi gpu, i'm sure alot of us would appreciate, especially as the low end nvidia cards come out at around £40 when launched and lower to about £30 after a few months.

LoRd_MuldeR
12th February 2010, 20:41
It would be cool if the x264 team could manage to put some of the functions of x264 to be done by a fermi gpu, i'm sure alot of us would appreciate, especially as the low end nvidia cards come out at around £40 when launched and lower to about £30 after a few months.

As explained in the previous post, you cannot simply pick individual functions and move them to the GPU :rolleyes:

Furthermore it's not guaranteed at all that the GPGPU improvements of the 'Fermi' generation will actually be relevant for x264.

Last but not least, you would have to expect a significant lower encoding performance from those cheap "low end" graphics cards!

That's different from decoding, where a dedicated decoder chip (which is identical on all cards) does the job...

CruNcher
12th February 2010, 21:48
You shouldn't except such work from the X264 team but im sure we gonna see some Fermi enhanced stuff by Nvidias sponsored Elemental Technologies :) they practically also worked on those Encoding enhancements with Nvidia.

Dark Shikari
12th February 2010, 22:14
For reference, there is a company working on a custom proprietary codec (sponsored by ATI) designed specifically for GPUs, i.e. making compression sacrifices in the specification to make it more amenable to GPU parallelization. They can currently get about 120fps @ 4K resolution on a top-end ATI card, which is about 4-6 times faster than x264 on ultrafast on a top-end Core i7. This is a pretty reasonable performance boost to expect from such an "ideal situation" in which the spec itself can be modified to suit the GPU.

hajj_3
12th February 2010, 22:44
will the spec be open sourced so that you can see it's code and copy good bits into x264 code?

LoRd_MuldeR
12th February 2010, 23:07
will the spec be open sourced so that you can see it's code and copy good bits into x264 code?

1) I highly doubt they will release the source code of their encoder under an OpenSource license. They probably prefer selling a commercial encoder software ;)

2) However they may make the specs for their new video format available (but not necessarily for free), so others can implement their own encoders or decoders for that format.

3) Even if they did make their code/specs public, it would be completely irrelevant for x264, because x264 is H.264 encoder. Not an encoder for ATI's video format!

popper
13th February 2010, 01:08
OC as Bridgeman the AMD executive in charge of linux code and docs reminds us "IIRC the Evergreen family (HD54xx and up) includes a few Sum of Absolute Differences shader instruction variants so one obvious task would be using those instructions to speed up motion estimation... details in the Evergreen shader instruction doc on our Stream site.
"

and yet no one capable seems that interested in making any proof of concept code available to test this GPU SAD speeds to see if that could be a useable option for any future part offload options to date.
http://developer.amd.com/gpu/ATIStreamSDK/assets/AMD_Evergreen-Family_ISA_Instructions_and_Microcode.pdf

http://developer.amd.com/gpu/ATIStreamSDK/pages/Documentation.aspx
http://forums.amd.com/devforum/messageview.cfm?catid=203&threadid=124677&enterthread=y&startpage=1
"There is also a detailed document describing the shader instruction set. Look for the "AMD Evergreen Family ISA Microcode and Instructions" document at :

http://developer.amd.com/gpu/ATIStreamSDK/pages/Documentation.aspx
"

aegisofrime
13th February 2010, 10:36
For what it's worth, nVidia bundles a CUDA encoding plugin for Adobe Premiere and After Effects with their Quadro GPUs. You would think that logically, since they are targeting this bundle at professionals, it should, at least, *not suck so hard*.

Incidentally its made by Elemental, the same people behind Badaboom, so it remains to be seen if its any better than Badaboom.

nVidia ad for the plugin:

http://www.youtube.com/watch?v=BZkK9HoxUvo

Unfortunately the Internet is chock full of news reports about this piece of software, but not much in the way of reviews...

edison
14th February 2010, 10:14
For what it's worth, nVidia bundles a CUDA encoding plugin for Adobe Premiere and After Effects with their Quadro GPUs. You would think that logically, since they are targeting this bundle at professionals, it should, at least, *not suck so hard*.

Incidentally its made by Elemental, the same people behind Badaboom, so it remains to be seen if its any better than Badaboom.

nVidia ad for the plugin:

http://www.youtube.com/watch?v=BZkK9HoxUvo

Unfortunately the Internet is chock full of news reports about this piece of software, but not much in the way of reviews...


here is a Chinese review on it:

http://www.pcinlife.com/article/graphics/2009-08-05/1249460047d833_1.html

julius666
14th February 2010, 10:53
The GPU can only access the GPU memory. So all input data that is processed on the GPU needs to go through the slow PCIe bus first.

Also results must go the same way back. Hence moving only a small function to the GPU is useless, even if it is 100x faster there. The delay for moving the data would simply be too long!

That's exactly the reason why you cannot take x264 as-is, move a few functions to the GPU and expect speed-up. Instead complete algorithms would have to be re-implemented on the GPU.

And what about Intel's new Clarkdale architecture with the integrated GPU? The PCIe bus can't be the bottleneck in that case. And in the future (almost) all CPU will carry a GPU, so it's probably worth the effort.

LoRd_MuldeR
14th February 2010, 12:20
And what about Intel's new Clarkdale architecture with the integrated GPU? The PCIe bus can't be the bottleneck in that case. And in the future (almost) all CPU will carry a GPU, so it's probably worth the effort.

Those "on board" GPU's avoid the PCIe bottleneck, indeed. But those are "low end" GPU's. You can't expect any noteworthy encoding performance from them.

Even the cheapest PCIe graphics card will outperform those "on board" chips easily! And Intel's "on board" chips are even much weaker than NVidia's "on board" chips.

Furthermore I'm not aware of any efforts to support OpenCL or DirectX ComputeShaders by Intel...

Again decoding performance is a different topic, because most "on board" chips have dedicated hardware for BluRay (H.264/VC1) decoding now.

aegisofrime
14th February 2010, 12:24
And what about Intel's new Clarkdale architecture with the integrated GPU? The PCIe bus can't be the bottleneck in that case. And in the future (almost) all CPU will carry a GPU, so it's probably worth the effort.

Those GPUs have only replaced the previously rubbish Intel IGPs with a slightly less rubbish Intel HD Graphics. In fact, why Intel would move the IGP off the motherboard onto the CPU package is beyond me, since it doesn't have have GPGPU motivations, unlike AMD Fusion (I think)

Deinorius
14th February 2010, 13:57
In fact, why Intel would move the IGP off the motherboard onto the CPU package is beyond me That's easy. It makes production cheaper, power consumption goes down as you could already see with Lynnfield. Maybe even graphics performance gets a boost like general performance itself because of the memory-controller/iGPU directly on the package (lower latencies).

And, of course Intel can dominate the chipset market for their own cpus but that's more a nice side effect.

aegisofrime
14th February 2010, 14:13
That's easy. It makes production cheaper, power consumption goes down as you could already see with Lynnfield. Maybe even graphics performance gets a boost like general performance itself because of the memory-controller/iGPU directly on the package (lower latencies).

And, of course Intel can dominate the chipset market for their own cpus but that's more a nice side effect.

Thanks for the explanation. That actually makes a lot of sense, from the standpoint of a budget user actually :D

I just hope that Fusion is something different. AMD bought ATI precisely for Fusion, and if it's just another Clarksdale...

Deinorius
14th February 2010, 14:23
Fusion is quite the same like Clarkdale, just more developed. In combination like Optimus you can use a nvidia card for anything you need, but when you don't need it, you get the low power consumption like Clarkdale delivers.

MfA
15th February 2010, 01:12
and yet no one capable seems that interested in making any proof of concept code available to test this GPU SAD speeds to see if that could be a useable option for any future part offload options to date.
The version of the SDK with the SAD instruction actually exposed is only a couple days old ... also there are not a lot of people who are comfortable with CAL IL programming.

Just for reference this is the instruction (for 4x4 SAD) :
Instructions SAD4

Syntax sad4, scr0, src1, src2

Description Sad8(src, src1) forms the sum of absolute differences, treating sr0 and src1 as a vector of eight-bit unsigned integers. This is a special instruction for multi-media video dst.xyzw = sad8(sr0.x,src1.x) + sad8(sr0.y,src1.y) + sad8(src0.z,src1.z) + sad 8(src0.w, src1.w) + r2.x . The 32-bit result is replicated to all four vector output slots. Valid for Evergreen GPUs and later.

Just put a number to the madness, purely looking at the SAD instruction that means a 5870 can hit a peak of 320*850 MHz = 272 GigaSAD/s (4x4). For comparison an I7 at 3 GHz does peak 4*2*3 GHz = 24 GigaSAD/s (4x4 using mpsadbw).

jakor
16th February 2010, 05:37
For reference, there is a company working on a custom proprietary codec (sponsored by ATI) designed specifically for GPUs, i.e. making compression sacrifices in the specification to make it more amenable to GPU parallelization. They can currently get about 120fps @ 4K resolution on a top-end ATI card, which is about 4-6 times faster than x264 on ultrafast on a top-end Core i7. This is a pretty reasonable performance boost to expect from such an "ideal situation" in which the spec itself can be modified to suit the GPU.

How do they deliver this amount of raw data to the processing unit ?
it is 4,000 * 2,000 * 1.5 (case of YV12) * 120 = 1,440,000,000 bytes per sec.
also for x264 case 5 times lower it is close to SATA limit of 3 Gbps...
Are these calculations correct ?

mariush
16th February 2010, 05:44
Well, I'd imagine they probably have a custom ATI card with some HDMI input like the Blackmagic cards... or they'd just get 2 gigabit network cards and team them up so they'd have 2gbps input from network.... with 8-16 GB of DDR3 memory you'd have enough to cache...

Now I don't know the throughput, uploading to the video card may be slower but as far as I know the PCI Express slots have tons of bandwidth...of course, if they have hdmi input port straight on the cards it's moot point.

Just guesses, I'm no expert...

jakor
16th February 2010, 06:28
Well, I'd imagine they probably have a custom ATI card with some HDMI input like the Blackmagic cards... or they'd just get 2 gigabit network cards and team them up so they'd have 2gbps input from network.... with 8-16 GB of DDR3 memory you'd have enough to cache...

Now I don't know the throughput, uploading to the video card may be slower but as far as I know the PCI Express slots have tons of bandwidth...of course, if they have hdmi input port straight on the cards it's moot point.

Just guesses, I'm no expert...

1.440.000.000 bytes per sec is 14 Gbps. HDMI upper limit is just 10.2 (judging from wiki). Anyway - what's on the other end of the HDMI or network cable ? which device is capable of producing this kind of data ?
or maybe those guys just generate some textures onboard...

Disabled
16th February 2010, 10:54
Or they only reencode bitstreamed videos. Ie upload an h264 to the card and get a reencoded file back.

Dark Shikari
16th February 2010, 11:14
Or they only reencode bitstreamed videos. Ie upload an h264 to the card and get a reencoded file back.Not possible, the card can't decode 4K at 120fps.

ExSport
17th February 2010, 01:05
Maybe noob question but it is possible to use cuda for decoding part so x264/mencoder/ffmpeg can save some cycles with decoding and use it for encoding part?
Are there some theoretical/practical limitations?
Could this have some performance gain or it will be unnoticable?
Many thanks!
P.S.
I did some testing with MEncoder from Sherpya and CoreAvc and for some movies speedup was about 25%, for other files no difference or slower...
But compression was done to MPEG2, not x264.
Original file was h264 from Blu-Ray.

LoRd_MuldeR
17th February 2010, 01:22
Maybe noob question but it is possible to use cuda for decoding part so x264/mencoder/ffmpeg can save some cycles with decoding and use it for encoding part?

You don't need CUDA to decode H.264 in hardware, because all up-to-date graphics cards contain dedicated decoding hardware for H.264, VC-1 and MPEG-2. And there are many solutions available to use your graphics card's hardware decoder. DXVA is playback only, so it's not an option for encoding tasks. However DirectShowSource+CoreAVC or DGAVCIndexNV can be used to decode the source in hardware and feed it into x264. But don't get confused: "CUDA decoding" in CoreAVC/DGAVCIndexNV does NOT mean they implemented a H.264 in CUDA. They simply use the "CUDA Video API" to access the graphic's card VP2 decoder chip. While "real" CUDA kernels run on the actual GPU, the hardware H.264/VC-1/MPEG-2 decoder is separate/specialized hardware that doesn't do anything else but decoding video...

ExSport
17th February 2010, 22:58
Maybe you don't understand me:)
My question is if decoding part can be done via CUDA or DXVA and then decoded frames feeded to mpeg2/x264 encoders so if it will be faster and if it is possible.
I suppose that encoder needs to decode frame before it can be encoded so why not to use CUDA or DXVA for it and not slower "decoder" implemented in mpeg2/x264 encoders.
Is this technique possible?
As I already said I tested mencoder with loaded Coreavc driver and some encodings were faster, some slower.
I suppose it is because when MEncoder used internal decoder, CPU was used at max but when combination of CoreAVC+MEncoder used, CPU was used between 15-70%
So result was at best case 25% faster(encoding) with CoreAVC(Cuda enabled) for full movie(CPU usage about 70% in average) but sometimes speed was same or slower when CPU was used fewer. Don't know why so big differences(why CPU is not used at max when CUDA+MEncoder used but 100% used when internal decoder used)...
Many thanks for answer

Snowknight26
17th February 2010, 23:13
CPU usage doesn't imply speed, so estimates are invalid.

RunningSkittle
17th February 2010, 23:16
...decoding part can be done via CUDA or DXVA and then decoded frames [fed] to mpeg2/x264 encoders so if it will be faster and if it is possible....

yes its possible and can already be done via avisynth or mplayer (with patches), however AFAIK there is not a standard way to accomplish this across different platforms. Fortunately x264 accepts input from avisynth on windows and yuv4mpeg for piping from mplayer :)

LoRd_MuldeR
17th February 2010, 23:19
Maybe you don't understand me:)

I think I did. But it seems you didn't understand the answer ;)

My question is if decoding part can be done via CUDA or DXVA and then decoded frames feeded to mpeg2/x264 encoders so if it will be faster and if it is possible.

As I already said, decoding a H.264 source in "hardware" and then sending it to the (software) encoder is possible indeed. And there already are solutions for that!

But DXVA cannot do it. DXVA is playback only. DXVA decoders are coupled to a DXVA-enabled renderer. Once the encoded bit-stream is sent to the renderer by the DXVA decoder, the frames will be decoded in hardware and then outputted directly to the screen. The software cannot get the decoded frames back. That's why DXVA isn't suitable for re-encoding tasks.

Furthermore there is no need to implement a H.264 decoder in CUDA, because there already is a dedicated H.264 decoder hardware on your graphic's card. CUDA allows you to run general purpose computations on the GPU. But implementing a H.264 encoder on the GPU (via CUDA) will never be as fast/efficient as using a dedicated decoder chip. And that chip is present on any halfway up-to-date graphics card!

There are several ways to access the H.264 decoder hardware on your graphics card in a way that allows feeding the decoded frames into x264 (or a similar encoder). These include at least DirectShowSource+CoreAVC and DGAVCIndexNV. To make this clear again: Both, CoreAVC and DGAVCIndexNV, do not implement a H.264 decoder in CUDA. Instead they use the "CUDA Video API" (CUVID) to access the dedicated H.264 decoder chip on the graphics card. In contrast to DXVA, CUVID has the advantage that you can get the decoded frames back and process them in software, such as a software encoder...

CPU usage doesn't imply speed

Very true.

jakor
18th February 2010, 00:16
But DXVA cannot do it. DXVA is playback only. DXVA decoders are coupled to a DXVA-enabled renderer. Once the encoded bit-stream is sent to the renderer by the DXVA decoder, the frames will be decoded in hardware and then outputted directly to the screen. The software cannot get the decoded frames back. That's why DXVA isn't suitable for re-encoding tasks.

actually - this is not correct. By implementing a custom renderer it is possible to retrieve raw data from decoded frames from DXVA. True, that with CUDA interface it is a little bit easier to implement, but DXVA architecture looks to me like more robust ;-) and not dependent on NVIDIA.

ExSport
18th February 2010, 00:36
Thanks for answers:)
I know that CPU usage doesn't imply speed but that maybe because CPU wasn't always fully used when Coreavc+Mencoder used so MEncoder alone with 100% CPU usage was more efficient but when combination of both used CPU 70% in average, encoding was faster about 25%.
So lower CPU usage and faster encoding:-) But unfortunately for some movies usage was lower so MENcoder alone was more efficient.
About AviSynth I know I can use it but I tried to don't use it and get rid of codecs mess. MEncoder alone(with CoreAVC loaded) has an advance that no installed codec is needed, no configuration etc.
Now I understand that DXVA is useless for me because I need to do realtime encoding of h264 to MPEG2/x264 and now I know it is not possible with DXVA but with CUDA yes.:rolleyes:
So only solution for me is Sherpya MEncoder+CoreAVC with CUDA enabled. But bad thing is that results are not stable so sometimes speedup is 25%, sometimes is slower. I thought that when decoding of 1080p file will be done in HW, there will be more cycles for encoding process in MEncoder and it is true but only partly, not for all files :mad:
My concern is PS3MediaServer. It is DLNA server and on slower PC or when more HD files have to be transcoded(MPEG2), every cycle for encoding part is a win. It is difference if realtime encoding will be 21fps or 26fps = difference of possibility of realtime streaming to any rendererer, not only PS3 with DLNA server:cool:
So again many thanks for useful info
Now I know that implementing experimental ffdshow with DXVA is not right way for multithreaded MEncoder which I use.

kypec
18th February 2010, 09:08
About AviSynth I know I can use it but I tried to don't use it and get rid of codecs mess. MEncoder alone(with CoreAVC loaded) has an advance that no installed codec is needed, no configuration etc.
Avisynth has nothing to do with installation of codec-packs. There's absolutely no need to install anything but DGAVCDecodeNV (very low-cost payware) if you want to use GPU accelerated video decoding in your avisynth scripts (provided you have nVidia card with integrated VP2+ chip of course).

ExSport
18th February 2010, 15:49
To my knowlede when you want to use AviSynth "encoding" with MEncoder, you need installed codecs in system because codecs are used for decoding part and then it is feeded to MEncoder.
But without AviSynth = MEncoder alone I can use integrated decoders with no "background" influence or mess in codecs configuration. I suppose when AC3filter will be configured to STEREO only, you can't encode it then in original 5.1 audio. Also missing codecs will terminate encoding process because it will fail with AviSynth but not with MEncoder alone with all intergrated decoders/encoders.
Also DGAVCDecodeNV is not multiplatform so it is not well usable with java PS3MediaServer which works on OSX,WIN,Linux etc.
Anyway many :thanks:

LoRd_MuldeR
18th February 2010, 20:12
actually - this is not correct. By implementing a custom renderer it is possible to retrieve raw data from decoded frames from DXVA

Well, that sounds interesting. But so far I have not seen any project that implemented a custom renderer for DXVA to get the decoded frames back into main memory.

So is this just some hypothetical idea or has this actually been proven to work? Any project names?

True, that with CUDA interface it is a little bit easier to implement, but DXVA architecture looks to me like more robust ;-) and not dependent on NVIDIA.

DXVA is extremely pick with what streams it accepts (regarding levels/profiles), CUVID is not! CoreAVC with "CUDA Decoding" even handles 1080p at encoded 50 MBit/s with all x264 settings maxed out.

So I'd say CUVID is much more "robust" than DXVA. But yes, CUVID has the major drawback that it's a proprietary interface available only on NVidia hardware...

jakor
19th February 2010, 00:33
I've made a small project - decoded video stream on DXVA hardware and got frames back and dumped them to HDD.

DXVA is extremely pick with what streams it accepts (regarding levels/profiles), CUVID is not! CoreAVC with "CUDA Decoding" even handles 1080p at encoded 50 MBit/s with all x264 settings maxed out.

So I'd say CUVID is much more "robust" than DXVA.
having said beforehand that it is still the same dedicated VP2 chip how does it make sense ?
1.5 years ago I implemented both DXVA and CUDA H.264 decoders - and yes, programming with CUDA interface was much more fun, it was more controllable while DXVA pain in the ass to figure out which flags mean what in their input structure, but being able to run them both DXVA ran more smoothly, while CUDA stuff had crashes here and there. Maybe it was raw drivers - it was early time adoption and now it works as solid as DXVA, but proprietary thing is not good and can not be relied upon - one day they release driver version without CUDA support and what do we do ? While disabling DXVA support would be a tougher thing to do (from marketing and licensing point of view).
However ATI liked to exclude a lot of DXVA supported streams with newer catalysts, so it can not be trusted either ;-)