PDA

View Full Version : H.264 encoding on Cell


bkman
22nd March 2006, 09:04
Considering the news that the first mass-market device featuring the Cell processor (the Sony Playstation 3) has been announced to come with a 60GB hard-drive preinstalled with linux, and that SDK's with Cell simulators have been out for a while now, I'm wondering whether any developers have considered starting an H.264 encoder specifically for the PS3 platform?

Assuming that the algorithms involved in H.264 encoding can be split across 7 very potent streaming vector processors, we could possibly have much a better performing encoder than on any mainstream PC. HD encoding may well be possible in realtime!

Has anyone put any thought into this?

Deinorius
22nd March 2006, 09:48
I have not much idea of programming, but I know, that you have to write the source especially and optimate for the Cell Chip.

On the other hand and now that's the bigger problem: There are no PS3's on sale. So how to test it? And there will be not soooo much, wo will own a PS3 at the start. No way. Wait for the release and look for the performance it will get.

bkman
22nd March 2006, 09:58
On the other hand and now that's the bigger problem: There are no PS3's on sale. So how to test it?

As I said, IBM SDK's with Cell BE simulators have been out for a while. Have a look. (http://www-128.ibm.com/developerworks/power/library/pa-cellstartsim/)

Deinorius
22nd March 2006, 10:11
Well, I am sceptical about this but wait for the comments of the profis.
At the end, you need someone, who would do this. Not to forget, that this x264 builds has to use 7 co-cpus. The PS3 Cell CPU will use 7 of 8. The 8th will be deactivated to get more running CPUs (think about L2 Cache at CPUs or Pixel Shader at GPUs).

el divx
22nd March 2006, 16:03
Well, I am sceptical about this but wait for the comments of the profis.
At the end, you need someone, who would do this. Not to forget, that this x264 builds has to use 7 co-cpus. The PS3 Cell CPU will use 7 of 8. The 8th will be deactivated to get more running CPUs (think about L2 Cache at CPUs or Pixel Shader at GPUs).
The 8th core isn't used because it usually coordinates the other 7, handles time scheduling and resources and generally acting more like the "head" of a "team".

Hellworm
22nd March 2006, 16:43
you have to write the source especially and optimate for the Cell Chip.

Really? I think if linux runs on the cell there is or will be some c-compiler.
So the only thing that needs to be done is to compile x264 for cell. It's already multithreaded so can make use of the 7/8 cores.

bratao
22nd March 2006, 20:16
But Cell Are different to Todays multiprocessors.
Each Core do A specific function.
You cant split in 7 threads like todays multiprocessors

emmel
22nd March 2006, 20:44
It is asymmetric, true, but as there will be tools like MPI, it should not totally strange to program in parallel.

Anyways, encoding existing media files into h264 in parallel is close to trivial, as you can always split the file into N-parts and apply N-copies of the scalar code to each part separately. Enciding/decoding live in parallel is another story.

sysKin
23rd March 2006, 13:40
So the only thing that needs to be done is to compile x264 for cell. It's already multithreaded so can make use of the 7/8 cores.
Oh no this is not going to work ;)
First, the executions cores on Cell can't really execute separate threads. They all have to work on one thread. What you'd have to do is "emulate" multithreaded x264 on the execution cores... which I suppose is possible, but definitely hard. A single execution core has 256KB of internal memory and you need to manage this memory by yourself, remembering picture slice you encode (this is only 384 bytes) but also remembering all reference frames within range, which needs motion-range-squared times (3/2) times number of reference frames. Times four if you want to keep halfpixel planes (probably better not). And then you have to handle extra evil data required for intra prediction. All that needs to fit in 256KB at any time.

In short, you need a completely new ME which is written only for Cell.

The Core execution units aren't very fast at integer stuff either - in fact they are plain slow, compared to our current CPUs. It's just that there are eight of them.

[edit] But actually, I bet it would be better to not emulate multiple threads, but to pipeline encoding process from one unit to another, and using some in parallel. One does sad-based ME, another does refinement, then two more try RDO (one for inter, one for intra. Or more for inter), then finally the main PowerPC thingy encodes the bitstream. Either way, you need to rewrite most of the encoder to do that.

bkman
23rd March 2006, 13:54
The Core execution units aren't very fast at integer stuff either - in fact they are plain slow, compared to our current CPUs. It's just that there are eight of them.

They are fast depending on whether you can vectorise your integer calculations, and how much branching the code does. It does require a different approach to coding to get good performance on the Cell :)

Parallelising at the algorithm level is probably necessary, like seperate processing of different macroblocks to each SPE... or however it can be done with H.264. Forgive me I don't know much about video encoding specifically. :o

sysKin
23rd March 2006, 14:22
They are fast depending on whether you can vectorise your integer calculations, and how much branching the code does. It does require a different approach to coding to get good performance on the Cell :)
Well, motion estimation usually consists of a relatively short set of SIMD operations, such as 16x16-sized sum of absolute differences (with one unaligned pointer and one aligned), followed by several branches and conditional memory moves. Rinse and repeat, over and over.
Unfortunately most of h264 encoding time is RDO, which has similar SIMD part (some FIR, then SIMD difference, then change from 8-bit to 16-bit precision, then a transform) followed by much larger non-SIMD processing (which includes weird almost-random memory accesses, RLE-like encoding [branches!], and arithmetic encoding).

I have no clue if Cell can do that at least remotely efficiently.

bkman
23rd March 2006, 14:41
Lots of independant SIMD operation blocks that can fit their memory access within a 256k work store (with prefetching DMA streaming data in and out) is about an ideal situation for the Cell SPE's. Integer SIMD ops are just as fast as floating from what I can gather. Branches are costly, but if you hint them early enough the loss is not so bad. So it sound like the ME part could work fairly well.

By your description of the RDO, it would probably work best on the PowerPC-based PE, although you will have to be careful with instruction scheduling as the processor lacks OoOE and has fairly weak branch-prediction.

At least that is what I can gather from my reading on Cell coding (a lot of it is over my head) ;) Your best bet is to read some IBM docs on the subject and try out some of the code with available tools. I think that this technology has a lot of potential if it is harnessed properly.

lexor
23rd March 2006, 14:47
First off, even if the devs will write a decoder/encoder for PS3, it won't run, until it's modchipped, this is true of every console. And I doubt devs want to put up with that nonsense. I think if someone wanted to port x264 to Cell, they are much better off, using one of the Blade servers (from either IBM or Microstar or something).

Oh, and for hardware specs, the reason only 7 out of 8 SPE are active is to improve yeilds, so IBM can still put a Cell into PS3 even if one of the SPEs is broken (this improves yeilds dramatically). So for those who intereseted you can't unlock that 8th SPE, well you could, but it's broken and won't work.

Later, when IBM and Co. get the production going with high yeilds, PS3 will have all 8 SPEs active since they would be able to throw away Cell chips with 1 broken SPE without jeopardising quantity produced.

sysKin
23rd March 2006, 14:50
Lots of independant SIMD operation blocks that can fit their memory access within a 256k work store (with prefetching DMA streaming data in and out) is about an ideal situation for the Cell SPE's.
Oh no, they are all dependant. One 16x16-sized SAD needs to be followed by branches and memory moves, before target of next SAD is known (well OK not always, but if you try several SADs in parallel then there's even more logic later, which sorts it all out). All such SADs use pretty much the same memory so I suppose splitting them on many SPEs wouldn't be such a good idea (their local memories would be clones).

bkman
23rd March 2006, 14:52
First off, even if the devs will write a decoder/encoder for PS3, it won't run, until it's modchipped, this is true of every console. And I doubt devs want to put up with that nonsense.

That's not necessarily true, I think. As I said in the OP, Sony seem set on providing a linux distro on PS3 HDD's, which would really facilitate development tremendously. Now that is assuming that they do enable the running of unsigned code on this Linux distro, but I think that is likely considering that they previously released a linux homebrew kit for the PS2.

lexor
23rd March 2006, 14:56
And for those really lazy, you can just wait for Teir IV or IBM's Octopiler (http://domino.research.ibm.com/comm/research_projects.nsf/pages/cellcompiler.index.html), then you can just get x264 sources, hit the compile button and it'll give you code that runs on Cell and all its SPEs.

bkman
23rd March 2006, 14:58
Oh no, they are all dependant. One 16x16-sized SAD needs to be followed by branches and memory moves, before target of next SAD is known (well OK not always, but if you try several SADs in parallel then there's even more logic later, which sorts it all out). All such SADs use pretty much the same memory so I suppose splitting them on many SPEs wouldn't be such a good idea (their local memories would be clones).

Hmm, well in in that case it is less than ideal. Perhaps the algorithm could be re-imagined to maximise independant parallelism. For instance, is computing a single SAD in several consecutive frames on one SPE possible? I dunno, I'm really reaching here without a good understanding of h264. :o

bkman
23rd March 2006, 15:00
Lexor: I dunno, something tells me that this Octopiler is still some good years off from being really useful :D

lexor
23rd March 2006, 15:06
Lexor: I dunno, something tells me that this Octopiler is still some good years off from being really useful :D
it IS. it's at Tier I right now.

someone in the know of it, said that if you can finds someone who can write code using it "He'll be worth Marlon Brando's weight in diamond-studded platinum"

sysKin
23rd March 2006, 15:46
someone in the know of it, said that if you can finds someone who can write code using it "He'll be worth Marlon Brando's weight in diamond-studded platinum"

Exactly, and this is because it takes much more than hitting a compile button to actually use Cell ;)

Having said that, it shouldn't be THAT hard... just split your stuff into smaller sub-programs that run independently and handle their own memory, and I'd say you're mosty done. It shouldn't be *much* more difficult than rewriting a program from scratch, which is something most programmers can do.

The main problem is that C is really a bad language for everything that doesn't run on general purpose microprocessor. It doesn't handle SIMD and it doesn't handle any parallelism, which are exactly the two things Cell is based on.

lexor
23rd March 2006, 18:29
The main problem is that C is really a bad language for everything that doesn't run on general purpose microprocessor. It doesn't handle SIMD and it doesn't handle any parallelism, which are exactly the two things Cell is based on.
that is why God invented MPI :)

oh and that complexity issue, it's because right now Octopiler is in Tier I, and it's at Tier IV that you will be able to load single threaded code, hit a button and be done :)

bkman
1st January 2007, 18:07
So, now that the PS3 is out with Linux freely available for it, has anyone given this any more thought?

The have been a wealth of Cell-programming resources lately, like these presentations: http://www.cs.utk.edu/%7Edongarra/cell2006/

A PS3 would make a good node in a distributed encoding cluster at the least, I think.

Sharktooth
2nd January 2007, 02:34
AFAIK the PS3 processing power available for linux and "other" OSes is very limited.

Blue_MiSfit
2nd January 2007, 03:49
I think that the PS3 WOULD have tremendous potential as an encoding machine / Linux box - if it had more RAM.

H.264 encoding needs lots of RAM - at least x264 on my Windows box does.

Sharktooth
2nd January 2007, 03:51
For "additional" OSes only 2 cores of the Cell are available...

bkman
2nd January 2007, 04:05
Pretty sure you have access to 6 SPE's in Linux (same as official developers), it's just the RSX access that is limited. You also have about 196MB of ram to work with, which should be plenty for x264 encoding if any heavy Avisynth work is done on a host machine.

akupenguin
2nd January 2007, 05:33
x264 480p ref=4 bframes=3 single-threaded on linux, takes 36MB virtual / 27MB physical ram. That can easily reduced by a factor of 3 at the cost of some extra simd computation if you don't store hpel planes.
So no, I don't think the total amount of ram is an issue, we only need to think about the 256KB per SPE.

Adub
2nd January 2007, 07:04
Cool! Now I am actually considering buying a PS3. I can Dumb the contents of Blu-ray disks with it, Fold proteins via F@H, and now encode via x264! Sweet!

Well, almost encode with x264. This is a really cool thread.

Sharktooth
2nd January 2007, 14:54
Pretty sure you have access to 6 SPE's in Linux (same as official developers), it's just the RSX access that is limited. You also have about 196MB of ram to work with, which should be plenty for x264 encoding if any heavy Avisynth work is done on a host machine.

Definatly not. The linux kernel is able to use the primary cell core (the one that looks like a dual 64bit powerpc processor) but not the SPUs. So you will end up with a dualcore-like architecture and nothing more unsless you have programs specifically written with support for SPUs...

bkman
2nd January 2007, 15:39
Definatly not. The linux kernel is able to use the primary cell core (the one that looks like a dual 64bit powerpc processor) but not the SPUs. So you will end up with a dualcore-like architecture and nothing more unsless you have programs specifically written with support for SPUs...

Of course you'll need to use the SPE C compiler and link the SPE modules in explicitly, but you can make use of the SPE's from within Linux. I've seen this corroborated by several people with PS3's, though I don't have one myself yet.

qyqgpower
2nd January 2007, 17:48
6 SPEs are available to user in PS3's Linux.
BTW, under XMB, they managed to decode FullHD H264 clip up to 50Mbps with only 3 SPEs.

foxyshadis
2nd January 2007, 19:32
Sharktooth didn't say they weren't available, but they currently have to be treated as custom coprocessors; either by hand-writing asm code to interface, or more sanely by using the PS3 dev kit, but it's not something that can be dropped in on a whim. Getting the threads to interface well and execute without swapping code and data constantly doesn't look easy. Sort of the same situation with shaders on PC gpus.