Threading and latency in x264

scjohnson · 12th September 2008, 21:20

I am interested in more fully understanding the current x264 threading mechanism and expected roadmap. I was using an old (2 1/2 years) version of x264 in a multi-cpu environment (>>1). In those days, threading was handled by slicing the frame. My understanding is the current algorithm threads by frames instead. Unfortunately, if your goals is streaming compression you worry about the latencies this approach introduces.

Naively if I could run a streaming compression on 15 processor in real time, that would introduce a 1/2 second latency right off the top in a 30fps environment (for example). Is my naive view of the x264 approach correct? Any thoughts of the best way to tackle this (other than waiting for faster processors)?

I did some searching on this and other forums, but only found this as the most recent comment. My apologies if I've missed this being discussed at length here or elsewhere.

Thanks.

fields_g · 12th September 2008, 23:45

There has been vast changes in speed and quality over the last 2.5 years. It is true that x264 is frame based now. Ok... let me get some clarification from you.
1) Is 30fps what you are able to achieve right now with the old version?
2) 15 (processors) is not a very "computer-like" number. What is your setup?
3) What command line parameters are you using?

I might not be the one to be able to give you a good answer, but with the above answers, the community will have a little more info to help you out.

Dark Shikari · 13th September 2008, 01:02

Yes, the current threading method has such latency, which could be problematic for applications that need extremely low latency, such as videoconferencing. However, x264 doesn't need more than one thread to do realtime SD encoding, so you can still get that realtime with near-zero latency. HD encoding can be done with just two or three threads on a top-end CPU, for just a couple frames of latency.

scjohnson · 15th September 2008, 16:55

Thanks for the prompt responses.

What I'm interested in is real-time 1080p/30 encoding. For HD applications I've primarily seen parallel encoding, but to do this real time requires a number of processors. I'm in a low-power environment where you wouldn't choose to use a top-end CPU, but opt instead for lower clock rate and more processors.

I haven't performed a port of the most recent x264 code base to this kind of platform, but my results using someone else's port of the x264 code from 2006 required about 16 processors for real-time on an HD stream. If the current code is approximately as fast, but uses frame-based threading this would introduce a 1/2 second delay (unacceptable in some environments).

That's running on a 720p/30 with qp=24, resulting in a 5Mb/s output for my data -- about as low of resolution as I can stand. For 1080p, it's worse of course. As you may have guessed, I'm not using any standard OTS platform and porting the most recent x264 to it will take some effort. I'm trying to evaluate if it's worth it.

If the current code is significantly faster, perhaps a reduction from the 1/2 to something below 1/4 would work.

I was also curious to understand why the choice was made to be frame-based rather than slice-based threaded. (Ease of programming? Something more subtle?)

I'm not sure I've clarified my questions at all, but do appreciate the thoughts.

Thanks again.

Dark Shikari · 15th September 2008, 17:58

Quote:

Originally Posted by scjohnson

Thanks for the prompt responses.

What I'm interested in is real-time 1080p/30 encoding. For HD applications I've primarily seen parallel encoding, but to do this real time requires a number of processors. I'm in a low-power environment where you wouldn't choose to use a top-end CPU, but opt instead for lower clock rate and more processors.

I haven't performed a port of the most recent x264 code base to this kind of platform, but my results using someone else's port of the x264 code from 2006 required about 16 processors for real-time on an HD stream. If the current code is approximately as fast, but uses frame-based threading this would introduce a 1/2 second delay (unacceptable in some environments).

You shouldn't need nearly such a system to do that. I've clocked x264 as running 1080p24 in realtime on a single core CPU (well, one core of a top-end Penryn) with absolute max speed settings. Avail Media does realtime 1080i30 with multiref and RDO with 8 cores; only about 4-6 are used on x264.

Quote:

Originally Posted by scjohnson

I was also curious to understand why the choice was made to be frame-based rather than slice-based threaded. (Ease of programming? Something more subtle?)

Slice-based threading is intolerably inefficient; it caps out very quickly in terms of overall performance increase and does not effectively utilize large numbers of cores.

Inventive Software · 15th September 2008, 17:58

The "thread-pool" patch (search for it) might be of benefit in that case, but you'd need fast, minimal search settings, and I'm assuming a CRF mode, to do 1080p30 real-time.

Manao · 15th September 2008, 18:24

Quote:

Slice-based threading is intolerably inefficient; it caps out very quickly in terms of overall performance increase and does not effectively utilize large numbers of cores.

No, not in a realtime & low latency environment. Slice-base threading is inefficient on offline encoding, but it has the same worst case scenario encoding time as frame based, which is what matters here.

Said otherwise, if you adapt subme settings according to CPU charge, frame-base will give a better subme average quality than slice based, but the worst case will be the same. Since low latency prevents you from buffering enough frames to adapt subme, you are forced to use a constant subme setting. Which means, in realtime, that you gain nothing between slice & frame based (except a slightly better coding efficiency for frame-base).

Dark Shikari · 15th September 2008, 18:27

Quote:

Originally Posted by Manao

Slice-base threading is inefficient on offline encoding, but it has the same worst case scenario encoding time as frame based, which is what matters here.

Of course not. With slice-based, if a frame has dramatically differing encoding times per slice, you will spend the vast majority of your time waiting on the last slice to finish. With frame-based, no such problem exists; there's only a problem if an entire frame takes a huge amount longer than other frames.

From the benchmarks I have of realtime encoding with slice-based threading, the maximum performance increase of threads capped out at about 200%, a pathetic value. Frame-based can achieve more than double or triple that. And what really matters is what happens in practice, not some theoretical situation that doesn't actually exist.

Also, real time does inherently not imply "low latency" at all. And if you need speedcontrol, the current speedcontrol patch probably works just fine with 30 or even 15 frames of buffer. Furthermore, since it acts on a thread level rather than frame level, it can re-use the buffer that already exists for threads to use; i.e. it shouldn't add any new delay.

scjohnson · 16th September 2008, 14:45

Quote:

Originally Posted by Dark Shikari

From the benchmarks I have of realtime encoding with slice-based threading, the maximum performance increase of threads capped out at about 200%, a pathetic value. Frame-based can achieve more than double or triple that. And what really matters is what happens in practice, not some theoretical situation that doesn't actually exist.

My benchmarks using an x264 port from almost three years ago with slice-based threading on multiple cores scales quite well up to the first 9 threads (88%) and caps out around 25 threads on HD video where I have a 16x improvement over 1 thread.

I don't consider that pathetic.

Sagekilla · 16th September 2008, 15:38

Would it theoretically be possible to have a hybrid frame-slice based threading? Something like: Frame 0 goes to thread group (0,1,2,3) and Frame 1 goes to thread group (4,5,6,7).

I'm guessing it's very difficult because of temporal dependencies but I was curious.

Manao · 15th September 2008, 18:42

I didn't say realtime == low latency. I said in a realtime & low latency environment, which is his case.

You propose a 15 frame buffer for speed control, which isn't low latency, so you admit low latency forces a constant subme.

And, the worst case scenario, in both cases, is the whole frame taking huge amount of time. Since subme is constant due to low latency, and since worst case scenario has the same speed for both slice & frame based, you end up with the same subme for both slice & frame based threading. And that is achieved at the same CPU usage (since both do the same amount of work).

Dark Shikari · 15th September 2008, 18:46

Quote:

Originally Posted by Manao

I didn't say realtime == low latency. I said in a realtime & low latency environment, which is his case.

You propose a 15 frame buffer for speed control, which isn't low latency, so you admit low latency forces a constant subme.

What do you define as low latency--any number less than a number I use in my post? I never said speedcontrol wouldn't work at a lower buffer size, I gave a suggestion. Also, I do consider 15 frames to be low latency. 300 frames is high latency, which is what Avail Media uses for television broadcast. I cannot imagine a case in which lower than 15 frames is absolutely necessary except perhaps for videoconferencing.

Quote:

Originally Posted by Manao

And, the worst case scenario, in both cases, is the whole frame taking huge amount of time.

No, it isn't, because the time that the frame takes is going to be roughly proportional to its size in bits. Therefore, any case in which you consistently get many frames that take way too long, you will have already violated VBV anyways. That is the only possible case in which frame-based threading could reach your rather hypothetical "worst-case scenario." Slice-based threading, on the other hand, can consistently fail even if VBV never has any problems.

scjohnson · 15th September 2008, 19:43

Thanks again for all the great insight.

Quote:

Originally Posted by Dark Shikari

[...] I do consider 15 frames to be low latency. 300 frames is high latency, which is what Avail Media uses for television broadcast. I cannot imagine a case in which lower than 15 frames is absolutely necessary except perhaps for videoconferencing. [...]

That's a big one ... and, one might argue, one of the largest poorly tapped markets in the real-time encoding field.

For telepresence applications, 15 frames is not low latency. You'd never stand for 1/2 second delay on your cell phone and that's why many telepresence applications are very awkward for the general user.

Low latency is 100 ms, which means you need to hold <3 frames in a buffer @30fps.

For a television broadcast, a 10 second delay is no big deal and there's where the threading choice of x264 makes perfect sense. However, if we want x264 to be applicable for other problems, it needs to approach a single frame latency, which naively lends itself to slice-based threading.

Manao · 15th September 2008, 19:05

Quote:

No, it isn't, because the time that the frame takes is going to be roughly proportional to its size in bits

No, the time is something like Ax(macroblock count)x(complexity) + Bx(bitrate), with a clearly non neglictible first part (example : foreman, -8 -q 20 -m 6 -b 3 : 22 fps for 1 mbps, -8 -q 40 -m 6 -b 3 : 44 fps for 60 kbps, so bitrate at q 20 amounts for half the encoding time on foreman)

fields_g · 15th September 2008, 20:04

I like this thread even more! I'm involved in videoconferencing.

In the entire system, latency comes from many places: drivers, buffers, encoding, distance, routing, decoding etc. Latency is inescapable! Theoretically, even talking to someone face-to-face has latency (Distance apart/speed of sound). In technology, you need to determine the overall acceptable latency of a system, then budget for each part of the process. Drivers, buffering, encoding, transmission, distance, queuing, decoding, etc. all add latency.

My definition of "Realtime" encoding means being able to indefinitely sustain an encoding rate equal or surpassing the input frame rate. Not including buffering, encoding realtime at 30 fps still can add up to 33ms to the conversation.

I believe that 150ms is generally considered the maximum one way latency for a 2-way voice conversation. This can adjust depending on the format of the conversation and the personalities of the people on each end. This should be approximately the same for videoconferencing.

I guess and important question is: Is the application latency 2-way videoconferencing? If not, the budget can be expanded greatly. Either way we need to know how much latency can we afford toward encoding/buffering. This will dictate the performance/processor, # of processors, and encoding options that are needed.

Shinigami-Sama · 15th September 2008, 21:54

if you're doing videoconferencing you should beable to do SD in a single thread and if you're doing it in HD you should give your head a shake...

BlackSharkfr · 16th September 2008, 07:52

Quote:

Originally Posted by Shinigami-Sama

if you're doing videoconferencing you should beable to do SD in a single thread and if you're doing it in HD you should give your head a shake...

I think scjohnson's product makes sense.
HD videoconferencing should be available soon.

People already have FullHD camcorders, FullHD TVs, although individuals don't have the required bandwidth to transfer realtime FullHD streams yet, many businesses can afford it.
I can't wait to have a fiber connection at home...

fields_g · 16th September 2008, 11:58

Quote:

Originally Posted by BlackSharkfr

I think scjohnson's product makes sense.
HD videoconferencing should be available soon.

People already have FullHD camcorders, FullHD TVs, although individuals don't have the required bandwidth to transfer realtime FullHD streams yet, many businesses can afford it.
I can't wait to have a fiber connection at home...

HD videoconferencing has been in the market for quite a while now. It relies on h.264 to get quality at 720p resolutions. Lifesize was a startup company who was the first company to debut HD "telepresence", followed by the the other two (already established), Tandberg and Polycom. Cisco came later, but pushed the envelope to include 1080p.

All these companies rely on hardware to encode their streams. I no longer have access to these machines, and when I did, I didn't try too hard to get an H.264 stream to analyze. I'm sure it would be interesting though.

Many of these companies lock much of their hardware so that it can't setup calls faster than 1.5-2Mbits unless you purchase key codes for greater bitrates. This is unfortunate since we all could imagine what the quality of such an image encoded by a hasty low latency encoder would look like.

foxyshadis · 16th September 2008, 09:44

If you need HD videoconferencing, you pay for the CPU needed to minimize latency. Do note that there's a slice-based patch floating around, which combines both threading methods - I have no way of even testing the latency benefit of combining both, but who knows, it might work.

Dark Shikari · 16th September 2008, 09:46

Quote:

Originally Posted by foxyshadis

If you need HD videoconferencing, you pay for the CPU needed to minimize latency. Do note that there's a slice-based patch floating around, which combines both threading methods - I have no way of even testing the latency benefit of combining both, but who knows, it might work.

The slices patch doesn't use slice-based threading; it uses slices in frame-based threading. It only affects syntax elements, not the threading model.

12th September 2008, 21:20	#1 \| Link
scjohnson Registered User Join Date: Aug 2008 Posts: 4	Threading and latency in x264 I am interested in more fully understanding the current x264 threading mechanism and expected roadmap. I was using an old (2 1/2 years) version of x264 in a multi-cpu environment (>>1). In those days, threading was handled by slicing the frame. My understanding is the current algorithm threads by frames instead. Unfortunately, if your goals is streaming compression you worry about the latencies this approach introduces. Naively if I could run a streaming compression on 15 processor in real time, that would introduce a 1/2 second latency right off the top in a 30fps environment (for example). Is my naive view of the x264 approach correct? Any thoughts of the best way to tackle this (other than waiting for faster processors)? I did some searching on this and other forums, but only found this as the most recent comment. My apologies if I've missed this being discussed at length here or elsewhere. Thanks.

13th September 2008, 01:02	#3 \| Link
Dark Shikari x264 developer Join Date: Sep 2005 Posts: 8,666	Yes, the current threading method has such latency, which could be problematic for applications that need extremely low latency, such as videoconferencing. However, x264 doesn't need more than one thread to do realtime SD encoding, so you can still get that realtime with near-zero latency. HD encoding can be done with just two or three threads on a top-end CPU, for just a couple frames of latency. __________________ Follow x264 development progress \| akupenguin quotes \| x264 git status ffmpeg and x264-related consulting/coding contracts \| Doom10

15th September 2008, 17:58	#6 \| Link
Inventive Software Turkey Machine Join Date: Jan 2005 Location: Lowestoft, UK (but visit lots of places with bribes [beer]) Posts: 1,953	The "thread-pool" patch (search for it) might be of benefit in that case, but you'd need fast, minimal search settings, and I'm assuming a CRF mode, to do 1080p30 real-time. __________________ On Discworld it is clearly recognized that million-to-one chances happen 9 times out of 10. If the hero did not overcome huge odds, what would be the point? Terry Pratchett - The Science Of Discworld

16th September 2008, 15:38	#10 \| Link
Sagekilla x264aholic Join Date: Jul 2007 Location: New York Posts: 1,752	Would it theoretically be possible to have a hybrid frame-slice based threading? Something like: Frame 0 goes to thread group (0,1,2,3) and Frame 1 goes to thread group (4,5,6,7). I'm guessing it's very difficult because of temporal dependencies but I was curious. __________________ You can't call your encoding speed slow until you start measuring in seconds per frame.

15th September 2008, 18:42	#11 \| Link
Manao Registered User Join Date: Jan 2002 Location: France Posts: 2,856	I didn't say realtime == low latency. I said in a realtime & low latency environment, which is his case. You propose a 15 frame buffer for speed control, which isn't low latency, so you admit low latency forces a constant subme. And, the worst case scenario, in both cases, is the whole frame taking huge amount of time. Since subme is constant due to low latency, and since worst case scenario has the same speed for both slice & frame based, you end up with the same subme for both slice & frame based threading. And that is achieved at the same CPU usage (since both do the same amount of work). __________________ Masktools x86 & x64: Stable (2.0a48) AVCMatrices : Stable (1.3) Anisotool : Beta (1.0a5)

12th September 2008, 23:45	#2 \| Link
fields_g x264... Brilliant! Join Date: Mar 2005 Location: Rockville, MD Posts: 167	There has been vast changes in speed and quality over the last 2.5 years. It is true that x264 is frame based now. Ok... let me get some clarification from you. 1) Is 30fps what you are able to achieve right now with the old version? 2) 15 (processors) is not a very "computer-like" number. What is your setup? 3) What command line parameters are you using? I might not be the one to be able to give you a good answer, but with the above answers, the community will have a little more info to help you out.

15th September 2008, 16:55	#4 \| Link
scjohnson Registered User Join Date: Aug 2008 Posts: 4	Thanks for the prompt responses. What I'm interested in is real-time 1080p/30 encoding. For HD applications I've primarily seen parallel encoding, but to do this real time requires a number of processors. I'm in a low-power environment where you wouldn't choose to use a top-end CPU, but opt instead for lower clock rate and more processors. I haven't performed a port of the most recent x264 code base to this kind of platform, but my results using someone else's port of the x264 code from 2006 required about 16 processors for real-time on an HD stream. If the current code is approximately as fast, but uses frame-based threading this would introduce a 1/2 second delay (unacceptable in some environments). That's running on a 720p/30 with qp=24, resulting in a 5Mb/s output for my data -- about as low of resolution as I can stand. For 1080p, it's worse of course. As you may have guessed, I'm not using any standard OTS platform and porting the most recent x264 to it will take some effort. I'm trying to evaluate if it's worth it. If the current code is significantly faster, perhaps a reduction from the 1/2 to something below 1/4 would work. I was also curious to understand why the choice was made to be frame-based rather than slice-based threaded. (Ease of programming? Something more subtle?) I'm not sure I've clarified my questions at all, but do appreciate the thoughts. Thanks again.

15th September 2008, 20:04	#15 \| Link
fields_g x264... Brilliant! Join Date: Mar 2005 Location: Rockville, MD Posts: 167	I like this thread even more! I'm involved in videoconferencing. In the entire system, latency comes from many places: drivers, buffers, encoding, distance, routing, decoding etc. Latency is inescapable! Theoretically, even talking to someone face-to-face has latency (Distance apart/speed of sound). In technology, you need to determine the overall acceptable latency of a system, then budget for each part of the process. Drivers, buffering, encoding, transmission, distance, queuing, decoding, etc. all add latency. My definition of "Realtime" encoding means being able to indefinitely sustain an encoding rate equal or surpassing the input frame rate. Not including buffering, encoding realtime at 30 fps still can add up to 33ms to the conversation. I believe that 150ms is generally considered the maximum one way latency for a 2-way voice conversation. This can adjust depending on the format of the conversation and the personalities of the people on each end. This should be approximately the same for videoconferencing. I guess and important question is: Is the application latency 2-way videoconferencing? If not, the budget can be expanded greatly. Either way we need to know how much latency can we afford toward encoding/buffering. This will dictate the performance/processor, # of processors, and encoding options that are needed.

16th September 2008, 09:44	#19 \| Link
foxyshadis Angel of Night Join Date: Nov 2004 Location: Tangled in the silks Posts: 9,559	If you need HD videoconferencing, you pay for the CPU needed to minimize latency. Do note that there's a slice-based patch floating around, which combines both threading methods - I have no way of even testing the latency benefit of combining both, but who knows, it might work.