Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > MPEG-4 AVC / H.264
Register FAQ Calendar Today's Posts Search

Reply
 
Thread Tools Search this Thread Display Modes
Old 12th September 2008, 21:20   #1  |  Link
scjohnson
Registered User
 
Join Date: Aug 2008
Posts: 4
Threading and latency in x264

I am interested in more fully understanding the current x264 threading mechanism and expected roadmap. I was using an old (2 1/2 years) version of x264 in a multi-cpu environment (>>1). In those days, threading was handled by slicing the frame. My understanding is the current algorithm threads by frames instead. Unfortunately, if your goals is streaming compression you worry about the latencies this approach introduces.

Naively if I could run a streaming compression on 15 processor in real time, that would introduce a 1/2 second latency right off the top in a 30fps environment (for example). Is my naive view of the x264 approach correct? Any thoughts of the best way to tackle this (other than waiting for faster processors)?

I did some searching on this and other forums, but only found this as the most recent comment. My apologies if I've missed this being discussed at length here or elsewhere.

Thanks.
scjohnson is offline   Reply With Quote
Old 12th September 2008, 23:45   #2  |  Link
fields_g
x264... Brilliant!
 
Join Date: Mar 2005
Location: Rockville, MD
Posts: 167
There has been vast changes in speed and quality over the last 2.5 years. It is true that x264 is frame based now. Ok... let me get some clarification from you.
1) Is 30fps what you are able to achieve right now with the old version?
2) 15 (processors) is not a very "computer-like" number. What is your setup?
3) What command line parameters are you using?

I might not be the one to be able to give you a good answer, but with the above answers, the community will have a little more info to help you out.
fields_g is offline   Reply With Quote
Old 13th September 2008, 01:02   #3  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
Yes, the current threading method has such latency, which could be problematic for applications that need extremely low latency, such as videoconferencing. However, x264 doesn't need more than one thread to do realtime SD encoding, so you can still get that realtime with near-zero latency. HD encoding can be done with just two or three threads on a top-end CPU, for just a couple frames of latency.
Dark Shikari is offline   Reply With Quote
Old 15th September 2008, 16:55   #4  |  Link
scjohnson
Registered User
 
Join Date: Aug 2008
Posts: 4
Thanks for the prompt responses.

What I'm interested in is real-time 1080p/30 encoding. For HD applications I've primarily seen parallel encoding, but to do this real time requires a number of processors. I'm in a low-power environment where you wouldn't choose to use a top-end CPU, but opt instead for lower clock rate and more processors.

I haven't performed a port of the most recent x264 code base to this kind of platform, but my results using someone else's port of the x264 code from 2006 required about 16 processors for real-time on an HD stream. If the current code is approximately as fast, but uses frame-based threading this would introduce a 1/2 second delay (unacceptable in some environments).

That's running on a 720p/30 with qp=24, resulting in a 5Mb/s output for my data -- about as low of resolution as I can stand. For 1080p, it's worse of course. As you may have guessed, I'm not using any standard OTS platform and porting the most recent x264 to it will take some effort. I'm trying to evaluate if it's worth it.

If the current code is significantly faster, perhaps a reduction from the 1/2 to something below 1/4 would work.

I was also curious to understand why the choice was made to be frame-based rather than slice-based threaded. (Ease of programming? Something more subtle?)

I'm not sure I've clarified my questions at all, but do appreciate the thoughts.

Thanks again.
scjohnson is offline   Reply With Quote
Old 15th September 2008, 17:58   #5  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
Quote:
Originally Posted by scjohnson View Post
Thanks for the prompt responses.

What I'm interested in is real-time 1080p/30 encoding. For HD applications I've primarily seen parallel encoding, but to do this real time requires a number of processors. I'm in a low-power environment where you wouldn't choose to use a top-end CPU, but opt instead for lower clock rate and more processors.

I haven't performed a port of the most recent x264 code base to this kind of platform, but my results using someone else's port of the x264 code from 2006 required about 16 processors for real-time on an HD stream. If the current code is approximately as fast, but uses frame-based threading this would introduce a 1/2 second delay (unacceptable in some environments).
You shouldn't need nearly such a system to do that. I've clocked x264 as running 1080p24 in realtime on a single core CPU (well, one core of a top-end Penryn) with absolute max speed settings. Avail Media does realtime 1080i30 with multiref and RDO with 8 cores; only about 4-6 are used on x264.
Quote:
Originally Posted by scjohnson View Post
I was also curious to understand why the choice was made to be frame-based rather than slice-based threaded. (Ease of programming? Something more subtle?)
Slice-based threading is intolerably inefficient; it caps out very quickly in terms of overall performance increase and does not effectively utilize large numbers of cores.
Dark Shikari is offline   Reply With Quote
Old 15th September 2008, 17:58   #6  |  Link
Inventive Software
Turkey Machine
 
Join Date: Jan 2005
Location: Lowestoft, UK (but visit lots of places with bribes [beer])
Posts: 1,953
The "thread-pool" patch (search for it) might be of benefit in that case, but you'd need fast, minimal search settings, and I'm assuming a CRF mode, to do 1080p30 real-time.
__________________
On Discworld it is clearly recognized that million-to-one chances happen 9 times out of 10. If the hero did not overcome huge odds, what would be the point? Terry Pratchett - The Science Of Discworld
Inventive Software is offline   Reply With Quote
Old 15th September 2008, 18:24   #7  |  Link
Manao
Registered User
 
Join Date: Jan 2002
Location: France
Posts: 2,856
Quote:
Slice-based threading is intolerably inefficient; it caps out very quickly in terms of overall performance increase and does not effectively utilize large numbers of cores.
No, not in a realtime & low latency environment. Slice-base threading is inefficient on offline encoding, but it has the same worst case scenario encoding time as frame based, which is what matters here.

Said otherwise, if you adapt subme settings according to CPU charge, frame-base will give a better subme average quality than slice based, but the worst case will be the same. Since low latency prevents you from buffering enough frames to adapt subme, you are forced to use a constant subme setting. Which means, in realtime, that you gain nothing between slice & frame based (except a slightly better coding efficiency for frame-base).
__________________
Manao is offline   Reply With Quote
Old 15th September 2008, 18:27   #8  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
Quote:
Originally Posted by Manao View Post
Slice-base threading is inefficient on offline encoding, but it has the same worst case scenario encoding time as frame based, which is what matters here.
Of course not. With slice-based, if a frame has dramatically differing encoding times per slice, you will spend the vast majority of your time waiting on the last slice to finish. With frame-based, no such problem exists; there's only a problem if an entire frame takes a huge amount longer than other frames.

From the benchmarks I have of realtime encoding with slice-based threading, the maximum performance increase of threads capped out at about 200%, a pathetic value. Frame-based can achieve more than double or triple that. And what really matters is what happens in practice, not some theoretical situation that doesn't actually exist.

Also, real time does inherently not imply "low latency" at all. And if you need speedcontrol, the current speedcontrol patch probably works just fine with 30 or even 15 frames of buffer. Furthermore, since it acts on a thread level rather than frame level, it can re-use the buffer that already exists for threads to use; i.e. it shouldn't add any new delay.

Last edited by Dark Shikari; 15th September 2008 at 18:33.
Dark Shikari is offline   Reply With Quote
Old 16th September 2008, 14:45   #9  |  Link
scjohnson
Registered User
 
Join Date: Aug 2008
Posts: 4
Quote:
Originally Posted by Dark Shikari View Post
From the benchmarks I have of realtime encoding with slice-based threading, the maximum performance increase of threads capped out at about 200%, a pathetic value. Frame-based can achieve more than double or triple that. And what really matters is what happens in practice, not some theoretical situation that doesn't actually exist.
My benchmarks using an x264 port from almost three years ago with slice-based threading on multiple cores scales quite well up to the first 9 threads (88%) and caps out around 25 threads on HD video where I have a 16x improvement over 1 thread.

I don't consider that pathetic.
scjohnson is offline   Reply With Quote
Old 16th September 2008, 15:38   #10  |  Link
Sagekilla
x264aholic
 
Join Date: Jul 2007
Location: New York
Posts: 1,752
Would it theoretically be possible to have a hybrid frame-slice based threading? Something like: Frame 0 goes to thread group (0,1,2,3) and Frame 1 goes to thread group (4,5,6,7).

I'm guessing it's very difficult because of temporal dependencies but I was curious.
__________________
You can't call your encoding speed slow until you start measuring in seconds per frame.
Sagekilla is offline   Reply With Quote
Old 15th September 2008, 18:42   #11  |  Link
Manao
Registered User
 
Join Date: Jan 2002
Location: France
Posts: 2,856
I didn't say realtime == low latency. I said in a realtime & low latency environment, which is his case.

You propose a 15 frame buffer for speed control, which isn't low latency, so you admit low latency forces a constant subme.

And, the worst case scenario, in both cases, is the whole frame taking huge amount of time. Since subme is constant due to low latency, and since worst case scenario has the same speed for both slice & frame based, you end up with the same subme for both slice & frame based threading. And that is achieved at the same CPU usage (since both do the same amount of work).
__________________
Manao is offline   Reply With Quote
Old 15th September 2008, 18:46   #12  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
Quote:
Originally Posted by Manao View Post
I didn't say realtime == low latency. I said in a realtime & low latency environment, which is his case.

You propose a 15 frame buffer for speed control, which isn't low latency, so you admit low latency forces a constant subme.
What do you define as low latency--any number less than a number I use in my post? I never said speedcontrol wouldn't work at a lower buffer size, I gave a suggestion. Also, I do consider 15 frames to be low latency. 300 frames is high latency, which is what Avail Media uses for television broadcast. I cannot imagine a case in which lower than 15 frames is absolutely necessary except perhaps for videoconferencing.
Quote:
Originally Posted by Manao View Post
And, the worst case scenario, in both cases, is the whole frame taking huge amount of time.
No, it isn't, because the time that the frame takes is going to be roughly proportional to its size in bits. Therefore, any case in which you consistently get many frames that take way too long, you will have already violated VBV anyways. That is the only possible case in which frame-based threading could reach your rather hypothetical "worst-case scenario." Slice-based threading, on the other hand, can consistently fail even if VBV never has any problems.

Last edited by Dark Shikari; 15th September 2008 at 18:49.
Dark Shikari is offline   Reply With Quote
Old 15th September 2008, 19:43   #13  |  Link
scjohnson
Registered User
 
Join Date: Aug 2008
Posts: 4
Thanks again for all the great insight.

Quote:
Originally Posted by Dark Shikari View Post
[...] I do consider 15 frames to be low latency. 300 frames is high latency, which is what Avail Media uses for television broadcast. I cannot imagine a case in which lower than 15 frames is absolutely necessary except perhaps for videoconferencing. [...]
That's a big one ... and, one might argue, one of the largest poorly tapped markets in the real-time encoding field.

For telepresence applications, 15 frames is not low latency. You'd never stand for 1/2 second delay on your cell phone and that's why many telepresence applications are very awkward for the general user.

Low latency is 100 ms, which means you need to hold <3 frames in a buffer @30fps.

For a television broadcast, a 10 second delay is no big deal and there's where the threading choice of x264 makes perfect sense. However, if we want x264 to be applicable for other problems, it needs to approach a single frame latency, which naively lends itself to slice-based threading.
scjohnson is offline   Reply With Quote
Old 15th September 2008, 19:05   #14  |  Link
Manao
Registered User
 
Join Date: Jan 2002
Location: France
Posts: 2,856
Quote:
No, it isn't, because the time that the frame takes is going to be roughly proportional to its size in bits
No, the time is something like Ax(macroblock count)x(complexity) + Bx(bitrate), with a clearly non neglictible first part (example : foreman, -8 -q 20 -m 6 -b 3 : 22 fps for 1 mbps, -8 -q 40 -m 6 -b 3 : 44 fps for 60 kbps, so bitrate at q 20 amounts for half the encoding time on foreman)
__________________
Manao is offline   Reply With Quote
Old 15th September 2008, 20:04   #15  |  Link
fields_g
x264... Brilliant!
 
Join Date: Mar 2005
Location: Rockville, MD
Posts: 167
I like this thread even more! I'm involved in videoconferencing.

In the entire system, latency comes from many places: drivers, buffers, encoding, distance, routing, decoding etc. Latency is inescapable! Theoretically, even talking to someone face-to-face has latency (Distance apart/speed of sound). In technology, you need to determine the overall acceptable latency of a system, then budget for each part of the process. Drivers, buffering, encoding, transmission, distance, queuing, decoding, etc. all add latency.

My definition of "Realtime" encoding means being able to indefinitely sustain an encoding rate equal or surpassing the input frame rate. Not including buffering, encoding realtime at 30 fps still can add up to 33ms to the conversation.

I believe that 150ms is generally considered the maximum one way latency for a 2-way voice conversation. This can adjust depending on the format of the conversation and the personalities of the people on each end. This should be approximately the same for videoconferencing.

I guess and important question is: Is the application latency 2-way videoconferencing? If not, the budget can be expanded greatly. Either way we need to know how much latency can we afford toward encoding/buffering. This will dictate the performance/processor, # of processors, and encoding options that are needed.
fields_g is offline   Reply With Quote
Old 15th September 2008, 21:54   #16  |  Link
Shinigami-Sama
Solaris: burnt by the Sun
 
Shinigami-Sama's Avatar
 
Join Date: Oct 2004
Location: /etc/default/moo
Posts: 1,923
if you're doing videoconferencing you should beable to do SD in a single thread and if you're doing it in HD you should give your head a shake...
__________________
Quote:
Originally Posted by benjust View Post
interlacing and telecining should have been but a memory long ago.. unfortunately still just another bizarre weapon in the industries war on image quality.
Shinigami-Sama is offline   Reply With Quote
Old 16th September 2008, 07:52   #17  |  Link
BlackSharkfr
Registered User
 
Join Date: Dec 2005
Posts: 133
Quote:
Originally Posted by Shinigami-Sama View Post
if you're doing videoconferencing you should beable to do SD in a single thread and if you're doing it in HD you should give your head a shake...
I think scjohnson's product makes sense.
HD videoconferencing should be available soon.

People already have FullHD camcorders, FullHD TVs, although individuals don't have the required bandwidth to transfer realtime FullHD streams yet, many businesses can afford it.
I can't wait to have a fiber connection at home...
BlackSharkfr is offline   Reply With Quote
Old 16th September 2008, 11:58   #18  |  Link
fields_g
x264... Brilliant!
 
Join Date: Mar 2005
Location: Rockville, MD
Posts: 167
Quote:
Originally Posted by BlackSharkfr View Post
I think scjohnson's product makes sense.
HD videoconferencing should be available soon.

People already have FullHD camcorders, FullHD TVs, although individuals don't have the required bandwidth to transfer realtime FullHD streams yet, many businesses can afford it.
I can't wait to have a fiber connection at home...
HD videoconferencing has been in the market for quite a while now. It relies on h.264 to get quality at 720p resolutions. Lifesize was a startup company who was the first company to debut HD "telepresence", followed by the the other two (already established), Tandberg and Polycom. Cisco came later, but pushed the envelope to include 1080p.

All these companies rely on hardware to encode their streams. I no longer have access to these machines, and when I did, I didn't try too hard to get an H.264 stream to analyze. I'm sure it would be interesting though.

Many of these companies lock much of their hardware so that it can't setup calls faster than 1.5-2Mbits unless you purchase key codes for greater bitrates. This is unfortunate since we all could imagine what the quality of such an image encoded by a hasty low latency encoder would look like.
fields_g is offline   Reply With Quote
Old 16th September 2008, 09:44   #19  |  Link
foxyshadis
Angel of Night
 
foxyshadis's Avatar
 
Join Date: Nov 2004
Location: Tangled in the silks
Posts: 9,559
If you need HD videoconferencing, you pay for the CPU needed to minimize latency. Do note that there's a slice-based patch floating around, which combines both threading methods - I have no way of even testing the latency benefit of combining both, but who knows, it might work.
foxyshadis is offline   Reply With Quote
Old 16th September 2008, 09:46   #20  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
Quote:
Originally Posted by foxyshadis View Post
If you need HD videoconferencing, you pay for the CPU needed to minimize latency. Do note that there's a slice-based patch floating around, which combines both threading methods - I have no way of even testing the latency benefit of combining both, but who knows, it might work.
The slices patch doesn't use slice-based threading; it uses slices in frame-based threading. It only affects syntax elements, not the threading model.
Dark Shikari is offline   Reply With Quote
Reply

Tags
latency, parallel, threading


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 00:24.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.