PDA

View Full Version : Question on cores*3/2 formula


roozhou
27th November 2008, 17:33
Many ppl told me that for multi-core CPUs cores*3/2 threads would give x264 best performance. But can anyone tell me why it is not equal to the number of physical cores? And it is clear that more threads = more time wasted on synchronizing + lower quality.

I did a quick test and it seems that only with --b-adapt 1 core*3/2 is needed.

Intel E2160 Dual 1.8G,
1440x1080 mpeg2
encoding settings: --crf 20 --ref 4 --subme 6 --me umh

w/o b frames,
--threads 2: 100% CPU load
--threads 3: 100% CPU load

-b 3 --b-adapt 1,
--threads 2: ~70% CPU load
--threads 3: 100% CPU load

-b 3 --b-adapt 2,
--threads 2: ~70% CPU load
--threads 3: ~70% CPU load

Irakli
27th November 2008, 17:47
I want like to comment on your test. In my opinion, using CPU percentage utilization for speed test is almost always useless. If you measure encoding speed (in fps) instead of CPU utilization, 3/2*cores will be faster.

RunningSkittle
27th November 2008, 17:48
the new b-frame decision is not multithreaded.

nurbs
27th November 2008, 18:02
IIRC frame type decision isn't threaded at all. With --b-adapt 2 it is just noticable the most.

LoRd_MuldeR
27th November 2008, 18:18
I want like to comment on your test. In my opinion, using CPU percentage utilization for speed test is almost always useless. If you measure encoding speed (in fps) instead of CPU utilization, 3/2*cores will be faster.

If CPU load did matter, we could simply add a few dummy threads that do useless calculations to fill all the CPU cores all the time and then we could claim that we just implemented perfect multi-threading :p

roozhou
27th November 2008, 18:33
I want like to comment on your test. In my opinion, using CPU percentage utilization for speed test is almost always useless. If you measure encoding speed (in fps) instead of CPU utilization, 3/2*cores will be faster.

Of course same cpu utilization means the same fps. I think it should be common sense so it is not mentioned in my test.

I just want to know WHY w/ b-adapt 1 two threads cannot use up 2 cores but three threads can.

LoRd_MuldeR
27th November 2008, 18:37
Of course same cpu utilization means the same fps

That's not so obvious. You can't assume that "same cpu utilization means the same fps" unless you have proven it :rolleyes:

Many multi-threaded applications don't achieve double throughput by using double CPU time. There usually is some overhead...

I just want to know WHY w/ b-adapt 1 two threads cannot use up 2 cores but three threads can.

Frame type decision is not multi-threaded yet.

With "--b-adapt 1" you usually won't notice, because it's fast! But "--b-adapt 2" is significant slower (and significant better).

roozhou
27th November 2008, 18:46
Frame type decision is not multi-threaded yet.

With "--b-adapt 1" you usually won't notice, because it's fast! But "--b-adapt 2" is significant slower (and significant better).

You didn't answer my question. I need answer on why with b-adapt 1 two threads cannot use up 2 physical cores. Does it mean b-adapt 1 has a bad multithread implementation?

nurbs
27th November 2008, 18:50
Does it mean b-adapt 1 has a bad multithread implementation?
You have been told several times that it (frametype decision) is not multithreaded at all.

LoRd_MuldeR
27th November 2008, 18:52
frame type decision isn't threaded at all

You didn't answer my question. I need answer on why with b-adapt 1 two threads cannot use up 2 physical cores. Does it mean b-adapt 1 has a bad multithread implementation?

Even using 100 threads won't help, if 99 of these threads have to wait for the result of a specific calculation that is done in one single thread...

kemuri-_9
27th November 2008, 19:16
You didn't answer my question. I need answer on why with b-adapt 1 two threads cannot use up 2 physical cores. Does it mean b-adapt 1 has a bad multithread implementation?

this is usually due to Hyperthreading, and scheduling done on the machine.

having 3 / 2 * cores as threads allows HT to be used (if i understand the concept right) which keeps the cpus running better, since the pipelines don't have as many gaps in work.

roozhou
27th November 2008, 19:19
Even using 100 threads won't help, if 99 of these threads have to wait for the result of a specific calculation that is done in one single thread...

We are talking about x264 and this situation should not happen in x264. If b-adapt is not multi-threaded, can we put it into the input thread?

LoRd_MuldeR
27th November 2008, 19:22
We are talking about x264 and this situation should not happen in x264.

If you think you know it better than the x264 developers, feel free to submit a patch :p

If b-adapt is not multi-threaded, can we put it into the input thread?

Moving an implementation that is not multi-threaded from one thread to another one doesn't make it multi-threaded :rolleyes:

roozhou
27th November 2008, 19:22
this is usually due to Hyperthreading, and scheduling done on the machine.

having 3 / 2 * cores as threads allows HT to be used (if i understand the concept right) which keeps the cpus running better, since the pipelines don't have as many gaps in work.

HT is off on my machine. And CPU is not "sleeping" during pipeline stalls, it is still working!

kemuri-_9
27th November 2008, 19:35
then the only answer i have left is the OS's process scheduler.

with a 1:1 matching of threads to cores,
one of x264 threads will often be swapped out for other processes.
having it to where x264 threads can't always be on the cpus.

with a 3:2 matching of threads to cores,
there's x264 threads that are 'building' up on their priority to be processed next, since they're in the ready state.
so when the scheduler swaps off an x264 thread,
it is likely to swap it out with one of the other x264 threads that are ready,
giving other processes a further hassle to get processing time.

LoRd_MuldeR
27th November 2008, 19:37
Hypterthreading means that there are two parallel pipelines in the very same core. Also most registers exist twice. However all the execution units still exist only once per core! Therefore HT doesn't allow "true" parallel execution - only one pipeline can be "active" at a time. But HT helps to keep the execution units busy. When one pipeline has to wait (for example because a cache miss has happened and it needs to wait for the RAM), the other pipeline can proceed immediately. Without HT the execution units would be idle during this time.

Hyper-Threading works by duplicating certain sections of the processor—those that store the architectural state—but not duplicating the main execution resources. This allows a Hyper-Threading equipped processor to pretend to be two "logical" processors to the host operating system, allowing the operating system to schedule two threads or processes simultaneously. When execution resources in a non-Hyper-Threading capable processor are not used by the current task, and especially when the processor is stalled, a Hyper-Threading equipped processor may use those execution resources to execute another scheduled task. (The processor may stall due to a cache miss, branch misprediction, or data dependency.) Except for its performance implications, this innovation is transparent to operating systems and programs. All that is required to take advantage of Hyper-Threading is symmetric multiprocessing (SMP) support in the operating system, as the logical processors appear as standard separate processors.


Also: n threads would only be sufficient to keep n cores busy all the time, if none of these threads ever has to synchronize with another thread. Obviously in "real life" applications that isn't always the case. Hence it may be necessary to use more threads than cores/processors to squish out the maximum performance. I think the 3/2 factor for x264 was found by testing...

kemuri-_9
27th November 2008, 19:41
ha ha ha, i figured i didn't really understand the HT concept. thanks for clearing that up.

Shinigami-Sama
27th November 2008, 19:52
ha ha ha, i figured i didn't really understand the HT concept. thanks for clearing that up.

its pretty funky yeah

maybe x264 could detect physicals cores and do core * 1.5 then count logical cores and just add logical * 0.5 to that to not bump into excessive threading overhead on HT boxes?

roozhou
28th November 2008, 03:17
then the only answer i have left is the OS's process scheduler.

with a 1:1 matching of threads to cores,
one of x264 threads will often be swapped out for other processes.
having it to where x264 threads can't always be on the cpus.

with a 3:2 matching of threads to cores,
there's x264 threads that are 'building' up on their priority to be processed next, since they're in the ready state.
so when the scheduler swaps off an x264 thread,
it is likely to swap it out with one of the other x264 threads that are ready,
giving other processes a further hassle to get processing time.

There is no other time-consuming processs than x264 running. And windows is not that stupid swapping x264 out for the system idle process.

Again plz note that cores*3/2 only happens on x264 b-adapt 1. I have never seen such things on other programs before. I am quite sure this is caused by x264's bad implementation not M$ Windows.

roozhou
28th November 2008, 03:22
Moving an implementation that is not multi-threaded from one thread to another one doesn't make it multi-threaded :rolleyes:

Wrong if it does not depends on the result from its following implementations. That's why --input-thread gives better multi-threading.

LoRd_MuldeR
28th November 2008, 03:35
Wrong if does not depends on the result from its following implementations. That's why --input-thread gives better multi-threading.

Don't you get it? The frame type decision currently is not multi-threaded at all. If other threads require the result of the frame type decision, these threads will need to wait for the one thread that is currently running the frame type decision. Moving the frame type decision from one thread to another one doesn't help at all, why should it? Also I don't think that moving the frame type decision to the input thread would be possible easily nor does it make any sense. If at all, the algorithm would need to be modified/extended in a way that it can be spread over several threads. And - according to the developers - this would require "a thousand-or-two line patch". However there seem to be some ideas on how it could be implemented at least...

http://forum.doom9.org/showpost.php?p=1215437&postcount=55

Dark Shikari
28th November 2008, 04:34
Cores*3/2 is because of internal algorithmic reasons, not OS scheduling or anything of the sort.

As you said, its not as necessary without B-frames.

Manao
28th November 2008, 09:01
Even if there were no bframe decision, you would still need to have more threads than CPU in order to get the best speed. Encoding a frame doesn't take a constant time. So if you have two frames, if the second one references the first, and if the second one encodes faster than the first one, then it must wait for the first one to complete. If you have as many threads as CPUs, and if some threads are waiting, then some CPU are idling, so you aren't using all your processing power. Hence, you need more threads than CPUs in order to use all the CPU cycles.

roozhou
28th November 2008, 09:41
Even if there were no bframe decision, you would still need to have more threads than CPU in order to get the best speed. Encoding a frame doesn't take a constant time. So if you have two frames, if the second one references the first, and if the second one encodes faster than the first one, then it must wait for the first one to complete. If you have as many threads as CPUs, and if some threads are waiting, then some CPU are idling, so you aren't using all your processing power. Hence, you need more threads than CPUs in order to use all the CPU cycles.

We should give the input thread a lower priority, keeping the input buffer as empty as possible. When the second thread needs to wait on first, the input thread gets a chance to fill the input frame buffer.

If other threads require the result of the frame type decision, these threads will need to wait for the one thread that is currently running the frame type decision.

Wrong again, LoRd_MuldeR! I am talking on frame type decision waiting on result of other threads.

Dark Shikari
28th November 2008, 10:37
Wrong again, LoRd_MuldeR! I am talking on frame type decision waiting on result of other threads.That can't happen, because the frametype decision runs before all threads. How can what is run first become dependent on the results of other threads?

Manao
28th November 2008, 20:08
We should give the input thread a lower priority, keeping the input buffer as empty as possible. When the second thread needs to wait on first, the input thread gets a chance to fill the input frame buffer.Ok, you have found a use for the input thread, provided that you have only 1 thread waiting, and that it has to wait less time than it takes for the input thread to due its computation. These two conditions hardly happens, thus the need to use more threads than CPU.

roozhou
29th November 2008, 14:19
That can't happen, because the frametype decision runs before all threads. How can what is run first become dependent on the results of other threads?

Sounds good! So moving frametype decision to input-thread would be possible.


Ok, you have found a use for the input thread, provided that you have only 1 thread waiting, and that it has to wait less time than it takes for the input thread to due its computation. These two conditions hardly happens, thus the need to use more threads than CPU.

That's right. So a larger input buffer should help.

LoRd_MuldeR
29th November 2008, 14:54
That can't happen, because the frametype decision runs before all threads. How can what is run first become dependent on the results of other threads?
Sounds good! So moving frametype decision to input-thread would be possible.

No, it's the explanation why it would be useless to move frametype decision :rolleyes:

roozhou
29th November 2008, 15:21
No, it's the explanation why it would be useless to move frametype decision :rolleyes:
I am thinking about the following sequence. Tell me what is wrong.

D -- decoding (input)
F -- frametype decision
E -- encoding (all other threads)


D1F1 D2F2 D3F3 D4F4
---- E1-- E2-- E3-- E4--

LoRd_MuldeR
29th November 2008, 15:27
I am thinking about the following sequence. Tell me what is wrong.

D -- decoding (input)
F -- frametype decision
E -- encoding (all other threads)


D1F1 D2F2 D3F3 D4F4
---- E1-- E2-- E3-- E4--

I think that is basically how it is implemented currently.

The problem is that you can't spawn another encoding thread, until the frame type decision (lookahead) has finished. And currently lookahead does not use several threads ...

roozhou
29th November 2008, 16:06
I think that is basically how it is implemented currently.

The problem is that you can't spawn another encoding thread, until the frame type decision (lookahead) has finished. And currently lookahead does not use several threads ...

If this is the current implementation, b-adapt 2 should not affect multi-threading. Unfortunately it does.

LoRd_MuldeR
29th November 2008, 16:11
If this is the current implementation, b-adapt 2 should not affect multi-threading. Unfortunately it does.

Obviously you don't want to understand. Either this or your knowledge of software development and of how threading works is too limited to understand it.

It has been explained more than enough, so I won't repeat it. Re-read this thread and try to understand what has been said...

roozhou
29th November 2008, 17:19
Obviously you don't want to understand. Either this or your knowledge of software development and of how threading works is too limited to understand it.

It has been explained more than enough, so I won't repeat it. Re-read this thread and try to understand what has been said...

AFAIK there are two ways to implement multi-threading. The first one is parellel processing (slices) and the other one is pipelining (current x264 implementation).

Now let's treat x264 as a CPU. We assume encoding one frame = execute one instruction. So decoding(input) = prefetch, frametype decision = decoding, encoding = ALU+writeback+whatever. The slower frametype decision works(but still faster than encoding), the leseer affect it will have on multi-threading.

The question is b-adapt 2 is slower than b-adapt 1, but b-adapt 1 gives perfect multi-threading and b-adapt 2 doesn't. Any reasons?

LoRd_MuldeR
29th November 2008, 17:28
You still don't get it. As far as I understand, it works like that: x264 will fetch the next input frame. Then it will run frametype decision (lookahead), which is not threaded. When this part is done, it will spawn an encoder thread to encode the current frame. This encoder thread can keep on running while x264 proceeds to the next frame. So x264 will immediately fetch the next input frame and again run frametype decision (lookahead), then spawn another encoding thread. And so on. Now if the lookahead part takes very long, which is the case with --b-adapt 2, some of the running encoder threads may be finished before the lookahead is done and ready to spawn a new encoder thread. Hence there will be some idle CPU time. This problem simply doesn't show up with --b-adapt 1 because it is FAST.

ajp_anton
29th November 2008, 17:41
I am thinking about the following sequence. Tell me what is wrong.

D -- decoding (input)
F -- frametype decision
E -- encoding (all other threads)


D1F1 D2F2 D3F3 D4F4
---- E1-- E2-- E3-- E4--

If E is faster than D+F, there will be a time when F is running alone.

roozhou
29th November 2008, 17:54
You still don't get it. As far as I understand, it works like that: x264 will fetch the next input frame. Then it will run frametype decision (lookahead), which is not threaded. When this part is done, it will spawn an encoder thread to encode the current frame. This encoder thread can keep on running while x264 proceeds to the next frame. So x264 will immediately fetch the next input frame and again run frametype decision (lookahead), then spawn another encoding thread. And so on. Now if the lookahead part takes very long, which is the case with --b-adapt 2, some of the running encoder threads may be finished before the lookahead is done and ready to spawn a new encoder thread. Hence there will be some idle CPU time. This problem simply doesn't show up with --b-adapt 1 because it is FAST.

You mean b-adapt 2 is slower than encoding? That makes sense. But when i maxed encoding settings(e.g. subme 9, merange 64), it still gives me ~70% CPU load.

Sagekilla
29th November 2008, 17:58
roozhou: there is no such thing as pipeline based multithreading. There's pipelining, which is the process of doing several stages of a execution unit's pipeline at once (instruction fetch, decode, execution, memory access, etc). Multithreading is just duplicating all your execution units so you can have n threads feeding instructions, etc into the pipeline.

What x264 does is have the lookahead run on all decoded frames continously, and the data received from that is used in the multithreaded (frame based) portion of encoding. lookahead only uses one thread, but if it were properly MT'ed it could feed the rest of the functions without starving them for data.

@roozhou's last post (11:54): Sort of. b-adapt 2 -is- much slower than -b-adapt 1. If you have a sufficiently fast CPU, you can use absurdly slow settings and still not max out the CPU. The other possibility is that you ran into a decoding bottleneck.

LoRd_MuldeR
29th November 2008, 18:05
You mean b-adapt 2 is slower than encoding?

Not necessarily. It just isn't fast enough to be finished, before some of the encoder threads (that had been started at an earlier time!) are finished. Hence you are in a state where the lookahead loop is not ready to spawn a new thread yet -and- there are not enough running encoder threads left to keep all CPU cores busy.

roozhou
29th November 2008, 18:14
What x264 does is have the lookahead run on all decoded frames continously, and the data received from that is used in the multithreaded (frame based) portion of encoding. lookahead only uses one thread, but if it were properly MT'ed it could feed the rest of the functions without starving them for data.



It doesn't matter how many threads lookahead uses if it runs faster than encoding. And if it is slower than encoding either speeding up lookahead or getting it MT'ed will solve the problem.

LoRd_MuldeR
29th November 2008, 18:17
speeding up lookahead or getting it MT'ed will solve the problem.

This was told to you at the very beginning of this thread, but you insisted that moving the single-threaded lookahead to another threads would solve the problem :rolleyes:

Also keep in mind that encoding-time per frame varies from frame to frame...

roozhou
29th November 2008, 18:18
Not necessarily. It just isn't fast enough to be finished, before some of the encoder threads (that had been started at an earlier time!) are finished. Hence you are in a state where the lookahead loop is not ready to spawn a new thread yet -and- there are not enough running encoder threads left to keep all CPU cores busy.

This is the same as "frametype decision is slower than encoder".

LoRd_MuldeR
29th November 2008, 18:24
This is the same as "frametype decision is slower than encoder".

No, it isn't! Because the encoder threads run concurrently (concurrently to other encoder threads -and- concurrently to lookahead).

Hence the encoder threads that will finish first had been started at an earlier time than the current iteration of lookahead. They have some advance.

Lookahead must be ready before some of these "old" encoder threads are done, or we get idle CPU time.

Obviously being only faster than the previously created encoder thread won't be sufficient.


[EXAMPLE]

Assumption: Lookahead takes 5 ms, encode thread takes ~10 ms to process one frame, number of threads set to 4.

So every 5 ms a new encode thread can be spawned. Each encode thread takes 10 ms to finish (in reality they wouldn't take constant times to complete).

http://img366.imageshack.us/img366/2789/threadseq7.th.png (http://img366.imageshack.us/img366/2789/threadseq7.png)

[Blue = Lookahead Loop, Red = Encode Thread, Gray = Idle]

Obviously in this example lookahead is too slow to keep 4 encoder threads running concurrently, although lookahead is faster than encoding (5 ms -vs- 10 ms) :p

Conclusion: Lookahead faster than encoding is not sufficient. Even lookahead taking 1/2 time of encode is still too slow...

roozhou
30th November 2008, 12:59
@Lord_Muldur

Your graph only makes sense with 5 physical cores. When we have only 3 cores there will be no "gaps".

According to your assumption, lookahead slower than 1/(cores-1) of encoding speed results in idle cpu time. In another word, encoding slower than (cores-1)x of lookahead speed results in perfect multi-threading. But in fact b-adapt 2 can never fills my duo core with 100% CPU load. No matter the overall speed is 20fps or 2fps, it always gives me 65%~70% CPU load.

LoRd_MuldeR
30th November 2008, 13:15
As the graph shows clearly, there are never more than three threads working concurrently.

Hence we won't be able to fully utilize a Quadcore. Obviously throwing even more threads at it wouldn't help (wouldn't be possible).
In this example lookahead is too slow and becomes the bottleneck. Which proves your theory - lookahead only needs to be faster than encode - wrong!
I select the values so that lookahead takes only 1/2 time of encode and it's still way too slow to not become the bottleneck.

If you still refuse to understand, I'm out of ideas :rolleyes:

roozhou
30th November 2008, 16:32
As the graph shows clearly, there are never more than three threads working concurrently.

Hence we won't be able to fully utilize a Quadcore. Obviously throwing even more threads at it wouldn't help (wouldn't be possible).
In this example lookahead is too slow and becomes the bottleneck. Which proves your theory - lookahead only needs to be faster than encode - wrong!
I select the values so that lookahead takes only 1/2 time of encode and it's still way too slow to not become the bottleneck.

If you still refuse to understand, I'm out of ideas :rolleyes:

You didn't read my post carefully, did you?

In your graph if lookahead only takes 1/4 time of encoding, it will be ok. I was not refusing to understand, I was drawing a conclusion from your graph.

We have n cores and n-1 encoding thread.

Let
L = time to run lookahead on one frame.
E = time to encode one frame.

Lookahead becomes bottleneck only when L > E / (n-1).

Sagekilla
30th November 2008, 17:34
if you have n cores and give x264 n threads, there are n threads for encoding, not n-1 I believe. Lookahead is in it's own thread regardless of x264's main encoding thread I believe. Otherwise, by that logic if you have a single core CPU or if you specify --threads 1, there are no threads for encoding.

roozhou
1st December 2008, 09:03
Well, that logic applies on multi-cores. And n threads or n-1 threads or even 2n threads don't make any difference.

LoRd_MuldeR
1st December 2008, 16:51
Still the point is: If lookahead is too slow, you can raise the number of threads to whatever n you want.

You won't be able to ever spawn n threads running at the same time, because before you spawned the n-th thread some of the previously spawned threads have already finished (terminated).

As long as lookahead is slow (and --b-adapt 2 is slow, but good) you probably won't be able to get 100% CPU load on your Quadcore in the first pass.

Accept that or submit a patch to make lookahead faster (ideas how to make lookahead multi-threaded have been mentioned).

roozhou
1st December 2008, 17:23
Still the point is: If lookahead is too slow, you can raise the number of threads to whatever n you want.

You won't be able to ever spawn n threads running at the same time, because before you spawned the n-th thread some of the previously spawned threads have already finished (terminated).

As long as lookahead is slow (and --b-adapt 2 is slow, but good) you probably won't be able to get 100% CPU load on your Quadcore in the first pass.

Accept that or submit a patch to make lookahead faster (ideas how to make lookahead multi-threaded have been mentioned).

Can you provide an insane setting slow enough to make b-adapt 2 using 100% CPU on my Dual Core? If there is not, you cannot prove your logic.

Snowknight26
1st December 2008, 17:25
--subme 9 with other relatively slow settings. 100% on my Q6600.

LoRd_MuldeR
1st December 2008, 17:27
Can you provide an insane setting slow enough to makes b-adapt 2 using 100% CPU on my Dual Core? If there is not, you cannot prove your logic.

I don't need to prove anything! That's because I never claimed that you can get 100% load on your CPU with "--b-adapt 2" during the first pass.

In fact I explained to you why the opposite is the case! And still you are not willing to understand the obvious...


This thread has become useless, because the same questions that have been answered already are asked over and over again :rolleyes:

Dark Shikari
1st December 2008, 17:45
Can you provide an insane setting slow enough to make b-adapt 2 using 100% CPU on my Dual Core? If there is not, you cannot prove your logic.--me tesa --merange 64 --ref 16 :rolleyes:

Fishman0919
1st December 2008, 17:52
Can you provide an insane setting slow enough to make b-adapt 2 using 100% CPU on my Dual Core? If there is not, you cannot prove your logic.




--pass 1 --bitrate 3750 --stats "d:\temp\job1.stats" --level 4.0 --sar 1:1 --aud --vbv-bufsize 20000 --vbv-maxrate 20000 --filter 0,0 --ref 3 --mixed-refs --bframes 3 --b-pyramid --aq-mode 2 --aq-strength 1.0 --direct auto --b-adapt 2 --keyint 24 --min-keyint 2 --subme 9 --trellis 2 --partitions all --8x8dct --me tesa --merange 64 --no-fast-pskip --threads auto --thread-input --progress --no-psnr --no-ssim --output "d:\temp\video.264"

--pass 2 --bitrate 3750 --stats "d:\temp\job1.stats" --level 4.0 --sar 1:1 --aud --vbv-bufsize 20000 --vbv-maxrate 20000 --filter 0,0 --ref 3 --mixed-refs --bframes 3 --b-pyramid --aq-mode 2 --aq-strength 1.0 --keyint 24 --min-keyint 2 --weightb --direct auto --subme 9 --trellis 2 --partitions all --8x8dct --me tesa --merange 64 --no-fast-pskip --threads auto --thread-input --progress --no-psnr --no-ssim --output "d:\temp\video.264"

Seems to peg my Dual and Quad core at 100% cpu all the time with r1046

roozhou
1st December 2008, 18:47
--pass 1 --bitrate 3750 --stats "C:\temp\job1.stats" --level 4.0 --sar 1:1 --aud --vbv-bufsize 20000 --vbv-maxrate 20000 --filter 0,0 --ref 3 --mixed-refs --bframes 3 --b-pyramid --aq-mode 2 --aq-strength 1.0 --direct auto --b-adapt 2 --keyint 24 --min-keyint 2 --subme 9 --trellis 2 --partitions all --8x8dct --me tesa --merange 64 --no-fast-pskip --threads auto --thread-input --progress --no-psnr --no-ssim --output "D:\temp\video.264"

--pass 2 --bitrate 3750 --stats "C:\temp\job1.stats" --level 4.0 --sar 1:1 --aud --vbv-bufsize 20000 --vbv-maxrate 20000 --filter 0,0 --ref 3 --mixed-refs --bframes 3 --b-pyramid --aq-mode 2 --aq-strength 1.0 --keyint 24 --min-keyint 2 --weightb --direct auto --subme 9 --trellis 2 --partitions all --8x8dct --me tesa --merange 64 --no-fast-pskip --threads auto --thread-input --progress --no-psnr --no-ssim --output "D:\temp\video.264"

Seems to peg my Dual and Quad core at 100% cpu all the time with r1046

Thanks. It works but giving me ~0.2fps on 1080P.

LoRd_MuldeR
1st December 2008, 19:01
--me tesa --merange 64

Thanks. It works but giving me ~0.2fps on 1080P.

What a surprise :D

Fishman0919
1st December 2008, 19:02
Thanks. It works but giving me ~0.2fps on 1080P.

No one said it was going to be fast. :rolleyes:


"C:\x264\x264.exe" "C:\temp\Movie.avs" --pass 1 --bitrate 3750 --stats "C:\temp\Movie.stats" --level 4.0 --sar 1:1 --aud --vbv-bufsize 16500 --vbv-maxrate 17500 --filter 0,0 --ref 3 --mixed-refs --bframes 3 --b-pyramid --aq-mode 2 --aq-strength 1.0 --direct auto --b-adapt 1 --keyint 24 --min-keyint 2 --subme 3 --trellis 2 --partitions all --me hex --no-fast-pskip --threads auto --thread-input --progress --no-psnr --no-ssim --output "C:\temp\video.264"

"C:\x264\x264.exe" "C:\temp\Movie.avs" --pass 2 --bitrate 3750 --stats "C:\temp\Movie.stats" --level 4.0 --sar 1:1 --aud --vbv-bufsize 16500 --vbv-maxrate 17500 --filter 0,0 --ref 3 --mixed-refs --bframes 3 --b-pyramid --aq-mode 2 --aq-strength 1.0 --keyint 24 --min-keyint 2 --weightb --direct auto --subme 7 --trellis 2 --partitions all --8x8dct --me umh --merange 24 --no-fast-pskip --threads auto --thread-input --progress --no-psnr --no-ssim --output "C:\temp\video.264"

Still pegs both my Dual and Quad core

Sagekilla
2nd December 2008, 04:21
@roozhou: 100% CPU usage does not imply fast encoding. You can completely max out x264 using --me tesa --merange 512 --subme 9 --ref 16 --b-adapt 2 --bframes 16 and it'll spit out 0.1 fps @ 100% CPU.

Likewise, you can use settings so fast that you'll get 100 fps but only 50% CPU usage because of a decoding bottleneck.

100% CPU != Fast encode!

kemuri-_9
2nd December 2008, 06:09
It's come to be obvious that he's look for the 'efficient' encode:
that is that everything is in a perfectly ideal situation,
where there's no particular bottleneck in any one part of the encode to slow the overall rate down.

obviously complaining that --b-adapt 2 is a bottleneck on non super extreme settings.

On those super extreme settings things are more balanced and can actually use 100% cpu with 1:1 cpu:thread ratio.

This is not the case when you back down from those settings to more sane ones though, and that's where the problem lies.

and from what Dark Shikari & akupenguin have said,
to fix --b-adapt 2 to fit more into the ideal situation, the code rewrite would be fairly non trivial.

so obviously, solutions to the situation are
A. don't use --b-adapt 2
B. help fix it
C. shut up and wait for it to get fixed

</annoyed rant>

roozhou
2nd December 2008, 11:39
@roozhou: 100% CPU usage does not imply fast encoding. You can completely max out x264 using --me tesa --merange 512 --subme 9 --ref 16 --b-adapt 2 --bframes 16 and it'll spit out 0.1 fps @ 100% CPU.

Likewise, you can use settings so fast that you'll get 100 fps but only 50% CPU usage because of a decoding bottleneck.

100% CPU != Fast encode!

You don't have to tell me such things that everyone here has already known.

Actually i was pointing out how slow b-adapt 2 is.

On my Duo Core.

--crf 24 --ref 3 --mixed-refs --bframes 3 --b-pyramid --keyint 240 --min-keyint 6 --weightb --direct auto --subme 8 --trellis 2 --partitions all --8x8dct --me umh --merange 24 --no-fast-pskip --threads 4 --thread-input --progress --no-psnr --no-ssim -o NUL --b-adapt 2

gives me 1.55fps and 100% CPU load.


--crf 24 --ref 3 --mixed-refs --bframes 3 --b-pyramid --keyint 240 --min-keyint 6 --weightb --direct auto --subme 8 --trellis 2 --partitions all --8x8dct --me umh --merange 24 --no-fast-pskip --threads 3 --thread-input --progress --no-psnr --no-ssim -o NUL --b-adapt 2

gives me 1.38fps and 85%~90% CPU load.

It proves 3*cores/2 threads wrong and "threads auto" should be more intellegient.

nurbs
2nd December 2008, 12:23
You don't have to tell me such things that everyone here has already known.

Actually i was pointing out how slow b-adapt 2 is.

Which also falls under "such things that everyone here has already known" :)

Shinigami-Sama
2nd December 2008, 19:21
You don't have to tell me such things that everyone here has already known.

Actually i was pointing out how slow b-adapt 2 is.


pot meet kettle?

but I do agree threads auto could use a little more brains but a quick trip to #x264 will set people straight...

Sagekilla
3rd December 2008, 03:14
1.5 * cores is just a roundabout "One size fits most." ANY time you want optimal speed or quality you should tweak your settings to your source and to your hardware. This can be time consuming so it's easier to use an automatic detection that will work for 99% of all cases. Not all CPUs are the same, so some may do better with 3 threads and others may do better with 2 or 4 threads. The devil is in the details..