Log in

View Full Version : Encoding with Threadripper - low CPU usage


blublub
24th September 2018, 20:28
Hi

I have a new Threaripper system with a 2950x 16c/32threads CPU.

I noticed that when I encode 1080p source with x265 my CPU usage was between 55-65% - so not even closely saturating my CPU. With x264 it maxes out all cores to 99%.

The Stilt wrote once that is because x265 does rely on AVX and thats Zen's architecture weakness.

When I encode one job with x265 preset medium I get around 20 fps and CPU load is as mentioned quite low.

In contradiction to The Stilt' comment I can run 2 encode simultaneously with together 26-30fps - so the catch here is that if AVX would be the bottleneck the second job should just choke at the already maxed out AVX load and my total fps with 2 jobs should not exceed the 20fps like I would only be doing 1 job.

So from my limited perspective here AVX can't be the reason why x265 just doesn't max out my 16c machine - there should be another reason.

So with some experimenting I found options pme and pmode which lead to at least 80-85% CPU load - sadly while benchmarking it did not decrease encoding time

Is there any option how I can make use of that many cores either in terms of increased quality or speed while not dropping in encoding speed?

Edit: typos and semantic

Forteen88
25th September 2018, 07:50
I have a new AMD Threadripper system with a 2950x 16c/32threads CPU.

Is there any option how I can make use of that many cores either in terms of increased quality or speed while not dropping in encoding speed?I think that most users increases x265 --threads to fix so that 100% of the CPU is used, but increasing --threads makes image-quality worse. But in x265 more threads doesn't harm image-quality as much as in x264.

blublub
25th September 2018, 08:44
I think that most users increases x265 --threads to fix so that 100% of the CPU is used, but increasing --threads makes image-quality worse. But in x265 more threads doesn't harm image-quality as much as in x264.

Ok, but how do I do this?

I tried to set numa-pools from 32 to 64 (32 was automode) but that didn't change anything

Forteen88
25th September 2018, 09:25
Ok, but how do I do this?
I tried to set numa-pools from 32 to 64 (32 was automode) but that didn't change anythingOh, excuse me, I thought that threads in x265 worked like x264, that you only need to set --threads to a high number.

This helps a little though,
--merange can have a negative impact on frame parallelism. If the range is too large, more rows of CTU lag must be added to ensure those pixels are available in the reference frames.https://x265.readthedocs.io/en/default/threading.html

x265's default --merange is too high at default, it's set for UHD-encode resolution,
https://forum.doom9.org/showthread.php?p=1655243#post1655243

microchip8
25th September 2018, 09:49
have you tried --pme and --pmode ?

Atak_Snajpera
25th September 2018, 13:30
I noticed that when I encode 1080p source with x265 my CPU usage was between 55-65% - so not even closely saturating my CPU.

Default CU size in x265 is 64 so
1080p / 64 = ~17 threads (~53% cpu usage)

Solution: reduce CU size to 32

--ctu 32

tuanden0
25th September 2018, 17:29
Default CU size in x265 is 64 so
1080p / 64 = ~17 threads (~53% cpu usage)

Solution: reduce CU size to 32

--ctu 32

I'm using ryzen 7 2700 and got same issue with OP and I tried to reduce ctu, my CPU full load and increase some speed.
Is there any harm for the quality if I reduce CU size?

Thank you!

Atak_Snajpera
25th September 2018, 17:53
64 is not that useful below UHD so quality will be practically the same.

benwaggoner
26th September 2018, 00:41
I think that most users increases x265 --threads to fix so that 100% of the CPU is used, but increasing --threads makes image-quality worse. But in x265 more threads doesn't harm image-quality as much as in x264.
Using <1 --frame-threads isn't as bad as it was in prior versions of x265, but -F 2 is about the max I'd use for high-quality encoding.

I'm doubtful that x264 is more parallelized than x265. It's likely that x264 is hitting 100% CPU because it is launching a crazy number of process threads or something. It is easier to parallelize H.264 for high quality since there are a lot more tools in HEVC that can having cascading effects on quality.

blublub
26th September 2018, 06:43
Default CU size in x265 is 64 so
1080p / 64 = ~17 threads (~53% cpu usage)

Solution: reduce CU size to 32

--ctu 32

That did it!

Now I just need to figure out how I can retain more detail in x265 so that it looks equivalent to x264 in Bluray encodings with CRF 17/18

EDIT:
Ok with frame-threads set to 2 CPU load goes back to 75%, but still better than 60. Any disadvantage to reduce CTU down even more?

blublub
26th September 2018, 06:44
Using <1 --frame-threads isn't as bad as it was in prior versions of x265, but -F 2 is about the max I'd use for high-quality encoding.

I'm doubtful hat x264 is more parallelized than x265. It's likely that x264 is hitting 100% CPU because it is launching a crazy number of process threads or something. It is easier to parallelize H.264 for high quality since there are a lot more tools in HEVC that can having cascading effects on quality.

Interesting because x265 defaults to "frame-threads=5" on my system!?

blublub
26th September 2018, 06:55
have you tried --pme and --pmode ?

Just tried that again.

Wit pme and pmode CPU usage goes up but encoding slows down - not really a benefit...

Forteen88
26th September 2018, 16:10
blublub. Did you change --merange as I wrote? I don't know if it fixes your problem much though.
If you encode to 1080p, try change to --merange 32

blublub
26th September 2018, 18:24
blublub. Did you change --merange as I wrote? I don't know if it fixes your problem much though.
If you encode to 1080p, try change to --merange 32

It increases CPU load by about 10%.

Question:
Does it reduce quality?

If I reduce CTU to 32 what would be the correct me range according to this:

"
--merange <integer>

Motion search range. Default 57

The default is derived from the default CTU size (64) minus the luma interpolation half-length (4) minus maximum subpel distance (2) minus one extra pixel just in case the hex search method is used. If the search range were any larger than this, another CTU row of latency would be required for reference frames.

Range of values: an integer from 0 to 32768
"
Would it be 25?

Forteen88
26th September 2018, 20:45
It increases CPU load by about 10%.
Question: Does it reduce quality?If --merange works like in x264, --merange 32 is quite much, since x264's --merange default is 16! If you do a higher resolution-encode, like UHD, the default x265-value is good, but you can set lower merange at lower resolutions.
I've read here at Doom9 that setting a too high merange in x264 can actually worsen image-quality.
I don't know which correct merange-setting to set though, that you asked about.

blublub
26th September 2018, 20:48
Mhh interesting

So theoretically setting it to 20 - which is within some x264 presets - should be good enough.

benwaggoner
27th September 2018, 19:14
Interesting because x265 defaults to "frame-threads=5" on my system!?
The default is based on the number of cores in your system. And it's often higher than is useful for higher resolutions or slower presets.

The quality hit is smaller than it used to be, and may be only in edge cases. You might try lowering it to the lowest value that doesn't negatively impact speed. There is some overhead in having multiple threads running at once, so using a too-high value can actually make things a little slower.

benwaggoner
27th September 2018, 19:20
It increases CPU load by about 10%.

Question:
Does it reduce quality?

If I reduce CTU to 32 what would be the correct me range according to this:

"
--merange <integer>

Motion search range. Default 57

The default is derived from the default CTU size (64) minus the luma interpolation half-length (4) minus maximum subpel distance (2) minus one extra pixel just in case the hex search method is used. If the search range were any larger than this, another CTU row of latency would be required for reference frames.

Range of values: an integer from 0 to 32768
"
Would it be 25?
I am not sure than merange operates the same in x264 and x265. Also, as HEVC has intra-frame prediction, it can potentially make better use of merange than H.264 can.

But trying lower values is certainly a reasonable test.


But yes, your math would be correct; 25 for hex motion search (default for superfast-medium), and 26 otherwise (thus ultrafast and slower-placebo).

That plus CTU 32 would increase the potential parallelization of WPP.

benwaggoner
27th September 2018, 19:23
Just tried that again.

Wit pme and pmode CPU usage goes up but encoding slows down - not really a benefit...
Don't try them together! So many threads :).

--pmode is the only one useful for SD-ish resolutions, and only then when there are a LOT of cores available. --pme is only helpful with LOTS of cores and low resolutions. I've never found a real-world case where it actually sped anything up.

I have found examples where --pmode helped.

blublub
27th September 2018, 19:35
What do u consider as low-res? I am only looking at 1080p here.

blublub
27th September 2018, 22:08
Don't try them together! So many threads :).

--pmode is the only one useful for SD-ish resolutions, and only then when there are a LOT of cores available. --pme is only helpful with LOTS of cores and low resolutions. I've never found a real-world case where it actually sped anything up.

I have found examples where --pmode helped.


Ok, I dropped pmode and pme now.

I currently tested:
no-sao:ctu=32:merange=25:me=3:subme=4:rd=4:qg-size=16:deblock:-1,-1:no-strong-intra-smoothing:frame-threads=2

And the quality looks very good. However I still have to find a good way to compare the encodes frame by frame AVSPmod is out of sync...