Log in

View Full Version : x265 on Ryzen 7950x not using all CPU resources


MonoS
29th December 2023, 18:35
Hi, I have a system with a Ryzen 7950x, in the past days I've started a 4K encode with the following settings

vspipe "test.vpy" - -c y4m | x265 --crf 17 --preset veryslow --master-display "G(8500,39850)B(6550,2300)R(35400,14600)WP(15635,16450)L(10000000,1)"
--hme --hme-range 16,32,48 --hme-search dia,umh,star --deblock -1:1 --sao --cbqpoffs -1 --crqpoffs -1 --min-keyint 1 --keyint 1440 --rskip 0
--no-early-skip --rd-refine --aq-mode 4 --colormatrix 9 --transfer 16 --colorprim 9 --selective-sao 2 --sao-non-deblock --limit-sao --subme 5
--qg-size 8 --tu-intra-depth 4 --tu-inter-depth 4 --rc-lookahead 60 --y4m --hdr10 --hdr10-opt --psy-rd 2 --psy-rdoq 4 --aq-strength 0.6 --asm avx512
--no-rect --no-amp - "test.hevc"


y4m [info]: 3840x2076 fps 24000/1001 i420p10 frames 0 - 277022 of 277023
x265 [info]: Using preset veryslow & tune none
raw [info]: output file: output.hevc
x265 [info]: HEVC encoder version 3.5+97-ga456c6e73+3-g87155154d
x265 [info]: build info [Windows][clang 14.0.4][64 bit] Kyouko 10bit+8bit+12bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512
x265 [warning]: Turning on repeat-headers for HDR compatibility
x265 [info]: Main 10 profile, Level-5 (Main tier)
x265 [info]: Thread pool created using 32 threads
x265 [info]: Slices : 1
x265 [info]: frame threads / pool features : 6 / wpp(33 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 4 inter / 4 intra
x265 [info]: HME L0,1,2 / range / subpel / merge : dia, umh, star / 48 / 5 / 5
x265 [info]: Keyframe min / max / scenecut / bias : 1 / 1440 / 40 / 5.00
x265 [info]: Cb/Cr QP Offset : -1 / -1
x265 [info]: Lookahead / bframes / badapt : 60 / 8 / 2
x265 [info]: b-pyramid / weightp / weightb : 1 / 1 / 1
x265 [info]: References / ref-limit cu / depth : 5 / off / off
x265 [info]: AQ: mode / str / qg-size / cu-tree : 4 / 0.6 / 8 / 1
x265 [info]: Rate Control / qCompress : CRF-17.0 / 0.60
x265 [info]: tools: rd=6 psy-rd=2.00 rdoq=2 psy-rdoq=4.00 rd-refine signhide
x265 [info]: tools: tmvp b-intra strong-intra-smoothing deblock(tC=-1:B=1)
x265 [info]: tools: sao-non-deblock selective-sao


The CPU utilization reach 100% only some of the time, with almost one quarter of the time spent in a lower utilization range.
https://i.ibb.co/HKZspRv/overral.png (https://ibb.co/HKZspRv)
https://i.ibb.co/MCqBKvv/per-core.png (https://ibb.co/MCqBKvv)

I've tried a lot of different settings, here x265 CPU utilization issue.zip (https://www.mediafire.com/file/10zxvkdj1gv2jj3/x265+CPU+utilization+issue.zip/file) you can find a couple of file

x265_bench.CSV and ffmpeg_x265_bench.CSV: Raw CSV generated by HWInfo during the encodes with a 1s refresh interval
prove.txt: information about the different tests i've made with start and end time (to cross reference with the HWInfo's CSVs), the CMD used, the speed and the resulting encode statistics
x265_bench_finale.ods: Worksheet with, on the first sheet the cleaned data from the CSVs, then all the other are the statistics for some of the test, cell C11 is the average of CPU utilization during the encode, Column C and D of the graph are respectively the "Core Usage average" and "5s average of the core usage average"


For the input i'm using simple VS script just for indexing and cropping, by itself it runs at 160fps, so i'm sure is not the bottleneck, just to be sure you'll find a test with FFMPEG for the input pipe.

What could be the issue? Is there a particular setting which hinder parallelization or maybe x265 have some problem parallelizing across that many threads?
I've noticed that veryslow by itself is capable of saturating my setup, with a 96% usage average, but when disabling rect and amp usage drop to 82% like in all the other tests I've calculated.

rwill
29th December 2023, 19:04
I think thats just the other Frame Threads waiting for the highest level P or I frame to complete. There is nothing to worry about.

RanmaCanada
29th December 2023, 20:09
This is pretty normal.

Rumbah
29th December 2023, 23:45
You could try to split your file in two and encode those parts in parallel.

benwaggoner
2nd January 2024, 03:55
If you want to use all your cores, try --pmode. That can speed up encoding lower resolutions on many-core processors quite a bit. But it is less efficient in terms of watts/pixel, and can slow things down if you were already hitting 50% utilization sometimes.

--pme is the even more extreme threads-for-fps tradeoff than --pmode, and is almost always counterproductive in my experience. Maybe if encoding 360p on 64 cores or something?

MonoS
4th January 2024, 00:20
If you want to use all your cores, try --pmode. That can speed up encoding lower resolutions on many-core processors quite a bit. But it is less efficient in terms of watts/pixel, and can slow things down if you were already hitting 50% utilization sometimes.

--pme is the even more extreme threads-for-fps tradeoff than --pmode, and is almost always counterproductive in my experience. Maybe if encoding 360p on 64 cores or something?

This is what i tried, if you download the zip i've attached you'll see that pmode+pme (i can try enabling only pmode, if you think may be of help) does improve performance quite substantially against the std settings, about 25%, but does not change the CPU utilization which goes from 83% to 81%

Asmodian
4th January 2024, 19:55
I would suggest using only --pmode with a 7950x.

does improve performance quite substantially against the std settings, about 25%

This is a huge performance change! The CPU is obviously doing more work/time.

, but does not change the CPU utilization which goes from 83% to 81%

I wonder if there is an issue with the measurement of CPU utilization? The split architecture is weird still. Power draw would probably be a more accurate indication of how busy the CPU really is.

If all you care about is maximizing CPU utilization then use placebo settings! ;)

Atak_Snajpera
5th January 2024, 00:48
--ctu 32

guest
5th January 2024, 02:33
What app are you using ?

This is the x265 command I generally use for 1080p & especially for 4K...

--level 6.2 --profile main10 --hdr10 --output-depth 10 --ctu 64 --high-tier --repeat-headers --vbv-bufsize 800000 --vbv-maxrate 800000 --asm avx512

All the other info is displayed in a different window once the process starts, and most of it depends on the video files & x265 itself, AFAIK.

I have a 7950X & i9-13900KF, and except for the avx512, (13900KF does not support 512) it seems to work well.

--ctu, -s <64|32|16>
Maximum CU size (width and height). The larger the maximum CU size, the more efficiently x265 can encode flat areas of the picture, giving large reductions in bitrate. However, this comes at a loss of parallelism with fewer rows of CUs that can be encoded in parallel, and less frame parallelism as well. Because of this the faster presets use a CU size of 32. Default: 64

https://x265.readthedocs.io/en/master/cli.html

Boulder
5th January 2024, 06:45
CTU 32 multithreads much better than 64 because the frame is split into more parts. I have not seen any real benefits out of CTU 64 compared to 32 concerning detail retention. Or anything else, to be exact.

guest
5th January 2024, 07:22
CTU 32 multithreads much better than 64 because the frame is split into more parts. I have not seen any real benefits out of CTU 64 compared to 32 concerning detail retention. Or anything else, to be exact.

Each to their own, I guess.

Have read that 4K does benefit from 64.

Interesting the default is 64.

Boulder
5th January 2024, 13:25
Each to their own, I guess.

Have read that 4K does benefit from 64.

Interesting the default is 64.
The default is 64 even for SD :devil: It should at least be adaptive based on the input resolution, but no..

What I've noticed with CTU 64 is that there are sometimes issues with flat areas with some noise, they may start exhibiting the floating noise artifact much more easily than with CTU 32. The increase in compression efficiency is also something which really does not apply at least when using CRF. I have yet to see any drastic filesize reductions even with material containing a lot of flat areas where the bigger size should excel. And there might be other uncovered issues than the problem when CTU 64 is combined with --limit-tu 0 and --rskip 2 (https://forum.doom9.org/showthread.php?p=1919347#post1919347)

benwaggoner
9th January 2024, 20:09
The default is 64 even for SD :devil: It should at least be adaptive based on the input resolution, but no.
Yeah, there are several settings that should be frame size adaptive, also including --frame-threads (higher with more cores, lower with more rows). --qg-size also should be adaptive

What I've noticed with CTU 64 is that there are sometimes issues with flat areas with some noise, they may start exhibiting the floating noise artifact much more easily than with CTU 32. The increase in compression efficiency is also something which really does not apply at least when using CRF. I have yet to see any drastic filesize reductions even with material containing a lot of flat areas where the bigger size should excel. And there might be other uncovered issues than the problem when CTU 64 is combined with --limit-tu 0 and --rskip 2 (https://forum.doom9.org/showthread.php?p=1919347#post1919347)
I'm 100% on --ctu 32 being the appropriate default for under 1080p.

Losko
10th January 2024, 17:51
For my encodings (always <= 1080p) I use to set:
--ctu 64 for anime content
--ctu 32 for everything else (including Pixar movies)

benwaggoner
11th January 2024, 19:20
For my encodings (always <= 1080p) I use to set:
--ctu 64 for anime content
--ctu 32 for everything else (including Pixar movies)
So, pretty much discreet tone versus continuous tone, then?

Which makes intuitive sense.

Have you found any issues with grainy anime using --ctu 64?

Losko
12th January 2024, 00:21
So, pretty much discreet tone versus continuous tone, then?

Which makes intuitive sense.
Do not overestimate my mastering of x265 (pretty average), as those above only come after years of lurking doom9 :D .

Have you found any issues with grainy anime using --ctu 64?
I managed to encode only a few anime movies with very light grain (most of them are pretty clean), and on those with --ctu 64 I just noticed grain becoming a light (barely noticeable) artifact: it was not removed, and it was not kept. But note the target bitrate was very low (somewhere around 1Mbps).

MonoS
13th January 2024, 23:36
Today I've run additional test with all of your suggestion, you can find the updated file here x265 CPU utilization issue 2.zip (https://www.mediafire.com/file/4hwarmhhvrrdfnl/x265_CPU_utilization_issue_2.zip/file)

First I've included two additional column to the initial data, as Asmodian suggested I've added the CPU power draw column and it's percentage from the maximum
Second I've rerun my current configuration (std pmode+pme) to have consistent result, as a matter of fact the encode time slightly changed, maybe I've had some process hogging resources.

Now, I'll go to the results.
As benwaggoner suggested just pmode slightly increased performance, by 4s or 0.01fps, and decreased CPU utilization, from 80.7% to 79.3%.
I've then tried both lowering and raising frame threads as rwill suggested, lowering it to 3 decreased performance to 1.22fps, from a baseline of 1.29, so i didn't bother to plot it on the worksheet, but raising it to 9 improved performance to 1.32fps and utilization by 0.7%.
To finish it i've also tried CTU 32 as a lot of you suggested and it tanked both performance and utilization lowering them to 1.04fps and 60.6%, also the bitrate was raised by about 1mbps, from 19968.66kbps to 20635.28kbps , in theory it raised performance as if the CPU was fully utilized it would encode at about 1.74fps (compared to an estimated 1.62fps of "std pmode+pme fthreads 9" at 100% utilization.

As soon as i can I'll do some additional test with only pmode with different frame threads number and CTU 32, so let me know if you want to test something or have some insight into what could be happening.

Atak_Snajpera
14th January 2024, 17:42
https://i.ibb.co/HKZspRv/overral.png
Are you sure that bottleneck is not in your script?
It looks like encoder is waiting for frames from script.

MonoS
14th January 2024, 17:57
https://i.ibb.co/HKZspRv/overral.png
Are you sure that bottleneck is not in your script?
It looks like encoder is waiting for frames from script.

As i wrote in my first post
For the input i'm using simple VS script just for indexing and cropping, by itself it runs at 160fps, so i'm sure is not the bottleneck, just to be sure you'll find a test with FFMPEG for the input pipe.

If you want i can try something different, i thought about testing a BlankClip script but it would not be a proper test in my opinion

Atak_Snajpera
14th January 2024, 22:56
Do you have the same problem in avisynth?

benwaggoner
23rd January 2024, 23:40
https://i.ibb.co/HKZspRv/overral.png
Are you sure that bottleneck is not in your script?
It looks like encoder is waiting for frames from script.
This looks like a very standard x265 high-core-count encode to me. I've never seen it stay at a steady 100%. I think the valleys in performance correspond with P-frame frequency. Since non-reference b-frames can be encoded in parallel, some sort of variance like that would make sense.