Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > High Efficiency Video Coding (HEVC)

Reply
 
Thread Tools Search this Thread Display Modes
Old 29th December 2023, 18:35   #1  |  Link
MonoS
Registered User
 
Join Date: Aug 2012
Posts: 203
x265 on Ryzen 7950x not using all CPU resources

Hi, I have a system with a Ryzen 7950x, in the past days I've started a 4K encode with the following settings
Code:
vspipe "test.vpy" - -c y4m | x265 --crf 17 --preset veryslow --master-display "G(8500,39850)B(6550,2300)R(35400,14600)WP(15635,16450)L(10000000,1)"
--hme --hme-range 16,32,48 --hme-search dia,umh,star --deblock -1:1 --sao --cbqpoffs -1 --crqpoffs -1 --min-keyint 1 --keyint 1440 --rskip 0
--no-early-skip --rd-refine --aq-mode 4 --colormatrix 9 --transfer 16 --colorprim 9 --selective-sao 2 --sao-non-deblock --limit-sao --subme 5
--qg-size 8 --tu-intra-depth 4 --tu-inter-depth 4 --rc-lookahead 60 --y4m --hdr10 --hdr10-opt --psy-rd 2 --psy-rdoq 4 --aq-strength 0.6 --asm avx512
--no-rect --no-amp - "test.hevc"
Code:
y4m  [info]: 3840x2076 fps 24000/1001 i420p10 frames 0 - 277022 of 277023
x265 [info]: Using preset veryslow & tune none
raw  [info]: output file: output.hevc
x265 [info]: HEVC encoder version 3.5+97-ga456c6e73+3-g87155154d
x265 [info]: build info [Windows][clang 14.0.4][64 bit] Kyouko 10bit+8bit+12bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512
x265 [warning]: Turning on repeat-headers for HDR compatibility
x265 [info]: Main 10 profile, Level-5 (Main tier)
x265 [info]: Thread pool created using 32 threads
x265 [info]: Slices                              : 1
x265 [info]: frame threads / pool features       : 6 / wpp(33 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 4 inter / 4 intra
x265 [info]: HME L0,1,2 / range / subpel / merge : dia, umh, star / 48 / 5 / 5
x265 [info]: Keyframe min / max / scenecut / bias  : 1 / 1440 / 40 / 5.00
x265 [info]: Cb/Cr QP Offset                     : -1 / -1
x265 [info]: Lookahead / bframes / badapt        : 60 / 8 / 2
x265 [info]: b-pyramid / weightp / weightb       : 1 / 1 / 1
x265 [info]: References / ref-limit  cu / depth  : 5 / off / off
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 4 / 0.6 / 8 / 1
x265 [info]: Rate Control / qCompress            : CRF-17.0 / 0.60
x265 [info]: tools: rd=6 psy-rd=2.00 rdoq=2 psy-rdoq=4.00 rd-refine signhide
x265 [info]: tools: tmvp b-intra strong-intra-smoothing deblock(tC=-1:B=1)
x265 [info]: tools: sao-non-deblock selective-sao
The CPU utilization reach 100% only some of the time, with almost one quarter of the time spent in a lower utilization range.



I've tried a lot of different settings, here x265 CPU utilization issue.zip you can find a couple of file
  • x265_bench.CSV and ffmpeg_x265_bench.CSV: Raw CSV generated by HWInfo during the encodes with a 1s refresh interval
  • prove.txt: information about the different tests i've made with start and end time (to cross reference with the HWInfo's CSVs), the CMD used, the speed and the resulting encode statistics
  • x265_bench_finale.ods: Worksheet with, on the first sheet the cleaned data from the CSVs, then all the other are the statistics for some of the test, cell C11 is the average of CPU utilization during the encode, Column C and D of the graph are respectively the "Core Usage average" and "5s average of the core usage average"

For the input i'm using simple VS script just for indexing and cropping, by itself it runs at 160fps, so i'm sure is not the bottleneck, just to be sure you'll find a test with FFMPEG for the input pipe.

What could be the issue? Is there a particular setting which hinder parallelization or maybe x265 have some problem parallelizing across that many threads?
I've noticed that veryslow by itself is capable of saturating my setup, with a 96% usage average, but when disabling rect and amp usage drop to 82% like in all the other tests I've calculated.
MonoS is offline   Reply With Quote
Old 29th December 2023, 19:04   #2  |  Link
rwill
Registered User
 
Join Date: Dec 2013
Location: Berlin, Germany
Posts: 378
I think thats just the other Frame Threads waiting for the highest level P or I frame to complete. There is nothing to worry about.
rwill is offline   Reply With Quote
Old 29th December 2023, 20:09   #3  |  Link
RanmaCanada
Registered User
 
Join Date: May 2009
Posts: 333
This is pretty normal.
RanmaCanada is offline   Reply With Quote
Old 29th December 2023, 23:45   #4  |  Link
Rumbah
Registered User
 
Join Date: Mar 2003
Posts: 481
You could try to split your file in two and encode those parts in parallel.
Rumbah is offline   Reply With Quote
Old 2nd January 2024, 03:55   #5  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,824
If you want to use all your cores, try --pmode. That can speed up encoding lower resolutions on many-core processors quite a bit. But it is less efficient in terms of watts/pixel, and can slow things down if you were already hitting 50% utilization sometimes.

--pme is the even more extreme threads-for-fps tradeoff than --pmode, and is almost always counterproductive in my experience. Maybe if encoding 360p on 64 cores or something?
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 4th January 2024, 00:20   #6  |  Link
MonoS
Registered User
 
Join Date: Aug 2012
Posts: 203
Quote:
Originally Posted by benwaggoner View Post
If you want to use all your cores, try --pmode. That can speed up encoding lower resolutions on many-core processors quite a bit. But it is less efficient in terms of watts/pixel, and can slow things down if you were already hitting 50% utilization sometimes.

--pme is the even more extreme threads-for-fps tradeoff than --pmode, and is almost always counterproductive in my experience. Maybe if encoding 360p on 64 cores or something?
This is what i tried, if you download the zip i've attached you'll see that pmode+pme (i can try enabling only pmode, if you think may be of help) does improve performance quite substantially against the std settings, about 25%, but does not change the CPU utilization which goes from 83% to 81%
MonoS is offline   Reply With Quote
Old 4th January 2024, 19:55   #7  |  Link
Asmodian
Registered User
 
Join Date: Feb 2002
Location: San Jose, California
Posts: 4,419
I would suggest using only --pmode with a 7950x.

Quote:
Originally Posted by MonoS View Post
does improve performance quite substantially against the std settings, about 25%
This is a huge performance change! The CPU is obviously doing more work/time.

Quote:
Originally Posted by MonoS View Post
, but does not change the CPU utilization which goes from 83% to 81%
I wonder if there is an issue with the measurement of CPU utilization? The split architecture is weird still. Power draw would probably be a more accurate indication of how busy the CPU really is.

If all you care about is maximizing CPU utilization then use placebo settings!
__________________
madVR options explained
Asmodian is offline   Reply With Quote
Old 5th January 2024, 00:48   #8  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,870
--ctu 32
Atak_Snajpera is offline   Reply With Quote
Old 5th January 2024, 02:33   #9  |  Link
TDS
Formally known as .......
 
TDS's Avatar
 
Join Date: Sep 2021
Location: Down Under.
Posts: 1,082
What app are you using ?

This is the x265 command I generally use for 1080p & especially for 4K...

Code:
--level 6.2 --profile main10 --hdr10 --output-depth 10 --ctu 64 --high-tier --repeat-headers --vbv-bufsize 800000 --vbv-maxrate 800000 --asm avx512
All the other info is displayed in a different window once the process starts, and most of it depends on the video files & x265 itself, AFAIK.

I have a 7950X & i9-13900KF, and except for the avx512, (13900KF does not support 512) it seems to work well.

Quote:
--ctu, -s <64|32|16>
Maximum CU size (width and height). The larger the maximum CU size, the more efficiently x265 can encode flat areas of the picture, giving large reductions in bitrate. However, this comes at a loss of parallelism with fewer rows of CUs that can be encoded in parallel, and less frame parallelism as well. Because of this the faster presets use a CU size of 32. Default: 64
https://x265.readthedocs.io/en/master/cli.html
__________________
Long term RipBot264 user.

RipBot264 modded builds..
*new* x264 & x265 addon packs..
TDS is offline   Reply With Quote
Old 5th January 2024, 06:45   #10  |  Link
Boulder
Pig on the wing
 
Boulder's Avatar
 
Join Date: Mar 2002
Location: Finland
Posts: 5,765
CTU 32 multithreads much better than 64 because the frame is split into more parts. I have not seen any real benefits out of CTU 64 compared to 32 concerning detail retention. Or anything else, to be exact.
__________________
And if the band you're in starts playing different tunes
I'll see you on the dark side of the Moon...
Boulder is offline   Reply With Quote
Old 5th January 2024, 07:22   #11  |  Link
TDS
Formally known as .......
 
TDS's Avatar
 
Join Date: Sep 2021
Location: Down Under.
Posts: 1,082
Quote:
Originally Posted by Boulder View Post
CTU 32 multithreads much better than 64 because the frame is split into more parts. I have not seen any real benefits out of CTU 64 compared to 32 concerning detail retention. Or anything else, to be exact.
Each to their own, I guess.

Have read that 4K does benefit from 64.

Interesting the default is 64.
__________________
Long term RipBot264 user.

RipBot264 modded builds..
*new* x264 & x265 addon packs..
TDS is offline   Reply With Quote
Old 5th January 2024, 13:25   #12  |  Link
Boulder
Pig on the wing
 
Boulder's Avatar
 
Join Date: Mar 2002
Location: Finland
Posts: 5,765
Quote:
Originally Posted by TDS View Post
Each to their own, I guess.

Have read that 4K does benefit from 64.

Interesting the default is 64.
The default is 64 even for SD It should at least be adaptive based on the input resolution, but no..

What I've noticed with CTU 64 is that there are sometimes issues with flat areas with some noise, they may start exhibiting the floating noise artifact much more easily than with CTU 32. The increase in compression efficiency is also something which really does not apply at least when using CRF. I have yet to see any drastic filesize reductions even with material containing a lot of flat areas where the bigger size should excel. And there might be other uncovered issues than the problem when CTU 64 is combined with --limit-tu 0 and --rskip 2 (https://forum.doom9.org/showthread.p...47#post1919347)
__________________
And if the band you're in starts playing different tunes
I'll see you on the dark side of the Moon...

Last edited by Boulder; 5th January 2024 at 13:27.
Boulder is offline   Reply With Quote
Old 9th January 2024, 20:09   #13  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,824
Quote:
Originally Posted by Boulder View Post
The default is 64 even for SD It should at least be adaptive based on the input resolution, but no.
Yeah, there are several settings that should be frame size adaptive, also including --frame-threads (higher with more cores, lower with more rows). --qg-size also should be adaptive

Quote:
What I've noticed with CTU 64 is that there are sometimes issues with flat areas with some noise, they may start exhibiting the floating noise artifact much more easily than with CTU 32. The increase in compression efficiency is also something which really does not apply at least when using CRF. I have yet to see any drastic filesize reductions even with material containing a lot of flat areas where the bigger size should excel. And there might be other uncovered issues than the problem when CTU 64 is combined with --limit-tu 0 and --rskip 2 (https://forum.doom9.org/showthread.p...47#post1919347)
I'm 100% on --ctu 32 being the appropriate default for under 1080p.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 10th January 2024, 17:51   #14  |  Link
Losko
Registered User
 
Join Date: Dec 2010
Posts: 65
For my encodings (always <= 1080p) I use to set:
--ctu 64 for anime content
--ctu 32 for everything else (including Pixar movies)
Losko is offline   Reply With Quote
Old 11th January 2024, 19:20   #15  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,824
Quote:
Originally Posted by Losko View Post
For my encodings (always <= 1080p) I use to set:
--ctu 64 for anime content
--ctu 32 for everything else (including Pixar movies)
So, pretty much discreet tone versus continuous tone, then?

Which makes intuitive sense.

Have you found any issues with grainy anime using --ctu 64?
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 12th January 2024, 00:21   #16  |  Link
Losko
Registered User
 
Join Date: Dec 2010
Posts: 65
Quote:
Originally Posted by benwaggoner View Post
So, pretty much discreet tone versus continuous tone, then?

Which makes intuitive sense.
Do not overestimate my mastering of x265 (pretty average), as those above only come after years of lurking doom9 .

Quote:
Originally Posted by benwaggoner View Post
Have you found any issues with grainy anime using --ctu 64?
I managed to encode only a few anime movies with very light grain (most of them are pretty clean), and on those with --ctu 64 I just noticed grain becoming a light (barely noticeable) artifact: it was not removed, and it was not kept. But note the target bitrate was very low (somewhere around 1Mbps).

Last edited by Losko; 12th January 2024 at 09:41.
Losko is offline   Reply With Quote
Old 13th January 2024, 23:36   #17  |  Link
MonoS
Registered User
 
Join Date: Aug 2012
Posts: 203
Today I've run additional test with all of your suggestion, you can find the updated file here x265 CPU utilization issue 2.zip

First I've included two additional column to the initial data, as Asmodian suggested I've added the CPU power draw column and it's percentage from the maximum
Second I've rerun my current configuration (std pmode+pme) to have consistent result, as a matter of fact the encode time slightly changed, maybe I've had some process hogging resources.

Now, I'll go to the results.
As benwaggoner suggested just pmode slightly increased performance, by 4s or 0.01fps, and decreased CPU utilization, from 80.7% to 79.3%.
I've then tried both lowering and raising frame threads as rwill suggested, lowering it to 3 decreased performance to 1.22fps, from a baseline of 1.29, so i didn't bother to plot it on the worksheet, but raising it to 9 improved performance to 1.32fps and utilization by 0.7%.
To finish it i've also tried CTU 32 as a lot of you suggested and it tanked both performance and utilization lowering them to 1.04fps and 60.6%, also the bitrate was raised by about 1mbps, from 19968.66kbps to 20635.28kbps , in theory it raised performance as if the CPU was fully utilized it would encode at about 1.74fps (compared to an estimated 1.62fps of "std pmode+pme fthreads 9" at 100% utilization.

As soon as i can I'll do some additional test with only pmode with different frame threads number and CTU 32, so let me know if you want to test something or have some insight into what could be happening.
MonoS is offline   Reply With Quote
Old 14th January 2024, 17:42   #18  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,870

Are you sure that bottleneck is not in your script?
It looks like encoder is waiting for frames from script.
Atak_Snajpera is offline   Reply With Quote
Old 14th January 2024, 17:57   #19  |  Link
MonoS
Registered User
 
Join Date: Aug 2012
Posts: 203
Quote:
Originally Posted by Atak_Snajpera View Post

Are you sure that bottleneck is not in your script?
It looks like encoder is waiting for frames from script.
As i wrote in my first post
Quote:
For the input i'm using simple VS script just for indexing and cropping, by itself it runs at 160fps, so i'm sure is not the bottleneck, just to be sure you'll find a test with FFMPEG for the input pipe.
If you want i can try something different, i thought about testing a BlankClip script but it would not be a proper test in my opinion
MonoS is offline   Reply With Quote
Old 23rd January 2024, 23:40   #20  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,824
Quote:
Originally Posted by Atak_Snajpera View Post

Are you sure that bottleneck is not in your script?
It looks like encoder is waiting for frames from script.
This looks like a very standard x265 high-core-count encode to me. I've never seen it stay at a steady 100%. I think the valleys in performance correspond with P-frame frequency. Since non-reference b-frames can be encoded in parallel, some sort of variance like that would make sense.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 23:17.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.