View Full Version : Which processor to encode x265 4K ?
Pages :
1
2
3
4
5
6
7
[
8]
DTL
8th February 2023, 20:19
Good evening.
This is the first test with Ryzen 9 7950X preset "Slow", I hope that with this upgrade the coding time will be more acceptable.
Screenshot shows it even not uses AVX512 functions of x265.
It looks 'auto' asm is still up to AVX2 and AVX512 still need to be enabled manually.
Also for getting more performance of software_for_task at different architectures it is better to use optimized builds for each architecture (making equal output work). In other case it is mostly testing different architectures for executing some single build of software (may be not optimal to any of tested architectures).
About AVX512 capable chips and x265 usage:
1. It may be better to use build for AVX512 architecture executable. May be better to use Intel C compiler mastered by Intel for 10+ years for AVX512 architecture even for AMD AVX512-compatible chips. Also having all-project (multi-file) interprocedural optimization features.
When compiler build for selected architecture it can use additional features like larger register file for params storage and special instructions. The disadvantage - the executable can only run at the target (or higher compatible) chip and will cause 'illegal instruction' crash at lower architectures. So for users of AVX512 chips it may be better to found or ask developers or self build the AVX512-targeted build for work.
So 'universal' run-about-everywhere like from SSE2 and higher architectures builds of x265 are not optimized for AVX512 architecture by compiler (and may have degraded performance).
2. In the 'old' github sources from 2020 https://github.com/videolan/x265 usage of optional handcrafted AVX512 codepaths is not enabled by default options set. So it must be activated by command line option --asm avx512 (also may be good to found all other SIMD string options somewhere). May be it is the same for newer versions for https://bitbucket.org/multicoreware/x265_git/downloads/?tab=downloads site.
benwaggoner
10th February 2023, 22:47
Screenshot shows it even not uses AVX512 functions of x265.
It looks 'auto' asm is still up to AVX2 and AVX512 still need to be enabled manually.
This is because using AVX512 causes a net performance decrease in almost all x265 use cases.
Also for getting more performance of software_for_task at different architectures it is better to use optimized builds for each architecture (making equal output work). In other case it is mostly testing different architectures for executing some single build of software (may be not optimal to any of tested architectures).
The nice thing about x265 and open source in general is that we can all compile it optimally ourselves. I think it's most useful to compare processors using the compiler and settings most optimal for that processor. Profile-guided optimizations are fair too in my opinion.
About AVX512 capable chips and x265 usage:
1. It may be better to use build for AVX512 architecture executable. May be better to use Intel C compiler mastered by Intel for 10+ years for AVX512 architecture even for AMD AVX512-compatible chips. Also having all-project (multi-file) interprocedural optimization features.
Yep, if testing AVX512 everything that can be optimized for AVX512 should be.
When compiler build for selected architecture it can use additional features like larger register file for params storage and special instructions. The disadvantage - the executable can only run at the target (or higher compatible) chip and will cause 'illegal instruction' crash at lower architectures. So for users of AVX512 chips it may be better to found or ask developers or self build the AVX512-targeted build for work.
So 'universal' run-about-everywhere like from SSE2 and higher architectures builds of x265 are not optimized for AVX512 architecture by compiler (and may have degraded performance).
ASM instructions not compatible with the current hardware don't get used, with everything up to AVX2 being used by default if available. You are 100% right about one-size-fits-all binaries not being optimal for anyone.
2. In the 'old' github sources from 2020 https://github.com/videolan/x265 usage of optional handcrafted AVX512 codepaths is not enabled by default options set. So it must be activated by command line option --asm avx512 (also may be good to found all other SIMD string options somewhere). May be it is the same for newer versions for https://bitbucket.org/multicoreware/x265_git/downloads/?tab=downloads site.
All the other ones are automatic by default. Per https://x265.readthedocs.io/en/master/cli.html#performance-options:
--asm <integer:false:string>, --no-asm
x265 will use all detected CPU SIMD architectures by default. You can disable all assembly by using --no-asm or you can specify a comma separated list of SIMD architectures to use, matching these strings: MMX2, SSE, SSE2, SSE3, SSSE3, SSE4, SSE4.1, SSE4.2, AVX, XOP, FMA4, AVX2, FMA3
Some higher architectures imply lower ones being present, this is handled implicitly.
One may also directly supply the CPU capability bitmap as an integer.
Note that by specifying this option you are overriding x265’s CPU detection and it is possible to do this wrong. You can cause encoder crashes by specifying SIMD architectures which are not supported on your CPU.
Default: auto-detected SIMD architectures
DTL
11th February 2023, 01:28
"This is because using AVX512 causes a net performance decrease in almost all x265 use cases."
I make quick test with my build with VisualStudio 2019 sources from github - at i5-11600 intel chip enabling AVX512 with command line option give about 4.8% benefit with FullHD encoding with --placebo profile over 'auto' SIMD. But it may be rare intel chip without frequency decrease at AVX512 usage. (low core number - only 6 cores).
Emulgator
11th February 2023, 20:16
An i9-11900K sees AVX-512 profit too here, I guess I can bind that to the production node of Rocket Lake.
x265 2.8 introduces --asm avx512 11900K: Speed 0,35fps -> 0,7fps yay !
(Please do not nail these numbers of mine to your door, these were just from remembering progress bars.
These 14nm chips seem to be able to pull through that workload without too much thermal penalty. (as the predecessors did)
Until AVX-512 got to be fixed/tucked/fused away for the following families by Intel themselves, as I heard...
benwaggoner
12th February 2023, 04:28
Exciting news! Too bad the latest Intel consumer chips dropped AVX512. Comparing with CPU-specific and profile-driven optimizations would be neat to see. And now we have quite a bit of ARM SIMD and some good ARM performance CPUs in Macs and Graviton, we can compare architectures too.
ReinerSchweinlin
13th February 2023, 10:34
I wonder how one of the later x64 compatible Xeon PHI would perform with x265. The cores themselfes where weaker atom cores, but if I recall correctly, they had a bunch of AVX512 features already on board and up to 64 cores per "CPU", often hosting 4 CPUs in one server (and 4 x hyperthreading), giving an insane number of 256 Threads per CPU, 1024 for one complete Server... If something like ripbot would distribute the encoding over the cores, cleverly avoiding too much bottlenecking between ram-swapping.. Might be interesting :)
I know that the cores themslelfes are way behing what modern cores could do - but if "the rest" of the core is enough to feed the AVX512 unit efficiently, maybe the number of cores makes up for the lack of frequency and cache ?
benwaggoner
13th February 2023, 21:35
I wonder how one of the later x64 compatible Xeon PHI would perform with x265. The cores themselfes where weaker atom cores, but if I recall correctly, they had a bunch of AVX512 features already on board and up to 64 cores per "CPU", often hosting 4 CPUs in one server (and 4 x hyperthreading), giving an insane number of 256 Threads per CPU, 1024 for one complete Server... If something like ripbot would distribute the encoding over the cores, cleverly avoiding too much bottlenecking between ram-swapping.. Might be interesting :)
I know that the cores themslelfes are way behing what modern cores could do - but if "the rest" of the core is enough to feed the AVX512 unit efficiently, maybe the number of cores makes up for the lack of frequency and cache ?
Modern video coding is a pretty punishing mix of single-threaded arithmetic processing (CABAC), lots of low-latency branchy mode decisions, and SIMD processing. It needs a pretty balanced and beefy CPU to do quickly; this is a big reason why CUDA-style GPU acceleration has largely vanished with modern codecs.
I expect the Atom cores would bottleneck on CABAC pretty hard. Some kind of hybrid NUMA approach with some performance cores and slower-but-SIMD cores could work. The key to a hybrid model like that would be low-latency access to shared L3 cache. That kind of mixed-core architecture is becoming pretty standard.
wyliec2
14th February 2023, 18:26
I consider it "fair" to use --pmode if it is only used when it increases throughput in a given configuration, and turned off when it doesn't. As --pmode doesn't decrease quality (and can theoretically increase it a bit).
I'm new to this forum but a long-time Handbrake user. Most of the HEVC/X265 info appears to be applicable.
I generally use a 5950X (16 core/32 thread) for encoding and find that using H.265 10-bit on 4K sources winds up with 90+% CPU utilization at Slow-Slower-Very Slow presets.
On BD sources, CPU utilization is more in the 40-50% range.
I have tried applying pmode in Handbrake (pmode=1) and found it to only be beneficial with BD encodes at Very Slow preset.
In testing with three movies at RF 19, Very Slow, pmode=1 reduces encode time 15-20% and increases CPU utilization roughly 20%.
The output files with pmode=1 are generally slightly smaller than an encode with same settings and no options.
One movie, somewhat grainy, showed a significant size reduction - 4247 MB with no options and 3474 MB with the pmode=1 option. I repeated this encode and confirmed the results. FWIW - No option took 13:46 (hours:minutes) to encode and pmode=1 took 10:31 to encode.
Based on what I'd read here, I wasn't expecting significant size change from pmode....wondering about any thoughts from experts..??
tormento
16th February 2023, 16:32
It would be nice to see newer AMD AVX2 vs AVX512 performances when encoding.
ReinerSchweinlin
17th February 2023, 16:07
Modern video coding is a pretty punishing mix of single-threaded arithmetic processing (CABAC), lots of low-latency branchy mode decisions, and SIMD processing. It needs a pretty balanced and beefy CPU to do quickly; this is a big reason why CUDA-style GPU acceleration has largely vanished with modern codecs.
I expect the Atom cores would bottleneck on CABAC pretty hard. Some kind of hybrid NUMA approach with some performance cores and slower-but-SIMD cores could work. The key to a hybrid model like that would be low-latency access to shared L3 cache. That kind of mixed-core architecture is becoming pretty standard.
Thanx for your estimation.
The atom cores on these later Xeon PHi indeed are pretty weak in integer and floating point. They do have a 16GB internal Cache though which is accessable by all the cores - maybe that helps.
But I just checked, the widely available nights Landing Chips only offer a very small subset of AVX 512 Instructions, only the latest ones have a few more modern subsets on board. Is there any ressource to x265 where the used AVX512 modi are listed?
benwaggoner
17th February 2023, 19:36
Thanx for your estimation.
The atom cores on these later Xeon PHi indeed are pretty weak in integer and floating point. They do have a 16GB internal Cache though which is accessable by all the cores - maybe that helps.
16 GB? That can fit a lot of frames.
But I just checked, the widely available nights Landing Chips only offer a very small subset of AVX 512 Instructions, only the latest ones have a few more modern subsets on board. Is there any ressource to x265 where the used AVX512 modi are listed?
The source code would be the definitive resource. There may be a higher level doc somewhere, but I couldn't find one with a quick search. But "a very small subset" is likely not compatible.
The lack of strong single-threaded perf would be the big bottleneck anyway.
Although, I just recalled that WPP might allow some WPP parallelization; nominally 1 thread per 64 pixels high, although probably only 2x better given overhead. WPP certainly allows for decoder parallelization. Even still, an Atom core is many times slower slower for CABAC-like operations than a modern Xeon core, so that's already factored into comparisons.
Modern video encoding is stressful in pretty much every way, so Amdahl's Law prevents any big improvement in one area from helping all that much.
As I've mentioned before, some years back Intel discovered that x265 pushed Xeon thermals hotter than Intel's on internal thermal test tool's theoretical worst case.
The flip side of this is that encoding benefits some from most improvements; when a new processor says it's "X%-Y%" faster, encoding is always close to the higher Y% value. We get to spend orders of magnitude more MIPS/pixel today than when I started doing compression.
Circa 1996, it took about 80 minutes to encode 1 minute of 320x240p15 on my then rocket-fast PowerMac 8100/80 workstation. I was able to charge $80/minute for a tape-to-file conversion with a $20/min surcharge for VHS (mainly to encourage the client to find the Beta SP master).
ReinerSchweinlin
18th February 2023, 13:37
16 GB? That can fit a lot of frames.
Yes, 16GB. But its not a "normal" L3 Cache, its referd to "remote L2 Cache". Its bandwith is higher than the 8 Lane DDR4 access, but not as fast as modern L3 Cache. It can be configured to act as a normal transparent Cache (like a L3 Cache), but also accessed with a seperate driver (or in a hybrid mode). too bad there are no motherboards in Europe for these Xeons. I know that its probally not really worth it, but for a small amount of money, I`d satisfy my curiosity and get one :)
The source code would be the definitive resource. There may be a higher level doc somewhere, but I couldn't find one with a quick search. But "a very small subset" is likely not compatible.
Thanx for checking though. I am not deep enough into all this to simply look up the source code and get my answer.
On a side note: When I was tinkering with CPU feature sets yesterday on an 1950x, I found odd performance differences in different runs, depending, turning AVX2 off seemed to speed things up... Seems there is some potential in individually tweaked binary compiles, taylored to a CPU (of course not worth if one wants to distribute it publicly, but tweaking a personal encoding server this way would be fun), so I probably will have to learn to compile stuff like this properly after all...
ok, back to topic...
The lack of strong single-threaded perf would be the big bottleneck anyway.
I think so, too.... These atom cores really are weak... Even a core2duo has more ooompf per core :)
Although, I just recalled that WPP might allow some WPP parallelization; nominally 1 thread per 64 pixels high, although probably only 2x better given overhead. WPP certainly allows for decoder parallelization. Even still, an Atom core is many times slower slower for CABAC-like operations than a modern Xeon core, so that's already factored into comparisons.
I remember quality penalties from too much parallelization - is it worth thinking about it or are we talking a few percent difference in efficiency here?
Modern video encoding is stressful in pretty much every way, so Amdahl's Law prevents any big improvement in one area from helping all that much.
Maybe at this point its worth mentioning that getting a XEON Phi of course is pure for academic research and interest, tinkering with old stuff, etc.... For anyone reading along - simply getting a modern Desktop CPU is a much better idea :)
As I've mentioned before, some years back Intel discovered that x265 pushed Xeon thermals hotter than Intel's on internal thermal test tool's theoretical worst case.
Haha... Whenever something like this happened back in my days at university - people reverted to introducing the "factor correction"... simply mutliply the whole equation by something that sounds reasonable (I hope Intel does better and to be fair - This was one user group of.... not so scientific members...)..
Circa 1996, it took about 80 minutes to encode 1 minute of 320x240p15 on my then rocket-fast PowerMac 8100/80 workstation. I was able to charge $80/minute for a tape-to-file conversion with a $20/min surcharge for VHS (mainly to encourage the client to find the Beta SP master).
AH, I remember these machines, I did some service on them back then... Good times... I still remember some encoding adventures - good old Abit BP6 with two P3 celeron Tualatin CPUs was able to do realtime MPEG2 for SVCD Encoding..
Edit:
Phoronix has some CPU-Infos which might be interesting:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 87
model name : Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz
stepping : 1
microcode : 0x1b0
cpu MHz : 1168.239
cache size : 1024 KB
physical id : 0
siblings : 256
core id : 0
cpu cores : 64
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ring3mwait cpuid_fault epb pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt dtherm ida arat pln pts
bugs : cpu_meltdown spectre_v1 spectre_v2 mds msbds_only
bogomips : 2600.01
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
rchitecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 4
Core(s) per socket: 64
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 87
Model name: Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz
Stepping: 1
CPU MHz: 1192.466
CPU max MHz: 1500.0000
CPU min MHz: 1000.0000
BogoMIPS: 2600.01
L1d cache: 2 MiB
L1i cache: 2 MiB
L2 cache: 32 MiB
NUMA node0 CPU(s): 0-255
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT mitigated
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Not affected
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, STIBP disabled, RSB filling
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ring3mwait cpuid_fault epb pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt dtherm ida arat pln pts
wyliec2
18th February 2023, 18:32
I consider it "fair" to use --pmode if it is only used when it increases throughput in a given configuration, and turned off when it doesn't. As --pmode doesn't decrease quality (and can theoretically increase it a bit).
I have been experimenting with with pmode on my 5950X platform.
I typically encode in Slow, Slower or Very Slow.
For all 4K encodes, the CPU will run at 90+% utilization and pmode causes encodes to take longer.
For BD encodes at Slow or Slower, the encodes take longer with pmode.
For BD encodes at Very Slow, pmode does reduce encode time - in one example an encode took 13 hours at Very Slow and 10 hours at Very Slow with pmode. It also seems to increase CPU utilization around 20% (from mid 40% to mid 60%).
I've only tested on 3 files and while two showed slightly smaller output file size, one showed a significant output size reduction (4247 MB without pmode and 3474 MB with pmode).
Nothing in documentation or what I've read here, lead me to expect this result....wondering if there are any thoughts/comments on this result..??
DMD
18th February 2023, 23:07
Slower is the fastest preset where some of of HEVC's more modern features kick in, like relatively deep TU recursion, weighted b-frame prediction, and B-intra encoding. It's the setting I start with by default, and iterate from. It has somewhat reduced parallelism (lookahead-slices 1 instead of 4), so might not be as optimal for benchmarking with many cores available.
Apples-to-apples comparisons can't rely on just presets, however. The number of frame threads can have a big impact on perf and a smaller impact on quality, and the default number of frame threads is based on how many cores are available. Thus comparing two processors with different core counts can see the processor with more cores running with more frame threads, improving encoding speed but potentially reducing quality. So not quite apples-to-apples.
Benchmarking is hard to do in a broadly applicable way, because there are so many encoding scenarios that can impact relative performance. Comparing at slow with default frame threads is certainly a scenario that will matter to plenty of people. For me, comparing with --preset slower --frame-threads 1 would have the most relevance. Benchmarking for realtime encoding would be very different, as predictable worst-case encoding time becomes essential. Plenty of benchmarks just compare with stock default settings.
I see you are comparing with --pmode (makes good sense if you have a lot of cores relative to frame size, but can slow things down if there aren't enough cores) and --pme (which is a net negative unless you have a whole lot of cores encoding sub-HD resolutions).
I consider it "fair" to use --pmode if it is only used when it increases throughput in a given configuration, and turned off when it doesn't. As --pmode doesn't decrease quality (and can theoretically increase it a bit).
The same can apply to using --pme selectively, although the cores needed to make it a net positive are a lot higher. But for 480p with 64 cores or something, it probably would help. I personally rarely test with more than 18/36 available for any given encoder instance. Although with all the ARM patches, Graviton2/3 with 64 cores deserves some benchmarking as well.
After your comprehensive answer, I apologize for this inexperienced question.
Taking into account that my CPU (Ryzen 7950) has 16C/32T, to perform x265 encoding of 4K HDR files, I disabled "pmode" should I also disable "pme" from my script to avoid long encoding time or how could I improve my script?
Thank you very much
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--crf 16 --preset slower --output-depth 10 --profile main10 --level-idc 5.1 --rd-refine --vbv-bufsize 100000 --vbv-maxrate 100000 --hme-search umh,umh,star --hme --min-keyint 1 --keyint 24 --no-open-gop --pme --master-display "G(8500,39850)B(6550,2300)R(35400,14600)WP(15635,16450)L(10000000,1)" --colorprim bt2020 --colormatrix bt2020nc --transfer smpte2084 --range limited --max-cll "1000,400" --sar 1:1 --no-info --repeat-headers --aud --hrd --uhd-bd
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
guest
22nd February 2023, 04:17
After your comprehensive answer, I apologize for this inexperienced question.
Taking into account that my CPU (Ryzen 7950) has 16C/32T, to perform x265 encoding of 4K HDR files, I disabled "pmode" should I also disable "pme" from my script to avoid long encoding time or how could I improve my script?
Thank you very much
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--crf 16 --preset slower --output-depth 10 --profile main10 --level-idc 5.1 --rd-refine --vbv-bufsize 100000 --vbv-maxrate 100000 --hme-search umh,umh,star --hme --min-keyint 1 --keyint 24 --no-open-gop --pme --master-display "G(8500,39850)B(6550,2300)R(35400,14600)WP(15635,16450)L(10000000,1)" --colorprim bt2020 --colormatrix bt2020nc --transfer smpte2084 --range limited --max-cll "1000,400" --sar 1:1 --no-info --repeat-headers --aud --hrd --uhd-bd
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hello yet again, DMD,
Like I said in the Staxrip thread, I use RipBot264, but the Pauly Dunne builds, which have so much more to offer, than the standard one...but I digress.
Several "power" users have commented how the 16 core Ryzens, "fall off a cliff" when encoding certain X265 video's, but "we" have come up with a "fix" that is part of the encoders command's that really gets them to do the job they're supposed to do, as well as custom x265 command's as well.
I am VERY happy with the way my 3950X, 5950X & the 7950X are performing, as well as the interloper, the 13900KF :)
I must admit that my 5950X was being bested by the 5900X with almost everything, but I changed some basic BIOS setting's and it's working better that ever before :D
DMD
22nd February 2023, 08:29
Hello yet again, DMD,
Like I said in the Staxrip thread, I use RipBot264, but the Pauly Dunne builds, which have so much more to offer, than the standard one...but I digress.
Several "power" users have commented how the 16 core Ryzens, "fall off a cliff" when encoding certain X265 video's, but "we" have come up with a "fix" that is part of the encoders command's that really gets them to do the job they're supposed to do, as well as custom x265 command's as well.
I am VERY happy with the way my 3950X, 5950X & the 7950X are performing, as well as the interloper, the 13900KF :)
I must admit that my 5950X was being bested by the 5900X with almost everything, but I changed some basic BIOS setting's and it's working better that ever before :D
I didn't know that, and I'm very surprised that the Ryzens have this problem with x265 encoding.
But I am also very happy that a solution has been found to make them work at their maximum performance.
I don't know how to apply the "fix" and sari happy to know how to do it.
As for my bios ( ASUS ROG Strix X670E-F Gaming WiFi) I only performed optimization for RAM and fast boot.
Thank you very much
DTL
23rd February 2023, 12:17
Taking into account that my CPU (Ryzen 7950) how could I improve my script?
With AMD 7xxx you may try to force usage of AVX512 with --asm avx512 . If it will not cause overheating clock trottling (https://www.hwcooling.net/en/intel-avx-512-tested-in-x265-how-to-enable-it-and-does-it-help/ ) you may got some performance benefit.
DMD
23rd February 2023, 18:50
With AMD 7xxx you may try to force usage of AVX512 with --asm avx512 . If it will not cause overheating clock trottling (https://www.hwcooling.net/en/intel-avx-512-tested-in-x265-how-to-enable-it-and-does-it-help/ ) you may got some performance benefit.
Many thanks for the suggestion.
ReinerSchweinlin
27th February 2023, 10:53
I didn't know that, and I'm very surprised that the Ryzens have this problem with x265 encoding.
But I am also very happy that a solution has been found to make them work at their maximum performance.
I don't know how to apply the "fix" and sari happy to know how to do it.
As for my bios ( ASUS ROG Strix X670E-F Gaming WiFi) I only performed optimization for RAM and fast boot.
Thank you very much
Since I also have a 1950x lying around, IŽd be happy to read more about that "fix" - could someone point me in the right direction with a link please ?
Boulder
27th February 2023, 11:50
It would be interesting to hear since I have a 5950X and have zero issues with getting the CPU work at 80-100% usage level when encoding with x265.
benwaggoner
4th March 2023, 23:16
After your comprehensive answer, I apologize for this inexperienced question.
Taking into account that my CPU (Ryzen 7950) has 16C/32T, to perform x265 encoding of 4K HDR files, I disabled "pmode" should I also disable "pme" from my script to avoid long encoding time or how could I improve my script?
You definitely want to have --pme off; I've never seen it boost throughput with anything above 480p. You should get a >2x speed improvement turning it off.
--pmode has a much bigger chance to be helpful, I'd say it's likely useful above 20 threads for 4K if using --frame-threads 1. The more modes being evaluated, the more parallelization for --pmode to take advantage of.
Using only a single frame thread can improve quality, but limits parallelization a lot, and combining it with --pmode can get some of that perf back if you have enough cores.
Looking at the reset of your command line:
--rd-refine doesn't do anything in a single pass, which your encode is.
Is --no-open-gop still required for BD compatibility with x265 (they are certainly supported by the BD format itself). With 24 frame GOPs, open GOP can provide some real benefit. If you're stuck with --no-open-gop, you could try --radl 2 to get some of the same benefit.
I don't know that --hme has proven to be that helpful. You should try with it off to see if it provides any benefit with your content.
If there is much grain in the source --rd 4 can both improve quality and throughput.
--crf 16 --preset slower --output-depth 10 --profile main10 --level-idc 5.1 --rd-refine --vbv-bufsize 100000 --vbv-maxrate 100000 --hme-search umh,umh,star --hme --min-keyint 1 --keyint 24 --no-open-gop --pme --master-display "G(8500,39850)B(6550,2300)R(35400,14600)WP(15635,16450)L(10000000,1)" --colorprim bt2020 --colormatrix bt2020nc --transfer smpte2084 --range limited --max-cll "1000,400" --sar 1:1 --no-info --repeat-headers --aud --hrd --uhd-bd
Boulder
5th March 2023, 13:47
--rd-refine doesn't do anything in a single pass, which your encode is.
It does work on CRF encodes, it doesn't need a stats file or anything. I've been trying to figure out what it actually does or what the use case is but I have no clue.
You definitely want to have --pme off; I've never seen it boost throughput with anything above 480p. You should get a >2x speed improvement turning it off.
--pmode has a much bigger chance to be helpful, I'd say it's likely useful above 20 threads for 4K if using --frame-threads 1. The more modes being evaluated, the more parallelization for --pmode to take advantage of.
Using only a single frame thread can improve quality, but limits parallelization a lot, and combining it with --pmode can get some of that perf back if you have enough cores.
Looking at the reset of your command line:
--rd-refine doesn't do anything in a single pass, which your encode is.
Is --no-open-gop still required for BD compatibility with x265 (they are certainly supported by the BD format itself). With 24 frame GOPs, open GOP can provide some real benefit. If you're stuck with --no-open-gop, you could try --radl 2 to get some of the same benefit.
I don't know that --hme has proven to be that helpful. You should try with it off to see if it provides any benefit with your content.
If there is much grain in the source --rd 4 can both improve quality and throughput.
Thank you very much for the advice, I will do some tests for a better result.
Using StaxRip I had a chance to do some tests with "number of parallel process" and "Chuncks", and I noticed that by setting the maximum value (16) for both parallel processes and Chunks, I got higher process speed, but also missing video frames.
In my personal configuration with a setting of 3-3 I was able to get a slight speed increase without any side effects, using the commands I included in the previous post.
benwaggoner
6th March 2023, 06:22
It does work on CRF encodes, it doesn't need a stats file or anything. I've been trying to figure out what it actually does or what the use case is but I have no clue.
You're right; was thinking of a different parameter.
--rd-refine, --no-rd-refine
For each analysed CU, calculate R-D cost on the best partition mode for a range of QP values, to find the optimal rounding effect. Default disabled.
Only effective at RD levels 5 and 6
It should offer a slight overall compression efficiency improvement.
Boulder
18th March 2023, 15:07
--pmode has a much bigger chance to be helpful, I'd say it's likely useful above 20 threads for 4K if using --frame-threads 1. The more modes being evaluated, the more parallelization for --pmode to take advantage of.
Using only a single frame thread can improve quality, but limits parallelization a lot, and combining it with --pmode can get some of that perf back if you have enough cores.
Just for fun, I tested frame-threads from 5 (the default for my CPU) to 1 and then with pmode on my 5950X (16C/32T). The effect of pmode on the compression efficiency is much bigger than I anticipated. The speed increase was weird because my CPU usage is already around 90-100% when encoding with the default frame-threads and no pmode.
I ran this test on a 720p encode, normal setup and settings for my 1080p->720p encodes to the media library. I do use some uncommon parameters like --no-limit-modes and --rskip 0 which probably affect the results compared to standard presets.
I seriously need to test the 4K encodes as well.
F 5 - 5718.31 kbps - 7.11 fps
F 4 - 5713.13 kbps - 6.93 fps
F 3 - 5708.73 kbps - 6.74 fps
F 2 - 5715.94 kbps - 6.23 fps (odd that the size went up..)
F 1 - 5695.93 kbps - 4.50 fps
F 1 + pmode - 5490.78 kbps - 5.88 fps
F 2 + pmode - 5521.68 kbps - 7.43 fps
F 3 + pmode - 5515.12 kbps - 7.72 fps
F 4 + pmode - 5519.36 kbps - 7.83 fps
F 5 + pmode - 5521.10 kbps - 8.01 fps
vBulletin® v3.8.11, Copyright ©2000-2026, vBulletin Solutions Inc.