View Full Version : X265 slow encoding
TEB
23rd November 2019, 16:25
hi, ive been doing a few encodes lately on my lab-encoder and im getting abmysal encoding speeds connected to the ffmpeg+x265 not using all the cpu capacity:
HW: Dual Intel Silver 4118 (20cores + 20HT cores), RHEL 7.7
sourcefile is a 422hq Prores file
ffmpeg is the latest gitpull from yesterday
Result: encoding at 1fps speed 0.04x
Machine has a load of 2 (40cores..??), almost all cpu cores are idle
./ffmpeg -loglevel verbose -i file_p25.mov -strict -1 -vf format=yuv420p10 -codec:v libx265 -x265-params keyint=100:min-keyint=100:no-open-gop=1 -level 4.1 -preset veryslow -crf 16 -profile:v main10 -y xtemp3_P6slow_nolimit_max.ts
Anyone know whats going on?
Atak_Snajpera
23rd November 2019, 16:32
Add --ctu 16 or just just ripbot264 in distributed encoding mode
TEB
23rd November 2019, 17:13
thx! It seems that by moving from veryslow to normal, i get a much better utilization of the cores..
Greenhorn
24th November 2019, 00:50
hi, ive been doing a few encodes lately on my lab-encoder and im getting abmysal encoding speeds connected to the ffmpeg+x265 not using all the cpu capacity:
HW: Dual Intel Silver 4118 (20cores + 20HT cores), RHEL 7.7
sourcefile is a 422hq Prores file
ffmpeg is the latest gitpull from yesterday
Result: encoding at 1fps speed 0.04x
Machine has a load of 2 (40cores..??), almost all cpu cores are idle
./ffmpeg -loglevel verbose -i file_p25.mov -strict -1 -vf format=yuv420p10 -codec:v libx265 -x265-params keyint=100:min-keyint=100:no-open-gop=1 -level 4.1 -preset veryslow -crf 16 -profile:v main10 -y xtemp3_P6slow_nolimit_max.ts
Anyone know whats going on?
If you want to encode at the higher presets, you can also try enabling --pmode for a large boost to utilization. (It'll actually decrease performance at the lower presets.)
TEB
19th December 2019, 08:40
If you want to encode at the higher presets, you can also try enabling --pmode for a large boost to utilization. (It'll actually decrease performance at the lower presets.)
I tried the following:
./ffmpeg -loglevel verbose -i ARCHIVE.mov -strict -1 -vf format=yuv420p10 -codec:v libx265 -x265-params keyint=100:min-keyint=100:no-open-gop=1:pmode=1 -level 4.1 -preset veryslow -crf 16 -profile:v main10 -y test-veryslow.ts
Source = Prores HQ422 10bit movie trailer
With and without PMODE (given the syntax is correct) im getting 0.2x (ca. 14% system usage)
From the log:
x265 [info]: HEVC encoder version 3.2+2-82a66ce12955
x265 [info]: build info [Linux][GCC 6.3.0][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
x265 [info]: Main 10 profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 64 threads
x265 [info]: Thread pool created using 64 threads
x265 [info]: Slices : 1
x265 [info]: frame threads / pool features : 5 / wpp(17 rows)+pmode
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 3 inter / 3 intra
x265 [info]: ME / range / subpel / merge : star / 57 / 4 / 5
x265 [info]: Keyframe min / max / scenecut / bias: 100 / 100 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt : 40 / 8 / 2
x265 [info]: b-pyramid / weightp / weightb : 1 / 1 / 1
x265 [info]: References / ref-limit cu / depth : 5 / off / off
x265 [info]: AQ: mode / str / qg-size / cu-tree : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress : CRF-16.0 / 0.60
x265 [info]: tools: rect amp rd=6 psy-rd=2.00 rdoq=2 psy-rdoq=1.00 rskip
x265 [info]: tools: signhide tmvp b-intra strong-intra-smoothing deblock sao
kuchikirukia
19th December 2019, 13:51
Have you tried not using ffmpeg? ffmpeg is useful because it can do anything, but it doesn't do anything well.
Using MeGUI on Windows (calling the x265 binary) I have no problem hitting 100% load my 8-threaded i7 4790 at --preset veryslow on a 1080p Blu-ray.
Though interestingly I barely exceed 1FPS in 10 bit. Have processors come so far that my 3.8GHz 4/8 Haswell can nearly be equaled by two cores of a 3GHz Xeon?
Passmark shows that Xeon as being reasonably behind my i7 single-threaded. (70% of the speed)
foxyshadis
21st December 2019, 10:43
Have you tried not using ffmpeg? ffmpeg is useful because it can do anything, but it doesn't do anything well.
Using MeGUI on Windows (calling the x265 binary) I have no problem hitting 100% load my 8-threaded i7 4790 at --preset veryslow on a 1080p Blu-ray.
Though interestingly I barely exceed 1FPS in 10 bit. Have processors come so far that my 3.8GHz 4/8 Haswell can nearly be equaled by two cores of a 3GHz Xeon?
Passmark shows that Xeon as being reasonably behind my i7 single-threaded. (70% of the speed)
It might be memory access speed, plus the Xeon has absolutely massive internal caches. x265 is extremely sensitive to access speed, even more so than x264 was back when it was still considered slow.
excellentswordfight
21st December 2019, 20:53
Have you tried not using ffmpeg? ffmpeg is useful because it can do anything, but it doesn't do anything well.
Using MeGUI on Windows (calling the x265 binary) I have no problem hitting 100% load my 8-threaded i7 4790 at --preset veryslow on a 1080p Blu-ray.
Though interestingly I barely exceed 1FPS in 10 bit. Have processors come so far that my 3.8GHz 4/8 Haswell can nearly be equaled by two cores of a 3GHz Xeon?
Passmark shows that Xeon as being reasonably behind my i7 single-threaded. (70% of the speed)
Wow, thats somewhat of an different setup from TS... You are talking about saturating 8threads and TS 40, its quite a big difference. And you dont get much more then 8t saturation with veryslow with such a big ctu size for 1080p, and since his xeon will run at sub 3Ghz its not that weird that you get similar speeds since he cant use the thread advantage!
And tbh veryslow is literally very slow, to the point were its almost unusable (especially after the latest preset changes). I would say that 'slower' is the lowest "usable" preset atm.
For reference, this is what i get with an Xeon GOLD 6126 (12C/24T)
--veryslow 0,8fps (25-40% utilization)
--slower --ctu 32 --merange 26 3fps (100% utilization)
almost a 4x speed increase.
edit. Also keep in mind that the source have a large effect on speed, you cannot do a direct comparison without using the same files.
kuchikirukia
22nd December 2019, 08:17
Wow, thats somewhat of an different setup from TS... You are talking about saturating 8threads and TS 40, its quite a big difference.
I'm talking about saturating 8 threads vs his 2. You showed ~8 threaded too.
While it doesn't look like he's going to see anything close to a 4x speedup if he fixes his threadedness issue, if it's a reasonable gain it may turn out a difference between running one to two encodes on each CPU vs five on each.
TEB
22nd December 2019, 09:43
Wow, thats somewhat of an different setup from TS... You are talking about saturating 8threads and TS 40, its quite a big difference. And you dont get much more then 8t saturation with veryslow with such a big ctu size for 1080p, and since his xeon will run at sub 3Ghz its not that weird that you get similar speeds since he cant use the thread advantage!
And tbh veryslow is literally very slow, to the point were its almost unusable (especially after the latest preset changes). I would say that 'slower' is the lowest "usable" preset atm.
For reference, this is what i get with an Xeon GOLD 6126 (12C/24T)
--veryslow 0,8fps (25-40% utilization)
--slower --ctu 32 --merange 26 3fps (100% utilization)
almost a 4x speed increase.
edit. Also keep in mind that the source have a large effect on speed, you cannot do a direct comparison without using the same files.
Mind explaining what --ctu 32 and --merange 26 means?
TEB
22nd December 2019, 09:44
I'm talking about saturating 8 threads vs his 2. You showed ~8 threaded too.
While it doesn't look like he's going to see anything close to a 4x speedup if he fixes his threadedness issue, if it's a reasonable gain it may turn out a difference between running one to two encodes on each CPU vs five on each.
Not sure where 2 came in as i have a 128 core cpu ? ;) Or am i missing something?
excellentswordfight
22nd December 2019, 17:29
Mind explaining what --ctu 32 and --merange 26 means?
--ctu specify the maxiumum CU size, the default value is rather large and is mostly beneficial for high res (UHD) material and it has a large effect on parallelism at lower res. It can be reduced for greater parallelism without any big effect on compression. I usually leave it at 64 for 1080p, and go for 32 at 720p and bellow, but if you are looking at using more threads and still use single encoding, this is one of the key parameters.
--merange sets the motion search range, the default value (57) is calculated based on the default CTU value of 64. The doc explains it rather well:
The default is derived from the default CTU size (64) minus the luma interpolation half-length (4) minus maximum subpel distance (2) minus one extra pixel just in case the hex search method is used. If the search range were any larger than this, another CTU row of latency would be required for reference frames.
All presets below medium use star search, so using the same logic with a cu size of 32 would get you 26.
kuchikirukia
23rd December 2019, 05:30
Not sure where 2 came in as i have a 128 core cpu ? ;) Or am i missing something?
Machine has a load of 2 (40cores..??), almost all cpu cores are idle
A load of 2 means you'd be at 100% load on a dual-core, when the x265 Windows binary will max my 8 threads. (load of 8)
While the veryslow preset doesn't scale out to the 40 threads of your system, it should be able to do 8, and my guess as to why you can't hit that would be an issue with your ffmpeg build.
TEB
23rd December 2019, 13:59
--ctu specify the maxiumum CU size, the default value is rather large and is mostly beneficial for high res (UHD) material and it has a large effect on parallelism at lower res. It can be reduced for greater parallelism without any big effect on compression. I usually leave it at 64 for 1080p, and go for 32 at 720p and bellow, but if you are looking at using more threads and still use single encoding, this is one of the key parameters.
--merange sets the motion search range, the default value (57) is calculated based on the default CTU value of 64. The doc explains it rather well:
All presets below medium use star search, so using the same logic with a cu size of 32 would get you 26.
Thx for the insight !!
./ffmpeg -loglevel verbose -i ARCHIVE.mov -strict -1 -vf format=yuv420p10 -codec:v libx265 -x265-params keyint=100:min-keyint=100:no-open-gop=1:pmode=1:ctu=32:merange:26 -level 4.1 -preset veryslow -crf 16 -profile:v main10 -y test-veryslow.ts
Doesnt seem to trigger the change tho:
x265 [info]: HEVC encoder version 3.2+2-82a66ce12955
x265 [info]: build info [Linux][GCC 6.3.0][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
x265 [info]: Main 10 profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 64 threads
x265 [info]: Thread pool created using 64 threads
x265 [info]: Slices : 1
x265 [info]: frame threads / pool features : 5 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 3 inter / 3 intra
x265 [info]: ME / range / subpel / merge : star / 57 / 4 / 5
x265 [info]: Keyframe min / max / scenecut / bias: 23 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt : 40 / 8 / 2
x265 [info]: b-pyramid / weightp / weightb : 1 / 1 / 1
x265 [info]: References / ref-limit cu / depth : 5 / off / off
x265 [info]: AQ: mode / str / qg-size / cu-tree : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress : CRF-16.0 / 0.60
x265 [info]: tools: rect amp rd=6 psy-rd=2.00 rdoq=2 psy-rdoq=1.00 rskip
x265 [info]: tools: signhide tmvp b-intra strong-intra-smoothing deblock sao
[mpegts @ 0x793f4c0] service 1 using PCR in pid=256, pcr_period=83ms
[mpegts @ 0x793f4c0] muxrate VBR, sdt every 500 ms, pat/pmt every 100 ms
foxyshadis
23rd December 2019, 23:13
Thx for the insight !!
./ffmpeg -loglevel verbose -i ARCHIVE.mov -strict -1 -vf format=yuv420p10 -codec:v libx265 -x265-params keyint=100:min-keyint=100:no-open-gop=1:pmode=1:ctu=32:merange:26 -level 4.1 -preset veryslow -crf 16 -profile:v main10 -y test-veryslow.ts
Doesnt seem to trigger the change tho:
x265 [info]: HEVC encoder version 3.2+2-82a66ce12955
x265 [info]: build info [Linux][GCC 6.3.0][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
x265 [info]: Main 10 profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 64 threads
x265 [info]: Thread pool created using 64 threads
x265 [info]: Slices : 1
x265 [info]: frame threads / pool features : 5 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 3 inter / 3 intra
x265 [info]: ME / range / subpel / merge : star / 57 / 4 / 5
x265 [info]: Keyframe min / max / scenecut / bias: 23 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt : 40 / 8 / 2
x265 [info]: b-pyramid / weightp / weightb : 1 / 1 / 1
x265 [info]: References / ref-limit cu / depth : 5 / off / off
x265 [info]: AQ: mode / str / qg-size / cu-tree : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress : CRF-16.0 / 0.60
x265 [info]: tools: rect amp rd=6 psy-rd=2.00 rdoq=2 psy-rdoq=1.00 rskip
x265 [info]: tools: signhide tmvp b-intra strong-intra-smoothing deblock sao
[mpegts @ 0x793f4c0] service 1 using PCR in pid=256, pcr_period=83ms
[mpegts @ 0x793f4c0] muxrate VBR, sdt every 500 ms, pat/pmt every 100 ms
You probably meant to put merange=26, not merange:26.
excellentswordfight
24th December 2019, 09:51
I'm talking about saturating 8 threads vs his 2. You showed ~8 threaded too.
While it doesn't look like he's going to see anything close to a 4x speedup if he fixes his threadedness issue, if it's a reasonable gain it may turn out a difference between running one to two encodes on each CPU vs five on each.
A load of 2 means you'd be at 100% load on a dual-core, when the x265 Windows binary will max my 8 threads. (load of 8)
While the veryslow preset doesn't scale out to the 40 threads of your system, it should be able to do 8, and my guess as to why you can't hit that would be an issue with your ffmpeg build.
Well thats either a typo or not really the case, cause the speed he is seeing is in line with using more like 8-12T (which is normal as well with 1080p on default settings), cuase its in line with both your haswell-s cpu and my xeon gold. So no real reason in dwelling on that.
TEB
24th December 2019, 14:18
You probably meant to put merange=26, not merange:26.
Jepp, corrected now!
TEB
24th December 2019, 14:23
UPDATE:
x265 [info]: HEVC encoder version 3.2+2-82a66ce12955
x265 [info]: build info [Linux][GCC 6.3.0][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
x265 [info]: Main 10 profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 64 threads
x265 [info]: Thread pool created using 64 threads
x265 [info]: Slices : 1
x265 [info]: frame threads / pool features : 5 / wpp(34 rows)+pmode
x265 [info]: Coding QT: max CU size, min CU size : 32 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 3 inter / 3 intra
x265 [info]: ME / range / subpel / merge : star / 26 / 4 / 5
x265 [info]: Keyframe min / max / scenecut / bias: 100 / 100 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt : 40 / 8 / 2
x265 [info]: b-pyramid / weightp / weightb : 1 / 1 / 1
x265 [info]: References / ref-limit cu / depth : 5 / off / off
x265 [info]: AQ: mode / str / qg-size / cu-tree : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress : CRF-16.0 / 0.60
x265 [info]: tools: rect amp rd=6 psy-rd=2.00 rdoq=2 psy-rdoq=1.00 rskip
x265 [info]: tools: signhide tmvp b-intra strong-intra-smoothing deblock sao
./ffmpeg -loglevel verbose -i ARCHIVE.mov -strict -1 -vf format=yuv420p10 -codec:v libx265 -x265-params keyint=100:min-keyint=100:no-open-gop=1:pmode=1:ctu=32:merange=26 -level 4.1 -preset veryslow -crf 16 -profile:v main10 -y test-veryslow.ts
System Load: 5min avg 21
FPS encoding: 7fps
A load of 21 on a 128cored cpu is a tad low ;) Any more tips to improve it and not move to lower quality profiles?
TEST1:
I tested medium preset for the fun of it, but i still had like 25ish load and ca 44 fps..
So in other words, higher framerate but the load isnt all that great..
TEST2:
I spawned 4 encoding instances like the one over in veryslow mode and i got a load of ca. 97
microchip8
24th December 2019, 14:40
Try with pme=1 (parallel motion estimation), but I doubt it'll suddenly saturate all threads. I think you're pushing the threading of x265 itself
TEB
25th December 2019, 00:32
Try with pme=1 (parallel motion estimation), but I doubt it'll suddenly saturate all threads. I think you're pushing the threading of x265 itself
Tried, same result
I also tried from another source (HEVC), same result..
excellentswordfight
25th December 2019, 13:47
A load of 21 on a 128cored cpu is a tad low ;) Any more tips to improve it and not move to lower quality profiles?
TEST1:
I tested medium preset for the fun of it, but i still had like 25ish load and ca 44 fps..
So in other words, higher framerate but the load isnt all that great..
What makes you say that? Utilizing 20+ threads for a very complex GOP codec is actually very impressive, and imo an high thread usuage! If you need encodes done faster, and have CPU-power to spare you really should do chunk encoding. That is the only solution to have these very compressed GOP codecs to scale enough for very high core count environments. You are already using fixed IDR placements with closed gop so you wont sacrefice anything doing it! You can then just do say 10s chunks and give each chunk 8threads and max out the whole CPU on a single video.
And as I said above, if you have issue with speed, keep in mind that very slow is indeed very slow, you should investigate using slower or slow instead if you need things done faster without sacrefice that much compression (medium is imo a bit to much of ansacrifice for me). Keep in mind that the slower preset in use now is more or less the old very slow, and the new very slow is imo close to something like an placebo preset.
TEB
25th December 2019, 19:45
What makes you say that? Utilizing 20+ threads for a very complex GOP codec is actually very impressive, and imo an high thread usuage! If you need encodes done faster, and have CPU-power to spare you really should do chunk encoding. That is the only solution to have these very compressed GOP codecs to scale enough for very high core count environments. You are already using fixed IDR placements with closed gop so you wont sacrefice anything doing it! You can then just do say 10s chunks and give each chunk 8threads and max out the whole CPU on a single video.
And as I said above, if you have issue with speed, keep in mind that very slow is indeed very slow, you should investigate using slower or slow instead if you need things done faster without sacrefice that much compression (medium is imo a bit to much of ansacrifice for me). Keep in mind that the slower preset in use now is more or less the old very slow, and the new very slow is imo close to something like an placebo preset.
Ure completely right! My assumptions on the practical paralellability of a single encode was "somewhat" grand.
Any tips on how to do chunck encoding? Can ffmpeg do that? Or other tools in the opensource linux world ?
br TE
stax76
25th December 2019, 20:39
Any tips on how to do chunck encoding? Can ffmpeg do that? Or other tools in the opensource linux world ?
RipBot (Windows) is the only tool I'm aware and for staxrip it's under consideration for future versions, building this is not terrible difficult I believe, maybe 50-100 lines Python code.
RanmaCanada
25th December 2019, 22:38
Mediacoder can also do chunking. Though it has no where near the options ripbot does.
benwaggoner
4th January 2020, 00:06
Is this a dual socket system? If so, using --pools "+,-" will pin the encode to socket 0. By improving caching, it's easier to get full utilization. The most efficient thing I've found is to just have one parallel encode per socket. For 1080p and higher, I've always been able to saturate all the cores on a socket in veryslow with --pmode. UHD doesn't even need --pmode.
TEB
6th January 2020, 08:34
Is this a dual socket system? If so, using --pools "+,-" will pin the encode to socket 0. By improving caching, it's easier to get full utilization. The most efficient thing I've found is to just have one parallel encode per socket. For 1080p and higher, I've always been able to saturate all the cores on a socket in veryslow with --pmode. UHD doesn't even need --pmode.
Single socket Epyc 7742, 64 physical cores = 128 SMT cores
I had to fill it with 6-7 encoding jobs to fully saturate all SMT cores
microchip8
6th January 2020, 17:48
As I said in a previous post, x265's threading is roughly based (with some modifications) on that of x264. The higher core/thread count, the less it will be able to saturate those cores/threads you throw at it. So running multiple encodes seems to be the only thing that will fully utilize the processor(s). Even --pmode and --pme won't do it alone
Forteen88
6th January 2020, 19:19
As I said in a previous post, x265's threading is roughly based (with some modifications) on that of x264. The higher core/thread count, the less it will be able to saturate those cores/threads you throw at it. So running multiple encodes seems to be the only thing that will fully utilize the processor(s). Even --pmode and --pme won't do it aloneDid you test it with a high preset, like --preset veryslow, or --preset slower?
microchip8
6th January 2020, 20:15
Did you test it with a high preset, like --preset veryslow, or --preset slower?
Yes, including my own settings I use for encoding. I find the presets to be non-optimal
For example, on the higher presets they use --merange of 57 even when star is set as --me. It should be set to 58 since --me hex is not used
SAO is turned on by default. Many, including me, disable it as it blurs too much
rdoq-level is set to 2 even when it should be st to 1 as 1 is more effective for psy-rd/psy-rdoq
--amp is enabled for the higher presets while providing less than 1% in compression efficiency but a huge slowdown
--ctu is set to 64, which is beneficial for 4K and up. I do mostly 1080p encodes atm with a ctu of 32
it is a wonder to me why they use --limit-tu 4 for the slower preset. No other preset limits the TU, including the faster ones and the more slower ones
Keep in mind that these presets have not been revised for a very long time while x265 has since then gained more compression/quality optimizations
(PS: I also use uniform AQ as opposed to AQ 2 or 3. The latter two suck up bitrate while providing no visual improvement much compared to 1 and I've tested on a few well calibrated TVs me glued to the TVs to analyze visually)
Blue_MiSfit
6th January 2020, 22:27
SAO may produce a result that looks blurrier in single frame comparison, but through exhaustive testing of 4K SDR and HDR10 encoding I determined it's worth keeping on unless you're doing very high bitrates. The overall improvement in motion is absolutely worth it.
SAO is great at reducing distracting artifacts that would pull your attention away from the natural focal point of the content. The slight reduction in sharpness is a great tradeoff unless you really have the bit budget for full transparency (like 25+ Mbps for 2160p).
microchip8
6th January 2020, 22:52
SAO may produce a result that looks blurrier in single frame comparison, but through exhaustive testing of 4K SDR and HDR10 encoding I determined it's worth keeping on unless you're doing very high bitrates. The overall improvement in motion is absolutely worth it.
SAO is great at reducing distracting artifacts that would pull your attention away from the natural focal point of the content. The slight reduction in sharpness is a great tradeoff unless you really have the bit budget for full transparency (like 25+ Mbps for 2160p).
And so does deblocking at its default strength or slightly lower (-1,-1). Using them both sounds like a bad idea, though I haven't tested yet
Cary Knoop
6th January 2020, 23:05
I determined it's worth keeping on unless you're doing very high bitrates. ... The slight reduction in sharpness is a great tradeoff unless you really have the bit budget for full transparency (like 25+ Mbps for 2160p).
I would strongly disagree on calling 25+ Mbps UHD HEVC a "very high bitrate". I would say 25 Mbps is medium to low.
benwaggoner
6th January 2020, 23:14
Single socket Epyc 7742, 64 physical cores = 128 SMT cores
I had to fill it with 6-7 encoding jobs to fully saturate all SMT cores
Wow, that is going to be a challenge! I think 8K encoding with --preset veryslow --pmode might come close to saturating that, but I doubt anything short of that would.
Have you tried --pmode --pme? PME might speed things up 50% at 4x the utilization.
Boulder
7th January 2020, 05:40
rdoq-level is set to 2 even when it should be st to 1 as 1 is more effective for psy-rd/psy-rdoq
(PS: I also use uniform AQ as opposed to AQ 2 or 3. The latter two suck up bitrate while providing no visual improvement much compared to 1 and I've tested on a few well calibrated TVs me glued to the TVs to analyze visually)
What kind of values do you have for psy-rd and psy-rdoq with rdoq-level 1? What about aq-strength?
it is a wonder to me why they use --limit-tu 4 for the slower preset. No other preset limits the TU, including the faster ones and the more slower ones
The faster ones don't have any depth to limit I think. --preset veryslow is already leaning towards placebo.
microchip8
7th January 2020, 06:44
What kind of values do you have for psy-rd and psy-rdoq with rdoq-level 1? What about aq-strength?
The faster ones don't have any depth to limit I think. --preset veryslow is already leaning towards placebo.
For psy-rd I use 4.0 and 20.0 for psy-rdoq. Default value of 1.0 for aq-mode 1
I was getting banding on a mostly clean source with a little mosquito noise (Blade Runner 2049 starting from 6 minutes). The closet in this low-light scene looked ugly so I increased psy-rdoq to 15 at first, then to 20 and these issues went away. The higher psy-rdoq is, the more non-zero coefficients are added and the closer it matches the energy of the source and according to docs, the more artifacts you'll get. Thing is, I was not able to spot any artifacts at all (no ringing and/or blocking or other distortions). Docs even say that the defaults are at the very low end and it can be beneficial to use larger values. I didn't test higher than 20 (it can go up to 50 for psy-rdoq)
Boulder
7th January 2020, 06:49
For psy-rd I use 4.0 and 20.0 for psy-rdoq. Default value of 1.0 for aq-mode 1
I was getting banding on a mostly clean source with a little mosquito noise (Blade Runner 2049 starting from 6 minutes). The closet in this low-light scene looked ugly so I increased psy-rdoq to 15 at first, then to 20 and these issues went away. The higher psy-rdoq is, the more non-zero coefficients are added and the closer it matches the energy of the source and according to docs, the more artifacts you'll get. Thing is, I was not able to spot any artifacts at all (no ringing and/or blocking or other distortions). Docs even say that the defaults are at the very low end and it can be beneficial to use larger values. I didn't test higher than 20 (it can go up to 50 for psy-rdoq)
Thank you, I need to do some testing myself. Did you check how the average bitrate was affected compared to the preset default for psy options?
microchip8
7th January 2020, 06:57
Thank you, I need to do some testing myself. Did you check how the average bitrate was affected compared to the preset default for psy options?
It was a bit higher but I suspect it was due to using a qcomp of 0.73 which has more effect on bitrate than psy-rdoq
My personal aim when encoding is to be as close as possible to the source. I don't do any filtering here so everything must be preserved including noise/grain
According to docs, higher values of psy-rdoq can double the bitrate but I didn't notice such doubling. Blade Runner 2049 encoded at an average of 3.7 Mbps while using low values for psy-rd/psy-rdoq and a default qcomp of 0.6, it encoded at an average of 2.5 Mbps (and didn't look as good as the one with high psy-rd/psy-rdoq and qcomp of 0.73). Also I use a CRF of 21 here
These are my current custom settings I use
ref=4:hme=0:me=star:merange=26:subme=5:bframes=8:rd=4:rd-refine=0:qcomp=0.73:fades=1:strong-intra-smoothing=1:ctu=32:qg-size=32:sao=0:selective-sao=0:cu-lossless=0:cutree=1:tu-inter-depth=4:tu-intra-depth=4:rskip=1:max-merge=4:rc-lookahead=100:aq-mode=1:aq-strength=1.0:rdoq-level=1:psy-rdoq=20.0:psy-rd=4.0:limit-modes=0:limit-refs=0:limit-tu=0:deblock=-1,-1:weightb=1:weightp=1:rect=1:amp=0:wpp=1:pmode=0:pme=0:b-intra=1:b-adapt=2:b-pyramid=1:tskip-fast=0:fast-intra=0:early-skip=0:splitrd-skip=0:min-keyint=24:keyint=240:transfer=bt709:colorprim=bt709:colormatrix=bt709
Boulder
7th January 2020, 07:13
Yeah, qcomp will definitely bump the bitrate. I personally use 0.7 all the time. I apply very slight motion compensated denoising and also try to keep grain as much as possible. I don't know if --selective-sao 2 would be better than disabling it completely, the "problem" with x265 blurring things seems to be in the handling of b-frames.
microchip8
7th January 2020, 07:20
Yeah, qcomp will definitely bump the bitrate. I personally use 0.7 all the time. I apply very slight motion compensated denoising and also try to keep grain as much as possible. I don't know if --selective-sao 2 would be better than disabling it completely, the "problem" with x265 blurring things seems to be in the handling of b-frames.
I haven't done much testing with SAO but you have two things in x265 that can/will blur stuff. The deblock filter and SAO. I'm weary of enabling them both as you'll get a double effect. At the moment I'm very happy of just using deblock=-1,-1 and no sao. Maybe I'll do some testing with SAO in the future (for I and P frames only)
benwaggoner
8th January 2020, 00:51
I haven't done much testing with SAO but you have two things in x265 that can/will blur stuff. The deblock filter and SAO. I'm weary of enabling them both as you'll get a double effect. At the moment I'm very happy of just using deblock=-1,-1 and no sao. Maybe I'll do some testing with SAO in the future (for I and P frames only)
High QP will blur stuff too, and using SAO and deblock will allow for lower QPs. HEVC was very much designed around having deblock on to provide optimal compression efficiency. SAO as of x265 3.x will almost always improve quality at moderate-low bitrates.
Although it turns out SAO is only useful on I and P frames, hence --selective-sao 2 being the new default suggestion for better speed without meaningful quality difference.
microchip8
8th January 2020, 03:13
High QP will blur stuff too, and using SAO and deblock will allow for lower QPs. HEVC was very much designed around having deblock on to provide optimal compression efficiency. SAO as of x265 3.x will almost always improve quality at moderate-low bitrates.
Although it turns out SAO is only useful on I and P frames, hence --selective-sao 2 being the new default suggestion for better speed without meaningful quality difference.
I did a quick test with selective-sao for I and P slices and my default of -1,-1 for deblock. It made the picture softer and some noise was removed as well that was noticable to me. Not happy with that as I want to keep it as close as possible to the source. By only using the deblock filter without sao seems to *me* as the better option
I don't get high QPs here. I encode at CRF 21 with a qcomp of 0.73 and high values for psy-rd & psy-rdoq. Average QPs for B frames are 23 while for P are 20 and for I 19
benwaggoner
8th January 2020, 03:45
I did a quick test with selective-sao for I and P slices and my default of -1,-1 for deblock. It made the picture softer and some noise was removed as well that was noticable to me. Not happy with that as I want to keep it as close as possible to the source. By only using the deblock filter without sao seems to *me* as the better option
I don't get high QPs here. I encode at CRF 21 with a qcomp of 0.73 and high values for psy-rd & psy-rdoq. Average QPs for B frames are 23 while for P are 20 and for I 19
You can't really compare these settings at the same CRF, since the actual file sizes will be different. I recommend testing in 2-pass VBR to compare parameters, so you're comparing quality at the same ABR. Once you figure out the optimum quality @ efficiency parameters, then you test to find what CRF gives you the desired quality.
Boulder
8th January 2020, 05:56
Changing the default psy settings did produce a substantially higher bitrate at the same CRF. I did a 2-pass test and it looked worse at the same bitrate so I'm sticking with the default for the time being. With that test clip I used, raising psy-rdoq just from 1.0 to 2.0 made the bitrate demand rise quite a lot. Very noisy clip, which probably is a tough one for the psy things.
microchip8
8th January 2020, 06:36
Yes, this is expected especially when there's a lot of noise. But it was the only thing that made the banding on the closet dissapear at the beginning of Blade Runner 2049. I've done since then other encodes, one which required 11 Mbps (CRF 21 + qcomp 0.73) but it was a very noisy film (Rogue One: A Star Wars Story). Solo: A Star Wars Story had also a bit of noise but it only required 5.5 Mbps. Both were done with psy-rd 4.0 and psy-rdoq 20.0
blublub
20th January 2020, 20:33
Single socket Epyc 7742, 64 physical cores = 128 SMT cores
I had to fill it with 6-7 encoding jobs to fully saturate all SMT cores
With Rome you should set pools=64 and frame threads to 4 or 6. Depending how the 7742 is presented to the OS you might also set it as 2 numa nodes with a poolsize of 64 each - which will give you 128 threads. Depending on the source material and the preset it will incredibly speed up he encode - best is probably preset slow
Forteen88
21st January 2020, 08:28
With Rome you should set pools=64 and frame threads to 4 or 6. Depending how the 7742 is presented to the OS you might also set it as 2 numa nodes with a poolsize of 64 each - which will give you 128 threads. Depending on the source material and the preset it will incredibly speed up he encode - best is probably preset slowMany skilled video-people here recommends --slower over --slow, since --slower gives noticeably better picture-quality.
blublub
21st January 2020, 20:54
Many skilled video-people here recommends --slower over --slow, since --slower gives noticeably better picture-quality.Why does slower produce better results as medium when using the same CRF - isn't the purpose of CRF to provide/describe the output quality?
When I encode with medium I get like 18fps or so, on slow it is like 7 or 8 - so anyone encoding with a 8c CPU is going to get like 3fps - that will take a day for one movie, are ppl really doing this?
microchip8
21st January 2020, 21:29
Why does slower produce better results as medium when using the same CRF - isn't the purpose of CRF to provide/describe the output quality?
When I encode with medium I get like 18fps or so, on slow it is like 7 or 8 - so anyone encoding with a 8c CPU is going to get like 3fps - that will take a day for one movie, are ppl really doing this?
CRF is not a quality metric but a rate factor. It tries to maintain roughly the same factor for the whole video. It is, though, loosly tied to quality.
I've made my own "preset" that I use combining features from both slow and slower, but does not have the same penalty like slower does (eg, I don't use --amp like slower does and use a CTU of 32 and qcomp of 0.7 and a few other minor tweaks). Depending on the movie, on my Ryzen 7 3700X I get between 3.5 and 5 fps with my preset on 1080p encodes. The slower preset will be even...ehm... slower ;)
blublub
21st January 2020, 21:40
CRF is not a quality metric but a rate factor. It tries to maintain roughly the same factor for the whole video. It is, though, loosly tied to quality.
I've made my own "preset" that I use combining features from both slow and slower, but does not have the same penalty like slower does (eg, I don't use --amp like slower does and use a CTU of 32 and qcomp of 0.7 and a few other minor tweaks). Depending on the movie, on my Ryzen 7 3700X I get between 3.5 and 5 fps with my preset on 1080p encodes. The slower preset will be even...ehm... slower ;)
Ok, thx. I obviously understood that wrong.
Could you post your settings? I use "main/medium" preset with some higher ones and I would like to compare - for UHD HDR encode I get 14fps on the new 3960x.
microchip8
21st January 2020, 21:48
Ok, thx. I obviously understood that wrong.
Could you post your settings? I use "main/medium" preset with some higher ones and I would like to compare - for UHD HDR encode I get 14fps on the new 3960x.
These are my current settings for 1080p. I don't encode yet 2160p... these are ffmpeg settings but are easily translatable to the x265 ones
ref=4:hme=0:me=star:merange=26:subme=5:bframes=8:rd=4:rd-refine=0:qcomp=0.7:fades=1:strong-intra-smoothing=1:ctu=32:qg-size=32:sao=0:selective-sao=0:cu-lossless=0:cutree=1:tu-inter-depth=4:tu-intra-depth=4:max-merge=4:rskip=1:rc-lookahead=100:aq-mode=1:aq-strength=1.0:rdoq-level=1:psy-rd=3.2:psy-rdoq=15.0:limit-modes=0:limit-refs=0:limit-tu=0:deblock=-1,-1:weightb=1:weightp=1:rect=1:amp=0:wpp=1:pmode=0:pme=0:b-intra=1:b-adapt=2:b-pyramid=1:tskip-fast=0:fast-intra=0:early-skip=0:splitrd-skip=0:min-keyint=24:keyint=240:transfer=bt709:colorprim=bt709:colormatrix=bt709
vBulletin® v3.8.11, Copyright ©2000-2026, vBulletin Solutions Inc.