View Full Version : high core count cpu for x265 encoding good idea?
sonyzz
26th January 2020, 02:57
as i already have x570 motherboard and ram bought like 4 months ago i'm planning to get either 3900x or 3950x from AMD which both are high core count cpu's and wanted to ask how much cores does x265 or hevc encoding can handle? i know that my used program can max out 8 core cpu (tested with friends ryzen 7 3700x) which speeds up video encoding... also another question would be - why do people on other topics say ''Note that quality diminishes as the number of cores/threads used to encode a video file increases. Once that number reaches the double digits, the quality decrease starts to become noticeable. '' why quality should decrease, if encode is being done twice faster because of more cores on same preset and same settings are people suggesting that high core count cpu cannot be used with x265 to produce good encoding results? thus i'm on BIG dilema here :( also please simplify the answers if possible :) thanks
Atak_Snajpera
26th January 2020, 13:33
x265 using default settings won't saturate 3950x while encoding 1920x800 (2.4:1 aspect ratio) video.
However if you use Distributed Encoding in ripbot264 then you can fully utilize even dual AMD Epyc 2 (256 threads total) without any problems.
https://i.postimg.cc/Vk7yzLZN/client.png
sonyzz
26th January 2020, 16:23
x265 using default settings won't saturate 3950x while encoding 1920x800 (2.4:1 aspect ratio) video.
However if you use Distributed Encoding in ripbot264 then you can fully utilize even dual AMD Epyc 2 (256 threads total) without any problems.
https://i.postimg.cc/Vk7yzLZN/client.png
But is it really as people say that if core/thread count increases - encoding quality decreases, and if it decreases does it decrease drastically? or for example like 1% from each added core so lets say for 4 core would be 5% loss for 8 would be 10% something like this, because now i was about to get atleast 3900x to make my encoding faster and to have spare resources when encoding is running, for example so i can encode and game at same time... because my 4c 8t cpu lets me encode and use the pc fine no problems (feels a bit less responsive when encoding and cpu is at 99%) and also can play games like lol with some ping or fps drops but its not good experience, AAA games are unplayable when encoding...:)
Atak_Snajpera
26th January 2020, 16:51
That issue was totally overblown. x264 has a hard thread limit (24 threads if I remember correctly) to prevent that quality degradation. In x265 default large CTU (64) also works as limiting factor for thread utilization.
excellentswordfight
26th January 2020, 19:04
But is it really as people say that if core/thread count increases - encoding quality decreases, and if it decreases does it decrease drastically? or for example like 1% from each added core so lets say for 4 core would be 5% loss for 8 would be 10% something like this, because now i was about to get atleast 3900x to make my encoding faster and to have spare resources when encoding is running, for example so i can encode and game at same time... because my 4c 8t cpu lets me encode and use the pc fine no problems (feels a bit less responsive when encoding and cpu is at 99%) and also can play games like lol with some ping or fps drops but its not good experience, AAA games are unplayable when encoding...:)
I woudlnt worry to much about quality degradation when it comes to threading as long as you stay within the more or less default values.
For 1080p video with preset slow you will see pretty good scaling up to 8c/16t, if you lower the CTU size to 32 or encode 2160p you will probably be good up towards 16C/32T. After that point doing chunk encoding is pretty much a must to continue thread scaling.
blublub
26th January 2020, 22:38
Hi
I have a TR3960x and I have learned the following:
Quality decreases in x265 if you you more frame-threads - auto setting is 6 for high core count CPUs.
For a UHD encode I can utilize about 80-90% of CPU usage with pool=64 - making it using 64 threads instead of auto 48.
Especially for 1080p it makes sense to either run 2 encodes at the same time or distribute the encoding via RipBot to use all CPU cores.
kolak
27th January 2020, 00:58
That issue was totally overblown. x264 has a hard thread limit (24 threads if I remember correctly) to prevent that quality degradation. In x265 default large CTU (64) also works as limiting factor for thread utilization.
Could ripbot264 be re-written to use ffmpeg as source/trimming/encoder, instead of avisynth?
Atak_Snajpera
27th January 2020, 10:53
Could ripbot264 be re-written to use ffmpeg as source/trimming/encoder, instead of avisynth?
Ffmpeg does not support frame accurate seeking. Basicaly you can't tell ffmpeg to seek to specific frame. On other hand lsmash plus avisynth is much more reliable and flexible. Ffmpeg only method would be huge downgrade in every aspect.
kolak
27th January 2020, 23:57
If you let it decode source then it will be frame accurate. Similar to wait for index file. For I frame based codecs -ss before -i is also frame accurate.
You can also seek this way:
-ss 8 -i xxxx -ss 4.08 which will seek to 8th second as rough seeking and then 4.08 second from this place as precise seeking. This way you should be able to seek very quickly and precisely in any format if I understand it correctly.
avs/vs source filters are its weakest point and not 100% reliable at all. I ha many problems and had to choose correct one base don source format (eg. problem with transport streams or MXF files). They are actually based on ffmpeg and sometimes there are problems even after indexing.
Atak_Snajpera
28th January 2020, 00:20
That's the problem. Seeking based on time is not precise enough. Trust me soon or latter you will end up with chunks starting from incorrect frame. I remember that mediacoder was trying this method and failed misserable. Check his forum and posts related with distibutive encoding.
kolak
28th January 2020, 00:23
Only problem I found is sometimes for fractional frames (I assume due to rounding-it would need figuring out how ffmpeg does it). For non-fractional frames never had an issue. Eg. 0.04 is 1 frame for 25p based source. Quite simple.
If you happy to wait for decode time then you have very precise access with select filter. Then you can use frame numbers.
Besides, for distributed encoding we either pick "nice/round" numbers or scene changes. There is really no need for places like 00:10:12:11 (if it's not a scene change).
You can with ffmpeg jump to rough place, then from this point find 1st clear scene change and get time for it. For 2nd part you jump to same rough place and add time for scene change+ 1 frame (this should actually "force" ffmpeg to "use same math" for both chunks time points).
Actually this could work very well.
Atak_Snajpera
28th January 2020, 00:32
I'm happy with l-smash (100% frame accurate) + avisynth so going back to ffmpeg only method would be like opening pandora box. Besides. If ain't broken, don't fix IT.
kolak
28th January 2020, 00:40
If you think about it then it's 5% code change or rather 2nd version. It only changes what you execute, but whole main engine would stay the same.
I'm not programmer, so won't do it, but I realised it can be not so difficult even just with Python (quite easy for just 1 machine).
Atak_Snajpera
28th January 2020, 00:45
But that constant seeking without index would significantly delay encoding for later chunks.
First chunk would start immediatelly, but how about 100th? Also keep in mind that everything is going VIA network and multiple servers would do the same task simultanously.
kolak
28th January 2020, 00:52
Yes, could be the issue, but ss seeking should be enough.
I've tested it and it works even with ts files with h264 (which are pain to seek).
Atak_Snajpera
28th January 2020, 00:56
Ha! Try with interlaced vc-1...
Btw. Why are so anti-avisynth?
kolak
28th January 2020, 01:03
Not anti avs at all :)
I need as simple and reliable solution as possible. My workflows require 99.99% reliability.
I used avs for massive jobs (eg. 3000 hours of very heavy processing) and it did work, but ffmpeg is "easier" and multi platform as well.
Interlaced vc-1 - not interested in it at all. Should never exists :)
I deal with broadcast+intermediate formats, so XDCAM MXF, ProRes, DNxHD, AVC-I MXF etc. (mainly MXF and MOV).
Atak_Snajpera
28th January 2020, 01:10
But I need to cover all those formats. Another problem. Without avisynth you would not have Access to mdegrain,qtgmc,hdrtosdr tonemspping or multithreaded resizer. I guess you do not also do any filtering like denoising, deinterlacing...
kolak
28th January 2020, 01:53
Deinterlacing for sure. De-noising not really.
I actually use vs for deinterlacing. It only deinterlaces interlaced frames and uses QTGMC. Same can be done with ffmpeg very easily, just less good deinterlacing. Although recent new filter (bwdif= yadif+w3fdif) is good enough for broadcast and of course so much faster than QTGMC.
MeteorRain
28th January 2020, 18:48
For high core count cpu I'd reduce frame threads to 1 and pools to 12 and do chunked encoding (half-half) or parallel encoding (2 at a time).
I believe less threads = less waste on threading for x265, so I always use less threads and run more processes.
Besides, I already have the software infrastructure to do chunked encoding, and mux them later. Encoded GOPs can be joined later, so chunked encoding is basically zero cost to me.
blublub
28th January 2020, 20:15
For high core count cpu I'd reduce frame threads to 1 and pools to 12 and do chunked encoding (half-half) or parallel encoding (2 at a time).
I believe less threads = less waste on threading for x265, so I always use less threads and run more processes.
Besides, I already have the software infrastructure to do chunked encoding, and mux them later. Encoded GOPs can be joined later, so chunked encoding is basically zero cost to me.
higher frame-threads will degrade quality, however x265 default is 6 for HCC CPUs so anything below won't hurt quality significantly.
From my understanding the parameter "pools" has no impact on quality. It is by default set to max number of available Hyperthreading cores
benwaggoner
29th January 2020, 17:06
higher frame-threads will degrade quality, however x265 default is 6 for HCC CPUs so anything below won't hurt quality significantly.
From my understanding the parameter "pools" has no impact on quality. It is by default set to max number of available Hyperthreading cores
Well, setting --pools to limit to a single NUMA socket will reduce the threads available, and so may reduce frame-threads. And generally doesn't slow things down that much; the overhead of using multiple cores seems to reduce the speed benefit of using them pretty significantly in my testing. Even at 8K, using a second core on a 2x18/36 Xeon system is maybe 20% faster than a second, and I generally just set up different encodes for each socket (ala --pools "+.-" and "-,+"
blublub
29th January 2020, 17:16
Hi
I only have 1 Socket iny my TR 3960x.
I currently don't know if x265 detects it as 1 or 2 Numa nodes.
However increasing pools from default/auto 48 to 64 does help with CPU utilization and speed - although it's not stuck at 100%, but at least 85 to 95 which is better than 75% average with the auto setting.
In the end it al comes down to:
Scaling isn't optimal above 12c CPUs and
If one just encodes one job a 16c and 32t is the optimal CPU / bang for the buck here with a 3950x as the higher cost of a 24c CPU isn't worth it at the moment
A 24c CPU makes sense if one encodes more than one job in parallel or uses distributed encoding
Atak_Snajpera
29th January 2020, 18:01
I currently don't know if x265 detects it as 1 or 2 Numa nodes.
Most likely your 3960x is seen as single numa node (Run EncodingServer.exe and look for line [SYSTEM] NUMA Nodes = 1). Only 3990x will be divided into two virtual numa nodes in OS.
MeteorRain
29th January 2020, 19:25
higher frame-threads will degrade quality, however x265 default is 6 for HCC CPUs so anything below won't hurt quality significantly.
From my understanding the parameter "pools" has no impact on quality. It is by default set to max number of available Hyperthreading cores
Just so you know that I wasn't talking about loss of quality, but a waste of computing resource. Single threaded application works more efficient than multi threaded application in terms of work per computing resource. (i.e. 2x single threaded app works more efficient than 1x 2 threads app.) So I like balancing the threads count and processes count to not introducing too much trouble while not introducing too much waste.
In my personal use case I found it a good balance to run 4-6 threads for x265 and another 2-4 threads for AviSynth.
MeteorRain
29th January 2020, 19:33
Also don't forget that, half of the CPU threads are SMT. So if you wisely use half of the CPU threads (i.e. 16 threads, each per 16 physical cores) you'll see 50% CPU utilization but underneath you are already using 80% of your CPU capacity. Pushing it to 100% CPU utilization will get you at most 25% more speed from 50% utilization before any extra threading loss.
sonyzz
2nd February 2020, 17:02
current settings are more like that for 1080p video on 4c 8t i7 cpu:
wpp / ctu=32 / min-cu-size=8 / max-tu-size=32 / tu-intra-depth=2 / tu-inter-depth=2 / me=3 / subme=5 / merange=57 / rect / no-amp / max-merge=3 / temporal-mvp / no-early-skip / rskip / rdpenalty=0 / no-tskip / no-tskip-fast / strong-intra-smoothing / no-lossless / no-cu-lossless / no-constrained-intra / no-fast-intra / open-gop / no-temporal-layers / interlace=0 / keyint=250 / min-keyint=23 / scenecut=40 / rc-lookahead=30 / lookahead-slices=4 / bframes=8 / bframe-bias=0 / b-adapt=2 / ref=4 / limit-refs=2 / limit-modes / weightp / weightb / aq-mode=3 / qg-size=32 / aq-strength=0.80 / cbqpoffs=0 / crqpoffs=0 / rd=4 / psy-rd=0.70 / rdoq-level=2 / psy-rdoq=1.00 / log2-max-poc-lsb=8 / limit-tu=0 / no-rd-refine / signhide / deblock=1:1 / no-sao / no-sao-non-deblock / b-pyramid / cutree / no-intra-refresh / rc=crf / crf=22.0 / qcomp=0.60 / qpmin=0 / qpmax=69 / qpstep=4 / ipratio=1.40 / pbratio=1.30
vBulletin® v3.8.11, Copyright ©2000-2026, vBulletin Solutions Inc.