Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
|
|
Thread Tools | Search this Thread | Display Modes |
13th March 2019, 19:43 | #1 | Link |
Registered User
Join Date: Jan 2015
Posts: 118
|
HEVC CPU load with Threadripper - "threading" it right?!
Hi
it has been posted a couple of times that HEVC doesn't fully utilize high core count CPUs in 1080p and even 2160p. Today I did some testing and I was really surprised at the outcome: x265 "defaults": numa-pools=32 frame threads=6 CPU load: 60-75% Encoding at FPS = 5,9 x265 settings I had used for the last months to improve quality: numa-pools=32 frame threads=2 CPU load: 65% Encoding at FPS = 6,2 Since the CPU load was far from utilized 100% I thought it is time to RTFM again. After reading it I set "numa pools = 1" since I have only a 1 Socket CPU. x265 "numa 1": numa-pools=1 frame threads=1 CPU load = 2-3% Encoding at FPS = 0,3 x265 "numa 48": numa-pools=48 frame threads=1 CPU load = 88% Encoding at FPS = 7,3 x265 "numa 48 - II": numa-pools=48 frame threads=2 CPU load = 100% Encoding at FPS = 7,8 x265 "numa 48 - III": numa-pools=48 frame threads=3 CPU load = 100% Encoding at FPS = 8,2 So increasing numa-pools does help with CPU utilization and speed. With frame-thread=2 the load was 100%. Increasing numa-pools over 48 did not increase CPU load or speed in FPS any further for me. Also when using standard numa-pools of 32 increasing frame-threads over 1 only seems to speed up the encode until a value of 2 as it can be seen that the encode with frame-thread 6 is a tad slower than with 2. But a frame-thread value of 2 or 3 can see a real benefit after increasing the number of numa-pool to 48 since FPS maximum was 8,2 with frame-threads=3 and numa-pools=48. Further increasing frame-threads to 5 did not result in higher FPS. So question is: Is there any disadvantage with using high numa-pools as there is quality degradation when using higher frame-threads? cheers Last edited by blublub; 14th March 2019 at 13:21. Reason: Added more results and re-run 1st batch. Removed erronous result with 9fps. |
13th March 2019, 20:22 | #2 | Link |
Registered User
Join Date: Dec 2002
Posts: 5,565
|
Which Threadripper do you have? What OS?
Have you tried something like --numa-pools "12,12,12,12" ? --pmode? --pme? Reduced CTU max size? I guess with that many threads it might be time to switch to running parallel instances running on different parts of a movie (like RipBot implements). Last edited by sneaker_ger; 13th March 2019 at 20:29. |
13th March 2019, 20:43 | #3 | Link | |
Registered User
Join Date: Jan 2015
Posts: 118
|
Quote:
I already used CTU=32 and qg-size=16 in all my tests and all previous encodes. I tried pme and pmode but I did neither observe more FPS or or a really higher CPU load. That's why I started fiddling with numa-pools. What does "--numa-pools "12,12,12,12" do? EDIT: found it in the PDF. But I still have no idea which setting is a wise choice ;-) EDIT2: "--numa-pools "12,12,12,12" only results in a CPU load around 50% Last edited by blublub; 13th March 2019 at 20:57. |
|
13th March 2019, 22:09 | #5 | Link | |
Registered User
Join Date: Jan 2015
Posts: 118
|
Quote:
The defaults I posted in my 1st post are the x265 default for those2 options ;-) Last edited by blublub; 13th March 2019 at 22:52. |
|
13th March 2019, 22:59 | #6 | Link |
Registered User
Join Date: Jan 2015
Posts: 118
|
Ok after a little more testing:
numa-pools = 36 or 38 does max out the CPU with frame-threads=2. With frame-threads=1 it pretty much takes numa-pools to be 48. When I lower numa-pools and enable "pme" the CPU load does go up by about 5-8% but speed seems to be reduced by a lot - so that hurts encoding speed it seems. Is anyone using high core count Xeons, maybe a 3175x? I am interested what other users do to max out the CPU. I just have hard time to believe that encoding multiple jobs at the same time is the only solution. |
27th April 2019, 23:12 | #7 | Link | |
Registered User
Join Date: Feb 2008
Posts: 145
|
Quote:
Windows or Linux? I have a dual 10core V2 Xeon machine (20 threads) that I ran into this issue with running Linux. Since I also have several other machines in my home with spare cores also running Linux I began researching how best to distribute the load. The end result has been a Docker Swarm and ReddisQueue application with some custom code. I had the concept, I built a bash script PoC, and then I worked with a talented friend to try and build the rest on my hardware. It’s working (across multiple machines) but we’re still tweaking and bug stomping right now. We expect to release the code publicly fairly soon as aside from RipBot I’ve seen nothing public like it and was pretty frustrated. Frankly we’re hoping others help us improve it and that this is a good headstart It’s not flashy. RipBot works well with Windows and can use AVIsynth filters too which is very nice but it won’t interleave jobs. I use this on my Windows boxes and when I need filters, I expect I’ll be trying containers on them in the future to use the custom code described above. If your desire is no fuss no muss for Windows RipBot is pretty kickass with only a few annoyances IMO. It gives good status feedback and won’t kill user performance so it can be run across even somewhat heavily utilized machines (my headless surveillance machine runs it for instance). RipBot can also leverage video hardware if that’s of interest. It gets around the core utilization issue by running multiple encoders in parallel btw. P.S. This machine and another will get the new Ryzen 16core CPUs when released, both will be leveraged for encoding. Last edited by BLKMGK; 27th April 2019 at 23:14. |
|
28th April 2019, 00:34 | #8 | Link | |
Broadcast Encoder
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,904
|
Quote:
Anyway, if you want and if I'll have spare time I'll try to play with that a little bit more again with your settings. |
|
30th April 2019, 07:44 | #11 | Link | |
Registered User
Join Date: Jul 2015
Posts: 706
|
Quote:
{"n_threads", "Set number of threads to be used when computing vmaf.", OFFSET(n_threads), AV_OPT_TYPE_INT, {.i64=0}, 0, UINT_MAX, FLAGS}, For x265 you can change adding values to threads, but for what? This is only json charts in X265. SVT codecs have already built-in VMAF metrics and working. double x265_calculate_vmaf_framelevelscore(x265_param *param, x265_vmaf_framedata *vmafframedata) { double score; int (*read_frame)(float *reference_data, float *distorted_data, float *temp_data, int stride, void *s); if (vmafframedata->internalBitDepth == 8) { read_frame = read_frame_8bit; if (vmafframedata->internalCsp == X265_CSP_I420) compute_vmaf(&score, vcd_yuv420p->format, vmafframedata->width, vmafframedata->height, read_frame, vmafframedata, vcd_yuv420p->model_path, vcd_yuv420p->log_path, vcd_yuv420p->log_fmt, vcd_yuv420p->disable_clip, vcd_yuv420p->disable_avx, vcd_yuv420p->enable_transform, vcd_yuv420p->phone_model, vcd_yuv420p->psnr, vcd_yuv420p->ssim, vcd_yuv420p->ms_ssim, vcd_yuv420p->pool, param->frameNumThreads, vcd_yuv420p->subsample, vcd_yuv420p->enable_conf_interval); } return score; } Last edited by Jamaika; 30th April 2019 at 07:57. |
|
1st May 2019, 18:51 | #12 | Link |
Moderator
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,770
|
Also, frame-threading artifacts are typically around the GOP boundary, and with a long GOP that can get buried in the overall VMAF. Comparing the minimum frame VMAF or the lowest 1% of frame VMAF values can find regressions much more effectively.
|
1st May 2019, 20:20 | #13 | Link | |
Herr
Join Date: Apr 2009
Location: North Europe
Posts: 556
|
Quote:
@Jamaika. I'll rather use your VMAF-compile of x265 that you released not long ago Last edited by Forteen88; 3rd May 2019 at 15:24. |
|
3rd May 2019, 03:22 | #14 | Link |
The cult of personality
Join Date: May 2013
Location: Planet Vegeta
Posts: 155
|
Someone suggested getting these for encoding x264/x265, being a cheap and effective build:
https://www.aliexpress.com/item/Buy-...886314322.html https://www.aliexpress.com/item/Inte...831123192.html So it is dual xeon rig, what do you think? After some digging on myself for fun reasons, I found this: https://www.techspot.com/review/1218...-pc/page7.html Looking forward to your opinions. I wanna know what affordable build or set up one can use to encode. I am now using my dedicated server for this, it is a lousy ATOM 2750 which is not suitable for encoding. |
3rd May 2019, 04:48 | #15 | Link | |
Registered User
Join Date: May 2009
Posts: 331
|
Quote:
I'd strongly suggest you wait till Zen2 comes out before you make your purchase. You will either be able to get a new system that will destroy the dual Xeon and sip power, or someone's old stuff because they wanted to upgrade. Should be anywhere from 1-2 months at this point. |
|
3rd May 2019, 08:11 | #16 | Link |
Registered User
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 363
|
Same issue here:
/ffmpeg -loglevel verbose -i 35543_1080p_25.mov -vf scale=960:540 -pix_fmt yuv420p10le -codec:v libx265 -x265-params keyint=100:min-keyint=200:no-open-gop=1:nal-hrd=VBR:force-cfr -level 5.1 -profile:v main10 -preset slow -crf 23 -maxrate 3M -bufsize 3M -force_key_frames "expr:eq(mod(n,100),0)" -c:a:0 aac -ac:a:0 2 -ab:a:0 128k -y foo_crf23_3.mp4 OS: RHEL 7.3 x64 CPU : Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz x 2 So 40 "cores" ffmpeg 4.1.3 static Can never get over 10% usage with 1 encode job.. Any tips? Ive tried ffmpeg threads = 20... didnt help much.. Mov file is local |
3rd May 2019, 08:34 | #17 | Link | |
Registered User
Join Date: Dec 2002
Posts: 5,565
|
That's quite low. Is this for streaming via slow Internet/wifi? Quote:
Code:
ffmpeg -i 35543_1080p_25.mov -benchmark -f null - Last edited by sneaker_ger; 3rd May 2019 at 08:36. |
|
3rd May 2019, 09:03 | #18 | Link | |
Lost my old account :(
Join Date: Jul 2017
Posts: 325
|
Quote:
Imo the multithread scaling with x265 is already very impressive, note sure why people think there is some issue cause it doesnt scale infinitely, this is still an complex GOP-based codec so unless you do chunk encoding there will never be infinite scaling. Last edited by excellentswordfight; 3rd May 2019 at 09:10. |
|
Thread Tools | Search this Thread |
Display Modes | |
|
|