Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > High Efficiency Video Coding (HEVC)

Reply
 
Thread Tools Search this Thread Display Modes
Old 13th March 2019, 18:43   #1  |  Link
blublub
Registered User
 
Join Date: Jan 2015
Posts: 70
HEVC CPU load with Threadripper - "threading" it right?!

Hi

it has been posted a couple of times that HEVC doesn't fully utilize high core count CPUs in 1080p and even 2160p.

Today I did some testing and I was really surprised at the outcome:

x265 "defaults":
numa-pools=32
frame threads=6

CPU load: 60-75%
Encoding at FPS = 5,9


x265 settings I had used for the last months to improve quality:

numa-pools=32
frame threads=2

CPU load: 65%
Encoding at FPS = 6,2


Since the CPU load was far from utilized 100% I thought it is time to RTFM again. After reading it I set "numa pools = 1" since I have only a 1 Socket CPU.


x265 "numa 1":
numa-pools=1
frame threads=1

CPU load = 2-3%
Encoding at FPS = 0,3


x265 "numa 48":
numa-pools=48
frame threads=1

CPU load = 88%
Encoding at FPS = 7,3


x265 "numa 48 - II":
numa-pools=48
frame threads=2

CPU load = 100%
Encoding at FPS = 7,8


x265 "numa 48 - III":
numa-pools=48
frame threads=3

CPU load = 100%
Encoding at FPS = 8,2


So increasing numa-pools does help with CPU utilization and speed. With frame-thread=2 the load was 100%.
Increasing numa-pools over 48 did not increase CPU load or speed in FPS any further for me.

Also when using standard numa-pools of 32 increasing frame-threads over 1 only seems to speed up the encode until a value of 2 as it can be seen that the encode with frame-thread 6 is a tad slower than with 2.
But a frame-thread value of 2 or 3 can see a real benefit after increasing the number of numa-pool to 48 since FPS maximum was 8,2 with frame-threads=3 and numa-pools=48.
Further increasing frame-threads to 5 did not result in higher FPS.

So question is: Is there any disadvantage with using high numa-pools as there is quality degradation when using higher frame-threads?

cheers

Last edited by blublub; 14th March 2019 at 12:21. Reason: Added more results and re-run 1st batch. Removed erronous result with 9fps.
blublub is offline   Reply With Quote
Old 13th March 2019, 19:22   #2  |  Link
sneaker_ger
Registered User
 
Join Date: Dec 2002
Posts: 5,476
Which Threadripper do you have? What OS?

Have you tried something like --numa-pools "12,12,12,12" ? --pmode? --pme? Reduced CTU max size?


I guess with that many threads it might be time to switch to running parallel instances running on different parts of a movie (like RipBot implements).

Last edited by sneaker_ger; 13th March 2019 at 19:29.
sneaker_ger is offline   Reply With Quote
Old 13th March 2019, 19:43   #3  |  Link
blublub
Registered User
 
Join Date: Jan 2015
Posts: 70
Quote:
Originally Posted by sneaker_ger View Post
Which Threadripper do you have? What OS?

Have you tried something like --numa-pools "12,12,12,12" ? --pmode? --pme? Reduced CTU max size?


I guess with that many threads it might be time to switch to running parallel instances running on different parts of a movie (like RipBot implements).
Hi I have the 16c 2950x and use Win 10 Prof.

I already used CTU=32 and qg-size=16 in all my tests and all previous encodes.
I tried pme and pmode but I did neither observe more FPS or or a really higher CPU load.
That's why I started fiddling with numa-pools.

What does "--numa-pools "12,12,12,12" do?

EDIT: found it in the PDF. But I still have no idea which setting is a wise choice ;-)

EDIT2: "--numa-pools "12,12,12,12" only results in a CPU load around 50%

Last edited by blublub; 13th March 2019 at 19:57.
blublub is offline   Reply With Quote
Old 13th March 2019, 20:36   #4  |  Link
Selur
Registered User
 
Selur's Avatar
 
Join Date: Oct 2001
Location: Germany
Posts: 5,867
@blublub: out of curiosity: What CPU usage do you get when not specifying numa-pools and frame threads? (not sure whether the 'defaults' referred to this or if you explicitly specified the values)
__________________
Hybrid here in the forum, homepage
Selur is offline   Reply With Quote
Old 13th March 2019, 21:09   #5  |  Link
blublub
Registered User
 
Join Date: Jan 2015
Posts: 70
Quote:
Originally Posted by Selur View Post
@blublub: out of curiosity: What CPU usage do you get when not specifying numa-pools and frame threads? (not sure whether the 'defaults' referred to this or if you explicitly specified the values)

The defaults I posted in my 1st post are the x265 default for those2 options ;-)

Last edited by blublub; 13th March 2019 at 21:52.
blublub is offline   Reply With Quote
Old 13th March 2019, 21:59   #6  |  Link
blublub
Registered User
 
Join Date: Jan 2015
Posts: 70
Ok after a little more testing:

numa-pools = 36 or 38 does max out the CPU with frame-threads=2. With frame-threads=1 it pretty much takes numa-pools to be 48.
When I lower numa-pools and enable "pme" the CPU load does go up by about 5-8% but speed seems to be reduced by a lot - so that hurts encoding speed it seems.

Is anyone using high core count Xeons, maybe a 3175x? I am interested what other users do to max out the CPU. I just have hard time to believe that encoding multiple jobs at the same time is the only solution.
blublub is offline   Reply With Quote
Old 27th April 2019, 23:12   #7  |  Link
BLKMGK
Registered User
 
Join Date: Feb 2008
Posts: 136
Quote:
Originally Posted by blublub View Post
Ok after a little more testing:

numa-pools = 36 or 38 does max out the CPU with frame-threads=2. With frame-threads=1 it pretty much takes numa-pools to be 48.
When I lower numa-pools and enable "pme" the CPU load does go up by about 5-8% but speed seems to be reduced by a lot - so that hurts encoding speed it seems.

Is anyone using high core count Xeons, maybe a 3175x? I am interested what other users do to max out the CPU. I just have hard time to believe that encoding multiple jobs at the same time is the only solution.

Windows or Linux?

I have a dual 10core V2 Xeon machine (20 threads) that I ran into this issue with running Linux. Since I also have several other machines in my home with spare cores also running Linux I began researching how best to distribute the load. The end result has been a Docker Swarm and ReddisQueue application with some custom code. I had the concept, I built a bash script PoC, and then I worked with a talented friend to try and build the rest on my hardware. It’s working (across multiple machines) but we’re still tweaking and bug stomping right now. We expect to release the code publicly fairly soon as aside from RipBot I’ve seen nothing public like it and was pretty frustrated. Frankly we’re hoping others help us improve it and that this is a good headstart It’s not flashy.

RipBot works well with Windows and can use AVIsynth filters too which is very nice but it won’t interleave jobs. I use this on my Windows boxes and when I need filters, I expect I’ll be trying containers on them in the future to use the custom code described above. If your desire is no fuss no muss for Windows RipBot is pretty kickass with only a few annoyances IMO. It gives good status feedback and won’t kill user performance so it can be run across even somewhat heavily utilized machines (my headless surveillance machine runs it for instance). RipBot can also leverage video hardware if that’s of interest. It gets around the core utilization issue by running multiple encoders in parallel btw.

P.S. This machine and another will get the new Ryzen 16core CPUs when released, both will be leveraged for encoding.

Last edited by BLKMGK; 27th April 2019 at 23:14.
BLKMGK is offline   Reply With Quote
Old 28th April 2019, 00:34   #8  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Germany
Posts: 650
Quote:
Originally Posted by blublub View Post
Ok after a little more testing:

Is anyone using high core count Xeons, maybe a 3175x? I am interested what other users do to max out the CPU. I just have hard time to believe that encoding multiple jobs at the same time is the only solution.
Intel Xeon 28c/56th over here, without AVX512, anyway I never managed to max it out. Please note that it's a dual socket CPU and 28c/56th is the sum of those two. This is basically why I use to run multiple files at the same time. For instance, when we gotta encode MPEG-2 files for broadcast usage, I run up to 56 encodes at the same time, 'cause MPEG-2 encoders were meant to be used on monocore CPUs. Anyway, one of the reason why I never managed to max it out with x265 was that I'm using Avisynth on 4K 12bit contents which are brought up to 16bit for post-processing, but this kind of files are very poorly handled by Avisynth when it comes to speed and going through a pipe before reaching x265 doesn't help.
Anyway, if you want and if I'll have spare time I'll try to play with that a little bit more again with your settings.
__________________
Broadcast Encoder
Avisynth memes: 1 - 2 - 3
Videotek - Audacity XP
FranceBB is offline   Reply With Quote
Old 28th April 2019, 10:17   #9  |  Link
Forteen88
Herr
 
Join Date: Apr 2009
Location: North Europe
Posts: 366
Quote:
Originally Posted by blublub View Post
So question is: Is there any disadvantage with using high numa-pools as there is quality degradation when using higher frame-threads?
Since no one answers that question, you should do a VMAF or SSIM-test to see if it generates different values.
Forteen88 is offline   Reply With Quote
Old 30th April 2019, 05:14   #10  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 2,951
Quote:
Originally Posted by Forteen88 View Post
Since no one answers that question, you should do a VMAF or SSIM-test to see if it generates different values.
There can be quality regressions when using frame threading. Although that can be turned off directly via -F 1.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 30th April 2019, 07:44   #11  |  Link
Jamaika
Registered User
 
Join Date: Jul 2015
Posts: 607
Quote:
Originally Posted by Forteen88 View Post
Since no one answers that question, you should do a VMAF or SSIM-test to see if it generates different values.
If we use ffmpeg/x265 & vmaf threads is default zero.
{"n_threads", "Set number of threads to be used when computing vmaf.", OFFSET(n_threads), AV_OPT_TYPE_INT, {.i64=0}, 0, UINT_MAX, FLAGS},

For x265 you can change adding values to threads, but for what? This is only json charts in X265. SVT codecs have already built-in VMAF metrics and working.

double x265_calculate_vmaf_framelevelscore(x265_param *param, x265_vmaf_framedata *vmafframedata)
{
double score;
int (*read_frame)(float *reference_data, float *distorted_data, float *temp_data,
int stride, void *s);
if (vmafframedata->internalBitDepth == 8)
{
read_frame = read_frame_8bit;
if (vmafframedata->internalCsp == X265_CSP_I420) compute_vmaf(&score, vcd_yuv420p->format, vmafframedata->width, vmafframedata->height, read_frame, vmafframedata, vcd_yuv420p->model_path, vcd_yuv420p->log_path, vcd_yuv420p->log_fmt, vcd_yuv420p->disable_clip, vcd_yuv420p->disable_avx, vcd_yuv420p->enable_transform, vcd_yuv420p->phone_model, vcd_yuv420p->psnr, vcd_yuv420p->ssim, vcd_yuv420p->ms_ssim, vcd_yuv420p->pool, param->frameNumThreads, vcd_yuv420p->subsample, vcd_yuv420p->enable_conf_interval);
}

return score;
}

Last edited by Jamaika; 30th April 2019 at 07:57.
Jamaika is offline   Reply With Quote
Old 1st May 2019, 18:51   #12  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 2,951
Quote:
Originally Posted by benwaggoner View Post
There can be quality regressions when using frame threading. Although that can be turned off directly via -F 1.
Also, frame-threading artifacts are typically around the GOP boundary, and with a long GOP that can get buried in the overall VMAF. Comparing the minimum frame VMAF or the lowest 1% of frame VMAF values can find regressions much more effectively.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 1st May 2019, 20:20   #13  |  Link
Forteen88
Herr
 
Join Date: Apr 2009
Location: North Europe
Posts: 366
Quote:
Originally Posted by benwaggoner View Post
There can be quality regressions when using frame threading. Although that can be turned off directly via -F 1.
Thanks, but I've already read that you've written that before here at Doom9. I was just wondering, like the first poster did, if there is any disadvantage (quality degradation) with using a high number of numa-pools.

@Jamaika. I'll rather use your VMAF-compile of x265 that you released not long ago

Last edited by Forteen88; 3rd May 2019 at 15:24.
Forteen88 is offline   Reply With Quote
Old 3rd May 2019, 03:22   #14  |  Link
~ VEGETA ~
The cult of personality
 
~ VEGETA ~'s Avatar
 
Join Date: May 2013
Location: Planet Vegeta
Posts: 121
Someone suggested getting these for encoding x264/x265, being a cheap and effective build:

https://www.aliexpress.com/item/Buy-...886314322.html
https://www.aliexpress.com/item/Inte...831123192.html

So it is dual xeon rig, what do you think?

After some digging on myself for fun reasons, I found this:

https://www.techspot.com/review/1218...-pc/page7.html

Looking forward to your opinions. I wanna know what affordable build or set up one can use to encode. I am now using my dedicated server for this, it is a lousy ATOM 2750 which is not suitable for encoding.
~ VEGETA ~ is offline   Reply With Quote
Old 3rd May 2019, 04:48   #15  |  Link
RanmaCanada
Registered User
 
Join Date: May 2009
Posts: 95
Quote:
Originally Posted by ~ VEGETA ~ View Post
Someone suggested getting these for encoding x264/x265, being a cheap and effective build:

https://www.aliexpress.com/item/Buy-...886314322.html
https://www.aliexpress.com/item/Inte...831123192.html

So it is dual xeon rig, what do you think?

After some digging on myself for fun reasons, I found this:

https://www.techspot.com/review/1218...-pc/page7.html

Looking forward to your opinions. I wanna know what affordable build or set up one can use to encode. I am now using my dedicated server for this, it is a lousy ATOM 2750 which is not suitable for encoding.
Not worth the money. I recently moved from an E5-2670 to a Ryzen 2700 and I literally almost doubled my frame rate while cutting my power usage by a good third (150 watts while encoding vs 230 watts). If I overclock the 2700 to 2700x speeds, I do double my performance from my old Xeon at the same power usage. The sheer lack of AVX 2.0 makes them horrible for x265. There is a thread by Sagitarre? that properly benches processors for x265. If you check the thread you'll see that all AVX processors are beaten by similarly spec'd AVX 2.0 processors. Even my old 8 core was beaten easily by a quad core with AVX 2.0.

I'd strongly suggest you wait till Zen2 comes out before you make your purchase. You will either be able to get a new system that will destroy the dual Xeon and sip power, or someone's old stuff because they wanted to upgrade. Should be anywhere from 1-2 months at this point.
RanmaCanada is offline   Reply With Quote
Old 3rd May 2019, 08:11   #16  |  Link
TEB
Registered User
 
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 325
Same issue here:

/ffmpeg -loglevel verbose -i 35543_1080p_25.mov -vf scale=960:540 -pix_fmt yuv420p10le -codec:v libx265 -x265-params keyint=100:min-keyint=200:no-open-gop=1:nal-hrd=VBR:force-cfr -level 5.1 -profile:v main10 -preset slow -crf 23 -maxrate 3M -bufsize 3M -force_key_frames "expr:eq(mod(n,100),0)" -c:a:0 aac -ac:a:0 2 -ab:a:0 128k -y foo_crf23_3.mp4


OS: RHEL 7.3 x64
CPU : Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz x 2
So 40 "cores"
ffmpeg 4.1.3 static

Can never get over 10% usage with 1 encode job..
Any tips? Ive tried ffmpeg threads = 20... didnt help much..
Mov file is local
TEB is offline   Reply With Quote
Old 3rd May 2019, 08:34   #17  |  Link
sneaker_ger
Registered User
 
Join Date: Dec 2002
Posts: 5,476
Quote:
Originally Posted by TEB View Post
keyint=100:min-keyint=200


Quote:
Originally Posted by TEB View Post
-maxrate 3M -bufsize 3M
That's quite low. Is this for streaming via slow Internet/wifi?

Quote:
Originally Posted by TEB View Post
Can never get over 10% usage with 1 encode job..
Any tips? Ive tried ffmpeg threads = 20... didnt help much..
Mov file is local
How is CPU usage if you remove -vf scale? How fast is simple decoding of source?
Code:
ffmpeg -i 35543_1080p_25.mov -benchmark -f null -
I guess the low resolution makes it especially difficult.

Last edited by sneaker_ger; 3rd May 2019 at 08:36.
sneaker_ger is offline   Reply With Quote
Old 3rd May 2019, 09:03   #18  |  Link
excellentswordfight
Lost my old account :(
 
Join Date: Jul 2017
Posts: 108
Quote:
Originally Posted by TEB View Post
Same issue here:

/ffmpeg -loglevel verbose -i 35543_1080p_25.mov -vf scale=960:540 -pix_fmt yuv420p10le -codec:v libx265 -x265-params keyint=100:min-keyint=200:no-open-gop=1:nal-hrd=VBR:force-cfr -level 5.1 -profile:v main10 -preset slow -crf 23 -maxrate 3M -bufsize 3M -force_key_frames "expr:eq(mod(n,100),0)" -c:a:0 aac -ac:a:0 2 -ab:a:0 128k -y foo_crf23_3.mp4


OS: RHEL 7.3 x64
CPU : Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz x 2
So 40 "cores"
ffmpeg 4.1.3 static

Can never get over 10% usage with 1 encode job..
Any tips? Ive tried ffmpeg threads = 20... didnt help much..
Mov file is local
540p is rather low, so you might not get great utilization, but 10% sounds a bit low either way, with --ctu 32 --merange 26 I have no issue to saturate 24threads. and without it i can saturate arround 16threads for 1080p video.

Imo the multithread scaling with x265 is already very impressive, note sure why people think there is some issue cause it doesnt scale infinitely, this is still an complex GOP-based codec so unless you do chunk encoding there will never be infinite scaling.

Last edited by excellentswordfight; 3rd May 2019 at 09:10.
excellentswordfight is offline   Reply With Quote
Old 3rd May 2019, 09:25   #19  |  Link
TEB
Registered User
 
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 325
@sneaker

25p x 4 sec gop = 100

Middle profile of a long stack/ladder

bench : 14.4x Realttime
TEB is offline   Reply With Quote
Old 3rd May 2019, 09:47   #20  |  Link
TEB
Registered User
 
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 325
This is just irritating

40 "cores", with 2 ffmpeg instances,

https://imgur.com/a/SHlMrJ8
TEB is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 14:34.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2019, vBulletin Solutions Inc.