Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > MPEG-4 AVC / H.264
Register FAQ Calendar Today's Posts Search

Reply
 
Thread Tools Search this Thread Display Modes
Old 13th February 2024, 18:26   #1  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,904
x264-10bit UHD and Numa Nodes support?

Hi there guys,
I'm currently using x264-10bit with an AVS Script.avs to encode UHD BT2020 HLG 50p 4:2:2 500 Mbit/s 10bit planar files like so:

Quote:
x264-10b.exe "AVS Script.avs" --preset medium --profile high422 --level 5.2 --keyint 1 --no-cabac --slices 8 --bitrate 500000 --vbv-maxrate 500000 --vbv-bufsize 100000 --deblock -1:-1 --overscan show --colormatrix bt2020nc --range tv --log-level info --thread-input --transfer arib-std-b67 --colorprim bt2020 --videoformat component --nal-hrd cbr --aud --output-csp i422 --output-depth 10 --output "raw_video.h264"

ffmpeg.exe -hide_banner -i "raw_video.h264" -i "AVS Script.avs" -map 0:0 -c:v copy -map 1:1 -c:a pcm_s24le -ar 48000 -map_metadata -1 -f mxf "pre-final_output_UHD.mxf"

bmxtranswrap.exe -p -y 10:00:00:00 -o "final_output.mxf" "pre-final_output_UHD.mxf"

pause
The input sources are generally 23,976p 4:4:4 12bit DNxHQX files and the AVS Script doesn't do much more than doing the 4% speed up + pitch adjustment to 25p, duplicating to 50p and converting to 4:2:2 planar, all with 16bit precision. Nothing fancy.

What I'm interested about is the CPU usage, whether I could get any gains with the "big guns" and why my two different dual socket configurations behave very differently.


First configuration:

CPU 0: Intel Xeon E5-2640 v4 2.40GHz 10c/20th (AVX2 max)
CPU 1: Intel Xeon E5-2640 v4 2.40GHz 10c/20th (AVX2 max)
RAM: 64 GB DDR4
OS: Windows 10 Enterprise x64

This configuration reaches a speed of 26fps and x264 saturates all cores and all threads, so there isn't anything to optimize here:



Second configuration:

CPU 0: Intel Xeon Gold 6238R 2.20GHz 28c/56th (AVX512 max)
CPU 1: Intel Xeon Gold 6238R 2.20GHz 28c/56th (AVX512 max)
RAM: 128 GB DDR4
OS: Windows Server 2019 Standard x64

This configuration reaches a speed of 32.9fps, only slightly faster than the other configuration and x264 only saturates the cores and threads of CPU 0 instead of using both of them:




In other words, the reason why it's 26fps vs 32.9fps is because it's as if the 20c/40th was competing against a single 28c/56th CPU instead of a 56c/112th one...
On top of that, despite having AVX512, it's only using up to AVX2 'cause x264 has AVX512 assembly optimization only for the 8bit version but not for the 10bit version, sadly (or at least that's what the command line output from the prompt says).
What I don't understand is why this happens.
I mean, up until now I thought only x265 was Numa Nodes aware and therefore was able to use both CPUs in a dual socket configuration. This reflects what is happening in the more powerful 56c/112th configuration, however the 20c/40th is also a dual socket configuration and there x264 is using both CPUs at 100%, so... what's going on here? And most importantly, is there anything I can do on this regard?


The x264 build I'm using is c164_r3107_a8b68eb from the 17th of July 2023, so it's fairly updated, in case you're wondering.
Avisynth is also updated as it's 3.7.3 stable, Ferenc's build of course.

Last edited by FranceBB; 22nd March 2024 at 23:36.
FranceBB is offline   Reply With Quote
Old 13th February 2024, 18:42   #2  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
The number of threads in x264 is limited by the vertical size with something around 40 lines if i remember properly, it's the smallest size it splits the frame.
You didn't specify the size of your video, but if it's 2160 => 2160/40 = 54, so this is the maximum number of threads you can expect.
__________________
My github.
jpsdr is offline   Reply With Quote
Old 13th February 2024, 18:53   #3  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,904
Quote:
Originally Posted by jpsdr View Post
The number of threads in x264 is limited by the vertical size with something around 40 lines if i remember properly, it's the smallest size it splits the frame.
You didn't specify the size of your video, but if it's 2160 => 2160/40 = 54, so this is the maximum number of threads you can expect.
It's 3840x2160, so yeah 2160:40 = 54 threads, which is about right and indeed if I count the "boxes" in Task Manager that's the exact number

Thank you Jean Philippe, as always!
FranceBB is offline   Reply With Quote
Old 15th February 2024, 00:34   #4  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,770
I'm curious what your scenario is for 2160p 10-bit H.264 is. Pretty much any device that can decode that can also decode HEVC AFAIK. Suggesting there are devices I don't know about.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is online now   Reply With Quote
Old 15th February 2024, 01:27   #5  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,904
Quote:
Originally Posted by benwaggoner View Post
I'm curious what your scenario is for 2160p 10-bit H.264 is. Pretty much any device that can decode that can also decode HEVC AFAIK. Suggesting there are devices I don't know about.
The target is a device called "Versio" provided by a company called "Imagine":



It only accepts MPEG-2 8bit and H.264 10bit and it plays them back via SDI for playout.
It's part of the "Versio Integrated Playout" platform for UHD TX used in linear channels by plenty of TVs out there.


Last edited by FranceBB; 15th February 2024 at 01:31.
FranceBB is offline   Reply With Quote
Old 15th February 2024, 01:54   #6  |  Link
Blue_MiSfit
Derek Prestegard IRL
 
Blue_MiSfit's Avatar
 
Join Date: Nov 2003
Location: Los Angeles
Posts: 5,989
Split and stitch
__________________
These are all my personal statements, not those of my employer :)
Blue_MiSfit is offline   Reply With Quote
Old 15th February 2024, 20:40   #7  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,770
Quote:
Originally Posted by FranceBB View Post
The target is a device called "Versio" provided by a company called "Imagine":



It only accepts MPEG-2 8bit and H.264 10bit and it plays them back via SDI for playout.
It's part of the "Versio Integrated Playout" platform for UHD TX used in linear channels by plenty of TVs out there.
Ah, a line-of-business product; makes sense.

As far as optimization, you've already implemented all the clever ideas I thought of. Using a faster preset if it gets you sufficient quality within the bitrate maybe?

Doing split and stitch with two x264 instances each pinned to a single NUMA node would offer quite a bit better throughput. The NU part of the MA introduces a fair amount of overhead, as you're seeing from your fps going up less than your thread utilization.

I doubt AVX512 would benefit you that much; x264 has fewer opportunities for >256 bit SIMD than x265.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is online now   Reply With Quote
Old 18th February 2024, 18:49   #8  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,904
Quote:
Originally Posted by benwaggoner View Post
Using a faster preset if it gets you sufficient quality within the bitrate maybe?
Ironically, I can't really go beyond medium 'cause I need proper compression in the TX Ready file (i.e the mxf I deliver to Versio before it gets played and hardware encoded in H.265 25 Mbit/s for consumers).
I used to think that 500 Mbit/s were plenty for UHD in H.264 and that it wouldn't have mattered anyway, but I actually found out that x264 wants to stay at around 800 Mbit/s in order to be "happy" and if left uncapped it would actually overshoot most of the time (hence the buffer size = bitrate / fps constrain). Crazy, I know, but it is what it is. Still, it's still way more compressed than the original DNX running at a whopping 1.3 Gbit/s.
FranceBB is offline   Reply With Quote
Old 19th February 2024, 20:40   #9  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,770
Quote:
Originally Posted by FranceBB View Post
Ironically, I can't really go beyond medium 'cause I need proper compression in the TX Ready file (i.e the mxf I deliver to Versio before it gets played and hardware encoded in H.265 25 Mbit/s for consumers).
I used to think that 500 Mbit/s were plenty for UHD in H.264 and that it wouldn't have mattered anyway, but I actually found out that x264 wants to stay at around 800 Mbit/s in order to be "happy" and if left uncapped it would actually overshoot most of the time (hence the buffer size = bitrate / fps constrain). Crazy, I know, but it is what it is. Still, it's still way more compressed than the original DNX running at a whopping 1.3 Gbit/s.
Yeah, DNxHD is a fast codec, but IDR only isn't efficient.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is online now   Reply With Quote
Old 21st March 2024, 14:55   #10  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,070
You can try to start more encoding processes and limit number of threads per each process. It maybe no good to start lots of threads with too small workunit for each tread. Because threads sync also adds some significant overhead.
DTL is offline   Reply With Quote
Old 22nd March 2024, 13:09   #11  |  Link
huhn
Registered User
 
Join Date: Oct 2012
Posts: 7,925
time to necro a bit.

can this Versio Imagine box handle the soft telecine flag --pulldown double.

it may just double the frame rate if you ask it to output progressive with no deinterlacing artefacts. or it may ruin the image.
huhn is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 05:11.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.