FranceBB
13th February 2024, 18:26
Hi there guys,
I'm currently using x264-10bit with an AVS Script.avs to encode UHD BT2020 HLG 50p 4:2:2 500 Mbit/s 10bit planar files like so:
x264-10b.exe "AVS Script.avs" --preset medium --profile high422 --level 5.2 --keyint 1 --no-cabac --slices 8 --bitrate 500000 --vbv-maxrate 500000 --vbv-bufsize 100000 --deblock -1:-1 --overscan show --colormatrix bt2020nc --range tv --log-level info --thread-input --transfer arib-std-b67 --colorprim bt2020 --videoformat component --nal-hrd cbr --aud --output-csp i422 --output-depth 10 --output "raw_video.h264"
ffmpeg.exe -hide_banner -i "raw_video.h264" -i "AVS Script.avs" -map 0:0 -c:v copy -map 1:1 -c:a pcm_s24le -ar 48000 -map_metadata -1 -f mxf "pre-final_output_UHD.mxf"
bmxtranswrap.exe -p -y 10:00:00:00 -o "final_output.mxf" "pre-final_output_UHD.mxf"
pause
The input sources are generally 23,976p 4:4:4 12bit DNxHQX files and the AVS Script doesn't do much more than doing the 4% speed up + pitch adjustment to 25p, duplicating to 50p and converting to 4:2:2 planar, all with 16bit precision. Nothing fancy.
What I'm interested about is the CPU usage, whether I could get any gains with the "big guns" and why my two different dual socket configurations behave very differently.
First configuration:
CPU 0: Intel Xeon E5-2640 v4 2.40GHz 10c/20th (AVX2 max)
CPU 1: Intel Xeon E5-2640 v4 2.40GHz 10c/20th (AVX2 max)
RAM: 64 GB DDR4
OS: Windows 10 Enterprise x64
This configuration reaches a speed of 26fps and x264 saturates all cores and all threads, so there isn't anything to optimize here:
https://i.imgur.com/vCFsLZy.png
Second configuration:
CPU 0: Intel Xeon Gold 6238R 2.20GHz 28c/56th (AVX512 max)
CPU 1: Intel Xeon Gold 6238R 2.20GHz 28c/56th (AVX512 max)
RAM: 128 GB DDR4
OS: Windows Server 2019 Standard x64
This configuration reaches a speed of 32.9fps, only slightly faster than the other configuration and x264 only saturates the cores and threads of CPU 0 instead of using both of them:
https://i.imgur.com/hy0C0bf.png
In other words, the reason why it's 26fps vs 32.9fps is because it's as if the 20c/40th was competing against a single 28c/56th CPU instead of a 56c/112th one...
On top of that, despite having AVX512, it's only using up to AVX2 'cause x264 has AVX512 assembly optimization only for the 8bit version but not for the 10bit version, sadly (or at least that's what the command line output from the prompt says).
What I don't understand is why this happens.
I mean, up until now I thought only x265 was Numa Nodes aware and therefore was able to use both CPUs in a dual socket configuration. This reflects what is happening in the more powerful 56c/112th configuration, however the 20c/40th is also a dual socket configuration and there x264 is using both CPUs at 100%, so... what's going on here? And most importantly, is there anything I can do on this regard?
The x264 build I'm using is c164_r3107_a8b68eb from the 17th of July 2023, so it's fairly updated, in case you're wondering.
Avisynth is also updated as it's 3.7.3 stable, Ferenc's build of course.
I'm currently using x264-10bit with an AVS Script.avs to encode UHD BT2020 HLG 50p 4:2:2 500 Mbit/s 10bit planar files like so:
x264-10b.exe "AVS Script.avs" --preset medium --profile high422 --level 5.2 --keyint 1 --no-cabac --slices 8 --bitrate 500000 --vbv-maxrate 500000 --vbv-bufsize 100000 --deblock -1:-1 --overscan show --colormatrix bt2020nc --range tv --log-level info --thread-input --transfer arib-std-b67 --colorprim bt2020 --videoformat component --nal-hrd cbr --aud --output-csp i422 --output-depth 10 --output "raw_video.h264"
ffmpeg.exe -hide_banner -i "raw_video.h264" -i "AVS Script.avs" -map 0:0 -c:v copy -map 1:1 -c:a pcm_s24le -ar 48000 -map_metadata -1 -f mxf "pre-final_output_UHD.mxf"
bmxtranswrap.exe -p -y 10:00:00:00 -o "final_output.mxf" "pre-final_output_UHD.mxf"
pause
The input sources are generally 23,976p 4:4:4 12bit DNxHQX files and the AVS Script doesn't do much more than doing the 4% speed up + pitch adjustment to 25p, duplicating to 50p and converting to 4:2:2 planar, all with 16bit precision. Nothing fancy.
What I'm interested about is the CPU usage, whether I could get any gains with the "big guns" and why my two different dual socket configurations behave very differently.
First configuration:
CPU 0: Intel Xeon E5-2640 v4 2.40GHz 10c/20th (AVX2 max)
CPU 1: Intel Xeon E5-2640 v4 2.40GHz 10c/20th (AVX2 max)
RAM: 64 GB DDR4
OS: Windows 10 Enterprise x64
This configuration reaches a speed of 26fps and x264 saturates all cores and all threads, so there isn't anything to optimize here:
https://i.imgur.com/vCFsLZy.png
Second configuration:
CPU 0: Intel Xeon Gold 6238R 2.20GHz 28c/56th (AVX512 max)
CPU 1: Intel Xeon Gold 6238R 2.20GHz 28c/56th (AVX512 max)
RAM: 128 GB DDR4
OS: Windows Server 2019 Standard x64
This configuration reaches a speed of 32.9fps, only slightly faster than the other configuration and x264 only saturates the cores and threads of CPU 0 instead of using both of them:
https://i.imgur.com/hy0C0bf.png
In other words, the reason why it's 26fps vs 32.9fps is because it's as if the 20c/40th was competing against a single 28c/56th CPU instead of a 56c/112th one...
On top of that, despite having AVX512, it's only using up to AVX2 'cause x264 has AVX512 assembly optimization only for the 8bit version but not for the 10bit version, sadly (or at least that's what the command line output from the prompt says).
What I don't understand is why this happens.
I mean, up until now I thought only x265 was Numa Nodes aware and therefore was able to use both CPUs in a dual socket configuration. This reflects what is happening in the more powerful 56c/112th configuration, however the 20c/40th is also a dual socket configuration and there x264 is using both CPUs at 100%, so... what's going on here? And most importantly, is there anything I can do on this regard?
The x264 build I'm using is c164_r3107_a8b68eb from the 17th of July 2023, so it's fairly updated, in case you're wondering.
Avisynth is also updated as it's 3.7.3 stable, Ferenc's build of course.