FranceBB
8th June 2025, 19:48
Hi there,
up until a few years ago, if someone came to me and asked about encoding with x264 on an ARM CPU I would have looked at him with a weird face as I always thought that ARM CPUs were supposed to be used in mobile devices like in smartphones as their main purpose was to be extremely power efficient and last for a long time even when connected to a battery. In other words, I didn't see their use ever becoming a thing on desktop computers, let alone in a server running in a datacenter. Yet, ARM powered laptops have become a thing, more and more people have been using ARM CPUs as their daily drivers, be it via the Qualcomm CPUs on Windows and Linux or the Apple M CPUs on MacOS. Software got better with more support outside of the mobile space and this of course recently included frameservers like Avisynth and VapourSynth, decoders like libav, encoders like x264 and of course FFMpeg, so I thought: it's time for a comparison.
In particular, when it comes to x264, there are manually written intrinsics in assembly for both x86_64 and ARM 64, in fact we have SSE, SSE2, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX512 and FMA from the x86 (https://code.videolan.org/search?search=AVX512&nav_source=navbar&project_id=536&group_id=9&search_code=true&repository_ref=master) side and NEON from the ARM side (https://code.videolan.org/search?search=NEON&nav_source=navbar&project_id=536&group_id=9&search_code=true&repository_ref=master).
const x264_cpu_name_t x264_cpu_names[] =
{
#if ARCH_X86 || ARCH_X86_64
// {"MMX", X264_CPU_MMX}, // we don't support asm on mmx1 cpus anymore
#define MMX2 X264_CPU_MMX|X264_CPU_MMX2
{"MMX2", MMX2},
{"MMXEXT", MMX2},
{"SSE", MMX2|X264_CPU_SSE},
#define SSE2 MMX2|X264_CPU_SSE|X264_CPU_SSE2
{"SSE2Slow", SSE2|X264_CPU_SSE2_IS_SLOW},
{"SSE2", SSE2},
{"SSE2Fast", SSE2|X264_CPU_SSE2_IS_FAST},
{"LZCNT", SSE2|X264_CPU_LZCNT},
{"SSE3", SSE2|X264_CPU_SSE3},
{"SSSE3", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3},
{"SSE4.1", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4},
{"SSE4", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4},
{"SSE4.2", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4|X264_CPU_SSE42},
#define AVX SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4|X264_CPU_SSE42|X264_CPU_AVX
{"AVX", AVX},
{"XOP", AVX|X264_CPU_XOP},
{"FMA4", AVX|X264_CPU_FMA4},
{"FMA3", AVX|X264_CPU_FMA3},
{"BMI1", AVX|X264_CPU_LZCNT|X264_CPU_BMI1},
{"BMI2", AVX|X264_CPU_LZCNT|X264_CPU_BMI1|X264_CPU_BMI2},
#define AVX2 AVX|X264_CPU_FMA3|X264_CPU_LZCNT|X264_CPU_BMI1|X264_CPU_BMI2|X264_CPU_AVX2
{"AVX2", AVX2},
{"AVX512", AVX2|X264_CPU_AVX512},
#undef AVX2
#undef AVX
#undef SSE2
#undef MMX2
{"Cache32", X264_CPU_CACHELINE_32},
{"Cache64", X264_CPU_CACHELINE_64},
{"SlowAtom", X264_CPU_SLOW_ATOM},
{"SlowPshufb", X264_CPU_SLOW_PSHUFB},
{"SlowPalignr", X264_CPU_SLOW_PALIGNR},
{"SlowShuffle", X264_CPU_SLOW_SHUFFLE},
{"UnalignedStack", X264_CPU_STACK_MOD4},
#elif ARCH_PPC
{"Altivec", X264_CPU_ALTIVEC},
#elif ARCH_ARM
{"ARMv6", X264_CPU_ARMV6},
{"NEON", X264_CPU_NEON},
{"FastNeonMRC", X264_CPU_FAST_NEON_MRC},
#elif ARCH_AARCH64
{"ARMv8", X264_CPU_ARMV8},
{"NEON", X264_CPU_NEON},
{"DotProd", X264_CPU_DOTPROD},
{"I8MM", X264_CPU_I8MM},
{"SVE", X264_CPU_SVE},
{"SVE2", X264_CPU_SVE2},
#elif ARCH_MIPS
{"MSA", X264_CPU_MSA},
#elif ARCH_LOONGARCH
{"LSX", X264_CPU_LSX},
{"LASX", X264_CPU_LASX},
#endif
{"", 0},
};
To make this comparison fair and avoid a "David vs Goliath" benchmark, I've picked two EC2 which are identical in terms of cores/thread and RAM, in particular:
x86_64
c6i.2xlarge 8c/8th 16GB RAM
ARM 64
c6g.2xlarge 8c/8th 16GB RAM
In other words, we have two Virtual Machines where the x86 one is powered by an Intel Xeon Platinum 8375C (Ice Lake) host, while the ARM 64 one is powered by a Graviton 2 which uses the ARMv8 Neoverse-N1 cores.
For the test, Linux was used, in particular Ubuntu 24.04 running FFMpeg 6.1.1 Stable. Each EC2 had a 2TB attached storage to perform the calculations, so that the benchmark essentially consisted in:
1) Spinning up the EC2
2) Transferring a mezzanine file from an S3 bucket to the 2TB attached storage of the EC2
3) Triggering the encode to create the final output files
4) Delivering those files back to S3
5) Shut down the EC2
The power up / power down times have then been taken out of the total job as well as the file transferring times in order to end up only with the actual computation time.
A total of 7 sources were used and in all cases the input file was a standard XDCAM-50 file with DolbyE Italian, DolbyE Original, PCM Stereo Italian, PCM Stereo Original. In particular:
Video:
FULL HD 1920x1080 MPEG-2 High 4:2:2 Profile, Level High 50 Mbit/s yv16 25i TFF BT709 SDR
Audio:
Track1 DolbyE 5.1 44800Hz 20bit Italian
Track2 DolbyE 5.1 44800Hz 20bit Original
Track3 PCM 2.0 48000Hz 24bit Italian
Track4 PCM 2.0 48000Hz 24bit Original
The 44800Hz in the DolbyE tracks refers to the internal sampling rate for that stream at 25fps (1792 samples * 25 frame per seconds = 44800 Hertz) which is always resampled to 48000Hz when played back on an hardware decoder.
The encoding job consisted in 6 steps
Step 1: Encoding the video
FULL HD H.264 Profile High Level 4.1 Ref 4 CRF 25 4:2:0 Limited TV Range 8bit planar BT709 SDR
Step 2: Encoding the audio in AAC
Track1 AAC 5.1 550 kbit/s 48000Hz Italian
Track2 AAC 5.1 550 kbit/s 48000Hz Original
Track3 AAC 2.0 384 kbit/s 48000Hz Italian
Track4 AAC 2.0 384 kbit/s 48000Hz Original
Step 3: Encode the audio in Opus as a proxy
Track1 Proxy: Opus Mono 64 kbit/s Italian
Track2 Proxy: Opus Mono 64 kbit/s Original
Step 4: Encoding the video in H.264 as a proxy with watermark + mux the already encoded audio
SD H.264 Profile High Level 4.1 Ref 4 CRF 25 4:2:0 Limited TV Range 8bit planar BT709 SDR
Step 5: Muxing the FULL HD video and the 5.1 AAC audio in MP4
Step 6: Extract a low resolution thumbnail from the middle of the video and encode it in JPEG
The command line used is reported as follows:
#BT709
#Video only
ffmpeg -i $inputSpec:myInput -map 0:v -c:v libx264 -profile:v high -level:v 4.1 -refs 4 -crf 25 -ignore_chapters 1 -ignore_unknown -write_tmcd 0 -movflags faststart -vf "sidedata=delete,metadata=delete,bwdif=mode=0:parity=0:deint=0,scale=w=1920:h=1080:flags=lanczos:sws_dither=ed,format=yuv420p,setfield=prog,setsar=1:1,fps=25" -x264opts "opencl:keyint=25:force_cfr=1:deblock=-1,-1:aud=1:overscan=show:colorprim=bt709:fullrange=off:transfer=bt709:colormatrix=bt709" -color_primaries bt709 -color_trc bt709 -colorspace bt709 -color_range tv -field_order progressive -brand mp42 -max_muxing_queue_size 700 -map_metadata -1 -metadata creation_time=now -an -f mp4 -y $jobOutputFolder:Video_Only.mp4
#CH.1-2 DolbyE 5.1 - CH.3-4 DolbyE 5.1 - CH.5-6 stereo - CH.7-8 stereo audio track
#Extract DolbyE track 1 and 2
ffmpeg -i $inputSpec:myInput -map 0:1 -acodec copy -f u8 -y $jobOutputFolder:stream1.u8
ffmpeg -i $inputSpec:myInput -map 0:2 -acodec copy -f u8 -y $jobOutputFolder:stream2.u8
#Encoding stereo track 3 and 4
ffmpeg -i $inputSpec:myInput -map 0:3 -c:a aac -b:a 384k -ar 48000 -y $jobOutputFolder:myOutputCh56.m4a
ffmpeg -i $inputSpec:myInput -map 0:4 -c:a aac -b:a 384k -ar 48000 -y $jobOutputFolder:myOutputCh78.m4a
#Extract each channel of DolbyE 5.1 ITA and DolbyE 5.1 ORI
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.0:0.0.0 -y $jobOutputFolder:ITA_FL.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.1:0.0.0 -y $jobOutputFolder:ITA_FR.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.2:0.0.0 -y $jobOutputFolder:ITA_CC.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.3:0.0.0 -y $jobOutputFolder:ITA_LFE.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.4:0.0.0 -y $jobOutputFolder:ITA_SL.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.5:0.0.0 -y $jobOutputFolder:ITA_SR.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.0:0.0.0 -y $jobOutputFolder:ORI_FL.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.1:0.0.0 -y $jobOutputFolder:ORI_FR.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.2:0.0.0 -y $jobOutputFolder:ORI_CC.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.3:0.0.0 -y $jobOutputFolder:ORI_LFE.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.4:0.0.0 -y $jobOutputFolder:ORI_SL.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.5:0.0.0 -y $jobOutputFolder:ORI_SR.wav
#Audio only
ffmpeg -i $jobOutputFolder:ITA_FL.wav -i $jobOutputFolder:ITA_FR.wav -i $jobOutputFolder:ITA_CC.wav -i $jobOutputFolder:ITA_LFE.wav -i $jobOutputFolder:ITA_SL.wav -i $jobOutputFolder:ITA_SR.wav -filter_complex "[0:a][1:a][2:a][3:a][4:a][5:a]join=inputs=6:channel_layout=5.1:map=0.0-FL|1.0-FR|2.0-FC|3.0-LFE|4.0-BL|5.0-BR[a]" -map "[a]" -c:a aac -b:a 550k -ar 48000 -y $jobOutputFolder:myOutputCh12.m4a
ffmpeg -i $jobOutputFolder:ORI_FL.wav -i $jobOutputFolder:ORI_FR.wav -i $jobOutputFolder:ORI_CC.wav -i $jobOutputFolder:ORI_LFE.wav -i $jobOutputFolder:ORI_SL.wav -i $jobOutputFolder:ORI_SR.wav -filter_complex "[0:a][1:a][2:a][3:a][4:a][5:a]join=inputs=6:channel_layout=5.1:map=0.0-FL|1.0-FR|2.0-FC|3.0-LFE|4.0-BL|5.0-BR[a]" -map "[a]" -c:a aac -b:a 550k -ar 48000 -y $jobOutputFolder:myOutputCh34.m4a
#Audio for proxy
ffmpeg -i $jobOutputFolder:myOutputCh12.m4a -ac 1 -c:a libopus -b:a 64k -y $jobOutputFolder:myOutputCh12_Transcribe.ogg
ffmpeg -i $jobOutputFolder:myOutputCh34.m4a -ac 1 -c:a libopus -b:a 64k -y $jobOutputFolder:myOutputCh34_Transcribe.ogg
#Muxed mono audio with TC & Watermark
ffmpeg -i $jobOutputFolder:Video_Only.mp4 -i $jobOutputFolder:myOutputCh12.m4a -i $jobOutputFolder:myOutputCh34.m4a -map 0:0 -map 1:0 -map 2:0 -vf "fps=25,scale=w=1024:h=576:flags=lanczos:sws_dither=ed,setfield=prog,setsar=1:1","drawtext=\timecode='10\:00\:00\:00':timecode_rate=25:x=(w-tw)/2:y=h-(1*lh):fontcolor=white@1:fontsize=25:box=1:boxcolor=black@0.6","drawtext=\text='Internal Use Only':x=(w-text_w)/2:y=(h-text_h)/2:fontcolor=white@0.1:fontsize=125:line_spacing=100" -c:v libx264 -profile:v high -level:v 4.1 -refs 4 -pix_fmt yuv420p -crf 25 -x264opts "opencl:keyint=25:force_cfr=1:deblock=-1,-1:aud=1:overscan=show:colorprim=bt709:fullrange=off:transfer=bt709:colormatrix=bt709" -color_primaries bt709 -color_trc bt709 -colorspace bt709 -color_range tv -field_order progressive -brand mp42 -max_muxing_queue_size 700 -map_metadata -1 -metadata creation_time=now -ignore_chapters 1 -ignore_unknown -write_tmcd 0 -movflags faststart -c:a copy -f mp4 -y $jobOutputFolder:Subtitling_Proxy.mp4
#Muxed audio and video
ffmpeg -i $jobOutputFolder:Video_Only.mp4 -i $jobOutputFolder:myOutputCh12.m4a -i $jobOutputFolder:myOutputCh34.m4a -map 0:v -map 1:a -map 2:a -c:v copy -c:a copy -f mp4 -y $jobOutputFolder:my_Muxed_Output.mp4
#Thumbnail
ffmpeg -ss 01:02:36.280 -i $jobOutputFolder:Video_Only.mp4 -vf "thumbnail=300,scale=w=240:h=136,setsar=1:1" -sws_flags lanczos -frames:v 1 -y $jobOutputFolder:thumb.jpg
One last note to keep in mind is that using an ARM CPU is 20% cheaper than using an x86 one, which means that, in theory, if the ARM CPU was as fast as the x86 one, then it would potentially save 20% of the cost.
Spoiler alert: this didn't happen.
Benchmark results:
Movie 1:
Title: Nope
Duration: 02:05:12:16
c6i.2xlarge x86 Encoding Duration: 2h 24m 6s
c6g.2xlarge ARM Encoding Duration: 3h 25m 43s
x86 cost: $7.50
ARM cost: $8.55
Result: ARM was 42.77% slower and 14.07% more expensive
Movie 2:
Title: Novocaine
Duration: 01:45:19:02
c6i.2xlarge x86 Encoding Duration: 1h 51m 47s
c6g.2xlarge ARM Encoding Duration: 2h 48m 10s
x86 cost: $6.30
ARM cost: $7.58
Result: ARM was 50.44% slower - 20.38% more expensive
Movie 3:
Title: Absolutely anything
Duration: 01:22:16:21
c6i.2xlarge x86 Encoding Duration: 1h 30m 19s
c6g.2xlarge ARM Encoding Duration: 2h 11m 10s
x86 cost: $4.94
ARM cost: $5.74
Result: ARM was 45.23% slower - 16.26% more expensive
Movie 4:
Title: Catch me if you can
Duration: 02:15:11:07
c6i.2xlarge x86 Encoding Duration: 3h 4m 15s
c6g.2xlarge ARM Encoding Duration: 4h 15m 25s
x86 cost: $8.11
ARM cost: $8.99
Result: ARM was 38.62% slower - 10.89% more expensive
Movie 5:
Title: Me before you
Duration: 01:45:57:00
c6i.2xlarge x86 Encoding Duration: 1h 51m 27s
c6g.2xlarge ARM Encoding Duration: 2h 43m 48s
x86 cost: $6.69
ARM cost: $7.87
Result: ARM was 47.08% slower - 17.72% more expensive
Movie 6:
Title: The lucky one
Duration: 01:36:54:00
c6i.2xlarge x86 Encoding Duration: 1h 56m 43s
c6g.2xlarge ARM Encoding Duration: 2h 46m 20s
x86 cost: $7.00
ARM cost: $7.97
Result: ARM was 42.51% slower - 13.98% more expensive
Movie 7:
Title: Shattered
Duration: 01:30:59:24
c6i.2xlarge x86 Encoding Duration: 1h 49m 22s
c6g.2xlarge ARM Encoding Duration: 2h 40m 6s
x86 cost: $6.56
ARM cost: $7.68
Result: ARM was 46.39% slower - 17.22% more expensive
In other words, on average, using an ARM CPU resulted in a 44.72% slowdown compared to the equivalent x86 CPU and, when we factor in the cost, despite it being 20% cheaper to run, the fact that it takes much longer to encode makes it actually 15.78% more expensive to run in real terms.
up until a few years ago, if someone came to me and asked about encoding with x264 on an ARM CPU I would have looked at him with a weird face as I always thought that ARM CPUs were supposed to be used in mobile devices like in smartphones as their main purpose was to be extremely power efficient and last for a long time even when connected to a battery. In other words, I didn't see their use ever becoming a thing on desktop computers, let alone in a server running in a datacenter. Yet, ARM powered laptops have become a thing, more and more people have been using ARM CPUs as their daily drivers, be it via the Qualcomm CPUs on Windows and Linux or the Apple M CPUs on MacOS. Software got better with more support outside of the mobile space and this of course recently included frameservers like Avisynth and VapourSynth, decoders like libav, encoders like x264 and of course FFMpeg, so I thought: it's time for a comparison.
In particular, when it comes to x264, there are manually written intrinsics in assembly for both x86_64 and ARM 64, in fact we have SSE, SSE2, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX512 and FMA from the x86 (https://code.videolan.org/search?search=AVX512&nav_source=navbar&project_id=536&group_id=9&search_code=true&repository_ref=master) side and NEON from the ARM side (https://code.videolan.org/search?search=NEON&nav_source=navbar&project_id=536&group_id=9&search_code=true&repository_ref=master).
const x264_cpu_name_t x264_cpu_names[] =
{
#if ARCH_X86 || ARCH_X86_64
// {"MMX", X264_CPU_MMX}, // we don't support asm on mmx1 cpus anymore
#define MMX2 X264_CPU_MMX|X264_CPU_MMX2
{"MMX2", MMX2},
{"MMXEXT", MMX2},
{"SSE", MMX2|X264_CPU_SSE},
#define SSE2 MMX2|X264_CPU_SSE|X264_CPU_SSE2
{"SSE2Slow", SSE2|X264_CPU_SSE2_IS_SLOW},
{"SSE2", SSE2},
{"SSE2Fast", SSE2|X264_CPU_SSE2_IS_FAST},
{"LZCNT", SSE2|X264_CPU_LZCNT},
{"SSE3", SSE2|X264_CPU_SSE3},
{"SSSE3", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3},
{"SSE4.1", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4},
{"SSE4", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4},
{"SSE4.2", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4|X264_CPU_SSE42},
#define AVX SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4|X264_CPU_SSE42|X264_CPU_AVX
{"AVX", AVX},
{"XOP", AVX|X264_CPU_XOP},
{"FMA4", AVX|X264_CPU_FMA4},
{"FMA3", AVX|X264_CPU_FMA3},
{"BMI1", AVX|X264_CPU_LZCNT|X264_CPU_BMI1},
{"BMI2", AVX|X264_CPU_LZCNT|X264_CPU_BMI1|X264_CPU_BMI2},
#define AVX2 AVX|X264_CPU_FMA3|X264_CPU_LZCNT|X264_CPU_BMI1|X264_CPU_BMI2|X264_CPU_AVX2
{"AVX2", AVX2},
{"AVX512", AVX2|X264_CPU_AVX512},
#undef AVX2
#undef AVX
#undef SSE2
#undef MMX2
{"Cache32", X264_CPU_CACHELINE_32},
{"Cache64", X264_CPU_CACHELINE_64},
{"SlowAtom", X264_CPU_SLOW_ATOM},
{"SlowPshufb", X264_CPU_SLOW_PSHUFB},
{"SlowPalignr", X264_CPU_SLOW_PALIGNR},
{"SlowShuffle", X264_CPU_SLOW_SHUFFLE},
{"UnalignedStack", X264_CPU_STACK_MOD4},
#elif ARCH_PPC
{"Altivec", X264_CPU_ALTIVEC},
#elif ARCH_ARM
{"ARMv6", X264_CPU_ARMV6},
{"NEON", X264_CPU_NEON},
{"FastNeonMRC", X264_CPU_FAST_NEON_MRC},
#elif ARCH_AARCH64
{"ARMv8", X264_CPU_ARMV8},
{"NEON", X264_CPU_NEON},
{"DotProd", X264_CPU_DOTPROD},
{"I8MM", X264_CPU_I8MM},
{"SVE", X264_CPU_SVE},
{"SVE2", X264_CPU_SVE2},
#elif ARCH_MIPS
{"MSA", X264_CPU_MSA},
#elif ARCH_LOONGARCH
{"LSX", X264_CPU_LSX},
{"LASX", X264_CPU_LASX},
#endif
{"", 0},
};
To make this comparison fair and avoid a "David vs Goliath" benchmark, I've picked two EC2 which are identical in terms of cores/thread and RAM, in particular:
x86_64
c6i.2xlarge 8c/8th 16GB RAM
ARM 64
c6g.2xlarge 8c/8th 16GB RAM
In other words, we have two Virtual Machines where the x86 one is powered by an Intel Xeon Platinum 8375C (Ice Lake) host, while the ARM 64 one is powered by a Graviton 2 which uses the ARMv8 Neoverse-N1 cores.
For the test, Linux was used, in particular Ubuntu 24.04 running FFMpeg 6.1.1 Stable. Each EC2 had a 2TB attached storage to perform the calculations, so that the benchmark essentially consisted in:
1) Spinning up the EC2
2) Transferring a mezzanine file from an S3 bucket to the 2TB attached storage of the EC2
3) Triggering the encode to create the final output files
4) Delivering those files back to S3
5) Shut down the EC2
The power up / power down times have then been taken out of the total job as well as the file transferring times in order to end up only with the actual computation time.
A total of 7 sources were used and in all cases the input file was a standard XDCAM-50 file with DolbyE Italian, DolbyE Original, PCM Stereo Italian, PCM Stereo Original. In particular:
Video:
FULL HD 1920x1080 MPEG-2 High 4:2:2 Profile, Level High 50 Mbit/s yv16 25i TFF BT709 SDR
Audio:
Track1 DolbyE 5.1 44800Hz 20bit Italian
Track2 DolbyE 5.1 44800Hz 20bit Original
Track3 PCM 2.0 48000Hz 24bit Italian
Track4 PCM 2.0 48000Hz 24bit Original
The 44800Hz in the DolbyE tracks refers to the internal sampling rate for that stream at 25fps (1792 samples * 25 frame per seconds = 44800 Hertz) which is always resampled to 48000Hz when played back on an hardware decoder.
The encoding job consisted in 6 steps
Step 1: Encoding the video
FULL HD H.264 Profile High Level 4.1 Ref 4 CRF 25 4:2:0 Limited TV Range 8bit planar BT709 SDR
Step 2: Encoding the audio in AAC
Track1 AAC 5.1 550 kbit/s 48000Hz Italian
Track2 AAC 5.1 550 kbit/s 48000Hz Original
Track3 AAC 2.0 384 kbit/s 48000Hz Italian
Track4 AAC 2.0 384 kbit/s 48000Hz Original
Step 3: Encode the audio in Opus as a proxy
Track1 Proxy: Opus Mono 64 kbit/s Italian
Track2 Proxy: Opus Mono 64 kbit/s Original
Step 4: Encoding the video in H.264 as a proxy with watermark + mux the already encoded audio
SD H.264 Profile High Level 4.1 Ref 4 CRF 25 4:2:0 Limited TV Range 8bit planar BT709 SDR
Step 5: Muxing the FULL HD video and the 5.1 AAC audio in MP4
Step 6: Extract a low resolution thumbnail from the middle of the video and encode it in JPEG
The command line used is reported as follows:
#BT709
#Video only
ffmpeg -i $inputSpec:myInput -map 0:v -c:v libx264 -profile:v high -level:v 4.1 -refs 4 -crf 25 -ignore_chapters 1 -ignore_unknown -write_tmcd 0 -movflags faststart -vf "sidedata=delete,metadata=delete,bwdif=mode=0:parity=0:deint=0,scale=w=1920:h=1080:flags=lanczos:sws_dither=ed,format=yuv420p,setfield=prog,setsar=1:1,fps=25" -x264opts "opencl:keyint=25:force_cfr=1:deblock=-1,-1:aud=1:overscan=show:colorprim=bt709:fullrange=off:transfer=bt709:colormatrix=bt709" -color_primaries bt709 -color_trc bt709 -colorspace bt709 -color_range tv -field_order progressive -brand mp42 -max_muxing_queue_size 700 -map_metadata -1 -metadata creation_time=now -an -f mp4 -y $jobOutputFolder:Video_Only.mp4
#CH.1-2 DolbyE 5.1 - CH.3-4 DolbyE 5.1 - CH.5-6 stereo - CH.7-8 stereo audio track
#Extract DolbyE track 1 and 2
ffmpeg -i $inputSpec:myInput -map 0:1 -acodec copy -f u8 -y $jobOutputFolder:stream1.u8
ffmpeg -i $inputSpec:myInput -map 0:2 -acodec copy -f u8 -y $jobOutputFolder:stream2.u8
#Encoding stereo track 3 and 4
ffmpeg -i $inputSpec:myInput -map 0:3 -c:a aac -b:a 384k -ar 48000 -y $jobOutputFolder:myOutputCh56.m4a
ffmpeg -i $inputSpec:myInput -map 0:4 -c:a aac -b:a 384k -ar 48000 -y $jobOutputFolder:myOutputCh78.m4a
#Extract each channel of DolbyE 5.1 ITA and DolbyE 5.1 ORI
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.0:0.0.0 -y $jobOutputFolder:ITA_FL.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.1:0.0.0 -y $jobOutputFolder:ITA_FR.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.2:0.0.0 -y $jobOutputFolder:ITA_CC.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.3:0.0.0 -y $jobOutputFolder:ITA_LFE.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.4:0.0.0 -y $jobOutputFolder:ITA_SL.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.5:0.0.0 -y $jobOutputFolder:ITA_SR.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.0:0.0.0 -y $jobOutputFolder:ORI_FL.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.1:0.0.0 -y $jobOutputFolder:ORI_FR.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.2:0.0.0 -y $jobOutputFolder:ORI_CC.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.3:0.0.0 -y $jobOutputFolder:ORI_LFE.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.4:0.0.0 -y $jobOutputFolder:ORI_SL.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.5:0.0.0 -y $jobOutputFolder:ORI_SR.wav
#Audio only
ffmpeg -i $jobOutputFolder:ITA_FL.wav -i $jobOutputFolder:ITA_FR.wav -i $jobOutputFolder:ITA_CC.wav -i $jobOutputFolder:ITA_LFE.wav -i $jobOutputFolder:ITA_SL.wav -i $jobOutputFolder:ITA_SR.wav -filter_complex "[0:a][1:a][2:a][3:a][4:a][5:a]join=inputs=6:channel_layout=5.1:map=0.0-FL|1.0-FR|2.0-FC|3.0-LFE|4.0-BL|5.0-BR[a]" -map "[a]" -c:a aac -b:a 550k -ar 48000 -y $jobOutputFolder:myOutputCh12.m4a
ffmpeg -i $jobOutputFolder:ORI_FL.wav -i $jobOutputFolder:ORI_FR.wav -i $jobOutputFolder:ORI_CC.wav -i $jobOutputFolder:ORI_LFE.wav -i $jobOutputFolder:ORI_SL.wav -i $jobOutputFolder:ORI_SR.wav -filter_complex "[0:a][1:a][2:a][3:a][4:a][5:a]join=inputs=6:channel_layout=5.1:map=0.0-FL|1.0-FR|2.0-FC|3.0-LFE|4.0-BL|5.0-BR[a]" -map "[a]" -c:a aac -b:a 550k -ar 48000 -y $jobOutputFolder:myOutputCh34.m4a
#Audio for proxy
ffmpeg -i $jobOutputFolder:myOutputCh12.m4a -ac 1 -c:a libopus -b:a 64k -y $jobOutputFolder:myOutputCh12_Transcribe.ogg
ffmpeg -i $jobOutputFolder:myOutputCh34.m4a -ac 1 -c:a libopus -b:a 64k -y $jobOutputFolder:myOutputCh34_Transcribe.ogg
#Muxed mono audio with TC & Watermark
ffmpeg -i $jobOutputFolder:Video_Only.mp4 -i $jobOutputFolder:myOutputCh12.m4a -i $jobOutputFolder:myOutputCh34.m4a -map 0:0 -map 1:0 -map 2:0 -vf "fps=25,scale=w=1024:h=576:flags=lanczos:sws_dither=ed,setfield=prog,setsar=1:1","drawtext=\timecode='10\:00\:00\:00':timecode_rate=25:x=(w-tw)/2:y=h-(1*lh):fontcolor=white@1:fontsize=25:box=1:boxcolor=black@0.6","drawtext=\text='Internal Use Only':x=(w-text_w)/2:y=(h-text_h)/2:fontcolor=white@0.1:fontsize=125:line_spacing=100" -c:v libx264 -profile:v high -level:v 4.1 -refs 4 -pix_fmt yuv420p -crf 25 -x264opts "opencl:keyint=25:force_cfr=1:deblock=-1,-1:aud=1:overscan=show:colorprim=bt709:fullrange=off:transfer=bt709:colormatrix=bt709" -color_primaries bt709 -color_trc bt709 -colorspace bt709 -color_range tv -field_order progressive -brand mp42 -max_muxing_queue_size 700 -map_metadata -1 -metadata creation_time=now -ignore_chapters 1 -ignore_unknown -write_tmcd 0 -movflags faststart -c:a copy -f mp4 -y $jobOutputFolder:Subtitling_Proxy.mp4
#Muxed audio and video
ffmpeg -i $jobOutputFolder:Video_Only.mp4 -i $jobOutputFolder:myOutputCh12.m4a -i $jobOutputFolder:myOutputCh34.m4a -map 0:v -map 1:a -map 2:a -c:v copy -c:a copy -f mp4 -y $jobOutputFolder:my_Muxed_Output.mp4
#Thumbnail
ffmpeg -ss 01:02:36.280 -i $jobOutputFolder:Video_Only.mp4 -vf "thumbnail=300,scale=w=240:h=136,setsar=1:1" -sws_flags lanczos -frames:v 1 -y $jobOutputFolder:thumb.jpg
One last note to keep in mind is that using an ARM CPU is 20% cheaper than using an x86 one, which means that, in theory, if the ARM CPU was as fast as the x86 one, then it would potentially save 20% of the cost.
Spoiler alert: this didn't happen.
Benchmark results:
Movie 1:
Title: Nope
Duration: 02:05:12:16
c6i.2xlarge x86 Encoding Duration: 2h 24m 6s
c6g.2xlarge ARM Encoding Duration: 3h 25m 43s
x86 cost: $7.50
ARM cost: $8.55
Result: ARM was 42.77% slower and 14.07% more expensive
Movie 2:
Title: Novocaine
Duration: 01:45:19:02
c6i.2xlarge x86 Encoding Duration: 1h 51m 47s
c6g.2xlarge ARM Encoding Duration: 2h 48m 10s
x86 cost: $6.30
ARM cost: $7.58
Result: ARM was 50.44% slower - 20.38% more expensive
Movie 3:
Title: Absolutely anything
Duration: 01:22:16:21
c6i.2xlarge x86 Encoding Duration: 1h 30m 19s
c6g.2xlarge ARM Encoding Duration: 2h 11m 10s
x86 cost: $4.94
ARM cost: $5.74
Result: ARM was 45.23% slower - 16.26% more expensive
Movie 4:
Title: Catch me if you can
Duration: 02:15:11:07
c6i.2xlarge x86 Encoding Duration: 3h 4m 15s
c6g.2xlarge ARM Encoding Duration: 4h 15m 25s
x86 cost: $8.11
ARM cost: $8.99
Result: ARM was 38.62% slower - 10.89% more expensive
Movie 5:
Title: Me before you
Duration: 01:45:57:00
c6i.2xlarge x86 Encoding Duration: 1h 51m 27s
c6g.2xlarge ARM Encoding Duration: 2h 43m 48s
x86 cost: $6.69
ARM cost: $7.87
Result: ARM was 47.08% slower - 17.72% more expensive
Movie 6:
Title: The lucky one
Duration: 01:36:54:00
c6i.2xlarge x86 Encoding Duration: 1h 56m 43s
c6g.2xlarge ARM Encoding Duration: 2h 46m 20s
x86 cost: $7.00
ARM cost: $7.97
Result: ARM was 42.51% slower - 13.98% more expensive
Movie 7:
Title: Shattered
Duration: 01:30:59:24
c6i.2xlarge x86 Encoding Duration: 1h 49m 22s
c6g.2xlarge ARM Encoding Duration: 2h 40m 6s
x86 cost: $6.56
ARM cost: $7.68
Result: ARM was 46.39% slower - 17.22% more expensive
In other words, on average, using an ARM CPU resulted in a 44.72% slowdown compared to the equivalent x86 CPU and, when we factor in the cost, despite it being 20% cheaper to run, the fact that it takes much longer to encode makes it actually 15.78% more expensive to run in real terms.