Log in

View Full Version : x264 - x86_64 vs ARM 64 The ultimate encoding battle


Pages : [1] 2

FranceBB
8th June 2025, 19:48
Hi there,
up until a few years ago, if someone came to me and asked about encoding with x264 on an ARM CPU I would have looked at him with a weird face as I always thought that ARM CPUs were supposed to be used in mobile devices like in smartphones as their main purpose was to be extremely power efficient and last for a long time even when connected to a battery. In other words, I didn't see their use ever becoming a thing on desktop computers, let alone in a server running in a datacenter. Yet, ARM powered laptops have become a thing, more and more people have been using ARM CPUs as their daily drivers, be it via the Qualcomm CPUs on Windows and Linux or the Apple M CPUs on MacOS. Software got better with more support outside of the mobile space and this of course recently included frameservers like Avisynth and VapourSynth, decoders like libav, encoders like x264 and of course FFMpeg, so I thought: it's time for a comparison.

In particular, when it comes to x264, there are manually written intrinsics in assembly for both x86_64 and ARM 64, in fact we have SSE, SSE2, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX512 and FMA from the x86 (https://code.videolan.org/search?search=AVX512&nav_source=navbar&project_id=536&group_id=9&search_code=true&repository_ref=master) side and NEON from the ARM side (https://code.videolan.org/search?search=NEON&nav_source=navbar&project_id=536&group_id=9&search_code=true&repository_ref=master).


const x264_cpu_name_t x264_cpu_names[] =
{
#if ARCH_X86 || ARCH_X86_64
// {"MMX", X264_CPU_MMX}, // we don't support asm on mmx1 cpus anymore
#define MMX2 X264_CPU_MMX|X264_CPU_MMX2
{"MMX2", MMX2},
{"MMXEXT", MMX2},
{"SSE", MMX2|X264_CPU_SSE},
#define SSE2 MMX2|X264_CPU_SSE|X264_CPU_SSE2
{"SSE2Slow", SSE2|X264_CPU_SSE2_IS_SLOW},
{"SSE2", SSE2},
{"SSE2Fast", SSE2|X264_CPU_SSE2_IS_FAST},
{"LZCNT", SSE2|X264_CPU_LZCNT},
{"SSE3", SSE2|X264_CPU_SSE3},
{"SSSE3", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3},
{"SSE4.1", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4},
{"SSE4", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4},
{"SSE4.2", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4|X264_CPU_SSE42},
#define AVX SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4|X264_CPU_SSE42|X264_CPU_AVX
{"AVX", AVX},
{"XOP", AVX|X264_CPU_XOP},
{"FMA4", AVX|X264_CPU_FMA4},
{"FMA3", AVX|X264_CPU_FMA3},
{"BMI1", AVX|X264_CPU_LZCNT|X264_CPU_BMI1},
{"BMI2", AVX|X264_CPU_LZCNT|X264_CPU_BMI1|X264_CPU_BMI2},
#define AVX2 AVX|X264_CPU_FMA3|X264_CPU_LZCNT|X264_CPU_BMI1|X264_CPU_BMI2|X264_CPU_AVX2
{"AVX2", AVX2},
{"AVX512", AVX2|X264_CPU_AVX512},
#undef AVX2
#undef AVX
#undef SSE2
#undef MMX2
{"Cache32", X264_CPU_CACHELINE_32},
{"Cache64", X264_CPU_CACHELINE_64},
{"SlowAtom", X264_CPU_SLOW_ATOM},
{"SlowPshufb", X264_CPU_SLOW_PSHUFB},
{"SlowPalignr", X264_CPU_SLOW_PALIGNR},
{"SlowShuffle", X264_CPU_SLOW_SHUFFLE},
{"UnalignedStack", X264_CPU_STACK_MOD4},
#elif ARCH_PPC
{"Altivec", X264_CPU_ALTIVEC},
#elif ARCH_ARM
{"ARMv6", X264_CPU_ARMV6},
{"NEON", X264_CPU_NEON},
{"FastNeonMRC", X264_CPU_FAST_NEON_MRC},
#elif ARCH_AARCH64
{"ARMv8", X264_CPU_ARMV8},
{"NEON", X264_CPU_NEON},
{"DotProd", X264_CPU_DOTPROD},
{"I8MM", X264_CPU_I8MM},
{"SVE", X264_CPU_SVE},
{"SVE2", X264_CPU_SVE2},
#elif ARCH_MIPS
{"MSA", X264_CPU_MSA},
#elif ARCH_LOONGARCH
{"LSX", X264_CPU_LSX},
{"LASX", X264_CPU_LASX},
#endif
{"", 0},
};



To make this comparison fair and avoid a "David vs Goliath" benchmark, I've picked two EC2 which are identical in terms of cores/thread and RAM, in particular:

x86_64
c6i.2xlarge 8c/8th 16GB RAM

ARM 64
c6g.2xlarge 8c/8th 16GB RAM

In other words, we have two Virtual Machines where the x86 one is powered by an Intel Xeon Platinum 8375C (Ice Lake) host, while the ARM 64 one is powered by a Graviton 2 which uses the ARMv8 Neoverse-N1 cores.

For the test, Linux was used, in particular Ubuntu 24.04 running FFMpeg 6.1.1 Stable. Each EC2 had a 2TB attached storage to perform the calculations, so that the benchmark essentially consisted in:

1) Spinning up the EC2
2) Transferring a mezzanine file from an S3 bucket to the 2TB attached storage of the EC2
3) Triggering the encode to create the final output files
4) Delivering those files back to S3
5) Shut down the EC2

The power up / power down times have then been taken out of the total job as well as the file transferring times in order to end up only with the actual computation time.

A total of 7 sources were used and in all cases the input file was a standard XDCAM-50 file with DolbyE Italian, DolbyE Original, PCM Stereo Italian, PCM Stereo Original. In particular:

Video:
FULL HD 1920x1080 MPEG-2 High 4:2:2 Profile, Level High 50 Mbit/s yv16 25i TFF BT709 SDR

Audio:
Track1 DolbyE 5.1 44800Hz 20bit Italian
Track2 DolbyE 5.1 44800Hz 20bit Original
Track3 PCM 2.0 48000Hz 24bit Italian
Track4 PCM 2.0 48000Hz 24bit Original

The 44800Hz in the DolbyE tracks refers to the internal sampling rate for that stream at 25fps (1792 samples * 25 frame per seconds = 44800 Hertz) which is always resampled to 48000Hz when played back on an hardware decoder.


The encoding job consisted in 6 steps

Step 1: Encoding the video
FULL HD H.264 Profile High Level 4.1 Ref 4 CRF 25 4:2:0 Limited TV Range 8bit planar BT709 SDR

Step 2: Encoding the audio in AAC
Track1 AAC 5.1 550 kbit/s 48000Hz Italian
Track2 AAC 5.1 550 kbit/s 48000Hz Original
Track3 AAC 2.0 384 kbit/s 48000Hz Italian
Track4 AAC 2.0 384 kbit/s 48000Hz Original

Step 3: Encode the audio in Opus as a proxy
Track1 Proxy: Opus Mono 64 kbit/s Italian
Track2 Proxy: Opus Mono 64 kbit/s Original

Step 4: Encoding the video in H.264 as a proxy with watermark + mux the already encoded audio
SD H.264 Profile High Level 4.1 Ref 4 CRF 25 4:2:0 Limited TV Range 8bit planar BT709 SDR

Step 5: Muxing the FULL HD video and the 5.1 AAC audio in MP4

Step 6: Extract a low resolution thumbnail from the middle of the video and encode it in JPEG

The command line used is reported as follows:

#BT709
#Video only
ffmpeg -i $inputSpec:myInput -map 0:v -c:v libx264 -profile:v high -level:v 4.1 -refs 4 -crf 25 -ignore_chapters 1 -ignore_unknown -write_tmcd 0 -movflags faststart -vf "sidedata=delete,metadata=delete,bwdif=mode=0:parity=0:deint=0,scale=w=1920:h=1080:flags=lanczos:sws_dither=ed,format=yuv420p,setfield=prog,setsar=1:1,fps=25" -x264opts "opencl:keyint=25:force_cfr=1:deblock=-1,-1:aud=1:overscan=show:colorprim=bt709:fullrange=off:transfer=bt709:colormatrix=bt709" -color_primaries bt709 -color_trc bt709 -colorspace bt709 -color_range tv -field_order progressive -brand mp42 -max_muxing_queue_size 700 -map_metadata -1 -metadata creation_time=now -an -f mp4 -y $jobOutputFolder:Video_Only.mp4

#CH.1-2 DolbyE 5.1 - CH.3-4 DolbyE 5.1 - CH.5-6 stereo - CH.7-8 stereo audio track
#Extract DolbyE track 1 and 2
ffmpeg -i $inputSpec:myInput -map 0:1 -acodec copy -f u8 -y $jobOutputFolder:stream1.u8
ffmpeg -i $inputSpec:myInput -map 0:2 -acodec copy -f u8 -y $jobOutputFolder:stream2.u8
#Encoding stereo track 3 and 4
ffmpeg -i $inputSpec:myInput -map 0:3 -c:a aac -b:a 384k -ar 48000 -y $jobOutputFolder:myOutputCh56.m4a
ffmpeg -i $inputSpec:myInput -map 0:4 -c:a aac -b:a 384k -ar 48000 -y $jobOutputFolder:myOutputCh78.m4a
#Extract each channel of DolbyE 5.1 ITA and DolbyE 5.1 ORI
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.0:0.0.0 -y $jobOutputFolder:ITA_FL.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.1:0.0.0 -y $jobOutputFolder:ITA_FR.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.2:0.0.0 -y $jobOutputFolder:ITA_CC.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.3:0.0.0 -y $jobOutputFolder:ITA_LFE.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.4:0.0.0 -y $jobOutputFolder:ITA_SL.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.5:0.0.0 -y $jobOutputFolder:ITA_SR.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.0:0.0.0 -y $jobOutputFolder:ORI_FL.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.1:0.0.0 -y $jobOutputFolder:ORI_FR.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.2:0.0.0 -y $jobOutputFolder:ORI_CC.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.3:0.0.0 -y $jobOutputFolder:ORI_LFE.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.4:0.0.0 -y $jobOutputFolder:ORI_SL.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.5:0.0.0 -y $jobOutputFolder:ORI_SR.wav
#Audio only
ffmpeg -i $jobOutputFolder:ITA_FL.wav -i $jobOutputFolder:ITA_FR.wav -i $jobOutputFolder:ITA_CC.wav -i $jobOutputFolder:ITA_LFE.wav -i $jobOutputFolder:ITA_SL.wav -i $jobOutputFolder:ITA_SR.wav -filter_complex "[0:a][1:a][2:a][3:a][4:a][5:a]join=inputs=6:channel_layout=5.1:map=0.0-FL|1.0-FR|2.0-FC|3.0-LFE|4.0-BL|5.0-BR[a]" -map "[a]" -c:a aac -b:a 550k -ar 48000 -y $jobOutputFolder:myOutputCh12.m4a
ffmpeg -i $jobOutputFolder:ORI_FL.wav -i $jobOutputFolder:ORI_FR.wav -i $jobOutputFolder:ORI_CC.wav -i $jobOutputFolder:ORI_LFE.wav -i $jobOutputFolder:ORI_SL.wav -i $jobOutputFolder:ORI_SR.wav -filter_complex "[0:a][1:a][2:a][3:a][4:a][5:a]join=inputs=6:channel_layout=5.1:map=0.0-FL|1.0-FR|2.0-FC|3.0-LFE|4.0-BL|5.0-BR[a]" -map "[a]" -c:a aac -b:a 550k -ar 48000 -y $jobOutputFolder:myOutputCh34.m4a

#Audio for proxy
ffmpeg -i $jobOutputFolder:myOutputCh12.m4a -ac 1 -c:a libopus -b:a 64k -y $jobOutputFolder:myOutputCh12_Transcribe.ogg
ffmpeg -i $jobOutputFolder:myOutputCh34.m4a -ac 1 -c:a libopus -b:a 64k -y $jobOutputFolder:myOutputCh34_Transcribe.ogg

#Muxed mono audio with TC & Watermark
ffmpeg -i $jobOutputFolder:Video_Only.mp4 -i $jobOutputFolder:myOutputCh12.m4a -i $jobOutputFolder:myOutputCh34.m4a -map 0:0 -map 1:0 -map 2:0 -vf "fps=25,scale=w=1024:h=576:flags=lanczos:sws_dither=ed,setfield=prog,setsar=1:1","drawtext=\timecode='10\:00\:00\:00':timecode_rate=25:x=(w-tw)/2:y=h-(1*lh):fontcolor=white@1:fontsize=25:box=1:boxcolor=black@0.6","drawtext=\text='Internal Use Only':x=(w-text_w)/2:y=(h-text_h)/2:fontcolor=white@0.1:fontsize=125:line_spacing=100" -c:v libx264 -profile:v high -level:v 4.1 -refs 4 -pix_fmt yuv420p -crf 25 -x264opts "opencl:keyint=25:force_cfr=1:deblock=-1,-1:aud=1:overscan=show:colorprim=bt709:fullrange=off:transfer=bt709:colormatrix=bt709" -color_primaries bt709 -color_trc bt709 -colorspace bt709 -color_range tv -field_order progressive -brand mp42 -max_muxing_queue_size 700 -map_metadata -1 -metadata creation_time=now -ignore_chapters 1 -ignore_unknown -write_tmcd 0 -movflags faststart -c:a copy -f mp4 -y $jobOutputFolder:Subtitling_Proxy.mp4

#Muxed audio and video
ffmpeg -i $jobOutputFolder:Video_Only.mp4 -i $jobOutputFolder:myOutputCh12.m4a -i $jobOutputFolder:myOutputCh34.m4a -map 0:v -map 1:a -map 2:a -c:v copy -c:a copy -f mp4 -y $jobOutputFolder:my_Muxed_Output.mp4


#Thumbnail
ffmpeg -ss 01:02:36.280 -i $jobOutputFolder:Video_Only.mp4 -vf "thumbnail=300,scale=w=240:h=136,setsar=1:1" -sws_flags lanczos -frames:v 1 -y $jobOutputFolder:thumb.jpg


One last note to keep in mind is that using an ARM CPU is 20% cheaper than using an x86 one, which means that, in theory, if the ARM CPU was as fast as the x86 one, then it would potentially save 20% of the cost.

Spoiler alert: this didn't happen.


Benchmark results:

Movie 1:
Title: Nope
Duration: 02:05:12:16
c6i.2xlarge x86 Encoding Duration: 2h 24m 6s
c6g.2xlarge ARM Encoding Duration: 3h 25m 43s
x86 cost: $7.50
ARM cost: $8.55

Result: ARM was 42.77% slower and 14.07% more expensive


Movie 2:
Title: Novocaine
Duration: 01:45:19:02
c6i.2xlarge x86 Encoding Duration: 1h 51m 47s
c6g.2xlarge ARM Encoding Duration: 2h 48m 10s
x86 cost: $6.30
ARM cost: $7.58

Result: ARM was 50.44% slower - 20.38% more expensive


Movie 3:
Title: Absolutely anything
Duration: 01:22:16:21
c6i.2xlarge x86 Encoding Duration: 1h 30m 19s
c6g.2xlarge ARM Encoding Duration: 2h 11m 10s
x86 cost: $4.94
ARM cost: $5.74

Result: ARM was 45.23% slower - 16.26% more expensive



Movie 4:
Title: Catch me if you can
Duration: 02:15:11:07
c6i.2xlarge x86 Encoding Duration: 3h 4m 15s
c6g.2xlarge ARM Encoding Duration: 4h 15m 25s
x86 cost: $8.11
ARM cost: $8.99

Result: ARM was 38.62% slower - 10.89% more expensive



Movie 5:
Title: Me before you
Duration: 01:45:57:00
c6i.2xlarge x86 Encoding Duration: 1h 51m 27s
c6g.2xlarge ARM Encoding Duration: 2h 43m 48s
x86 cost: $6.69
ARM cost: $7.87

Result: ARM was 47.08% slower - 17.72% more expensive



Movie 6:
Title: The lucky one
Duration: 01:36:54:00
c6i.2xlarge x86 Encoding Duration: 1h 56m 43s
c6g.2xlarge ARM Encoding Duration: 2h 46m 20s
x86 cost: $7.00
ARM cost: $7.97

Result: ARM was 42.51% slower - 13.98% more expensive




Movie 7:
Title: Shattered
Duration: 01:30:59:24
c6i.2xlarge x86 Encoding Duration: 1h 49m 22s
c6g.2xlarge ARM Encoding Duration: 2h 40m 6s
x86 cost: $6.56
ARM cost: $7.68

Result: ARM was 46.39% slower - 17.22% more expensive



In other words, on average, using an ARM CPU resulted in a 44.72% slowdown compared to the equivalent x86 CPU and, when we factor in the cost, despite it being 20% cheaper to run, the fact that it takes much longer to encode makes it actually 15.78% more expensive to run in real terms.

excellentswordfight
9th June 2025, 09:18
Very cool test, thank you for sharing that. But given that Graviton2 has cores derived from Cortex-A76, which as very far away from cutting edge ARM-performance, although very interesting to see that they have worse performance/dollar, it would be interesting to see from a performance standpoint how Graviton4 instances perform.

Does anyone know if there is any significant difference between x264 and x265 when it comes to ARM and NEON optimization? Ive seen quite a bit for x265, but I dont follow x264 development that much anymore.

Z2697
9th June 2025, 10:01
A test on newer generation of Graviton would be great!
And there's even Mac?

rwill
9th June 2025, 10:57
Looks like a Decoder and I/O benchmark to me. Not really x264 specific.

Z2697
9th June 2025, 16:45
Looks like a Decoder and I/O benchmark to me. Not really x264 specific.

I'd assume a 50Mbps MPEG2 source won't be a bottleneck?

rwill
10th June 2025, 05:46
I'd assume a 50Mbps MPEG2 source won't be a bottleneck?

No, but isn't the default preset of x264 'medium'? I do not know really what that script did the whole time for a couple of hours but sure enough it was not H.264 encoding. Also the numbers are all over the place and do not check out.

Z2697
10th June 2025, 06:19
No, but isn't the default preset of x264 'medium'? I do not know really what that script did the whole time for a couple of hours but sure enough it was not H.264 encoding. Also the numbers are all over the place and do not check out.

He does the encoding twice for each title, in a low frequency virtual cores cloud instance.
Yes, maybe it still don't add up perfectly, but makes some sense I guess.

rwill
10th June 2025, 08:22
He does the encoding twice for each title, in a low frequency virtual cores cloud instance.
Yes, maybe it still don't add up perfectly, but makes some sense I guess.

Yes I have seen that he encodes at FHD and something close to QHD. I can read ffmpeg scripts. Thank you.

An 8 core Graviton 2 instance should clock at around 2.5Ghz and should be well faster than my 16Gb Raspberry PI at x264 if the majority of time is spend x264 encoding. But according to FranceBB numbers it is hardly.

Z2697
10th June 2025, 13:12
Yes I have seen that he encodes at FHD and something close to QHD. I can read ffmpeg scripts. Thank you.

An 8 core Graviton 2 instance should clock at around 2.5Ghz and should be well faster than my 16Gb Raspberry PI at x264 if the majority of time is spend x264 encoding. But according to FranceBB numbers it is hardly.

Don't forget that it will almost certainly have to share the CPU resource with other instances running on the same host.
But yeah, I'm not to disagree that this test includes too much noise, more than the title suggests - a x264 encoding battle.

What's the x264 medium speed of your Raspberry PI?

rwill
10th June 2025, 15:02
Don't forget that it will almost certainly have to share the CPU resource with other instances running on the same host.
But yeah, I'm not to disagree that this test includes too much noise, more than the title suggests - a x264 encoding battle.

What's the x264 medium speed of your Raspberry PI?

Its a Raspberry Pi 5 so hardly more advanced than Graviton 2.

For some Hollywood Movie intro pan over some detailed desert + person with some light grain:

x264 --input-res 1920x1080 --input-depth 8 --fps 24 --crf 24 -o trash.264 snip_1920x1080_8b.yuv
-->
encoded 120 frames, 21.73 fps, 2398.65 kb/s

and for the smaller resolution

x264 --input-res 1024x576 --input-depth 8 --fps 24 --crf 24 -o trash.264 snip_1024x576_8b.yuv
-->
encoded 120 frames, 69.77 fps, 862.13 kb/s

*edit*
x264 is:
x264 0.164.3095 baee400
built on Apr 12 2023, gcc: 12.2.0

GeoffreyA
11th June 2025, 19:30
Thanks for this test, FranceBB. It would be interesting if we could get a fair test between x86 and Apple's CPUs, which, as far as I'm aware, are the best ARM implementation.

j7n
12th June 2025, 18:36
Is the "cost" chosen arbitrarily by the provider of the virtual machine, or does it reflect the price of the computer plus electricity costs? Do you actually get charged by time used, $8 for three hours or so? That would get unsustainable real quick.

FranceBB
12th June 2025, 22:33
Is the "cost" chosen arbitrarily by the provider of the virtual machine, or does it reflect the price of the computer plus electricity costs?

It's set by the provider of the virtual machines. In this case, the tests were run via SDVI on Amazon's infrastructure, so it includes the cost of the Elastic Block Storage.


That would get unsustainable real quick.

Yep... But the "benefit" of the cloud running open source software is the scalability. This is the current situation in Prod (it's 10PM on a Thursday) for Avisynth for instance:

https://i.imgur.com/Im8MKud.png

Basically, when you deploy something, a "golden" AMI is created. You can think about the AMI as a .ovf file containing the virtual machine and its configuration. When those are deployed you can have an elastic farm so that you start with 1 instance which is shut down. In the case of FFAStrans (FFMpeg Avisynth Transcoder), for instance, when a job comes, this EC2 spins up, the file is transferred from the S3 bucket to the 2TB attached storage, the rest_service starts, a POST is triggered so that it imports the workflow, then another POST is made to trigger the workflow and a series of GET are made afterwards to get the status. Once the job is over and the file is encoded, it gets transferred from the 2TB attached storage to the S3 bucket and the EC2 is shut down. Clearly, if more jobs come and more EC2 are required, those will be created automatically dynamically and - as you can see from the screenshot - I currently have 87 EC2 created of which 7 are running and executing jobs in this very moment and up to 620 can be created (I can eventually increase this limit). Each EC2 has a "grace period" of 1 week so that if it's not used and it stays shut down for more than 1 week, it gets automatically deleted and it will eventually be recreated if needed.

Obviously this is all fun and games 'cause we're only talking about machines with CPU and RAM, but if we were to include dedicated resources like a GPU then you could end up in a situation in which your machine won't be created for a long time 'cause there may not be any availability in the region so you have to wait until some other AWS customer finishes using it. With CPU only machines, however, this never happens.


The advantage of using open source software, in general, compared to closed source software you buy a license for is that for closed source software you can only deploy as many instances as the licenses you bought. Like, suppose you bought 10 licenses from company A, then you can deploy 10 provisioned instances which will always be there, they will power up and down, you don't have to wait for them to be created, but you won't be able to scale to, let's say, the 11th one, 'cause you don't have the license for it. Some providers offer you the option to use their cloud offering instead, so instead of buying the license, you never buy it and you use those as a sort of "pay-as-you-go", which allows you to scale, however every time you trigger a job you pay a bit more than you would have paid 'cause they're actually also charging the license.


This is more general, though, but as far as encoding is concerned, unless you need something incredibly specific like, I don't know, the Dolby Media Encoder to encode DAMF in AC4 etc or things like FAB to mux .stl subtitles in an .mxf container as a 436m track to carry Teletext Subtitles or some other peculiar use case not covered by open source software, you're better off with open source, which is why I'm proud not just to be using Avisynth but also of the fact that the company I work for is directly contributing to x265 as they're one of the Multicoreware partners (and have been for a very long time).

excellentswordfight
13th June 2025, 09:34
Thanks for this test, FranceBB. It would be interesting if we could get a fair test between x86 and Apple's CPUs, which, as far as I'm aware, are the best ARM implementation.
Phoronix has a decent test as they have a test with M4 (standard) and AMD Strix point (the best and most efficient x86 models available that are in a somewhat similar power range). My guesstimate is that M4 Pro has about the same x265 performance as HX 370. So I think it looks like Zen5 and M4 in general has about the same performance for x265. But as you can see, Strix Point has 2x the performance/w over its desktop parts, so its very hard to extrapolate this with more performance focused designs (M4 Max in Mac Studio or Server CPUs).

https://phoronix.com/benchmark/result/amd-zen-5-vs-intel-arrow-lake-vs-apple-m4-mac-mini/x265-bosphorus-4k-1.svgz
https://phoronix.com/benchmark/result/amd-zen-5-vs-intel-arrow-lake-vs-apple-m4-mac-mini/x265-bosphorus-4k.svgz

Ritsuka
14th June 2025, 05:56
x265 got so many Neon optimisations after 3.6, making that benchmark meaningless. The latest x265 master branch should be ~ 50% or more faster on Arm than 3.6.

GeoffreyA
14th June 2025, 12:42
Phoronix has a decent test as they have a test with M4 (standard) and AMD Strix point (the best and most efficient x86 models available that are in a somewhat similar power range). My guesstimate is that M4 Pro has about the same x265 performance as HX 370. So I think it looks like Zen5 and M4 in general has about the same performance for x265. But as you can see, Strix Point has 2x the performance/w over its desktop parts, so its very hard to extrapolate this with more performance focused designs (M4 Max in Mac Studio or Server CPUs).

https://phoronix.com/benchmark/result/amd-zen-5-vs-intel-arrow-lake-vs-apple-m4-mac-mini/x265-bosphorus-4k-1.svgz
https://phoronix.com/benchmark/result/amd-zen-5-vs-intel-arrow-lake-vs-apple-m4-mac-mini/x265-bosphorus-4k.svgz

Thanks, excellentswordfight. I wonder if the x86 branch's having more SIMD implemented is leading to a weaker showing for the M4.

benwaggoner
19th June 2025, 17:28
No, but isn't the default preset of x264 'medium'? I do not know really what that script did the whole time for a couple of hours but sure enough it was not H.264 encoding. Also the numbers are all over the place and do not check out.
This could be as much a ffmpeg perf test as x264.

I suggest using a .y4m or .yuv file as source directly to x264, off of SSD storage

--preset veryslow would probably be more relevant these days as well. Even that is very fast on modern hardware.

benwaggoner
19th June 2025, 17:34
Very cool test, thank you for sharing that. But given that Graviton2 has cores derived from Cortex-A76, which as very far away from cutting edge ARM-performance, although very interesting to see that they have worse performance/dollar, it would be interesting to see from a performance standpoint how Graviton4 instances perform.

Does anyone know if there is any significant difference between x264 and x265 when it comes to ARM and NEON optimization? Ive seen quite a bit for x265, but I dont follow x264 development that much anymore.
x265 definitely shows substantial performance improvements on Graviton4 versus Graviton2, including a lot of SVE and SVE2, not just NEON.

Testing should be done from the main branch, as a lot of ARM optimizations have been checked in since the last official release. The year-on-year performance improvements on x265 ARM have been amazing.

I know some of the same people have done a bunch of work in x264 as well, but I've not been tracking that as closely.

excellentswordfight
19th June 2025, 20:32
This could be as much a ffmpeg perf test as x264.

I suggest using a .y4m or .yuv file as source directly to x264, off of SSD storage

--preset veryslow would probably be more relevant these days as well. Even that is very fast on modern hardware.
I think both methods holds merit, as I agree that this method isnt really a x264 benchmark per se, but at the end of the day it comes down to the time and cost per for actual workflow in use (including rewrapping, processing, audio etc).

If I find some time I can try to make a more isolated test with both x264 and x265 comparing m8g.2xlarge (8C Graviton4) vs m7a.2xlarge (8C 4th gen EPYC).

edit.
Couldnt help myself... Did a test with two ubuntu 24.04 EC2 instance, with the official x264-r3222-b35605a binaries from videolan. As I didnt have much time I had to transfer a compressed source, using a 25Mbps AVC encode of tears of steel.

Commandline:

ffmpeg -loglevel quiet -i tos_sample.1080p.x264.mp4 -an -f yuv4mpegpipe -strict -1 - | x264 --demuxer y4m --preset veryslow --profile high --level 4.1 --crf 17 --rc-lookahead 96 --keyint 240 --min-keyint 24 --vbv-bufsize 78125 --vbv-maxrate 62500 --colorprim bt709 --transfer bt709 --colormatrix bt709 - -o /dev/null

m7a.2xlarge instance (x86-64):
x264 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512
18.05 fps

m8g.2xlarge (AArch64)
x264 [info]: using cpu capabilities: ARMv8 NEON DotProd I8MM SVE SVE2
15.37 fps

edit2.

built x265 4.1+171-cd40fe75d with gcc (GCC 13.3.0)/cmake, I am a rookie compiler so dont ask me about tuning and tweaking :P

Commandline:

ffmpeg -loglevel quiet -i tos_sample.1080p.x264.mp4 -an -f yuv4mpegpipe -strict -1 - | ./x265 --y4m --preset slow --profile main10 --level-idc 41 --crf 17 --keyint 240 --min-keyint 24 --rc-lookahead 96 --vbv-bufsize 50000 --vbv-maxrate 50000 --colorprim bt709 --transfer bt709 --colormatrix bt709 - -o /dev/nul

m7a.2xlarge instance (x86-64):
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
9.16 fps
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512
9.45 fps

m8g.2xlarge (AArch64)
x265 [info]: using cpu capabilities: NEON SVE SVE2 Neon_DotProd Neon_I8MM SVE2_BitPerm
7.45 fps

Z2697
20th June 2025, 05:17
So actually x265 has worse ARM performance (relative to x86) than x264?
x264 10bit doesn't have AVX512 optimization so it will be "more better" for ARM, perhaps? (What's the situation with 10bit on the ARM side?)

(compile time tuning and tweaking doesn't matter much, since most of the work is done by handwritten assembly)

Ritsuka
20th June 2025, 10:39
x265 Arm optimizations are quite recent and still a work in progress. x264 had a quite complete set for a long time, it could be improved by using Neon dotprod and i8mm instructions (and SVE/SVE2), there are some PR on x264 website, but it still was in a much better initial state than x265.

FranceBB
20th June 2025, 21:56
x264 10bit doesn't have AVX512

It's very sad 'cause x264 8bit does, but while I use the 8bit version for distribution stuff, I use the 10bit version to create the XAVC Intra Class 300 UHD files and those would benefit a lot from AVX512... :(
Anyway, I didn't feel like torturing an ARM CPU with a 10bit UHD encode. For reference, I'm using the 16c/16th 32GB of RAM x86_64 for those as it would take like 48h for an 8c/8th 16GB of RAM x86_64 to process those movies, especially if there's a PQ to HLG conversion to do...

excellentswordfight
23rd June 2025, 08:14
Anyway, I didn't feel like torturing an ARM CPU with a 10bit UHD encode.
Just FYI, Graviton4 ec2 instance has higher x265 performance than third gen Xeon SP used in your threadstart, which are quite slow by todays standard.


m7a.2xlarge (fourth gen EPYC):
9.45 fps

m8g.2xlarge (Graviton4)
7.45 fps

c6i.2xlarge (third gen Xeon SP)
4.55 fps

FranceBB
23rd June 2025, 20:19
m7a.2xlarge (fourth gen EPYC):
9.45 fps

m8g.2xlarge (Graviton4)
7.45 fps

c6i.2xlarge (third gen Xeon SP)
4.55 fps

Oh... Wow...
That's... not what I was expecting... O_O''

Z2697
23rd June 2025, 21:42
I'm not familiar with AWS, but just from a glance "M" series have 2x the memory than "C" series, I know encoding on its own is not that memory hungry, but could there be some influences? (and maybe other differences?)

FranceBB
23rd June 2025, 21:55
Uhmmm you're right, looks like the m.x have 32GB of RAM instead of 16GB. That might have played a role.
In that case, the comparison should have been made with m7i.2xlarge which also has 32GB and it's based on 4th Generation Intel Xeon Scalable.

Blue_MiSfit
24th June 2025, 07:19
Anyway, I didn't feel like torturing an ARM CPU with a 10bit UHD encode. For reference, I'm using the 16c/16th 32GB of RAM x86_64 for those as it would take like 48h for an 8c/8th 16GB of RAM x86_64 to process those movies, especially if there's a PQ to HLG conversion to do...

That's why we use chunked encoding :)

Scale out to a swarm of the smallest instances that can fit the pipeline in memory and run single threaded, then stitch on an IO intensive instance like i4i or i8g!

excellentswordfight
24th June 2025, 09:06
m7a.2xlarge (fourth gen EPYC):
9.45 fps

m8g.2xlarge (Graviton4)
7.45 fps

c6i.2xlarge (third gen Xeon SP)
4.55 fps
Oh... Wow...
That's... not what I was expecting... O_O''
I'm not familiar with AWS, but just from a glance "M" series have 2x the memory than "C" series, I know encoding on its own is not that memory hungry, but could there be some influences? (and maybe other differences?)
Uhmmm you're right, looks like the m.x have 32GB of RAM instead of 16GB. That might have played a role.
In that case, the comparison should have been made with m7i.2xlarge which also has 32GB and it's based on 4th Generation Intel Xeon Scalable.
In these results its pretty much irrelevant, total memory usage was about 8GB. i chose c6i.2xlarge as it was used in the threadstart, just to show that if that was some reference to "performance" that ARM is also very much a performance platform at this point. However not the fastest, yet...

m7i.2xlarge would for sure do better, but that would be the CPU difference as they are Sapphire Rapids based. I would guestimate that it would be close to ARM/Graviton4, as it was quite a performance leap if I remembered correctly, Ice Lake will also have one of the least gains of avx512 cause of throttling. In general I think you can say that Graviton4 should be in the same performance range as 4/5th gen Xeon and 3rd gen EPYC for this benchmark. So its still about two generations behind (this is not the case for ALL loads mind you).

edit.
Seems like my estimations based on x264/x265 is in line with "general" performance. 4th gen Xeon is slightly behind Graviton4 when looking at average performance.
https://phoronix.com/benchmark/result/graviton4-vs-graviton3-vs-graviton2--amd-epyc,-intel-xeon-aws/geometric-mean-of-all-test-results-result-composite-gvgvgaeixa.svgz

benwaggoner
24th June 2025, 17:03
It's very sad 'cause x264 8bit does, but while I use the 8bit version for distribution stuff, I use the 10bit version to create the XAVC Intra Class 300 UHD files and those would benefit a lot from AVX512... :(
Anyway, I didn't feel like torturing an ARM CPU with a 10bit UHD encode. For reference, I'm using the 16c/16th 32GB of RAM x86_64 for those as it would take like 48h for an 8c/8th 16GB of RAM x86_64 to process those movies, especially if there's a PQ to HLG conversion to do...
Yeah, I'd expect 10-bit to get even more AVX512 benefit than 8-bit. But 10-bit x264 is used a lot less than 10-bit x265. Benefits could also be less for intra-only.

benwaggoner
24th June 2025, 17:12
Oh... Wow...
That's... not what I was expecting... O_O''
x265 has been getting a lot of good ARM optimization the last couple of years. Throughput has more than doubled on my MacBook M1 Pro.

benwaggoner
24th June 2025, 17:15
Are we doing comparisons with the same number of cores, or monitoring how many cores are being utilized? It's not uncommon to see a given job use fewer cores than are available on the larger instances.

So comparing at the same core count and listing utilization % would be helpful context.

And heck, doing a single-threaded encode to see how per-process efficiency looks can be interesting, if not a real-world use case these days.

excellentswordfight
24th June 2025, 17:36
Are we doing comparisons with the same number of cores, or monitoring how many cores are being utilized? It's not uncommon to see a given job use fewer cores than are available on the larger instances.

So comparing at the same core count and listing utilization % would be helpful context.

And heck, doing a single-threaded encode to see how per-process efficiency looks can be interesting, if not a real-world use case these days.
All the numbers posted by me is as stated at 8T, which is low enough at 1080p to be fully utilized.

And I also rather test that as well as there can be cross-thread-performance difference between architectures that will not be seen at single threaded instances.

Personally I don’t do single threaded encodings, even when doing chunk encoding.
x265 has been getting a lot of good ARM optimization the last couple of years. Throughput has more than doubled on my MacBook M1 Pro.
Yes, but it is intressting that graviton4 performed relativity better in x264 than x265. But I suspect it might be cause that x265 has more/better avx/simd optimization. If I remember correctly I saw a bigger performance difference in x265 when I got ”modern” cpu some years back.

Eitherway, as relative performance in both x264 and x265 are in line with other software, optimization seems to be at least at some sort of expected level.

rwill
24th June 2025, 18:31
Yes, but it is intressting that graviton4 performed relativity better in x264 than x265. But I suspect it might be cause that x265 has more/better avx/simd optimization. If I remember correctly I saw a bigger performance difference in x265 when I got ”modern” cpu some years back.

My guess is thats because Neon is still 16 byte registers and AVX2/AVX512 is 32 or 64. H.264 macroblocks don't go wider than 16 byte.

rwill
24th June 2025, 18:33
x265 has been getting a lot of good ARM optimization the last couple of years. Throughput has more than doubled on my MacBook M1 Pro.

I saw, Amazon seems to employ really determined people. I would have went with intrinsics instead of assembler though.

benwaggoner
25th June 2025, 19:18
I saw, Amazon seems to employ really determined people. I would have went with intrinsics instead of assembler though.
Yeah, AWS people have done a lot of work. ARM themselves have been putting in lots of commits as well this year.

If you follow the commits: https://bitbucket.org/multicoreware/x265_git/commits/branch/master, you'll see AArch64 (ARM) has accounted for about half the total in 2025.

There's been dozens of ARM perf patches added since x265 4.1, so a main branch build should have significantly better performance than the current official release.

benwaggoner
25th June 2025, 19:19
My guess is thats because Neon is still 16 byte registers and AVX2/AVX512 is 32 or 64. H.264 macroblocks don't go wider than 16 byte.
FWIW, the current optimization seems to be much more focused on the more recent SVE and SVE2 SIMD instructions, not plain NEON.

But yeah, your intuition about modern processors having a greater x265 versus x264 performance being due to smaller sequences of numbers that can be usefully calculated at the same time is sound. Even original flavor AVX didn't do much to speed up x264. Dark Shikari had a good blog post about why back in the day.

rwill
26th June 2025, 05:54
If you follow the commits: https://bitbucket.org/multicoreware/x265_git/commits/branch/master, you'll see AArch64 (ARM) has accounted for about half the total in 2025.

Looks like I have not kept track close enough, the code IS actually using intrinsics partially. I had this stored differently in my mind because so many *.S files showed up and so few *.cpp.

Ritsuka
26th June 2025, 13:03
Yup, the timeline is, more or less:

- A bunch of optimized Arm ASM from Huawei in early 2020, quite small speedup (15%);
- Intrinsics contributed by Apple initially to HandBrake, which were merged years later and after some stupid replies by MCW (like "Hey that means Apple are shipping x265 in macOS and they don't pay us", no they aren't bundling it in macOS);
- Some additional ASM contributed by Amazon;
- Contribution by Arm, initially mostly ASM, lately intrinsics.

Z2697
26th June 2025, 16:55
Yup, the timeline is, more or less:

- A bunch of optimized Arm ASM from Huawei in early 2020, quite small speedup (15%);
- Intrinsics contributed by Apple initially to HandBrake, which were merged years later and after some stupid replies by MCW (like "Hey that means Apple are shipping x265 in macOS and they don't pay us", no they aren't bundling it in macOS);
- Some additional ASM contributed by Amazon;
- Contribution by Arm, initially mostly ASM, lately intrinsics.

Contribute x265 optimization code to HandBrake is also stupid, I think :p

Z2697
26th June 2025, 16:58
FFmpeg tutorial says intrinsic is 10-15% slower than assembly, what do you think?
https://github.com/FFmpeg/asm-lessons/blob/79051e270d0bfdfa4ad52c34f20055596df75487/lesson_01/index.md?plain=1#L27C221-L27C396

rwill
26th June 2025, 19:28
The FFmpeg tutorial also states that intrinsic supporters would disagree, which I hereby do.

benwaggoner
26th June 2025, 20:00
FFmpeg tutorial says intrinsic is 10-15% slower than assembly, what do you think?
https://github.com/FFmpeg/asm-lessons/blob/79051e270d0bfdfa4ad52c34f20055596df75487/lesson_01/index.md?plain=1#L27C221-L27C396
Assembly is always best if you can have a big sustained optimization team that can update them with new architecture revisions.

But intrinsics are still much faster than just C++, faster to develop, and more flexible over the long term.

So, given a fixed number of optimization engineer years, focusing on intrinsics can provide the best net speedup.

And once they are there, assembly can still be done later for promising hot spots.

DTL
24th October 2025, 22:19
FFmpeg tutorial says intrinsic is 10-15% slower than assembly, what do you think?
https://github.com/FFmpeg/asm-lessons/blob/79051e270d0bfdfa4ad52c34f20055596df75487/lesson_01/index.md?plain=1#L27C221-L27C396

It may be from the old days of too poor compilers. Now when power (AI) of compilers still go higher and complexity of SIMD become more and more higher and amount of the residual developers going lower and lower usage of compilable intrinsics may be not slower and allows to make more lifetime for same sources text (re-compilable for future CPUs without re-write of source text). For example AVX2 intrinsics-based function running out of 'registers' number at AVX2 chip will have lower performance at AVX2 chip and higher performance at AVX512 chip after simply being re-compiled for AVX512 architecture.

Also performance of binary from intrinsics text greatly depends on the compiler used. Nowdays LLVM-based compilers understand design idea of the source author better and make better performance binary. For limited of brain-power developer still more simple to put sequence of 'hints' to AI-compiler of how it can be implemented in machine code and compiler may give better result in comparison with any possible for current developer even after years of manual optimization (with trials and errors). And compiler may give result in minutes or faster for current SIMD generations on the market. Human developer may give better result after years and it will be for already outdated architecture. Slow human fails in comparison with faster robot.

About MPEG encoders general and x264 for SIMD CPU acceleration - the most slow part is not easy paralellable processes like SAD computing but logical threads of motion search or logical decisions about each block. And this part of generic MPEG encoding is much more complex for putting to simple enough SIMD architecture of current CPUs (because of possible branching very highly depending on the computing results at some algorithm tree). And SIMD architecture designed for very large brute-force computing with better zero conditinal branching. Only a small parts of x264 may be (possibly) being put to SIMD (SIMD-level fine-multi-threading) by loops-unrolling of the loops having very simple possible branching or no conditional branching at all. The conditional complexity of MPEG encoding increases very highly when we move to the root of the frame encoding and it can not be handled by current SIMD CPU engines. All this resulted in the very low SIMD performance enchancement at the all time of x264 development with usage of any SIMD from SSE128 to AVX512.

benwaggoner
28th October 2025, 16:55
It may be from the old days of too poor compilers. Now when power (AI) of compilers still go higher and complexity of SIMD become more and more higher...
Do we have practical examples of AI materially accelerating SIMD optimization or doing a materially better job of generating SIMD assembly now? It's an obvious but tricky extension, and I've not been tracking the compiler world carefully in the last few years.

About MPEG encoders general and x264 for SIMD CPU acceleration - the most slow part is not easy paralellable processes like SAD computing but logical threads of motion search or logical decisions about each block. And this part of generic MPEG encoding is much more complex for putting to simple enough SIMD architecture of current CPUs (because of possible branching very highly depending on the computing results at some algorithm tree). And SIMD architecture designed for very large brute-force computing with better zero conditinal branching. Only a small parts of x264 may be (possibly) being put to SIMD (SIMD-level fine-multi-threading) by loops-unrolling of the loops having very simple possible branching or no conditional branching at all. The conditional complexity of MPEG encoding increases very highly when we move to the root of the frame encoding and it can not be handled by current SIMD CPU engines. All this resulted in the very low SIMD performance enchancement at the all time of x264 development with usage of any SIMD from SSE128 to AVX512.
I think the story is more complex than just that. x265 has gotten a lot more benefit from SIMD and parallelization than x264. One big reason is it supports up to 32x32 blocks for encoding, which is a whole lot more pixels to process at once making 256 and 512 bits at a time a lot more useful. Wavefront Parallel Processing, which gives an additional intraframe encoder thread per 64 pixels of height of the video helps a bunch as well.

DTL
29th October 2025, 12:19
Do we have practical examples of AI materially accelerating SIMD optimization or doing a materially better job of generating SIMD assembly now? It's an obvious but tricky extension, and I've not been tracking the compiler world carefully in the last few years.



See example at resampler - https://forum.doom9.org/showthread.php?p=2023248#post2023248

VS2022 standard C-compiler from Microsoft fail to group instructions to hide latency from directly grouped in intrinsics-based sources. And LLVM-based compiler understand the design idea of the intrinsics and group as expected in the binary and this as expected make better performance binary.

At the last several years LLVM compiler make more or less better performance binaries from about any sources. I expect LLVM is more or less AI-based internally as it shows better understanding of design ideas from C-based text (C-based SIMD programs implemented as intrinsics).

"x265 has gotten a lot more benefit from SIMD and parallelization than x264."

It may simply because internal complexity of x265 is much larger and some not very complex for logical branches parts got better performance from SIMD. But total x265 performance (in fps) still significantly lower in comparison with x264. Better example of more SIMD-friendly MPEG encoder design may be if x265+all possible SIMD optimizations enabled will run faster in comparison with x264 in scalar C-only computing (or x264 with only simple SIMD optimizations like block SAD computing).

The general end-users market CPU designers like intel understand the inability of good SIMD benefit in MPEG encoding and start making many simple cores and low SIMD chips. Because MPEG encoding get more benefit from running many threads on simple scalar cores of with a few SIMD support. So the more E-cores in intel CPU the better software MPEG encoder performance. Sad for current architecture of general purpose SIMD units.

The real help to software MPEG encoders expected in 202x years is usage of motion estimation hardware accelerators in end-users PCs. But it looks most of developers of open source and multi-platform x264 (and other) MPEG encoders not like this idea and we still do not have this optional support in software implementations.

benwaggoner
29th October 2025, 23:20
"x265 has gotten a lot more benefit from SIMD and parallelization than x264."

It may simply because internal complexity of x265 is much larger and some not very complex for logical branches parts got better performance from SIMD. But total x265 performance (in fps) still significantly lower in comparison with x264. Better example of more SIMD-friendly MPEG encoder design may be if x265+all possible SIMD optimizations enabled will run faster in comparison with x264 in scalar C-only computing (or x264 with only simple SIMD optimizations like block SAD computing).
No, a big part of it is that HEVC is more SIMD friendly due to having bigger chunks of data it can process at the same time. While H.264 maxes out at 8x8 blocks and 16x16 macro blocks, HEVC can do up to 32x32 in 64x64 units. That is 16x more pixels at a time than H.264 gets. I was deep in the weeds about this stuff with MultiCoreWare back in the day.

The general end-users market CPU designers like intel understand the inability of good SIMD benefit in MPEG encoding and start making many simple cores and low SIMD chips. Because MPEG encoding get more benefit from running many threads on simple scalar cores of with a few SIMD support. So the more E-cores in intel CPU the better software MPEG encoder performance. Sad for current architecture of general purpose SIMD units.
I don't know why you think people don't believe MPEG codecs take good advantage of SIMD. An AVX2 build of x265 outperforms a no-SIMD build by a pretty decent multiple. I think 7x on Zen 5 with AVX512 also enabled.

Wavefront parallel processing also helps in getting more threads per frame, but that's orthogonal.

The real help to software MPEG encoders expected in 202x years is usage of motion estimation hardware accelerators in end-users PCs. But it looks most of developers of open source and multi-platform x264 (and other) MPEG encoders not like this idea and we still do not have this optional support in software implementations.
People tried, but in the end it just didn't wind up offering quality improvements or much performance improvements. Like 20-30% faster at most. The overhead of going to/from GPU was enough that it was practically faster than just doing the coarse analysis on the same chip with the same data in the L3 cache as refinement.

DTL
30th October 2025, 08:50
"No, a big part of it is that HEVC is more SIMD friendly due to having bigger chunks of data it can process at the same time. While H.264 maxes out at 8x8 blocks and 16x16 macro blocks, HEVC can do up to 32x32 in 64x64 units. That is 16x more pixels at a time than H.264 gets."

I understand some parts of the MPEG encoder may get more benefit from larger data size to process like block size in samples. But it is still not full SIMD-friendly way of computing.

Typical SIMD-enchanced program for processing frame of blocks is designed as

FOR (each block in a frame) DO (some processing including SIMD parts). Computing time is Total_number_of_blocks/time_per_loop_spin.

This result in some benefit in performance if low level loops can be unrolled for more SIMD-way of processing like SAD computing at motion search or even a bit more. But it still uses lots of scalar logic. And when scalar logic is working all SIMD dispatch units stay inactive and total performance is far from constant SIMD running RAW performance (in FLOPs/IOPs).

Better SIMD-friendly (up to Full-SIMD-friendly) program for processing frame of blocks is designed as

FOR (N blocks in a frame) DO (process N blocks in a SIMD program of single instruction per N blocks without fallback to scalar way). Computing time is Total_number_of_blocks/(N x time_per_loop_spin).

This can run close to full blood performance of CPU with SIMD dispatch units. But because full processing of each block in the MPEG encoder is too complex and its complexity is above supported by SIMD architecture of today general purpose CPU cores - it is close to not possible. Every pausing of SIMD processing and fallback to scalar each block logical processing make a great performance impact on SIMD program. This is what called SIMD non-friendliness of MPEG encoding. The logical complexity of modern MPEG encoding for each block can be only supported by set of full-logical compute units like found in data compute accelerators (GPUs).

And next sad fact in 2025 - we still not have x264 encoder implementation for massive-multicore accelerators like end-users GPUs with universal compute cores (and also much faster memory). It looks open source programmers do not like to support external compute accelerators too. Even using some attempt in general purpose computing APIs like OpenCL, DirectCompute, Vulkan, CUDA, etc.

rwill
30th October 2025, 16:02
Complaining that x264 does not run on GPUs, is it the year 2008 again?

FranceBB
30th October 2025, 20:18
Complaining that x264 does not run on GPUs, is it the year 2008 again?

To be fair, I would also like to have the OpenCL lookahead implementation available in the 10bit version of x264 instead of being limited to the 8bit one.

That and AVX512 which are also only available for 8bit x264. :(

benwaggoner
30th October 2025, 20:55
"No, a big part of it is that HEVC is more SIMD friendly due to having bigger chunks of data it can process at the same time. While H.264 maxes out at 8x8 blocks and 16x16 macro blocks, HEVC can do up to 32x32 in 64x64 units. That is 16x more pixels at a time than H.264 gets."

I understand some parts of the MPEG encoder may get more benefit from larger data size to process like block size in samples. But it is still not full SIMD-friendly way of computing.
It may not be the optimal way. But we can see how turning on more advanced SIMD in binaries increases throughput progressively. It has maxed out at AVX2 for a while, but modern architectures like Zen5 are finally making even AVX512 a useful net positive.

And next sad fact in 2025 - we still not have x264 encoder implementation for massive-multicore accelerators like end-users GPUs with universal compute cores (and also much faster memory). It looks open source programmers do not like to support external compute accelerators too. Even using some attempt in general purpose computing APIs like OpenCL, DirectCompute, Vulkan, CUDA, etc.
One big challenge, which is also a challenge with GPU acceleration, is that more advanced codecs get more advanced tools and mode alternatives, which exponentially grow the possible ways to do something that could be optimal. So there's lots of branchy code required between the parallelilizable operations. The overhead of latency and memory bandwidth between CPU and GPU winds up making it uncompetitive versus being able to mix branching and SIMD operations applied to the same data in the same L3 cache