Log in

View Full Version : x264 - x86_64 vs ARM 64 The ultimate encoding battle


Pages : 1 [2]

rwill
30th October 2025, 22:09
To be fair, I would also like to have the OpenCL lookahead implementation available in the 10bit version of x264 instead of being limited to the 8bit one.

That and AVX512 which are also only available for 8bit x264. :(

You really shouldn't be using the OpenCL lookahead implementation as the speed gain, if any, comes at a harsh penalty on lookahead analysis quality.

FranceBB
1st November 2025, 00:51
You really shouldn't be using the OpenCL lookahead implementation as the speed gain, if any, comes at a harsh penalty on lookahead analysis quality.

Ah... I had no idea, I've been using it regularly since at least 2015 :scared:

Blue_MiSfit
1st November 2025, 07:30
You really shouldn't be using the OpenCL lookahead implementation as the speed gain, if any, comes at a harsh penalty on lookahead analysis quality.

Why is that, exactly? (the penalty)

rwill
1st November 2025, 08:47
Why is that, exactly? (the penalty)

Mostly, the lowres OpenCL motion estimation algorithm is very limited compared to the flexible software implementation, leading to suboptimal analysis scores.

In most cases this has no real effect on the resulting stream. But in certain corner cases x264 is able to make better decisions when using the analysis data of the software implementation compared to the ones from OpenCL.

DTL
1st November 2025, 16:19
Complaining that x264 does not run on GPUs, is it the year 2008 again?

From 2008 to 2025 we expect GPUs made some more or less significant progress in general purpose computing and in some day will be ready to execute software implementation of h.264 encoding.

I understand data exchange between logical processors in GPU also not very easy but they try to make things better in next chips versions I hope.

Though we have lower and lower number of open source human developers and number of GPU manufacturers is very low in comparison with general purpose CPUs able to run C-based program. May be even some AI-robot will helps finally translate C-based x264 sources into GPU-accelerated solution ? We see great investment in AI-robots last years. And conversion of already working C solution looks not very complex.

rwill
1st November 2025, 17:50
This is turning more and more into another thread where blind people philosophize or argue about colors.

I wish we could get MasterNobody's take on this to cut it short.

benwaggoner
10th November 2025, 19:33
From 2008 to 2025 we expect GPUs made some more or less significant progress in general purpose computing and in some day will be ready to execute software implementation of h.264 encoding.

I understand data exchange between logical processors in GPU also not very easy but they try to make things better in next chips versions I hope.
The question of CPU versus programmable GPU versus fixed-function GPU encoding really gets down to the relative improvements of each. Certainly GPUs are getting better for this kind of application, but CPUs also get year-on-year (or every other year) improvements. It's insane how much more powerful a good x264 CPU is today than when the software was first becoming popular (2004?). It was, what, mainly single core x86-32 with SSE3 SIMD if you had the latest and greatest?

MasterNobody
15th November 2025, 11:08
This is turning more and more into another thread where blind people philosophize or argue about colors.

I wish we could get MasterNobody's take on this to cut it short.
I can only quote what I wrote more than 4 years ago, when I proposed removing OpenCL support from x264 (https://code.videolan.org/videolan/x264/-/merge_requests/66#note_264628) altogether:
It has not been maintained for more than 7 years (last change was in 2014). And not because the code is so perfect that there are no bugs and there is nothing to improve (in fact, it was never even in production ready state), but simply because nobody cares, since in fact almost no one uses it (and its use has never been recommended).
Its presence prevents changes of general lookahead code because it should be changed in sync with OpenCL lookahead code which nobody wants to touch. And so, it harms development of core functionality.
It is not cool, it is just a gimmick. It was just an experiment (Proof of Concept) on a hot topic at that time, which, IMHO, failed (well, maybe except as a marketing tool). If you are not satisfied with the encoding speed of x264 on CPU-only than it makes more sense to look towards fully hardware accelerated encoding i.e. QuickSync / NVENC / AMF / VCE / etc. As history of the last 10 years shows, all hybrid codecs have failed and are long forgotten, as they were usually just a temporary solution until the development of an acceptable fully hardware accelerated codecs.

And my opinion on this hasn't changed at all over this time. But there are always fanatics who rave about GPU support, but no one willing to support or improve this code has appeared over these more than 10 years.

jpsdr
15th November 2025, 12:14
I remember reading a looong time ago that OpenCL in x264 was not good for quality, this is why in my releases (also a long time ago) i didn't include OpenCL for a while because build failed. One day, the build with OpenCL worked, and as there was some demand for it i included it in my releases, but i've never used it personnaly on my encodes.
I thought it was now a common knowledge that OpenCL in x264 wasn't good, and bad for quality. I didn't know that unfortunately people were using it, without knowing the downside of it.

Jamaika
15th November 2025, 13:07
I'm adding the latest OpenCL drivers without going into detail. Cuda also has OpenCL drivers.

I don't know much about this, but OpenCL has OpenCV and the latest Whisper add-ons. They only work with Adreno kernels and don't work with Windows. I don't know, maybe OpenCL is for smartphones. The second problem is the drivers that Windows doesn't add.
https://github.com/KhronosGroup/OpenCL-ICD-Loader/commit/634ef470035f3fadf46ee48fa91886f155f788f5
https://github.com/KhronosGroup/OpenCL-Headers/commit/6137cfbbc7938cd43069d45c622022572fb87113

Edit:OpenCL-Based Design of an FPGA Accelerator for H.266/VVC Transform and Quantization

Blue_MiSfit
19th November 2025, 06:35
1. It has not been maintained for more than 7 years (last change was in 2014). And not because the code is so perfect that there are no bugs and there is nothing to improve (in fact, it was never even in production ready state), but simply because nobody cares, since in fact almost no one uses it (and its use has never been recommended).

2. Its presence prevents changes of general lookahead code because it should be changed in sync with OpenCL lookahead code which nobody wants to touch. And so, it harms development of core functionality.

3. It is not cool, it is just a gimmick. It was just an experiment (Proof of Concept) on a hot topic at that time, which, IMHO, failed (well, maybe except as a marketing tool). If you are not satisfied with the encoding speed of x264 on CPU-only than it makes more sense to look towards fully hardware accelerated encoding i.e. QuickSync / NVENC / AMF / VCE / etc. As history of the last 10 years shows, all hybrid codecs have failed and are long forgotten, as they were usually just a temporary solution until the development of an acceptable fully hardware accelerated codecs.


Presuming this is all factual (and I have little reason to doubt as such) I can get behind this. I remember a very cool video of our friend Dark_Shikari presenting this (associated with Telestream?) in a video, but sadly I cannot find it.

It WAS a cool idea, especially in the context of Vantage Lightspeed servers with GPUs. But.. if it's really not useful anymore why keep it around, especially if it hamstrings development of the regular lookahead? CPUs are so fast now!

DTL
20th November 2025, 11:11
but CPUs also get year-on-year (or every other year) improvements.

CPUs evolution is about dead from mid-201x. All we got over 0.1 of a century is a few general purpose cores addition and some more cache. AVX is close to doing nothing with MPEG encoding helping because it can only make very small parts of the algorithm. So if 80+% of the algorithm is scalar - the more and more speedup of the SIMD-friendly 20-% (or less) can do only very small total performance addition.

And the number of good general purpose compute cores and better RAM performance at the dedicated data compute accelerators as field-installable boards to very old x86/x64 architecture increases much faster. If we compare NVIDIA RTX PRO 6000 with typical (top) desktop CPU of 2025 - it is dozens-hundreds-thousands times higher in parameters and hardware resources. Site says "The NVIDIA RTX PRO™ 6000 Blackwell Workstation Edition is the most powerful desktop GPU ever created, redefining performance and capability for professionals." 24000+ CUDA cores vs 24 general purpose cores at CPU. 1.7 TB/s memory bandwidth vs 100 GB/s at CPU.

I expect when the general purpose compute cores at data compute accelerators will reach enough level to run x264 algorithm we will finally get significant performance gain.

"there are always fanatics who rave about GPU support, but no one willing to support or improve this code has appeared over these more than 10 years."

There are simply less and less open source freeware developers left at all civilization. As I see we lost almost all developers of other open source moving pictures processing software like avisynth by the end of 2025. That is why we see the last attempt of dying civilization to invest all into hardware AI-robots to allow some more time or make degradation a bit smoother in time. And it may already be NN/AI-robots who will make a working port of x264 software to external data compute accelerators I think.

nevcairiel
20th November 2025, 13:54
So if 80+% of the algorithm is scalar - the more and more speedup of the SIMD-friendly 20-% (or less) can do only very small total performance addition.


I expect when the general purpose compute cores at data compute accelerators will reach enough level to run x264 algorithm we will finally get significant performance gain.

But this is the exact same problem that makes GPGPU encoding prohibitive. The literal only thing GPUs have going for them is _massive_ parallelism, as you say yourself, 24000+ cores. Only an algorithm that can run massively in parallel will benefit from that. Any single individual core running scalar code is not very fast.

DTL
20th November 2025, 14:44
MPEG4-AVC is still very simple and does not use too large a database of textures over total footage and mostly typically uses a very small group of frames like a 10..100. So a massive multicore chip can simply encode in parallel lots of small groups of frames of typical runtime footage of 1.5 hours at 25fps - about 135000 frames. Current x264 already can do multithreading I hope not very bad (after about 0.2 of a century of development for >1 cores CPUs). So it may be simply expanding multithreading from 10..20 threads at current host CPUs to thousands threads. 135000 frames of standard runtime divided to 100 frames - 1350 threads possible. With HFR footage - even a bit more.
Most limitations are RAM size to store uncompressed input frames and RAM performance to chip. But it is already about 10x better in comparison with poor-people dual-channel DDR5. Also on-chip caches may help too.

rwill
20th November 2025, 19:27
Well, do it!

And don't forget to report back.

DTL
21st November 2025, 09:46
First I plan to do something on old AVS+ filtergraph performance optimization for current x86 architecture with very slow host RAM. First tests shows workunit size optimization for L1D cache size make performance boost about several times over old frame-based architecture in a long enough filterchain - https://forum.doom9.org/showthread.php?p=2024941#post2024941 .

And while I can be busy with this (years ?) we still have very great investments of very rich people in AI-robots tools. And in some time it may be possible to simply order a robot 'take x264 C-sources and put an external multicore data compute accelerator for x86 API with windows-compatible binary'. Because human heads developers may be gone completely to that times.