View Full Version : x265 encoding on arm64
wqcr
3rd September 2017, 19:01
Hi,
I'm on aarch64/arm64/armv8l and trying to figure out how to transcode video (mpeg1, mpeg2, avc) into h265 with opus audio.
FFmpeg built-in h265 encoder for debian squeezy does not currently support multicore encoding (surprisingly h264 does), further in the current version, ffmpeg's opus encoder also appears broken.
So I'd need to use ffmpeg to decode video and feed it directly to x265 encoder, then to decode audio and using opusenc create opus file, and finally mux those two streams.
Can someone advice me how to do this?
I'd have grabbed Handbrake a long time ago, but it's also not available for arm64 yet.
Thanks
birdie
3rd September 2017, 23:03
You don't want to do that. Seriously.
RanmaCanada
3rd September 2017, 23:33
Do you have years to wait? Seriously you do not wait to do x265 on arm.
wqcr
4th September 2017, 06:51
You don't want to do that. Seriously.
Do you have years to wait? Seriously you do not wait to do x265 on arm.
Putting aside for a moment the fact that I asked for something else, care to explain why?
Ely
4th September 2017, 12:28
x265 on arm64, even with NEON, is slower by several orders of magnitude than x86.
Your best bet is to have a SoC with a HEVC hardware encoder and use that.
To actually answer your question : you need to build FFmpeg linked with libx265 and libopus. This way, you can encode streams with commands roughly like this :
ffmpeg -i <input> -c:v libx265 -c:a libopus <output>
wqcr
4th September 2017, 13:12
I don't know, really....
x265 on arm64 is running 0.35-0.45fps on 'slower' preset for 352x272 MPEG-2 video
When encoding the same file on i7-3820QM downclocked to 2.2GHz I get 3.4fps
But this is most likely due to the fact that x265 on arm64 is not multithreaded:
http://i.imgur.com/B2iuN9t.png
http://i.imgur.com/RPUp522.png
With as efficient multithreading as on x86-64, the performance would be on-par, or much closer, certainly not magnitudes slower.
birdie
5th September 2017, 00:27
It will be magnitudes slower because mobile SoCs are just not suitable for such intensive workloads. I'm not even sure your SoC is not already throttling when using a single thread.
wqcr
5th September 2017, 06:03
It will be magnitudes slower because mobile SoCs are just not suitable for such intensive workloads. I'm not even sure your SoC is not already throttling when using a single thread.
There's lot of assumptions in that statement.
Have you actually tried it before assuming all arm-SoCs are just too weak to handle this?
I did - and not only the SoC does not throttle after days and days of 100% load (all 8 cores), further its performance is equivalent to i5-3340M HT at 3.2GHz (Geekbench3 and opus decoding speed).
It doesn't have any difficulty with x264 even multithreaded. By all means it's not only suitable for x265 encode, it should be even preferred due to its vastly superior power efficiency, several magnitudes better than even the latest generation of Kaby Lake.
The only drawback comes from current implementation of x265 which on arm64 supposedly neither use NEON nor is multithreaded.
Still with manual parallelization, 2.5+fps is possible, which isn't that far from the above result measured at i7 quad
littlepox
5th September 2017, 09:06
Given x265 is still at its feature expanding & quality enhancing stage, it makes little sense to burn resources and optimize it for arm devices.
x265_Project
9th September 2017, 08:04
There's lot of assumptions in that statement.
Have you actually tried it before assuming all arm-SoCs are just too weak to handle this?
I did - and not only the SoC does not throttle after days and days of 100% load (all 8 cores), further its performance is equivalent to i5-3340M HT at 3.2GHz (Geekbench3 and opus decoding speed).
It doesn't have any difficulty with x264 even multithreaded. By all means it's not only suitable for x265 encode, it should be even preferred due to its vastly superior power efficiency, several magnitudes better than even the latest generation of Kaby Lake.
The only drawback comes from current implementation of x265 which on arm64 supposedly neither use NEON nor is multithreaded.
Still with manual parallelization, 2.5+fps is possible, which isn't that far from the above result measured at i7 quad
Not multithreaded? x265 is cross-platform C/C++ code, and it can be compiled for several different microprocessor architectures (x86, ARM, PowerPC). x265 is always multi-threaded. That's in the software architecture (the thread pools feature, frame parallelism, Wavefront Parallel Processing, etc.). It can't possibly run single-threaded unless you explicitly make it do that with the --pools command.
We have some limited ARM Neon optimization (x265\source\common\arm), but this is not anywhere near as complete as our x86 SIMD optimization. We've had discussions with various people at various times about doing a full optimization effort, but as of today this hasn't bubbled up to the top of the priority list for our customers or our strategic hardware partners. Of course, x265 is open source, and contributions are always welcomed.
mandarinka
10th September 2017, 23:29
I don't know, really....
x265 on arm64 is running 0.35-0.45fps on 'slower' preset for 352x272 MPEG-2 video
What SoC/CPU are you using? "arm64" could mean something from awfully broad spectrum of slow to reasonably fast chips. Cortex-A53 is quite different from some higher out of order core or even the architectures Apple implemented. Clocks matter, number of cores too, etc.
wqcr
30th September 2017, 11:32
Encode finished in just under 78 hours - 7 h264 clips, 41mins long each at 352x272 - encoder preset "slower", bitrate based 400kbps, 1pass, audio 64k opus
I'm quite satisfied with the result. CPU throttled a little with all its cores loaded, but only 10-20% under extreme conditions (35°C ambient).
2-pass encoding seems to be broken though, even with the correct params, log files were not created.
If you really want to know, this system used for encoding:
Redmi Note 4 Global version
CPU - Snapdragon 625, A53 octa-core at 2.02GHz
RAM - 4GB LPDDR3-1600
Setup - Rooted AOSGP X 2.11 on Android 7.1.1, Debian Stretch running through chroot, using hotspot mode in conjunction with sshd and vnc server to operate the machine remotely.
Typical power consumption - 3.7W on full load
Blue_MiSfit
2nd October 2017, 06:15
Neat! Thanks for sharing your experience :)
wqcr
16th October 2017, 14:14
Another encode finished, this time 2-pass PAL source.
Preset slower, fast 1-pass, v-bitrate 400, a-bitrate 64 (opus)
7 clips (each 7 minutes long) were finished in 22 hours.
This time I used Slimrom, which by default disables any thermal throttling, so CPU was at 2016 MHz all the time. Tcase was just under 74°C, phone's outer case was no more than 48°C. Power consumption jumped to 4W, so beefier 5V/2A source had to be used.
Still I'm again very satisfied with the result, even though x265 haven't used any arm64 optimizations.
Performance is more or less directly comparable to similarily clocked C2Q, except that would run at 12 times the consumption compared to this little SOC.
Motenai Yoda
17th October 2017, 15:01
7 x 7 min each = 49 min
if those are pal usually 720x576x25fps so
about 385kPx/s looks to me a bit too slow for a 2.0GHz c2q, also is useless compare 10y old cpu efficiency with a new one.
my rpi3 reach 200kPx/s (1.2GHz armv6 + neon, maybe I have to compile it with more appropriate flags)
hajj_3
12th November 2017, 19:03
Would be nice if someone with one of these qualcomm 2400 arm chips could does some x264 and x265 encodes tests: https://blog.cloudflare.com/arm-takes-wing/
ReinerSchweinlin
30th December 2020, 15:50
Since I stumpled across some M1 x265 mentionings across the web over the past few days, I wanted to see if there are some in-depth benchmarks of the M1 handbrake/ffmpeg HEVC variants.
Found some interesting stuff:
https://www.reddit.com/r/hardware/comments/k27c6j/wife_got_a_new_macbook_air_m1_and_i_benched_it/
https://www.youtube.com/watch?v=iGVKEcaTLhw&feature=youtu.be
Does anyone have some more comparisons?
Looking at the very low power a MAC Mini is drawing and given the price, it looks like the M1 chip could be a very interesting option for x265 HEVC Encodings if not "the fastes around" is neede, but a solid performance for desktop/hobby usecases.. Thinking of powering this with solar :)
nakTT
30th December 2020, 18:08
Hi,
I'm on aarch64/arm64/armv8l and trying to figure out how to transcode video (mpeg1, mpeg2, avc) into h265 with opus audio.
FFmpeg built-in h265 encoder for debian squeezy does not currently support multicore encoding (surprisingly h264 does), further in the current version, ffmpeg's opus encoder also appears broken.
So I'd need to use ffmpeg to decode video and feed it directly to x265 encoder, then to decode audio and using opusenc create opus file, and finally mux those two streams.
Can someone advice me how to do this?
I'd have grabbed Handbrake a long time ago, but it's also not available for arm64 yet.
Thanks
Tried it on my Raspberry Pi 4B (But in my case, 32 bit OS, armhf) using Handbrake 1.2.2 and I can tell you that it's awfully slow even for the core (ARM Cortex-A72) that supposedly competitive with at least Intel Atom and the likes in many other workloads. I think it's mainly down to the lack of optimization, unlike what x86 CPU gets. What do you think?
RanmaCanada
30th December 2020, 22:16
ARM is too slow. This is only an option if time means nothing to you. You can get better results with a 35watt APU from AMD. Heck the current laptop lineup from AMD destroys this and they literally sip power.
Blue_MiSfit
31st December 2020, 00:14
Maybe too slow for you, but ARM is rapidly becoming more and more prevalent as ARM chips can occupy an interesting quadrant on the power / speed curve. I imagine with thorough assembly optimization a modern ARM server CPU could outperform a modern x86_64 CPU in terms of efficiency.
If this wasn't the case we probably wouldn't see AWS, Apple, and Microsoft all investing in their own ARM silicon.
Granted, HEVC compression is a very specific use case :)
nakTT
31st December 2020, 03:58
Maybe too slow for you, but ARM is rapidly becoming more and more prevalent as ARM chips can occupy an interesting quadrant on the power / speed curve. I imagine with thorough assembly optimization a modern ARM server CPU could outperform a modern x86_64 CPU in terms of efficiency.
If this wasn't the case we probably wouldn't see AWS, Apple, and Microsoft all investing in their own ARM silicon.
Granted, HEVC compression is a very specific use case :)
I'm sure everyone would agree on the potential that ARM design has. But currently the performance is indeed poor. I believe it is mainly due to the lack of optimization for this architecture.
And of course, i'm speaking as someone who love to see the rise of ARM to give some serious competition in the area that has always been dominated by x86. :)
Ritsuka
31st December 2020, 09:45
HandBrake has got a patch with additional neon optimisations that has not been merged in x265 yet: https://github.com/HandBrake/HandBrake/blob/master/contrib/x265/A04-darwin-neon-support-for-arm64.patch
Any help to test it on a non Apple cpu, or help to get it merged in x265 is welcomed.
benwaggoner
6th January 2021, 02:20
There have been a ton of ARM optimizations added to x265 over the years. It might not be as optimized as for x64, but there is plenty there.
It's likely more that ARM systems just haven't been intended for the sort of sustained load highly parallel workloads like has been available for x64. Apple's M1 certainly is promising for what ARM could potentially do, but still has a much, much smaller number of cores than available from Intel and AMD.
I'm not sure how AVX? compares with NEON for SIMD, but I know Intel has viewed SW encoding as a material workflow in their designs.
vBulletin® v3.8.11, Copyright ©2000-2026, vBulletin Solutions Inc.