Log in

View Full Version : x265 HEVC Encoder


Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 [172] 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197

benwaggoner
10th July 2022, 03:18
I've been avoiding doing actual compiling of code for 25 years, but now I've got a modified source version of x265 I need to compile for Windows 64-bit. I did the stuff in the ReadMe (yasm, etc).

I'm trying to run \x265_v3_5_mod\build\vc15-x86_64\make-solutions.bat, but I keep getting a "no visual studio 15" error, and I'm unsure on what Visual Studio I need to have installed. VS 2007 15.9 didn't work. Do I use "Build Tools for Visual Studio 2017 (version 15.0)" or something else/additionally?

As a former Microsoft employee, I wish to decry how tightly coupled Visual Studio is to Windows. I shouldn't have to reboot this many times ;)!

DJATOM
10th July 2022, 10:46
AFAIK you need a link with special altered environment (look for that in MSVC Programs group, I'm on Linux from work PC and don't want to boot my ryzen PC with win10 just to check exact link name), from that console you will generate solution files. Also after some modification you could use even MSVC 2019 - https://github.com/DJATOM/x265-aMod/tree/aMod-3.5/build/vc16-x86_64

benwaggoner
11th July 2022, 05:10
AFAIK you need a link with special altered environment (look for that in MSVC Programs group
That sounds promising. Alas, I do not know what you mean by MSVC Programs group or what to look for there.

Does anyone have the link for that?

Sorry for such basic questions. The last time I really compiled something like this outside of some autobuild environment was before Visual Studio 97 was even released...

qyot27
11th July 2022, 18:13
Trying to remember off the top of my head because I'm booted into Ubuntu at the moment:

Start Menu->Programs->Visual Studio->Tools->Visual Studio Command Line Tools [x86|x64|arm64]

Basically, it's just the Start Menu shortcut to the relevant vcvarsall(%arch%).bat that loads up the compiler environment in cmd.exe. I *think* those are supposed to be automatically installed no matter what configuration options you've chosen for VS, but otherwise you may have to re-run the VS updater and select them.

In general, I wouldn't bother with the build scripts provided by x265, because those are by nature too highly coupled to a specific build environment configuration. If you know how to use CMake directly (and have notes on exactly which x265[mod]-specific configuration options you want), it's much more straightforward, since you can select exactly what you know you're working with (and you don't have to load up a GUI).

Boulder
12th July 2022, 12:45
Wasn't it so that you have to edit the bat files manually to support a specific VS version. I have Visual Studio 2019 Community installed on my PC, and it's apparently "Visual Studio 16 2019" for CMake.

benwaggoner
12th July 2022, 18:51
If anyone can give me an overview of the best way to go from a source directory to a Windows 64-bit binary, I'll owe you one! I don't mind going the GCC route if that's easier.

Boulder
12th July 2022, 20:38
I don't know if it helps, but I just cloned the repo and edited the make-solutions.bat in the build\vc15-x86_64 folder to this:

cmake -G "Visual Studio 16 2019" ..\..\source && cmake-gui ..\..\source

Then ran make-solutions in the console and it will proceed like it is instructed in the wiki. You might need to point to the path where NASM is and enable assembly in the configure part (assembly is disabled if CMake cannot find NASM). The resulting .sln file can be opened in Visual Studio 2019.

LigH
22nd July 2022, 12:06
@benwaggoner:

I enjoy using the media-autobuild suite in most cases. You can configure it down to building only what you need. Doesn't have to be a whole ffmpeg, just a separate x265.exe is fine too.

Blue_MiSfit
23rd July 2022, 19:21
Another vote for MABS. It's incredibly useful (when it's not broken, which is often lol)

Barough
23rd July 2022, 23:27
@benwaggoner

MABS gets my vote also

Selur
24th July 2022, 16:13
Yup, using MABS too when building Windows tools for Hybrid. :)

BuccoBruce
27th July 2022, 16:22
If anyone can give me an overview of the best way to go from a source directory to a Windows 64-bit binary, I'll owe you one! I don't mind going the GCC route if that's easier.

I don't know if it helps, but I just cloned the repo and edited the make-solutions.bat in the build\vc15-x86_64 folder to this:

cmake -G "Visual Studio 16 2019" ..\..\source && cmake-gui ..\..\source

Then ran make-solutions in the console and it will proceed like it is instructed in the wiki. You might need to point to the path where NASM is and enable assembly in the configure part (assembly is disabled if CMake cannot find NASM). The resulting .sln file can be opened in Visual Studio 2019.

Just a note about MABS, it will build with GCC and purely with GCC, unless you tell it to build with clang. Not an issue for most, and probably preferred if you're going to be using the libraries to link against anything else built with GCC. I still prefer it for building non-free ffmpeg (with ffmpeg's AAC, MP2, MP3, opus, and vorbis implementations disabled outright) to make an MPV build that can handle USAC and that de/encodes opus using libopus. Trying to build all those requirements separately with MSVC myself would take a few years off my life, and take forever.

For what it's worth, I've found on some machines that VS builds of x265.exe (and x264, aom-enc, SVT-AV1, to name a few) perform a bit better in some cases, but only negligibly. It does seemingly add up on Intel CPUs without AVX2 though, and it also allows you to use VS profiler guided optimizations (PGO) if you choose to, and in my case, disable things like Spectre slowdow...I mean mitigations, but only because I don't know how to pass that through to GCC.

I can confirm Boulder's edit worked for compiling under VS2019. I add -A x64 out of habit, so cmake -G "Visual Studio 16 2019" -A x64 ..\..\source && cmake-gui ..\..\source. You would presumably edit it to read "Visual Studio 17 2022" if you're using 2022.

Make sure you start a "x64 Native Tools Command Prompt for VS 2019" to run things from, and make sure NASM is in your PATH or you'll end up with no optimized assembly code.

BuccoBruce
27th July 2022, 16:49
TL;DR Is there some magic bullet for muxing an HEVC elementary stream with Open GOP using mp4box and getting Media Foundation to decode it properly?

I'm running into some weird issues with Open GOP HEVC + Media Foundation decoding in an MP4 container. Muxing with ffmpeg -movflags faststart+negative_cts_offsets seems to work fine most of the time. It complains about a lack of timestamps in the raw .hevc stream and outputs VFR, so -i has to be preceded with e.g. -r 60000/1001 to get around that, and I have to -loglevel error -stats or it will just quickly fill the console, ad infinitum, with:
[mp4 @ 000001da260000c0] Timestamps are unset in a packet for stream 0.
This is deprecated and will stop working in the future.
Fix your code to set the timestamps properly
[mp4 @ 000001da260000c0] pts has no valueB time=00:00:00.00 bitrate=N/A speed= 0x
Last message repeated 103 times


5760x2880, 5408x2704, 4800x2400, 4096x2048, 3840x1920, 3000x1500
Level 6 Main at the max, lower for smaller resolutions
Issue persists even with L5/Main 3000x1500 50 fps video, or 2160x2160 30 fps
8/10 bit doesn't matter
GOP length doesn't seem to matter, tried 60/600 (10 second rule), 30/300 (half), 25/250 (default)
Ref/b-frame count doesn't seem to matter, all within the limits of Level 6 or well below anyways
Thought it might be VBV limited CRF acting up and overflowing the DPB using the default Level 6 VBV, so I tried lowering the VBV, and lower bitrate ABR with a much longer RC Lookahead, issue persisted
CRF encodes ended up being fine anyways, since the issue seems to be limited to MP4Box+Media Foundation...
Disabling Open GOP magically fixes it most of the time.


Is it some kind of IDR signaling issue? Disabling Open GOP alone seemingly resolves all of the issues, but I would like to use Open GOP since these are static camera shots. Is it something really dumb like -inter 500 being too small? I guess that would make me really dumb. It's starting to seem more like a GPAC/mp4box issue, or more likely super-duper Dunning-Kreuger PEBKAC, but I don't know enough about HEVC bitstream output and signaling (PPS/SPS/VUI) to know any better so I wasted my time messing with encoder parameters.

MP4Box output is mostly unplayable, it doesn't seek, and it plays choppy, almost like what you'd expect to see when the decoder drops a temporal enhancement layer and plays back at half FPS. Using --forcesync with mp4box didn't help either. I tried MP4Box with an added --negctts but that just outputs "Arg negctts set but not used" in the console - whereas using negative_cts_offsets in ffmpeg seems to fix things?! Either way, the issue only seems to be with Media Foundation playback. Using an MKV and/or decoding with LAV works fine, as does decoding in software or using MPV or even ffplay.

---

What about any of the x265 bitstream options, could they help? Based on the documentation, --repeat-headers seems like it's only useful for trying to seek within the elementary stream output before muxing it. --aud? --eos? --hrd? I thought --idr-recovery-sei might help, but enabling it along with --repeat-headers seemingly made things worse. mp4box's output when trying to mux a stream made with those two options makes avidemux crash immediately, and makes ffmpeg (MPV) have serious issues playing the file too. This seems to be regardless of the parameters I tried with mp4box: -inter 0 to force a flat mp4, letting it do the default -inter 500, and trying with and without --forcesync for both options. I am pretty sure I tried --nosei, but that'd just be throwing away the extra stuff I asked x265 to write, and then I'd have to waste my time remembering how to re-signal bt709/limited. Trying to remux any of that mp4box output using ffmpeg results in a file that is entirely unplayable in anything, it skips back and forth randomly, you get intermittently decoded blocks, etc. It's also the only result that could technically allow posting a screenshot, since it's NSFW video...it's VR pr0n alright? I can mux directly from the raw .hevc stream if I use those two x265 options with ffmpeg but only with no other parameters, just forcing the FPS with -r 60000/1001 to prevent erroneous VFR output, and -c copy. I haven't tried -movflags faststart, and using negative_cts_offsets seemingly breaks these files too. I also have yet to try putting ffmpeg's output back through mp4box. I am streaming these from a NAS and would prefer to have the MOOV atom at the beginning of the file, so a working flat mp4 is only "half fixed".

What's even weirder is I can take a working mp4 from elsewhere at the same resolution, frame rate, and bitrate, and with seemingly identical x265 settings visible in the SEI, and remux it all I want with mp4box. It doesn't break playback under Media Foundation. The only difference is the version tag for x265 reading 0.0 - these working files were also seemingly muxed with ffmpeg (Lavf58.12.100) or even encoded directly with it using -c:v libx265. Looking at that file, it looks like the only options they passed to x265 were --bitrate 30000 --output-depth 10 --colormatrix=2 --colorprim=2 --transfer=2 --videoformat=5. Everything else is --preset medium defaults.

I've just been using --preset medium with some slower options selectively enabled, and some that I thought would help lower bitrate when I thought that was the issue.

--bitrate 20000 --output-depth 10 --level-idc 6 --no-high-tier
--rect --amp --tskip --tskip-fast --b-intra --limit-modes
--vbv-bufsize 30000 --vbv-maxrate 40000
--analyze-src-pics --rc-lookahead 120 --min-keyint 60 --keyint 600
--fades --video-signal-type-preset BT709_YCC
--opt-qp-pps --opt-ref-list-length-pps --opt-cu-delta-qp
--limit-sao --selective-sao 1 --sao-non-deblock

Plus either +,- or -,+ for pools on a dual socket system, and I've obviously tried with/without --repeat-headers --idr-recovery-sei . Adding/removing any of --b-intra --fades --analyze-src-pics --opt-qp-pps --opt-ref-list-length-pps --opt-cu-delta-qp didn't make a difference either - I'm just including the command I tried with the most options for completeness.

Boulder
29th July 2022, 12:23
A note for MABS users who build for Zen2/3: add -march=znver2 or -march=znver3 in custom_profile in the local64\etc directory. It gives a slightly better performance for those chips, I think I found it 3-4% better when I tested it on my 3900X.

LigH
30th July 2022, 20:06
New upload: x265 3.5+39-a599806d3 (https://www.mediafire.com/file/s6srzkosonu5yai/x265_3.5+39-a599806d3.7z/file)

[Windows][GCC 12.1.0][32/32XP/64 bit] 8bit+10bit+12bit

LeXXuz
30th July 2022, 20:42
I was wondering if someone could build me a Zen3 optimized and Zen2 optimized Windows version for my 5950x and 3950x CPUs. That would be much appreciated. :o :thanks:

RanmaCanada
1st August 2022, 15:34
I was wondering if someone could build me a Zen3 optimized and Zen2 optimized Windows version for my 5950x and 3950x CPUs. That would be much appreciated. :o :thanks:

Pretty sure DJATOM (https://github.com/DJATOM/x265-aMod) has the best. Yes it's an older build, but x265 has been in maintenance mode for well over a year now.

LeXXuz
1st August 2022, 19:20
Pretty sure DJATOM (https://github.com/DJATOM/x265-aMod) has the best. Yes it's an older build, but x265 has been in maintenance mode for well over a year now.

I know that's why I'd like an actual build for comparison. :)

benwaggoner
1st August 2022, 19:23
TL;DR Is there some magic bullet for muxing an HEVC elementary stream with Open GOP using mp4box and getting Media Foundation to decode it properly?
I've had some .hevc files that don't play properly when muxed in mp4box, but do when muxed in ffmpeg. ffmpeg complains enormously about missing PTS data, but seems to fix it fine.

They were all Closed GOP, though, so potentially unrelated to your issue.

BuccoBruce
1st August 2022, 22:19
I've had some .hevc files that don't play properly when muxed in mp4box, but do when muxed in ffmpeg. ffmpeg complains enormously about missing PTS data, but seems to fix it fine.

They were all Closed GOP, though, so potentially unrelated to your issue.

Might still be related, I re-encoded so many files I might have forgotten if something other than Open GOP was also causing it. Guess I'll stick to ffmpeg for HEVC and just use mp4box for AVC+HLS.

LeXXuz
2nd August 2022, 06:13
Is SAO still an issue for high quality encodes with actual builds or can this safely be activated now?

microchip8
2nd August 2022, 07:23
Is SAO still an issue for high quality encodes with actual builds or can this safely be activated now?

it's still an issue

LeXXuz
24th August 2022, 11:58
I'm tinkering around with my profiles to gain more speed out of my encodes. The significant rise in electricity cost here in Germany made that decision necessary. :(

I have a question regarding the "--limit refs" parameter. As there is a huge speed difference between mode 1 and 3 and I was told to better use mode 1 for better quality, I now also tested mode 2 which none of the presets seem to use by default.

I got a decent performance increase with mode 2 over mode 1 and tested this with quite a few examples. Can't say I've seen any notable differences in quality so far.

I read the docs about the differenct modes, but in all honesty I don't really understand what's written there and how that may affect quality.

I always do high bitrate encodes with the "slower" preset as a base and CRF values of 18 or even below. Is there any good reason NOT to use mode 2 over 1 for better performance? :o

benwaggoner
24th August 2022, 18:42
I'm tinkering around with my profiles to gain more speed out of my encodes. The significant rise in electricity cost here in Germany made that decision necessary. :(

I have a question regarding the "--limit refs" parameter. As there is a huge speed difference between mode 1 and 3 and I was told to better use mode 1 for better quality, I now also tested mode 2 which none of the presets seem to use by default.

I got a decent performance increase with mode 2 over mode 1 and tested this with quite a few examples. Can't say I've seen any notable differences in quality so far.

I read the docs about the differenct modes, but in all honesty I don't really understand what's written there and how that may affect quality.

I always do high bitrate encodes with the "slower" preset as a base and CRF values of 18 or even below. Is there any good reason NOT to use mode 2 over 1 for better performance? :o
To test more subtle features like this, I strongly recommend using a 2-pass --bitrate encode instead of CRF. It's hard to disentangle impacts on quality when bitrate is also varying. 1-pass CBR can also work, and is faster.

benwaggoner
24th August 2022, 19:03
I'm tinkering around with my profiles to gain more speed out of my encodes. The significant rise in electricity cost here in Germany made that decision necessary. :(
If you're looking for ways to reduce joules/pixel, --frame-threads 1 can really help. The overhead of frame threading can really reduce power efficiency, and doesn't always have that big of a speed boost depending on how many cores you have and the resolution you're encoding at.

If you use SAO, --selective-sao 2 saves a bit without material quality impact.

If you can share your current command line, we might have other suggestions.

In general, the --preset options are pretty well tuned for a typical range of content and scenarios as of x265 3.0. They don't include any features added in 3.1 or later, which is why no --selective-sao, --rskip 2, etcetera, even though those really should be the defaults.

LeXXuz
24th August 2022, 23:07
Thank you for those suggestions. :thanks:

Right now I recode 1080p content

I use these settings:

--preset slower --crf 17.00 --qpfile "E:\WORK\chp.qpf"
--repeat-headers --input-depth 16 --output-depth 10 --dither
--ctu 32 --limit-refs 2 --psy-rdoq 5 --selective-sao 0 --no-sao
--colorprim bt709 --transfer bt709 --colormatrix bt709

CPUs used are Ryzen 5950x and 3950x.

benwaggoner
25th August 2022, 01:49
Thank you for those suggestions. :thanks:

Right now I recode 1080p content

I use these settings:

--preset slower --crf 17.00 --qpfile "E:\WORK\chp.qpf"
--repeat-headers --input-depth 16 --output-depth 10 --dither
--ctu 32 --limit-refs 2 --psy-rdoq 5 --selective-sao 0 --no-sao
--colorprim bt709 --transfer bt709 --colormatrix bt709

CPUs used are Ryzen 5950x and 3950x.
--slower is already one of the better-balanced presets. Changing parameters from slower to ones from slow will speed things up, but all of them have quality impacts too.

There's no point to using --selective-sao if you're already using --no-sao.

I always like to set --profile and --level-idc so I'll get warnings if I violate the requirements. In your case that looks like --profile main10 --level-idc 4.0 or 4.1.

Using --psy-rdoq 5 without raising --psy as well is an uncommon configuration, but should work.

I'd use --rskip 2 to replace the default --rskip 1 because it's a better quality mode. I've not directly compared the speed. Higher --rskip-edge-threshold values are faster, but can reduce quality. I tend to use 2-3 in my stuff, but I'm more biased towards quality/efficiency than your use case.

What CPU are you running on?

The biggest thing to improve pixels/joule without any quality loss would be --frame-threads 1. Lower values can actually improve quality.

You can learn a lot from doing a --csv-log-level 2 and looking at the frame level data. For example, if there aren't a lot of TUs smaller than 8x8 you could reduce --tu-intra-depth and --tu-inter-depth by 1. Recursing all the way down is mostly helpful with content that has sharp details, like text and cel animation.

If you have a lot of RAM, increasing --rc-lookahead can improve quality when VBV-limited quite a lot without much negative speed impact.

LeXXuz
25th August 2022, 09:38
There's no point to using --selective-sao if you're already using --no-sao.
I was uncertain if I have to set it to 0 as well when I don't want to have SAO at all. I will remove that parameter.


I always like to set --profile and --level-idc so I'll get warnings if I violate the requirements. In your case that looks like --profile main10 --level-idc 4.0 or 4.1.
Again, I was unsure if I should let x265 decide on its own or put these in manually. Never thought about the violation warnings though which is a very good point. Will add these again.


Using --psy-rdoq 5 without raising --psy as well is an uncommon configuration, but should work.
Well, that is a longer story and the most subtle approach at the moment to fight banding with the quite clean source material I have. The --slower preset already uses --psy-rd 2. Raising that any higher added too much static noise into flat areas to my taste.
It's barely visible on 4k, but visible on 1080p and almost terrible on SD.
Without raising at least --psy-rdoq a little, x265 tends to produce banding in certain flat areas. And sadly my living room TV is very susceptible to that and tends to intensify even the slightest banding compared to my other TVs. So this is somewhat a personal compromise.


I'd use --rskip 2 to replace the default --rskip 1 because it's a better quality mode. I've not directly compared the speed. Higher --rskip-edge-threshold values are faster, but can reduce quality. I tend to use 2-3 in my stuff, but I'm more biased towards quality/efficiency than your use case.
I'll add --rskip 2 to my script.


What CPU are you running on?
AMD Ryzen 5950x and 3950x. Both with 16 cores/32 threads


The biggest thing to improve pixels/joule without any quality loss would be --frame-threads 1. Lower values can actually improve quality.
Doesn't that decrease speed a lot as it reduces parallel processing? Or am I mistaken here?



If you have a lot of RAM, increasing --rc-lookahead can improve quality when VBV-limited quite a lot without much negative speed impact.
The machines have 64GB. I think the default is 40 for the --slower preset? How much should I raise that?

Thanks again for your valued input Ben. :)

vpupkind
25th August 2022, 20:25
rc_lookahead -- at least 1s worth of frames

Immaculate
25th August 2022, 23:13
--slower is already one of the better-balanced presets. Changing parameters from slower to ones from slow will speed things up, but all of them have quality impacts too.

There's no point to using --selective-sao if you're already using --no-sao.

I always like to set --profile and --level-idc so I'll get warnings if I violate the requirements. In your case that looks like --profile main10 --level-idc 4.0 or 4.1.

Using --psy-rdoq 5 without raising --psy as well is an uncommon configuration, but should work.

I'd use --rskip 2 to replace the default --rskip 1 because it's a better quality mode. I've not directly compared the speed. Higher --rskip-edge-threshold values are faster, but can reduce quality. I tend to use 2-3 in my stuff, but I'm more biased towards quality/efficiency than your use case.

What CPU are you running on?

The biggest thing to improve pixels/joule without any quality loss would be --frame-threads 1. Lower values can actually improve quality.

You can learn a lot from doing a --csv-log-level 2 and looking at the frame level data. For example, if there aren't a lot of TUs smaller than 8x8 you could reduce --tu-intra-depth and --tu-inter-depth by 1. Recursing all the way down is mostly helpful with content that has sharp details, like text and cel animation.

If you have a lot of RAM, increasing --rc-lookahead can improve quality when VBV-limited quite a lot without much negative speed impact.

Thanks for the tips. --rskip 2 seems to improve grain "motion" quite a bit in some cases.

It's a shame that you have to fiddle with x265 to get an acceptable quality, when a simple --tune film --preset veryslow produces good results with x264. Of course, clean material isn't an issue, it's just that x264 looks better with noise/grain - out of the box.

FranceBB
28th August 2022, 20:24
A friend of mine bought a brand new Apple MacBook Pro that he wanted to use, so after mocking him a bit, I said: "you know what, I could run some benchmarks to see how ARM chips are gonna do in encoding" and I expected poor results, but... it looks like ARM chips are somehow closing the gap on x86 and I still can't figure out how nor why.

Anyway, here's the benchmark, enjoy.


Test Platform 1:

Apple MacBook Pro 2021
CPU: Apple M1 Max 10c/10th (8 performance + 2 energy) ARM
GPU: Apple M1 Max 32-core GPU
RAM: 64 GB DDR4
OS: Mac OS Monterey 12.5.1
Year: 2021


Test Platform 2:
Asus N552VX
CPU: Intel i7 6700HQ 4c/8th x86_64
GPU: NVIDIA GTX 950M 4GB GDDR5 (640CUDA Cores)
RAM: 64 GB DDR4
OS: Fedora Linux 36
Year: 2016

Test Platform 3:
CPU: Intel i7 10750H 6c/12th x86_64
GPU: NVIDIA RTX 2060 6GB GDDR6 (1920 CUDA Cores)
RAM: 16 GB DDR4
OS: Windows 10 Enterprise x64
Year: 2020


Test number 1: Video Decoding
Software used MPV

Source file: DCP (Digital Cinema Package)
Codec: MJPEG 2000
Profile: D-Cinema 4K (All Intra)
Bitrate: 124 Mbit/s
Resolution: 4096x1716 (2.40 LB)
Framerate: 24p
Colorspace: XYZ
Sampling: 4:4:4
Range: Full PC Range
Bit Depth: 12bit


Apple M1 Max 10c/10th ARM
CPU Usage: 100%
Total Frames of the clip: 480 frames
Number of dropped frames: 141

Intel i7 6700HQ 4c/8th 3.5GHz
CPU Usage: 100%
Total Frames of the clip: 480 frames
Number of dropped frames: 138


Test number 2: Video Encoding
Software used: x265 (with FFMpeg decoding)


Source file: Studio Masterfile
Codec: MJPEG 2000
Profile: D-Cinema 4K (All Intra)
Bitrate: 124 Mbit/s
Resolution: 4096x1716 (2.40 LB)
Framerate: 24p
Colorspace: RGB
Sampling: 4:4:4
Range: Full PC Range
Bit Depth: 12bit


Encoding target: H.265 HEVC 4:4:4 12bit YUV
Preset used: --placebo --crf 28

Apple M1 Max 10c/10th ARM
CPU Usage: 100%
Threadpool created using 10 threads
Using cpu capabilities: NEON
Speed: 0.6fps

https://i.imgur.com/H2ZuYyV.png

Intel i7 6700HQ 4c/8th 3.5GHz
CPU Usage: 100%
Threadpool created using 8 threads
Using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
Speed: 0.2fps

https://i.imgur.com/UKJ1tfv.png

Intel i7 10750H 6c/12th 2.6GHz
CPU Usage: 100%
Threadpool created using 12 threads
Using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
Speed: 0.4fps

https://i.imgur.com/bvpXqtK.png


To my very own surprise, even though the decoding benchmark was poor on both end, when it came to encoding, the Apple M1 10c/10th outperformed the old i7 4c/8th and the old i7 6c/12th by keeping a steady 0.6fps, while the i7 eventually reached 97°C, started throttling from the 3.5GHz frequency down to 2.5GHz, lowering the speed to 0.2fps and the other i7 reached around the same temperature going down from a turbo boost of 4.8GHz to 2.60GHz too, hence reaching 0.4fps...

So... looks like modern mobile ARM are indeed just slightly faster than 2016 mobile x86 CPUs, who would have thought that...
I've always been someone who thought x86 was gonna be faster, no matter what, due to the intrinsics / instructions set, but looks like I was wrong.

Still, given that ARM are actually meant to perform computationally low tasks like browsing the web, reading emails, sending messages in a chat etc, I don't think they'll ever make their way into a desktop, nor should they be used for encoding (and neither should laptop in general).
Still, for a laptop, this is quite remarkable, given that their own architecture is supposed to be against them... O_O

rwill
28th August 2022, 20:40
<removed>

FranceBB
28th August 2022, 20:42
Ah, the daily Alternative Facts.



It should say "NEON" for cpu capabilities when running on ARM. You did it wrong.

Damn. Let me repeat the test and update it...


EDIT: Updated the other post with the new benchmark using brew install homebrew-ffmpeg --with-neon (source here (https://www.osxexperts.net/FFmpeg51ARM.zip)), the M1 won. O_O Why? How?! Daaaaaaaaaaaamn

rwill
28th August 2022, 20:51
If you look here:

https://bitbucket.org/multicoreware/x265_git/src/master/source/common/cpu.cpp

You somehow have to get it to compile with X265_ARCH_ARM64.

The other way is to offer an AMD64 binary to OSX and let it recompile the x86/SSE/AVX machine code to ARM/NEON at runtime. This is sub-optimal but still faster than x265 reporting cpu capabilities as "none"

I am actually carefully expecting ~0.7 to ~0.9 fps on an M1 then if you got ~0.2 before.

FranceBB
28th August 2022, 21:10
Yep, I've got it right now.
Thanks, I had no idea that I had to download the NEON build.
I've got 0.6fps steady now, with the fan going like crazy, while my poor i7 succumbed to 0.2fps and my slightly newer i7 went to 0.4fps... :|

rwill
29th August 2022, 04:24
Yep, I've got it right now.
Thanks, I had no idea that I had to download the NEON build.
I've got 0.6fps steady now, with the fan going like crazy, while my poor i7 succumbed to 0.2fps and my slightly newer i7 went to 0.4fps... :|

Well I think your 'newer' Intel Core i7-10750H is still on a 14nm process node. So my guess is that it is hitting thermal or power limits. You could use HWINFO64 or similar monitoring software to find out. Intel kinda used the trick to run excessive power consumption before throttling very hard to win in benchmarks.

The M1 is a good chip, and Apple designing their stuff around it helps to deploy it optimally. In the mobile space anyway.

Amazon spent some time/money on optimizing x265 for ARM/NEON, I guess they run it on their Graviton AWS instances. So x265 was already optimized for a somewhat similar ARM platform.

Ritsuka
29th August 2022, 06:19
I think homebrew is missing the latest NEON optimizations available in the x265 master branch (or maybe not? it's not clear to me if it's using the 3.5 tarball or the git master branch). There is an additional patch to tune the thread pools for the M1 on https://github.com/HandBrake/HandBrake/blob/master/contrib/x265/A03-threads-pool-adjustments.patch

FranceBB
29th August 2022, 07:03
#if MACOS && X265_ARCH_ARM64
p->frameNumThreads = 8;

So if I got it right, it's gonna create a threadpool using the 8 real cores and not throw the 2 energy saving cores in, probably 'cause otherwise calcualtions would be divided equally and therefore the powerful cores would have to wait for the energy saving ones to finish and sync the results of the split multithreaded operations.

I think homebrew is missing the latest NEON optimizations available in the x265 master branch

Yeah if I understood the piece of code right, it definitely is 'cause it created a threadpool with 10 threads, thus using both the 8 power cores and the 2 energy saving cores.

nevcairiel
29th August 2022, 11:03
Both your x86 machines are old and mid-range at best, the MacBook is using a brand new 10 core high-end CPU. Its no surprise it wins. Apple has also been contributing some ARM improvements to a variety of multimedia projects to help accelerate it there, as historically most development was focused on x86 only.

excellentswordfight
29th August 2022, 11:14
A friend of mine bought a brand new Apple MacBook Pro that he wanted to use, so after mocking him a bit, I said: "you know what, I could run some benchmarks to see how ARM chips are gonna do in encoding" and I expected poor results, but... it looks like ARM chips are somehow closing the gap on x86 and I still can't figure out how nor why.

Well its not surprising, ARM64 is much more modern ISA than x86 developed for modern computing unlike x86 that has tons of legacy drawbacks. And Apple has created a pretty remarkable architecture around that, and they had at release the highest single threading performance available (https://images.anandtech.com/graphs/graph16252/119160.png) (Geekbench 5 is actually a very decent benchmark for general performance now days), thats imo incredible given the difference in powerdraw/frequency, in terms of perf/w its a slaughter. So when you can scale that up to something like a M1 ultra that has more transistors then a 64C Epyc and a Nvidia A100 combined (!), well there is a lot of performance potential there.

The biggest reason why sw rendering/encoding performance has been a bit lacking is cause of the heavy SIMD optimization of the current software, this is even a greater issue as rosetta dont support emulation of AVX. We will just have to see when sw starts to get (good) NEON optimization until we can see the actual capability of these designs in these kinds of loads.

Still, given that ARM are actually meant to perform computationally low tasks like browsing the web, reading emails, sending messages in a chat etc
It doesn't sound like you have followed the development of ARM at all, this is not the case. There is a reason why nvidia wanted ARM, and why there are so much development in the datacenter space when it comes to ARM-designs (ARMs Neoverse, Qualcomm buying Nuvia etc)...

Blue_MiSfit
29th August 2022, 19:57
Apple's hardware is remarkable, especially when you look at the performance per watt.

I'm also astonished at how well it performs even on non-native code via Rosetta 2 emulation. On the workstation I'm a Windows guy until the bitter end, but wow this is impressive.

The current generation of ARM servers is also quite good, but from what I've seen they're not on the same level as Apple for things like x265. More NEON optimizations in this and other similar apps will be a boon to media workloads.

TL;DR -- ARM is an absolute monster when the software is sufficiently optimized.

rwill
30th August 2022, 04:48
TL;DR -- ARM is an absolute monster when the software is sufficiently optimized.

Well ARM is just the instruction set here, most IP cores ARM offers remain low power.

There is nothing really stopping AMD or Intel to bolt some ARM or Risc-V instruction decoder in front of their CPU designs to get high performance CPUs without the x86 instruction set. Ok maybe they need to do some touch-ups here and there but still..

Both companies have experience with designing CPUs that perform rather well for a given power envelope. They have most CPU building blocks ready. AMD even had some ARM Opteron project a couple of years ago but that was canceled I think.

Have you tried running x265 on some Raspberry Pi 4 yet ?

benwaggoner
30th August 2022, 18:22
Thanks for the tips. --rskip 2 seems to improve grain "motion" quite a bit in some cases.
That's what it does, lowering --rskip-edge-threshold to 2-3 can further improve grain quality, although at a cost of some speed.

It's a shame that you have to fiddle with x265 to get an acceptable quality, when a simple --tune film --preset veryslow produces good results with x264. Of course, clean material isn't an issue, it's just that x264 looks better with noise/grain - out of the box.[/QUOTE]
Yeah, x265 hasn't had its presets refactored since 3.0, despite some new featured added since that should be defaults.

As a community, we could probably define some revised presets, ala "slower except --selective-sao 2 and --rskip 2 --rskip-threshold 3"

LigH
10th September 2022, 11:29
New upload: x265 3.5+39-3ca6a8197 (https://www.mediafire.com/file/794hz6035k00xdf/x265_3.5+39-3ca6a8197.7z/file)

[Windows][GCC 12.2.0][32/32XP/64 bit] 8bit+10bit+12bit

The git revision hash may be the only obvious change since my last upload; and the GCC version.

tormento
14th September 2022, 11:12
x265 3.5+39-3ca6a8197
Did you try to encode something with it?

I have tried x64 with StaxRip and it simply doesn't work, not throwing any error.

tormento
14th September 2022, 20:20
New upload
With

D:\Eseguibili\Media\StaxRip\Apps\FrameServer\VapourSynth\vspipe.exe "F:\In\Attacco dei giganti S2\01_temp\01.vpy" - --y4m | D:\Eseguibili\Media\StaxRip\Apps\Encoders\x265\x265.exe --frames 34816 --crf 22 --output-depth 10 --aq-mode 5 --fades --pools 10 --colorprim bt709 --colormatrix bt709 --transfer bt709 --range limited --qpfile "F:\In\Attacco dei giganti S2\01.qp" --y4m --output "F:\In\Attacco dei giganti S2\01_temp\01_out.hevc" -

I get

Error: fwrite() call failed when writing frame: 5, plane: 0, errno: 32

Barough
14th September 2022, 22:34
x265 v3.5+40-1ea20502b
Built on September 13, 2022, GCC 12.2.0

https://www.mediafire.com/file/kfmz9m3q4r4fc6l/

x265 Note :

The commit value on the binary is wrong due to something upstream at MulticoreWare.

Correct commit number for this release is
9311783

LigH
15th September 2022, 00:21
Did you try to encode something with it?

I just tested encoding a few physical Y4M files, that worked.

rwill
15th September 2022, 02:05
With

D:\Eseguibili\Media\StaxRip\Apps\FrameServer\VapourSynth\vspipe.exe "F:\In\Attacco dei giganti S2\01_temp\01.vpy" - --y4m | D:\Eseguibili\Media\StaxRip\Apps\Encoders\x265\x265.exe --frames 34816 --crf 22 --output-depth 10 --aq-mode 5 --fades --pools 10 --colorprim bt709 --colormatrix bt709 --transfer bt709 --range limited --qpfile "F:\In\Attacco dei giganti S2\01.qp" --y4m --output "F:\In\Attacco dei giganti S2\01_temp\01_out.hevc" -

I get

Error: fwrite() call failed when writing frame: 5, plane: 0, errno: 32

Check again, you should get something like


x265 [error]: Aq-Mode is out of range
x265 [error]: x265_encoder_open() failed for Enc,
x265 [error]: Failure generating stream headers in x265
Segmentation fault


Errno 32 is "Broken Pipe", x265 dies and vspipe.exe can no longer push data downstream.

tormento
15th September 2022, 09:36
Aq-Mode is out of range
Is there an updated x265 with aq-mode 5, beside the Patman ones?