Naiv Question: Supercharge x264

Easy · 30th November 2022, 16:15

This question may be naiv , but still I would like a professional answer to it if possible

Question is: Why not optimize x264 code with x265 code.
So here is the naiv part: x265 is e.g. known for using more and bigger partitions than x264. For example 32x32 and 64x64 , but why not use something like 'group of partitions' in x264 ; for example use four 8x8 to get 32x32.
Using more fake CTU parts in Macroblocks is only an example, may not be useful in the real world but atleast something that can be tested. Also there may be DCT optimizations that could be ported.
Also should it technical be possible to use CABAC and CAVLC simultaneous to reduce decoding spikes.
I know that one can not touch the decoder side of things as it should stay compatible with existing decoders. So not to many 'features' can be ported without changing also how to decode them.

An answer from especially people how looked at the source code would be very helpfull. I don't know c++ or asm, but I know how a function looks and some rough porting theory. So maybe this could be my weekends hobby for the next year or so.

mastrboy · 30th November 2022, 16:34

If the goal of this excercise is to optimize cpu usage/speed up the encode process you can try https://github.com/master-of-zen/Av1an

Easy · 30th November 2022, 17:36

Quote:

Originally Posted by mastrboy

If the goal of this excercise is to optimize cpu usage/speed up the encode process you can try https://github.com/master-of-zen/Av1an

Thank you for sharing , this is one aspect of my thoughts and this repo sounds amazing. Still, I would like to see also quality differentiations (of course the goal is to improve things)
AV1an looks like half the work , but not completing my thoughts.

To clarify further , I want to improve x264 with speed and quality options , initially with improved code from newer codecs like 265

benwaggoner · 1st December 2022, 01:27

Quote:

Originally Posted by Easy

This question may be naiv , but still I would like a professional answer to it if possible

Question is: Why not optimize x264 code with x265 code.
So here is the naiv part: x265 is e.g. known for using more and bigger partitions than x264. For example 32x32 and 64x64 , but why not use something like 'group of partitions' in x264 ; for example use four 8x8 to get 32x32.
Using more fake CTU parts in Macroblocks is only an example, may not be useful in the real world but atleast something that can be tested. Also there may be DCT optimizations that could be ported.
Also should it technical be possible to use CABAC and CAVLC simultaneous to reduce decoding spikes.
I know that one can not touch the decoder side of things as it should stay compatible with existing decoders. So not to many 'features' can be ported without changing also how to decode them.

An answer from especially people how looked at the source code would be very helpfull. I don't know c++ or asm, but I know how a function looks and some rough porting theory. So maybe this could be my weekends hobby for the next year or so.

Interesting ideas. I'm not either if your examples would provide practical benefit, but you're thinking about the problem in a good, creative way.

MultiCoreWare licensed all of x264, and x265 started out largely in mixing x264 performance and psychovisual optimization stuff with the HEVC reference encoder. After a few years, they started going into all-new directions not based on either code base. But the deal was they would contribute everything back to x264 so relevant features could be added. Not much actually got backported, though. Probably because x265 development really got into gear as x264 was sliding into a more maintenance mode.

There's lower-hanging fruit in adding features to x264 that I know MCW contributed in x264-compatible ways, like x265's very awesome csv logging. I don't know how easy it would be to find old MCW contributions that never made it into the main branch, but if you can, I bet there's a lot of fun stuff that could be made functional with a couple weeks' work.

Assembly stuff would be a whole lot harder, as it's a lot harder to work with in general. And H.264 itself just doesn't have the same opportunities for really big SIMD functions to speed things up. There are only 4x4 and 8x8 blocks, while HEVC can go from 4x4 to 32x32, including 4x16 and many other power-of-two variations. x265 gets bigger chunks of data to work with at once, and has enormously more mode decisions to make. In a 64x64 block of pixels, x265 has many dozens (hundreds) of different ways it can break down texture units, especially if --amp and --rect are used. For x264, it only needs to figure out whether to use 4x4 or 8x8 for each 16x16 macroblock. And each of those TU's can use one of 33 (IIRC) predictions modes instead of H.264's eight, including whole new mode types like lossless and transform skip.

A big share of x265 assembly optimization is around mode selection, and most of that simply isn't applicable to x264.

I imagine there are some cases that are algorithmically identical to x264 that x265 simply has better optimized implementations of that could be backported. Or x264 subsets of algorithms that could have the x265-only parts of stripped out.

benwaggoner · 1st December 2022, 01:34

Big picture though, there are fundamental inescapable reasons why x265 is able to use more CPU and more threads for a video of the same resolution than x264 can. WPP threading isn't possible in x264 as H.264 doesn't have WPP. The flipside is that the greater simplicity of x264 means it's already a whole lot faster than x265. Which also means that's x265 is already a lot closer to optimal performance than x265 is even today.

x265's improve scenecut and frame type decision algorithms would be much more likely to have stuff straightforward to backport, as frame types and why to decide one or the other. Stuff like --mctsf and --hist-scenecut also would be quite relevant to x264.

There may well be bigger opportunities to improve quality/compression efficiency of x264 than encoding speed in backporting from x265.

FranceBB · 5th December 2022, 16:29

In terms of speed, I would actually love if someone wrote AVX512 assemblies for the 10bit version of x264.
We already have them for the 8bit version, which is really welcomed for distribution encodes (especially given that we do lots of them), but any mezzanine file / TX Master is gonna be 10bit, think about Intra Classes for FULL HD and UHD files. Especially with UHD files, I feel like AVX512 would speed up XAVC encodes a lot in x264 compared to AVX2 only.

DTL · 12th February 2023, 15:49

Quote:

Originally Posted by benwaggoner

And H.264 itself just doesn't have the same opportunities for really big SIMD functions to speed things up. There are only 4x4 and 8x8 blocks,

The total idea of SIMD is to process more data with single program of instructions.
4x4 partitions of 16x16 macroblock in x264 cause processing of 16 (4x4 array) very small blocks of 4x4 samples sized. So in current versions as I see it process 16 blocks of 4x4 in 16 loop passes scanning it 4x4 blocks array one by one.

So the task to put it to more SIMD looking is to process several blocks of single macroblock in single SIMD function. Like 2..4..8 blocks of 4x4 per one SIMD loop pass. It will decrease total CPU cycles for processing all 16 4x4 blocks and make better performance.

Current situation at the begining of 2023 year of dying this great tech civilization is really the next:
1. The more simplier program or algorithm the easier it to program for really complex 'internal paralleling' or 'wide vectoring' to more and more wide SIMD units in todays and future still promised CPUs. So as the complexity of MPEG encoders typically decreases to the lowering it version the best to put to AVX512 is sort of MPEG-1. Programmers typically like simple programs - easier to design/debug/support.

2. As the civilization is dying from degradation the available programmers resources become lower and lower. We already pass the phase when it was populair in general public to have a home PC and to understand it and have ability to create programs for it. Also education of the current young generation become poorer and poorer. I think about zero or very few of schools learn all childern how to optimize computer programs for todays AVX2/AVX512 chips. We possibly never more in this civilization will got again as many freeware opensource good educated and with still enough quality genome programmers as we have in 199x..200x years. But unfortunately hardware manufacturers can only provide SIMD about MMX/SSE(2) at that decades. That programmers are mostly lost nowdays (at least as I can see from opensource projects - may be go to the very good payment job).

3. The quality of MPEG-1 is poor enough so when selecting which version of MPEG encoder to put to current CPUs we definitely need to select something more higher in quality.

4. As was already noted in the https://forum.doom9.org/showthread.p...35#post1982735 thread there is cleary visible 'effect of saturation' of MPEG versions quality for everyday media view for general public at about MPEG-4 ASP (or may be somewhere between ASP and AVC).

5. As was noted in this thread the MPEG-HEVC and higher are much more complex in algorithms so harder to understand, create programming design for AVX2/AVX512 architecture, implement, debug and support.

So as followed from 1..5 the h.264 MPEG is about most probable candidate (near MPEG-4 ASP that is about xvid freeware and opensource (?) project) of putting some residual programming resources to make run better at current hardware chips (sort of 'Make x264 Great Again"). And really if it will show some notable improvements it may be x265 developers may take this implementations to 265 and more complex projects.

MPEG-2 is even simplier but provide too low quality for current general public usage so is lost in competition for residual programming resources. So developers need some balance between minimum acceptable by general public MPEG quality and maximum complexity of MPEG encoder to implement. Too complex MPEG encoders with possibly a bit better compressability may lost in C-references of very low performance of encoding. Or their workunit for SIMD parts will use less and less of the possible SIMD processing capability and performance gain from adding some more AVX to the very low workunit implementation will be close to invisible (as we see todays with enabling AVX512 at x26*).
Fast MPEG encoder at AVX2/AVX512 and possible future chips should have some somehow limited complexity (defining possible max compressability). It will make its chances to stay alive in the future when many experimental still slow codecs will left in the past. The example of xvid at todays internet shows this process. To be Fast and make acceptabe quality is a key to staying alive.

So the relative simplicity of algorithms of x264 is really helps it to have some programmers attention to make it more better everyday workhourse for media compression in compare with x265 and later. For example there is already old idea to try to add support of accepting motion estimation from hardware accelerators via current Microsoft DX12 API (independent of each hardware vendors API). It may also saves some time in IPB compression modes.

Most of MPEGs from MPEG1 and may be up to h.266 and more still small blocks-based internally and number of blocks in the frame to encode is much higher in compare with 'nominal workunit' of AVX512 and next SIMD units so any of these MPEG encoders may be put to AVX512 providing full load of dispatch ports.

30th November 2022, 16:15	#1 \| Link
Easy Registered User Join Date: Jan 2016 Location: Germany Posts: 8	Naiv Question: Supercharge x264 This question may be naiv , but still I would like a professional answer to it if possible Question is: Why not optimize x264 code with x265 code. So here is the naiv part: x265 is e.g. known for using more and bigger partitions than x264. For example 32x32 and 64x64 , but why not use something like 'group of partitions' in x264 ; for example use four 8x8 to get 32x32. Using more fake CTU parts in Macroblocks is only an example, may not be useful in the real world but atleast something that can be tested. Also there may be DCT optimizations that could be ported. Also should it technical be possible to use CABAC and CAVLC simultaneous to reduce decoding spikes. I know that one can not touch the decoder side of things as it should stay compatible with existing decoders. So not to many 'features' can be ported without changing also how to decode them. An answer from especially people how looked at the source code would be very helpfull. I don't know c++ or asm, but I know how a function looks and some rough porting theory. So maybe this could be my weekends hobby for the next year or so. Last edited by Easy; 30th November 2022 at 17:51.

30th November 2022, 16:34	#2 \| Link
mastrboy Registered User Join Date: Sep 2008 Posts: 365	If the goal of this excercise is to optimize cpu usage/speed up the encode process you can try https://github.com/master-of-zen/Av1an __________________ (i have a tendency to drunk post)

1st December 2022, 01:34	#5 \| Link
benwaggoner Moderator Join Date: Jan 2006 Location: Portland, OR Posts: 4,770	Big picture though, there are fundamental inescapable reasons why x265 is able to use more CPU and more threads for a video of the same resolution than x264 can. WPP threading isn't possible in x264 as H.264 doesn't have WPP. The flipside is that the greater simplicity of x264 means it's already a whole lot faster than x265. Which also means that's x265 is already a lot closer to optimal performance than x265 is even today. x265's improve scenecut and frame type decision algorithms would be much more likely to have stuff straightforward to backport, as frame types and why to decide one or the other. Stuff like --mctsf and --hist-scenecut also would be quite relevant to x264. There may well be bigger opportunities to improve quality/compression efficiency of x264 than encoding speed in backporting from x265. __________________ Ben Waggoner Principal Video Specialist, Amazon Prime Video My Compression Book

5th December 2022, 16:29	#6 \| Link
FranceBB Broadcast Encoder Join Date: Nov 2013 Location: Royal Borough of Kensington & Chelsea, UK Posts: 2,902	In terms of speed, I would actually love if someone wrote AVX512 assemblies for the 10bit version of x264. We already have them for the 8bit version, which is really welcomed for distribution encodes (especially given that we do lots of them), but any mezzanine file / TX Master is gonna be 10bit, think about Intra Classes for FULL HD and UHD files. Especially with UHD files, I feel like AVX512 would speed up XAVC encodes a lot in x264 compared to AVX2 only. __________________ LUT Collection FFAStrans Videotek - AAA - SafeColorLimiter

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode