View Full Version : Parallel encoding to speed up AV1 encoders or x265
Boulder
14th September 2023, 13:22
I recently got into looking at AV1 as a codec option as my new media player (Homatics Box R 4K Plus) supports AV1 decoding in hardware. I've experimented a little bit with the 'lavish' mod of aomenc (based on aomenc-psy) and noticed that aomenc is quite slow since it doesn't utilize multithreading too much.
There is a great tool for splitting the encode into chunks for parallel encoding called av1an, but it works only with files that can be opened with VapourSynth. As I use Avisynth for processing the videos, this was a problem.
I had wanted to test ChatGPT's capabilities of creating usable code for a while so this was a good exercise. My finding was that the thing can do some basic stuff, but anything more complex requires human work. Nevertheless, even I was able to hack together a working PoC in a couple of evenings and ChatGPT definitely helped there. You just need to be very patient and know when to write some part on your own. PyCharm's debugging tools became quite useful this time :D
If you are interested, I put the result here: https://github.com/Boulder08/chunknorris. Feel free to do any changes needed, currently it's very basic and for example does not have any built-in scene change detection included (I'm planning on adding one). I've used StainlessS's excellent tool for creating a QP file so it was the quickest option to use in this exercise. You can find the tool here: https://forum.doom9.org/showthread.php?t=171624, there should a more recent version than the one in the first post posted somewhere in the thread. It includes creating a .qp.txt file for using in x264/x265 and it works with this script as is.
I'm probably going to do some minor development as my time allows but feel free to tell me if there are any problems or things that might be useful to have.
Jamaika
14th September 2023, 18:14
libavm
* In encoding and decoding, AV1 allows an input image frame be partitioned into separate vertical tile columns, which can be encoded or decoded independently. This enables easy implementation of parallel encoding and decoding. The parameter for this control describes the number of tile columns (in log2 units), which has a valid range of [0, 6]:
0 = 1 tile column
1 = 2 tile columns
2 = 4 tile columns
.....
n = 2**n tile columns
* By default, the value is 0, i.e. one single column tile for entire image.
Boulder
14th September 2023, 18:42
libavm
* In encoding and decoding, AV1 allows an input image frame be partitioned into separate vertical tile columns, which can be encoded or decoded independently. This enables easy implementation of parallel encoding and decoding. The parameter for this control describes the number of tile columns (in log2 units), which has a valid range of [0, 6]:
0 = 1 tile column
1 = 2 tile columns
2 = 4 tile columns
.....
n = 2**n tile columns
* By default, the value is 0, i.e. one single column tile for entire image.
Yes, there are tiles but they don't come without a price.
benwaggoner
14th September 2023, 19:54
Yes, there are tiles but they don't come without a price.
Although chunked encoding can have its own challenges, particularly as chunks get smaller. Like maintain VBV compliance across chunk boundaries, and potentially adding extra intra frames at chunk boundaries.
Aren't some amount of tiles required for decoder compliance at higher resolutions?
Boulder
15th September 2023, 07:57
Although chunked encoding can have its own challenges, particularly as chunks get smaller. Like maintain VBV compliance across chunk boundaries, and potentially adding extra intra frames at chunk boundaries.My use cases fortunately don't have problems with VBV as the results are for my personal use only and bitrates definitely low enough to ignore any VBV related things. But yes, there are also caveats with this method.
Aren't some amount of tiles required for decoder compliance at higher resolutions?That's a good question, I could not find anything based on a quick search. Then again, with higher resolutions, having more tiles (I'd expect that there's always one tile anyway) would not have the efficiency penalty that splitting a 720p or 1080p source in multiple tiles might see.
Boulder
18th September 2023, 08:21
I've done some more work on this, now you can use ffmpeg for detecting the scene changes as the first phase of the script. I'll also add an option to use a different script for the detection (so you can either just load the source, or do some downscaling to speed things up).
You can also apply a film grain table on the fly -- Film Grain Synthesis seems to be the thing about AV1 so it's very important in my opinion. I need to come up with a practical solution of creating the table based on the source so that could also be done in the same processing chain without too much manual interaction.
Boulder
21st September 2023, 15:54
Another update, now you can create the Film Grain Synthesis grain table (semi-)automatically.
The next thing to do is to implement a better method for automatic scene change detection, ffmpeg is quite flaky with non-stable sources.
benwaggoner
22nd September 2023, 22:56
Also, SVT-AV1 isn't really all that great an encoder, particularly for Film Grain Synthesis. I don't know of a great free AV1 encoder, but commercial ones are getting quite good. But aomenc and SVT-AV1 aren't likely to outperform x265 much for quality @ perf. The Visonular Aurora encoder is the one I've played with the most, and I'm pleased by the compression efficiency it can deliver in reasonable encoding times.
Boulder
23rd September 2023, 08:43
I think the aomenc-lavish fork is not that bad, it has some psy tunings and psy related fixes compared to the original. Of course, AV1 like the other modern encoders, tend to be biased towards minimizing bitrate so they require a bit of tuning to keep details. In general, all encoder related denoising must be disabled and AQ mode set to 0. I don't understand what has been screwed up with AQ, but it just doesn't work. But the big thing is FGS in my opinion, it can make a huge difference.
Aomenc without any sort of chunked encoding like av1an or this script is definitely a waste of time compared to x265.
Boulder
23rd September 2023, 09:01
Most of the big stuff is done now, I just added the capability to use PySceneDetect for scene change detection.
Selur
23rd September 2023, 11:50
Out of curiosity: When using Avisynth anyway, why not use SCXvid ?
Boulder
23rd September 2023, 15:15
Out of curiosity: When using Avisynth anyway, why not use SCXvid ?
Probably because I didn't know about it :D
Maybe I'll implement it as an option as well, shouldn't be too hard now that things are as functions.
Boulder
24th September 2023, 16:55
SCXviD added now as well.
Beelzebubu
28th September 2023, 12:22
Aren't some amount of tiles required for decoder compliance at higher resolutions?
Yes, but that only comes in above 4K. Max tile area is 4096 * 2304, so 4K can in theory still be a single tile.
Boulder
28th September 2023, 13:55
I added some tile parameters in the presets if anyone wants to use them. 1080p comes with --tile-columns=1 and presets meant for resolutions above that --tile-columns=1 --tile-rows=1. I didn't notice these to cause any significant compression efficiency dropdowns and would help with decoding. The actual encoding speed might not be affected that much, I've been running some tests with 1080p content and four parallel encodes with 16 threads already saturate the CPU quite nicely (5950X 16c/32t).
Now I need to do some more work on the film grain synthesis part to make it work properly, then all the basic stuff is there.
Boulder
28th September 2023, 20:05
The film grain table creation was reworked and based on my tests, it works as expected. Now I suggest looking up a short, representative frame range (100-200 frames will do) without scene changes for analysis. The script will get the longest section it finds from the generated table and use it for the final encode. The grain table file format is somehow a bit odd, but this approach seems to work fine. Running a diff on the complete source is definitely unnecessary, you just need to find a good scene to get the data from.
All the big stuff should be done now.
benwaggoner
4th October 2023, 18:51
Be warned that some early SoCs don't properly implement FGS, so you'll definitely want to test it on your intended hardware before making a lot of content using it.
And yeah, tiles>1 can help parallelism, but if you're already saturated, it won't matter. Potentially you could turn down frame threading with more tiles, which might profile better quality @ perf. Frame threading is HARD to do without some quality degradation, in any codec.
Boulder
6th October 2023, 08:25
I think at least aomenc already has frame threading disabled by default, it only does row-based multithreading unless you change the parameters.
While developing the tool and testing AV1, I've found it to be a very capable replacement for HEVC. Without film grain synthesis, it does smooth the image way too much to my liking, but with FGS on top, the results are very good indeed. Much better than x265, and no issues with red colors which tend to be a problem with x265 for some reason and FGS is basically free of charge what comes to bitrate. Aomenc is only slightly slower than x265 when you utilize these chunked encoding methods, it would be too slow when running just a single encode. You just need to make sure you are using the lavish mod, and there is a very recent patch with luma bias parameters which improve the handling of dark areas a lot. The AV1 Discord channel has pre-built binaries available.
I've also found out that it is usually a better idea to use a luma-only grain table. A table with full chroma information tends to start oversaturating the image and the chroma noise itself can be more distracting than the more grain like appearance of luma noise. I'll probably do some development on this to apply Tweak in the grain table creation part, that way the user can choose the amount of chroma to include in the result.
benwaggoner
6th October 2023, 22:56
I think at least aomenc already has frame threading disabled by default, it only does row-based multithreading unless you change the parameters.
Yeah, frame threading is hard, and can cause unpredictable problems in a few frames here and there that require somewhat complex evaluation of per-frame metrics to predict. I always try to use a single frame with x265 as well.
While developing the tool and testing AV1, I've found it to be a very capable replacement for HEVC. Without film grain synthesis, it does smooth the image way too much to my liking, but with FGS on top, the results are very good indeed.
And my impression of the "smoothing" is that it is mainly a limitation of AQ modes in existing encoders, not intrinsic to AV1 as a technology. Commercial proprietary AV1 encoders are doing a lot better in that regard.
Much better than x265, and no issues with red colors which tend to be a problem with x265 for some reason
Are you doing HDR? x265 really benefits from setting --hdr10-opt and lowering the chroma QPs by at least one.
and FGS is basically free of charge what comes to bitrate.
If you have a good implementation on the encoding side and the decoding side. I'm not aware of anyone using FGS at scale by default. The SVT-AV1 implementation in particular can result in some really distracting grain patterns instead of a true random one.
The whole "remove grain while parameterizing it" domain has been an active R&D area for a few decades now, and it's hard to perfect across the many edge cases in real-world content. ML is certainly helping a lot, quickly, but it's not a solved problem by any means.
But the biggest challenge is inconsistent support in early HW implementations.
That said, grain and noise is the biggest bit suck in encoding, and doesn't get much better as we improve efficiency of encoding the actual signal. Good grain removal and parametrization followed by good FGS would help cut bitrate of grainy titles by >50% in some cases. And since it is entirely out-of-loop of the AV1 decoder itself, the metadata and synthesis could support any codec. Getting a "good enough" end-to-end FGS chain would be revolutionary.
Aomenc is only slightly slower than x265 when you utilize these chunked encoding methods, it would be too slow when running just a single encode. You just need to make sure you are using the lavish mod, and there is a very recent patch with luma bias parameters which improve the handling of dark areas a lot. The AV1 Discord channel has pre-built binaries available.
Does aomenc have a lapped rate control model like x265 for split-and-stitch? That allows for frames before and after the chunk to be analyzed and discarded to get a more accurate VBV state to reduce quality fluctuations at chunk boundaries.
I've also found out that it is usually a better idea to use a luma-only grain table. A table with full chroma information tends to start oversaturating the image and the chroma noise itself can be more distracting than the more grain like appearance of luma noise. I'll probably do some development on this to apply Tweak in the grain table creation part, that way the user can choose the amount of chroma to include in the result.
Good tip. While color film does have some chromatic grain, it's generally not super saturated. Is there a way to reduce grain saturation without eliminating it outright?
Boulder
7th October 2023, 09:18
And my impression of the "smoothing" is that it is mainly a limitation of AQ modes in existing encoders, not intrinsic to AV1 as a technology. Commercial proprietary AV1 encoders are doing a lot better in that regard.
With aomenc, it's at least currently recommended to switch the regular AQ-mode off as it just blurs the image even more. --deltaq-mode (1 for SDR, 5 for HDR) and --enable-chroma-deltaq=1 are what I recommend.
Are you doing HDR? x265 really benefits from setting --hdr10-opt and lowering the chroma QPs by at least one.Both SDR and HDR. This is an old issue with x265, it just doesn't handle the chroma planes that well. --cbqpoffs -3 --crqpoffs -6 are my go-to settings in x265 but actually the offsets don't fix the issue. I've just left them there since the increase in bitrate is rather small and I think x264 uses -3 for both by default so I've kind of followed that path.
--hdr10-opt only adjusts QP per average luma level, but yes, it's always enabled with HDR sources.
If you have a good implementation on the encoding side and the decoding side. I'm not aware of anyone using FGS at scale by default. The SVT-AV1 implementation in particular can result in some really distracting grain patterns instead of a true random one.
I haven't noticed any ill effects by using the grain table produced by grav1synth. Then again, I'm using a fork of aomenc, which is the reference encoder. I'm not sure if the encoder does anything when you feed the table as a parameter, or is it just muxing it in the container.
The whole "remove grain while parameterizing it" domain has been an active R&D area for a few decades now, and it's hard to perfect across the many edge cases in real-world content. ML is certainly helping a lot, quickly, but it's not a solved problem by any means.
But the biggest challenge is inconsistent support in early HW implementations.
That's why I'm only thinking this in a "selfish" way and create solutions which work on my devices :D Well, the box (and its derivatives like the Dune or Nokia) I have is a quite common one because of its capability of outputting DoVi in Kodi so maybe someone else will benefit from it too. And playback on the PC works fine.
That said, grain and noise is the biggest bit suck in encoding, and doesn't get much better as we improve efficiency of encoding the actual signal. Good grain removal and parametrization followed by good FGS would help cut bitrate of grainy titles by >50% in some cases. And since it is entirely out-of-loop of the AV1 decoder itself, the metadata and synthesis could support any codec. Getting a "good enough" end-to-end FGS chain would be revolutionary.
I think the grav1synth approach is quite smart (and simple) but requires some manual labour. You just need to diff the final encoding result against the original one and it comes up with a grain table approximating the difference. The grain table file has start and end timestamps for each set of grain so in theory, you could also adjust it in case there are sections which are full on CGI and some with a lot of film grain.
I've just taken a short path and ask the user to either provide a table or analyze a short range of the source to create a table with just one section of grain. I've also collected some tables and put them on GitHub for starters and will add more along the way. After all, the title doesn't matter but the appearance of grain.
Does aomenc have a lapped rate control model like x265 for split-and-stitch? That allows for frames before and after the chunk to be analyzed and discarded to get a more accurate VBV state to reduce quality fluctuations at chunk boundaries.
I think not. At least I don't see any advanced things like these there.
Good tip. While color film does have some chromatic grain, it's generally not super saturated. Is there a way to reduce grain saturation without eliminating it outright?
The only way is to either alter the grain table (it's just a text file), or alter the video while running the analysis. So if you feed the source without chroma information to the analysis, there will be no chroma grain in the table. That's why I was thinking of allowing the user to choose a saturation value between 0-1 to either tone down the chroma noise substantially or to remove it completely.
Here's a short sample of how AV1 with FGS (@ q 12) and HEVC (@ CRF 18) compare in a rather bright, static scene. The bitrate difference between the two is huge, but for the entire movie, the AV1 encode is around 7,5% bigger. I didn't re-encode the credits yet to lower the bitrate, they are way too big in the AV1 encode. I'm also using settings which give quite a lot of bits to dark areas compared to the default. In general, HEVC is just much blurrier since FGS adds to sharpness.
AV1: https://drive.google.com/file/d/1cB1G9uZMnIuTRNivrKI5I-Bn3rVuF7nn/view?usp=drive_link
HEVC: https://drive.google.com/file/d/1mXQ0_-ASYpwOrZtd_Fafw6vONxKDWupf/view?usp=drive_link
Lossless: https://drive.google.com/file/d/1qOaH-MImN7LH-fwfR1X7_qHZj9y_G3Y2/view?usp=drive_link
EDIT: got around to re-encoding the credits (a little less than 40000 frames :eek:), and now the complete encode is a little smaller than the HEVC counterpart. Very happy with this, but I need to continue testing different sources.
Boulder
8th October 2023, 14:48
I've now pushed the latest version to the repo. Quite a few small tweaks there, namely the ability to tweak the saturation of the FGS analysis clip and doing tone mapping of HDR sources (using DGHDRtoSDR) and adjustable downscaling in the scene change detection phase for faster processing and increased accuracy. The progress bar also works better now and gives a more accurate estimate of remaining time.
I've put several grain table files in the repo as well if anyone finds them useful.
Blue_MiSfit
9th October 2023, 00:31
Hmm... comparing those encodes against the reference in the free version of MSU VQMT I see a slight hue shift towards magenta in the AV1 version but not in the HEVC version... Not sure what could be the cause of that.
Boulder
9th October 2023, 05:14
Hmm... comparing those encodes against the reference in the free version of MSU VQMT I see a slight hue shift towards magenta in the AV1 version but not in the HEVC version... Not sure what could be the cause of that.
Yes, I have noticed the same thing myself. I've been trying to find out the reason and I think I've caught it.. if you feed 8-bit input to aomenc (and encode at 10 bits as is recommended), the result gets this shift. If you convert the source to 10 bits before inputting it, the colors look pretty much the same in the encode as in the source.
I need to test this with different cases and try to replicate it with the vanilla aomenc so I can see if the problem's there or in the lavish fork.
Boulder
9th October 2023, 20:13
The hue shift issue occurs when using FGS and 8-bit input, works fine if the source is converted to 10 bits or an FGS table is not used. I didn't check if it applies to 8-bit encodes, as there's really no use doing those.
I've opened a case in the aomedia issue tracker, let's see if anything happens. For the time being, I need to add the conversion step to my tool. I still think that a B/W film grain table is better than one with chroma data included :p
benwaggoner
11th October 2023, 22:40
The hue shift issue occurs when using FGS and 8-bit input, works fine if the source is converted to 10 bits or an FGS table is not used. I didn't check if it applies to 8-bit encodes, as there's really no use doing those.
I've opened a case in the aomedia issue tracker, let's see if anything happens. For the time being, I need to add the conversion step to my tool. I still think that a B/W film grain table is better than one with chroma data included :p
Hmm, perhaps they're doing some sort of 4x scale factor for chroma instead of using bicubic interpolation or something. A straight multiple would round down chroma value a bit. Inverse dithering with interpolation is a largely untapped approach. Basically it should be like a bicubic, but in depth instead of area.
Boulder
28th November 2023, 07:51
I've developed the tool further along the way and the latest change was to add the support for encoding end credits using different Q and --cpu-used.
I think the next bigger step will be adding support for SVT-AV1, there is a huge performance boost incoming and also very interesting development regarding aq-mode 2 and low contrast sources which are often a problem for encoders.
Boulder
5th March 2024, 11:32
A lot more development has been done lately, now the tool supports svt-av1, rav1e and x265. In addition to that, you can create Dolby Vision compatible encodes using svt-av1-psy (https://github.com/gianni-rosato/svt-av1-psy).
More things coming soon, I've been moving the encoder parameters to a preset file and should be able to push that change to the repo during this week.
Boulder
10th March 2024, 19:49
I was finally able to finish the changes regarding moving settings to a separate preset file.
DoVi for x265 is still pending though, I think I'll get that working next.
Boulder
11th March 2024, 20:40
DoVi for x265 works now, so I think it's time to let the script rest a while unless something interesting pops up. The forthcoming SVT-AV1 v2.0 will include a huge quality boost when the variance boost feature will be included in mainline from the psy mod, so definitely check it out even if you don't use this script :)
benwaggoner
12th March 2024, 01:40
DoVi for x265 works now, so I think it's time to let the script rest a while unless something interesting pops up. The forthcoming SVT-AV1 v2.0 will include a huge quality boost when the variance boost feature will be included in mainline from the psy mod, so definitely check it out even if you don't use this script :)
In what way is DoVi not supported in x265? Optimization for Y'CtCp? I know x265 certainly has been used for commercial DoVi Profile 5 encoding. And it has full-fledged support for Profile 8.1 (which is just HDR-10 plus metadata from a sidecar file).
Boulder
12th March 2024, 05:48
In what way is DoVi not supported in x265? Optimization for Y'CtCp? I know x265 certainly has been used for commercial DoVi Profile 5 encoding. And it has full-fledged support for Profile 8.1 (which is just HDR-10 plus metadata from a sidecar file).
You can input an RPU file and create DoVi Profile 8.x encodes so yes, dynamic metadata like HDR10+. SVT-AV1(-psy) supports Profile 10 the same way, thanks to quietvoid's patch. As far as I know, there are no internal optimizations other than what PQ sources may have in all cases.
What the script simply does is that it splits the RPU based on the chunks so each chunk to encode gets proper metadata. mkvmerge must be used to concatenate the chunks as it does not remove the data while muxing.
BlueSwordM
15th June 2024, 19:07
You can input an RPU file and create DoVi Profile 8.x encodes so yes, dynamic metadata like HDR10+. SVT-AV1(-psy) supports Profile 10 the same way, thanks to quietvoid's patch. As far as I know, there are no internal optimizations other than what PQ sources may have in all cases.
What the script simply does is that it splits the RPU based on the chunks so each chunk to encode gets proper metadata. mkvmerge must be used to concatenate the chunks as it does not remove the data while muxing.
HDR10+ should also work with svt-av1-psy now, but I haven't tested it on Windows, only Linux.
Boulder
25th July 2024, 14:33
Some more development done based on the SSIMULACRA2-based auto-boost feature by trixoniisama (https://github.com/trixoniisama/auto-boost-algorithm). Works on SVT-AV1 and x265.
LigH
3rd October 2024, 22:49
The media-autobuild suite now supports building Av1an (https://github.com/master-of-zen/Av1an). But there seems to be an issue, it produces no console output for me.
Z2697
5th October 2024, 18:03
I don't really understand why av1an chose rust though. A scripting language is enough and maybe more suitable. Considering the VapourSynth requirement of it, Python would be an almost perfect choice.
Z2697
9th October 2024, 15:57
Here's my weird approach, the main goal is to not wait for the scene change detection to complete but do encoding "on the fly", because for quite a long time and forseeable future before x266 is ready, I almost exclusively use x265 for any "serious business" encoding (i.e. probably using some heavy vapoursynth script), I use AV1 for some simple transcoding, thus speed is very important.
https://github.com/Mr-Z-2697/ideal-chainsaw/blob/main/segsaio.py
Maybe you can find it helpful. Not necessarily the "on the fly" thing, the scene change detection should be much faster, and perhaps better.
The mvtools method used by default is running 600+ FPS on my Ryzen 7950X.
Alternatively the deep learning method (https://github.com/Mr-Z-2697/z-vsPyScripts/blob/main/XcQ/scene_detect.py) can be used, the speed of current default model and execution provider (DirectML) is 400+ FPS on RTX 4090 (not full utilized, only draws ~150W).
If you Gfx card is reasonablely new and have free resources during encoding, offloading scene change detection to it can free up some CPU resources for encoder.
vBulletin® v3.8.11, Copyright ©2000-2025, vBulletin Solutions Inc.