x264 development [Archive] - Page 43

View Full Version : x264 development

Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 [43]

FLX90

16th January 2019, 21:20

I‘m currently 2-pass transcoding a BluRay and it takes about 26 hours on my 5th gen. i7.
CPU load: 100 %

Is it bad for the transcode if I do some web surfing during the process.
Sometimes my browser freezes for less than a second ...
Would that harm my transcodes?

Groucho2004

16th January 2019, 21:39

Is it bad for the transcode if I do some web surfing during the process.
Sometimes my browser freezes for less than a second ...
Would that harm my transcodes?
No.
If you are planning to use your computer while encoding you should even lower the priority of the process to low/idle. This way your browsing (or whatever else you're doing) will not 'freeze' and the encoder only gets the CPU cycles that are available. Of course that only works when you have sufficient memory installed.

FLX90

16th January 2019, 21:52

Sounds great.
Thank you for your answer.

I asked because more than 10 years ago, I transcoded my then library with iTunes to mp3. I was gaming and doing other CPU intense things during the transcoding process.

After that, I noticed for 10 % of my 10,000 mp3s an audible click noise for less than a millisecond.
Thought that was because of the high CPU load.

Groucho2004

16th January 2019, 22:15

I asked because more than 10 years ago, I transcoded my then library with iTunes to mp3. I was gaming and doing other CPU intense things during the transcoding process.

After that, I noticed for 10 % of my 10,000 mp3s an audible click noise for less than a millisecond.
Thought that was because of the high CPU load.
No idea, should not happen. Does itunes by any chance use the GPU for transcoding?

FLX90

17th January 2019, 07:15

I really don’t know ...

LigH

17th January 2019, 08:23

I would rather suspect misinterpreted ID3 tags as source of distortions.

Back to x264?

FLX90

17th January 2019, 08:45

Yes, that could possibly be.
Thanks for clarification.

sneaker_ger

21st January 2019, 14:23

qcomp 0.8 is part of --tune grain so it's not anything insane. We usually don't recommend users to fiddle with those parameters because they don't really understand them and are better off simply using preset+tune than choosing "wrong" values.

From what i understand of the lower values it distributes more bitrate to more visible areas and uses less bitrate in lower visible areas but when i did some video comparison the lower qcomp appears to take freely from a bunch of areas like parts of the fingers or hair or background were missing, i guess the bitrate for those areas are put elsewhere? I could see the benefit of this in constant rate factor maybe but it doesn't seem to be as needed in two-pass.
Yes, it's about redistributing bits. With higher qcomp the redistribution is less aggressive. This redistribution affects 2pass and crf in the same way.

benwaggoner

21st January 2019, 19:47

No idea, should not happen. Does itunes by any chance use the GPU for transcoding?
There definitely were not GPU accelerated MP3 encoders in iTunes >10 years ago!

benwaggoner

21st January 2019, 19:49

Emulgator

22nd January 2019, 10:58

Maybe interesting:
https://forum.doom9.org/showthread.php?t=175723
Never mind, I typed into the wrong thread here, too many windows open...

dev84

9th February 2019, 22:14

Why there is no official binary r2935 10bits ?

sneaker_ger

9th February 2019, 22:27

8 bit and 10 bit are now combined into a single binary. Use --output-depth 10

dev84

9th February 2019, 22:41

:( i'm using StaxRIP 1.7 but it doesn't work with "--output-bitdepth 10" i get "failed with exit code: -1 (0xFFFFFFFF)"
:(

sneaker_ger

9th February 2019, 23:06

Sorry, --output-depth, not --output-bitdepth

dev84

10th February 2019, 00:05

Thanks

FranceBB

10th February 2019, 13:24

8 bit and 10 bit are now combined into a single binary.

Oh, just like x265. Nice.

utack

15th February 2019, 02:40

Does anyone have inisight on how x264 uses NEON instruction sets?
I just upgraded from a mobile device with 8x symmetric Cortex A53 to 4xA53 slower +4xA72 faster that has more or less the same multicore performance for a majority of tasks
However I noticed that x264 has had more than a 2x speedup on this device
Does Cortex A72 come with some special vector magic it uses?

Blue_MiSfit

19th February 2019, 22:09

IIRC there is a lot of hand tune NEON assembly in x264 :)

Boulder

6th March 2019, 19:32

I've been experimenting with going back to x264 for some of my BD re-encodes where the source is rather soft or noisy/grainy. I'm having a hard time figuring out when one would need to tweak psy-trellis. I understand that psy-rd is mainly responsible for keeping the grain, or at least fooling the viewing into believing that there is detail there, but I don't know what psy-trellis adjusts. Can anyone explain any use cases?

jpsdr

28th March 2019, 21:00

Unless i've missed it in the code, but it seems that there is no check value on --output-depth, you can put whatever you want ! Didn't see any check on i_bitdepth, and i also didn't see anything on validate_parameters in encoder.c file...
Don't you think you should add a check on allowed value ? And the help maybe more specific, instead of "just int value" :confused:

MasterNobody

29th March 2019, 00:09

jpsdr

29th March 2019, 11:15

Thanks, i had missed it, it was odd that i didn't find any check, it makes more sense indeed like this... :D

LigH

18th January 2022, 07:26

benwaggoner

22nd January 2022, 01:22

There is a (quite demanding) thread in the VideoHelp forum (https://forum.videohelp.com/threads/404454-Where-is-documentation-for-the-x264-encoder) asking for a comprehensive official documentation of x264, similar to ReadTheDocs for x265. Does any exist which I missed? The VideoLAN homepage of x264 seems not to mention any, and the docs directory in the repo is hardly worth mentioning. So I guess the best source of knowledge appears to be the fullhelp combined with searching back in the developer mailinglist for discussions about every parameter...
I am pretty confident such a thing doesn't exist, although it should. x265.readthedocs.io is, by a huge margin, the best documentation for an encoder that's existed in the last 20 years. The old Terran Interactive manuals around 1995-2000 were the closest, although there were a lot fewer parameters to document.

FranceBB

28th February 2022, 14:31

Ok, so, I have a task for you, guys: optimizing the quality of a command line x264 encode, but this time is for professional use, therefore we have some constraints.

This is the current Command Line:

x264-10b.exe "\\mibcisilonsc\avisynth\Scambio\FILM\4CS00091.avs" --preset medium --profile high422 --level 5.2 --keyint 1 --no-cabac --slices 8 --bitrate 500000 --vbv-maxrate 500000 --vbv-bufsize 500000 --deblock -1:-1 --overscan show --colormatrix bt2020nc --range tv --log-level info --thread-input --transfer arib-std-b67 --colorprim bt2020 --videoformat component --nal-hrd cbr --aud --output-csp i422 --output-depth 10 --output "I:\Scambio\FILM\raw_video.h264"

what you see in the command line so far is mandatory, in fact the constraints are:

- Profile High 422 10bit
- Level 5.2
- keyint 1 (i.e All Intra)
- no-cabac
- slices 8
- 500 Mbit/s constant bitrate
- aud (access unit delimiter NAL at the start of each slice access unit)

Input files are UHD HDR PQ 12bit 4:4:4 files in Apple ProRes 23,976p at 1502 Mbit/s or 25p at 1600 Mbit/s or 29,970p at 1990 Mbit/s or 50p at 3318 Mbit/s or 60p at 3981 Mbit/s which are indexed, brought to 16bit, frame-rate converted to 50p in all cases, LUT converted to HLG and downscaled in chroma to 4:2:2 with Avisynth.
The resulting 16bit 4:2:2 UHD HDR HLG stream is then delivered to x264 and dithered down by x264 itself to 10bit planar with the Sierra-2-4A error diffusion and encoded as above.

A few questions for you:

1) Would it make sense to optimize x264 for quality and use like a slower preset like --preset veryslow at such an high bitrate?

2) Would it make sense to perform a two pass encode given that it's CBR and All Intra and at such an high bitrate?

3) Do you think it would make sense to mess up with --aq-mode and such at such an high bitrate?

Keep in mind that the end user will never ever see those files. Those files are sent to the playout in which an hardware playback port plays them, delivers the signal through an SDI cable and then such a signal is re-encoded live by an hardware H.265 encoder which resizes the chroma and encodes the final UHD 4:2:0 25Mbit/s 10bit satellite feed that the user receives live in .ts.

rwill

28th February 2022, 18:10

For the use case 500Mbit/s is not high, just average. If you care about the quality of your intermediate file use high quality settings.

The H.265 hardware encoder at the end of the chain will botch the quality anyway though.

I don't think the x264 RC uses multipass statistics effectively for CBR btw.

Emulgator

1st March 2022, 21:26

1. Although I am using it always, I guess veryslow is not needed here
2. No 2-pass, I don't see blu-ray restriction kind of distribution as the main problem here,
3. I guess no.
I guess you want to make sure that enough bits are spent in any case.
What I did in such cases was simply limiting qp decisions and test for that on the most complex scene.
Release any bitrate restrictions, decrease qpmax as much as you see it driving bitrate up.
Maybe you see going below --qpmax 30 will definitely force bitrate upwards.
Then step back, meaning increase qpmax again a few ticks --qpmax 36 ?
till you feel safe giving enough headroom.
Now you shouldn't get too low rate decisions anymore, so hopefully safe from harm.

stax76

2nd March 2022, 08:17

There is this:

http://www.chaneru.com/Roku/HLS/X264_Settings.htm

Used in staxrip for the context (right-click) help.

DTL

12th February 2023, 13:06

Moved from 'getting latest' thread: About better performance of MPEG encoders (including all x26x projects) at 'big' and 'large' architectures like AVX2 and AVX512 using multi-blocks processing program redesign from single block processing.
About I-frames only example:
I got C-sources from https://github.com/ShiftMediaProject/x264/tree/master/encoder and it is built with VisualStudio 2017. Other versions (including jpsdr) looks like not compatible with MSVS.
After some profiling I see some significant time is in the intra_satd_x9_4x4() function and track its call stack to the all macroblocks walking through:
It is loop in the encoder.c file: https://github.com/ShiftMediaProject/x264/blob/be0cb5426d4b9fecaf2e4b058466322a43f17241/encoder/encoder.c#L2812 . It walks through all frame macroblocks one by one by rows and columns (using MB number advancing mb_xy = i_mb_x + i_mb_y * h->mb.i_mb_width; where i_mb_x is current x-pos in MBs array and i_mb_y is current y-pos).
So practical 'workunit' size for each 'macro loop' pass is 1 macroblock only. If macroblock is of 16x16 size it mean the total CPU core executing this thread have only workunit size of 16x16 (lets 10bit proc and 16bit values per sample) - 16x16x(2 bytes per sample) = 512 bytes. Too few for CPU capable of processing up to kilobytes workunits.

So to make this part of encoder faster we need to re-design this 'macro loop' and all downstream called functions to process several macroblocks in a single pass. But it not very easy task and also if not all macroblocks in a 'group pass' are processed equally it need some more branching (like fallback to single macroblock proc if its processing is not equal to others).

The final 'macroloop' advancing at https://github.com/ShiftMediaProject/x264/blob/be0cb5426d4b9fecaf2e4b058466322a43f17241/encoder/encoder.c#L3068 will be not simple
i_mb_x++; (for progressive encode mode)
+1 advancing but
i_mb_x+=num_macroblocks_per_pass;

But program re-design to this simple 'internal parallelling to use SIMD' may take lots of time.

More close to reality of fixing example: At the processing of 16x16 macroblock with partititions down to 4x4 it split macroblock to 4x4 blocks of 4x4 and check some predictors for each 4x4 block. So it is the much smaller loop of https://github.com/ShiftMediaProject/x264/blob/be0cb5426d4b9fecaf2e4b058466322a43f17241/encoder/analyse.c#L924

for( ; *predict_mode >= 0; predict_mode++ )
{
int i_satd;
int i_mode = *predict_mode;

if( h->mb.b_lossless )
x264_predict_lossless_4x4( h, p_dst_by, 0, idx, i_mode );
else
h->predict_4x4[i_mode]( p_dst_by );

i_satd = h->pixf.mbcmp[PIXEL_4x4]( p_src_by, FENC_STRIDE, p_dst_by, FDEC_STRIDE );
if( i_pred_mode == x264_mb_pred_mode4x4_fix(i_mode) )
{
i_satd -= lambda * 3;
if( i_satd <= 0 )
{
i_best = i_satd;
a->i_predict4x4[idx] = i_mode;
break;
}
}

COPY2_IF_LT( i_best, i_satd, a->i_predict4x4[idx], i_mode );
}

where h->pixf.mbcmp[PIXEL_4x4]( p_src_by, FENC_STRIDE, p_dst_by, FDEC_STRIDE ); is call to single-block of SATD(SAD depending on options ?) of 2 4x4 blocks (assembly function typically for each SIMD family). Count of loop spins is typically number of non-negative members in predict_mode pointed vector (about 3 or 4).

When running of the very old architectures like SSE(2) the 2 of 4x4 16bit blocks for SATD calculation takes 64 bytes to load and at x86 SSE2 with 8 only 128 bit (16 bytes) SIMD register file of 128 bytes total size it takes about half of register file and close to no space left for immediate values if try to load 2 pairs of blocks. So this implementation is 'internally limited' to SSE2 32bit build target architecture. It is optimal for speed at that architecture because at each iteration it can break by condition i_satd <= 0 and skip some predictors and save some time.

At the larger architectures it is possible to process more SATD computing of 4x4 16bit pairs blocks in single SIMD pass. So this program block may be rearranged to more SATD computing per single pass using new multi-block SATD computing SIMD function and the cycle may be changed to processing groups of predictors (typically to single pass when using up to 4 predictors) and after single SIMD function call analyse for minimal i_satd value from vector of SATD values and select minimal (also can be attempted to do with SIMD min member of vector instruction _mm_minpos_epu16() from SSE 4.1 set if SATD not great than 16bit unsigned value - unfortunately no 32bit copy of this nice instructon even at AVX512 set). But this new program block need to be guarded by 'architecture' if() block like only for AVX2 and x64 or larger and it make total program text bigger and harder to understand (and debug and support and so on).

DTL

28th March 2023, 13:00

Based on current state of development:
- The activity of MPEG coder and preprocessing temporal denoiser is enough collaborative in both motion vectors search and usage.
- Current mvtools can use motion vectors stream from both system hardware MPEG encoding accelerators via standard now DX12-ME API from some generation of Win10 or full onCPU search. The format of motion vectors stream of both hardware MPEG encoder and mvtools is about equal. Precision may be down to qpel (current the only supported by hardware API precision).
- The multu-generation motion search for natural nosied sources shows significant improuvement in quality of motion vectors already at second generation (examples of execution structure - https://forum.doom9.org/showthread.php?p=1984152#post1984152 )

It may be interesting to reuse refined motion vectors in pre-MPEG denoising in x264 encoder (also making some offloading of mvs search to system hardware accelerators if present). May be someone with good knowledge in x264 motion vectors usage can make some fork and quick tests for performance/quality ?

Selur

29th March 2023, 03:15

Why the resolution limitation to 16384 since https://code.videolan.org/videolan/x264/-/commit/7923c5818b50a3d8816eed222a7c43b418a73b36 ?
One can't encode something like, 24800x90 anymore, which should work fine with level 5.2 at 30fps and works fine with older versions (see: discussion over at videohelp (https://forum.videohelp.com/threads/409093-file-size-limit-MP4-or-MOV)).

Cu Selur

nevcairiel

29th March 2023, 09:21

One can't encode something like, 24800x90 anymore, which should work fine with level 5.2

This is not accurate. Such a dimension is not supported in any defined level.

From the specification:

f) PicWidthInMbs <= Sqrt( MaxFS * 8 )
g) FrameHeightInMbs <= Sqrt( MaxFS * 8 )

The highest MaxFS, on Level 6(.1/.2) is 139264 Macroblocks.
A macroblock is 16x16 pixel, just for reference.

This makes the maximum for any single dimension .. sqrt(139264 * 8) = 1055 Mbs, or 16880 pixel. Pretty close to the limit chosen, which is a neat power of 2 close to this (or 1024 Mbs)

Selur

29th March 2023, 17:40

Argh, totally forgot about PicWidthInMbs and PicWidthInMbs which makes my calculation unimportant since it only took MaxMBPS into account :(
Thanks for clearing that up.

MasterNobody

29th March 2023, 21:22

This change was made for security reasons, and because the limit has to be somewhere, it was made high enough to be reasonable and not overflow some intermediate SAD/SSD cost calculations for row of MBs in 32-bit integers (especially for 10-bit output). In special cases, you can always compile x264 without this limitation and use it at your own risk without any warranty.

PoeBear

3rd May 2023, 16:39

Do any of the custom x264 builds have aq-bias-strength enabled? It's been useful for spreading crf bitrate out in x265, and I was curious if it existed in x264, and came across these patches when searching, but no binaries. Would it be as useful in x264?

https://gist.github.com/noizuy/83ba825d796e86cab1f67de333f90e0d
https://gist.github.com/noizuy/588446e288164fd4792c9406bc0b1416

jpsdr

17th May 2023, 17:48

Made a new build with the aq-bias-strength patch, check on my github.