x265 HEVC Encoder [Archive] - Page 130

View Full Version : x265 HEVC Encoder

Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 [130] 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201

Boulder

19th October 2018, 20:37

I finished my short tests today, and these are the settings that I found to suit my requirements.

1080p:

--deblock -3:-3 --no-strong-intra-smoothing --merange 44 --no-sao --qcomp 0.75 --aq-mode 3 --aq-strength 0.8 --ctu 32
--max-tu-size 32 --qg-size 8 --tu-inter-depth 4 --tu-intra-depth 4 --limit-tu 4 --limit-refs 3 --max-merge 3 --rd-refine
--ref 6 --bframes 10 --crf 20.5

720p:
--deblock -3:-3 --no-strong-intra-smoothing --merange 38 --no-sao --qcomp 0.8 --aq-mode 3 --aq-strength 0.8 --ctu 16
--max-tu-size 16 --qg-size 8 --tu-inter-depth 3 --tu-intra-depth 3 --limit-tu 4 --limit-refs 3 --max-merge 3 --rd-refine
--ref 6 --bframes 10 --crf 19.5

Very few changes between the two, but the CTU/TU size was an obvious change based on the comparison I made. Upon playback, the video looks very good in both resolutions on my 65" LCD even when watched from 1 meter or so. As a test source, I used the first episode of Black Sails as it's generally considered a very high quality Blu-ray release.

I did try finding out differences between deblock 1:1 and -3:-3 in motion but couldn't tell. If someone had ABX'd me, I probably wouldn't have been able to say which one is which. Mind you, the scene I used contained quite a lot of motion and some fast cuts so maybe they were just not visible there. Anyway, I chose a low value as I think it will retain detail and sharpness better. I watch everything from about 3-3.5 meters anyway, so any small blocking won't be that visible.

Edit: both encodes using the preset "veryslow".

Dclose

20th October 2018, 00:30

Did you see any issues with using 0:0?

I did a 4k down to 1080 encode the other day, with an accidental 0:0 deblock setting. It looks good in a "slick," "photoshopped" way, but it definitely loses a lot of detail, at least at CRF 23.

That's too much deblocking for me. Though I do put db at 0:0 when increasing CRF and dropping resolution a lot.

DotJun

20th October 2018, 05:53

--asm sse4

Main asm levels are:

--asm no

--asm sse2

--asm ssse3

--asm sse4

--asm avx2

--asm avx512

--asm avx512 works a bit different -- it only enable possibility to use AVX-512 in auto-detection of CPU capabilities. For hard use of AVX-512 code (without checking) please use

--asm avx,avx512

Thanks, I’ll put up three test runs of default, sse4 and avx512 to see what changes there are in FPS and efficiency.

Will medium or fast preset be ok to use or it has to be slower or higher?
Will it be ok to use “asm sse4” or do I have to specify sse4.2 like my log file shows?

Boulder

20th October 2018, 10:42

--qg-size 16, tested values from 64 to 8, 16 looked best (in terms of distortion in small details again) when compared frame-by-frame. Tested with a 720p encode, so I'll need to check that also with 1080p later.
I would expect it would look better, but at some cost to efficiency due to the signaling overhead. --opt-cu-delta-qp might help that some.
I was wondering about this parameter. Does it actually shift bits inside the frame so that the distribution is closer to the average - so that some areas with fine detail could suffer at the cost of flat areas looking better? Or the other way around as flat areas could be compressed using a lower QP to make them look equally good?

benwaggoner

20th October 2018, 22:39

I was wondering about this parameter. Does it actually shift bits inside the frame so that the distribution is closer to the average - so that some areas with fine detail could suffer at the cost of flat areas looking better? Or the other way around as flat areas could be compressed using a lower QP to make them look equally good?

AFAIK it works by reducing signaling overhead without changing actual pixels. There are several bitstream options that do things like that. This is the one likely to have the most impact, as WP signaling can take place MANY times per frame. The others are per frame or per GOP.

Sent from my iPhone using Tapatalk

DotJun

21st October 2018, 07:23

I did a few more test runs with the same parameters I stated before using 50k frames test clip. Here are the results.

Normal Preset:
sse4 2.19fps
avx2 2.89fps
512 3.18fps

They all had the exact same kbps of 23463.

Slower Preset:
sse4 0.71fps
avx2 0.89fps
512 0.95fps

All of them ended up with the exact same kbps of 22868.

My previous test from the other day seems to be a failure since I didn't use correct the correct switches for --asm.

Boulder

21st October 2018, 13:21

AFAIK it works by reducing signaling overhead without changing actual pixels.

There is a quite clear difference in the frame, it was quite obvious in this flat area (zoomed in though). The filesize difference was about 5% (the optimized version was smaller).

Without the optimization:
http://thumbs2.imagebam.com/6f/13/25/6e88661006880184.jpg (http://www.imagebam.com/image/6e88661006880184)

Optimization enabled:
http://thumbs2.imagebam.com/9f/c3/8d/85be161006880194.jpg (http://www.imagebam.com/image/85be161006880194)

In higher detailed area, the optimized encode looked better. I think I need to find a scene with some still background like sky, and see which one looks better in motion. The banding in flat areas can be quite eye-catching once you notice it.

benwaggoner

21st October 2018, 18:54

There is a quite clear difference in the frame, it was quite obvious in this flat area (zoomed in though). The filesize difference was about 5% (the optimized version was smaller).

In higher detailed area, the optimized encode looked better. I think I need to find a scene with some still background like sky, and see which one looks better in motion. The banding in flat areas can be quite eye-catching once you notice it.
Wow, awesome info! Thanks! I will iterate on further.

Was there any measurable perf difference?

Boulder

21st October 2018, 19:37

Wow, awesome info! Thanks! I will iterate on further.

Was there any measurable perf difference?

I just ran a longer encode to measure, and the difference is quite high. 4.95 fps for the normal encode and 5.22 fps for the one with optimization enabled. The normal encode is about 8% bigger.

I also checked how the optimization works in normal playback, and my eyes didn't like the result. The flat areas suffered a bit too much, there was a short scene with a nice, slightly noisy but flat coloured background which was lit by some flickering candlelight. The normal encode was slightly better looking there, there was not as much swimming blocks effect as there was with the optimized version.

benwaggoner

21st October 2018, 20:35

I just ran a longer encode to measure, and the difference is quite high. 4.95 fps for the normal encode and 5.22 fps for the one with optimization enabled. The normal encode is about 8% bigger.
An 8% file size reduction for a 5.5% speed increase would be an incredible optimization. An optimization can that give 1% reduction for a 5% speed increase is a big deal.

I also checked how the optimization works in normal playback, and my eyes didn't like the result. The flat areas suffered a bit too much, there was a short scene with a nice, slightly noisy but flat coloured background which was lit by some flickering candlelight. The normal encode was slightly better looking there, there was not as much swimming blocks effect as there was with the optimized version.
...but only if that 8% reduction doesn't impact quality, alas.

It would be interesting to see what the difference was in a stream analyzer. Or just looking at the log-level 2 csv files.

Boulder

22nd October 2018, 04:07

I ran the encodes again to produce the logs, if you want to have a look: https://drive.google.com/open?id=1EjBUkFQjPreh4NsiZJnsPbY8qyl6tFMA . The flat part scene I mentioned appears several times, for example frames 0-117 contain that one.

Boulder

27th October 2018, 12:30

iwod

27th October 2018, 14:52

Is there a release note for v2.9?

LigH

27th October 2018, 22:01

I never saw any...

LoRd_MuldeR

27th October 2018, 22:44

Is there a release note for v2.9?

https://bitbucket.org/multicoreware/x265/src/f9681d731f2e56c2ca185cec10daece5939bee07/doc/reST/releasenotes.rst?at=stable&fileviewer=file-view-default

benwaggoner

28th October 2018, 01:35

Wolfberry

28th October 2018, 08:02

It is kind of weird that the v2.9 release notes is only available in the stable (https://x265.readthedocs.io/en/stable/releasenotes.html) version, but not default (https://x265.readthedocs.io/en/default/releasenotes.html) or even latest (https://x265.readthedocs.io/en/latest/releasenotes.html).

alex1399

28th October 2018, 09:25

It seems that even zeranoe ffmpeg stays at x265 2.8 for a long time.

By the way, I'm not a fan of hard-coded dithering. Dithering should be handled as part of creative editing before the encoding process or left to the rendering after the decoding process.

FranceBB

29th October 2018, 11:11

This is probably a silly question, but here goes anyway: if I use --hdr-opt, do I need to feed the encoder with 10-bit data or is 16-bit data as good if the source is a standard UHD with HDR? I always process things in 16-bit domain and let the encoder dither down to 10 bits.

I would let x265.exe do the dithering, 'cause other dithering options like the Floyd Steinberg error diffusion may have a nicer look, but they could increase the bitrate required by x265. The built in dithering filter in x265 is supposed to dither everything down to the target bit depth without introducing banding. Blocks and macro blocks dithered by x265 are more likely to be recognised during the motocompensation by x265 than the ones dithered using a third party dithering method, therefore compression should be better.

In a nutshell, let x265 do the dithering and always pipe to it the highest bit depth you have, unless you like a specific dithering method and you have enough bitrate.

benwaggoner

29th October 2018, 18:31

I would let x265.exe do the dithering, 'cause other dithering options like the Floyd Steinberg error diffusion may have a nicer look, but they could increase the bitrate required by x265. The built in dithering filter in x265 is supposed to dither everything down to the target bit depth without introducing banding. Blocks and macro blocks dithered by x265 are more likely to be recognised during the motocompensation by x265 than the ones dithered using a third party dithering method, therefore compression should be better.

In a nutshell, let x265 do the dithering and always pipe to it the highest bit depth you have, unless you like a specific dithering method and you have enough bitrate.
Note there are two dithering modes in x265. The basic one if you don’t specify anything, and the more advanced one if you use —dither.

benwaggoner

31st October 2018, 00:08

RainyDog

31st October 2018, 11:29

Say, does anyone have any data showing potential benefits of using some of the "beyond placebo" settings?

For example

--subme 6 or 7 instead of 5
--me sea instead of star
--bframes 16 instead of 8
--ref 6+ instead of 5
--tskip
--cu-lossless

I've fond some value in very low bitrates or unusual content from all but the higher subme and me, which I've not tried.
--me sea in particular cuts speed enormously.

Tskip and cu-lossless can help with anime, small text, screen captures, and other stuff with sharp detailed edges.

Is --subme tied to RDO in the way that it is in x264?

In x264, --subme 6 upwards have increased levels of RDO. But seems that RDO is its own separate thing in x265 with RD levels 1-6 and rd-refine being a separate switch too.

benwaggoner

31st October 2018, 17:08

Is --subme tied to RDO in the way that it is in x264?

In x264, --subme 6 upwards have increased levels of RDO. But seems that RDO is its own separate thing in x265 with RD levels 1-6 and rd-refine being a separate switch too.
Yeah, subme does have some impact on rate control, but --rd is where the action is at.

From x265.readthedocs.io

Amount of subpel refinement to perform. The higher the number the more subpel iterations and steps are performed. Default 2

At –subme values larger than 2, chroma residual cost is included in all subpel refinement steps and chroma residual is included in all motion estimation decisions (selecting the best reference picture in each list, and chosing between merge, uni-directional motion and bi-directional motion). The ‘slow’ preset is the first preset to enable the use of chroma residual.

TomV

31st October 2018, 20:39

Say, does anyone have any data showing potential benefits of using some of the "beyond placebo" settings?

For example

--subme 6 or 7 instead of 5
--me sea instead of star
--bframes 16 instead of 8
--ref 6+ instead of 5
--tskip
--cu-lossless

I've fond some value in very low bitrates or unusual content from all but the higher subme and me, which I've not tried.
--me sea in particular cuts speed enormously.

Tskip and cu-lossless can help with anime, small text, screen captures, and other stuff with sharp detailed edges.
When I was tuning x265's presets, I tried all of these options that go beyond placebo, to see what should be included in placebo. You and anyone else are welcome to try them again, but I found it was easy to massively increase encode times, but impossible to get any meaningful improvement in efficiency.

LigH

31st October 2018, 23:28

What idiom would you give a configuration "beyond placebo"? Possibly "insane", like LAME MP3...

mandarinka

1st November 2018, 02:02

What idiom would you give a configuration "beyond placebo"? Possibly "insane", like LAME MP3...
ultra placebo. Or very placebo since that would match with very slow.

benwaggoner

1st November 2018, 05:10

When I was tuning x265's presets, I tried all of these options that go beyond placebo, to see what should be included in placebo. You and anyone else are welcome to try them again, but I found it was easy to massively increase encode times, but impossible to get any meaningful improvement in efficiency.
Indeed. “—me full —cu-lossless” will tank speed unbelievably without even trivial quality improvements for typical content and use cases.

I have seen visible and measurable value from ref 6, bframes 16, and tskip at <150 Kbps. Of course, using a lot of MIPS/pixel isn’t nearly so painful at low frame sizes and bitrates.

RainyDog

1st November 2018, 10:27

Yeah, subme does have some impact on rate control, but --rd is where the action is at.

From x265.readthedocs.io

Well yeah, that's why I asked. As --subme in the x265 docs doesn't specifically mention RDO.

But in x264 it's :-

subme

Default: 6

Set the subpixel estimation complexity. Higher numbers are better. Levels 1-5 simply control the subpixel refinement strength. Level 6 enables RDO for mode decision, and level 8 enables RDO for motion vectors and intra prediction modes. RDO levels are significantly slower than the previous levels.

QPel SAD 1 iteration
QPel SATD 2 iterations
HPel on MB then QPel
Always QPel
Multi QPel + bime
RD on I/P frames
RD on all frames
RD refinement on I/P frames
RD refinement on all frames
QP-RD (requires --trellis=2, --aq-mode > 0)

So it seems to me that --subme in x265 is as per --subme 1-5 in x264.

And the RD stuff of x264 --subme 6-11 is branched off into it's own thing in x265 with selectable RDO levels, rd-refine as a separate option etc.

LigH

2nd November 2018, 12:59

x265 2.9+4-471726d3a046 (https://www.mediafire.com/file/0w5wivvwhcdb1y7/x265_2.9%2B4-471726d3a046.7z)

fixes: rowStat computation in const-vbv; memory reset size in dynamic-refine; linking issue on non x86 platform

Dclose

4th November 2018, 19:33

Say, does anyone have any data showing potential benefits of using some of the "beyond placebo" settings?

For example
[LIST=1]
--subme 6 or 7 instead of 5
I don't know what data you're looking for besides "yes, it looks better." I wrote the following some months back in this thread:

1) imo, if you care about things that move, (and picture quality in general), you have to use sub-motion pixel subme 7. 5 is good, and is as low as I ever set that even on files I'm trying to finish fast, but 5 is easily visually inferior to 7 imo. 7 of course takes longer to encode though.

I'm usually around CRF 22-23 (with nearly all quality settings turned on). It probably has lesser effect at CRF 18.

RainyDog

11th November 2018, 19:18

For 2-pass encodes, I normally use a custom faster 1st pass command line which is the same as my slow 2nd pass just with RDO level and subme turned down to level 2, --me dia, --early-skip and --fast-intra.

But I've been testing using identical command lines for both passes and using --multi-pass-opt-analysis instead which speeds up the 2nd pass considerably to the point where a complete 2-pass encode is almost the same speed as my usual approach.

Which should technically yield the higher quality final result? Is there any potential harm to using --multi-pass-opt-analysis?

Majorlag

13th November 2018, 17:52

Which should technically yield the higher quality final result? Is there any potential harm to using --multi-pass-opt-analysis?
Don't forget to also include --multi-pass-opt-rps --multi-pass-opt-distortion to your command line as well.
I understand that if your NOT turning down --RDO, --me and other settings then it should produce better results since it will spend more time on those settings in first pass. The --mulit-pass options are great in reusing the values obtained in the first pass to increase the speed of the second pass.

atrin

15th November 2018, 04:09

Hi,

I have some sample tabels for lamda2 and I generated a table of lambda based on it. This is the address of my sample https://mailman.videolan.org/pipermail/x265-devel/2017-March/010936.html
The second table is not related with its formula (lambda2 = 0.038 * pow(0.234, QP))

is there any document or information that explains lambda and lambda2 tables and relations?

Many thanks

Jamaika

18th November 2018, 13:17

Problems with metric VMAF.

After many hours, I managed to adjust the items VMAF. The new addition even recalculates something.

Read input model (libsvm) at ./vmaf_rb_v0.6.2/vmaf_rb_v0.6.2.pkl.model ...
…
Initialize storage arrays...
Extract atom features...
frame: 0, adm: 0.986, adm_num: 792.386, adm_den: 803.249, adm_num_scale0: 102.293, adm_den_scale0: 105.547, adm_num_scale1: 148.311, adm_den_scale1: 151.745, adm_num_scale2: 230.474, adm_den_scale2: 232.816, adm_num_scale3: 311.307, adm_den_scale3: 313.141, motion: 0.000, motion2: 0.000, vif_num_scale0: 3201540.000, vif_den_scale0: 4265149.500, vif_num_scale1: 915769.438, vif_den_scale1: 973015.313, vif_num_scale2: 237178.438, vif_den_scale2: 244942.781, vif_num_scale3: 62793.887, vif_den_scale3: 64002.453, vif: 0.796,
Generate final features (including derived atom features)...
Normalize features, SVM regression, denormalize score, clip...
frame: 0, adm2: 0.986477, adm_scale0: 0.969174, adm_scale1: 0.977376, adm_scale2: 0.989942, adm_scale3: 0.994142, motion: 0.000000, vif_scale0: 0.750628, vif_scale1: 0.941167, vif_scale2: 0.968301, vif_scale3: 0.981117, vif: 0.796321, motion2: 0.000000,
Exec FPS: 2.742952
VMAF score (mean) = 100.000000
x265 [info]: frame I: 1, Avg QP:23.64 kb/s: 5696.20
x265 [info]: frame P: 2, Avg QP:29.09 kb/s: 874.10
x265 [info]: frame B: 7, Avg QP:35.33 kb/s: 204.03
x265 [info]: Weighted P-Frames: Y:0.0% UV:0.0%
x265 [info]: Weighted B-Frames: Y:0.0% UV:0.0%
x265 [info]: consecutive B-frames: 33.3% 0.0% 0.0% 33.3% 33.3% 0.0% 0.0% 0.0% 0.0%

However, how to use it? So many ads on the forum.

static const x265_vmaf_commondata vcd_yuv420p[] = { { (char *)"yuv420p", (char *)"./vmaf_rb_v0.6.2/vmaf_rb_v0.6.2.pkl", (char *)"vmaf_yuv%04d.json", (char *)"json", 0, 1, 1, 0, 0, 0, 0, (char *)"mean", 0, 3, 1 } };

The first two items are obvious. They concern the color of subsampling and the version VMAF. Due to the fact that I chose version 0.6.2, the last item 'enable_conf_interval' must be included.
Then, the recording items metric VMAF in the files json or xml. These are the next two positions from the left. Here are the problems. First of all, I don't know why the program doesn't save all parameters in one file. Secondly, I can't force a program to save json/xml files one after another to the number of processed frames. (vmaf_yuv%04d)
{
"version":"1.3.7",
"params":{
"model":"",
"scaledWidth":1920,
"scaledHeight":1080,
"subsample":3
},
"metrics":[
"adm2",
"bagging",
"ci95_high",
"ci95_low",
"motion2",
"stddev",
"vif_scale0",
"vif_scale1",
"vif_scale2",
"vif_scale3",
"vmaf"
],
"frames":[
{
"frameNum":0,
"metrics":{
"adm2":0.98648,
"bagging":99.62585,
"ci95_high":100.0,
"ci95_low":98.12069,
"motion2":0.0,
"stddev":0.7394500000000001,
"vif_scale0":0.75063,
"vif_scale1":0.94117,
"vif_scale2":0.9683,
"vif_scale3":0.98112,
"vmaf":100.0
}
}
]
}
Next, what is the items 'disable_clip' and 'enable_transform' for?
The next four items phone_model, psnr, ssim, ms_ssim should be turned off.
Choice of data processing method
Choosing the number of cores. In my case, zero.
Problem with the color n_subsample parameter. For BPG, once there are three for YUV, once there should be one for the alpha color. The instruction is five.

Ok, I created x265 files with VMAF and without:
- The X265 VMAF codec doesn't work with FFmpeg.
av_interleaved_write_frame(): Broken pipe
No more output streams to write to, finishing.
Error writing trailer of pipe:: Broken pipe
ffmpeg.exe -loglevel verbose -i Untitled.mp4 -an -f yuv4mpegpipe -vf scale=1920:1080:in_color_matrix=bt709:in_range=limited:out_color_matrix=bt709:out_range=limited,format=yuv420p -strict -1 - |
x265_081012bit_hdr_vmaf.exe --y4m --input-csp i420 --input-depth 8 --output-depth 8 --preset veryslow --crf 28 --fps 25.000 --keyint 50 --info --no-open-gop
--colormatrix bt709 --colorprim bt709 --transfer bt709 --limit-ref 0 --range limited --recon 111.yuv --output 111.h265 -
- I don't know what is the 'recon' function for VMAF for?
In the description:
-r/--recon <filename> Reconstructed raw image YUV or Y4M output file name

- Strange, the x265 vmaf itself works, but it isn't known whether the codec should have an output file or not?
Assuming he has. This file does not differ in content from the recon file. In addition, these files don't differ from x265 files without VMAF. I don't have a concept for what it is and what is the recon file for?

LigH

19th November 2018, 09:54

The "recon" feature writes a YUV or Y4M raw video file that contains the reconstructed video which has been decoded right after encoding it, so you can compare the compression results with the original source (assuming it was a YUV or Y4M file too) without calling an additional decoder. It is available independently of VMAF functions linked into x265 – which may still be possible only under Linux, I believe; are you sure your Windows build contains any VMAF comparison code? The build script source\CMakeLists.txt contains the check clearly in a "if(UNIX)" block.

Jamaika

19th November 2018, 11:55

... which may still be possible only under Linux, I believe; are you sure your Windows build contains any VMAF comparison code? The build script source\CMakeLists.txt contains the check clearly in a "if(UNIX)" block.
I created a version for Windows 2.9+8. I almost doesn't change anything. Codec hasn't only 'threads' for VMAF as I wrote earlier.
https://www.sendspace.com/file/r90y0d
Probably it can also be created in MSVC.

Barough

19th November 2018, 18:40

x265 v2.9+8-27d8424c799d (http://www.mediafire.com/file/bs49cg9rcjik169/) (32 & 64-bit 8/10/12bit Multilib Windows Binaries)

https://bitbucket.org/multicoreware/x265/commits/branch/stable

21st November 2018, 11:54

(64-bit GCC 8.2.0 8+10+12bit multilib / ICC 19.0 8/10/12 cli+shared)

ICC binaries not working in my Win10 (missing dll's).

Natty

22nd November 2018, 00:58

hi, i would like to know how qcomp works in simple language, and it's impact on bitrate when its lowered or increased from its default value :thanks:

LigH

22nd November 2018, 09:11

--qcomp <float> (https://x265.readthedocs.io/en/default/cli.html?highlight=qcomp#cmdoption-qcomp)
qComp sets the quantizer curve compression factor. It weights the frame quantizer based on the complexity of residual (measured by lookahead). It’s value must be between 0.5 and 1.0. Default value is 0.6. Increasing it to 1.0 will effectively generate CQP.

The default value 0.6 is a balance between a constant quantizer (regardless of the video content) and the complexity of the video content (degree of details and amount of motion) providing chances to spare bitrate by increasing the quantizer slightly in scenes where it may be sufficient to preserve enough quality with little noticeable loss.

IIRC, if you could decrease it to 0.0, the encoder would try its best to keep a constant bitrate (CBR), which would cause a very varying amount of quality loss (I might be wrong here, for x265, though). Increasing it to 1.0 instead would cause a constant quantization which would not take advantage of the possible ways to spare bitrate in scenes where convenient quality preservation could already be achieved with less bitrate, at a coarser quantization than the target.

You may increase this value a little (e.g. towards 0.8) when you notice that there is too much loss of precision in areas with very little detail, e.g. darkness and smooth ramps, especially in cases when your target bitrate is rather low. On the other hand, there may be other (psycho-visual) options to let the encoder not spare too much bitrate.

22nd November 2018, 12:15

@Ma Test version available. If any of these works, some benchmarks will be appreciated.

x265_MT.exe works, thanks! I will make some test with '--no-asm' option to compare only C++ compilers.

22nd November 2018, 21:04

Test platform: Win10 64-bit home, i7 8700 + be quiet pure rock, 16 GB RAM DDR4 @ 3866
Command line (only 8-bit encoding):
x265 --no-asm --crf 20 ../Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m w.hevc

Results in fps (encoding speed, mean value from 2 runs):
8.98 fps -- ICC AVX2
8.41 fps -- GCC 9.0 AVX2 ucrt
7.36 fps -- GCC 8.2 AVX2
7.34 fps -- GCC 8.2 AVX2 ucrt
7.12 fps -- GCC 7.3 AVX2 ucrt
7.11 fps -- GCC 6.5 AVX2 ucrt
6.30 fps -- GCC 5.5 AVX2 ucrt
5.93 fps -- GCC 8.2 generic Barough build
5.57 fps -- GCC 4.9.4 AVX2 ucrt
5.06 fps -- VS 2017 AVX2
4.89 fps -- VS 2015 AVX2
4.72 fps -- GCC 4.8.5 AVX2 ucrt

ucrt means Universal CRT (it is replacement for msvcrt.dll)
Results with asm was 29 up to 30 fps for all contenders (full results in screen.txt).

ICC 19 is clear winner, GCC 9 in second place. VS 2017/2015 without asm are really slow (but with asm are good/the best).

poisondeathray

22nd November 2018, 23:55

Thanks Ma for those tests. Wow, that's a large % variation in speed

FranceBB

23rd November 2018, 05:59

Test platform: Win10 64-bit home, i7 8700 + be quiet pure rock, 16 GB RAM DDR4 @ 3866
Command line (only 8-bit encoding):
x265 --no-asm --crf 20 ../Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m w.hevc

Results in fps (encoding speed, mean value from 2 runs):
8.98 fps -- ICC AVX2
8.41 fps -- GCC 9.0 AVX2 ucrt
7.36 fps -- GCC 8.2 AVX2
7.34 fps -- GCC 8.2 AVX2 ucrt
7.12 fps -- GCC 7.3 AVX2 ucrt
7.11 fps -- GCC 6.5 AVX2 ucrt
6.30 fps -- GCC 5.5 AVX2 ucrt
5.93 fps -- GCC 8.2 generic Barough build
5.57 fps -- GCC 4.9.4 AVX2 ucrt
5.06 fps -- VS 2017 AVX2
4.89 fps -- VS 2015 AVX2
4.72 fps -- GCC 4.8.5 AVX2 ucrt
.

Very interesting.
I knew that Intel Parallel Studio (and its compiler) was good, but what surprises me is that GCC has become better and better.
Visual Studio used to be good for AVX2, while GCC used to be better for SSE2/SSSE3/SSE4.1, but perhaps things have changed and GCC now totally outperforms Visual Studio.

nevcairiel

23rd November 2018, 09:15

Do keep in mind that these tests are absolutely disjunct from reality. Noone is going to run something like x265 without ASM, so for any real-world use these numbers are meaningless.

FranceBB

26th November 2018, 07:17

Do keep in mind that these tests are absolutely disjunct from reality. Noone is going to run something like x265 without ASM, so for any real-world use these numbers are meaningless.

Well, of course.
Still, in an ideal world, compilers would be able to produce optimized assembly code as fast as manually-written intrinsics, so there's no need to manually write them.

Unfortunately, that's still an utopia.

Anyway, for 10/12bit x265 on x86 32bit systems (for which there aren't manually written intrinsics available and builds rely on compiler optimization only), compilers "speed tests" are kinda useful. ^_^

nevcairiel

26th November 2018, 10:38

Still, in an ideal world, compilers would be able to produce optimized assembly code as fast as manually-written intrinsics, so there's no need to manually write them.

Unfortunately, that's still an utopia.

And it will always remain nothing but a dream. A compiler does not have enough information about the restrictions and requirements of the algorithm to perform the same sort of optimization a developer can do when manually writing ASM - especially with advanced SIMD.

PS:
Anyone that runs on a 32-bit system deserves what they get. Upgrade already, stop wasting developers and your own time. A simple change from 32-bit to 64-bit on the same hardware will yield a massive speedup already. And if you're encoding with x265 on hardware thats not even 64-bit compatible, then you should *really* upgrade.

benwaggoner

26th November 2018, 18:42

And it will always remain nothing but a dream. A compiler does not have enough information about the restrictions and requirements of the algorithm to perform the same sort of optimization a developer can do when manually writing ASM - especially with advanced SIMD.
Yeah, QFT++. Autovectorization has been a dream for decades, and it’s never gotten anywhere near what hand assembly can do. Same with autoparallelization. A good compiler and the right code can maybe get 2-3x faster. Intel’s whole Itanium debacle was premised on, and failed because of, very over optimistic assumptions about compilers being able to do this stuff.

Apple switching to Intel was such a boon to the industry because it eliminated the need for media apps to have to implement on both SSEx and AltiVec. Because you couldn’t even port between them; often the whole algorithm had to be refactored to get decent performance.

As it is, x265 is probably some of the most advanced and complex SIMD code on the planet, with a lot of complex threading to boot. It’s likely about the worse case for a speed gap between compiler-generated SIMD versus hand-coded SIMD.

Anyone that runs on a 32-bit system deserves what they get. Upgrade already, stop wasting developers and your own time. A simple change from 32-bit to 64-bit on the same hardware will yield a massive speedup already. And if you're encoding with x265 on hardware thats not even 64-bit compatible, then you should *really* upgrade.
Are people actually still doing this? I can’t imagine how slow x265 must be on pre x64 hardware. I don’t think I’ve had a machine NOT running 64-bit since Windows 7 launched, and everything I was running when Win 7 launched was already 64-bit capable.

The joules per pixel on a pre x264 system has to be a couple of orders of magnitude worse than the latest Intel and AMD processors deliver. And upgrade would pay for itself quickly in lowered elecctricity & cooling costs alone!

qyot27

26th November 2018, 20:24

Are people actually still doing this? I can’t imagine how slow x265 must be on pre x64 hardware. I don’t think I’ve had a machine NOT running 64-bit since Windows 7 launched, and everything I was running when Win 7 launched was already 64-bit capable.
It's been a long, long time since I ever tried, although I do keep a full set of 32-bit build instructions for all the pieces in my FFmpeg/mpv build guide. For posterity, mostly.

My guess would be that you could get decent encode times with x265 on a PIII only by encoding at most 480p under --profile ultrafast at 8bit (since the >8bit ASM has to be explicitly turned off to build for 32-bit). And possibly not even then, as I'm pretty sure we'd still be looking at maybe 5fps. At that point you're dealing with strictly academic 'because I can' types of things, and you'd absolutely get better framerates (at a preciser preset) by just using x264 instead.

The joules per pixel on a pre x264 system has to be a couple of orders of magnitude worse than the latest Intel and AMD processors deliver. And upgrade would pay for itself quickly in lowered elecctricity & cooling costs alone!
Exactly. My Coppermine system was only a main system until 2015, and it only held out that long because of not having an income up until then (it is still alive, though, but now serves as a file archive).

Inexpensive mini-PCs have really filled the gap here, and get vastly better performance than an ancient system like that would get. Even with the power draw restraints and lack of AVX 1 or 2, Bay Trail-T (and now Apollo Lake, since I inadvertently fried the other one) could run circles around the Coppermine while being dead silent because the power consumption is so low they don't even need a cooling fan. Plus access to better SIMD - up to SSE4.2, plus the AES stuff - and multithreading. Running 64-bit OSes can be a bit of a task - the Bay Trail-T era would normally only ship with 32-bit versions of Windows on tiny eMMC storage (and since they come with 32-bit UEFI, you can't use 64-bit Windows, although 64-bit Linux distros loaded from flash drives or external USB hard drives are an option), but by now 64-bit Windows installs seem common, along with allowing for putting secondary SSDs into the system.

I've done 4K->4K and 4K->1080p transcodes on the Apollo Lake at about 5fps (ultrafast, 10bit, preserving HDR, crf 18), and at least Apollo Lake has a 10bit HEVC decoder in the GPU. Had I known the exact way to get that enabled in mpv at the time, I wouldn't have even bothered trying to transcode. But it let me get working figures, and I suppose 1080p 10bit HEVC would be less of a burden on the GPU than 4K would, so there's that.

aegisofrime

27th November 2018, 17:24

x265 v2.9+8-27d8424c799d (https://drive.google.com/open?id=1xZQABtoaSFgGu11YstmHKYLzO3elemlC)
(64-bit GCC 8.2.0 8+10+12bit multilib / ICC 19.0 8/10/12 cli+shared)

Apologies, I'm only seeing a v2.9+9 build on that, and that's built with GCC 8.2. I can't find an ICC build in that Google Drive, or am I blind? :(