x265 HEVC Encoder [Archive] - Page 80

View Full Version : x265 HEVC Encoder

Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 [80] 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201

pingfr

17th July 2016, 14:13

x265 already uses all sorts of enhanced instruction sets from SSE to AVX2.

That I am fully aware of the use of SSE, AVX and AVX2.

Intel's Kaby Lake (Xeon only) next generation which is due sometime late 2016/Q1 2017 is supposed to bring AVX512 to the table, I wonder how long will it take until x265 can take advantage of that instruction set? ;)

Note that carry-less multiplication has only very specific uses (mostly in crypto), but if it were to solve a specific problem more efficiently I'm sure they would use it.

To be honest with you, I wasn't even sure of what that instruction set does until late last night when I upgraded my Filezilla client and saw the rather extensive list of instruction sets it benefits of on my Macbook Air 11" (Mid 2013-i7 4650U).

So yeah, there you have it.

Edit: My Filezilla seems to use SSE SSE2 SSSE3 SSE4.1 SSE4.2 AVX AVX2 AES PCLMULQDQ RDRND BMI2 BMI2. (no typo from me, it states BMI2 twice).

lvqcl

17th July 2016, 15:48

until late last night when I upgraded my Filezilla client and saw the rather extensive list of instruction sets it benefits of on my Macbook Air 11" (Mid 2013-i7 4650U).

So yeah, there you have it.

Edit: My Filezilla seems to use SSE SSE2 SSSE3 SSE4.1 SSE4.2 AVX AVX2 AES PCLMULQDQ RDRND BMI2 BMI2. (no typo from me, it states BMI2 twice).

Filezilla recognizes them, but it doesn't mean that in benefits from them.

littlepox

17th July 2016, 15:55

--tune film Update:

--crf 18 --pbratio 1.2 --cbqpoffs -3 --crqpoffs -3 --no-sao --subme 3 --b-intra --no-amp --weightb --aq-mode 3 --aq-strength 0.9 --rd 4 --psy-rd 2.0 --psy-rdoq 3.0 --rdoq-level 2 --rc-lookahead 80 --qcomp 0.65 --no-strong-intra-smoothing

Remove --ctu 32 --max-tu-size 16; without these two we see some wired artifacts reduced in high motion, flat areas.
Decrease --rd 5 to 4 since too many complains about the speed. Seriously, you guys should try NVEncC/QSVEncC by rigaya, which encodes H.265 at lightening speed with acceptable quality.

pingfr

17th July 2016, 16:00

Filezilla recognizes them, but it doesn't mean that in benefits from them.

Considering Filezilla handles SFTP FTP over SSH and FTP over TLS it would make sense it recognizes and benefits from PCLMULQDQ.

pingfr

17th July 2016, 16:03

Decrease --rd 5 to 4 since too many complains about the speed.

Do you have any kind of statistics regarding the encoding speed delta between a --rd 5 and an --rd 4 encode of an identical sample?

What's the offset and what trade-in or trade-out it implies/correlation in regards to resulted subjective visual quality and finalized file size?

PS: I'm drunk as fuck, hope I make sense.

littlepox

17th July 2016, 16:09

Do you have any kind of statistics regarding the encoding speed delta between a --rd 5 and an --rd 4 encode of an identical sample?

What's the offset and what trade-in or trade-out it implies/correlation in regards to resulted subjective visual quality and finalized file size?

PS: I'm drunk as fuck, hope I make sense.

Speed increases about 1/4, or even higher.
Quality decrease is hard to quantify with limited samples and observations, but still fairly acceptable; untrained eyes are probably not going to spot any significant difference.

NikosD

17th July 2016, 16:24

Seriously, you guys should try NVEncC by rigaya, which encodes H.265 at lightening speed with acceptable quality.

And for Skylake owners I would suggest QSVEncC by rigaya.

Unfortunately I'm not one of them but judging from the speed and quality of my Haswell's H.264 HW encoding, Skylake's H.265 HW encoding should be faster and with better quality than Nvidia's.

Nvidia's H.265 HW encoder lacks b frame support too.

JohnLai

17th July 2016, 16:24

Seriously, you guys should try NVEncC by rigaya, which encodes H.265 at lightening speed with acceptable quality.

I would suggest using Intel Skylake QSV HEVC instead. QSV HEVC encoding has more options (eg, QSV ICQ rather than NVENC CQP rate control, did I mention Skylake QSV has B-Frame support while Nvidia GPU, yeah, Pascal included....doesn't even support HEVC B-Frame encoding? Oh, only Pascal will get SAO support, Maxwell will not......and maximum CU size still limited to 32x32 for Pascal [Hardware limitation] )

How do I know? Because I asked Nvidia and got replies?

NikosD

17th July 2016, 16:27

I posted just one sec before you.

What about speed comparison between those two HW H.265 encoders ?

And file size ?

JohnLai

17th July 2016, 16:32

I posted just one sec before you.

What about speed comparison between those two HW H.265 encoders ?

And file size ?

T_T......Don't have Skylake.....Can't test nor compare.
BTW, NikosD, you got Skylake right? Can post 1 minute 1920x1080 QSV HEVC encoded sample? I wanna check something with the bitstream.

And we are getting off-topic now.....PM?

NikosD

17th July 2016, 16:33

No. .I wrote that in my previous post.

I only have Haswell.

JohnLai

17th July 2016, 16:42

No. .I wrote that in my previous post.

I only have Haswell.

I see....cause I wanna verify type of Intra + Inter PU sizes, SAO and 64X64 CU size support for QSV HEVC encode.

NVENC HEVC shows ;

Intra PU sizes
4x4
8x8
16x16
32x32

Inter PU sizes
8x8
8x16
8x32
16x8
16x16
16x32
24x32
32x8
32x16
32x24
32x32

Meanwhile, x265 (Very Fast preset) shows ;
Intra PU sizes
4x4
8x8
16x16
32x32

Inter PU sizes
8x8
16x16
32x32
64x64

I wonder why NVENC and x265 has different statistic for Inter PU Sizes. Make me wonder about QSV HEVC too....

Edit:
Now, extra statistic from NVENC and X265 very fast preset, ignore the last values, just wanna show what hevc standard and features being used by hardware and software encoders;

NVENC HEVC

0x00000058 Slice I, IDR_W_RADL 0 (60510)
nal_unit_header
forbidden_zero_bit 0
nal_unit_type 19
nuh_layer_id 0
nuh_temporal_id_plus1 1
first_slice_segment_in_pic_flag 1
no_output_of_prior_pics_flag 0
slice_pic_parameter_set_id 0
slice_type 2
slice_qp_delta -9
slice_loop_filter_across_slices_enabled_flag 1

0x0000ecb6 Slice P, TRAIL_R 1 (390)
nal_unit_header
forbidden_zero_bit 0
nal_unit_type 1
nuh_layer_id 0
nuh_temporal_id_plus1 1
first_slice_segment_in_pic_flag 1
slice_pic_parameter_set_id 0
slice_type 1
slice_pic_order_cnt_lsb 1
short_term_ref_pic_set_sps_flag 0
inter_ref_pic_set_prediction_flag 0
num_negative_pics 1
num_positive_pics 0
delta_poc_s0_minus1[0] 0
used_by_curr_pic_s0_flag[0] 1
num_ref_idx_active_override_flag 0
cabac_init_flag 0
five_minus_max_num_merge_cand 0
slice_qp_delta -5
slice_loop_filter_across_slices_enabled_flag 1

x265 Very Fast preset with --aq-strength 0.6 --bframes 16 --rc-lookahead 21 --ref 5 --psy-rd 0.3

0x0000046a Slice I, IDR_W_RADL 0 (90185)
nal_unit_header
forbidden_zero_bit 0
nal_unit_type 19
nuh_layer_id 0
nuh_temporal_id_plus1 1
first_slice_segment_in_pic_flag 1
no_output_of_prior_pics_flag 0
slice_pic_parameter_set_id 0
slice_type 2
slice_sao_luma_flag 1
slice_sao_chroma_flag 1
slice_qp_delta -3
slice_loop_filter_across_slices_enabled_flag 1
num_entry_point_offsets 16
offset_len_minus1 12
entry_point_offset_minus1[0] 6288
entry_point_offset_minus1[1] 5988
entry_point_offset_minus1[2] 5103
entry_point_offset_minus1[3] 5581
entry_point_offset_minus1[4] 6711
entry_point_offset_minus1[5] 4489
entry_point_offset_minus1[6] 3397
entry_point_offset_minus1[7] 3618
entry_point_offset_minus1[8] 3997
entry_point_offset_minus1[9] 4412
entry_point_offset_minus1[10] 5210
entry_point_offset_minus1[11] 6472
entry_point_offset_minus1[12] 6287
entry_point_offset_minus1[13] 5020
entry_point_offset_minus1[14] 5821
entry_point_offset_minus1[15] 5984

0x000164b3 Slice P, TRAIL_R 1 (213)
nal_unit_header
forbidden_zero_bit 0
nal_unit_type 1
nuh_layer_id 0
nuh_temporal_id_plus1 1
first_slice_segment_in_pic_flag 1
slice_pic_parameter_set_id 0
slice_type 1
slice_pic_order_cnt_lsb 17
short_term_ref_pic_set_sps_flag 0
num_negative_pics 1
num_positive_pics 0
delta_poc_s0_minus1[0] 16
used_by_curr_pic_s0_flag[0] 1
slice_temporal_mvp_enabled_flag 1
slice_sao_luma_flag 1
slice_sao_chroma_flag 1
num_ref_idx_active_override_flag 0
luma_log2_weight_denom 7
delta_chroma_log2_weight_denom -1
luma_weight_l0_flag[0] 0
chroma_weight_l0_flag[0] 0
five_minus_max_num_merge_cand 3
slice_qp_delta -3
slice_loop_filter_across_slices_enabled_flag 1
num_entry_point_offsets 16
offset_len_minus1 4
entry_point_offset_minus1[0] 18
entry_point_offset_minus1[1] 11
entry_point_offset_minus1[2] 12
entry_point_offset_minus1[3] 11
entry_point_offset_minus1[4] 14
entry_point_offset_minus1[5] 9
entry_point_offset_minus1[6] 8
entry_point_offset_minus1[7] 7
entry_point_offset_minus1[8] 6
entry_point_offset_minus1[9] 4
entry_point_offset_minus1[10] 8
entry_point_offset_minus1[11] 11
entry_point_offset_minus1[12] 11
entry_point_offset_minus1[13] 8
entry_point_offset_minus1[14] 9
entry_point_offset_minus1[15] 14

0x00016588 Slice B, TRAIL_R 2 (125)
nal_unit_header
forbidden_zero_bit 0
nal_unit_type 1
nuh_layer_id 0
nuh_temporal_id_plus1 1
first_slice_segment_in_pic_flag 1
slice_pic_parameter_set_id 0
slice_type 0
slice_pic_order_cnt_lsb 9
short_term_ref_pic_set_sps_flag 0
num_negative_pics 1
num_positive_pics 1
delta_poc_s0_minus1[0] 8
used_by_curr_pic_s0_flag[0] 1
delta_poc_s1_minus1[1] 7
used_by_curr_pic_s1_flag[1] 1
slice_temporal_mvp_enabled_flag 1
slice_sao_luma_flag 1
slice_sao_chroma_flag 1
num_ref_idx_active_override_flag 0
mvd_l1_zero_flag 0
collocated_from_l0_flag 0
five_minus_max_num_merge_cand 3
slice_qp_delta -2
slice_loop_filter_across_slices_enabled_flag 1
num_entry_point_offsets 16
offset_len_minus1 3
entry_point_offset_minus1[0] 10
entry_point_offset_minus1[1] 6
entry_point_offset_minus1[2] 5
entry_point_offset_minus1[3] 6
entry_point_offset_minus1[4] 6
entry_point_offset_minus1[5] 5
entry_point_offset_minus1[6] 4
entry_point_offset_minus1[7] 4
entry_point_offset_minus1[8] 3
entry_point_offset_minus1[9] 3
entry_point_offset_minus1[10] 4
entry_point_offset_minus1[11] 4
entry_point_offset_minus1[12] 4
entry_point_offset_minus1[13] 5
entry_point_offset_minus1[14] 4
entry_point_offset_minus1[15] 6

See how many missing features from Nvenc compared to x265?

....now if only someone can post QSV HEVC sample....

x265_Project

17th July 2016, 17:21

Intel's Kaby Lake (Xeon only) next generation which is due sometime late 2016/Q1 2017 is supposed to bring AVX512 to the table, I wonder how long will it take until x265 can take advantage of that instruction set? ;)
My understanding is that AVX512 will only be available in Skylake Xeons (code named Purley). Support for AVX512 in x265 depends on when we have access to working samples, and enough funding to cover the optimization effort. Of course, contributions are also welcomed.

If I understand correctly, Kaby Lake is a consumer (Core i7, i5, i3, etc.) refresh of Skylake, with more powerful graphics and Quicksync hardware. It won't support AVX512. From news reports, Purley Xeons that support AVX512 will be fairly massive, using a new LGA 3647 socket.
http://www.tomshardware.com/news/intel-xeon-skylake-purley-cpu,31980.html

littlepox

17th July 2016, 17:36

pingfr

17th July 2016, 17:38

My understanding is that AVX512 will only be available in Skylake Xeons (code named Purley).

Yuppers.

Want bleeding top-end features? Pay top notch dollars for 'em or get shafted.

<rant>
This is what happens when large companies such as Intel get monopoly.
</rant>

pingfr

17th July 2016, 17:42

I'm not so concerned about speed yet. For x264/x265, speed, quality and bit-rate are always linked together, having one improved automatically improves the other two:

Previously we used almost --preset slower to guarantee the quality/bit-rate, but now since our test suggest x265 has improved a lot, after some tuning we decided to adopt --preset slow, now we can encode faster, smaller and better (in quality) than before.

Given the current situation of x265, the focus should still be the quality optimization and tuning exploration, at least my team and I find it most useful and efficient in practice.

A typical movie is usually shot at 23fps to 25fps, I think it's safe to say all of us would be satisfied with x265 the day we manage to encode a movie at the 25fps rate with either --preset slower/veryslow/placebo with a 25% to 50% compression ratio over a x264 counterpart encode.

Yeah okay I think I'm dreaming awake here.. :)

littlepox

17th July 2016, 17:48

A typical movie is usually shot at 23fps to 25fps, I think it's safe to say all of us would be satisfied with x265 the day we manage to encode a movie at the 25fps rate with either --preset slower/veryslow/placebo with a 25% to 50% compression ratio over a x264 counterpart encode.

Yeah okay I think I'm dreaming awake here.. :)

Try XviD or Easy Real Producer(RV10 encoder), I think the ultra settings can do 20+fps easily with i7 CPUs, and it is superior than its MPEG2 counterparts with 20+% of compression ratio.

But are you willing to use them seriously nowadays?:p

x265_Project

17th July 2016, 17:49

I see....cause I wanna verify type of Intra + Inter PU sizes, SAO and 64X64 CU size support for QSV HEVC encode.

NVENC HEVC shows ;

Intra PU sizes
4x4
8x8
16x16
32x32

Inter PU sizes
8x8
8x16
8x32
16x8
16x16
16x32
24x32
32x8
32x16
32x24
32x32

Meanwhile, x265 (Very Fast preset) shows ;
Intra PU sizes
4x4
8x8
16x16
32x32

Inter PU sizes
8x8
16x16
32x32
64x64

I wonder why NVENC and x265 has different statistic for Inter PU Sizes. Make me wonder about QSV HEVC too....

CU, TU and PU sizes are controlled by several x265 parameters.
--ctu specifies the maximum CTU size. Default is 64x64. Making --ctu smaller limits the range of CTU sizes that x265 will evaluate and encode, allowing encoding to go faster, but decreasing encoding efficiency.
--min-ctu specifies the minimum CTU size. Default is 8x8. Making --min-ctu larger limits the range of CTU sizes that x265 will evaluate and encode, allowing encoding to go faster, but decreasing encoding efficiency.
--rect allows x265 to analyze and encode rectangular partitions (Nx2N or 2NxN - 16x32, 32x16, 8x16, 16x8, etc.) Default is --no-rect, meaning that x265 will only evaluate square partitions.
--amp allows x265 to analyze and encode asymmetric partitions (2N×nU, 2N×nD, nL×2N, and nR×2N - 32x8, 24x32, 32x24, 32,x8, etc.)
--tu-inter-depth determines how far down the quad tree x265 will analyze and encode inter-coded prediction units (PUs). A higher value will allow smaller prediction units to be evaluated and encoded, providing for higher encoding efficiency, but taking longer to evaluate.
--tu-intra-depth determines how far down the quad tree x265 will analyze and encode intra-coded prediction units (PUs). A higher value will allow smaller prediction units to be evaluated and encoded, providing for higher encoding efficiency, but taking longer to evaluate.

--rect and --amp are off for --preset veryfast. If you analyze the results of x265 --preset veryslow or placebo, you'll see the full range of partition shapes and sizes.

For more information on how HEVC blocks are partitioned, see http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.352.1947&rep=rep1&type=pdf

littlepox

17th July 2016, 17:53

CU, TU and PU sizes are controlled by several x265 parameters.
--ctu specifies the maximum CTU size. Default is 64x64. Making --ctu smaller limits the range of CTU sizes that x265 will evaluate and encode, allowing encoding to go faster, but decreasing encoding efficiency.
--min-ctu specifies the minimum CTU size. Default is 8x8. Making --min-ctu larger limits the range of CTU sizes that x265 will evaluate and encode, allowing encoding to go faster, but decreasing encoding efficiency.
--rect allows x265 to analyze and encode rectangular partitions (Nx2N or 2NxN - 16x32, 32x16, 8x16, 16x8, etc.) Default is --no-rect, meaning that x265 will only evaluate square partitions.
--amp allows x265 to analyze and encode asymmetric partitions (2N×nU, 2N×nD, nL×2N, and nR×2N - 32x8, 24x32, 32x24, 32,x8, etc.)
--tu-inter-depth determines how far down the quad tree x265 will analyze and encode inter-coded prediction units (PUs). A higher value will allow smaller prediction units to be evaluated and encoded, providing for higher encoding efficiency, but taking longer to evaluate.
--tu-intra-depth determines how far down the quad tree x265 will analyze and encode intra-coded prediction units (PUs). A higher value will allow smaller prediction units to be evaluated and encoded, providing for higher encoding efficiency, but taking longer to evaluate.

--rect and --amp are off for --preset veryfast. If you analyze the results of x265 --preset veryslow or placebo, you'll see the full range of partition shapes and sizes.

Speaking of these, recently our test seem to find some wired artifacts by limiting --ctu 32 --max-tu-size 16; it is not likely due to any "encoding efficiency decrease".

We're testing further on that and probably report an issue if we are ready to get the problems specified.

burfadel

17th July 2016, 17:54

AVX-512 refers to several groups of instructions. AVX-512 will be available on Cannonlake supposedly, but only some groups. Zen or Zen+ May introduce other features.

x265_Project

17th July 2016, 18:04

Yuppers.
Want bleeding top-end features? Pay top notch dollars for 'em or get shafted.

Implementing support for a new Single Instruction Multiple Data (SIMD) instruction set isn't just a "feature", as it is in software. It isn't trivial to design and implement a new SIMD instruction set. It involves building logic units with a whole new architecture that can operate on data words that are twice as wide as before (512 bits instead of 256 bits), in a single clock cycle! This is hard stuff, man. It takes a lot of silicon, and a lot of power to do twice as much work per clock cycle as you were able to do in the previous generation.

JohnLai

17th July 2016, 18:33

CU, TU and PU sizes are controlled by several x265 parameters.
--ctu specifies the maximum CTU size. Default is 64x64. Making --ctu smaller limits the range of CTU sizes that x265 will evaluate and encode, allowing encoding to go faster, but decreasing encoding efficiency.
--min-ctu specifies the minimum CTU size. Default is 8x8. Making --min-ctu larger limits the range of CTU sizes that x265 will evaluate and encode, allowing encoding to go faster, but decreasing encoding efficiency.
--rect allows x265 to analyze and encode rectangular partitions (Nx2N or 2NxN - 16x32, 32x16, 8x16, 16x8, etc.) Default is --no-rect, meaning that x265 will only evaluate square partitions.
--amp allows x265 to analyze and encode asymmetric partitions (2N×nU, 2N×nD, nL×2N, and nR×2N - 32x8, 24x32, 32x24, 32,x8, etc.)
--tu-inter-depth determines how far down the quad tree x265 will analyze and encode inter-coded prediction units (PUs). A higher value will allow smaller prediction units to be evaluated and encoded, providing for higher encoding efficiency, but taking longer to evaluate.
--tu-intra-depth determines how far down the quad tree x265 will analyze and encode intra-coded prediction units (PUs). A higher value will allow smaller prediction units to be evaluated and encoded, providing for higher encoding efficiency, but taking longer to evaluate.

--rect and --amp are off for --preset veryfast. If you analyze the results of x265 --preset veryslow or placebo, you'll see the full range of partition shapes and sizes.

Oh....I see...using preset very slow (or --rect --amp) does shows all the shapes and sizes. It is really slow though....1 fps...ouch. I guess there is tradeoff for speed/quality/efficiency.

Interestingly....for simple comparison with nvenc (all other things being equal), even without all those SAO, --rect, --amp and B-frame being used, x265 still outperform NVENC in term of quality per bit in CQP mode (except the encoding speed XD)

Khun_Doug

17th July 2016, 21:41

XVID nowadays, with HD source?? Just say no. Really.

So on this matter of speed, I have been testing some of the various suggestions for the as yet not implemented --tune film recommendations. I'm using an i7 3.4 Ghz with RAID data space and SSD for OS, and AMD HD 7850 GPU. Generally I have never seen this machine be slow. And it will be replaced with something faster sometime this year, retired to be a Kodi media center box.

I tried encoding a 4:32 color HD sample. In X264 using CRF 20, preset slow and tune film, the encode was 9 minutes. Using X265 2.0+4 64 bit 10 bit, this same clip using CRF 20 and preset slow took 45 minutes. When I added some of the suggested tweaks the encode jumped to more than 80 minutes. I watched the encoder report fps of ~ 1.5.

I can safely say that, visually, the X264 and X265 (without the extra tweaks) at CRF 20 looked so nearly identical that the difference was trivial, to my eyes. I tried using 2 pass encodes and noticed the same thing. But the encode speeds were so slow using X265 that I estimated encoding the full HD source would take roughly 48 hours.

If it matters what tools I am using, I have been staying with Hybrid for the X265 tests since it allows me to use whatever .exe I choose. Handbrake is my favorite for X264 HD encodes, but doesn't have the latest X265 libraries.

herbert

17th July 2016, 22:41

mandarinka

17th July 2016, 23:01

https://mailman.videolan.org/pipermail/x265-devel/2016-July/010532.html

--qpmin and --qpmax are coming it seems. Thanks for that, MCW, I used the latter option a lot with x264 (and requested it for x265). Not sure when I'll get to try how useful it is for high quality encodes these days though, it will be tough (and less time than in the past for that...). Too bad that libavcodec doesn't support QP visualizations for HEVC like with H.264 I think. The free stream analyser IIRC also doesn't show block QPs.

benwaggoner

17th July 2016, 23:09

A typical movie is usually shot at 23fps to 25fps, I think it's safe to say all of us would be satisfied with x265 the day we manage to encode a movie at the 25fps rate with either --preset slower/veryslow/placebo with a 25% to 50% compression ratio over a x264 counterpart encode.
Film==24p. It might get slowed down to 23.976 for NTSC or sped up to 25p for PAL, but 99.99% of all "filmed" content is shot 24p.

Are you talking 1080p24 above? I believe slower in realtime is certainly achievable on a modern multicore CPU. It might require a dual socket system and AVX2 instructions.

pingfr

17th July 2016, 23:44

Are you talking 1080p24 above? I believe slower in realtime is certainly achievable on a modern multicore CPU. It might require a dual socket system and AVX2 instructions.

Welp, I'm on a quad multi socketed CPU system (with AVX but without AVX2) and God no, no matter what preset I use, I can see two things happening:

- Average speed is about 15fps, nowhere near real time encoding can be achieved.
- The core/threads are under-utilized by a single running instance of x265.

And yes, source content is 1080p23.976, preset used is either slower or veryslow using littlepox's homebrewn "--tune film" parameters and x265 is 1.9+200'ish.

It's neither impossible to saturate the 60 available cores nor impossible to do real time encoding.

However, as I either get rid of littlepox's tweaks or start using faster presets such as medium, I see faster encoding and can witness core utilization/load going up dramatically, real time is most likely do'able at veryfast ultrafast presets... at the cost of quality.

In that case, the resulted encode looks far worse though.

If it was just me, I'd stress that further NUMA optimization (or a full rewrite and a different approach) would be a necessity to fully harness the potential power of x265, but that's just me.

microchip8

18th July 2016, 00:36

@pingfr

Have you tried using pme and pmode ?

benwaggoner

18th July 2016, 02:50

Welp, I'm on a quad multi socketed CPU system (with AVX but without AVX2) and God no, no matter what preset I use, I can see two things happening:

- Average speed is about 15fps, nowhere near real time encoding can be achieved.
- The core/threads are under-utilized by a single running instance of x265.
Quad-socket is definitely sub-optimal for a single 1080p stream. Dual-sockets will easily saturate the threads and offer better per-core performance, so net throughput is quite a bit better. Some of the newer 8-core single socket enthusiast Broadwell parts might actually be the optimal for throughput at 1080p.

As was suggested --pmode will almost certainly help in your use case, and perhaps even --pme with that many threads. Even when encoding on my quad-core Haswell laptop, I need to use --pmode to get fast performance for sub-SD frame sizes at the higher presets.

If it was just me, I'd stress that further NUMA optimization (or a full rewrite and a different approach) would be a necessity to fully harness the potential power of x265, but that's just me.
x265 is way better NUMA optimized than x264, and has a lot of features to manage NUMA well, especially when doing parallel encoding.

It's just that a lot of parallelization options wind up having a quality impact, so they're off in the higher presets.

littlepox

18th July 2016, 03:59

Thanks again, littlepox.

Depending on the chosen speed preset the following options get enabled or disabled:

--no-rskip --limit-modes --rect

Do you have any data as to the usefulness of the above options yet?

--limit-modes is enabled by default at slower or above;
--no-rskip is intolerably slow with little improvement.

littlepox

18th July 2016, 06:25

--tune film Update:

--crf 18 --ctu 32 --pbratio 1.2 --cbqpoffs -3 --crqpoffs -3 --no-sao --subme 3 --b-intra --no-amp --weightb --aq-mode 3 --aq-strength 0.9 --rd 4 --psy-rd 2.5 --psy-rdoq 4.0 --rdoq-level 2 --rc-lookahead 80 --qcomp 0.65 --no-strong-intra-smoothing --limit-modes

Reintroduce --ctu 32 since further tests suggest the buggy term is --max-tu-size 16, while --ctu 32 is innocent.
Increase psy 2.0:3.0 to 2.5:4.0 which should further reduce blurriness, enhancing visual quality. (bit-rate should increase as well, but worthy)
Add --limit-modes in case someone enables --rect at preset medium (which I strongly DISRECOMMEND; --rect should only be used at preset slower or above)

At the same time, here we go for a --tune animation:

--crf 18 --ctu 32 --ref 4 --bframes 6 --pbratio 1.2 --cbqpoffs -3 --crqpoffs -3 --no-sao --subme 3 --b-intra --no-amp --weightb --aq-mode 3 --aq-strength 0.8 --rd 4 --psy-rd 1.8 --psy-rdoq 1.5 --rdoq-level 2 --rc-lookahead 80 --qcomp 0.65 --no-strong-intra-smoothing --limit-modes

mandarinka

18th July 2016, 12:39

What sorts of artifacts do you get from --max-tu-size 16? I encoded some stuff with may/june builds so maybe I could look for them.

Barough

18th July 2016, 16:26

x265-v2.0+5-98a948623fdc (http://www91.zippyshare.com/v/MZaPppbS/file.html) (MSYS/MinGW, GCC 5.4.0, 32 & 64bit 8/10/12bit multilib EXEs)

benwaggoner

19th July 2016, 00:05

--tune film Update:

--crf 18 --ctu 32 --pbratio 1.2 --cbqpoffs -3 --crqpoffs -3 --no-sao --subme 3 --b-intra --no-amp --weightb --aq-mode 3 --aq-strength 0.9 --rd 4 --psy-rd 2.5 --psy-rdoq 4.0 --rdoq-level 2 --rc-lookahead 80 --qcomp 0.65 --no-strong-intra-smoothing --limit-modes

Reintroduce --ctu 32 since further tests suggest the buggy term is --max-tu-size 16, while --ctu 32 is innocent.
Increase psy 2.0:3.0 to 2.5:4.0 which should further reduce blurriness, enhancing visual quality. (bit-rate should increase as well, but worthy)
Add --limit-modes in case someone enables --rect at preset medium (which I strongly DISRECOMMEND; --rect should only be used at preset slower or above)

For film, what about --ctu 64 --rdpenalty 1 --qg-size 32? That'll still allow those big 32x32 intra blocks, but bias strongly against them so they only get used when they really pay off, and still give you adaptive quant at the 32x32 level.

As for --limit-modes, that only does something when --amp or --rect are on anyway, which is make them a lot faster with a tiny reduction in efficiency.

Also, note there's an issue in --aq-mode 3 currently which causes it to use WAY more bits in CRF mode than --aq-mode 2.

benwaggoner

19th July 2016, 00:06

What sorts of artifacts do you get from --max-tu-size 16? I encoded some stuff with may/june builds so maybe I could look for them.
I wouldn't expect artifacts so much as reduced compression efficiency.

K.i.N.G

19th July 2016, 08:55

Wow, that '--no-sao' switch drasticly increased detail (grain) retention for me!

Anyone more knowledgeable on here can tell me what it does... Google results get kinda too technical for me (i dont understand the terminologie)

RainyDog

19th July 2016, 08:58

Also, note there's an issue in --aq-mode 3 currently which causes it to use WAY more bits in CRF mode than --aq-mode 2.

Hasn't --aq-mode 3 always been like this even with x264?

RainyDog

19th July 2016, 09:04

gamebox

19th July 2016, 10:35

Barough

19th July 2016, 10:43

x265-v2.0+8-669dc9bfe7eb (http://www70.zippyshare.com/v/BEulNFhn/file.html) (MSYS/MinGW, GCC 5.4.0, 32 & 64bit 8/10/12bit multilib EXEs)

K.i.N.G

19th July 2016, 12:30

@K.i.N.G: SAO is an encoding step that tries to compensate against some typical encoding artifacts seen in most MPEG compression technologies to date (h265 as well), by altering the image after it is compressed. It changes the way mostly edges look, trying to "iron" (smooth) some defects introduced by compression - the ones that are "expected" to be present in those areas. It does a very good job, but sadly seems to eliminate a lot of details by error too, as it can not define compression defects in an image "precisely" (it would take a lot of data to describe them), so it only has some "general impression" or "expectation" about the way they should look.

A small question about x265 "blurriness" most people complain about - has anyone tried pre-filtering the video before sending it to x265, by applying some sharpening to counter-effect the issue? To me, problem seems motion estimation related - defining further steps in sub-pixel motion precision over previous technologies increased efficiency but introduced many calculation steps that ended up smoothing the pixels too much. That's just an amateur guess, I might be wrong. The issue of x265 sharpness seems to influence "detail clarity" as well as (or even more than) "edge contrast" - it seems to need something like HDR filter in preprocessing as well to improve detail "amplitudes" by actually "over-accenting" them before compression takes place.

Hey thanks for the very clear explanation. :)

I dont know about other ppl, but to my eyes, without SAO it still looks allot better than x264 at the same bitrate (arround 4500kb/s) and I dont notice any prominent artifacts (much less than the x264 encode for sure). At least, on my laptop screen it does... I'll be testing it later on my TV.

To me it looks like the SAO switch could use some more dev-tweaking and/or should only be used with very low bitrates... It is way to aggressive right now, imho.

Im using a slightly tuned 'slower' preset so it is allot slower than x264 (it's encoding 1080p at 1.34fps right now).

I'll be testing some more and try to get it faster. If things keep looking this promessing then this will probably make me switch over to x265.

Im using version 1.9+200-6098ba3e0cf16b11 (with staxrip).
Id like to test v2.0 but im still waiting for staxrip to update and have optimized UI for v2.0

Barough

19th July 2016, 13:44

Regarding --ag-mode 3 so do i have to say that it can do more 'harm' then good some times. Have noticed on and off on some of my darker test videos that at the edges of dark & very dark areas there can be some kind of 'ghosting'. With -aq-mode 2 so have there been no 'ghosting' like 60 % of the time.

Preset : Medium
CRF : 19-22
Custom command lines used :
--me 3 --ref 4 --aq-mode 2 --aq-strength 2 --rdoq-level 1 --psy-rdoq 4 --rc-lookahead 80
--me 3 --ref 4 --ctu 32 --no-sao --aq-mode 3 --aq-strength 2 --rdoq-level 1 --psy-rdoq 4 --rc-lookahead 80

gamebox

19th July 2016, 15:37

@ K.I.N.G. You're welcome :)

You will not notice compression artifacts in x265 as often and as easily as in x264 partially because x265 encodes video differently. Artifacts in 64x64 and 32x32 image blocks are different than those in x264, slightly "dissolved" and sometimes covering greater areas.

Apart from that, your bitrate can be considered "generous", and not many artifacts should appear. I typically encode at bitrates closer to limits of "acceptable" image quality and "sane" quants - namely 3 Mbps for 1080p - and I could clearly see benefits from SAO. SAO will probably be very useful for broadcasting and streaming as it will "mask" disturbing artifacting during high motion scenes, when limits of data throughput force encoder to temporarily use extreme levels of compression.

littlepox

19th July 2016, 16:08

OK here is a good demo about x265's max-tu-size 16 bug:

https://www.dropbox.com/sh/slftjn7meozs1f3/AADokCEEhMm_aW4IEE-V6P3ba?dl=0

source.mkv is the source with 99 frames; denoted as source.
dft_20.hevc is encoded with --preset slow --no-sao --crf 20; denoted as A
dft_19.5_maxtu16.hevc is encoded with --preset slow --no-sao --crf 19.5 --max-tu-size 16; denoted as B

The size difference of them is 0.6%.

In many frames (e.g, frame 17,21,63,83...) You see unpleasant patterns in the flat area of B, like worms on the wall, especially noticable when the grain is lost and the area is clean. In A the situation is much better.

This is A(part of frame 83):
http://img.2222.moe/images/2016/07/19/A.png

This is B(part of frame 83):
http://img.2222.moe/images/2016/07/19/B.png

Asking my friend to drop an issue to the https://bitbucket.org/multicoreware/x265/issues?status=new&status=open (done)

littlepox

19th July 2016, 16:39

For film, what about --ctu 64 --rdpenalty 1 --qg-size 32? That'll still allow those big 32x32 intra blocks, but bias strongly against them so they only get used when they really pay off, and still give you adaptive quant at the 32x32 level.

After clearing the --max-tu-size bug, we tested various possibilities like --ctu 32, --ctu 64 + --rdpenalty 0/1/2
They are indifferent both quality-wise and size-wise, at least with our test settings.
The only difference is --ctu 32 is faster and better paralleled; that's why it is reintroduced.

Also, note there's an issue in --aq-mode 3 currently which causes it to use WAY more bits in CRF mode than --aq-mode 2.
That's why we suggest to tweak the aq-strength -0.2 for aqmode=3 and +0.2 for aqmode=2. With reduced strength, the bit-rate increase is acceptable and worthy.

Regarding --ag-mode 3 so do i have to say that it can do more 'harm' then good some times. Have noticed on and off on some of my darker test videos that at the edges of dark & very dark areas there can be some kind of 'ghosting'. With -aq-mode 2 so have there been no 'ghosting' like 60 % of the time.

Preset : Medium
CRF : 19-22
Custom command lines used :
--me 3 --ref 4 --aq-mode 2 --aq-strength 2 --rdoq-level 1 --psy-rdoq 4 --rc-lookahead 80
--me 3 --ref 4 --ctu 32 --no-sao --aq-mode 3 --aq-strength 2 --rdoq-level 1 --psy-rdoq 4 --rc-lookahead 80

having aq-strength outside [0.5,1.5] is probably a bad idea, aq-strength=2 for mode=3 is WAY too large.
Also, you need to control your variables. I don't see it reasonable to use different sao settings for the test.

Barough

19th July 2016, 17:42

having aq-strength outside [0.5,1.5] is probably a bad idea, aq-strength=2 for mode=3 is WAY too large.
Also, you need to control your variables. I don't see it reasonable to use different sao settings for the test.

Still doing tests here. Have lowered the aq-strength to between 1.2 and 1.5. Will do a full movie encode tonight of a dark movie. Will see what my bookshelf have 2 offer.

Also waiting for a m8 that work with video to get back to me. He's not into HEVC so much yet as i understand but his knowledge in the world of video encoding is good.

littlepox

19th July 2016, 17:46

Still doing tests here. Have lowered the aq-strength to between 1.2 and 1.5. Will do a full movie encode tonight of a dark movie. Will see what my bookshelf have 2 offer.

Also waiting for a m8 that work with video to get back to me. He's not into HEVC so much yet as i understand but his knowledge in the world of video encoding is good.

I think 1.2~1.5 is still too high for aqmode 3. For me, I'd use it as 0.9.

divxmaster

19th July 2016, 23:11

@K.i.N.G: SAO is an encoding step that tries to compensate against some typical encoding artifacts seen in most MPEG compression technologies to date (h265 as well), by altering the image after it is compressed. It changes the way mostly edges look, trying to "iron" (smooth) some defects introduced by compression - the ones that are "expected" to be present in those areas. It does a very good job, but sadly seems to eliminate a lot of details by error too, as it can not define compression defects in an image "precisely" (it would take a lot of data to describe them), so it only has some "general impression" or "expectation" about the way they should look.

A small question about x265 "blurriness" most people complain about - has anyone tried pre-filtering the video before sending it to x265, by applying some sharpening to counter-effect the issue? To me, problem seems motion estimation related - defining further steps in sub-pixel motion precision over previous technologies increased efficiency but introduced many calculation steps that ended up smoothing the pixels too much. That's just an amateur guess, I might be wrong. The issue of x265 sharpness seems to influence "detail clarity" as well as (or even more than) "edge contrast" - it seems to need something like HDR filter in preprocessing as well to improve detail "amplitudes" by actually "over-accenting" them before compression takes place.

@gamebox, I have tested sharpening before processing, and it makes a huge difference. Ironically it is through QTGMC in vapoursynth (I have just noticed that a few select cgi scenes in later seasons of ds9 are interlaced...ick!)
The side effect of the qtgmc deinterlacing is a strong sharpen and also seems to be colour enhancement. This does increase bitrate +25%, so I increased CRF from 21 to 22, but it still looks way better than no qtgmc/sharpen and is slightly lower bitrate (5%) as a bonus.
Specifically ds9 really benefits from degrain also 'vid = haf.SMDegrain(vid, tr=3,thSAD=300,RefineMotion=True,contrasharp=True,pel=2)'
params are --preset slow --rskip --crf 22 --no-sao --aq-mode 2 --rdoq-level 2 --psy-rd 2 --psy-rdoq 1.1 --qg-size 32 --tu-intra-depth 3 --merange 27 --weightb --early-skip --fast-intra --b-intra --tskip --tskip-fast --ref 6 --bframes 6 --max-merge 5 --min-keyint 23 --keyint 288 --deblock -1:-1 --no-open-gop

nakTT

20th July 2016, 02:50

x265-v2.0+8-669dc9bfe7eb (http://www70.zippyshare.com/v/BEulNFhn/file.html) (MSYS/MinGW, GCC 5.4.0, 32 & 64bit 8/10/12bit multilib EXEs)
Hi,

Is this the same as the one that can be found in the webpage below?

https://builds.x265.eu

Barough

20th July 2016, 02:55

Hi,

Is this the same as the one that can be found in the webpage below?

https://builds.x265.eu
No. It's 'my' compiles made through media-autobuild suite.

Sent from my Samsung Galaxy S7 edge via Tapatalk