Log in

View Full Version : x265 HEVC Encoder


Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 [110] 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197

jd17
11th July 2017, 07:37
Thanks for your answers!
It is pretty useless for conversion of "usual video" without a high bit depth already in the original video source. If you don't know what "color primaries" mean, don't attempt to use it.
Don't try to use it if you aren't encoding High Dynamic Range content.

One click on my link guys... ;)

I did encode a 10bit HDR video:
I encoded the "Samsung HDR Wonderland" demo, once with --hdr-opt and once without.

Source MediaInfo:
Video
ID : 1
Format : HEVC
Format/Info : High Efficiency Video Coding
Format profile : Main 10@L5.1@High
Codec ID : V_MPEGH/ISO/HEVC
Duration : 2 min 41 s
Bit rate : 45.7 Mb/s
Width : 3 840 pixels
Height : 2 160 pixels
Display aspect ratio : 16:9
Frame rate mode : Constant
Frame rate : 23.976 (24000/1001) FPS
Color space : YUV
Chroma subsampling : 4:2:0
Bit depth : 10 bits
Bits/(Pixel*Frame) : 0.230
Stream size : 878 MiB (100%)
Writing library : ATEME Titan File 3.7.3 (4.7.3.1002)
Default : Yes
Forced : No
Color range : Limited
Color primaries : BT.2020
Transfer characteristics : SMPTE ST 2084
Matrix coefficients : BT.2020 non-constant
Mastering display color primaries : R: x=0.680000 y=0.320000, G: x=0.265000 y=0.690000, B: x=0.150000 y=0.060000, White point: x=0.312700 y=0.329000
Mastering display luminance : min: 0.0500 cd/m2, max: 1000.0000 cd/m2

My encode (HandBrake 1.0.7, x265 2.4):
CRF17, medium, no tune
CL custom: --no-sao --uhd-bd --hdr-opt --hrd --master-display "G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,500)"
(I would normally add --max-cll too, but the source does not include that information...)

MediaInfo:
Video
ID : 1
Format : HEVC
Format/Info : High Efficiency Video Coding
Format profile : Main 10@L5.1@High
Codec ID : V_MPEGH/ISO/HEVC
Duration : 2 min 41 s
Bit rate : 22.2 Mb/s
Width : 3 840 pixels
Height : 2 160 pixels
Display aspect ratio : 16:9
Frame rate mode : Constant
Frame rate : 23.976 (24000/1001) FPS
Color space : YUV
Chroma subsampling : 4:2:0 (Type 2)
Bit depth : 10 bits
Bits/(Pixel*Frame) : 0.112
Stream size : 427 MiB (98%)
Writing library : x265 2.4+13-26963e98fa64:[Windows][MSVC 1910][64 bit] 10bit
Encoding settings : cpuid=1173503 / frame-threads=2 / numa-pools=4 / wpp / no-pmode / no-pme / no-psnr / no-ssim / log-level=2 / input-csp=1 /
input-res=3840x2160 / interlace=0 / total-frames=0 / level-idc=51 / high-tier=1 / uhd-bd=1 / ref=3 / no-allow-non-conformance / repeat-headers / annexb / aud / hrd /
info / hash=0 / no-temporal-layers / no-open-gop / min-keyint=1 / keyint=24 / bframes=4 / b-adapt=2 / b-pyramid / bframe-bias=0 / rc-lookahead=20 / lookahead-slices=8 /
scenecut=40 / no-intra-refresh / ctu=64 / min-cu-size=8 / no-rect / no-amp / max-tu-size=32 / tu-inter-depth=1 / tu-intra-depth=1 / limit-tu=0 / rdoq-level=0 /
dynamic-rd=0.00 / no-ssim-rd / signhide / no-tskip / nr-intra=0 / nr-inter=0 / no-constrained-intra / strong-intra-smoothing / max-merge=2 / limit-refs=3 /
no-limit-modes / me=1 / subme=2 / merange=57 / temporal-mvp / weightp / no-weightb / no-analyze-src-pics / deblock=0:0 / no-sao / no-sao-non-deblock / rd=3 /
no-early-skip / rskip / no-fast-intra / no-tskip-fast / no-cu-lossless / no-b-intra / rdpenalty=0 / psy-rd=2.00 / psy-rdoq=0.00 / no-rd-refine / analysis-mode=0 /
no-lossless / cbqpoffs=0 / crqpoffs=0 / rc=crf / crf=17.0 / qcomp=0.60 / qpstep=4 / stats-write=0 / stats-read=0 / vbv-maxrate=160000 / vbv-bufsize=160000 /
vbv-init=0.9 / crf-max=0.0 / crf-min=0.0 / ipratio=1.40 / pbratio=1.30 / aq-mode=1 / aq-strength=1.00 / cutree / zone-count=0 / no-strict-cbr / qg-size=32 /
no-rc-grain / qpmax=69 / qpmin=0 / no-const-vbv / sar=1 / overscan=0 / videoformat=5 / range=0 / colorprim=9 / transfer=16 / colormatrix=9 / chromaloc=1 /
chromaloc-top=2 / chromaloc-bottom=2 / display-window=0 / master-display=G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,500) / max-cll=0,0 /
min-luma=0 / max-luma=1023 / log2-max-poc-lsb=8 / vui-timing-info / vui-hrd-info / slices=1 / opt-qp-pps / opt-ref-list-length-pps / no-multi-pass-opt-rps /
scenecut-bias=0.05 / no-opt-cu-delta-qp / no-aq-motion / hdr / hdr-opt / no-dhdr10-opt / refine-level=5 / no-limit-sao
Default : Yes
Forced : No
Color range : Limited
Color primaries : BT.2020
Transfer characteristics : SMPTE ST 2084
Matrix coefficients : BT.2020 non-constant
Mastering display color primaries : R: x=0.680000 y=0.320000, G: x=0.265000 y=0.690000, B: x=0.150000 y=0.060000, White point: x=0.312700 y=0.329000
Mastering display luminance : min: 0.0500 cd/m2, max: 1000.0000 cd/m2

The encode without --hdr-opt essentially just resulted in a higher bitrate, everything else is identical:

Bit rate : 25.7 Mb/s


Also, would you be so kind as to "sanity check" those parameters?
Would you change anything for UHD HDR encoding? Is anything essential missing?

I think you also suggested before:
--chromaloc 2
--keyint 48
?

I try to read up on the parameters as much as I can, but everything UHD HDR is not really well documented... So I welcome any help and explanations! :)


--hdr-opt implements these recommendations. It's like a special type of adaptive quantization, which looks at the brightness level of each block of video. It improves encoding quality and efficiency for HDR content.

Does that mean there will be no visible degradation when I use --hdr-opt? :)

LigH
12th July 2017, 15:41
x265 2.4+99-3160e1a0cc5f (https://www.mediafire.com/file/6pddrk60ckzz3es/x265_2.4%2B99-3160e1a0cc5f.7z) (merge stable+default)

"Allocate frame threads based on available pool threads" ... + a Mac fix and preps for v2.5 milestone.

pingfr
12th July 2017, 15:53
x265 2.4+99-3160e1a0cc5f (https://www.mediafire.com/file/6pddrk60ckzz3es/x265_2.4%2B99-3160e1a0cc5f.7z) (merge stable+default)

"Allocate frame threads based on available pool threads" ... + a Mac fix and preps for v2.5 milestone.

The big pool thread patch merged-in a few days ago alone is 2.5 milestone worthy IMHO. ;)

LigH
12th July 2017, 17:12
Another little patch about to be committed, and we will probably have it until the weekend. Optimistic guess.

pradeeprama
13th July 2017, 18:03
x265 version 2.5, which includes improvements to grain handling, and improved CSV logging feature which is now built into the library.

Version 2.5 can be downloaded from here (md5: 192e54fa3068b594aa44ab2b703f071d).Full documentation is available at http://x265.readthedocs.io/en/stable/

Release Notes for Version 2.5
=======================

Encoder enhancements
--------------------------------
1. Improved grain handling with --tune grain option by throttling VBV operations to limit QP jumps.
2. Frame threads are now decided based on number of threads specified in the --pools, as opposed to the number of hardware threads available. The mapping was also adjusted to improve quality of the encodes with minimal impact to performance.
3. CSV logging feature (enabled by --csv) is now part of the library; it was previously part of the x265 application. Applications that integrate libx265 can now extract frame level statistics for their encodes by exercising this option in the library.
4. Globals that track min and max CU sizes, number of slices, and other parameters have now been moved into instance-specific variables. Consequently, applications that invoke multiple instances of x265 library are no longer restricted to use the same settings for these parameter options across the multiple instances.
x265 can now generate a seprate library that exports the HDR10+ parsing API. Other libraries that wish to use this API may do so by linking against this library. Enable ENABLE_HDR10_PLUS in CMake options and build to generate this library.
5. SEA motion search receives a 10% performance boost from AVX2 optimization of its kernels.
6. The CSV log is now more elaborate with additional fields such as PU statistics, average-min-max luma and chroma values, etc. Refer to documentation of --csv for details of all fields.
7. x86inc.asm cleaned-up for improved instruction handling.

API changes
-----------------
1. New API x265_encoder_ctu_info() introduced to specify suggested partition sizes for various CTUs in a frame. To be used in conjunction with --ctu-info to react to the specified partitions appropriately.
2. Rate-control statistics passed through the x265_picture object for an incoming frame are now used by the encoder.
3. Options to scale, reuse, and refine analysis for incoming analysis shared through the x265_analysis_data field in x265_picture for runs that use --analysis-reuse-mode load; use options --scale, --refine-mv, --refine-inter, and --refine-intra to explore.
4. VBV now has a deterministic mode. Use --const-vbv to exercise.

Bug fixes
-------------
1. Several fixes for HDR10+ parsing code including incompatibility with user-specific SEI, removal of warnings, linking issues in linux, etc.
2. SEI messages for HDR10 repeated every keyint when HDR options (--hdr-opt, --master-display) specified.

Happy compressing!

LigH
13th July 2017, 21:39
Realistic guess, then! :D

x265 2.5+2-18fa144d453e (https://www.mediafire.com/file/x3r8zbukc5wsgwl/x265_2.5%2B2-18fa144d453e.7z) (merge with stable; new v2.5 milestone)

SEI payload message writing fixed

pingfr
13th July 2017, 23:11
Hah, I knew it! :devil:

Barough
15th July 2017, 17:58
x265 v2.5+3-3f6841d271e3 (http://ge.tt/5o9xpnl2) (GCC 7.1.0, 32 & 64-bit 8/10/12bit Multilib Windows Binaries)

x265 [info]: HEVC encoder version 2.5+3-3f6841d271e3
x265 [info]: build info [Windows][GCC 7.1.0][32/64 bit] 8bit+10bit+12bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2


https://bitbucket.org/multicoreware/x265/commits/branch/default

bxyhxyh
17th July 2017, 05:24
Would developers update todo list? Last update of that is 2016.01.04.

littlepox
17th July 2017, 05:32
Would developers update todo list? Last update of that is 2016.01.04.

I prefer they don't. There is the joke that for developers, todo = something I should do but I don't want to do, so I put it down just as a show.

Barough
22nd July 2017, 17:26
x265 v2.5+4-01a981f509ea (http://ge.tt/3hOyLtl2) (GCC 7.1.0, 32 & 64-bit 8/10/12bit Multilib Windows Binaries)

x265 [info]: HEVC encoder version 2.5+4-01a981f509ea
x265 [info]: build info [Windows][GCC 7.1.0][32/64 bit] 8bit+10bit+12bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2


https://bitbucket.org/multicoreware/x265/commits/branch/default

Midzuki
24th July 2017, 09:07
x265.exe 2.5+6-d11482e5fedb

https://forum.videohelp.com/threads/357754-%5BHEVC%5D-x265-EXE-mingw-builds?p=2492072&viewfull=1#post2492072

LigH
24th July 2017, 09:51
x265_2.5+6-d11482e5fedb (https://www.mediafire.com/file/efm6tv94lm5loz4/x265_2.5%2B6-d11482e5fedb.7z) (merge with stable)

fixes two memory leaks (threading, HDR10+), improves encoder reconfiguration, and allows forced output flushing:

--force-flush <integer> Force the encoder to flush frames. Default 0
0 - flush the encoder only when all the input pictures are over.
1 - flush all the frames even when the input is not over. Slicetype decision may change with this option.
2 - flush the slicetype decided frames only.

I guess this is mainly interesting for scenarios with changing parameters where a quick response is required?

jlpsvk
24th July 2017, 11:51
Any news about AVX-512 instructions support? :) Tommorow my new lovely i7-7820X will arrive, so a testing volunteer is here. :D

NikosD
24th July 2017, 14:28
I'm really sorry to inform you that there is no real AVX512 support inside your CPU, according to reviews and specs (?)

Intel fused the two FMA AVX2 units into one FMA AVX512, so it's like the support of AVX2 by Zen core which has only AVX128 units.

Don't forget also that AVX512 clocks are a lot lower than AVX2 and sometimes lower even from base clock.

Also, optimizations for AVX512 and x264/x265 will be minimal regarding performance.

Threadripper CPU with a lot of cores and a very high clock, could be far more interesting.

Atak_Snajpera
24th July 2017, 14:43
Any news about AVX-512 instructions support? :) Tommorow my new lovely i7-7820X will arrive, so a testing volunteer is here. :D

Slightly more expensive Threadripper@4GHz 1920x (12C/24T) will probably destroy i7-7820x in video encoding. It is odd that you didn't want to wait two weeks for Threadrippers. Intel is now very bad in price to performance ratio.

LigH
24th July 2017, 14:55
Intel fused the two FMA AVX2 units into one FMA AVX512, so it's like the support of AVX2 by Zen core which has only AVX128 units.

Don't forget also that AVX512 clocks are a lot lower than AVX2 and sometimes lower even from base clock.

That reminds me of the behaviour on AMD Phenom-II CPU's which are more or less capable of executing SSE3 instructions, but x264 and x265 refuse to enable them because their implementation is so slow (and possibly even incomplete?), thus the fastest instruction set for those old engines is "SSE2Fast". :o

So I would not be surprised if x265 may enable AVX512 instructions only on CPU's where their execution will be a benefit, for the same reason. :sly:

jlpsvk
25th July 2017, 09:57
That reminds me of the behaviour on AMD Phenom-II CPU's which are more or less capable of executing SSE3 instructions, but x264 and x265 refuse to enable them because their implementation is so slow (and possibly even incomplete?), thus the fastest instruction set for those old engines is "SSE2Fast". :o

So I would not be surprised if x265 may enable AVX512 instructions only on CPU's where their execution will be a benefit, for the same reason. :sly:

That's why I choosed i7-7820X. :)

Skylake-X should support these AVX512 instructions (in bold):
AVX-512-F: F for Foundation
AVX-512-BW: Support for 512-bit Word support
AVX-512-CD: Conflict Detect (loop vectorization with possible conflicts)
AVX-512-DQ: More instructions for double/quad math operations
AVX-512-ER: Exponential and Reciprocal
AVX-512-IFMA: Integer Fused Multiply Add with 52-bit precision
AVX-512-PF: Prefetch Instructions
AVX-512-VBMI: Vector Byte Manipulation Instructions
AVX-512-VL: Foundation plus <512-bit vector length support
AVX-512-4VNNIW: Vector Neural Network Instructions Word (variable precision)
AVX-512-4FMAPS: Fused Multiply Accumulation Packed Single precision

Atak_Snajpera
25th July 2017, 11:37
That's why I choosed i7-7820X. :)

Skylake-X should support these AVX512 instructions (in bold):
AVX-512-F: F for Foundation
AVX-512-BW: Support for 512-bit Word support
AVX-512-CD: Conflict Detect (loop vectorization with possible conflicts)
AVX-512-DQ: More instructions for double/quad math operations
AVX-512-ER: Exponential and Reciprocal
AVX-512-IFMA: Integer Fused Multiply Add with 52-bit precision
AVX-512-PF: Prefetch Instructions
AVX-512-VBMI: Vector Byte Manipulation Instructions
AVX-512-VL: Foundation plus <512-bit vector length support
AVX-512-4VNNIW: Vector Neural Network Instructions Word (variable precision)
AVX-512-4FMAPS: Fused Multiply Accumulation Packed Single precision

Did you know that...
1) AVX-512 instructions "generate" much more heat. Hence introduced by Intel negative AVX offset.
2) Speed-up in x265 will most likely be much lower than SSEx.x vs AVX2.

Do not expect miracles in practice.

NikosD
25th July 2017, 11:44
That's why I choosed i7-7820X. :)

Skylake-X should support these AVX512 instructions (in bold):


Sorry to bother you again, but you clearly didn't understand my reply and certainly not Ligh's too.

It's not the support of instructions that matters but the implementation.
That's what I told you and that's exactly the same thing that Ligh told you.

Your Skylake-X supports AVX512 but not in a fast way because Intel enables a real AVX512 FMA unit only on 10 core and above.

Your CPU has a half speed implementation or slower.

But, probably your reply shows us why you chose to buy that CPU in the first place.

nevcairiel
25th July 2017, 12:06
x264 already got some AVX512 improvements (although its not complete yet, i've been told). You can use it already today to judge improvements. On a 7900X it does result in a real improvement, but as NikosD said, the 7900X has a second separate full 512-bit unit, which the 7800 and 7820 do not have.

The only "downside" of AVX512 is that the CPUs clock down when its in use due to the heat generation, however Skylake can change its clock much faster then previous platforms, so at least it won't be terrible. x265 already exeperienced issues with downlocks when they worked on AVX2 at first, which also downclocks on server CPUs, so hopefully they'll account for that and only use it when there is a real and tangible improvement to be had.

NikosD
25th July 2017, 12:14
Do we know how much is the real difference of x264-AVX512 using a 7900X compared to x264-AVX2 version on the same CPU ?

I'm pretty sure that Threadripper 16C/32T with the same price of 7900K will eat Skylake-X for breakfast on x264, even though it has only a fast FMA AVX-128 bit implementation and not a AVX512 of course.

sneaker_ger
25th July 2017, 12:24
x264 results are not that impressive.
2017-06-19 13:55:46 < BugMaster|work> I mean overall speed up vs no AVX512 on same CPU
2017-06-19 13:56:09 < Gramner> 5-10% vs avx2 on veryfast
2017-06-19 13:59:00 < BugMaster|work> and for veryslow it similar or should be faster?
2017-06-19 14:00:41 < Gramner> it goes down to +-0 at veryslow currently.

nevcairiel
25th July 2017, 15:07
10% overall is pretty good from some improved SIMD functions. But like I said, its not done yet. There is more functions to optimize. When Gramner gets to those, he didn't say.

Atak_Snajpera
25th July 2017, 15:21
10% overall is pretty good from some improved SIMD functions. But like I said, its not done yet. There is more functions to optimize. When Gramner gets to those, he didn't say.

NOT OVERALL! Do not bend facts! 10% max is only in veryfast preset. If you have 10+ core CPU then you most likely aim for veryslow preset for max quality. I doubt that you can get more that few percent extra speedup in those slow modes.

nevcairiel
25th July 2017, 15:31
NOT OVERALL! Do not bend facts! 10% max is only in veryfast preset.

I never stated the opposite, clearly anyone is capable of reading one post upwards, so keep your pants on.

Its overall 10% faster in that preset, and thats still a significant speedup. These presets are still quite useful for live encoding for streaming, when the really slow ones are still too slow for realtime (and gaming at the same time, for example).

Atak_Snajpera
25th July 2017, 15:33
It clearly says between 5 and 10%. So on average you get less than 10%. So extra ~7% more in useless veryfast preset is just a placebo for me.
2017-06-19 13:56:09 < Gramner> 5-10% vs avx2 on veryfast

These presets are still quite useful for live encoding for streaming, when the really slow ones are still too slow for realtime (and gaming at the same time, for example).
Who on earth buys very expensive x299 AVX-512 cpu for streaming games? For streaming cheap Ryzen 7 1700@3.8GHz is enough.

sneaker_ger
25th July 2017, 15:44
10% overall is pretty good from some improved SIMD functions. But like I said, its not done yet. There is more functions to optimize. When Gramner gets to those, he didn't say.
I don't disagree. I actually expected you to answer when I wrote my post. 5% to 10% for free is nothing to turn up your nose at.
But doom9 folks tend to go for "veryslow or go home!"...

burfadel
25th July 2017, 16:10
I don't disagree. I actually expected you to answer when I wrote my post. 5% to 10% for free is nothing to turn up your nose at.
But doom9 folks tend to go for "veryslow or go home!"...

It's not 5 to 10 percent for free, you paid extra for the CPU to get that extra speed. Not only that, the extra speed is only applicable to the faster, slower processors thusly. No doubt any improvement for some functions is offset by the associated downclock. It's why a virtualised GPU based onboard SPU (Supplementary Processing Unit) replacing a large part of the ALU and FPU functions could possibly be of advantage here. Sounds pretty good for something I just made up, right? :D

Anyways, it's highly probably that Threadripper will still outperform Skylake-X even without AVX2 functions, and it certainly beats it on price. Considering it is costs less and faster, that 5 to 10 percent for 'free' as you put is, actually in effect costs whatever the performance vs outlay cost difference is percentage wise. So definitely NOT a free 'advantage'!

Atak_Snajpera
25th July 2017, 16:18
Also do not forget guys that intel's "NOT GLUED CORES TOGETHER" technology has serious problems with base clock when you add more and more cores.
Intel® Xeon® Platinum 8153 Processor (16c/32t) has base clock at only 2 GHz!
http://ark.intel.com/products/series/125191/Intel-Xeon-Scalable-Processors

ThreadRipper@4GHz 1950x will destroy Skylake-X even without almighty AVX-512.

IgorC
25th July 2017, 17:27
Xeon 18153 16 cores 2.0/2.8 GHz has lower TDP (125W).

6142M 16 cores (150W, 2.6/3.7 GHz) is closer to future AMD 16/32 CPU (180W). http://ark.intel.com/products/120488/Intel-Xeon-Gold-6142M-Processor-22M-Cache-2_60-GHz

Core i9-7960X (165W) will be apparently at 3.0/4.x GHz https://en.wikipedia.org/wiki/List_of_Intel_Core_i9_microprocessors

Atak_Snajpera
25th July 2017, 17:46
Xeon 18153 16 cores 2.0/2.8 GHz has lower TDP (125W).

6142M 16 cores (150W, 2.6/3.7 GHz) is closer to future AMD 16/32 CPU (180W). http://ark.intel.com/products/120488/Intel-Xeon-Gold-6142M-Processor-22M-Cache-2_60-GHz

Core i9-7960X (165W) will be apparently at 3.0/4.x GHz https://en.wikipedia.org/wiki/List_of_Intel_Core_i9_microprocessors

Taken from Intel
http://i.cubeupload.com/pRbWC0.png

Core i9-7960X won't hit 4GHz on ALL cores only on single one.
Furthermore extra heat caused by AVX-512 will force CPU to work at base frequency during video encoding.

x265_Project
25th July 2017, 18:30
Guys - all this speculation about the benefits we'll be able to obtain from AVX-512 instructions is premature. We don't know the answer today. We have to do the work to find out the answer. As with AVX2 instructions, you're using more silicon (wider execution units), which generate more heat than the standard execution units (nothing is free!), and so there are some thermal clock management issues that we have to be aware of as we implement AVX-512 optimization.

Forget about turbo frequencies when you're saturating a processor by running x265. x265 pushes the processor to its thermal limits, and when it is running hot it won't be going into turbo mode. Anyhow, getting twice the work done per instruction (AVX-512 vs AVX2) is beneficial, even if the clock frequency temporarily slows when we implement those instructions.

I also don't think it's useful to compare any high level statistics like core count, TDP, etc. between Intel and AMD in order to estimate performance. There are just too many other differences in CPU architecture between Purley and Zen/Epyc. The bottom line is that either way, it's all good news for x265 users... more cores and much more performance per dollar than what we have been getting until now.

sneaker_ger
25th July 2017, 19:10
Is there any good explanation for the way x265 applies reference limits (https://forum.doom9.org/showthread.php?p=1747979#post1747979)? I don't understand how it makes any sense. Why numPocTotalCurr must always be <= 8?
/* The value of NumPocTotalCurr shall be less than or equal to 8 */
int numPocTotalCurr = param.maxNumReferences + vps.numReorderPics;
if (numPocTotalCurr > 8)
{
x265_log(&param, X265_LOG_WARNING, "level %s detected, but NumPocTotalCurr (total references) is non-compliant\n", levels[i].name);
vps.ptl.profileIdc = Profile::NONE;
vps.ptl.levelIdc = Level::NONE;
vps.ptl.tierFlag = Level::MAIN;
x265_log(&param, X265_LOG_INFO, "NONE profile, Level-NONE (Main tier)\n");
return;
}

x265_Project
25th July 2017, 22:10
Is there any good explanation for the way x265 applies reference limits (https://forum.doom9.org/showthread.php?p=1747979#post1747979)? I don't understand how it makes any sense. Why numPocTotalCurr must always be <= 8?

We can take a look at this. It's probably best to ask such a question on our developer mailing list. Personally (as the GM and head of product management), I don't want x265 to have any arbitrary limitations. It should support the full HEVC specification, at least, for support profiles. So if it's legal to have more than 8 reference frames in a particular profile and level, and if the user wants to do this, x265 should support it. That's not always a great idea (there may be decoder limitations that you'll run into, and there are seriously diminishing returns beyond about 3 reference frames), but our default performance presets limit --refs, so you have to manually specify --refs if you want more.

sneaker_ger
25th July 2017, 22:14
We can take a look at this.
Don't get me wrong. I'm not trying to suggest it's a bug. I'm only trying to understand it.

jlpsvk
26th July 2017, 00:32
First stress tests...

CPU: i7-7820X at stock 3.6GHz, all values default on MOBO (MSI X299 RAIDER)
Cooler: Fractal Design Kelvin S36
Case: NZXT H440

All cores running at 4.00GHz automatically, temps around 62 degress Celsius.

burfadel
26th July 2017, 00:40
Is that at full load or idle?

jlpsvk
26th July 2017, 01:02
Full load.

Sagittaire
26th July 2017, 14:47
Guys - all this speculation about the benefits we'll be able to obtain from AVX-512 instructions is premature. We don't know the answer today. We have to do the work to find out the answer. As with AVX2 instructions, you're using more silicon (wider execution units), which generate more heat than the standard execution units (nothing is free!), and so there are some thermal clock management issues that we have to be aware of as we implement AVX-512 optimization.

Forget about turbo frequencies when you're saturating a processor by running x265. x265 pushes the processor to its thermal limits, and when it is running hot it won't be going into turbo mode. Anyhow, getting twice the work done per instruction (AVX-512 vs AVX2) is beneficial, even if the clock frequency temporarily slows when we implement those instructions.

I also don't think it's useful to compare any high level statistics like core count, TDP, etc. between Intel and AMD in order to estimate performance. There are just too many other differences in CPU architecture between Purley and Zen/Epyc. The bottom line is that either way, it's all good news for x265 users... more cores and much more performance per dollar than what we have been getting until now.

Some useful information:

1) actually x265 is unable tu use 16C/32T at 100% for make 1080p encoding. The actual limit for 1080p is more 8C/16T or 10C/20T.

2) you can expect saturate 16C/32T only with 2160p source with real full resolution (no blacks borders)


Considering that:

1) 16C/32T is useless for 1080p encoding. At this time something the R7 1700 is by far the best solution for make encoding with high speed at low cost.

2) it will be really difficult to saturate 16C/32T at 100% even with 2160p source. I think even that it will be usefull to desactivate SMT or HT to have only 16C/16T configuration, particulary for 1080p source. For this reason, I don't think that thermal limit will be a big problem here.

Sagittaire
26th July 2017, 15:00
First stress tests...

CPU: i7-7820X at stock 3.6GHz, all values default on MOBO (MSI X299 RAIDER)
Cooler: Fractal Design Kelvin S36
Case: NZXT H440

All cores running at 4.00GHz automatically, temps around 62 degress Celsius.

try this test:
https://forum.doom9.org/showthread.php?p=1799988#post1799988

and report your result thx

Atak_Snajpera
26th July 2017, 16:18
1) actually x265 is unable tu use 16C/32T at 100% for make 1080p encoding. The actual limit for 1080p is more 8C/16T or 10C/20T.
With medium preset it is even less than that. I never have full cpu usage on my E5-2690 with default preset (my average is ~80%). In this case I think that 6C/12T is max.

Considering that:

1) 16C/32T is useless for 1080p encoding. At this time something the R7 1700 is by far the best solution for make encoding with high speed at low cost.
Your conclusion is wrong because you are ignoring the fact that you can run multiple instances of x265 to saturate those extra cores.

benwaggoner
26th July 2017, 19:23
Your conclusion is wrong because you are ignoring the fact that you can run multiple instances of x265 to saturate those extra cores.
And there is always --pmode. And even --pme.

--preset veryslow --lookahead-slices 4 --pmode --tskip --cu-lossless --tu-inter 4 --tu-intra 4 would likely saturate 16 physical cores on a single socket.

Atak_Snajpera
26th July 2017, 19:50
And there is always --pmode. And even --pme.

--preset veryslow --lookahead-slices 4 --pmode --tskip --cu-lossless --tu-inter 4 --tu-intra 4 would likely saturate 16 physical cores on a single socket.

Nope...
--crf 20 --preset veryslow --lookahead-slices 4 --pmode --tskip --cu-lossless --tu-inter 4 --tu-intra 4

Script (files in ramdisk)
LoadPlugin("C:\Users\Dave\Documents\Delphi_Projects\RipBot264\_Compiled\Tools\AviSynth plugins\RawSource\RawSource.dll")
video1=RawSource("E:\_Video_Samples\y4m\crowd_run_1080p50.y4m")
video2=RawSource("E:\_Video_Samples\y4m\park_joy_1080p50.y4m")
video3=RawSource("E:\_Video_Samples\y4m\ducks_take_off_1080p50.y4m")
video4=RawSource("E:\_Video_Samples\y4m\in_to_tree_1080p50.y4m")
video5=RawSource("E:\_Video_Samples\y4m\old_town_cross_1080p50.y4m")

return video1+video2+video3+video4+video5


http://i.cubeupload.com/M7ljR6.png

benwaggoner
26th July 2017, 20:00
Nope...

Huh. Well, thanks for testing!

Is this with hyperthreading on? We might not see 100% saturation of logical cores. 75% isn't bad with HT.

Maybe try --pme? That should be able to saturate anything.

Sagittaire
26th July 2017, 20:04
And there is always --pmode. And even --pme.


don't work correctly. CPU charge is more intensive ... but with less speed !!???

--preset veryslow --lookahead-slices 4 --pmode --tskip --cu-lossless --tu-inter 4 --tu-intra 4 would likely saturate 16 physical cores on a single socket.

x265 work correctly with default setting with 8C/16T with 1080p (something like more or less 80% in slow mode and higher). But you can't expect that with 16C/32T for 1080p source.

Only solution like Atak_Snajpera say is make mulicession encoding. Anyway x265 is slow and make multicession will be good solution only with multicession for same source (with advanced gui for make multipartition encoding in same time)

Atak_Snajpera
26th July 2017, 20:10
Huh. Well, thanks for testing!

Is this with hyperthreading on? We might not see 100% saturation of logical cores. 75% isn't bad with HT.

Maybe try --pme? That should be able to saturate anything.

Nothing helps. Trust me I've done many tests in the past with similar suggestions. The only working solution is spawn more x265 encoder.
According to my tests ThreadRipper 1950x 16C/32T will require 3 instances to get constant 100% usage.

LigH
26th July 2017, 23:37
How many times did we already try to explain that 100% parallelization in modern efficient video encoding is a miracle? There are always parts of the compression algorithm which have to wait for other previous parts to finish before they can continue. This is not yet so obvious in x264 for AVC encoding, but much more in x265 where you have a lot more dependencies among parts of the HEVC algorithm.

Running several instances of encoders on a subset of the cores each works a lot better, because they work independently of each other, thus parallelize better.

jlpsvk
27th July 2017, 01:12
try this test:
https://forum.doom9.org/showthread.php?p=1799988#post1799988

and report your result thx

|---------------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| CPU | x264 | x265 | LAVC | auto | MMX2 | SSE | SSE2 | SSE3 | SSE4 | AVX | AVX2 | All |

|---------------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|

| Core i7-7820X | 24.95 | 4.74 | 136 | 3.67 | 1.13 | 1.14 | 1.82 | 1.97 | 2.97 | 2.97 | 3.57 | N/A |



Temps under 70 degrees, all cores were running at 100% LOAD and at 4.0GHz.

zub35
30th July 2017, 14:02
In this test-example, artifacts of macroblocks 64х64. Scene №3.

raw_1080p30.y4m [534MB] https://cloud.mail.ru/public/D74t/4LPtEm2WG
files x264,x265 [34MB] https://cloud.mail.ru/public/DdcN/HGKoYV5cU

x265 --bitrate 16000/8000 --preset placebo --merange 57 --subme 7 --psy-rd 4 --psy-rdoq 10 --pass 1/2
x264 --bitrate 16000/8000 --preset veryslow --level 4.1 --slow-firstpass --pass 1/2

http://i94.fastpic.ru/big/2017/0729/f6/2a7a431b28f96ccad5319198c86b78f6.png