x265 HEVC Encoder [Archive] - Page 111

x265_Project

31st July 2017, 17:52

In this test-example, artifacts of macroblocks 64х64. Scene №3.

raw_1080p30.y4m [534MB] https://cloud.mail.ru/public/D74t/4LPtEm2WG
files x264,x265 [34MB] https://cloud.mail.ru/public/DdcN/HGKoYV5cU

x265 --bitrate 16000/8000 --preset placebo --merange 57 --subme 7 --psy-rd 4 --psy-rdoq 10 --pass 1/2
x264 --bitrate 16000/8000 --preset veryslow --level 4.1 --slow-firstpass --pass 1/2

http://i94.fastpic.ru/big/2017/0729/f6/2a7a431b28f96ccad5319198c86b78f6.png

Try re-encoding with default placebo settings, and let us know what you find.

Compression artifacts will occur if you set the psy-rd strength and/or psy-rdoq strength too high. Psy-rd biases mode decision towards candidates that have a similar level of detail (variance) to the source block. A little psy-rd is good, but too much psy-rd strength will cause x265 to choose candidates that aren't the best match visually (blocks that have higher or matching energy levels, even though they are not the best candidate with the lowest Rate-Distortion cost). When psy-rd is too high, you'll typically see motion inaccuracy (for example, blocks that should be portraying smooth motion that instead appear to be staying in one place too long). When psy-rdoq is too high you'll just get compression artifacts.

excellentswordfight

1st August 2017, 09:57

Is there any plans of full HLG support? The arib-std-b67 transfer curve is implemented, but the DVB states that this should be flagged in the "alternative_transfer_characteristics" SEI message and bt2020-10 should be flagged in the normal VUI, this way it suppose to be more backwards compatible for non HLG equipment.

http://www.etsi.org/deliver/etsi_ts/101100_101199/101154/02.03.01_60/ts_101154v020301p.pdf

FranceBB

1st August 2017, 17:25

I gotta send a sample to a TV which requires constant bitrate (20 Mbit/s) HEVC 10bit 50fps .ts 'cause they are moving from .mxf XDCAM 1080i MPEG-2 25 Mbit/s interlaced contents to 4K HEVC 50fps progressive.
I used:

@avs4x265.exe --preset medium --level 5.1 --tune fastdecode --profile main10 --bitrate 19500 --vbv-maxrate 19500 --vbv-bufsize 19500 --strict-cbr --ref 3 --deblock -1:-1 --overscan show --colormatrix bt709 --range limited --transfer bt709 --colorprim bt709 --videoformat component --no-open-gop --fps 50 -o "raw_video.hevc" "AVS Script.avs"

and also with:

@avs4x265.exe --preset ultrafast --level 5.1 --tune fastdecode --profile main10 --bitrate 19500 --vbv-maxrate 19500 --vbv-bufsize 19500 --strict-cbr --ref 3 --deblock -1:-1 --overscan show --colormatrix bt709 --range limited --transfer bt709 --colorprim bt709 --videoformat component --no-open-gop --fps 50 -o "raw_video.hevc" "AVS Script.avs"

My AVS:

ColorBars(width = 3840, height = 2160, pixel_type = "yv12")

ConvertFPS(50)
ResampleAudio(48000)

Normalize(0.89, show=false)

trim(0, 1500)
barsnote=last

BlankClip(width=3840, height=2160, pixel_type="YV12", fps=50000, fps_denominator=1000, audio_rate=48000, channels=2, sample_type="float", color=$000000, length=1500)

TextSub("Clock.ass")
clock=last

FFmpegSource2("sampleProRes.mov", fpsnum=50000, fpsden=1000, atrack=-1)

ResampleAudio(48000)

Normalize(0.89, show=false)

sample=last

barsnote++clock++sample

But I got a variable bitrate video (about 16 Mb/s going up and down). I'm using the latest version of x265 compiled by LigH (x64). I need a constant bitrate video (or at least a way to mux it pretending that the stream is CBR).

(as to the 8bit processing done by Avisynth, that's because it's just a sample/test; I'm gonna use the 10bit "hack" in FFMpegSource2 to output 10bit, do some post-processing - light denoise and debanding - and then feed x265 directly with the 10bit stream thanks to Dither Tool in the real encode)

I cannot upload publicly the encoded sample, due to copyright rights, but I might send a very little part of it via PM if it's absolutely necessary to solve this "issue", but I'm pretty sure I'm just missing some settings in my encode line. :)

sneaker_ger

1st August 2017, 17:42

x265 log? How did you determine the result is not CBR?

FranceBB

1st August 2017, 17:50

Well, after I encoded the .hevc file with x265 and the AC3 audio file with ffmpeg, I muxed them in .ts using ffmpeg and I got my .ts file, but Media Info says:

General
ID : 1 (0x1)
Complete name : G:\Sample.ts
Format : MPEG-TS
File size : 370 MiB
Duration : 2 min 58 s
Overall bit rate mode : Variable (instead of Constant)
Overall bit rate : 17.2 Mb/s (instead of 20 Mb/s)

Video
ID : 256 (0x100)
Menu ID : 1 (0x1)
Format : HEVC
Format/Info : High Efficiency Video Coding
Format profile : Main 10@L5.1@Main
Codec ID : 36
Duration : 3 min 0 s
Bit rate : 16.0 Mb/s (instead of 19.5 Mb/s)
Width : 3 840 pixels
Height : 2 160 pixels
Display aspect ratio : 16:9
Frame rate : 50.000 FPS
Standard : Component
Color space : YUV
Chroma subsampling : 4:2:0
Bit depth : 10 bits
Bits/(Pixel*Frame) : 0.038
Stream size : 343 MiB (93%)
Writing library : x265 2.5+6-d11482e5fedb:[Windows][GCC 7.1.0][64 bit] 10bit
Encoding settings : cpuid=1050111 / frame-threads=1 / numa-pools=2 / wpp / no-pmode / no-pme / no-psnr / no-ssim / log-level=2 / input-csp=1 / input-res=3840x2160 / interlace=0 / total-frames=9000 / level-idc=51 / high-tier=1 / uhd-bd=0 / ref=1 / no-allow-non-conformance / no-repeat-headers / annexb / no-aud / no-hrd / info / hash=0 / no-temporal-layers / no-open-gop / min-keyint=25 / keyint=250 / bframes=3 / b-adapt=0 / b-pyramid / bframe-bias=0 / rc-lookahead=5 / lookahead-slices=8 / scenecut=0 / no-intra-refresh / ctu=32 / min-cu-size=16 / no-rect / no-amp / max-tu-size=32 / tu-inter-depth=1 / tu-intra-depth=1 / limit-tu=0 / rdoq-level=0 / dynamic-rd=0.00 / no-ssim-rd / no-signhide / no-tskip / nr-intra=0 / nr-inter=0 / no-constrained-intra / strong-intra-smoothing / max-merge=2 / limit-refs=0 / no-limit-modes / me=0 / subme=0 / merange=57 / temporal-mvp / no-weightp / no-weightb / no-analyze-src-pics / deblock=-1:-1 / no-sao / no-sao-non-deblock / rd=2 / early-skip / rskip / fast-intra / no-tskip-fast / no-cu-lossless / no-b-intra / rdpenalty=0 / psy-rd=2.00 / psy-rdoq=0.00 / no-rd-refine / analysis-reuse-mode=0 / no-lossless / cbqpoffs=0 / crqpoffs=0 / rc=cbr / bitrate=19500 / qcomp=0.60 / qpstep=4 / stats-write=0 / stats-read=0 / vbv-maxrate=19500 / vbv-bufsize=19500 / vbv-init=0.9 / ipratio=1.40 / pbratio=1.00 / aq-mode=1 / aq-strength=0.00 / cutree / zone-count=0 / strict-cbr / qg-size=32 / no-rc-grain / qpmax=69 / qpmin=0 / no-const-vbv / sar=0 / overscan=1 / overscan-crop=0 / videoformat=0 / range=0 / colorprim=1 / transfer=1 / colormatrix=1 / chromaloc=0 / display-window=0 / max-cll=0,0 / min-luma=0 / max-luma=1023 / log2-max-poc-lsb=8 / vui-timing-info / vui-hrd-info / slices=1 / opt-qp-pps / opt-ref-list-length-pps / no-multi-pass-opt-rps / scenecut-bias=0.05 / no-opt-cu-delta-qp / no-aq-motion / no-hdr / no-hdr-opt / no-dhdr10-opt / analysis-reuse-level=5 / scale-factor=0 / refine-intra=0 / refine-inter=0 / refine-mv=0 / no-limit-sao / ctu-info=0
Color range : Limited
Color primaries : BT.709
Transfer characteristics : BT.709
Matrix coefficients : BT.709

Audio
ID : 21 (0x15)
Menu ID : 1 (0x1)
Format : AC-3
Format/Info : Audio Coding 3
Format settings, Endianness : Big
Codec ID : 129
Duration : 3 min 0 s
Bit rate mode : Constant
Bit rate : 384 kb/s
Channel(s) : 2 channels
Channel positions : Front: L R
Sampling rate : 48.0 kHz
Frame rate : 31.250 FPS (1536 spf)
Bit depth : 16 bits
Compression mode : Lossy
Stream size : 8.24 MiB (2%)
Language : English
Service kind : Complete Main

Menu
ID : 4096 (0x1000)
Menu ID : 1 (0x1)
Duration : 2 min 58 s
List : 256 (0x100) (HEVC) / 21 (0x15) (AC-3, English)
Language : / English
Service name : Service01
Service provider : FFmpeg
Service type : digital television

LigH

1st August 2017, 21:06

littlepox

2nd August 2017, 11:31

But:

rc=cbr / bitrate=19500 ... vbv-maxrate=19500 / vbv-bufsize=19500

That's the point which counts ... but CBR bitrate distribution mode is no guarantee for a really really constant bitrate. If the encoder needs less bitrate to achieve optimal quality, a multiplexer would have to stuff the video stream with empty junk if a pseudo-constant bitrate is expected. And I doubt any playback device is so picky about it. There are more important limits than a bitrate. Like, reference frames, consecutive B frames, B frame pyramid, CABAC, Profile@Level marker...

I have seen such user so keen to use CBR for the following reason:

1. My broadcast/streaming service provider says no more than 20Mbps
2. I should encode with a maximum bitrate of 20Mbps(or to be safe, 19Mbps)
3. in order to maximize the quality under this constrain, make the bitrate constant as 19Mbps
4. Why the fxxk I cannot find a CBR mode? are you serious in telling me to use x264 instead of my favourite one-key encoder?

LigH

2nd August 2017, 14:43

a) just a casual question to the x265 developers, no accusation intended: When there are no commits for several days ... are you usually hitting a complex issue all together, which takes some time to get investigated, or is it more likely that you have appointments outside the development office (e.g. fairs or customer meetings, maybe it's just vacation season, and hopefully no natural disasters)?
_

b) a question not only to the x265 developers, but mainly: what is the intended difference between "make clean" and "make clean-generated", or more specifically, in which situations should one use either of them?

excellentswordfight

2nd August 2017, 15:14

I have seen such user so keen to use CBR for the following reason:

1. My broadcast/streaming service provider says no more than 20Mbps
2. I should encode with a maximum bitrate of 20Mbps(or to be safe, 19Mbps)
3. in order to maximize the quality under this constrain, make the bitrate constant as 19Mbps
4. Why the fxxk I cannot find a CBR mode? are you serious in telling me to use x264 instead of my favourite one-key encoder?
I have been doing something similar, were offline files aready encoded were inserted to the live ts stream so they needed to match those settings. If its the same use case here I find it kind of weird that they are concerned about the bitrate but not GOP structure, ref frames etc. So if these are not correct, it needs to be reencoded anyway.

For my case I used ffmeg/x265 with these settings and they ended up close to "true" CBR. I saw no cases with a significant drop in bitrate.

-c:v libx265 -pix_fmt yuv420p10le -preset slow -x265-params level=51:bitrate=24000:vbv-maxrate=24000:vbv-bufsize=24000:keyint=48:no-scenecut=1:b-adapt=0:bframes=8:rc-lookahead=48:strict-cbr=1:colorprim="bt709":transfer="bt709":colormatrix="bt709":range="limited"

I didnt look that closely on your script, but it looked like you inserted bars to it right? This could explain it, they could be encoded at a much lower bitrate even though a higher bitrate is specified (I have at least seen this with abr in the past). But it should be padded (muxrate) anyway when you repack it into a TS-stream so it shouldnt matter right?

FranceBB

2nd August 2017, 15:47

@excellentswordfight... yes, I used a closed GOP 'cause open GOP it's not supported by their specs and I kept --ref 3 just to be sure that the reference frames won't be a problem. As to the bars-note, I had to insert 30sec bars-note to show them channels mapping, audio level (about -24 LUFS more or less, 'cause according to the law commercial breaks can't be louder than the program itself, that has to be normalised) and that both luma and chroma of the program are in range (bars 75%, luma 0.7) followed by the clock with the name of program, video specs, TC-In, TC-Out and the countdown, lasting for 30 sec in order to have the program starting at 00:01:00.00.

So... yes, probably it's just because of bars-note and the clock/countdown that the encoder lowered the bitrate, which shouldn't be a big deal if it's just for that. Anyway, I sent them a sample; I hope they'll accept it.

benwaggoner

2nd August 2017, 19:28

I gotta send a sample to a TV which requires constant bitrate (20 Mbit/s) HEVC 10bit 50fps .ts 'cause they are moving from .mxf XDCAM 1080i MPEG-2 25 Mbit/s interlaced contents to 4K HEVC 50fps progressive.
But I got a variable bitrate video (about 16 Mb/s going up and down). I'm using the latest version of x265 compiled by LigH (x64). I need a constant bitrate video (or at least a way to mux it pretending that the stream is CBR).

Try adding --strict-cbr.

However, if there isn't much motion in the content, there are only so many ways to spend the bits. Adding a bit of random noise and using --tune grain is a good way to boost bitrate

FranceBB

3rd August 2017, 01:08

@benwaggoner I used that parameter (--strict-cbr) already. Anyway, excellentswordfight was right: it was just a few kbit/s at the very beginning due to the bar-note, but the program itself was constant at 19.5 Mbit/s, in fact they just called me to tell me that they accepted the sample.

Gentlemen, it's been a pleasure. XD
No, seriously, thank you, everyone. :)

qyot27

3rd August 2017, 03:14

b) a question not only to the x265 developers, but mainly: what is the intended difference between "make clean" and "make clean-generated", or more specifically, in which situations should one use either of them?
My gut reaction is that it's the same as the difference between 'make clean' and 'make distclean' in autoconf and the custom build systems used by x264, FFmpeg, etc. 'make clean' scrubs the object files and other at-compile-time stuff away, while 'make distclean' cleans both that and the build process structures generated by running configure.

At least in those systems, distclean should be used whenever a change to the build system occurs, while clean is sufficient if only the actual project code changed.

WhatZit

5th August 2017, 09:03

When there are no commits for several days ... are you usually hitting a complex issue all together, which takes some time to get investigated, or is it more likely that you have appointments outside the development office

Remember that there is much more to MulticoreWare's development of x265 than just the visible public open source. They have their proprietary API-based UHDKit, which you definitely WON'T see commits for.

Also, it's entirely possible that MulticoreWare's "sideline projects" (I'm sure calling them that will make 'em angry) might require some collaborative poaching of x265 developers. Their recently developed LipSync technology is a perfect example of this, requiring an amalgam of both video analysis and artificial intelligence.

LigH

5th August 2017, 19:52

@ WhatZit: :cool: I wasn't aware of additional projects with similar complexity. Of course, that's a quite probable reason.

@ qyot27: My impression was the opposite, "make clean-generated" only removing CMake generated and similar auxiliary files to force rebuilding from the end of the building sequence for maybe only sub-targets with updates. But it's not easy for me to understand make files without a real clue about their structure, syntax, dependencies ...

x265_Project

6th August 2017, 04:40

x265_Project

6th August 2017, 04:41

Remember that there is much more to MulticoreWare's development of x265 than just the visible public open source. They have their proprietary API-based UHDKit, which you definitely WON'T see commits for.

Also, it's entirely possible that MulticoreWare's "sideline projects" (I'm sure calling them that will make 'em angry) might require some collaborative poaching of x265 developers. Their recently developed LipSync technology is a perfect example of this, requiring an amalgam of both video analysis and artificial intelligence.

We have 4 business units at MulticoreWare. I run the video business. Our LipSync product was developed by our Machine Learning business unit. They didn't poach any of our video developers. :)

Tom

LigH

10th August 2017, 08:49

x265 2.5+8-eea2afb81ef2 (https://www.mediafire.com/file/h9ebz1g509r3e1w/x265_2.5%2B8-eea2afb81ef2.7z)

New modes for --refine-intra (http://x265.readthedocs.io/en/default/cli.html#cmdoption-refine-intra) and --refine-inter (http://x265.readthedocs.io/en/default/cli.html#cmdoption-refine-inter)

Atak_Snajpera

10th August 2017, 15:03

AMD Threadripper 1950X in x265
https://cubeupload.com/im/bKv6yQ.png

Source -> https://youtu.be/TJiP1bKxLkU?t=3m43s

mastrboy

10th August 2017, 17:28

AMD Threadripper 1950X in x265
https://cubeupload.com/im/bKv6yQ.png

Source -> https://youtu.be/TJiP1bKxLkU?t=3m43s

So AMD is still way behind Intel in FPU performance pr core, or will there be future x265 optimizations for Ryzen/Threadripper that could close the gap?

Sagittaire

10th August 2017, 17:57

So AMD is still way behind Intel in FPU performance pr core, or will there be future x265 optimizations for Ryzen/Threadripper that could close the gap?

it's intensive multicession encoding with x265: in this case Memory Banding limit speed for Ryzen/Threadripper. In simple 4K encoding 1950X produce better speed than 7900X (~20%).

Atak_Snajpera

10th August 2017, 18:34

it's intensive multicession encoding with x265: in this case Memory Banding limit speed for Ryzen/Threadripper. In simple 4K encoding 1950X produce better speed than 7900X (~20%).

Nah. It is just 2xFMAC256 magic vs 2xFMAC128 in Zen. ThreadRipper's Memory Bandwidth is pretty good.

Source -> https://youtu.be/G9JR_v-4BaQ?t=2m2s

Barough

10th August 2017, 19:37

x265 v2.5+9-fdf39a97ecb8 (http://ge.tt/3UT4x7m2) (GCC 7.1.0, 32 & 64-bit 8/10/12bit Multilib Windows Binaries)

x265 [info]: HEVC encoder version x265 v2.5+9-fdf39a97ecb8
x265 [info]: build info [Windows][GCC 7.1.0][32/64 bit] 8bit+10bit+12bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2

https://bitbucket.org/multicoreware/x265/commits/branch/default

Sagittaire

10th August 2017, 20:34

Nah. It is just 2xFMAC256 magic vs 2xFMAC128 in Zen. ThreadRipper's Memory Bandwidth is pretty good.

Source -> https://youtu.be/G9JR_v-4BaQ?t=2m2s

Rysen use slow data fabric for internal L3 cache communication between each CCX module. You have really higher latence for intel with DDR4 and L3 too. Use multiple instance is not good idea for AMD.

If you make 4K encoding with X265, you don't saturate memory controler and in this case 1950X@stock will produce better result than 7900X@stock.

You have the same problem with R7 1800X and i7-6900K. In all test in 1080p, R7 1800X@stock and i7-6900K@stock are on par for x265 encoding but not in x265 fhd benchmark.

if possible, reduce the instance number (2x or perhaps 3x 1080p instance will be enough and you will see that relative speed will be really higher for AMD).

Atak_Snajpera

10th August 2017, 20:38

Rysen use slow data fabric for internal L3 cache communication between each CCX module. You have really higher latence for intel with DDR4 and L3 too. Use multiple instance is not good idea for AMD.

If you make 4K encoding with X265, you don't saturate memory controler and in this case 1950X will produce really better result than 7900X.

You have the same problem with R7 1800X and i7-6900K. In all test in 1080p, R7 1800X and i7-6900K are on par for x265 encoding but not in x265 fhd benchmark.

Similar results
http://pclab.pl/art75073-14.html
http://www.benchmark.pl/testy_i_recenzje/amd-ryzen-threadripper-1950x-i-1920x-test/strona/28404.html

You are expecting too much from 2xFMAC128 vs 2xFMAC256.
ThreadRipper was designed to use as many processes as possible.
See benchmarks on youtube. Gaming + streaming to youtube + twitch + encoding something in Adobe Premiere.
REMEMBER! You have to use multiple x265 encoders to fully saturate all cores with very common 1080p resolution. So in practice there is no escape from that.

Sagittaire

10th August 2017, 21:12

Similar results
http://pclab.pl/art75073-14.html
http://www.benchmark.pl/testy_i_recenzje/amd-ryzen-threadripper-1950x-i-1920x-test/strona/28404.html

You are expecting too much from 2xFMAC128 vs 2xFMAC256.
ThreadRipper was designed to use as many processes as possible.
See benchmarks on youtube. Gaming + streaming to youtube + twitch + encoding something in Adobe Premiere.
REMEMBER! You have to use multiple x265 encoders to fully saturate all cores with very common 1080p resolution. So in practice there is no escape from that.

not really. In pclab test 1950X produce 7% better result than 7900X and in your test it's 2% better result for 7900X.

Moreover, I don't like handbrake test because this gui use heavy filter (avisynth?) and don't use directly stream for encoding. I prefer direct benchmark with high speed ffmpeg frameserver (less than 5% of CPU charge for stream decoding).

Try your benchmark with less instance (just to assure to have CPU charge at 100%) and you will see that speed will be higher. Perhaps higher for Intel CPU too.

Atak_Snajpera

10th August 2017, 21:34

not really. In pclab test 1950X produce 7% better result than 7900X and in your test it's 2% better result for 7900X.

Moreover, I don't like handbrake test because this gui use heavy filter (avisynth?) and don't use directly stream for encoding. I prefer direct benchmark with high speed ffmpeg frameserver (less than 5% of CPU charge for stream decoding).

Try your benchmark with less instance (just to assure to have CPU charge at 100%) and you will see that speed will be higher. Perhaps higher for Intel CPU too.
I have already done that on my E5-2690 in distributed encoding mode. 5 x265/x264 encoders vs 1 x265/x264. Difference in encoding time was in margin of error.

Sagittaire

10th August 2017, 23:06

I have already done that on my E5-2690 in distributed encoding mode. 5 x265/x264 encoders vs 1 x265/x264. Difference in encoding time was in margin of error.

1) Well I read before that your E5-2690 8C/16T is only at 70-75% for CPU charge in 1080p x265 encoding.

2) In this condition, why use 5x encoding instance, if 1x is enough?

adsun701

11th August 2017, 01:26

Atak_Snajpera

11th August 2017, 13:35

It looks like that 1950x in default creative mode (2 dies active) is sitting between 3.3 and 3.4GHz. While i9 7900x runs at constant 4 GHz.
Source -> https://youtu.be/Fr1ZlUu8v_Q?t=9m8s

Scalling in my benchmark is good.
Ryzen 7 1700 @ 3.7GHz (OC) = 25.5 fps
Threadripper 1950x @ 3.4GHz = 43.6 fps
Threadripper 1950x @ 3.7GHz = 47.4 fps (estimated)

Scalling factor = ~1.9x

microchip8

11th August 2017, 21:02

Hi here. I just created a patch that enables support for SMPTE ST 428,
SMPTE RP 431, and SMPTE EG 432 primaries. It also enables support for
SMPTE ST 2085, ICtCp, and both chroma-derived non-constant and
constant luminance matrices. They are all included in the latest spec.

Here's the link.
https://gist.github.com/Adsun701/472ec93957289f057e0c90599ec4bb9a

you better post those patches to the x265 mail list, not here.

x265_Project

12th August 2017, 18:45

Hi here. I just created a patch that enables support for SMPTE ST 428,
SMPTE RP 431, and SMPTE EG 432 primaries. It also enables support for
SMPTE ST 2085, ICtCp, and both chroma-derived non-constant and
constant luminance matrices. They are all included in the latest spec.

Here's the link.
https://gist.github.com/Adsun701/472ec93957289f057e0c90599ec4bb9a

Thanks! We received your email (sent to x265contributions at multicorewareinc dot com), along with your signed Contributor License Agreement. We'll review your patch ASAP.

Tom

NikosD

14th August 2017, 15:05

So AMD is still way behind Intel in FPU performance pr core, or will there be future x265 optimizations for Ryzen/Threadripper that could close the gap?

Nah. It is just 2xFMAC256 magic vs 2xFMAC128 in Zen.

You are expecting too much from 2xFMAC128 vs 2xFMAC256.

x265 has nothing to do with the FPU or the FMACs or floating point performance in general.

It's a pure integer app using AVX2 integers not FMA3 or FADD or FMUL or any floating point in general.

If someone gets a Monsterripper or Killerofskyalakex 16C/32T 1950X try both modes.

UMA and NUMA using x265.

UMA should be faster, but who knows.

Also make sure you saturate all 32 threads.

Atak_Snajpera

14th August 2017, 18:01

It's a pure integer app using AVX2 integers not FMA3 or FADD or FMUL or any floating point in general.
Are you 100% sure that FMACs are not being used in integer calculations as well? Haswell is noticeable faster in x265 than Sandy/IvyBridge (clock vs clock).
Looking at architecture I don't see anything special except new FMACs
http://www.anandtech.com/show/6355/intels-haswell-architecture/8

If someone gets a Monsterripper or Killerofskyalakex 16C/32T 1950X try both modes.
Chipzilla 16C/32T will destroy ThreadRipper 1950x in x265 by 1.6x factor.

NikosD

14th August 2017, 20:20

Are you 100% sure that FMACs are not being used in integer calculations as well?

Haswell is noticeable faster in x265 than Sandy/IvyBridge (clock vs clock).

Looking at architecture I don't see anything special except new FMACs
http://www.anandtech.com/show/6355/intels-haswell-architecture/8

You seem to confuse vector SIMD integer instruction set with vector SIMD floating point instruction set.

Haswell and above have AVX2 instruction set which enables 256 bit vector SIMD integer instructions leveraged by x265

Sandy & Ivy have only AVX which is for floating point (mainly).
So, no speedup for those processors.

Of course AVX2 has FMA3 too, which doubles the floating point throughput compared to AVX but that's a different story irrelevant to x265.

Chipzilla 16C/32T will destroy ThreadRipper 1950x in x265 by 1.6x factor.

If that becomes a reality - 60% faster than 1950X - prepare yourself to use liquid nitrogen to freeze that CPU coming directly from hell, especially if Intel is still using that mustard between the CPU and heat spreader.

And you will need around 500W for that performance.

Atak_Snajpera

15th August 2017, 12:12

Haswell and above have AVX2 instruction set which enables 256 bit vector SIMD integer instructions leveraged by x265
What specific unit in CPU is responsible for calculating AVX2 instructions? My common sense tells me that FMAC does that. After all old SSE2 can also work on integers
https://en.wikipedia.org/wiki/SSE2

Zen has 2xFMAC128 while Intel since haswell has got 2xFMAC256. x265 benchmarks clearly show AMD 16C/32T = Intel 10C/20T. I see clear correlation here.

NikosD

15th August 2017, 18:27

What specific unit in CPU is responsible for calculating AVX2 instructions? My common sense tells me that FMAC does that. After all old SSE2 can also work on integers
https://en.wikipedia.org/wiki/SSE2

Zen has 2xFMAC128 while Intel since haswell has got 2xFMAC256. x265 benchmarks clearly show AMD 16C/32T = Intel 10C/20T. I see clear correlation here.

OMG! You really are a stubborn b@st@rd ! :)

The execution units leveraged by AVX2 instruction set are 256 bit SIMD integer for ADD, MUL, SHIFT.

Integer DIV remains 128 bit.

It's the last time I'm telling you that FMACs and floating point numbers have nothing to do with integers and x265 application.

Read here about all execution units of Haswell vs Sandybridge.

http://www.realworldtech.com/haswell-cpu/4/

Atak_Snajpera

15th August 2017, 19:11

Ok smart ass so explain us why zen architecture sucks so much in x265...
http://www.linleygroup.com/mpr/article.php?id=11666

NikosD

15th August 2017, 19:39

Ok smart ass so explain us why zen architecture sucks so much in x265...
http://www.linleygroup.com/mpr/article.php?id=11666

In case you didn't see it, I put a :) in my first sentence just to be polite with your tremendous ignorance regarding CPU architectures and ego (those two usually come together)

But now, after your reply, I can't be polite anymore.

Your comments made me laugh like no tomorrow regarding FMACs and x265, so keep on posting your thoughts after reading CPU architecture articles you don't understand.

It's so funny!

Thank you!

Asmodian

16th August 2017, 01:58

Ok smart ass so explain us why zen architecture sucks so much in x265...

Maybe it is due to its much lower cache bandwidth or much higher cache/memory latency? There are major differences there. Teasing out the differences in the ALUs that might impact x265 is beyond me, so if anyone can help I would appreciate it.

Skylake-X:
Data-Cache Accesses: 2x 32B read + 2x 32B write
L2 Read Bandwidth: 64B

Zen:
Data-Cache Accesses: 2x 16B read + 1x 16B write
L2 Read Bandwidth: 32B

Does the massive L2 of Skylake-X help x265 at all?

Balthazar2k4

17th August 2017, 18:27

Ok smart ass so explain us why zen architecture sucks so much in x265...
http://www.linleygroup.com/mpr/article.php?id=11666

I am running a 1950x with a 3.9ghz OC across all 16-cores and, frankly, I am impressed. I have to run two encodes simultaneously to saturate the system and using the medium preset with a CRF of 19 on 1080p material I am seeing ~25fps on both encodes. Will the 16C Intel counterpart beat the 1950x? Most likely. That said, the Intel part is $700 more and would therefore expect it to be superior.

This is my first AMD system in 15+ years and I can say unequivocally that I am very happy with it.

LigH

18th August 2017, 15:15

x265 2.5+11-d58761d8db4a (https://www.mediafire.com/file/g0pg09gtnb6fg82/x265_2.5%2B11-d58761d8db4a.7z)

supports some new SMPTE-ST/RP/EG colorimetry options and a new split RD skip command* (documented only in full help):

--[no-]splitrd-skip Enable skipping split RD analysis when sum of split CU rdCost larger than none split CU rdCost for Intra CU. Default disabled

--colorprim <string> Specify color primaries from undef, bt709, bt470m, bt470bg, smpte170m,
smpte240m, film, bt2020, smpte-st-428, smpte-rp-431, smpte-eg-432. Default undef

--colormatrix <string> Specify color matrix setting from undef, bt709, fcc, bt470bg, smpte170m,
smpte240m, GBR, YCgCo, bt2020nc, bt2020c, smpte-st-2085, chroma-nc, chroma-c, ictcp. Default undef

* If I understood the patch comment in the mailing list (https://mailman.videolan.org/pipermail/x265-devel/2017-August/011237.html) correctly, it should speed up intra split cost calculation a little while possibly preserving identical output.

burfadel

19th August 2017, 03:55

Yes, the splitRD-skip looks interesting, I wouldn't be surprised that if in the future it isn't enabled by default. I guess that comes down to user reports, or maybe they're waiting on the possibility of it being extended to inter-CU?

Stephen R. Savage

20th August 2017, 03:12

In case you didn't see it, I put a :) in my first sentence just to be polite with your tremendous ignorance regarding CPU architectures and ego (those two usually come together)

But now, after your reply, I can't be polite anymore.

Your comments made me laugh like no tomorrow regarding FMACs and x265, so keep on posting your thoughts after reading CPU architecture articles you don't understand.

It's so funny!

Thank you!
If were smart instead of merely a smart-ass, you would know that the FPU (and FMAC) are also responsible for executing integer SIMD instructions. Likewise, you would know that Intel can retire dual 256-bit (512-bit in Skylake) multiply-accumulate of 16-bit integers using the exact same execution ports as floating-point FMA. In fact, why do you think the upcoming Cannonlake will have 52-bit integer FMA instructions (hint (https://en.wikipedia.org/wiki/File:IEEE_754_Double_Floating_Point_Format.svg))?

NikosD

20th August 2017, 05:52

If were smart instead of merely a smart-ass, you would know that the FPU (and FMAC) are also responsible for executing integer SIMD instructions. Likewise, you would know that Intel can retire dual 256-bit (512-bit in Skylake) multiply-accumulate of 16-bit integers using the exact same execution ports as floating-point FMA. In fact, why do you think the upcoming Cannonlake will have 52-bit integer FMA instructions (hint (https://en.wikipedia.org/wiki/File:IEEE_754_Double_Floating_Point_Format.svg))?Oh my, oh my (!)

What a smart ass.

What a dump ass

What an asshole.

Port 0 and 1 can dispatch SIMD integer and FMA for floating point, but not all hardware capable of SIMD integer can do SIMD floating point too.

For example port 5 can do SIMD integer but not FMA for floating point in Haswell/Broadwell/Skylake/Kabylake architecture.

Integer FMA is something very new to Intel's CPU architecture and part of AVX-512 instruction set only.

It's called AVX512-IFMA and has 52 bit precision.

Haswell/Broadwell/Skylake/Kabylake do not support AVX-512 and don't support integer FMA of course.

Skylake-X (Skylake-SP core) added an FMA 512 bit unit in Port 5 (10 core and above) but for floating point only, as it supports a limited part of the huge AVX-512 family of instructions set variants, but not AVX512-IFMA.

There are 12 levels of AVX-512 actually.

Cannonlake will be the first mainstream CPU of Intel to support integer FMA.

So, again.

What a smart ass, a dump ass and an asshole.

You should be banned from doom9 for ever.