x265 HEVC Encoder [Archive] - Page 129

LigH

10th October 2018, 20:20

@katzenjoghurt:

For the core functions of the x265 encoder (especially those which are most often called in tight loops), some of the following sentences may be true, depending on performance gain, development progress in different bit depths, etc.:

There is basic C/C++ code. It depends on the compiler options which instruction set is used. If your CPU supports it, x265 can use this code. If not, it will crash due to unsupported instructions.
There is hand-optimized assembler code with MMX/SSE2 optimization. If your CPU supports it, x265 can use this code. If not, x265 should use simpler code.
There is hand-optimized assembler code with SSSE3/SSE4 optimization. If your CPU supports it, x265 can use this code. If not, x265 should use simpler code.
There is hand-optimized assembler code with AVX optimization. If your CPU supports it, x265 can use this code. If not, x265 should use simpler code.
There is hand-optimized assembler code with AVX2 optimization. If your CPU supports it and you enable it explicitly, x265 can use this code.

Well, for any x86-64 CPU today, SSE2 should be the minimum sensible supported instruction set. But already there are small differences. I remember the Athlon 64 (AMD K8) family being a threshold of providing an SSE2 implementation which is considered relatively "fast", but despite supporting SSE3 in specs, x264 and x265 will refuse to use it.

Usually all these code variants are present in a binary of (lib)x265, except they are excluded during compilation (e.g. you may disable all assembler code paths; but why would you want that?).

katzenjoghurt

10th October 2018, 22:35

Np. There are 2.9 builds there under "x265 binaries for Win64/32 — stable branch"!

OMG! You are right.
I ignored the right table as I always pick my versions from the left side.
Thanks again!

Usually all these code variants are present in a binary of (lib)x265, except they are excluded during compilation (e.g. you may disable all assembler code paths; but why would you want that?).

Thanks, LigH!
I understand this as: Unless a build mentions something else all optimizations are enabled. (?)
I just was looking for some way to verify.
Some context: What led me to the question was a current version of StaxRip which contained an x265.exe and a "x265 AVX2" zip file - and I couldn't tell if the zip was just some oversight or if the "normal" x265.exe is a version without AVX2 optimiziations.
I will just replace it and be done with it... though... I was wondering how to make sure an unknown x265.exe is indeed the "right" version for my machine.

LigH

10th October 2018, 23:40

Just a hint:

x265.exe --no-asm --version
x265 : HEVC encoder version 2.9+1-169e76b6bbcc
x265 [info]: build info [Windows][GCC 8.2.0][64 bit] 8bit+10bit+12bit
x265 [info]: using cpu capabilities: none!

All assembler optimizations forbidden = only basic C/C++ code used.

But if all assembler optimizations are enabled (and all of them are usually linked in the encoder), it only means they are available in case your CPU supports them (which is detected at runtime). It doesn't mean all of them are used on every hardware. If you don't limit them with the --asm [i]mask parameter, x265 detects what your CPU supports while starting, and selects the code paths with the optimal speed supported by your specific CPU.

x265.exe --version
x265 [info]: HEVC encoder version 2.9+1-169e76b6bbcc
x265 [info]: build info [Windows][GCC 8.2.0][64 bit] 8bit+10bit+12bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT

This x265.exe contains code paths for time-critical assembler routines for MMX+SSE2, SSSE3+SSE4, AVX, and even AVX2. But it runs on an AMD Phenom-II, so it is limited to MMX+SSE2 by the CPU auto-detection.

The very same x265.exe can use AVX or even AVX2 code if you copy it onto a PC with a CPU that supports AVX or even AVX2 and run it there.

{EDIT}If your CPU even supports AVX512, and you insist in using AVX512 instructions, then you need to enable it with an additional parameter --asm avx512 in your command line because it is a bit risky and does not always provide better performance, especially not when your CPU gets temperature throttled. And it will crash if your CPU does not support AVX512.{/EDIT}

So what are these special executables provided on a few sites? In addition to multiple code paths for time-critical assembler routines, also non-critical C/C++ routines get optimized for a modern instruction set, which limits their compatibility; these builds will not even start on older CPU's.

If I would use an x265.exe which was built with C/C++ compiler optimizations for AVX (that is probably what you read for special binaries), it would crash right at the start if run on an AMD Athlon/Phenom which doesn't support AVX, because it would use AVX instructions already for the initialization, already before the encoding even starts. But this is not a time-critical part. There is no serious need to speed up code which runs only once or a few times. – (Jedi mind powers) "This is not the build you are looking for."

Rather generic builds, like mine or Barough's or Midzuki's, are fine for a large range of PC's; builds for x86-64 are probably optimized at least for SSE2 in the code generated by the C/C++ compiler, which is the minimal widely supported instruction set of AMD64 compatible CPU's. And the selection of highly optimized assembler routines for the really time-critical parts is done in the encoder at runtime.

benwaggoner

11th October 2018, 01:19

If your CPU even supports AVX2, and you insist in using AVX2 instructions, then you need to enable it with an additional parameter --asm avx2 in your command line because it is a bit risky and does not always provide better performance, especially not when your CPU gets temperature throttled. And it will crash if your CPU does not support AVX2.
I thought that AVX2 would be used automatically, but AVX512 would only be activated via --asm (which appears to be undocumented in x265.readthedocs.io (https://x265.readthedocs.io/en/default/cli.html#performance-options))
Do I have that wrong?

AVX-512 is only useful with slower UHD resolutions, so it makes sense for it to require an opt in.

Rather generic builds, like mine or Barough's or Midzuki's, are fine for a large range of PC's; builds for x86-64 are probably optimized at least for SSE2 in the code generated by the C/C++ compiler, which is the minimal widely supported instruction set of AMD64 compatible CPU's. And the selection of highly optimized assembler routines for the really time-critical parts is done in the encoder at runtime.
Do we have any ballpark sense for how much platform-specific compilation can help encoding performance? I've heard some speculation about ~5% but that was a while ago before the current-gen AMD and Intel processors were out.

LigH

11th October 2018, 08:16

Oops, my mistake ... yes, AVX2 is automatic, only AVX512 is manual.

Atak_Snajpera

11th October 2018, 11:55

AVX-512 is only useful with slower UHD resolutions, so it makes sense for it to require an opt in.
I would like to see how useful is AVX-512 on 28 core xeon ;) I'm expecting negative speed-up ;)

benwaggoner

12th October 2018, 01:15

I would like to see how useful is AVX-512 on 28 core xeon ;) I'm expecting negative speed-up ;)
If you aren't doing Main10 UHD with a slower+ preset, you are almost certainly correct.

That said, an updated microarchitecture could potentially make AVX-512 be more generally useful. AVX2 became a lot more useful with (IIRC) Skylake's microarchitectural change which reduced thermal throttling doing AVX2, really improving throughput.

qyot27

12th October 2018, 01:57

That said, an updated microarchitecture could potentially make AVX-512 be more generally useful. AVX2 became a lot more useful with (IIRC) Skylake's microarchitectural change which reduced thermal throttling doing AVX2, really improving throughput.
So, expect a reasonably-mature AVX-512 ca. Sapphire Rapids?

benwaggoner

12th October 2018, 03:47

So, expect a reasonably-mature AVX-512 ca. Sapphire Rapids?
Plausibly. But x265 maybe the most CPU stressful real software in the world, hitting the cores, caches, and SIMD super hard at once. Hopefully Intel is benchmarking x265 during development!

It is hard to predict the optimal performance tuning of a given CPU without actually having it, as theoretical improvements don’t always work as expected.

I’m curious if anyone has benchmarked performance improvements from arch-specific and profile-driven builds.

X265 is also pretty stressful for compilers too.

alex1399

12th October 2018, 04:04

The x265 reported an error that the file "F:\x265" is not found(maybe it shows file could not open if I recall correctly) when I use --analysis-reuse-file F:\x265 during the second pass encoding.

Great, now I couldn't reproduce this error again.
It has been five days that zeranoe ffmpeg does not release a new version with x265 2.9.

Atak_Snajpera

12th October 2018, 09:58

Hopefully Intel is benchmarking x265 during development!
Nah... They will ask Principled Technology to do "proper" benchmarks ;) They are very good at disabling cores before testing...

LigH

12th October 2018, 10:22

Fresh build (https://www.mediafire.com/file/tz7w5f8qgenb28e/ffmpeg_N-92161-gf6d48b618a.7z/file) by media-autobuild_suite, GPL v3, Zeranoe-like selection.

ffmpeg version N-92161-gf6d48b618a Copyright (c) 2000-2018 the FFmpeg developers
built with gcc {7.3.0|8.2.0} (Rev3, Built by MSYS2 project)
configuration: --disable-autodetect --enable-amf --enable-bzlib --enable-cuda --enable-cuvid --enable-d3d11va --enable-dxva2 --enable-iconv --enable-lzma --enable-nvenc --enable-zlib --enable-sdl2 --disable-debug --enable-ffnvcodec --enable-nvdec --enable-libmp3lame --enable-libopus --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-fontconfig --enable-libass --enable-libbluray --enable-libfreetype --enable-libmfx --enable-libmysofa --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvo-amrwbenc --enable-libwavpack --enable-libwebp --enable-libxml2 --enable-libzimg --enable-libshine --enable-gpl --enable-avisynth --enable-libxvid --enable-libaom --enable-version3 --enable-mbedtls --extra-cflags=-DLIBTWOLAME_STATIC --extra-libs=-lstdc++ --extra-cflags=-DLIBXML_STATIC --extra-libs=-liconv
libavutil 56. 19.101 / 56. 19.101
libavcodec 58. 33.100 / 58. 33.100
libavformat 58. 18.104 / 58. 18.104
libavdevice 58. 4.105 / 58. 4.105
libavfilter 7. 33.101 / 7. 33.101
libswscale 5. 2.100 / 5. 2.100
libswresample 3. 2.100 / 3. 2.100
libpostproc 55. 2.100 / 55. 2.100

Forteen88

12th October 2018, 18:56

x265 2.9+2 released now!
http://www.msystem.waw.pl/x265/

LigH

12th October 2018, 21:59

^ ffmpeg contains v2.9+2.

DotJun

14th October 2018, 15:10

Is there a downside to enabling avx512 on an intel X chip?

Sent from my iPhone using Tapatalk

LigH

14th October 2018, 17:12

As far as I remember from previous discussions...

Most of all: Temperature throttling. AVX512 can be a heavy burden.

Furthermore, switching the CPU into and out of AVX modes can be quite time consuming, which has to be considered in the optimization efforts, and it can make it less efficient for lower resolutions.

Please try to read back, and I believe a thread about AVX512 and AMD Ryzen got even separated from this generic x265 encoder thread.

StvG

15th October 2018, 04:51

Tested binaries download from here (http://msystem.waw.pl/x265/).
input.mkv - hevc (Main 10), yuv420p10le(tv), 3840x1606
AVX2 clock speed = AVX512 clock speed

ffmpeg -i input.mkv -f yuv4mpegpipe -strict -1 - | .\resources\x265-10b.exe --y4m - --ctu 32 -o .\OUTPUT.mkv

x265 [info]: HEVC encoder version 2.8+74-fd517ae68f93
x265 [info]: build info [Windows][GCC 8.2.0][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2

encoded 498 frames in 51.34s (9.70 fps), 5044.61 kb/s, Avg QP:31.42

x265 [info]: HEVC encoder version 2.8+74-fd517ae68f93
x265 [info]: build info [Windows][MSVC 1915][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2

encoded 498 frames in 51.80s (9.61 fps), 5044.61 kb/s, Avg QP:31.42
ffmpeg -i input.mkv -f yuv4mpegpipe -strict -1 - | .\resources\x265-10b.exe --y4m - --ctu 32 -o .\OUTPUT.mkv

x265 [info]: HEVC encoder version 2.9+2-7e978ed93d60
x265 [info]: build info [Windows][GCC 8.2.0][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2

encoded 498 frames in 52.82s (9.43 fps), 5044.61 kb/s, Avg QP:31.42

x265 [info]: HEVC encoder version 2.9+2-7e978ed93d60
x265 [info]: build info [Windows][MSVC 1915][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2

encoded 498 frames in 51.55s (9.66 fps), 5044.61 kb/s, Avg QP:31.42
ffmpeg -i input.mkv -f yuv4mpegpipe -strict -1 - | .\resources\x265-10b.exe --y4m - --ctu 32 -o .\OUTPUT.mkv

VS 2017 Generic compilation ("none")

encoded 498 frames in 51.49s (9.67 fps), 5044.61 kb/s, Avg QP:31.42

VS 2017 AVX2 compilation ("AVX2")

encoded 498 frames in 52.27s (9.53 fps), 5044.61 kb/s, Avg QP:31.42
ffmpeg -i input.mkv -f yuv4mpegpipe -strict -1 - | .\resources\x265-10b.exe --y4m - --ctu 32 (--asm avx512) -o .\OUTPUT.mkv

x265 [info]: HEVC encoder version 2.9+2-7e978ed93d60
x265 [info]: build info [Windows][MSVC 1915][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2

encoded 498 frames in 52.05s (9.57 fps), 5044.61 kb/s, Avg QP:31.42

x265 [info]: HEVC encoder version 2.9+2-7e978ed93d60
x265 [info]: build info [Windows][MSVC 1915][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512

encoded 498 frames in 50.79s (9.80 fps), 5044.61 kb/s, Avg QP:31.42

DotJun

15th October 2018, 07:21

LigH

15th October 2018, 07:30

So it appears to be efficient on your specific CPU model.

Atak_Snajpera

15th October 2018, 12:16

I tried a short test clip with avx512 enabled and disabled on a 4K source using the slower preset. FPS went up to 1.37 from 0.84 when I enabled 512.

Encoded clip looks good, no obvious errors that is. File size is roughly the same, but clip length and crf might have something to do with the tiny difference between the two.

64bit x265 on an intel 7820x. Temps are roughly equal to when 512 is disabled. Load is mostly at 100% on all cores with the occasional dip down to 87% every minute or so.

You should encode whole movie (130k frames) instead of ultra short clip with few hundred of frames.
The longer you encode the more heat your cpu will produce and hence more aggressive AVX negative offset will be activated.

Boulder

15th October 2018, 12:32

RieGo

15th October 2018, 14:16

Clare

15th October 2018, 19:01

Pushing Encoding Quality and Speed with x265 (https://youtu.be/7YT2KJwt4KI)
Massively Parallel Encoding (https://youtu.be/MHvpEVCvajk)

from Mile-High Video Workshop videos http://mile-high.video/files/mhv2018/

benwaggoner

15th October 2018, 19:27

x265 2.9+2 released now!
http://www.msystem.waw.pl/x265/
Anyone know what the new patch actually does:

rc: Fix rowStat computation in const-vbv (https://bitbucket.org/multicoreware/x265/commits/7e978ed93d6086973f87f607645339642ebb6ed0)

It looks like it might fix a serious issue in a given RC mode, but it isn't actually self-documenting.

Jamaika

16th October 2018, 06:32

Something seems to me that these aren't the only x265 bugs.
Github is still at 2.8. There will probably be some fixes to implement version 2.9.

Boulder

16th October 2018, 06:43

Has anybody made any recent tests with different CTU and TU sizes? I made a quick test yesterday on a 720p encode, and max CTU (and TU) 16 turned out to produce the smallest file but also looked best compared to the original frame. At CTU max 64, the frame was clearly more blurry in places where there were more small details such as hair etc.

Tests with 4K and 1080p encodes also showed the same behaviour. It's quite strange as it's said that the big thing in HEVC is that it can use larger CTUs than AVC to increase efficiency and that the bigger the CTU is, the less bitrate should be required. It doesn't seem to be like this at least in CRF mode. Too bad it's not allowed to use 16x16 CTUs with 1080p or 4K encodes if they are to be compliant.

LigH

16th October 2018, 07:57

Github is still at 2.8. There will probably be some fixes to implement version 2.9.

Bitbucket recently provided tag v2.9; but the "tip" is currently in the "stable" branch, not in "default".

Still, there seems to be a lack of communication recently. I reported compiler warnings of GCC 8.x already 2 times, and nobody replied until today.

SmilingWolf

16th October 2018, 09:22

Still, there seems to be a lack of communication recently. I reported compiler warnings of GCC 8.x already 2 times, and nobody replied until today.

Don't get me wrong, I have nothing but the conspiracy theories in my head, BUT, seeing as MulticoreWare sells en/decoding products, I wouldn't be surpised to see some announcement around the turn of the year about their work on some of the new codecs, be it AV1 or VVC.
But that's probably just me wishfully hoping for a sane AV1 encoder with proper performance, multithreading, profiles and documentation...

benwaggoner

16th October 2018, 16:10

Tests with 4K and 1080p encodes also showed the same behaviour. It's quite strange as it's said that the big thing in HEVC is that it can use larger CTUs than AVC to increase efficiency and that the bigger the CTU is, the less bitrate should be required. It doesn't seem to be like this at least in CRF mode. Too bad it's not allowed to use 16x16 CTUs with 1080p or 4K encodes if they are to be compliant.
It would be helpful to see the command line used, or at least bitrate/CRF.

Larger CTUs are definitely helpful at lower bitrates. Comparing at high perceptual quality can bring in other subtler differences between modes.

I've even been able to see a slight improvement at sub-SD resolutions using CTU 64 at very low bitrates.

Boulder

16th October 2018, 17:02

Source was 4K, downsampled to 1080p. These are the basic parameters that I've set:
--input-depth 16
--dither
--profile main10
--min-keyint 5
--keyint 480
--merange 44
--splitrd-skip
--preset veryslow
--rc-lookahead 60
--deblock -2:-2
--no-strong-intra-smoothing
--no-sao
--qcomp 0.8
--aq-mode 3
--aq-strength 0.8
--ctu 64
--max-tu-size 32
--rdpenalty 1
--qg-size 16
--tu-inter-depth 4
--tu-intra-depth 4
--limit-tu 4
--limit-refs 3
--max-merge 2
--rd-refine
--ref 6
--bframes 10
--crf 19
I've then also tested ctu 32 with max-tu-size 32 or 16.

In the 1080p encode, the bitrate is as follows:

CTU 64 / TU 32 - 15078 kbps
CTU 32 / TU 32 - 14608 kbps
CTU 32 / TU 16 - 14472 kbps

Of these, CTU 32 / TU 32 resembles the original the most. It's interesting that setting TU 16 also causes distortion in the same areas as CTU 64 / TU 32. I checked areas like eyes, hair etc. which have easily some sort of distortion because there are many fine lines and things that can be compared quite easily.

I've just started testing what CRF value is visually enough for 1080p so the final bitrate will probably be lower than what I got from my tests. I'd estimate CRF 20-21 would be the final value.

benwaggoner

16th October 2018, 17:11

Source was 4K, downsampled to 1080p. These are the basic parameters that I've set...

I've then also tested ctu 32 with max-tu-size 32 or 16.

In the 1080p encode, the bitrate is as follows:

CTU 64 / TU 32 - 15078 kbps
CTU 32 / TU 32 - 14608 kbps
CTU 32 / TU 16 - 14472 kbps

Of these, CTU 32 / TU 32 resembles the original the most. It's interesting that setting TU 16 also causes distortion in the same areas as CTU 64 / TU 32. I checked areas like eyes, hair etc. which have easily some sort of distortion because there are many fine lines and things that can be compared quite easily.

I've just started testing what CRF value is visually enough for 1080p so the final bitrate will probably be lower than what I got from my tests. I'd estimate CRF 20-21 would be the final value.
Target CRF will also depend on other parameters, if you are still adjusting those.

Overall, that is a very idiosyncratic set of options. Nothing looks wrong (and I'd love to hear who you picked some of those!, but it's definitely outside of any combination of settings that MCW would be doing psychovisual optimization for.

I'm curious about why you chose these particular settings:

--merange 44
--splitrd-skip
--max-merge 2
--deblock -2:-2
--rdpenalty 1
--qg-size 16
--bframes 10

Is this for encoding anime or some other kind of synthetic or mixed synthetic/natural image encoding?

Boulder

16th October 2018, 17:28

benwaggoner

16th October 2018, 17:52

My sources are just regular movies or TV series. I'll tune CRF as the last item once I've got all the rest in place.

--merange 44, basically lowering it from the default 57 which I understand is meant for 4K. For 720p, I've used 38 all the time. I think this is remains of littlepox's set of "tune film" parameters.
It's not THAT frame size dependent, but 44 is probably fine.
--splitrd-skip, in my old notes, I didn't find it cause any ill effects. Do you have any specific information why it's a "bad idea"?
No reason it would be a bad idea; I just haven't seen it used before. Generally parameters that have a reliable quality/speed tradeoff are in a preset. But it isn't listed as experimental...

Anyone from MultiCoreWare care to weigh in? What's the tradeoff? Is this something that should get added to the faster presets, or default to On but off in --preset placebo?
--max-merge 2, values 3-4 tested and it caused blur. One of the things in x265 that is different from x264 - the slower presets don't mean similar quality at lower final bitrate
Yeah, with the many more tools available in HEVC, the differences between presets are greater. Odd that it causes blur; this should be a speed/quality tradeoff. You should file an issue here with repro details: https://bitbucket.org/multicoreware/x265/issues?status=new&status=open
--deblock -2:-2, no need for intense deblocking according to my tests
Well, to improve compression efficiency. Did you see any issues with using 0:0?
--rdpenalty 1, trying to favour smaller blocks.
Have you seen any experimental validation of it helping in x265 2.4+? I found places where it was helpful in older versions, but not recently.
--qg-size 16, tested values from 64 to 8, 16 looked best (in terms of distortion in small details again) when compared frame-by-frame. Tested with a 720p encode, so I'll need to check that also with 1080p later.
I would expect it would look better, but at some cost to efficiency due to the signaling overhead. --opt-cu-delta-qp might help that some. What sorts of bitrates are you targeting/getting?
--bframes 10, some video utilizes a lot of B-frames for some reason. Not a big slowdown so I've kept it at that all the time.
Why 10 specifically? I'd use either 8 (most tested, as it's in the slower+ presets) or 16 (maximum allowed).

Boulder

16th October 2018, 18:09

I think the general problem is that the presets and tunings are really old. I'm quite sure things are very much different now compared to what they were when most of the parameters and settings were come up with. Also, we are still missing the most common tuning which is --tune film. I would very much like to see what the creators of the encoder think of proper settings as they should know the internal workings best.

I recall someone else also complaining about max-merge for smoothing things if the value is too big. I'll need to retest and file a bug report if it's still reproducable. It's been some time since I tested it.

Will also retest deblock 0:0. When I tested it earlier, I did find it smooth things slightly so I went one notch down from the standard -1:-1 of x264's --tune film.

--rdpenalty 1 is something I have not tested recently.

--qg-size 16 does cause the bitrate to jump compared to 32 or 64, but it's worth it in my opinion.
I had tested it with a 720p encode with quite a big filesize difference:
qg-size 64 : 3147,41 kbps
qg-size 32 : 3648,01 kbps
qg-size 16 : 3954,23 kbps
It is something I need to separately retest for 1080p. The bitrates fluctuate a lot, I'd say between 2.5 - 6 Mbps for 720p encodes. I have no specific target so I just use CRF 19.

10 B-frames because for some quite noisy sources, I noticed that 8 consecutive B-frames were being used 5-10% of the time. Setting ten usually meant that the longest sequence was used 1-2% of the time.

jlpsvk

16th October 2018, 20:46

Is there a downside to enabling avx512 on an intel X chip?

Sent from my iPhone using Tapatalk

Heat. :D

benwaggoner

17th October 2018, 02:09

Heat. :D
Would it be hotter? Or would it just be slower due to thermal throttling.

I wouldn't be surprised if Intel's next big microarchitecture revision makes AVX512 useful in cases where it isn't today. We saw the same thing with Skylake and AVX2.

Wolfberry

17th October 2018, 03:52

No reason it would be a bad idea, I just haven't seen it used before. Generally parameters that have a reliable quality/speed tradeoff are in a preset. But it isn't listed as experimental...
Anyone from multicoreware care to weigh in? What's the tradeoff? Is this something that should get added to the faster presets, or default to on but off in --preset placebo?

In fact, this skip is not a fast skip algorithm.
As the sum of split cost is larger than none split CU's best cost (both rdcost of sub-cu and none split CU are without split flag cost), which means splitting into 4 parts at this depth of cu is a worse case compared with none split CU. So that, the remain N * 1/4 parts of CU analysis is useless.

If I understood the patch comment in the mailing list correctly, it should speed up intra split cost calculation a little while possibly preserving identical output.

splitrd-skip sounds like a small speedup with no trade off, but still disabled by default (according to the doc)

Atak_Snajpera

17th October 2018, 11:10

Would it be hotter? Or would it just be slower due to thermal throttling.

I wouldn't be surprised if Intel's next big microarchitecture revision makes AVX512 useful in cases where it isn't today. We saw the same thing with Skylake and AVX2.

This will require 10nm process. 14nm++++++++++++ has reached its thermal limits. Upcoming 8C/16T CPUs already are very hot at 5GHz.

RieGo

17th October 2018, 15:51

Would it be hotter? Or would it just be slower due to thermal throttling.

in my personal experience avx512 isn't generating much more heat than avx2 - on x265
so as long as you are only using it on x265 and use a safe thermal throttling setting u should be fine imho :)

in case i am wrong and anyone else has different results please correct me :)

Boulder

18th October 2018, 07:08

I recall someone else also complaining about max-merge for smoothing things if the value is too big. I'll need to retest and file a bug report if it's still reproducable. It's been some time since I tested it.

Will also retest deblock 0:0. When I tested it earlier, I did find it smooth things slightly so I went one notch down from the standard -1:-1 of x264's --tune film.
I've done some testing on a 1080p encode, basically I've set CRF 21 and then tested one parameter at a time by comparing still frames. I know it's not the optimal way but it is very hard to notice the differences in motion. What I've tried to compare are areas which show distortion quite well, such as eyes and their surroundings, hair etc. I've also tried picking frames from fast motion and almost still scenes.

I'll post my results as soon as I finish the 720p tests. From what I can tell is that 1080p requires slightly different parameters, but of course it could be that things have changed so much under the hood that my set of parameters have been obsolete all the time :)

What I already found strange is that deblock doesn't really affect bitrate. For example deblock 6:6 ended up around the same size as deblock 0:0.

Forteen88

18th October 2018, 08:07

What I already found strange is that deblock doesn't really affect bitrate. For example deblock 6:6 ended up around the same size as deblock 0:0.But did the avg QP, or visual quality remain the same?

Boulder

18th October 2018, 15:58

But did the avg QP, or visual quality remain the same?

The average QP remained pretty much the same, differences were around 0.01-0.03 units between 1:1 - -3:-3. Based on my tests, the higher values did soften the image more at least in places I checked, so maybe the bits were allocated elsewhere in the image.

benwaggoner

18th October 2018, 16:40

in my personal experience avx512 isn't generating much more heat than avx2 - on x265
so as long as you are only using it on x265 and use a safe thermal throttling setting u should be fine imho :)
That’s what I’d expect; TDP is TDP, and thermal throttling should kick in regardless of where the heat is coming from.

Picojoules per pixels is the more relevant metric here. Heat produced is equal to power draw. Better coolers can get the heat away from the CPU power better of course, but watts to the CPU is going to be the same as watts of heat to dissipate.

benwaggoner

18th October 2018, 16:45

I'll post my results as soon as I finish the 720p tests. From what I can tell is that 1080p requires slightly different parameters, but of course it could be that things have changed so much under the hood that my set of parameters have been obsolete all the time :)
Pretty much any settings from before x264 2.4 are invalid now due to the new lambda tables. Lots of other things have improved, but that was probably the biggest change.

What I already found strange is that deblock doesn't really affect bitrate. For example deblock 6:6 ended up around the same size as deblock 0:0.
Lots of codecs’ internal decisions are done ignoring in-loop deblocking to improve speed. So that’s not terribly surprising. And if this is CRF, it is a delta from QP based on spatial complexity, so it’s not surprising that it would change.

It IS surprising that you wouldn’t see bitrate OR QP change, even with a significant loss of detail. Not that using 6:6 is something anyone is likely to do in practice. But reduced detail should result in lower bitrate and/or QP. With lowered detail (including grain/gain noise), prediction should be more efficient...

Maybe log your repro for this as an issue for MCW?

benwaggoner

18th October 2018, 16:48

The average QP remained pretty much the same, differences were around 0.01-0.03 units between 1:1 - -3:-3. Based on my tests, the higher values did soften the image more at least in places I checked, so maybe the bits were allocated elsewhere in the image.
Hmmm. You should look at it in motion then; maybe the improvements are there.

Playing back at 1/4 speed is generally okay to see temporal artifacts better while still keeping temporal coherence. You can still detect discontinuities between IDR/i/P/B/b that way.

Testing at a higher CRF, like maybe 28, can also make differences at lot more obvious. Working at a quality where most things look pretty good can make differences much harder to detect.

nevcairiel

18th October 2018, 17:24

Would it be hotter? Or would it just be slower due to thermal throttling.

I wouldn't be surprised if Intel's next big microarchitecture revision makes AVX512 useful in cases where it isn't today. We saw the same thing with Skylake and AVX2.

The real problem with AVX512 is the strong downclock on stock CPUs, and the high density of the AVX512 instructions, causing a high power draw in a small die area.

With overclockable CPUs, you can control the AVX512 offset and make it useful for x265, but on Xeons or the like, you probably don't get that level of control, and the downclock during AVX512 might offset the advantages it offers.

Additionally, x265 is a pretty "light" AVX512 load. If you change the offset for a lower downclock and then run a strong AVX512 load (like pure math, ie. FFTs, for a prolonged time), your system may become unstable, due to the extreme density of all the heat and power draw. So its really hard to balance.

DotJun

19th October 2018, 07:47

You should encode whole movie (130k frames) instead of ultra short clip with few hundred of frames.
The longer you encode the more heat your cpu will produce and hence more aggressive AVX negative offset will be activated.
I encoded a full length movie and temps did not go over 80c which I think is ok for my chip? I should have stated that this computer is in a climate controlled area set to 70F.

My chip is OC'd to 4.5ghz with a -4 offset so that it drops to 4.1 when using avx. I guess I should have tried to compare encoding speed with avx on and off instead of just avx-512 enabled and disabled. I have no throttling issues and the load is a pretty consistent 100% with the occasional 1 second dip down to 87% every 30 seconds or so.

What is the command to disable avx entirely?

afaik avx512 enabled and disabled should produce exact same output. at least on my tests it did. can you share your command line?

This is what I use for 4k source since there is no Film preset:
--preset slower --crf 17 --profile main10 --me 3 --subme 5 --psy-rd 1.5 --psy-rdoq 5.0 --rdoq-level 1 --qcomp 0.8 --deblock -1:-1 --no-sao --repeat-headers --hdr-opt --range limited --colorprim 9 --transfer 16 --colormatrix 9 --master-display "G(13250, 34500)B(7500, 3000)R(34000, 16000)WP(15635, 16450)L(10000000, 1)"

Atak_Snajpera

19th October 2018, 12:21

I encoded a full length movie and temps did not go over 80c which I think is ok for my chip? I should have stated that this computer is in a climate controlled area set to 70F.
It is funny how you in USA use two different scales for temperature ;)

What is the command to disable avx entirely?
Bad idea! AVX2 is very useful in x265.

19th October 2018, 13:53

What is the command to disable avx entirely?

--asm sse4

Main asm levels are:
--asm no
--asm sse2
--asm ssse3
--asm sse4
--asm avx2
--asm avx512

--asm avx512 works a bit different -- it only enable possibility to use AVX-512 in auto-detection of CPU capabilities. For hard use of AVX-512 code (without checking) please use
--asm avx,avx512

excellentswordfight

19th October 2018, 19:36

Not sure if there is anyone that works on x265 is active in this thread anymore, but I've seen some discussion regarding --CTU and --merange both in this thread in the past and recently in a few threads.

From what I’ve read here and from what I’ve experienced from my testing is that using a CU size of 64 is overkill for 1080p and bellow, I see both a speed increase and a multithread increase when lowering it together with merange with no apparent loss in compression. I also saw some posts way back in this thread that suggested that the default CTU size should be based on resolution, wasn’t this implemented for any specific reason or is it just that no one has committed a patch for it?

I also have a question for merange, it says in the docs that the value of 57 is based on CTU-size and search method, but the value is set to 57 for all presets even though CTU-size and search method vary. How come?