Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > High Efficiency Video Coding (HEVC)

Reply
 
Thread Tools Search this Thread Display Modes
Old 12th April 2018, 05:09   #6001  |  Link
foxyshadis
ангел смерти
 
foxyshadis's Avatar
 
Join Date: Nov 2004
Location: Lost
Posts: 9,277
I thought having Kaby Lake meant I had them, but nope, servers only. I have one customer who has a brand spanking new Skylake-X server that I can remote into, I should be able to get benchmarks tomorrow.
__________________
There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. ~ Ed Howdershelt
foxyshadis is offline   Reply With Quote
Old 12th April 2018, 06:37   #6002  |  Link
Asmodian
Registered User
 
Join Date: Feb 2002
Location: San Jose, California
Posts: 3,075
AVX-512 is faster!

I did some benchmarks using LigH's build x265 2.7+332-593e63cda903 (Win64) above. I used the same build for the AVX2 tests, simply without the "--asm avx512" command.
i9-7900X @ 4.5 GHz all cores, 3.0 GHz mesh/cache, DDR4 4000-17-18-18-41-1T. No AVX2 or AVX-512 multiplier offsets. Max 92 degC package CPU temperature during both veryslow encodes. The faster modes did not saturate all 20 threads.

The source is 1920x1080 8-bit gradient MagicYUV 4:2:0 on a NvME SSD encoding to another NvME SSD. I used the first 1000 frames from Firefly episode 9 which I had already denoised (SMDegrain) and had on my drive.

avs2pipemod.exe -y4mp=1:1 "fireflyshort.avs" | x265_AVX512.exe --input - --y4m -o "D:\temp\fireflyshort.mkv" --asm avx512 --preset veryslow --crf 18.5 --output-depth 10
x265 [info]: HEVC encoder version 2.7+332-593e63cda903
x265 [info]: build info [Windows][GCC 7.3.0][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512
x265 [info]: Main 10 profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 20 threads

veryslow:
AVX512: encoded 1000 frames in 303.44s (3.30 fps), 4037.41 kb/s, Avg QP:20.64
AVX2: encoded 1000 frames in 335.83s (2.98 fps), 4037.41 kb/s, Avg QP:20.64
medium:
AVX512: encoded 1000 frames in 28.41s (35.20 fps), 3183.67 kb/s, Avg QP:20.46
AVX2: encoded 1000 frames in 30.71s (32.57 fps), 3183.67 kb/s, Avg QP:20.46
veryfast:
AVX512: encoded 1000 frames in 15.47s (64.64 fps), 2769.26 kb/s, Avg QP:20.89
AVX2: encoded 1000 frames in 16.89s (59.20 fps), 2769.26 kb/s, Avg QP:20.89
ultrafast:
AVX512: encoded 1000 frames in 6.86s (145.77 fps), 1398.46 kb/s, Avg QP:25.00
AVX2: encoded 1000 frames in 7.22s (138.41 fps), 1398.46 kb/s, Avg QP:25.00

Thanks to everyone who works on x265 and thanks for the regular builds LigH.
__________________
madVR options explained

Last edited by Asmodian; 12th April 2018 at 07:12.
Asmodian is offline   Reply With Quote
Old 12th April 2018, 10:41   #6003  |  Link
excellentswordfight
Lost my old account :(
 
Join Date: Jul 2017
Posts: 28
Slower here.

Using LGHs build with a dell 2u rack server with a Xeon Gold 6126 (12c/24t). CPU utilization dropped with about 10% (both for 1080p and 2160p) and clockspeed dropped from 2.9Ghz to 2.4Ghz. I'm guessing that the gains for AVX512 didnt outweight the dropp in clockspeed and utilization.

Tears of steal source (10bit UHD-Bluray compat x265 source for 2160p test, 8bit bluray compat x264 soruce for 1080p)

2160p with avx512: 80-90% CPU usage, 2.28 fps
Code:
--asm avx512 --preset slow --profile main10 --level-idc 51 --crf 22

2160p: 100% CPU usage, 2.36 fps
Code:
--preset slow --profile main10 --level-idc 51 --crf 22

1080p with avx512: 45-55% CPU usage, 6.54 fps
Code:
--asm avx512 --preset slow --profile main10 --level-idc 41 --crf 18

1080p: 55-65% CPU usage, 7.14 fps
Code:
--preset slow --profile main10 --level-idc 41 --crf 18

Last edited by excellentswordfight; 12th April 2018 at 11:58.
excellentswordfight is offline   Reply With Quote
Old 12th April 2018, 11:10   #6004  |  Link
WhatZit
Registered User
 
Join Date: Aug 2016
Posts: 54
Quote:
Originally Posted by excellentswordfight View Post
I'm guessing that the gains for AVX512 didn't outweight the drop in clockspeed and utilization.
Yep, a Catch-22 also discovered by Cloudfare after some cryptography assessments: https://blog.cloudflare.com/on-the-d...uency-scaling/
WhatZit is offline   Reply With Quote
Old 12th April 2018, 12:47   #6005  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 9,242
Asmodian runs without a AVX512 offset, which would instantly crash his system if a strong AVX512 workload would run, so clearly its faster with some "light" AVX512 usage. Usually you need at least a -10 offset or such to get it working stable under strong AVX512 load (or boost voltages substantially for more heat). Non-OCed Xeon CPUs probably downlock quite substantially.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders
nevcairiel is offline   Reply With Quote
Old 12th April 2018, 14:18   #6006  |  Link
Barough
Registered User
 
Barough's Avatar
 
Join Date: Feb 2007
Location: Sweden
Posts: 280
x265 v2.7+337-54ff74d2b635 (GCC 7.3.0, 32 & 64-bit 8/10/12bit Multilib Windows Binaries)

Code:
https://bitbucket.org/multicoreware/x265/commits/branch/default
Barough is offline   Reply With Quote
Old 12th April 2018, 14:26   #6007  |  Link
burfadel
Registered User
 
Join Date: Aug 2006
Posts: 2,235
Probably best to utilise AVX-512 where it gives the best gains without triggering thermal throttle. The good thing at least with 307 separate patches this can be whittled down. If a function is frequently used and gives only a small gain, it may actually encode faster if on mitred fire to the throttling the patch causes. Even if throttling isn't triggered on a particular rig, temperature difference should be taken into account to cover typical situations.
burfadel is online now   Reply With Quote
Old 12th April 2018, 17:34   #6008  |  Link
Stephen R. Savage
Registered User
 
Stephen R. Savage's Avatar
 
Join Date: Nov 2009
Posts: 332
I tested Barough's build in comment #6006 on an i9-7900X at Intel POR settings.

Code:
Processor: i9-7900X
OS: Windows 10 1709
Memory: 4x DDR4-3200 CL 16

Extended processor information:
Turbo frequency: 4.0 / 3.6 / 3.3 GHz
PL1: 140 W
PL2: 168 W
PL1tau: 1 s

Input: 1280x720 4:4:4 10-bit, 2160 frames
Encoder settings: --preset veryslow --output-depth 10 --crf 18

AVX2: --asm avx2
Speed: 2.36 fps
Power: 92 W
Efficiency: 0.026 fps/W

AVX-512: --asm avx512
Speed: 2.45 fps
Power: 87 W
Efficiency: 0.028 fps/W
The memory is non-JEDEC (DDR4-2666 CL 17), but it should not matter for video encoding. After many months(?) of work, AVX-512 has managed to deliver a meager 3.8% speedup. The efficiency story is slightly better at almost 10% frames/W, but since the power never came close to TDP, efficiency is of little importance.

Quote:
Originally Posted by burfadel View Post
Probably best to utilise AVX-512 where it gives the best gains without triggering thermal throttle. The good thing at least with 307 separate patches this can be whittled down. If a function is frequently used and gives only a small gain, it may actually encode faster if on mitred fire to the throttling the patch causes. Even if throttling isn't triggered on a particular rig, temperature difference should be taken into account to cover typical situations.
It's actually the opposite. They need to pack all the AVX-512 functions so that they run back-to-back, or else the whole application will be frequency-throttled, yet gain no benefit. It takes on the order of 1 ms to change AVX states on the processor.

Last edited by Stephen R. Savage; 12th April 2018 at 23:27.
Stephen R. Savage is offline   Reply With Quote
Old 12th April 2018, 17:52   #6009  |  Link
LigH
German doom9/Gleitz SuMo
 
LigH's Avatar
 
Join Date: Oct 2001
Location: Germany, rural Altmark
Posts: 5,312
x265 2.7+337-54ff74d2b635
  • Merge with default; prep for v3.0
  • Support for HLG-graded content and pic_struct
  • Fix conditions for single-sei NAL
  • Fix 32 bit build error (means: AVX-512 support is only included in x86-64 architecture target)
(VMAF support to report per frame and aggregate VMAF score — unfortunately not yet? available for Windows builds)

New CLI parameters:

Code:
   --atc-sei <integer>           Emit the alternative transfer characteristics SEI message where the integer is the preferred transfer characteristics. Default disabled
   --pic-struct <integer>        Set the picture structure and emits it in the picture timing SEI message. Values in the range 0..12. See D.3.3 of the HEVC spec. for a detailed explanation.
__________________

German Gleitz board
MediaFire: x264 | x265 | VPx | AOM | Xvid
LigH is offline   Reply With Quote
Old 12th April 2018, 17:57   #6010  |  Link
Asmodian
Registered User
 
Join Date: Feb 2002
Location: San Jose, California
Posts: 3,075
Quote:
Originally Posted by nevcairiel View Post
Asmodian runs without a AVX512 offset, which would instantly crash his system if a strong AVX512 workload would run, so clearly its faster with some "light" AVX512 usage. Usually you need at least a -10 offset or such to get it working stable under strong AVX512 load (or boost voltages substantially for more heat). Non-OCed Xeon CPUs probably downlock quite substantially.
I had downclocked from my normal max clocks when running without an AVX offset.

I also ran some tests at my normal OC settings with -2, -4 multiplier offsets. 4.8 GHz max core, 4.6 GHz AVX2, 4.4 GHz AVX-512.

AVX512: encoded 1000 frames in 310.46s (3.22 fps), 4037.41 kb/s, Avg QP:20.64
AVX2: encoded 1000 frames in 335.85s (2.98 fps), 4037.41 kb/s, Avg QP:20.64

It would probably still melt with a heavy AVX-512 load but it also wasn't completely maxed. AVX-512 ran cooler than AVX2 at these settings. I am not sure why my AVX2 run only had the same speed as the previous 4.5 GHz encode, maybe a latency penalty due to the core changing states.

This is a binned, delidded, and water cooled CPU... other systems may have different results.

Edit: If I run Prime95 (p95v294b8) with AVX-512 at 4.5 GHz I do get thermal throttling.
__________________
madVR options explained

Last edited by Asmodian; 12th April 2018 at 19:09.
Asmodian is offline   Reply With Quote
Old 12th April 2018, 20:31   #6011  |  Link
Stephen R. Savage
Registered User
 
Stephen R. Savage's Avatar
 
Join Date: Nov 2009
Posts: 332
Quote:
Originally Posted by nevcairiel View Post
Asmodian runs without a AVX512 offset, which would instantly crash his system if a strong AVX512 workload would run, so clearly its faster with some "light" AVX512 usage. Usually you need at least a -10 offset or such to get it working stable under strong AVX512 load (or boost voltages substantially for more heat). Non-OCed Xeon CPUs probably downlock quite substantially.
Anandtech has the Xeon Scalable turbo frequencies for each AVX state: https://www.anandtech.com/show/11544...f-the-decade/8

At a glance, it's clear that most Xeon Gold/Platinum SKUs have a 20% AVX-512 vs AVX penalty, which is far more than the x265 "optimization" achieves. The high frequency client parts (Core-X/Xeon-W) have a lower penalty of only 10%, where the new x265 just barely comes out ahead. Another 10% frequency loss would reverse the gains, which matches up with what excellentswordfight found.
Stephen R. Savage is offline   Reply With Quote
Old 12th April 2018, 20:57   #6012  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 9,242
Quote:
Originally Posted by Asmodian View Post
Edit: If I run Prime95 (p95v294b8) with AVX-512 at 4.5 GHz I do get thermal throttling.
Try with LinX/Linpack and see your system die. Prime95 does not fully use AVX512 yet (only trial factoring, not full FFTs)
__________________
LAV Filters - open source ffmpeg based media splitter and decoders
nevcairiel is offline   Reply With Quote
Old 12th April 2018, 21:06   #6013  |  Link
Stephen R. Savage
Registered User
 
Stephen R. Savage's Avatar
 
Join Date: Nov 2009
Posts: 332
Quote:
Originally Posted by nevcairiel View Post
Try with LinX/Linpack and see your system die. Prime95 does not fully use AVX512 yet (only trial factoring, not full FFTs)
It's actually not so bad at higher frequencies, because each 100 MHz increment saves a lot more power, compared to 2.5 GHz server SKUs. i9-7900X can reach 4.1-4.2 GHz AVX-512 frequency with an aftermarket cooling solution.
Stephen R. Savage is offline   Reply With Quote
Old 12th April 2018, 21:37   #6014  |  Link
jlpsvk
Registered User
 
Join Date: Dec 2014
Posts: 162
Quote:
Originally Posted by Kavitha View Post
x265 has static levels of refinement(--refine inter <level>/refine intra <level>) which can be used with --analysis-reuse-level 10.
Efficiency in terms of quality increases as the levels of refinement increases. This quality increase results from additional computation thereby increasing the overall encoding time.
For a better quality-speed trade-off, dynamic refinement was introduced where the encoder dynamically switches between different inter refine levels.
This basically exploits the fact that not all CUs are required to be encoded with same level for better performance/quality.
Considering the complexity of video content and the analysis information from first pass, the encoder can intelligently decide the optimal level of refinement for each CU.
Intra frames are usually encoded with best quality as they are used as references by the consecutive frames. Hence error introduced in intra frames due to reusing analysis data can propagate to frames that use these intra frames as reference.
To minimize the chances of error propagation, refine-intra 4 (level with best quality) restricts reusing analysis data for intra frames and forces the encoder to perform full intra analysis in the second pass.
This is why x265 documentation suggests to use dynamic refinement along with refine-intra 4 and this setting is expected to give improved quality than other refine intra levels for some videos.
any suggested quality wise settings recommendation for 4K HDR encoding? with CRF ie 17?
jlpsvk is offline   Reply With Quote
Old 12th April 2018, 21:37   #6015  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 9,242
Quote:
Originally Posted by Stephen R. Savage View Post
It's actually not so bad at higher frequencies, because each 100 MHz increment saves a lot more power, compared to 2.5 GHz server SKUs. i9-7900X can reach 4.1-4.2 GHz AVX-512 frequency with an aftermarket cooling solution.
You can reach that if you boost the power you give the CPU, but unfortunately that also boosts the power outside of AVX512 mode, making your CPU overall less efficient. The integrated voltage controller has no option to increase the core voltage only in AVX512 mode, unfortunately.

But this is probably going a bit off-topic for X265.
I would've thought the X265 people already learned the down-clocking lesson with AVX2 though, where they experienced the same effect - fancy instructions that made the overall encode slower, especially on server systems, due to clock changes.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders
nevcairiel is offline   Reply With Quote
Old 12th April 2018, 21:44   #6016  |  Link
mandarinka
Registered User
 
mandarinka's Avatar
 
Join Date: Jan 2007
Posts: 666
https://forums.anandtech.com/threads...#post-39149633

Quote:
RZN vs. CFL vs. SKL-X in X265 2.5+31:

RZN: /w AVX2 = 100.00%, /wo AVX2 = 105.21%
CFL: /w AVX2 = 130.61%, /wo AVX2 = 101.13%
SKL-X: w/ AVX2 = 135.47%, /wo AVX2 = 105.21%

Ryzen's performance without AVX2 is impressive, but it is sad to see that there is still a penalty (like on Excavator) when running 256-bit code.
I wish somebody would adjust the CPU detection code to disable AVX2 on Zen. Easy performance gain just from that simple change: Zen gets 5.2% faster by disabling AVX2.
mandarinka is offline   Reply With Quote
Old 12th April 2018, 21:51   #6017  |  Link
jlpsvk
Registered User
 
Join Date: Dec 2014
Posts: 162
is it just me? using cpu capabilities with the new x265 not listing AVX-512. i7-7820X
jlpsvk is offline   Reply With Quote
Old 12th April 2018, 22:56   #6018  |  Link
Asmodian
Registered User
 
Join Date: Feb 2002
Location: San Jose, California
Posts: 3,075
Don't forget the "--asm avx512", it isn't enabled by default. This seems good if it is slower on most systems due to the multiplier offsets for AVX-512.
__________________
madVR options explained
Asmodian is offline   Reply With Quote
Old 12th April 2018, 23:01   #6019  |  Link
jlpsvk
Registered User
 
Join Date: Dec 2014
Posts: 162
@Asmodian

aaaaah... forgot it..
jlpsvk is offline   Reply With Quote
Old 12th April 2018, 23:24   #6020  |  Link
Stephen R. Savage
Registered User
 
Stephen R. Savage's Avatar
 
Join Date: Nov 2009
Posts: 332
Quote:
Originally Posted by nevcairiel View Post
But this is probably going a bit off-topic for X265.
I would've thought the X265 people already learned the down-clocking lesson with AVX2 though, where they experienced the same effect - fancy instructions that made the overall encode slower, especially on server systems, due to clock changes.
To be fair, the AVX2 speedup is still larger than the frequency penalty, so it made sense. AVX-512 was always going to be a stretch, seeing as the incremental speedup would necessarily be less than AVX2 (30%). Looking through the Bitbucket commit log, a lot of the AVX-512 kernels are getting poor scaling from the AVX2 version. For example, the iDCT, a fundamental operation, didn't see any improvement from AVX-512.

Code:
https://bitbucket.org/multicoreware/x265/commits/b7149d1068997ee2a92dd3d48a848d1d65698e82

[x265-avx512]x86: AVX512 idct16x16
AVX2 Performance    :    11.67x
AVX512 Performance  :    12.80x
AVX-512 is probably too wide (64 bytes, 32 words) for the primitives in HEVC (16x16 and 32x32 blocks). There was a similar limit to x264's scaling from SSE to AVX, which yielded only 10%, and only in 10-bit mode.
Stephen R. Savage is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 07:05.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.