Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > High Efficiency Video Coding (HEVC)

Reply
 
Thread Tools Search this Thread Display Modes
Old 21st September 2021, 17:39   #81  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,329
Rumors have it that Zen4 might have AVX512, which would be an odd situation as Intel has pulled it from consumer Alder Lake, if AMD brings it to consumer desktop.
But those rumors might as well be totally off, as typical consumers would only see a small benefit ... but of course AMD uses the same cores for the entire lineup, so who knows.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders
nevcairiel is offline   Reply With Quote
Old 21st September 2021, 18:20   #82  |  Link
shootah
Registered User
 
Join Date: Nov 2012
Posts: 2
Quote:
Originally Posted by excellentswordfight View Post
Not sure about the speed as I rarely use preset fast, nor own an 5800x, but I would guestimate that it would be in in the 50fps+ range.
Great, 50 fps is huge improvement for me!
shootah is offline   Reply With Quote
Old 8th February 2023, 08:52   #83  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,000
Quote:
Originally Posted by nevcairiel View Post
Rumors have it that Zen4 might have AVX512, which would be an odd situation as Intel has pulled it from consumer Alder Lake, if AMD brings it to consumer desktop.
But those rumors might as well be totally off, as typical consumers would only see a small benefit ... but of course AMD uses the same cores for the entire lineup, so who knows.
At the end of 2022 AMD start to support some good set of AVX512 instructions in Zen4 7xxx chips. But may be not at the full rate (512bits per dispatch) and only by 256bits parts (so instructions per cycle about half of expected). But it most probably support full sized 2 kBytes register file for AVX512 and even separated register files for integer and floats (so 2x2 kBytes per core). It also helps to performance if program manually designed to use such 'large for desktop CPU' register file or C compiler was configured to AVX512 architecture and can use additional register space.

May be in the next generation of chips AMD will support AVX512 instructions dispatch at 'full rate' and it can help to performance more.

Nowdays more desktop developers with AMD 7xxx chips can make and test AVX512 software without using Intel SDE that only fully support up to Visual Studio 2017 (old enough).
DTL is offline   Reply With Quote
Old 8th February 2023, 19:01   #84  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,652
Even if IPC of AVX512 is half that of AVX2 for many operations, that will still reduce instruction bandwidth which might have some benefit. There's also the permute stuff added in AVX512 which seems like it could simplify some algorithms and optimization if well utilized. I'm not sure how deep the x265 AVX512 optimization goes in that direction.

MCW dialed back on further AVX512 work after it was discovered that AVX512'S thermal down clocking made it a net negative for perf in most scenarios.

Getting a good Zen4 tuned profile driven optimized binary to test with would reveal much. I'd expect to see some increased benefit with 4K encoding and hopefully 1080p with slower presets. AVX512 on Xeon to date has really only shown net improvements at 4K veryslow, and even then <20%.

Anyone know about AVX512 improvements in the 2023 Xeons?

For AWS EC2 cloud encoding, Graviton2 already offers a lot better throughput/$ than c5 Xeon instances. For cloud at least, x86-64 isn't the only game in town anymore.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 8th February 2023, 20:33   #85  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,000
One useful feature of 4x larger register file of AVX512 x64 in compare with AVX2 is ability to process more blocks in motion estimation per pass or larger sized blocks or have more search radius without reloading register file from cache (it cost about 5 or more clockticks even from closest L1D cache and riune performance about several times). So more manually optimized functions from x265 developers for AVX512 architecture may show more performance boost in the future.

"Getting a good Zen4 tuned profile driven optimized binary to test with would reveal much. "

Are there any architecture optimized builds of x265 (for AVX512-Intel (families) or for AVX512-AMD (Zen4) avaialble for download and test ?

I tried quickly make builds but fast result was only with MSVC with 2020-dated sources from github (and its /arch:AVX512 build at i5-11600 runs even slightly slower in compare with /arch:AVX2, but MSVC 16 is not best optimizing compiler for Intel's AVX512 architecture). Newer sources either not run cmakeconfigure to build solution or not link .obj files with Intel C 19.1 compiler. Need more time for solving build errors. It is nice if x265 community already provide optimized builds for AVX512 for endusers.

"about AVX512 improvements in the 2023 Xeons?"

As I see Xeons have some better optimized memory (cache and RAM) controller for multithreading (multicoring) in compare with desktop chips for poor people. So Xeons may also better benefit from wide bus transfers and wide words computing. As nowdays intel again drop support for AVX512 in desktop chips - it looks with poor memory controller at cheap (sub $1000 chips) it still very unbalanced setup of too fast computing core and too slow memory and not worth of marketing at dying desktops market.

Last edited by DTL; 8th February 2023 at 20:48.
DTL is offline   Reply With Quote
Old 9th February 2023, 05:48   #86  |  Link
Blue_MiSfit
Derek Prestegard IRL
 
Blue_MiSfit's Avatar
 
Join Date: Nov 2003
Location: Los Angeles
Posts: 5,980
Quote:
Originally Posted by benwaggoner View Post

For AWS EC2 cloud encoding, Graviton2 already offers a lot better throughput/$ than c5 Xeon instances. For cloud at least, x86-64 isn't the only game in town anymore.
Is this the case for x264 and x265 encoding? I know there have been lots of ARM SIMD optimizations but I'd be kind of surprised if those surpass the level of x86_64 optimization. Does it just not matter and you come out ahead?

We already use Graviton for our database workloads and are moving some of our JVM workloads there as well, but I was under the impression that standard x264/x265 work was still more cost efficient on AMD instances in EC2 ~generally~ speaking
__________________
These are all my personal statements, not those of my employer :)
Blue_MiSfit is offline   Reply With Quote
Old 9th February 2023, 13:57   #87  |  Link
Boulder
Pig on the wing
 
Boulder's Avatar
 
Join Date: Mar 2002
Location: Finland
Posts: 5,673
Quote:
Originally Posted by benwaggoner View Post
Getting a good Zen4 tuned profile driven optimized binary to test with would reveal much. I'd expect to see some increased benefit with 4K encoding and hopefully 1080p with slower presets. AVX512 on Xeon to date has really only shown net improvements at 4K veryslow, and even then <20%.
Once Media Autobuild Suite goes GCC 13, that should be available. I don't know how much you can expect out of some compiler made optimizations though. On a Zen 3, the difference is a couple of % (using 'znver3' as target) against a generic build.
__________________
And if the band you're in starts playing different tunes
I'll see you on the dark side of the Moon...
Boulder is offline   Reply With Quote
Old 9th February 2023, 18:36   #88  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,000
Google shows article from intel:
https://www.intel.com/content/dam/de...265-avx512.pdf

Accelerating x265 with Intel®
Advanced Vector Extensions 512

Diagrams shows really not very nice gain. It looks programmers of x265 need significantly new processing approaches to use 4x larger register file and 2x larger dataword per instruction (about up to 4x*2x=8x hardware performance boost - also superscalarity allow to execute several AVX512 instructions at different dispatch ports if this instruction supported at several dispatch ports and no data dependency exist) of AVX512 over AVX2 to make at least 2x performance gain instead of just a few %. At least at full-blood Xeons with many memory channels and better cache controller in compare with poor-people's desktop chips.

One of the reason of too few benefit of AVX512 versions over AVX2: With era of AVX2 the programmers design workunit to compute for the size of 512 bytes register file of AVX2 CPUs. So simple usage of AVX512 instructions with the same workunit size may double in theory performance but the workunit reload rate from caches and main RAM is still high. So memory bounding is same as with AVX2 design.

The moving to AVX512 architecture require also increasing workunit size to about 4x in size and such software may take very high performance penalty if trying to use this workunit size on old architectures (AVX2) with smaller register file size (compiler will accept intrinsics-based program with overusage of register file but will fill output binary with reloads from cache and it will drop performance significantly). Also as I see x265 uses external assembly handcoded. So need to handwrite new AVX512 functions using larger workunit size and this will be significant separate part of x265 with only good performance at AVX512 chips (and mostly unusably slow on older). As befiore AMD 7xxx enduser chips the AVX512 was very rare at the desktops - there were too few reasons (there may be no investors to re-design x265 separate branch to AVX512) to put typically limited developers resources to making separate version of x265 for rich-people servers in freeware opensource project. And now as we got more AVX512 chips at desktops the total developers activity at opensource looks almost died - so simply no one left to make one more redesign of x265 for new offered architecture at desktop chips. May be we will see some progress in 202x years if AMD will keep AVX512 at cheap desktop chips for at least several years.

Last edited by DTL; 10th February 2023 at 12:01.
DTL is offline   Reply With Quote
Old 10th February 2023, 22:51   #89  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,652
Quote:
Originally Posted by DTL View Post
One of the reason of too few benefit of AVX512 versions over AVX2: With era of AVX2 the programmers design workunit to compute for the size of 512 bytes register file of AVX2 CPUs. So simple usage of AVX512 instructions with the same workunit size may double in theory performance but the workunit reload rate from caches and main RAM is still high. So memory bounding is same as with AVX2 design.
Very interesting! Does anyone know if the existing AVX512 in x265 was done with this in mind?
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 11th February 2023, 00:44   #90  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,000
May be programmers can complain that complex/modern video coding algorithms with long processing (many many small functions switching) with frame splitting granularity down to very small block (like 4x4 or 2x2 or even single sample) is not possible to arrange to 'any large workunits'. So no significant benefit from 'large SIMD architectures possible'. We will continue to pass something like 2x2 8bit block of 4 bytes workunit size over hundreds of functions and many loop spins at the hardware capable of handle up to 2048 bytes workunits with instant (zero memory load time penalty) access to dispatch ports from 'register file'. So they may ask for very large number of full logical processing cores to process very small workunits at very large number of cores to have better total frame compression time performance. But current hardware manufacturing industry can only provide small full logical core number chips with increasing 'workunit' size when moving from 256bytes register file of AVX(2) to 2048bytes of AVX512 (and may be larger in next promised AVX1024 in about this 2023 year - https://appuals.com/granite-rapids-cpu-showcase/ Intel 5th Gen Xeon with AVX1024/FMA3). So it is a task for programmers to adapt software for small core number execution hardware with increasing 'workunit' size for each processing thread.

Last edited by DTL; 11th February 2023 at 00:54.
DTL is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 11:51.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, vBulletin Solutions Inc.