Log in

View Full Version : New CPU - Intel or AMD?


Pages : 1 [2]

Nico8583
2nd February 2021, 17:52
If you can get any of those new Ryzens, definitely grab one. Encoding is cheaper since it's more computing power per watt. Otherwise a good option might be to buy a second hand 3900X on the cheap. The extra threads do give a nice boost.

CTU 64 is in my opinion a bad idea. There is something definitely wrong there because --limit-tu 0 --ctu 64 --rskip 2 is just totally broken. I don't trust it at all because of that observation.
https://forum.doom9.org/showthread.php?p=1919347#post1919347
What are the default values ? ctu=64, limit-tu=0 and rskip ?

benwaggoner
2nd February 2021, 18:11
CTU 64 is in my opinion a bad idea. There is something definitely wrong there because --limit-tu 0 --ctu 64 --rskip 2 is just totally broken. I don't trust it at all because of that observation.
https://forum.doom9.org/showthread.php?p=1919347#post1919347
I use ctu 64 regularly without any issues. In the early days I did see some reproable quality issues with ctu 64 sometimes, but it's been years.

Boulder
2nd February 2021, 19:15
I use ctu 64 regularly without any issues. In the early days I did see some reproable quality issues with ctu 64 sometimes, but it's been years.

That one is definitely reproducable with that sample clip of mine, and it's not so long ago. Actually when they implemented the new rskip options some time ago.

Boulder
2nd February 2021, 19:16
What are the default values ? ctu=64, limit-tu=0 and rskip ?

The default depends on the preset. --limit-tu is 4 or 0, ctu 64 and rskip 1.

Nico8583
2nd February 2021, 20:11
The default depends on the preset. --limit-tu is 4 or 0, ctu 64 and rskip 1.
Thanks, I'm watching the preset values to compare

Nico8583
2nd February 2021, 20:15
A last question to be sure : if I use a 3700X (8 cores / 16 threads) with x265's default settings, just --preset slow (it's an example), it will produce an encoded file. If I use a 3900X (12 cores / 24 threads) with x265's default settings --preset slow, it will produce a different encoded file because core number is higher, right ?
Now if I use a 5800X (8 cores / 16 threads), is the produced file will be exactly the same than the 3700X encoded file ? Because there are both 8 cores 16 threads ?

DJATOM
2nd February 2021, 21:01
I'm not sure if results are deterministic at all... But yeah, bitrate should be the same in all cases.

apophis906
2nd February 2021, 22:51
A last question to be sure : if I use a 3700X (8 cores / 16 threads) with x265's default settings, just --preset slow (it's an example), it will produce an encoded file. If I use a 3900X (12 cores / 24 threads) with x265's default settings --preset slow, it will produce a different encoded file because core number is higher, right ?
Now if I use a 5800X (8 cores / 16 threads), is the produced file will be exactly the same than the 3700X encoded file ? Because there are both 8 cores 16 threads ?

The files will be the same as long as the settings are the same. I have tested on an i3-2700M, a i7-6700, and 3700x. The same test file with the same settings ends up the exact same output in bitrate and size. The difference is in the time it takes to encode it.

Nico8583
3rd February 2021, 05:49
Thank you and you changed --frame-threads and --lookahead-threads values in your settings ?

apophis906
3rd February 2021, 07:22
Thank you and you changed --frame-threads and --lookahead-threads values in your settings ?

No I left them on auto. So the i3 has 2 frame threads and 4 thread pool, the i7 has 3 frame threads and 8 thread pool, and of course the 3700x has 4 frame threads and 16 thread pool. From the testing I have tried it looks like the output is the same with more cores. The only time I have ever seen a different output was when trying out setting it to 1 frame thread, 2 and up give the same results. So I would think that a 16 core would produce the same output as an 8 core with the same settings.

excellentswordfight
3rd February 2021, 13:05
Just to add some numbers regarding utilization with threading options left at default:

--main10 --preset slow --crf18 on an 24C/48T Eypc Rome:

1080p Bluray source:

With CTU 64 & Merange 57: 25-35%
With CTU 32 & Merange 26: 50-70%

2160p UHD Bluray source:

With CTU 64 & Merange 57: 50-95% (24threads pinned at 100%, the other 24 is very unstable).

1080p with CTU 32, and UHD at default should give great saturation up to 16/32 for single instance encoding. 5950X is likely the fastest single instance encoding processor you can get for these cases atm (I would guestimate that it would beat 3960X & whatever intel has to offer including all xeons).

Atak_Snajpera
3rd February 2021, 15:47
CPU usage should be even higher if you do some heavy filtering like denoising in avisynth (MDegrain + prefetch 24)

benwaggoner
3rd February 2021, 23:30
CPU usage should be even higher if you do some heavy filtering like denoising in avisynth (MDegrain + prefetch 24)
But how many threads can those algorithms usefully use at once? Temporal denoising can be hard to parallelize because it is an interframe analysis process. But possible if each frame is doing forward and backward comparisons without reusing other analysis of the same frames. Can double the total CPU, but offer a lot better parallelism.

DJATOM
4th February 2021, 00:05
Modern video processing frameworks relying on caches, no need to compute stuff twice. Just pick cached frame and reuse or fallback to full processing on cache misses. With high amount of RAM MDergain is fast enough.

Atak_Snajpera
4th February 2021, 21:55
But how many threads can those algorithms usefully use at once? Temporal denoising can be hard to parallelize because it is an interframe analysis process. But possible if each frame is doing forward and backward comparisons without reusing other analysis of the same frames. Can double the total CPU, but offer a lot better parallelism.

MDegrain works very well with prefetch equal to number of physical cores.

Nico8583
18th August 2021, 13:36
Hi,
Does anyone already compare 3900X and 5900X in real conditions (x265 1080p and 4K) ? With the same settings, what is the expected perf difference between both ?
Thanks !

RanmaCanada
18th August 2021, 17:36
Hi,
Does anyone already compare 3900X and 5900X in real conditions (x265 1080p and 4K) ? With the same settings, what is the expected perf difference between both ?
Thanks !

Just look at the new benchmark thread from Sagitarre.

https://forum.doom9.org/showthread.php?t=174393&page=4

The 3900x is barely ahead of the 5800x.

Nico8583
27th August 2021, 14:13
Just look at the new benchmark thread from Sagitarre.

https://forum.doom9.org/showthread.php?t=174393&page=4

The 3900x is barely ahead of the 5800x.
Thanks ;)

tonemapped
3rd September 2021, 07:31
I've just ordered an AMD 5900x and RTX 3080 for encoding (NVENC still has a few use cases outside of streaming). The reason for the 5900x (or even going for a 3000-series or any other 5000-series) is power savings, less heat, and better value for money with multithreaded workloads such as encoding. As it stands, I would recommend AMD (I believe that's the first time I've written that since I purchased an Athlon 64 X2 as a child!).

mindphasar
12th September 2021, 14:13
Intel processors (11 gen) with the internal video card UHD Graphics 750 are good for x265 encoding? or better to buy an AMD processor without internal video card?

DJATOM
12th September 2021, 21:04
I assume AMD will be better (HW encoding usually performs worse than SW), and 8 big + 8 small cores might make things slower against full-power 16 cores by AMD. So just wait for actual release and check speed comparisons before buy.

benwaggoner
13th September 2021, 17:42
Intel processors (11 gen) with the internal video card UHD Graphics 750 are good for x265 encoding? or better to buy an AMD processor without internal video card?
There's no downside to having an integrated GPU, but there's no upside either for x265 itself. Depending on workflow, the GPU can be used to accelerate source decode and potentially some preprocessing operations. And there's the potential to use a HW first pass and the reuse the initial analysis in x265 for a moderate speed boost, but it's a hassle to get configured, and speed up is maybe 25% if done without quality loss.

I don't know anyone doing that in practice for file-to-file encoding.

microchip8
13th September 2021, 17:50
Intel processors (11 gen) with the internal video card UHD Graphics 750 are good for x265 encoding? or better to buy an AMD processor without internal video card?

Intel's latest generation CPUs are power consuming monsters. Keep that also in mind. I'd opt for an AMD at the moment

Balling
16th September 2021, 09:47
You can root Intel Chipset Minix OS and get access to ucode and modify it. So Intel, of course. What a dumb question! Just simply picosecond precision digital analyser inside is already equvivalent to millions of $.

shootah
20th September 2021, 19:13
I'm also searching for new CPU, Ryzen 7 5800X would be my max budget, is this the best what I can buy for that money?
What fps should I expect on 5800X encoding x265 from 1080p, AVC source using CRF profile fast?

ChaosKing
20th September 2021, 19:39
There's no downside to having an integrated GPU, but there's no upside either for x265 itself.

All ryzen cpus with a integrated GPU have a smaller L3 cache. So there's a small downside :D

excellentswordfight
20th September 2021, 19:42
I'm also searching for new CPU, Ryzen 7 5800X would be my max budget, is this the best what I can buy for that money?
What fps should I expect on 5800X encoding x265 from 1080p, AVC source using CRF profile fast?
Techpowerup has a decent test x265 test which you can use for some speed comparison between models.

https://tpucdn.com/review/intel-core-i7-11700kf/images/encode-h265.png

Not sure about the speed as I rarely use preset fast, nor own an 5800x, but I would guestimate that it would be in in the 50fps+ range.

FranceBB
21st September 2021, 00:34
Out of curiosity, still no AVX512 in the new AMD CPUs, right?

Asmodian
21st September 2021, 00:47
Right, no AVX512 in any of the announced, or even rumored, AMD CPUs.

However, Zen3 CPUs encode with x265 very quickly, so the lack of AVX512 is not really a minus for x265 encoding.

butterw2
21st September 2021, 15:59
Intel 11th gen desktop igpu has a good hardware hevc encoder (in addition to an av1 hardware decoder).

avx512 isn't reported as being available with 12th gen Intel desktop (releasing Q4 2021).

nevcairiel
21st September 2021, 17:39
Rumors have it that Zen4 might have AVX512, which would be an odd situation as Intel has pulled it from consumer Alder Lake, if AMD brings it to consumer desktop.
But those rumors might as well be totally off, as typical consumers would only see a small benefit ... but of course AMD uses the same cores for the entire lineup, so who knows.

shootah
21st September 2021, 18:20
Not sure about the speed as I rarely use preset fast, nor own an 5800x, but I would guestimate that it would be in in the 50fps+ range.

Great, 50 fps is huge improvement for me!

DTL
8th February 2023, 08:52
Rumors have it that Zen4 might have AVX512, which would be an odd situation as Intel has pulled it from consumer Alder Lake, if AMD brings it to consumer desktop.
But those rumors might as well be totally off, as typical consumers would only see a small benefit ... but of course AMD uses the same cores for the entire lineup, so who knows.

At the end of 2022 AMD start to support some good set of AVX512 instructions in Zen4 7xxx chips. But may be not at the full rate (512bits per dispatch) and only by 256bits parts (so instructions per cycle about half of expected). But it most probably support full sized 2 kBytes register file for AVX512 and even separated register files for integer and floats (so 2x2 kBytes per core). It also helps to performance if program manually designed to use such 'large for desktop CPU' register file or C compiler was configured to AVX512 architecture and can use additional register space.

May be in the next generation of chips AMD will support AVX512 instructions dispatch at 'full rate' and it can help to performance more.

Nowdays more desktop developers with AMD 7xxx chips can make and test AVX512 software without using Intel SDE that only fully support up to Visual Studio 2017 (old enough).

benwaggoner
8th February 2023, 19:01
Even if IPC of AVX512 is half that of AVX2 for many operations, that will still reduce instruction bandwidth which might have some benefit. There's also the permute stuff added in AVX512 which seems like it could simplify some algorithms and optimization if well utilized. I'm not sure how deep the x265 AVX512 optimization goes in that direction.

MCW dialed back on further AVX512 work after it was discovered that AVX512'S thermal down clocking made it a net negative for perf in most scenarios.

Getting a good Zen4 tuned profile driven optimized binary to test with would reveal much. I'd expect to see some increased benefit with 4K encoding and hopefully 1080p with slower presets. AVX512 on Xeon to date has really only shown net improvements at 4K veryslow, and even then <20%.

Anyone know about AVX512 improvements in the 2023 Xeons?

For AWS EC2 cloud encoding, Graviton2 already offers a lot better throughput/$ than c5 Xeon instances. For cloud at least, x86-64 isn't the only game in town anymore.

DTL
8th February 2023, 20:33
One useful feature of 4x larger register file of AVX512 x64 in compare with AVX2 is ability to process more blocks in motion estimation per pass or larger sized blocks or have more search radius without reloading register file from cache (it cost about 5 or more clockticks even from closest L1D cache and riune performance about several times). So more manually optimized functions from x265 developers for AVX512 architecture may show more performance boost in the future.

"Getting a good Zen4 tuned profile driven optimized binary to test with would reveal much. "

Are there any architecture optimized builds of x265 (for AVX512-Intel (families) or for AVX512-AMD (Zen4) avaialble for download and test ?

I tried quickly make builds but fast result was only with MSVC with 2020-dated sources from github (and its /arch:AVX512 build at i5-11600 runs even slightly slower in compare with /arch:AVX2, but MSVC 16 is not best optimizing compiler for Intel's AVX512 architecture). Newer sources either not run cmakeconfigure to build solution or not link .obj files with Intel C 19.1 compiler. Need more time for solving build errors. It is nice if x265 community already provide optimized builds for AVX512 for endusers.

"about AVX512 improvements in the 2023 Xeons?"

As I see Xeons have some better optimized memory (cache and RAM) controller for multithreading (multicoring) in compare with desktop chips for poor people. So Xeons may also better benefit from wide bus transfers and wide words computing. As nowdays intel again drop support for AVX512 in desktop chips - it looks with poor memory controller at cheap (sub $1000 chips) it still very unbalanced setup of too fast computing core and too slow memory and not worth of marketing at dying desktops market.

Blue_MiSfit
9th February 2023, 05:48
For AWS EC2 cloud encoding, Graviton2 already offers a lot better throughput/$ than c5 Xeon instances. For cloud at least, x86-64 isn't the only game in town anymore.

Is this the case for x264 and x265 encoding? I know there have been lots of ARM SIMD optimizations but I'd be kind of surprised if those surpass the level of x86_64 optimization. Does it just not matter and you come out ahead?

We already use Graviton for our database workloads and are moving some of our JVM workloads there as well, but I was under the impression that standard x264/x265 work was still more cost efficient on AMD instances in EC2 ~generally~ speaking :)

Boulder
9th February 2023, 13:57
Getting a good Zen4 tuned profile driven optimized binary to test with would reveal much. I'd expect to see some increased benefit with 4K encoding and hopefully 1080p with slower presets. AVX512 on Xeon to date has really only shown net improvements at 4K veryslow, and even then <20%.


Once Media Autobuild Suite goes GCC 13, that should be available. I don't know how much you can expect out of some compiler made optimizations though. On a Zen 3, the difference is a couple of % (using 'znver3' as target) against a generic build.

DTL
9th February 2023, 18:36
Google shows article from intel:
https://www.intel.com/content/dam/develop/external/us/en/documents/mcw-intel-x265-avx512.pdf

Accelerating x265 with Intel®
Advanced Vector Extensions 512

Diagrams shows really not very nice gain. It looks programmers of x265 need significantly new processing approaches to use 4x larger register file and 2x larger dataword per instruction (about up to 4x*2x=8x hardware performance boost - also superscalarity allow to execute several AVX512 instructions at different dispatch ports if this instruction supported at several dispatch ports and no data dependency exist) of AVX512 over AVX2 to make at least 2x performance gain instead of just a few %. At least at full-blood Xeons with many memory channels and better cache controller in compare with poor-people's desktop chips.

One of the reason of too few benefit of AVX512 versions over AVX2: With era of AVX2 the programmers design workunit to compute for the size of 512 bytes register file of AVX2 CPUs. So simple usage of AVX512 instructions with the same workunit size may double in theory performance but the workunit reload rate from caches and main RAM is still high. So memory bounding is same as with AVX2 design.

The moving to AVX512 architecture require also increasing workunit size to about 4x in size and such software may take very high performance penalty if trying to use this workunit size on old architectures (AVX2) with smaller register file size (compiler will accept intrinsics-based program with overusage of register file but will fill output binary with reloads from cache and it will drop performance significantly). Also as I see x265 uses external assembly handcoded. So need to handwrite new AVX512 functions using larger workunit size and this will be significant separate part of x265 with only good performance at AVX512 chips (and mostly unusably slow on older). As befiore AMD 7xxx enduser chips the AVX512 was very rare at the desktops - there were too few reasons (there may be no investors to re-design x265 separate branch to AVX512) to put typically limited developers resources to making separate version of x265 for rich-people servers in freeware opensource project. And now as we got more AVX512 chips at desktops the total developers activity at opensource looks almost died - so simply no one left to make one more redesign of x265 for new offered architecture at desktop chips. May be we will see some progress in 202x years if AMD will keep AVX512 at cheap desktop chips for at least several years.

benwaggoner
10th February 2023, 22:51
One of the reason of too few benefit of AVX512 versions over AVX2: With era of AVX2 the programmers design workunit to compute for the size of 512 bytes register file of AVX2 CPUs. So simple usage of AVX512 instructions with the same workunit size may double in theory performance but the workunit reload rate from caches and main RAM is still high. So memory bounding is same as with AVX2 design.
Very interesting! Does anyone know if the existing AVX512 in x265 was done with this in mind?

DTL
11th February 2023, 00:44
May be programmers can complain that complex/modern video coding algorithms with long processing (many many small functions switching) with frame splitting granularity down to very small block (like 4x4 or 2x2 or even single sample) is not possible to arrange to 'any large workunits'. So no significant benefit from 'large SIMD architectures possible'. We will continue to pass something like 2x2 8bit block of 4 bytes workunit size over hundreds of functions and many loop spins at the hardware capable of handle up to 2048 bytes workunits with instant (zero memory load time penalty) access to dispatch ports from 'register file'. So they may ask for very large number of full logical processing cores to process very small workunits at very large number of cores to have better total frame compression time performance. But current hardware manufacturing industry can only provide small full logical core number chips with increasing 'workunit' size when moving from 256bytes register file of AVX(2) to 2048bytes of AVX512 (and may be larger in next promised AVX1024 in about this 2023 year - https://appuals.com/granite-rapids-cpu-showcase/ Intel 5th Gen Xeon with AVX1024/FMA3). So it is a task for programmers to adapt software for small core number execution hardware with increasing 'workunit' size for each processing thread.