Log in

View Full Version : x265 benchmark on a 7975WX Threadripper


jpsdr
9th June 2025, 19:39
Hello, i've finaly finished making my gig, and made some tests.
CPU: 7975WX (AVX512 & HT enabled in the BIOS) => 32 physical cores (64 logicals).
Motherboard: ASUS Pro WS WRX90E-SAGE SE
Memory CORSAIR CMA128GX5M8B5600C40(Ver 5.43.01) 8x 16GB

I've made a Windows 10/Windows 11 dual boot.
PC is totaly offline, a lot of services disabled, so it's doing "almost" nothing else and so there is "almost" no waste of time.

Used x265 4.1.0.104, build with LLVM 20.1.4.
Zen4 is build with: -Ofast -march=znver4
Zen4s is build with: -Ofast -msse2 -mavx -mavx2 -mfma -mtune=znver4
(So Zen4s is optimised for Zen4 buit without using AVX512 instructions).

I've made a lot of differents tests on a small 128 frames 4k 10bits file.
The commande line used is (avg bitrate is 40000):

SET E_SRC=%8%1.avs
SET E_DST=%5%1.hevc
SET CHAPTERS=%8%7
SET STAT_FILE=%8%1.stats
SET LOG_FILE_1=%8%1_log_1.txt
SET LOG_FILE_2=%8%1_log_2.txt
SET LOG_FILE_3=%8%1_log_3.txt
SET BITRATE=%2
SET TUNING=%6
SET MCLL=%3
SET MDISPLAY=%4

x265_x64 --asm avx512 --preset slower --vbv-maxrate 90000 --vbv-bufsize 70000 --bitrate %BITRATE% --stats %STAT_FILE% --level 5.1 --profile main10 --high-tier --level-idc 51 --hist-scenecut --fades --aq-mode 4 --aq-auto 6 --weightb --rc-lookahead 72 --tskip --tskip-fast --no-rect --me hex --subme 2 --b-intra --no-sao --deblock -1,-1 --psy-rd 2.5 --psy-rdoq 4 --multi-pass-opt-analysis --multi-pass-opt-distortion --video-signal-type-preset BT2100_PQ_YCC -D 10 --max-cll %MCLL% --master-display %MDISPLAY% --hdr10-opt --qpfile %CHAPTERS% --input %E_SRC% --pass 1 -o NUL 2> %LOG_FILE_1%
x265_x64 --asm avx512 --preset slower --vbv-maxrate 90000 --vbv-bufsize 70000 --bitrate %BITRATE% --stats %STAT_FILE% --level 5.1 --profile main10 --high-tier --level-idc 51 --hist-scenecut --fades --aq-mode 4 --aq-auto 6 --weightb --rc-lookahead 72 --tskip --tskip-fast --rect --no-amp --me umh --subme 3 --b-intra --no-sao --deblock -1,-1 --psy-rd 2.5 --psy-rdoq 4 --multi-pass-opt-analysis --multi-pass-opt-distortion --video-signal-type-preset BT2100_PQ_YCC -D 10 --max-cll %MCLL% --master-display %MDISPLAY% --hdr10-opt --qpfile %CHAPTERS% --input %E_SRC% --pass 3 -o NUL 2> %LOG_FILE_2%
x265_x64 --asm avx512 --preset slower --vbv-maxrate 90000 --vbv-bufsize 70000 --bitrate %BITRATE% --stats %STAT_FILE% --level 5.1 --profile main10 --high-tier --level-idc 51 --hist-scenecut --fades --aq-mode 4 --aq-auto 6 --weightb --rc-lookahead 72 --tskip --tskip-fast --rect --amp --b-intra --no-sao --deblock -1,-1 --psy-rd 2.5 --psy-rdoq 4 --scenecut-aware-qp 3 --multi-pass-opt-analysis --multi-pass-opt-distortion --video-signal-type-preset BT2100_PQ_YCC -D 10 --max-cll %MCLL% --master-display %MDISPLAY% --hdr10-opt --qpfile %CHAPTERS% --input %E_SRC% --pass 2 -o %E_DST% 2> %LOG_FILE_3%


Results are sometimes... unexpected.
Of course, each time encode is from the same file.

Zen4
Windows 10
Pass 1: encoded 128 frames in 45.09s (2.84 fps)
Pass 2: encoded 128 frames in 73.77s (1.74 fps)
Pass 3: encoded 128 frames in 62.94s (2.03 fps)
Windows 11
Pass 1: encoded 128 frames in 43.15s (2.97 fps)
Pass 2: encoded 128 frames in 111.11s (1.15 fps)
Pass 3: encoded 128 frames in 93.05s (1.38 fps)

Zen4s
Windows 10
Pass 1: encoded 128 frames in 44.51s (2.88 fps)
Pass 2: encoded 128 frames in 73.16s (1.75 fps)
Pass 3: encoded 128 frames in 62.09s (2.06 fps)
Windows 11
Pass 1: encoded 128 frames in 43.24s (2.96 fps)
Pass 2: encoded 128 frames in 111.24s (1.15 fps)
Pass 3: encoded 128 frames in 93.54s (1.37 fps)

Results:
- Zen4 & Zen4s have (almost) identical results.
- First pass is a little faster on Windows 11, but Windows 11 is significantly slower on Pass 2 & 3 ! :confused:

Zen4s without the --asm AVX512.
Windows 10
Pass 1: encoded 128 frames in 48.72s (2.63 fps) [AVX512 +9.5%]
Pass 2: encoded 128 frames in 107.07s (1.20 fps) [AVX512 +45,8%]
Pass 3: encoded 128 frames in 96.89s (1.32 fps) [AVX512 +56,1%]
Windows 11
Pass 1: encoded 128 frames in 54.68s (2.34 fps) [AVX512 +26.5%]
Pass 2: encoded 128 frames in 111.77s (1.15 fps) [AVX512 +0.0%]
Pass 3: encoded 128 frames in 94.69s (1.35 fps) [AVX512 +1.5%]

Results:
For Windows 10, the difference is great, but Windows 11... :confused:
It's like on Pass 2 & Pass 3 Windows 11 is not using AVX512 :eek:

I've noticed using the task manager that even if x265 creates 64 threads (it's notified in the log), the total CPU usage was under 50%.
Si I tryed, adding --pools 32, but kept --asm AVX512.

Zen4s
Windows 10
Pass 1: encoded 128 frames in 44.25s (2.89 fps)
Pass 2: encoded 128 frames in 92.07s (1.39 fps)
Pass 3: encoded 128 frames in 76.46s (1.67 fps)

Results : A little slower (expected), but not so much.

So... I said to myself: As i have a lot of memory, if i start 2 encodes in the same time with --pools 32, maybe it could be interesting.
Encodes are made from 2 identical files on 2 differents HDD.

From 1rst test :
1 file full speed (64 threads):
Windows 10: 181,80s => 2 encodes take 363,60s
Windows 11: 248,02s => 2 encodes take 496,04s

Now, there is 2 encodes in parallel, with --pools 32 & --asm AVX512.
Windows 10
File 1:
Pass 1: encoded 128 frames in 50.92s (2.51 fps)
Pass 2: encoded 128 frames in 114.36s (1.12 fps)
Pass 3: encoded 128 frames in 94.96s (1.35 fps)
=> Total of 260,18s
File 2:
Pass 1: encoded 128 frames in 51.39s (2.49 fps)
Pass 2: encoded 128 frames in 84.04s (1.52 fps)
Pass 3: encoded 128 frames in 68.29s (1.87 fps)
=> Total of 203,73s
=> 2 files encoded in 260,18s instead of 363,60s => -28%.
But at one time, one file get slower, the load was not equal between the files.
Windows 11
File 1:
Pass 1: encoded 128 frames in 49.73s (2.57 fps)
Pass 2: encoded 128 frames in 116.22s (1.10 fps)
Pass 3: encoded 128 frames in 96.31s (1.33 fps)
=> Total of 262,26s
File 2:
Pass 1: encoded 128 frames in 49.66s (2.58 fps)
Pass 2: encoded 128 frames in 115.14s (1.11 fps)
Pass 3: encoded 128 frames in 96.03s (1.33 fps)
=> Total of 260,83s
=> 2 files encoded in 262,26s instead of 496,04s => -47%.
The % gain is better than Windows 10, the load is equal, but nevertheless result is finaly the same than with Windows 10.

If this slowdown of Pass 2 & Pass 3 between Windows 10 & Windows 11 could be explained and solved, my guess is that Windows 11 would be better than Windows 10, but for now, it's not the case.

Z2697
10th June 2025, 06:27
VBV is non-deterministic.
Auto-AQ is non-deterministic.
Test result is subject to high margine of error.

jpsdr
10th June 2025, 08:42
Non-deterministic just means that doing 2 times the exact same encode will not produce the exact same result file, it doesn't mean the encoding time will change drasticaly. As i don't care of file result and just encoding time, i don't think this remark is relevant, and don't agree with it.

But... As i also think test results are relevant, i'll do this evening when back home 4 times the exact same test (on both Windows 10 & Windows 11), and see if there is a significant difference in enconding time between each of them.
If there is, i was wrong, if not you were wrong.

rwill
10th June 2025, 08:56
Non-deterministic just means that doing 2 times the exact same encode will not produce the exact same result file, it doesn't mean the encoding time will change drasticaly.

Given your very short input file and x265's bitrate control implementation I am not so sure about this.

jpsdr
10th June 2025, 12:05
I'll see the result when back home of launching several time the same test, if time change.
If not, it meens tests are relevants. If yes, i'll do the same test but duplicate 10 times the clip in the avs script, creating 1280 frames, making encoding time between 10 to 20 minutes... And redoing some tests (in that case, probably not so much).
I must says, for time saving, that i hope the result will be that time between test will not change... :D

jpsdr
10th June 2025, 18:46
Back home, results of enconding time consistancy.

Windows 10
#1
Pass 1: encoded 128 frames in 44.16s (2.90 fps)
Pass 2: encoded 128 frames in 72.95s (1.75 fps)
Pass 3: encoded 128 frames in 61.92s (2.07 fps)
#2
Pass 1: encoded 128 frames in 44.20s (2.90 fps)
Pass 2: encoded 128 frames in 72.52s (1.76 fps)
Pass 3: encoded 128 frames in 62.15s (2.06 fps)
#3
Pass 1: encoded 128 frames in 44.25s (2.89 fps)
Pass 2: encoded 128 frames in 72.82s (1.76 fps)
Pass 3: encoded 128 frames in 62.12s (2.06 fps)
#4
Pass 1: encoded 128 frames in 44.43s (2.88 fps)
Pass 2: encoded 128 frames in 72.64s (1.76 fps)
Pass 3: encoded 128 frames in 62.10s (2.06 fps)
Results
Pass 1 : vary from 44.16s to 44.43s => 0.6%
Pass 1 : vary from 72.52s to 72.95s => 0.6%
Pass 3 : vary from 61.92s to 62.15s => 0.4%

Windows 11
#1
Pass 1: encoded 128 frames in 43.13s (2.97 fps)
Pass 2: encoded 128 frames in 111.58s (1.15 fps)
Pass 3: encoded 128 frames in 93.35s (1.37 fps)
#2
Pass 1: encoded 128 frames in 44.20s (2.90 fps)
Pass 2: encoded 128 frames in 111.47s (1.15 fps)
Pass 3: encoded 128 frames in 93.70s (1.37 fps)
#3
Pass 1: encoded 128 frames in 43.11s (2.97 fps)
Pass 2: encoded 128 frames in 111.42s (1.15 fps)
Pass 3: encoded 128 frames in 93.64s (1.37 fps)
#4
Pass 1: encoded 128 frames in 43.10s (2.97 fps)
Pass 2: encoded 128 frames in 111.52s (1.15 fps)
Pass 3: encoded 128 frames in 93.88s (1.36 fps)
Results
Pass 1 : vary from 43.10s to 44.20s => 2.6%
Pass 1 : vary from 111.42s to 111.58s => 0.1%
Pass 3 : vary from 93.35s to 93.88s => 0.6%

Obviously my results are a lot of things, but NOT subject to high marging of error !
This confirm my statement in post #3.

Nevertheless, i'll try, just the case 32 threads/2 encodes in the same time, looping 10 times my small file in the avs script (so 1280 frames) and only one Windows 10 and Windows 11, to see if the CPU load balance is better on a larger file.

Z2697
10th June 2025, 19:16
Then, maybe Hypervisor? Windows 11 is very stubborn on getting that enabled, and other things... for safety (not sure about that).
I mean it's really strange to me, I think Windows 11 should still be very similar to Windows 10 (down in the kernel), without the online bloats running, how can it perform such differently?

jpsdr
10th June 2025, 21:47
@Z2697
What's odd, is that Windows 11 is a little faster on Pass 1, but a lot slower only on Pass 2 & Pass 3. And according the speed test without AVX512, it looks like on Windows 11 AVX512 is disabled just for Pass 2 & Pass 3. But it make no sense... Why would AVX512 be disabled on Pass 2 & Pass 3 on Windows 11 and not on Windows 10... :confused:

=================================

Otherwise, test of a 1280 frames files, AVX512, 32 threads, 2 files encoded in the same time.

Windows 10
File 1
Pass 1: encoded 1280 frames in 489.71s (2.61 fps)
Pass 2: encoded 1280 frames in 881.18s (1.45 fps)
Pass 3: encoded 1280 frames in 709.66s (1.80 fps)
=> Total of 2080.55s
File 2
Pass 1: encoded 1280 frames in 490.34s (2.61 fps)
Pass 2: encoded 1280 frames in 883.15s (1.45 fps)
Pass 3: encoded 1280 frames in 709.87s (1.80 fps)
=> Total of 2083.36s
Result: As i suspected, on this specific test, small file is not accurate to check the CPU load balance, too short in time. With a bigger file, the result shows an excellent CPU balance, with almost identical time for each file, giving a total of 2083.36s for encoding 2 files.
If i make a quick computation according speed of one encoding :
Pass 1 -> 2.89fps => 442.91s
Pass 2 -> 1,75fps => 731.43s
Pass 3 -> 2.06fps => 621.36s
=> Total of 1795.70s -> 3591.40s for 2 files
Encoding time : -42%

Windows 11
File 1
Pass 1: encoded 1280 frames in 437.46s (2.93 fps)
Pass 2: encoded 1280 frames in 1065.31s (1.20 fps)
Pass 3: encoded 1280 frames in 938.72s (1.36 fps)
=> Total of 2441.49s
File 2
Pass 1: encoded 1280 frames in 578.49s (2.21 fps)
Pass 2: encoded 1280 frames in 1028.20s (1.24 fps)
Pass 3: encoded 1280 frames in 897.66s (1.43 fps)
=> Total of 2504.35s
Result: CPU balance is good, but not as good on Windows 10, and time is bigger.

Winner : Windows 10
At least for now...

For the record, i've a 'tunne' install of Windows 11 with by default a lot of crap disabled, i also disabled a lot of things i'm not using like firewall and defender and a lot of network services as PC is totaly offline.
For the record also, i've made a "Windows update" on both of them when i've installed the OS last WE, before making them totaly offline, so they are, normaly, "up to date".

@Z2697
I've checked on my Windows 11, Hyper-V is totaly disabled in the Program features.
And also on my Windows 10 after checking.

jpsdr
11th June 2025, 08:44
I just thought this morning of a test i'll do this evening when back home.
I've tested with an LLVM build, i'll test with a GCC build.
Shouldn't change things, in theory, but at this point...

Z2697
11th June 2025, 16:18
Some security settings will enable hypervisor regradless of the checkboxies in the features dialogue.
In fact, I can't find a sane way to disable the hypervisor in Windows 11 24H2. (of course disable SVM in BIOS do the trick)
You can run msinfo32 to check if the hypervisor is running. (it will say hypervisor is detected in the bottom row)

Although the virtualization should be pretty efficient, there may still be some edge cases.

jpsdr
11th June 2025, 19:00
So...
I've deactivated SVM in the BIOS.
I've followed all the guides to disable Hyper-V in Windows.
Result : encode speed is a little faster, but, no change in the fact that Pass 2 & 3 are a lot slower in Windows 11 than Windows 10...

Also, no speed difference between LLVM and GCC builds.

Also, if you know how to permanently disable Defender in Windows 11 i take it !!!
I've been able, it seems, to do it under Windows 10, but Windows 11 is...:devil:

jpsdr
11th June 2025, 19:22
For now, the only clue i see is this result:
Windows 11
Pass 1: encoded 128 frames in 54.68s (2.34 fps) [AVX512 +26.5%]
Pass 2: encoded 128 frames in 111.77s (1.15 fps) [AVX512 +0.0%]
Pass 3: encoded 128 frames in 94.69s (1.35 fps) [AVX512 +1.5%]

I don't know how it could be possible, but "Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth".
So, how improbable it could be, but not impossible, for now the only conclusion i have, is that there is something in the code that prevent the use of AVX512 path specificaly under Windows 11 and not Windows 10, linked to one of the settings i use in Pass 2 and Pass 3.

I don't konw if people from Multicoreware read here and can eventualy provide their insights.

jpsdr
12th June 2025, 08:56
I have a little time, so i post results with Hyper-V deactivated:

Windows 10
Pass 1: encoded 128 frames in 43.33s (2.95 fps) => +4.0%
Pass 2: encoded 128 frames in 70.98s (1.80 fps) => +3.9%
Pass 3: encoded 128 frames in 60.47s (2.12 fps) => +4.1%

Speed increase is very stable, small, but it's always good to take. Value is just high enough to not be considered as "noise".

Windows 11
Pass 1: encoded 128 frames in 42.84s (2.99 fps) => +0.7%
Pass 2: encoded 128 frames in 110.45s (1.16 fps) => +0.6%
Pass 3: encoded 128 frames in 92.55s (1.38 fps) => +0.5%

Speed increase is also very stable, but... so small that it can be "noise".

jpsdr
12th June 2025, 13:56
@Z2697
Didn't tried yet, but interesting:
https://winaerotweaker.com/
https://github.com/ionuttbara/windows-defender-remover
Also found (still not tested) this :
https://github.com/TairikuOokami/Windows/blob/main/Microsoft%20Defender%20Disable.bat

jpsdr
13th June 2025, 11:29
GCC mcf version is supposed to have a Windows 10 optimised threading model, so, i've tested GCC mcf build vs LLVM build.

AVX512, 64 threads.

Windows 10 - LLVM
Pass 1: encoded 128 frames in 43.30s (2.96 fps)
Pass 2: encoded 128 frames in 70.82s (1.81 fps)
Pass 3: encoded 128 frames in 60.53s (2.11 fps)
Windows 10 - GCC mcf
Pass 1: encoded 128 frames in 44.20s (2.90 fps) => -2.0%
Pass 2: encoded 128 frames in 72.88s (1.76 fps) => -2.8%
Pass 3: encoded 128 frames in 61.88s (2.07 fps) => -2.2%

LLVM wins.

Windows 11 - LLVM
Pass 1: encoded 128 frames in 42.80s (2.99 fps)
Pass 2: encoded 128 frames in 111.23s (1.15 fps)
Pass 3: encoded 128 frames in 93.47s (1.37 fps)
Windows 11 - GCC mcf
Pass 1: encoded 128 frames in 43.97s (2.91 fps) => -2.7%
Pass 2: encoded 128 frames in 112.88s (1.13 fps) => -1.5%
Pass 3: encoded 128 frames in 95.16s (1.35 fps) => -1.8%

LLVM wins.

Around 2% is not a big deal, but as i said, everything is good to take.

Boulder
13th June 2025, 15:15
You'll want to try znver2 instead of znver4 and enable AVX512 separately if that's possible. Znver3 and znver4 are both broken in LLVM and actually produce slower binaries than znver2.

Z2697
13th June 2025, 16:13
You'll want to try znver2 instead of znver4 and enable AVX512 separately if that's possible. Znver3 and znver4 are both broken in LLVM and actually produce slower binaries than znver2.

While that's true, the CPU arch flags in compiler makes almost no difference in x265.

Z2697
13th June 2025, 16:26
Can you try Linux?

jpsdr
13th June 2025, 20:24
No, can't try Linux.

I'll try zenver2 with AVX512 compile options enabled, when i have time, but even if broken, LLVM is still faster than GCC.

RanmaCanada
14th June 2025, 00:43
I am sure many would like to see you run the advanced benchmark sagittare created..

https://forum.doom9.org/showthread.php?t=185855

Z2697
14th June 2025, 06:54
No, can't try Linux.

I'll try zenver2 with AVX512 compile options enabled, when i have time, but even if broken, LLVM is still faster than GCC.

You use gcc with mcfgthread exclusively, or you have tested "original" gcc?

jpsdr
14th June 2025, 11:31
I am sure many would like to see you run the advanced benchmark sagittare created..

https://forum.doom9.org/showthread.php?t=185855

I can't download it, it asks for email/pswd i don't have... :(

Edit:
Finaly, i've been able to get it.
Result on sagittare thread soon...

jpsdr
14th June 2025, 12:11
You use gcc with mcfgthread exclusively, or you have tested "original" gcc?

I don't know if it's the "original" for you, but here result build with version 13.1.0 (20230426 version from http://msystem.waw.pl/x265/) and zen4 arch. AVX512 & 64 cores.

Windows 10
Pass 1: encoded 128 frames in 44.99s (2.85 fps) => -3.6%
Pass 2: encoded 128 frames in 73.95s (1.73 fps) => -4.2%
Pass 3: encoded 128 frames in 62.98s (2.03 fps) => -3.9%
Windows 11
Pass 1: encoded 128 frames in 44.49s (2.88 fps) => -3.8%
Pass 2: encoded 128 frames in 113.44s (1.13 fps) => -1.9%
Pass 3: encoded 128 frames in 95.81s (1.34 fps) => -2.4%

Whell, mcf is indeed a little better than "original", so as LLVM is better than mcf, it's also better than "original".

I was said that Zen4 is broken under LLVM, at what version ?
And is it sure it's still the case and not been fixed yet ?
Because even if Zen4 LLVM is broken, it's still perform better than Zen4 GCC, both "original" and mcf.

jpsdr
14th June 2025, 13:04
Benchmark of sagittare posted on his thread.

Windows 10 wins all.

jpsdr
22nd June 2025, 13:23
You'll want to try znver2 instead of znver4 and enable AVX512 separately if that's possible. Znver3 and znver4 are both broken in LLVM and actually produce slower binaries than znver2.

Finaly been able to check znver2 and znver4 builds, both produce the exact same result, so it seems that if it was broken, it's now fixed.

Z2697
22nd June 2025, 13:57
It's not fixed, it's doesn't matter to x265.
Same reason I doubt your GCC vs LLVM result.

Stereodude
22nd June 2025, 20:30
Is this a Zen4 thing specifically or will Zen5 have the same issue?

Z2697
23rd June 2025, 08:58
Is this a Zen4 thing specifically or will Zen5 have the same issue?

I think it's a AVX-512 thing, combined with bad tuning for the specific uArch made it slightly worse.
So far in Clang 20.1.7 the znver5 tuning is as bad as znver4.

jpsdr
23rd June 2025, 11:44
It's not fixed, it's doesn't matter to x265.
Same reason I doubt your GCC vs LLVM result.
I didn't fake the results... :D
I'm not a... (don't know how to translate properly) "pro" LLVM guy. I'm just looking for the faster, i don't care if it's GCC or LLVM.
Same for Windows, i don't care if it's Windows 10 or 11, i'm looking for the faster. (But as i hate less Windows 10 interface than Windows 11, in that case i'm happy of the result).
In my case, it's Windows 10 with LLVM.
Maybe, as you said, for the x265 code it doesn't matter, as i have (at +/- 0,01fps, so it's noise) for Zen2 and Zen4 -march build option the same speed.

To be specific, i compared these 3 LLVM builds:
-Ofast -march=znver2 -msse2 -mavx -mavx2 -mfma
-Ofast -march=znver2 -msse2 -mavx -mavx2 -mfma -mtune=znver4
-Ofast -march=znver4

All produced different size .exe (from 18.2Mb to 18.6Mb size if i remember properly), but there wasn't the same speed ration than size ratio... ;)

Stereodude
25th June 2025, 02:23
I think it's a AVX-512 thing, combined with bad tuning for the specific uArch made it slightly worse.
So far in Clang 20.1.7 the znver5 tuning is as bad as znver4.
Zen5 has a significant improvement over Zen4 in terms of CPU architecture when running AVX-512 instructions. It's disappointing to read that apparently x265 can't take advantage of it (only in Windows 11?).

benwaggoner
25th June 2025, 19:40
Zen5 has a significant improvement over Zen4 in terms of CPU architecture when running AVX-512 instructions. It's disappointing to read that apparently x265 can't take advantage of it (only in Windows 11?).
Perhaps x265 needs some Zen5 bespoke optimizations? Or perhaps a compiler issue?

It can take a while to get stuff fully optimized for a new architecture. And often funding or code contributions from the architecture's vendor. Intel has done quite a lot to get x264 and x265 optimized or their new architectures over the years; I'm not aware of AMD doing the same, but I am sure they are welcome to.

Z2697
25th June 2025, 20:34
Zen5 has a significant improvement over Zen4 in terms of CPU architecture when running AVX-512 instructions. It's disappointing to read that apparently x265 can't take advantage of it (only in Windows 11?).

I'm not sure what you are talking about, but the compiler optimization has nothing to do with x265's (and most other well optimized encoders and decoders') assembly/intrinsic optimizations.
That's what I mean it doesn't matter to x265.

jpsdr
26th June 2025, 09:19
Update to new BIOS and new driver.
Played with some settings in the BIOS, almost (but not only) the NUMA settings. Finaly, the best result was keeping things on "auto", the way it was for all my tests until now, so no change.
Also, playing with the NUMA settings didn't change the Windows 11 slowdown for pass2 & pass3, neither the AMD driver update (didn't have a lot of hope for this last one).

excellentswordfight
26th June 2025, 09:54
Update to new BIOS and new driver.
Played with some settings in the BIOS, almost (but not only) the NUMA settings. Finaly, the best result was keeping things on "auto", the way it was for all my tests until now, so no change.
Also, playing with the NUMA settings didn't change the Windows 11 slowdown for pass2 & pass3, neither the AMD driver update (didn't have a lot of hope for this last one).
Wasnt it only the first generations of EPYC/TR that exposed the chiplets as serveral NUMA? I.e. it behaved as multi-socket system. This is not the case anymore, so in all practice its a single NUMA system. I would suspect that the issue is with the OS scheduler that is responsible to spread the load across the CCDs in an optimal way (and Windows has a known track record for not doing so, as well as doing so inconsistently across versions/patches), this is why there been several cases were disabling CCDs would increase performance for example games, and why the 16C versions had had worse performance than the 8C models in some loads for Ryzen.

I would also try a test on linux, it will probably handle it better, and has in general better performance for high-core systems. As someone that follows a lot of benchmarking, as well as doing in quite a lot myself, Ive read enough about issues about Windows scheduler, that I would probably avoid Windows for systems like this if maximizing performance is that important.

jpsdr
26th June 2025, 19:51
One of the things i tried played with BIOS, is to activate an option which created one NUMA by CCD => result in 4 NUMAs. x265 was allocating 16 threads on each NUMA, so, theoricaly, the load was spread on each NUMA, but result was a little slower.

I don't intend to use Linux, so sorry, i've already spend a LOT of hours with all of theses tests, so i'll not spend others several hours for something i'll not use.
No i've finished my tests and found the optimal for my case, and begin to use my PC for what i build it.

Stereodude
28th June 2025, 01:37
I'm not sure what you are talking about, but the compiler optimization has nothing to do with x265's (and most other well optimized encoders and decoders') assembly/intrinsic optimizations.
That's what I mean it doesn't matter to x265.
I'm not sure what you're talking about either. I didn't even mention the compiler or assembler.

Isn't he observing that Windows 10 see's a performance boost from AVX512 on a Zen4 CPU on pass 2 and pass 3 with x265 and Windows 11 doesn't? Presumably that would mean a Zen5 CPU would likewise not see a boost from AVX512 instructions in Windows 11.

Why this is happening I don't have any idea. I wouldn't expect the OS to be changing op codes, but I deal with microcontrollers with a while(1) loop, not anything like a modern OS. Since we're going to be strongly pushed to Windows 11 in a few months by Microsoft, seeing it apparently doesn't utilize newer AMD hardware well with x265 is discouraging.

Stereodude
28th June 2025, 01:40
No i've finished my tests and found the optimal for my case, and begin to use my PC for what i build it.
Does this mean you're going to use Windows 10, or did you find another solution?

jpsdr
28th June 2025, 10:27
Windows 10.

benwaggoner
1st July 2025, 02:53
Isn't he observing that Windows 10 see's a performance boost from AVX512 on a Zen4 CPU on pass 2 and pass 3 with x265 and Windows 11 doesn't? Presumably that would mean a Zen5 CPU would likewise not see a boost from AVX512 instructions in Windows 11.
I wouldn't make that assumption not knowing the root cause. I can imagine fixing it would have been a big priority for AMD to help Microsoft with as Zen5 shipped.