x265 HEVC Encoder [Archive] - Page 65

LoRd_MuldeR

6th February 2016, 11:50

It seems v1.9 is less efficient especially when used in dual NUMAs(CPUs) system compared to v1.8. We deployed the same parameters (change in the presets incorporated so that every single parameter is the same), the speed (fps) has dropped by 15%.

Any idea why this happens?

x265 -D 10 --preset slower --crf 16.0 --ctu 32 --max-tu-size 16 --tu-intra-depth 3 --tu-inter-depth 3 --rdpenalty 2 --me 3 --subme 5 --merange 44 --b-intra --no-rect --no-amp --ref 5 --weightb --keyint 360 --min-keyint 1 --bframes 10 --aq-mode 1 --aq-strength 1.1 --rd 5 --psy-rd 0.8 --psy-rdoq 4.0 --rdoq-level 1 --no-sao --no-open-gop --rc-lookahead 80 --scenecut 40 --max-merge 4 --qcomp 0.80 --no-strong-intra-smoothing --input-depth 10 --deblock -2:-2 --qg-size 16 --vbv-bufsize 28000 --vbv-maxrate 25000 --limit-refs 0 --no-limit-modes

x265 v1.9 has added new features and re-tuned the presets accordingly. This should give you an improved quality/speed ratio throughout the presets, but you can not assume that the absolute speed (or the absolute quality) is exactly the same as before for each preset. So, I think your test may be kind of misleading. In order to get "meaningful" results, you must either choose settings that give the same speed in v1.8 and v1.9 and then compare the resulting quality (at the same bitrate!), or you must choose settings that give the same quality (at the same bitrate!) in v1.8 and v1.9 and then compare the resulting speed. This is more difficult, yes. But, at the moment, you might be comparing apples and oranges...

sneaker_ger

6th February 2016, 11:58

He said he accounted for the preset changes.

Maybe related to that:
x265 version 1.9

- Multi-socket machines now use a single pool of threads that can work cross-socket.
http://x265.org/x265-version-1-9/

1.7 doc:
Default “”, one thread is allocated per detected hardware thread (logical CPU cores) and one thread pool per NUMA node.
http://x265.readthedocs.org/en/1.7/cli.html

current doc:
Default “”, one pool is created across all available NUMA nodes, with one thread allocated per detected hardware thread (logical CPU cores).
http://x265.readthedocs.org/en/default/cli.html

MasterNobody

6th February 2016, 12:08

Question in this thread is OK, answer is not.

Function abs2 do:
(x, y) |-> { (abs(x), abs(y)) if x >= 0
{ (abs(x), abs(y+1)) if x < 0

It is crucial for x265 (and x264) because it takes part in SATD computations.
You can compile and run test sample (http://pastie.org/10710948) yourself and result will be:
(-1, 0) -> 0xffffffffffffffff -> 0x0000000000000001

Update
Or you can compile and run test sample (http://pastebin.com/9EEd2nk9) yourself and result will be:
( -2, -2) -> 0xfffffffdfffffffe -> 0x0000000200000002 - > ( 2, 2)
( -2, -1) -> 0xfffffffefffffffe -> 0x0000000100000002 - > ( 2, 1)
( -2, 0) -> 0xfffffffffffffffe -> 0x0000000000000002 - > ( 2, 0)
( -2, 1) -> 0x00000000fffffffe -> 0x0000000100000002 - > ( 2, 1)
( -2, 2) -> 0x00000001fffffffe -> 0x0000000200000002 - > ( 2, 2)
( -1, -2) -> 0xfffffffdffffffff -> 0x0000000200000001 - > ( 1, 2)
( -1, -1) -> 0xfffffffeffffffff -> 0x0000000100000001 - > ( 1, 1)
( -1, 0) -> 0xffffffffffffffff -> 0x0000000000000001 - > ( 1, 0)
( -1, 1) -> 0x00000000ffffffff -> 0x0000000100000001 - > ( 1, 1)
( -1, 2) -> 0x00000001ffffffff -> 0x0000000200000001 - > ( 1, 2)
( 0, -2) -> 0xfffffffe00000000 -> 0x0000000200000000 - > ( 0, 2)
( 0, -1) -> 0xffffffff00000000 -> 0x0000000100000000 - > ( 0, 1)
( 0, 0) -> 000000000000000000 -> 000000000000000000 - > ( 0, 0)
( 0, 1) -> 0x0000000100000000 -> 0x0000000100000000 - > ( 0, 1)
( 0, 2) -> 0x0000000200000000 -> 0x0000000200000000 - > ( 0, 2)
( 1, -2) -> 0xfffffffe00000001 -> 0x0000000200000001 - > ( 1, 2)
( 1, -1) -> 0xffffffff00000001 -> 0x0000000100000001 - > ( 1, 1)
( 1, 0) -> 0x0000000000000001 -> 0x0000000000000001 - > ( 1, 0)
( 1, 1) -> 0x0000000100000001 -> 0x0000000100000001 - > ( 1, 1)
( 1, 2) -> 0x0000000200000001 -> 0x0000000200000001 - > ( 1, 2)
( 2, -2) -> 0xfffffffe00000002 -> 0x0000000200000002 - > ( 2, 2)
( 2, -1) -> 0xffffffff00000002 -> 0x0000000100000002 - > ( 2, 1)
( 2, 0) -> 0x0000000000000002 -> 0x0000000000000002 - > ( 2, 0)
( 2, 1) -> 0x0000000100000002 -> 0x0000000100000002 - > ( 2, 1)
( 2, 2) -> 0x0000000200000002 -> 0x0000000200000002 - > ( 2, 2)

6th February 2016, 13:12

You can compile and run test sample (http://pastie.org/10710948) yourself and result will be:
(-1, 0) -> 0xffffffffffffffff -> 0x0000000000000001

Thanks for this sample. If you change line 29 in your sample from sign-extended cast to zero-extended cast:
sum2_t sum2 = (sum2_t)(sum_t)x + ((sum2_t)y << BITS_PER_SUM);

it will be the case which I consider.

The problem is: what case is in x264/x265 code? Line 200 from file common/pixel.cpp (x265) is:
a0 = (pix1[0] - pix2[0]) + ((sum2_t)(pix1[4] - pix2[4]) << BITS_PER_SUM);
which is zero-extended cast.

Then it goes 2 times to HADAMARD4 macro. Now I don't know if it is OK or not -- I will look into it.

MasterNobody

6th February 2016, 14:07

The problem is: what case is in x264/x265 code? Line 200 from file common/pixel.cpp (x265) is:
a0 = (pix1[0] - pix2[0]) + ((sum2_t)(pix1[4] - pix2[4]) << BITS_PER_SUM);
which is zero-extended cast.
No. It is the same cast as in my code int -> sum2_t which is sign extended (because cast from signed int). And we get this int because we promote both pix1[0] and pix2[0] to (signed) int from pixel type.

And yes your addition of cast to sum_t will break it because you at first sign-extend it to sum_t and only then zero-extend (because sum_t is unsigned) to sum2_t.

6th February 2016, 18:18

No. It is the same cast as in my code int -> sum2_t which is sign extended (because cast from signed int). And we get this int because we promote both pix1[0] and pix2[0] to (signed) int from pixel type.

Thanks for explanation. I wrongly assumed that (pix1[0] - pix2[0]) is unsigned int.

pradeeprama

7th February 2016, 05:27

It seems v1.9 is less efficient especially when used in dual NUMAs(CPUs) system compared to v1.8. We deployed the same parameters (change in the presets incorporated so that every single parameter is the same), the speed (fps) has dropped by 15%.

Any idea why this happens?

x265 -D 10 --preset slower --crf 16.0 --ctu 32 --max-tu-size 16 --tu-intra-depth 3 --tu-inter-depth 3 --rdpenalty 2 --me 3 --subme 5 --merange 44 --b-intra --no-rect --no-amp --ref 5 --weightb --keyint 360 --min-keyint 1 --bframes 10 --aq-mode 1 --aq-strength 1.1 --rd 5 --psy-rd 0.8 --psy-rdoq 4.0 --rdoq-level 1 --no-sao --no-open-gop --rc-lookahead 80 --scenecut 40 --max-merge 4 --qcomp 0.80 --no-strong-intra-smoothing --input-depth 10 --deblock -2:-2 --qg-size 16 --vbv-bufsize 28000 --vbv-maxrate 25000 --limit-refs 0 --no-limit-modes

Is this on a server or on a desktop? If it is a multi-socketed server, can you try to explicitly specify --pools N,N where N is the # threads per socket if you're running on a dual-socket machine. In some faster presets with some videos, we've seen that explicitly setting the # pools and threads per pool helps performance; we launch a single pool with combined threads by default.

Also, while your command line calls preset slower, you seem to be turning off pretty much all features that are available in the slower presets to make it efficient (rect, amp, sao, reduced rdoq level). Perhaps you want to move to use a faster default preset that is more representative of the command line you're using? It looks like you're looking at something faster than ultrafast actually!

x265_Project

7th February 2016, 06:51

FYI - Pradeeprama is a manager on the x265 development team. Pradeep's team is responsible for testing, platform and performance optimization.

littlepox

7th February 2016, 08:38

Is this on a server or on a desktop? If it is a multi-socketed server, can you try to explicitly specify --pools N,N where N is the # threads per socket if you're running on a dual-socket machine. In some faster presets with some videos, we've seen that explicitly setting the # pools and threads per pool helps performance; we launch a single pool with combined threads by default.

Also, while your command line calls preset slower, you seem to be turning off pretty much all features that are available in the slower presets to make it efficient (rect, amp, sao, reduced rdoq level). Perhaps you want to move to use a faster default preset that is more representative of the command line you're using? It looks like you're looking at something faster than ultrafast actually!

The speed reduction is seen BOTH in desktop(4790K) and server(E5 2683v3 dual).
Much worse on the server, some times x265 only consumes one NUMA node, with another sleeping...It has been tested that with --pools "N,N" the issue can be solved, but still slower than v1.8+2

BTW, I'm not targeting ultrafast, I'm using --ref 5 --me star --subme 5 --rd 5. The parameters are heavily modified to fit high-quality 1080p anime encoding. rect and amp are disabled for being little contributing and (previously) too slow to bear(we are considering to bring them back recently with --limit-refs and --limit-mode, but at the beginning of the test we spotted the reduction on speed). sao and rdoq are tweaked due to visual quality adjustment.

chenm001

7th February 2016, 16:16

Thanks for explanation. I wrongly assumed that (pix1[0] - pix2[0]) is unsigned int.

http://en.cppreference.com/w/cpp/language/implicit_cast

(pix1[0] - pix2[0]) ---> (uint16_t - uint16_t) ---> (int32 - int32)

the final convert to sum2_t since second part is sum2_t

x265_Project

7th February 2016, 17:26

FYI - chenm001 is an architect on the x265 development team.

8th February 2016, 00:04

It seems v1.9 is less efficient especially when used in dual NUMAs(CPUs) system compared to v1.8. We deployed the same parameters (change in the presets incorporated so that every single parameter is the same), the speed (fps) has dropped by 15%.

I've checked speed of encoding on i5 3450S CPU (AVX level). The speed drop is from 7.36 to 6.73 fps -- about 9%. But the result of encoding is different -- you can check bitrate, PSNR and SSIM (in attached screen.txt). It is slower but the quality is better.

littlepox

8th February 2016, 04:33

I've checked speed of encoding on i5 3450S CPU (AVX level). The speed drop is from 7.36 to 6.73 fps -- about 9%. But the result of encoding is different -- you can check bitrate, PSNR and SSIM (in attached screen.txt). It is slower but the quality is better.

Thanks for verifying this for me.
This explanation makes sense, if they have improved the logic but code not as optimized as v1.8.

We are still left with the issue for multi-NUMA systems to worry.

nandaku2

8th February 2016, 06:57

Thanks for verifying this for me.
This explanation makes sense, if they have improved the logic but code not as optimized as v1.8.

We are still left the issue for multi-NUMA system to worry.

Please turn on limit-refs. You will retain most of the improved quality, and greatly improve speed.

LigH

8th February 2016, 10:29

I contacted a person who has access to a dual Xeon, I hope to get a reply soon...

pradeeprama

8th February 2016, 11:07

The speed reduction is seen BOTH in desktop(4790K) and server(E5 2683v3 dual).
Much worse on the server, some times x265 only consumes one NUMA node, with another sleeping...It has been tested that with --pools "N,N" the issue can be solved, but still slower than v1.8+2

This is strange. On our dual-Xeon E5-2699v3, I see nearly a 50% & 100% improvement in 4K encode speed at the default veryslow and ultrafast presets, respectively. On the 4790K, the improvement is a good 40-50%. These numbers are averaged across typical open-source videos. I wonder if it is something specific in your video, or settings that is causing such a big dip. Can you share your video with us?

Also, when you say that "some times x265 consumes only one NUMA node while the other is sleeping", is that in the middle of the execution or all the time? There was some initial problem with the implementation on Windows platform that was causing us to use only one NUMA node, but this was later fixed and I haven't seen any issues since then.

BTW, I'm not targeting ultrafast, I'm using --ref 5 --me star --subme 5 --rd 5. The parameters are heavily modified to fit high-quality 1080p anime encoding. rect and amp are disabled for being little contributing and (previously) too slow to bear(we are considering to bring them back recently with --limit-refs and --limit-mode, but at the beginning of the test we spotted the reduction on speed). sao and rdoq are tweaked due to visual quality adjustment.

Thanks for the clarification here - I missed some of your other parameters. Limit-refs and modes will considerably help you out as you have identified already

littlepox

8th February 2016, 12:38

This is strange. On our dual-Xeon E5-2699v3, I see nearly a 50% & 100% improvement in 4K encode speed at the default veryslow and ultrafast presets, respectively. On the 4790K, the improvement is a good 40-50%. These numbers are averaged across typical open-source videos. I wonder if it is something specific in your video, or settings that is causing such a big dip. Can you share your video with us?

uhmmm... I don't think I'm supposed to share them, since we are trying to backup some anime BDs. But the source should not be the key point. If you wish to test, find some fine-quality 1080p Japanese anime clips, or other lightly noisy 1080p film clips, those should do.

You observe the efficiency increase because you changed the default preset settings(ref, limit-refs, limit-modes), but here I have override every one of them, so that effectively you do NOT benefit from the change and force the comparison with exactly the same parameters.

As tested by Ma (http://forum.doom9.org/showthread.php?p=1756487#post1756487), It could be the logic is different so that you require more computations to reach a better quality, and the new codes have not been heavily optimized. If that's the case, it shall be fine.

Also, when you say that "some times x265 consumes only one NUMA node while the other is sleeping", is that in the middle of the execution or all the time? There was some initial problem with the implementation on Windows platform that was causing us to use only one NUMA node, but this was later fixed and I haven't seen any issues since then.

This happens randomly, (sometimes it utilizes both NUMA nodes, but other time it uses only one.) but if you are of bad luck, the asleep NUMA node sleeps all the time.

We are testing that on Win10. I'm asking my friend to do a few testing on Win7, hopefully I shall update this post later.
Confirmed, this is seen on windows 7 as well.

We are using the builds from http://www.msystem.waw.pl/x265/ , Stable branch VS 2015 clean builds, AVX2

LigH

8th February 2016, 19:55

I contacted a person who has access to a dual Xeon, I hope to get a reply soon...

May take a little longer, he would have to build for MacOS X.

9th February 2016, 00:25

Thanks for verifying this for me.
This explanation makes sense, if they have improved the logic but code not as optimized as v1.8.

We are still left with the issue for multi-NUMA systems to worry.

I can't help with multi-NUMA systems directly (I don't have any), but you could provide more detailed info.

I've compiled stable 1.8 & 1.9 version with '-DDETAILED_CU_STATS=ON' option:
www.msystem.waw.pl/x265/x265-1.8+2-1f0d4de-stable_vs2015-AVX2-detailed.7z
www.msystem.waw.pl/x265/x265-1.9+2-ee38630-stable_vs2015-AVX2-detailed.7z

You could use these versions with additional option '--log-level debug' on multi-NUMA system (encoding the same short sample) and then copy console window to clipboard (Right Click -> Select All -> ENTER). I think that it will be much easier for x265 team to find the source of this problem with detailed info.

littlepox

9th February 2016, 15:33

I can't help with multi-NUMA systems directly (I don't have any), but you could provide more detailed info.

I've compiled stable 1.8 & 1.9 version with '-DDETAILED_CU_STATS=ON' option:
www.msystem.waw.pl/x265/x265-1.8+2-1f0d4de-stable_vs2015-AVX2-detailed.7z
www.msystem.waw.pl/x265/x265-1.9+2-ee38630-stable_vs2015-AVX2-detailed.7z

You could use these versions with additional option '--log-level debug' on multi-NUMA system (encoding the same short sample) and then copy console window to clipboard (Right Click -> Select All -> ENTER). I think that it will be much easier for x265 team to find the source of this problem with detailed info.

Thank you for all this, and here is the test results(.csv with logging infomation):

http://1drv.ms/20m04GC

1.8_avx2 and 1.9_avx2 are tested with Core i7 4790K
speed reduction is about 5% (really small in this test...)

no_pools and pools_++ are tested with dual E5v3 (12C24T * 2)
no_pools is of the case that uses only one NUMA node, and pools_++ means we used --pools "+,+" to force it work properly.

the speed is about 2x difference.

Hope the above test result shall help the x265 team.

9th February 2016, 22:01

no_pools and pools_++ are tested with dual E5v3 (12C24T * 2)
no_pools is of the case that uses only one NUMA node, and pools_++ means we used --pools "+,+" to force it work properly.

the speed is about 2x difference.

Weird. Maybe it is only Windows specific behavior. I assume that x265 1.8 without specify any pools option works similar to ver. 1.9 with '--pools +,+' option?

littlepox

10th February 2016, 02:38

Weird. Maybe it is only Windows specific behavior. I assume that x265 1.8 without specify any pools option works similar to ver. 1.9 with '--pools +,+' option?

Yes. v1.8 works quite OK without specifying --pools.

pingfr

10th February 2016, 07:55

I must say I have a similar experience with a 5960X system (8c/16t) where I see x265.exe not even using 60% of the CPU "raw computing power".

Is this an intended behaviour or is there a way to force x265 aggressively on the 5960X hoping to achieve more fps crunching?

Edit: More details should they matter:

My x265 version:
x265 [info]: HEVC encoder version 1.9+9-8e093e85b9ab
x265 [info]: build info [Windows][GCC 5.3.0][64 bit] 8bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX AVX2 FMA3 LZ CNT BMI2

My command line:
avs4x265.exe -P x265.exe --psy-rd 2 --tune grain --crf 17 --preset veryslow --output %~n1.hevc %1

10th February 2016, 21:31

Yes. v1.8 works quite OK without specifying --pools.

I decided to find suspect commits that could be the cause of the problem. I found only one: 10983 -- https://bitbucket.org/multicoreware/x265/commits/e1adac00dce8e5641cbe9aec3d50a72261c308d9

It is my old enemy and this time I've killed it immediately. I've compiled a special build of ver. 1.9+2 stable -- without this one commit. Could you test this special build on multi-NUMA system?
www.msystem.waw.pl/x265/x265-1.9+2-ee38630-stable-10983_vs2015-AVX2.7z

nevcairiel

10th February 2016, 22:36

One guess would be that its designed around libNUMA and the Windows version of that code is just missing some smarter logic.
More specifically, from a quick look at the linked commit it requires a build targeted at Windows 7 and newer at compile time to even make use of the special windows logic (so no XP compat builds), would probably be smarter to dynamically load those functions when available so builds are more generically useful.

pingfr

10th February 2016, 23:22

@Ma: Tested the 1.9+2 stable you posted a few minutes ago, saw no noticeable changes on my end but then again my CPU despite being 8c/16t is a single physical CPU system.

10th February 2016, 23:26

I must say I have a similar experience with a 5960X system (8c/16t) where I see x265.exe not even using 60% of the CPU "raw computing power".

It could be normal -- you have 8 cores and x265 uses on your CPU 16 threads. All threads wants 100% computing power all the time, so the result should be about 50% (it is a bit better due to multithread CPU).

I'm curious what's happen when you add '--pools 8' options. It will be 8 threads on 8 cores with about 100% computing power or 8 threads on 4 cores with about 60% computing power?

pingfr

10th February 2016, 23:54

@Ma: with a --pools 8, it's even worse, the cores are used at 40% of their capacity only.

11th February 2016, 00:05

@Ma: with a --pools 8, it's even worse, the cores are used at 40% of their capacity only.

Do you refer to "detailed CU info" like in this example:
encoded 504 frames in 26.41s (19.09 fps), 4083.78 kb/s, Avg QP:37.82
x265 [info]: CU: %41.14 time spent in motion estimation, averaging 8.288 CU inter modes per CTU
x265 [info]: CU: Skipped motion searches per depth %8.78 %11.86 %19.99 %0.00
x265 [info]: CU: %02.60 time spent in intra analysis, averaging 5.693 Intra PUs per CTU
x265 [info]: CU: Skipped intra CUs at depth %-nan(ind) %63.49 %120.95
x265 [info]: CU: %22.33 time spent in inter RDO, measuring 19.088 inter/merge predictions per CTU
x265 [info]: CU: %05.56 time spent in intra RDO, measuring 16.963 intra predictions per CTU
x265 [info]: CU: %02.40 time spent in loop filters, average 0.381 ms per call
x265 [info]: CU: %00.12 time spent in weight analysis, average 0.926 ms per call
x265 [info]: CU: %15.49 time spent in slicetypeDecide (avg 96.802ms) and prelookahead (avg 2.980ms)
x265 [info]: CU: %10.36 time spent in other tasks
x265 [info]: CU: Intra RDO time per depth %00.00 %34.78 %18.75 %46.48
x265 [info]: CU: Intra RDO calls per depth %00.00 %03.17 %07.70 %89.13
x265 [info]: CU: Inter RDO time per depth %41.67 %32.64 %14.59 %11.09
x265 [info]: CU: Inter RDO calls per depth %09.58 %19.54 %25.49 %45.39
x265 [info]: CU: 120960 64X64 CTUs compressed in 95.965 seconds, 1260.460 CTUs per worker-second
x265 [info]: CU: 3.634 average worker utilization, %90.85 of theoretical maximum utilization
which on my i5 3450S CPU (4c/4t) is 90.85% or you observe "Task Manager" and you see about 40% on all 16 logical cores?
-----------------
I've tested '--pools 2' option on my 4c/4t CPU. Win 7 schedules this 2 threads on all 4 cores. So I assume that Windows is able to optimize threads on one CPU but probably can't/doesn't want move threads between CPU nodes.

pingfr

11th February 2016, 00:33

@Ma: I'm running Windows 7, using Core Temp gadget on the side which shows me the CPU usage and the temperature for each core.

During x265 usage with --preset veryslow, I see my temperatures rising as high as 79°C but the load per core is only roughly 60% each.

Thus I'm wondering if there's a parameter to force x265 to saturate cores to their full extent and therefore crunch fps faster maybe?

pradeeprama

11th February 2016, 05:21

I decided to find suspect commits that could be the cause of the problem. I found only one: 10983 -- https://bitbucket.org/multicoreware/x265/commits/e1adac00dce8e5641cbe9aec3d50a72261c308d9

It is my old enemy and this time I've killed it immediately. I've compiled a special build of ver. 1.9+2 stable -- without this one commit. Could you test this special build on multi-NUMA system?
www.msystem.waw.pl/x265/x265-1.9+2-ee38630-stable-10983_vs2015-AVX2.7z

Thanks for the break-down and the extensive testing. Even on the linux box, I've noticed that for some clips, explicitly specifying a --pools option gave much higher performance than not specifying the --pools version. However, through extensive testing on several machines, we noticed that not specifying the --pools option was by-and-large better and hence we decided to change the default behavior. The difference was primarily governed by what the work per thread to communication overhead balance was.

If specifying the --pools option gives you better performance, you may include it in your command line; the behavior of the thread pools will then go back to how it was in the versions under 1.8.

pradeeprama

11th February 2016, 05:26

@Ma: with a --pools 8, it's even worse, the cores are used at 40% of their capacity only.

Specifying --pools 8 in the command line for x265 on an 8c16t machine will result in using only 4 cores; each core will have 2 threads.

11th February 2016, 07:58

pradeeprama

11th February 2016, 08:55

Now I'm lost. In ver. 1.9 source code I tried to find difference between not using pools option and '--pools +,+' option and it looks like there is no difference.

In littlepox data it is clear that the speed is only 50% without pools option.

@pradeeprama could you point what is different with '--pools +,+' option in source code?

@Ma, you are right - not using the --pools in the command line should be identical to using --pools +,+ for a system with 2 nodes; both should create one pool with # worker threads = sum of available logical cpus across both nodes.

@Littlepox - I'd misunderstood your problem earlier, my apologies. I saw your logs again and you can see that in both your runs, x265 creates one pool with 48 threads. Is there some RAM bottleneck in the system when you run for the first time because of which when you don't give --pools in the command line, you're working out of the disk and not the RAM? If you run the identical command line 3-4 times, do you consistently see a much lower FPS?

pingfr

11th February 2016, 08:57

Could any of you tell me which --pools option(s) should I use on a 8c/16t machine with 32GB of RAM?

Would be appreciated.

pradeeprama

11th February 2016, 09:01

Could any of you tell me which --pools option(s) should I use on a 8c/16t machine with 32GB of RAM?

Would be appreciated.

By default x265 should be able to find all 16 threads and use them for encoding; look at the console log for identifying how many threads were created. You shouldn't have to specify anything explicit.

pingfr

11th February 2016, 09:08

pradeeprama

11th February 2016, 09:34

@pradeerprama: I see the console spouting out that is using 16 threads indeed but my concern is when looking in depths, the 8 cores are barely used/loaded with a load never topping 60% each core.

Is there a way to set the process to be more aggressively crunching fps?

I see that you're using the slower preset. Maybe you want to consider a quicker preset if you want higher fps? Of course, it is an efficiency trade-off though.

pingfr

11th February 2016, 09:39

@pradeeprama: I am using that preset because I am doing Blu-Ray source "archival". I am actually using the --preset veryslow at the moment. Picture quality is my top notch priority over anything else.

Off-topic question while you're here: I'm passing the --psy-rd 2 parameter, my question is: would passing --psy-rd 3 or --psy-rd 4 have any effect on perceptible quality when used with the veryslow preset?

littlepox

11th February 2016, 10:05

@Ma

My friend tested your new compile and it's functioning very well.

@pradeerprama

I do NOT think there are any bottlenecks. Nevertheless I'm requesting my friend to redo the test for a few more times, hopefully I shall update this post later.

Update: It is confirmed that the loss of speed and "sleeping" of one NUMA node is with HIGH probability, it's like >50% of chance we see it on native v1.9+2, WITHOUT --pools +,+

Boulder

11th February 2016, 10:49

@pradeeprama: are there any expected changes in rd-refine in the near future? When I posted a bug report at BitBucket, one comment mentioned that there are some enhancements to do as you were not yet happy with the output. I've put my x265 encodes on hold as rd-refine at least makes smaller files, and warrants further testing. I've not seen any test reports by anyone so far.

11th February 2016, 10:57

@Ma

My friend tested your new compile and it's functioning very well.

Thanks for this info. [...]

littlepox

11th February 2016, 11:15

Thanks for this info. Now we can try to shoot directly to the enemy.
I've compiled ver. 1.9+2 with one small change:
diff -r ee38630033b7 source/encoder/encoder.cpp
--- a/source/encoder/encoder.cpp Fri Feb 05 10:49:41 2016 +0530
+++ b/source/encoder/encoder.cpp Thu Feb 11 10:24:39 2016 +0100
@@ -102,6 +102,7 @@
}

bool allowPools = !p->numaPools || strcmp(p->numaPools, "none");
+ Sleep(500);

// Trim the thread pool if --wpp, --pme, and --pmode are disabled
if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation && !p->lookaheadSlices)

Could you test this new build?
www.msystem.waw.pl/x265/x265-1.9+2-ee38630-stable-sleep_vs2015-AVX2.7z

For this version, without specifying --pools, there is still the problem of speed reduction and an asleep NUMA node. all 48 threads are squeezed into 1 NUMA node, and the other not taking up computation.

log says "x265 [info]: Thread pool 0 using 48 threads on numa nodes 0,1"

however, with --pools, it works fine, two NUMA nodes share the work equally, and the log changed to:

x265 [info]: Thread pool 0 using 24 threads on numa nodes 0
x265 [info]: Thread pool 1 using 24 threads on numa nodes 1

littlepox

11th February 2016, 11:27

It seems a bit messy right now, I shall re-state the problem:

x265 1.8 works fine, the log says:

x265 [info]: Thread pool 0 using 24 threads on numa nodes 0
x265 [info]: Thread pool 1 using 24 threads on numa nodes 1

WITH PLAIN x265 1.9 build:

specifying --pools 24,24 all 48 threads are allocated into two thread pools, and the log says:
x265 [info]: Thread pool 0 using 24 threads on numa nodes 0
x265 [info]: Thread pool 1 using 24 threads on numa nodes 1

NOT specifying --pools 24,24 all 48 threads are allocated into one thread pool(reduction of speed), and the log says:
x265 [info]: Thread pool 0 using 48 threads on numa nodes 0,1

With MA's build: (http://forum.doom9.org/showthread.php?p=1756612#post1756612)

specifying --pools 24,24 all 48 threads are allocated into two thread pools, and the log says:
x265 [info]: Thread pool 0 using 24 threads on numa nodes 0
x265 [info]: Thread pool 1 using 24 threads on numa nodes 1

HOWEVER , the previous csv file says "x265 [info]: Thread pool 0 using 48 threads on numa nodes 0,1". The reason is unknown(may be wrong copy-paste, may be different behavior), here we've re-tested a few times so in this post whatever reported shall be correct.

NOT specifying --pools 24,24 all 48 threads are allocated into one thread pool(reduction of speed), and the log says:
x265 [info]: Thread pool 0 using 48 threads on numa nodes 0,1

In summary for x265 1.9:

with --pools 24,24:
x265 [info]: Thread pool 0 using 24 threads on numa nodes 0
x265 [info]: Thread pool 1 using 24 threads on numa nodes 1
http://img.2222.moe/images/2016/02/11/fixed.png

without --pools 24,24:
x265 [info]: Thread pool 0 using 48 threads on numa nodes 0,1
BUT numa node 1 is not utilized:
http://img.2222.moe/images/2016/02/11/wrong.png

BTW, since my friend uses such a powerful dual-socket computer, the encoding, even with --pools +,+, takes only ~50% of total CPU power. so It may be possible that x265/windows decide to allocate the work into one NUMA node and leave the other not involved. but the problem is that this strategy reduces speed.

pingfr

11th February 2016, 12:15

BTW, since my friend uses such a powerful dual-socket computer, the encoding, even with --pools +,+, takes only ~50% of total CPU power. so It may be possible that x265/windows decide to allocate the work into one NUMA node and leave the other not involved. but the problem is that this strategy reduces speed.

And despite my 5960X being single socket computer, I have exactly the same issue here (sorry to interfere in your posts).

My 8 cores are never used at more than 60% of their CPU power.

It would make sense that owners of "High-end machines" would want to harness the full power of their systems for x265 encoding and this is clearly not the case here.

11th February 2016, 12:22

For this version, without specifying --pools, there is still the problem of speed reduction and an asleep NUMA node. all 48 threads are squeezed into 1 NUMA node, and the other not taking up computation.

log says "x265 [info]: Thread pool 0 using 48 threads on numa nodes 0,1"

however, with --pools, it works fine, two NUMA nodes share the work equally, and the log changed to:

x265 [info]: Thread pool 0 using 24 threads on numa nodes 0
x265 [info]: Thread pool 1 using 24 threads on numa nodes 1

The second output is for '--pools 24,24' option or '--pools +,+' option? I try to understand what's going on with '--pools +,+' vs. without pools encoding.

littlepox

11th February 2016, 12:40

The second output is for '--pools 24,24' option or '--pools +,+' option? I try to understand what's going on with '--pools +,+' vs. without pools encoding.

epic fail...He found he was testing with pools 24,24 except for the one with csv output.

I'm requesting him to retest everything with pools ++.

Currently you can assume pools 24 24 gives u desired speed.

retesting with pools ++, the behavior is similar to without. the speed in BOTH cases are highly volatile, and all threads are squeezed into one NUMA nodes, significantly slower than pools 2424

LigH

11th February 2016, 13:46

My 8 cores are never used at more than 60% of their CPU power.

I believe this has been an issue for longer already, and the reason is the rather high level of dependencies between the threads, which is already known to be more a problem for HEVC than for AVC. It used to be a recommendation already months ago to rather run two conversions in parallel than to hope for a 100% utilized single conversion with many cores.

At least so I remember.

Issues with the balance of NUMA pools on multi-socket systems are technically a really different matter.

11th February 2016, 15:16

epic fail...He found he was testing with pools 24,24 except for the one with csv output.

I'm requesting him to retest everything with pools ++.

Currently you can assume pools 24 24 gives u desired speed.

retesting with pools ++, the behavior is similar to without. the speed in BOTH cases are highly volatile, and all threads are squeezed into one NUMA nodes, significantly slower than pools 2424

OK, so in csv output it was only unusual speedup for '--pools +,+' option. I think that it is the case which nevcairiel mentioned -- Linux is smarter than Windows in thread scheduling so we need separate logic for Linux (without change) and for Windows (no pools option and '--pools +,+' should act as '--pools 24,24' on your system).

littlepox

11th February 2016, 15:38

@Ma

Thanks for the explanation, now it seems we can just use the v1.9 you compiled without the commit, or manually add --pools "N,N" in the CLI. The rest shall be simple as you said to use separated logic on Win/Linux.

@pingfr

As far as from our experience, on a E5v3 12C24T, encoding 1080p with x265 generally consumes >95% usage, so it's strange for your case. A few reasons maybe:

1. How did you get the source? x265 can only encode at the rate it is given, if you are piping in a low speed (typically with complex avisynth scripts), this shall be expected.
2. what is the resolution you are coping with? if it is <=720p, this shall be explained as there are less work to be distributed.