Asmodian
24th November 2017, 04:48
If latency matters a lot (I don't know) for x265 then you want to keep most of the work between the 4 cores inside each CCX.
I don't know how you identify real cores from logical threads let alone cores from their different CCXs though in software.
This is very interesting for future CPU designs from both companies, going to smaller modular dies and architectures is more efficient (cost and material wise) so more layers of NUMA awareness might be very useful. Instead of two NUMA nodes for the 1950X you could have four "NUMA level 1" nodes, one for each CCX, then two "NUMA level 2" nodes, one for each die. A dual socket system would have two "NUMA level 3" nodes, one for each CPU.
This way the OS and/or x265 could schedule threads that share data on the lowest NUMA level available or put independent threads as far away as possible. Does anyone know if something like this is already possible or planned? It seems key for future performance with the wide interest in multiple dies and modular architectures.
I don't know how you identify real cores from logical threads let alone cores from their different CCXs though in software.
This is very interesting for future CPU designs from both companies, going to smaller modular dies and architectures is more efficient (cost and material wise) so more layers of NUMA awareness might be very useful. Instead of two NUMA nodes for the 1950X you could have four "NUMA level 1" nodes, one for each CCX, then two "NUMA level 2" nodes, one for each die. A dual socket system would have two "NUMA level 3" nodes, one for each CPU.
This way the OS and/or x265 could schedule threads that share data on the lowest NUMA level available or put independent threads as far away as possible. Does anyone know if something like this is already possible or planned? It seems key for future performance with the wide interest in multiple dies and modular architectures.