Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
18th May 2023, 11:13 | #2541 | Link |
Registered User
Join Date: Nov 2009
Posts: 2,367
|
Well lately I've been only developing not really encoding/rendering. Even then I switched to Redshift render engine which is GPU based. And for the next 2 years I'll be doing some APPs in Dart.
I will wait until the market relaxes a bit, today prices are inflated and then I'll move to the next Xeon, even if it's the bottom tier. Not really digging this performance/efficiency core system and no AVX512 on top of that, also looking forward to AVX-1024, AMX2, TSX, DDR6 and PCI Gen6, and RTX5000 with path tracing for the GPU with some meaningful RAM+CUDA count.
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread |
18th May 2023, 11:20 | #2542 | Link |
Registered User
Join Date: Jul 2018
Posts: 1,207
|
"DDR6"
After Xeon Max with at least 64 GB of HBM integrated starts shipping in 2023 it looks DDR-SDRAM died in the past for high-performance computing. I hope for poor and dying end-users PCs for homes market intel (or may be AMD) can provide HBM-based CPU compute platform in close years. Still some hope exist. Also the total compute platform architecture finally redesigned again - from poor host RAM with fast CPU and non-x86/64-compatible compute accelerator(s) mounted via PCIe bus intel finally moves to high-performance CPU+RAM factory assembled module with 20x faster RAM speed and it compatible with all x86/64 software and can use standart periferal and also graphics/DCA PCIe adapters if needed. It can be designed to poor-people low-LGA CPU socket motherboards with 2ch DDR-SDRAM if expansion required. Last edited by DTL; 18th May 2023 at 11:25. |
18th May 2023, 11:29 | #2543 | Link | |
Guest
Posts: n/a
|
Quote:
However, the 7950X does have AVX512 support, and ALL cores do the job equally (kind of). In my experience, the 7950X is just a little better than a 13900 (except maybe the 13900KS), when encoding. But things just keep getting too expensive |
|
18th May 2023, 11:29 | #2544 | Link |
Registered User
Join Date: Jul 2018
Posts: 1,207
|
The computing may be simply RAM-speed limited because AVS make most of data storage in main host RAM.
The simple 6ch DDR4 6 cores Xeon may run at 1800+ fps and any cores 2ch DDR4/5 based CPU may be limited to about 1000 fps - https://forum.doom9.org/showthread.p...62#post1987162 Last edited by DTL; 18th May 2023 at 11:34. |
18th May 2023, 11:31 | #2545 | Link | |
Guest
Posts: n/a
|
Quote:
|
|
18th May 2023, 11:40 | #2546 | Link |
Registered User
Join Date: Jul 2018
Posts: 1,207
|
"It's a shame that the 13th Gen Intel's dropped AVX512 support"
Intel optimize general-public (really shrinking and dying) market of home PCs to the standard performance tests to be comparable with AMD goods of the same season. So as general public market software close to no benefit from AVX512 and fast RAM - the marketing goods will continue to fight for best fps in Microsoft Word. So no AVX512 and no HBM RAM may reach general public CPUs any more. For intel it is better to add E-cores with simple AVX2 (and 2 cores of AVX2 can overload simple 2ch DDR4/DDR5 RAM config - no need to put 1 core of AVX512 with about 4x more performance). So it will be comparable in MS Word and 7zip and all other standard marketing tests to make money in each season. If you will look in the every-year home PCs tests - the intel and AMD is about +-10..30% of performance only. It is simple making money every year - not making high-performance compute platforms for users of smartphones. Real HPC is Datacenters/Enterprise in 2023 and it is Xeon Max now. AVS can start whatever required by user threads with Prefetch() and it will be displayed in Task Manager as busy CPU cores. But each thread is also a chain of filters and each filter is surrounded with large software cache located in host RAM. So Busy CPU cores typically spin idle loops and waiting for RAM. It is not thread-logical idle state so Task Manager of OS do not show core idle. It can be found in the hardware performance counters in VTune for example. Last edited by DTL; 18th May 2023 at 11:51. |
18th May 2023, 11:52 | #2547 | Link | |
Registered User
Join Date: Jan 2018
Posts: 2,168
|
Quote:
|
|
18th May 2023, 12:03 | #2548 | Link | |
Pig on the wing
Join Date: Mar 2002
Location: Finland
Posts: 5,803
|
Quote:
__________________
And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon... |
|
18th May 2023, 12:12 | #2550 | Link | |
Registered User
Join Date: Jan 2018
Posts: 2,168
|
Quote:
https://forum.doom9.org/member.php?u=244197 |
|
18th May 2023, 15:25 | #2551 | Link | ||
21 years and counting...
Join Date: Oct 2002
Location: Germany
Posts: 716
|
Quote:
I have not encountered this under Win11 so far. Quite the contrary where other tools report CPU usage but Task Manager not. That happens upon starting scripts with heavy GPU load and memory usage. HWI reports quite high CPU usage while Win11 TM doesn't. Quote:
Well, if you guys think that was way too slow, I'm happy to test with different settings for comparison. Last edited by LeXXuz; 18th May 2023 at 15:35. |
||
18th May 2023, 16:19 | #2552 | Link |
Pig on the wing
Join Date: Mar 2002
Location: Finland
Posts: 5,803
|
I've tested different combinations and that produced the best performance over multiple kinds of sources with threads=32. I mostly use the same framework script for every source and just tune the parameters so the load on CPU is pretty much the same anyway. It's particularly important to test it by running the encoder with your normal settings since it will affect CPU scheduling greatly.
__________________
And if the band you're in starts playing different tunes I'll see you on the dark side of the Moon... |
18th May 2023, 19:12 | #2553 | Link |
Registered User
Join Date: Jul 2018
Posts: 1,207
|
"Would you mind linking a source for that?"
You can run any software profiler and look at the timestamps around load from memory operations (in disassembly view). From freeware AMD uProf . Loading from memory is not any compute operation - it simply CPU core stall because no data to continue compute. But it is thread taking CPU time so Task Manager will show core load time (it not show real computing load - it only show % of time thread using core time). Thread waiting for data can not free CPU core resources because thread switching is very costly operation. The only possible is in 'hyperthreading' mode when 2 threads uses 1 core resources the stalling thread waiting for memory may be switched to other thread if it got ready to compute data. So 'hyperthreading' sometime really make some total performance benefit. The more advanced intel VTune will also show you lots of performance counters around cache misses and so on. Also VTune can create some hints about how much the application is memory-performance bounded (after analysis of memory-performance counters). The real indirect way to estimate compute load is look into CPU Power usage - any real computing part make switching and in CMOS it takes some power. Stalling in idle parts in CMOS not draw power and not produce heat. So CPU engine put parts to idle zero power state if nothing to compute. So the more power draw to CPU and more heat produced - the more real useful computing is performed. Or at least some about useful shuffling of data in lots of caches (L1,L2,(L3,L4)). The AVX2 and AVX512 compute units running with good load draw lots of power and produce lots of heat (so typically CPU clock trotled to save from overheating). CPU temp may be used to estimate how compute software is optimized for compute at current chip - with fixed cooling the CPU temp increases as software make more compute switching at given time. It not mean it is useful compute but at least not stall waiting for something. "I have not encountered this under Win11 so far" I still not see Win11 - may be Microsoft make some redesign of performance measurement tools in that version. If all supported CPU vendors provide required hardware performance counters. In old times OS can only measure thread time at core (OS knows start and end time of thread at core and can measure 'physical time' using common RDTSC instruction). Last edited by DTL; 18th May 2023 at 19:41. |
18th May 2023, 19:52 | #2554 | Link | |
21 years and counting...
Join Date: Oct 2002
Location: Germany
Posts: 716
|
Quote:
And between Prefetch 2,4,8,16, and 18...32 there's a quite good linear scale from 2 to 16, with almost double the prefetch = double the performance. After that the gradient drops with increasing prefetch but performance still increases and tops out with ~32. So I see no reason not to use 32 threads on a 16C/32T CPU. I'll post a speed test between 'untouched', UHDHalf=true and UHDHalf=false with the 7950x soon for another comparison. Last edited by LeXXuz; 18th May 2023 at 19:55. |
|
18th May 2023, 20:12 | #2555 | Link |
Registered User
Join Date: Nov 2009
Posts: 2,367
|
@LeXXuz: I forgot to say, are you using DGSource() or a CPU bound loader? it makes substantial difference.
I think Xeon Max is enterprise tier, I'm looking forward Diamond Rapids w3 or w5, whatever delivers 8c/16t at least. And the thing with AVX512 is that not all are created equal, Intel now says they are going to bring it to mainstream line but does it include AVX512-VNNI?
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread |
18th May 2023, 21:50 | #2556 | Link |
Registered User
Join Date: Jul 2018
Posts: 1,207
|
I hope the Xeon Max (already renamed in several months after 'Sapphire Rapids HBM') is only one of many new families names in the future. May be end-user chip will be named i7-15xxx or more numbers. Or i9-15xxx, i11-xxxx and so on.
" 8c/16t at least." Hyperthreading typically good only if it is close to free addition. Only after all other hardware resources are fully filled (max possible channels of RAM and max possible ways in the caches and so on). If the same priced less cores without HT but offer more DDR RAM channels - it may be faster with well optimized software. Also not very great optimized software may somehow benefit from more ways in the caches (Xeons may have 12..24 ways and low quality chips 8..12 ways at L2/L3). If the processing datasets fits in the L2/L3 caches. "AVX512-VNNI?" The only known real benefit from NN is may be deinterlacing with fighting special separated fields aliasing. And the interlaced content moves to the past now. I not know any other processing with real benefit from NN on CPU. May be if civilization will not die too quickly we can see 4:2:0 to 4:4:4 decoding of the same high quality as deinterlacing - but it may be visible to Die-Hard perfectionists only. The general benefit of AVX512 is 4x larger register file of AVX2 and 2x wider dispatch ports and lots of new faster instructions for simple integer 8/16 bit operations typically exist in F/BW/DQ/VL/VBMI old series. 202x years expected to be season of AVX512 most members in all CPU chips at the market but something go not as nice as expected. Also the software still very poorly optimized even for AVX2 and number of programmers with understanding in SIMD looks like fast shrinking too. So if we even have AVX512 everywhere - there already close to nobody knows how to program it. The 512 bytes AVX2 register file already not very easy to keep in mind about data placement - and AVX512 is 4x times larger. Only really nice (typically young) great brains can make nice fast handcrafted software for 2048 bytes data array and with complex computing (not simple 2x expanding of old poorly designed AVX2 software). Or with degrading and dying of real humans we need to some NN-robots to design nice AVX512 programs - still not exist may be. Last edited by DTL; 18th May 2023 at 22:12. |
19th May 2023, 00:23 | #2557 | Link |
Registered User
Join Date: Nov 2009
Posts: 2,367
|
I think we need general NN support at CPU level as well since as we know the future is AI based and frees us up from those time-consuming/low-reward tasks.
My only real concern with future hardware is if they are going back to the efficiency road path as lately everything consumes and heats way over I consider stable. By the way I updated TransformsPack with a new gamut compression function ported from Jed Smith: Code:
ConvertFormat(cs_in="ACEScg",cs_out="709",EOTFi="",GC=true) TM_Hable(mode="Dark",filmic=false) CCTF("1886",false, tv_out=false) ConvertBits(8,dither=1) 709 - 1886 709 - 1886 (+Gamut Compression) Since probably all these samples are scene referred (radiance based) they look bland, lacking the filmic punch. You can try setting 'filmic' to true for either Dark or Bright 'mode'. But I prefer the following. Here's an example using LMT_DCP() (DCP Tone Curve OOTF) Code:
TM_Hable(mode="Dark",filmic=false) LMT_DCP() CCTF("1886",false,tv_out=false) And here using LMT_EMoR() instead, a typical Camera Response Function fit. It's an OOTF+EOTFi so we suppress the 1886 inverse EOTF: Code:
TM_Hable(mode="Dark",filmic=false) LMTi_EMoR(tv_range=false) EDIT: Same as above with LMT_EMoR() but with updated GamutCompression with non-linear integration of negative values 709 - EMoR OOTF/EOTFi (+Gamut Compression Updated)
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread Last edited by Dogway; 24th May 2023 at 05:10. |
19th May 2023, 01:04 | #2558 | Link | |
21 years and counting...
Join Date: Oct 2002
Location: Germany
Posts: 716
|
Quote:
Otherwise it wouldn't be a fair comparison between both systems. The GT 710 is fine and fast enough for frameserving MPEG-2 and AVC. And the Intel ARC is great for OpenCL filters but doesn't support CUDA of course. I usually don't do UHD content. For various reasons. This one was just for testing out of curiosity. Anyhow, here is the same content encoded with a Ryzen 7950x system: Again, top-left to bottom-right: tl: x265 medium preset CRF20, UHDHalf=true, Prefetch(16) tm: x265 medium preset CRF20, UHDHalf=false, Prefetch(16) tr: x265 medium preset CRF20, no filtering in Avisynth bl: x265 medium preset CRF20, UHDHalf=true, Prefetch(32) bm: x265 medium preset CRF20, UHDHalf=false, Prefetch(32) br: x265 medium preset CRF20 AVX512, no filtering in Avisynth UHDhalf=false benefits a little from Prefetch 32 over 16, while UHDhalf=true is about the same. Already noticed this with more demanding scripts for 1080p too. Lastly a comparison between default AVX2 and allowing AVX512. Difference is negligible, as I expected. But mileage may vary with different content. Anyhow, I'm fine with these numbers. Last edited by LeXXuz; 19th May 2023 at 01:06. |
|
19th May 2023, 10:11 | #2559 | Link |
Registered User
Join Date: Nov 2009
Posts: 2,367
|
Thanks for the tests, the 7950x is a monster CPU so I think the numbers are fine, UHDHalf=true does almost a 63% perf increase so I think it's fine given people would resort to more convoluted scripts and filtering.
I don't know what filters do make use of AVX512 in AviSynth so I can't tell. My interest in the instruction is more for 3D DCC and games (if I ever happen to find time for that).
__________________
i7-4790K@Stock::GTX 1070] AviSynth+ filters and mods on GitHub + Discussion thread |
19th May 2023, 16:33 | #2560 | Link |
Registered User
Join Date: Dec 2005
Location: Sweden
Posts: 716
|
My footage in uhd is very sharp/finely detailed I guess and when using temporalsoften mode in uhdhalf=true it is considerably blurrier compared to uhdhalf=false. Not nearly as little difference as to LeXXuz comparison examples. Is there a way to regain the detail/sharpness by pel or something and yet gain the speed from mscalevector somewhat?
|
Tags |
avisynth, dogway, filters, hbd, packs |
Thread Tools | Search this Thread |
Display Modes | |
|
|