Vapoursynth CPU saturation [Archive]

asarian

10th January 2022, 08:02

(moved out of main VS thread)

Is R57 broken somehow? At first I blamed Windows 11's new termimal for the very slow throughput (slow pipe transfer?), but nope, even using ffmepeg with -vapoursynth, the process is extremely slow, using CPU for only like 25%. Both QTGMC and MCTemporalDenoise seem to grind to a near halt. All on my new i9 12900K. This used to go blistering fast, even on my old 6700K.

Here's what I do (see below). It's almost as if multi-threading is broken for these two functions (it isn't, but appears to work exceedngly inefficient). This is 4K material, btw.

import vapoursynth as vs
import havsfunc as haf

core = vs.core
core.max_cache_size = 65535

vid = core.dgdecodenv.DGSource (r'c:\jobs\am.dgi', ct=44, cb=44, cl=0, cr=0)

vid = haf.QTGMC (vid, InputType=1, Preset="Very Slow", TR2=3, EdiQual=2, EZDenoise=0.5, NoisePreset="Slower", TFF=True, Denoiser="KNLMeansCL")
vid = haf.MCTemporalDenoise (vid, settings="very low", stabilize=True)
vid = core.neo_f3kdb.Deband (vid, preset="veryhigh", dither_algo=2)
vid = core.std.AddBorders (clip=vid, left=0, right=0, top=44, bottom=44)

vid.set_output ()

E-cores are hardly used (some are marked as 'parked' even). But even the P-cores hardly see any action. See:

CPU saturation (https://imgur.com/EzfKrtZ)

N.B. I had the same issue on my previous i7 11700K, btw.

P.S. Does it matter my plugins folder has 148 plugins in it? (all 64-bit recent vapoursynth plugins someone posted here).

Myrsloik

10th January 2022, 12:32

(moved out of main VS thread)

Is R57 broken somehow? At first I blamed Windows 11's new termimal for the very slow throughput (slow pipe transfer?), but nope, even using ffmepeg with -vapoursynth, the process is extremely slow, using CPU for only like 25%. Both QTGMC and MCTemporalDenoise seem to grind to a near halt. All on my new i9 12900K. This used to go blistering fast, even on my old 6700K.

Here's what I do (see below). It's almost as if multi-threading is broken for these two functions (it isn't, but appears to work exceedngly inefficient). This is 4K material, btw.

import vapoursynth as vs
import havsfunc as haf

core = vs.core
core.max_cache_size = 65535

vid = core.dgdecodenv.DGSource (r'c:\jobs\am.dgi', ct=44, cb=44, cl=0, cr=0)

vid = haf.QTGMC (vid, InputType=1, Preset="Very Slow", TR2=3, EdiQual=2, EZDenoise=0.5, NoisePreset="Slower", TFF=True, Denoiser="KNLMeansCL")
vid = haf.MCTemporalDenoise (vid, settings="very low", stabilize=True)
vid = core.neo_f3kdb.Deband (vid, preset="veryhigh", dither_algo=2)
vid = core.std.AddBorders (clip=vid, left=0, right=0, top=44, bottom=44)

vid.set_output ()

E-cores are hardly used (some are marked as 'parked' even). But even the P-cores hardly see any action. See:

CPU saturation (https://imgur.com/EzfKrtZ)

N.B. I had the same issue on my previous i7 11700K, btw.

P.S. Does it matter my plugins folder has 148 plugins in it? (all 64-bit recent vapoursynth plugins someone posted here).

Run vspipe with the --filter-time option and post the output here. You don't need to run the whole script just a bit.

Autoloading every dll on the planet is a bad idea. I don't approve of this.

ChaosKing

10th January 2022, 13:44

Autoloading every dll on the planet is a bad idea. I don't approve of this.

I only have 220 :p
It should only slowdown the initial load, right? (my nvme is fast!!!111)

asarian

10th January 2022, 20:21

Run vspipe with the --filter-time option and post the output here. You don't need to run the whole script just a bit.

Thanks for your useful reply. :thanks:

Didn't even know you could use this timing function. Here are the results (of 'VSPipe --filter-time -c y4m "f:\jobs\test.vpy" NUL'). Still dismal at 1.49 fps.

Output 7215 frames in 4836.29 seconds (1.49 fps)
Filtername Filter mode Time (%) Time (s)
DFTTest parallel 236.04 11415.38
Degrain3 parallel 153.59 7428.12
Analyse parallel 140.13 6776.88
Analyse parallel 139.92 6766.79
Analyse parallel 135.95 6574.87
Analyse parallel 134.19 6489.61
Analyse parallel 133.42 6452.38
Analyse parallel 131.75 6371.89
Degrain1 parallel 76.81 3714.79
Degrain1 parallel 76.38 3694.15
Analyse parallel 59.43 2874.38
Analyse parallel 59.07 2856.85
Degrain1 parallel 42.95 2077.16
Super parallel 42.45 2053.03
Compensate parallel 37.70 1823.22
Compensate parallel 37.16 1797.00
KNLMeansCL parreq 36.76 1777.75
TemporalSoften2 parallel 34.00 1644.15
Super parallel 31.56 1526.36
Super parallel 30.51 1475.59
Super parallel 25.62 1239.13
Compensate parallel 24.76 1197.31
Compensate parallel 24.47 1183.26
Compensate parallel 24.47 1183.21
Compensate parallel 24.18 1169.34
TemporalSoften2 parallel 22.21 1073.95
resample parallel 18.60 899.78
TTempSmooth parallel 18.35 887.22
resample parallel 17.68 854.87
Deband parallel 15.32 740.92
Super parallel 12.12 585.95
Expr parallel 9.82 474.90
Expr parallel 8.71 421.06
Expr parallel 8.30 401.20
Expr parallel 7.56 365.69
Expr parallel 7.47 361.42
Point parallel 7.30 352.99
Expr parallel 7.16 346.42
Expr parallel 7.13 344.99
Expr parallel 7.06 341.45
Expr parallel 7.01 339.19
Expr parallel 6.79 328.22
Expr parallel 6.78 327.71
Merge parallel 6.73 325.45
Expr parallel 6.73 325.34
Expr parallel 6.69 323.71
Merge parallel 6.36 307.75
MakeDiff parallel 6.29 303.96
Merge parallel 6.26 302.53
Merge parallel 6.23 301.09
MergeDiff parallel 6.21 300.25
Merge parallel 6.19 299.49
Convolution parallel 6.18 298.65
MergeDiff parallel 6.10 295.05
MakeDiff parallel 6.02 291.36
Convolution parallel 5.97 288.95
MaskedMerge parallel 5.85 282.99
Expr parallel 5.84 282.41
Inflate parallel 5.70 275.67
Deflate parallel 5.61 271.28
Merge parallel 5.51 266.49
Inflate parallel 5.36 259.19
Deflate parallel 5.28 255.32
Expr parallel 5.25 254.09
DGSource unordered 5.18 250.73
Expr parallel 4.98 241.04
Minimum parallel 4.63 224.13
Maximum parallel 4.63 223.72
Minimum parallel 4.60 222.57
Minimum parallel 4.59 221.93
Maximum parallel 4.56 220.74
Maximum parallel 4.56 220.34
Minimum parallel 4.55 220.16
Minimum parallel 4.51 217.99
Minimum parallel 4.51 217.94
Super parallel 4.49 217.20
Maximum parallel 4.46 215.60
Minimum parallel 4.41 213.16
Maximum parallel 4.37 211.56
Maximum parallel 4.32 208.95
Maximum parallel 4.29 207.35
Minimum parallel 4.27 206.58
Maximum parallel 4.22 204.18
MakeDiff parallel 4.20 203.26
MakeDiff parallel 4.19 202.86
Minimum parallel 4.16 201.38
Maximum parallel 4.13 199.86
Crop parallel 4.13 199.63
MakeDiff parallel 4.06 196.45
Median parallel 3.98 192.53
MakeDiff parallel 3.96 191.64
Convolution parallel 3.95 191.13
Inflate parallel 3.95 190.91
Convolution parallel 3.94 190.69
MakeDiff parallel 3.93 190.28
Convolution parallel 3.93 189.94
bitdepth parallel 3.89 188.22
Expr parallel 3.86 186.72
bitdepth parallel 3.85 186.14
Convolution parallel 3.68 177.99
Convolution parallel 3.53 170.95
Convolution parallel 3.50 169.11
Lut parallel 3.46 167.12
AddBorders parallel 3.32 160.49
PlaneStats parallel 2.84 137.48
Interleave parallel 0.04 1.89
SetFieldBased parallel 0.01 0.44
SCDetect parallel 0.00 0.22
SelectEvery parallel 0.00 0.13
ShufflePlanes parallel 0.00 0.04
Trim parallel 0.00 0.01

Autoloading every dll on the planet is a bad idea. I don't approve of this.

Nor do I. :) But they're all in the plugins dir (and I think that means autoload), and it's extremely hard to figure out the dll dependencies for QTGMC and MCTemporalDenoise, as the respective dependencies listed tend to be older than mega vapoursynth filter bundle compiled by someone on this site. Maybe I can use procmonitor or something to try and determine which dll's are actually in use.

Myrsloik

12th January 2022, 12:04

The first thing I'd try is removing all GPU filters like knlmeanscl and see if those are bottlenecking things. Their resource usage doesn't show up in the filter times.

asarian

15th January 2022, 16:34

The first thing I'd try is removing all GPU filters like knlmeanscl and see if those are bottlenecking things. Their resource usage doesn't show up in the filter times.

Just bought a new RTX 3080 Ti. Obviously, CPU is taxed higher with KNLMeansCL not being used. What is surprising, however, is that without KNLMeansCL the process seems to go faster almost (will need a longer time to test with certainty). KNLMeansCL only takes about 25-57% of GPU load, though, so I'm not sure what the bottleneck really is.

It may simply also be a memory issue. I also bought G.Skill memory nearly twice as fast as the one I had, and that makes the entire process nearly go twice as fast too (currently 42666 Mhz).

Myrsloik

16th January 2022, 13:23

Just bought a new RTX 3080 Ti. Obviously, CPU is taxed higher with KNLMeansCL not being used. What is surprising, however, is that without KNLMeansCL the process seems to go faster almost (will need a longer time to test with certainty). KNLMeansCL only takes about 25-57% of GPU load, though, so I'm not sure what the bottleneck really is.

It may simply also be a memory issue. I also bought G.Skill memory nearly twice as fast as the one I had, and that makes the entire process nearly go twice as fast too (currently 42666 Mhz).

Are you processing very high resolutions? That's usually when memory bandwidth becomes a limiting factor.

For example a threadripper 1950x can be considerably faster than a shiny new 5950x due to the extra memory channels.

asarian

17th January 2022, 02:35

Are you processing very high resolutions? That's usually when memory bandwidth becomes a limiting factor.

Just regular 10-bit 4K material. Not super-high.

Weird thing is, though, that when I split up the 4K job into 4 parts (each 1080p + a reasonable overscan), then I get full saturation on all cores.** You'd think QTGMC and such repeatedly working on a full 4K frame isn't fast enough to produce enough throughput for x265, right? But that would ere make the CPU work in overdrive, rather than lazily sitting at 40% or less.

** Sometimes even with overscan, grainy sources still won't be seamless afterwards.

asarian

26th January 2022, 04:38

Well, the matter is resolved. :) Looks like it was the E-cores, after all. I found a very useful option in the BIOS to disable the E-cores, pressing ScrollLock while in Windows 11 (it doesn't actually disable them, just marks them all as 'parked'). Now I get a blistering fast, sustained 100% CPU saturation again on all P-cores.

Even though heretofore the E-cores appeared to be hardly used at all, nonetheless they were the source of the (significant) hold-up.