Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > VapourSynth

Reply
 
Thread Tools Search this Thread Display Modes
Old 10th January 2022, 08:02   #1  |  Link
asarian
Registered User
 
Join Date: May 2005
Posts: 1,462
Vapoursynth CPU saturation

(moved out of main VS thread)

Is R57 broken somehow? At first I blamed Windows 11's new termimal for the very slow throughput (slow pipe transfer?), but nope, even using ffmepeg with -vapoursynth, the process is extremely slow, using CPU for only like 25%. Both QTGMC and MCTemporalDenoise seem to grind to a near halt. All on my new i9 12900K. This used to go blistering fast, even on my old 6700K.

Here's what I do (see below). It's almost as if multi-threading is broken for these two functions (it isn't, but appears to work exceedngly inefficient). This is 4K material, btw.

Code:
import vapoursynth as vs
import havsfunc as haf

core = vs.core
core.max_cache_size = 65535

vid = core.dgdecodenv.DGSource (r'c:\jobs\am.dgi', ct=44, cb=44, cl=0, cr=0)

vid = haf.QTGMC (vid, InputType=1, Preset="Very Slow", TR2=3, EdiQual=2, EZDenoise=0.5, NoisePreset="Slower", TFF=True, Denoiser="KNLMeansCL")
vid = haf.MCTemporalDenoise (vid, settings="very low", stabilize=True)
vid = core.neo_f3kdb.Deband (vid, preset="veryhigh", dither_algo=2)
vid = core.std.AddBorders (clip=vid, left=0, right=0, top=44, bottom=44)

vid.set_output ()
E-cores are hardly used (some are marked as 'parked' even). But even the P-cores hardly see any action. See:

CPU saturation

N.B. I had the same issue on my previous i7 11700K, btw.

P.S. Does it matter my plugins folder has 148 plugins in it? (all 64-bit recent vapoursynth plugins someone posted here).
__________________
Gorgeous, delicious, deculture!
asarian is offline   Reply With Quote
Old 10th January 2022, 12:32   #2  |  Link
Myrsloik
Professional Code Monkey
 
Myrsloik's Avatar
 
Join Date: Jun 2003
Location: Kinnarps Chair
Posts: 2,548
Quote:
Originally Posted by asarian View Post
(moved out of main VS thread)

Is R57 broken somehow? At first I blamed Windows 11's new termimal for the very slow throughput (slow pipe transfer?), but nope, even using ffmepeg with -vapoursynth, the process is extremely slow, using CPU for only like 25%. Both QTGMC and MCTemporalDenoise seem to grind to a near halt. All on my new i9 12900K. This used to go blistering fast, even on my old 6700K.

Here's what I do (see below). It's almost as if multi-threading is broken for these two functions (it isn't, but appears to work exceedngly inefficient). This is 4K material, btw.

Code:
import vapoursynth as vs
import havsfunc as haf

core = vs.core
core.max_cache_size = 65535

vid = core.dgdecodenv.DGSource (r'c:\jobs\am.dgi', ct=44, cb=44, cl=0, cr=0)

vid = haf.QTGMC (vid, InputType=1, Preset="Very Slow", TR2=3, EdiQual=2, EZDenoise=0.5, NoisePreset="Slower", TFF=True, Denoiser="KNLMeansCL")
vid = haf.MCTemporalDenoise (vid, settings="very low", stabilize=True)
vid = core.neo_f3kdb.Deband (vid, preset="veryhigh", dither_algo=2)
vid = core.std.AddBorders (clip=vid, left=0, right=0, top=44, bottom=44)

vid.set_output ()
E-cores are hardly used (some are marked as 'parked' even). But even the P-cores hardly see any action. See:

CPU saturation

N.B. I had the same issue on my previous i7 11700K, btw.

P.S. Does it matter my plugins folder has 148 plugins in it? (all 64-bit recent vapoursynth plugins someone posted here).
Run vspipe with the --filter-time option and post the output here. You don't need to run the whole script just a bit.

Autoloading every dll on the planet is a bad idea. I don't approve of this.
__________________
VapourSynth - proving that scripting languages and video processing isn't dead yet

Last edited by Myrsloik; 10th January 2022 at 12:38.
Myrsloik is online now   Reply With Quote
Old 10th January 2022, 13:44   #3  |  Link
ChaosKing
Registered User
 
Join Date: Dec 2005
Location: Germany
Posts: 1,795
Quote:
Originally Posted by Myrsloik View Post
Autoloading every dll on the planet is a bad idea. I don't approve of this.
I only have 220
It should only slowdown the initial load, right? (my nvme is fast!!!111)
__________________
AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth
VapourSynth Portable FATPACK || VapourSynth Database

Last edited by ChaosKing; 10th January 2022 at 13:47.
ChaosKing is online now   Reply With Quote
Old 10th January 2022, 20:21   #4  |  Link
asarian
Registered User
 
Join Date: May 2005
Posts: 1,462
Quote:
Originally Posted by Myrsloik View Post
Run vspipe with the --filter-time option and post the output here. You don't need to run the whole script just a bit.
Thanks for your useful reply.

Didn't even know you could use this timing function. Here are the results (of 'VSPipe --filter-time -c y4m "f:\jobs\test.vpy" NUL'). Still dismal at 1.49 fps.

Code:
Output 7215 frames in 4836.29 seconds (1.49 fps)
Filtername           Filter mode   Time (%)   Time (s)
DFTTest              parallel       236.04   11415.38
Degrain3             parallel       153.59    7428.12
Analyse              parallel       140.13    6776.88
Analyse              parallel       139.92    6766.79
Analyse              parallel       135.95    6574.87
Analyse              parallel       134.19    6489.61
Analyse              parallel       133.42    6452.38
Analyse              parallel       131.75    6371.89
Degrain1             parallel        76.81    3714.79
Degrain1             parallel        76.38    3694.15
Analyse              parallel        59.43    2874.38
Analyse              parallel        59.07    2856.85
Degrain1             parallel        42.95    2077.16
Super                parallel        42.45    2053.03
Compensate           parallel        37.70    1823.22
Compensate           parallel        37.16    1797.00
KNLMeansCL           parreq          36.76    1777.75
TemporalSoften2      parallel        34.00    1644.15
Super                parallel        31.56    1526.36
Super                parallel        30.51    1475.59
Super                parallel        25.62    1239.13
Compensate           parallel        24.76    1197.31
Compensate           parallel        24.47    1183.26
Compensate           parallel        24.47    1183.21
Compensate           parallel        24.18    1169.34
TemporalSoften2      parallel        22.21    1073.95
resample             parallel        18.60     899.78
TTempSmooth          parallel        18.35     887.22
resample             parallel        17.68     854.87
Deband               parallel        15.32     740.92
Super                parallel        12.12     585.95
Expr                 parallel         9.82     474.90
Expr                 parallel         8.71     421.06
Expr                 parallel         8.30     401.20
Expr                 parallel         7.56     365.69
Expr                 parallel         7.47     361.42
Point                parallel         7.30     352.99
Expr                 parallel         7.16     346.42
Expr                 parallel         7.13     344.99
Expr                 parallel         7.06     341.45
Expr                 parallel         7.01     339.19
Expr                 parallel         6.79     328.22
Expr                 parallel         6.78     327.71
Merge                parallel         6.73     325.45
Expr                 parallel         6.73     325.34
Expr                 parallel         6.69     323.71
Merge                parallel         6.36     307.75
MakeDiff             parallel         6.29     303.96
Merge                parallel         6.26     302.53
Merge                parallel         6.23     301.09
MergeDiff            parallel         6.21     300.25
Merge                parallel         6.19     299.49
Convolution          parallel         6.18     298.65
MergeDiff            parallel         6.10     295.05
MakeDiff             parallel         6.02     291.36
Convolution          parallel         5.97     288.95
MaskedMerge          parallel         5.85     282.99
Expr                 parallel         5.84     282.41
Inflate              parallel         5.70     275.67
Deflate              parallel         5.61     271.28
Merge                parallel         5.51     266.49
Inflate              parallel         5.36     259.19
Deflate              parallel         5.28     255.32
Expr                 parallel         5.25     254.09
DGSource             unordered        5.18     250.73
Expr                 parallel         4.98     241.04
Minimum              parallel         4.63     224.13
Maximum              parallel         4.63     223.72
Minimum              parallel         4.60     222.57
Minimum              parallel         4.59     221.93
Maximum              parallel         4.56     220.74
Maximum              parallel         4.56     220.34
Minimum              parallel         4.55     220.16
Minimum              parallel         4.51     217.99
Minimum              parallel         4.51     217.94
Super                parallel         4.49     217.20
Maximum              parallel         4.46     215.60
Minimum              parallel         4.41     213.16
Maximum              parallel         4.37     211.56
Maximum              parallel         4.32     208.95
Maximum              parallel         4.29     207.35
Minimum              parallel         4.27     206.58
Maximum              parallel         4.22     204.18
MakeDiff             parallel         4.20     203.26
MakeDiff             parallel         4.19     202.86
Minimum              parallel         4.16     201.38
Maximum              parallel         4.13     199.86
Crop                 parallel         4.13     199.63
MakeDiff             parallel         4.06     196.45
Median               parallel         3.98     192.53
MakeDiff             parallel         3.96     191.64
Convolution          parallel         3.95     191.13
Inflate              parallel         3.95     190.91
Convolution          parallel         3.94     190.69
MakeDiff             parallel         3.93     190.28
Convolution          parallel         3.93     189.94
bitdepth             parallel         3.89     188.22
Expr                 parallel         3.86     186.72
bitdepth             parallel         3.85     186.14
Convolution          parallel         3.68     177.99
Convolution          parallel         3.53     170.95
Convolution          parallel         3.50     169.11
Lut                  parallel         3.46     167.12
AddBorders           parallel         3.32     160.49
PlaneStats           parallel         2.84     137.48
Interleave           parallel         0.04       1.89
SetFieldBased        parallel         0.01       0.44
SCDetect             parallel         0.00       0.22
SelectEvery          parallel         0.00       0.13
ShufflePlanes        parallel         0.00       0.04
Trim                 parallel         0.00       0.01
Quote:
Autoloading every dll on the planet is a bad idea. I don't approve of this.
Nor do I. But they're all in the plugins dir (and I think that means autoload), and it's extremely hard to figure out the dll dependencies for QTGMC and MCTemporalDenoise, as the respective dependencies listed tend to be older than mega vapoursynth filter bundle compiled by someone on this site. Maybe I can use procmonitor or something to try and determine which dll's are actually in use.
__________________
Gorgeous, delicious, deculture!

Last edited by asarian; 10th January 2022 at 22:01. Reason: Updated data
asarian is offline   Reply With Quote
Old 12th January 2022, 12:04   #5  |  Link
Myrsloik
Professional Code Monkey
 
Myrsloik's Avatar
 
Join Date: Jun 2003
Location: Kinnarps Chair
Posts: 2,548
The first thing I'd try is removing all GPU filters like knlmeanscl and see if those are bottlenecking things. Their resource usage doesn't show up in the filter times.
__________________
VapourSynth - proving that scripting languages and video processing isn't dead yet
Myrsloik is online now   Reply With Quote
Old 15th January 2022, 16:34   #6  |  Link
asarian
Registered User
 
Join Date: May 2005
Posts: 1,462
Quote:
Originally Posted by Myrsloik View Post
The first thing I'd try is removing all GPU filters like knlmeanscl and see if those are bottlenecking things. Their resource usage doesn't show up in the filter times.
Just bought a new RTX 3080 Ti. Obviously, CPU is taxed higher with KNLMeansCL not being used. What is surprising, however, is that without KNLMeansCL the process seems to go faster almost (will need a longer time to test with certainty). KNLMeansCL only takes about 25-57% of GPU load, though, so I'm not sure what the bottleneck really is.

It may simply also be a memory issue. I also bought G.Skill memory nearly twice as fast as the one I had, and that makes the entire process nearly go twice as fast too (currently 42666 Mhz).
__________________
Gorgeous, delicious, deculture!
asarian is offline   Reply With Quote
Old 16th January 2022, 13:23   #7  |  Link
Myrsloik
Professional Code Monkey
 
Myrsloik's Avatar
 
Join Date: Jun 2003
Location: Kinnarps Chair
Posts: 2,548
Quote:
Originally Posted by asarian View Post
Just bought a new RTX 3080 Ti. Obviously, CPU is taxed higher with KNLMeansCL not being used. What is surprising, however, is that without KNLMeansCL the process seems to go faster almost (will need a longer time to test with certainty). KNLMeansCL only takes about 25-57% of GPU load, though, so I'm not sure what the bottleneck really is.

It may simply also be a memory issue. I also bought G.Skill memory nearly twice as fast as the one I had, and that makes the entire process nearly go twice as fast too (currently 42666 Mhz).
Are you processing very high resolutions? That's usually when memory bandwidth becomes a limiting factor.

For example a threadripper 1950x can be considerably faster than a shiny new 5950x due to the extra memory channels.
__________________
VapourSynth - proving that scripting languages and video processing isn't dead yet
Myrsloik is online now   Reply With Quote
Old 17th January 2022, 02:35   #8  |  Link
asarian
Registered User
 
Join Date: May 2005
Posts: 1,462
Quote:
Originally Posted by Myrsloik View Post
Are you processing very high resolutions? That's usually when memory bandwidth becomes a limiting factor.

Just regular 10-bit 4K material. Not super-high.

Weird thing is, though, that when I split up the 4K job into 4 parts (each 1080p + a reasonable overscan), then I get full saturation on all cores.** You'd think QTGMC and such repeatedly working on a full 4K frame isn't fast enough to produce enough throughput for x265, right? But that would ere make the CPU work in overdrive, rather than lazily sitting at 40% or less.

** Sometimes even with overscan, grainy sources still won't be seamless afterwards.
__________________
Gorgeous, delicious, deculture!
asarian is offline   Reply With Quote
Old 26th January 2022, 04:38   #9  |  Link
asarian
Registered User
 
Join Date: May 2005
Posts: 1,462
Well, the matter is resolved. Looks like it was the E-cores, after all. I found a very useful option in the BIOS to disable the E-cores, pressing ScrollLock while in Windows 11 (it doesn't actually disable them, just marks them all as 'parked'). Now I get a blistering fast, sustained 100% CPU saturation again on all P-cores.

Even though heretofore the E-cores appeared to be hardly used at all, nonetheless they were the source of the (significant) hold-up.
__________________
Gorgeous, delicious, deculture!

Last edited by asarian; 26th January 2022 at 05:22.
asarian is offline   Reply With Quote
Reply

Tags
speed qtgmc

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 10:57.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.