Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
7th August 2018, 12:10 | #4181 | Link |
Registered User
Join Date: Aug 2006
Location: Stockholm/Helsinki
Posts: 805
|
When going down in bit depth, dither=-1 means round, right? Or is it floor? Because I see a tendency of everything shifting ever so slightly "down".
If you do the following on a YV24 source (repeat a few times to magnify the effect) Code:
convertbits(16) converttorgb(matrix="rec709") converttoyuv444(matrix="rec601") convertbits(8) convertbits(16) converttorgb(matrix="rec601") converttoyuv444(matrix="rec709") convertbits(8) edit: dither_bits: - Has no effect if dither=-1 (off). - Must be an even number from 2 to bits, inclusive. - In addition, must be >= (clip.BitsPerComponent-8). I don't understand why any of these restrictions exist, except for maybe for implementation reasons. Not that I need this functionality, just wondering. Last edited by ajp_anton; 7th August 2018 at 12:19. |
7th August 2018, 17:53 | #4182 | Link | ||
Registered User
Join Date: May 2016
Posts: 197
|
Quote:
Quote:
|
||
8th August 2018, 03:15 | #4183 | Link | |
...?
Join Date: Nov 2005
Location: Florida
Posts: 1,420
|
Quote:
Taking a cursory look at AVSInpaint.c, though...ack. That thing needs a cleanup. I can't tell if it's even using the C interface correctly (comparing, as I mentioned above, to the FFMS2 C plugin; that might be a special case, though, and I've not taken any time thus far to see if I can get AssRender building with GCC to verify with it). Last edited by qyot27; 8th August 2018 at 03:18. |
|
8th August 2018, 14:44 | #4184 | Link | |
Registered User
Join Date: Sep 2003
Location: Berlin, Germany
Posts: 3,079
|
Quote:
Not representative statistically, I tried to keep it "real world" as much as I could. Interesting results, and I also have a few questions. Test platform: Lenovo T530, Core i5-3230M Ivy Bridge with 8 GB RAM. Pretty much middle class today (of course only entry level for Doom9 members). The CPU has 2 physical cores plus 2 virtual (Hyperthreading) cores. AviSynth versions (32-bit only): 1: AVS 2.61 Alpha VC6 Build 2: AVS+ r2728 pinterf 3: AVS+ r2741 qyot27 Non-SSE2 Source file: Downloaded HD clip @ 29.97 progressive. I used 3 different scripts where the first 2 are common everday scripts, the third one uses DCT=1 with MVTools2 and is way too slow for everyday use. Script #1: ConvertToYV12() DegrainMedian(mode=2) LSFMod() Spline36Resize(720,576) ChangeFPS(25) Script #2: ConvertToYV12() Spline36Resize(720,576) mx_fps(25) # a modded version of FrameRateConverter by MysteryX Script #3 Same as above except using mx_fps(25, dct=1) For AVS+ I tested both "Prefetch(2)" and "Prefetch(4)" at the end of the scripts. I also used "SetMTMode.avsi" which is linked at the AVS+ WIKI page. And here comes my first question: Do I copy this "SetMTMode.avsi" into the "plugins+" folder or into the "plugins" folder? I tried both and did not notice any difference, so I put it in the "plugins+" folder. I deliberately did not just measure the results of the scripts in AVSMeter because I wanted fo find the overall conversion speeds. My encoder was FFmpeg which uses all 4 cores by default. Results: Code:
AVS 2.61 Alpha: Script #1: 30 fps Script #2: 15 fps Script #3: 1.0 fps Code:
AVS+ r2728 pinterf: Script #1: 31 fps (no difference between Prefetch(2) and Prefetch(4)) Script #2: 24 fps (Prefetch(2)) and 28 fps (Prefetch(4)) Script #3: 1.6 fps (Prefetch(2)) and 1.5 fps (Prefetch(4)) Code:
AVS+ r2741 qyot27 Non-SSE2 Script #1: 31 fps (no difference between Prefetch(2) and Prefetch(4)) Script #2: 24 fps (Prefetch(2)) and 28 fps (Prefetch(4)) Script #3: 1.6 fps (for both Prefetch(2) and Prefetch(4)) Conclusion: 1. The difference between standard AVS and AVS+ is very obvious, mainly when a complex script like mx_fps (which uses MVTools2 and MaskTools2) gets used. 2. There is almost no difference between the official pinterf version of AVS+ and the Non-SSE2 version by qyot27. On one occasion the qyot27 version is even slightly faster. Which leads me to my second question: Could it be that the qyot27 version does use the SSE2 capability of the CPU if the CPU supports it? If not then I would say that using only SIMD versions up to MMX and SSE does not necessarily slow down the conversions, at least not with my tests. Any thoughts? Cheers manolito Last edited by manolito; 8th August 2018 at 21:50. |
|
8th August 2018, 17:00 | #4185 | Link | |
...?
Join Date: Nov 2005
Location: Florida
Posts: 1,420
|
Quote:
Essentially, it works like this. Intrinsics or hand-written assembly code use SIMD instructions directly in discrete versions of a particular function (layer_sse4, layer_sse2, layer_avx, etc.). These are then included in a runtime CPU detection dispatcher which allows the program to select the appropriate one based on what the CPU supports. This always gets compiled, and the functions for the different paths are there no matter what. There are SSSE3 and SSE4 functions for some filters, there are AVX/AVX2 versions for others, all based on what can give the greatest boost/anyone bothered writing. Most compilers, however, have the ability to optimize the plain C versions during the build process so that the final binary can emit SIMD instructions at any time. MSVC controls this through the /arch: parameter, GCC does it through -march, -mtune, or -m[SIMD] flags used together or alone. These flags make it so that said CPU or SIMD is *required*, because it will use that code even in the parts not covered by the intrinsics or assembly. So take Mask for example. In AviSynth+, this has multiple versions of the function: Code:
mask_sse2 mask_core_mmx (not sure if this is actually just a dependency of the mask_mmx function below) mask_mmx mask_c When MSVC has /arch:IA32 set, mask_c (the entire program's plain C code, actually) will be built without any optimizations. Left at its default, though, mask_c (and all the rest of the plain C code) will be optimized by MSVC itself to emit SSE2 instructions when it gets run. Fine for CPUs that have SSE2 already, not fine for ones that don't. This auto-optimization-by-compiler is generally not as thorough or fine-tuned as you'd get from either intrinsics or hand-written assembly, which is why those are still needed to get significant boosts in speed, but for functions which haven't yet had intrinsics or asm written for them, the auto-optimization is the best you can do. |
|
8th August 2018, 22:00 | #4186 | Link | |
Registered User
Join Date: Sep 2003
Location: Berlin, Germany
Posts: 3,079
|
Thanks, I think I finally got it...
Quote:
So why doesn't pinterf implement your CPU branching routine? Cheers manolito |
|
8th August 2018, 23:04 | #4187 | Link | |
...?
Join Date: Nov 2005
Location: Florida
Posts: 1,420
|
Quote:
/arch and optimizing even the C parts with SIMD is making the cake itself a marble, or straight-up chocolate, cake - if you want to avoid the chocolate, you can't. It's in there, whether you like it or not. Whether the CPU supports it or not (and if the CPU doesn't support it, it crashes with an Illegal instruction error). I did absolutely nothing to branch AviSynth+'s CPU support. The only difference between the typical builds pinterf has been providing and the one I posted a little bit ago is that I switched /arch back to SSE before building it, the way it was on the original AviSynth+ repo before ultim went on hiatus again (as you can see, it was last updated in August 2016, which is why pinterf's repo is the current development hub everyone points to now): https://github.com/AviSynth/AviSynth...eLists.txt#L48 vs. https://github.com/pinterf/AviSynthP...eLists.txt#L74 All I did in my working branch (https://github.com/qyot27/AviSynthPl...eLists.txt#L74) was make it so that when I go to build AviSynth+, I don't have to open CMakeLists.txt in Notepad2-mod and change it back to SSE. Instead, I can now pass -DCPU_ARCH=SSE (or -DCPU_ARCH=IA32, -DCPU_ARCH=AVX, or -DCPU_ARCH=AVX2) to the CMake command line, like any other configuration option and avoid having to open source files in text editors first. As for why it hasn't shown up outside of my branch, A) I just whipped up that patch earlier this week or last week, and B) I've not opened a pull request for the changes on that branch yet. Last edited by qyot27; 8th August 2018 at 23:08. |
|
8th August 2018, 23:48 | #4188 | Link | |
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
|
Quote:
__________________
Groucho's Avisynth Stuff |
|
9th August 2018, 00:29 | #4189 | Link | |
Registered User
Join Date: Sep 2003
Location: Berlin, Germany
Posts: 3,079
|
Quote:
But then why the hell is your build just as fast or even a little faster on my test computer which certainly does have SSE2? |
|
9th August 2018, 01:45 | #4190 | Link | ||
...?
Join Date: Nov 2005
Location: Florida
Posts: 1,420
|
Quote:
The point is that /arch specifies the minimum instruction set the CPU supports, and because of that, it allows the compiler to use SIMD at or below that minimum setting when optimizing the C parts of the code during the build process. I mean, I could probably throw up a build with all the intrinsics disabled so you'd be forced to use the C versions and see directly how well MSVC optimizes stuff. I think it's just disabling a couple of defines, but I'm not sure. Quote:
|
||
9th August 2018, 02:22 | #4191 | Link |
Registered User
Join Date: Sep 2003
Location: Berlin, Germany
Posts: 3,079
|
Alright, this answers most of my questions, thanks...
Since all the high bitdepth and high colors stuff is not for me (I am just too old for all this UHD / HDR / 8K stuff, my viewing device is a 4:3 CRT TV set with natural colors I so far have never seen on an LCD. And I also refuse to converge computer (for working) and TV (for entertainment) stuff). So all I am interested in for AVS+ is speed gain caused by MT. Thanks again manolito |
9th August 2018, 08:38 | #4192 | Link |
Registered User
Join Date: Oct 2002
Location: France
Posts: 2,316
|
There was plasma screen (Pioneer Kuro) which were very good for color, SED/FED died before being born, but now there is OLED, which i think will provide good color for CRT people, as i was. Like you, i've never like LCD, but my Plasma Kuro Pioneer gave me satisfaction, and the day i'll have to replace it (because it will happens, the later i hope), i think OLED will satisfy me.
__________________
My github. |
9th August 2018, 10:49 | #4193 | Link | |
Registered User
Join Date: Jan 2014
Posts: 2,314
|
Quote:
|
|
9th August 2018, 12:07 | #4194 | Link |
RipBot264 author
Join Date: May 2006
Location: Poland
Posts: 7,815
|
I'm just curious why Prefetch with value equal to number of physical cores is faster than with number of logical processors?
It does not matter if it is Xeon 8C/16T or Ryzen 8C/16T. Result is always the same. Script Code:
#VideoSource LoadPlugin("C:\Users\Dave\Documents\Delphi_Projects\RipBot264\_Compiled\Tools\AviSynth plugins\ffms\ffms_latest\x64\ffms2.dll") video=FFVideoSource("C:\Temp\RipBot264temp\job1\video.mkv",cachefile = "C:\Temp\RipBot264temp\job1\video.mkv.ffindex") #Deinterlace #Resize LoadPlugin("C:\Users\Dave\Documents\Delphi_Projects\RipBot264\_Compiled\Tools\AviSynth plugins\Plugins_JPSDR\Plugins_JPSDR.dll") video=Spline36ResizeMT(video,1920,1080,SetAffinity=false).Sharpen(0.2) #Tonemap Loadplugin("C:\Users\Dave\Documents\Delphi_Projects\RipBot264\_Compiled\Tools\AviSynth plugins\avsresize\avsresize.dll") Loadplugin("C:\Users\Dave\Documents\Delphi_Projects\RipBot264\_Compiled\Tools\AviSynth plugins\DGTonemap\x64\DGTonemap.dll") video=z_ConvertFormat(video,pixel_type="RGBPS",colorspace_op="2020ncl:st2084:2020:l=>rgb:linear:2020:l", dither_type="none").DGHable video=z_ConvertFormat(video,pixel_type="YV12",colorspace_op="rgb:linear:2020:l=>709:709:709:l",dither_type="ordered") #Prefetch video=Prefetch(video,X) #Return return video Code:
AVSMeter 2.8.1 (x64) - Copyright (c) 2012-2018, Groucho2004 AviSynth+ 0.1 (r2728, MT, x86_64) (0.1.0.0) Number of frames: 7935 Length (hh:mm:ss.ms): 00:02:12.382 Frame width: 1920 Frame height: 1080 Framerate: 59.940 (60000/1001) Colorspace: YV12 Frames processed: 7935 (0 - 7934) FPS (min | max | average): 0.147 | 1000000 | 30.50 Memory usage (phys | virt): 2099 | 2124 MiB Thread count: 65 CPU usage (average): 53% Time (elapsed): 00:04:20.122 Prefetch(8) Code:
AVSMeter 2.8.1 (x64) - Copyright (c) 2012-2018, Groucho2004 AviSynth+ 0.1 (r2728, MT, x86_64) (0.1.0.0) Number of frames: 7935 Length (hh:mm:ss.ms): 00:02:12.382 Frame width: 1920 Frame height: 1080 Framerate: 59.940 (60000/1001) Colorspace: YV12 Frames processed: 7935 (0 - 7934) FPS (min | max | average): 0.339 | 944590 | 33.17 Memory usage (phys | virt): 1591 | 1616 MiB Thread count: 57 CPU usage (average): 53% Time (elapsed): 00:03:59.189
__________________
Windows 7 Image Updater - SkyLake\KabyLake\CoffeLake\Ryzen Threadripper |
9th August 2018, 13:49 | #4195 | Link |
Professional Code Monkey
Join Date: Jun 2003
Location: Kinnarps Chair
Posts: 2,555
|
This is a general answer that applies to most things.
You can easily saturate the total memory bandwidth with fewer than the logical number of threads. Especially (A)VS which processes full frames instead of tiles/lines quickly reach that level. And once you have more than the physical number of cores as threads you have reduced cache too... which means each thread is even more likely to have to access and wait even more for RAM. It's possible that the default number of threads should be something like max(physical cores, min(logical threads, 8)) for x86.
__________________
VapourSynth - proving that scripting languages and video processing isn't dead yet |
9th August 2018, 16:40 | #4196 | Link |
Registered User
Join Date: Sep 2003
Location: Berlin, Germany
Posts: 3,079
|
Not true for my Core i5-3230M Ivy Bridge with 8 GB RAM. According to this formula I should use Prefetch(2), but my tests (latest AVS+ 32-bit) showed that in most cases Prefetch(4) is significantly faster.
Last edited by manolito; 9th August 2018 at 16:42. |
9th August 2018, 16:42 | #4197 | Link |
Professional Code Monkey
Join Date: Jun 2003
Location: Kinnarps Chair
Posts: 2,555
|
My formula gives 4. I don't see the problem here. The whole thing was pulled out of my butt so no guarantees thats it's optimal.
__________________
VapourSynth - proving that scripting languages and video processing isn't dead yet |
9th August 2018, 16:54 | #4198 | Link | |
Guest
Posts: n/a
|
Quote:
The formula in your case would work through the following steps: max(2 physical cores, min(4 logical threads, 8 threads)) Min of 4 and 8 is 4. max(2 physical cores, 4 logical threads) Max between 2 and 4 would be 4. So, his back-of-the-envelope formula gave you exactly what you claim was the faster number. |
|
9th August 2018, 17:19 | #4199 | Link |
RipBot264 author
Join Date: May 2006
Location: Poland
Posts: 7,815
|
That formula would return 8 for 8700k/Ryzen 2600 instead of 6. I have disabled 2 cores on my Xeon E5-2690 in BIOS and again test showed that prefetch equal to number of cores is better.
Prefetch(8) Code:
AVSMeter 2.8.1 (x64) - Copyright (c) 2012-2018, Groucho2004 AviSynth+ 0.1 (r2728, MT, x86_64) (0.1.0.0) Number of frames: 7935 Length (hh:mm:ss.ms): 00:02:12.382 Frame width: 1920 Frame height: 1080 Framerate: 59.940 (60000/1001) Colorspace: YV12 Frames processed: 7935 (0 - 7934) FPS (min | max | average): 0.131 | 944593 | 22.98 Memory usage (phys | virt): 1422 | 1446 MiB Thread count: 45 CPU usage (average): 60% Time (elapsed): 00:05:45.265 Code:
AVSMeter 2.8.1 (x64) - Copyright (c) 2012-2018, Groucho2004 AviSynth+ 0.1 (r2728, MT, x86_64) (0.1.0.0) Number of frames: 7935 Length (hh:mm:ss.ms): 00:02:12.382 Frame width: 1920 Frame height: 1080 Framerate: 59.940 (60000/1001) Colorspace: YV12 Frames processed: 7935 (0 - 7934) FPS (min | max | average): 0.132 | 708445 | 25.73 Memory usage (phys | virt): 1281 | 1305 MiB Thread count: 43 CPU usage (average): 60% Time (elapsed): 00:05:08.385
__________________
Windows 7 Image Updater - SkyLake\KabyLake\CoffeLake\Ryzen Threadripper Last edited by Atak_Snajpera; 9th August 2018 at 17:22. |
9th August 2018, 17:51 | #4200 | Link |
RipBot264 author
Join Date: May 2006
Location: Poland
Posts: 7,815
|
Another test but this time disabled 6 cores leaving only 2C/4T.
Prefetch(4) Code:
AVSMeter 2.8.1 (x64) - Copyright (c) 2012-2018, Groucho2004 AviSynth+ 0.1 (r2728, MT, x86_64) (0.1.0.0) Number of frames: 7935 Length (hh:mm:ss.ms): 00:02:12.382 Frame width: 1920 Frame height: 1080 Framerate: 59.940 (60000/1001) Colorspace: YV12 Frames processed: 7935 (0 - 7934) FPS (min | max | average): 0.092 | 944596 | 10.70 Memory usage (phys | virt): 784 | 808 MiB Thread count: 17 CPU usage (average): 89% Time (elapsed): 00:12:21.689 Code:
AVSMeter 2.8.1 (x64) - Copyright (c) 2012-2018, Groucho2004 AviSynth+ 0.1 (r2728, MT, x86_64) (0.1.0.0) Number of frames: 7935 Length (hh:mm:ss.ms): 00:02:12.382 Frame width: 1920 Frame height: 1080 Framerate: 59.940 (60000/1001) Colorspace: YV12 Frames processed: 7935 (0 - 7934) FPS (min | max | average): 3.012 | 314865 | 14.37 Memory usage (phys | virt): 620 | 644 MiB Thread count: 15 CPU usage (average): 85% Time (elapsed): 00:09:12.192
__________________
Windows 7 Image Updater - SkyLake\KabyLake\CoffeLake\Ryzen Threadripper |
|
|