Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
25th August 2015, 16:09 | #1 | Link |
Registered User
Join Date: Nov 2001
Posts: 291
|
FlipVertical and New BitBlt
This is an update and different development for flipvertical, at least one part of it
that can work in place and also use avx instruction in new machines. To avoid conflict names you must use as; Fvertcal() If you have my old flips.dll in your avisynth plugin folder you should replace it by this new Vertical.dll But actually this update was an excuse to develop a new bitblt, something I started many years ago and has been sleeping in my disks and changing from one machine to another at least for the last eight years, and as I never arrive to finish this project, I release it now the way it is. This new bitblt has only been tested in this plugin and others of my own use, never tested deeply as a substitute of the internal one, only a few tests, so I cannot guarantee for now full compatibility and free of bugs. It makes use of sse2, Ssse3 and avx instructions depending on the machine on which is running. This project includes four files from Agner Fog's libraries, cachesize32.asm, cputype32.asm, instrset32.asm, unalignedfaster32.asm and some slightly modified subroutines from memcpy32.asm You can find them in http://www.agner.org/optimize/asmlib.zip All original Agner Fog's sources are also included in this file Version 1.0 Fvertcal.7z Version 1.0 Fvertcal.zip Version 1.01 Fvertcal.7z Version 1.002 Fvertcal.dll Version 1.003 Fvertcal.dll Version 1.004 Fvertcal.dll Version 1005 Fvertical.dll I hope this can be usefull ARDA Last edited by ARDA; 15th September 2015 at 01:46. Reason: update version |
25th August 2015, 16:32 | #2 | Link |
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
|
Quick test on my i5-2500K (Sandy Bridge):
Script: Code:
blankclip(length = 1000, width = 5000, height = 3000, color=$005B8B).killaudio().assumefps(50, 1) #flipvertical() #fvertical() Code:
[Runtime info] Frames processed: 1000 (0 - 999) FPS (min | max | average): 124.8 | 148.5 | 141.8 Memory usage (phys | virt): 121 | 120 MB Thread count: 1 CPU usage (average): 24% Time (elapsed): 00:00:07.050 Code:
[Runtime info] Frames processed: 1000 (0 - 999) FPS (min | max | average): 121.3 | 141.3 | 135.9 Memory usage (phys | virt): 121 | 120 MB Thread count: 1 CPU usage (average): 23% Time (elapsed): 00:00:07.360 |
25th August 2015, 16:42 | #3 | Link |
Registered User
Join Date: Nov 2001
Posts: 291
|
If you have any doubt about the performance of any filter, I propose the following script
MPEG2Source("your source") # or any source you like and use always the same to get a little more accurate benchmarks # and test always the same frames each time. 9000 frames it is a good quantity for this script. #TemporalSoften(4,8,8,15,2) # use this line to force a non writable src and test when a new video frame # is created by your filter or not. It is just an example. AvsTimer(frames=1000, name="ANYONE",type=3, frequency=x?, total=false, quiet=true)# use your cpu frequency # Put here your filter to benchmark #flipvertical() #fvertical() AvsTimer(frames=1500 ,name="ANYONE",type=3, frequency=x?, difference=1, total=false)# use your cpu frequency Open the scipt in virtualdub, set direct stream copy, set an initial frame and an end frame. Open debugview(google), set a filter highlight in debugview, in this example *ANYONE* Go back to virtualdub and Run video analysis pass. You will see in debug view windows the results every 1500 frames. If anyone knows and wants to propose any other more accurate method to benchmark, please post here to discuss it. I hope this can be usefull ARDA |
25th August 2015, 16:46 | #4 | Link |
Registered User
Join Date: Nov 2001
Posts: 291
|
@Groucho2004
The variation that your benchmark shows is too small, and your test is measuring blanckclip as well, and the efect it has on memory Please try the method I propose and tell me what results you have Thanks ARDA |
25th August 2015, 16:51 | #5 | Link | |
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
|
Quote:
I used "blankclip" instead of a "real" source because it's extremely fast and does not add any overhead. |
|
25th August 2015, 17:27 | #7 | Link |
Registered User
Join Date: Nov 2001
Posts: 291
|
In my haswell laptop. intel family 6 model 45h
Test done with the above method described. with a size clip of 5000 x 3000 (Y8) I will be doing more tests but please give some time. with avstimer Fvertical() VirtualDub.exe [91497] ANYONE = 302 fps VirtualDub.exe [92997] ANYONE = 308 fps VirtualDub.exe [94497] ANYONE = 309 fps VirtualDub.exe [95997] ANYONE = 306 fps VirtualDub.exe [97497] ANYONE = 289 fps VirtualDub.exe [98997] ANYONE = 305 fps with avstimer Flipvertical() VirtualDub.exe [91499] ANYONE = 263 fps VirtualDub.exe [92999] ANYONE = 268 fps VirtualDub.exe [94499] ANYONE = 270 fps VirtualDub.exe [95999] ANYONE = 270 fps VirtualDub.exe [97499] ANYONE = 270 fps VirtualDub.exe [98999] ANYONE = 268 fps Thanks ARDA Last edited by ARDA; 25th August 2015 at 18:05. |
25th August 2015, 18:04 | #8 | Link | |
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
|
Quote:
AvsTimer obviously uses the rdtsc instruction to measure time based on the CPU time stamp but this is highly unreliable, especially with modern CPUs. |
|
25th August 2015, 23:05 | #9 | Link | ||
Registered User
Join Date: Nov 2001
Posts: 291
|
Quote:
Quote:
Yes in the method I proposed, type=3 difines that rdtsc wil be used, if you prefer QueryPerformaceCounter counter use type=0. You will get different results but almost sure with the same distance between them. Anyway for more accurate tests we should have to set real time priority and set a thread affinity in the source plugin code, but that is not real life and I think it wouldn't run under Windows XP(not sure) There are a lot of academic discussion here and there all over the net about which is the most accurate method to benchmark codes, I wouldn't like this thread become about this subject. I did not do the benchmarks with the example in RGBB32 cause it would take to much time and with this resolution(5000x3000) new bitblt is also applied with non temporal stores. In my machine and with a y8 clip here are the results; Code:
Source= 5000x3000 (Y8) AvsTimer(frames=1000, name="ANYONE",type=0, frequency=1700, total=false, quiet=true) fvertical() AvsTimer(frames=1500 ,name="ANYONE",type=0, frequency=1700, difference=1, total=false) Use type=0 QueryPerformaceCounter VirtualDub.exe AvsTimer 0.8.1 VirtualDub.exe AvsTimer 0.8.1 VirtualDub.exe [91499] ANYONE = 306 fps VirtualDub.exe [92999] ANYONE = 305 fps VirtualDub.exe [94499] ANYONE = 305 fps VirtualDub.exe [95999] ANYONE = 305 fps VirtualDub.exe [97499] ANYONE = 310 fps VirtualDub.exe [98999] ANYONE = 310 fps Use type=2 GetThreadTimes VirtualDub.exe [91499] ANYONE = 305 fps VirtualDub.exe [92999] ANYONE = 317 fps VirtualDub.exe [94499] ANYONE = 311 fps VirtualDub.exe [95999] ANYONE = 323 fps VirtualDub.exe [97499] ANYONE = 331 fps VirtualDub.exe [98999] ANYONE = 313 fps Use type=3 RDTSC VirtualDub.exe [91497] ANYONE = 302 fps VirtualDub.exe [92997] ANYONE = 308 fps VirtualDub.exe [94497] ANYONE = 309 fps VirtualDub.exe [95997] ANYONE = 306 fps VirtualDub.exe [97497] ANYONE = 289 fps VirtualDub.exe [98997] ANYONE = 305 fps ************************************************************************************ Code:
Source= 5000x3000 (Y8) AvsTimer(frames=1000, name="ANYONE",type=0, frequency=1700, total=false, quiet=true) flipvertical() AvsTimer(frames=1500 ,name="ANYONE",type=0, frequency=1700, difference=1, total=false) Use type=0 QueryPerformaceCounter VirtualDub.exe [91499] ANYONE = 262 fps VirtualDub.exe [92999] ANYONE = 267 fps VirtualDub.exe [94499] ANYONE = 267 fps VirtualDub.exe [95999] ANYONE = 266 fps VirtualDub.exe [97499] ANYONE = 267 fps VirtualDub.exe [98999] ANYONE = 267 fps Use type=2 GetThreadTimes VirtualDub.exe [91499] ANYONE = 261 fps VirtualDub.exe [92999] ANYONE = 275 fps VirtualDub.exe [94499] ANYONE = 274 fps VirtualDub.exe [95999] ANYONE = 268 fps VirtualDub.exe [97499] ANYONE = 264 fps VirtualDub.exe [98999] ANYONE = 284 fps Use type=3 RDTSC VirtualDub.exe [91499] ANYONE = 263 fps VirtualDub.exe [92999] ANYONE = 268 fps VirtualDub.exe [94499] ANYONE = 270 fps VirtualDub.exe [95999] ANYONE = 270 fps VirtualDub.exe [97499] ANYONE = 270 fps VirtualDub.exe [98999] ANYONE = 268 fps All above test shows an increase in performance of around 18% for flip vertical In the conditions of a Y8 clip of 5000*3000 the plugin is using the new bitblt by using non temporal stores, at least in my machine in which the largest cache is a L3 of 3MB. Under other conditions the difference can arrive till 30% or more; maybe in a few days if I have time I will published more tests. Thanks ARDA |
||
26th August 2015, 16:53 | #13 | Link | |
Registered User
Join Date: Nov 2001
Posts: 291
|
Quote:
If your projects include something relative to new bitblt or memcpy in avisynth I encourage you to include my new source, and improve it if you find something wrong, anyway it is a good idea to read all manuals in Agner Fog's page and mainly the assembler optimization. Thanks ARDA |
|
26th August 2015, 18:29 | #14 | Link | |
Registered User
Join Date: Oct 2002
Location: France
Posts: 2,382
|
Quote:
Other problem is that your code seems 32 bits only (because i've tried to take a look to find this new bitblt...). So, unless there is something i'm more able to understand, i think for now i'll stay only with asmlib and allready build libraries. |
|
26th August 2015, 22:11 | #15 | Link | |
Registered User
Join Date: Nov 2001
Posts: 291
|
Quote:
Yes, all this project is 32bits only, new bitblt is all in BitBlt_SSE2_avs.asm file, it is in the zip(see first post) If you want a bitblt for 64 bits don't expect it soon from my side. The day we have an avisynth for 64bits stable, faster than 32 bits and reliable maybe I will think about it. If I donnot remember wrong assembler codes in asmlib project are all in yasm/nasm sintax it would be a good idea start looking at them without fear, if you intend taking advantage of them. What a wonderful word is open source! Thanks ARDA |
|
27th August 2015, 13:12 | #17 | Link |
Registered User
Join Date: Nov 2001
Posts: 291
|
Code:
Source= 1920x1080 (Y8) AvsTimer(frames=1000, name="ANYONE",type=0, frequency=1700, total=false, quiet=true) flipvertical() AvsTimer(frames=1500 ,name="ANYONE",type=0, frequency=1700, difference=1, total=false) Use type=0 QueryPerformaceCounter VirtualDub.exe [91498] ANYONE = 2163 fps VirtualDub.exe [92998] ANYONE = 2263 fps VirtualDub.exe [94498] ANYONE = 2283 fps VirtualDub.exe [95998] ANYONE = 2298 fps VirtualDub.exe [97498] ANYONE = 2301 fps VirtualDub.exe [98998] ANYONE = 2305 fps Use type=2 GetThreadTimes VirtualDub.exe [91498] ANYONE = 2595 fps VirtualDub.exe [92998] ANYONE = 2667 fps VirtualDub.exe [94498] ANYONE = 2400 fps VirtualDub.exe [95998] ANYONE = 2233 fps VirtualDub.exe [97498] ANYONE = 2909 fps VirtualDub.exe [98998] ANYONE = 2182 fps Use type=3 RDTSC VirtualDub.exe [91499] ANYONE = 2047 fps VirtualDub.exe [92999] ANYONE = 2272 fps VirtualDub.exe [94499] ANYONE = 2267 fps VirtualDub.exe [95999] ANYONE = 2259 fps VirtualDub.exe [97499] ANYONE = 2265 fps VirtualDub.exe [98999] ANYONE = 2237 fps Code:
Source= 1920x1080 (Y8) AvsTimer(frames=1000, name="ANYONE",type=0, frequency=1700, total=false, quiet=true) fvertical() AvsTimer(frames=1500 ,name="ANYONE",type=0, frequency=1700, difference=1, total=false) Use type=0 QueryPerformaceCounter VirtualDub.exe [91498] ANYONE = 4825 fps VirtualDub.exe [92998] ANYONE = 4801 fps VirtualDub.exe [94498] ANYONE = 4804 fps VirtualDub.exe [95998] ANYONE = 4819 fps VirtualDub.exe [97498] ANYONE = 4870 fps VirtualDub.exe [98998] ANYONE = 4853 fps Use type=2 GetThreadTimes VirtualDub.exe [91498] ANYONE = 6000 fps VirtualDub.exe [92998] ANYONE = 5333 fps VirtualDub.exe [94498] ANYONE = 5053 fps VirtualDub.exe [95998] ANYONE = 3840 fps VirtualDub.exe [97498] ANYONE = 5647 fps VirtualDub.exe [98998] ANYONE = 6400 fps Use type=3 RDTSC VirtualDub.exe [91498] ANYONE = 4830 fps VirtualDub.exe [92998] ANYONE = 4823 fps VirtualDub.exe [94498] ANYONE = 4817 fps VirtualDub.exe [95998] ANYONE = 4829 fps VirtualDub.exe [97498] ANYONE = 4851 fps VirtualDub.exe [98998] ANYONE = 4843 fps These tests were done to test different kind of methods to measure performance under avstimer These tests shows that fvertical is around 100% faster, in fact we should say that new fvertical in place is faster than internal avisynth bitblt, so it is not a fair comparation, but this was the one of the objectives of this plugin, to get better performance for flip vertical when posible. More tests soon. Thanks ARDA |
27th August 2015, 20:42 | #18 | Link |
Registered User
Join Date: Nov 2001
Posts: 291
|
Code:
Source= 720x576 (Y8) AvsTimer(frames=1000, name="ANYONE",type=0, frequency=1700, total=false, quiet=true) fvertical() AvsTimer(frames=1500 ,name="ANYONE",type=0, frequency=1700, difference=1, total=false) Use type=0 QueryPerformaceCounter VirtualDub.exe [91498] ANYONE = 30549 fps VirtualDub.exe [92998] ANYONE = 30837 fps VirtualDub.exe [94498] ANYONE = 30766 fps VirtualDub.exe [95998] ANYONE = 30768 fps VirtualDub.exe [97498] ANYONE = 30690 fps VirtualDub.exe [98998] ANYONE = 30673 fps Use type=2 GetThreadTimes VirtualDub.exe [91498] ANYONE = 24000 fps VirtualDub.exe [92998] ANYONE = 24000 fps VirtualDub.exe [94498] ANYONE = 24000 fps VirtualDub.exe [95998] ANYONE = 16000 fps VirtualDub.exe [97498] ANYONE = 32000 fps VirtualDub.exe [98998] ANYONE = 48000 fps Use type=3 RDTSC VirtualDub.exe [91498] ANYONE = 30725 fps VirtualDub.exe [92998] ANYONE = 30835 fps VirtualDub.exe [94498] ANYONE = 30936 fps VirtualDub.exe [95998] ANYONE = 30952 fps VirtualDub.exe [97498] ANYONE = 30488 fps VirtualDub.exe [98998] ANYONE = 30806 fps Code:
Source= 720x576 (Y8) AvsTimer(frames=1000, name="ANYONE",type=0, frequency=1700, total=false, quiet=true) flipvertical() AvsTimer(frames=1500 ,name="ANYONE",type=0, frequency=1700, difference=1, total=false) Use type=0 QueryPerformaceCounter VirtualDub.exe [91497] ANYONE = 9094 fps VirtualDub.exe [92997] ANYONE = 12647 fps VirtualDub.exe [94497] ANYONE = 12826 fps VirtualDub.exe [95997] ANYONE = 13012 fps VirtualDub.exe [97497] ANYONE = 12892 fps VirtualDub.exe [98997] ANYONE = 12935 fp Use type=2 GetThreadTimes VirtualDub.exe [91498] ANYONE = 16000 fps VirtualDub.exe [92998] ANYONE = 8727 fps VirtualDub.exe [94498] ANYONE = 24000 fps VirtualDub.exe [95998] ANYONE = 8727 fps VirtualDub.exe [97498] ANYONE = 12000 fps VirtualDub.exe [98998] ANYONE = 10667 fps Use type=3 RDTSC VirtualDub.exe [91498] ANYONE = 12403 fps VirtualDub.exe [92998] ANYONE = 13077 fps VirtualDub.exe [94498] ANYONE = 12927 fps VirtualDub.exe [95998] ANYONE = 12930 fps VirtualDub.exe [97498] ANYONE = 12985 fps VirtualDub.exe [98998] ANYONE = 12993 fps This resolution shows and increase performance of around 100%, is almost the same condition than previous tests |
27th August 2015, 21:17 | #19 | Link |
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
|
OK, I tested it properly now. In order to measure the very short time for each call of *vertical() I just call it several times. "colorbars()" is extremely fast and does not influence the results.
Here's the script: Code:
colorbars(width = 1920, height = 1080, pixel_type = "yv12").killaudio().assumefps(25, 1).trim(0, 4999) #test_flipvertical() #test_fvertical() function test_flipvertical(clip c) { last = c flipvertical().flipvertical().flipvertical().flipvertical().flipvertical() flipvertical().flipvertical().flipvertical().flipvertical().flipvertical() return last } function test_fvertical(clip c) { last = c fvertical().fvertical().fvertical().fvertical().fvertical() fvertical().fvertical().fvertical().fvertical().fvertical() return last } Code:
Frames processed: 5000 (0 - 4999) FPS (min | max | average): 195.1 | 201.1 | 199.1 Memory usage (phys | virt): 15 | 14 MB Thread count: 1 CPU usage (average): 25% Time (elapsed): 00:00:25.109 Code:
Frames processed: 5000 (0 - 4999) FPS (min | max | average): 687.9 | 736.4 | 730.7 Memory usage (phys | virt): 12 | 12 MB Thread count: 1 CPU usage (average): 25% Time (elapsed): 00:00:06.843 |
27th August 2015, 22:46 | #20 | Link |
Excessively jovial fellow
Join Date: Jun 2004
Location: rude
Posts: 1,100
|
now benchmark it against a bitblt that simply uses a memcpy from a modern runtime instead
I suspect the only reason that this possible to optimize is that Avisynth's ~optimized~ bitblt is an ancient piece of garbage written for P4's and ancient Athlons, which doesn't really produce great results on modern CPU's. Agner Fog's memcpy implementation was - by his own benchmarks - only barely faster than Microsoft's back in 2008. Replacing Avisynth's bitblt with a wrapper around memcpy and compiling with a modern runtime (haha, who am I kidding, this is Avisynth) would probably speed it up a lot. Last edited by TheFluff; 27th August 2015 at 22:57. |
Thread Tools | Search this Thread |
Display Modes | |
|
|