Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > VP9 and AV1

Reply
 
Thread Tools Search this Thread Display Modes
Old 12th October 2019, 13:49   #1  |  Link
hajj_3
Registered User
 
Join Date: Mar 2004
Posts: 1,011
dav1d accelerated AV1 decoder

dav1d 0.5.0 'Asiatic Cheetah'

https://code.videolan.org/videolan/dav1d/-/releases

The fast and small AV1 decoder, codename 'Asiatic Cheetah'. It supports all the AV1 features and all bitdepths.

0.5.0 brings large improvements in speed on SSSE3 CPU (up to 40% speedup), new speed improvements on AVX-2 (for 4-7%) and ARM64 (up to 10%) and ARM32. It introduces some VSX, SSE2 and SSE4 optimizations.
0.5.0 fixes some minor issues, can export ITU T.35 metadata and improves the player example.

Last edited by foxyshadis; 3rd December 2020 at 12:34. Reason: add general releases link
hajj_3 is offline   Reply With Quote
Old 27th October 2019, 15:58   #2  |  Link
Mr_Khyron
Member
 
Mr_Khyron's Avatar
 
Join Date: Nov 2002
Posts: 147
https://code.videolan.org/videolan/d...releases#0.5.1
Quote:
This is a minor update of the 0.5.0 version of dav1d, the fast and small AV1 decoder, codename 'Asiatic Cheetah'.

0.5.1 brings improvements in speed for SSE2 CPUs (up to 50% speedup), and ARMv7 CPUs (up to 41% speedup).

It also fixes minor issues and minor speed improvements for other architectures.

http://download.opencontent.netflix....ix=AV1/Sparks/
Netflix posted new AV1 samples with and without film grain in 540p, 1080p and 2160p
Mr_Khyron is offline   Reply With Quote
Old 27th October 2019, 19:44   #3  |  Link
Nintendo Maniac 64
Registered User
 
Nintendo Maniac 64's Avatar
 
Join Date: Nov 2009
Location: Northeast Ohio
Posts: 444
Quote:
Originally Posted by IgorC View Post
I wonder why dav1d developers have dedicated time to optimize for SSE2. Isn't SSSE3 already old enough? AMD has catched up and implemented SSSE3 in 2011.
AMD's non-DDR4 processors with SSSE3 were...underwhelming to say the least (the only exception being their Atom-competitor chips such as the Jaguar cores used in consoles, but it certainly wasn't their absolute performance that made them exceptions, far from it in fact).

A good amount of people saw no reason to upgrade from their SSE3-at-max AMD CPUs regardless of whether that was a Phenom II (especially those using the 6 core) or the first-gen "Llano" laptop APUs, worse still because of requiring different motherboards for either (AM3+ and FM2). Heck, when the first gen AM3+ FX CPU reviews landed, it was even common for people to instead upgrade to the Phenom II X6!



Quote:
Originally Posted by IgorC View Post
P.S. Few years ago I have tested 10 years old laptop with Pentium T4200 (SSSE3) which now rests unused. It could barely play Youtube VP9 720p videos while still dropped some frames
VP9 decoding in browsers was woeful back then. I was able to run YouTube's 1080p30 VP9 encodes on 2.0GHz first-gen Core 2 Duo (2MB L2 cache) and their 1080p60 VP9 encodes on a 2.4GHz second-gen Core 2 Duo (3MB L2 cache) if I ran the video stream through MPC-HC/LAVfilters, but the results were terrible in the browser. This was because the browsers at the time all used libvpx while MPC-HC/LAVfilters used ffvp9 (which actually for quite a while ran just as terribly if your CPU didn't support SSSE3 and/or you were using 32bit MPC-HC/LAVfilters, this however isn't the case anymore)

For reference your Pentium is the same exact architecture as a second-gen Core 2 Duo but has 1MB of L2 cache, so I would expect its IPC to be similar to a first-gen 2MB L2 Core 2 Duo.
__________________
____HTPC____  | __Desktop PC__
2.93GHz Xeon x3470 (4c/8t Nehalem) | 4.6GHz Pentium G3258 (2c/2t Haswell)
Radeon HD5870  | Intel iGPU      
2x2GB+2x1GB DDR3-1333 | 4x4GB DDR3-1600       

Win7 x64

Last edited by Nintendo Maniac 64; 28th October 2019 at 00:05.
Nintendo Maniac 64 is offline   Reply With Quote
Old 28th October 2019, 03:00   #4  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 82
Quote:
Originally Posted by IgorC View Post
I wonder why dav1d developers have dedicated time to optimize for SSE2.
SSSE3 is done. There was a comment by Steve Robertson (Youtube) at Video@Scale this year that 10% of their userbase on x86 has no SSSE3. So we're trying to explore whether this is meaningful.

1080p on SSE2 is not our goal. The goal is to have a baseline support so ~5 years (or even earlier?) from now, AV1 can be the baseline, not H.264. We don't know for sure, but this may imply some basic need for SSE2 support. So we're exploring what is possible and how much work it'd be.
Beelzebubu is offline   Reply With Quote
Old 28th October 2019, 10:29   #5  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by Beelzebubu View Post
SSSE3 is done
According to the dAV1d team, latest version 0.5.0 is extremely fast.
Many times faster than libaom, even using SSSE3.
Based on the benchmarks below, what really surprises me is that depending on content and CPU implementation, SSSE3 code running on 128bit registers can be as fast as AVX2 code running on 256bit registers!
How is this even possible ?
I'm starting to believe that your AVX2 assembly optimizations could be optimized further.
BTW, any plans for AVX-512 in near future ?
Is there any benefit on this ?

https://i.postimg.cc/7h1nkFss/dav1d-0-5-x86-s.png
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 28th October 2019, 11:17   #6  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Ok...So, I take a look at the single threaded performance and I see a 20% gain of AVX2 compared to SSSE3.

It is really amazing that the remaining non-optimized parts of the algorithm can impact the performance around 80% (!)

Does that mean that all these months of writing optimized AVX2 assembly are really contributing for 20% ?

I would really like to hear what the dAV1d team or other developers of software AV1 decoding say about that.

Do we really have an 80% non optimizable algorithm here ?

Looks like another implementation of Pareto law to me.

https://i.postimg.cc/3rt91v4z/1-0o-W...a9-BSb3-SQ.png
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 28th October 2019, 12:57   #7  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by Beelzebubu View Post
On what system (chipset)?
I took it from "you"
http://www.jbkempf.com/blog/post/201...elease-fastest
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 28th October 2019, 13:50   #8  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 82
Quote:
Originally Posted by NikosD View Post
Few things going on there:
  • YMM (e.g. AVX2) functions are never exactly 2x as fast as XMM (e.g. SSSE3) functions, even in theoretical conditions;
  • YMM upper lane use will cause CPU downclocking (but not on modern AMD CPUs, I'm being told);
  • certain code in SIMD functions does not use YMM upper lanes (effectively), usually because the block size is too small (width=4-8), but sometimes because we don't want a function-pointer-call overhead (multisymbol coding);
  • and obviously, a lot of code is not SIMD'ed at all, it's 50%-50% between SIMD and non-SIMD at best.

Together, that means the speedup is well below half of half, so 20% is not entirely unreasonable. Sucks a bit, but you can't beat reality.
Beelzebubu is offline   Reply With Quote
Old 29th October 2019, 01:23   #9  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 82
Quote:
Originally Posted by NikosD View Post
TBH, I remembered ffvp9 to be one of the best optimized decoders ever and I thought it was due to AVX2 and not SSSE3 optimizations.
There's a reasonable amount, but it's sort of the inverse as dav1d: we really did go all out in dav1d, doing everything-and-the-kitchen-sink in AVX2, and then we did SSSE3 later, doing most of it, but not quite everything. For ffvp9, it was the other way around, we did everything-and-more in SSSE3, and then did a couple of things (some MC, some inverse transforms) in AVX2, but the smaller inverse transforms and MC, as well as the loopfilters and most intra predictors, were never done. So it's fairly incomplete.

Quote:
Originally Posted by NikosD View Post
it seems that all decoders are doomed in the SSEx vs AVX2 battle.
That's a little negative. But yes, you won't get a 2x (or even 1.5x) speedup. 1.2x is nothing bad, though. And this i straight Haswell, newer chipsets (Zen2, Skylake) will get more, as will encoders.
Beelzebubu is offline   Reply With Quote
Old 29th October 2019, 13:28   #10  |  Link
excellentswordfight
Lost my old account :(
 
Join Date: Jul 2017
Posts: 170
Quote:
Originally Posted by nevcairiel View Post
dav1ds AVX2 is fine. If you want to properly compare SSSE3 vs AVX2, then you need to look at Single Threaded benchmarks. Multi-Threading is often limited in scaling, where such differences can "hide".
But you should also not expect twice the performance from AVX2, since once you optimize everything possible with SSSE3/AVX2, the remaining parts that cannot be optimized so easily will impact the performance the most.
What dav1d version does the stable 0.74.1 LAV filter use?

Tried to play that 2160p60 sample from netflix with 0.74.1 and mpc-be on an i7-7500U; 4 threads at 100% load at 3.2Ghz, could barely open the file, after 30s it started playing at 2-10fps (downscaled to 1080p). Not that I was expecting any smooth playback, but is this "normal" performance? HEVC 10bit sw decoding is about 3x faster on the same setup.

Last edited by excellentswordfight; 29th October 2019 at 14:26.
excellentswordfight is offline   Reply With Quote
Old 29th October 2019, 14:30   #11  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,044
Quote:
Originally Posted by excellentswordfight View Post
What dav1d version does the stable 0.74.1 LAV filter use?
0.2.1, the newest available at the time. You can use a nightly version which would come with 0.5.1, the newest available right now.
That won't necessarily guarantee that 2160p60 will play on a mobile U-series CPU, but it got the best chances.

Just be careful not to pick the 10-bit variant of the Netflix Chimera video. 10-bit is not optimized at all yet, and its not representative of real-world content yet. YouTube for example only delivers AV1 8-bit so far.
And since there is no 8-bit 2160p variant of Chimera, thats your answer.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders

Last edited by nevcairiel; 29th October 2019 at 14:35.
nevcairiel is offline   Reply With Quote
Old 29th October 2019, 20:40   #12  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
I'm about to start a few benchmarks using various versions of dAV1d regarding SSSE3 and AVX2 progress on Core2Duo, Haswell, Skylake and Coffee Lake Refresh CPUs.

Is there a link with 1080p and 4K AV1 8bit sample videos to test ?
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 30th October 2019, 11:04   #13  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
dAV1d benchmarks: 0.5.1 vs 0.2.1 on Core2Duo, Haswell, Skylake, Coffee Lake R

OK, here we are.

Test Systems:
Skylake Core i5 6500 (TDP 65W) - Win 10 v1809 (17763.805) - 8GB DDR4-2133 MHz (1 DIMM - Single Channel)
All-core-turbo 3.3GHz

Haswell Core i3 4170 (TDP 54W) - Win 10 v1903 (18362.449) - 16GB DDR3-1600 MHz (Dual Channel 2x8GB)
Fixed 3.7GHz clock

Coffee Lake Refresh Core i3 9100F (TDP 65W) - Win 10 v1903 (18362.449) - 16GB DDR4-2400 MHz (Dual Channel 2x8GB)
All-core-turbo 4.0GHz

Merom Core2Duo T7600 (TDP 34W) - Win 10 v1809 (17763.805) - 4GB DDR2-667 MHz (Dual Channel Interleaved 2x2GB)
Fixed 2.33GHz clock


SW Tools:
DXVA Checker v4.2.1

LAV filters 0.74.1 (dAV1d 0.2.1 - 12/03/2019)
LAV filters 0.74.1-29 (dAV1d 0.5.1 - 26/10/2019)

During the whole benchmarking procedure, the Core i5 6500 never dropped its turbo clock of 3.3 GHz speed and Core i3 9100F never dropped its turbo clock of 4.0 GHz speed either.

Core i5 6500:
Max TDP for 1080p ~33W
Max TDP for 4K ~36W

Core i3 4170
Max TDP for 4K ~35W

Core i3 9100F
Max TDP for 4K ~54W

Core2Duo T7600
No tool can read Power Consumption


All video samples below are 8bit.

Chimera 1080p24fps sample is from Netflix
Dua Lipa 1080p25fps sample is from Youtube
Holi Festival 4K25fps sample is from Elecard (thanks @HolyWu)
Summer Nature 4K25fps sample is from Elecard

The numbers below represent FramesPerSecond (FPS) expressed as minimum/average/maximum.

1080p

Chimera ~6.6Mbps

Core i5 6500 86/134/290 CPU 87% -0.5.1
Core i5 6500 77/127/273 CPU 91% -0.2.1

Core2Duo T7600 10/19/94 CPU 72% -0.5.1
Core2Duo T7600 8/17/100 CPU 87% -0.2.1


Dua Lipa ~2.2Mbps

Core i5 6500 120/186/251 CPU 87% -0.5.1
Core i5 6500 112/186/255 CPU 91% -0.2.1

Core2Duo T7600 7/18/70 CPU 65% -0.5.1
Core2Duo T7600 7/18/69 CPU 84% -0.2.1



4K

Holi Festival ~14Mbps

Core i5 6500 34/43/61 CPU 94% -0.5.1
Core i5 6500 30/40/60 CPU 95% -0.2.1


Summer Nature ~23Mbps

Core i3 9100F 45/60/82 CPU 91% -0.5.1

Core i5 6500 32/43/57 CPU 93% -0.5.1
Core i5 6500 26/37/50 CPU 91% -0.2.1

Core i3 4170 21/30/46 CPU 92% -0.5.1
Core i3 4170 16/27/41 CPU 90% -0.2.1



Comments:

0) Sorry guys...dAV1d 0.5.1 has serious CPU utilization problem with my Core2Duo for laptop, essentially wiping out any optimization for SSSE3 set.
Dua Lipa has 0% gain over 0.2.1 and Chimera has only 11% on average.
The situation is a disaster for SSSE3 optimizations.

1) After 7 months of 0.2.1 release, I would say that dAV1d team certainly was not busy doing AVX2 optimizations.
It looks like 0.5.1 is only 0% - 8% faster than 0.2.1 on Skylake, besides the last 4K clip that gets a nice 16% gain.

2) Skylake Core i5 6500 is certainly not capable of decoding anything more than 4K30fps for AV1 up to ~20Mbps without dropping frames, even with the latest version.

3) Coffee Lake R Core i3 9100F is closer to 4K60fps, but still minimum frame rate is well below 60fps.

4) 0.5.1 dropped CPU utilization a little for Skylake (but enormously for Core2Duo) compared to 0.2.1, eating some of the performance optimizations of latest version for Skylake.

The only time that CPU utilization increased - compared to 0.2.1 - the gain was a good 16%.

5) Core i3 9100F vs Core i5 6500 results are showing that CFL-R is ~15% faster than its clock favor, probably due to a lot faster memory configuration.

Overall the results comparing 0.2.1 vs 0.5.1 were bad for both instruction sets of SIMD optimizations - SSSE3 and AVX2.

The last 7 months I see no progress according to my benchmarks and I really wonder where all those huge numbers of gain came from dAV1d team regarding 0.5.1 version vs 0.2.1

Is there a difference using a Core2Duo for desktop ?

Really looking forward for your tests and feedback.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all

Last edited by NikosD; 31st October 2019 at 04:30.
NikosD is offline   Reply With Quote
Old 2nd November 2019, 06:20   #14  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
@Beelzebubu
@nevcairiel

Guys, I posted a huge benchmark report regarding dAV1d decoder progress between 0.2.1 vs 0.5.1 versions, meaning for the last seven months and I see no replies or reactions from you since.

Can you confirm or reject my findings with yours, showing different things ?

I have seen a lot of huge numbers regarding dAV1d progress from the dAV1d team in the official release notes - which I couldn't confirm - but in here you are very quiet.

Waiting for your feedback!
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 2nd November 2019, 10:10   #15  |  Link
SmilingWolf
I am maddo saientisto!
 
SmilingWolf's Avatar
 
Join Date: Aug 2018
Posts: 103
TLDR:
Win7 64bits, i7-4770k, 3.40GHz (stock), improvement between 0.2.1 and 0.5.1 using only SSSE3 accelerated routines, single thread:
Chimera: 33.2%
Dua Lipa: 34.9%

Included are some AVX2 tests too, because yes.

Code:
# time ./dav1d-0.2.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null

real    5m27,012s
user    0m0,000s
sys     0m0,000s

# time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null

real    3m38,449s
user    0m0,000s
sys     0m0,000s

# time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask avx2 -o /dev/null

real    3m5,282s
user    0m0,000s
sys     0m0,000s

# time ./dav1d-0.2.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null

real    2m42,726s
user    0m0,000s
sys     0m0,000s

# time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null

real    1m45,987s
user    0m0,000s
sys     0m0,000s

# time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask avx2 -o /dev/null

real    1m22,243s
user    0m0,000s
sys     0m0,000s
Overall the results comparing 0.2.1 vs 0.5.1 were good for both instruction sets of SIMD optimizations - SSSE3 and AVX2.

The last 7 months I see progress according to my benchmarks and I really don't have to wonder where all those huge numbers of gain came from dAV1d team regarding 0.5.1 version vs 0.2.1

Last edited by SmilingWolf; 2nd November 2019 at 10:14.
SmilingWolf is offline   Reply With Quote
Old 2nd November 2019, 11:31   #16  |  Link
SmilingWolf
I am maddo saientisto!
 
SmilingWolf's Avatar
 
Join Date: Aug 2018
Posts: 103
FFMpeg says it's going to use 4 frame threads and 3 tile threads to decode the files, so I'll be using those numbers.

Chimera: 34%
Dua Lipa: 29.7%

Code:
# time ./dav1d-0.2.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null

real    1m40,657s
user    0m0,000s
sys     0m0,000s

# time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null

real    1m6,398s
user    0m0,000s
sys     0m0,000s

# time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask avx2 -o /dev/null

real    0m54,087s
user    0m0,000s
sys     0m0,000s

# time ./dav1d-0.2.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null

real    0m46,972s
user    0m0,000s
sys     0m0,000s

# time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null

real    0m33,041s
user    0m0,000s
sys     0m0,000s

# time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask avx2 -o /dev/null

real    0m25,912s
user    0m0,000s
sys     0m0,000s
SmilingWolf is offline   Reply With Quote
Old 2nd November 2019, 12:11   #17  |  Link
SmilingWolf
I am maddo saientisto!
 
SmilingWolf's Avatar
 
Join Date: Aug 2018
Posts: 103
The tool is from the dav1d project, found here: https://code.videolan.org/videolan/d...e/master/tools

You can either compile it yourself (MABS can do that) or use my copy: https://mega.nz/#!op5gGSTD!JPyhq1IqJ...g8SyR2HbSTs7gk

Finding the best frame/tile threads numbers is a bit tricky. Fiddling can improve performance, and I made some quick tests that brought Dua Lipa down to 29 seconds on 0.5.1+SSSE3 using --framethreads 6 --tilethreads 2, but I did not want to post them because in the end almost no media player is going to let you stray from FFmpeg, and therefore LAVFilters, defaults.

Last edited by SmilingWolf; 2nd November 2019 at 12:13.
SmilingWolf is offline   Reply With Quote
Old 2nd November 2019, 12:33   #18  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by SmilingWolf View Post
The tool is from the dav1d project, found here: https://code.videolan.org/videolan/d...e/master/tools

You can either compile it yourself (MABS can do that) or use my copy: https://mega.nz/#!op5gGSTD!JPyhq1IqJ...g8SyR2HbSTs7gk

Finding the best frame/tile threads numbers is a bit tricky. Fiddling can improve performance, and I made some quick tests that brought Dua Lipa down to 29 seconds on 0.5.1+SSSE3 using --framethreads 6 --tilethreads 2, but I did not want to post them because in the end almost no media player is going to let you stray from FFmpeg, and therefore LAVFilters, defaults.
Then Houston we have a problem.
Because we have seriously contradicting results between multi-threaded performance of ffmpeg and LAV filters regarding dAV1d, according to your tests and mine.

Could be your compilations vs nevcairiel's compilations, could be the setup of LAV vs ffmpeg for dAV1d or the hyperthreading nature of 4770K.

If you don't want to raise the threads in order to reach 100% CPU utilization, you could close hyperthreading from BIOS and run again the tests with 4 threads.

Also, you could run LAV filters benchmark using GraphStudioNext or DXVA Checker on your Core i7 as is with hyperthreading ON and see how that's going.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 2nd November 2019, 12:59   #19  |  Link
SmilingWolf
I am maddo saientisto!
 
SmilingWolf's Avatar
 
Join Date: Aug 2018
Posts: 103
I can't use DXVA to check performance since the dav1d library inside the FFmpeg library inside the LAVfilters library would default to using AVX2, unless you know of a way to target a specific instruction set from within DXVA, or perhaps using an environment variable. OTOH, if you just want me to check how dav1d multithreading works in different versions of LAVFilters/ffmpeg, I can test that. But it won't be a 0.2.1 vs 0.5.1 SSSE3 benchmark anymore.

I'm not too sure why you think I'm not using all cores of my CPU. With the settings above, 4 frame 3 tile threads, I get peaks of 80% CPU usage, and an eyeballed average of around 70%.

Anyway, again just because, here is my best 0.5.1+SSSE3 Dua Lipa result so far:
Code:
# time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 6 --tilethreads 3 --cpumask ssse3 -o /dev/null

real    0m28,737s
user    0m0,000s
sys     0m0,000s
Peaks at 92% CPU, hovers at around 85% average.

Last edited by SmilingWolf; 2nd November 2019 at 13:09.
SmilingWolf is offline   Reply With Quote
Old 2nd November 2019, 14:00   #20  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by SmilingWolf View Post
I can't use DXVA to check performance since the dav1d library inside the FFmpeg library inside the LAVfilters library would default to using AVX2, unless you know of a way to target a specific instruction set from within DXVA, or perhaps using an environment variable. OTOH, if you just want me to check how dav1d multithreading works in different versions of LAVFilters/ffmpeg, I can test that. But it won't be a 0.2.1 vs 0.5.1 SSSE3 benchmark anymore.
I couldn't find a way to test specific instruction sets too, that's why I used 0.2.1 vs 0.5.1 on different hardware.
You could check SSSE3 using a Core2Duo or an AMD processor and AVX2 on any Haswell onwards or Ryzen 3000.
If you only have i7 4770K, just check AVX2.

Quote:
Originally Posted by SmilingWolf View Post
I'm not too sure why you think I'm not using all cores of my CPU. With the settings above, 4 frame 3 tile threads, I get peaks of 80% CPU usage, and an eyeballed average of around 70%.
Due to 8 threads capable CPU.
Waiting for your AVX2 LAV filters results 0.2.1 vs 0.5.1 preferably in the form of DXVA Checker min/avg/max and the average CPU utilization reported by DXVA Checker.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 14:17.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, vBulletin Solutions Inc.