dav1d accelerated AV1 decoder [Archive]

Nintendo Maniac 64

27th October 2019, 19:44

I wonder why dav1d developers have dedicated time to optimize for SSE2. Isn't SSSE3 already old enough? AMD has catched up and implemented SSSE3 in 2011.

AMD's non-DDR4 processors with SSSE3 were...underwhelming to say the least (the only exception being their Atom-competitor chips such as the Jaguar cores used in consoles, but it certainly wasn't their absolute performance that made them exceptions, far from it in fact).

A good amount of people saw no reason to upgrade from their SSE3-at-max AMD CPUs regardless of whether that was a Phenom II (especially those using the 6 core) or the first-gen "Llano" laptop APUs, worse still because of requiring different motherboards for either (AM3+ and FM2). Heck, when the first gen AM3+ FX CPU reviews landed, it was even common for people to instead upgrade to the Phenom II X6!

P.S. Few years ago I have tested 10 years old laptop with Pentium T4200 (SSSE3) which now rests unused. It could barely play Youtube VP9 720p videos while still dropped some frames

VP9 decoding in browsers was woeful back then. I was able to run YouTube's 1080p30 VP9 encodes on 2.0GHz first-gen Core 2 Duo (2MB L2 cache) and their 1080p60 VP9 encodes on a 2.4GHz second-gen Core 2 Duo (3MB L2 cache) if I ran the video stream through MPC-HC/LAVfilters, but the results were terrible in the browser. This was because the browsers at the time all used libvpx while MPC-HC/LAVfilters used ffvp9 (which actually for quite a while ran just as terribly if your CPU didn't support SSSE3 and/or you were using 32bit MPC-HC/LAVfilters, this however isn't the case anymore)

For reference your Pentium is the same exact architecture as a second-gen Core 2 Duo but has 1MB of L2 cache, so I would expect its IPC to be similar to a first-gen 2MB L2 Core 2 Duo.

Beelzebubu

28th October 2019, 03:00

I wonder why dav1d developers have dedicated time to optimize for SSE2.

SSSE3 is done. There was a comment by Steve Robertson (Youtube) at Video@Scale this year that 10% of their userbase on x86 has no SSSE3. So we're trying to explore whether this is meaningful.

1080p on SSE2 is not our goal. The goal is to have a baseline support so ~5 years (or even earlier?) from now, AV1 can be the baseline, not H.264. We don't know for sure, but this may imply some basic need for SSE2 support. So we're exploring what is possible and how much work it'd be.

NikosD

28th October 2019, 10:29

SSSE3 is doneAccording to the dAV1d team, latest version 0.5.0 is extremely fast.
Many times faster than libaom, even using SSSE3.
Based on the benchmarks below, what really surprises me is that depending on content and CPU implementation, SSSE3 code running on 128bit registers can be as fast as AVX2 code running on 256bit registers!
How is this even possible ?
I'm starting to believe that your AVX2 assembly optimizations could be optimized further.
BTW, any plans for AVX-512 in near future ?
Is there any benefit on this ?

https://i.postimg.cc/7h1nkFss/dav1d-0-5-x86-s.png

NikosD

28th October 2019, 11:17

Ok...So, I take a look at the single threaded performance and I see a 20% gain of AVX2 compared to SSSE3.

It is really amazing that the remaining non-optimized parts of the algorithm can impact the performance around 80% (!)

Does that mean that all these months of writing optimized AVX2 assembly are really contributing for 20% ?

I would really like to hear what the dAV1d team or other developers of software AV1 decoding say about that.

Do we really have an 80% non optimizable algorithm here ?

Looks like another implementation of Pareto law to me.

https://i.postimg.cc/3rt91v4z/1-0o-Wq-YLo1-A2x-Ho-Pa9-BSb3-SQ.png

NikosD

28th October 2019, 12:57

On what system (chipset)? I took it from "you"
http://www.jbkempf.com/blog/post/2019/dav1d-0.5.0-release-fastest

Beelzebubu

28th October 2019, 13:50

I took it from "you"
http://www.jbkempf.com/blog/post/2019/dav1d-0.5.0-release-fastest

Few things going on there:

YMM (e.g. AVX2) functions are never exactly 2x as fast as XMM (e.g. SSSE3) functions, even in theoretical conditions;
YMM upper lane use will cause CPU downclocking (but not on modern AMD CPUs, I'm being told);
certain code in SIMD functions does not use YMM upper lanes (effectively), usually because the block size is too small (width=4-8), but sometimes because we don't want a function-pointer-call overhead (multisymbol coding);
and obviously, a lot of code is not SIMD'ed at all, it's 50%-50% between SIMD and non-SIMD at best.

Together, that means the speedup is well below half of half, so 20% is not entirely unreasonable. Sucks a bit, but you can't beat reality.

Beelzebubu

29th October 2019, 01:23

TBH, I remembered ffvp9 to be one of the best optimized decoders ever and I thought it was due to AVX2 and not SSSE3 optimizations.

There's a reasonable (http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavcodec/x86/vp9dsp_init.c;h=837cce850819c1697132e23b97afe5b4ea9da2f9;hb=HEAD#l388) amount, but it's sort of the inverse as dav1d: we really did go all out in dav1d, doing everything-and-the-kitchen-sink in AVX2, and then we did SSSE3 later, doing most of it, but not quite everything. For ffvp9, it was the other way around, we did everything-and-more in SSSE3, and then did a couple of things (some MC, some inverse transforms) in AVX2, but the smaller inverse transforms and MC, as well as the loopfilters and most intra predictors, were never done. So it's fairly incomplete.

it seems that all decoders are doomed in the SSEx vs AVX2 battle.

That's a little negative. But yes, you won't get a 2x (or even 1.5x) speedup. 1.2x is nothing bad, though. And this i straight Haswell, newer chipsets (Zen2, Skylake) will get more, as will encoders.

excellentswordfight

29th October 2019, 13:28

dav1ds AVX2 is fine. If you want to properly compare SSSE3 vs AVX2, then you need to look at Single Threaded benchmarks. Multi-Threading is often limited in scaling, where such differences can "hide".
But you should also not expect twice the performance from AVX2, since once you optimize everything possible with SSSE3/AVX2, the remaining parts that cannot be optimized so easily will impact the performance the most.
What dav1d version does the stable 0.74.1 LAV filter use?

Tried to play that 2160p60 sample from netflix with 0.74.1 and mpc-be on an i7-7500U; 4 threads at 100% load at 3.2Ghz, could barely open the file, after 30s it started playing at 2-10fps (downscaled to 1080p). Not that I was expecting any smooth playback, but is this "normal" performance? HEVC 10bit sw decoding is about 3x faster on the same setup.

nevcairiel

29th October 2019, 14:30

What dav1d version does the stable 0.74.1 LAV filter use?

0.2.1, the newest available at the time. You can use a nightly version (https://files.1f0.de/lavf/nightly/) which would come with 0.5.1, the newest available right now.
That won't necessarily guarantee that 2160p60 will play on a mobile U-series CPU, but it got the best chances.

Just be careful not to pick the 10-bit variant of the Netflix Chimera video. 10-bit is not optimized at all yet, and its not representative of real-world content yet. YouTube for example only delivers AV1 8-bit so far.
And since there is no 8-bit 2160p variant of Chimera, thats your answer.

NikosD

29th October 2019, 20:40

I'm about to start a few benchmarks using various versions of dAV1d regarding SSSE3 and AVX2 progress on Core2Duo, Haswell, Skylake and Coffee Lake Refresh CPUs.

Is there a link with 1080p and 4K AV1 8bit sample videos to test ?

NikosD

30th October 2019, 11:04

OK, here we are.

Test Systems:
Skylake Core i5 6500 (TDP 65W) - Win 10 v1809 (17763.805) - 8GB DDR4-2133 MHz (1 DIMM - Single Channel)
All-core-turbo 3.3GHz

Haswell Core i3 4170 (TDP 54W) - Win 10 v1903 (18362.449) - 16GB DDR3-1600 MHz (Dual Channel 2x8GB)
Fixed 3.7GHz clock

Coffee Lake Refresh Core i3 9100F (TDP 65W) - Win 10 v1903 (18362.449) - 16GB DDR4-2400 MHz (Dual Channel 2x8GB)
All-core-turbo 4.0GHz

Merom Core2Duo T7600 (TDP 34W) - Win 10 v1809 (17763.805) - 4GB DDR2-667 MHz (Dual Channel Interleaved 2x2GB)
Fixed 2.33GHz clock

SW Tools:
DXVA Checker v4.2.1

LAV filters 0.74.1 (dAV1d 0.2.1 - 12/03/2019)
LAV filters 0.74.1-29 (dAV1d 0.5.1 - 26/10/2019)

During the whole benchmarking procedure, the Core i5 6500 never dropped its turbo clock of 3.3 GHz speed and Core i3 9100F never dropped its turbo clock of 4.0 GHz speed either.

Core i5 6500:
Max TDP for 1080p ~33W
Max TDP for 4K ~36W

Core i3 4170
Max TDP for 4K ~35W

Core i3 9100F
Max TDP for 4K ~54W

Core2Duo T7600
No tool can read Power Consumption

All video samples below are 8bit.

Chimera 1080p24fps sample is from Netflix
Dua Lipa 1080p25fps sample is from Youtube
Holi Festival 4K25fps sample is from Elecard (thanks @HolyWu)
Summer Nature 4K25fps sample is from Elecard

The numbers below represent FramesPerSecond (FPS) expressed as minimum/average/maximum.

1080p

Chimera ~6.6Mbps

Core i5 6500 86/134/290 CPU 87% -0.5.1
Core i5 6500 77/127/273 CPU 91% -0.2.1

Core2Duo T7600 10/19/94 CPU 72% -0.5.1
Core2Duo T7600 8/17/100 CPU 87% -0.2.1

Dua Lipa ~2.2Mbps

Core i5 6500 120/186/251 CPU 87% -0.5.1
Core i5 6500 112/186/255 CPU 91% -0.2.1

Core2Duo T7600 7/18/70 CPU 65% -0.5.1
Core2Duo T7600 7/18/69 CPU 84% -0.2.1

4K

Holi Festival ~14Mbps

Core i5 6500 34/43/61 CPU 94% -0.5.1
Core i5 6500 30/40/60 CPU 95% -0.2.1

Summer Nature ~23Mbps

Core i3 9100F 45/60/82 CPU 91% -0.5.1

Core i5 6500 32/43/57 CPU 93% -0.5.1
Core i5 6500 26/37/50 CPU 91% -0.2.1

Core i3 4170 21/30/46 CPU 92% -0.5.1
Core i3 4170 16/27/41 CPU 90% -0.2.1

Comments:

0) Sorry guys...dAV1d 0.5.1 has serious CPU utilization problem with my Core2Duo for laptop, essentially wiping out any optimization for SSSE3 set.
Dua Lipa has 0% gain over 0.2.1 and Chimera has only 11% on average.
The situation is a disaster for SSSE3 optimizations.

1) After 7 months of 0.2.1 release, I would say that dAV1d team certainly was not busy doing AVX2 optimizations.
It looks like 0.5.1 is only 0% - 8% faster than 0.2.1 on Skylake, besides the last 4K clip that gets a nice 16% gain.

2) Skylake Core i5 6500 is certainly not capable of decoding anything more than 4K30fps for AV1 up to ~20Mbps without dropping frames, even with the latest version.

3) Coffee Lake R Core i3 9100F is closer to 4K60fps, but still minimum frame rate is well below 60fps.

4) 0.5.1 dropped CPU utilization a little for Skylake (but enormously for Core2Duo) compared to 0.2.1, eating some of the performance optimizations of latest version for Skylake.

The only time that CPU utilization increased - compared to 0.2.1 - the gain was a good 16%.

5) Core i3 9100F vs Core i5 6500 results are showing that CFL-R is ~15% faster than its clock favor, probably due to a lot faster memory configuration.

Overall the results comparing 0.2.1 vs 0.5.1 were bad for both instruction sets of SIMD optimizations - SSSE3 and AVX2.

The last 7 months I see no progress according to my benchmarks and I really wonder where all those huge numbers of gain came from dAV1d team regarding 0.5.1 version vs 0.2.1

Is there a difference using a Core2Duo for desktop ?

Really looking forward for your tests and feedback.

NikosD

2nd November 2019, 06:20

SmilingWolf

2nd November 2019, 10:10

TLDR:
Win7 64bits, i7-4770k, 3.40GHz (stock), improvement between 0.2.1 and 0.5.1 using only SSSE3 accelerated routines, single thread:
Chimera: 33.2%
Dua Lipa: 34.9%

Included are some AVX2 tests too, because yes.

# time ./dav1d-0.2.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null

real 5m27,012s
user 0m0,000s
sys 0m0,000s

# time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null

real 3m38,449s
user 0m0,000s
sys 0m0,000s

# time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask avx2 -o /dev/null

real 3m5,282s
user 0m0,000s
sys 0m0,000s

# time ./dav1d-0.2.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null

real 2m42,726s
user 0m0,000s
sys 0m0,000s

# time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null

real 1m45,987s
user 0m0,000s
sys 0m0,000s

# time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask avx2 -o /dev/null

real 1m22,243s
user 0m0,000s
sys 0m0,000s

Overall the results comparing 0.2.1 vs 0.5.1 were good for both instruction sets of SIMD optimizations - SSSE3 and AVX2.

The last 7 months I see progress according to my benchmarks and I really don't have to wonder where all those huge numbers of gain came from dAV1d team regarding 0.5.1 version vs 0.2.1

SmilingWolf

2nd November 2019, 11:31

FFMpeg says it's going to use 4 frame threads and 3 tile threads to decode the files, so I'll be using those numbers.

Chimera: 34%
Dua Lipa: 29.7%

# time ./dav1d-0.2.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null

real 1m40,657s
user 0m0,000s
sys 0m0,000s

# time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null

real 1m6,398s
user 0m0,000s
sys 0m0,000s

# time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask avx2 -o /dev/null

real 0m54,087s
user 0m0,000s
sys 0m0,000s

# time ./dav1d-0.2.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null

real 0m46,972s
user 0m0,000s
sys 0m0,000s

# time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null

real 0m33,041s
user 0m0,000s
sys 0m0,000s

# time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask avx2 -o /dev/null

real 0m25,912s
user 0m0,000s
sys 0m0,000s

SmilingWolf

2nd November 2019, 12:11

NikosD

2nd November 2019, 12:33

The tool is from the dav1d project, found here: https://code.videolan.org/videolan/dav1d/tree/master/tools

You can either compile it yourself (MABS can do that) or use my copy: https://mega.nz/#!op5gGSTD!JPyhq1IqJc8-aUksVl81YzHBl8sXMg8SyR2HbSTs7gk

Finding the best frame/tile threads numbers is a bit tricky. Fiddling can improve performance, and I made some quick tests that brought Dua Lipa down to 29 seconds on 0.5.1+SSSE3 using --framethreads 6 --tilethreads 2, but I did not want to post them because in the end almost no media player is going to let you stray from FFmpeg, and therefore LAVFilters, defaults. Then Houston we have a problem.
Because we have seriously contradicting results between multi-threaded performance of ffmpeg and LAV filters regarding dAV1d, according to your tests and mine.

Could be your compilations vs nevcairiel's compilations, could be the setup of LAV vs ffmpeg for dAV1d or the hyperthreading nature of 4770K.

If you don't want to raise the threads in order to reach 100% CPU utilization, you could close hyperthreading from BIOS and run again the tests with 4 threads.

Also, you could run LAV filters benchmark using GraphStudioNext or DXVA Checker on your Core i7 as is with hyperthreading ON and see how that's going.

SmilingWolf

2nd November 2019, 12:59

I can't use DXVA to check performance since the dav1d library inside the FFmpeg library inside the LAVfilters library would default to using AVX2, unless you know of a way to target a specific instruction set from within DXVA, or perhaps using an environment variable. OTOH, if you just want me to check how dav1d multithreading works in different versions of LAVFilters/ffmpeg, I can test that. But it won't be a 0.2.1 vs 0.5.1 SSSE3 benchmark anymore.

I'm not too sure why you think I'm not using all cores of my CPU. With the settings above, 4 frame 3 tile threads, I get peaks of 80% CPU usage, and an eyeballed average of around 70%.

Anyway, again just because, here is my best 0.5.1+SSSE3 Dua Lipa result so far:

# time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 6 --tilethreads 3 --cpumask ssse3 -o /dev/null

real 0m28,737s
user 0m0,000s
sys 0m0,000s

Peaks at 92% CPU, hovers at around 85% average.

NikosD

2nd November 2019, 14:00

I can't use DXVA to check performance since the dav1d library inside the FFmpeg library inside the LAVfilters library would default to using AVX2, unless you know of a way to target a specific instruction set from within DXVA, or perhaps using an environment variable. OTOH, if you just want me to check how dav1d multithreading works in different versions of LAVFilters/ffmpeg, I can test that. But it won't be a 0.2.1 vs 0.5.1 SSSE3 benchmark anymore. I couldn't find a way to test specific instruction sets too, that's why I used 0.2.1 vs 0.5.1 on different hardware.
You could check SSSE3 using a Core2Duo or an AMD processor and AVX2 on any Haswell onwards or Ryzen 3000.
If you only have i7 4770K, just check AVX2.

I'm not too sure why you think I'm not using all cores of my CPU. With the settings above, 4 frame 3 tile threads, I get peaks of 80% CPU usage, and an eyeballed average of around 70%. Due to 8 threads capable CPU.
Waiting for your AVX2 LAV filters results 0.2.1 vs 0.5.1 preferably in the form of DXVA Checker min/avg/max and the average CPU utilization reported by DXVA Checker.

Beelzebubu

3rd November 2019, 01:04

@Beelzebubu
@nevcairiel

Guys, I posted a huge benchmark report regarding dAV1d decoder progress between 0.2.1 vs 0.5.1 versions, meaning for the last seven months and I see no replies or reactions from you since.

Can you confirm or reject my findings with yours, showing different things ?

I have seen a lot of huge numbers regarding dAV1d progress from the dAV1d team in the official release notes - which I couldn't confirm - but in here you are very quiet.

Waiting for your feedback!

Just to add to SmilingWolf's comments, I agree you and I have diverging results and I've been discussing with various people as for what could be the cause. I don't immediately have a solution or explanation, but I haven't forgotten about it either.

To be clear, we don't just do command-line interface tests. We test this in end-user applications such as VLC and Chrome/Firefox also, and we see the same performance improvements there that we also see in "dav1d" the commandline tool.

NikosD

3rd November 2019, 02:59

... I don't immediately have a solution or explanation, but I haven't forgotten about it either.

To be clear, we don't just do command-line interface tests. We test this in end-user applications such as VLC and Chrome/Firefox also, and we see the same performance improvements there that we also see in "dav1d" the commandline tool. Ok, but SmilingWolf and you, have tested different things than me.
Firstly, he posted single threaded performance difference and I posted multi-threaded performance difference, besides the obvious difference of the implementation.
VLC is a popular media player - no doubt about it - but here we mostly prefer other players (MPC-HC / MPC-BE / MPV.NET etc)
I don't think there is other way to find out what is going on, than to reproduce the tests by yourself.
Is it possible to test the two versions of LAV's implementation I posted above ?
Also, the huge gains of performance posted in various release notes of dAV1d are for single-threaded or multi-threaded performance ?
Thanks!

NikosD

3rd November 2019, 08:18

Conveniently forgetting about my two posts dedicated to multi threaded performance aren't we? Conveniently forgetting about my word "firstly" as you posted initially single-tnreaded performance only, while I was asking to confirm or reject my multi-threaded results, as I posted first, regarding this issue.
After my comment you posted multi-threaded results, not using the same tools and with different threading status.
Anyway, the point here is to understand what's going on and not once again playing with words or intensions.
You could try to delete the config file of DXVA Checker and uninstall and reinstall everything.
I'm still waiting for an answer if the publicly available reported gains between versions of dAV1d referred to single-tnreaded or multi-threaded performance.
BTW, how do you benchmark dAV1d with the two executables you posted here ?
There is no internal command in these.

NikosD

3rd November 2019, 18:38

There isn't a DXVA Checker report yet, afternoon spent trying to make it work notwithstanding, but as I said, CPU utilization goes between 70% and 90% with the two sequences used. The main issue of dAV1d progress between 0.2.1 and 0.5.1 is not CPU Utilization.
The drop of CPU utilization using Skylake was only 2% although using Core2Duo the drop was huge.
The main issue of dAV1d it's the loss of any single-thread gain in real-world multi-thread decoding for whatever internal reason.
In the end, the end user doesn't know and doesn't care for the reasons that Dua Lipa video has exactly the same decoding speed for both versions of dAV1d 0.2.1 and 0.5.1 for two different CPU architectures and instructions sets (Skylake using AVX2 / Core2Duo using SSSE3)
It is us that we are still searching why is this happening and under what circumstances.

SmilingWolf

3rd November 2019, 19:27

Oh but you seemed so worried about how much dav1d was using all my cores just one day ago.

But here, have a Chimera run:
LAVFilters 0.74.1-29:
CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
GPU: NVIDIA GeForce GTX 1080
Decoder: LAV Video Decoder
Decoder Device: -
Frames: 8929
FPS: 170,234 [103-349]
CPU Usage: -
GPU Usage: 0 [0-1] %
GPU Video Engine Usage: 0 [0-0] %

LAVFilters 0.74.1:
CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
GPU: NVIDIA GeForce GTX 1080
Decoder: LAV Video Decoder
Decoder Device: -
Frames: 8929
FPS: 139,201 [77-306]
CPU Usage: -
GPU Usage: 0 [0-1] %
GPU Video Engine Usage: 0 [0-0] %

And Dua Lipa:
LAVFilters 0.74.1-29:
CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
GPU: NVIDIA GeForce GTX 1080
Decoder: LAV Video Decoder
Decoder Device: -
Frames: 5615
FPS: 260,815 [183-335]
CPU Usage: -
GPU Usage: 0 [0-1] %
GPU Video Engine Usage: 0 [0-0] %

LAVFilters 0.74.1:
CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
GPU: NVIDIA GeForce GTX 1080
Decoder: LAV Video Decoder
Decoder Device: -
Frames: 5615
FPS: 248,936 [137-328]
CPU Usage: -
GPU Usage: 0 [0-0] %
GPU Video Engine Usage: 0 [0-0] %

NikosD

4th November 2019, 08:55

Oh but you seemed so worried about how much dav1d was using all my cores just one day ago. My worries were and still are, the same.
The very low gain of real-world multi-thread performance between versions 0.2.1 vs 0.5.1 of dAV1d decoder, as measured by me using the above systems and tools, compared to the advertised and publicly reported by dAV1d team regarding SSSE3 and AVX2 optimizations.
All the other comments by me, express my agony to explain by any means that huge difference.
Your results confirm mine in an absolute way regarding Dua Lipa video, but there is a small light in the end of the tunnel regarding Chimera (regardless the name of the video)
I think @nevcairiel could explain better and test LAV filter's dAV1d implementation.

NikosD

4th November 2019, 13:01

@nevcairiel
@Beelzebubu
@SmilingWolf

A few more interesting notes regarding LAV filters.

LAV filters v0.74.1 allows you to set Thread = 1 but it actually uses 50% something CPU utilization, which means 2 cores = 2 threads for AV1 (using dAV1d)

But for all the other codecs, it uses only 1 thread as it should, based on the selection.

LAV filters v0.74.1-29 doesn't even allow you to set Thread = 1 because if you set it to 1, it doesn't enumerate in DXVA Checker when trying to decode AV1 files, while it can be used for all the other codecs using only 1 thread.

So, there is definitely something different regarding dAV1d integration in LAV filters, compared to all the other codecs.

In LAV filters 0.74.1-29, when setting Thread = 4, it has exactly the same performance as Auto for my Core i5 6500.

sneaker_ger

4th November 2019, 13:43

LAV filters v0.74.1 allows you to set Thread = 1 but it actually uses 50% something CPU utilization, which means 2 cores = 2 threads for AV1 (using dAV1d)
I think in LAV dav1d tilethreads are hard-coded to 2.
https://github.com/Nevcairiel/LAVFilters/blob/7f4e6dd6a45e88cbcbf9ec8a0173a247550cfc12/decoder/LAVVideo/decoders/avcodec.cpp#L370

dav1d has 2 thread number settings but LAV only exposes 1 to the user so that's just how it is. I guess nev thinks this is good enough for playback.

SmilingWolf

4th November 2019, 21:00

All the other comments by me, express my agony to explain by any means that huge difference.
And explaining by any means would be nice IF you actually bothered to follow up with a sistematic approach to prove your hypothesis.
This would imply removing all the fluff, going down to the most basic level and doing tests going up from there:
- use the dav1d util, single threaded, on IVF files to reduce to the minimum the amount of non-concerned code that is executed, like container parsing
-- Does it not show gains? Then you're right, there haven't been improvements
-- Does it show gains? Then you're wrong, look elsewhere
- use the dav1d util, with multiple threads.
-- Does it eat the gains? Then the problem is multithreading overhead.
-- Does it show the same gains, like I have measured? Then the problem is not single vs multithreaded performance
- use ffmpeg+dav1d, always on IVF files, singlethreaded.
-- Does it eat the gains? Then your problem is in the dav1d+ffmpeg integration
-- Does it show gains? Then look elsewhere
- use ffmpeg+dav1d, always on IVF files, multithreaded.
-- Does it eat the gains? Then your problem is in the dav1d+ffmpeg integration and the way multithreading interacts in either or both tools. I had this happen with ffmpeg+libvmaf, where one of the two would simply hang waiting for data that would never come
-- Does it show gains? Then look elsewhere
[I'm not going to write the whole thing again for AV1 files inside MKV containers but, well, if that's all you have left to look at, why not]
-- Did ffmpeg+dav1d integrate well? Then look at LAVFilters
Etc. etc. etc.

compared to the advertised and publicly reported by dAV1d team regarding SSSE3 and AVX2 optimizations.
You're forgetting a whole host of volunteers who followed and helped during development: https://code.videolan.org/videolan/dav1d/issues/15

Your results confirm mine in an absolute way regarding Dua Lipa video, but there is a small light in the end of the tunnel regarding Chimera (regardless the name of the video)
The only thing my Dua Lipa results confirm is that most routines that would be weighting down decode performance for this particular encode had already been optimized in AVX2 by the time 0.2.1 was released. AVX2 optimization, I would like to remind you, was considered almost complete by the time 0.2.0 was released: https://code.videolan.org/videolan/dav1d/blob/bb160f09/NEWS#L96
Hell, if I have time I might even build every single tag leading to 0.2.1 to pintpoint the exact release that brought us to today's performance.

LAV filters v0.74.1-29 doesn't even allow you to set Thread = 1 because if you set it to 1, it doesn't enumerate in DXVA Checker when trying to decode AV1 files, while it can be used for all the other codecs using only 1 thread.

So, there is definitely something different regarding dAV1d integration in LAV filters, compared to all the other codecs.
Congrats, this might be your first correct conjecture in this whole hordeal.

Based on the line of code highlighted by sneaker_ger I'd say having the number of tile threads implicitly set to 2 makes dav1d bail on this line: https://code.videolan.org/videolan/dav1d/blob/b9d4630c/src/lib.c#L84
The problem, however, is not in LAVFilters, but in FFmpeg's formula for frame distribution between the two modes, starting here: https://github.com/FFmpeg/FFmpeg/blob/a34d062/libavcodec/libdav1d.c#L137
Just in case, the formula is: frame_threads = threads / tile_threads, with all numbers involved being integers. For 2 tile_threads, this solves to 0 frame_threads, as shown here: https://godbolt.org/z/cz65UY, which is below the minimum of 1 frame thread required by dav1d.
This is indeed a bug. The easiest fix would be to cast threads to float before doing the division, as shown here: https://godbolt.org/z/wBLYoo, to avoid having dav1d bail. Threads distribution will still be higher than selected, but at least it'll work.

NikosD

4th November 2019, 22:23

And explaining by any means would be nice IF you actually bothered to follow up with a sistematic approach to prove your hypothesis...Then look at LAVFilters
Etc. etc. etc. I really like your analytical thought, but we need 2 lifes to check all these.
And if all this procedure was so clear for you, why didn't you do it?
We have to suggest things that are feasible in real world, not just crazy detailed procedures.
The only thing my Dua Lipa results confirm is that most routines that would be weighting down decode performance for this particular encode had already been optimized in AVX2 by the time 0.2.1 was released. AVX2 optimization, I would like to remind you, was considered almost complete by the time 0.2.0 was released: https://code.videolan.org/videolan/dav1d/blob/bb160f09/NEWS#L96
Hell, if I have time I might even build every single tag leading to 0.2.1 to pintpoint the exact release that brought us to today's performance. Unfortunately there are two issues here.
Firstly, I have already said that for 7 months not a lot things have been added to AVX2 optimizations according to my tests although if we followed every release notes after 0.2.1 up to 0.5.1 we should see a lot more AVX2 gain than 5%.
The second more important issue is that according to my tests using LAV filters with Core2Duo in multi-thread mode, there is no difference using SSSE3 optimizations between 0.2.1 and 0.5.1 which is really bad according to release notes.
Congrats, this might be your first correct conjecture in this whole hordeal.
Based on the line of code highlighted by sneaker_ger I'd say having the number of tile threads implicitly set to 2 makes dav1d bail on this line: https://code.videolan.org/videolan/dav1d/blob/b9d4630c/src/lib.c#L84...This is indeed a bug. The easiest fix would be to cast threads to float before doing the division, as shown here: https://godbolt.org/z/wBLYoo, to avoid having dav1d bail. Threads distribution will still be higher than selected, but at least it'll work. I'm here to point to bugs, to discover bugs or even make developers think that something is going wrong that could be a bug, so I'm happy that I discovered one.
But certainly I'm not here to fix it, as I'm not a developer.
Still, the way I understand the bug and the fix presented by you, I'm not sure if it's going to recover the multi-thread "loss" or whatever other reason exists that 0.2.1 is so close to 0.5.1 using LAV for both AVX2 and SSSE3 according to my tests.
So, are we still looking for answers or case closed after fixing the bug ?

SmilingWolf

4th November 2019, 22:43

Well that procedure is the only certain way to find the source of the slowdown. It should't take more than one afternoon to run those tests, especially with some scripting and logging thrown in the mix.

And the reason I didn't follow my own procedure is that I can't reproduce your results, and have nothing to diagnose. I'm seeing between 4% (Dua Lipa) and 18% (Chimera) improvements in AVX2, and above 30% in SSSE3.
That's far above anything you're seeing on your computers, and more or less in line with what was announced:
- 0.3.0: http://www.jbkempf.com/blog/post/2019/dav1d-0.3-release%3A-even-faster%21 - "a gain of 15%-25% on SSSE3 processors; and even a 5% gain on AVX-2 processors"
- 0.5.0: http://www.jbkempf.com/blog/post/2019/dav1d-0.5.0-release-fastest - "a gain of 22%-40% on SSSE3 processors; and another gain of 4-7% on AVX-2 processors"
- 0.5.1: http://www.jbkempf.com/blog/post/2019/dav1d-0.5.1 - posted for completeness sake only, there's no mention of SSSE3 or AVX2 speedups

Is there a specific figure you were expecting?

Still, the way I understand the bug and the fix presented by you, I'm not sure if it's going to recover the multi-thread "loss" or whatever other reason exists that 0.2.1 is so close to 0.5.1 using LAV for both AVX2 and SSSE3 according to my tests.
So, are we still looking for answers or case closed after fixing the bug ?
That's correct, no case closed yet.

From where I'm standing, the problem is that you are the only one with access to those troublesome systems.
If you want me to help by compiling different versions dav1d or ffmpeg for Windows, I'm game, but that's as far as I can go from here.

nevcairiel

4th November 2019, 22:46

Based on the line of code highlighted by sneaker_ger I'd say having the number of tile threads implicitly set to 2 makes dav1d bail on this line: https://code.videolan.org/videolan/dav1d/blob/b9d4630c/src/lib.c#L84
The problem, however, is not in LAVFilters, but in FFmpeg's formula for frame distribution between the two modes, starting here: https://github.com/FFmpeg/FFmpeg/blob/a34d062/libavcodec/libdav1d.c#L137

LAV Filters was actually meant to avoid the calculation logic in FFmpeg entirely, but since I last looked at it, it was changed again (previously it directly took framethreads = threads). So I've adjusted how LAV configures ffmpeg-dav1d, and it should never use their calculations - and it'll now also disable all threading if you set it to 1.

NikosD

5th November 2019, 09:17

And the reason I didn't follow my own procedure is that I can't reproduce your results, and have nothing to diagnose. I'm seeing between 4% (Dua Lipa) and 18% (Chimera) improvements in AVX2, and above 30% in SSSE3. Using what tools to achieve those figures and in what mode, single-thread or multi-thread ?
That's far above anything you're seeing on your computers, and more or less in line with what was announced:
- 0.3.0: http://www.jbkempf.com/blog/post/2019/dav1d-0.3-release%3A-even-faster%21 - "a gain of 15%-25% on SSSE3 processors; and even a 5% gain on AVX-2 processors"
- 0.5.0: http://www.jbkempf.com/blog/post/2019/dav1d-0.5.0-release-fastest - "a gain of 22%-40% on SSSE3 processors; and another gain of 4-7% on AVX-2 processors"
- 0.5.1: http://www.jbkempf.com/blog/post/2019/dav1d-0.5.1 - posted for completeness sake only, there's no mention of SSSE3 or AVX2 speedups

Is there a specific figure you were expecting? You got it all wrong here.
To be more scientifically accurate, allow me to correct you according to the publicly available release notes:

- 0.2.2 :
SSSE3 +10% of 0.2.1
AVX2 +5% of 0.2.1

- 0.3.0 :
SSSE3 +12% of 0.2.2
AVX2 +5% of 0.2.2

- 0.5.0 :
SSSE +40% of 0.3.0
AVX2 +(4-7%), for my calculations I take 5% on average of 0.3.0

So, if you do the math correctly we are expecting a gain between 0.2.1 and 0.5.1 versions as follows:

SSSE3 ~72%
AVX2 ~16%

Even your troublesome calculations, as you mixed single-thread mode with multi-thread mode and dAV1d executables with lower than expected number of threads and LAV filters without managing to run DXVA Checker properly, couldn't reach those figures.
From where I'm standing, the problem is that you are the only one with access to those troublesome systems. From where I'm standing I'm the only one with four and not two video samples measured (for both 1080p and 4K), with proper measurements using LAV filters in multi-thread mode and correct DXVA Checker results.
TBH, I'm the only one who even noticed the issue of false reporting the gains between versions, at least using LAV filters in multi-thread mode and as I proved just above, you also confirmed my claims even using dAV1d executables and without wanting to.

I'm not sure what is your connection with dAV1d team, but you are certainly not offering a good job as their unofficial "lawyer"

I'm still waiting for an answer from you or any other member of dAV1d team regarding that 16% gain of AVX2 and 72% gain of SSSE3 between 0.2.1 and 0.5.1 reported in the release notes, is it for single-thread or multi-thread mode ?

NikosD

5th November 2019, 14:19

Comparisons between LAV 0.74.1 and later nightly versions are flawed since the threading strategy changed in FFmpeg, which resulted in 0.74.1 using more frame threads then the later nightlies, making 0.74.1 artificially faster. As such, all your results are invalidated.
This is why you should use as little software as possible to do benchmarking (ie. go as close to the core as possible), as you never know what changes might interfer with your conclusions. So...It seems that the inconsistency of LAV filters between the threading management of 0.74.1 (0.2.1 dAV1d) and 0.74.1-29 (0.5.1 dAV1d) caused a lot of troubles for benchmarking.

Also, your decision to reject single-thread decoding for 0.74.1 and 0.74.1-29, didn't allow me and still doesn't allow me to test this kind of performance gain (single-thread)

But, as I said before, the end user using a Media Player couldn't care less for single-thread performance/gain.

It's the real-world multi-thread decoding that does matter.

I've also once again changed the thread distribution in 0.74.1-30 from last night, and while its going to use more threads again now, similar to the old logic, its not going to be identical to 0.74.1 in all cases (because I added more tile threads on high core-count CPUs) OK, let's move on to new benchmarks using multi-thread performance of 0.74.1-30.

1080p

Chimera ~6.6Mbps

Core i5 6500 95/144/285 CPU 92% -0.5.1 (LAV 0.74.1-30)
Core i5 6500 86/134/290 CPU 87% -0.5.1 (LAV 0.74.1-29)
Core i5 6500 77/127/273 CPU 91% -0.2.1 (LAV 0.74.1)

Core2Duo T7600 12/22/103 CPU 87% -0.5.1 (LAV 0.74.1-30)
Core2Duo T7600 10/19/94 CPU 72% -0.5.1 (LAV 0.74.1-29)
Core2Duo T7600 8/17/100 CPU 87% -0.2.1 (LAV 0.74.1)

Dua Lipa ~2.2Mbps

Core i5 6500 135/194/255 CPU 91% -0.5.1 (LAV 0.74.1-30)
Core i5 6500 120/186/251 CPU 87% -0.5.1 (LAV 0.74.1-29)
Core i5 6500 112/186/255 CPU 91% -0.2.1 (LAV 0.74.1)

Core2Duo T7600 11/22/62 CPU 84% -0.5.1 (LAV 0.74.1-30)
Core2Duo T7600 7/18/70 CPU 65% -0.5.1 (LAV 0.74.1-29)
Core2Duo T7600 7/18/69 CPU 84% -0.2.1 (LAV 0.74.1)

4K

Holi Festival ~14Mbps

Core i5 6500 34/43/62 CPU 94% -0.5.1 (LAV 0.74.1-30)
Core i5 6500 34/43/61 CPU 94% -0.5.1 (LAV 0.74.1-29)
Core i5 6500 30/40/60 CPU 95% -0.2.1 (LAV 0.74.1)

Summer Nature ~23Mbps

Core i5 6500 31/42/55 CPU 92% -0.5.1 (LAV 0.74.1-30)
Core i5 6500 32/43/57 CPU 93% -0.5.1 (LAV 0.74.1-29)
Core i5 6500 26/37/50 CPU 91% -0.2.1 (LAV 0.74.1)

Comments:

1) Unfortunately not a lot changed regarding AVX2 optimizations in general.

For 4K clips the decoding performance didn't change at all and there is also a slight regression for Summer Nature

But for 1080p we have a gain of 13% for Chimera and 4% for Dua Lipa comparing 0.2.1 vs 0.5.1, still far away from 16% of expected gain according to release notes.

2) I'm now 100% sure that dAV1d team should be a lot more cautious regarding publicly reported gains of their versions in release notes.

IMO, they should always include real-world multi-thread gains on multiple content and resolutions (at least 1080p and 4K)

3) SSSE3 optimizations give 22% and 29% gain for 0.5.1 vs 0.2.1 on Core2Duo CPU, which of course is far away than optimal 72% but a lot better than previous badly configured LAV filters 0.74.1-29.

4) LAV filters 0.74.1-30 and 0.74.1 have the same CPU utilization, so the bug of LAV 0.74.1-29 has been fixed and we can finally compare apples to apples.

clsid

5th November 2019, 15:48

You can find the exact benchmark results from Ewout in the individual MRs. There is a link to a spreadsheet with all test results and system spec. Example:
https://code.videolan.org/videolan/dav1d/merge_requests/792

SmilingWolf

5th November 2019, 18:34

Using what tools to achieve those figures and in what mode, single-thread or multi-thread ?
You got it all wrong here.
To be more scientifically accurate, allow me to correct you according to the publicly available release notes:

- 0.2.2 :
SSSE3 +10% of 0.2.1
AVX2 +5% of 0.2.1

- 0.3.0 :
SSSE3 +12% of 0.2.2
AVX2 +5% of 0.2.2

- 0.5.0 :
SSSE +40% of 0.3.0
AVX2 +(4-7%), for my calculations I take 5% on average of 0.3.0

So, if you do the math correctly we are expecting a gain between 0.2.1 and 0.5.1 versions as follows:

SSSE3 ~72%
AVX2 ~16%

No, once again it's you who got it all wrong: https://code.videolan.org/videolan/dav1d/compare/0.2.2...0.3.0
A grand total of 4 commits between 0.2.2 and 0.3.0, with a stability fix, some docs updates, and no performance related commits whatsoever.

And if you had bothered to read the resources I linked to, you would have seen the numbers refer to 0.3.0 vs 0.2.1, as shown by the image on JBKempf's blog:
http://www.jbkempf.com/blog/public/VideoLAN/dav1d/0.3_SSSE3.png

So if YOU do the math correctly, you get:
- 0.3.0: http://www.jbkempf.com/blog/post/201...even-faster%21 - "a gain of 15%-25% on SSSE3 processors; and even a 5% gain on AVX-2 processors"
- 0.5.0: http://www.jbkempf.com/blog/post/201...elease-fastest - "a gain of 22%-40% on SSSE3 processors; and another gain of 4-7% on AVX-2 processors"
So, for SSSE3, max: 75%, min: 40% if you consider the numbers in the TLDR, or 37% if you consider the lowest range given within the 0.3.0 blogpost.
And for AVX2: max: 12,4%, min: 109,2%

I have already shown that, with SSSE3, I can get a 29% improvement in "FFmpeg multithread" mode on Dua Lipa, and 38,8% if playing some more extensively with the thread settings.
29% figure: http://forum.doom9.org/showthread.php?p=1889274#post1889274, 0.2.1 SSSE3 = 46,972s, 0.5.1 SSSE3 = 33,041s
38,8% figure: http://forum.doom9.org/showthread.php?p=1889289#post1889289, 0.5.1 SSSE3 with "nonstandard" thread settings: 28,737s
Admittedly close to the low end of the promised speedups, but definitely within the given range.

The 4% and 18% figures come from this post: http://forum.doom9.org/showthread.php?p=1889442#post1889442
Your beloved DXVA checker, LAVFilters 0.74.1 vs 0.74.1-29, AVX2, default multithreading, basically same conditions as you:
Chimera average FPS: 139,201 -> 170,234 = 18,2% slowdown when going from the most recent to the older, or 22% speedup when doing the opposite
Dua Lipa average FPS: 248,936 -> 260,815 = 4.6% slowdown when going from the most recent to the older, or 4.8% speedup when doing the opposite

Moreover, you keep yelling at a whole bunch of clouds (https://i.kym-cdn.com/news_feeds/icons/mobile/000/019/234/3ad.jpg): it has been shown that a bunch of different projects have undergone a bunch of changes that make both your and my DXVA Checker measurements completely unreliable to find out about dav1d improvements or lack thereof, yet you insist.

Meanwhile, all explanations (but your own), offers of help and alternative, more reliable solutions have been met with utter hostility. At this point, all resources are exhausted. You're right. dav1d is crap, the developers are incompetent, and you can live in your happy world where you can be mad at something.

OR you could start doing as suggested, and MAYBE we'll find out exactly where the problem lies, and possibly fix it.

NikosD

6th November 2019, 11:15

No, once again it's you who got it all wrong: https://code.videolan.org/videolan/dav1d/compare/0.2.2...0.3.0
A grand total of 4 commits between 0.2.2 and 0.3.0, with a stability fix, some docs updates, and no performance related commits whatsoever.

And if you had bothered to read the resources I linked to, you would have seen the numbers refer to 0.3.0 vs 0.2.1, as shown by the image on JBKempf's blog I really like names like Jean-Baptiste or Jesus from Nazareth, but I like more to read the official release notes than specific blogs: https://code.videolan.org/videolan/dav1d/-/releases

So, what do we have here ?
0.2.2 brings large improvements in speed on ARM64 and SSSE3 (more than 10% speed increase) and even manages to gain around 5% on the already fast AVX-2 implementation. 10% for SSSE3 and 5% for AVX2 using 0.2.2 compared to previous version aka 0.2.1 0.3.0 brings large improvements in speed on ARM64 (15% speedup) and SSSE3 (more than 12% fps increase) and even manages to gain around 5% on the already fast AVX-2 implementation. Another 12% for SSSE3 and 5% for AVX2 using 0.3.0 compared to previous version aka 0.2.2 0.5.0 brings large improvements in speed on SSSE3 CPU (up to 40% speedup), new speed improvements on AVX-2 (for 4-7%) and ARM64 (up to 10%) and ARM32. It introduces some VSX, SSE2 and SSE4 optimizations. Another 40% for SSSE3 and 4-7% for AVX2 using 0.5.0 compared to previous version aka 0.3.0.

Once again, please do the math.

It's ~72% from 0.2.1 to 0.5.1 regarding SSSE3 optimizations and ~16% for AVX2, according to the official, publicly released notes.
I have already shown that, with SSSE3, I can get a 29% improvement in "FFmpeg multithread" mode on Dua Lipa, and 38,8% if playing some more extensively with the thread settings.

Admittedly close to the low end of the promised speedups, but definitely within the given range. I have updated my previous post regarding benchmarks and I get 22% and 29% for Chimera and Dua Lipa using my Core2Duo, still too far away from 72%

Moreover, you keep yelling at a whole bunch of clouds (https://i.kym-cdn.com/news_feeds/icons/mobile/000/019/234/3ad.jpg): it has been shown that a bunch of different projects have undergone a bunch of changes that make both your and my DXVA Checker measurements completely unreliable to find out about dav1d improvements or lack thereof, yet you insist.

Meanwhile, all explanations (but your own), offers of help and alternative, more reliable solutions have been met with utter hostility. At this point, all resources are exhausted. You're right. dav1d is crap, the developers are incompetent, and you can live in your happy world where you can be mad at something. SmilingWolf with a Big Mouth, I could easily add.
You can find the exact benchmark results from Ewout in the individual MRs. There is a link to a spreadsheet with all test results and system spec. Example:
https://code.videolan.org/videolan/dav1d/merge_requests/792 Interesting specs...2 x Xeon with AVX2, DDR4 etc= 2x14cores = 28 cores with hyperthreading for testing SSSE3.

He has an average gain of ~23% which is in the range of my 22% to 29% gain, but I don't understand how the build versions used by him are connected to final versions (0.2.1, 0.2.2 etc)

But his results made me struggle to understand what is really going on with SSSE3 and propose something different.

My first Haswell processor was a Pentium with artificially disabled AVX/AVX2 instructions.

So, I remembered late yesterday night and confirmed with my 2013 (!) benchmark results that my 128bit SIMD (SSEx) benchmarks running on Pentium Haswell, were a lot faster at the same clock than my desktop Core2Duo E7300, unusually faster and not justified by the architecture differences.
It was like running 128bit instructions on 256bit registers and I say that because of the huge difference.

My suggestion:
@Beelzebubu, dAV1d team, x265/x264 fans, @doom9 and every other people running benchmarks on different SIMD optimizations.

If you want to benchmark specific 128bit SIMD optimizations and your target group is not only Pentiums/ Celerons with disabled AVX/AVX2 sets, but legacy hardware with SSEx only SIMD, then I suggest to run the tests on REAL SSEx-only (128bit only) capable hardware (e.g Core2Duo, Core2Quad or Core iX first generation) and not an emulation like running 128bit SSEx code with artificially disabled 256bit SIMD optimizations, but on a lot faster DDR4 and 256bit register capable CPU like 2 x Xeon (!)

I think you are going to be surprised by the results and these results could explain some performance difference.

Beelzebubu

6th November 2019, 15:30

My suggestion:
@Beelzebubu, dAV1d team [..]

If you want to benchmark specific 128bit SIMD optimizations and your target group is not only Pentiums/ Celerons with disabled AVX/AVX2 sets, but legacy hardware with SSEx only SIMD, then I suggest to run the tests on REAL SSEx-only (128bit only) capable hardware (e.g Core2Duo, Core2Quad or Core iX first generation) and not an emulation like running 128bit SSEx code with artificially disabled 256bit SIMD optimizations, but on a lot faster DDR4 and 256bit register capable CPU like 2 x Xeon (!)

That's a fair request, we can look into doing that.

marcomsousa

8th November 2019, 13:14

AOMedia Research Symposium 2019 Videos
https://www.youtube.com/playlist?list=PL97T7zfqOOF3YKvniyywewtWKpxXky8iI

Adding some titles

Youtube - https://www.youtube.com/watch?v=dqpEcNB6ltw
Facebook - https://www.youtube.com/watch?v=fgztbt6HLs4
Netflix - https://www.youtube.com/watch?v=M1vwnI0vbMI
Dav1d and Eve-AV1 - https://www.youtube.com/watch?v=Jy_89NcVpk4
SVT-AV1 - https://www.youtube.com/watch?v=zXvoBVZmkHs
AV1 in RT - https://www.youtube.com/watch?v=Uf90zOw6rcE
AV1 in RT in WebRTC https://www.youtube.com/watch?v=McmR8MhjbQk
Deep Neural Network Based Frame Reconstruction For AV2 - https://www.youtube.com/watch?v=QIoJfY9IIH0
Deep Learning - https://www.youtube.com/watch?v=WJd9qF4OceI
Lesson learnt from WebP - https://www.youtube.com/watch?v=zogxpP2lm-o
V1: Nits, Nitpicks and Shortcomings [Things we should fix for AV2] Mozilla - https://www.youtube.com/watch?v=Paf8JcO682Y

soresu

20th November 2019, 17:24

stadia "works" with pretty much every device. you don't need a chromecast a phone can do it so can a web browser on the PC. there is missing support for iOS and such but what ever.

It would drain battery tout suite, but dav1d 0.5.1 is more than fast enough to decode 1080p60 on any iPhone or iPad from the last 2-3 years, perhaps even 4K (though 4K60 seems doubtful).

Even the less impressive Cortex big cores on Snapdragon could handle 1080p60.

Obviously lacking ASIC decoder support is not ideal, but at least it is some support rather than nothing.

Heres hoping future improvements to the GPU code in dav1d will make decoding even more efficient for pre-ASIC devices than the initial GSoC 2019 efforts.

utack

3rd December 2019, 22:47

Beating a dead horse at this point maybe, but dav1d having a milestone for better PPC support and no word about adding basic 10bit support seems extremely odd
Netflix has signalled they are only interested in 10bit content, Youtube started encoding 10bit for their new higher resolution videos as well.
It should clearly be a priority over armv7 and PPC assembly, and imho all the "make it fast" milestones are not reached yet.

Blue_MiSfit

4th December 2019, 02:29

I also hope that the dav1d team focuses more on 10 bit soon. I think social media / user generated content is a huge use case for AV1, and most of that content is 8 bit for now.

benwaggoner

4th December 2019, 23:06

I also hope that the dav1d team focuses more on 10 bit soon. I think social media / user generated content is a huge use case for AV1, and most of that content is 8 bit for now.Do we have any evidence that 8-bit sources encode better in 10-bit than 8-bit in AV1? While that was true for H.264, it was much less so for HEVC, and I don't see why AV1 would have any regressions versus HEVC in that regard.

SW encoding and decoding of >8-bit content is always at least 25% slower, and can be more depending on the bottlenecks. And there's really not much point in doing >8-bit unless the source or display controller can do more than that. Most social media is consumed on phones and computers, for which very few end-to-end >8-bit pipelines exist. And with really high ppi, dithering is nigh invisible.

10-bit is much more valuable on living room screens, which are much larger and have native >8-bit support.

I think everything is going to go half float linear light for internal processing next decade, to make tone mapping, particularly of mixed color space content, way easier and better.

Sent from my SM-T837V using Tapatalk

hajj_3

5th December 2019, 01:56

Dav1d v0.5.2 'Asiatic Cheetah' changelog:

ARM32 optimizations for loopfilter, ipred_dc|h|v
Add section-5 raw OBU demuxer
Improve the speed by reducing the L2 cache collisions
Fix minor issues, including compilation on some OSes

soresu

5th December 2019, 12:58

I also hope that the dav1d team focuses more on 10 bit soon. I think social media / user generated content is a huge use case for AV1, and most of that content is 8 bit for now.

It's a shame Google pursued gAV1 rather than shunting engineer time to dav1d, they could work much faster on all fronts with some serious money behind them.

I get the whole competition is good angle, but it doesn't seem to have made much of a difference, other than prompting dav1d to shore up their ARM32 priorities - from which it seems dav1d are soundly ahead on all fronts again.

soresu

5th December 2019, 13:00

dapperdan

7th February 2020, 09:02

This announcement makes no sense: I don't know a single mobile SoC which supports AV1 HW decoding acceleration right now which means the poor users who will watch AV1 content will decimate their battery life.

They note that they're using Dav1d and supporting development of 10-bit assembly speedups so that's what wil be used until hardware arrives.

Currently it's opt in, you need to flick the pre-existing switch that says you want to save data when watching on mobile. I assume most of the time this just reduced file size and therefore quality, but now they have the extra option of switching codec as well.

It would be nice if they'd run the numbers and publish them, but against the backdrop of mobile streaming and a phone screen fully on, I'm not sure the lack of hardware decode will be that noticeable. I think Instagram were already shipping a VP9 software decoder on Android I don't think they were even using the fast ffmpeg decoder, just libaom.

soresu

7th February 2020, 17:21

I think Instagram were already shipping a VP9 software decoder on Android I don't think they were even using the fast ffmpeg decoder, just libaom.

Ouch, that's just being bloody minded to user battery life that is.

Here (https://code.videolan.org/videolan/dav1d/issues/215)'s the gitlab issue link for dav1d NEON, 4 opts for 16bpc already merged, 2 more lined up it seems.

Nintendo Maniac 64

8th February 2020, 23:39

dav1d v0.5.0 performance on Ubuntu Linux 20.04 LTS snapshot with the new 64core/128thread Threadripper 3990X: https://www.phoronix.com/scan.php?page=article&item=3990x-threadripper-linux&num=2

MoSal

10th February 2020, 19:32

Ouch, that's just being bloody minded to user battery life that is.

Here (https://code.videolan.org/videolan/dav1d/issues/215)'s the gitlab issue link for dav1d NEON, 4 opts for 16bpc already merged, 2 more lined up it seems.

Packaging differences. Compiling from sources and using "ninja -vC release install" puts the header(s) under <somewhere>/include/libvmaf/<headers>.h
Arch's package maintainer probably follows in the footsteps of the Debian package (https://debian.pkgs.org/10/multimedia-main-amd64/libvmaf-dev_1.3.14-dmo3_amd64.deb.html), that leaves the header under /usr/include/libvmaf.h

Arch's package maintainer is not (https://git.archlinux.org/svntogit/community.git/plain/trunk/PKGBUILD?h=packages/vmaf) following anyone's footsteps other than upstream. It''s just the Makefile (https://github.com/Netflix/vmaf/blob/e434247f8bf8f5ec36ba87062656583c073394fd/Makefile) from v1.3.15 predates the changes that made the new Makefile (https://github.com/Netflix/vmaf/blob/82a86e040371f2ca8665d6c210e3d1d2d608a636/Makefile) just a wrapper around meson/ninja.

Anyway, the last time (~18 months ago) I tried linking libvmaf to ffmpeg, I got runtime crashes after I managed to find a version that compiles. So I figured the library is not really reliable API/ABI wise for external linkage use-cases. So I opted to just script around the provided executable vmafossexec.