dav1d accelerated AV1 decoder [Archive] - Page 3

hajj_3

22nd December 2020, 00:07

To the contrary, 4k TVs are widely popular, and they all have 10 bit HDR & WCG panels.

I'll agree that the vast majority of PC / Mac systems do not have HDR / 10 bit displays ;)

hardly anyone with an x86 desktop/laptop is connected to a tv therefore it isn't a priority for netflix.

soresu

24th December 2020, 21:12

hardly anyone with an x86 desktop/laptop is connected to a tv therefore it isn't a priority for netflix.

Speak for yourself, not everyone has perfect vision and can use a sub 30 inch monitor to read with.

I'm using a 40 inch HDTV and even with that I still need to scale up the text/dpi just to read without getting a splitting headache.

I'm just hoping that when I finally upgrade to a UHDTV that all the relevant software I use has dynamically scaling UI now as many didn't when I first started noticing my vision loss 9 years ago.

soresu

24th December 2020, 21:21

To the contrary, 4k TVs are widely popular, and they all have 10 bit HDR & WCG panels.

I'll agree that the vast majority of PC / Mac systems do not have HDR / 10 bit displays ;)

Oddly not all 4K TVs are either 10 bit or HDR capable, something that took sometime to figure out as my dad kept getting a 'downscaling' message on his 2015 Panasonic 4k TV when he was playing a 4K bluray.

As it turned out his TV had been made before the HDR part of the 4K UHD Bluray standard had been set in stone, so it basically just does 2160p resolution, no 10 bpc (native or 8 bpc + FRC) or HDR to be had.

It does have a Displayport connector though as well as the standard HDMI, which is a considerable oddity for a TV model.

As 4K TV's were being made well before my dad bought his I can only guess that before the full 4k UHD standard was hammered out it was somewhat like the pre 'Full HD' phase of HDTV's that had 720p and other such sub 1080p resolutions that poor people were duped into en masse.

Blue_MiSfit

24th December 2020, 23:29

Sure, there's some outliers. 2015-2016 is when things really solidified. 4K UHD TVs before 2015 were extremely rare and expensive. I'd say your dad's is in that edge territory :)

foxyshadis

27th December 2020, 01:31

hardly anyone with an x86 desktop/laptop is connected to a tv therefore it isn't a priority for netflix.

I wouldn't be surprised if they'd love to flip a switch and have HDR on all PCs, if it wasn't for gpu drivers that can't HDR their way out of a wet paper bag, and create new bugs every version since they don't use video as a regression test case at all. It's a bit of a catch-22, since gpu driver teams only put any effort into anything other than games when it becomes critical mass and they can't ignore it anymore.

soresu

27th December 2020, 02:17

Sure, there's some outliers. 2015-2016 is when things really solidified. 4K UHD TVs before 2015 were extremely rare and expensive. I'd say your dad's is in that edge territory :)

Or just plain unreliable.

The TV broke less than 6 months shy of the 5 year warranty going kaput last year - thankfully the new motherboard inside not only works fine but boots the TV 3-4 times faster from standby than the old one

I'm pretty sure that it was also a 2014 model on very reduced price sale to clear stock, so not quite so expensive - I'm hoping to get that myself for a Samsung Q80T 55 inch to finally upgrade from my basic HDTV I'm using for a monitor.

soresu

27th December 2020, 20:27

dav1d is finally getting some x86 10 bpc AVX2 SIMD by porting some recent work for rav1e, no idea what sort of gains it will get yet but I'll leave the gitlab issue links in case the info gets posted there.

https://code.videolan.org/videolan/dav1d/-/merge_requests/1110

https://code.videolan.org/videolan/dav1d/-/merge_requests/1111

soresu

3rd January 2021, 19:22

Mostly more ARM32 NEON assembly.

At this point 1080p should be pretty viable on more recent streaming devices like Fire TV and Chromecast.*

*that's Chromecast 4 with A55 cores mind you.

Spyros

11th February 2021, 19:14

Blue_MiSfit

12th February 2021, 04:47

Excellent news!

benwaggoner

12th February 2021, 21:44

Some 10-bit assembly from rav1e was merged in dav1d, seemingly resulting in 25-30%+ better performance in AVX2 systems

https://i.redd.it/3u1jjjg5hgg61.png

Source + more details (https://old.reddit.com/r/AV1/comments/lg3x15/more_rav1e_hbd_assembly_ported_to_dav1d/)
What's the system these results are being measured on?

And is anyone keeping track of how PC power utilization is impacted by using SW AV1 versus a HW decoder? Back in the Silverlight days, I saw ~20 watts extra on a beefy laptop with H.264 SW decode versus HW decode. I imagine the gap is lower now, but it could be a pretty significant net environmental impact if 10M people are watching YouTube at an extra 10 watts each. Everything that can push that down is helpful.

unlord

13th February 2021, 01:09

What's the system these results are being measured on?

Author of the AVX2 patches (and graph) here. This experiment was conducted on a 3970x with a single core running at 3.7GHz. I can provide the sequences if you'd like to repeat the test on another system.

benwaggoner

13th February 2021, 02:57

Author of the AVX2 patches (and graph) here. This experiment was conducted on a 3970x with a single core running at 3.7GHz. I can provide the sequences if you'd like to repeat the test on another system.
Thank you.

Why single-core for testing, just curious? There aren't any AVX2 CPUs without at least 4 cores IIRC.

unlord

13th February 2021, 03:22

Why single-core for testing, just curious? There aren't any AVX2 CPUs without at least 4 cores IIRC.

Conducting a multi-threaded decoder comparison across implementations requires controlling for more variables. It is outdated now, but here is a comprehensive multi-threaded configuration study I ran in May 2020 comparing just dav1d to libgav1:

https://docs.google.com/spreadsheets/d/19byTEMMVuyOpqqF59eT1mwAi-W1Fhhtcqj1_4js9jSo

Note the Thread Configurations table.

hajj_3

15th February 2021, 22:15

dav1d v0.8.2:

- ARM32 optimizations for ipred and itx in 10/12bits, completing the 10b/12b work on ARM64 and ARM32
- Give the post-filters their own threads
- ARM64: rewrite the wiener functions
- Speed up coefficient decoding, 0.5%-3% global decoding gain
- x86: rewrite the SGR AVX2 asm
- x86: improve msac speed on SSE2+ machines
- ARM32: improve speed of ipred and warp
- ARM64: improve speed of ipred, cdef_dir, cdef_filter, warp_motion and itx16
- ARM32/64: improve speed of looprestoration
- Add seeking, pausing to the player
- Update the player for rendering of 10b/12b
- Misc speed improvements and fixes on all platforms
- Add a xxh3 muxer in the dav1d application

benwaggoner

17th February 2021, 01:49

Conducting a multi-threaded decoder comparison across implementations requires controlling for more variables. It is outdated now, but here is a comprehensive multi-threaded configuration study I ran in May 2020 comparing just dav1d to libgav1:

https://docs.google.com/spreadsheets/d/19byTEMMVuyOpqqF59eT1mwAi-W1Fhhtcqj1_4js9jSo

Note the Thread Configurations table.
Makes sense, and thank you!

Do you have any rough estimate for the gap in perf of a 10-bit and an 8-bit decode with equally optimized decoders?

savage747

17th February 2021, 09:42

There aren't any AVX2 CPUs without at least 4 cores IIRC.

My notebook with an i5 4300U would like to make its existence known ;-)

(2 Cores, 4 Threads, AVX2, 1.9 to 2.9 GHz)

Intel *really* loved selling dual-core mobile CPUs with similar nametags as quad-core desktop CPUs...

LigH

17th February 2021, 13:28

Hyper-Hyper threading FTW :sly:

benwaggoner

17th February 2021, 19:43

My notebook with an i5 4300U would like to make its existence known ;-)

(2 Cores, 4 Threads, AVX2, 1.9 to 2.9 GHz)

Intel *really* loved selling dual-core mobile CPUs with similar nametags as quad-core desktop CPUs...
I stand corrected!

There weren't any single core AVX2 processors at least, right?

savage747

17th February 2021, 21:01

There weren't any single core AVX2 processors at least, right?

Well, not really, as far as I know.

The closest to a single-core processor with AVX2 I know of would be the AMD A6-9500 for the current AM4 platform or the AMD A6-7480 for the old FM2+ platform. Those are marketed as two cores and two threads, but they're using AMD's ill-fated Bulldozer module architecture - so there is only one "module" (and thus only one FPU/vector unit to process AVX instructions) with two integer cores sharing the FPU/vector unit and parts of the frontend and memory system.

soresu

15th May 2021, 23:23

Massive speed up for 10 bpc content on AVX2 CPUs landed in the dav1d master recently:

https://code.videolan.org/videolan/dav1d/-/merge_requests/1195

As well as another separate commit for AVX2 mc.emu_edge on 10 bpc content:

https://code.videolan.org/videolan/dav1d/-/merge_requests/1196

The work for the main (huge) commit was sponsored by Facebook and Netflix according to the merge request.

It will be part of the dav1d 0.9 release which is likely to land pretty soon - this will also include numerous NEON asm for film grain synthesis on 8 bpc content, and the beginnings of the same for 10+ bpc content.

This release will render most 4K 10 bpc content pretty playable on many 8 core AVX2 capable CPU's, so even those that bought AMD Renoir and Cezanne based NUCs should still manage pretty well even without the AV1 ASIC coming for Van Gogh and Rembrandt APUs onward.

soresu

16th May 2021, 23:29

davi1d 0.9 (Golden Eagle) was officially released:

https://code.videolan.org/videolan/dav1d/-/releases/0.9.0

x86 (64bit) AVX2 implementation of most 10b/12b functions, which should provide a large boost for high-bitdepth decoding on modern x86 computers and servers.
ARM64 neon implementation of FilmGrain (4:2:0/4:2:2/4:4:4 8bit)
New API to signal events happening during the decoding process

benwaggoner

17th May 2021, 21:49

quietvoid

17th May 2021, 22:05

Is there any perf data yet on real-world decode fps after these changes?

https://www.phoronix.com/scan.php?page=news_item&px=AVX2-dav1d-0.9-Benchmarks

The changes also increased rav1e's encoding speed by 3x (with AVX2), since they share much of the ASM.

Beelzebubu

18th May 2021, 12:37

https://www.phoronix.com/scan.php?page=news_item&px=AVX2-dav1d-0.9-Benchmarks

The changes also increased rav1e's encoding speed by 3x (with AVX2), since they share much of the ASM.

The phoronix numbers look pretty good. Keep in mind that the Chimera clip they used has filmgrain in the 10bit version but not in the 8bit, and they enabled filmgrain application in the decoder, so it's not a perfect comparison.

In most software-decoder players, I would expect the film grain to be added in the GPU directly. dav1d contains an example (dav1dplay) that demonstrates how to do this (using libplacebo), and you will get some speed-up from this. To emulate this using the dav1d binary, use --filmgrain=0 (or in ffmpeg: -filmgrain 0). In gav1, you'd use --post_filter_mask 0xf. This is especially important because 10-bit filmgrain has no Neon SIMD optimizations yet (but 8-bit Neon/SSSE3/AVX2 and 10-bit AVX2 is present). So keep this in mind when running comparisons.

LigH

18th May 2021, 21:24

dav1d 0.9.0-0 (g8636b4f / 2021-05-16) (https://www.mediafire.com/file/2ewbyzu9l668wca/dav1d_0.9.0-0-g8636b4f.7z/file) (MSYS2 / MinGW, GCC 10.3.0)

I guess the "patches after release" increment is wrong...

soresu

19th May 2021, 07:06

This is especially important because 10-bit filmgrain has no Neon SIMD optimizations yet (but 8-bit Neon/SSSE3/AVX2 and 10-bit AVX2 is present). So keep this in mind when running comparisons.

Even the big 10 bpc AVX2 dump only covers film grain for 420 content, but as that covers probably everything currently on Youtube or any other commercial source using AV1 significantly it should be fine for most people.

There is also the beginnings of 10 bpc NEON optimisations though which was added just before 0.9 - so I would expect either a 0.9.1/0.9.2 to cover it all before too long since it was only a couple of months from the first NEON 8 bpc FG patch to the last.

benwaggoner

24th May 2021, 18:38

benwaggoner

24th May 2021, 18:42

Even the big 10 bpc AVX2 dump only covers film grain for 420 content, but as that covers probably everything currently on Youtube or any other commercial source using AV1 significantly it should be fine for most people.
Yeah, I don't see any reason why content distribution beyond 4:2:0 10-bit would happen in the 2020s. Especially where software decode may be required, as 444 is about half the speed without any quality advantage in >99% of content.

There is also the beginnings of 10 bpc NEON optimisations though which was added just before 0.9 - so I would expect either a 0.9.1/0.9.2 to cover it all before too long since it was only a couple of months from the first NEON 8 bpc FG patch to the last.
I would anticipate that 8-bit will be standard for where SW decode is needed, and 10-bit for HW, unless the perf overhead of 10-bit drops to <25% over 8-bit.

hajj_3

20th July 2021, 09:30

Dav1d 0.9.1 changelog:

- 10/12b SSSE3 optimizations for mc (avg, w_avg, mask, w_mask, emu_edge),
prep/put_bilin, prep/put_8tap, ipred (dc/h/v, paeth, smooth, pal, filter), wiener,
sgr (10b), warp8x8, deblock, film_grain, cfl_ac/pred for 32bit and 64bit x86 processors
- Film grain NEON for fguv 10/12b, fgy/fguv 8b and fgy/fguv 10/12 arm32
- Fixes for filmgrain on ARM
- itx 4x4 for SSE4
- Misc improvements on SSE2, SSE4

benwaggoner

20th July 2021, 21:38

Spyros

2nd August 2021, 14:01

Optimizations coverage

With 0.9.1, we've done most of the optimizations for 8/10/12bit on the following platforms:

Desktop CPUs with AVX2 (64bit)
Desktop CPUs with SSSE3 in 32bit
Desktop CPUs with SSSE3 in 64bit
ARM CPU in 32bit, ARMv7
ARM CPU in 64bit, ARMv8

There are still some minor optimizations left to do, but they won't change much the overall performance of the decoder.
For example, intra z1/z2/z3, 12bit SGR or 12bit itxfm are not done, but their usefulness is debatable :)

Please note also, that some optimizations were done for SSE4 and not SSSE3.

Assembly size

The portion of code in dav1d written in assembly is now reaching 140000 lines of code.

This code is composed of:

90000 lines for x86 (AVX2, SSSE3-32, SSSE3-64);
50000 lines for ARM (32bit and 64bit).

This is very large for handwritten assembly, and for comparison, this is more assembly than what there is in FFmpeg (for all codecs).

And yes, this code is faster than what the compilers can generate by themselves. :)

Source: jbkempf.com (http://www.jbkempf.com/blog/post/2021/dav1d-0.9.1-a-ton-of-asm)

benwaggoner

2nd August 2021, 20:19

That's some impressive optimization work there!

Looks like about 2x faster on Ryzen 5 and 3x faster on recent Intel.

Do we have a comparison between 8-bit and 10-bit decode performance?

Beelzebubu

6th August 2021, 13:54

Do we have a comparison between 8-bit and 10-bit decode performance?

From memory, on the starting scene (past frame 120) Chimera, same content, same tools, same bitrate, same resolutioin, I saw a ~20% drop on a Haswell laptop, single-threaded. Overall, this will depend on content complexity or bitrate: high-complexity content or low-quantizer/high-bitrate encodes tend to have a smaller drop than low-complexity content or high-quantizer/low-bitrate encodes. If you want more accurate numbers, you'll have to give some insight in what you're looking for in terms of bitrate/complexity/quantizer/resolution/etc. - and probably also type of device.

(Reason: things like coefficient decoding are basically identical between 8bit and 10bit, but things like prediction are slower because they require twice the memory. Therefore, overall slowdown depends on ratio between things that require double the memory (like prediction) and things that don't (like coef decoding). Because overall memory usage is cumulative between threads, you'll see a slightly larger drop-off with more threads.)

soresu

14th August 2021, 21:41

That's some impressive optimization work there!

Looks like about 2x faster on Ryzen 5 and 3x faster on recent Intel.

Do we have a comparison between 8-bit and 10-bit decode performance?

The SSSE3 path is currently missing CDEF filter for 10/12 bit which is currently in the merge request line up on the dav1d gitlab, as a major compute hog for AV1 that will give the next release (1.0.0) another step up again, and some optimisations made for that particular commit will have equivalent additions to the current AVX2 CDEF filter asm.

As well as that they just landed what I think is the last of the 10 bit film grain asm (gen_grain) for ARM64 NEON.

All told 1.0.0 should be pretty much every significant SIMD path fairly well optimised for 8 and 10 bpc content minus AVX512.

lvqcl

4th September 2021, 14:05

dav1d 0.9.2 (https://code.videolan.org/videolan/dav1d/-/releases/0.9.2)

0.9.2 is a small update of dav1d on the 0.9.x branch, focusing on adding SIMD on numerous small cases:

x86: SSE4 optimizations of inverse transforms for 10bit for all sizes
x86: mc.resize optimizations with AVX2/SSSE3 for 10/12b
x86: SSSE3 optimizations for cdef_filter in 10/12b and mc_w_mask_422/444 in 8b
ARM NEON optimizations for FilmGrain Gen_grain functions
Optimizations for splat_mv in SSE2/AVX2 and NEON
x86: SGR improvements for SSSE3 CPUs
x86: AVX2 optimizations for cfl_ac

This mostly concludes SIMD for SSSE3 (32+64), AVX2 and NEON (32+64). The rest are scaled-related and z1/z2/z3 and should not bring significant improvements in speed for most cases.

hajj_3

1st March 2022, 12:23

Changes for 1.0.0 'Peregrine falcon':
-------------------------------------

1.0.0 is a major release of dav1d, adding important features and bug fixes.

It notably changes, in an important way, the way threading works, by adding
an automatic thread management.

It also adds support for AVX-512 acceleration, and adds speedups to existing x86
code (from SSE2 to AVX2).

1.0.0 adds new grain API to ease acceleration on the GPU.

Finally, 1.0.0 fixes numerous small bugs that were reported since the beginning
of the project to have a proper release.

nevcairiel

1st March 2022, 13:00

1.0.0 has not been released yet. Keep your pants on :p

Spyros

18th March 2022, 19:07

dav1d 1.0.0 was released today. (Tag (https://code.videolan.org/videolan/dav1d/-/tags/1.0.0))

Changes for 1.0.0 'Peregrine falcon':
-------------------------------------

1.0.0 is a major release of dav1d, adding important features and bug fixes.

It notably changes, in an important way, the way threading works, by adding
an automatic thread management.

It also adds support for AVX-512 acceleration, and adds speedups to existing x86
code (from SSE2 to AVX2).

1.0.0 adds new grain API to ease acceleration on the GPU, and adds an API call
to get information of which frame failed to decode, in error cases.

Finally, 1.0.0 fixes numerous small bugs that were reported since the beginning
of the project to have a proper release.

.''.
.''. . *''* :_\/_: .
:_\/_: _\(/_ .:.*_\/_* : /\ : .'.:.'.
.''.: /\ : ./)\ ':'* /\ * : '..'. -=:o:=-
:_\/_:'.:::. ' *''* * '.\'/.' _\(/_'.':'.'
: /\ : ::::: *_\/_* -= o =- /)\ ' *
'..' ':::' * /\ * .'/.\'. '
* *..* :
* :
* 1.0.0

Source: NEWS (https://code.videolan.org/videolan/dav1d/-/blob/master/NEWS)

benwaggoner

23rd March 2022, 03:54

[QUOTE=Spyros;1965983]dav1d 1.0.0 was released today. (Tag (https://code.videolan.org/videolan/dav1d/-/tags/1.0.0))
Do we know how much speedup AVX512 provided? We've not seen it to be particularly useful in encoder performance, so it'd be interesting if it helps more on the decode side.

lvqcl

23rd March 2022, 18:41

8-bit video: SSE4.1 vs AVX2 vs AVX-512 (on 8C/16T Rocket Lake) - https://code.videolan.org/videolan/dav1d/-/merge_requests/1301

Beelzebubu

23rd March 2022, 18:56

Do we know how much speedup AVX512 provided? We've not seen it to be particularly useful in encoder performance, so it'd be interesting if it helps more on the decode side.

dav1d uses the icelake subset (AWS: m6i/c6i, or: F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES), not skylake subset (AWS: m5*/c5*, or: F, CD, VL, DQ, BW). Icelake's performance of AVX512 instructions is in general much better than Skylake's, but the wider instruction subset also allows for certain additional code optimizations.

Extreme example of the latter: 8-bit film grain (https://code.videolan.org/videolan/dav1d/-/merge_requests/1374) is more than 3x as fast with AVX512 compared to AVX2.

benwaggoner

23rd March 2022, 20:05

Wow, those are some very impressive speedups with AVX512! The new instructions are making at least as much of a difference than the "AVX2, but 2x wider" instructions.

Of course, Icelake CPUs don't have that much market share yet, but these kinds of speedups are quite promising in the long term for software decoding.

Beelzebubu

24th March 2022, 12:21

Of course, Icelake CPUs don't have that much market share yet, but these kinds of speedups are quite promising in the long term for software decoding.

... and software encoding!

Spyros

15th February 2023, 19:53

dav1d 1.1.0 was released yesterday. (Tag (https://code.videolan.org/videolan/dav1d/-/tags/1.1.0))

Changes for 1.1.0 'Arctic Peregrine Falcon':
-------------------------------------------

1.1.0 is an important release of dav1d, fixing numerous bugs, and adding SIMD

New function dav1d_get_frame_delay to query the decoder frame delay
Numerous fixes for strict conformity to the specs and samples
NEON and AVX-512 misc fixes and improvements
Partial AVX2 12bpc transform implementations
AVX-512 high bit-depth cdef_filter, loopfilter, itx
NEON z1/z3 optimization for 8bpc
SSSE3 z1 optimization for 8bpc

"From VideoLAN with love"

hajj_3

3rd May 2023, 07:43

Changes for 1.2.0 'Arctic Peregrine Falcon':
-------------------------------------------

- Improvements on attachments of props and T.35 entries on output pictures
- NEON z1/z3 high bit-depth optimizations and improvements for 8bpc
- SSSE3 z2/z3 8bpc and SSSE3 z1/z3 high bit-depth optimziations
- refmvs.save_tmvs optimizations in SSSE3/AVX2/AVX-512
- AVX-512 optimizations for high bit-depth itx (16x64, 32x64, 64x16, 64x32, 64x64)
- AVX2 optimizations for 12bpc for 16x32, 32x16, 32x32 itx

hajj_3

4th June 2023, 21:57

Changes for 1.2.1 'Arctic Peregrine Falcon':
-------------------------------------------

- Fix a threading race on task_thread.init_done
- NEON z2 8bpc and high bit-depth optimizations
- SSSE3 z2 high bit-depth optimziations
- Fix a desynced luma/chroma planes issue with Film Grain
- Reduce memory consumption
- Improve dav1d_parse_sequence_header() speed
- OBU: Improve header parsing and fix potential overflows
- OBU: Improve ITU-T T.35 parsing speed
- Misc buildsystems, CI and headers fixes

Barough

5th October 2023, 20:54

Changes for 1.3.0 'Tundra Peregrine Falcon (Calidus)':
------------------------------------------------------

1.3.0 is a medium release of dav1d, focus on new APIs and memory usage reduction.

- Reduce memory usage in numerous places
- ABI break in Dav1dSequenceHeader, Dav1dFrameHeader, Dav1dContentLightLevel structures
- new API function to check the API version: dav1d_version_api()
- Rewrite of the SGR functions for ARM64 to be faster
- NEON implemetation of save_tmvs for ARM32 and ARM64
- x86 palette DSP for pal_idx_finish function

Barough

5th October 2023, 21:02

dav1d v1.3.0-3-g47107e3
Built on October 05, 2023, GCC 13.2.0

https://code.videolan.org/videolan/dav1d

DL :
dav1d v1.3.0-3-g47107e3 (https://www.mediafire.com/file/yduyam6uwu6oi57/dav1d-1.3.0-3-g47107e3_Win_GCC132.7z/file)

hajj_3

14th February 2024, 17:21

Changes for 1.4.0 'Road Runner':
------------------------------------------------------

1.4.0 is a medium release of dav1d, focusing on new architecture support and optimizations

- AVX-512 optimizations for z1, z2, z3 in 8bit and high-bit depth
- New architecture supported: loongarch
- Loongarch optimizations for 8bit
- New architecture supported: RISC-V
- RISC-V optimizations for itx
- Misc improvements in threading and in reducing binary size
- Fix potential integer overflow with extremely large frame sizes