Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > VP9 and AV1

Reply
 
Thread Tools Search this Thread Display Modes
Old 15th May 2021, 23:23   #121  |  Link
soresu
Registered User
 
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 190
Massive speed up for 10 bpc content on AVX2 CPUs landed in the dav1d master recently:

https://code.videolan.org/videolan/d..._requests/1195

As well as another separate commit for AVX2 mc.emu_edge on 10 bpc content:

https://code.videolan.org/videolan/d..._requests/1196

The work for the main (huge) commit was sponsored by Facebook and Netflix according to the merge request.

It will be part of the dav1d 0.9 release which is likely to land pretty soon - this will also include numerous NEON asm for film grain synthesis on 8 bpc content, and the beginnings of the same for 10+ bpc content.

This release will render most 4K 10 bpc content pretty playable on many 8 core AVX2 capable CPU's, so even those that bought AMD Renoir and Cezanne based NUCs should still manage pretty well even without the AV1 ASIC coming for Van Gogh and Rembrandt APUs onward.
soresu is offline   Reply With Quote
Old 16th May 2021, 23:29   #122  |  Link
soresu
Registered User
 
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 190
davi1d 0.9 (Golden Eagle) was officially released:

https://code.videolan.org/videolan/d...releases/0.9.0
  • x86 (64bit) AVX2 implementation of most 10b/12b functions, which should provide a large boost for high-bitdepth decoding on modern x86 computers and servers.
  • ARM64 neon implementation of FilmGrain (4:2:0/4:2:2/4:4:4 8bit)
  • New API to signal events happening during the decoding process
soresu is offline   Reply With Quote
Old 17th May 2021, 21:49   #123  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 3,863
Quote:
Originally Posted by soresu View Post
davi1d 0.9 (Golden Eagle) was officially released:

https://code.videolan.org/videolan/d...releases/0.9.0
  • x86 (64bit) AVX2 implementation of most 10b/12b functions, which should provide a large boost for high-bitdepth decoding on modern x86 computers and servers.
  • ARM64 neon implementation of FilmGrain (4:2:0/4:2:2/4:4:4 8bit)
  • New API to signal events happening during the decoding process
Is there any perf data yet on real-world decode fps after these changes?
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 17th May 2021, 22:05   #124  |  Link
quietvoid
Registered User
 
Join Date: Jan 2019
Posts: 297
Quote:
Originally Posted by benwaggoner View Post
Is there any perf data yet on real-world decode fps after these changes?
https://www.phoronix.com/scan.php?pa...0.9-Benchmarks

The changes also increased rav1e's encoding speed by 3x (with AVX2), since they share much of the ASM.
__________________
LG OLED C8 | madVR 870 DPL HDR curve

Last edited by quietvoid; 17th May 2021 at 23:09.
quietvoid is online now   Reply With Quote
Old 18th May 2021, 12:37   #125  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 88
Quote:
Originally Posted by quietvoid View Post
https://www.phoronix.com/scan.php?pa...0.9-Benchmarks

The changes also increased rav1e's encoding speed by 3x (with AVX2), since they share much of the ASM.
The phoronix numbers look pretty good. Keep in mind that the Chimera clip they used has filmgrain in the 10bit version but not in the 8bit, and they enabled filmgrain application in the decoder, so it's not a perfect comparison.

In most software-decoder players, I would expect the film grain to be added in the GPU directly. dav1d contains an example (dav1dplay) that demonstrates how to do this (using libplacebo), and you will get some speed-up from this. To emulate this using the dav1d binary, use --filmgrain=0 (or in ffmpeg: -filmgrain 0). In gav1, you'd use --post_filter_mask 0xf. This is especially important because 10-bit filmgrain has no Neon SIMD optimizations yet (but 8-bit Neon/SSSE3/AVX2 and 10-bit AVX2 is present). So keep this in mind when running comparisons.
Beelzebubu is offline   Reply With Quote
Old 18th May 2021, 21:24   #126  |  Link
LigH
German doom9/Gleitz SuMo
 
LigH's Avatar
 
Join Date: Oct 2001
Location: Germany, rural Altmark
Posts: 6,435
dav1d 0.9.0-0 (g8636b4f / 2021-05-16) (MSYS2 / MinGW, GCC 10.3.0)

I guess the "patches after release" increment is wrong...
__________________

New German Gleitz board
MediaFire: x264 | x265 | VPx | AOM | Xvid
LigH is offline   Reply With Quote
Old 19th May 2021, 07:06   #127  |  Link
soresu
Registered User
 
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 190
Quote:
Originally Posted by Beelzebubu View Post
This is especially important because 10-bit filmgrain has no Neon SIMD optimizations yet (but 8-bit Neon/SSSE3/AVX2 and 10-bit AVX2 is present). So keep this in mind when running comparisons.
Even the big 10 bpc AVX2 dump only covers film grain for 420 content, but as that covers probably everything currently on Youtube or any other commercial source using AV1 significantly it should be fine for most people.

There is also the beginnings of 10 bpc NEON optimisations though which was added just before 0.9 - so I would expect either a 0.9.1/0.9.2 to cover it all before too long since it was only a couple of months from the first NEON 8 bpc FG patch to the last.

Last edited by soresu; 19th May 2021 at 07:08.
soresu is offline   Reply With Quote
Old 24th May 2021, 18:38   #128  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 3,863
Quote:
Originally Posted by quietvoid View Post
https://www.phoronix.com/scan.php?pa...0.9-Benchmarks

The changes also increased rav1e's encoding speed by 3x (with AVX2), since they share much of the ASM.
3x? That's pretty amazing!

That's for 10-bit specifically, correct?
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 24th May 2021, 18:42   #129  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 3,863
Quote:
Originally Posted by soresu View Post
Even the big 10 bpc AVX2 dump only covers film grain for 420 content, but as that covers probably everything currently on Youtube or any other commercial source using AV1 significantly it should be fine for most people.
Yeah, I don't see any reason why content distribution beyond 4:2:0 10-bit would happen in the 2020s. Especially where software decode may be required, as 444 is about half the speed without any quality advantage in >99% of content.

Quote:
There is also the beginnings of 10 bpc NEON optimisations though which was added just before 0.9 - so I would expect either a 0.9.1/0.9.2 to cover it all before too long since it was only a couple of months from the first NEON 8 bpc FG patch to the last.
I would anticipate that 8-bit will be standard for where SW decode is needed, and 10-bit for HW, unless the perf overhead of 10-bit drops to <25% over 8-bit.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 20th July 2021, 09:30   #130  |  Link
hajj_3
Registered User
 
Join Date: Mar 2004
Posts: 1,027
Dav1d 0.9.1 changelog:

- 10/12b SSSE3 optimizations for mc (avg, w_avg, mask, w_mask, emu_edge),
prep/put_bilin, prep/put_8tap, ipred (dc/h/v, paeth, smooth, pal, filter), wiener,
sgr (10b), warp8x8, deblock, film_grain, cfl_ac/pred for 32bit and 64bit x86 processors
- Film grain NEON for fguv 10/12b, fgy/fguv 8b and fgy/fguv 10/12 arm32
- Fixes for filmgrain on ARM
- itx 4x4 for SSE4
- Misc improvements on SSE2, SSE4
hajj_3 is offline   Reply With Quote
Old 20th July 2021, 21:38   #131  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 3,863
Quote:
Originally Posted by hajj_3 View Post
Dav1d 0.9.1 changelog:

- 10/12b SSSE3 optimizations for mc (avg, w_avg, mask, w_mask, emu_edge),
prep/put_bilin, prep/put_8tap, ipred (dc/h/v, paeth, smooth, pal, filter), wiener,
sgr (10b), warp8x8, deblock, film_grain, cfl_ac/pred for 32bit and 64bit x86 processors
- Film grain NEON for fguv 10/12b, fgy/fguv 8b and fgy/fguv 10/12 arm32
- Fixes for filmgrain on ARM
- itx 4x4 for SSE4
- Misc improvements on SSE2, SSE4
Sounds like good stuff.

Any updates on net 10-bit decode performance improvements?
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 03:27.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, vBulletin Solutions Inc.