Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > VP9 and AV1

Reply
 
Thread Tools Search this Thread Display Modes
Old 15th May 2021, 23:23   #121  |  Link
soresu
Registered User
 
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 191
Massive speed up for 10 bpc content on AVX2 CPUs landed in the dav1d master recently:

https://code.videolan.org/videolan/d..._requests/1195

As well as another separate commit for AVX2 mc.emu_edge on 10 bpc content:

https://code.videolan.org/videolan/d..._requests/1196

The work for the main (huge) commit was sponsored by Facebook and Netflix according to the merge request.

It will be part of the dav1d 0.9 release which is likely to land pretty soon - this will also include numerous NEON asm for film grain synthesis on 8 bpc content, and the beginnings of the same for 10+ bpc content.

This release will render most 4K 10 bpc content pretty playable on many 8 core AVX2 capable CPU's, so even those that bought AMD Renoir and Cezanne based NUCs should still manage pretty well even without the AV1 ASIC coming for Van Gogh and Rembrandt APUs onward.
soresu is offline   Reply With Quote
Old 16th May 2021, 23:29   #122  |  Link
soresu
Registered User
 
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 191
davi1d 0.9 (Golden Eagle) was officially released:

https://code.videolan.org/videolan/d...releases/0.9.0
  • x86 (64bit) AVX2 implementation of most 10b/12b functions, which should provide a large boost for high-bitdepth decoding on modern x86 computers and servers.
  • ARM64 neon implementation of FilmGrain (4:2:0/4:2:2/4:4:4 8bit)
  • New API to signal events happening during the decoding process
soresu is offline   Reply With Quote
Old 17th May 2021, 21:49   #123  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 3,923
Quote:
Originally Posted by soresu View Post
davi1d 0.9 (Golden Eagle) was officially released:

https://code.videolan.org/videolan/d...releases/0.9.0
  • x86 (64bit) AVX2 implementation of most 10b/12b functions, which should provide a large boost for high-bitdepth decoding on modern x86 computers and servers.
  • ARM64 neon implementation of FilmGrain (4:2:0/4:2:2/4:4:4 8bit)
  • New API to signal events happening during the decoding process
Is there any perf data yet on real-world decode fps after these changes?
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 17th May 2021, 22:05   #124  |  Link
quietvoid
Registered User
 
Join Date: Jan 2019
Posts: 320
Quote:
Originally Posted by benwaggoner View Post
Is there any perf data yet on real-world decode fps after these changes?
https://www.phoronix.com/scan.php?pa...0.9-Benchmarks

The changes also increased rav1e's encoding speed by 3x (with AVX2), since they share much of the ASM.
__________________
LG OLED C8 | madVR 870 DPL HDR curve

Last edited by quietvoid; 17th May 2021 at 23:09.
quietvoid is offline   Reply With Quote
Old 18th May 2021, 12:37   #125  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 89
Quote:
Originally Posted by quietvoid View Post
https://www.phoronix.com/scan.php?pa...0.9-Benchmarks

The changes also increased rav1e's encoding speed by 3x (with AVX2), since they share much of the ASM.
The phoronix numbers look pretty good. Keep in mind that the Chimera clip they used has filmgrain in the 10bit version but not in the 8bit, and they enabled filmgrain application in the decoder, so it's not a perfect comparison.

In most software-decoder players, I would expect the film grain to be added in the GPU directly. dav1d contains an example (dav1dplay) that demonstrates how to do this (using libplacebo), and you will get some speed-up from this. To emulate this using the dav1d binary, use --filmgrain=0 (or in ffmpeg: -filmgrain 0). In gav1, you'd use --post_filter_mask 0xf. This is especially important because 10-bit filmgrain has no Neon SIMD optimizations yet (but 8-bit Neon/SSSE3/AVX2 and 10-bit AVX2 is present). So keep this in mind when running comparisons.
Beelzebubu is offline   Reply With Quote
Old 18th May 2021, 21:24   #126  |  Link
LigH
German doom9/Gleitz SuMo
 
LigH's Avatar
 
Join Date: Oct 2001
Location: Germany, rural Altmark
Posts: 6,459
dav1d 0.9.0-0 (g8636b4f / 2021-05-16) (MSYS2 / MinGW, GCC 10.3.0)

I guess the "patches after release" increment is wrong...
__________________

New German Gleitz board
MediaFire: x264 | x265 | VPx | AOM | Xvid
LigH is offline   Reply With Quote
Old 19th May 2021, 07:06   #127  |  Link
soresu
Registered User
 
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 191
Quote:
Originally Posted by Beelzebubu View Post
This is especially important because 10-bit filmgrain has no Neon SIMD optimizations yet (but 8-bit Neon/SSSE3/AVX2 and 10-bit AVX2 is present). So keep this in mind when running comparisons.
Even the big 10 bpc AVX2 dump only covers film grain for 420 content, but as that covers probably everything currently on Youtube or any other commercial source using AV1 significantly it should be fine for most people.

There is also the beginnings of 10 bpc NEON optimisations though which was added just before 0.9 - so I would expect either a 0.9.1/0.9.2 to cover it all before too long since it was only a couple of months from the first NEON 8 bpc FG patch to the last.

Last edited by soresu; 19th May 2021 at 07:08.
soresu is offline   Reply With Quote
Old 24th May 2021, 18:38   #128  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 3,923
Quote:
Originally Posted by quietvoid View Post
https://www.phoronix.com/scan.php?pa...0.9-Benchmarks

The changes also increased rav1e's encoding speed by 3x (with AVX2), since they share much of the ASM.
3x? That's pretty amazing!

That's for 10-bit specifically, correct?
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 24th May 2021, 18:42   #129  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 3,923
Quote:
Originally Posted by soresu View Post
Even the big 10 bpc AVX2 dump only covers film grain for 420 content, but as that covers probably everything currently on Youtube or any other commercial source using AV1 significantly it should be fine for most people.
Yeah, I don't see any reason why content distribution beyond 4:2:0 10-bit would happen in the 2020s. Especially where software decode may be required, as 444 is about half the speed without any quality advantage in >99% of content.

Quote:
There is also the beginnings of 10 bpc NEON optimisations though which was added just before 0.9 - so I would expect either a 0.9.1/0.9.2 to cover it all before too long since it was only a couple of months from the first NEON 8 bpc FG patch to the last.
I would anticipate that 8-bit will be standard for where SW decode is needed, and 10-bit for HW, unless the perf overhead of 10-bit drops to <25% over 8-bit.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 20th July 2021, 09:30   #130  |  Link
hajj_3
Registered User
 
Join Date: Mar 2004
Posts: 1,029
Dav1d 0.9.1 changelog:

- 10/12b SSSE3 optimizations for mc (avg, w_avg, mask, w_mask, emu_edge),
prep/put_bilin, prep/put_8tap, ipred (dc/h/v, paeth, smooth, pal, filter), wiener,
sgr (10b), warp8x8, deblock, film_grain, cfl_ac/pred for 32bit and 64bit x86 processors
- Film grain NEON for fguv 10/12b, fgy/fguv 8b and fgy/fguv 10/12 arm32
- Fixes for filmgrain on ARM
- itx 4x4 for SSE4
- Misc improvements on SSE2, SSE4
hajj_3 is offline   Reply With Quote
Old 20th July 2021, 21:38   #131  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 3,923
Quote:
Originally Posted by hajj_3 View Post
Dav1d 0.9.1 changelog:

- 10/12b SSSE3 optimizations for mc (avg, w_avg, mask, w_mask, emu_edge),
prep/put_bilin, prep/put_8tap, ipred (dc/h/v, paeth, smooth, pal, filter), wiener,
sgr (10b), warp8x8, deblock, film_grain, cfl_ac/pred for 32bit and 64bit x86 processors
- Film grain NEON for fguv 10/12b, fgy/fguv 8b and fgy/fguv 10/12 arm32
- Fixes for filmgrain on ARM
- itx 4x4 for SSE4
- Misc improvements on SSE2, SSE4
Sounds like good stuff.

Any updates on net 10-bit decode performance improvements?
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 2nd August 2021, 14:01   #132  |  Link
Spyros
Registered User
 
Join Date: Jun 2019
Posts: 8
dav1d 0.9.1: a ton of asm

Quote:
Optimizations coverage

With 0.9.1, we've done most of the optimizations for 8/10/12bit on the following platforms:
  • Desktop CPUs with AVX2 (64bit)
  • Desktop CPUs with SSSE3 in 32bit
  • Desktop CPUs with SSSE3 in 64bit
  • ARM CPU in 32bit, ARMv7
  • ARM CPU in 64bit, ARMv8
There are still some minor optimizations left to do, but they won't change much the overall performance of the decoder.
For example, intra z1/z2/z3, 12bit SGR or 12bit itxfm are not done, but their usefulness is debatable

Please note also, that some optimizations were done for SSE4 and not SSSE3.
Quote:
Assembly size

The portion of code in dav1d written in assembly is now reaching 140000 lines of code.

This code is composed of:
  • 90000 lines for x86 (AVX2, SSSE3-32, SSSE3-64);
  • 50000 lines for ARM (32bit and 64bit).
This is very large for handwritten assembly, and for comparison, this is more assembly than what there is in FFmpeg (for all codecs).

And yes, this code is faster than what the compilers can generate by themselves.
Source: jbkempf.com
Spyros is offline   Reply With Quote
Old 2nd August 2021, 20:19   #133  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 3,923
That's some impressive optimization work there!

Looks like about 2x faster on Ryzen 5 and 3x faster on recent Intel.

Do we have a comparison between 8-bit and 10-bit decode performance?
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 6th August 2021, 13:54   #134  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 89
Quote:
Originally Posted by benwaggoner View Post
Do we have a comparison between 8-bit and 10-bit decode performance?
From memory, on the starting scene (past frame 120) Chimera, same content, same tools, same bitrate, same resolutioin, I saw a ~20% drop on a Haswell laptop, single-threaded. Overall, this will depend on content complexity or bitrate: high-complexity content or low-quantizer/high-bitrate encodes tend to have a smaller drop than low-complexity content or high-quantizer/low-bitrate encodes. If you want more accurate numbers, you'll have to give some insight in what you're looking for in terms of bitrate/complexity/quantizer/resolution/etc. - and probably also type of device.

(Reason: things like coefficient decoding are basically identical between 8bit and 10bit, but things like prediction are slower because they require twice the memory. Therefore, overall slowdown depends on ratio between things that require double the memory (like prediction) and things that don't (like coef decoding). Because overall memory usage is cumulative between threads, you'll see a slightly larger drop-off with more threads.)
Beelzebubu is offline   Reply With Quote
Old 14th August 2021, 21:41   #135  |  Link
soresu
Registered User
 
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 191
Quote:
Originally Posted by benwaggoner View Post
That's some impressive optimization work there!

Looks like about 2x faster on Ryzen 5 and 3x faster on recent Intel.

Do we have a comparison between 8-bit and 10-bit decode performance?
The SSSE3 path is currently missing CDEF filter for 10/12 bit which is currently in the merge request line up on the dav1d gitlab, as a major compute hog for AV1 that will give the next release (1.0.0) another step up again, and some optimisations made for that particular commit will have equivalent additions to the current AVX2 CDEF filter asm.

As well as that they just landed what I think is the last of the 10 bit film grain asm (gen_grain) for ARM64 NEON.

All told 1.0.0 should be pretty much every significant SIMD path fairly well optimised for 8 and 10 bpc content minus AVX512.
soresu is offline   Reply With Quote
Old 4th September 2021, 14:05   #136  |  Link
lvqcl
Registered User
 
Join Date: Aug 2015
Posts: 163
dav1d 0.9.2

Quote:
0.9.2 is a small update of dav1d on the 0.9.x branch, focusing on adding SIMD on numerous small cases:

x86: SSE4 optimizations of inverse transforms for 10bit for all sizes
x86: mc.resize optimizations with AVX2/SSSE3 for 10/12b
x86: SSSE3 optimizations for cdef_filter in 10/12b and mc_w_mask_422/444 in 8b
ARM NEON optimizations for FilmGrain Gen_grain functions
Optimizations for splat_mv in SSE2/AVX2 and NEON
x86: SGR improvements for SSSE3 CPUs
x86: AVX2 optimizations for cfl_ac

This mostly concludes SIMD for SSSE3 (32+64), AVX2 and NEON (32+64). The rest are scaled-related and z1/z2/z3 and should not bring significant improvements in speed for most cases.
lvqcl is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 09:57.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, vBulletin Solutions Inc.