Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
![]() |
#1 | Link |
Registered User
Join Date: Mar 2004
Posts: 1,100
|
dav1d accelerated AV1 decoder
dav1d 0.5.0 'Asiatic Cheetah'
https://code.videolan.org/videolan/dav1d/-/releases The fast and small AV1 decoder, codename 'Asiatic Cheetah'. It supports all the AV1 features and all bitdepths. 0.5.0 brings large improvements in speed on SSSE3 CPU (up to 40% speedup), new speed improvements on AVX-2 (for 4-7%) and ARM64 (up to 10%) and ARM32. It introduces some VSX, SSE2 and SSE4 optimizations. 0.5.0 fixes some minor issues, can export ITU T.35 metadata and improves the player example. Last edited by foxyshadis; 3rd December 2020 at 12:34. Reason: add general releases link |
![]() |
![]() |
![]() |
#2 | Link | |
Member
Join Date: Nov 2002
Posts: 203
|
https://code.videolan.org/videolan/d...releases#0.5.1
Quote:
http://download.opencontent.netflix....ix=AV1/Sparks/ Netflix posted new AV1 samples with and without film grain in 540p, 1080p and 2160p |
|
![]() |
![]() |
![]() |
#3 | Link | ||
Registered User
Join Date: Nov 2009
Location: Northeast Ohio
Posts: 447
|
Quote:
A good amount of people saw no reason to upgrade from their SSE3-at-max AMD CPUs regardless of whether that was a Phenom II (especially those using the 6 core) or the first-gen "Llano" laptop APUs, worse still because of requiring different motherboards for either (AM3+ and FM2). Heck, when the first gen AM3+ FX CPU reviews landed, it was even common for people to instead upgrade to the Phenom II X6! Quote:
For reference your Pentium is the same exact architecture as a second-gen Core 2 Duo but has 1MB of L2 cache, so I would expect its IPC to be similar to a first-gen 2MB L2 Core 2 Duo.
__________________
____HTPC____ | __Desktop PC__
2.93GHz Xeon x3470 (4c/8t Nehalem) | 4.5GHz 1.24v dual-core Haswell G3258 Radeon HD5870 | Intel iGPU 2x2GB+2x1GB DDR3-1333 | 4x4GB DDR3-1600 Last edited by Nintendo Maniac 64; 28th October 2019 at 00:05. |
||
![]() |
![]() |
![]() |
#4 | Link | |
Registered User
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
|
Quote:
1080p on SSE2 is not our goal. The goal is to have a baseline support so ~5 years (or even earlier?) from now, AV1 can be the baseline, not H.264. We don't know for sure, but this may imply some basic need for SSE2 support. So we're exploring what is possible and how much work it'd be. |
|
![]() |
![]() |
![]() |
#5 | Link |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
According to the dAV1d team, latest version 0.5.0 is extremely fast.
Many times faster than libaom, even using SSSE3. Based on the benchmarks below, what really surprises me is that depending on content and CPU implementation, SSSE3 code running on 128bit registers can be as fast as AVX2 code running on 256bit registers! How is this even possible ? I'm starting to believe that your AVX2 assembly optimizations could be optimized further. BTW, any plans for AVX-512 in near future ? Is there any benefit on this ? https://i.postimg.cc/7h1nkFss/dav1d-0-5-x86-s.png
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
![]() |
![]() |
![]() |
#6 | Link |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Ok...So, I take a look at the single threaded performance and I see a 20% gain of AVX2 compared to SSSE3.
It is really amazing that the remaining non-optimized parts of the algorithm can impact the performance around 80% (!) Does that mean that all these months of writing optimized AVX2 assembly are really contributing for 20% ? I would really like to hear what the dAV1d team or other developers of software AV1 decoding say about that. Do we really have an 80% non optimizable algorithm here ? Looks like another implementation of Pareto law to me. https://i.postimg.cc/3rt91v4z/1-0o-W...a9-BSb3-SQ.png
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
![]() |
![]() |
![]() |
#7 | Link |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
I took it from "you"
http://www.jbkempf.com/blog/post/201...elease-fastest
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
![]() |
![]() |
![]() |
#8 | Link | |
Registered User
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
|
Quote:
Together, that means the speedup is well below half of half, so 20% is not entirely unreasonable. Sucks a bit, but you can't beat reality. |
|
![]() |
![]() |
![]() |
#9 | Link | |
Registered User
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
|
Quote:
That's a little negative. But yes, you won't get a 2x (or even 1.5x) speedup. 1.2x is nothing bad, though. And this i straight Haswell, newer chipsets (Zen2, Skylake) will get more, as will encoders. |
|
![]() |
![]() |
![]() |
#10 | Link | |
Lost my old account :(
Join Date: Jul 2017
Posts: 321
|
Quote:
Tried to play that 2160p60 sample from netflix with 0.74.1 and mpc-be on an i7-7500U; 4 threads at 100% load at 3.2Ghz, could barely open the file, after 30s it started playing at 2-10fps (downscaled to 1080p). Not that I was expecting any smooth playback, but is this "normal" performance? HEVC 10bit sw decoding is about 3x faster on the same setup. Last edited by excellentswordfight; 29th October 2019 at 14:26. |
|
![]() |
![]() |
![]() |
#11 | Link | |
Registered Developer
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,329
|
Quote:
That won't necessarily guarantee that 2160p60 will play on a mobile U-series CPU, but it got the best chances. Just be careful not to pick the 10-bit variant of the Netflix Chimera video. 10-bit is not optimized at all yet, and its not representative of real-world content yet. YouTube for example only delivers AV1 8-bit so far. And since there is no 8-bit 2160p variant of Chimera, thats your answer.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders Last edited by nevcairiel; 29th October 2019 at 14:35. |
|
![]() |
![]() |
![]() |
#12 | Link |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
I'm about to start a few benchmarks using various versions of dAV1d regarding SSSE3 and AVX2 progress on Core2Duo, Haswell, Skylake and Coffee Lake Refresh CPUs.
Is there a link with 1080p and 4K AV1 8bit sample videos to test ?
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
![]() |
![]() |
![]() |
#13 | Link |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
dAV1d benchmarks: 0.5.1 vs 0.2.1 on Core2Duo, Haswell, Skylake, Coffee Lake R
OK, here we are.
Test Systems: Skylake Core i5 6500 (TDP 65W) - Win 10 v1809 (17763.805) - 8GB DDR4-2133 MHz (1 DIMM - Single Channel) All-core-turbo 3.3GHz Haswell Core i3 4170 (TDP 54W) - Win 10 v1903 (18362.449) - 16GB DDR3-1600 MHz (Dual Channel 2x8GB) Fixed 3.7GHz clock Coffee Lake Refresh Core i3 9100F (TDP 65W) - Win 10 v1903 (18362.449) - 16GB DDR4-2400 MHz (Dual Channel 2x8GB) All-core-turbo 4.0GHz Merom Core2Duo T7600 (TDP 34W) - Win 10 v1809 (17763.805) - 4GB DDR2-667 MHz (Dual Channel Interleaved 2x2GB) Fixed 2.33GHz clock SW Tools: DXVA Checker v4.2.1 LAV filters 0.74.1 (dAV1d 0.2.1 - 12/03/2019) LAV filters 0.74.1-29 (dAV1d 0.5.1 - 26/10/2019) During the whole benchmarking procedure, the Core i5 6500 never dropped its turbo clock of 3.3 GHz speed and Core i3 9100F never dropped its turbo clock of 4.0 GHz speed either. Core i5 6500: Max TDP for 1080p ~33W Max TDP for 4K ~36W Core i3 4170 Max TDP for 4K ~35W Core i3 9100F Max TDP for 4K ~54W Core2Duo T7600 No tool can read Power Consumption All video samples below are 8bit. Chimera 1080p24fps sample is from Netflix Dua Lipa 1080p25fps sample is from Youtube Holi Festival 4K25fps sample is from Elecard (thanks @HolyWu) Summer Nature 4K25fps sample is from Elecard The numbers below represent FramesPerSecond (FPS) expressed as minimum/average/maximum. 1080p Chimera ~6.6Mbps Core i5 6500 86/134/290 CPU 87% -0.5.1 Core i5 6500 77/127/273 CPU 91% -0.2.1 Core2Duo T7600 10/19/94 CPU 72% -0.5.1 Core2Duo T7600 8/17/100 CPU 87% -0.2.1 Dua Lipa ~2.2Mbps Core i5 6500 120/186/251 CPU 87% -0.5.1 Core i5 6500 112/186/255 CPU 91% -0.2.1 Core2Duo T7600 7/18/70 CPU 65% -0.5.1 Core2Duo T7600 7/18/69 CPU 84% -0.2.1 4K Holi Festival ~14Mbps Core i5 6500 34/43/61 CPU 94% -0.5.1 Core i5 6500 30/40/60 CPU 95% -0.2.1 Summer Nature ~23Mbps Core i3 9100F 45/60/82 CPU 91% -0.5.1 Core i5 6500 32/43/57 CPU 93% -0.5.1 Core i5 6500 26/37/50 CPU 91% -0.2.1 Core i3 4170 21/30/46 CPU 92% -0.5.1 Core i3 4170 16/27/41 CPU 90% -0.2.1 Comments: 0) Sorry guys...dAV1d 0.5.1 has serious CPU utilization problem with my Core2Duo for laptop, essentially wiping out any optimization for SSSE3 set. Dua Lipa has 0% gain over 0.2.1 and Chimera has only 11% on average. The situation is a disaster for SSSE3 optimizations. 1) After 7 months of 0.2.1 release, I would say that dAV1d team certainly was not busy doing AVX2 optimizations. It looks like 0.5.1 is only 0% - 8% faster than 0.2.1 on Skylake, besides the last 4K clip that gets a nice 16% gain. 2) Skylake Core i5 6500 is certainly not capable of decoding anything more than 4K30fps for AV1 up to ~20Mbps without dropping frames, even with the latest version. 3) Coffee Lake R Core i3 9100F is closer to 4K60fps, but still minimum frame rate is well below 60fps. 4) 0.5.1 dropped CPU utilization a little for Skylake (but enormously for Core2Duo) compared to 0.2.1, eating some of the performance optimizations of latest version for Skylake. The only time that CPU utilization increased - compared to 0.2.1 - the gain was a good 16%. 5) Core i3 9100F vs Core i5 6500 results are showing that CFL-R is ~15% faster than its clock favor, probably due to a lot faster memory configuration. Overall the results comparing 0.2.1 vs 0.5.1 were bad for both instruction sets of SIMD optimizations - SSSE3 and AVX2. The last 7 months I see no progress according to my benchmarks and I really wonder where all those huge numbers of gain came from dAV1d team regarding 0.5.1 version vs 0.2.1 Is there a difference using a Core2Duo for desktop ? Really looking forward for your tests and feedback.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all Last edited by NikosD; 31st October 2019 at 04:30. |
![]() |
![]() |
![]() |
#14 | Link |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
@Beelzebubu
@nevcairiel Guys, I posted a huge benchmark report regarding dAV1d decoder progress between 0.2.1 vs 0.5.1 versions, meaning for the last seven months and I see no replies or reactions from you since. Can you confirm or reject my findings with yours, showing different things ? I have seen a lot of huge numbers regarding dAV1d progress from the dAV1d team in the official release notes - which I couldn't confirm - but in here you are very quiet. Waiting for your feedback!
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
![]() |
![]() |
![]() |
#15 | Link |
I am maddo saientisto!
Join Date: Aug 2018
Posts: 95
|
TLDR:
Win7 64bits, i7-4770k, 3.40GHz (stock), improvement between 0.2.1 and 0.5.1 using only SSSE3 accelerated routines, single thread: Chimera: 33.2% Dua Lipa: 34.9% Included are some AVX2 tests too, because yes. Code:
# time ./dav1d-0.2.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null real 5m27,012s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null real 3m38,449s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask avx2 -o /dev/null real 3m5,282s user 0m0,000s sys 0m0,000s # time ./dav1d-0.2.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null real 2m42,726s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null real 1m45,987s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask avx2 -o /dev/null real 1m22,243s user 0m0,000s sys 0m0,000s The last 7 months I see progress according to my benchmarks and I really don't have to wonder where all those huge numbers of gain came from dAV1d team regarding 0.5.1 version vs 0.2.1 Last edited by SmilingWolf; 2nd November 2019 at 10:14. |
![]() |
![]() |
![]() |
#16 | Link |
I am maddo saientisto!
Join Date: Aug 2018
Posts: 95
|
FFMpeg says it's going to use 4 frame threads and 3 tile threads to decode the files, so I'll be using those numbers.
Chimera: 34% Dua Lipa: 29.7% Code:
# time ./dav1d-0.2.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null real 1m40,657s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null real 1m6,398s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask avx2 -o /dev/null real 0m54,087s user 0m0,000s sys 0m0,000s # time ./dav1d-0.2.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null real 0m46,972s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null real 0m33,041s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask avx2 -o /dev/null real 0m25,912s user 0m0,000s sys 0m0,000s |
![]() |
![]() |
![]() |
#17 | Link |
I am maddo saientisto!
Join Date: Aug 2018
Posts: 95
|
The tool is from the dav1d project, found here: https://code.videolan.org/videolan/d...e/master/tools
You can either compile it yourself (MABS can do that) or use my copy: https://mega.nz/#!op5gGSTD!JPyhq1IqJ...g8SyR2HbSTs7gk Finding the best frame/tile threads numbers is a bit tricky. Fiddling can improve performance, and I made some quick tests that brought Dua Lipa down to 29 seconds on 0.5.1+SSSE3 using --framethreads 6 --tilethreads 2, but I did not want to post them because in the end almost no media player is going to let you stray from FFmpeg, and therefore LAVFilters, defaults. Last edited by SmilingWolf; 2nd November 2019 at 12:13. |
![]() |
![]() |
![]() |
#18 | Link | |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
Because we have seriously contradicting results between multi-threaded performance of ffmpeg and LAV filters regarding dAV1d, according to your tests and mine. Could be your compilations vs nevcairiel's compilations, could be the setup of LAV vs ffmpeg for dAV1d or the hyperthreading nature of 4770K. If you don't want to raise the threads in order to reach 100% CPU utilization, you could close hyperthreading from BIOS and run again the tests with 4 threads. Also, you could run LAV filters benchmark using GraphStudioNext or DXVA Checker on your Core i7 as is with hyperthreading ON and see how that's going.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|
![]() |
![]() |
![]() |
#19 | Link |
I am maddo saientisto!
Join Date: Aug 2018
Posts: 95
|
I can't use DXVA to check performance since the dav1d library inside the FFmpeg library inside the LAVfilters library would default to using AVX2, unless you know of a way to target a specific instruction set from within DXVA, or perhaps using an environment variable. OTOH, if you just want me to check how dav1d multithreading works in different versions of LAVFilters/ffmpeg, I can test that. But it won't be a 0.2.1 vs 0.5.1 SSSE3 benchmark anymore.
I'm not too sure why you think I'm not using all cores of my CPU. With the settings above, 4 frame 3 tile threads, I get peaks of 80% CPU usage, and an eyeballed average of around 70%. Anyway, again just because, here is my best 0.5.1+SSSE3 Dua Lipa result so far: Code:
# time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 6 --tilethreads 3 --cpumask ssse3 -o /dev/null real 0m28,737s user 0m0,000s sys 0m0,000s Last edited by SmilingWolf; 2nd November 2019 at 13:09. |
![]() |
![]() |
![]() |
#20 | Link | ||
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
You could check SSSE3 using a Core2Duo or an AMD processor and AVX2 on any Haswell onwards or Ryzen 3000. If you only have i7 4770K, just check AVX2. Quote:
Waiting for your AVX2 LAV filters results 0.2.1 vs 0.5.1 preferably in the form of DXVA Checker min/avg/max and the average CPU utilization reported by DXVA Checker.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
||
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
Display Modes | |
|
|