Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
|
![]() |
#1 | Link |
Registered User
Join Date: Mar 2004
Posts: 1,099
|
dav1d accelerated AV1 decoder
dav1d 0.5.0 'Asiatic Cheetah'
https://code.videolan.org/videolan/dav1d/-/releases The fast and small AV1 decoder, codename 'Asiatic Cheetah'. It supports all the AV1 features and all bitdepths. 0.5.0 brings large improvements in speed on SSSE3 CPU (up to 40% speedup), new speed improvements on AVX-2 (for 4-7%) and ARM64 (up to 10%) and ARM32. It introduces some VSX, SSE2 and SSE4 optimizations. 0.5.0 fixes some minor issues, can export ITU T.35 metadata and improves the player example. Last edited by foxyshadis; 3rd December 2020 at 12:34. Reason: add general releases link |
![]() |
![]() |
![]() |
#2 | Link | |
Registered User
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
|
Quote:
That's a little negative. But yes, you won't get a 2x (or even 1.5x) speedup. 1.2x is nothing bad, though. And this i straight Haswell, newer chipsets (Zen2, Skylake) will get more, as will encoders. |
|
![]() |
![]() |
![]() |
#3 | Link | |
Lost my old account :(
Join Date: Jul 2017
Posts: 321
|
Quote:
Tried to play that 2160p60 sample from netflix with 0.74.1 and mpc-be on an i7-7500U; 4 threads at 100% load at 3.2Ghz, could barely open the file, after 30s it started playing at 2-10fps (downscaled to 1080p). Not that I was expecting any smooth playback, but is this "normal" performance? HEVC 10bit sw decoding is about 3x faster on the same setup. Last edited by excellentswordfight; 29th October 2019 at 14:26. |
|
![]() |
![]() |
![]() |
#4 | Link | |
Registered Developer
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,329
|
Quote:
That won't necessarily guarantee that 2160p60 will play on a mobile U-series CPU, but it got the best chances. Just be careful not to pick the 10-bit variant of the Netflix Chimera video. 10-bit is not optimized at all yet, and its not representative of real-world content yet. YouTube for example only delivers AV1 8-bit so far. And since there is no 8-bit 2160p variant of Chimera, thats your answer.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders Last edited by nevcairiel; 29th October 2019 at 14:35. |
|
![]() |
![]() |
![]() |
#5 | Link |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
I'm about to start a few benchmarks using various versions of dAV1d regarding SSSE3 and AVX2 progress on Core2Duo, Haswell, Skylake and Coffee Lake Refresh CPUs.
Is there a link with 1080p and 4K AV1 8bit sample videos to test ?
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
![]() |
![]() |
![]() |
#6 | Link |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
dAV1d benchmarks: 0.5.1 vs 0.2.1 on Core2Duo, Haswell, Skylake, Coffee Lake R
OK, here we are.
Test Systems: Skylake Core i5 6500 (TDP 65W) - Win 10 v1809 (17763.805) - 8GB DDR4-2133 MHz (1 DIMM - Single Channel) All-core-turbo 3.3GHz Haswell Core i3 4170 (TDP 54W) - Win 10 v1903 (18362.449) - 16GB DDR3-1600 MHz (Dual Channel 2x8GB) Fixed 3.7GHz clock Coffee Lake Refresh Core i3 9100F (TDP 65W) - Win 10 v1903 (18362.449) - 16GB DDR4-2400 MHz (Dual Channel 2x8GB) All-core-turbo 4.0GHz Merom Core2Duo T7600 (TDP 34W) - Win 10 v1809 (17763.805) - 4GB DDR2-667 MHz (Dual Channel Interleaved 2x2GB) Fixed 2.33GHz clock SW Tools: DXVA Checker v4.2.1 LAV filters 0.74.1 (dAV1d 0.2.1 - 12/03/2019) LAV filters 0.74.1-29 (dAV1d 0.5.1 - 26/10/2019) During the whole benchmarking procedure, the Core i5 6500 never dropped its turbo clock of 3.3 GHz speed and Core i3 9100F never dropped its turbo clock of 4.0 GHz speed either. Core i5 6500: Max TDP for 1080p ~33W Max TDP for 4K ~36W Core i3 4170 Max TDP for 4K ~35W Core i3 9100F Max TDP for 4K ~54W Core2Duo T7600 No tool can read Power Consumption All video samples below are 8bit. Chimera 1080p24fps sample is from Netflix Dua Lipa 1080p25fps sample is from Youtube Holi Festival 4K25fps sample is from Elecard (thanks @HolyWu) Summer Nature 4K25fps sample is from Elecard The numbers below represent FramesPerSecond (FPS) expressed as minimum/average/maximum. 1080p Chimera ~6.6Mbps Core i5 6500 86/134/290 CPU 87% -0.5.1 Core i5 6500 77/127/273 CPU 91% -0.2.1 Core2Duo T7600 10/19/94 CPU 72% -0.5.1 Core2Duo T7600 8/17/100 CPU 87% -0.2.1 Dua Lipa ~2.2Mbps Core i5 6500 120/186/251 CPU 87% -0.5.1 Core i5 6500 112/186/255 CPU 91% -0.2.1 Core2Duo T7600 7/18/70 CPU 65% -0.5.1 Core2Duo T7600 7/18/69 CPU 84% -0.2.1 4K Holi Festival ~14Mbps Core i5 6500 34/43/61 CPU 94% -0.5.1 Core i5 6500 30/40/60 CPU 95% -0.2.1 Summer Nature ~23Mbps Core i3 9100F 45/60/82 CPU 91% -0.5.1 Core i5 6500 32/43/57 CPU 93% -0.5.1 Core i5 6500 26/37/50 CPU 91% -0.2.1 Core i3 4170 21/30/46 CPU 92% -0.5.1 Core i3 4170 16/27/41 CPU 90% -0.2.1 Comments: 0) Sorry guys...dAV1d 0.5.1 has serious CPU utilization problem with my Core2Duo for laptop, essentially wiping out any optimization for SSSE3 set. Dua Lipa has 0% gain over 0.2.1 and Chimera has only 11% on average. The situation is a disaster for SSSE3 optimizations. 1) After 7 months of 0.2.1 release, I would say that dAV1d team certainly was not busy doing AVX2 optimizations. It looks like 0.5.1 is only 0% - 8% faster than 0.2.1 on Skylake, besides the last 4K clip that gets a nice 16% gain. 2) Skylake Core i5 6500 is certainly not capable of decoding anything more than 4K30fps for AV1 up to ~20Mbps without dropping frames, even with the latest version. 3) Coffee Lake R Core i3 9100F is closer to 4K60fps, but still minimum frame rate is well below 60fps. 4) 0.5.1 dropped CPU utilization a little for Skylake (but enormously for Core2Duo) compared to 0.2.1, eating some of the performance optimizations of latest version for Skylake. The only time that CPU utilization increased - compared to 0.2.1 - the gain was a good 16%. 5) Core i3 9100F vs Core i5 6500 results are showing that CFL-R is ~15% faster than its clock favor, probably due to a lot faster memory configuration. Overall the results comparing 0.2.1 vs 0.5.1 were bad for both instruction sets of SIMD optimizations - SSSE3 and AVX2. The last 7 months I see no progress according to my benchmarks and I really wonder where all those huge numbers of gain came from dAV1d team regarding 0.5.1 version vs 0.2.1 Is there a difference using a Core2Duo for desktop ? Really looking forward for your tests and feedback.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all Last edited by NikosD; 31st October 2019 at 04:30. |
![]() |
![]() |
![]() |
#7 | Link |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
@Beelzebubu
@nevcairiel Guys, I posted a huge benchmark report regarding dAV1d decoder progress between 0.2.1 vs 0.5.1 versions, meaning for the last seven months and I see no replies or reactions from you since. Can you confirm or reject my findings with yours, showing different things ? I have seen a lot of huge numbers regarding dAV1d progress from the dAV1d team in the official release notes - which I couldn't confirm - but in here you are very quiet. Waiting for your feedback!
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
![]() |
![]() |
![]() |
#8 | Link | |
Registered User
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
|
Quote:
To be clear, we don't just do command-line interface tests. We test this in end-user applications such as VLC and Chrome/Firefox also, and we see the same performance improvements there that we also see in "dav1d" the commandline tool. |
|
![]() |
![]() |
![]() |
#9 | Link | |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
Firstly, he posted single threaded performance difference and I posted multi-threaded performance difference, besides the obvious difference of the implementation. VLC is a popular media player - no doubt about it - but here we mostly prefer other players (MPC-HC / MPC-BE / MPV.NET etc) I don't think there is other way to find out what is going on, than to reproduce the tests by yourself. Is it possible to test the two versions of LAV's implementation I posted above ? Also, the huge gains of performance posted in various release notes of dAV1d are for single-threaded or multi-threaded performance ? Thanks!
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|
![]() |
![]() |
![]() |
#10 | Link | |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
After my comment you posted multi-threaded results, not using the same tools and with different threading status. Anyway, the point here is to understand what's going on and not once again playing with words or intensions. You could try to delete the config file of DXVA Checker and uninstall and reinstall everything. I'm still waiting for an answer if the publicly available reported gains between versions of dAV1d referred to single-tnreaded or multi-threaded performance. BTW, how do you benchmark dAV1d with the two executables you posted here ? There is no internal command in these.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|
![]() |
![]() |
![]() |
#11 | Link | |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
The drop of CPU utilization using Skylake was only 2% although using Core2Duo the drop was huge. The main issue of dAV1d it's the loss of any single-thread gain in real-world multi-thread decoding for whatever internal reason. In the end, the end user doesn't know and doesn't care for the reasons that Dua Lipa video has exactly the same decoding speed for both versions of dAV1d 0.2.1 and 0.5.1 for two different CPU architectures and instructions sets (Skylake using AVX2 / Core2Duo using SSSE3) It is us that we are still searching why is this happening and under what circumstances.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|
![]() |
![]() |
![]() |
#12 | Link |
I am maddo saientisto!
Join Date: Aug 2018
Posts: 95
|
TLDR:
Win7 64bits, i7-4770k, 3.40GHz (stock), improvement between 0.2.1 and 0.5.1 using only SSSE3 accelerated routines, single thread: Chimera: 33.2% Dua Lipa: 34.9% Included are some AVX2 tests too, because yes. Code:
# time ./dav1d-0.2.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null real 5m27,012s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null real 3m38,449s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask avx2 -o /dev/null real 3m5,282s user 0m0,000s sys 0m0,000s # time ./dav1d-0.2.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null real 2m42,726s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask ssse3 -o /dev/null real 1m45,987s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 1 --tilethreads 1 --cpumask avx2 -o /dev/null real 1m22,243s user 0m0,000s sys 0m0,000s The last 7 months I see progress according to my benchmarks and I really don't have to wonder where all those huge numbers of gain came from dAV1d team regarding 0.5.1 version vs 0.2.1 Last edited by SmilingWolf; 2nd November 2019 at 10:14. |
![]() |
![]() |
![]() |
#13 | Link |
I am maddo saientisto!
Join Date: Aug 2018
Posts: 95
|
FFMpeg says it's going to use 4 frame threads and 3 tile threads to decode the files, so I'll be using those numbers.
Chimera: 34% Dua Lipa: 29.7% Code:
# time ./dav1d-0.2.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null real 1m40,657s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null real 1m6,398s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask avx2 -o /dev/null real 0m54,087s user 0m0,000s sys 0m0,000s # time ./dav1d-0.2.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null real 0m46,972s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask ssse3 -o /dev/null real 0m33,041s user 0m0,000s sys 0m0,000s # time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 4 --tilethreads 3 --cpumask avx2 -o /dev/null real 0m25,912s user 0m0,000s sys 0m0,000s |
![]() |
![]() |
![]() |
#14 | Link |
I am maddo saientisto!
Join Date: Aug 2018
Posts: 95
|
The tool is from the dav1d project, found here: https://code.videolan.org/videolan/d...e/master/tools
You can either compile it yourself (MABS can do that) or use my copy: https://mega.nz/#!op5gGSTD!JPyhq1IqJ...g8SyR2HbSTs7gk Finding the best frame/tile threads numbers is a bit tricky. Fiddling can improve performance, and I made some quick tests that brought Dua Lipa down to 29 seconds on 0.5.1+SSSE3 using --framethreads 6 --tilethreads 2, but I did not want to post them because in the end almost no media player is going to let you stray from FFmpeg, and therefore LAVFilters, defaults. Last edited by SmilingWolf; 2nd November 2019 at 12:13. |
![]() |
![]() |
![]() |
#15 | Link | |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
Because we have seriously contradicting results between multi-threaded performance of ffmpeg and LAV filters regarding dAV1d, according to your tests and mine. Could be your compilations vs nevcairiel's compilations, could be the setup of LAV vs ffmpeg for dAV1d or the hyperthreading nature of 4770K. If you don't want to raise the threads in order to reach 100% CPU utilization, you could close hyperthreading from BIOS and run again the tests with 4 threads. Also, you could run LAV filters benchmark using GraphStudioNext or DXVA Checker on your Core i7 as is with hyperthreading ON and see how that's going.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|
![]() |
![]() |
![]() |
#16 | Link |
I am maddo saientisto!
Join Date: Aug 2018
Posts: 95
|
I can't use DXVA to check performance since the dav1d library inside the FFmpeg library inside the LAVfilters library would default to using AVX2, unless you know of a way to target a specific instruction set from within DXVA, or perhaps using an environment variable. OTOH, if you just want me to check how dav1d multithreading works in different versions of LAVFilters/ffmpeg, I can test that. But it won't be a 0.2.1 vs 0.5.1 SSSE3 benchmark anymore.
I'm not too sure why you think I'm not using all cores of my CPU. With the settings above, 4 frame 3 tile threads, I get peaks of 80% CPU usage, and an eyeballed average of around 70%. Anyway, again just because, here is my best 0.5.1+SSSE3 Dua Lipa result so far: Code:
# time ./dav1d-0.5.1.exe -q -i Dua_Lipa.ivf --muxer yuv4mpeg2 --framethreads 6 --tilethreads 3 --cpumask ssse3 -o /dev/null real 0m28,737s user 0m0,000s sys 0m0,000s Last edited by SmilingWolf; 2nd November 2019 at 13:09. |
![]() |
![]() |
![]() |
#17 | Link | ||
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
You could check SSSE3 using a Core2Duo or an AMD processor and AVX2 on any Haswell onwards or Ryzen 3000. If you only have i7 4770K, just check AVX2. Quote:
Waiting for your AVX2 LAV filters results 0.2.1 vs 0.5.1 preferably in the form of DXVA Checker min/avg/max and the average CPU utilization reported by DXVA Checker.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
||
![]() |
![]() |
![]() |
#18 | Link |
I am maddo saientisto!
Join Date: Aug 2018
Posts: 95
|
Oh but you seemed so worried about how much dav1d was using all my cores just one day ago.
But here, have a Chimera run: Code:
LAVFilters 0.74.1-29: CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz GPU: NVIDIA GeForce GTX 1080 Decoder: LAV Video Decoder Decoder Device: - Frames: 8929 FPS: 170,234 [103-349] CPU Usage: - GPU Usage: 0 [0-1] % GPU Video Engine Usage: 0 [0-0] % LAVFilters 0.74.1: CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz GPU: NVIDIA GeForce GTX 1080 Decoder: LAV Video Decoder Decoder Device: - Frames: 8929 FPS: 139,201 [77-306] CPU Usage: - GPU Usage: 0 [0-1] % GPU Video Engine Usage: 0 [0-0] % Code:
LAVFilters 0.74.1-29: CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz GPU: NVIDIA GeForce GTX 1080 Decoder: LAV Video Decoder Decoder Device: - Frames: 5615 FPS: 260,815 [183-335] CPU Usage: - GPU Usage: 0 [0-1] % GPU Video Engine Usage: 0 [0-0] % LAVFilters 0.74.1: CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz GPU: NVIDIA GeForce GTX 1080 Decoder: LAV Video Decoder Decoder Device: - Frames: 5615 FPS: 248,936 [137-328] CPU Usage: - GPU Usage: 0 [0-0] % GPU Video Engine Usage: 0 [0-0] % Last edited by SmilingWolf; 3rd November 2019 at 20:41. |
![]() |
![]() |
![]() |
#19 | Link | |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
The very low gain of real-world multi-thread performance between versions 0.2.1 vs 0.5.1 of dAV1d decoder, as measured by me using the above systems and tools, compared to the advertised and publicly reported by dAV1d team regarding SSSE3 and AVX2 optimizations. All the other comments by me, express my agony to explain by any means that huge difference. Your results confirm mine in an absolute way regarding Dua Lipa video, but there is a small light in the end of the tunnel regarding Chimera (regardless the name of the video) I think @nevcairiel could explain better and test LAV filter's dAV1d implementation.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|
![]() |
![]() |
![]() |
#20 | Link |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
@nevcairiel
@Beelzebubu @SmilingWolf A few more interesting notes regarding LAV filters. LAV filters v0.74.1 allows you to set Thread = 1 but it actually uses 50% something CPU utilization, which means 2 cores = 2 threads for AV1 (using dAV1d) But for all the other codecs, it uses only 1 thread as it should, based on the selection. LAV filters v0.74.1-29 doesn't even allow you to set Thread = 1 because if you set it to 1, it doesn't enumerate in DXVA Checker when trying to decode AV1 files, while it can be used for all the other codecs using only 1 thread. So, there is definitely something different regarding dAV1d integration in LAV filters, compared to all the other codecs. In LAV filters 0.74.1-29, when setting Thread = 4, it has exactly the same performance as Auto for my Core i5 6500.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
Display Modes | |
|
|