View Full Version : Alliance for Open Media codecs
dapperdan
9th October 2019, 13:56
I couldn't find any x264 tune ssim encodes, but I did find some x265 tune PSNR that show basically the same pattern. Tuning for PSNR at the expense of Fast SSIM, while other metrics don't really change.
The major difference is that the PSNR tuning seems to help with CB and Cr for x265, not just the Y channel.
Fascinating. So it appears x264 and x265 have their own version of VMAF which is just the combination of PSNR and FastSSIM metrics.
LigH
9th October 2019, 15:03
So it appears x264 and x265 have their own version of VMAF which is just the combination of PSNR and FastSSIM metrics.
That sounds confusing to me. PSNR and SSIM are mainly frame-to-frame metrics, whereas VFAM should also consider motion, thus have a temporal window.
The main metric used in x264 and x265 is the "rate factor", if I'm not completely wrong...
dapperdan
9th October 2019, 15:36
I just meant in the sense that VMAF is a fusion of other metrics, including PSNR.
VMAF use fancy statistics to combine them all based how they agreed with subjective measurements but it looks like x264 just combined Fast SSIM and PSNR when they didn't move in opposite directions.
mzso
9th October 2019, 16:27
Will SVT-AV1 and rav1e be added to ffmpeg as encoder libraries?
Or is that not possible because of licensing?
sneaker_ger
9th October 2019, 18:55
AFAIK there are no licensing issues. Both rav1e and SVT-AV1 have open licences (BSD 2-Clause and BSD+patent). Some patches already seem to exist but it seems aren't fully ready yet (i.e. developers still need to do some work or want to wait a bit more for the respective projects to mature).
quietvoid
9th October 2019, 19:12
rav1e is still missing a stable C API for ffmpeg, IIRC.
TD-Linux
10th October 2019, 01:51
Will SVT-AV1 and rav1e be added to ffmpeg as encoder libraries?
Or is that not possible because of licensing?
For rav1e we are waiting on releasing 0.1. We're down to one blocker bug, so it should be out soon: https://github.com/xiph/rav1e/issues/1636
TD-Linux
10th October 2019, 01:54
Fascinating. So it appears x264 and x265 have their own version of VMAF which is just the combination of PSNR and FastSSIM metrics.
Unfortunately it's impossible to use VMAF directly for RDO because an encoder has to evaluate distortion on very small blocks, down to 8x8 pixels. So the best you can do is come up with something that approximates it.
LigH
10th October 2019, 07:43
Just that you mention rav1e ... this week they broke compilation for the x86 target, possibly in an attempt to add assembler optimizations. The developers seem to lack of a cross-compilation environment for x86 (where I am not sure if that specifically means 32 bit code) for automated testing. I discovered that issue during my usual sporadic runs of the media-autobuild suite.
Kirakishou
14th October 2019, 07:42
Hi guys. I’m not good at English, sorry for that. Got file [Xrip][Nekopara][OVA_Extra][GB][1080P][AV1_10bit].mp4 from nyaa. Last and several dozen previous build of mpv playback it choppy. x86 and x64 version of mpv.
But mpc-hc 1.8.4.x86 playback it smooth. Only x86 build of mpc-hc 1.8.4, x64 also playback it choppy. And next builds of mpc-hc also playback it choppy. As far as I understand there were some changes in LAV Filters which led to the worst result. Maybe this information will be useful to someone.
sneaker_ger
14th October 2019, 10:18
MPC-HC 1.8.4 (LAV v0.73.1) uses libaom for AV1 decoding, 1.8.5 (LAV v0.74) uses dav1d. dav1d isn't optimized for 10 bit AV1 yet and it seems for this particular case and your hardware that libaom is faster (for 8 bit dav1d is much faster). Probably same problem for mpv. As dav1d matures this problem will be solved.
https://code.videolan.org/videolan/dav1d/issues/216
https://code.videolan.org/videolan/dav1d/issues/78
benwaggoner
15th October 2019, 00:27
Unfortunately it's impossible to use VMAF directly for RDO because an encoder has to evaluate distortion on very small blocks, down to 8x8 pixels. So the best you can do is come up with something that approximates it.
And VMAF isn't THAT great a metric. Many encoders will do some more sophisticated things internally. particularly around maintaining temporal coherence. VMAF does at least include a lightweight interframe comparison metric, but it doesn't do anything new to figure out how the variation of quality of individual frames impacts the overall viewer experience.
benwaggoner
15th October 2019, 00:31
I couldn't find any x264 tune ssim encodes, but I did find some x265 tune PSNR that show basically the same pattern. Tuning for PSNR at the expense of Fast SSIM, while other metrics don't really change.
The major difference is that the PSNR tuning seems to help with CB and Cr for x265, not just the Y channel.
Well, "help" in the sense that PSNR metrics would improve. --tune psnr reduces the subjective quality of the content at a given bitrate BY optimizing only for improved PSNR scores.
NikosD
23rd October 2019, 13:02
First commercial AV1 hardware decoder, claimed by Chips&Media and it's called Wave510A.
Can handle 4K60fps AV1 main profile using one core@500MHz and expands to dual core@1000MHz for 8K60fps.
Supports AV1 8bit/10bit up to 8Kx8K and up to 50Mbps
More here:
https://www.anandtech.com/show/15003/chipsmedia-launches-wave510a-hardware-av1-decoder-ip
and here:
https://en.chipsnmedia.com/page/product_view/5919
soresu
23rd October 2019, 13:57
Can handle 4K60fps AV1 main profile using one core@500MHz and expands to dual core@1000MHz for 8K60fps.
I wonder what the power draw for those configurations are.
benwaggoner
24th October 2019, 16:46
First commercial AV1 hardware decoder, claimed by Chips&Media and it's called Wave510A.
Can handle 4K60fps AV1 main profile using one core@500MHz and expands to dual core@1000MHz for 8K60fps.
Supports AV1 8bit/10bit up to 8Kx8K and up to 50Mbps
More here:
https://www.anandtech.com/show/15003/chipsmedia-launches-wave510a-hardware-av1-decoder-ip
and here:
https://en.chipsnmedia.com/page/product_view/5919
Interesting!
I wish there was some hint on how many transistors this takes, so we could estimate the silicon cost of adding it to a chip. It could be bigger than normal as this is JUST an AV1 decoder, without sharing anything with H.264/HEVC/VP9/etcetera decoders. In a more mature implementation, one would expect an integrated decoder which supports multiple bitstreams. That takes a lot fewer transistors in total that having all those as independent decoders.
400/500 MHz is pretty reasonable, as it can run in a processor in a relatively lower power state for better battery life on long-term content.
I am not a deep SoC guy, so take all above with an appropriately scaled grain of salt.
I'm looking forward to seeing an announcement for the first device with HW AV1 decode. AV1 isn't relevant for premium content until a material portion of customers have devices with HW decoders with integrated HW DRM.
So much hinges on whether the additive cost of AV1 decode will be low enough to be a default in lower cost SoCs in the next year or two. I'm kinda startled how murky that still is as we approach 2020.
birdie
24th October 2019, 21:10
Considering the timeline it looks like SnapDragon 865 (or whatever comes after SnapDragon 855) won't support AV1 HW decoding which is a huge bummer.
soresu
24th October 2019, 22:17
Considering the timeline it looks like SnapDragon 865 (or whatever comes after SnapDragon 855) won't support AV1 HW decoding which is a huge bummer.
Samsung's next chip doesn't have AV1 and they have actually joined AOM, so Qualcomm's conspicuous absence from AOM makes their support for AV1 dubious at best.
hajj_3
25th October 2019, 19:19
Leaked Amlogic roadmap showing the Amlogic S905X4, S908X and S805X2 chips that will support AV1. The S905X4 looks to be shipping within the next few months:
https://www.cnx-software.com/wp-content/uploads/2019/10/Amlogic-S905X4-S905X8-S805X2-AV1-8K-Processors-Large.jpg
Adonisds
26th October 2019, 21:46
Does Youtube keep the original of every video? Will every video receive an AV1 version? Just future ones?
sneaker_ger
26th October 2019, 22:09
Does Youtube keep the original of every video?
Yes.
Will every video receive an AV1 version? Just future ones?
Possibly every video but we can't say for sure. It will cost a lot of cpu time/money to convert all videos in all resolutions (+SDR/HDR) to AV1. For the longest time Youtube did not encode all videos to VP9 either. So we don't really know how it will be for AV1.
Adonisds
26th October 2019, 22:39
Yes.
Possibly every video but we can't say for sure. It will cost a lot of cpu time/money to convert all videos in all resolutions (+SDR/HDR) to AV1. For the longest time Youtube did not encode all videos to VP9 either. So we don't really know how it will be for AV1.
Thank you. Are all videos from before VP9 available in VP9?
vidschlub
26th October 2019, 23:48
Meanwhile AV1 was only standardised last year, it takes time to get these new codecs ship shape for production purposes, let alone significant maturity.
Due to the extended interest in AV1 from such a wide group of companies, could we expect to see it gain traction faster than 265 at least?
vidschlub
26th October 2019, 23:57
[Xrip][Nekopara][OVA_Extra][GB][1080P][AV1_10bit].mp4 .
As predicted, anime teams are always first to adopting crazy new codecs as soon as possible. I love these guys, they're nuts <3
birdie
27th October 2019, 11:44
Thank you. Are all videos from before VP9 available in VP9?
No. Google encodes into VP9 only videos with more than N number of views or views pre day where N or N/day are yet to be determined.
Due to the extended interest in AV1 from such a wide group of companies, could we expect to see it gain traction faster than 265 at least?
I really doubt that considering that AV1 is up to two orders of magnitude more computationally expensive and older x86 CPUs cannot even decode FullHD 60fps videos encoded in it in real time.
dapperdan
27th October 2019, 14:18
Due to the extended interest in AV1 from such a wide group of companies, could we expect to see it gain traction faster than 265 at least?
Probably a lot depends on how you define traction in this situation.
For example, it wouldn't surprise me if the total stream watch time for AV1 is already above HEVC due to YouTube encoding low res versions of popular videos. Mozilla released some numbers that suggested AV1 Firefox video views on their nightly channel rapidly rose to 20% of all video plays just before YouTube paused their AV1 rollout. Would be interested to see where that's gone since.
Instagram already uses a software decoder for VP9 on Android ( a version of libvpx surprisingly) so switching to dav1d and AV1 for popular content isn't totally unbelievable even before hardware decoders are widespread. (I'm not sure if AV1 would be much better in terms of bitrate than VP9 for their user generated content at low bitrates, but if SVT and dav1d are sufficiently better than libvpx then it would actually save them time and energy to upgrade.
I'd love to see a real world comparison of watching something like Breaking Bad on a phone on a metered connection. What are realistic bitrates for these users, if you can get a 30% bitrate saving just from synthetic film grain and AV1's tools appear to work better at lower bitrates how much does the software decoding actually cost you? Once you factor in network savings is it actually noticeable against the baseline of having the screen on? What about in the download scenario? Is it worth the battery hit to see 5 more episodes on your monthly bandwidth allowance?
IgorC
27th October 2019, 17:38
I wonder why dav1d developers have dedicated time to optimize for SSE2. Isn't SSSE3 already old enough? AMD has catched up and implemented SSSE3 in 2011. Even outdated Core 2 Duo has SSSE3.
While 10 bits decoding has literally zero optimizations till moment.
P.S. Few years ago I have tested 10 years old laptop with Pentium T4200 (SSSE3) which now rests unused. It could barely play Youtube VP9 720p videos while still dropped some frames, leave alone AV1 with its 3x complexity.
AV1 would be actually a downgrade for this kind of hardware (from VP9 720p to AV1 360/480p). And we're talking about CPU with SSSE3.
birdie
27th October 2019, 19:02
https://code.videolan.org/videolan/dav1d/-/releases#0.5.1
http://download.opencontent.netflix.com/?prefix=AV1/Sparks/
Netflix posted new AV1 samples with and without film grain in 540p, 1080p and 2160p
Neither mpv, nor ffplay can open these *.obu files. Any ideas how one can play them?
sneaker_ger
27th October 2019, 19:20
Mux to mkv using mkvmerge first.
nevcairiel
28th October 2019, 10:43
dav1ds AVX2 is fine. If you want to properly compare SSSE3 vs AVX2, then you need to look at Single Threaded benchmarks. Multi-Threading is often limited in scaling, where such differences can "hide".
But you should also not expect twice the performance from AVX2, since once you optimize everything possible with SSSE3/AVX2, the remaining parts that cannot be optimized so easily will impact the performance the most.
Beelzebubu
28th October 2019, 12:55
Ok...So, I take a look at the single threaded performance and I see a 20% gain of AVX2 compared to SSSE3.
On what system (chipset)?
soresu
28th October 2019, 14:43
BTW, any plans for AVX-512 in near future ?
Is there any benefit on this ?
There was a merge request/issue some time ago for adding some support for it (specifically mentioned as Ice Lake), but I don't think any actual optimisations have been committed to the master yet, going by my git commit RSS/Atom feed anyway.
I'm more interested in the GPGPU work that happened over the summer, another mirror repo had some further Vulkan work that seemed like bugfixes or 'piping' as it were.
Is there any chance of getting some bench figures on that work soon?
NikosD
28th October 2019, 17:22
Few things going on there:
YMM (e.g. AVX2) functions are never exactly 2x as fast as XMM (e.g. SSSE3) functions, even in theoretical conditions; True, but to be honest I was expecting something like 60% - 70%.
That's a reasonable gain going from SSEx to AVX2
YMM upper lane use will cause CPU downclocking (but not on modern AMD CPUs, I'm being told); Using a Haswell for years I believed that too, but actually Intel has solved the issue since Skylake.
My Core i3 9100F can keep all core turbo of 4.0GHz forever using AVX2 optimized code (but not power virus like Prime95 small FFT)
certain code in SIMD functions does not use YMM upper lanes (effectively), usually because the block size is too small (width=4-8), but sometimes because we don't want a function-pointer-call overhead (multisymbol coding);
and obviously, a lot of code is not SIMD'ed at all, it's 50%-50% between SIMD and non-SIMD at best.
Together, that means the speedup is well below half of half, so 20% is not entirely unreasonable. Sucks a bit, but you can't beat reality. It's not that I don't believe you or nevcairiel, but personally judging by H.265 encoding/decoding and VP9 decoding, I think the transition from SSEx to AVX2 could be more impressive than 20%.
For me it's still unreasonable.
Will see...
I'm more interested in the GPGPU work that happened over the summer, another mirror repo had some further Vulkan work that seemed like bugfixes or 'piping' as it were.
Is there any chance of getting some bench figures on that work soon? I think pure fixed-function HW will appear for the first time in 2020 using Ampere architecture of nVidia (hopefully) and as I have said in the past, GPGPU can't be that effective with such a complex codec like AV1, in my opinion.
Regarding benchmarks, if there is a DirectShow or a MediaFoundation filter exposed via DXVA2/D3D11VA for AV1 compatible with nVidia GPUs, I would definitely try it although I really don't expect too much from GPGPU for video codecs.
Nintendo Maniac 64
28th October 2019, 19:51
1080p on SSE2 is not our goal. The goal is to have a baseline support so ~5 years (or even earlier?) from now, AV1 can be the baseline, not H.264. We don't know for sure, but this may imply some basic need for SSE2 support. So we're exploring what is possible and how much work it'd be.
Question about what constitutes a baseline - would this happen to be 8bit AV1 only?
I ask because I would have expected by now for 10bit to be the standard baseline for AV1 even for non-HDR content at sub-1080p resolutions since, at least back in the h.264 days, doing such can improve compression (though obviously no hardware 10bit h.264 decoder exists even today...but that's not an issue for newer codecs), but last I checked dav1d's 10bit decode performance was actually slower than even aomedia's reference decoder!
YMM upper lane use will cause CPU downclocking (but not on modern AMD CPUs, I'm being told)
Indeed, AVX2 workloads on Zen-based CPUs (Epyc, Ryzen, Athlon with iGPUs) do not cause any downclocking. It's for this reason that Zen2 (which has a full-width 256bit AVX2 implementation unlike Zen1/+'s half-width 128bit AVX2 implementation) can actually keep up quite well on a per-thread basis to Intel's AVX-512-equipped CPUs in several AVX-512-accelerated workloads like x265 despite no current AMD processor supporting AVX-512.
Beelzebubu
28th October 2019, 21:12
personally judging by H.265 encoding/decoding and VP9 decoding, I think the transition from SSEx to AVX2 could be more impressive than 20%.
For me it's still unreasonable.
OK, let's test your claim. HEVC/VP9/dav1d decoding using FFmpeg, recent snapshot, on my local Haswell laptop of a same-quality encoded sample of the same file (ToddlerFountain), everything single-threaded, in alphabetical order:
AV1: SSSE3 12.913s vs. AVX2 10.638s = 21.4% faster;
HEVC: SSSE3: 15.188s vs. AVX2: 10.938s = 38.9% faster;
VP9: SSSE3 7.133s vs. AVX2 6.748s = 5.7% faster;
So, that's weird, AV1 is halfway between HEVC and VP9 - what's going on here? It's simple: look at the profiles. First of all, let's do AV1, and check the top-10 functions for both runs (SSSE3 first, then AVX2):
2.25 s 18.2% 2.25 s dav1d_prep_8tap_ssse3.hv_w8_loop
1.25 s 10.1% 1.25 s motion_field_projection
700.00 ms 5.6% 700.00 ms decode_coefs
601.00 ms 4.8% 601.00 ms decode_b
475.00 ms 3.8% 475.00 ms dav1d_find_ref_mvs
391.00 ms 3.1% 391.00 ms add_tpl_ref_mv
363.00 ms 2.9% 363.00 ms dav1d_put_8tap_ssse3.hv_w8_loop
329.00 ms 2.6% 329.00 ms dav1d_prep_8tap_ssse3.h_loop
326.00 ms 2.6% 326.00 ms dav1d_cdef_filter_8x8_ssse3.k_loop
242.00 ms 1.9% 242.00 ms add_ref_mv_candidate
vs.
1.36 s 12.9% 1.36 s dav1d_prep_8tap_avx2.hv_w8_loop
1.20 s 11.4% 1.20 s motion_field_projection
741.00 ms 7.0% 741.00 ms decode_coefs
638.00 ms 6.0% 638.00 ms decode_b
479.00 ms 4.5% 479.00 ms dav1d_find_ref_mvs
433.00 ms 4.1% 433.00 ms add_tpl_ref_mv
247.00 ms 2.3% 247.00 ms add_ref_mv_candidate
241.00 ms 2.2% 241.00 ms ..@949.end
229.00 ms 2.1% 229.00 ms mc
206.00 ms 1.9% 206.00 ms dav1d_put_8tap_avx2.hv_w8_loop
What do we see? Nothing unexpected (except maybe the large percentage of time spent in ref_mvs.c functions, which we know about and is tracked in #217 (https://code.videolan.org/videolan/dav1d/issues/217)). Some time spent in Properly optimized functions, but nothing major.
OK, let's look at VP9 - again SSSE3 first, then AVX2:
2.16 s 29.5% 2.16 s decode_coeffs_8bpp
574.00 ms 7.8% 574.00 ms decode_mode
474.00 ms 6.4% 474.00 ms ff_vp9_loop_filter_h_16_16_ssse3
437.00 ms 5.9% 437.00 ms 0x1090748d0
366.00 ms 4.9% 366.00 ms ff_vp9_intra_recon_8bpp
277.00 ms 3.7% 277.00 ms ff_vp9_loop_filter_v_16_16_ssse3
229.00 ms 3.1% 229.00 ms 0x10907478f
220.00 ms 3.0% 220.00 ms ff_vp9_fill_mv
184.00 ms 2.5% 184.00 ms ff_vp9_decode_block
177.00 ms 2.4% 177.00 ms ff_vp9_loopfilter_sb
vs.
2.29 s 32.1% 2.29 s decode_coeffs_8bpp
576.00 ms 8.0% 576.00 ms decode_mode
387.00 ms 5.4% 387.00 ms ff_vp9_intra_recon_8bpp
381.00 ms 5.3% 381.00 ms ff_vp9_loop_filter_h_16_16_avx
261.00 ms 3.6% 261.00 ms 0x1075bc8d0
244.00 ms 3.4% 244.00 ms ff_vp9_loop_filter_v_16_16_avx
224.00 ms 3.1% 224.00 ms ff_vp9_decode_block
222.00 ms 3.1% 222.00 ms 0x1075bc78f
173.00 ms 2.4% 173.00 ms ff_vp9_fill_mv
168.00 ms 2.3% 168.00 ms ff_vp9_loopfilter_sb
(Sorry for the hex codes.) What you see here is simple. There is not much AVX2. There is AVX-XMM (Sandybridge), but that only helps a couple of percent at best, apparently (with three-operand instructions, and SSE4 opcodes).
Last, HEVC (SSSE3 first, then AVX2):
2.45 s 16.0% 2.45 s ff_hevc_hls_residual_coding
1.40 s 9.1% 1.40 s put_hevc_qpel_uni_w_hv_8
766.00 ms 5.0% 766.00 ms put_hevc_qpel_bi_w_hv_8
702.00 ms 4.6% 702.00 ms put_hevc_epel_uni_w_hv_8
685.00 ms 4.4% 685.00 ms put_hevc_qpel_hv_8
512.00 ms 3.3% 512.00 ms ff_hevc_hls_filter
511.00 ms 3.3% 511.00 ms ff_hevc_deblocking_boundary_strengths
498.00 ms 3.2% 498.00 ms hls_coding_quadtree
366.00 ms 2.4% 366.00 ms hls_transform_tree
332.00 ms 2.1% 332.00 ms put_hevc_epel_bi_w_hv_8
vs.
2.45 s 26.6% 2.45 s ff_hevc_hls_residual_coding
549.00 ms 5.9% 549.00 ms ff_hevc_hls_filter
474.00 ms 5.1% 474.00 ms ff_hevc_deblocking_boundary_strengths
448.00 ms 4.8% 448.00 ms hls_coding_quadtree
344.00 ms 3.7% 344.00 ms hls_transform_tree
233.00 ms 2.5% 233.00 ms pred_angular_2_8
230.00 ms 2.5% 230.00 ms hls_prediction_unit
192.00 ms 2.0% 192.00 ms intra_pred_2_8
184.00 ms 2.0% 184.00 ms 0x102e3032a
178.00 ms 1.9% 178.00 ms 0x102e30829
Aha, we have the opposite problem here: HEVC has no SSSE3 fallback for most routines, it only has AVX optimizations. No wonder the difference is so big, and even then, it's only ~40%... If I compare the SSE4.2 performance to AVX2 for the same file, I get a couple of % at best, even though ffhevc has a fair bunch of AVX2 optimizations.
I'm going to leave the decoding claim for you to re-visit if you wish, but I don't think my data supports your claim. In fact, dav1d appears to do quite well.
So, let's move over to the encoding claim: that is entirely plausible. Encoding is much more DSP heavy than decoding, and the expected speed-up is thus bigger. I would indeed expect a significantly-larger-than-20% speedup from AV1 encoding on AVX2 vs. SSEx.
benwaggoner
28th October 2019, 22:36
Question about what constitutes a baseline - would this happen to be 8bit AV1 only?
I ask because I would have expected by now for 10bit to be the standard baseline for AV1 even for non-HDR content at sub-1080p resolutions since, at least back in the h.264 days, doing such can improve compression (though obviously no hardware 10bit h.264 decoder exists even today...but that's not an issue for newer codecs), but last I checked dav1d's 10bit decode performance was actually slower than even aomedia's reference decoder!
There are plenty of 10-bit H.264 decoders out there. There certainly are some devices that can decode 10-bit HEVC but only 8-bit H.264, but plenty who can do 10-bit of both.
While H.264 did show a significant efficiency improvement from 10-bit encoding even of 8-bit sources, HEVC showed much less gain (due to improvements in 8-bit). I wouldn't assume that AV1 would see a gain similar to H.264 without significant testing.
The most important thing about 10-bit is that it's required for HDR content. And HDR is definitely on the path to become mainstream over the next five years.
NikosD
28th October 2019, 23:05
OK, let's test your claim. HEVC/VP9/dav1d decoding using FFmpeg, recent snapshot, on my local Haswell laptop of a same-quality encoded sample of the same file (ToddlerFountain)...
I'm going to leave the decoding claim for you to re-visit if you wish, but I don't think my data supports your claim. In fact, dav1d appears to do quite well. Your analysis is very interesting.
Basically you say that ffmpeg HEVC decoding has no actual SSSE3 optimizations, that's why it gains 40% comparing previous SSE vs AVX2, which is double than AV1 but not that good, while SSE4.2 vs AVX2 is almost the same.
On the other hand, VP9 has no actual AVX2 optimizations that's why SSSE3 vs AVX2 is so close.
TBH, I remembered ffvp9 to be one of the best optimized decoders ever and I thought it was due to AVX2 and not SSSE3 optimizations.
The way you presented your research, it seems that all decoders are doomed in the SSEx vs AVX2 battle.
I will search it a little better and come back if i find something interesting.
Thank you for your time!
nevcairiel
28th October 2019, 23:14
There are plenty of 10-bit H.264 decoders out there. There certainly are some devices that can decode 10-bit HEVC but only 8-bit H.264, but plenty who can do 10-bit of both.
I think you got that backwards. 10-bit H.264 hardware decoders are very rare, in consumer space anyway, while any modern HEVC decoder will handle 10-bit, so devices that can do 8-bit H.264 only, but 10-bit HEVC are ample, and growing with every new device coming out.
Nintendo Maniac 64
29th October 2019, 07:46
And this i straight Haswell, newer chipsets (Zen2, Skylake) will get more, as will encoders.
Well unless it's a Pentium since those still lack AVX support altogether even if it's a desktop Skylake-based Pentium like the G5400 and such (which are effectively what an i3 used to be with Pentiums now being 2core/4thread).
Also, I thought Haswell's implementation of AVX2 was pretty much the same as Skylake's? (and I already touched on the Zen1/+ vs Zen2 implementation of AVX2 in my previous post)...unless you were alluding to Skylake-X's support of AVX-512.
nevcairiel
29th October 2019, 08:50
Well unless it's a Pentium since those still lack AVX support altogether even if it's a desktop Skylake-based Pentium like the G5400 and such (which are effectively what an i3 used to be with Pentiums now being 2core/4thread).
Chips like this is why SSSE3 was still a primary optimization target, among other things. Can't fix the lack of AVX2, but SSSE3 will do the best that is possible on them.
Also, I thought Haswell's implementation of AVX2 was pretty much the same as Skylake's? (and I already touched on the Zen1/+ vs Zen2 implementation of AVX2 in my previous post)...unless you were alluding to Skylake-X's support of AVX-512.
AVX2 improved a bit in Skylake, the better process allows the downclock to be less aggressive, and instruction latencies were slightly improved. The clocking difference would make the biggest difference there.
Skylake is afterall a different micro-architecture then Haswell, its just that since then we didn't get anything new anymore.
soresu
29th October 2019, 12:10
I think pure fixed-function HW will appear for the first time in 2020 using Ampere architecture of nVidia (hopefully) and as I have said in the past, GPGPU can't be that effective with such a complex codec like AV1, in my opinion.
Regarding benchmarks, if there is a DirectShow or a MediaFoundation filter exposed via DXVA2/D3D11VA for AV1 compatible with nVidia GPUs, I would definitely try it although I really don't expect too much from GPGPU for video codecs.
It interests me purely for decoding on platforms that lack a more modern CPU core, but still have GPU power to divy up the decoding effort, which is otherwise going to waste.
The nVidia Shield TV is a good example of this - a relatively weak CPU by modern terms with a strong GPU.
Cortex A57 is decent but aging now, even some Amazon products have more recent ARM cores with higher IPC, and obviously phone products that currently exist are completely limited to CPU power which would strain to put out 4K24 in many cases, let alone 4K60 - which maybe Apple's flagship Axx SoC could do at the moment with dav1d using 8 bit content.
It's all about taking advantage of otherwise wasted compute power to augment decode fps, perhaps even do so more efficiently if enough can be executed on the GPU without extraneous copy/transfer overheads to the CPU.
soresu
29th October 2019, 12:17
I think you got that backwards. 10-bit H.264 hardware decoders are very rare, in consumer space anyway, while any modern HEVC decoder will handle 10-bit, so devices that can do 8-bit H.264 only, but 10-bit HEVC are ample, and growing with every new device coming out.
Unfortunately not fast enough, I bought an Amazon Fire Stick for my dad in 2017 assuming that the advertised HEVC support meant up to 10 bit, only to find out to my horror that most of the HEVC encoded videos I have did not work on it.
At least the newer FS 4K I bought this year has 10 bit capability, still the experience was somewhat disheartening considering the sheer amount of 10 bit content available at the time I bought the first Fire Stick, 5 years after HEVC was standardised.
Beelzebubu
29th October 2019, 12:42
I'm more interested in the GPGPU work that happened over the summer, another mirror repo had some further Vulkan work that seemed like bugfixes or 'piping' as it were.
Is there any chance of getting some bench figures on that work soon?
Last slide (https://www.twitch.tv/videos/498918740?t=4h40m49s) in this presentation (https://www.twitch.tv/videos/498918740?t=4h37m50s), although it was cut-off at a hard 3-minute limit. Slide shows lower [NO!]fps-per-watt[NO!] watt-per-fps when using the GPU code we have so far compared to pure CPU.
NikosD
29th October 2019, 13:03
For ffvp9, it was the other way around, we did everything-and-more in SSSE3, and then did a couple of things (some MC, some inverse transforms) in AVX2, but the smaller inverse transforms and MC, as well as the loopfilters and most intra predictors, were never done. So it's fairly incomplete. I managed to find out a few more details regarding AVX2 optimizations of ffvp9 which are good (for 10bit and 12bit) on the specific routines and with a clear distance from the extremely optimized SSSE3 version.
vp9_diag_downleft_32x32_10bpp_c: 1101.2
vp9_diag_downleft_32x32_10bpp_sse2: 145.4
vp9_diag_downleft_32x32_10bpp_ssse3: 137.5
vp9_diag_downleft_32x32_10bpp_avx: 134.8
vp9_diag_downleft_32x32_10bpp_avx2: 94.0
vp9_diag_downleft_32x32_12bpp_c: 1108.5
vp9_diag_downleft_32x32_12bpp_sse2: 145.5
vp9_diag_downleft_32x32_12bpp_ssse3: 137.3
vp9_diag_downleft_32x32_12bpp_avx: 135.2
vp9_diag_downleft_32x32_12bpp_avx2: 94.0
AVX2 version is 32% faster than SSSE3 for vp9 ipred_dl_32x32_16
vp9_diag_downleft_32x32_12bpp_c: 1534.2
vp9_diag_downleft_32x32_12bpp_sse2: 145.9
vp9_diag_downleft_32x32_12bpp_ssse3: 140.0
vp9_diag_downleft_32x32_12bpp_avx: 134.8
vp9_diag_downleft_32x32_12bpp_avx2: 78.9
AVX2 version is 44% faster than SSSE3 for ipred_dl_32x32
vp9_vert_left_16x16_12bpp_c: 273.8
vp9_vert_left_16x16_12bpp_sse2: 69.4
vp9_vert_left_16x16_12bpp_ssse3: 35.3
vp9_vert_left_16x16_12bpp_avx: 34.6
vp9_vert_left_16x16_12bpp_avx2: 22.4
AVX2 version is 37% faster than SSSE3 for ipred_vl_16x16
1.2x is nothing bad, though. And this i straight Haswell, newer chipsets (Zen2, Skylake) will get more, as will encoders. I was ready to tell you to test exactly the same things using a more modern implementation of AVX2 than Haswell, like Skylake onwards (basically it's the same old 2015 Skylake architecture for all Intel CPUs after Skylake).
I have a Haswell Core i3 4170 and a Coffee Lake Refresh Core i3 9100F and I will try latest dAV1d using LAV Video when 0.5.x version becomes embedded in the decoder (hopefully soonish)
excellentswordfight
29th October 2019, 15:55
0.2.1, the newest available at the time. You can use a nightly version (https://files.1f0.de/lavf/nightly/) which would come with 0.5.1, the newest available right now.
That won't necessarily guarantee that 2160p60 will play on a mobile U-series CPU, but it got the best chances.
Ah, ok, I missed the date on the build, I see now that it was released quite some time ago so that makes sense.
Yeah, was not expecting 60fps, cant do that with sw hevc decoder either on this cpu. I was just interested how far sw decoding has come. But I will get an nightly build then,
Just be careful not to pick the 10-bit variant of the Netflix Chimera video. 10-bit is not optimized at all yet, and its not representative of real-world content yet. YouTube for example only delivers AV1 8-bit so far.
And since there is no 8-bit 2160p variant of Chimera, thats your answer.
I was using a 10bit sample; the new sparks one that was linked on previous page.
soresu
29th October 2019, 16:08
Last slide (https://www.twitch.tv/videos/498918740?t=4h40m49s) in this presentation (https://www.twitch.tv/videos/498918740?t=4h37m50s), although it was cut-off at a hard 3-minute limit. Slide shows lower fps-per-watt when using the GPU code we have so far compared to pure CPU.
Thanks for the update, disappointing but to be expected I guess for an early effort which likely expends a lot of power and time copying data back and forth between the CPU and GPU.
Also reports of Mali GPU power efficiency (used in the Kirin 970/Huawei P20) haven't been stellar either compared to alternatives in the mobile market, especially when A73 is such an efficient CPU core design to compare against.
dapperdan
29th October 2019, 20:43
Thanks for the update, disappointing but to be expected I guess for an early effort
I think the code works. It gets more FPS at the same power usage, or uses less power for the same FPS. I think the corment about "lower fps-per-watt" was intended to be the opposite, "lower watts-per-fps" or at least that's my reading of the graph from the presentation.
Beelzebubu
29th October 2019, 20:50
I think the code works. It gets more FPS at the same power usage, or uses less power for the same FPS. I think the corment about "lower fps-per-watt" was intended to be the opposite, "lower watts-per-fps" or at least that's my reading of the graph from the presentation.
Oops, yes, sorry, my bad. I'll update my post. Thanks for noticing.
TEB
30th October 2019, 13:17
Unfortunately not fast enough, I bought an Amazon Fire Stick for my dad in 2017 assuming that the advertised HEVC support meant up to 10 bit, only to find out to my horror that most of the HEVC encoded videos I have did not work on it.
At least the newer FS 4K I bought this year has 10 bit capability, still the experience was somewhat disheartening considering the sheer amount of 10 bit content available at the time I bought the first Fire Stick, 5 years after HEVC was standardised.
I can confirm this based on my research too. "all" devices released in the last 2-3 years support HEVC, but the devilīs in the details here. 10bit support lacks quite alot (main10), and also many older chipsets, from a spec, support main10 up to 4kp60 but in reality they wont provide more than 8bit@level 3.0..
It depends on a combination of chipset, microcode, OS version etc..
Example: I was testing x265 10bit encoded content in main10 on my older Samsung G7edge that has a snapdragon 820, which from the spec supports all we need (4kp60 uhd...whatever that means in reality..), but i was not able to HW decode anything over main@l2.1 via XO player.. VLC happily decoded main10@l4.1 with a 70% cpu usage on 6 of the 8 cores.. but power drain was awefull..
On H.264, i havent seen any chipset in the last years, regardless of how good the HEVC part of the soc is, able to do anything over hp@4.2. Even the latest AV1 decode enabled chipsets cant do 10bit h.264 either.
Blue_MiSfit
30th October 2019, 18:09
^ exactly. Even though the SOC should do 4kp60, the actual implementation in $phone can only do level 3.0.
vBulletin® v3.8.11, Copyright ©2000-2025, vBulletin Solutions Inc.