Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
28th October 2019, 14:43 | #1882 | Link | |
Registered User
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 196
|
Quote:
I'm more interested in the GPGPU work that happened over the summer, another mirror repo had some further Vulkan work that seemed like bugfixes or 'piping' as it were. Is there any chance of getting some bench figures on that work soon? |
|
28th October 2019, 17:22 | #1883 | Link | |||||
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
Quote:
For me it's still unreasonable. Will see... Quote:
Regarding benchmarks, if there is a DirectShow or a MediaFoundation filter exposed via DXVA2/D3D11VA for AV1 compatible with nVidia GPUs, I would definitely try it although I really don't expect too much from GPGPU for video codecs.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|||||
28th October 2019, 19:51 | #1884 | Link | |
Registered User
Join Date: Nov 2009
Location: Northeast Ohio
Posts: 447
|
Quote:
I ask because I would have expected by now for 10bit to be the standard baseline for AV1 even for non-HDR content at sub-1080p resolutions since, at least back in the h.264 days, doing such can improve compression (though obviously no hardware 10bit h.264 decoder exists even today...but that's not an issue for newer codecs), but last I checked dav1d's 10bit decode performance was actually slower than even aomedia's reference decoder! Indeed, AVX2 workloads on Zen-based CPUs (Epyc, Ryzen, Athlon with iGPUs) do not cause any downclocking. It's for this reason that Zen2 (which has a full-width 256bit AVX2 implementation unlike Zen1/+'s half-width 128bit AVX2 implementation) can actually keep up quite well on a per-thread basis to Intel's AVX-512-equipped CPUs in several AVX-512-accelerated workloads like x265 despite no current AMD processor supporting AVX-512.
__________________
____HTPC____ | __Desktop PC__
2.93GHz Xeon x3470 (4c/8t Nehalem) | 4.5GHz 1.24v dual-core Haswell G3258 Radeon HD5870 | Intel iGPU 2x2GB+2x1GB DDR3-1333 | 4x4GB DDR3-1600 Last edited by Nintendo Maniac 64; 28th October 2019 at 20:48. |
|
28th October 2019, 21:12 | #1885 | Link | |
Registered User
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
|
Quote:
So, that's weird, AV1 is halfway between HEVC and VP9 - what's going on here? It's simple: look at the profiles. First of all, let's do AV1, and check the top-10 functions for both runs (SSSE3 first, then AVX2): Code:
2.25 s 18.2% 2.25 s dav1d_prep_8tap_ssse3.hv_w8_loop 1.25 s 10.1% 1.25 s motion_field_projection 700.00 ms 5.6% 700.00 ms decode_coefs 601.00 ms 4.8% 601.00 ms decode_b 475.00 ms 3.8% 475.00 ms dav1d_find_ref_mvs 391.00 ms 3.1% 391.00 ms add_tpl_ref_mv 363.00 ms 2.9% 363.00 ms dav1d_put_8tap_ssse3.hv_w8_loop 329.00 ms 2.6% 329.00 ms dav1d_prep_8tap_ssse3.h_loop 326.00 ms 2.6% 326.00 ms dav1d_cdef_filter_8x8_ssse3.k_loop 242.00 ms 1.9% 242.00 ms add_ref_mv_candidate Code:
1.36 s 12.9% 1.36 s dav1d_prep_8tap_avx2.hv_w8_loop 1.20 s 11.4% 1.20 s motion_field_projection 741.00 ms 7.0% 741.00 ms decode_coefs 638.00 ms 6.0% 638.00 ms decode_b 479.00 ms 4.5% 479.00 ms dav1d_find_ref_mvs 433.00 ms 4.1% 433.00 ms add_tpl_ref_mv 247.00 ms 2.3% 247.00 ms add_ref_mv_candidate 241.00 ms 2.2% 241.00 ms ..@949.end 229.00 ms 2.1% 229.00 ms mc 206.00 ms 1.9% 206.00 ms dav1d_put_8tap_avx2.hv_w8_loop OK, let's look at VP9 - again SSSE3 first, then AVX2: Code:
2.16 s 29.5% 2.16 s decode_coeffs_8bpp 574.00 ms 7.8% 574.00 ms decode_mode 474.00 ms 6.4% 474.00 ms ff_vp9_loop_filter_h_16_16_ssse3 437.00 ms 5.9% 437.00 ms 0x1090748d0 366.00 ms 4.9% 366.00 ms ff_vp9_intra_recon_8bpp 277.00 ms 3.7% 277.00 ms ff_vp9_loop_filter_v_16_16_ssse3 229.00 ms 3.1% 229.00 ms 0x10907478f 220.00 ms 3.0% 220.00 ms ff_vp9_fill_mv 184.00 ms 2.5% 184.00 ms ff_vp9_decode_block 177.00 ms 2.4% 177.00 ms ff_vp9_loopfilter_sb Code:
2.29 s 32.1% 2.29 s decode_coeffs_8bpp 576.00 ms 8.0% 576.00 ms decode_mode 387.00 ms 5.4% 387.00 ms ff_vp9_intra_recon_8bpp 381.00 ms 5.3% 381.00 ms ff_vp9_loop_filter_h_16_16_avx 261.00 ms 3.6% 261.00 ms 0x1075bc8d0 244.00 ms 3.4% 244.00 ms ff_vp9_loop_filter_v_16_16_avx 224.00 ms 3.1% 224.00 ms ff_vp9_decode_block 222.00 ms 3.1% 222.00 ms 0x1075bc78f 173.00 ms 2.4% 173.00 ms ff_vp9_fill_mv 168.00 ms 2.3% 168.00 ms ff_vp9_loopfilter_sb Last, HEVC (SSSE3 first, then AVX2): Code:
2.45 s 16.0% 2.45 s ff_hevc_hls_residual_coding 1.40 s 9.1% 1.40 s put_hevc_qpel_uni_w_hv_8 766.00 ms 5.0% 766.00 ms put_hevc_qpel_bi_w_hv_8 702.00 ms 4.6% 702.00 ms put_hevc_epel_uni_w_hv_8 685.00 ms 4.4% 685.00 ms put_hevc_qpel_hv_8 512.00 ms 3.3% 512.00 ms ff_hevc_hls_filter 511.00 ms 3.3% 511.00 ms ff_hevc_deblocking_boundary_strengths 498.00 ms 3.2% 498.00 ms hls_coding_quadtree 366.00 ms 2.4% 366.00 ms hls_transform_tree 332.00 ms 2.1% 332.00 ms put_hevc_epel_bi_w_hv_8 Code:
2.45 s 26.6% 2.45 s ff_hevc_hls_residual_coding 549.00 ms 5.9% 549.00 ms ff_hevc_hls_filter 474.00 ms 5.1% 474.00 ms ff_hevc_deblocking_boundary_strengths 448.00 ms 4.8% 448.00 ms hls_coding_quadtree 344.00 ms 3.7% 344.00 ms hls_transform_tree 233.00 ms 2.5% 233.00 ms pred_angular_2_8 230.00 ms 2.5% 230.00 ms hls_prediction_unit 192.00 ms 2.0% 192.00 ms intra_pred_2_8 184.00 ms 2.0% 184.00 ms 0x102e3032a 178.00 ms 1.9% 178.00 ms 0x102e30829 I'm going to leave the decoding claim for you to re-visit if you wish, but I don't think my data supports your claim. In fact, dav1d appears to do quite well. So, let's move over to the encoding claim: that is entirely plausible. Encoding is much more DSP heavy than decoding, and the expected speed-up is thus bigger. I would indeed expect a significantly-larger-than-20% speedup from AV1 encoding on AVX2 vs. SSEx. Last edited by Beelzebubu; 28th October 2019 at 21:23. |
|
28th October 2019, 22:36 | #1886 | Link | |
Moderator
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,750
|
Quote:
While H.264 did show a significant efficiency improvement from 10-bit encoding even of 8-bit sources, HEVC showed much less gain (due to improvements in 8-bit). I wouldn't assume that AV1 would see a gain similar to H.264 without significant testing. The most important thing about 10-bit is that it's required for HDR content. And HDR is definitely on the path to become mainstream over the next five years. |
|
28th October 2019, 23:05 | #1887 | Link | |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
Basically you say that ffmpeg HEVC decoding has no actual SSSE3 optimizations, that's why it gains 40% comparing previous SSE vs AVX2, which is double than AV1 but not that good, while SSE4.2 vs AVX2 is almost the same. On the other hand, VP9 has no actual AVX2 optimizations that's why SSSE3 vs AVX2 is so close. TBH, I remembered ffvp9 to be one of the best optimized decoders ever and I thought it was due to AVX2 and not SSSE3 optimizations. The way you presented your research, it seems that all decoders are doomed in the SSEx vs AVX2 battle. I will search it a little better and come back if i find something interesting. Thank you for your time!
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|
28th October 2019, 23:14 | #1888 | Link |
Registered Developer
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,344
|
I think you got that backwards. 10-bit H.264 hardware decoders are very rare, in consumer space anyway, while any modern HEVC decoder will handle 10-bit, so devices that can do 8-bit H.264 only, but 10-bit HEVC are ample, and growing with every new device coming out.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders |
29th October 2019, 07:46 | #1889 | Link | |
Registered User
Join Date: Nov 2009
Location: Northeast Ohio
Posts: 447
|
Quote:
Also, I thought Haswell's implementation of AVX2 was pretty much the same as Skylake's? (and I already touched on the Zen1/+ vs Zen2 implementation of AVX2 in my previous post)...unless you were alluding to Skylake-X's support of AVX-512.
__________________
____HTPC____ | __Desktop PC__
2.93GHz Xeon x3470 (4c/8t Nehalem) | 4.5GHz 1.24v dual-core Haswell G3258 Radeon HD5870 | Intel iGPU 2x2GB+2x1GB DDR3-1333 | 4x4GB DDR3-1600 Last edited by Nintendo Maniac 64; 29th October 2019 at 07:51. |
|
29th October 2019, 08:50 | #1890 | Link | ||
Registered Developer
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,344
|
Quote:
Quote:
Skylake is afterall a different micro-architecture then Haswell, its just that since then we didn't get anything new anymore.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders |
||
29th October 2019, 12:10 | #1891 | Link | |
Registered User
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 196
|
Quote:
The nVidia Shield TV is a good example of this - a relatively weak CPU by modern terms with a strong GPU. Cortex A57 is decent but aging now, even some Amazon products have more recent ARM cores with higher IPC, and obviously phone products that currently exist are completely limited to CPU power which would strain to put out 4K24 in many cases, let alone 4K60 - which maybe Apple's flagship Axx SoC could do at the moment with dav1d using 8 bit content. It's all about taking advantage of otherwise wasted compute power to augment decode fps, perhaps even do so more efficiently if enough can be executed on the GPU without extraneous copy/transfer overheads to the CPU. |
|
29th October 2019, 12:17 | #1892 | Link | |
Registered User
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 196
|
Quote:
At least the newer FS 4K I bought this year has 10 bit capability, still the experience was somewhat disheartening considering the sheer amount of 10 bit content available at the time I bought the first Fire Stick, 5 years after HEVC was standardised. |
|
29th October 2019, 12:42 | #1893 | Link | |
Registered User
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
|
Quote:
Last edited by Beelzebubu; 29th October 2019 at 20:51. |
|
29th October 2019, 13:03 | #1894 | Link | ||
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
Code:
vp9_diag_downleft_32x32_10bpp_c: 1101.2 vp9_diag_downleft_32x32_10bpp_sse2: 145.4 vp9_diag_downleft_32x32_10bpp_ssse3: 137.5 vp9_diag_downleft_32x32_10bpp_avx: 134.8 vp9_diag_downleft_32x32_10bpp_avx2: 94.0 vp9_diag_downleft_32x32_12bpp_c: 1108.5 vp9_diag_downleft_32x32_12bpp_sse2: 145.5 vp9_diag_downleft_32x32_12bpp_ssse3: 137.3 vp9_diag_downleft_32x32_12bpp_avx: 135.2 vp9_diag_downleft_32x32_12bpp_avx2: 94.0 Code:
vp9_diag_downleft_32x32_12bpp_c: 1534.2 vp9_diag_downleft_32x32_12bpp_sse2: 145.9 vp9_diag_downleft_32x32_12bpp_ssse3: 140.0 vp9_diag_downleft_32x32_12bpp_avx: 134.8 vp9_diag_downleft_32x32_12bpp_avx2: 78.9 Code:
vp9_vert_left_16x16_12bpp_c: 273.8 vp9_vert_left_16x16_12bpp_sse2: 69.4 vp9_vert_left_16x16_12bpp_ssse3: 35.3 vp9_vert_left_16x16_12bpp_avx: 34.6 vp9_vert_left_16x16_12bpp_avx2: 22.4 Quote:
I have a Haswell Core i3 4170 and a Coffee Lake Refresh Core i3 9100F and I will try latest dAV1d using LAV Video when 0.5.x version becomes embedded in the decoder (hopefully soonish)
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
||
29th October 2019, 15:55 | #1895 | Link | ||
Lost my old account :(
Join Date: Jul 2017
Posts: 322
|
Quote:
Yeah, was not expecting 60fps, cant do that with sw hevc decoder either on this cpu. I was just interested how far sw decoding has come. But I will get an nightly build then, Quote:
Last edited by excellentswordfight; 29th October 2019 at 15:58. |
||
29th October 2019, 16:08 | #1896 | Link | |
Registered User
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 196
|
Quote:
Also reports of Mali GPU power efficiency (used in the Kirin 970/Huawei P20) haven't been stellar either compared to alternatives in the mobile market, especially when A73 is such an efficient CPU core design to compare against. |
|
29th October 2019, 20:43 | #1897 | Link |
Registered User
Join Date: Aug 2009
Posts: 201
|
I think the code works. It gets more FPS at the same power usage, or uses less power for the same FPS. I think the corment about "lower fps-per-watt" was intended to be the opposite, "lower watts-per-fps" or at least that's my reading of the graph from the presentation.
|
29th October 2019, 20:50 | #1898 | Link | |
Registered User
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
|
Quote:
|
|
30th October 2019, 13:17 | #1899 | Link | |
Registered User
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 363
|
Quote:
It depends on a combination of chipset, microcode, OS version etc.. Example: I was testing x265 10bit encoded content in main10 on my older Samsung G7edge that has a snapdragon 820, which from the spec supports all we need (4kp60 uhd...whatever that means in reality..), but i was not able to HW decode anything over main@l2.1 via XO player.. VLC happily decoded main10@l4.1 with a 70% cpu usage on 6 of the 8 cores.. but power drain was awefull.. On H.264, i havent seen any chipset in the last years, regardless of how good the HEVC part of the soc is, able to do anything over hp@4.2. Even the latest AV1 decode enabled chipsets cant do 10bit h.264 either. Last edited by TEB; 30th October 2019 at 13:23. |
|
Thread Tools | Search this Thread |
Display Modes | |
|
|