Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > VP9 and AV1

Reply
 
Thread Tools Search this Thread Display Modes
Old 28th October 2019, 12:55   #1881  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
Quote:
Originally Posted by NikosD View Post
Ok...So, I take a look at the single threaded performance and I see a 20% gain of AVX2 compared to SSSE3.
On what system (chipset)?
Beelzebubu is offline   Reply With Quote
Old 28th October 2019, 14:43   #1882  |  Link
soresu
Registered User
 
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 196
Quote:
Originally Posted by NikosD View Post
BTW, any plans for AVX-512 in near future ?
Is there any benefit on this ?
There was a merge request/issue some time ago for adding some support for it (specifically mentioned as Ice Lake), but I don't think any actual optimisations have been committed to the master yet, going by my git commit RSS/Atom feed anyway.

I'm more interested in the GPGPU work that happened over the summer, another mirror repo had some further Vulkan work that seemed like bugfixes or 'piping' as it were.

Is there any chance of getting some bench figures on that work soon?
soresu is offline   Reply With Quote
Old 28th October 2019, 17:22   #1883  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by Beelzebubu View Post
Few things going on there:
  • YMM (e.g. AVX2) functions are never exactly 2x as fast as XMM (e.g. SSSE3) functions, even in theoretical conditions;
  • True, but to be honest I was expecting something like 60% - 70%.
    That's a reasonable gain going from SSEx to AVX2
    Quote:
    Originally Posted by Beelzebubu View Post
  • YMM upper lane use will cause CPU downclocking (but not on modern AMD CPUs, I'm being told);
  • Using a Haswell for years I believed that too, but actually Intel has solved the issue since Skylake.
    My Core i3 9100F can keep all core turbo of 4.0GHz forever using AVX2 optimized code (but not power virus like Prime95 small FFT)

    Quote:
    Originally Posted by Beelzebubu View Post
  • certain code in SIMD functions does not use YMM upper lanes (effectively), usually because the block size is too small (width=4-8), but sometimes because we don't want a function-pointer-call overhead (multisymbol coding);
  • and obviously, a lot of code is not SIMD'ed at all, it's 50%-50% between SIMD and non-SIMD at best.
Quote:
Originally Posted by Beelzebubu View Post


Together, that means the speedup is well below half of half, so 20% is not entirely unreasonable. Sucks a bit, but you can't beat reality.
It's not that I don't believe you or nevcairiel, but personally judging by H.265 encoding/decoding and VP9 decoding, I think the transition from SSEx to AVX2 could be more impressive than 20%.
For me it's still unreasonable.
Will see...
Quote:
Originally Posted by soresu View Post
I'm more interested in the GPGPU work that happened over the summer, another mirror repo had some further Vulkan work that seemed like bugfixes or 'piping' as it were.



Is there any chance of getting some bench figures on that work soon?
I think pure fixed-function HW will appear for the first time in 2020 using Ampere architecture of nVidia (hopefully) and as I have said in the past, GPGPU can't be that effective with such a complex codec like AV1, in my opinion.
Regarding benchmarks, if there is a DirectShow or a MediaFoundation filter exposed via DXVA2/D3D11VA for AV1 compatible with nVidia GPUs, I would definitely try it although I really don't expect too much from GPGPU for video codecs.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 28th October 2019, 19:51   #1884  |  Link
Nintendo Maniac 64
Registered User
 
Nintendo Maniac 64's Avatar
 
Join Date: Nov 2009
Location: Northeast Ohio
Posts: 447
Quote:
Originally Posted by Beelzebubu View Post
1080p on SSE2 is not our goal. The goal is to have a baseline support so ~5 years (or even earlier?) from now, AV1 can be the baseline, not H.264. We don't know for sure, but this may imply some basic need for SSE2 support. So we're exploring what is possible and how much work it'd be.
Question about what constitutes a baseline - would this happen to be 8bit AV1 only?

I ask because I would have expected by now for 10bit to be the standard baseline for AV1 even for non-HDR content at sub-1080p resolutions since, at least back in the h.264 days, doing such can improve compression (though obviously no hardware 10bit h.264 decoder exists even today...but that's not an issue for newer codecs), but last I checked dav1d's 10bit decode performance was actually slower than even aomedia's reference decoder!


Quote:
Originally Posted by Beelzebubu View Post
[*]YMM upper lane use will cause CPU downclocking (but not on modern AMD CPUs, I'm being told)
Indeed, AVX2 workloads on Zen-based CPUs (Epyc, Ryzen, Athlon with iGPUs) do not cause any downclocking. It's for this reason that Zen2 (which has a full-width 256bit AVX2 implementation unlike Zen1/+'s half-width 128bit AVX2 implementation) can actually keep up quite well on a per-thread basis to Intel's AVX-512-equipped CPUs in several AVX-512-accelerated workloads like x265 despite no current AMD processor supporting AVX-512.
__________________
____HTPC____  | __Desktop PC__
2.93GHz Xeon x3470 (4c/8t Nehalem) | 4.5GHz 1.24v dual-core Haswell G3258
Radeon HD5870  | Intel iGPU      
2x2GB+2x1GB DDR3-1333 | 4x4GB DDR3-1600       

Last edited by Nintendo Maniac 64; 28th October 2019 at 20:48.
Nintendo Maniac 64 is offline   Reply With Quote
Old 28th October 2019, 21:12   #1885  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
Quote:
Originally Posted by NikosD View Post
personally judging by H.265 encoding/decoding and VP9 decoding, I think the transition from SSEx to AVX2 could be more impressive than 20%.
For me it's still unreasonable.
OK, let's test your claim. HEVC/VP9/dav1d decoding using FFmpeg, recent snapshot, on my local Haswell laptop of a same-quality encoded sample of the same file (ToddlerFountain), everything single-threaded, in alphabetical order:
  • AV1: SSSE3 12.913s vs. AVX2 10.638s = 21.4% faster;
  • HEVC: SSSE3: 15.188s vs. AVX2: 10.938s = 38.9% faster;
  • VP9: SSSE3 7.133s vs. AVX2 6.748s = 5.7% faster;

So, that's weird, AV1 is halfway between HEVC and VP9 - what's going on here? It's simple: look at the profiles. First of all, let's do AV1, and check the top-10 functions for both runs (SSSE3 first, then AVX2):

Code:
  2.25 s    18.2%	  2.25 s	 	 dav1d_prep_8tap_ssse3.hv_w8_loop
  1.25 s    10.1%	  1.25 s	 	 motion_field_projection
700.00 ms    5.6%	700.00 ms	 	 decode_coefs
601.00 ms    4.8%	601.00 ms	 	 decode_b
475.00 ms    3.8%	475.00 ms	 	 dav1d_find_ref_mvs
391.00 ms    3.1%	391.00 ms	 	 add_tpl_ref_mv
363.00 ms    2.9%	363.00 ms	 	 dav1d_put_8tap_ssse3.hv_w8_loop
329.00 ms    2.6%	329.00 ms	 	 dav1d_prep_8tap_ssse3.h_loop
326.00 ms    2.6%	326.00 ms	 	 dav1d_cdef_filter_8x8_ssse3.k_loop
242.00 ms    1.9%	242.00 ms	 	 add_ref_mv_candidate
vs.

Code:
  1.36 s    12.9%  	  1.36 s	 	 dav1d_prep_8tap_avx2.hv_w8_loop
  1.20 s    11.4%  	  1.20 s	 	 motion_field_projection
741.00 ms    7.0%	741.00 ms	 	 decode_coefs
638.00 ms    6.0%	638.00 ms	 	 decode_b
479.00 ms    4.5%	479.00 ms	 	 dav1d_find_ref_mvs
433.00 ms    4.1%	433.00 ms	 	 add_tpl_ref_mv
247.00 ms    2.3%	247.00 ms	 	 add_ref_mv_candidate
241.00 ms    2.2%	241.00 ms	 	 ..@949.end
229.00 ms    2.1%	229.00 ms	 	 mc
206.00 ms    1.9%	206.00 ms	 	 dav1d_put_8tap_avx2.hv_w8_loop
What do we see? Nothing unexpected (except maybe the large percentage of time spent in ref_mvs.c functions, which we know about and is tracked in #217). Some time spent in Properly optimized functions, but nothing major.

OK, let's look at VP9 - again SSSE3 first, then AVX2:

Code:
  2.16 s    29.5%	  2.16 s	 	 decode_coeffs_8bpp
574.00 ms    7.8%	574.00 ms	 	 decode_mode
474.00 ms    6.4%	474.00 ms	 	 ff_vp9_loop_filter_h_16_16_ssse3
437.00 ms    5.9%	437.00 ms	 	 0x1090748d0
366.00 ms    4.9%	366.00 ms	 	 ff_vp9_intra_recon_8bpp
277.00 ms    3.7%	277.00 ms	 	 ff_vp9_loop_filter_v_16_16_ssse3
229.00 ms    3.1%	229.00 ms	 	 0x10907478f
220.00 ms    3.0%	220.00 ms	 	 ff_vp9_fill_mv
184.00 ms    2.5%	184.00 ms	 	 ff_vp9_decode_block
177.00 ms    2.4%	177.00 ms	 	 ff_vp9_loopfilter_sb
vs.

Code:
  2.29 s   32.1%	  2.29 s	 	 decode_coeffs_8bpp
576.00 ms    8.0%	576.00 ms	 	 decode_mode
387.00 ms    5.4%	387.00 ms	 	 ff_vp9_intra_recon_8bpp
381.00 ms    5.3%	381.00 ms	 	 ff_vp9_loop_filter_h_16_16_avx
261.00 ms    3.6%	261.00 ms	 	 0x1075bc8d0
244.00 ms    3.4%	244.00 ms	 	 ff_vp9_loop_filter_v_16_16_avx
224.00 ms    3.1%	224.00 ms	 	 ff_vp9_decode_block
222.00 ms    3.1%	222.00 ms	 	 0x1075bc78f
173.00 ms    2.4%	173.00 ms	 	 ff_vp9_fill_mv
168.00 ms    2.3%	168.00 ms	 	 ff_vp9_loopfilter_sb
(Sorry for the hex codes.) What you see here is simple. There is not much AVX2. There is AVX-XMM (Sandybridge), but that only helps a couple of percent at best, apparently (with three-operand instructions, and SSE4 opcodes).

Last, HEVC (SSSE3 first, then AVX2):

Code:
  2.45 s    16.0%	  2.45 s	 	 ff_hevc_hls_residual_coding
  1.40 s     9.1%	  1.40 s	 	 put_hevc_qpel_uni_w_hv_8
766.00 ms    5.0%	766.00 ms	 	 put_hevc_qpel_bi_w_hv_8
702.00 ms    4.6%	702.00 ms	 	 put_hevc_epel_uni_w_hv_8
685.00 ms    4.4%	685.00 ms	 	 put_hevc_qpel_hv_8
512.00 ms    3.3%	512.00 ms	 	 ff_hevc_hls_filter
511.00 ms    3.3%	511.00 ms	 	 ff_hevc_deblocking_boundary_strengths
498.00 ms    3.2%	498.00 ms	 	 hls_coding_quadtree
366.00 ms    2.4%	366.00 ms	 	 hls_transform_tree
332.00 ms    2.1%	332.00 ms	 	 put_hevc_epel_bi_w_hv_8
vs.

Code:
  2.45 s    26.6%	  2.45 s	 	 ff_hevc_hls_residual_coding
549.00 ms    5.9%	549.00 ms	 	 ff_hevc_hls_filter
474.00 ms    5.1%	474.00 ms	 	 ff_hevc_deblocking_boundary_strengths
448.00 ms    4.8%	448.00 ms	 	 hls_coding_quadtree
344.00 ms    3.7%	344.00 ms	 	 hls_transform_tree
233.00 ms    2.5%	233.00 ms	 	 pred_angular_2_8
230.00 ms    2.5%	230.00 ms	 	 hls_prediction_unit
192.00 ms    2.0%	192.00 ms	 	 intra_pred_2_8
184.00 ms    2.0%	184.00 ms	 	 0x102e3032a
178.00 ms    1.9%	178.00 ms	 	 0x102e30829
Aha, we have the opposite problem here: HEVC has no SSSE3 fallback for most routines, it only has AVX optimizations. No wonder the difference is so big, and even then, it's only ~40%... If I compare the SSE4.2 performance to AVX2 for the same file, I get a couple of % at best, even though ffhevc has a fair bunch of AVX2 optimizations.

I'm going to leave the decoding claim for you to re-visit if you wish, but I don't think my data supports your claim. In fact, dav1d appears to do quite well.

So, let's move over to the encoding claim: that is entirely plausible. Encoding is much more DSP heavy than decoding, and the expected speed-up is thus bigger. I would indeed expect a significantly-larger-than-20% speedup from AV1 encoding on AVX2 vs. SSEx.

Last edited by Beelzebubu; 28th October 2019 at 21:23.
Beelzebubu is offline   Reply With Quote
Old 28th October 2019, 22:36   #1886  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,750
Quote:
Originally Posted by Nintendo Maniac 64 View Post
Question about what constitutes a baseline - would this happen to be 8bit AV1 only?

I ask because I would have expected by now for 10bit to be the standard baseline for AV1 even for non-HDR content at sub-1080p resolutions since, at least back in the h.264 days, doing such can improve compression (though obviously no hardware 10bit h.264 decoder exists even today...but that's not an issue for newer codecs), but last I checked dav1d's 10bit decode performance was actually slower than even aomedia's reference decoder!
There are plenty of 10-bit H.264 decoders out there. There certainly are some devices that can decode 10-bit HEVC but only 8-bit H.264, but plenty who can do 10-bit of both.

While H.264 did show a significant efficiency improvement from 10-bit encoding even of 8-bit sources, HEVC showed much less gain (due to improvements in 8-bit). I wouldn't assume that AV1 would see a gain similar to H.264 without significant testing.

The most important thing about 10-bit is that it's required for HDR content. And HDR is definitely on the path to become mainstream over the next five years.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 28th October 2019, 23:05   #1887  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by Beelzebubu View Post
OK, let's test your claim. HEVC/VP9/dav1d decoding using FFmpeg, recent snapshot, on my local Haswell laptop of a same-quality encoded sample of the same file (ToddlerFountain)...
I'm going to leave the decoding claim for you to re-visit if you wish, but I don't think my data supports your claim. In fact, dav1d appears to do quite well.
Your analysis is very interesting.
Basically you say that ffmpeg HEVC decoding has no actual SSSE3 optimizations, that's why it gains 40% comparing previous SSE vs AVX2, which is double than AV1 but not that good, while SSE4.2 vs AVX2 is almost the same.
On the other hand, VP9 has no actual AVX2 optimizations that's why SSSE3 vs AVX2 is so close.
TBH, I remembered ffvp9 to be one of the best optimized decoders ever and I thought it was due to AVX2 and not SSSE3 optimizations.
The way you presented your research, it seems that all decoders are doomed in the SSEx vs AVX2 battle.
I will search it a little better and come back if i find something interesting.
Thank you for your time!
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 28th October 2019, 23:14   #1888  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,344
Quote:
Originally Posted by benwaggoner View Post
There are plenty of 10-bit H.264 decoders out there. There certainly are some devices that can decode 10-bit HEVC but only 8-bit H.264, but plenty who can do 10-bit of both.
I think you got that backwards. 10-bit H.264 hardware decoders are very rare, in consumer space anyway, while any modern HEVC decoder will handle 10-bit, so devices that can do 8-bit H.264 only, but 10-bit HEVC are ample, and growing with every new device coming out.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders
nevcairiel is offline   Reply With Quote
Old 29th October 2019, 07:46   #1889  |  Link
Nintendo Maniac 64
Registered User
 
Nintendo Maniac 64's Avatar
 
Join Date: Nov 2009
Location: Northeast Ohio
Posts: 447
Quote:
Originally Posted by Beelzebubu View Post
And this i straight Haswell, newer chipsets (Zen2, Skylake) will get more, as will encoders.
Well unless it's a Pentium since those still lack AVX support altogether even if it's a desktop Skylake-based Pentium like the G5400 and such (which are effectively what an i3 used to be with Pentiums now being 2core/4thread).

Also, I thought Haswell's implementation of AVX2 was pretty much the same as Skylake's? (and I already touched on the Zen1/+ vs Zen2 implementation of AVX2 in my previous post)...unless you were alluding to Skylake-X's support of AVX-512.
__________________
____HTPC____  | __Desktop PC__
2.93GHz Xeon x3470 (4c/8t Nehalem) | 4.5GHz 1.24v dual-core Haswell G3258
Radeon HD5870  | Intel iGPU      
2x2GB+2x1GB DDR3-1333 | 4x4GB DDR3-1600       

Last edited by Nintendo Maniac 64; 29th October 2019 at 07:51.
Nintendo Maniac 64 is offline   Reply With Quote
Old 29th October 2019, 08:50   #1890  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,344
Quote:
Originally Posted by Nintendo Maniac 64 View Post
Well unless it's a Pentium since those still lack AVX support altogether even if it's a desktop Skylake-based Pentium like the G5400 and such (which are effectively what an i3 used to be with Pentiums now being 2core/4thread).
Chips like this is why SSSE3 was still a primary optimization target, among other things. Can't fix the lack of AVX2, but SSSE3 will do the best that is possible on them.

Quote:
Originally Posted by Nintendo Maniac 64 View Post
Also, I thought Haswell's implementation of AVX2 was pretty much the same as Skylake's? (and I already touched on the Zen1/+ vs Zen2 implementation of AVX2 in my previous post)...unless you were alluding to Skylake-X's support of AVX-512.
AVX2 improved a bit in Skylake, the better process allows the downclock to be less aggressive, and instruction latencies were slightly improved. The clocking difference would make the biggest difference there.
Skylake is afterall a different micro-architecture then Haswell, its just that since then we didn't get anything new anymore.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders
nevcairiel is offline   Reply With Quote
Old 29th October 2019, 12:10   #1891  |  Link
soresu
Registered User
 
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 196
Quote:
Originally Posted by NikosD View Post
I think pure fixed-function HW will appear for the first time in 2020 using Ampere architecture of nVidia (hopefully) and as I have said in the past, GPGPU can't be that effective with such a complex codec like AV1, in my opinion.
Regarding benchmarks, if there is a DirectShow or a MediaFoundation filter exposed via DXVA2/D3D11VA for AV1 compatible with nVidia GPUs, I would definitely try it although I really don't expect too much from GPGPU for video codecs.
It interests me purely for decoding on platforms that lack a more modern CPU core, but still have GPU power to divy up the decoding effort, which is otherwise going to waste.

The nVidia Shield TV is a good example of this - a relatively weak CPU by modern terms with a strong GPU.

Cortex A57 is decent but aging now, even some Amazon products have more recent ARM cores with higher IPC, and obviously phone products that currently exist are completely limited to CPU power which would strain to put out 4K24 in many cases, let alone 4K60 - which maybe Apple's flagship Axx SoC could do at the moment with dav1d using 8 bit content.

It's all about taking advantage of otherwise wasted compute power to augment decode fps, perhaps even do so more efficiently if enough can be executed on the GPU without extraneous copy/transfer overheads to the CPU.
soresu is offline   Reply With Quote
Old 29th October 2019, 12:17   #1892  |  Link
soresu
Registered User
 
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 196
Quote:
Originally Posted by nevcairiel View Post
I think you got that backwards. 10-bit H.264 hardware decoders are very rare, in consumer space anyway, while any modern HEVC decoder will handle 10-bit, so devices that can do 8-bit H.264 only, but 10-bit HEVC are ample, and growing with every new device coming out.
Unfortunately not fast enough, I bought an Amazon Fire Stick for my dad in 2017 assuming that the advertised HEVC support meant up to 10 bit, only to find out to my horror that most of the HEVC encoded videos I have did not work on it.

At least the newer FS 4K I bought this year has 10 bit capability, still the experience was somewhat disheartening considering the sheer amount of 10 bit content available at the time I bought the first Fire Stick, 5 years after HEVC was standardised.
soresu is offline   Reply With Quote
Old 29th October 2019, 12:42   #1893  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
Quote:
Originally Posted by soresu View Post
I'm more interested in the GPGPU work that happened over the summer, another mirror repo had some further Vulkan work that seemed like bugfixes or 'piping' as it were.

Is there any chance of getting some bench figures on that work soon?
Last slide in this presentation, although it was cut-off at a hard 3-minute limit. Slide shows lower [NO!]fps-per-watt[NO!] watt-per-fps when using the GPU code we have so far compared to pure CPU.

Last edited by Beelzebubu; 29th October 2019 at 20:51.
Beelzebubu is offline   Reply With Quote
Old 29th October 2019, 13:03   #1894  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by Beelzebubu View Post
For ffvp9, it was the other way around, we did everything-and-more in SSSE3, and then did a couple of things (some MC, some inverse transforms) in AVX2, but the smaller inverse transforms and MC, as well as the loopfilters and most intra predictors, were never done. So it's fairly incomplete.
I managed to find out a few more details regarding AVX2 optimizations of ffvp9 which are good (for 10bit and 12bit) on the specific routines and with a clear distance from the extremely optimized SSSE3 version.

Code:
vp9_diag_downleft_32x32_10bpp_c: 1101.2
vp9_diag_downleft_32x32_10bpp_sse2: 145.4
vp9_diag_downleft_32x32_10bpp_ssse3: 137.5
vp9_diag_downleft_32x32_10bpp_avx: 134.8
vp9_diag_downleft_32x32_10bpp_avx2: 94.0

vp9_diag_downleft_32x32_12bpp_c: 1108.5
vp9_diag_downleft_32x32_12bpp_sse2: 145.5
vp9_diag_downleft_32x32_12bpp_ssse3: 137.3
vp9_diag_downleft_32x32_12bpp_avx: 135.2
vp9_diag_downleft_32x32_12bpp_avx2: 94.0
AVX2 version is 32% faster than SSSE3 for vp9 ipred_dl_32x32_16

Code:
vp9_diag_downleft_32x32_12bpp_c: 1534.2
vp9_diag_downleft_32x32_12bpp_sse2: 145.9
vp9_diag_downleft_32x32_12bpp_ssse3: 140.0
vp9_diag_downleft_32x32_12bpp_avx: 134.8
vp9_diag_downleft_32x32_12bpp_avx2: 78.9
AVX2 version is 44% faster than SSSE3 for ipred_dl_32x32

Code:
vp9_vert_left_16x16_12bpp_c: 273.8
vp9_vert_left_16x16_12bpp_sse2: 69.4
vp9_vert_left_16x16_12bpp_ssse3: 35.3
vp9_vert_left_16x16_12bpp_avx: 34.6
vp9_vert_left_16x16_12bpp_avx2: 22.4
AVX2 version is 37% faster than SSSE3 for ipred_vl_16x16


Quote:
Originally Posted by Beelzebubu View Post
1.2x is nothing bad, though. And this i straight Haswell, newer chipsets (Zen2, Skylake) will get more, as will encoders.
I was ready to tell you to test exactly the same things using a more modern implementation of AVX2 than Haswell, like Skylake onwards (basically it's the same old 2015 Skylake architecture for all Intel CPUs after Skylake).

I have a Haswell Core i3 4170 and a Coffee Lake Refresh Core i3 9100F and I will try latest dAV1d using LAV Video when 0.5.x version becomes embedded in the decoder (hopefully soonish)
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 29th October 2019, 15:55   #1895  |  Link
excellentswordfight
Lost my old account :(
 
Join Date: Jul 2017
Posts: 322
Quote:
Originally Posted by nevcairiel View Post
0.2.1, the newest available at the time. You can use a nightly version which would come with 0.5.1, the newest available right now.
That won't necessarily guarantee that 2160p60 will play on a mobile U-series CPU, but it got the best chances.
Ah, ok, I missed the date on the build, I see now that it was released quite some time ago so that makes sense.

Yeah, was not expecting 60fps, cant do that with sw hevc decoder either on this cpu. I was just interested how far sw decoding has come. But I will get an nightly build then,

Quote:
Just be careful not to pick the 10-bit variant of the Netflix Chimera video. 10-bit is not optimized at all yet, and its not representative of real-world content yet. YouTube for example only delivers AV1 8-bit so far.
And since there is no 8-bit 2160p variant of Chimera, thats your answer.
I was using a 10bit sample; the new sparks one that was linked on previous page.

Last edited by excellentswordfight; 29th October 2019 at 15:58.
excellentswordfight is offline   Reply With Quote
Old 29th October 2019, 16:08   #1896  |  Link
soresu
Registered User
 
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 196
Quote:
Originally Posted by Beelzebubu View Post
Last slide in this presentation, although it was cut-off at a hard 3-minute limit. Slide shows lower fps-per-watt when using the GPU code we have so far compared to pure CPU.
Thanks for the update, disappointing but to be expected I guess for an early effort which likely expends a lot of power and time copying data back and forth between the CPU and GPU.

Also reports of Mali GPU power efficiency (used in the Kirin 970/Huawei P20) haven't been stellar either compared to alternatives in the mobile market, especially when A73 is such an efficient CPU core design to compare against.
soresu is offline   Reply With Quote
Old 29th October 2019, 20:43   #1897  |  Link
dapperdan
Registered User
 
Join Date: Aug 2009
Posts: 201
Quote:
Originally Posted by soresu View Post
Thanks for the update, disappointing but to be expected I guess for an early effort
I think the code works. It gets more FPS at the same power usage, or uses less power for the same FPS. I think the corment about "lower fps-per-watt" was intended to be the opposite, "lower watts-per-fps" or at least that's my reading of the graph from the presentation.
dapperdan is offline   Reply With Quote
Old 29th October 2019, 20:50   #1898  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
Quote:
Originally Posted by dapperdan View Post
I think the code works. It gets more FPS at the same power usage, or uses less power for the same FPS. I think the corment about "lower fps-per-watt" was intended to be the opposite, "lower watts-per-fps" or at least that's my reading of the graph from the presentation.
Oops, yes, sorry, my bad. I'll update my post. Thanks for noticing.
Beelzebubu is offline   Reply With Quote
Old 30th October 2019, 13:17   #1899  |  Link
TEB
Registered User
 
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 363
Quote:
Originally Posted by soresu View Post
Unfortunately not fast enough, I bought an Amazon Fire Stick for my dad in 2017 assuming that the advertised HEVC support meant up to 10 bit, only to find out to my horror that most of the HEVC encoded videos I have did not work on it.

At least the newer FS 4K I bought this year has 10 bit capability, still the experience was somewhat disheartening considering the sheer amount of 10 bit content available at the time I bought the first Fire Stick, 5 years after HEVC was standardised.
I can confirm this based on my research too. "all" devices released in the last 2-3 years support HEVC, but the devilīs in the details here. 10bit support lacks quite alot (main10), and also many older chipsets, from a spec, support main10 up to 4kp60 but in reality they wont provide more than 8bit@level 3.0..
It depends on a combination of chipset, microcode, OS version etc..
Example: I was testing x265 10bit encoded content in main10 on my older Samsung G7edge that has a snapdragon 820, which from the spec supports all we need (4kp60 uhd...whatever that means in reality..), but i was not able to HW decode anything over main@l2.1 via XO player.. VLC happily decoded main10@l4.1 with a 70% cpu usage on 6 of the 8 cores.. but power drain was awefull..
On H.264, i havent seen any chipset in the last years, regardless of how good the HEVC part of the soc is, able to do anything over hp@4.2. Even the latest AV1 decode enabled chipsets cant do 10bit h.264 either.

Last edited by TEB; 30th October 2019 at 13:23.
TEB is offline   Reply With Quote
Old 30th October 2019, 18:09   #1900  |  Link
Blue_MiSfit
Derek Prestegard IRL
 
Blue_MiSfit's Avatar
 
Join Date: Nov 2003
Location: Los Angeles
Posts: 5,988
^ exactly. Even though the SOC should do 4kp60, the actual implementation in $phone can only do level 3.0.
Blue_MiSfit is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 22:58.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.