Alliance for Open Media codecs [Archive] - Page 27

View Full Version : Alliance for Open Media codecs

Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 [27] 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

utack

13th December 2018, 14:28

Why shouldn't we like tiled encoding?

They make compression efficiency worse.The current implementation splits the frame into equal parts, and most of the times you get a split right in the center of the picture where most action takes place.
dav1d demonstrates pretty well that frame parallel decoding works fairly well, other encoders managed to get perfect frame parallel encoding done, so it just seems a lazy solution until libaom gets row_mt running well.

LigH

13th December 2018, 16:34

New uploads: (MSYS2; MinGW32: GCC 7.4.0 / MinGW64: GCC 8.2.1)

AOM v1.0.0-1030-g7ac3eb1bb (https://www.mediafire.com/file/ro61f2rjhrw5y76/aom_v1.0.0-1030-g7ac3eb1bb.7z)
New parameters:
--enable-dual-filter=<arg> Enable dual filter (0: false, 1: true (default))
--enable-order-hint=<arg> Enable order hint (0: false, 1: true (default))
--enable-dist-wtd-comp=<arg Enable distance-weighted compound (0: false, 1: true (default))
--enable-masked-comp=<arg> Enable masked (wedge/diff-wtd) compound (0: false, 1: true (default))
--enable-interintra-comp=<a Enable interintra compound (0: false, 1: true (default))
--enable-diff-wtd-comp=<arg Enable difference-weighted compound (0: false, 1: true (default))
--enable-interinter-wedge=< Enable interinter wedge compound (0: false, 1: true (default))
--enable-interintra-wedge=< Enable interintra wedge compound (0: false, 1: true (default))
--enable-global-motion=<arg Enable global motion (0: false, 1: true (default))
--enable-warped-motion=<arg Enable local warped motion (0: false, 1: true (default))
--enable-obmc=<arg> Enable OBMC (0: false, 1: true (default))

rav1e 0.1.0 (64b9f50 / 2018-12-13) (https://www.mediafire.com/file/up68lvmt9j8m9e3/rav1e_0.1.0_2018-12-13_64b9f50.7z)

dav1d 0.1.0 (e5bca59 / 2018-12-13) (https://www.mediafire.com/file/v8d90s8ykco32zl/dav1d_0.1.0_2018-12-13_e5bca59.7z)

SmilingWolf

13th December 2018, 16:52

They make compression efficiency worse.
In x265, WPP hurts efficiency too. Should we stop using it?

The clip used is the F.Y.C one I described some pages ago (http://forum.doom9.org/showthread.php?p=1852449#post1852449)

Cmdlines:
x265 --preset veryslow --tune ssim --crf 20 -F 1 --no-wpp -o test.x265.crf20.1F.00WPP.hevc orig.i420.y4m
x265 --preset veryslow --tune ssim --crf 20 -F 1 -o test.x265.crf20.1F.12WPP.hevc orig.i420.y4m

Sizes:
test.x265.crf20.1F.00WPP.hevc: 5566953
test.x265.crf20.1F.12WPP.hevc: 5612446 (+0.81%)

PSNR-HVS-M:
test.x265.crf20.1F.00WPP.hevc: 42.9368
test.x265.crf20.1F.12WPP.hevc: 42.9299 (-0.02%)

MS-SSIM:
test.x265.crf20.1F.00WPP.hevc: 26.3172
test.x265.crf20.1F.12WPP.hevc: 26.3112 (-0.02%)

With libaom the compression efficiency loss is very very low with an acceptable amount of tiles (in this case, 4 on a 720p clip).
I have already measured it: http://forum.doom9.org/showthread.php?p=1856939#post1856939.
That's -0.75% space efficiency with 0.0X% loss in quality. It's even comparable to x265's WPP!

On the other hand, libaom's --frame-parallel=1 exhibits a 6% overhead. Just so that we're clear, libaom's --frame-parallel has got nothing to do with libdav1d's decoding option with the similar name, which doesn't depend on any optional characteristic of the bitstream.
You can already have row-mt WITH tiles which should work decently. Maybe combine it with chunked encoding for better overall performance.
So again, no excuses to not use tiles.

marcomsousa

13th December 2018, 16:57

I realize I sound like a broken record at this point, but the newest Pentiums and Celerons still do not support AVX, and this even applies to the models that use the full-fat Sky/Kaby/Coffeelake cores (though with smaller cache size) such as the ever-popular 2c/4t Pentium G4560 and its direct successor the G5400.

Dav1d is already optimize AVX2 (~50% market share)
Now they will begin optimizing for SSS3 and SSE4.1 that all CPU have.
They not know if it will work fine with just this two extensions...

1) You must think that this codec will not be mainstream until there are some HW encoders (6 months to 1 more year for Big Companies have a custom HW encoder)

2) If a Celerons can't decode 1080p with dav1d, the player have two options: serve another codec, or serve the same codec with less resolution.

If you are big enough like youtube you can serve H264 to that HW and save bandwidth with the majority

Or, it's just fine to serve AV1 720p videos to Celerons, and more to the others. (they shouldn't be too picky).
For sure 8k video will be only be serve with AV1 in Youtube, like today VP9 is for >1080p.

Beelzebubu

13th December 2018, 18:18

Ronald Bultje commented on frame parallelism being a bad thing for VP9 (https://blogs.gnome.org/rbultje/2014/02/22/the-worlds-fastest-vp9-decoder-ffvp9/), so not much of a surprise it was turned off by default in AV1/libaom

No, that's a mis-interpretation. Frame parallelism is great.

For encoding, the speedup is slightly better than for tile parallelism, and the quality loss per added thread is less than for tile threading. For example, in my experiments, frame-multithreading in Eve/VP9 costs 0.0% BDRATE loss for a 1.8x speedup going from 1 to 2 threads, but tile threading only gives a 1.7x speedup and has a BDRATE quality loss of around 0.5%. This pattern holds for more threads, and tends to be true across multiple codecs and encoders. Now, obviously, libaom/vpx have no frame threaded encoding so not much to be said there. But in x264, my experience from many years ago is that they switches from slice to frame multi-threaded encoding for the same reason: better scaling *and* less BDRATE quality loss. So far, so good.

OK, next, decoding. This is trickier. For ffh264, for example, we classically found that frame-multithreaded decoding gives a higher speedup than slice-multi-threaded decoding per added thread. Given this pattern of frame multithreading scaling better *and* having less quality loss than within-frame alternatives in a variety of codecs, you'd expect everything to be good, right? Well, not exactly. It holds true, but only to some extend.

The problem in decoding of vp9/av1 is that frames depend on entropy output of the previous frame. For h264/5, cabac state resets in each frame, but this is not true for vp9/av1. So, for frame-multithreading, you need to split decoding in 2 passes, and pass 1 of the next dependent frame can only start when the previous frame finished it's pass 1 and started its pass 2. So, vp9/av1 *decoding* scale less well than h264 *decoding* when using frame multithreading. Fortunately, the system load doesn't go up either, so really what it means is that you need more threads to fully saturate a system. It's even better if you combine frame and tile threading, like what dav1d does.

Wait, you're asking now, what about that statement that frame parallelism is bad in libvpx? Well, it's not what you think it is. --frame-parallel in vpxenc has nothing to do with frame multi-threading in the encoder. It's a header bit that removes the entropy dependency I just talked about. So now, it scales better when using frame multi-threading, which is why this bit is called the "frame parallelism" bit, but it also costs you all backwards entropy, incurring >1% BDRATE quality loss. However, there is no reason to do this. Hardware is not allowed to support higher resolutions with vs. without this feature, and there is no software decoder that implements frame multithreading with but not without entropy dependencies disabled. And if entropy dependencies are present, you can saturate system load anyway by simply using more threads. So the whole thing is kind of silly. Why give up quality for no gain whatsoever?

I can't tell how correct it is, but this was an interesting read: https://codecs.multimedia.cx/2018/12/why-i-am-sceptical-about-av1/
Author is a former libav/ffmpeg developer if you don't remember his name.

Kostya Shishkov.

Beelzebubu

13th December 2018, 18:29

Do we have numbers for the installed base of AVX2 capable PCs? They've been in all new mainstream systems for several years now. I'd guess it's >50% already.

It's around 50%, depending on what statistics you look at. So, I think some people have already tried to address the dav1d performance metrics, so to summarize:

single-threaded, playing 8-bits/component content on >=Haswell (i.e. AVX2=1) will give a 40-80% FPS increase when using dav1d compared to libaom;
multi-threaded, when using the right combinations of frame and tile threading (or just really large numbers) you can get several times higher FPS using dav1d compared to libaom when playing back 8-bit content on Hawell or newer (i.e. AVX2=1);
pre-Haswell (e.g. SSSE3 (https://code.videolan.org/videolan/dav1d/issues/216)), non-x86 (e.g. Neon (https://code.videolan.org/videolan/dav1d/issues/215)), 32bit (the AVX2 assembly is 64-bit only) and 10-bits/component are not yet done. They will not be faster ATM, and possibly significantly slower. We're working on it;
Firefox has a problem (https://bugzilla.mozilla.org/show_bug.cgi?id=1512462) integrating nasm so their version of dav1d has all assembly disabled ATM.

benwaggoner

13th December 2018, 18:48

No, that's a mis-interpretation. Frame parallelism is great.

For encoding, the speedup is slightly better than for tile parallelism, and the quality loss per added thread is less than for tile threading. For example, in my experiments, frame-multithreading in Eve/VP9 costs 0.0% BDRATE loss for a 1.8x speedup going from 1 to 2 threads, but tile threading only gives a 1.7x speedup and has a BDRATE quality loss of around 0.5%. This pattern holds for more threads, and tends to be true across multiple codecs and encoders. Now, obviously, libaom/vpx have no frame threaded encoding so not much to be said there. But in x264, my experience from many years ago is that they switches from slice to frame multi-threaded encoding for the same reason: better scaling *and* less BDRATE quality loss. So far, so good.
Also, content may not be encoded with slices/tiles, but almost certainly will be encoded with hierarchically structured reference frames (like a I P B b structure) where the majority of frames aren't reference frames (e.g. all non-ref b-frames can be decoded in parallel as long as their reference frames are already decoded). So a performant decoder needs to have frame level parallelism, even if it also has slice/tile level as well.

The problem in decoding of vp9/av1 is that frames depend on entropy output of the previous frame. For h264/5, cabac state resets in each frame, but this is not true for vp9/av1. So, for frame-multithreading, you need to split decoding in 2 passes, and pass 1 of the next dependent frame can only start when the previous frame finished it's pass 1 and started its pass 2. So, vp9/av1 *decoding* scale less well than h264 *decoding* when using frame multithreading. Fortunately, the system load doesn't go up either, so really what it means is that you need more threads to fully saturate a system. It's even better if you combine frame and tile threading, like what dav1d does.
So, decoding will be limited by serial decoding of entropy decoding? Do non-reference frames still update and thus serialize the entropy state? If decoding the "bbbb" in an IbbbbBbbbbP" sequence is serialized, that'll really impact decoder parallelization. but if all the non-ref b frames inherit the CABAC state of the most recently decoded reference frame, than it'll be a lot easier.

Beelzebubu

13th December 2018, 19:50

So, decoding will be limited by serial decoding of entropy decoding? Do non-reference frames still update and thus serialize the entropy state? If decoding the "bbbb" in an IbbbbBbbbbP" sequence is serialized, that'll really impact decoder parallelization. but if all the non-ref b frames inherit the CABAC state of the most recently decoded reference frame, than it'll be a lot easier.

Frames with a "similar entropy" reference each other, so a high-level P might use the previous P (which is coded 16 frames back) as its entropy reference, and a non-reference inner B frame (which might not be a reference picture at all for pixel purposes) may actually use the previous inner B-frame (which may well be the one directly before this, or usually 2 and sometimes 3 frames back) as its reference. So this certainly influences how well frame-multithreading scales, not in the worst possible way but not ideal either.

And that's why you see weird things where using 256 instead of 128 threads (I think this is 32/16 frame threads x 8 tile threads) on a 32 core leads to pretty significant speedups (like this (https://medium.com/@ewoutterhoeven/dav1d-0-1-0-release-the-first-benchmarks-5404360e44e3)).

benwaggoner

13th December 2018, 20:00

Frames with a "similar entropy" reference each other, so a high-level P might use the previous P (which is coded 16 frames back) as its entropy reference, and a non-reference inner B frame (which might not be a reference picture at all for pixel purposes) may actually use the previous inner B-frame (which may well be the one directly before this, or usually 2 and sometimes 3 frames back) as its reference. So this certainly influences how well frame-multithreading scales, not in the worst possible way but not ideal either.

And that's why you see weird things where using 256 instead of 128 threads (I think this is 32/16 frame threads x 8 tile threads) on a 32 core leads to pretty significant speedups (like this (https://medium.com/@ewoutterhoeven/dav1d-0-1-0-release-the-first-benchmarks-5404360e44e3)).
Great analysis, thanks!

And huh, I can just imagine the tears of people trying to implement low-cost HW decoders for this. I can see how interframe entropy could provide a percent or two of compression efficiency, though.

I would rather have per-frame entropy and no slice requirement if I had a choice.

Beelzebubu

13th December 2018, 20:05

Great analysis, thanks!

And huh, I can just imagine the tears of people trying to implement low-cost HW decoders for this. I can see how interframe entropy could provide a percent or two of compression efficiency, though.

I would rather have per-frame entropy and no slice requirement if I had a choice.

TBH, from what I understand from people in the relevant committees, this was proposed for HEVC also. The reason they didn't do it had nothing to do with HW, though, but was simply to keep the VoD and RTC use cases technically more similar. (Entropy dependencies are obviously disabled for RTC use cases.)

benwaggoner

13th December 2018, 21:16

TBH, from what I understand from people in the relevant committees, this was proposed for HEVC also. The reason they didn't do it had nothing to do with HW, though, but was simply to keep the VoD and RTC use cases technically more similar. (Entropy dependencies are obviously disabled for RTC use cases.)
For RTC you could have backwards entropy states just fine, I think. So IPPPPPP could have each P reference the entropy state of the previous P. Error correction for lost packets would require trickiness. AV1 RTC would have the same issues.

Limiting entropy state reference to reference frames/tiles would be a lot more robust, but of reduce value. A bunch of non-ref b frames referencing the same frames probably have a lot more in common than any do to the ref-B/P/I frames they reference...

Random access would also be slowed by interframe entropy coding; it's essentially adding another layer of reference dependencies. Entropy is easier to decode, but getting to an arbitrary frame in a long GOP could require decoding the entropy state of a lot more frames than it would with a traditional IbBbP with inter-frame entropy only. With 8 b-frames, getting to an arbitrary frame of H.264/HEVC requires decoding about 1/8th of frames between the IDR and the target frame. Seems like it could be a lot worse in AV1, if I am understanding correctly.

Beelzebubu

13th December 2018, 21:20

Random access would also be slowed by interframe entropy coding; it's essentially adding another layer of reference dependencies. Entropy is easier to decode, but getting to an arbitrary frame in a long GOP could require decoding the entropy state of a lot more frames than it would with a traditional IbBbP with inter-frame entropy only. With 8 b-frames, getting to an arbitrary frame of H.264/HEVC requires decoding about 1/8th of frames between the IDR and the target frame. Seems like it could be a lot worse in AV1, if I am understanding correctly.

Yes, you're correct, random access (seeking) is going to be slower because of this.

mandarinka

14th December 2018, 00:31

In x265, WPP hurts efficiency too. Should we stop using it?
Why do you think the bestest encoders haven't? :cool: Enlightened ones have stropped using frame threading. :devil:

Do we have numbers for the installed base of AVX2 capable PCs? They've been in all new mainstream systems for several years now. I'd guess it's >50% already.

Steam is probably one of the largest datasets available but it is probably quite skewed. It covers disproportionate number of gaming-used computers, but likely almost no HTPCs or office-usage PCs. And all of those are going to watch AV1 video in browsers, even if it is just video ads. So real AVX2 penetration is likely worse than Steam shows, because of the Pentiums/Celerons and the like.
For illustration, look for example at the difference in Windows 10 versus Windows 7 usage shown by general browsing-based statistics sources and by Steam. The former show ~45% for W10 while Steam gives it over 60 %.

SmilingWolf

14th December 2018, 08:35

Why do you think the bestest encoders haven't? :cool: Enlightened ones have stropped using frame threading. :devil:
I am unsure of the meaning of this.
My point was that there is no point in not using either frame threading, WPP (for x265) or tiling (for libaom) when the overhead is not only so low, but even very similar between the two.
Yet I have never seen WPP get the same amount of flack tiling gets, especially considering tile-threading in libdav1d can contribute up to +108% of the decoding performance on its own: https://docs.google.com/spreadsheets/d/1AO3lDZnpC8pNJffOknY1rIxXwLog_ISwHhO_sv3Xlhg/edit#gid=1238661928

Kurosu

14th December 2018, 12:34

Tiles will cause a coding efficiency loss, even if negligible in the big picture. But it is not such a boon either, except for encoders with particular limits, or software decoders. Same for WPP, which really is more a software decoder thing. Contrary to dav1d, your regular HEVC software decoder does not exploit the combined "threadability" of frames and tiles/WPP.

nevcairiel

14th December 2018, 12:39

SmilingWolf

14th December 2018, 13:17

In the long run, features that allow faster software decoding are really just wasted coding efficiency. When a codec goes mainstream, you'll have a full stack of hardware decoders, which usually don't care that much about these things.
On top of that, if you look at frame threading numbers, the advantage from tile threading shrinks extremely rapidly. Comparing its speed advantage without frame threading is really only a very limited picture.

True, and true. I don't even have a retort to that.

I still think that we can care about removing tiling from a libaom encoding workflow whenever the hardware goes mainstream and makes 4K decodable even on budget CPUs like v0lt's Pentium G5600, which should be 2-3 years (?), but I'm ok with the above. Hopefully in the same time rav1e will get proper psy-RD and frame-parallel encoding, too, so we won't have to care about it anyway.

My main heat for the whole tiling debate comes from excluding from early adoption (i.e. right about now) a lot of low-medium tier systems with "inappropriate" encoding settings. In my early tests libdav1d could scale much better on my processor if combined with tiling rather than simply incresing the frame-threads above a certain threshold. Hard to justify a 4MB difference in 1GB of video when said video can't be decoded in real time at all.
Still, the spreadsheet I quoted makes me think I should run the numbers again for dav1d. It has been a couple of months after all.

Mierastor

14th December 2018, 18:37

"Intel: AV1 support not yet in Gen11 Graphics, but coming soon after"
https://www.reddit.com/r/AV1/comments/a5ufft/intel_av1_support_not_yet_in_gen11_graphics_but/

Meaning late 2020, if Intel as usual introduces new CPU generations late in the year?

Since these introductions have often only been paper launches, large-scale availability will only occur in 2021?

nevcairiel

14th December 2018, 19:45

Thats about the time frame most here would expect hardware support. Maybe in 2020, or thereabouts.

Nintendo Maniac 64

14th December 2018, 21:36

But lets be honest here - with AMD finally being a viable alternative again, who is really buying Intel for their graphics capabilities? :p

Motenai Yoda

14th December 2018, 22:14

the ones that don't care about gpu capabilities and still get a display without need a discrete card

nevcairiel

14th December 2018, 22:19

But lets be honest here - with AMD finally being a viable alternative again, who is really buying Intel for their graphics capabilities? :p

Gen11 is also supposed to be significantly faster. And Intel has among the best media capabilities today already, while AMD has the worst.
So for a small form factor media PC, there would be no competition for me.

Nintendo Maniac 64

15th December 2018, 04:16

the ones that don't care about gpu capabilities and still get a display without need a discrete card

Uhhhh... (https://en.wikipedia.org/wiki/List_of_AMD_accelerated_processing_unit_microprocessors)

huhn

17th December 2018, 13:43

Mr_Khyron

17th December 2018, 18:53

you are aware that amd is still missing VP9 for hardware decoding even on vega.

while AMD is very competitive in the CPU market there GPU's are currently at an all time low. vega is using a lot of power needs a huge die and is really slow if you take size into consideration and the hardware decoder is pretty much worse than the nvidia cards that are getting 4 years old.

https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf
on page 14
Vega” can also decode the VP9 format at resolutions up to 3840x2160 using a hybrid approach where the video and shader engines collaborate to offload work from the CPU.

sneaker_ger

17th December 2018, 19:56

AFAIK the newer AMD ones like Ryzen 5 2500U (Raven Ridge/ Vega 8) now have VP9 10 bit ASIC decoding.

But yeah, they are late. Don't expect it to be different for AV1.

huhn

17th December 2018, 21:39

hybrid decoding has nothing todo with hardware decoding.

letting a CPU and GPU core do the work of an ASIC is simply not the same.

it would be nice if the newest APU have asic decoder but not that trust worth test i found told me the vega 11 is hybrid.

alex1399

18th December 2018, 08:54

No sense, the amd hybrid decoding of vp9 just works as the way that intel do on the hybrid decoding of hevc in their 6th generation processor skylake. They ARE hardware decode.

Blue_MiSfit

18th December 2018, 19:33

Keep things on topic, please.

SmilingWolf

18th December 2018, 22:31

Status report, redux
"rav1e is doing well" edition

1st edition: http://forum.doom9.org/showthread.php?p=1852449#post1852449
2nd edition: http://forum.doom9.org/showthread.php?p=1857587#post1857587
Whatever paragraph I don't repeat here can be assumed to be the same as in the aforementioned posts

First of all: graphs! Click to enlarge
Y axis: chosen metric
X axis: bits per pixel

720p:
https://i.ibb.co/ZJ8GZVw/msssim-720.png (https://ibb.co/ZJ8GZVw) https://i.ibb.co/fVWNjjy/psnrhvsm-720.png (https://ibb.co/fVWNjjy) https://i.ibb.co/6ZQ6sTn/hvmaf-720.png (https://ibb.co/6ZQ6sTn)

1080p:
https://i.ibb.co/3dKr5f5/msssim-1080.png (https://ibb.co/3dKr5f5) https://i.ibb.co/4NNHxCC/psnrhvsm-1080.png (https://ibb.co/4NNHxCC) https://i.ibb.co/nDTK5y5/hvmaf-1080.png (https://ibb.co/nDTK5y5)

Encoders improvement over time:
720p:
https://i.ibb.co/R0dQbtc/msssim-overtime-720.png (https://ibb.co/R0dQbtc) https://i.ibb.co/9rJQXbQ/psnrhvsm-overtime-720.png (https://ibb.co/9rJQXbQ)

1080p:
https://i.ibb.co/Y76p6Tp/msssim-overtime-1080.png (https://ibb.co/Y76p6Tp) https://i.ibb.co/WBhV3rN/psnrhvsm-overtime-1080.png (https://ibb.co/WBhV3rN)

BD rates for 720p:

Codecs ladder: | x264 relative:
x264 -> rav1e | x264 -> rav1e
RATE (%) DSNR (dB) | RATE (%) DSNR (dB)
MSSSIM -10.7345 0.541324 | MSSSIM -10.7345 0.541324
PSNRHVS -15.3271 1.07245 | PSNRHVS -15.3271 1.07245
HVMAF -7.40703 2.07138 | HVMAF -7.40703 2.07138
----------------------------|-----------------------------
rav1e -> vp9 | x264 -> vp9
RATE (%) DSNR (dB) | RATE (%) DSNR (dB)
MSSSIM -15.1057 0.68453 | MSSSIM -21.81 1.08927
PSNRHVS -11.2436 0.654976 | PSNRHVS -22.9586 1.51188
HVMAF -19.2883 2.57019 | HVMAF -24.1102 3.58993
----------------------------|-----------------------------
vp9 -> x265 | x264 -> x265
RATE (%) DSNR (dB) | RATE (%) DSNR (dB)
MSSSIM -4.25723 0.169151 | MSSSIM -25.6195 1.21115
PSNRHVS -8.19042 0.41409 | PSNRHVS -29.8289 1.83058
HVMAF -10.6714 0.708441 | HVMAF -31.2046 4.45371
----------------------------|-----------------------------
x265 -> av1 | x264 -> av1
RATE (%) DSNR (dB) | RATE (%) DSNR (dB)
MSSSIM -18.9088 0.7852 | MSSSIM -38.0511 1.97653
PSNRHVS -15.3123 0.761791 | PSNRHVS -38.6659 2.56119
HVMAF -18.0023 1.0489 | HVMAF -44.0411 4.3982

BD rates for 1080p:

Codecs ladder: | x264 relative:
x264 -> rav1e | x264 -> rav1e
RATE (%) DSNR (dB) | RATE (%) DSNR (dB)
MSSSIM -20.0683 1.00011 | MSSSIM -20.0683 1.00011
PSNRHVS -21.9935 1.47903 | PSNRHVS -21.9935 1.47903
HVMAF -18.4773 3.96202 | HVMAF -18.4773 3.96202
------------------------------|-----------------------------
rav1e -> vp9 | x264 -> vp9
RATE (%) DSNR (dB) | RATE (%) DSNR (dB)
MSSSIM -18.2653 0.729489 | MSSSIM -31.1754 1.49143
PSNRHVS -14.922 0.755605 | PSNRHVS -30.1275 1.87845
HVMAF -20.1645 2.31195 | HVMAF -32.4505 4.72978
------------------------------|-----------------------------
vp9 -> x265 | x264 -> x265
RATE (%) DSNR (dB) | RATE (%) DSNR (dB)
MSSSIM 5.13717 -0.177855 | MSSSIM -28.9956 1.18206
PSNRHVS -0.096748 -0.0123981 | PSNRHVS -31.474 1.63676
HVMAF -3.78107 0.0881882 | HVMAF -34.6185 4.22357
------------------------------|-----------------------------
x265 -> av1 | x264 -> av1
RATE (%) DSNR (dB) | RATE (%) DSNR (dB)
MSSSIM -26.486 0.938124 | MSSSIM -45.4959 2.12535
PSNRHVS -21.7431 0.905916 | PSNRHVS -43.4792 2.56047
HVMAF -22.7091 1.17861 | HVMAF -48.0404 4.69582

Encoders:
x264 157-2935-545de2f
x265 2.9-4-471726d3a046
rav1e 0.1.0-977-64b9f501
libaom 1.0.0-908-g3a607f7b0
libvpx 1.7.0-1352-gea57f9acd

Cmdlines:
x264 --preset veryslow --tune ssim --crf 16 -o test.x264.crf16.264 orig.i420.y4m
x265 --preset veryslow --tune ssim --crf 16 -o test.x265.crf16.hevc orig.i420.y4m
rav1e --low_latency false -o test.rav1e.cq80.ivf --quantizer 80 -s 2 --tune psnr orig.i420.y4m
aomenc --frame-parallel=0 --tile-columns=3 --auto-alt-ref=1 --cpu-used=4 --tune=psnr --passes=2 --threads=2 --end-usage=q --cq-level=20 --test-decode=fatal -o test.av1.cq20.webm orig.i420.y4m
vpxenc --codec=vp9 --frame-parallel=0 --tile-columns=2 --good --cpu-used=0 --tune=psnr --passes=2 --threads=2 --end-usage=q --cq-level=20 --test-decode=fatal --ivf -o test.vp9.cq20.ivf orig.i420.y4m

Quality settings:
x264: CRF 16-24 step 1, and 24-34 step 2
x265: CRF 16-24 step 1, and 24-34 step 2
rav1e: CQ 80-160 step 16
aomenc: CQ 20-40 step 4
vpxenc: CQ 20-48 step 4
VMAF: model used: nflxall_vmafv4, pooling: harmonic_mean

Notes:
Revisiting the sequences from the previous report:

F.Y.C, x264 -> rav1e
RATE (%) DSNR (dB)
MSSSIM -21.8223 1.2258
PSNRHVS -28.408 2.29056
HVMAF -18.7682 3.89782

PresageFlowerFight, x264 -> rav1e
RATE (%) DSNR (dB)
MSSSIM -33.8353 1.95641
PSNRHVS -33.5774 2.47118
HVMAF -29.2079 7.17318

PresageFlowerWalk, x264 -> rav1e
RATE (%) DSNR (dB)
MSSSIM -6.98635 0.333085
PSNRHVS -3.78743 0.244148
HVMAF 2.23865 -0.150589

MAJOR improvements in static/very low motion scenes thanks to adaptive keyframe selection (https://github.com/xiph/rav1e/commit/869fef7002e7a7595bf6891228bd58d70a26a670), marginal improvements in the others.

Phanton_13

19th December 2018, 00:36

utack

19th December 2018, 05:26

The cpu-used heuristics in libaom seems to be poorly tuned in lossless mode.
Tested with a digitally animated gif (https://mir-s3-cdn-cf.behance.net/project_modules/max_1200/90d83d73226609.5c0277698d016.gif)
hitting 50% size increase at 1/4 encode time saved compared to cpu-used=0, and double the size at 60% the encode time of cpu-used=0
each dot presets one cpu-used value
https://i.imgur.com/RR3IcEb.png

Nintendo Maniac 64

19th December 2018, 07:31

digitally animated gif

If you ever want to test higher-quality animation examples (read: not limited to 256 colors), then perhaps try an animated PNG.

SmilingWolf

19th December 2018, 11:06

SmilingWolf, can you also indicates the encoding speed, it don't need to be in fps as it can be in relation of a specific encoder, is more to see the variation in the speed of aoemc and rav1e due to optimizations or other improvements.

I use my PC under various loads while encoding, so I can't reliably measure times, which is why I haven't included anything time specific so far.
A quick test on the F.Y.C clip however has displayed an improvement, on average, of 12% over the last month (1.0.0-1058-g8547359cf vs 1.0.0-908-g3a607f7b0).
At the same time, quality as measured by MS-SSIM and PSNR-HVS-M has gone down ~2.5%, circa to the levels of 3 months ago (1.0.0-577-g8ae39302e).

As for rav1e, the times have only grown worse because there have been many new additions but almost no early breakout (or none at all?) strategies have been implemented yet.
And as of right now, some ASM optimizations are disabled (https://github.com/xiph/rav1e/blob/f0e759175f52a1dae4233ff528557fcf0e3c8319/src/me.rs#L10) on Windows (https://github.com/xiph/rav1e/blob/f0e759175f52a1dae4233ff528557fcf0e3c8319/src/predict.rs#L157), so I guess I'm basically running on Rust code. So not much of a speedup on that front either (yet).

Phanton_13

20th December 2018, 14:52

Thanks SmilingWolf for the info.

sneaker_ger

22nd December 2018, 11:31

Simply crashes on my system even with a simple "ffmpeg" (no parameters). Windows 7 x64, i5-2500K (SSE4.2 and AVX).

sneaker_ger

22nd December 2018, 17:47

Still crashes.

lvqcl

22nd December 2018, 18:40

My CPU doesn't support SSE4.2, only 4.1, but I still tried to run it.
ffmpeg.exe crashes at shlx instruction, which is part of BMI2 (Haswell/Excavator).

nevcairiel

22nd December 2018, 19:11

Its usually not a good idea to build such a restricted binary, nor does it really give a meaningful speed enhancement. Just use a better build.
I often intentionally restrict what I allow compilers to do, because even GCC 8 is still terrible at using advanced instruction set instructions, and it can and will cause issues left and right.

This is especially true in a large and old code base like ffmpeg, which has code that was written 15 years ago, and code that as written just now, code that was painstakingly hand-optimized, and whatnot which can result in quite varying stress on the compiler.

Wolfberry

22nd December 2018, 22:01

I think the crash comes from that I configured gcc to have skylake by default, forget to use the --cpu switch, will rebuild them later.

Thanks nevcairiel.

SmilingWolf

23rd December 2018, 01:26

GCC 8.2.1 20181221, static build
ffmpeg-4.2-92779-g8b53d1322f: https://mega.nz/#!5kJxjA7I!dHKMhcYcjQPVZZWEBVTkzgNpOxT5hyylOEHTDtESTYM
- libaom-av1 1.0.0-1103-g9a48f9ca5
- libdav1d 0.1.0-38-g1703f21

SmilingWolf

23rd December 2018, 13:22

My AVIF toolkit: https://mega.nz/#!5oQE2Sob!STZHdk4ob4ptHknMvNcB4JxbCt9xdu3WUKkg7iyh2EM

Needs MSYS2. It's mostly an hack job.
Don't mess with the directory structure. Images to convert to AVIF in "images", AVIF to convert to PNG in "contained".
Due to stuff right now the script assumes all the AVIF images follow the BT709 matrix for YUV -> RGB conversion.

Usage:

encode.MT.sh takes in input a quality for aomenc
encode.ST.sh takes in input an image path and a quality setting
decode.MT.sh takes no arguments
decode.ST.sh takes the AVIF file path

ST and MT are for Single Thread and Multi Thread, respectively. Would be more correct to say multi-process but you get the idea. The default is 6 processes, you can tweak it by modifying xargs' -P option in the MT script.

The defaults are for high quality (--cpu-used=0 for extra overkill), so it might take a while to convert everything depending on your machine. You can tweak aomenc's options in encode.ST.sh

Also included a python script to gather the metrics' weighted average from a given stats file (auto generated by encode.MT.sh)

SmilingWolf

23rd December 2018, 17:24

zscale (libzimg) should be preferred (personal opinion)
While I do use libzimg from time to time, it has a tendency to randomly crash (esp. when downscaling video), so I don't want it into an automated workflow.

For the mod2 thing: setting YUV444 just for that looks way too wasteful unless you already have to preserve zones of high contrast like in the DDMC.png example. So maybe something like scale=-2:ih to set the width to a multiple of 2?

lvqcl

23rd December 2018, 17:47

I decided to test how my old CPU (Intel Core2 Quad Q9300, SSE4.1) decodes AV1 using ffmpeg from SmilingWolf build.

Test video: https://www.youtube.com/watch?v=PiWyCQV52h0 , 1280x720p.

-c:v libaom-av1: 1.89x realtime, utime = 94 sec.
-c:v libdav1d: 1.30x realtime, utime = 158 sec.
-c:v libdav1d -threads 4 -tilethreads 4: 1.67x realtime, utime = 159 sec.
-c:v libdav1d -threads 8 -tilethreads 8: 1.74x realtime, utime = 163 sec.
-c:v libdav1d -threads 16 -tilethreads 16: 2.02x realtime, utime = 161 sec.

(I hava no idea what threads and tilethreads options do, so I just tested various values for them)

So, on my system dav1d requires ~160/94=1.7 times more CPU time than aom.

MoSal

24th December 2018, 12:16

(I hava no idea what threads and tilethreads options do, so I just tested various values for them)

Try with tilethreads set to 1.

lvqcl

24th December 2018, 18:01

-c:v libdav1d: 1.31x realtime, utime = 156 sec.
-c:v libdav1d -threads 4 -tilethreads 1: 1.31x realtime, utime = 157 sec.
-c:v libdav1d -threads 8 -tilethreads 1: 1.61x realtime, utime = 158 sec.
-c:v libdav1d -threads 16 -tilethreads 1: 1.80x realtime, utime = 159 sec.
-c:v libdav1d -threads 32 -tilethreads 1: 1.98x realtime, utime = 160 sec.

v0lt

24th December 2018, 18:22

@lvqcl
What is "utime"? This is not like decoding time.

NikosD

24th December 2018, 20:20

What's the progress of dav1d leveraging SSSE3 assembly optimizations ?

Are we still based on AVX2 only for dav1d ?

lvqcl

24th December 2018, 20:37

ffmpeg prints something like this:
bench: utime=178.762s stime=2.839s rtime=58.362s
IIUC:
utime = total time spent on user code (across all CPU cores)
stime = total time spent on system code
rtime = "real time" aka wall time

So: it took 58.362 seconds to decode a video, but CPU time spent on decoding was 178.762+2.839 sec.
That is, (178.762+2.839)/58.362 = 3.1 cores were active (on average) during decoding.

Wolfberry

25th December 2018, 01:55

@NikosD

SSSE3: issue #216 (https://code.videolan.org/videolan/dav1d/issues/216) (7 / 28)

AVX2: issue #78 (https://code.videolan.org/videolan/dav1d/issues/78) (9 / 52)