Alliance for Open Media codecs - Page 36

MoSal · 13th June 2018, 21:16

Quote:

I have bought 1,5 years ago LG smartTV. This year after upgrading the Youtube aplication it starts to support VP9 (stats for nerds reports it) and plays Youtube 4K without single drop.

I tested a cheap locally-assembled TV with Chinese parts the other day. VP9 4K@60fps is supported out of the box. Opus was the codec that's not supported.

No one will forget AV1, not even the no name chip manufacturers. Here is hope, from now on, they will not forget Opus either.

14th June 2018, 03:13

Quote:

Originally Posted by IgorC

It's not like VP9 isn't supported by any smart TVs.

I have bought 1,5 years ago LG smartTV. This year after upgrading the Youtube aplication it starts to support VP9 (stats for nerds reports it) and plays Youtube 4K without single drop.

But it’s not supported by everything whereas anything that can do 4K with a Netflix app has to have HEVC support.

Blue_MiSfit · 15th June 2018, 09:46

Quote:

anything that can do 4K with a Netflix app has to have HEVC support.

Exactly

blurred · 22nd June 2018, 10:24

Interesting discussion regarding the choice of entropy coder for AV1: https://encode.ru/threads/1890-Bench...ll=1#post56945

Daala range coder using 16 multiplications per symbol has won with rANS using 1 multiplication per symbol, and ~7x faster implementations: https://sites.google.com/site/powturbo/entropy-coder

Do anybody know why the slower and more costly one was chosen?

ps. This nibble adaptive rANS is e.g. used in recent open source Dropbox DivANS: https://blogs.dropbox.com/tech/2018/...r-with-divans/

nevcairiel · 22nd June 2018, 11:43

Often the choice is for simpler hardware implementations, since thats really the future, not software. I'm also not convinced a generic benchmark can fully represent the performance characteristics of an actual codec.

Phanton_13 · 22nd June 2018, 12:13

If I don't remenber incorrectly the rANS does some things in a reverse way and othes things that complicated the cost of implimentation in hardware, in other works of implementing in silicon the 16 multipliers in Daala range coder is cheapier than implementing the memory need to implement the rANS. Also it apears that rANS can increase the latency specially at low rates due to the buffers. Most of this problems are being tackled in a new generation of ANS coders, but they are not going to be ready for a possible implementation in AV1.

Also remenber: Faster/cheapier in software is not the same as faster/cheapier in hardware.

blurred · 22nd June 2018, 17:03

But doesn't 16 multiplications cost more energy than one - paid in energy consumption and battery life of our devices?

Phanton_13 · 23rd June 2018, 00:55

That only aplies to software implementations, in hardware you dont think in number of instructions but in number of gates/transistors, and the maping is not that simple as sometimes you can implement a function with 9 multiplications into the same number of gates that takes to implent 2 multiplications (this in an example of a real case). Memory also have cost in gates and power and you ned to evaluate If it's more eficient to expend the gates on memory or in procesing, the result of this evaluation is what determined the use of the Daala range coder as av1 have been designed with the hardware implementation in mind because it is critical for mobile phones.

blurred · 24th June 2018, 12:32

Quote:

Originally Posted by Phanton_13

(...)you can implement a function with 9 multiplications into the same number of gates that takes to implent 2 multiplications(...)

Looks like you are referring to serial execution, which might require 16x frequency increase here (?)
And hardware decoding requires replacing current hardware - meanwhile (~5 years) it will be made software, where being 7x slower seems a huge sacrifice.
Additionally, Google is still fighting for this ANS patent over dead bodies ( https://arstechnica.com/tech-policy/...public-domain/ ) - if it is not intended for AV1, will it prevent others using ANS in video compression?

Phanton_13 · 24th June 2018, 15:21

Quote:

Originally Posted by blurred

Looks like you are referring to serial execution, which might require 16x frequency increase here (?)

No, I refering to reimplemt the function, in that case a 50Mhz FPGA implementation of the full ASIC was able to match a Core2duo at 2Ghz runing its functionality in software, and in silicon the asic was runing at 1Ghz.

Hardware design is very diferent that software development, for example in the range coder of daala most multiplication are constant*value, in this case in hardware you don't need to do multiplication always, for example in the case that the constant is "2" there are various variants as for example in unsigned is only a bit shift but in harware is even cheaper as you only resoute the data and for signed you use and adder or a modified shifter. And for other values most of the time there is an alternative and faster way to implement it instead of doing a full multiplier. Also most of the time you don't need to implement a full multiplication unit as you only implement it what you need, for example you can do a 12 bit multiplier instad of a 16bit one if you values always fit in 12 bits, or you only implement the lower bits of a 16 bits multiplication and ignore any value over 16bits...

In hardware design the frecuency is a derived value of data propagation (delay, timing) and what you whant is results, even if some implementation have slower frecuency but produces the result faster you go for it.

Quote:

Originally Posted by blurred

And hardware decoding requires replacing current hardware - meanwhile (~5 years) it will be made software, where being 7x slower seems a huge sacrifice.

That is true, bus is more like 2-3 years for hardware to start apearing, and in this case it can be reduced to 1 year due to the varios hardware designers and manufactures in AOM.

Quote:

Originally Posted by blurred

Additionally, Google is still fighting for this ANS patent over dead bodies ( https://arstechnica.com/tech-policy/...public-domain/ ) - if it is not intended for AV1, will it prevent others using ANS in video compression?

No, actually having it refused can actually be good as it's detimentral if its aproved at posteriori for other entity because it can be used to put the patent office and the posteriori aproval in question and invalidate it. More this also demostrated the disfuntionality in both the patent system and the legal teams in companies.

blurred · 24th June 2018, 16:17

Quote:

Originally Posted by Phanton_13

(...)in the range coder of daala most multiplication are constant*value(...)

If I properly understand, there are 16 multiplications due to "maximal alphabet size" = 16 - it needs to multiply "range size" by CDF value for all 16 symbols.
In contrast, rANS needs to multiply by only one value (p[s] = CDF[s+1]-CDF[s]), where s is the currently decoded symbol.

CDF changes with data type (context), and can be adapted - these are definitely not constant values.
In hardware you can build 16 parallel multipliers not to increase frequency, but it would need 16x more gates, and most importantly: consume 16x more energy.

Phanton_13 · 24th June 2018, 19:45

In part you are rigth but at the same time you are forgeting one thing, those 16 pararell multipliers consume more energy than the extra memory needed in rANS? the hardware is inerent pararell, then is theupdate posible to do in pararel with another task during the decoding process? Also there is the posibility of optimization for those 16 pararell multiplications as one operand is comon to all. On thing that help with hardware is not to think of it like a computer program but as a data flow between operands.

blurred · 24th June 2018, 21:24

Such additional (for rANS) buffer is only needed in encoder, which for video compression is usually an order of magnitude more costly, and for example for youtube, netflix video used only once per thousands or millions of views (decodings).
And video compressor seems to require huge flexible buffers for various modellings/predictions - is it a non-negligible cost to share a few kilobytes with entropy coder?

Quote:

Originally Posted by Phanton_13

Also there is the posibility of optimization for those 16 pararell multiplications as one operand is comon to all.

Interesting, indeed the range is varying, but the same for all 16 multiplications.
Thinking about multiplication as shifts and additions, the cheap shifting part can be indeed shared, but it doesn't seem simple to get systematic optimization for separate additions - do you maybe know some paper showing how to optimize it?

Quikee · 25th June 2018, 21:13

AV1 1.0.0 code tag

Also specs don't have draft status anymore.

No official announcement yet..

GTPVHD · 26th June 2018, 00:12

https://aomediacodec.github.io/av1-spec/

Still says Draft Document here.

TD-Linux · 26th June 2018, 02:03

Quote:

Originally Posted by blurred

Daala range coder using 16 multiplications per symbol has won with rANS using 1 multiplication per symbol, and ~7x faster implementations: https://sites.google.com/site/powturbo/entropy-coder

Do anybody know why the slower and more costly one was chosen?

Firstly, the AV1 range coder only uses 1 multiplication per CDF entry, the 16 is the "worst case" (keep in mind that they can be done in parallel, e.g. with SIMD, so it's actually better to use more than less as the multiply is the cheapest part in software). Secondly, the difference is nowhere near 7x when we benchnmarked the two - rANS was faster, but by a factor of about 2. However, the requirement to buffer and reverse the symbols was unfortunately insurmountable.

Also keep in mind that AV1 adjusts the probabilities on a per-symbol basis. The entropy coder CDFs are designed to make adapting the probabilities very fast (with only adds and shifts). This puts some constraints on the design that don't exist in the linked benchmark (which uses fixed probabilities as far as I can tell).

benwaggoner · 26th June 2018, 04:50

Quote:

Originally Posted by nevcairiel

Often the choice is for simpler hardware implementations, since thats really the future, not software. I'm also not convinced a generic benchmark can fully represent the performance characteristics of an actual codec.

I think you can remain happily convinced that a generic benchmark will NOT "represent the performance characteristics of an actual codec"

There is so much clever that gets done, even in decoders. And there are so many different kinds of parallelization, SIMD, ASIC, etcetera available. And surprising numbers of decoders don't implement basic stuff like skipping non-reference frames when doing seeking, due to the system layer and the decoder layers not being tightly coupled enough.

AV1 is way better designed for parallelized HW decoders than VP9 was, which was pretty painfully serialized compared to HEVC, with software decoders pretty dependent on fast single-core performance.

LigH · 26th June 2018, 07:37

@ GTPVHD: Then the github site may have outdated content?

MABS also retrieves sources from GoogleSource. And had to disable a TESTS flag to continue compiling, 2 days ago.
__

P.S.: New upload:

AOM v1.0.0-6-gce8f4811b (yes, v1.0.0+)

Phanton_13 · 26th June 2018, 11:17

Quote:

Originally Posted by blurred

do you maybe know some paper showing how to optimize it?

Lamentably no for the general case, also searching for it I found a paper:"DAALA_EC in AV1" that have some data for hadware implementations:

Daala_ec decoder 54k gates,performance 1 symbol per clock, decoding time 1 clock.
Daala_ec encoder 9k gates,performance 1 symbol per clock, encoding time 1 clock.
ANS decoder 49k gates,performance 1 symbol per clock, decoding time 1 clock.
ANS encoder 25k gates,performance 1 symbol every 2 clocks, encoding time 2 clocks.

As for reference VP9 G2 hardware codec has 2.60M gates (2160p@30fps content playback: ~250Mz)

Basically ANS has not faster decoding speed that Daala range coder once implemented in hardware, an even is slower in encoding. The thing that the speed diference in software implementation don't correlate to it in hardware implementation is enougth common as to call it a norm. Other thing is that it appears that in the decission of using the Daala range coder the hardware guys at ARM/AMD/Itel/Nvidia had a good hand in it.

Also rANS is quite recent and higthly optimised, plus it uses 32/64bit aritmetic and SIMD instructions while daala range coder uses only 16bit aritmethic. And you can do betwen 2 and 4 1 clock 16bit multipliers in the same number of gates that of a 32bit 1clock multiplier.

Blue_MiSfit · 27th June 2018, 08:11

Congrats to the AOM team for hitting 1.0! It's only a few months late

Good stuff tho, looking forward to the encoder maturing. It's always great to see more options.

22nd June 2018, 10:24	#704 \| Link
blurred Registered User Join Date: Jul 2016 Posts: 14	Interesting discussion regarding the choice of entropy coder for AV1: https://encode.ru/threads/1890-Bench...ll=1#post56945 Daala range coder using 16 multiplications per symbol has won with rANS using 1 multiplication per symbol, and ~7x faster implementations: https://sites.google.com/site/powturbo/entropy-coder Do anybody know why the slower and more costly one was chosen? ps. This nibble adaptive rANS is e.g. used in recent open source Dropbox DivANS: https://blogs.dropbox.com/tech/2018/...r-with-divans/ Last edited by blurred; 22nd June 2018 at 10:40. Reason: Dropbox DivANS

22nd June 2018, 11:43	#705 \| Link
nevcairiel Registered Developer Join Date: Mar 2010 Location: Hamburg/Germany Posts: 10,346	Often the choice is for simpler hardware implementations, since thats really the future, not software. I'm also not convinced a generic benchmark can fully represent the performance characteristics of an actual codec. __________________ LAV Filters - open source ffmpeg based media splitter and decoders Last edited by nevcairiel; 22nd June 2018 at 11:47.

23rd June 2018, 00:55	#708 \| Link
Phanton_13 Registered User Join Date: May 2002 Posts: 95	That only aplies to software implementations, in hardware you dont think in number of instructions but in number of gates/transistors, and the maping is not that simple as sometimes you can implement a function with 9 multiplications into the same number of gates that takes to implent 2 multiplications (this in an example of a real case). Memory also have cost in gates and power and you ned to evaluate If it's more eficient to expend the gates on memory or in procesing, the result of this evaluation is what determined the use of the Daala range coder as av1 have been designed with the hardware implementation in mind because it is critical for mobile phones. Last edited by Phanton_13; 23rd June 2018 at 00:58.

25th June 2018, 21:13	#714 \| Link
Quikee Registered User Join Date: Jan 2006 Posts: 41	AV1 freeze AV1 1.0.0 code tag Also specs don't have draft status anymore. No official announcement yet..

26th June 2018, 07:37	#718 \| Link
LigH German doom9/Gleitz SuMo Join Date: Oct 2001 Location: Germany, rural Altmark Posts: 6,780	@ GTPVHD: Then the github site may have outdated content? MABS also retrieves sources from GoogleSource. And had to disable a TESTS flag to continue compiling, 2 days ago. __ P.S.: New upload: AOM v1.0.0-6-gce8f4811b (yes, v1.0.0+) __________________ New German Gleitz board MediaFire: x264 \| x265 \| VPx \| AOM \| Xvid Last edited by LigH; 26th June 2018 at 09:21.

22nd June 2018, 12:13	#706 \| Link
Phanton_13 Registered User Join Date: May 2002 Posts: 95	If I don't remenber incorrectly the rANS does some things in a reverse way and othes things that complicated the cost of implimentation in hardware, in other works of implementing in silicon the 16 multipliers in Daala range coder is cheapier than implementing the memory need to implement the rANS. Also it apears that rANS can increase the latency specially at low rates due to the buffers. Most of this problems are being tackled in a new generation of ANS coders, but they are not going to be ready for a possible implementation in AV1. Also remenber: Faster/cheapier in software is not the same as faster/cheapier in hardware.

22nd June 2018, 17:03	#707 \| Link
blurred Registered User Join Date: Jul 2016 Posts: 14	But doesn't 16 multiplications cost more energy than one - paid in energy consumption and battery life of our devices?

24th June 2018, 19:45	#712 \| Link
Phanton_13 Registered User Join Date: May 2002 Posts: 95	In part you are rigth but at the same time you are forgeting one thing, those 16 pararell multipliers consume more energy than the extra memory needed in rANS? the hardware is inerent pararell, then is theupdate posible to do in pararel with another task during the decoding process? Also there is the posibility of optimization for those 16 pararell multiplications as one operand is comon to all. On thing that help with hardware is not to think of it like a computer program but as a data flow between operands.

26th June 2018, 00:12	#715 \| Link
GTPVHD Registered User Join Date: Mar 2008 Posts: 175	https://aomediacodec.github.io/av1-spec/ Still says Draft Document here.

27th June 2018, 08:11	#720 \| Link
Blue_MiSfit Derek Prestegard IRL Join Date: Nov 2003 Location: Los Angeles Posts: 5,989	Congrats to the AOM team for hitting 1.0! It's only a few months late Good stuff tho, looking forward to the encoder maturing. It's always great to see more options.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode