Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
13th June 2018, 21:16 | #701 | Link | |
Registered User
Join Date: Jun 2013
Posts: 95
|
Quote:
No one will forget AV1, not even the no name chip manufacturers. Here is hope, from now on, they will not forget Opus either.
__________________
https://github.com/MoSal |
|
22nd June 2018, 10:24 | #704 | Link |
Registered User
Join Date: Jul 2016
Posts: 14
|
Interesting discussion regarding the choice of entropy coder for AV1: https://encode.ru/threads/1890-Bench...ll=1#post56945
Daala range coder using 16 multiplications per symbol has won with rANS using 1 multiplication per symbol, and ~7x faster implementations: https://sites.google.com/site/powturbo/entropy-coder Do anybody know why the slower and more costly one was chosen? ps. This nibble adaptive rANS is e.g. used in recent open source Dropbox DivANS: https://blogs.dropbox.com/tech/2018/...r-with-divans/ Last edited by blurred; 22nd June 2018 at 10:40. Reason: Dropbox DivANS |
22nd June 2018, 11:43 | #705 | Link |
Registered Developer
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,346
|
Often the choice is for simpler hardware implementations, since thats really the future, not software. I'm also not convinced a generic benchmark can fully represent the performance characteristics of an actual codec.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders Last edited by nevcairiel; 22nd June 2018 at 11:47. |
22nd June 2018, 12:13 | #706 | Link |
Registered User
Join Date: May 2002
Posts: 95
|
If I don't remenber incorrectly the rANS does some things in a reverse way and othes things that complicated the cost of implimentation in hardware, in other works of implementing in silicon the 16 multipliers in Daala range coder is cheapier than implementing the memory need to implement the rANS. Also it apears that rANS can increase the latency specially at low rates due to the buffers. Most of this problems are being tackled in a new generation of ANS coders, but they are not going to be ready for a possible implementation in AV1.
Also remenber: Faster/cheapier in software is not the same as faster/cheapier in hardware. |
23rd June 2018, 00:55 | #708 | Link |
Registered User
Join Date: May 2002
Posts: 95
|
That only aplies to software implementations, in hardware you dont think in number of instructions but in number of gates/transistors, and the maping is not that simple as sometimes you can implement a function with 9 multiplications into the same number of gates that takes to implent 2 multiplications (this in an example of a real case). Memory also have cost in gates and power and you ned to evaluate If it's more eficient to expend the gates on memory or in procesing, the result of this evaluation is what determined the use of the Daala range coder as av1 have been designed with the hardware implementation in mind because it is critical for mobile phones.
Last edited by Phanton_13; 23rd June 2018 at 00:58. |
24th June 2018, 12:32 | #709 | Link | |
Registered User
Join Date: Jul 2016
Posts: 14
|
Quote:
And hardware decoding requires replacing current hardware - meanwhile (~5 years) it will be made software, where being 7x slower seems a huge sacrifice. Additionally, Google is still fighting for this ANS patent over dead bodies ( https://arstechnica.com/tech-policy/...public-domain/ ) - if it is not intended for AV1, will it prevent others using ANS in video compression? |
|
24th June 2018, 15:21 | #710 | Link | |||
Registered User
Join Date: May 2002
Posts: 95
|
Quote:
Hardware design is very diferent that software development, for example in the range coder of daala most multiplication are constant*value, in this case in hardware you don't need to do multiplication always, for example in the case that the constant is "2" there are various variants as for example in unsigned is only a bit shift but in harware is even cheaper as you only resoute the data and for signed you use and adder or a modified shifter. And for other values most of the time there is an alternative and faster way to implement it instead of doing a full multiplier. Also most of the time you don't need to implement a full multiplication unit as you only implement it what you need, for example you can do a 12 bit multiplier instad of a 16bit one if you values always fit in 12 bits, or you only implement the lower bits of a 16 bits multiplication and ignore any value over 16bits... In hardware design the frecuency is a derived value of data propagation (delay, timing) and what you whant is results, even if some implementation have slower frecuency but produces the result faster you go for it. Quote:
Quote:
|
|||
24th June 2018, 16:17 | #711 | Link | |
Registered User
Join Date: Jul 2016
Posts: 14
|
Quote:
In contrast, rANS needs to multiply by only one value (p[s] = CDF[s+1]-CDF[s]), where s is the currently decoded symbol. CDF changes with data type (context), and can be adapted - these are definitely not constant values. In hardware you can build 16 parallel multipliers not to increase frequency, but it would need 16x more gates, and most importantly: consume 16x more energy. |
|
24th June 2018, 19:45 | #712 | Link |
Registered User
Join Date: May 2002
Posts: 95
|
In part you are rigth but at the same time you are forgeting one thing, those 16 pararell multipliers consume more energy than the extra memory needed in rANS? the hardware is inerent pararell, then is theupdate posible to do in pararel with another task during the decoding process? Also there is the posibility of optimization for those 16 pararell multiplications as one operand is comon to all. On thing that help with hardware is not to think of it like a computer program but as a data flow between operands.
|
24th June 2018, 21:24 | #713 | Link | |
Registered User
Join Date: Jul 2016
Posts: 14
|
Such additional (for rANS) buffer is only needed in encoder, which for video compression is usually an order of magnitude more costly, and for example for youtube, netflix video used only once per thousands or millions of views (decodings).
And video compressor seems to require huge flexible buffers for various modellings/predictions - is it a non-negligible cost to share a few kilobytes with entropy coder? Quote:
Thinking about multiplication as shifts and additions, the cheap shifting part can be indeed shared, but it doesn't seem simple to get systematic optimization for separate additions - do you maybe know some paper showing how to optimize it? |
|
26th June 2018, 02:03 | #716 | Link | |
Registered User
Join Date: Aug 2015
Posts: 34
|
Quote:
Also keep in mind that AV1 adjusts the probabilities on a per-symbol basis. The entropy coder CDFs are designed to make adapting the probabilities very fast (with only adds and shifts). This puts some constraints on the design that don't exist in the linked benchmark (which uses fixed probabilities as far as I can tell). Last edited by TD-Linux; 26th June 2018 at 02:06. |
|
26th June 2018, 04:50 | #717 | Link | |
Moderator
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,770
|
Quote:
There is so much clever that gets done, even in decoders. And there are so many different kinds of parallelization, SIMD, ASIC, etcetera available. And surprising numbers of decoders don't implement basic stuff like skipping non-reference frames when doing seeking, due to the system layer and the decoder layers not being tightly coupled enough. AV1 is way better designed for parallelized HW decoders than VP9 was, which was pretty painfully serialized compared to HEVC, with software decoders pretty dependent on fast single-core performance. |
|
26th June 2018, 07:37 | #718 | Link |
German doom9/Gleitz SuMo
Join Date: Oct 2001
Location: Germany, rural Altmark
Posts: 6,780
|
@ GTPVHD: Then the github site may have outdated content?
MABS also retrieves sources from GoogleSource. And had to disable a TESTS flag to continue compiling, 2 days ago. __ P.S.: New upload: AOM v1.0.0-6-gce8f4811b (yes, v1.0.0+) Last edited by LigH; 26th June 2018 at 09:21. |
26th June 2018, 11:17 | #719 | Link |
Registered User
Join Date: May 2002
Posts: 95
|
Lamentably no for the general case, also searching for it I found a paper:"DAALA_EC in AV1" that have some data for hadware implementations:
Daala_ec decoder 54k gates,performance 1 symbol per clock, decoding time 1 clock. Daala_ec encoder 9k gates,performance 1 symbol per clock, encoding time 1 clock. ANS decoder 49k gates,performance 1 symbol per clock, decoding time 1 clock. ANS encoder 25k gates,performance 1 symbol every 2 clocks, encoding time 2 clocks. As for reference VP9 G2 hardware codec has 2.60M gates (2160p@30fps content playback: ~250Mz) Basically ANS has not faster decoding speed that Daala range coder once implemented in hardware, an even is slower in encoding. The thing that the speed diference in software implementation don't correlate to it in hardware implementation is enougth common as to call it a norm. Other thing is that it appears that in the decission of using the Daala range coder the hardware guys at ARM/AMD/Itel/Nvidia had a good hand in it. Also rANS is quite recent and higthly optimised, plus it uses 32/64bit aritmetic and SIMD instructions while daala range coder uses only 16bit aritmethic. And you can do betwen 2 and 4 1 clock 16bit multipliers in the same number of gates that of a 32bit 1clock multiplier. Last edited by Phanton_13; 26th June 2018 at 12:48. |
Thread Tools | Search this Thread |
Display Modes | |
|
|