Log in

View Full Version : Any news about a potential successor to VVC / H.266?


Pages : [1] 2 3

kurkosdr
19th August 2022, 18:15
I know it's a bit early to ask, but with Leonardo Chiariglione having announced the "death of MPEG", I thought there is no harm in asking:

Has anyone heard any news about a potential successor to VVC / H.266?

Now, don't get me wrong, VVC is pretty advanced and all, but it still can't compress a UHD stream to the bitrate of an FHD H.264 stream with the same quality, while at the same time terrestrial bands are shrinking worldwide. That's why I am wondering whether the 10-year cadence (a new format every 10 years or so) will be kept alive despite the recent restructuring of MPEG or VVC / H.266 will be the end of the line.

Also, I know AV1 exists and will likely keep evolving, but it's not an ISO or ETSI standard, and free-to-air and free-sat broadcasters can only use ISO and ETSI standards (and ATSC standards for the US). In other words, ISO standards are still important if free-to-air and free-sat are to keep up with internet distribution in terms of quality.

nhw_pulsar
19th August 2022, 19:44
There is currently MPEG VVC ECM: Enhanced Compression Model (beyond VVC) which presents some impressive improvement over VVC, something like 20-30% if I remember correctly.MPEG is also studying machine learning next-generation video coding tools, notably super-resolution and neural-network-based loop filters. For more info, details are available on their website at standard explorations.

A drawback is that it is really slower notably to decode...

Also the ultra-impressive ECM is however (as always) PSNR- and SSIM- driven, which don't correlate well with human visual system.For example, on the clic learned image compression challenge website, I have seen an image at high compression with BPG and ECM, and I visually preferred the BPG image... As also very quickly with VVC, I don't know if my VTM binaries are right (I also use the default command line and slowest preset), but I find that VVC lacks of neatness, with a comparison with my codec, I find that NHW has more neatness (but less precision), but visually I find that neatness is more pleasant than precision... Don't know if ECM will correct neatness? But of course this is my very personal opinion, for example the industry absolutely doesn't share it and completely sticks with PSNR, SSIM and precision...

Cheers,
Raphael

Jamaika
20th August 2022, 09:30
What are BPG photos and who uses them? After all we have the HEIF/AVIF standard.
The purchase of equipment is also very interesting nowadays. After the promotion, Orange offers smartphones for X EURO, but it can buy equipment for 75% cheaper on amazon through an intermediary who keeps 15% commission.
As for the vision of 8K LTM, LCEVC, EVC, AV2, JXS. They are paid and only on the SAT.

The Webb telescope has problems, and it is not known what the photos of the 16K universe will be in. A good topic for a clairvoyant.
https://www.v-nova.com/try-v-nova-video-compression-technologies/
https://cloud.qencode.com/lcevc-video-codec
https://www.lcevc.org/how-lcevc-works/
https://www.mpeg.org/

Blue_MiSfit
21st August 2022, 03:56
LC-EVC is a big step forward for a lot of use cases, especially when paired with next gen formats like VVC and AV1.

It's not a slam dunk for every use case though :)

nhw_pulsar
21st August 2022, 19:25
Hello,

Just a quick correction, sorry for my misleading information, BPG is not better than VVC intra.On my image test sets, VVC VTM intra is a little visually better than BPG for me, but it seems rather slight (BPG -m 1 also takes 78ms to encode, whereas VTM intra takes 75seconds to encode...), I also don't test at extreme compression.But the difference really appears for video coding I think, because I don't test video sequences, but I have seen few comparisons between HEVC and VVC at the same bitrate, and VVC was then really visually better in the video case.

Cheers,
Raphael

rwill
22nd August 2022, 06:05
MOM !

People are comparing shitty (reference) encoder implementations against each other again !!!

ksec
22nd August 2022, 14:32
It is progressing at a surprisingly fast rate. ECM 4 + EE1.2 managed ~30% BD-Rate with 4K compared to VTM 11. ( VVC Reference Encoder, although VTM 16 do outperform VTM 11 by about ~10% so the actual difference isn't as big )

We should have some news about ECM 6 soon. We are not far off from the 40-50% BD Rate compared to VVC.

SeeMoreDigital
22nd August 2022, 15:06
I know it's a bit early to ask, but with Leonardo Chiariglione having announced the "death of MPEG", I thought there is no harm in asking...
Hmmm...

Leonardo Chiariglione's announcement was made over two years ago (Sat 06 Jun 2020) over on the MPEG Home Page (https://www.mpeg.org/) when his chairmanship of the group came to an end after 32 years!

I suspect he was more than a little pissed off... Was he pushed out? Probably ;)

benwaggoner
24th August 2022, 18:45
AV2 is also in development, and certainly will be available before ECM. It's too early to say if it'll be as good as VVC for real-world content, let alone better.

ksec
25th August 2022, 14:42
I would go as far as to say ECM is actually ahead of AV2 in terms of development. At this rate we could have next gen ( H.267? ) spec ready by 2025.

It will be interesting because ECM is looking like the first Video Codec which would requires dedicated Hardware Decoder to work. We are looking at ~5x the Decoding Complexity compared to VVC.

FranceBB
29th August 2022, 14:26
But most importantly... where the heck is x266?! :(
Honestly, is there any plan from Multicoreware to release what they have done and make it open source once and for all? 'cause honestly, I've been doing all my tests with Fraunhofer's VVEnc which is somewhat more usable than the VTM reference encoder, but still, I'd like to have x266 to play with and start the integration with our systems... :(

P.s looks like they're gonna be at this year's IBC, so maybe I can ask them in person https://multicorewareinc.com/ibc2022/

What do you say, Ben? Shall we get there together on behalf of Doom9 xD?

EDIT: Ok, I've actually booked with them on Monday, September 12. I'll ask about x266 and I'll come back here with some more info. Hopefully it's gonna be good news. Stay tuned.


terrestrial bands are shrinking worldwide. That's why I am wondering if the 10-year cadence (a new format every 10 years or so) will be kept alive despite the recent restructuring of MPEG or if VVC / H.266 will be the end of the line.


Don't worry about terrestrial, I'm sure that in 10 years time there are still gonna be channels in MPEG-2 720x576 yv12 25i TFF BT601 alive xD

Blue_MiSfit
29th August 2022, 20:02
Let us know what MCW says - I haven't spoken to them in quite a long time (since before Pradeep left).

benwaggoner
30th August 2022, 18:31
What do you say, Ben? Shall we get there together on behalf of Doom9 xD?
Yeah. We should have a doom9 meetup even if there's a critical mass going.

kurkosdr
2nd September 2022, 21:29
But most importantly... where the heck is x266?! :(
Honestly, is there any plan from Multicoreware to release what they have done and make it open source once and for all? 'cause honestly, I've been doing all my tests with Fraunhofer's VVEnc which is somewhat more usable than the VTM reference encoder, but still, I'd like to have x266 to play with and start the integration with our systems... :(

P.s looks like they're gonna be at this year's IBC, so maybe I can ask them in person https://multicorewareinc.com/ibc2022/

What do you say, Ben? Shall we get there together on behalf of Doom9 xD?

EDIT: Ok, I've actually booked with them on Monday, September 12. I'll ask about x266 and I'll come back here with some more info. Hopefully it's gonna be good news. Stay tuned.

To be honest, with the development of open-source encoders for ISO standards having slowed down to a standstill (for example: no high-quality AAC-LC encoder, or HE-AAC encoder, or MPEG-H 3D Audio encoder), I am glad Multicoreware has stepped up to create an open-source VVC encoder, so at least that thing is covered. It's the same deal with EVC, we had to wait for Samsung to donate an open-source encoder (xeve) to have one. So, let's give MultiCoreware some time, they have delivered well enough with x265 to deserve our trust IMO.

FranceBB
2nd September 2022, 22:34
Absolutely, I totally trust them, but there are two things that worry me now and didn't worry me in 2013 in the x265 days:

- They don't have an account where they post regularly on Doom9 anymore (and they deleted the x265_Project account from which the devs used to post)
- Pradeep Ramachandran has left Multicoreware (June 2015 - May 2021) and he was the manager of the video-related development team who was responsible for x265 and I have no idea who picked up the task after him

which is why I'm going to meet with them and ask them how things are going.
Speaking of which, I'm gonna open a new topic on monday as I'd like to collect all the questions that the community might wanna ask them and report those to them so it would be as if you all came to their boot at IBC with me. ;)

kurkosdr
3rd September 2022, 16:42
LC-EVC is a big step forward for a lot of use cases, especially when paired with next gen formats like VVC and AV1.

It's not a slam dunk for every use case though :)

Is it any good when layered over HEVC? HEVC has emerged as the defacto format for UHD 4K video in terrestrial and satellite, so anything that could be done to reduce the bitrate without breaking compatibility is always welcome. With this arrangement, the current HEVC decoders could potentially decode a lower-bitrate HEVC stream while any newer receivers could also decode the LC-EVC enhancement to get the equivalent quality of a higher bitrate HEVC-only stream. Provided LC-EVC over HEVC is worthwhile, of course.

To be honest, I am a bit disappointed by the current state of UHD in broadcasting. We are taking state-of-the-art DVB-T2 muxes being able to broadcast a grand total of 3 channels (https://www.digitalbitrate.com/dtv.php?mux=HEVC&liste=1&live=1&lang=en). With VVC it would increase to a grand total of 5 (meanwhile with H.264 FHD you can fit 6). That's why I think that UHD on broadcasting won't be much of a success unless some further improvement is made on compression.

hajj_3
3rd September 2022, 20:07
To be honest, I am a bit disappointed by the current state of UHD in broadcasting. We are taking state-of-the-art DVB-T2 muxes being able to broadcast a grand total of 3 channels (https://www.digitalbitrate.com/dtv.php?mux=HEVC&liste=1&live=1&lang=en). With VVC it would increase to a grant total of 5 (meanwhile with H.264 FHD you can fit 6). That's why I think that UHD on broadcasting won't be much of a success unless some further improvement is made on compression.

iptv is becoming more popular so dvb-t2 will become less relevant. Plenty of bandwidth available with DVB-S2 and DVB-C.

FranceBB
3rd September 2022, 22:23
To be honest, I am a bit disappointed by the current state of UHD in broadcasting. We are taking state-of-the-art DVB-T2 muxes being able to broadcast a grand total of 3 channels (https://www.digitalbitrate.com/dtv.php?mux=HEVC&liste=1&live=1&lang=en). With VVC it would increase to a grant total of 5 (meanwhile with H.264 FHD you can fit 6). That's why I think that UHD on broadcasting won't be much of a success unless some further improvement is made on compression.

That's true for DTT, but don't think Satellite is any better 'cause HotBird is overcrowded and bandwidth is expensive af. :(

iptv is becoming more popular so dvb-t2 will become less relevant. Plenty of bandwidth available with DVB-S2 and DVB-C.

Not really, it's becoming more popular in cities only.
If you think about rural communities, they can easily get a DTT feed or a Satellite feed, but they can't really watch the telly over the internet.
To make a quick example, during the 2nd heatwave in Italy I didn't wanna stay in Milan and sweat my balls off on a 104°F heatwave, so I went to a mountain village at 5000 feet which was far better with as little as 61°F. Thanks God they had a Satellite dish and I could watch the game quite happily in UHD at 25 Mbit/s H.265 HEVC 4:2:0 HLG HDR 10bit planar, 'cause otherwise I would have had to miss it (yes, miss it). Why? Well, 'cause even though the hotel had Wi-Fi, their Wi-Fi was hooked up to an ADSL modem 'cause nothing else was available or ever got to that village. Same goes for mobile connection which was limited to 3G only. Until internet is gonna be everywhere at decent speeds and with modern technologies, IPTV will never be a thing. If you use satellite, however, you have pretty much the same quality everywhere and it doesn't matter where people live, they almost certainly will be able to install a dish and watch their favorite programs at high quality. Internet is fine if you have things like OTT where you wanna watch a movie or a TV Series and it can load all the time you want as buffering doesn't matter, but if you have a live event like a Football game, then latency matters and in that case a satellite feed will always be the only option for lots of people, especially for those living in rural communities. Speaking of internet-only based providers, this is mainly the reason why companies like Netflix and Amazon who offer on demand movies and tv series are succeeding in Italy, but those who only offer Sports like DAZN are not (and never will without satellite).

p.s I also work for a company who airs via satellite, so I'm biased xD

ksec
4th September 2022, 06:42
ECM 4.0 Tagged, ECM 5.0 expected in October.

ksec
27th December 2022, 14:04
ECM 4.0 Tagged, ECM 5.0 expected in October.

Now they tested ECM 6.0 vs VTM 11

The results of the two tests reveal congruent results for the performance of the ECM and the VTM. Both, the laboratory test conducted with naïve viewers and the on-site test with experts demonstrate a clear visual benefit of the ECM when compared to the VTM for a significant number of cases. For the laboratory results, reported BD rate savings indicate a benefit of about 38% for the UHD test sequences and about 32-33% for the HD test sequences on the given test set.

The 38% average was with 3 test doing 45% reduction and two test with 30% only.

Pretty damn impressive if you ask me. This is excluding the EE2 work using Neural Network. And they still have plans for ECM 7.

hajj_3
5th March 2023, 14:59
ECM 8.0 is out: https://vcgit.hhi.fraunhofer.de/ecm/ECM/-/tree/ECM-8.0

ksec
5th March 2023, 18:32
ECM 8.0 is out: https://vcgit.hhi.fraunhofer.de/ecm/ECM/-/tree/ECM-8.0

Nice. We will have to wait for the next JVET meeting in April to know its results. It is progressing rather quickly.

DTL
21st March 2023, 13:18
What is really required to make some real progressive step over classic h.264 is significant or complete move to object-oriented encoding. For any type of scene (including any natural scene).

Currenly as supplement to 'old-block-oriented' MPEGs I making simple pre-processing engine (pmode=1 for MDegrainN at mvtools2 software pack for avisynth environment) and it shows good benefit already (about 60 -> 90% of skipped blocks in x264 at crf=18 encoding and about 40..50% less bitrate at mostly static scene and about 5% less bitrate at mixed-scenes movie). And it still not support 'forward' motion compensation for moving scenes so total mixed-scenes title encoding benefit will be higher after full idea implementation. Example of processing software at https://forum.doom9.org/showthread.php?p=1984527#post1984527 .

But it is still very poor amature-level software design and from still not very died current civilization industry of video codecs design expected more complete and professional solution:
Video codec must extract from incoming frame sequence for encode the scene content (textures, objects, lighting, motion data) and create standard compatible bitstream for enduser decoder to decode 2D frames sequence to feed to standard enduser 2D pixel display.

Currently implemented only very small part of this process: The MDegrainN pmode=1 search for best texture view of the small image patch over current tr-scoped frames pool (+tr frame around current, including 'backward' motion compensation to perform dissimilarity metric analysis) and duplicate this texture in output frame sequence to MPEG coder. So it is 100%/full temporal denoising. MPEG coder understand it as scene texture element and not encode it in each frame (skip-block) and only reference it as element from scene textures dataset (some ref frame in h.264 standard) and apply some simple transform (motion) data (zero-MV currently) to create each output time-sampled frame to display. So FHD-sized frame of the about static scene with speaking talent can be encoded to about 500 bytes sized B-frames of h.264 standard (compression ratio of about 6000:1). So object-oriented MPEG codecs expected easily to reach >10000:1 compression ratios for natural movies too.

So that I want from industry-desigen next-gen motion pictures (MPxx) compression standard features:
1. Understand and compensate for many possible transforms (not 2D translate-only as in current MPEGs), but also
- rotate transform (3 axis)
- skew transform
- scale transform
- lighting/shading transform
- may be other possible base geometric transforms of 2D projections of textured+lit 3D objects

2. Have more advanced scene analysis engine to understand physically non-changed scene textures (only damaged by photon-shot noise and/or changed by supported by current MPEG-generation list of transforms). So the encoded data is consist of scene textures dataset and transforms for each part of the scene elements (areas). Also motion-compensation should be expanded in naming to Transforms Compensating meaning not only 2D translate transform can be compensated but much more advanced list of possible geometric and lighting transforms.

Really it is also a part of future denoise engine of mvtools2 development too. So both MPEG compression and temporal denoising still have large part of equal processing. So in some future 'perfect world' expected some final combining of temporal denosie + MPEG encoder into single engine so users no more need to apply MPEG-pre processing with temporal denoise before final MPEG encoder.

Are there any existing solutions moving to this way ?

" announced the "death of MPEG""

MPEG is simply motion pictures experts group. It will dead about at the end of current civilization (really may be soon enough). But after this residual creatures may be not very busy with motion pictures processing. For the next long 'dark ages'.

"I am a bit disappointed by the current state of UHD in broadcasting. We are taking state-of-the-art DVB-T2 muxes being able to broadcast a grand total of 3 channels. With VVC it would increase to a grand total of 5 (meanwhile with H.264 FHD you can fit 6). That's why I think that UHD on broadcasting won't be much of a success unless some further improvement is made on compression."

The tech solution is very simple - do not broadcast unlimited number of junk channels. Not count number of channels as an advantage. Simply put 1 UHD per 8 MHz physical bandwidth. But good designed.

"iptv is becoming more popular so dvb-t2 will become less relevant."

Some creatures really live simply at the surface of the planet but not in the cities. And in the current quickly dying civilization we already have disabled unlimited-traffic internet tariffs of wireless 3G/4G providers a few months ago and last week my pre-paid internet 3G is almost died and support tried to fix it several days long without any good success. So mostly poor text-based internet left only. And no wire or optical provider can make wire/optical connection for poor creatures living far from city and even from multi-room multi-levels buildings in a small self-built homes. So only DVB-T or DVB-S is really only high enough bandwidth way to got some movie content in such places.

kurkosdr
24th March 2023, 16:23
... from still not very died current civilization industry of video codecs design ... will dead about at the end of current civilization (really may be soon enough) ... And in the current quickly dying civilization ...
Dude, can you stop doing this? Doom9 is a technical forum, not a place to live your eschatological fantasies. If you want to do this, there are websites run by cranks who believe in this kind of stuff. They will even help you buy gold and prepper supplies from the affiliate links.

DTL
24th March 2023, 19:43
We as visual tech industry engineers see the internals of the processes at about half a century interval. Though still trying to make this visual industry a bit better. But it is good to correct the efforts with some propositions to the close enough future (about 1/3..1/4 of a century or less).

As about object-oriented encoding I remember I read some promises at time of h.264 developement may be 10..15 years ago. But it looks all hype about object-orienting video compression is now dead ? I see close to nothing move to this really benefitical direction. May be it was found it can not build well universal codec for both broadcasting (requiremnt of a very low channel switching time before decoder can start to feed output decoded frames sequence) and other ways of content delivery ? I understand the torrents-way full-file form of content delivery via IP network before starting of playback is somewhere at the lowset priority at the main industry video codecs developers. Though it may be main way of content compressing and delivery at some communities and they are mostly interested in high quality video compression at the smallest file size per title (low network transfer cost, low storage cost, high number of happy seeds, longlive of release in a network and so on). So the requirement of startup time before decoder can collect all required scene data may be removed in file-based content delivery scenarious and fast enough random file acess playback storages (like HDD or better flash).

So may be future video codecs may divided to broadcast and streaming-fiendly with lower quality and/or higher average bitrate required and title-per-file codecs for private CD/DVD/BD or online-purchased files backup into smallest file size. While the 'local cutscene bitrate' may be as high as required per selected cutscenes and may be much higher strict requirements for broadcast-friendly video codecs.

benwaggoner
24th March 2023, 22:16
Object-oriented compression either requires the content be authored as objects (ala Atmos and MPEG-H) or a good way to extract objects from existing basement media. One can imagine a codec which is basically a stream of GPU instructions to draw the video. However, video is full of things that look like objects but aren't, or don't act like objects, or turn into different kinds of objects. And they all exist under different lighting, and get color timed to get a look that may not have been possible to do in real life. It's not hard to come up with a promising demo like this. RealMedia blew our minds when they remade a South Park episodes as animated vector graphics with a soundtrack. Standard definition that could stream in real time over a 56K modem!

Demos of actual film and video content getting automatically converted into objects and then reconstructed accurately? That I've never seen. Maybe possible with some crazy machine learning system, to some degree, but reconstruction to something that feels shot on 35mm film at 24p with a 180 degree shutter is a couple orders of magnitude beyond what we know how to do today. Getting something understandable, sure. But something that feels like the original seems very challenging.

Of course, codecs have been getting all kind of features that make encoding objects in the source a lot more efficient. All the different CU sizes (including rectangular) and asymetric motion partitions lets us pretty accurately make an object get motion prediction distinct from the background. Intra-frame prediction lets us reuse repeated visual elements (great with text). But since it is fundamentally pixels in and out, stuff that can't be parsed as objects comes through just fine.

I also get nervous about too much ML compression. Bitstream corruption could go from annoying green blocks appearing to different dialog being generated and characters wearing different clothing.

birdie
25th March 2023, 08:53
I also get nervous about too much ML compression. Bitstream corruption could go from annoying green blocks appearing to different dialog being generated and characters wearing different clothing.

Or saying the wrong words since ML is now applied to audio as well. :D

DTL
25th March 2023, 15:49
"Object-oriented compression either requires the content be authored as objects (ala Atmos and MPEG-H) or a good way to extract objects from existing basement media."

It is should be already 'very simple' task for todays 'neural networks' and other 'machine learning'. They can even create some images from sort of several bytes text description.

As I see in practice the total imaging system way to the enduser may go in a bit wrong direction when engineers try to compare video-compression codec with metrics like 'as clear as source' (with typical PSNR/SSIM/VIF and other same metrics). The final goal of the movie pictures content consumption is really the feeling in the head. And it may be not required not only lossless encode the noise part of scene image transfer but also may be some significant part of 'real' scene not greatly add to the total feeling of the movie. It is real challenging task for 'very cool future codecs': analyse total movie content (standard runtime of about 1,5 hours) and create some low-bytes encoded description so at player side the 'decoder' can reproduce very close to this movie frame sequence (may be in different resolution and so on) may be used not very same objects and textures. But giving close to the watching 'initial source' movie feeling in the consumer head. The tested by 'equality metrics' PSNR between source frame sequence and decoder output may be very low. But customer experience and satisfaction will be close to perfect. Depending on the movie description format it may create 'much more 10000:1 compression ratio' and also render any required by enduser resolution.

"One can imagine a codec which is basically a stream of GPU instructions to draw the video."

Currently MPEGs and temporal denoisers works close to 'simple 2D rendering' operating with small patches of an each frame called 'blocks' and either assume the block is the same (100% motion compensation in MPEGs and 'skip' block status) or encode 'residual error' that can not be compensated (really described as a sequence of small sized commands to renderer) with current system supported transforms (or not really geometrically transform at all).

"And they all exist under different lighting,"

The lighting is also from 'transforms' family and very easy to describe in small byte sequence. For example when some part of the scene is shaded by moving object - the shaded blocks only got 'lighting' transform data changed if other geometry and light sources are static. The task for 'temporal denoisers' is equal - detect and 'back compensate' as many transforms as possible to 'distillate' as clean as possible original texture view. After this the 'denoised' frames may be restored as 1 only clean texture database source +number of transforms applied in each output frame (translation, scaling, rotation, lighting and so on). The only difference between temporal denosier and video codec : temporal denoiser uses all this data internally for 'regenerate clean frames view' and output same RAW frames as input. And video codec can output clean textures +transform data in 'smaller bytesize' form and pass to 'decoder' to restore image frames sequence in full size for displaying. Same as we have now with MPEGs using I-frames and ref-frames (?) as short-time texture database and MVs data to describe texture translate-transform only compensation with each output frame render. All other transforms not supported by current MPEGs are treated as 'unsupported' and encoded as residual error imposible to compensate and increase output bitrate and filesize.

I think to add support of 'lignting' transform analysis and 'compensation' into mvtools software in some not very far future - it is easy enough. Also the 'rotate' and 'scale' transform also may be useful in many real cutscenes. So all 'base' SRT (Scale Rotate Translate) transforms will be covered +lighting. Currently as old enough MPEGs it support only 'translate' transform analysis and compensation. But can reuse Translate analysis data ('standard blocks MVs') from hardware MPEG encoder ASIC if it present in the system and have required API via drivers. So if some new version of hardware MPEG encoder ASIC will provide more analysis transforms data (scale, rotate and other) - it can be also easily enough reused for temporal degraining too. But currently Microsoft hardware accelerators API for motion pictures analysis only support Translate transform.

Yes - making motion search engine for analysis of 'rotate' transform in 2 or better all 3 axises will make significant performance penalty to current transtale-only search engine in mvtools so offloading this also simple but compute-loaded work to some ASIC is very 'nice to have' feature. Also if this engine will be reused in MPEG coder it may make significant saving of investments for both denoise and compression tasks.

"Of course, codecs have been getting all kind of features that make encoding objects in the source a lot more efficient. All the different CU sizes (including rectangular) and asymetric motion partitions lets us pretty accurately make an object get motion prediction distinct from the background."

If we currently have all MPEGs from MPEG-1 to may be h.266 only support of analysis and compensation of translate-tranform only it looks the real engineering progress in video codecs design is not very big.

"But since it is fundamentally pixels in and out, stuff that can't be parsed as objects comes through just fine."

There is already sort of partial object-orienting in may be any MPEG - it is block-based compression. Block is small object (can be treated as small texture) and it can be or can not be supplemented with transform data to try to reach better compression in frames sequence.
So encoder can have 2 datapaths:
1. Standard compression of block as samples-array without Transform Compensation (simple M-JPEG)
2. Attempt to analyse full list of supported transforms and create encoding using Transforms Compensation

Next is compare compressed dataset size from path1 and path2 and select the lowest if the decoded image difference with input source is below threshold (quality level of encode param).

So only if block can not reach better compressability after attempt to transform-compensated compression it is encoded as unknown transform or impossible to compensate 'standard JPEG'. Also if some transforms compensation shows benefit - the intermediate path is checked if the partial transform compensation + residual error JPEG-like compression can give less compressed datasize. All other blocks are encoded with lower bits count if Transform Compensation shows close to ideal compensation and residual error close to zero (or below quality-threashold) and the total per timeslice or per file size is reduced. Example of ideally compensated by TC blocks are simply static blocks.

So for the typical cutscene: talking talent at fixed background - all blocks of fixed background are encoded as 100% TC with 'skipped' blocks.
The facial animation of talking talent is created from skin/face texture + transforms dataset (scale, rotate, translate, skew,...) . So also can be better encoded using simple static face single texture database + set of transforms data for each frame.

Also for codecs the hierarchy of database data may be given:
1. Group of frames textures database
2. Cutscene (group of group o frames) textures database.
3. Group of cutscenes texture database.
4. Total movie textures database.

With using more and more higher level of textuure database - more and more data can be skipped from lower levels for the total movie to file compression. For example single talent typically frequently present in many cutscenes of the movie and single texture database may be used. Textures database may be sort of 'macro-used ref frames' for MPEG encoder with 'unlimited' number of ref frames (computed after first pass of analysis of full movie to encode in multi-pass encoding mode). And decoder can have fast random access to total movie textures (ref frames) database for any output frame decoding.

benwaggoner
27th March 2023, 16:34
Yes, ML can generate images from a short text description, which is amazing. But it can't take an image, describe it in some high-level pseudo-semenatic way, and then regenerate the generally same image it started with. Nor is it feasible to have a sequence of images, and have ML code the slight differences between consecutive frames in order to reconstruct the original frames.

Maybe for something simple and of consistent style. For example, a Simpsons episode. There's a huge number of episodes to train on, with a somewhat limited number of art styles, characters and locations frequently on the screen. It's all fundamentally vector animation, which can be expressed quite concisely. I suspect having a cluster of ML models that could generate a movie from scratch would be a lot easier than one that can take existing content, compress it to a much lower bitrate than feasible today, and then reconstruct the same image. When a ML can make a movie, it'll know to only try to make the kind of movies it knows how to make. An input video sequence can be almost anything.

With a lot of handwaving, $1B, and five years I can imagine coming up with some sort of efficient ML system that can make a generic Simpsons episode. But that would still be a generic episode. If they did a modern 3D episode with ray tracing, the ML would have no way to express that accurately. There would still need to be a fallback to more traditional pixels-to-pixels encoding for novel stuff the ML wasn't trained on.

And a technology that can just take moving images and turn them into animated vectors in a pleasing way just doesn't exist, and that's the easier first stage of what you're talking about, ignoring reconstruction.

ML can be very powerful for predicting a right result for cases that land within the range of training data it was given. But trying to go out of that range can result in bizarre results and even "hallucinations" where it provides something plausible but wrong. Training a ML to handle any sort of video that's been made or might be made is an impossible task. Traditional codecs have their downsides, but they are robust. It's nigh impossible to find content that just breaks a codec anymore; at reasonable bitrates, the content is always recognizable as itself, even if degraded.

In a classic information theory sense, we can think of the ML model itself as the codebook that takes the compressed signal that is reconstructed into something similar enough to the original. The more variety and specificity needed, the more training and the bigger ML model that needs to be downloaded and stored is.

ML is able to interpolate missing data in a more analog-like fashion, providing a verisimilitude without the artifacts of traditional encoding. But that verisimilitude means that bitstream or other errors can result in something wrong-but-plausible instead of an obvious artifact. Not good if a different actor's face is used, or text on signage says something else. It's a feature that traditional codecs will lose detail, but not make it up in ways that would change the narrative.

I think the MPAI approach to have something like a traditional codec at the core, enhanced by ML when ML can enhance useful, makes a lot more sense. ML would be great at rendering a pebbled street, since the location of the individual pebbles don't matter. But classic motion compensation would be needed to make sure the pebbles don't randomize when the camera pans. ML has also showed promising results for better deblocking/deringing without detail loss.

DTL
27th March 2023, 20:01
" If they did a modern 3D episode with ray tracing, the ML would have no way to express that accurately."

The 'accuracy measurement' is really quality setting of compression. If user want lower sized compressed file it accept lower accuracy metric. Very close to todays video codecs with PSNR and other really accuracy metrics of how decoded result differs from input. And the final goal of 'highly abstracted' video compression method is to create more or less accurate (similar) feeling in mind in compare with watching original non-compressed content. So it have right to significantly step away from original frames samples values, object positions and other 'look'.

"An input video sequence can be almost anything."

It may be not completely correct. The human-oriented video codec may act close to human vision system - it create some (very simple) 3D scene model using 1 or 2 2D projections and it also learnable (as small babe start to watch surround world it learn how to reconstruct the limited world model in mind using limited size 2D projections from eyes to brain internals). If place this planet trained mature human to significantly different environment it may lost ability to 'space orient' using eyes only and need to re-learn again.
So the typical movie scenes are limited in possible 'samples array values' (not pure white random noise) and describe not infinitely big number of 'typical 3D' or even 'simple 2D' scenes. It is really a sort of shame for current civilization but it was found the number of plots for movies is also not only not infinitely big but also _very_ limited. So it is now very hard for content writers to write something significantly new and good for general public. Sort of quick exhausting crypto valid values from close to infinitely big count of even integer numbers.

"a technology that can just take moving images and turn them into animated vectors in a pleasing way just doesn't exist,"

In simple example it is working way of almost all current 'video compression codecs' - turn input into possible to transform-compensate array of textures (blocks) smaller in size in compare with input frame sequence and try to encode only residual error after transform-compensating. The transform dataset (currently only motion vectors for translate-transform compensation) expected to be much smaller in compare with input block sequence in input frames sequence. Sadly most of MPEGs from beginning only support translate-transform analysis and compensation. From new codecs expected more supported transforms.

And yes - if we increase number of supported transforms - the datasize of transforms set will be somehow bigger. So it is the task for encoder to select best way for each block - either encode it as MJPEG only (no-temporal compression) or make transform-compensation of 1 or more supported transforms +encode residual error and check the smallest possible data output while keeping selected quality level for residual decoding error.

stax76
28th March 2023, 00:34
A related news article that recently appeared in my news feed:

Apple acquired a startup using AI to compress videos. (https://techcrunch.com/2023/03/27/apple-acquired-a-startup-using-ai-to-compress-videos/?guccounter=1&guce_referrer=aHR0cHM6Ly9hcHAucmFpbmRyb3AuaW8v&guce_referrer_sig=AQAAAH2e4_hBg74yy7PXsXnfnGU64uhOeO_ce4a-Rhgl18wBUtwIZmTr481PXrlLTvn1qVRyd_uM0E4WA28FqZibr2CILSnfpntkBJ4x8h5eXYE-borEy0OppE4u_aohk3fWOMuBFcQQ2L9iKg3xBUrhzteoe47mbHxuwUJysn0TkRpJ)

benwaggoner
28th March 2023, 02:39
So it have right to significantly step away from original frames samples values, object positions and other 'look'.
Oh my, we must work in very different markets. In professional content, a slight change in hue or contrast can mean a week of my time figuring out what went wrong and getting a fix implemented. Very senior executives start emailing me when a creative complains that the audio dynamic range is off. We can't change color, dynamic range, cropping, or anything else that would modify creative intent.

Changing the position of objects and the look of the title? Contracts would be updated to bar the use of compression technologies that could result in that.

Full stop for security cameras, news gathering, or anything else where accuracy is more important than verisimilitude. Courts were originally were iffy on using MPEG-2 video in evidence because "it could be changed from what happened."

benwaggoner
28th March 2023, 02:45
A related news article that recently appeared in my news feed:

Apple acquired a startup using AI to compress videos. (https://techcrunch.com/2023/03/27/apple-acquired-a-startup-using-ai-to-compress-videos/?guccounter=1&guce_referrer=aHR0cHM6Ly9hcHAucmFpbmRyb3AuaW8v&guce_referrer_sig=AQAAAH2e4_hBg74yy7PXsXnfnGU64uhOeO_ce4a-Rhgl18wBUtwIZmTr481PXrlLTvn1qVRyd_uM0E4WA28FqZibr2CILSnfpntkBJ4x8h5eXYE-borEy0OppE4u_aohk3fWOMuBFcQQ2L9iKg3xBUrhzteoe47mbHxuwUJysn0TkRpJ)
That's basically ML driven preprocessing, softening less important detail to make it easier to encode. It's not clear if they're doing it baseband only, or with integration in the the encoder itself; noise-reduction enhancement like that works better if it is quantization-aware.

This is a pretty powerful class of techniques to improve quality at low bitrates while maintaining compatibility with existing decoders. The new experimental --mctf filter in x265 is another stab at the same concept, but probably less sophisticated.

ML integration into the decoder side could help restore some of the missing detail in those "less important" regions, as the feel of the texture is more important the the specific pattern of the foliage. That's a kind of loss that creative aren't likely to notice or care about.

DTL
28th March 2023, 18:02
"Oh my, we must work in very different markets. In professional content, a slight change in hue or contrast can mean a week of my time figuring out what went wrong and getting a fix implemented. Very senior executives start emailing me when a creative complains that the audio dynamic range is off. We can't change color, dynamic range, cropping, or anything else that would modify creative intent."

Yes - 'Hi-Fi' and 'Hi-End' markets quality requirements are much higher in compare with 'general public / mass market' . So it may be also not good to attempt to use 'single MPEG version for decade' in so different quality groups. For general public usage and mass-market much more 'scene abstractive' video codecs may be used. Same as MP3 128 kBit/s for general public is much more interesting in compare with HD-Audio with lossless 192 kHz/24bit. But both marketing applications exist.

Also the progressive video codec can be 'receiver-optimized' and use simple fact of 'human vision receive bitrate' - it is really much lower GBit/s so even uncompressed FHD at SDI 1+ Gbit/s is really extra redundant. So if codec is highly-adaptive to enduser visual system it also can have significant benefit while keeping 'good quality feeling' at customer's mind and customer can successfully pay per service. Though the 'internal tech/engineered measured quality of service' will be far from 'Hi-Fi' and 'Hi-End' markets.

I work at general public state broadcasting and I see how awful is 2 Mbit/s h.264 SD (realtime MPEG compression with small GOP size) airing in compare with studio uncompressed FHD SDI source but general public typically make zero complaints on quality (over a years and decades). So it work for years and decades very well.

"Changing the position of objects and the look of the title? Contracts would be updated to bar the use of compression technologies that could result in that."

There is some real secret - if enduser not see the ground-truth source he may not notice any difference if total 'video compression system' is 'smart enough'. Typically in most of broadcasting tasks endusers never see the real source of footage. The typical target marketing goal is some final feeling in mind/head of the customer. And customer generally pay for it. It is not about exact samples/pixels values and very precise scene elements locations.

As I now work on sort of artistic images creation I see how really _most_ of real imaging and ofcourse broadcast imaging typically have _very_poor_ artistic design (close to zero) and even some significant changes can not more ruine the typically absent any artistic value.

It is not about specially designed artistic titles of several years of production with per-scene and per-frame careful and costly fine-tuning and cost about $300 000 000 per 1.5 hours (5400 seconds) of runtime. It is about 99+% of typical content for video compression.
If we can create most of broadcast content priced about $50000 per second of runtime (about $500000 per scenecut) so it may require better video compression.

benwaggoner
28th March 2023, 20:48
The name of the game in my business is "preservation of creative intent." We try to deliver the experience the creators made and intended to be seen as accurately as possible. The ultimate goal is something that would pass an A/B test with the mezzanine (Rings of Power hits that goal). It is accepted that the representation of creative intent will fall short of small devices and low bandwidth. But we never, never, never want to do anything that would change the creative intent to something else. Loss of detail is vastly more acceptable than the introduction of false detail.

Taking your MP3 example, a ML-enabled compression technology would be able to reproduce the same sound in fewer bits by offering better was to reconstruct a complex signal from a compact compressed representation. But the object-based AI you're talking about would be more like taking a .WAV file and then converting it to MIDI. Sure, that could work well for some content, and deliver extremely low bitrates. But if there are novel instruments or effects in there, or lyrics, a MIDI version would express a radically different creative intent.

Of course there will be convergence and gray areas over time. But in general, the more a compression technology bends towards synthesis versus reproduction, the bigger risks to maintaining creative intent there will be. And there will always need to be fallbacks to classic compression approaches for elements that the ML doesn't know how to synthesize accurately.

benwaggoner
28th March 2023, 21:17
One can think of the difference between what can be synthesized and what can't as the residual.

DTL
28th March 2023, 22:21
" But the object-based AI you're talking about would be more like taking a .WAV file and then converting it to MIDI. Sure, that could work well for some content, and deliver extremely low bitrates. But if there are novel instruments or effects in there, or lyrics, a MIDI version would express a radically different creative intent."

Yes - it is very close to sound model: Take RAW WAV at the AI-coder input and it will output MIDI-like dataset for distribution. Though depending on the requested by customer file size (and resulted quality) the used instruments set may be 'standard library' or _partial_ new instruments description may be included into compressed dataset. And the quality of compression (adjustable by user) is the amount of 'real instruments/ textures' used - more or less. Now we have it with current MPEGs - the lower requested bitrate the lower number of texture details and other scene elements enduser got to watch.

"And there will always need to be fallbacks to classic compression approaches for elements that the ML doesn't know how to synthesize accurately."

Yes - the initial implementations of 'highly abstractive' video compressors may be very poor. But expected to become better in quality using 'learning'. As was mentioned current civilization already about exhaust all possible movie plots and accumulated a good library of 'classic movies' over half a century in colours. So it now simply a task to machine-learning compression creators to scan over this classic-movies database and decrease mean residual error of compression. Using as much as possible existing movies in library. Also a 'typical basic database of textures' for decoder (downloadable once) may be created. It expected to be not very large.

Same as MIDI standard instruments library looks like updated once at the OS install or with drivers for playback card or software. And yes - the quality of MIDI playback depends on quality of player hardware (cost). So it simple marketing parameter - the more user pay - the more quality it got at movie (content) playback.

Also it is easily scalable for quality over a range of playback devices from poor smartphones with low power and cost to home-cinema premium playback devices of high compute power. It is also very marketing-friendly.

DTL
29th March 2023, 09:33
"Courts were originally were iffy on using MPEG-2 video in evidence because "it could be changed from what happened.""

It is simply very different applications cases. Because human visual system is _very_ limited in possible to 'understand' datarate the following 2 cases of getting visual information from frame possible:

1. If human creature have very long time for each frame and each frame area analysis - it can get most of visual information presented in a frame using long time sequential frame scan by different (located and sized) areas. It is typical use case of Large Format photoframe consumption - user take hours of viewing of total frame and of different frame areas to consume all data from 16K or much larger resolution frame. If the frame is designed by good artist and have many 'levels' to view.

2. If frame sequence presenting is externally clocked at very high framerate (like 25 frames per second or even more) - the human creature have only chances of capturing some common feeling of the total cutscene presenting and may be some very limited areas of the scene in more detailed form. The general public broadcasting and many other use cases of moving pictures systems are about this case. So wisely designed video compression system can use this feature of enduser to decrease datarate and or filesize.

The limitation of 2. can also harm the user's life in some real life use cases like mission-critical viewing tasks like driving a car at high speed - if human visual system overloaded with high input datarate make skip of some really important part of input scene the serious incident may happen. Though the view was clear and all scene content was presented to the eyes in clear non-distorted form.

FranceBB
6th April 2023, 17:11
The name of the game in my business is "preservation of creative intent." We try to deliver the experience the creators made and intended to be seen as accurately as possible. The ultimate goal is something that would pass an A/B test with the mezzanine (Rings of Power hits that goal).

I would assume that's because Prime has almost exclusively TV Series and movies (i.e creative content in general).
You're lucky, in a way, 'cause the way you get the master is the way you output it.

I mean, from your use case it probably doesn't matter if it's PQ at 23,976, if it's BT709 SDR at 25p etc while for linear broadcasting it's always a matter of performing the "best conversion possible" to respect the airing standard which are 50p BT2020 HLG for UHD, 25i TFF BT709 SDR for FULL HD and 25i TFF BT601 SDR for SD.

Back in my streaming services days (Crunchyroll and Viewster, 2013-2015), I was in the same safe boat as you guys as I didn't really have to change the framerate or the transfer or the primaries etc, but ever since I moved to a TV and started working in linear broadcasting (January 6th, 2016) I had to actively start doing those conversions in the least painful way possible.


By the way, the thing DTL is talking about would kinda work for a news channel, which almost every linear broadcaster has anyway. ;)



I work at general public state broadcasting and I see how awful is 2 Mbit/s h.264 SD (realtime MPEG compression with small GOP size) airing in compare with studio uncompressed FHD SDI source but general public typically make zero complaints on quality (over a years and decades). So it work for years and decades very well.

I feel you. Sky TG24 (the news channel) is still in SD on terrestrial (DTV), so seeing what happens to a perfectly good FULL HD 25i BT709 SDI uncompressed 1 Gbit/s feed from the video mixer is really painful... Luckily, on the satellite it is in FULL HD at decent bitrate, but that only covers around 4 million people and taking into account that the Italian population is 55 million people, one can quickly realize that the sad truth is that the overall majority watches news in SD... :(

benwaggoner
6th April 2023, 17:25
I would assume that's because Prime has almost exclusively TV Series and movies (i.e creative content in general).
You're lucky, in a way, 'cause the way you get the master is the way you output it.
Prime Video is doing a whole lot of sports content and live channels now as well. But yeah, for scripted stuff the job is to reproduce the mezzanine as accurately as possible.

I mean, from your use case it probably doesn't matter if it's PQ at 23,976, if it's BT709 SDR at 25p etc while for linear broadcasting it's always a matter of performing the "best conversion possible" to respect the airing standard which are 50p BT2020 HLG for UHD, 25i TFF BT709 SDR for FULL HD and 25i TFF BT601 SDR for SD.
Yeah, IP delivery makes things much simpler by allowing for more complex output options. And we don't ever have to deliver interlaced or HLG for anything!

Back in my streaming services days (Crunchyroll and Viewster, 2013-2015), I was in the same safe boat as you guys as I didn't really have to change the framerate or the transfer or the primaries etc, but ever since I moved to a TV and started working in linear broadcasting (January 6th, 2016) I had to actively start doing those conversions in the least painful way possible.
I started out in good old analog 480i production, so had sort of an inverse path to yours. And it sure was a breath of fresh air once I'd proved to everyone that we didn't need to do frame rate conversions etcetera.

I feel you. Sky TG24 (the news channel) is still in SD on terrestrial (DTV), so seeing what happens to a perfectly good FULL HD 25i BT709 SDI uncompressed 1 Gbit/s feed from the video mixer is really painful... Luckily, on the satellite it is in FULL HD at decent bitrate, but that only covers around 4 million people and taking into account that the Italian population is 55 million people, one can quickly realize that the sad truth is that the overall majority watches news in SD... :(
Ugh. It was that way in the USA for a long time as well. Hopefully your year-on-year numbers are moving in the right direction. I heard in another meeting today that 50% of EU eyeball hours are OTT already.

It's funny to recall how much of my career was trying to reach a high quality standard def, first on CD-ROM, and then on the web. It's fun to have HD SDR be the fallback now, with creators increasingly considering the 4K HDR the "real" version of the content.

ksec
17th July 2023, 14:52
And ECM 9.1 is out. https://vcgit.hhi.fraunhofer.de/ecm/ECM/-/tree/ECM-9.1

According to the latest test from JVET Geneva meeting, ECM 9.1 finally achieved *overall* 30% BD-Rate vs VTM 11 in Random Access.

benwaggoner
18th July 2023, 17:38
And ECM 9.1 is out. https://vcgit.hhi.fraunhofer.de/ecm/ECM/-/tree/ECM-9.1

According to the latest test from JVET Geneva meeting, ECM 9.1 finally achieved *overall* 30% BD-Rate vs VTM 11 in Random Access.
30% is impressive! Hopefully the irreducible decode complexity increase isn't >>2x to get those gains.

nevcairiel
18th July 2023, 17:59
30% sounds about average, its a similar ballpark as VVC claimed over HEVC at the slower speeds. Much less and adopting a new codec is barely worth it.

ksec
22nd October 2023, 08:19
And ECM 10 is out in the usual place

According to the latest test from JVET meeting, ECM 10 achieved overall 33% BD-Rate vs VTM 11 in Random Access. With lots of improvement going in Low-Delay.

The most impressive thing is we get ~44% of BD-Rate in Text and Graphics in Motion Category. ( The overall result exclude this Category )

30% is impressive! Hopefully the irreducible decode complexity increase isn't >>2x to get those gains.

Compared to VTM we are looking at 8x increase in Encoding Complexity ( Fair but far from perfect ) and 8x increase in decoding. ( oouh )

I am not entirely sure how they intend to achieve 60% BD-Rate ( Yes 60%, not 50% as they are usually aiming at ). We might have hit the end of the S curve in terms of how we could further compress video.

benwaggoner
23rd October 2023, 21:23
And ECM 10 is out in the usual place

According to the latest test from JVET meeting, ECM 10 achieved overall 33% BD-Rate vs VTM 11 in Random Access. With lots of improvement going in Low-Delay.
Yeah, we got the same update at the SMPTE Media Technology Summit last week. Quite impressive progress for just three years since standardization

The most impressive thing is we get ~44% of BD-Rate in Text and Graphics in Motion Category. ( The overall result exclude this Category )
Oh, that is good. Not having screen content extensions be mandatory in Main in HEVC is one of my biggest frustrations from it. Those would make game streaming a lot better.

Compared to VTM we are looking at 8x increase in Encoding Complexity ( Fair but far from perfect ) and 8x increase in decoding. ( oouh )
Yeah, 8x decoder complexity is going to be a hard sell; we've never seen close to that big a jump in successive codecs. Although AV2's ML extensions might mean an atypical jump on the AOM side as well. Certainly the $ and mm^2 for a current-get video decoder are WAY down compared to MPEG-2. Even if we had an 8x jump circa 2026. Moore's Law keeps on giving.

I am not entirely sure how they intend to achieve 60% BD-Rate ( Yes 60%, not 50% as they are usually aiming at ). We might have hit the end of the S curve in terms of how we could further compress video.
At least not without some serious ML on the encoder side to make some more advanced transforms feasible to encode. For example, a spherical warp.

Getting a good film grain synthesis feature built-in is probably the biggest lower-hanging fruit left, as so much of the energy of the most difficult to encode content is grain or other temporally and spatially random noise.

Comcast and others are pushing for AOM's AVFG1 proposal, which would allow any codec to trigger the AV1 FGS post filter via a SEI message. AV1 FGS isn't great, but it certainly can be made better with better parameterization (SVT-AV1 sells it way short). The biggest gap is that the grain synthesis is defined relative to display resolution, not source content resolution, so you get cases where grain particle sizes vary on different displays. Since grain is a physical property, a given piece of synthetic grain should take up a specified percentage of the content, so a low resolution screen should show proportionally smaller grain.

ksec
10th February 2024, 15:54
ECM 11 is out, tiny improvement made on top of EMC 10, very small encoding and decoding efficiency improvement. Looking at possibly merging work on Neural Network to further push down BD-Rate.

Tommy Carrot
10th February 2024, 20:10
Anyone care to share a Windows build of this encoder? I'm very curious how it performs.

Jamaika
11th February 2024, 23:30
https://www.sendspace.com/file/ymwwdd

Tommy Carrot
12th February 2024, 16:22
Thank you very much!

In random access mode unfortunately it crashes at the 2nd frame every time, but i could get it work in allintra mode, so i can at least examine how does it perform for still image compression.

Jamaika
12th February 2024, 18:44
I didn't check codecs. I created encoder only in AVX. I had no time.