Any news about a potential successor to VVC / H.266? - Page 2

hajj_3 · 5th March 2023, 14:59

ECM 8.0 is out: https://vcgit.hhi.fraunhofer.de/ecm/ECM/-/tree/ECM-8.0

ksec · 5th March 2023, 18:32

Quote:

Originally Posted by hajj_3

ECM 8.0 is out: https://vcgit.hhi.fraunhofer.de/ecm/ECM/-/tree/ECM-8.0

Nice. We will have to wait for the next JVET meeting in April to know its results. It is progressing rather quickly.

DTL · 21st March 2023, 13:18

What is really required to make some real progressive step over classic h.264 is significant or complete move to object-oriented encoding. For any type of scene (including any natural scene).

Currenly as supplement to 'old-block-oriented' MPEGs I making simple pre-processing engine (pmode=1 for MDegrainN at mvtools2 software pack for avisynth environment) and it shows good benefit already (about 60 -> 90% of skipped blocks in x264 at crf=18 encoding and about 40..50% less bitrate at mostly static scene and about 5% less bitrate at mixed-scenes movie). And it still not support 'forward' motion compensation for moving scenes so total mixed-scenes title encoding benefit will be higher after full idea implementation. Example of processing software at https://forum.doom9.org/showthread.p...27#post1984527 .

But it is still very poor amature-level software design and from still not very died current civilization industry of video codecs design expected more complete and professional solution:
Video codec must extract from incoming frame sequence for encode the scene content (textures, objects, lighting, motion data) and create standard compatible bitstream for enduser decoder to decode 2D frames sequence to feed to standard enduser 2D pixel display.

Currently implemented only very small part of this process: The MDegrainN pmode=1 search for best texture view of the small image patch over current tr-scoped frames pool (+tr frame around current, including 'backward' motion compensation to perform dissimilarity metric analysis) and duplicate this texture in output frame sequence to MPEG coder. So it is 100%/full temporal denoising. MPEG coder understand it as scene texture element and not encode it in each frame (skip-block) and only reference it as element from scene textures dataset (some ref frame in h.264 standard) and apply some simple transform (motion) data (zero-MV currently) to create each output time-sampled frame to display. So FHD-sized frame of the about static scene with speaking talent can be encoded to about 500 bytes sized B-frames of h.264 standard (compression ratio of about 6000:1). So object-oriented MPEG codecs expected easily to reach >10000:1 compression ratios for natural movies too.

So that I want from industry-desigen next-gen motion pictures (MPxx) compression standard features:
1. Understand and compensate for many possible transforms (not 2D translate-only as in current MPEGs), but also
- rotate transform (3 axis)
- skew transform
- scale transform
- lighting/shading transform
- may be other possible base geometric transforms of 2D projections of textured+lit 3D objects

2. Have more advanced scene analysis engine to understand physically non-changed scene textures (only damaged by photon-shot noise and/or changed by supported by current MPEG-generation list of transforms). So the encoded data is consist of scene textures dataset and transforms for each part of the scene elements (areas). Also motion-compensation should be expanded in naming to Transforms Compensating meaning not only 2D translate transform can be compensated but much more advanced list of possible geometric and lighting transforms.

Really it is also a part of future denoise engine of mvtools2 development too. So both MPEG compression and temporal denoising still have large part of equal processing. So in some future 'perfect world' expected some final combining of temporal denosie + MPEG encoder into single engine so users no more need to apply MPEG-pre processing with temporal denoise before final MPEG encoder.

Are there any existing solutions moving to this way ?

" announced the "death of MPEG""

MPEG is simply motion pictures experts group. It will dead about at the end of current civilization (really may be soon enough). But after this residual creatures may be not very busy with motion pictures processing. For the next long 'dark ages'.

"I am a bit disappointed by the current state of UHD in broadcasting. We are taking state-of-the-art DVB-T2 muxes being able to broadcast a grand total of 3 channels. With VVC it would increase to a grand total of 5 (meanwhile with H.264 FHD you can fit 6). That's why I think that UHD on broadcasting won't be much of a success unless some further improvement is made on compression."

The tech solution is very simple - do not broadcast unlimited number of junk channels. Not count number of channels as an advantage. Simply put 1 UHD per 8 MHz physical bandwidth. But good designed.

"iptv is becoming more popular so dvb-t2 will become less relevant."

Some creatures really live simply at the surface of the planet but not in the cities. And in the current quickly dying civilization we already have disabled unlimited-traffic internet tariffs of wireless 3G/4G providers a few months ago and last week my pre-paid internet 3G is almost died and support tried to fix it several days long without any good success. So mostly poor text-based internet left only. And no wire or optical provider can make wire/optical connection for poor creatures living far from city and even from multi-room multi-levels buildings in a small self-built homes. So only DVB-T or DVB-S is really only high enough bandwidth way to got some movie content in such places.

kurkosdr · 24th March 2023, 16:23

Quote:

Originally Posted by DTL

... from still not very died current civilization industry of video codecs design ... will dead about at the end of current civilization (really may be soon enough) ... And in the current quickly dying civilization ...

Dude, can you stop doing this? Doom9 is a technical forum, not a place to live your eschatological fantasies. If you want to do this, there are websites run by cranks who believe in this kind of stuff. They will even help you buy gold and prepper supplies from the affiliate links.

DTL · 24th March 2023, 19:43

We as visual tech industry engineers see the internals of the processes at about half a century interval. Though still trying to make this visual industry a bit better. But it is good to correct the efforts with some propositions to the close enough future (about 1/3..1/4 of a century or less).

As about object-oriented encoding I remember I read some promises at time of h.264 developement may be 10..15 years ago. But it looks all hype about object-orienting video compression is now dead ? I see close to nothing move to this really benefitical direction. May be it was found it can not build well universal codec for both broadcasting (requiremnt of a very low channel switching time before decoder can start to feed output decoded frames sequence) and other ways of content delivery ? I understand the torrents-way full-file form of content delivery via IP network before starting of playback is somewhere at the lowset priority at the main industry video codecs developers. Though it may be main way of content compressing and delivery at some communities and they are mostly interested in high quality video compression at the smallest file size per title (low network transfer cost, low storage cost, high number of happy seeds, longlive of release in a network and so on). So the requirement of startup time before decoder can collect all required scene data may be removed in file-based content delivery scenarious and fast enough random file acess playback storages (like HDD or better flash).

So may be future video codecs may divided to broadcast and streaming-fiendly with lower quality and/or higher average bitrate required and title-per-file codecs for private CD/DVD/BD or online-purchased files backup into smallest file size. While the 'local cutscene bitrate' may be as high as required per selected cutscenes and may be much higher strict requirements for broadcast-friendly video codecs.

benwaggoner · 24th March 2023, 22:16

Object-oriented compression either requires the content be authored as objects (ala Atmos and MPEG-H) or a good way to extract objects from existing basement media. One can imagine a codec which is basically a stream of GPU instructions to draw the video. However, video is full of things that look like objects but aren't, or don't act like objects, or turn into different kinds of objects. And they all exist under different lighting, and get color timed to get a look that may not have been possible to do in real life. It's not hard to come up with a promising demo like this. RealMedia blew our minds when they remade a South Park episodes as animated vector graphics with a soundtrack. Standard definition that could stream in real time over a 56K modem!

Demos of actual film and video content getting automatically converted into objects and then reconstructed accurately? That I've never seen. Maybe possible with some crazy machine learning system, to some degree, but reconstruction to something that feels shot on 35mm film at 24p with a 180 degree shutter is a couple orders of magnitude beyond what we know how to do today. Getting something understandable, sure. But something that feels like the original seems very challenging.

Of course, codecs have been getting all kind of features that make encoding objects in the source a lot more efficient. All the different CU sizes (including rectangular) and asymetric motion partitions lets us pretty accurately make an object get motion prediction distinct from the background. Intra-frame prediction lets us reuse repeated visual elements (great with text). But since it is fundamentally pixels in and out, stuff that can't be parsed as objects comes through just fine.

I also get nervous about too much ML compression. Bitstream corruption could go from annoying green blocks appearing to different dialog being generated and characters wearing different clothing.

birdie · 25th March 2023, 08:53

Quote:

Originally Posted by benwaggoner

I also get nervous about too much ML compression. Bitstream corruption could go from annoying green blocks appearing to different dialog being generated and characters wearing different clothing.

Or saying the wrong words since ML is now applied to audio as well.

DTL · 25th March 2023, 15:49

"Object-oriented compression either requires the content be authored as objects (ala Atmos and MPEG-H) or a good way to extract objects from existing basement media."

It is should be already 'very simple' task for todays 'neural networks' and other 'machine learning'. They can even create some images from sort of several bytes text description.

As I see in practice the total imaging system way to the enduser may go in a bit wrong direction when engineers try to compare video-compression codec with metrics like 'as clear as source' (with typical PSNR/SSIM/VIF and other same metrics). The final goal of the movie pictures content consumption is really the feeling in the head. And it may be not required not only lossless encode the noise part of scene image transfer but also may be some significant part of 'real' scene not greatly add to the total feeling of the movie. It is real challenging task for 'very cool future codecs': analyse total movie content (standard runtime of about 1,5 hours) and create some low-bytes encoded description so at player side the 'decoder' can reproduce very close to this movie frame sequence (may be in different resolution and so on) may be used not very same objects and textures. But giving close to the watching 'initial source' movie feeling in the consumer head. The tested by 'equality metrics' PSNR between source frame sequence and decoder output may be very low. But customer experience and satisfaction will be close to perfect. Depending on the movie description format it may create 'much more 10000:1 compression ratio' and also render any required by enduser resolution.

"One can imagine a codec which is basically a stream of GPU instructions to draw the video."

Currently MPEGs and temporal denoisers works close to 'simple 2D rendering' operating with small patches of an each frame called 'blocks' and either assume the block is the same (100% motion compensation in MPEGs and 'skip' block status) or encode 'residual error' that can not be compensated (really described as a sequence of small sized commands to renderer) with current system supported transforms (or not really geometrically transform at all).

"And they all exist under different lighting,"

The lighting is also from 'transforms' family and very easy to describe in small byte sequence. For example when some part of the scene is shaded by moving object - the shaded blocks only got 'lighting' transform data changed if other geometry and light sources are static. The task for 'temporal denoisers' is equal - detect and 'back compensate' as many transforms as possible to 'distillate' as clean as possible original texture view. After this the 'denoised' frames may be restored as 1 only clean texture database source +number of transforms applied in each output frame (translation, scaling, rotation, lighting and so on). The only difference between temporal denosier and video codec : temporal denoiser uses all this data internally for 'regenerate clean frames view' and output same RAW frames as input. And video codec can output clean textures +transform data in 'smaller bytesize' form and pass to 'decoder' to restore image frames sequence in full size for displaying. Same as we have now with MPEGs using I-frames and ref-frames (?) as short-time texture database and MVs data to describe texture translate-transform only compensation with each output frame render. All other transforms not supported by current MPEGs are treated as 'unsupported' and encoded as residual error imposible to compensate and increase output bitrate and filesize.

I think to add support of 'lignting' transform analysis and 'compensation' into mvtools software in some not very far future - it is easy enough. Also the 'rotate' and 'scale' transform also may be useful in many real cutscenes. So all 'base' SRT (Scale Rotate Translate) transforms will be covered +lighting. Currently as old enough MPEGs it support only 'translate' transform analysis and compensation. But can reuse Translate analysis data ('standard blocks MVs') from hardware MPEG encoder ASIC if it present in the system and have required API via drivers. So if some new version of hardware MPEG encoder ASIC will provide more analysis transforms data (scale, rotate and other) - it can be also easily enough reused for temporal degraining too. But currently Microsoft hardware accelerators API for motion pictures analysis only support Translate transform.

Yes - making motion search engine for analysis of 'rotate' transform in 2 or better all 3 axises will make significant performance penalty to current transtale-only search engine in mvtools so offloading this also simple but compute-loaded work to some ASIC is very 'nice to have' feature. Also if this engine will be reused in MPEG coder it may make significant saving of investments for both denoise and compression tasks.

"Of course, codecs have been getting all kind of features that make encoding objects in the source a lot more efficient. All the different CU sizes (including rectangular) and asymetric motion partitions lets us pretty accurately make an object get motion prediction distinct from the background."

If we currently have all MPEGs from MPEG-1 to may be h.266 only support of analysis and compensation of translate-tranform only it looks the real engineering progress in video codecs design is not very big.

"But since it is fundamentally pixels in and out, stuff that can't be parsed as objects comes through just fine."

There is already sort of partial object-orienting in may be any MPEG - it is block-based compression. Block is small object (can be treated as small texture) and it can be or can not be supplemented with transform data to try to reach better compression in frames sequence.
So encoder can have 2 datapaths:
1. Standard compression of block as samples-array without Transform Compensation (simple M-JPEG)
2. Attempt to analyse full list of supported transforms and create encoding using Transforms Compensation

Next is compare compressed dataset size from path1 and path2 and select the lowest if the decoded image difference with input source is below threshold (quality level of encode param).

So only if block can not reach better compressability after attempt to transform-compensated compression it is encoded as unknown transform or impossible to compensate 'standard JPEG'. Also if some transforms compensation shows benefit - the intermediate path is checked if the partial transform compensation + residual error JPEG-like compression can give less compressed datasize. All other blocks are encoded with lower bits count if Transform Compensation shows close to ideal compensation and residual error close to zero (or below quality-threashold) and the total per timeslice or per file size is reduced. Example of ideally compensated by TC blocks are simply static blocks.

So for the typical cutscene: talking talent at fixed background - all blocks of fixed background are encoded as 100% TC with 'skipped' blocks.
The facial animation of talking talent is created from skin/face texture + transforms dataset (scale, rotate, translate, skew,...) . So also can be better encoded using simple static face single texture database + set of transforms data for each frame.

Also for codecs the hierarchy of database data may be given:
1. Group of frames textures database
2. Cutscene (group of group o frames) textures database.
3. Group of cutscenes texture database.
4. Total movie textures database.

With using more and more higher level of textuure database - more and more data can be skipped from lower levels for the total movie to file compression. For example single talent typically frequently present in many cutscenes of the movie and single texture database may be used. Textures database may be sort of 'macro-used ref frames' for MPEG encoder with 'unlimited' number of ref frames (computed after first pass of analysis of full movie to encode in multi-pass encoding mode). And decoder can have fast random access to total movie textures (ref frames) database for any output frame decoding.

benwaggoner · 27th March 2023, 16:34

Yes, ML can generate images from a short text description, which is amazing. But it can't take an image, describe it in some high-level pseudo-semenatic way, and then regenerate the generally same image it started with. Nor is it feasible to have a sequence of images, and have ML code the slight differences between consecutive frames in order to reconstruct the original frames.

Maybe for something simple and of consistent style. For example, a Simpsons episode. There's a huge number of episodes to train on, with a somewhat limited number of art styles, characters and locations frequently on the screen. It's all fundamentally vector animation, which can be expressed quite concisely. I suspect having a cluster of ML models that could generate a movie from scratch would be a lot easier than one that can take existing content, compress it to a much lower bitrate than feasible today, and then reconstruct the same image. When a ML can make a movie, it'll know to only try to make the kind of movies it knows how to make. An input video sequence can be almost anything.

With a lot of handwaving, $1B, and five years I can imagine coming up with some sort of efficient ML system that can make a generic Simpsons episode. But that would still be a generic episode. If they did a modern 3D episode with ray tracing, the ML would have no way to express that accurately. There would still need to be a fallback to more traditional pixels-to-pixels encoding for novel stuff the ML wasn't trained on.

And a technology that can just take moving images and turn them into animated vectors in a pleasing way just doesn't exist, and that's the easier first stage of what you're talking about, ignoring reconstruction.

ML can be very powerful for predicting a right result for cases that land within the range of training data it was given. But trying to go out of that range can result in bizarre results and even "hallucinations" where it provides something plausible but wrong. Training a ML to handle any sort of video that's been made or might be made is an impossible task. Traditional codecs have their downsides, but they are robust. It's nigh impossible to find content that just breaks a codec anymore; at reasonable bitrates, the content is always recognizable as itself, even if degraded.

In a classic information theory sense, we can think of the ML model itself as the codebook that takes the compressed signal that is reconstructed into something similar enough to the original. The more variety and specificity needed, the more training and the bigger ML model that needs to be downloaded and stored is.

ML is able to interpolate missing data in a more analog-like fashion, providing a verisimilitude without the artifacts of traditional encoding. But that verisimilitude means that bitstream or other errors can result in something wrong-but-plausible instead of an obvious artifact. Not good if a different actor's face is used, or text on signage says something else. It's a feature that traditional codecs will lose detail, but not make it up in ways that would change the narrative.

I think the MPAI approach to have something like a traditional codec at the core, enhanced by ML when ML can enhance useful, makes a lot more sense. ML would be great at rendering a pebbled street, since the location of the individual pebbles don't matter. But classic motion compensation would be needed to make sure the pebbles don't randomize when the camera pans. ML has also showed promising results for better deblocking/deringing without detail loss.

DTL · 27th March 2023, 20:01

" If they did a modern 3D episode with ray tracing, the ML would have no way to express that accurately."

The 'accuracy measurement' is really quality setting of compression. If user want lower sized compressed file it accept lower accuracy metric. Very close to todays video codecs with PSNR and other really accuracy metrics of how decoded result differs from input. And the final goal of 'highly abstracted' video compression method is to create more or less accurate (similar) feeling in mind in compare with watching original non-compressed content. So it have right to significantly step away from original frames samples values, object positions and other 'look'.

"An input video sequence can be almost anything."

It may be not completely correct. The human-oriented video codec may act close to human vision system - it create some (very simple) 3D scene model using 1 or 2 2D projections and it also learnable (as small babe start to watch surround world it learn how to reconstruct the limited world model in mind using limited size 2D projections from eyes to brain internals). If place this planet trained mature human to significantly different environment it may lost ability to 'space orient' using eyes only and need to re-learn again.
So the typical movie scenes are limited in possible 'samples array values' (not pure white random noise) and describe not infinitely big number of 'typical 3D' or even 'simple 2D' scenes. It is really a sort of shame for current civilization but it was found the number of plots for movies is also not only not infinitely big but also _very_ limited. So it is now very hard for content writers to write something significantly new and good for general public. Sort of quick exhausting crypto valid values from close to infinitely big count of even integer numbers.

"a technology that can just take moving images and turn them into animated vectors in a pleasing way just doesn't exist,"

In simple example it is working way of almost all current 'video compression codecs' - turn input into possible to transform-compensate array of textures (blocks) smaller in size in compare with input frame sequence and try to encode only residual error after transform-compensating. The transform dataset (currently only motion vectors for translate-transform compensation) expected to be much smaller in compare with input block sequence in input frames sequence. Sadly most of MPEGs from beginning only support translate-transform analysis and compensation. From new codecs expected more supported transforms.

And yes - if we increase number of supported transforms - the datasize of transforms set will be somehow bigger. So it is the task for encoder to select best way for each block - either encode it as MJPEG only (no-temporal compression) or make transform-compensation of 1 or more supported transforms +encode residual error and check the smallest possible data output while keeping selected quality level for residual decoding error.

stax76 · 28th March 2023, 00:34

A related news article that recently appeared in my news feed:

Apple acquired a startup using AI to compress videos.

benwaggoner · 28th March 2023, 02:39

Quote:

Originally Posted by DTL

So it have right to significantly step away from original frames samples values, object positions and other 'look'.

Oh my, we must work in very different markets. In professional content, a slight change in hue or contrast can mean a week of my time figuring out what went wrong and getting a fix implemented. Very senior executives start emailing me when a creative complains that the audio dynamic range is off. We can't change color, dynamic range, cropping, or anything else that would modify creative intent.

Changing the position of objects and the look of the title? Contracts would be updated to bar the use of compression technologies that could result in that.

Full stop for security cameras, news gathering, or anything else where accuracy is more important than verisimilitude. Courts were originally were iffy on using MPEG-2 video in evidence because "it could be changed from what happened."

benwaggoner · 28th March 2023, 02:45

Quote:

Originally Posted by stax76

A related news article that recently appeared in my news feed:

Apple acquired a startup using AI to compress videos.

That's basically ML driven preprocessing, softening less important detail to make it easier to encode. It's not clear if they're doing it baseband only, or with integration in the the encoder itself; noise-reduction enhancement like that works better if it is quantization-aware.

This is a pretty powerful class of techniques to improve quality at low bitrates while maintaining compatibility with existing decoders. The new experimental --mctf filter in x265 is another stab at the same concept, but probably less sophisticated.

ML integration into the decoder side could help restore some of the missing detail in those "less important" regions, as the feel of the texture is more important the the specific pattern of the foliage. That's a kind of loss that creative aren't likely to notice or care about.

DTL · 28th March 2023, 18:02

"Oh my, we must work in very different markets. In professional content, a slight change in hue or contrast can mean a week of my time figuring out what went wrong and getting a fix implemented. Very senior executives start emailing me when a creative complains that the audio dynamic range is off. We can't change color, dynamic range, cropping, or anything else that would modify creative intent."

Yes - 'Hi-Fi' and 'Hi-End' markets quality requirements are much higher in compare with 'general public / mass market' . So it may be also not good to attempt to use 'single MPEG version for decade' in so different quality groups. For general public usage and mass-market much more 'scene abstractive' video codecs may be used. Same as MP3 128 kBit/s for general public is much more interesting in compare with HD-Audio with lossless 192 kHz/24bit. But both marketing applications exist.

Also the progressive video codec can be 'receiver-optimized' and use simple fact of 'human vision receive bitrate' - it is really much lower GBit/s so even uncompressed FHD at SDI 1+ Gbit/s is really extra redundant. So if codec is highly-adaptive to enduser visual system it also can have significant benefit while keeping 'good quality feeling' at customer's mind and customer can successfully pay per service. Though the 'internal tech/engineered measured quality of service' will be far from 'Hi-Fi' and 'Hi-End' markets.

I work at general public state broadcasting and I see how awful is 2 Mbit/s h.264 SD (realtime MPEG compression with small GOP size) airing in compare with studio uncompressed FHD SDI source but general public typically make zero complaints on quality (over a years and decades). So it work for years and decades very well.

"Changing the position of objects and the look of the title? Contracts would be updated to bar the use of compression technologies that could result in that."

There is some real secret - if enduser not see the ground-truth source he may not notice any difference if total 'video compression system' is 'smart enough'. Typically in most of broadcasting tasks endusers never see the real source of footage. The typical target marketing goal is some final feeling in mind/head of the customer. And customer generally pay for it. It is not about exact samples/pixels values and very precise scene elements locations.

As I now work on sort of artistic images creation I see how really _most_ of real imaging and ofcourse broadcast imaging typically have _very_poor_ artistic design (close to zero) and even some significant changes can not more ruine the typically absent any artistic value.

It is not about specially designed artistic titles of several years of production with per-scene and per-frame careful and costly fine-tuning and cost about $300 000 000 per 1.5 hours (5400 seconds) of runtime. It is about 99+% of typical content for video compression.
If we can create most of broadcast content priced about $50000 per second of runtime (about $500000 per scenecut) so it may require better video compression.

benwaggoner · 28th March 2023, 20:48

The name of the game in my business is "preservation of creative intent." We try to deliver the experience the creators made and intended to be seen as accurately as possible. The ultimate goal is something that would pass an A/B test with the mezzanine (Rings of Power hits that goal). It is accepted that the representation of creative intent will fall short of small devices and low bandwidth. But we never, never, never want to do anything that would change the creative intent to something else. Loss of detail is vastly more acceptable than the introduction of false detail.

Taking your MP3 example, a ML-enabled compression technology would be able to reproduce the same sound in fewer bits by offering better was to reconstruct a complex signal from a compact compressed representation. But the object-based AI you're talking about would be more like taking a .WAV file and then converting it to MIDI. Sure, that could work well for some content, and deliver extremely low bitrates. But if there are novel instruments or effects in there, or lyrics, a MIDI version would express a radically different creative intent.

Of course there will be convergence and gray areas over time. But in general, the more a compression technology bends towards synthesis versus reproduction, the bigger risks to maintaining creative intent there will be. And there will always need to be fallbacks to classic compression approaches for elements that the ML doesn't know how to synthesize accurately.

benwaggoner · 28th March 2023, 21:17

One can think of the difference between what can be synthesized and what can't as the residual.

DTL · 28th March 2023, 22:21

" But the object-based AI you're talking about would be more like taking a .WAV file and then converting it to MIDI. Sure, that could work well for some content, and deliver extremely low bitrates. But if there are novel instruments or effects in there, or lyrics, a MIDI version would express a radically different creative intent."

Yes - it is very close to sound model: Take RAW WAV at the AI-coder input and it will output MIDI-like dataset for distribution. Though depending on the requested by customer file size (and resulted quality) the used instruments set may be 'standard library' or _partial_ new instruments description may be included into compressed dataset. And the quality of compression (adjustable by user) is the amount of 'real instruments/ textures' used - more or less. Now we have it with current MPEGs - the lower requested bitrate the lower number of texture details and other scene elements enduser got to watch.

"And there will always need to be fallbacks to classic compression approaches for elements that the ML doesn't know how to synthesize accurately."

Yes - the initial implementations of 'highly abstractive' video compressors may be very poor. But expected to become better in quality using 'learning'. As was mentioned current civilization already about exhaust all possible movie plots and accumulated a good library of 'classic movies' over half a century in colours. So it now simply a task to machine-learning compression creators to scan over this classic-movies database and decrease mean residual error of compression. Using as much as possible existing movies in library. Also a 'typical basic database of textures' for decoder (downloadable once) may be created. It expected to be not very large.

Same as MIDI standard instruments library looks like updated once at the OS install or with drivers for playback card or software. And yes - the quality of MIDI playback depends on quality of player hardware (cost). So it simple marketing parameter - the more user pay - the more quality it got at movie (content) playback.

Also it is easily scalable for quality over a range of playback devices from poor smartphones with low power and cost to home-cinema premium playback devices of high compute power. It is also very marketing-friendly.

DTL · 29th March 2023, 09:33

"Courts were originally were iffy on using MPEG-2 video in evidence because "it could be changed from what happened.""

It is simply very different applications cases. Because human visual system is _very_ limited in possible to 'understand' datarate the following 2 cases of getting visual information from frame possible:

1. If human creature have very long time for each frame and each frame area analysis - it can get most of visual information presented in a frame using long time sequential frame scan by different (located and sized) areas. It is typical use case of Large Format photoframe consumption - user take hours of viewing of total frame and of different frame areas to consume all data from 16K or much larger resolution frame. If the frame is designed by good artist and have many 'levels' to view.

2. If frame sequence presenting is externally clocked at very high framerate (like 25 frames per second or even more) - the human creature have only chances of capturing some common feeling of the total cutscene presenting and may be some very limited areas of the scene in more detailed form. The general public broadcasting and many other use cases of moving pictures systems are about this case. So wisely designed video compression system can use this feature of enduser to decrease datarate and or filesize.

The limitation of 2. can also harm the user's life in some real life use cases like mission-critical viewing tasks like driving a car at high speed - if human visual system overloaded with high input datarate make skip of some really important part of input scene the serious incident may happen. Though the view was clear and all scene content was presented to the eyes in clear non-distorted form.

FranceBB · 6th April 2023, 17:11

Quote:

Originally Posted by benwaggoner

The name of the game in my business is "preservation of creative intent." We try to deliver the experience the creators made and intended to be seen as accurately as possible. The ultimate goal is something that would pass an A/B test with the mezzanine (Rings of Power hits that goal).

I would assume that's because Prime has almost exclusively TV Series and movies (i.e creative content in general).
You're lucky, in a way, 'cause the way you get the master is the way you output it.

I mean, from your use case it probably doesn't matter if it's PQ at 23,976, if it's BT709 SDR at 25p etc while for linear broadcasting it's always a matter of performing the "best conversion possible" to respect the airing standard which are 50p BT2020 HLG for UHD, 25i TFF BT709 SDR for FULL HD and 25i TFF BT601 SDR for SD.

Back in my streaming services days (Crunchyroll and Viewster, 2013-2015), I was in the same safe boat as you guys as I didn't really have to change the framerate or the transfer or the primaries etc, but ever since I moved to a TV and started working in linear broadcasting (January 6th, 2016) I had to actively start doing those conversions in the least painful way possible.

By the way, the thing DTL is talking about would kinda work for a news channel, which almost every linear broadcaster has anyway.

Quote:

Originally Posted by DTL

I work at general public state broadcasting and I see how awful is 2 Mbit/s h.264 SD (realtime MPEG compression with small GOP size) airing in compare with studio uncompressed FHD SDI source but general public typically make zero complaints on quality (over a years and decades). So it work for years and decades very well.

I feel you. Sky TG24 (the news channel) is still in SD on terrestrial (DTV), so seeing what happens to a perfectly good FULL HD 25i BT709 SDI uncompressed 1 Gbit/s feed from the video mixer is really painful... Luckily, on the satellite it is in FULL HD at decent bitrate, but that only covers around 4 million people and taking into account that the Italian population is 55 million people, one can quickly realize that the sad truth is that the overall majority watches news in SD...

benwaggoner · 6th April 2023, 17:25

Quote:

Originally Posted by FranceBB

I would assume that's because Prime has almost exclusively TV Series and movies (i.e creative content in general).
You're lucky, in a way, 'cause the way you get the master is the way you output it.

Prime Video is doing a whole lot of sports content and live channels now as well. But yeah, for scripted stuff the job is to reproduce the mezzanine as accurately as possible.

Quote:

I mean, from your use case it probably doesn't matter if it's PQ at 23,976, if it's BT709 SDR at 25p etc while for linear broadcasting it's always a matter of performing the "best conversion possible" to respect the airing standard which are 50p BT2020 HLG for UHD, 25i TFF BT709 SDR for FULL HD and 25i TFF BT601 SDR for SD.

Yeah, IP delivery makes things much simpler by allowing for more complex output options. And we don't ever have to deliver interlaced or HLG for anything!

Quote:

Back in my streaming services days (Crunchyroll and Viewster, 2013-2015), I was in the same safe boat as you guys as I didn't really have to change the framerate or the transfer or the primaries etc, but ever since I moved to a TV and started working in linear broadcasting (January 6th, 2016) I had to actively start doing those conversions in the least painful way possible.

I started out in good old analog 480i production, so had sort of an inverse path to yours. And it sure was a breath of fresh air once I'd proved to everyone that we didn't need to do frame rate conversions etcetera.

Quote:

I feel you. Sky TG24 (the news channel) is still in SD on terrestrial (DTV), so seeing what happens to a perfectly good FULL HD 25i BT709 SDI uncompressed 1 Gbit/s feed from the video mixer is really painful... Luckily, on the satellite it is in FULL HD at decent bitrate, but that only covers around 4 million people and taking into account that the Italian population is 55 million people, one can quickly realize that the sad truth is that the overall majority watches news in SD...

Ugh. It was that way in the USA for a long time as well. Hopefully your year-on-year numbers are moving in the right direction. I heard in another meeting today that 50% of EU eyeball hours are OTT already.

It's funny to recall how much of my career was trying to reach a high quality standard def, first on CD-ROM, and then on the web. It's fun to have HD SDR be the fallback now, with creators increasingly considering the 4K HDR the "real" version of the content.

21st March 2023, 13:18	#23 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,058	What is really required to make some real progressive step over classic h.264 is significant or complete move to object-oriented encoding. For any type of scene (including any natural scene). Currenly as supplement to 'old-block-oriented' MPEGs I making simple pre-processing engine (pmode=1 for MDegrainN at mvtools2 software pack for avisynth environment) and it shows good benefit already (about 60 -> 90% of skipped blocks in x264 at crf=18 encoding and about 40..50% less bitrate at mostly static scene and about 5% less bitrate at mixed-scenes movie). And it still not support 'forward' motion compensation for moving scenes so total mixed-scenes title encoding benefit will be higher after full idea implementation. Example of processing software at https://forum.doom9.org/showthread.p...27#post1984527 . But it is still very poor amature-level software design and from still not very died current civilization industry of video codecs design expected more complete and professional solution: Video codec must extract from incoming frame sequence for encode the scene content (textures, objects, lighting, motion data) and create standard compatible bitstream for enduser decoder to decode 2D frames sequence to feed to standard enduser 2D pixel display. Currently implemented only very small part of this process: The MDegrainN pmode=1 search for best texture view of the small image patch over current tr-scoped frames pool (+tr frame around current, including 'backward' motion compensation to perform dissimilarity metric analysis) and duplicate this texture in output frame sequence to MPEG coder. So it is 100%/full temporal denoising. MPEG coder understand it as scene texture element and not encode it in each frame (skip-block) and only reference it as element from scene textures dataset (some ref frame in h.264 standard) and apply some simple transform (motion) data (zero-MV currently) to create each output time-sampled frame to display. So FHD-sized frame of the about static scene with speaking talent can be encoded to about 500 bytes sized B-frames of h.264 standard (compression ratio of about 6000:1). So object-oriented MPEG codecs expected easily to reach >10000:1 compression ratios for natural movies too. So that I want from industry-desigen next-gen motion pictures (MPxx) compression standard features: 1. Understand and compensate for many possible transforms (not 2D translate-only as in current MPEGs), but also - rotate transform (3 axis) - skew transform - scale transform - lighting/shading transform - may be other possible base geometric transforms of 2D projections of textured+lit 3D objects 2. Have more advanced scene analysis engine to understand physically non-changed scene textures (only damaged by photon-shot noise and/or changed by supported by current MPEG-generation list of transforms). So the encoded data is consist of scene textures dataset and transforms for each part of the scene elements (areas). Also motion-compensation should be expanded in naming to Transforms Compensating meaning not only 2D translate transform can be compensated but much more advanced list of possible geometric and lighting transforms. Really it is also a part of future denoise engine of mvtools2 development too. So both MPEG compression and temporal denoising still have large part of equal processing. So in some future 'perfect world' expected some final combining of temporal denosie + MPEG encoder into single engine so users no more need to apply MPEG-pre processing with temporal denoise before final MPEG encoder. Are there any existing solutions moving to this way ? " announced the "death of MPEG"" MPEG is simply motion pictures experts group. It will dead about at the end of current civilization (really may be soon enough). But after this residual creatures may be not very busy with motion pictures processing. For the next long 'dark ages'. "I am a bit disappointed by the current state of UHD in broadcasting. We are taking state-of-the-art DVB-T2 muxes being able to broadcast a grand total of 3 channels. With VVC it would increase to a grand total of 5 (meanwhile with H.264 FHD you can fit 6). That's why I think that UHD on broadcasting won't be much of a success unless some further improvement is made on compression." The tech solution is very simple - do not broadcast unlimited number of junk channels. Not count number of channels as an advantage. Simply put 1 UHD per 8 MHz physical bandwidth. But good designed. "iptv is becoming more popular so dvb-t2 will become less relevant." Some creatures really live simply at the surface of the planet but not in the cities. And in the current quickly dying civilization we already have disabled unlimited-traffic internet tariffs of wireless 3G/4G providers a few months ago and last week my pre-paid internet 3G is almost died and support tried to fix it several days long without any good success. So mostly poor text-based internet left only. And no wire or optical provider can make wire/optical connection for poor creatures living far from city and even from multi-room multi-levels buildings in a small self-built homes. So only DVB-T or DVB-S is really only high enough bandwidth way to got some movie content in such places. Last edited by DTL; 21st March 2023 at 15:02.

24th March 2023, 19:43	#25 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,058	We as visual tech industry engineers see the internals of the processes at about half a century interval. Though still trying to make this visual industry a bit better. But it is good to correct the efforts with some propositions to the close enough future (about 1/3..1/4 of a century or less). As about object-oriented encoding I remember I read some promises at time of h.264 developement may be 10..15 years ago. But it looks all hype about object-orienting video compression is now dead ? I see close to nothing move to this really benefitical direction. May be it was found it can not build well universal codec for both broadcasting (requiremnt of a very low channel switching time before decoder can start to feed output decoded frames sequence) and other ways of content delivery ? I understand the torrents-way full-file form of content delivery via IP network before starting of playback is somewhere at the lowset priority at the main industry video codecs developers. Though it may be main way of content compressing and delivery at some communities and they are mostly interested in high quality video compression at the smallest file size per title (low network transfer cost, low storage cost, high number of happy seeds, longlive of release in a network and so on). So the requirement of startup time before decoder can collect all required scene data may be removed in file-based content delivery scenarious and fast enough random file acess playback storages (like HDD or better flash). So may be future video codecs may divided to broadcast and streaming-fiendly with lower quality and/or higher average bitrate required and title-per-file codecs for private CD/DVD/BD or online-purchased files backup into smallest file size. While the 'local cutscene bitrate' may be as high as required per selected cutscenes and may be much higher strict requirements for broadcast-friendly video codecs. Last edited by DTL; 24th March 2023 at 19:49.

24th March 2023, 22:16	#26 \| Link
benwaggoner Moderator Join Date: Jan 2006 Location: Portland, OR Posts: 4,770	Object-oriented compression either requires the content be authored as objects (ala Atmos and MPEG-H) or a good way to extract objects from existing basement media. One can imagine a codec which is basically a stream of GPU instructions to draw the video. However, video is full of things that look like objects but aren't, or don't act like objects, or turn into different kinds of objects. And they all exist under different lighting, and get color timed to get a look that may not have been possible to do in real life. It's not hard to come up with a promising demo like this. RealMedia blew our minds when they remade a South Park episodes as animated vector graphics with a soundtrack. Standard definition that could stream in real time over a 56K modem! Demos of actual film and video content getting automatically converted into objects and then reconstructed accurately? That I've never seen. Maybe possible with some crazy machine learning system, to some degree, but reconstruction to something that feels shot on 35mm film at 24p with a 180 degree shutter is a couple orders of magnitude beyond what we know how to do today. Getting something understandable, sure. But something that feels like the original seems very challenging. Of course, codecs have been getting all kind of features that make encoding objects in the source a lot more efficient. All the different CU sizes (including rectangular) and asymetric motion partitions lets us pretty accurately make an object get motion prediction distinct from the background. Intra-frame prediction lets us reuse repeated visual elements (great with text). But since it is fundamentally pixels in and out, stuff that can't be parsed as objects comes through just fine. I also get nervous about too much ML compression. Bitstream corruption could go from annoying green blocks appearing to different dialog being generated and characters wearing different clothing. __________________ Ben Waggoner Principal Video Specialist, Amazon Prime Video My Compression Book

25th March 2023, 15:49	#28 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,058	"Object-oriented compression either requires the content be authored as objects (ala Atmos and MPEG-H) or a good way to extract objects from existing basement media." It is should be already 'very simple' task for todays 'neural networks' and other 'machine learning'. They can even create some images from sort of several bytes text description. As I see in practice the total imaging system way to the enduser may go in a bit wrong direction when engineers try to compare video-compression codec with metrics like 'as clear as source' (with typical PSNR/SSIM/VIF and other same metrics). The final goal of the movie pictures content consumption is really the feeling in the head. And it may be not required not only lossless encode the noise part of scene image transfer but also may be some significant part of 'real' scene not greatly add to the total feeling of the movie. It is real challenging task for 'very cool future codecs': analyse total movie content (standard runtime of about 1,5 hours) and create some low-bytes encoded description so at player side the 'decoder' can reproduce very close to this movie frame sequence (may be in different resolution and so on) may be used not very same objects and textures. But giving close to the watching 'initial source' movie feeling in the consumer head. The tested by 'equality metrics' PSNR between source frame sequence and decoder output may be very low. But customer experience and satisfaction will be close to perfect. Depending on the movie description format it may create 'much more 10000:1 compression ratio' and also render any required by enduser resolution. "One can imagine a codec which is basically a stream of GPU instructions to draw the video." Currently MPEGs and temporal denoisers works close to 'simple 2D rendering' operating with small patches of an each frame called 'blocks' and either assume the block is the same (100% motion compensation in MPEGs and 'skip' block status) or encode 'residual error' that can not be compensated (really described as a sequence of small sized commands to renderer) with current system supported transforms (or not really geometrically transform at all). "And they all exist under different lighting," The lighting is also from 'transforms' family and very easy to describe in small byte sequence. For example when some part of the scene is shaded by moving object - the shaded blocks only got 'lighting' transform data changed if other geometry and light sources are static. The task for 'temporal denoisers' is equal - detect and 'back compensate' as many transforms as possible to 'distillate' as clean as possible original texture view. After this the 'denoised' frames may be restored as 1 only clean texture database source +number of transforms applied in each output frame (translation, scaling, rotation, lighting and so on). The only difference between temporal denosier and video codec : temporal denoiser uses all this data internally for 'regenerate clean frames view' and output same RAW frames as input. And video codec can output clean textures +transform data in 'smaller bytesize' form and pass to 'decoder' to restore image frames sequence in full size for displaying. Same as we have now with MPEGs using I-frames and ref-frames (?) as short-time texture database and MVs data to describe texture translate-transform only compensation with each output frame render. All other transforms not supported by current MPEGs are treated as 'unsupported' and encoded as residual error imposible to compensate and increase output bitrate and filesize. I think to add support of 'lignting' transform analysis and 'compensation' into mvtools software in some not very far future - it is easy enough. Also the 'rotate' and 'scale' transform also may be useful in many real cutscenes. So all 'base' SRT (Scale Rotate Translate) transforms will be covered +lighting. Currently as old enough MPEGs it support only 'translate' transform analysis and compensation. But can reuse Translate analysis data ('standard blocks MVs') from hardware MPEG encoder ASIC if it present in the system and have required API via drivers. So if some new version of hardware MPEG encoder ASIC will provide more analysis transforms data (scale, rotate and other) - it can be also easily enough reused for temporal degraining too. But currently Microsoft hardware accelerators API for motion pictures analysis only support Translate transform. Yes - making motion search engine for analysis of 'rotate' transform in 2 or better all 3 axises will make significant performance penalty to current transtale-only search engine in mvtools so offloading this also simple but compute-loaded work to some ASIC is very 'nice to have' feature. Also if this engine will be reused in MPEG coder it may make significant saving of investments for both denoise and compression tasks. "Of course, codecs have been getting all kind of features that make encoding objects in the source a lot more efficient. All the different CU sizes (including rectangular) and asymetric motion partitions lets us pretty accurately make an object get motion prediction distinct from the background." If we currently have all MPEGs from MPEG-1 to may be h.266 only support of analysis and compensation of translate-tranform only it looks the real engineering progress in video codecs design is not very big. "But since it is fundamentally pixels in and out, stuff that can't be parsed as objects comes through just fine." There is already sort of partial object-orienting in may be any MPEG - it is block-based compression. Block is small object (can be treated as small texture) and it can be or can not be supplemented with transform data to try to reach better compression in frames sequence. So encoder can have 2 datapaths: 1. Standard compression of block as samples-array without Transform Compensation (simple M-JPEG) 2. Attempt to analyse full list of supported transforms and create encoding using Transforms Compensation Next is compare compressed dataset size from path1 and path2 and select the lowest if the decoded image difference with input source is below threshold (quality level of encode param). So only if block can not reach better compressability after attempt to transform-compensated compression it is encoded as unknown transform or impossible to compensate 'standard JPEG'. Also if some transforms compensation shows benefit - the intermediate path is checked if the partial transform compensation + residual error JPEG-like compression can give less compressed datasize. All other blocks are encoded with lower bits count if Transform Compensation shows close to ideal compensation and residual error close to zero (or below quality-threashold) and the total per timeslice or per file size is reduced. Example of ideally compensated by TC blocks are simply static blocks. So for the typical cutscene: talking talent at fixed background - all blocks of fixed background are encoded as 100% TC with 'skipped' blocks. The facial animation of talking talent is created from skin/face texture + transforms dataset (scale, rotate, translate, skew,...) . So also can be better encoded using simple static face single texture database + set of transforms data for each frame. Also for codecs the hierarchy of database data may be given: 1. Group of frames textures database 2. Cutscene (group of group o frames) textures database. 3. Group of cutscenes texture database. 4. Total movie textures database. With using more and more higher level of textuure database - more and more data can be skipped from lower levels for the total movie to file compression. For example single talent typically frequently present in many cutscenes of the movie and single texture database may be used. Textures database may be sort of 'macro-used ref frames' for MPEG encoder with 'unlimited' number of ref frames (computed after first pass of analysis of full movie to encode in multi-pass encoding mode). And decoder can have fast random access to total movie textures (ref frames) database for any output frame decoding. Last edited by DTL; 25th March 2023 at 21:55.

27th March 2023, 16:34	#29 \| Link
benwaggoner Moderator Join Date: Jan 2006 Location: Portland, OR Posts: 4,770	Yes, ML can generate images from a short text description, which is amazing. But it can't take an image, describe it in some high-level pseudo-semenatic way, and then regenerate the generally same image it started with. Nor is it feasible to have a sequence of images, and have ML code the slight differences between consecutive frames in order to reconstruct the original frames. Maybe for something simple and of consistent style. For example, a Simpsons episode. There's a huge number of episodes to train on, with a somewhat limited number of art styles, characters and locations frequently on the screen. It's all fundamentally vector animation, which can be expressed quite concisely. I suspect having a cluster of ML models that could generate a movie from scratch would be a lot easier than one that can take existing content, compress it to a much lower bitrate than feasible today, and then reconstruct the same image. When a ML can make a movie, it'll know to only try to make the kind of movies it knows how to make. An input video sequence can be almost anything. With a lot of handwaving, $1B, and five years I can imagine coming up with some sort of efficient ML system that can make a generic Simpsons episode. But that would still be a generic episode. If they did a modern 3D episode with ray tracing, the ML would have no way to express that accurately. There would still need to be a fallback to more traditional pixels-to-pixels encoding for novel stuff the ML wasn't trained on. And a technology that can just take moving images and turn them into animated vectors in a pleasing way just doesn't exist, and that's the easier first stage of what you're talking about, ignoring reconstruction. ML can be very powerful for predicting a right result for cases that land within the range of training data it was given. But trying to go out of that range can result in bizarre results and even "hallucinations" where it provides something plausible but wrong. Training a ML to handle any sort of video that's been made or might be made is an impossible task. Traditional codecs have their downsides, but they are robust. It's nigh impossible to find content that just breaks a codec anymore; at reasonable bitrates, the content is always recognizable as itself, even if degraded. In a classic information theory sense, we can think of the ML model itself as the codebook that takes the compressed signal that is reconstructed into something similar enough to the original. The more variety and specificity needed, the more training and the bigger ML model that needs to be downloaded and stored is. ML is able to interpolate missing data in a more analog-like fashion, providing a verisimilitude without the artifacts of traditional encoding. But that verisimilitude means that bitstream or other errors can result in something wrong-but-plausible instead of an obvious artifact. Not good if a different actor's face is used, or text on signage says something else. It's a feature that traditional codecs will lose detail, but not make it up in ways that would change the narrative. I think the MPAI approach to have something like a traditional codec at the core, enhanced by ML when ML can enhance useful, makes a lot more sense. ML would be great at rendering a pebbled street, since the location of the individual pebbles don't matter. But classic motion compensation would be needed to make sure the pebbles don't randomize when the camera pans. ML has also showed promising results for better deblocking/deringing without detail loss. __________________ Ben Waggoner Principal Video Specialist, Amazon Prime Video My Compression Book

5th March 2023, 14:59	#21 \| Link
hajj_3 Registered User Join Date: Mar 2004 Posts: 1,125	ECM 8.0 is out: https://vcgit.hhi.fraunhofer.de/ecm/ECM/-/tree/ECM-8.0

27th March 2023, 20:01	#30 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,058	" If they did a modern 3D episode with ray tracing, the ML would have no way to express that accurately." The 'accuracy measurement' is really quality setting of compression. If user want lower sized compressed file it accept lower accuracy metric. Very close to todays video codecs with PSNR and other really accuracy metrics of how decoded result differs from input. And the final goal of 'highly abstracted' video compression method is to create more or less accurate (similar) feeling in mind in compare with watching original non-compressed content. So it have right to significantly step away from original frames samples values, object positions and other 'look'. "An input video sequence can be almost anything." It may be not completely correct. The human-oriented video codec may act close to human vision system - it create some (very simple) 3D scene model using 1 or 2 2D projections and it also learnable (as small babe start to watch surround world it learn how to reconstruct the limited world model in mind using limited size 2D projections from eyes to brain internals). If place this planet trained mature human to significantly different environment it may lost ability to 'space orient' using eyes only and need to re-learn again. So the typical movie scenes are limited in possible 'samples array values' (not pure white random noise) and describe not infinitely big number of 'typical 3D' or even 'simple 2D' scenes. It is really a sort of shame for current civilization but it was found the number of plots for movies is also not only not infinitely big but also _very_ limited. So it is now very hard for content writers to write something significantly new and good for general public. Sort of quick exhausting crypto valid values from close to infinitely big count of even integer numbers. "a technology that can just take moving images and turn them into animated vectors in a pleasing way just doesn't exist," In simple example it is working way of almost all current 'video compression codecs' - turn input into possible to transform-compensate array of textures (blocks) smaller in size in compare with input frame sequence and try to encode only residual error after transform-compensating. The transform dataset (currently only motion vectors for translate-transform compensation) expected to be much smaller in compare with input block sequence in input frames sequence. Sadly most of MPEGs from beginning only support translate-transform analysis and compensation. From new codecs expected more supported transforms. And yes - if we increase number of supported transforms - the datasize of transforms set will be somehow bigger. So it is the task for encoder to select best way for each block - either encode it as MJPEG only (no-temporal compression) or make transform-compensation of 1 or more supported transforms +encode residual error and check the smallest possible data output while keeping selected quality level for residual decoding error. Last edited by DTL; 27th March 2023 at 20:09.

28th March 2023, 00:34	#31 \| Link
stax76 Registered User Join Date: Jun 2002 Location: On thin ice Posts: 6,837	A related news article that recently appeared in my news feed: Apple acquired a startup using AI to compress videos. __________________ https://github.com/stax76/software-list https://www.youtube.com/@stax76/playlists

28th March 2023, 18:02	#34 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,058	"Oh my, we must work in very different markets. In professional content, a slight change in hue or contrast can mean a week of my time figuring out what went wrong and getting a fix implemented. Very senior executives start emailing me when a creative complains that the audio dynamic range is off. We can't change color, dynamic range, cropping, or anything else that would modify creative intent." Yes - 'Hi-Fi' and 'Hi-End' markets quality requirements are much higher in compare with 'general public / mass market' . So it may be also not good to attempt to use 'single MPEG version for decade' in so different quality groups. For general public usage and mass-market much more 'scene abstractive' video codecs may be used. Same as MP3 128 kBit/s for general public is much more interesting in compare with HD-Audio with lossless 192 kHz/24bit. But both marketing applications exist. Also the progressive video codec can be 'receiver-optimized' and use simple fact of 'human vision receive bitrate' - it is really much lower GBit/s so even uncompressed FHD at SDI 1+ Gbit/s is really extra redundant. So if codec is highly-adaptive to enduser visual system it also can have significant benefit while keeping 'good quality feeling' at customer's mind and customer can successfully pay per service. Though the 'internal tech/engineered measured quality of service' will be far from 'Hi-Fi' and 'Hi-End' markets. I work at general public state broadcasting and I see how awful is 2 Mbit/s h.264 SD (realtime MPEG compression with small GOP size) airing in compare with studio uncompressed FHD SDI source but general public typically make zero complaints on quality (over a years and decades). So it work for years and decades very well. "Changing the position of objects and the look of the title? Contracts would be updated to bar the use of compression technologies that could result in that." There is some real secret - if enduser not see the ground-truth source he may not notice any difference if total 'video compression system' is 'smart enough'. Typically in most of broadcasting tasks endusers never see the real source of footage. The typical target marketing goal is some final feeling in mind/head of the customer. And customer generally pay for it. It is not about exact samples/pixels values and very precise scene elements locations. As I now work on sort of artistic images creation I see how really _most_ of real imaging and ofcourse broadcast imaging typically have _very_poor_ artistic design (close to zero) and even some significant changes can not more ruine the typically absent any artistic value. It is not about specially designed artistic titles of several years of production with per-scene and per-frame careful and costly fine-tuning and cost about $300 000 000 per 1.5 hours (5400 seconds) of runtime. It is about 99+% of typical content for video compression. If we can create most of broadcast content priced about $50000 per second of runtime (about $500000 per scenecut) so it may require better video compression. Last edited by DTL; 28th March 2023 at 18:29.

28th March 2023, 20:48	#35 \| Link
benwaggoner Moderator Join Date: Jan 2006 Location: Portland, OR Posts: 4,770	The name of the game in my business is "preservation of creative intent." We try to deliver the experience the creators made and intended to be seen as accurately as possible. The ultimate goal is something that would pass an A/B test with the mezzanine (Rings of Power hits that goal). It is accepted that the representation of creative intent will fall short of small devices and low bandwidth. But we never, never, never want to do anything that would change the creative intent to something else. Loss of detail is vastly more acceptable than the introduction of false detail. Taking your MP3 example, a ML-enabled compression technology would be able to reproduce the same sound in fewer bits by offering better was to reconstruct a complex signal from a compact compressed representation. But the object-based AI you're talking about would be more like taking a .WAV file and then converting it to MIDI. Sure, that could work well for some content, and deliver extremely low bitrates. But if there are novel instruments or effects in there, or lyrics, a MIDI version would express a radically different creative intent. Of course there will be convergence and gray areas over time. But in general, the more a compression technology bends towards synthesis versus reproduction, the bigger risks to maintaining creative intent there will be. And there will always need to be fallbacks to classic compression approaches for elements that the ML doesn't know how to synthesize accurately. __________________ Ben Waggoner Principal Video Specialist, Amazon Prime Video My Compression Book

28th March 2023, 21:17	#36 \| Link
benwaggoner Moderator Join Date: Jan 2006 Location: Portland, OR Posts: 4,770	One can think of the difference between what can be synthesized and what can't as the residual. __________________ Ben Waggoner Principal Video Specialist, Amazon Prime Video My Compression Book

28th March 2023, 22:21	#37 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,058	" But the object-based AI you're talking about would be more like taking a .WAV file and then converting it to MIDI. Sure, that could work well for some content, and deliver extremely low bitrates. But if there are novel instruments or effects in there, or lyrics, a MIDI version would express a radically different creative intent." Yes - it is very close to sound model: Take RAW WAV at the AI-coder input and it will output MIDI-like dataset for distribution. Though depending on the requested by customer file size (and resulted quality) the used instruments set may be 'standard library' or _partial_ new instruments description may be included into compressed dataset. And the quality of compression (adjustable by user) is the amount of 'real instruments/ textures' used - more or less. Now we have it with current MPEGs - the lower requested bitrate the lower number of texture details and other scene elements enduser got to watch. "And there will always need to be fallbacks to classic compression approaches for elements that the ML doesn't know how to synthesize accurately." Yes - the initial implementations of 'highly abstractive' video compressors may be very poor. But expected to become better in quality using 'learning'. As was mentioned current civilization already about exhaust all possible movie plots and accumulated a good library of 'classic movies' over half a century in colours. So it now simply a task to machine-learning compression creators to scan over this classic-movies database and decrease mean residual error of compression. Using as much as possible existing movies in library. Also a 'typical basic database of textures' for decoder (downloadable once) may be created. It expected to be not very large. Same as MIDI standard instruments library looks like updated once at the OS install or with drivers for playback card or software. And yes - the quality of MIDI playback depends on quality of player hardware (cost). So it simple marketing parameter - the more user pay - the more quality it got at movie (content) playback. Also it is easily scalable for quality over a range of playback devices from poor smartphones with low power and cost to home-cinema premium playback devices of high compute power. It is also very marketing-friendly. Last edited by DTL; 28th March 2023 at 22:35.

29th March 2023, 09:33	#38 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,058	"Courts were originally were iffy on using MPEG-2 video in evidence because "it could be changed from what happened."" It is simply very different applications cases. Because human visual system is _very_ limited in possible to 'understand' datarate the following 2 cases of getting visual information from frame possible: 1. If human creature have very long time for each frame and each frame area analysis - it can get most of visual information presented in a frame using long time sequential frame scan by different (located and sized) areas. It is typical use case of Large Format photoframe consumption - user take hours of viewing of total frame and of different frame areas to consume all data from 16K or much larger resolution frame. If the frame is designed by good artist and have many 'levels' to view. 2. If frame sequence presenting is externally clocked at very high framerate (like 25 frames per second or even more) - the human creature have only chances of capturing some common feeling of the total cutscene presenting and may be some very limited areas of the scene in more detailed form. The general public broadcasting and many other use cases of moving pictures systems are about this case. So wisely designed video compression system can use this feature of enduser to decrease datarate and or filesize. The limitation of 2. can also harm the user's life in some real life use cases like mission-critical viewing tasks like driving a car at high speed - if human visual system overloaded with high input datarate make skip of some really important part of input scene the serious incident may happen. Though the view was clear and all scene content was presented to the eyes in clear non-distorted form. Last edited by DTL; 29th March 2023 at 09:41.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode