Digital Subband Video 2 - Open Source Wavelet Codec [Archive]

LMP88959

4th October 2024, 16:06

Hello Doom9, I've released the second iteration of my video codec.

DSV1 wasn't giving me the performance I wanted so I continued work on it and created DSV2.

DSV2 is to DSV1 what H.264 was to MPEG-2. A similar 'generational leap' in terms of efficiency.

As of June 20, 2025 the DSV2 bitstream is frozen.
It is totally free for you to use however you'd like.

Please refer to the GitHub repository for code, examples, and a single-header decoder implementation.

https://github.com/LMP88959/Digital-Subband-Video-2

Description taken from GitHub README:

DSV2 Features

compression using multiresolution subband analysis instead of DCT
also known as a wavelet transform
up to quarter-pixel motion compensation
4:1:0, 4:1:1, 4:2:0, 4:2:2 (+ UYVY) and 4:4:4 chroma subsampling formats
adaptive quantization
in-loop filtering
intra and inter frames with variable length closed GOP
no bidirectional prediction (also known as B-frames). Only forward prediction with previous picture as reference

Improvements and New Features since DSV1

in-loop filtering after motion compensation
more adaptive quantization potential
skip blocks for temporal stability
new subband filters + support for adaptive subband filtering
better motion compensation through Expanded Prediction Range Mode (EPRM)
quarter pixel compensation
psychovisual optimizations in the codec and encoder design

benwaggoner

4th October 2024, 21:07

Can you give an architectural overview of how you're handling inter prediction in wavelets? Wavelets were always pretty good for intraframe compression, but weren't able to match block-based motion compensation efficiency.

I've long thought that this was due to the better symmetry between inter and intra in block-based coding, which minimized cumulative loss from using visibly compressed blocks to predict from.

Being able to do more wavelet-like interprediction always seemed like the holy grail here, but I at least never had a good idea how that would work in practice.

LMP88959

4th October 2024, 21:31

benwaggoner

9th October 2024, 17:01

Hi Ben,

DSV2 does block-based motion compensation as well using the 'traditional' techniques of subpel motion and in-loop 'deblocking' filtering.
The wavelets are only used for intra frame and inter (residual) frame compression.

Did I understand your question correctly?
Yes, perfectly.

Is the deblocking to handle artifacts introduced by the pixel-based motion estimation?

Do you find mixing wavelet-based and pixel-based tools is a challenge for efficient motion prediction? My sense is this has been a big reason that wavelet codecs haven't been competitive for interframe encoding to date.

Many years ago I idly imagined a way to use 3D wavelets to extend into the temporal dimension to get around this. But my approach was so obviously awful I didn't get very far with it ;).

The Daala experimental approach of keeping everything in the frequency domain for prediction and only rasterizing for display seemed like it might be more promising for wavelets than it proved to be for a block-based transform.

Wavelets are a fascinating topic! And get used in things like digital projectors in movie theaters and the beloved Cineform codec. The intrinsic spatially scalable properties seem like they'd be well suited to a lot of internet applications if temporal prediction bitrate scaled with the subbands...

LMP88959

10th October 2024, 03:29

Yes, perfectly.

Is the deblocking to handle artifacts introduced by the pixel-based motion estimation?

Do you find mixing wavelet-based and pixel-based tools is a challenge for efficient motion prediction? My sense is this has been a big reason that wavelet codecs haven't been competitive for interframe encoding to date.

Yes the deblocking filter is to handle sharp discontinuities introduced by the motion compensation.

From what I've learned, a full-frame wavelet codec like DSV2 is unable to realistically test the entropy of potential predictions since it would be very unreasonable to subtract the prediction for a block, transform+quantize the whole frame, and then estimate the number of bits (or some other metric) that prediction would result in.

DCTs and wavelets are essentially opposites of each other. A DCT coefficient represents a pattern over a certain 2D area whereas a wavelet coefficient represents a 'feature' at a certain point in an area. As a result, what traditional DCT codecs are good at coding may be nightmarish for a wavelet and vice versa.
I say all this to bring attention to the fact that these differences must be taken into consideration when doing ME and mode decision.

Some might call this opinion a stretch, however, I believe H.264's concept of intra prediction is as effective as it is because it essentially combined wavelets and the DCT. Of course, wavelets weren't being directly used, but the removal of a gradient in a block is similar to a wavelet decomposition where you'd have one coefficient in a lower level subband which represents a gradient in the horizontal/vertical/diagonal direction over a certain part of an image.

Many years ago I idly imagined a way to use 3D wavelets to extend into the temporal dimension to get around this. But my approach was so obviously awful I didn't get very far with it ;).

Interestingly enough, I began the DSV project last year in an attempt to make a decent wavelet based video codec after seeing how virtually every codec was either using the DCT or (in the old days) VQ.
When I started the project I originally tried that 3D wavelet idea.
It had a few big issues:
1. Very slow.
2. High memory requirement (relatively)
3. Ineffective at temporal prediction.

Temporally, a Haar wavelet would be the best because it does not introduce much ringing at all but at the same time it's essentially the same as simple frame differencing. Any other wavelet would result in weird ghosting and blurring in the temporal domain.

The Daala experimental approach of keeping everything in the frequency domain for prediction and only rasterizing for display seemed like it might be more promising for wavelets than it proved to be for a block-based transform.

Wavelets are a fascinating topic! And get used in things like digital projectors in movie theaters and the beloved Cineform codec. The intrinsic spatially scalable properties seem like they'd be well suited to a lot of internet applications if temporal prediction bitrate scaled with the subbands...

The idea of doing motion estimation using the wavelet decompositions of both the source and reference frames instead of the pixels themselves sounds interesting, I wonder how well it would work in practice...

Thank you for the great comments and questions!

benwaggoner

11th October 2024, 20:08

Thanks for your feedback! I love hearing about novel codec designs. I hadn't considered wavelets and DCT as inverses of each other. I'll need to spend some shower time ruminating on that.

I had tangential involvement with JPEG XR when I was at Microsoft. It had alternating DCT and wavelet spatial enhancement layers, which I thought was a fascinating approach. A fundamentally ineffective one, though. Some smart people spent a long time trying to make an encoder that would meaningfully outperform JPEG perceptually, which in theory was easy, but in practice wasn't accomplished. If that had worked, doing block-based motion estimation on the DCT layer and only using the wavelets for spatial enhancement could have been an interesting possibility.

Alas, so many good ideas just don't work in the end. The Daala development notes are a fascinating journey through trying lots of novel, clever ideas and having most of them prove impractical.

LMP88959

13th October 2024, 07:06

Of course, and thank you for your curiosity!
With the little experience I have, I can naively say there is a good amount of potential in mixing wavelets and DCTs in a codec. :D
And I agree, the Daala development pages are very interesting and document the various novel techniques quite well. They also often describe the fundamental problems they're trying to solve with each technique which helps newbies like myself understand what a codec needs to do.

LMP88959

18th November 2024, 20:19

Since last time I posted, I've made a few improvements to DSV2.

- Motion estimation uses RMSE instead of SAD
- Better mode decision
- Added a new psychovisual metric. Found in hme.c, block_has_vis_edge(). (may be interesting to a curious reader or other encoder developer out there)
- Replaced the high freq filter with something interesting and relatively unexplored (as far as I know) since ~1990, an asymmetric subband filter.

In my opinion, the most fascinating addition by far is the asymmetric subband filtering.
Asymmetric in this context means the encoding is considerably slower than decoding. It is lossy, but such asymmetry is a rare thing in transform based coding where the inverse transform is typically just as complex as the forward transform.

Check out the pull request here:
https://github.com/LMP88959/Digital-Subband-Video-2/pull/9

Quote from the pull request:
Adding a new subband filter (which I'm calling ASF93 for reasons explained below) for increased intra frame decoding performance.
...
This new filter is only applied to level 1 (to create the highest frequency subbands). It is asymmetrical and does not allow for perfect reconstruction.

Subband filtering consists of a FIR filter bank that splits up a signal into different subbands. Wavelets and the Haar transform are all special cases of subband filters.
In subband filtering, the forward transform is called 'analysis' and the inverse transform is called 'synthesis.'

With ASF93, the analysis filters are 9-tap filters. The FIR coefficients were tuned to my personal psychovisual tastes (slightly emphasized low pass features, noisy/ringy but localized high pass features). The synthesis FIR filter coefficients are the standard 3-tap filters used elsewhere in DSV2. The name ASF93 stands for "asymmetric subband filter with 9-tap analysis and 3-tap synthesis."

Reference for the idea:
Subband image coding with three-tap pyramids by
E H Adelson and E P Simoncelli (1990)

Strangely, I haven't been able to find anything else regarding this concept of asymmetrical subband filtering besides this paper, which is unfortunate. I believe there is some untapped potential here.
I hope this update to DSV2 will help push this effort forward as well as provide a modern use-case.

LMP88959

16th May 2025, 17:02

benwaggoner

16th May 2025, 19:30

A new version of this codec has just been released on GitHub!
It consists of about half a year of research and development and brings a plethora of performance and quality improvements!

Even if you have tried the codec before please give it another try, it's nothing groundbreaking or cutting edge but it's different :)
It's always great to see people taking fundamentally different approaches. We've got the traditional block-based algorithms so refined that so much development is just more refinement of that, while we're not exploring what would could build with equal effort on top of fundamentally different technology.

The industry spent so many years delivering different H.263 refinements. VVC is basically HEVC++.

We don't need to see something immediately competitive out of a different approach for it to be worthy or promising. I wish there was a way to compare different foundation approaches with a similar degree of refinement. It's not that informative to compare something with the latest x265 to see if the fundamental transforms are better. Comparing to an early build of reference encoders that had similar degree of refinement could be an interesting approach. So don't test rate control or frame type selection in a comparison when a novel codec doesn't have those yet.

LMP88959

17th May 2025, 16:36

It's always great to see people taking fundamentally different approaches. We've got the traditional block-based algorithms so refined that so much development is just more refinement of that, while we're not exploring what would could build with equal effort on top of fundamentally different technology.

The industry spent so many years delivering different H.263 refinements. VVC is basically HEVC++.

We don't need to see something immediately competitive out of a different approach for it to be worthy or promising. I wish there was a way to compare different foundation approaches with a similar degree of refinement. It's not that informative to compare something with the latest x265 to see if the fundamental transforms are better. Comparing to an early build of reference encoders that had similar degree of refinement could be an interesting approach. So don't test rate control or frame type selection in a comparison when a novel codec doesn't have those yet.

Thank you! I agree with you.

DCT based codecs were first to the market in the early 90s (and from what I read, easier to write hardware implementations of?) so it makes sense that people took it and expanded on it since it was already established.

Wavelets on the other hand seemed to be the topic of interest for researchers exclusively in the mid 90s - pre 2010s that was explored mostly in theory with few practical/fleshed out implementations. By the late 2000s, any new wavelet codecs (e.g Snow/Dirac) needed to compete with x264 which, like you said, is a very difficult (and unfair) thing to do given how new both the technology and encoders were.

This, and the fact that DSV has been a one-man project, is why on the DSV page I compare to minih264 (a relatively simple H.264 encoder).

I really appreciate your interest in the project! If you do end up trying it out, I'd love to hear your thoughts on it :)

nhw_pulsar

17th May 2025, 17:41

Thank you! I agree with you.

DCT based codecs were first to the market in the early 90s (and from what I read, easier to write hardware implementations of?) so it makes sense that people took it and expanded on it since it was already established.

Wavelets on the other hand seemed to be the topic of interest for researchers exclusively in the mid 90s - pre 2010s that was explored mostly in theory with few practical/fleshed out implementations. By the late 2000s, any new wavelet codecs (e.g Snow/Dirac) needed to compete with x264 which, like you said, is a very difficult (and unfair) thing to do given how new both the technology and encoders were.

This, and the fact that DSV has been a one-man project, is why on the DSV page I compare to minih264 (a relatively simple H.264 encoder).

I really appreciate your interest in the project! If you do end up trying it out, I'd love to hear your thoughts on it :)

Hello,

I never could try your very interesting project DSV, very sorry, or Dirac, Snow because I don't test video compression, I am more an image compression guy, but I still wanted to notice very quickly that in April 2008 was published the excellent Rududu Image Compression codec and at that time I compared it for still images with the then best codec of that time which was x264 intra (the codec was named UCI codec), and for me RIC was as good as x264 intra (and RIC was notably faster to encode). If I remember correctly RIC and x264 intra had the same PSNR, and visually it was 50/50 on my test images (I did not test extreme compression).

So sorry for the off-topic, Rududu Image Compression was really a very good wavelet image codec of 2008, so sad that its author could not continue its work on it... Anyway Rududu uses the high trend in wavelets that is zerotree/SPIHT, it seems that you don't use that scheme in DSV (neither me in my codec NHW)?

Cheers,
Raphael

LMP88959

17th May 2025, 23:59

Hello! I never tried Rududu, but I'll see if I can find it and compile it. I didn't design DSV2 as a still image codec so I'm sure it will not perform as well. I don't use SPIHT/zerotrees as those are very slow and involve multiple passes over the image. I do simple value-coding for speed and it performs decently from what I've seen.

nhw_pulsar

18th May 2025, 09:40

I don't use SPIHT/zerotrees as those are very slow and involve multiple passes over the image.

Hello!

I don't know zerotree/SPIHT techniques, but yes it notably seems "a little" slow to decode....

I do simple value-coding for speed and it performs decently from what I've seen.

Just great!

I really appreciate your new approach to make work the "classic" and advanced me/mc techniques with wavelets.

Cheers,
Raphael

LMP88959

18th May 2025, 19:39

I don't know zerotree/SPIHT techniques, but yes it notably seems "a little" slow to decode....

Ah, yes they (the normal versions) operate on bit planes rather than on individual coefficients which not only requires extra computation/iterations but also results in poor cache usage. They also require storing lists and iterating multiple times over the bits to determine what is and isn't significant. They are more or less symmetrical operations for encoding/decoding so it's not like this extra time will get you faster decoding or anything. Plus, these are PSNR optimized algorithms which squeeze the entropy quite well out of the wavelet tree but at lower bit rates they look awfully blurry due to the lack of psychovisual consideration.

Just great!

I really appreciate your new approach to make work the "classic" and advanced me/mc techniques with wavelets.

Thank you so much, I took a look at your NHW project as well and it looks great as well. :)

nhw_pulsar

18th May 2025, 20:55

Ah, yes they (the normal versions) operate on bit planes rather than on individual coefficients which not only requires extra computation/iterations but also results in poor cache usage. They also require storing lists and iterating multiple times over the bits to determine what is and isn't significant. They are more or less symmetrical operations for encoding/decoding so it's not like this extra time will get you faster decoding or anything. Plus, these are PSNR optimized algorithms which squeeze the entropy quite well out of the wavelet tree but at lower bit rates they look awfully blurry due to the lack of psychovisual consideration.

Thank you for the explanation. I have also read that to squeeze entropy out of zerotree/SPIHT partitioning the best as possible, you need to couple it to context modeling (with arithmetic coding), and context modeling is quite slow...

Personally, I am not an expert, but I like your approach to make a _fast_ but psychovisually-optimized wavelet video codec.

Thank you so much, I took a look at your NHW project as well and it looks great as well. :)

Thank you very much for taking a look at NHW.

Keep up your great work,
Cheers,
Raphael

LMP88959

19th May 2025, 21:44

Thank you Raphael.

@benwaggoner I added lossless coding support as you suggested :)
https://github.com/LMP88959/Digital-Subband-Video-2/pull/13

benwaggoner

30th May 2025, 22:17

Ah, yes they (the normal versions) operate on bit planes rather than on individual coefficients which not only requires extra computation/iterations but also results in poor cache usage. They also require storing lists and iterating multiple times over the bits to determine what is and isn't significant. They are more or less symmetrical operations for encoding/decoding so it's not like this extra time will get you faster decoding or anything. Plus, these are PSNR optimized algorithms which squeeze the entropy quite well out of the wavelet tree but at lower bit rates they look awfully blurry due to the lack of psychovisual consideration.
Yeah, PSNR is very much a first-order quality metric!

As such, it's fine to use in early codec development. But too much focus on PSNR-at-QP optimization can miss opportunities to make a bitstream that is highly efficient with adaptive quantization and other psychovisual improvements.

I don't get it when codec developer apologize for "we got 40% subjective improvement, but only 25% PNSR." Subjective improvements is the only thing we care about in distribution! I'd take that over 35% in both PSNR and subjective improvement.

This blind spot wound up being a big issue with VC-1, due to our inefficient RLE differential bitmask approach to adaptive quantization. At internet bitrates, the overhead often ate more bits than the psychovisual improvements could save.

There were some tricks to get around that to a worthwhile degree (like using it only for I-frames and selected P-frames), but they never became the default in anything.

LMP88959

30th May 2025, 22:45

Definitely, I've been using XPSNR as a rough guide during tuning but I always verify with my eyes.

Wavelets got a lot of bad reputation for image/video coding (in my opinion) specifically because it was being researched mainly during the time when PSNR was almost exclusively the target metric. People saw good numbers but bad visual quality and for some strange reason decided to say wavelets were the problem and not the metric itself.

I really enjoy learning about psychovisual phenomena which is why I designed DSV2 with psychovisual considerations baked-in rather than something which was created to optimize for PSNR originally and had psy optimizations added as an afterthought.

Did you work on developing VC-1? I was curious about why the designers chose a cubic half-pel filter (-1,9,9,-1), not sure if you know why that was chosen?
I was using that filter originally but it made subpixel motion extremely blurry (and I wanted to keep the filters at 4-taps max) so I did some R&D on my end and settled on 'smart' temporal switching between two sharper cubic filters which from my testing significantly outperformed the (-1,9,9,-1) filter.

Thank you!

LMP88959

21st June 2025, 03:59

Small but significant update(s):

1. DSV2 bitstream is now frozen
2. Improved wavelet coefficient entropy coding slightly (generally 1-5% smaller videos for same quality)
3. Improved adaptive quantization

benwaggoner

24th June 2025, 18:56

I really enjoy learning about psychovisual phenomena which is why I designed DSV2 with psychovisual considerations baked-in rather than something which was created to optimize for PSNR originally and had psy optimizations added as an afterthought.
That is the right way to do it!

Did you work on developing VC-1? I was curious about why the designers chose a cubic half-pel filter (-1,9,9,-1), not sure if you know why that was chosen?
I was using that filter originally but it made subpixel motion extremely blurry (and I wanted to keep the filters at 4-taps max) so I did some R&D on my end and settled on 'smart' temporal switching between two sharper cubic filters which from my testing significantly outperformed the (-1,9,9,-1) filter.
I joined Microsoft at the end of 2005, so the VC-1 bitstream was locked down before I was able to make any contributions. I was quite involved in the evolution of VC-1 decoders and real-world VC-1 implementations, though.

Bear in mind WMV9 was released in 2003, and thus was optimized to run well enough on single-core x86-32 MMX CPUs as a baseline, so they had a fraction of the MIPS/pixel to work with than H.264 Main Profile. The major difference between WMV9 Main and WMV9 Advanced/VC-1 was allowing for adaptive QP on I-frames, which was overlooked in the original implementation. But performance was identical with the same parameters (although VC-1 implementations tended to have overlap transform and loop filter on by default, while WMV9 originally defaulted to them off for decoder performance reasons).

The in-loop deblocking filter was one of the big retrospective regrets by the VC-1 developers. They felt they over optimized on decoder performance and the codec would have been a lot more competitive against H.264 if they'd just used a few more taps so it could do a better job.
Bear in mind that VC-1 also had overlap transform, which also played a role in reducing blocking, and they were presumed to be working together.

The other big regret was implementing differential QP by doing a RLE bitmask of variable length codes of the per-macro block differential. The actual bitrate overhead of that signaling turned out to be high enough as to eliminate the value of using adaptive QP in many low-bitrate cases. H.264 did it a lot more efficiently so it was a safe always-on feature to use. In more advanced VC-1 encoders, advanced adaptive deadline techniques without any signaling overhead wound up addressing a lot of what adaptive QP did in other codecs.

LMP88959

24th June 2025, 23:19

The in-loop deblocking filter was one of the big retrospective regrets by the VC-1 developers. They felt they over optimized on decoder performance and the codec would have been a lot more competitive against H.264 if they'd just used a few more taps so it could do a better job.
Bear in mind that VC-1 also had overlap transform, which also played a role in reducing blocking, and they were presumed to be working together.

Very interesting, was the overlap transform an experimental curiosity of the time or a way to avoid patent issues?

benwaggoner

25th June 2025, 20:09

Very interesting, was the overlap transform an experimental curiosity of the time or a way to avoid patent issues?
I joined the team after it was already in there, so don't know how it was decided to be put in. It did help reducing blocking with higher QPs with less computational overhead for software decoders than in-loop deblocking did. IIRC overlap and loop defaulted to off when encoding in Windows Vista and XP service pack something, but was set to on for Windows 7, as GPU decoders were common and min bar CPUs were faster.

VC-1 was really a well thought-out design for a codec that had to run well on x86-32 with MMX, and was competitive with H.264 Baseline at similar bitrate and software decoder overhead. And available encoders were faster and more robust than the early commercial H.264 encoders like Main Concept. The VC-1 Professional Edition encoder could handle a much broader range of content than early H.264 encoders, like screen recordings with transparent GUI (Vista and Aqua) and cel animation.

But didn't have enough in the tank to compete with H.264 Main and High profiles, particularly when coupled with the miracle of x264 and how well it leveraged open source and dedicated community with a wide variety of use cases to test.

And when Microsoft decided that the Windows Media mission was achieved (no more uncapped per unit decoder licenses like MPEG-2), there wasn't any reason to work so hard swimming upstream and the company pivoted to standards-based codecs like AAC and H.264. Microsoft had a really nice early H.264 encoder that was competitive with x264 in quality (but not at speed, as it only had slice-level parallelism while x264 had frame-level). But it was buried in a .dll and never really promoted or had any tools released that used it other than Expression Encoder.

And there wasn't any business justification to fund a team continuously optimizing it at the level the x264 community was providing that.

excellentswordfight

26th June 2025, 15:48

Wavelets got a lot of bad reputation for image/video coding (in my opinion) specifically because it was being researched mainly during the time when PSNR was almost exclusively the target metric. People saw good numbers but bad visual quality and for some strange reason decided to say wavelets were the problem and not the metric itself.

Isnt JPEG2000 a wavelet based codec? One if the most used DI/mezz codecs in the world.

Cineform was it as well if i remembered correctly, didnt see that wide adoption but it was very good, and was one of the best DI options and rather common on windows until we started to see prores support there.

LMP88959

26th June 2025, 17:50

VC-1 was really a well thought-out design for a codec that had to run well on x86-32 with MMX, and was competitive with H.264 Baseline at similar bitrate and software decoder overhead...But didn't have enough in the tank to compete with H.264 Main and High profiles, particularly when coupled with the miracle of x264 and how well it leveraged open source and dedicated community with a wide variety of use cases to test.

And there wasn't any business justification to fund a team continuously optimizing it at the level the x264 community was providing that.

Very interesting history, I rarely hear anything about VC-1 so it's nice to learn more about it, thank you :)

Isnt JPEG2000 a wavelet based codec? One if the most used DI/mezz codecs in the world.

Cineform was it as well if i remembered correctly, didnt see that wide adoption but it was very good, and was one of the best DI options and rather common on windows until we started to see prores support there.

Yes they have use cases, mainly for mezzanine / intra-only high resolution content, because they outperform block based DCT codecs due to the inherent lack of scalability of a fixed block sized transform.

I should have specified 'low bit rate image/video coding' in my original comment because that's generally where people felt wavelets looked worse.

benwaggoner

26th June 2025, 20:35

Isnt JPEG2000 a wavelet based codec? One if the most used DI/mezz codecs in the world.
And in digital cinema. All our theaters are playing back 12-bit J2K in log xyz.

Cineform was it as well if i remembered correctly, didnt see that wide adoption but it was very good, and was one of the best DI options and rather common on windows until we started to see prores support there.
Yeah, I was a Cineform Stan back in the day. Made HD editing on laptop viable almost 20 years ago. IIRC, is was a simplified wavelet with IBIB encoding. The team that made it was brilliant, and went on to do some other cool stuff in the field I am spacing on for the moment. Fun folks to hang out with as well.

benwaggoner

26th June 2025, 20:55

Very interesting history, I rarely hear anything about VC-1 so it's nice to learn more about it, thank you :)
I spent a few years as a VC-1 and Windows Media Evangelist. Feel free to ask me all the questions. My book linked in my sig is almost 16 years out of date, which is a feature if you want some serious VC-1 deep dives (and the first half about video and encoding fundamentals, and preprocessing is evergreen stuff).

The coolest use for VC-1 was in the Xbox 360 video streaming service. It did 1080p variable GOP VBR adaptive bitrate streaming in 2009! Due to the loop filter and overlap transform issues in VC-1, even 10 Mbps could get blocky at 1080p with some stressful content. And once you had a blocky p-frame, the rest of the GOP was going to look bad. So they built analysis of the motion vectors and frequency distribution horizontally and vertically. Then they would apply anamorphic spatial compression per fragment to the largest that would maintain the target maximum QP. Which worked well psychovisually, as it would tend to compress along the axis of motion, where there was motion blur and less resolution was needed to maintain detail!

This got around the classic compression conundrum of having to pick a static frame size that didn't look too terrible for high complexity content while not being overkill for static content which would have been fine with higher resolution (and benefits the most from it).

It was also helpful for software decode (and that's all there was in the 360/PS3 generation), as the more complex the motion compensation was for a frame, the fewer pixels needed to have it applied. That also left more compute to apply CPU load dynamic out-of-loop deblocking and deringing post processing.

It was a great technique that didn't really get picked up after that. Modern codecs with their stronger in-loop deblocking, SAO, and similar in-loop artifact suppression features can "recover" from a QP spike better as a high QP frame can still make an okay reference frame. And support for bigger block sizes has meant that required bitrate increases for a given resolution increase is relatively less, so it's safer to err on the side of higher resolutions. Still, Yuri from BrightCove demoed 10% savings even with HEVC using a similar technique at SMPTE MTS in 2023. Film grain was the primary limiting factor in practical use.[/QUOTE]

Yes they have use cases, mainly for mezzanine / intra-only high resolution content, because they outperform block based DCT codecs due to the inherent lack of scalability of a fixed block sized transform.

I should have specified 'low bit rate image/video coding' in my original comment because that's generally where people felt wavelets looked worse.
Yeah, wavelets are very compression for intra-only encoding. Where block-based transforms have really outcompeted is with motion compensation, which is very straightforward to do with blocks. But not with wavelets. I've seen all sorts of attempts to do good motion compensation with wavelets, but they always felt bolted on, not something that could be integrated with the core transform like with typical codecs.

Fingers crossed someone will have the genius idea to make it work someday. I thought I've had it a half dozen times myself ;)! But "temporal wavelets" just never gel into anything plausibly practical.

Thanks for great excuses to go down codec memory lane!

I encoded my first digital video file back in 1989, so this has truly been my life's work.

Emulgator

27th June 2025, 00:16

Nice reading, Ben !
Going through the Blu-ray editions I bought I am still happy with the natural detail retention vs. bitrate over all the VC-1 encodes I've seen.
It is just good codec R&D.
I came to value .wmv much later, after H.264 indeed, as more and more bitrate-starved blotchy H.264 videos came into my task
and after all one would come to the conclusion that a 1Mbps .wmv would look less maimed.
I just would have wished an earlier Expression Encoder release and I would have paid for that.
The release strategy around VC-1 was a bit unclear for me, but well, I was late to this party anyway.

LMP88959

27th June 2025, 17:57

I spent a few years as a VC-1 and Windows Media Evangelist. Feel free to ask me all the questions. My book linked in my sig is almost 16 years out of date, which is a feature if you want some serious VC-1 deep dives (and the first half about video and encoding fundamentals, and preprocessing is evergreen stuff).

Fascinating! I need to read more about VC-1. It seems to have a cool set of features along with some niche use-cases which is always fun to see.

Yeah, wavelets are very compression for intra-only encoding. Where block-based transforms have really outcompeted is with motion compensation, which is very straightforward to do with blocks. But not with wavelets. I've seen all sorts of attempts to do good motion compensation with wavelets, but they always felt bolted on, not something that could be integrated with the core transform like with typical codecs.

Fingers crossed someone will have the genius idea to make it work someday. I thought I've had it a half dozen times myself ;)! But "temporal wavelets" just never gel into anything plausibly practical.

Thanks for great excuses to go down codec memory lane!

I encoded my first digital video file back in 1989, so this has truly been my life's work.

Yeah motion compensation is a tough problem, I don't think wavelets, or even current block transforms, are the best solution to it. I think the biggest issue with P-frame coding is the proliferation of artifacts at lower bit rates where all of these transforms end up adding noise or ringing to the compensated image, thus making more residual for the next frame to compensate. This is why I use a 2x2 Haar transform for P-frames in DSV. It's fast, it doesn't ring, but it doesn't compress too well either.

3D DCT and 3D wavelets have been tried and always fall flat mainly due to how inefficient they are at coding similarities between adjacent frames. Block matching is both fast and effective since similarities between frames are not always able to be described by a curve or function.

What did you use to encode your first digital video file by the way?

FranceBB

29th June 2025, 23:51

And in digital cinema.

Yep. And it's already been 2 and a half years since I wrote "The Ultimate Avisynth DCP (Digital Cinema Package) creation guide (https://forum.doom9.org/showthread.php?p=1979437)" here on Doom9 describing exactly that. One of the funny thing about JPEG2000 and its all intra nature is that it still scales up nicely after so many years. Basically, being all intra, each frame of the source is a .tiff picture that gets fed to the OpenJPEG encoder to create a .j2k for each frame. Those then gets appended together to create the actual JPEG2000 stream and such a stream is muxed in the mxf container. This allows to potentially scale as much as you want as you can encode as many frames as a CPU has cores in parallel. I've recently encoded every episode of The Last of Us for the marathon of the 1° series that was gonna be displayed at the cinema leading up to the first episode of the second series and I was in a bit of a hurry, so I just set the number of threads to 56 (I have a 56c/112th Xeon in one of my servers) and it scaled up pretty nicely (https://i.imgur.com/nFdjDtw.png). :D

3D DCT and 3D wavelets have been tried and always fall flat mainly due to how inefficient they are at coding similarities between adjacent frames.

There have also been experiments with the KLT (Karhunen Loève Transform) back when the first H.265 HEVC proposals were still being evaluated and before ending up with the actual DST and DCT. Ultimately, although the KLT was supposed to be optimal in many scenarios, its overall computational cost and lack of any known "fast" algorithm made it so that it wasn't worth pursuing.

a.ok.in

30th June 2025, 11:53

There have also been experiments with the KLT (Karhunen Loève Transform) back when the first H.265 HEVC proposals were still being evaluated and before ending up with the actual DST and DCT. Ultimately, although the KLT was supposed to be optimal in many scenarios, its overall computational cost and lack of any known "fast" algorithm made it so that it wasn't worth pursuing.

Afaik KLT has been proposed even before H.264 and it is currently being tested in the experimental AV2 codec.

benwaggoner

1st July 2025, 03:09

Nice reading, Ben !
Going through the Blu-ray editions I bought I am still happy with the natural detail retention vs. bitrate over all the VC-1 encodes I've seen.
It is just good codec R&D.
That is part of it.

Another secret weapon of VC-1 for Blu-ray was xscaler, a command line utility written by Spears and Munsil back when they worked at Microsoft. It was a really advanced, very tweakable dithering tool to get down from 10+ bit mezzanine sources to the 8-bit 4:2:0 of original flavor Blu-ray. And it was bundled with the VC-1 professional encoding tools, and the Blu-ray compressionists were trained in using it.

The best codec in the world can't fix upstream banding or dithering issues, so that really helped. Of course, the tool was codec agnostic, and got used by some of the same compressionists after they switched to H.264 disc authoring.

I came to value .wmv much later, after H.264 indeed, as more and more bitrate-starved blotchy H.264 videos came into my task
and after all one would come to the conclusion that a 1Mbps.wmv would look less maimed.
I just would have wished an earlier Expression Encoder release and I would have paid for that.
The release strategy around VC-1 was a bit unclear for me, but well, I was late to this party anyway.
The release strategy around VC-1 was incredibly chaotic, buffeted in the winds of Ballmer-era Microsoft corporate dysfunction! Strategies around VC-1, codecs, digital media, Windows Media, and media in Windows changed roughly every six months from 2006-2011. Just getting the stuff released that we did required a lot of skunkworks, favor trading, and begging forgiveness instead of asking permission. Getting the last service pack of Expression Encoder out the door took heroic efforts by people who cared about the product deeply. Silverlight almost had hardware accelerated DRM protected multi codec decode in 2011, but it never shipped because it was estimated to require a few weeks of testing by people who had been reassigned to Windows Phone.

Literally everything I worked on in my six years at Microsoft had been cancelled before I left, often in some incredibly stupid and customer-harmful ways. My book and aspects of MPEG-DASH are about all I did there that matters anymore.

At Amazon I'm still iterating on stuff I started my first day 13 years ago. "Customer Obsession" is real.

Emulgator

1st July 2025, 22:27

Many thanks for your open words, Ben.

FranceBB

2nd July 2025, 05:20

it was a really advanced, very tweakable dithering tool to get down from 10+ bit mezzanine sources to the 8-bit 4:2:0 of original flavor Blu-ray. And it was bundled with the VC-1 professional encoding tools, and the Blu-ray compressionists were trained in using it.

having the dithering method and the encoder work together can do miracles, quite literally. Bit depth conversion and dithering is a problem within itself: make it too widespread, complex and changing and the encoder won't find the references and will require a lot of bitrate to encode the stream. Make it too static and recognizable (like in the normal ordered dither) and your brain will also detect the pattern. In the x264 days, the Sierra 2-4A method was introduced directly within the encoder as it supported being fed with 16bit stacked (aka "double height" with MSB / LSB stacked one of top of the other) and 16bit interleaved (aka "double width" with MSB / LSB interleaved together) with the support to the normal 16bit planar added later on. Back then a lot of optimisation went into it so that this kind of dithering could be recognised and efficiently compressed compared to other patterns and the results were like night and day. To see how bad the situation was before that (taking aside xvid as it was never used for professional authoring and therefore it was rare for it to have higher than 8bit sources anyway as an input) one could use either the lavc open source MPEG-2 encoder or x262 and try to use dithering. Back in those days dithering had to be done in the frameserver outside of the encoder and there were three main dithering methods available aside the usual ordered dithering: Stucki, Atkinson and the evergreen Floyd Steinberg. Unfortunately, regardless of which one you picked, it would take an insane amount of bitrate for any of those two MPEG-2 encoders to avoid completely destroying the gradients in the background. For instance, for complicated titles like House of the Dragon (which I had to encode recently), after going through the Floyd Steinberg error diffusion, it took the encoder 85 Mbits (XDCAM-85) in FULL HD 1920x1080 25fps 4:2:2 yv16 M=3 N=12 8bit BT709 to avoid destroying the background. Limiting it to the classic 50 Mbits (XDCAM-50) would make it completely destroy any dark background, which is pretty annoying if the majority of the show has problematic low lights with several individual candles and other changing things making up the majority of the lighting in the scene. x265 built on the same concept x264 used and introduced two dithering methods: one that can be enabled via --dither and a basic one that is used otherwise, but in both cases the encoder should be able to detect the pattern and encode efficiently. On the other hand, with 10bit becoming the standard, we've seen 8bit being used less and less so it's gonna be interesting to see if x266 is gonna use the same concept. After all, dithering can still be used to go from 16bit to 10bit, so I still think it has useful applications and it's important to have those patterns detectable and encodable in a compression-friendly way in an encoder. We'll see.

Blue_MiSfit

2nd July 2025, 05:39

Many thanks for your open words, Ben.

QFT. Many thanks for all your contributions

benwaggoner

2nd July 2025, 17:43

having the dithering method and the encoder work together can do miracles, quite literally.
Definitely! And heck, I don't know why we convert to 8-bit input pixels when we're going to convert back to a floating-point like internal representation anyway. Just use the higher precision to get more accurate quantization.

That said, xscaler wasn't codec aware at all. It was just very good and tunable for its era. Parametrized Floyd-Steinberg and such. We have access to at least as good stuff today.

Bit depth conversion and dithering is a problem within itself: make it too widespread, complex and changing and the encoder won't find the references and will require a lot of bitrate to encode the stream. Make it too static and recognizable (like in the normal ordered dither) and your brain will also detect the pattern. In the x264 days, the Sierra 2-4A method was introduced directly within the encoder as it supported being fed with 16bit stacked (aka "double height" with MSB / LSB stacked one of top of the other) and 16bit interleaved (aka "double width" with MSB / LSB interleaved together) with the support to the normal 16bit planar added later on. Back then a lot of optimisation went into it so that this kind of dithering could be recognised and efficiently compressed compared to other patterns and the results were like night and day. To see how bad the situation was before that (taking aside xvid as it was never used for professional authoring and therefore it was rare for it to have higher than 8bit sources anyway as an input) one could use either the lavc open source MPEG-2 encoder or x262 and try to use dithering. Back in those days dithering had to be done in the frameserver outside of the encoder and there were three main dithering methods available aside the usual ordered dithering: Stucki, Atkinson and the evergreen Floyd Steinberg. Unfortunately, regardless of which one you picked, it would take an insane amount of bitrate for any of those two MPEG-2 encoders to avoid completely destroying the gradients in the background. For instance, for complicated titles like House of the Dragon (which I had to encode recently), after going through the Floyd Steinberg error diffusion, it took the encoder 85 Mbits (XDCAM-85) in FULL HD 1920x1080 25fps 4:2:2 yv16 M=3 N=12 8bit BT709 to avoid destroying the background. Limiting it to the classic 50 Mbits (XDCAM-50) would make it completely destroy any dark background, which is pretty annoying if the majority of the show has problematic low lights with several individual candles and other changing things making up the majority of the lighting in the scene. x265 built on the same concept x264 used and introduced two dithering methods: one that can be enabled via --dither and a basic one that is used otherwise, but in both cases the encoder should be able to detect the pattern and encode efficiently. On the other hand, with 10bit becoming the standard, we've seen 8bit being used less and less so it's gonna be interesting to see if x266 is gonna use the same concept. After all, dithering can still be used to go from 16bit to 10bit, so I still think it has useful applications and it's important to have those patterns detectable and encodable in a compression-friendly way in an encoder. We'll see.
Yeah, decent dithering is still valuable in 10-bit, particularly HDR 10-bit where we're not even using the full 64-940 range anyway.

benwaggoner

2nd July 2025, 17:44

QFT. Many thanks for all your contributions
Happy to provide. I imagine the codec historians of the future will treasure the archive.org logs of Doom9!

And everyone is always welcome to ask more questions of this sort. Want to hear about MacroMind Director Accelerator RLE encoding circa 1989 ;)?

LMP88959

8th July 2025, 03:34

Tiny little update I added to address an issue I noticed at lower bit rates when encoder with a longer GOP length.
Basically, the encoder keeps track of which blocks in the GOP have been marked as intra at some point and if there are too many that have been marked intra, the encoder inserts an intra-frame.
Blocks that are unmoving are double counted since those are more noticeable to the viewer.

https://github.com/LMP88959/Digital-Subband-Video-2/pull/19

LMP88959

13th September 2025, 02:39

Huge encoder update! (encoder version 14)

- statistics tracking and reporting
- 'sfr' argument now works with y4m inputs
- intra frame psy improvements

The biggest changes were to motion estimation:
- improved motion vector RDO
- added dynamic psy-based block difference metric
- improved inter/intra/skip mode selection
- added a ton of estimation candidate vectors + an extra subpel test

I have updated the example encodes / comparisons in the GitHub's README, please check them out!
https://github.com/LMP88959/Digital-Subband-Video-2/

CruNcher

13th September 2025, 12:48

another waveleto

hmm after Rududu,Dirac,Snow,JPEG-2000 lot of new movement ;)

the Chinese are also very actively invested working on it improving also with their Deep Learning Power for geospatial purposes (which is also a codeword for land reconnaissance UAV in military/space use terms)

Funny thing about VC-1 it was very successful for some who preferred skin retention now we can see that AV1 improved everything VC-1 did less efficient

BTW see who found his way to Microsoft Ben

https://patents.google.com/patent/US20200169750A1/en?inventor=Sergey+Sablin

TSU->Elecard/Mainconcept->Aspex Semiconductor->Microsoft(Skype)->Meta(Facebook)->AV1

https://gitlab.com/users/ssablin/activity

https://gitlab.com/AOMediaCodec/SVT-AV1/-/merge_requests/2507