Intel QuickSync Decoder - HW accelerated FFDShow decoder with video processing - Page 29

NikosD · 16th January 2012, 11:00

Quote:

Originally Posted by nevcairiel

The old LAV CUVID supported MPEG4-ASP, and LAV Video 0.45 will regain that ability. I may also add it to the DXVA2 decoder for AMD/ATI, if i ever feel really bored (its not implemented in ffmpeg yet, so more work then just flipping a switch)

Good to know.
For Nvidia HW why isn't it possible to accelerate MPEG4-ASP in DXVA ? (Direct or Frame copy)
Why do you have to use LAV CUVID ?

Quote:

Originally Posted by nevcairiel

@VC-1:
Its possible to intercept the calls from the MSDK into DXVA and check what its doing different (i used that alot the last few days to figure out VC-1 interlaced DXVA2 with other DXVA2 decoders), but since we have the MSDK - why bother? :d

Direct is always faster and more efficient in terms of power consumption, a requirement for laptops mainly.
Especially Intel's MSDK implementation with 60fps VC-1 clips, there is the same problem of Turbo CPU frequency as with H.264 60fps files, even in normal playback mode.

betaking · 16th January 2012, 11:10

Quote:

Originally Posted by egur

OK then. everyone wants deinterlacing + film detection. I'll start with that. Scaling and other post processing features will follow.
I won't implement a custom DI myself, I'll use the one supplied by the driver. Personally I've tested it to be better than Nvidia/AMD but your mileage may vary.
@Betaking
The MSDK doesn't support MPEG4-ASP as far as I know. I don't know about driver support. Does DXVA support this format (any GPU)?

Thanks for the reply.

nevcairiel · 16th January 2012, 11:13

Quote:

Originally Posted by NikosD

For Nvidia HW why isn't it possible to accelerate MPEG4-ASP in DXVA ? (Direct or Frame copy)
Why do you have to use LAV CUVID ?

Who says you have to?
DXVA is also possible.

NikosD · 16th January 2012, 11:17

Nobody has done it, yet.
Including you.

nevcairiel · 16th January 2012, 11:19

And i already said why i didn't do it yet, because ffmpeg doesn't support MPEG4 DXVA2 yet, and implementing that is quite a bit of work for very little benefit. It is however planned for some time in the future.

Also, my DXVA2 decoder is only a few days old, i rather focused on issues like VC-1 interlaced DXVA, which required you to use a commercial DXVA decoder up to now. :d

NikosD · 16th January 2012, 11:23

So, PotPlayer's (which is based on FFMpeg) UVD3 support of MPEG4-ASP VLD is just a work of their own ?

Or FFMpeg implemented MPEG4 ASP only for ATI's HW ?

BTW, DivX codec itself has DXVA MPEG4 ASP support only for UVD3.

Update:
PotPlayer is free and can accelerate DXVA VC-1 Interlaced at least one year before!

betaking · 16th January 2012, 11:26

Quote:

Originally Posted by NikosD

So, PotPlayer's (which is based on FFMpeg) UVD3 is just a work of their own ?

Or FFMpeg implemented MPEG4 ASP only for ATI's HW ?

BTW, DivX codec itself has DXVA MPEG4 ASP support only for UVD3.

I not have UVD3 to test it !but arcsoft video codec support DXVA MPEG4 ASP only for UVD3 too!

nevcairiel · 16th January 2012, 11:33

Quote:

Originally Posted by NikosD

So, PotPlayer's (which is based on FFMpeg) UVD3 support of MPEG4-ASP VLD is just a work of their own ?

Or FFMpeg implemented MPEG4 ASP only for ATI's HW ?

Its not implemented at all, if they added it, its their own - and they "forgot" to contribute it back to ffmpeg, like the license mandates.

Quote:

Originally Posted by NikosD

PotPlayer is free and can accelerate DXVA VC-1 Interlaced at least one year before!

A decoder limited to one player is not useful to many people.

Its just one example why their attitude is not really productive. They take open source code, then add their own features, and claim to have more features then everyone else.
If you base your work on open source, its mandatory to also contribute any changes back to the project, or at least make the changes available for the public. Its not only a "nice thing to do", but its also required by copyright law!

NikosD · 16th January 2012, 11:40

So, they look like the "bad" guys of Open Source community.
They take things from others, but they give nothing back.

I don't know if that's true - I have heard it from others too - but I know that they sure have the most complete Video Player out there especially regarding DXVA video codecs support for all HW (ATI, Nvidia, Intel)

I consider PotPlayer and LAV filters among the best "new" free software for multimedia (MPC-HC, FFMpeg are the "grandfathers")

egur · 16th January 2012, 14:41

Quote:

Originally Posted by NikosD

Thanks for the reply.

So, when you propose to "Output native DXVA surfaces" doesn't involve you in DXVA itself ?
You could output native DXVA surfaces through MSDK ?
...

The MSDK outputs Direct3D9 surfaces (which I allocate BTW). These type of resource is used by DXVA for video frames. MSDK, according to its documentation is an abstraction layer on top of DXVA2. BTW, the overhead of the MSDK is extremely low. I add an overhead in my decoder (frame copying) so I can impersonate a SW decoder with the added benefits.

Quote:

Originally Posted by RBG

egur
Hello Eric.

Can you explain a little bit more about scaling and what do you mean by that. AFAIK video scaling(chroma, luma upscaling, downscaling) is something that is usually done on render level, and you are developing a decoder. Also how good is intel scaling compared to madVR?

Scaling
Assumption: an image is a discrete representation of the continuous world. this means that pixels (samples) are integrals of an area in the real world. This is similar to audio samples that represent an integral over time.

Scaling AKA resampling can be described as converting the discrete samples to a continuous signal and getting the value (actually integrating) the signal in new positions. If you create more sample points from the continuous signal, you up-scale the image (more pixels) if you sample less points, you're performing down-scaling.

Signal processing theory describes the process (some signal processing knowledge required

:
* Create a continuous signal from the discrete samples. This done by adding zeros between the samples. The continuous signal is all zeroes with spikes where the samples where.
* Low-pass the signal (weaken or eliminate high frequency)
* Sample the values of the low-passed signal at new positions.

Actual resampling implementations (nearest neighbor, bi-linear, bi-cubic, Lanczos) do just that. Instead of integrating a signal which most of it is zero, you can simply sample the low-pass function and multiple the sampled values with the original pixels.

When down-scaling, the low-pass function must be designed so it will remove high frequencies that do exist in the output image. Every discrete signal has a Nyquist frequency which half it's size (one for horizontal and one for vertical).

From signal processing point of view, the perfect low-pass filter is a Sync (sin(x)/x). A Sync will clip all high frequencies and retain the amplitude (strength) of the low frequencies.

For performance reasons the number of samples used to derive a new pixel value is limited. This is called the sampling window width.
For down-scaling this is perfect (if all the pixels in the input image are used to create each and every output pixel).

For up-scaling things are not so easy. Using a Sync or a modified trimmed version of it (Lanczos) will result in ripples near edges. This is unpleasing to the eye (false edges and mosquito noise).

A variety of sampling functions exist, they are always compromise on performance, sharpness and artifacts.
* Lanczos is the sharpest. Exhibits strong edge artifacts. The more taps used (sampling window size) the output will be sharper with more artifacts
* Bi-cubic - less artifacts, less sharp.
* Bi-leaner - not sharp, geometric artifacts.
* nearest neighbor - sharp, heavy geometric artifacts

There are some sampling algorithms that work a little differently. They can guess the value of missing samples by some kind of heuristics or statistics (e.g. NEDI algorithm). They are computationally very heavy and the results are not worth the effort.

SandyBridge's adavnced video scaler has a different approach. A context adaptive scaler.
It will use a Lanczos4 scaler (8 taps) in order to create very sharp images. In order to avoid (most) of the artifacts, it will perform an analysis of the area and blend between the sharp scaler and a smooth scaler depending if the analysis thought the target pixel is prone to artifacts.

Context adaptive scaling is not a new idea but this implementation's quality and performance are probably one the best.
Some companies perform context adaptive scaling using a different paradigm - use a soft scaler like bi-cubic and perform post processing sharpness filter on edges that were very strong in the source image.

BTW, these tricks are used only for upscaling. For downscaling , the optimal filter is Lanczos for a given sampling window size.

Regarding luma and chroma scaling. Luma is he grey levels of the image (called Y) and chroma is the color information (called UV or CbCr). In the YUV color space, which most of the videos are encoded with, the UV color components are usually at a lower resolution and thus not fully aligned with the luma (Y) component. A scaler algorithm must make sure that chroma scaling produces a pleasing result. Most of the time, chroma values are resampled using a softer scaler (bi-cubic variant).

MadVR currently implements a wise variety of scaling algorithms, all of them are known textbook algorithms and allows selecting different algorithms for Y and UV scaling so the user can get the results he/she likes best.
Since theirs usually a trade-of between sharpness and various artifacts some users will sacrifice one for the other.

NikosD · 16th January 2012, 16:07

Quote:

Originally Posted by egur

The MSDK outputs Direct3D9 surfaces (which I allocate BTW). These type of resource is used by DXVA for video frames. MSDK, according to its documentation is an abstraction layer on top of DXVA2. BTW, the overhead of the MSDK is extremely low. I add an overhead in my decoder (frame copying) so I can impersonate a SW decoder with the added benefits.

It's clear now.

But then , I think is extremely easy to implement a "direct" DXVA decoder through MSDK, just by sending directly the decoded frame to EVR renderer.

So, I think the "direct" - through MSDK - DXVA decoder can be implemented earlier and easier than video scaling procedures.

CruNcher · 16th January 2012, 16:16

Quote:

Originally Posted by egur

OK then. everyone wants deinterlacing + film detection. I'll start with that. Scaling and other post processing features will follow.
I won't implement a custom DI myself, I'll use the one supplied by the driver. Personally I've tested it to be better than Nvidia/AMD but your mileage may vary.

@CruNcher
The scaling algorithm in SandyBridge is superior to both Nvidia and AMD in both upscaling and downscaling. Since I invented the algorithm for the video scaler, I have deep knowledge on the matter. Like many other parts of the video engine, it's implemented as ASIC and has very high performance.

@NikosD
I have relatively little knowledge on DXVA and what the driver support or not. Frankly I don't want to deep dive on the matter, I'd rather have a tooth pulled out

That's why I use the Media SDK, it simplifies (significantly) HW video decode/process and also adds encode as a bonus.
This means that dealing with DXVA is handled by the Media SDK developers and not me

Some features may be possible when using native DXVA that don't exist in the MSDK, but I can live with that.
Anyway, complaints about the driver (or feature requests) should be posted in the driver forum.

@Betaking
The MSDK doesn't support MPEG4-ASP as far as I know. I don't know about driver support. Does DXVA support this format (any GPU)?

I fully believe you that

though that wasn't the question on the scaling it was more how does it compare to NEEDI3

Quote:

There are some sampling algorithms that work a little differently. They can guess the value of missing samples by some kind of heuristics or statistics (e.g. NEDI algorithm). They are computationally very heavy and the results are not worth the effort.

Yes extremely slow and that's the question how does yours compare Performance/Quality in Hardware implemented even Adaptive

Quote:

SandyBridge's adavnced video scaler has a different approach. A context adaptive scaler.
It will use a Lanczos4 scaler (8 taps) in order to create very sharp images. In order to avoid (most) of the artifacts, it will perform an analysis of the area and blend between the sharp scaler and a smooth scaler depending if the analysis thought the target pixel is prone to artifacts.

It sounds good on paper (did you ever released one ?), and surely Intel wouldn't have bought it if it wouldn't have looked valuable for them

nevcairiel · 16th January 2012, 16:32

Quote:

Originally Posted by NikosD

But then , I think is extremely easy to implement a "direct" DXVA decoder through MSDK, just by sending directly the decoded frame to EVR renderer

Its not that easy, there are a number of annoying factors to deal with. Personally, i think the copy-back solution is easier, which is why i started with it.

RBG · 16th January 2012, 16:51

egur

Thanks for your reply, I appreciate it a lot.

I want to clear something up, will SB scaling work on hybrid systems and what are the conditions of it? For example, there is no real display connected to my Intel HD graphics, I made a fake one, like you suggested here.

CruNcher · 16th January 2012, 17:05

@Eric
as this is one of your professions i would like to advice you that we also have Robidoux on Doom9 Madshi, Tritical and other who research in that field over @ the Avisynth area

(i know you are fully on with the decoder but maybe as soon as you get to the scaling implementation you could say hello

)

http://forum.doom9.org/showthread.php?t=160038
http://forum.doom9.org/showthread.php?t=145358
http://forum.doom9.org/showthread.php?t=160610
http://forum.doom9.org/showthread.php?t=154143

Quote:

Context adaptive scaling is not a new idea but this implementation's quality and performance are probably one the best.
Some companies perform context adaptive scaling using a different paradigm - use a soft scaler like bi-cubic and perform post processing sharpness filter on edges that were very strong in the source image.

This is also what i currently prefer Realtime and use in my avisynth framework via the GPU shader though not with bi-cubic

Would really tove to see your implementations result, especialy speed beeing native asic and not Shader though copy back will hit that again

But yeah i voted for Deinterlacing + Ivtc and those are more important for now

STaRGaZeR · 16th January 2012, 17:08

Good post right there egur

egur · 16th January 2012, 20:42

@CruNcher
SandyBridge's Advanced Video Scaler (AVS) is a programmable fixed function sclaler (ASIC) utilized when either using the EVR or by renderers from Cyberlink and Arcsoft and maybe other companies.
I've confirmed that the Media SDK uses the AVS for scaling. Older GPUs had simpler scalers.

I didn't release a paper/patent since the actual implementation is trade secret (the analysis part). But again, context adaptive scaling (or context adaptive algorithms in general) is not new.

The performance of the AVS will vary on GPU clock speed, but it can do several 1080p60 streams simultaneously.

The best way to test upscaling is by scaling DVD resolution to 1080p (720p-->1080p is a small scale factor). Downscaling can be checked by shrinking the player/render and playing test patterns - look for aliasing.

@RBG
EVR will use the video processing features (DI, scaling ,etc) available on the GPU connected to the screen showing the video. So with hybrid setups, you get what AMD/Nvidia gives you.

RBG · 16th January 2012, 21:28

Quote:

Originally Posted by egur

@RBG
EVR will use the video processing features (DI, scaling ,etc) available on the GPU connected to the screen showing the video. So with hybrid setups, you get what AMD/Nvidia gives you.

Ah... That's sad.

Now I don't understand what did you mean by writing:

What should be the next big feature?
* HW Video processing: deinterlacing, film detection (3:2, 2:2 pulldowns, etc), noise reduction, sharpness, scaling, etc.

Internal hw deinterlacing, edge-enhancement, all that works in LAV video(CUVID), and when I saw "scaling" in your list, I thought you are going to implement it somehow in the decoder itself, that is why I asked you about it.

egur · 16th January 2012, 22:02

Quote:

Originally Posted by RBG

Ah... That's sad.

Now I don't understand what did you mean by writing:

What should be the next big feature?
* HW Video processing: deinterlacing, film detection (3:2, 2:2 pulldowns, etc), noise reduction, sharpness, scaling, etc.

Internal hw deinterlacing, edge-enhancement, all that works in LAV video(CUVID), and when I saw "scaling" in your list, I thought you are going to implement it somehow in the decoder itself, that is why I asked you about it.

I do plan to implement it internally and it will work on Intel HW regardless of the renderer. That's the beauty of the QS decoder's design, it cares very little about the renderer.
The decoder is (traditionally) not responsible for scaling, the renderer is. But I can expose the feature anyway, like ffdshow does (scale to a fixed resolution).

RBG · 16th January 2012, 23:40

Quote:

Originally Posted by egur

I do plan to implement it internally and it will work on Intel HW regardless of the renderer. That's the beauty of the QS decoder's design, it cares very little about the renderer.
The decoder is (traditionally) not responsible for scaling, the renderer is. But I can expose the feature anyway, like ffdshow does (scale to a fixed resolution).

Yes, scaling is done by render, I know that, but IMO it brings some inconveniences, especially on hybrid systems. That means if I want to get high quality picture, I should either stick with MadVR which is obviously not very stable, either use SB hw scaler, which is limited to vanilla EVR and physical display connection needed here. And in this situation internal hq hw scaling can be a real option. It will be totally awesome if you implement this feature someday. Also I wonder if it is possible to make internal hw scaling work dynamically, like it works on EVR, scale to the actual window size?

16th January 2012, 11:17	#564 \| Link
NikosD Registered User Join Date: Aug 2010 Location: Athens, Greece Posts: 2,901	Nobody has done it, yet. Including you. __________________ Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all

16th January 2012, 11:19	#565 \| Link
nevcairiel Registered Developer Join Date: Mar 2010 Location: Hamburg/Germany Posts: 10,347	And i already said why i didn't do it yet, because ffmpeg doesn't support MPEG4 DXVA2 yet, and implementing that is quite a bit of work for very little benefit. It is however planned for some time in the future. Also, my DXVA2 decoder is only a few days old, i rather focused on issues like VC-1 interlaced DXVA, which required you to use a commercial DXVA decoder up to now. :d __________________ LAV Filters - open source ffmpeg based media splitter and decoders Last edited by nevcairiel; 16th January 2012 at 11:22.

16th January 2012, 11:23	#566 \| Link
NikosD Registered User Join Date: Aug 2010 Location: Athens, Greece Posts: 2,901	So, PotPlayer's (which is based on FFMpeg) UVD3 support of MPEG4-ASP VLD is just a work of their own ? Or FFMpeg implemented MPEG4 ASP only for ATI's HW ? BTW, DivX codec itself has DXVA MPEG4 ASP support only for UVD3. Update: PotPlayer is free and can accelerate DXVA VC-1 Interlaced at least one year before! __________________ Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all

16th January 2012, 11:40	#569 \| Link
NikosD Registered User Join Date: Aug 2010 Location: Athens, Greece Posts: 2,901	So, they look like the "bad" guys of Open Source community. They take things from others, but they give nothing back. I don't know if that's true - I have heard it from others too - but I know that they sure have the most complete Video Player out there especially regarding DXVA video codecs support for all HW (ATI, Nvidia, Intel) I consider PotPlayer and LAV filters among the best "new" free software for multimedia (MPC-HC, FFMpeg are the "grandfathers") __________________ Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all Last edited by NikosD; 16th January 2012 at 11:44.

16th January 2012, 20:42	#577 \| Link
egur QuickSync Decoder author Join Date: Apr 2011 Location: Atlit, Israel Posts: 916	@CruNcher SandyBridge's Advanced Video Scaler (AVS) is a programmable fixed function sclaler (ASIC) utilized when either using the EVR or by renderers from Cyberlink and Arcsoft and maybe other companies. I've confirmed that the Media SDK uses the AVS for scaling. Older GPUs had simpler scalers. I didn't release a paper/patent since the actual implementation is trade secret (the analysis part). But again, context adaptive scaling (or context adaptive algorithms in general) is not new. The performance of the AVS will vary on GPU clock speed, but it can do several 1080p60 streams simultaneously. The best way to test upscaling is by scaling DVD resolution to 1080p (720p-->1080p is a small scale factor). Downscaling can be checked by shrinking the player/render and playing test patterns - look for aliasing. @RBG EVR will use the video processing features (DI, scaling ,etc) available on the GPU connected to the screen showing the video. So with hybrid setups, you get what AMD/Nvidia gives you. __________________ Eric Gur, Processor Application Engineer for Overclocking and CPU technologies Intel QuickSync Decoder author Intel Corp.

16th January 2012, 16:51	#574 \| Link
RBG Registered User Join Date: Oct 2011 Posts: 108	egur Thanks for your reply, I appreciate it a lot. I want to clear something up, will SB scaling work on hybrid systems and what are the conditions of it? For example, there is no real display connected to my Intel HD graphics, I made a fake one, like you suggested here.