Intel QuickSync Decoder - HW accelerated FFDShow decoder with video processing [Archive] - Page 6

egur

8th November 2011, 23:00

Problem is that my decoder is not connected to the renderer, it's unaware of the graph. I can redesign
Currently I create my own device but it fails in fullscreen (only when decoder is instantiated during fullscreen)

nevcairiel

9th November 2011, 07:28

Its the only way. In FSE Mode you cannot create a new device, you have to use the one the EVR gives you.

egur

9th November 2011, 09:46

Its the only way. In FSE Mode you cannot create a new device, you have to use the one the EVR gives you.

Thanks.
Can you point me to the relevant reading material? I didn't see this anywhere.

Update:
I was referred to this article:
http://msdn.microsoft.com/en-us/library/windows/desktop/bb147220(v=vs.85).aspx

Basically this mean I can create a device when these conditions are met:
* They are created by the same Direct3D object that created the device that is full-screen.
* They have the same focus window as the device that is full-screen.
* They represent a different adapter from any full-screen device.

So nevcairiel is right :( . I need to postpone decoder initialization after the graph is connected which is a little ugly for current use cases and requires a shotgun surgery in ffdshow's code. Future use cases (output DXVA samples) will enjoy this design change as I'll be able to know how many surfaces are queued in the renderer.

Blight

9th November 2011, 20:11

CruNcher:
Line21 still exists in digital format (see DVDs or M2TS content grabbed from DTV streams), mainly for Closed Captions.

egur

10th November 2011, 10:33

If you're using a device passed to DXVA2CreateVideoService (http://msdn.microsoft.com/en-us/library/windows/desktop/ms704721%28v=VS.85%29.aspx), make sure that the HWND pointer used when creating the device isn't linked to a monitor that will be used for exclusive mode.

Thanks for your help but I'm not sure this is a viable option in all cases:
* When there's just 1 GPU and one screen.
* Even in multi GPU setups I don't know how to do this :(

My home setup has 2 GPUs Intel + AMD. The screen is connected to the AMD and I still can't create the device on the Intel GPU.
Maybe I send an HWND that's associated with the AMD-connected monitor. But I don't know how to create an generic HWND on the other monitor that will actually result in a functioning d3d device...

dukey

10th November 2011, 13:06

EVR sets things up like this
creates a window 1x1 pixels in size

D3DPRESENT_PARAMETERS pp;
ZeroMemory(&pp, sizeof(pp));

pp.BackBufferWidth = 1;
pp.BackBufferHeight = 1;
pp.Windowed = TRUE;
pp.SwapEffect = D3DSWAPEFFECT_COPY;
pp.BackBufferFormat = D3DFMT_UNKNOWN;
pp.hDeviceWindow = hwnd;
pp.Flags = D3DPRESENTFLAG_VIDEO;
pp.PresentationInterval = D3DPRESENT_INTERVAL_DEFAULT;

doesn't use that for rendering. It creates addition swap chains which it renders into, and presents when they are done. Don't know if that helps.

JanWillem32

10th November 2011, 18:52

That points out the hwnd handle nicely. Creating the window handle window handle with WS_MINIMIZE|WS_POPUP and possibly WS_DISABLED should work. http://msdn.microsoft.com/en-us/library/windows/desktop/ms632679%28v=VS.85%29.aspx
If creating it minimized is a problem, using CloseWindow should also do the trick of minimizing it. http://msdn.microsoft.com/en-us/library/windows/desktop/ms632678%28v=VS.85%29.aspx
Also, structs can be assigned on creation. Using ZeroMemory is more something for class members (and also only if you can't zero them on class initialization).
D3DPRESENT_PARAMETERS pp = {1, 1, D3DFMT_X8R8G8B8, 1, D3DMULTISAMPLE_NONE, 0, D3DSWAPEFFECT_DISCARD, hWnd, TRUE, FALSE, D3DFMT_UNKNOWN, 0, 0, D3DPRESENT_INTERVAL_IMMEDIATE};

nevcairiel

10th November 2011, 19:05

For readability, everyone should always favor the syntax as dukey posted it. For complex structs like that, using the inline initializers is just asking for trouble.

JanWillem32

10th November 2011, 20:22

That would most certainly extend the size of the renderers I'm working on considerably (the color management section declares dozens of various structs). Also, I often declare structs like this as static const. Using ZeroMemory might inhibit making elements "rommable" (a syntax of: D3DPRESENT_PARAMETERS pp = {0}; is illegal in this case). I usually just add comments to mark the interesting bits (most elements of this type of struct are 0 or 1). In this case only the hWnd parameter is worth noting, the rest just describes parameters for a 1×1 pixel backbuffer without extras.

nevcairiel

10th November 2011, 20:23

If the code is already too long, using these things to "shorten" it is the worst idea ever. It'll just make already long code even harder to read/understand.

dukey

10th November 2011, 21:53

the point i was trying to make was EVR creates and renderers into additional swap chains, it doesn't depend on the window really, as the size of the back buffer for the window created was only 1x1.

egur

10th November 2011, 22:00

That points out the hwnd handle nicely. Creating the window handle window handle with WS_MINIMIZE|WS_POPUP and possibly WS_DISABLED should work. http://msdn.microsoft.com/en-us/library/windows/desktop/ms632679%28v=VS.85%29.aspx
If creating it minimized is a problem, using CloseWindow should also do the trick of minimizing it. http://msdn.microsoft.com/en-us/library/windows/desktop/ms632678%28v=VS.85%29.aspx

Doesn't work :( I created the hWnd like you specified, tried offscreen coordinates (9999,9999), minimized, etc. Will not work in FS only in windowed mode.

If you've verified that it works in your system (in FSE), please send the me the device creation code starting from hWnd creation up to the call to CreateDevice.

Thanks!

vivan

11th November 2011, 11:16

Added black borders to images with non 16 modulo width. Retaining non standard width can cause downstream filters to crash (dvobsub/vsfilter).Is it possible to remove this "feature"?
1) There are a lot of people who are not using such... filters. For them it makes things only worse - e.g. 712x400 is 16:9, if you add 8pix border it would be 720x400 and you will have to enjoy top, right and bottom black borders on 16:9 display.
2) ffdshow already has such feature - resize & aspect filter - "expand to next multiple of 16".

egur

11th November 2011, 13:54

Is it possible to remove this "feature"?
1) There are a lot of people who are not using such... filters. For them it makes things only worse - e.g. 712x400 is 16:9, if you add 8pix border it would be 720x400 and you will have to enjoy top, right and bottom black borders on 16:9 display.
2) ffdshow already has such feature - resize & aspect filter - "expand to next multiple of 16".

Nev has already asked for this removal and I agreed.
What I plan to do is enable/disable this feature via config. This way the DS filter can control what's going on. I hope this will make everyone happy.
I personally use vobsub and it's quite common - that's why I implemented it in the first place.

The default will be to enable mod16 width because stability overrides quality.

BTW, ffdshow will copy the image with a significant performance penalty so it's not a good option.

CruNcher

13th November 2011, 16:59

Egur this is really cool http://www.mediafire.com/download.php?d0bg6khk2lk8bjl due to the vsync background noise completely gone you can immediately see what the Broadcast Encoder did wrong (without needing to look @ the bitstream) :) it happens @ every keyframe :)

I tried to get the same resolution with ffdshow-quicksync but it shows strange peakings and get a lock sometimes @ 25 fps
the nicetest scene shows extreme jitter also 16ms

Peaking wrong lock issue (telecine mpeg-2):

http://www.mediafire.com/?gauc931m49hya51

Heavy jitter issue (H.264 Interlaced):

http://www.mediafire.com/?x2crc2tuoq9il29

Interesting disabling Deinterlace auto flag output in ffdshow-quicksync fixes both of these issues but obviously no Deinterlacing anymore (which though is only really problematic for the Interlaced H.264 stream)

Though there is still strange peaking periodically going on for the telecined stream and jitter changes from 0.4xxms upto 2ms very strange, ahhh it loses the lock on the next keyframe again :) :(

Lock Lost issue (telecine mpeg-2):

http://www.mediafire.com/?tii7ytnm7di6ck4

egur

13th November 2011, 22:25

Egur this is really cool http://www.mediafire.com/download.php?d0bg6khk2lk8bjl due to the vsync background noise completely gone you can immediately see what the Broadcast Encoder did wrong (without needing to look @ the bitstream) :) it happens @ every keyframe :)

I tried to get the same resolution with ffdshow-quicksync but it shows strange peakings and get a lock sometimes @ 25 fps
the nicetest scene shows extreme jitter also 16ms

Peaking wrong lock issue (telecine mpeg-2):

http://www.mediafire.com/?gauc931m49hya51

Heavy jitter issue (H.264 Interlaced):

http://www.mediafire.com/?x2crc2tuoq9il29

Interesting disabling Deinterlace auto flag output in ffdshow-quicksync fixes both of these issues but obviously no Deinterlacing anymore (which though is only really problematic for the Interlaced H.264 stream)

Though there is still strange peaking periodically going on for the telecined stream and jitter changes from 0.4xxms upto 2ms very strange, ahhh it loses the lock on the next keyframe again :) :(

Lock Lost issue (telecine mpeg-2):

http://www.mediafire.com/?tii7ytnm7di6ck4

I'm on a business trip this week, so I'll take a look at it when i return. Thanks!

JanWillem32

14th November 2011, 16:52

Doesn't work :( I created the hWnd like you specified, tried offscreen coordinates (9999,9999), minimized, etc. Will not work in FS only in windowed mode.

If you've verified that it works in your system (in FSE), please send the me the device creation code starting from hWnd creation up to the call to CreateDevice.

Thanks!That's too bad. Let's first try something else. What errors does the DirectX debug runtime give in a tracing debug session? (Don't forget to define D3D_DEBUG_INFO globally in the project and enable the Direct3D debug runtime in the "Microsoft DirectX SDK (June 2010)\Utilities\bin\x64\dxcpl.exe" or "Microsoft DirectX SDK (June 2010)\Utilities\bin\x86\dxcpl.exe" utility.)
I hope you'll enjoy your trip, and hear from you later on.

skingery

20th November 2011, 04:43

I used to use LAV for splitting, audio and video. Recently I rebuilt my HTPC with a Sandybridge processor so I thought I'd give this build a try for video.
For a renderer, what are people generally using EVR, EVR CP or madVR?

kwlee

22nd November 2011, 13:03

Hi,
I'm a newbie here, and testing about limitation issue...
There should not be a (practical) limit. I can modify ffdshow to revert to libavcodec if initialization fails. Most likely that the platform will run out RAM before this happens.

I am testing about h264 & DVXA2 by ffdshow to see how many applications(graphstudio.exe) can playback at the same time.

The maximun number is 6, the 7th graphstudio will fail. I have tested on both nVidia GT430 and AMD Radeon 6900 series...

Is that the limitation of ffdshow? and will the limitation same as this intel project ?

Thanks!

egur

22nd November 2011, 22:34

Hi,
I'm a newbie here, and testing about limitation issue...

I am testing about h264 & DVXA2 by ffdshow to see how many applications(graphstudio.exe) can playback at the same time.

The maximun number is 6, the 7th graphstudio will fail. I have tested on both nVidia GT430 and AMD Radeon 6900 series...

Is that the limitation of ffdshow? and will the limitation same as this intel project ?

Thanks!

I gave up after 12 graph studios. GPU RAM is probably the bottleneck, if you play multiple instances of an H264, full HD, with lots of reference frames, a lot of GPU RAM is used, limiting the amount of instances. BTW, if you've lowered the amount of RAM the GPU uses (BIOS setup) than the instance count will be lower.

egur

22nd November 2011, 22:46

The first official build of ffdshow with Intel QuickSync decoder can be downloaded from the ffdshow's download page at:
http://ffdshow-tryout.sourceforge.net/download.php

Same QuickSync decoder as v0.18, ffdshow itself contains changes mostly for subtitles.

This thread will continue to supply ffdshow builds for continuous testing of my decoder's versions.

kwlee

23rd November 2011, 01:52

I gave up after 12 graph studios. GPU RAM is probably the bottleneck.
Ya, If I test ffdshow by libavcodec, It can easy up to 12 graph studios with 1080P avi files.

If I test ffdshow by DXVA2, even by smaller h264 video files,
the graph studio limitation is still 6, no idea why :(

My purpose is for TV wall...

kwlee

24th November 2011, 04:47

Hi, A programming issue about progressive frame flag,

In QuickSync.cpp
mfxU32 CQuickSync::PicStructToDsFlags(mfxU32 picStruct)

if (picStruct == MFX_PICSTRUCT_PROGRESSIVE)
{
return AM_VIDEO_FLAG_WEAVE; --> Is it better to use "AM_VIDEO_FLAG_INTERLEAVED_FRAME" ?
}

egur

24th November 2011, 07:29

Hi, A programming issue about progressive frame flag,

In QuickSync.cpp
mfxU32 CQuickSync::PicStructToDsFlags(mfxU32 picStruct)

if (picStruct == MFX_PICSTRUCT_PROGRESSIVE)
{
return AM_VIDEO_FLAG_WEAVE; --> Is it better to use "AM_VIDEO_FLAG_INTERLEAVED_FRAME" ?
}

This is done to insure that a deinterlacer will not run on this frame. AM_VIDEO_FLAG_INTERLEAVED_FRAME means that both fields exist. Do you see any issues with this?

nevcairiel

24th November 2011, 07:45

AM_VIDEO_FLAG_WEAVE is the right flag for progressive. AM_VIDEO_FLAG_INTERLEAVED_FRAME actually has the value 0, and is therefor always set. Most images we deal with here always have two interleaved fields - singe fields are rather uncommon.

kwlee

24th November 2011, 08:06

This is done to insure that a deinterlacer will not run on this frame. AM_VIDEO_FLAG_INTERLEAVED_FRAME means that both fields exist. Do you see any issues with this?
During my test it's no problem(TBF, I don't have too many
test sample files)

http://msdn.microsoft.com/en-us/library/windows/desktop/dd373499(v=vs.85).aspx

AM_VIDEO_FLAG_WEAVE "This flag applies only when there are two fields per sample. " As MSDN explains..

Maybe someone can help to explain more..:)

nevcairiel

24th November 2011, 08:14

A progressive frame contains two "fields", you're just not supposed to handle them separately.

egur

24th November 2011, 08:58

a video frame can basically come from two sources - interlaced or progressive.
In the former, the 2 fields (called top/bottom or odd/even) are from different times. Each field has half the lines of the original frame. Due to the different time stamps, they should be interpolated (several methods exist) to a full frame (e.g. deinterlaced).
In a progressive frame, there's no notion of fields, the image has all the lines and no deinterlacing is needed.

A special case is what's called "film". A video that was shot in progressive and later artificially split into fields (mostly TV broadcast and DVDs). In "film" content, the fields share the same time stamps and need to processed as progressive frames. A deinterlacer and/or a decoder will usually have a film detector mechanism of some sort as a standard deinterlacer will produce horrible artifacts for some content.

There's also the matter of how CbCr is stored (4:2:0 only). If the frame is progressive, CbCr values are stored for two consecutive lines. For interlaced content, it's stored for two consecutive lines of a specific field. Simply weaving the lines will produces color artifacts even if there's no motion.

egur

7th December 2011, 21:51

Version 0.19 is out with the following changes:
* Added limited support for WMC full screen exclusive mode:
- Renderer must be connected to the decoder directly - no intermediate filters.
- Screen is connected to the Intel GPU (decoder shares device with renderer).
- Might only work on single monitor setups.
* Decoder has exposed its configuration options GetConfig/SetConfig - must be called before initialized.
* Padding the image to mod16 width is now off by default. Works with vobsub.
* Decoder can be tested for compatibly with media types via the TestMediaType method
* FFDShow rev4126

Download from SourceForge home page:
http://sourceforge.net/p/qsdecoder

haruhiko_yamagata

9th December 2011, 11:20

Thanks a lot for the update.
Can I ask several questions and requests? Please excuse me for not reading all of this thread. I read several pages. I do not have sandy bridge and can not test your decoder.
I would like to ask you to add an entry to our wiki (http://ffdshow-tryout.sourceforge.net/wiki/video:codecs) and answer these questions. I think most of my questions are FAQ.
Version 0.19 is out with the following changes:
* Added limited support for WMC full screen exclusive mode:
- Renderer must be connected to the decoder directly - no intermediate filters.
- Screen is connected to the Intel GPU (decoder shares device with renderer).
- Might only work on single monitor setups.

Can ffdshow output other color spaces than NV12?
Can ffdshow resize?
If screen is not connected to the Intel GPU, and the renderer is not the WMC full screen, does your decoder work?

* Padding the image to mod16 width is now off by default. Works with vobsub.

Are the strides aligned? ffdshow exhibit its bugs and heavy loss of performance for several features if the strides are not aligned. I may able to fix the bugs, but not the performance issue.

Additional FAQ for the wiki:
How fast is this?
For what kind of video is this useful?
Which profiles of H.264 does this support?
Can this output 10/12-bit formats?

nevcairiel

9th December 2011, 11:38

For what kind of video is this useful?
Which profiles of H.264 does this support?
Can this output 10/12-bit formats?

- It supports the same formats as DXVA, so only 8-bit 4:2:0 (MPEG2, H264 and VC1)
- H264 High profile, no 10-bit or 4:2:2/4:4:4
- See above, only 8-bit. Output is always NV12.

egur

9th December 2011, 14:49

Thanks a lot for the update.
Can I ask several questions and requests? Please excuse me for not reading all of this thread. I read several pages. I do not have sandy bridge and can not test your decoder.
I would like to ask you to add an entry to our wiki (http://ffdshow-tryout.sourceforge.net/wiki/video:codecs) and answer these questions. I think most of my questions are FAQ.

If I have permissions, I'll add the following questions to the Wiki. No prb.

Can ffdshow output other color spaces than NV12?
Can ffdshow resize?
If screen is not connected to the Intel GPU, and the renderer is not the WMC full screen, does your decoder work?
Are the strides aligned?
How fast is this?
For what kind of video is this useful?
Which profiles of H.264 does this support?
Can this output 10/12-bit formats?

* HW Decoder only outputs NV12, but ffdshow converts the output to what was agreed with the downstream filter.
* ffdshow can perform all of it's post processing as long as they work well with NV12 input. swscale had issues with NV12->NV12 copies so I changed ffdshow's code to use a nother copy method. Now I didn't see any issues and none where reported to me on the matter.
* With the exception of WMC FS, multi GPU setups are working (I personally have this setup AMD Radeon HD6950). I'm working on fixing WMC FS with multi GPU setup - partial success.
* Strides are always 16 byte aligned.
* How fast? Worst case scenario is ~2x faster than libavcodec (low bitrate clips). Speed difference is greater for high bitrate clips. CPU frequency stays at LFM (800MHz mobile/ 1600MHz Desktop) for the duration of playback. CPU overhead is usually related to image resolution and frame rate and not birate since I copy the frames to system memory.
* Decodes H264 up to and including high profile. MPEG2 - all expect the 4:2:2 profile. VC1 - advanced profile in HW, MP&SP in SW. Future HW will support more video formats.
* Unfortunately, HW only decodes 8 bit 4:2:0 formats ATM :(. The decoder only outputs only NV12. This might change in the future. I can't commit on this.

haruhiko_yamagata

9th December 2011, 15:19

Thank you very much for reply.
You can register to the wiki here (http://ffdshow-tryout.sourceforge.net/wiki/video:codecs?do=register). Please note that the wiki cannot sent any e-mails, even if it says it would. For example, password retrieval through e-mail is not available.

Esperado

12th December 2011, 08:48

Hi, Egur.
I have tried last FFdshow with QuickSync in DvbViewer. I wonder why it eats more CPU than CoreAVC, with my Radeon HW acceleration (DXVA) for example.
Don't it is supposed to use only Hardware?
I do not see any difference if i pluga monitor in the intel VGA output or not. And, if not, and if i start DVBViewer by Virtu, it eats more CPU..
Thanks for your sharing and you work, anyway.
I wonder why Intel does not provide codecs using directly the Sandy Bridge acceleration ?
Best regards.

egur

12th December 2011, 08:57

Hi, Egur.
I have tried last FFdshow with QuickSync in DvbViewer. I wonder why it eats more CPU than CoreAVC, with my Radeon HW acceleration (DXVA) for example.
Don't it is supposed to use only Hardware?
I do not see any difference if i pluga monitor in the intel VGA output or not. And, if not, and if i start DVBViewer by Virtu, it eats more CPU..
Thanks for your sharing and you work, anyway.

It uses HW for decoding but the frame are copied back to CPU memory for further processing, hence the non zero CPU usage.
Make sure you selected Intel QuickSync as the decoder in ffdshow's codec tab for h264,mpeg2,vc1.

I wonder why Intel does not provide codecs using directly the Sandy Bridge acceleration ?
Best regards.
Actually decoders are shipped with the driver.
BTW Microsoft's DVT-DVD decoder also uses HW acceleration - when connected to an EVR renderer but it fails on some types of clips - but will work on most content.
Using a pure DXVA playback pipeline is very restrictive - decoder must connect directly to a special renderer, decoder output can't be modified by the renderer, etc.
My solution of copying the images comes at a price but it's much more versatile.

haruhiko_yamagata

12th December 2011, 10:24

Egur, thank you very much for the wiki (http://ffdshow-tryout.sourceforge.net/wiki/video:codecs).
Now we have a good summary for our new decoder.
fastplayer, thank you for the update.

Esperado

13th December 2011, 01:04

It uses HW for decoding but the frame are copied back to CPU memory for further processing, hence the non zero CPU usage.
My solution of copying the images comes at a price but it's much more versatile.
As it cost *more* CPU than other ordinary hardware accelerated codecs, and i do not see any much better quality in real time TNT decoding (specially about De-interlacing), i believe i will forget the graphics from my Sandy Bridge to save some power and temp.
The interest, for me -and the reason why i was interested in those Sandy Bridge- was to can watch TNT on my PC with no CPU load. Some kind of TV/Monitor set.
Make sure you selected Intel QuickSync as the decoder in ffdshow's codec tab for h264,mpeg2,vc1.Of course i did. And tried everything. Monitor connected on motherboard VGA or not (no changes), Virtu or not (It eats more CPU, launching DvbViewer with Virtu, and no noticeable quality change...)

Actually decoders are shipped with the driver but they are not very good. This is something I'm trying to improve behind the scenes.You made a very nice work. Shame it is unofficial.
In fact, i found no Intel Codecs registered for Direct show on my computer after Install, and i only found your thread after a long research on Google. I'm very disappointed with Intel position, in that matter, marketing announces effects, and nothing usable for real time decoding, apart your personal work in fact ? Unbelievable from such a big company.

egur

13th December 2011, 08:10

As it cost *more* CPU than other ordinary hardware accelerated codecs, and i do not see any much better quality in real time TNT decoding (specially about De-interlacing), i believe i will forget the graphics from my Sandy Bridge to save some power and temp.
The interest, for me -and the reason why i was interested in those Sandy Bridge- was to can watch TNT on my PC with no CPU load. Some kind of TV/Monitor set.
Of course i did. And tried everything. Monitor connected on motherboard VGA or not (no changes), Virtu or not (It eats more CPU, launching DvbViewer with Virtu, and no noticeable quality change...)

The frame copying paradigm allows decode flows unavailable to DXVA decoders - avisynth acceleration, multi-GPU setup (decoder on SNB + high-end renderer like MadVR on dGPU). The performance price is small.

Virtu isn't needed with my decoder. It would be useful with a DXVA decoder.

You made a very nice work. Shame it is unofficial.
In fact, i found no Intel Codecs registered for Direct show on my computer after Install, and i only found your thread after a long research on Google. I'm very disappointed with Intel position, in that matter, marketing announces effects, and nothing usable for real time decoding, apart your personal work in fact ? Unbelievable from such a big company.

One can always use Microsoft's DTV-DVD decoder which ships with Win7. But this decoder isn't 100% working on my test suite... But it works on most content.

CruNcher

14th December 2011, 20:48

Internally I'm pushing for better HW enablement (SW, docs, sample code) - for both end-users and independent developers. My work has created a lot of positive noise within Intel and it's one step forward towards high quality enablement.

Great news hope those responsible @ Intel see the chance like Nvidia did right from the VPx start way back then, it's to sad to see a lot of times such engagement smashing @ the big chairs for whatever stupid reasons, though i have to say Intels Ecosystem around it's hardware is much better from the start compared to how catastrophic it was with AMDs taking themselves years of time and then come up with something that no one really likes to implement ;)
Not that i didn't expected Intels SB support for all classes to be awesome from the start ;) but i was skeptical after the ATI disaster that only Nvidia might have realized the chance about such a strong community Ecosystem from Devs to End users :)
Though since my move back from AMD to Intel i wasn't yet disappointed (ok the bad B2 stepping thing was a big disaster itself but fast fixed without compromise, personally i decided i don't need the B3 anyways ;) )

Btw Eric are you and Blight related to each other family wise or is the surname match just a coincidence :) are you maybe brothers would make sense somehow as he also knew something about your past work i wondered ?

haruhiko_yamagata

14th December 2011, 23:51

In ffdshow's codecs configuration page, IntelQuick Sync Decoder is listed for MPEG-1, but is this working? Because it is not listed in TglobalSettingsDecVideo::c_mpeg1, I suspect it is not working. Can I remove MPEG-1 support?

egur

15th December 2011, 09:18

In ffdshow's codecs configuration page, IntelQuick Sync Decoder is listed for MPEG-1, but is this working? Because it is not listed in TglobalSettingsDecVideo::c_mpeg1, I suspect it is not working. Can I remove MPEG-1 support?

Yes, my mistake, please remove.
Maybe the decoder supports mpeg1 but for such low resolution clips, there no point to do HW acceleration. FFDShow's support for mpeg1 is already very good.

egur

15th December 2011, 09:58

...

Btw Eric are you and Blight related to each other family wise or is the surname match just a coincidence :) are you maybe brothers would make sense somehow as he also knew something about your past work i wondered ?

Yes, we are brothers. Good catch :)

egur

15th December 2011, 20:12

Version 0.20 is out with the following changes:
* Fixed support for WMC full screen exclusive mode:
- Works with multi-GPU setups. Video decoding is HW accelerated using QuickSync. Renderer can be on a different GPU.
- WMC's background thumbnail creation is done in SW
* FFDShow rev4149

Download from SourceForge home page:
http://sourceforge.net/p/qsdecoder

Blight

16th December 2011, 23:01

CruNcher:
Yes, we're bros :P
Eric's wanted to do something like this for years, but now he's in a good position to do so and I've been helping here and there to move things along.
Hopefully, the good people @Intel are watching... He's doing them a heck of a service, both technical and promotional.

hhb97b

17th December 2011, 14:46

Hi

I have been so lucky to borrow a acer aspire 7750g for a day, which has an i7 2670QM cpu and a amd radeon HD6650m gpu.
This was my opportunity to try out your decoder. It worked without a problem, but there is one thing I don't understand.
I had choosen quick sync decoder in the decoder tab and choose resize, sharpen, blur and a avs script as the configuration to test. The reason for all
the post processing was to generate as much processing instructions as possible. I used version 0.20 for intel quicksync decoder and mpc hc 1.5 as testbed.

My expectations was that quicksync decoder would handle this load better than the with the libavcodec decoder, but this is not what I saw.
My experience was that I couldn't playback a 1080P video with a bitrate of 10500 kbps at normale speed. The pikes was around 52ms/127 % for "Time on ffdshow", which means it decode slow that the movies FPS.
However when I used the same configuration with libavcodec the results was 38ms/90%. Was my expectation wrong or what could cause this experience?

CruNcher

17th December 2011, 20:25

the latency issue most probably comes from the memory path (main memory gpu memory copy + post processing stress on the CPU)though it's faster then with a discrete card or @ least should be dynamic frequency switching and so latency changes of the main memory can also have a impact you shuffling quiet a lot of data around @ 1080p with the additional post processing though my tests with my Quicksync recording Framework showed that it's really application dependent of what in the current task is more efficient to use a good rule of thumb i guess is the more GPU resource the main application needs the better it is to use the CPU for any other task and vice versa (trying to keep both in balance is the key to the optimum, especially with DWM and Aero it becomes more complex handling this) the more CPU resources a application uses the better it is to use the GPU finding the right balance and a dynamic way to distribute it (efficient resource scheduling) though isn't easy and very Framework dependent, though the major culprit here is the OS and the Driver itself something very rare persons have the possibility todo major changes on (see preemptive changes in WDDM 1.2 Win8) ;)

Another rule of thumb If you really need time critical performance that beats Software in most cases their is no way around native DXVA Microsoft did a excellent job on it, or you need a very strong overall system so it doesn't surprise in your case try to disable power saving and see how that impacts the latency :)

If you really need the last drop of Performance (Power Saving) on those regards in Playback with a good post pro try Mirillis Splash Player it's really excellent in Performance (very well usage of what Microsoft supplies to ISVs in the Directx API on really every level UI, Renderer (own Direct 3d based one, supporting deinterlacing and subtitles), Subbtitle Renderer (own) + very efficient usage of their own decoders DXVA + Shaders in that combination :)
It was absolutely designed with 1 goal in mind drawing and manipulating (post pro) videos on the screen as fast as possible without a lot of resources (power efficient,sheduling GPU/CPU as good as possible for the tasks) utilizing what Microsofts Provides ( i havent seen any better yet and i know a lot maybe only 1 player currently comes near that from Asia though still misses features and relies partly on other peoples code (ffmpeg vobsub lot from mpc-hc) that would be Potplayer )

egur

20th December 2011, 23:29

Hi

I have been so lucky to borrow a acer aspire 7750g for a day, which has an i7 2670QM cpu and a amd radeon HD6650m gpu.
This was my opportunity to try out your decoder. It worked without a problem, but there is one thing I don't understand.
I had choosen quick sync decoder in the decoder tab and choose resize, sharpen, blur and a avs script as the configuration to test. The reason for all
the post processing was to generate as much processing instructions as possible. I used version 0.20 for intel quicksync decoder and mpc hc 1.5 as testbed.

My expectations was that quicksync decoder would handle this load better than the with the libavcodec decoder, but this is not what I saw.
My experience was that I couldn't playback a 1080P video with a bitrate of 10500 kbps at normale speed. The pikes was around 52ms/127 % for "Time on ffdshow", which means it decode slow that the movies FPS.
However when I used the same configuration with libavcodec the results was 38ms/90%. Was my expectation wrong or what could cause this experience?

I managed to reproduce similar results but I haven't root caused the problem.
Several options exist:
* Memory bus is saturated like Cruncher said.
* FFDshow's video processing algorithms are not optimized for NV12 surfaces (don't know need to check).
* Maybe there's a color space conversion to YV12.

I'll need to run a profiler among other things to get to the bottom of this, but it it looks interesting. I'll report back.

With pure DXVA you get the best performance + power savings but video processing becomes tricky usually done in the renderer. The frames outputted by the decoder are used by it as reference frames and mustn't be modified. Copying the frames is an option, but that's very similar to what I do. Writing video processing (like FFDshow have) using shader language (or CUDA) isn't easy at all as can be seen by their rarity. Although I think it's possible to create a DXVA video processor filter, I'm not aware of one existing.

hhb97b

21st December 2011, 14:52

I believe that you are both correct

* Memory bus is saturated like Cruncher said.
I think this is the case in some situation because I couldn't use all of the cpu power even when I only used libavcodec. The cpu usage was only at 50% before the "time on ffdshow" was over 41 ms

* Maybe there's a color space conversion to YV12
I have set the "input colorspace" to yv12 under the avscript tab . I'm using the script rgb3dlut/t3dlut from tritical as colour management system. The input colorspace for the script are yuy2, rgb24, and rgb32. This means that in test-setup there would have been a colorspace-convertion like this

? -> nv12(decoder) -> yv12(avs tab) -> yuy2(script) -> rgb(script) -> output

Will the QuickSync decoder support other colorspace than nv12 or is this a limitation of the hardware?

egur

21st December 2011, 15:14

I believe that you are both correct

* Memory bus is saturated like Cruncher said.
I think this is the case in some situation because I couldn't use all of the cpu power even when I only used libavcodec. The cpu usage was only at 50% before the "time on ffdshow" was over 41 ms

* Maybe there's a color space conversion to YV12
I have set the "input colorspace" to yv12 under the avscript tab . I'm using the script rgb3dlut/t3dlut from tritical as colour management system. The input colorspace for the script are yuy2, rgb24, and rgb32. This means that in test-setup there would have been a colorspace-convertion like this

? -> nv12(decoder) -> yv12(avs tab) -> yuy2(script) -> rgb(script) -> output

Will the QuickSync decoder support other colorspace than nv12 or is this a limitation of the hardware?

I ran a profiler to find the problem and here's what I've found:
* ffdshow's internal video processing code runs at about the same speed.
* No color space conversion occurs (I didn't use avisynth) but your use case "enjoys" the extra conversion.
* The CPU time spent within the HW decoder + driver is extremely small. But wall-time may pass - I don't know how much. Wall time doesn't register anywhere but it adds to the decode latency.
* ffdshow doesn't use threads to do video processing (seems that way anyway) so it's not using the 8 threads my i7-2600 has. At most i got it to use <2 cores (up to 15% utilization in task manager).
* Time is spent locking the D3D surface within my decoder. I don't know how to optimize this operation.

As a quick conclusion, I see that I need to perform the frame copying on another thread (simple fix) maybe even the decode itself (not so simple).
This will cut down the wall time the decode thread stays within my decoder and as a result - cut down ffdshow's latency allowing more code to run.

This is especially useful for full speed decoding (e.g. transcoding use case).

If no horrible bugs were introduced in the last version, I'll start working on it.

Regarding HW support, NV12 is the only supported format. There's no point in adding support for surface conversions in my decoder. NV12 is the most HW friendly format and I think all GPUs use it. NV12 use only 2 "pointers" - one for Y and one for UV. Usually color operations work on both U and V so cache hits are much better. This is actually true for SW as well. Y is separated because many operations work on Y alone. It's also the format recommend by Microsoft. Although not supporting 4:2:2/4:4:4 or bit depth greater than 8bit, 99.99% of the video content can be represented as NV12.

The modern renderer which uses HW acceleration also like this format as it saves another format conversion.

hhb97b

21st December 2011, 17:44

I ran a profiler to find the problem and here's what I've found:
* ffdshow's internal video processing code runs at about the same speed.
* No color space conversion occurs (I didn't use avisynth) but your use case "enjoys" the extra conversion.
* The CPU time spent within the HW decoder + driver is extremely small. But wall-time may pass - I don't know how much. Wall time doesn't register anywhere but it adds to the decode latency.
* ffdshow doesn't use threads to do video processing (seems that way anyway) so it's not using the 8 threads my i7-2600 has. At most i got it to use <2 cores (up to 15% utilization in task manager).
* Time is spent locking the D3D surface within my decoder. I don't know how to optimize this operation.

As a quick conclusion, I see that I need to perform the frame copying on another thread (simple fix) maybe even the decode itself (not so simple).
This will cut down the wall time the decode thread stays within my decoder and as a result - cut down ffdshow's latency allowing more code to run.

This is especially useful for full speed decoding (e.g. transcoding use case).

If no horrible bugs were introduced in the last version, I'll start working on it.

Regarding HW support, NV12 is the only supported format. There's no point in adding support for surface conversions in my decoder. NV12 is the most HW friendly format and I think all GPUs use it. NV12 use only 2 "pointers" - one for Y and one for UV. Usually color operations work on both U and V so cache hits are much better. This is actually true for SW as well. Y is separated because many operations work on Y alone. It's also the format recommend by Microsoft. Although not supporting 4:2:2/4:4:4 or bit depth greater than 8bit, 99.99% of the video content can be represented as NV12.

The modern renderer which uses HW acceleration also like this format as it saves another format conversion.

Well this was also my expectation regarding the colorspace question, but I just wanted you to confirm it. Thanks for the detailed post to your both.