Intel QuickSync Decoder - HW accelerated FFDShow decoder with video processing [Archive] - Page 14

hajj_3

1st February 2012, 01:02

those of you that have been doing benchmarks with quicksync and seeing how well it plays compares to cyberlink's decoders you might want to try powerdvd 12 which came out today and see if that has improved things in comparison to quicksync's decoder.

CruNcher

1st February 2012, 01:22

PowerDVD makes rarely use of Copy back (when using some of the PP it does) it mostly utilize DXVA natively

CoolerKing

1st February 2012, 10:33

Just for a last try, use PotPlayer with internal DXVA codecs or Quicksync decoder from the same program.
It may play OK.

DXVA Decoder (VLD - Slice Long) plays fine, but how do I know if it's using QuickSync? because I can't seem to select it anywhere.. It's only showing VLD (bitstream decoder) under H264 in the built-in video decoder settings..

I've also tried LAV with PotPlayer and it gives the same stuttering when QuickSync is selected.

egur

1st February 2012, 10:46

So if the drivers (they should be identical whether obtained from windows update or the intel website), windows version (rsd78 is running 32bit and me 64bit) or lucid virtu (not installed) all aren't causing the problem, what could be? rsd78 said he's having the problem on 3 of his machines? So I'm quite surprised that only a few people mentioned having the back/forward skipping issue.

Unfortunately, they are not always identical. I don't know what drivers are shipped in Windows update for Intel's graphics drivers but I had multiple bad experiences with other HW. One time I even had to revert to a backup of my OS before the install because the network driver crippled Windows and stopped working at 1Gb completely.
Download the latest driver from http://downloadcenter.intel.com/ (graphics->processor graphics)

NikosD

1st February 2012, 11:35

DXVA Decoder (VLD - Slice Long) plays fine, but how do I know if it's using QuickSync? because I can't seem to select it anywhere.. It's only showing VLD (bitstream decoder) under H264 in the built-in video decoder settings..

I've also tried LAV with PotPlayer and it gives the same stuttering when QuickSync is selected.

When you select QuickSync inside PotPlayer or LAV Video, you actually use Egur's work intergrated in those filters.
It's not direct DXVA solution, it uses Intel's Media SDK.

PotPlayer can use both QuickSync and DXVA directly, when you select internal decoders FFmpeg and nothing more.

Best solutions often are the simplest solutions.

CoolerKing

1st February 2012, 11:45

I've just installed the latest driver from Intel website and reinstalled ffdshow, and it's showing Decoder: Intel QuickSync in ffdshow properties, and no longer stuttering :) so I hope it's working now.. maybe MS drivers are indeed bad, or what I did different now as well is that I checked the DXVA video decoder box when installing FFDShow (dunno if that makes any difference).

Anyway, thx for all the help

ps. How much cpu usage is normal? Because me and a buddy are both getting 35-40% cpu usage with ffdshow (both 2500K), which seems quite high compared to other decoders

edit: Also tried LAV now with quicksync (it's working) and only about 5% cpu usage there... think I'll use PotPlayer w/spline resize + LAV QuickSync from now on

amtm

1st February 2012, 15:11

Unfortunately, they are not always identical. I don't know what drivers are shipped in Windows update for Intel's graphics drivers but I had multiple bad experiences with other HW. One time I even had to revert to a backup of my OS before the install because the network driver crippled Windows and stopped working at 1Gb completely.
Download the latest driver from http://downloadcenter.intel.com/ (graphics->processor graphics)

Windows Update only pushes out the WHQL drivers given to them by the manufacturer. It's not as if Windows Update is pushing a generic driver, they are pushing out the Intel branded WHQL driver. If Windows Update is pushing out a different driver than what is hosted on Intel's site, Intel is screwing something up and you need to tell them to stop doing so. Besides how can they be different if they have the same date and version info?

amtm

1st February 2012, 15:16

maybe MS drivers are indeed bad

They aren't. I have the same chipset as you in one of my computers and the driver from Windows Update is identical to the one from Intel's site and they perform identically. There was something else causing your issue.

vivan

1st February 2012, 15:30

2559 was pushed via Windows Update, but then it was removed. And now "latest" is previous - 2509 driver.
So problem is just in a buggy 2559 driver that people got through Windows Update, but it's not MS fault.

NikosD

1st February 2012, 15:33

Eric,

do you know if QuickSync ASIC exists inside HD P3000 graphics card and if there is any difference with HD 2000/3000 (regarding QuickSync only - not GPU in general) ?

amtm

1st February 2012, 16:23

2559 was pushed via Windows Update, but then it was removed. And now "latest" is previous - 2509 driver.
So problem is just in a buggy 2559 driver that people got through Windows Update, but it's not MS fault.

Exactly. Unless it's a generic driver, Windows Update is just pushing out the WHQL driver given to them by the manufacturer. If Intel's driver through Windows Update is broken or different, Intel screwed up not Microsoft.

egur

1st February 2012, 16:23

Eric,

do you know if QuickSync ASIC exists inside HD P3000 graphics card and if there is any difference with HD 2000/3000 (regarding QuickSync only - not GPU in general) ?

You mean the single socket server chip Xeon E3-12x5? Didn't try it. I have little access to the server platforms. Should be OK.

NikosD

1st February 2012, 18:54

Yes exactly.
Intel claims "up to 4x better performance than Intel HD Graphics 3000” but i am more interested in QuickSync performance of that card, meaning if there is a new version of QS already - before IvyBridge.

CruNcher

2nd February 2012, 02:31

Exactly. Unless it's a generic driver, Windows Update is just pushing out the WHQL driver given to them by the manufacturer. If Intel's driver through Windows Update is broken or different, Intel screwed up not Microsoft.

Yes though it seems for some Intel systems 2559 works excellent for Video the case on my system (I5-2400 GT1 HD2000) and i didn't experience a wrong frame order playback yet. So it seems system dependent also (on which level dunno) maybe 32bit 64 bit difference or maybe MSDK replaced the buggy driver mfx library, lets wait and see what the next intel drivers changelog says (though if it only states a lucid virtu issue it's still mysterious) ;)

i think i found a issue with your scaler Egur, though hardly mathematically seen it isn't if someone creates a wrong size stream 400x265 and the decoder isn't padding it, such things happen ;)

lav video = http://img821.imageshack.us/img821/5746/greenlinescalinglavvide.png
ffdshow = http://img62.imageshack.us/img62/3374/ffdshowscalingok.png

interesting though how the lines are shifted especially the bottom line looks unusual compared to for example Jans Lanczos4 Shader which errors with only a greater thicker green line on the top :)

Jans Lanczos4 Shader error (EVR Custom)

http://img140.imageshack.us/img140/2241/evrcustomjanlanczos4sha.png

Egur Intel Scaler error (EVR)

http://img718.imageshack.us/img718/9799/evregurintelscaler.png

though its definitely also visible that yours sharper :)

Again this shiffting though it looks correct as more data of the frame is actually visible Have to compare how Madvr stands against it) :D

Egur = http://img830.imageshack.us/img830/9624/egurintelscaler.png
Jan Lanczos4 = http://img545.imageshack.us/img545/6514/janslanczos4shader.png
Bicubic 1.0 = http://img192.imageshack.us/img192/3310/janbicubic10.png

Also it seems Jans Lanczos 4 is faulty or do i imagine just more sharpness in the Bicubic one ?

!llus!on

2nd February 2012, 05:20

Hi to all,
can I ask about support for the new H264 profile Hi10P.
Is there any chance to offload at least the main h264 arithmetic from the CPU, because the new profile has 20-40% better compression and better video quality (10bit profile with ~30% better compression :) )

Here are some samples for testing:
http://www.nyaa.eu/?page=search&term=Hi10P&sort=2
http://coalgirls.wakku.to/?p=4465
picture samples:
http://blisswater.info/comparison/elephantsdream/

And some info about the new Hi10P (http://habrahabr.ru/blogs/mass_media/129099/)

CruNcher

2nd February 2012, 05:52

@!llus!on
depending on how flexible that whole programmable part is it might be possible but no one knows that except intel, and their is a slim chance they will use this capability to integrate 10 bit support currently because they see no demand for it yet i would also like to see H.264 High Predictive 4:4:4 supported but that will also stay a Dream ;)
Though you would see chances for 10 bit if suddenly all Broadcasters would startup and implement it or 3D is outburned and their needs to be something new sold 10 bit would come close after 3D as you would need so much new Hardware it would be pretty good. Though don't expect it so soon there still needs to be a lot sold and some Anime Fans (0.00000000001 market) wont change that over night ;)

nevcairiel

2nd February 2012, 08:19

I don't think the hardware is capable of 10-bit decoding, so the chance is basically zero. :p
Thats the draw-back of fixed function hardware, its designed for one task, and one task alone. But that task, it does at supreme speed.

CruNcher

2nd February 2012, 10:10

Btw in case of SB if i get it right and Intels Scaler currently works only on EVR and not with EVR custom and subtitles only on EVR custom (DXVA) that would create a pretty bad situation as it seems not really efficient for SB user to use the current custom shaders and not be able to save that GPU load (which can be some amount of Shader Load depending on the Resize Shader) by letting Intels Hardware Scaler do the work and so also keep the better quality ;)

Though how the heck does Microsoft it in their Media Center then having closed captions for Broadcasts displayed, EVR (using the Hardware scaler) + DXVA (native) ?

I guess there must be a way to disable Custom Shader resizing on EVR Custom and let the Hardware scaler do its job instead while still maintaining DXVA, Subtitle and after Resizing PP Shader capabilities.
So i wonder if it would be possible to realize a "User Hardware Resizer instead of Shader" in MPC-HC EVR-Custom option, because im sure initially the idear was to replace the bad Hardware scaling (Billinear with Bicubic using Shader) though i guess slowly Hardware Resizer get on another Level and so this old believes are pretty outdated @ least for SB and Ivy Bridge user they will be forced to use something lower quality with less efficiency. :)

nevcairiel

2nd February 2012, 10:41

I guess there must be a way to disable Custom Shader resizing on EVR Custom and let the Hardware scaler do its job instead while still maintaining DXVA, Subtitle and after Resizing PP Shader capabilities.

Just configure it to "Bilinear" in the MPC-HC Output screen, and not one of the PS2.0 options, that should ask the GPU to do the scaling.

CruNcher

2nd February 2012, 11:05

Just configure it to "Bilinear" in the MPC-HC Output screen, and not one of the PS2.0 options, that should ask the GPU to do the scaling.

Would be nice if it would be that easy but nope doesn't work

see the result http://img171.imageshack.us/img171/4559/evrcustomcpu.png <- billinear

egur

2nd February 2012, 11:13

Yes though it seems for some Intel systems 2559 works excellent for Video the case on my system (I5-2400 GT1 HD2000) and i didn't experience a wrong frame order playback yet. So it seems system dependent also (on which level dunno) maybe 32bit 64 bit difference or maybe MSDK replaced the buggy driver mfx library, lets wait and see what the next intel drivers changelog says (though if it only states a lucid virtu issue it's still mysterious) ;)

i think i found a issue with your scaler Egur, though hardly mathematically seen it isn't if someone creates a wrong size stream 400x265 and the decoder isn't padding it, such things happen ;)

lav video = http://img821.imageshack.us/img821/5746/greenlinescalinglavvide.png
ffdshow = http://img62.imageshack.us/img62/3374/ffdshowscalingok.png

interesting though how the lines are shifted especially the bottom line looks unusual compared to for example Jans Lanczos4 Shader which errors with only a greater thicker green line on the top :)

Jans Lanczos4 Shader error (EVR Custom)

http://img140.imageshack.us/img140/2241/evrcustomjanlanczos4sha.png

Egur Intel Scaler error (EVR)

http://img718.imageshack.us/img718/9799/evregurintelscaler.png

though its definitely also visible that yours sharper :)

Again this shiffting though it looks correct as more data of the frame is actually visible Have to compare how Madvr stands against it) :D

Egur = http://img830.imageshack.us/img830/9624/egurintelscaler.png
Jan Lanczos4 = http://img545.imageshack.us/img545/6514/janslanczos4shader.png
Bicubic 1.0 = http://img192.imageshack.us/img192/3310/janbicubic10.png

Also it seems Jans Lanczos 4 is faulty or do i imagine just more sharpness in the Bicubic one ?

Is the grey line at the bottom part of the video or a scaling artifact? It doesn't appear in all your samples that used EVR.

There's definitely a shift between the Intel scaler and the rest.
I've noticed this in a simulation I ran as well (I have the scaler implementation in SW). Try scaling a pattern image where the columns are white-black-white-black... (1 pixel wide).
Scaling such an image by 2x should:
* Show a nice sine pattern - looks identical across the screen.
* Even columns (0, 2, 4, ...) should have identical values to the source image.

egur

2nd February 2012, 11:20

Hi to all,
can I ask about support for the new H264 profile Hi10P.
Is there any chance to offload at least the main h264 arithmetic from the CPU, because the new profile has 20-40% better compression and better video quality (10bit profile with ~30% better compression :) )

Here are some samples for testing:
http://www.nyaa.eu/?page=search&term=Hi10P&sort=2
http://coalgirls.wakku.to/?p=4465
picture samples:
http://blisswater.info/comparison/elephantsdream/

And some info about the new Hi10P (http://habrahabr.ru/blogs/mass_media/129099/)

Well, 10bit is not supported. I'm not familiar with the internals of the decoder so I don't know if partial acceleration is possible. 10bit has system wide implications (lots of SW changes) so it's probably not a small feature. The return on such a feature is small as 10bit is very rare. This might change of course when 10/12bit becomes more mainstream.

egur

2nd February 2012, 11:23

Btw in case of SB if i get it right and Intels Scaler currently works only on EVR and not with EVR custom and subtitles only on EVR custom (DXVA) that would create a pretty bad situation as it seems not really efficient for SB user to use the current custom shaders and not be able to save that GPU load (which can be some amount of Shader Load depending on the Resize Shader) by letting Intels Hardware Scaler do the work and so also keep the better quality ;)

Though how the heck does Microsoft it in their Media Center then having closed captions for Broadcasts displayed, EVR (using the Hardware scaler) + DXVA (native) ?
...

I think subtitles are rendered into an image before EVR. EVR receives 2 raw video streams - video and subs and it blends them together to form the result image. That's the MS way of rendering subs. I don't know if the subs are scaled before or after it's blended with the video stream.

JanWillem32

2nd February 2012, 16:59

Hello again! CruNcher pointed me here, and maybe I can be of use.
-The resizers in EVR CP are purely derived from either the bilinear texture sampler stage or the internal pixel shaders in the renderer (shared with VMR-9 r., Quicktime DX9 h., and RealMedia DX9 h.). In either way, only the shadercore is used for resizing. I can derive specific resizers from external DLL files, such as EVR.dll, but it will have to wait until the main renderer no longer requires the four external mixers and their handlers to manage the video input pin and other connections. Both the DirectX 9 and DirectX 11 renderers are ready, except for a custom mixer, so that's one of the main issues I have at the moment.
-When rendering in DirectX 9, all pixels are offset .5 pixel from their vertex position: http://msdn.microsoft.com/en-us/library/windows/desktop/bb219690%28v=vs.85%29.aspx .
(Note that there are much more efficient forms of vertex lists than the one stated in the summary. As long as it's clear that the X and Y vertex positions need to be offset by 0.5 pixel to the top left to align the top left pixel on [0, 0], it's fine. I can show a sample with more efficient vertex handling on request.)
-The base VMR-9 and EVR renderers allow up to 16 (as I remember) substreams to overlay the main video. For example, in MPC-HC, the OSD renderer sends a A8R8G8B8 texture. As long as the characteristics of the overlay texture are sent, the renderers will render it. It requires a palette index for 256-color P8 textures, or a YUV matrix identification for Y'CbCr textures for example. Resolution isn't a problem, all these textures are scaled independently before blending.

CruNcher

2nd February 2012, 19:58

@Egur
thx for the explanation i guess that's also what most try currently to achieve os of MS
and yeah that 1 line most probably is a scaling artifact combined with the decoder

also something else another Dev integrated Intel GFX support :) http://www.youtube.com/watch?v=mhsNXaCHRd8

@JanWillem32

I see so the 5 pixels offset is a render issue of EVR-Custom and MadVR currently, good news on the replacement of that Render part i guessed it wouldn't be so easy :(
Did you also looked @ the Lanczos4 result it doesn't seem to reflect a 4 tap Lanczos ?

i pushed MadVR and Haali now against the Intel Scaler on this case (no deinterlacing progressive higher SIF lower SD , high quantization (Sorenson Spark) to HD :)

http://img854.imageshack.us/img854/9254/madvrbillinear.png <- Billinear
http://img31.imageshack.us/img31/3447/madvrbicubic75.png <- Bicubic 0.75
http://img846.imageshack.us/img846/5562/madvrlanczos4.png <- Lanczos 4 taps
http://img833.imageshack.us/img833/2834/madvrlanczos4linear.png <- Lanczos 4 taps (Linear)
http://img713.imageshack.us/img713/2964/madvrsoftcubic.png <- Softcubic 50% softness
http://img840.imageshack.us/img840/8785/madvrspline4.png <- Spline 4 taps
http://img718.imageshack.us/img718/6958/haali75.png <- Haali -0.75 (seems also to adjust a little different though you can compensate it with its settings if needed)

http://img830.imageshack.us/img830/9624/egurintelscaler.png <- Intel Hardware Scaler (definetly the sharpest of all)
http://img545.imageshack.us/img545/6514/janslanczos4shader.png <- Jan Lanczos 4 tap
http://img192.imageshack.us/img192/3310/janbicubic10.png <- Jan??? Bicubic 1.00

PS: Though the PC/TV Scale issue makes the results not really 100% comparable (though im sure compensating that wouldn't change much of the percepted sharpness difference) and in the End Intels Scaler is pretty good (Egurs) and can easily compete with MadVRs custom shaders (and keeps the correct position on EVR, doesn't kill visible data) :)

The Hardware Deinterlacer though shows weaknesses (left combing) compared to Yadiff, and that already in the first test

http://img839.imageshack.us/img839/771/intelscreenresult.png <- Intel
http://img16.imageshack.us/img16/9953/yadiffscreenresult.png <- Yadiff

Though to be also fair this is in a extreme High Motion Scene that's so fast over no Human would realize it in that spot anyways so for Realtime Playback its ok see the encodes, though a lot of frame by frame comparer would hate it however (no im not one of those crazy guys if my goal is Realtime Playback , i don't watch stuff in slowmo or frame by frame usually, and even my perception of things is good it's not that good) ;)

JanWillem32

2nd February 2012, 22:10

Actually, the pictures comparing the renderers show what I expect them to look like. As far as I know, MadVR uses A16B16G16R16F textures, and you couldn't get EVR CP to work with anything but 8-bit textures. In terms of banding and handling contrast/color balance, it's quite visible.
The Haali renderer is well known to have the .5 pixel offset problem, EVR CP and MadVR properly compensate for it. This is also visible in the pictures. An overscan test of a single pixel border with large magnification factors can point out the problem more clearly. Some tests can be found here: http://www.w6rz.net/ .
As far as I know, the original bicubic shader set for VMR-9 (renderless) (or its predecessor, I'm not sure) was written by Haali. I only modified the resizer pixel shader set to allow more variants and load easier.

CruNcher

2nd February 2012, 22:29

So you would also agree that Intels Scaler does upscaling in this case more efficient keeping more visible data alive ?

egur

2nd February 2012, 22:45

@CruNcher one small correction. Lanczos4 is 8 tap. "4" is half the sampling window.

In the DI test there's a problem with the motion detection on Intel side (fingers) but the yadif mispredict the necklace. So the necklace is reconstructed better on the Intel DI. I don't see detail loss in the Intel DI. Looking at the right side of the girls shirt you can see better details in the Intel scaler.
BTW, I don't know your exact setup, but you should turn off any other enhancement (denoise, sharpen, etc.) so we can have apples to apples comparison.

CruNcher

2nd February 2012, 23:09

its all of :)
yes you right the reconstruction looks more appropriate on the intel side if you see that its a plain surface when the lights hit it

egur

3rd February 2012, 14:36

Version 0.25 beta is out with the following changes:
* Fixed handling of CCV1 streams (Haali splitter custom fourCC).
* Support for H264 AVI files.
* Optimized memory copy further. Removed ASM code. Code now uses intrinsic for both 32 and 64 bit as intrinsic code reached 32 bit efficiency.
* Code cosmetics.
* FFDShow rev4295

Download from SourceForge home page (http://sourceforge.net/projects/qsdecoder/)

ramicio

3rd February 2012, 14:38

Is this supposed to work with EVR CP or not? I can get it to work with everything else, but when I use EVR CP I just get black. Are there any specific settings I need to change to make it work with EVR CP?

egur

3rd February 2012, 15:20

Is this supposed to work with EVR CP or not? I can get it to work with everything else, but when I use EVR CP I just get black. Are there any specific settings I need to change to make it work with EVR CP?

It works for me. Make sure NV12 output is enabled for best performence.

DragonQ

3rd February 2012, 15:22

Version 0.25 beta is out with the following changes:
* Fixed handling of CCV1 streams (Haali splitter custom fourCC).
* Support for H264 AVI files.
* Optimized memory copy further. Removed ASM code. Code now uses intrinsic for both 32 and 64 bit as intrinsic code reached 32 bit efficiency.
* Code cosmetics.
* FFDShow rev4295

Download from SourceForge home page (http://sourceforge.net/projects/qsdecoder/)
Do you know when support for pre-Sandy Bridge Intel IGPs is going to be fixed?

ramicio

3rd February 2012, 15:39

Everything defaults to NV12 color space. That was one of the first things I tried, switching between various output color spaces.

nevcairiel

3rd February 2012, 18:10

Do you know when support for pre-Sandy Bridge Intel IGPs is going to be fixed?

I have been wondering about this.
Even when no compatible hardware is available, shouldn't it be capable of software emulation?
It looks like the getOK method your interface provides returns failure.

I wish i had such hardware around so i could test it.

egur

4th February 2012, 15:38

I have been wondering about this.
Even when no compatible hardware is available, shouldn't it be capable of software emulation?
It looks like the getOK method your interface provides returns failure.

I wish i had such hardware around so i could test it.

There's something wrong with the installer QS check added by clsid (tested ffdshow r4291). It fails on my 9400T (Penryn with gm45 chipset). The installer checks for a processor revision of 42 (from SysInfo). This might also fail on IvyBridge.
I'll inform clsid about the problem.

SW emulation is working if the Media SDK 2012 is installed (free). I just verified this.
BTW, The DLL shipped with the driver is doesn't contain all the code for SW emulation.
If you have a multi-GPU setup and want to test the SW fallback, connect the display to the dGPU and make sure the desktop is NOT extended to an iGPU socket. This will force the Media SDK to failback to SW emulation.

nevcairiel

4th February 2012, 15:50

I'm not using any installer checks, but just creating the decoder instance seems to fail on non-SNB CPUs. All i'm doing is calling your functions to check if its working, and get getOK function indicates failure. But, if i remember correctly, it used to work, so something must've changed.

egur

4th February 2012, 16:07

I'm not using any installer checks, but just creating the decoder instance seems to fail on non-SNB CPUs. All i'm doing is calling your functions to check if its working, and get getOK function indicates failure. But, if i remember correctly, it used to work, so something must've changed.

Ok, I see the failure, I'll report back when I've solved it.

Update:
I couldn't get the HW acceleration to work on my Penryn system. The Media SDK fails to initialize. Even updated the driver and nothing. I'm almost 100% sure it worked before.
At clsid's request I'll add a check function. It will return general support for HW acceleration and SW emulation.

Update2:
Done at r32

CruNcher

5th February 2012, 02:16

Ok here is another Deinterlace test thx to Didée for providing the test sequence :)

Intel = http://www.mediafire.com/?hibu8gbdluh81pn
Yadif = http://www.mediafire.com/?0180493ch5cc4ln

Really great result for Intel on this one, which imho weights also more then the motion failure it showed before as this is definitely a very perceptible case :)

JanWillem32

5th February 2012, 22:42

Egur, if I may ask a few questions...
Can I please see your performance data on the copy function and the time required for the surface lock operation to complete in the command before the copy? I'd like to know more about the characteristics of CPU<->GPU memory copy functions, such as GetRenderTargetData, D3DXLoadSurfaceFromSurface and locking types, such as yours.
In the body of the copy function, streaming load operations are used, but no streaming store operations. Why is that? In general, all texture data chunks are too large for the CPU cache to hold and only slow down writing the data with non-steaming store operations.
In my earlier tests for inline copy functions, I could not find any benefit from stacking multiple load operations before a store operation. I see that you've implemented 8-register and 16-register store loops. Could you show the performance differences with other copy loops you tried?
(my previous "copytest": http://www.mediafire.com/?ud2dpkfum6zgchx , 134 KB - x86, x64 and source code On request, I can show more functions in implementations like this.)
About the alignment test, does this function ever copy a texture from a point other than the first pixel? All DirectX surfaces and other types of buffers I've ever encountered were at least 16-byte aligned (some items are even 64-kilobyte aligned by default) at the base.
Lastly, as I see you're using _mm_sfence(), you might find __faststorefence() interesting: http://msdn.microsoft.com/en-us/library/t710k390.aspx .

egur

5th February 2012, 23:35

Egur, if I may ask a few questions...
Can I please see your performance data on the copy function and the time required for the surface lock operation to complete in the command before the copy? I'd like to know more about the characteristics of CPU<->GPU memory copy functions, such as GetRenderTargetData, D3DXLoadSurfaceFromSurface and locking types, such as yours.
In the body of the copy function, streaming load operations are used, but no streaming store operations. Why is that? In general, all texture data chunks are too large for the CPU cache to hold and only slow down writing the data with non-steaming store operations.
In my earlier tests for inline copy functions, I could not find any benefit from stacking multiple load operations before a store operation. I see that you've implemented 8-register and 16-register store loops. Could you show the performance differences with other copy loops you tried?
(my previous "copytest": http://www.mediafire.com/?ud2dpkfum6zgchx , 134 KB - x86, x64 and source code On request, I can show more functions in implementations like this.)
About the alignment test, does this function ever copy a texture from a point other than the first pixel? All DirectX surfaces and other types of buffers I've ever encountered were at least 16-byte aligned (some items are even 64-kilobyte aligned by default) at the base.
Lastly, as I see you're using _mm_sfence(), you might find __faststorefence() interesting: http://msdn.microsoft.com/en-us/library/t710k390.aspx .

Well, I did quite a few testing after consulting with both architecture and driver performance experts.
My function may not be the fastest for all platforms (and GPUs) - I didn't test other CPUs. The driver isn't doing anything better, I've used all the tricks they do.
GetRenderTargetData didn't work for me. don't know why.
I used D3D9 API for getting the address - pretty standard:

D3DLOCKED_RECT locked;
hr = pSurface->LockRect(&locked, NULL, D3DLOCK_READONLY | D3DLOCK_NOSYSLOCK);

I tried various other locking options. They either didn't work or had the same speed. Locking can be time consuming, that's why my code uses multithreading - decode in one thread (a worker thread) and copy in another. The DS decode thread is used mostly for synchronization.

The driver always returned a 64B (cache line) aligned address BTW. I did alignment and reminder checks for completeness.

_mm_sfence() is called only once so it's performance is meaningless. shaving less than a micro second won't change anything.

I always copy all the pixels in the surface. For most video sizes, it's 1:1. I did a few benchmarks and found out that it's not worth writing a separate function to copy lines or part of lines.
I just crop lines not needed in the output. That's why I copy Y and UV separately.

Here's a summary of the speedup tricks:
* Copy using 8 or 16 xmm registers using the method I used (via local variables). 16 registers give a very small performance boost. If you use _mm_stream_load_si128 to copy from source to target, MSVC will only use 2 xmm registers causing performance degradation.
* Source and target addresses page offsets (12lsb) must be different. CPU performs check that they don't overlap. Check is fastest with a 2K page offset. Allocate an extra 4K for the target buffer. I allocate the target buffer so I have control over this.
* Copy using 2 threads - each thread copies half. More than 2 threads didn't improve - only degraded performance. It's also a good idea to use 2 threads for system to system copy.
* Load before store - made a difference (1%). Forces MSVC to use more xmm registers.

You can look at my thread pool code. If you can improve it, let me know.

I don't have MB/s numbers as I did system tests using GraphStudioNext (high priority process). I used a 1080p clip with relatively low bitrate ~1.5mbps. In the benchmark ffdshow is copying the frame again to the renderer. ffdshow's copy method is not MT (yet).
For the clip I use, I get an average of ~835fps for 5000 frames for both 32 and 64 bit. This is far below the memory controllers speed, but again ffdshow is working, the GPU is decoding, etc. I didn't set up a pure copy benchmark environment before I fear that it might not reflect on real world performance.
BTW, I have relatively cheap memory DDR3@1333MHz.

egur

6th February 2012, 00:03

Driver v2622 is out on Intel's download page (http://downloadcenter.intel.com/SearchResult.aspx?lang=eng&ProductFamily=Graphics&ProductLine=Processor+graphics&ProductProduct=2nd+Generation+Intel%C2%AE+Core%E2%84%A2+Processors+with+Intel%C2%AE+HD+Graphics+3000%2f2000&ProdId=3319&LineId=3310&FamilyId=39).

STaRGaZeR

6th February 2012, 00:25

Driver v2622 is out on Intel's download page (http://downloadcenter.intel.com/SearchResult.aspx?lang=eng&ProductFamily=Graphics&ProductLine=Processor+graphics&ProductProduct=2nd+Generation+Intel%C2%AE+Core%E2%84%A2+Processors+with+Intel%C2%AE+HD+Graphics+3000%2f2000&ProdId=3319&LineId=3310&FamilyId=39).

Any noticeable improvements?

CruNcher

6th February 2012, 00:44

http://downloadmirror.intel.com/20843/eng/ReleaseNotes_GFX_64.htm

officially not a lot Quicksync stability improvements those are i think

egur

6th February 2012, 08:57

Any noticeable improvements?

WMV9 HW support. Works fast but has occasional corruption in several of my test clips. Weirdly slow seeks (maybe I can fix this).

@CruNcher
A driver after that (not public yet) fixes the notorious mc.ts clip. BTW this clip with Haali give me a bad frame rate. Haali is sending time stamps consistent with 40fps :rolleyes:

I didn't notice quality improvements in other clips.
I didn't benchmark anything yet.

RBG

6th February 2012, 10:52

WMV9 HW support.

Only WMV9 advanced profile.:(

STaRGaZeR

6th February 2012, 16:06

WMV9 HW support. Works fast but has occasional corruption in several of my test clips. Weirdly slow seeks (maybe I can fix this).

Argh, it also seems that some VC1 corruption is still there.

nevcairiel

6th February 2012, 16:16

Argh, it also seems that some VC1 corruption is still there.

As long as there is no new corruption.... :)
Eric said they are working on some WMV9/VC1 things, so i'm hopeful! ;)

NikosD

6th February 2012, 16:40

The most useful and right thing to do for Intel - regarding VC-1 - is to open it to all, by making a ModeVC1_VLD mode accessible to everyone and not only to Cyberlink and Arcsoft, in order to use other Video Player developers HW acceleration for VC-1 without using quicksync.dll and Intel's MSDK.

No offence for Eric and his great effort to build something useful, like QuickSync decoder (and to fix some things internally regarding Video support of Intel, as I understand)

Xaurus

6th February 2012, 18:01

hi egur,

I want to thank you for your excellent work. As I understand it you are probably able to answer this question:

Slightly off-topic, but does anyone know if it is possible to run the Intel 2000 IGP at the same time as a Nvidia 450 GTS?
I mean, connect the display to the 450 and just take the audio from the Intel 2000 IGP.

I've searched everywhere but I can't really find an answer.