Intel QuickSync Decoder - HW accelerated FFDShow decoder with video processing [Archive] - Page 15

View Full Version : Intel QuickSync Decoder - HW accelerated FFDShow decoder with video processing

Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 [15] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

Blight

6th February 2012, 18:12

btw, a bit old, but the reason why downloading drivers from windows update is a bad idea for display drivers is that often, these drivers are just bare-bones. Enough to get the OS working in the right resolution, but doesn't include every module the driver downloaded from the company's site might have.

In the WinXP days, I downloaded an nvidia driver that lacked Direct3D support from windows update, so I stopped trying and now go directly to the source.

ramicio

6th February 2012, 18:14

Yeah, I don't know why people rely on Windows Update like it's some holy grail. For motherboards and video cards, I usually download drivers right from the chip manufacturer, instead of the hardware-slapper-together.

nevcairiel

6th February 2012, 18:33

In the WinXP days, I downloaded an nvidia driver that lacked Direct3D support from windows update, so I stopped trying and now go directly to the source.

To be fair, Windows Update in Win7 is significantly different (and much better) then what WinXP had.

Nevertheless, getting the real driver from the official source is always better. I usually ignore this rule for stuff like network drivers or printer drivers, though =p

HeadlessCow

6th February 2012, 19:11

btw, a bit old, but the reason why downloading drivers from windows update is a bad idea for display drivers is that often, these drivers are just bare-bones. Enough to get the OS working in the right resolution, but doesn't include every module the driver downloaded from the company's site might have.

Nevertheless, getting the real driver from the official source is always better. I usually ignore this rule for stuff like network drivers or printer drivers, though =p

For video drivers this might be bad, but for printer drivers, the barebones version is so, very much better :) And hundreds of megabytes smaller!

JanWillem32

6th February 2012, 20:45

Thank you for your quick response, egur.Well, I did quite a few testing after consulting with both architecture and driver performance experts.
My function may not be the fastest for all platforms (and GPUs) - I didn't test other CPUs. The driver isn't doing anything better, I've used all the tricks they do.
GetRenderTargetData didn't work for me. don't know why.It's not well documented, but I believe the restrictions are pretty much the same as StretchRect: http://msdn.microsoft.com/en-us/library/windows/desktop/bb174471%28v=vs.85%29.aspx .
Locking a surface that isn't in the asynchronous command cue for renderer device tasks isn't very expensive indeed. The performance cost for allocating a lockable render target is probably low as well. (Standard render targets and render target surface levels of a texture are not lockable by default.)I used D3D9 API for getting the address - pretty standard:

D3DLOCKED_RECT locked;
hr = pSurface->LockRect(&locked, NULL, D3DLOCK_READONLY | D3DLOCK_NOSYSLOCK);

I tried various other locking options. They either didn't work or had the same speed. Locking can be time consuming, that's why my code uses multithreading - decode in one thread (a worker thread) and copy in another. The DS decode thread is used mostly for synchronization.There are no magic methods to get locking cheaper. Locking an offscreen plain surface might aquire a lock a bit faster, but that's mostly because it can't be re-used in a rendering chain. The D3DLOCK_NOSYSLOCK flag is unused since Windows 2000: http://msdn.microsoft.com/en-us/library/windows/desktop/ee416788%28v=vs.85%29.aspx . Windows 95 and 98 had 16-bit system parts that were problematic with locking.The driver always returned a 64B (cache line) aligned address BTW. I did alignment and reminder checks for completeness.Shouldn't that code go into a DEBUG_ONLY sequence or something similar then?_mm_sfence() is called only once so it's performance is meaningless. shaving less than a micro second won't change anything.

I always copy all the pixels in the surface. For most video sizes, it's 1:1. I did a few benchmarks and found out that it's not worth writing a separate function to copy lines or part of lines.
I just crop lines not needed in the output. That's why I copy Y and UV separately.Are you sure it's separated Y and UV? In an NV12 texture, the Y plane takes 2/3 and the UV plane takes 1/3 of the total memory. Isn't it more efficient to distribute the load so that each thread takes 50%?Here's a summary of the speedup tricks:
* Copy using 8 or 16 xmm registers using the method I used (via local variables). 16 registers give a very small performance boost.I've taken a look at the assembly MSVC generates, and found out that the assembly generates fine for 1- to 16-register loops. The only thing I did see, is that a loop of 5 registers or more takes more than 64 bytes of instruction code, and isn't aligned on a 64 byte boundary for that reason. (The loop for 4 registers takes 56 bytes and is cache line aligned.) In my tests, the loop of one load and one store is the most stable in its speed across a few tests. Larger loops are rarely faster, but sometimes are a lot slower. It could be that there's a generation gap in between the processors in my PCs and Sandy Bridge, though.If you use _mm_stream_load_si128 to copy from source to target, MSVC will only use 2 xmm registers causing performance degradation.I didn't exactly mean _mm_stream_load_si128 (movntdqa reg mem), but using _mm_stream_si128 (movntdq mem reg) instead of _mm_store_si128 (movdqa mem reg). In my tests (256 MB memory copy), the version with non-temporal stores is 53% faster.* Source and target addresses page offsets (12lsb) must be different. CPU performs check that they don't overlap. Check is fastest with a 2K page offset. Allocate an extra 4K for the target buffer. I allocate the target buffer so I have control over this.
* Copy using 2 threads - each thread copies half. More than 2 threads didn't improve - only degraded performance. It's also a good idea to use 2 threads for system to system copy.
* Load before store - made a difference (1%). Forces MSVC to use more xmm registers.

You can look at my thread pool code. If you can improve it, let me know.Thank you for the information, I'll try some more optimizations and let you know if I find something interesting.I don't have MB/s numbers as I did system tests using GraphStudioNext (high priority process). I used a 1080p clip with relatively low bitrate ~1.5mbps. In the benchmark ffdshow is copying the frame again to the renderer. ffdshow's copy method is not MT (yet).
For the clip I use, I get an average of ~835fps for 5000 frames for both 32 and 64 bit. This is far below the memory controllers speed, but again ffdshow is working, the GPU is decoding, etc. I didn't set up a pure copy benchmark environment before I fear that it might not reflect on real world performance.
BTW, I have relatively cheap memory DDR3@1333MHz.I generally just insert _ReadWriteBarrier(), _mm_mfence(), and such to order data in a certain section, and then use QueryPerformanceCounter() at the beginning and end to profile that section. I do avoid unnecessary calls to QueryPerformanceCounter() in release builds that are not used for profiling, as the kernel round-trip it uses is slow.

nevcairiel

6th February 2012, 21:01

In my tests, the loop of one load and one store is the most stable in its speed across a few tests. Larger loops are rarely faster, but sometimes are a lot slower.

When using streaming loads (movntdqa), you need to process a multiple of 4 loads at a time to fully exhaust the 64B streaming cache line(s). If you start mixing it with writes, it'll degrade performance.
Note that movntdqa reverts to normal movdqa behaviour when you run it on "normal" memory (anything thats not USWC), so benchmarking it on a normal memory buffer is pointless and needs to be done on a GPU -> System copy.

I agree that Eric should look into using _mm_stream_si128, it might offer a speed enhancement still.

STaRGaZeR

6th February 2012, 21:47

As long as there is no new corruption.... :)
Eric said they are working on some WMV9/VC1 things, so i'm hopeful! ;)

When talking about drivers, I guess you could say that... :D

nevcairiel

6th February 2012, 21:49

When talking about drivers, I guess you could say that... :D

He also said a yet unreleased driver fixes "MC.ts", which is a VC-1 stream that caused errors before, so, yay? :)

egur

6th February 2012, 22:27

...
The D3DLOCK_NOSYSLOCK flag is unused since Windows 2000

It was hard to benchmark the locks and I'm not an expert in D3D as you :)
Thanks, I'll remove the D3DLOCK_NOSYSLOCK flag.

Shouldn't that code go into a DEBUG_ONLY sequence or something similar then?
No, better safe than sorry, it doesn't affect performance. People might copy-paste this function to somewhere else...

Are you sure it's separated Y and UV? In an NV12 texture, the Y plane takes 2/3 and the UV plane takes 1/3 of the total memory. Isn't it more efficient to distribute the load so that each thread takes 50%?
Y is copied using 2 threads (50-50) and then UV is copied in 2 threads (50-50).
My (pseudo) code looks like:

mt_gpu_memcpy(outFrame.y, inFrame.y, height * pitch);
mt_gpu_memcpy(outFrame.uv, inFrame.uv, pitch * height / 2);

I've taken a look at the assembly MSVC generates, and found out that the assembly generates fine for 1- to 16-register loops. The only thing I did see, is that a loop of 5 registers or more takes more than 64 bytes of instruction code, and isn't aligned on a 64 byte boundary for that reason. (The loop for 4 registers takes 56 bytes and is cache line aligned.) In my tests, the loop of one load and one store is the most stable in its speed across a few tests. Larger loops are rarely faster, but sometimes are a lot slower. It could be that there's a generation gap in between the processors in my PCs and Sandy Bridge, though.I didn't exactly mean _mm_stream_load_si128 (movntdqa reg mem), but using _mm_stream_si128 (movntdq mem reg) instead of _mm_store_si128 (movdqa mem reg). In my tests (256 MB memory copy), the version with non-temporal stores is 53% faster.Thank you for the information, I'll try some more optimizations and let you know if I find something interesting.

What I meant is using only _mm_stream_load_si128:

dest[i] = _mm_stream_load_si128(src + i + j); // copy line 8 times, replace j with 0, 1, 2, ...

The above code would only use two xmm registers inder MSVC. ICL 12 will use them all.
I didn't try using _mm_stream_si128. I'll check it out and report my findings.

I generally just insert _ReadWriteBarrier(), _mm_mfence(), and such to order data in a certain section, and then use QueryPerformanceCounter() at the beginning and end to profile that section. I do avoid unnecessary calls to QueryPerformanceCounter() in release builds that are not used for profiling, as the kernel round-trip it uses is slow.
That's all good and well, but I'm interested in system performance, so I look at fps in GraphStudioNext boosted to highest priority.
Synthetic benchmarks sometimes stray from the real world.

Update
Benchmarked the gpu_memcpy function using _mm_stream_si128 instead of _mm_store_si128. It was slower:
_mm_store_si128 (current code): avg 866fps
_mm_stream_si128 (new): avg 803

BTW, the 2622 driver is faster than the 2559 driver. The latter produced only 835fps.

Update2
I've managed to further optimize the copy function. fps is now 910 (was 866). The optimization was found by mistake and I don't know why it works faster. But it does. I'll build tomorrow for all to test. Checked in SVN at r34.

STaRGaZeR

6th February 2012, 23:14

He also said a yet unreleased driver fixes "MC.ts", which is a VC-1 stream that caused errors before, so, yay? :)

May I have that sample?

nevcairiel

7th February 2012, 07:42

May I have that sample?

http://www.mediafire.com/download.php?1uc5b42u55ue280
Field Interlaced VC-1 (one of the worse parts of the VC-1 spec)

DragonQ

7th February 2012, 10:50

When is interlaced VC-1 actually used? The only time I've ever seen VC-1 used at all is on BDs and most of those are 23.976p.

nevcairiel

7th February 2012, 10:53

Blu-rays use interlaced VC-1, mostly for documentaries or concerts.

STaRGaZeR

7th February 2012, 22:02

http://www.mediafire.com/download.php?1uc5b42u55ue280
Field Interlaced VC-1 (one of the worse parts of the VC-1 spec)

Thanks, the guy who made that sample sure knows how to motivate the driver team :D

Thunderbolt8

7th February 2012, 22:31

http://www.mediafire.com/download.php?1uc5b42u55ue280
Field Interlaced VC-1 (one of the worse parts of the VC-1 spec)that file runs fine and smooth for me with LAV video (albeit just barely as present queue indicates), but not with the MS Videdecoder DMO.

CruNcher

8th February 2012, 07:13

Thanks, the guy who made that sample sure knows how to motivate the driver team :D

It took Nvidia 1 Driver release to fix it just as a side note, seems it will be 2 for Intel ;)

egur

8th February 2012, 10:03

It took Nvidia 1 Driver release to fix it just as a side note, seems it will be 2 for Intel ;)

Reading Nev's thread on LAV filters, show that Nvidia latest drivers are quite broken with respect to video and I had issues getting my AMD 6950 to play vc1. At least with the Intel driver the improvements are slow but monotone.

nevcairiel

8th February 2012, 10:05

I should test if Intels drivers still BSODs when i install my system in UEFI mode...
Nothing is ever perfect. :p

NikosD

8th February 2012, 10:21

All interlaced content (MPEG-2, H.264, VC-1) plays like progressive on ATi's HD 5000 series hardware/drivers using PotPlayer for the last many months (and HW accelerated of course)

What's so special about MC.ts anyway ?
Many extras in Blu-ray discs are encoded that way.
Like 300 for example.

CruNcher

8th February 2012, 11:11

Reading Nev's thread on LAV filters, show that Nvidia latest drivers are quite broken with respect to video and I had issues getting my AMD 6950 to play vc1. At least with the Intel driver the improvements are slow but monotone.

Im already fascinated by the Quality, btw does in the CE4150 series (Sodaville) anything of these implementations being used as well or is it completely different ? :)
It seems at least the Decoder Core is not from Intel but 3rd party licensed IP from PowerVR since the Atom http://www.youtube.com/watch?v=LzEgd1rF6Ps
so i really wonder if the next CE will use the new Intel Decoder and implementations around it (Scaler,Deinterlacer) as well :) ?

egur

8th February 2012, 12:47

Im already fascinated by the Quality, btw does in the CE4150 series (Sodaville) anything of these implementations being used as well or is it completely different ? :)
It seems at least the Decoder Core is not from Intel but 3rd party licensed IP from PowerVR since the Atom http://www.youtube.com/watch?v=LzEgd1rF6Ps
so i really wonder if the next CE will use the new Intel Decoder and implementations around it (Scaler,Deinterlacer) as well :) ?

the Atom family have a different GPU altogether (I think PowerVR) and a completely different and simpler core. Atoms are not scaled down versions of SandyBridge. It's probably impossible or at least not practical to have the same architecture and process span 1-150 Watts.

CruNcher

8th February 2012, 13:34

But a lot of improvements that where made on Atom (Power Consumption) have been ported to SB (SB in idle behaves according to Intels tools like a Atom @ full utilization, coincidence maybe ;)) :) so i guessed it would also work the other way ;) seeing tri gates coming which surely also the new CE generation is gonna share upon :)
Also the Scaler and Deinterlacer are fixed function as you said and the Decoder IP too the question is just how does CE4100s Scaler,Deinterlacer compare to the current fixed function SB and coming Ivy Bridge implementations, and are we seeing here a move away from PowerVR IP over the long run and unifying the CE Series with Intels IP, also seeing that you now have your own Video Codec research back again which becomes really interesting since the Indeo IP was sold ;)

egur

8th February 2012, 13:36

But a lot of improvements that where made on Atom (Power Consumption) have been ported to SB :)
I don't think this is accurate at all :)

egur

9th February 2012, 21:00

Version 0.26 beta is out with the following changes:
* Added option to disable SW decoding when HW can’t decode. Default is not to decode in SW.
* Even faster memory copy function
* FFDShow rev4313

Download from SourceForge home page (http://sourceforge.net/projects/qsdecoder/)

egur

10th February 2012, 09:36

I ran a quick benchmark on my new HTPC:
i7-2600K, 3.8GHz, DDR3 6-8-6@1600. Windows 7 64 bit. 32 bit playback using LAV splitter and my latest build (0.26). GraphStudioNext.
Low bitrate H264 test clip.
Tried 2 memory speeds:
* BIOS default: 1333 - 905fps (about the same as my dev PC)
* XMP profile: 1600 6-8-6 timing - 1030fps.

So for the performance lovers, you can scale performance with memory speed by buying a slightly more expensive memory.

Changing the GPU clock (1350->1500) didn't change anything in this test. It might, on high bitrate clips. I'll ran a few more clips and test.

nevcairiel

10th February 2012, 09:42

egur

10th February 2012, 12:13

Someone over at AVS Forum already established that higher memory frequency has quite a significant impact on performance on the IGP when using madVR, for example.
I guess its a two fold process, faster download of the frames, and also faster upload. Additionally, the memory is also used as GPU memory, so any processing will also be faster.

Considering IVB will bump up the default to 1600, i'll probably aim for 1833 or even 2133.

It's one of the only reason to buy fast memory - better iGPU utilization. In non-GPU benchmarks fast memory has very little impact, usually no impact.

Even today you can buy very fast memory but it will cost you a significant premium over the basic RAM.
SandyBridge's memory controller can handle 2133 of course - otherwise no one would buy this RAM.
The board manufacturer has to design a board that can handle these high speeds robustly. This comes at a price of course.
Benchmarks aside, is fast memory a must for building an HTPC?
If you use RAM that operates on standard voltage and delivers better bandwidth, than the CPU will spend less time in elevated power states and save power. E.g. it will have more time to doze off and save power.
The sweet spot these days is 1600, because its price premium is small and it works 20% faster.
Future memory technologies (DDR3L, LP-DDR3, DDR4, ...) will use less power (less voltage) and provide same or better bandwidth helping to make the HTPCs smaller and quieter and probably cheaper too.

ajp_anton

10th February 2012, 13:31

I just bought some cheap "ValueRAM" at 1333MHz, overclocked it to 1600 and undervolted to 1.27V =).

egur

10th February 2012, 13:37

I just bought some cheap "ValueRAM" at 1333MHz, overclocked it to 1600 and undervolted to 1.27V =).

It works stable at 1.27V? Impressive.
Maybe this tweaking can be done automatically by the BIOS and save power...

DragonQ

10th February 2012, 13:38

Which kinda makes you wonder...is it worth spending more money on faster RAM just to utilise the IGP when you could get cheaper RAM and a stand-alone GPU (in a desktop anyway)?

nevcairiel

10th February 2012, 14:13

I would rather buy both, but i would also never try to save on a PC on the wrong ends. :p

CruNcher

10th February 2012, 15:09

i would rather use ARM or Atom (though not the consumer Atom stuff industry or CE ;) ) for a HTPC or Sandy or Ivy Bridge Mobile versions :)

ajp_anton

10th February 2012, 19:22

It works stable at 1.27V? Impressive.
Maybe this tweaking can be done automatically by the BIOS and save power...I was as surprised as you. Actually, it was stable at 1.23V for half a year, but then I got a random BSOD. Don't know what caused it, but the RAM felt most likely so I upped it to 1.27V.
So now its voltage is the same as my CPU =).

RBG

10th February 2012, 23:12

I just bought some cheap "ValueRAM" at 1333MHz, overclocked it to 1600 and undervolted to 1.27V =).

And what's the point in lowering ram voltage? That will only reduce your stability with no actual benefits....

ryrynz

11th February 2012, 00:40

It reduces heat, power consumption and improves product longevity. It's not THAT important unless you just love to tweak.

ajp_anton

11th February 2012, 02:36

It's not THAT important unless you just love to tweak.Haha, it's mostly this =). Two 4GB DIMMs probably won't use much power compared to the overclocked 2600k in the system.
But I was just making the point that noone should be running slow memory. 1600MHz doesn't seem to be a problem for even the cheapest RAM out there.

RBG

11th February 2012, 05:35

It reduces heat, power consumption and improves product longevity. It's not THAT important unless you just love to tweak.

To tell you the truth it is TOTALLY not important unless you're running a hpc with terrabytes of ram. ;) On an ordinary computer from undervoltaging your ram you'll benefit up to 1 watt during idle and up to 3 watt at peak, compared to the overall system power consumption this value looks not worthy of note. Heat output is not a problem too. And I don't think that you'll get significant longevity increase by lowering ram voltage, but you'll get stability problems for sure. From my experience I even doubt that ajp_anton's system is really stable if only there wasn't used some kind of LoVo DDR3.:rolleyes:

ryrynz

11th February 2012, 06:40

They're pretty insignificant for sure but you did ask the question :) it's mostly OCD regarding getting the most out of your components.

Regarding stability if it's stable for his needs, it's stable. Some go the whole nine yards to test rock solid stability and some don't, let's avoid turning this into an overclocking thread hmm? ;)
Watch this space for more QuickSync magic!

RBG

11th February 2012, 08:19

They're pretty insignificant for sure but you did ask the question :)

It was a rhetorical question. :p

Regarding stability if it's stable for his needs, it's stable.

Tertium non datur. You either have stable ram or you have not.;)

ajp_anton

11th February 2012, 19:40

Everything is unstable. The question is what the "half time"-equivalent of your system is. Out-of-the-box it's probably in the order of millions of years, so it's practically stable.
Having it running 24/7 for 6 months before a BSOD, with lots of video compression and Avisynth scripts filling 90% of my RAM, it's stable enough for me. After this I upped the voltage just a bit, so now it should be even more stable, *if* the BSOD was even caused by the RAM and not the OC'd CPU. Either way I don't care. I couldn't live with a computer I didn't tweak to death.

---

Random question that may or may not be off-topic: =)
Can for example x264 create lossless streams that are compatible with QS? Like, skipping the non-compatible but efficient lossless algorithms, and instead just through enough bitrate at it to make it lossless?
Would be useful for intermediate files between heavy Avisynth scripts if they were really fast to decode in hardware.

nevcairiel

11th February 2012, 19:55

No matter how much bitrate you throw at it, it'll never be truely "lossless", however you can probably make it "visually lossless" to some degree.

egur

11th February 2012, 20:21

...
Random question that may or may not be off-topic: =)
Can for example x264 create lossless streams that are compatible with QS? Like, skipping the non-compatible but efficient lossless algorithms, and instead just through enough bitrate at it to make it lossless?
Would be useful for intermediate files between heavy Avisynth scripts if they were really fast to decode in hardware.

My guess is that if you make the intermediate clip all I-frames with super high bitrate (50+ mbps) it will be fast and with excellent quality.

RBG

11th February 2012, 20:37

Everything is unstable. The question is what the "half time"-equivalent of your system is. Out-of-the-box it's probably in the order of millions of years, so it's practically stable.

You are moving the topic to philosophical matter, and I think that is wrong. Mostly all hardware parts in you PC are covered by manufacturer's warranty and that means that during that period of time your hardware unit will work properly and will be stable, of course as long as you follow manufacturer's service instructions, otherwise it will be replaced or you''ll receive a money refund. So there is no third option here, either it is stable or it is not.

Either way I don't care. I couldn't live with a computer I didn't tweak to death.

Well it is your choice, though I strongly recommend you to do proper ram testing, pure read/write series of test patterns and the same under full psu load.

CruNcher

12th February 2012, 05:13

@ Egur

Intels PP doesn't seem to work with Adobes Flash Player (Custom Direct3D Renderer) ?

AMD has i think a special option for it in their Controll Panel

http://oi40.tinypic.com/aavgg4.jpg

egur

12th February 2012, 08:00

@CruNcher,
Didn't check it, but it should be enabled via Adobe (in my opinion) and not forced by the driver.

DragonQ

12th February 2012, 12:06

Eww @ all that post-processing. That's what makes many modern TVs look so bad.

CruNcher

12th February 2012, 12:09

Ehh depends on the TV some use very good research up to the point of Super Resolution in Hardware ;)
though it's not easy to find out what for a technology (algorithm) works in the background and most marketing is like crazy encryption ;)
Samsung for example likes to work together with MSU(Yuvsoft) http://compression.ru/video/resampling/index_en.html on their TV PP research though it's not easy to find out in which actual product these things are being used in the end ;)

PS:
Can't wait to see MSUs GPU Encoder compare will be interesting to see if Intel can keep on with the good result of SB with Ivy (better performance is clear, but will it be still @ the top of Quality ;) ) and if Nvidia improved their Encoder and what AMD now reaches with the Motion Estimation improvements for the Encoding part :)

egur

12th February 2012, 12:11

Eww @ all that post-processing. That's what makes many modern TVs look so bad.

Mine looks great and much better then my previous (6 years old) that was one of the best at the time. You need to tune it and definitely not max out the enhancements. For PC display, you should turn practically all enhancements off.

CruNcher

12th February 2012, 19:31

@Egur
What is the difference between ILK = ??? and SNB = SandyBridge ? binaries like libmfxhw[bit]-i1/s1.dll ?

egur

12th February 2012, 23:47

@Egur
What is the difference between ILK = ??? and SNB = SandyBridge ? binaries like libmfxhw[bit]-i1/s1.dll ?

ILK is IronLake - the code name for Westmere's chipset (or integrated north bridge, don't remember),
The difference between the various Media SDK DLLs is the API they support. The newer the API, more features are supported.
I use MSDK API 1.1 which correlates to SNB. I don't have a Westmere (32nm i3/5/7 Core Processor) so I'm not sure what API it supports - probably API 1.0.
IvyBridge's driver has a similar DLL (libmfxhw32-i2.dll). It supports API version 1.3 (same as the latest MSDK 2012 version). The latter DLL does not work with SNB.