Log in

View Full Version : Intel QuickSync Decoder - HW accelerated FFDShow decoder with video processing


Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 [46] 47 48 49 50 51 52 53 54

uncola
12th November 2013, 05:50
It's an nvidia optimus laptop with an external screen. I use the intel gpu 100% of the time. I installed the drivers direct from intel. I think that fake screen is only for people on desktops who use the discrete gpu as primary. I've tried everything I can think of, installed older win 8 drivers on win 8.1.. I've seen a bunch of other people mention they can't get it working either on 8.1. Is there a good freeware encoder I could try? I've only been using cyberlink power director 12
edit: NEVERMIND figured it out haha. Man, this entire time I Thought I was using the intel gpu for cyberlink powerdirector but it turns out it was running on the nvidia gpu automatically. This is the first time that function actually worked other than for games so I was totally surprised. I just had to right click powerdirector and choose run with gpu - integrated default.

NikosD
13th November 2013, 20:36
Eric,

I tried latest Intel GPA 2013 R3, in order to measure Media performance of QuickSync during playback on my Core i5 Sandy.

I really disappointed once more to find out that nothing has changed from 2013 R2, regarding interference with results.

When you open System Analyzer to measure media performance, the fps drop a lot compared to clean decoding (with GPA closed)

The old GPA with Media Performance as a separate selection (not integrated in System Analyzer) had no such problems.

The performance of QuickSync decoding was the same, with or without GPA monitoring.

Any workarounds ?

wanezhiling
14th November 2013, 16:22
Hi Eric, QS decoding doesn't support MPEG1?

egur
14th November 2013, 17:38
Any workarounds ?
I'm not a power user of this tool. Sorry.

Hi Eric, QS decoding doesn't support MPEG1?
MPEG1 is not supported. Since this legacy format is limited to very low resolutions and bitrate, using HW acceleration wouldn't help much anyway. My guess is that it would work slower than SW if copy back was used (e.g. like QS decoder).

NikosD
15th November 2013, 12:04
I'm not a power user of this tool. Sorry.


OK.
I think the best suitable workaround is to downgrade to the last previous version of Intel GPA, before the integration of Media Performance to System Analyzer.

Is it possible to provide me a download link for that version ?

TIA

egur
16th November 2013, 23:19
I can try, which version are you looking for (version number)?

NikosD
16th November 2013, 23:23
I don't remember exactly.

It's the last version before the integration of Media Performance with System Analyzer.

itsonlyjustincase
20th November 2013, 12:34
If you want to use quick sync try to encode a file with handbrake quick sync version. Try it with or without QS.

egur
20th November 2013, 13:10
The latest Handbrake has QS, you need to enable it from the video tab. You'll need the screen connected to the Intel GPU.

NikosD
21st November 2013, 14:43
Eric,

I'm about to try a new Pentium Haswell dual core processor with QuickSync inside and I'm wondering if it's capable to play in HW 4K H.264 with your QS decoder, or I need to use DXVA native for 4K H.264.

For YouTube video, flash video and other streaming technologies like Vimeo, DXVA native is the only way.
Right ?

egur
21st November 2013, 15:29
Eric,

I'm about to try a new Pentium Haswell dual core processor with QuickSync inside and I'm wondering if it's capable to play in HW 4K H.264 with your QS decoder, or I need to use DXVA native for 4K H.264.

QS Decoder should work, let me know anyway how it performs.
Also works on Atom (Bay Trail) with Window 8.1 (LAV filters+QS).

For YouTube video, flash video and other streaming technologies like Vimeo, DXVA native is the only way.
Right ?
I don't know what they use inside. Could be DXVA2, could be hardware MFTs.

itsonlyjustincase
21st November 2013, 17:00
Can't wait for the result :) but i'm sure it will work flawlessly cause this quick sync is really powerful. I was so impressed by handbrake. I'm also using a soft that permits to record screen while playing games for examples which use quick sync and permit to record without loosing a single fps.

Haswell + Quicksync should definately do it

NikosD
25th November 2013, 23:18
Eric, can you provide me a link for Intel GPA 2013 R1 (13.1) ?

It's the last version of separated Media Performance module.

Thanks.

NikosD
3rd December 2013, 20:04
Well,

today I got my new HTPC (signature system).

It has 8GB of DDR3-1600MHz memory.

I had a few surprises too!

Using latest drivers 3345 under Win 8.1 Pro x64, I have OpenGL 4.2/OpenCL 1.2 and of course QS decoding.

But the performance of QS hardware in both DXVA native and QS decoding is below my expectations.

With H.264 1080p clips, it's close to SNB HD2000 (6 EUs -1100MHz), just a little faster.

It shouldn't, because GT1 of Haswell has a 3rd generation QS hardware capable of 4K H.264 and 10 EUs@1100MHz.

The 4K performance is a lot worst than Ivy 3770K GPU (16 EUs)

In Ducks-3840x2160@50fps I get only 43fps with QS decoding and 80fps with DXVA native.

Those 2 figures (~45fps for QS and ~80fps for DXVAn) seem to be the limit for 4K H.264 clips.

So I can play every 4K@60fps clip but only in native mode, in QS mode I can't even play 4K@50fps.

And some rare 4K@120fps clips are of course unreachable, even in native mode.

Need further investigations and maybe some iGPU overclocking too ;)

egur
4th December 2013, 15:22
The numbers don't look so good.

I played DucksTakeOff (2160p@50) using ZoomPlayer + ffdshow QS and it played rock solid at 50fps. CPU utilization was 55-60%.
This was on my aging IvyBridge Laptop (2.3GHz) with DDR3 1600MHz. This is a 35W mobile CPU.

I'l try later on my Haswell i5 at home.

NikosD
4th December 2013, 18:59
Using MPC-HC v1.7.1.142 (Dec 2) with LAV filters 0.59.1.35 I got 40fps with QS decoding and 60-65% CPU utilization going full@3.2GHz for Ducks-2160p@50fps

The only way to really understand what is going on, is to use Intel GPA Monitor.

So, I installed latest version 2013 R3 and tried to catch GPU, EU and MFX utilization, during playback of MPC - HC but it was zero all the time!

For both DXVA native and QS decoding.

Probably some incompatibility with drivers/ Pentium or something is really weird here.

egur
5th December 2013, 14:25
@NikosD, Are you testing with EVR, EVR-CP, MadVR?
My numbers are with EVR.

Also use CPU-Z to check if your memory is actually working at 1600MHz.

NikosD
5th December 2013, 14:32
EVR always.
Yes memory is working@1600Mhz

I did some new test with DXVA native of LAV Video 0.59.1

For every 4096x2304 clip I get an upper limit of 75fps to 80fps, even for 10Mbps clips or 100Mbps clips

For every 3840x2160 clip I get an upper limit of 85fps to 90fps from low bandwidth to higher bandwidth clips.

QuickSync decoding should be better for Haswell than Ivy, even for my GT1 Haswell, because EUs should not play a significant role in decoding.

What are your figures of Haswell Core i5?

egur
5th December 2013, 15:09
I didn't have time yesterday to check.

NikosD
5th December 2013, 19:09
OK.

In order to have the same reference about decoding performance, I tested under signature system with LAV Video 0.59.1 in DXVA native the famous Ducks 2160p@50fps 370Mbps H.264 file with DXVA Checker v3.0

My results with fps and CPU utilization are in the picture

13900

zcream
8th December 2013, 13:31
I am trying to use QuickSync to encode 2x RAW video streams from USB3 cameras. Project here -
http://www.personal-view.com/talks/discussion/8944/3d-3k-12-bit-raw-camera-project

I don't need P or B- frames. Just need i-frame compression at lossless or near lossless quality. With this requirement, I think a dual 3K stream would be possible onto a fast dedicated SSD.

The best option is to use ffdshow/ffmpeg or an encoder that accepts input from stdin (for virtualdub).

So far, the encoders I have seen only accept a complete input file. Would you consider enabling RAW input from stdin - usage with Virtualdub ? Or alternatively an input using a RAW stream from a DirectShow driver that comes with most Vision cameras ?

egur
8th December 2013, 14:37
I am trying to use QuickSync to encode 2x RAW video streams from USB3 cameras. Project here -
http://www.personal-view.com/talks/discussion/8944/3d-3k-12-bit-raw-camera-project

I don't need P or B- frames. Just need i-frame compression at lossless or near lossless quality. With this requirement, I think a dual 3K stream would be possible onto a fast dedicated SSD.

The best option is to use ffdshow/ffmpeg or an encoder that accepts input from stdin (for virtualdub).

So far, the encoders I have seen only accept a complete input file. Would you consider enabling RAW input from stdin - usage with Virtualdub ? Or alternatively an input using a RAW stream from a DirectShow driver that comes with most Vision cameras ?

I don't do encoding at all. As for decoding, only 8bit 4:2:0 is supported in HW.
If you're looking for 12bit HW encode/decode, you're out of luck, AMD and Nvidia are no better in this area.

zcream
9th December 2013, 07:34
Is this the status with ffdshow (ffmpeg) and Quicksync ATM ?
i.e. 8-bit 4:2:0 decoding only ?
I am not looking for 12-bit, only 8-bit 4:2:2 to begin with, and possibly 10-bit 4:2:2.
There are other encoders, however only virtualdub and ffmpeg support encoding from live USB vision streams. Hence, I was hoping for an encoder that can be offloaded from the CPU.
Cheers!

egur
9th December 2013, 08:22
Is this the status with ffdshow (ffmpeg) and Quicksync ATM ?
i.e. 8-bit 4:2:0 decoding only ?
I am not looking for 12-bit, only 8-bit 4:2:2 to begin with, and possibly 10-bit 4:2:2.
There are other encoders, however only virtualdub and ffmpeg support encoding from live USB vision streams. Hence, I was hoping for an encoder that can be offloaded from the CPU.
Cheers!

No support for anything other 4:2:0 8bit.
ATM, the HW supports NV12 surfaces output only.
Sadly, I don't see this change soon, as changing the HW to support more than 8 bit is very expensive and 10/12 is not mainstream.

NikosD
9th December 2013, 09:52
Eric,

any news about HW decoding performance of Haswell vs Ivybridge ?

andyvt
9th December 2013, 10:15
Is this the status with ffdshow (ffmpeg) and Quicksync ATM ?
i.e. 8-bit 4:2:0 decoding only ?
I am not looking for 12-bit, only 8-bit 4:2:2 to begin with, and possibly 10-bit 4:2:2.
There are other encoders, however only virtualdub and ffmpeg support encoding from live USB vision streams. Hence, I was hoping for an encoder that can be offloaded from the CPU.
Cheers!

QSTranscode uses ffmpeg for I/O so it should be possible to support this scenario with some customization. Does your desire to offload outweigh the desire for 4:2:2?

zcream
10th December 2013, 05:40
Hi Andy.
I was hoping to capture USB3 vision 4:2:2 Raw video streams in realtime using a DirectShow driver. Hence 4:2:2 is preferred.

Is there a lossless i-frame only mode in QSTranscode that will allow for lower compression but hopefully realtime compression in 4:2:0 ?
I have a core i7 with HD4000 - a mid-2012 Macbook Pro for this project.


QSTranscode uses ffmpeg for I/O so it should be possible to support this scenario with some customization. Does your desire to offload outweigh the desire for 4:2:2?

andyvt
10th December 2013, 10:59
Hi Andy.
I was hoping to capture USB3 vision 4:2:2 Raw video streams in realtime using a DirectShow driver. Hence 4:2:2 is preferred.

Is there a lossless i-frame only mode in QSTranscode that will allow for lower compression but hopefully realtime compression in 4:2:0 ?
I have a core i7 with HD4000 - a mid-2012 Macbook Pro for this project.

Any encoding will be lossy, how lossy will depend on the encoding parameters. I expect that you should be able to do the encoding in real time even at high settings (full transcoding is faster than real time for everything I've tried), but it is something you'd need to test out.

The goppicsize and goprefdist arguments can be used to control the mix of I/P/B-frames. IIRC setting goppicsize = 1 sets the encoder to only create I-frames.

If it is something that you'd like to pursue, LMK and I'll have a look at what it would take to signal to ffmpeg that a device should be the input source.

NikosD
10th December 2013, 18:02
Eric,

a lot of new things about my new Pentium Haswell.

First of all I managed to find Intel GPA 2013 R1 for Sandy, so that link is not necessary.

By reading 2013 R3 help, I realized that the HUD (head-up display metrics proposed by Intel) is using a lot less resources than System Analyzer, but I preferred the old Media Performance tool so I installed again 2013 R1.

2013 R3 has a lot of problems with my Pentium Haswell system because of Win 8.1 (not officially supported - only 8.1 preview) and/ or Haswell GT1 GPU - (4th generation core HD 4200 and above officially supported).

I guess I have to wait for 2013 R4 or 2014 R1 for proper Win 8.1 and/ or Haswell GT1 support.

I did some tests regarding 1080p decoding performance between QS1 (Sandy - GT1) vs QS3 (Haswell - GT1).

In normal bandwidth clips and easier codecs, like H.264<30Mbps or MPEG-2 1080p clips, QS3(Haswell) is a little slower!! than QS1(Sandy)

But H.264 1080p clips around 100Mbps are decoded a lot faster on QS3 than QS1, about 2x or even 2.4x times faster.

I found out one major difference that Intel silently has changed between QS generations.

Haswell QS3 supports a new VC-1 mode called (VC1_VLD2010) which is used by both LAV Video and WMVideo MFT decoders

LAV has some problems, but on progressive video WMVideo is perfect for VC-1/WMV3 clips.
PotPlayer supports VC-1/WMV3 in DXVA native too (no problems)

Still, interlaced video can't be decoded properly.

I'm really looking forward for an answer about IvyBridge vs Haswell performance comparison especially in 4K H.264 clips.

I have to know if it's my Pentium Haswell QS slower than Ivy or every Haswell QS is slower than any Ivy.

thanks.

zcream
10th December 2013, 23:39
I am going to test this as soon as I get my laptop.
ATM, ffmpeg can record from USB cameras.
https://trac.ffmpeg.org/wiki/DirectShow

If I can use the ffmpeg cli and pass the ffmpeg raw stream to QSTranscode it should be fine.

I am very curious to see if there are any noticeable differences in Cineform 4:2:2 vs h.264 4:2:0 at a high enough bitrate.

Any encoding will be lossy, how lossy will depend on the encoding parameters. I expect that you should be able to do the encoding in real time even at high settings (full transcoding is faster than real time for everything I've tried), but it is something you'd need to test out.

The goppicsize and goprefdist arguments can be used to control the mix of I/P/B-frames. IIRC setting goppicsize = 1 sets the encoder to only create I-frames.

If it is something that you'd like to pursue, LMK and I'll have a look at what it would take to signal to ffmpeg that a device should be the input source.

CharlieCL
12th December 2013, 03:03
a lot of new things about my new Pentium Haswell.
......

Pentium Haswell hardware may be a crapped Haswell vs Core i in graphics.

wanezhiling
12th December 2013, 07:02
Apparently GT1 (HD Graphics) is not on the same level with GT2 (HD Graphics 4200/4400/4600), the worse is that it may worse than IVB GT2 (HD Graphics 2500/4000).

NikosD
12th December 2013, 10:09
Pentium Haswell hardware may be a crapped Haswell vs Core i in graphics.

I couldn't disagree more with that statement.
Read below.

Apparently GT1 (HD Graphics) is not on the same level with GT2 (HD Graphics 4200/4400/4600), the worse is that it may worse than IVB GT2 (HD Graphics 2500/4000).

First of all we are talking here about media performance, not graphics performance.

Media performance is based on QuickSync hardware (ASIC) which is inside GPU and the number of EUs (graphics performance) doesn't play a significant role in hardware decoding.

Anandtech here http://www.anandtech.com/show/5871/intel-core-i5-3470-review-hd-2500-graphics-tested/2 has exactly the opposite opinion than me, saying that transcoding is based a lot on number of EUs, but never mind, Anandtech is wrong and I'm right! ;)

The number of EUs play a significant role for scaling and other post processing video features, but in decoding the role of EUs is not significant. Sure they assist, but main decoding is done by MFX engine.

So QS decoding should be on par between same generation QS hardware (HD 2000 vs HD 3000), (HD 2500 vs HD 4000), Haswell GT1, GT2 etc and not influenced by the number of EUs which shows us mainly graphics and not media performance.

But, talking about Pentium Haswell GPU in general, we have to say that even graphics performance has raised significantly compared to previous generations.

For example GT1 of SandyBridge (HD 2000) and GT1 of IvyBridge (HD 2500) have both 6 EUs, where GT1 of Haswell has 10 EUs.

In real world apps and benchmarks I saw a 2.5x increase in DX10 and OpenGL performance between GT1 -HD 2000 (sandy) and my Haswell GT1, for example Cinebench OpenGL mark and a DX10 benchmark.

So Haswell GT1 is certainly faster than GT1 - HD 2500 (Ivy) and probably faster even than GT2 -HD 3000 (Sandy) and close to GT2- HD 4000 (Ivy).

The equation goes like this regarding graphics performance:

HD 2000 < HD 2500 < HD 3000 < Haswell GT1 < HD 4000

But our talk here is about HW H.264 decoding performance of GT1 Haswell compared to GT2, GT3 Haswell and GT1, GT2 performance of Ivy.

nevcairiel
12th December 2013, 11:00
Anandtech here http://www.anandtech.com/show/5871/intel-core-i5-3470-review-hd-2500-graphics-tested/2 has exactly the opposite opinion than me, saying that transcoding is based a lot on number of EUs, but never mind, Anandtech is wrong and I'm right! ;)

Transcoding is a lot more then plain decoding. EUs play a role in encoding and the processing required for that.

Anyhow, the Pentium "Intel HD Graphics" is also limited in media features. It doesn't have QuickSync for example (the encoder), which could also be a sign that its a completely different media block, and does not share the same decoder as the other Haswell GPUs. Its very well possible that it behaves differently in decoding performance then the Core i3/i5/i7 would, and until someone does a clear 1:1 comparison, noone can know for sure.

Fact is, you are just assuming its the same hardware. What if its not?

NikosD
12th December 2013, 11:09
The best to way to find out is to run with your system (I know you have a Core i7 4770K ;) ) some H.264 benchmarks (1080p & 4K) to compare them with mine.

It would be a straight comparison between GT1 vs GT2 Haswell media decoding performance.

kypec
12th December 2013, 11:51
The best to way to find out is to run with your system (I know you have a Core i7 4770K ;) ) some H.264 benchmarks (1080p & 4K) to compare them with mine.

It would be a straight comparison between GT1 vs GT2 Haswell media decoding performance.
Please do that comparison test, I'm about to buy Pentium G3420 soon and need to be sure that H.264 decoding capabilities for 1080p (and possibly 4K as well) material are not crippled in this Haswell CPU. :thanks:

egur
12th December 2013, 19:52
Transcoding is a lot more then plain decoding. EUs play a role in encoding and the processing required for that.
Yes, a minor role.

Fact is, you are just assuming its the same hardware. What if its not?
It's the same HW (design wise). Same media block. Lowest binned silicon (leakage & frequency, but still reliable), much smaller L3 cache, 2 cores, no HT, no turbo and some other features turned off compared to i3/i5/i7.

NikosD
12th December 2013, 20:23
Please do that comparison test, I'm about to buy Pentium G3420 soon and need to be sure that H.264 decoding capabilities for 1080p (and possibly 4K as well) material are not crippled in this Haswell CPU. :thanks:

The decoding capabilities of Pentium G3420 are amazing.

With signature system and clips from my thread I got these results with LAV Video 0.59.1 DXVA native:

2. 298/303/313
6. 288/295/299
7. 264/265/265
10. -/240/-

For 4K H.264 I got:

Ducks 2160p30fps-243Mbps 72/75/77
Park Joy 2160p50fps - 136Mbps 77/79/79
Duck 2160p50fps - 370Mbps 73/78/80

CharlieCL
12th December 2013, 21:54
Your testing is not a pure decoding testing. There are render and fps. The bandwidth and memory performance as well as EU are factors. In Intel's official specification Pentium G3420 does not support Quick Sync Video. In word meaning that included both encoder and decoder.

egur
13th December 2013, 17:33
That's correct, rendering and video processing uses the EUs at some level.

NikosD
13th December 2013, 18:01
Your testing is not a pure decoding testing. There are render and fps. The bandwidth and memory performance as well as EU are factors. In Intel's official specification Pentium G3420 does not support Quick Sync Video. In word meaning that included both encoder and decoder.

I don't understand what are you trying to say.

kypec
13th December 2013, 22:33
The decoding capabilities of Pentium G3420 are amazing.
Thank you very much for your input, NikosD, much appreciated!
Now, I can't wait to get that new rig in my room. This is gonna be one nice Christmas after all...:D

NikosD
14th December 2013, 20:24
Eric,

I found out that QS decoder suffers from the same limitations that DXVA copy-back suffers too.

It can't go up to 4K x 4K resolution for H.264 without blurring (artifacts).
I tried even some 3K x 3K and had problems too.

They aren't the most common resolutions of the planet, but it surprised me that only DXVA native can play such resolutions perfect.

DXVA copy-back doesn't even initialize - it falls back to software mode with LAV Video.

egur
15th December 2013, 08:38
@NikosD, Interesting, please share the clips.

NikosD
15th December 2013, 08:45
The clips have been downloaded from your collection of DXVA limited clips with various resolutions of square, macroblocks, height or width limitations.
There are many of them.

If you don't have them anymore, tell me to upload one or two.

egur
15th December 2013, 08:57
I don't have those. Just share 1 of each resolution.

NikosD
15th December 2013, 09:02
OK.

But because you have both an IvyBridge and a Haswell processor, I want from you a comparison of H.264 decoding performance in DXVA native for selected clips of 1080p and 4K resolutions :)

Do we have a deal ? :D

egur
15th December 2013, 09:10
I'm not committing to anything these days. A lot of work issues. But I'll give it a try.
BTW, Intel® Graphics Performance Analyzers 2013 R4 is out.

NikosD
15th December 2013, 09:17
I tried Intel GPA 2013 R4 on my system but nothing changed.

Still, media performance metrics (which have raised from only 3 to 11!) show me null values during playback.

Also I have a lot of hang-ups of OS while using 2013 R4.

Maybe Pentium G3000 series is still not compatible with that program.

Anyway a typical sample of a square sample with image distortion using QS decoder is here:
http://speedy.sh/Tm7BR/4080x4080-VP5.mkv

BTW, do you know why SandyBridge latest driver for HD2000 ver 3347 doesn't have VC1_VLD2010 support like Haswell ?

Are they going to add it in future driver releases ?

egur
15th December 2013, 09:22
SandyBridge drivers are feature-frozen, only serious bug fixes from now on.