LAV CUVID Decoder - High Quality Hardware decoding for NVIDIA [Archive] - Page 15

View Full Version : LAV CUVID Decoder - High Quality Hardware decoding for NVIDIA

Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 [15] 16 17 18 19 20 21 22 23 24 25 26 27 28

nevcairiel

30th May 2011, 07:19

Indeed, its already been changed, and the next version wont require it anymore.

zhfi

30th May 2011, 08:40

Indeed, its already been changed, and the next version wont require it anymore.

How long will be the next version come out?
or would you build it for us?
I have vc2008...:eek:

roozhou

30th May 2011, 14:28

How long will be the next version come out?
or would you build it for us?
I have vc2008...:eek:
My build:
http://www.mediafire.com/?lg67wlbe6ptklmn

zhfi

30th May 2011, 16:54

My build:
http://www.mediafire.com/?lg67wlbe6ptklmn

:helpful:
3Q！

ney2x

1st June 2011, 18:16

My problem with AVI (DivX/XviD) solved with 275.33 driver. It's time to get rid of ffdshow now.

Question: Regarding H.264 videos, does ffdshow filters like deinterlacing, deband, sharpen, etc. works when LAV CUVID is in used and ffdshow (raw) in chain? Cause I don't notice any difference :)

nevcairiel

1st June 2011, 18:17

Interesting that they fixed the timing issues with the driver update. :)

ffdshow should work, just make sure to force ffdshow to only accept YV12 input, as NV12 is broken in ffdshow.

ney2x

1st June 2011, 18:18

Interesting that they fixed the timing issues with the driver update. :)

ffdshow should work, just make sure to force ffdshow to only accept YV12 input, as NV12 is broken in ffdshow.

Thanks for your quick reply! I better sleep now, tomorrow will be movie marathon... :thanks:

jazzysmooth

1st June 2011, 22:17

Still can't use cuvid with any driver higher than 270.51 with my onboard GF9300. Oh well...

Blue_MiSfit

2nd June 2011, 03:50

Will this decoder work if I'm connected to the system via RDP?

DGDecNV does not, so I'm not expecting this to either, but I figured it was worth a shot :)

I don't currently have a system available via RDP that has the correct hardware, otherwise I'd just test. Actually... I will test when I get home :devil:

Derek

SamuriHL

2nd June 2011, 03:53

No. RDP uses its own virtual video driver. Doesn't work with AMD, nVidia, or any other hardware that I'm aware of.

nevcairiel

2nd June 2011, 14:32

LAV CUVID Decoder 0.7

0.7 - 2011/06/02
- x64 support
- The VC2010 runtime is no longer required
- New SSE2 NV12->YV12 conversion
- Improved CUDA GPU detection

Download: 32-bit (CUDA 4.0+) (http://files.1f0.de/cuvid/LAVCUVID-0.7.zip) - 64-bit (CUDA 4.0+) (http://files.1f0.de/cuvid/LAVCUVID-0.7-x64.zip) -- 32-bit (Older CUDA) (http://files.1f0.de/cuvid/LAVCUVID-0.7-LegacyCUDA.zip)

So, finally a new version. I still have some more changes planned for the near future, but felt like releasing this today.

First, on the different version:
- The "normal" 32-bit (and 64-bit) require a CUDA 4.0 Driver now (270 series)
- The "Old CUDA" build was compiled against an old CUDA API, and should work with a whole load of older drivers, at least in theory. This is not available for x64 at this time, because of recent changes in the 64-bit CUDA APIs.

For proper support, i rewrote the NV12->YV12 conversion using SSE2 intrinsics, so that the x64 folks don't have to use the slow pure C conversion. This of course means that SSE2 is now required, but seriously, invest in an upgrade if you don't have SSE2 support.

Oh, and hopefully LAV CUVID should now also work with cards that are not connected to a display.

SamuriHL

2nd June 2011, 14:35

Nice! Thanks, Nev!

ney2x

2nd June 2011, 15:13

You're the best Nev! I think you have the right to ask for donations for your effort!

LAV CUVID = free and updates once, twice and even thrice a month!
CoreAVC = $12.95 and update once a year! :(

*** There's no flame here, I just want to be honest! :D

Carpo

2nd June 2011, 17:20

now all we need is madVR x64 :p testing new LAV CUVID now

edit: xvid also working for me now :)

CruNcher

2nd June 2011, 17:37

http://mirror05.x264.nl/CruNcher/pperf/ <- no improvement yet (the low performance 60fps issue after xx seconds is still their with Lav Cuvid)

nevcairiel

2nd June 2011, 17:38

Where did it say its faster now? Your hardware is just too slow. :p

Carpo

2nd June 2011, 17:38

you got a link to that file?

SamuriHL

2nd June 2011, 17:41

Nev, for some reason the x64 build isn't finding the cuda libs for me. I do have both 32 and 64 bit versions of the 4.0 toolkit installed. I'm not feeling all that well right now so I haven't looked into it too much. Just thought I'd mention it. 32 bit version builds fine for me.

nevcairiel

2nd June 2011, 17:42

You need to make sure the CUDA_PATH environment variable points at the x64 version. Also, you only need to install the x64 version, it includes the 32-bit libs as well. (and i only tested with the 4.0 SDK, 3.2 is known to be broken for x64, older version i dont know)

SamuriHL

2nd June 2011, 17:45

Ah, kuel. I didn't know that. I'll remove the 32 bit version and point it to the 64 bit version. Thanks!

Yup, that worked. Awesome!

CruNcher

2nd June 2011, 17:59

Carpo:

http://e.dl.playstation.net/e/wipeouthd/assets/WipEoutHD_EN_1080p.zip

another one failing

http://www.fileplanet.com/219665/210000/fileinfo/Battlefield-3-%27Fault-Line%27-Complete-12-Minute-HD-Gameplay-Footage

though 60 fps Full HD H.264 Inet content is still rare (mostly used for PC/Console Game marketing) ;)

Nope my Hardware isn't to slow see Cyberlinks Decoder and CPU usage (you would like it to be i know, @ least the VP2 can't be the culprit here, i agree that it could be the Nvcuvid API under XP as all Nvcuvid API based decoder show this performance issue, though other DXVA1 decoder also fail so it's not really that issue either) ;)

Carpo

2nd June 2011, 18:07

i will have to grab the Battlefield demo later, in the ISP's capping zone atm :(

edit: playstation link is slow as hell, 20K :( might be able to get the other one when this one finishes :D

ney2x

2nd June 2011, 18:20

http://mirror05.x264.nl/CruNcher/pperf/ <- no improvement yet (the low performance 60fps issue after xx seconds is still their with Lav Cuvid)

@CruNcher, What is you GPU? I think it's time for you to upgrade OS and especially hardware (affordable GTX 460, GTS 450 or GTX 550)

Maybe hardware limitations.

Tested:
Windows 7 and Windows XP / GTX 260 = smooth and sailing :)
Windows 7 and Windows XP / GTX 560 Ti = smooth and sailing :)

Didée

2nd June 2011, 18:24

@ CruNcher - is it sure that Cyberlink is decoding all frames? It's bit hard to tell from your screen-recording, but it seems to me that Cyberlinks decoder isn't fluid, but is dropping frames. In that case, the riddle's solution could be that Cyberlink's decoder is using a smarter workaround-strategy (skipping B-frames, maybe?), and LAVCUVID simply tries to decode everything, and when the decoder can't keep up anymore, it gets into re-syncing issues ...

Bottom line: I think VP2 is too weak for high-bitrate 1080p60. :)
(This would be conform with results from the "DGDecodeNV benchmark thread", where VP2 cards IIRC are ~around~ 60fps (often below that) for 1080 content.)

nevcairiel

2nd June 2011, 18:37

I've been trying to tell him that for ages. The Cyberlink decoder is the only decoder that manages to play 60p streams on his hardware, all others fail.
Its still my opinion that the Cyberlink Decoder does something special to get fluid playback, dropping some frames internally is of course one option.

VP2 is not fast enough to reliably decode 60p, and that'll never change.
CruNcher you can stop annoying everyone with your posts now, it'll never ever be fixed.

CruNcher

2nd June 2011, 18:38

Yes, that dropping is because of the x264 recording overhead (actually even not entirely also the capture is complex on XP) , im working on better 60 fps recording results but you can guess it's not easy, especially when you have additionally stuff like process explorer running with high precision timing ;) And nope lossless would be to easy ;)

also Didée it depends on the complexity of the bitstream those 2 aren't very complex the 2 Girls bitstream for example is much more complex (x264) http://forum.doom9.org/showthread.php?p=1490026#post1490026 and their i would agree VP2 being to weak but also only a tad (though re-syncing drops happens anyways as you said with it) :)

The problem is also only visible with VMR9 Renderless DXVA1, only Cyberlinks Decoder manages it so far with excellent jitter results CoreAVC DXVA avoids it now by switching into Software Decoding :)
see current situation http://forum.doom9.org/showthread.php?p=1488268&post1488268

And yeah i guess this Cyberlink Decoder behavior will stay a mystery, (my guess is excellent tuning maybe with NDA knowledge) :)

nevcairiel

2nd June 2011, 21:20

Request for samples

I'm looking to implement IVTC decimation in LAV CUVID, but to get this to work reliably, i could really need alot of samples, preferably of different video codecs, to test against.

All i have right now is a NTSC DVD Set, but i realize that broadcasts are probably much harder to deal with then DVDs, so i'm looking for all sorts of TC'ed material to test with.

To summarize, i'm looking for NTSC Telecined samples - that is 24p content telecined to 30p/60i.
The samples should be at least 30s long so i can test the detection properly. Cool would be samples of all relevant video codecs, mostly H264 and MPEG-2, i guess. Different resolutions welcome.

I can provide upload space on my FTP if anyone has a large collection of samples, and wants to help.

I'll offer more technical/implementation details as the code progresses, in case anyone is interested.

SamuriHL

2nd June 2011, 21:44

I don't think I have any content like that in my collection. I can look and see what I can find though. Maybe some of my obscure NTSC DVD's might be screwy like that. :)

lsarver

2nd June 2011, 22:56

Request for samples

To summarize, i'm looking for NTSC Telecined samples - that is 24p content telecined to 30p/60i.
The samples should be at least 30s long so i can test the detection properly. Cool would be samples of all relevant video codecs, mostly H264 and MPEG-2, i guess. Different resolutions welcome.

I can provide upload space on my FTP if anyone has a large collection of samples, and wants to help.

I have loads of movies captured from US cable via TiVo S3 and edited in VideoReDo. All are MPEG-2/AC3, 720p or 1080i, 29.97fps. Between my edit points, the cadences should be intact. (I considered IVTCing them with DGwhatever, but gave up: waaay over my head.)

I also have lots of movies captured from Dish Network (Russian cable channels), also edited in VRD. They are MPEG-2/MP2, 480i, 29.97pfs. They also have long GOPs and have been converted D>A>D during capture, so may not be good candidates for IVTC.

Edit: Forgot to mention that all are in .mpg container.

e-t172

2nd June 2011, 23:37

I'm looking to implement IVTC decimation in LAV CUVID

Is this something CUDA offers or are you implementing the IVTC algorithms yourself?

Anakunda

3rd June 2011, 02:02

I must say it doesnot work on my machine. I have nvidia card with CUDA but if I assign in KMPlayer external encoder for Xvid to LAV CUVID, the playback starts but black screen, in filter info I see LAV CUVID is in effect so it must be it...

namaiki

3rd June 2011, 02:05

Anakunda

3rd June 2011, 02:10

What is your GPU model and version of Windows?

It's Windows7, GPU G102M i think.

namaiki

3rd June 2011, 02:19

It's Windows7, GPU G102M i think.

nVIDIA GPU with VP3 does not support DivX decode, only the next generations.

roozhou

3rd June 2011, 04:20

Hi nevcairiel,

I have a large collection of telecined clips, most of which were uploaded by D9 users. Some of them are very difficult to deal with.

I added an IVTC filter to x264. It's working fine with most "regular" telecined materials. I will be happy to give help if you are going to implement an IVTC algorithm in your decoder.

nevcairiel

3rd June 2011, 06:42

I'll let the NVIDIA hardware do the actual IVTC, all i plan to do is remove the duplicate frame that results from this, decimating from 30p to 24p.
I already wrote some code to compare frames, and as long as there is at least some motion in the picture, detecting the duplicate seems straight forward, i can clearly see the cadence in my comparison output (4 good frames, one near-duplicate, on a perfectly regular pattern). Just looking for more samples for comparison and testing.

roozhou

3rd June 2011, 07:02

Anakunda

3rd June 2011, 07:11

nVIDIA GPU with VP3 does not support DivX decode, only the next generations.

Hi does this lowend GPU support decoding AVC via CUDA? I've seen CoreAVC is playing, just dont know if with CUDA support. But anyway the CPU load is noticeably lower than when using conventional codec. What do you recommend in this case, CoreAVC or LAV CUVID?

nevcairiel

3rd June 2011, 07:13

A VP3 GPU should work fine for AVC/H264 movies, only deinterlacing may be a bit too slow on your card.

FWIW, the next version will also contain a black/white list, only enabling XVID decoding on VP4 cards.
All other codecs are supported on VP2/VP3, even if only with partial acceleration.

madshi

3rd June 2011, 07:28

I'll let the NVIDIA hardware do the actual IVTC, all i plan to do is remove the duplicate frame that results from this, decimating from 30p to 24p.
Haha, great minds think alike, I guess... :p I'm a bit worried, though, if this will really work perfectly. E.g. what happens if the hardware deinterlacer thinks that some parts of the image are "video" content? In that case a few pixels of the "duplicate frame" will still be different. No problem if the output stays 30p. But if you try to throw away duplicate frames, this might become problematic. Especially if you want to really change output to 24p "officially" via media type information etc. In that case you're forced to throw away frames at a regular basis. What if the hardware deinterlacer isn't clear about which frames are duplicates and which are not? Of course you could pick the frames with the smallest differences to throw away. But what if there's e.g. an advertising break in a broadcast? Ads are often video mode, I think. Will you switch output to video mode then for the duration of the ad? Or what if there's a video mode scrolling text over the movie (end credits) for movies made for TV? I'm not sure about all this myself...

But I guess for many cases a rather "simple" solution like this will probably work just fine.

nevcairiel

3rd June 2011, 07:29

Let me guess, you did that in madVR for the next version? :p

madshi

3rd June 2011, 07:36

Let me guess, you did that in madVR for the next version? :p
No, I've not worked on deinterlacing for madVR yet. But of course I'm sometimes thinking ahead of how to do something, and I had the same idea you had... :)

But please look at the edit in my previous comment for some worries I'm having about this solution.

nevcairiel

3rd June 2011, 07:43

For one, i won't force IVTC, the user has to actively enable it, the default will remain video-mode deinterlacing.

I'm unsure how to handle mixed content. I think i would just stop dropping frames if the cadence is missing for a period of time (it switched to video content).
I could of course generate a new media type then, but to what end? Switching refresh-rate mid-playback seems really annoying, as it causes a 1-2 second black screen, at least on my hardware.

For "video" on top of "film"- if its really something moving, then the comparison algorithm will detect it as video content, and stop decimating - at least when the limits are set right. If its not moving enough for the comparison to pick it up, well, i would wager you don't notice that i dropped a few frames of it. :p

robpdotcom

3rd June 2011, 08:23

I can get you a bunch of samples with mixed content - I'll post some later today.

e-t172

3rd June 2011, 09:41

2 years ago I developed an IVTC filter inside ffdshow. It worked quite well except is was crashing sometimes. Unfortunately, I never got around to fixing and releasing it, as I stopped watching telecined content (I'm watching TV shows from iTunes now - much better quality than US broadcasts in my opinion).

The interesting thing is, I managed to develop a filter which was both much, much faster (something like 5x-10x faster if I remember correctly) than Decomb and other IVTC filters I tested, and yet was making less "mistakes" (it was very close to perfection, in fact, and even achieved it for most captures I had), even in scenes with very few motion, or scenes with interfering video content (such as animated banners at the bottom of the image). At the time I had a Core 2 Duo E6600 and my filter would IVTC (near-perfect) 1080i30 content with very small CPU consumption (something like 15%), whereas other filters like Decomb were just barely making it to realtime and making mistakes.

In order to achieve this performance, my filter relies heavily on context information: it would use measurements from the last N frames and M buffered future frames. N and M were quite big (15, 30, something like that), so it had a lot of context to use for making decisions. Big N was not really a problem since the filter only saved measurements instead of entire frames ; however, increasing M meant buffering more frames and thus using more memory.

Basically, what made the algorithm so reliable is that it has access to all this context information: it knows what the measurements are not only for the current pattern, but also for last N/5 patterns and the next M/5 patterns. So, for example, if it sees a sequence like this (numbers are measured IVTC pattern positions within 5-frame batches): 3 - 3 - 4 - 3 - 5 - 3 - 3 -3, it knows that there is a very high chance that 3 is the right cadence and 4 and 5 are just "mistakes" from the measurements. Indeed, it took advantage of the fact that 99% of the time, the IVTC cadence is stable and predictable.

Actually the filter was a little smarter than that: if it sees, for example, 3 - 3 - 3 - 3 - 3 - 5 - 5 - 5 - 5, it is able to detect that the IVTC cadence is changing, and will use pattern position 3 for the first 5 batches and pattern position 5 for the last 4 batches, resulting in perfect IVTC. Obviously you need to have a big enough M for this to work properly (but we have so much RAM nowadays, why not use it?). The only corner case where this doesn't work is if the IVTC cadence is changing very frequently (every second or so) but I don't think such problematic content exist (and if it does, shoot the engineer...).

The thing is, this context-aware algorithm is so reliable that even when measuring with heavy subsampling it still manages to avoid mistakes. For example I did some tests with extreme subsampling (using 1% or less of all pixels, resulting in extremely fast measurements), and still got near-perfect playback provided N and M are big enough. In the end it's all about memory (M) VS CPU(subsampling), although the filter is still able to produce very acceptable results only using past measurements (M = 0). My filter took advantage of the fact that when discussing IVTC we have much more memory than CPU in our machines.

Another advantage of being a context-aware filter: most of the time when there is a big, animated banner at the bottom of the screen from the TV network, my filter still manages to IVTC the original, 24p content behind it, whereas other filters get completely confused and make mistakes because of interference from the animation. In other words, it shows very good resistance to noise and interference.

I don't think I'll ever get the motivation to finish my filter (especially now that the code is 2 years old). I'm sharing my findings here so that all this won't be in vain. nevcairiel, this could be useful for decimation (but it is even more useful for someone implementing complete IVTC including decombing). If anyone is interested, I still have my code from 2 years ago (it's a patch to ffdshow-tryouts). I even think I could build it and distribute a "demo" ffdshow.ax, provided my toolchain is still able to compile a two year old ffdshow codebase.

Also, to anyone developing an IVTC filter, some pieces of advice from my own experience with IVTC filter development:
- Don't trust chroma. Only use luma. On most TV broadcasts I've seen chroma is unusable for IVTC because of crosstalk. I don't remember the exact reasons (again, it was 2 years ago), but I remember it was most problematic when the pattern is positioned at the same time as a camera angle change. (chroma is usable for decimating 60p content, however)
- Some telecined content out there is seriously fucked up. At the time I had 1080i captures from a TV show called "The Unit" on CBS. Guess what: sometimes the content was jumping back and forth between soft telecine and hard telecine... several times per minute! Fortunately soft telecine is easy to handle (just average the frame durations), but I had to tweak my filter so that the soft/hard transitions stayed smooth, which was quite "acrobatic". If I remember correctly, other broadcasts from CBS exhibited the same insane behavior.
- As I said above, don't put too much trust on measurements. No matter what measurement algorithm you use, it will be wrong quite often because of noise, compression artefacts and mixed video content. Use context.
- I found that Decomb's measurement method (apply SAD for each 16x16 block and take the maximum of all blocks) gives the best results.
- To further avoid interference from animated banners, add an option to ignore some bottom portion of the image in the measurements.

madshi

3rd June 2011, 10:09

I'm unsure how to handle mixed content. I think i would just stop dropping frames if the cadence is missing for a period of time (it switched to video content).
But is that a good thing if the user sets IVTC to "on"? In that case the display probably runs at 24Hz. Ok, a good video renderer will then do the dropping for you, so I guess it'd be ok. The question is who's in a better position to decide which frames to drop: The video renderer or you? Maybe, if IVTC is activated, you should simply drop the frames with the smallest differences? The renderer will not look at frame differences when dropping frames...

Not sure, just thinking "aloud".

madshi

3rd June 2011, 10:12

@e-t172, why don't you publish your source code? It might be useful, or maybe not. But in any case it's better to publish it than to let it rot without use... :) If you do publish it, please decide on the license (public domain or GPL or [...]). Thx.

e-t172

3rd June 2011, 10:23

You're right. I'll try to compile it and do some basic tests to make sure it at least somewhat works so it can be used as a "demo" for my algorithm, but I don't promise anything. If I remember correctly, I got the algorithm working perfectly but the filter sometimes crashes in certain situations, or just randomly. However, if I'm not mistaken, it at least works for 5-10 minutes before the first crash, so it is still usable as a demo. I'll do my thing and get back to you.

nevcairiel

3rd June 2011, 10:32

I'm at work right now, so i cannot go into too much details on your post, but thanks for the thorough explanation.

I'll of course be working with context on how to decide what to drop, however i'm not sure if i'll be able to have "future" context, as that would require quite some drastic changes and added complexity (the current filter design does not have a concept of buffering frames).
If i can, i would like to get around without future context, but if its necessary and drastically improves precision, i might just do it.
From memory usage, i'm not afraid to use some, as long as we can keep it on an acceptable memory limit (so that 32-bit applications still work perfectly).

e-t172

3rd June 2011, 10:45

I'll of course be working with context on how to decide what to drop, however i'm not sure if i'll be able to have "future" context, as that would require quite some drastic changes and added complexity (the current filter design does not have a concept of buffering frames).
If i can, i would like to get around without future context, but if its necessary and drastically improves precision, i might just do it.

You don't need future context as long as the cadence do not change (i.e. there is no discontinuity in the telecined stream). If the filter only considers past context and stumbles upon a discontinuity, then it will get the pattern position wrong for a certain number of batches (because it'll trust past positions, which is wrong). After that it will pick up on the new pattern position and everything will be right again. So most of the time the IVTC will be perfect, but when a discontinuity occurs, the result will be ugly for X frames after the discontinuity.

In most 1080i30 broadcasts I've seen, these discontinuities are quite rare (except, of course, in case of advertising or packet loss due to reception problems). They do exist, however. Some broadcasts, like the example I gave about CBS, have discontinuities all over the place (switching back and forth between soft and hard telecine) and are nearly impossible to IVTC perfectly without future context.

Also, if you don't have future context, then the first X frames of output (beginning of playback, or seeking) are of course likely to be wrong. In fact, for the first 5 frames, you won't be able to take a decision at all, because you don't even have a complete 5-frame batch to begin with. If you buffer frames, it means you can have context even for the very first frames of the current stream, and so get the first batches right.

From memory usage, i'm not afraid to use some, as long as we can keep it on an acceptable memory limit (so that 32-bit applications still work perfectly).

Well, you can use as much as past context as you want without increasing memory usage (you just have to store the measured pattern positions, not the frames themselves). Memory is only consumed when buffering frames for future context.