I just looked over your ffdshow changes, and for the record: Its always much nicer to keep changes separated amongst multiple commits. For example, the bug fixes and the addition of the QS decoder should've at least been two commits, or more. Just sayin', its not my project or anything.
One thing i noticed though. Your sse2 memcpy seems superflous. If ffdshow is configured to use function intrinsics, the MS compiler will already use a optimized memcpy using sse2 if available. I did some testing along those lines recently, and a custom sse2 memcpy was actually not faster.
In addition to that, i don't think ffdshow had a hard dependency on sse2 before.