Yeah, copying back using the A8R8G8B8 is what has always killed me too, because it is the slowest operation and has to transfer 4x as much data as necessary. That is why I was thinking that the 4 frame at a time deal would work well...
If you want the source for AviShader which has all the D3D stuff, PM me and I'll see what I can do. D3D is pretty straightforward once you grok it.
|