tsp,
Layer and Overlay make use of the Alpha channel. Generally it is preferable not to clobber the alpha channel. However given this filter currently copies a chosen channel to the other 2 it is not a great stretch to
define it to copy a channel to the other 3. It certainly makes the MMX code easier. And the core currently has a ShowAlpha(clip, string pixel_type) which is just C code, it copies the Alpha to the R, G and B channels, so there is a precedent.
For the MMX code I was thinking along the lines of
Code:
movq mm0,[src]
movq mm1,[src+8]
pand mm0,k000000ff000000ff // dd p1, p0
pand mm1,k000000ff000000ff // dd p3, p2
packssdw mm0,mm1 // dw p3, p2, p1, p0
packuswb mm0,mm0 // db p3, p2, p1, p0, p3, p2, p1, p0
punpcklbw mm0,mm0 // db p3, p3, p2, p2, p1, p1, p0, p0
movq mm1,mm0
punpcklbw mm0,mm0 // db p1, p1, p1, p1, p0, p0, p0, p0
punpckhbw mm1,mm1 // db p3, p3, p3, p3, p2, p2, p2, p2
movq [dst],mm0
movq [dst+8],mm1
as this code only uses 3 registers you could add a 2nd (or 3rd) stream and process 8 (or 12) pixels at once, it should really scream
IanB