PDA

View Full Version : how to do smartsmoothing in YV12


FredBender
30th August 2003, 01:01
Hello,

I am planning on writing my own smartsmoother for avisynth just fore the pure fun of it ;)

I have thought of different possibilities to implement it and would like to have information about how the popular smartsmoother/2dcleaner filters work or just what you think about it.

Just to clarify things: by smartsmoothing I mean replacing a pixel by the average of surrounding pixels which do not differ more than a certain threshold from the pixel in progress.

1) smartsmooth y, u, v seperately (should be very fast)

2) smartsmooth y, u/v seperately

3) smartsmooth u/v, then smartsmooth only those ys where the according u/v were within the threshold.

4) smartsmooth only those "double pixels" where the two ys and the u/v lie within the threshold.

Hm, YV12 can be a pain in the *** :(

Thanx for your replies in advance!

Kurosu
30th August 2003, 02:09
Sorry, most is already done and MMX optimized :)
http://kurosu.inforezo.org/avs/Smooth/index.html

Points 3 and 4 aren't implemented but could be. In terms of MMX code, having a mask of what to process and not to process would only be suitable in means of quality. Speed-wise, it's only slower.

sh0dan
30th August 2003, 03:24
As always very nice (and efficient) work by Kurosu!

Smooth HiQ is quite difficult to do in YV12 - as in YUY2 - at least fast. I don't think the ogiginal idea is fit for a subsampled colorspace, thus MipSmooth. I'm still working on a new mipsmooth, that will hopefully soon be a fast eand efficient (read: fast) alternative.

However for the time being the best (spatial) filter is without any doubt VagueDenoise - I can't recommend it enough. Use it - enjoy it!
:cool:

FredBender
30th August 2003, 03:58
Originally posted by Kurosu
Sorry, most is already done and MMX optimized :)
That's okay, as I was originally saying I'm just having fun coding it myself ;)
http://kurosu.inforezo.org/avs/Smooth/index.html
The best is to redirect you first to the pages who are at the base of this filter
The first two links do not work, I thought you might want to know (in case that's your page you're referring to, that's what I understood from the last posting).

FredBender
30th August 2003, 04:04
Originally posted by sh0dan
However for the time being the best (spatial) filter is without any doubt VagueDenoise - I can't reccomend it enough. Use it - enjoy it!
:cool:
I am quite new to avisynth and all the filters and will give it a try. How do all the available spatial filters differ in terms of the cleaning strategy?

EDIT: This is how my .avs looks like:

loadplugin("mpeg2dec3.dll")
loadplugin("decomb.dll")
loadplugin("VagueDenoiser.dll")
mpeg2source("e:\temp\a.d2v")
crop(20,16,680,544)
telecide(chroma=true)
bilinearresize(512,384)
VagueDenoiser(filter=7,threshold=1.5,method=1,nsteps=6,chroma=true)

But I get the following error message when opening the .avs

avisynth open failure:
LoadPlugin: unable to load "VagueDenoiser.dll"

Help anyone... :confused:

Kurosu
30th August 2003, 15:20
Syntax shows you're using an outdated version of VagueDenoiser. This version usually needs a P4 to run, and maybe a runtime lib from Intel. But there are other versions, just check the 2 relevant main threads.

FredBender
30th August 2003, 15:32
Originally posted by Kurosu
Syntax shows you're using an outdated version of VagueDenoiser. This version usually needs a P4 to run, and maybe a runtime lib from Intel. But there are other versions, just check the 2 relevant main threads.
Thank you, i'll give it a try! Another question: In order to profile the speed of my filter, I would like to know if there is a possibility to store YV12 umcompressed in an AVI, or at least a codec that does not take up too much time compressing YV12.

Thanks in advance!

EDIT: For comparing the surrounding ys with the actual y, I load 4 pixels into mm0 by:

movd mm0, [ebx]

I did two different tests, one with a "surrounding pixels window" moving in units of 1 pixel (thus 75% of the movd being misaligned) and one where the window only moves every four pixels (by and ebx, -4 before the movd) for perfect alignment. But I cannot measure a difference of even one second in an encoding time of 6 minutes. Does movd not care about dword alignment?

Kurosu
30th August 2003, 16:50
Originally posted by FredBender
[B]Thank you, i'll give it a try! Another question: In order to profile the speed of my filter, I would like to know if there is a possibility to store YV12 umcompressed in an AVI, or at least a codec that does not take up too much time compressing YV12.
You can try:
- VBLE codec by MarcFD (in one of the Avisynth forums)
- Newest builds of FFVfW have a lossless codec even more powerfull, though the playback can eat a lot of resource depending on the compression parameters (in the XviD forum).

I did two different tests, one with a "surrounding pixels window" moving in units of 1 pixel (thus 75% of the movd being misaligned) and one where the window only moves every four pixels (by and ebx, -4 before the movd) for perfect alignment. But I cannot measure a difference of even one second in an encoding time of 6 minutes. Does movd not care about dword alignment?
1) On such processing, there may be a difference, but the overall processing is also source decoding and video displaying (with colour conversion). You can check http://www.avstimer.de.tf/ or http://kurosu.inforezo.org/avs/Kronos/ for tools to more accurately check execution times. The later has a lot more overhead, but I think it's still usable - precision vs usability, I guess. Anyway, it's still needed to do several tests for measures, as memory fragmentation may have some consequences. When I want to be sure of my result, I do 3 runs of around 500 frames.
2) As pointed out by trbarry, once the data is in the cache, unaligned reads don't matter that much. And you need 16 such reads to generally hit a new reload, but I guess that even with that, hardware is able to do some natural prefetch. All in all, I guess it only matters with higher read stride, for instance when using xmm registers.
3) Sh0dan will probably give you a better explaination :)
4) For profiling, check "VagueDenoiser optimization" thread, it has some of the above stuff, and a link to AMD CodeAnalyst, if that can help you.

FredBender
30th August 2003, 17:13
Is there a faster way to do what you can see in the comments?

mov al, [esi]
movd mm0, eax // mm0 = | ?| ?| ?| ?| ?| ?| ?| Y|
punpcklbw mm0, mm3 // mm0 = | 0| ?| 0| Y|
punpcklwd mm0, mm0 // mm0 = | 0| 0| Y| Y|
punpckldq mm0, mm0 // mm0 = | Y| Y| Y| Y|


And:

__int64 _4times1 = 0x0001000100010001;

// mm4 = | sum3| sum2| sum1| sum0|
// mm5 = | n3| n2| n1| n0|

pmaddwd mm4, _4times1 // mm4 = | sum3+sum2| sum1+sum0|
pmaddwd mm5, _4times1 // mm5 = | n3+n2 | n1+n0 |
packssdw mm4, mm5 // mm4 = | n32| n10|sum32|sum10|
pmaddwd mm4, _4times1 // mm4 = | n| sum|


As far as I know pmaddwd is pretty slow, right? Aside from the fact that I don't really need a multiply, I just want to add 4 words in one mmx reg together.

Kurosu
30th August 2003, 17:42
I'll add the proper tags missing :)

First of all, I have myself little experience with assembly language. It only means math stuff to me, I'm really not aware of the fine tricks available.

Originally posted by FredBender
Is there a faster way to do what you can see in the comments?
mov al, [esi]
movd mm0, eax // mm0 = | ?| ?| ?| ?| ?| ?| ?| Y|
punpcklbw mm0, mm3 // mm0 = | 0| ?| 0| Y|
punpcklwd mm0, mm0 // mm0 = | 0| 0| Y| Y|
punpckldq mm0, mm0 // mm0 = | Y| Y| Y| Y|
No, really, that's how we all do this.


__int64 _4times1 = 0x0001000100010001;

// mm4 = | sum3| sum2| sum1| sum0|
// mm5 = | n3| n2| n1| n0|

pmaddwd mm4, _4times1 // mm4 = | sum3+sum2| sum1+sum0|
pmaddwd mm5, _4times1 // mm5 = | n3+n2 | n1+n0 |
packssdw mm4, mm5 // mm4 = | n32| n10|sum32|sum10|
pmaddwd mm4, _4times1 // mm4 = | n| sum|

As far as I know pmaddwd is pretty slow, right? Aside from the fact that I don't really need a multiply, I just want to add 4 words in one mmx reg together.
I wonder if pmaddwd is that slow... I would have tried a mixture of pack/unpacks, shifts and adds but this all depends on what maximum result you may have to store:

//mm3 = 0x0000FFFF0000FFFFi64;
movq mm6,mm4
movq mm7,mm5
punpacklwd mm5,mm4
punpackhwd mm7,mm6
paddusw mm7,mm5
movq mm4,mm7
pand mm7,mm3
psrld mm4,16
pand mm4,mm3
paddd mm4,mm7
It has a higher instruction count, it has stalls and unpaired instructions. Probably not worth. If you can hit the limit at the very first addition, then you have to use 4 regs, unpack them to dw with another zeroed (xor'ed) reg and do the addition from there. That might be even faster, because less stalls and unpaired instructions.

EDIT: hope the font change makes it look better.
EDIT: not really, is there no possibility to make more than one space occur?
The magic [ code ] tag :)

Regarding a smarter smoother, having weights depending on how the pixel is far from the center pixel may yield a better result. Newer versions of Deen have this. (see pm)

FredBender
30th August 2003, 18:34
I just read through some mmx and general optimization guidelines (only knew about basic pairing rules until then) and came upon the following suggestion:

Align frequently executed branch targets on 16-byte boundaries.

So I did it for the most inner loop and guess what... execution time dropped to 38% which means the filter is now 2,6 times as fast just because of a single "align 16" directive!

sh0dan
30th August 2003, 18:48
I don't know if you've seem it, but there is also some stuff about Assembleroptimizing (http://www.avisynth.org/index.php?page=AssemblerOptimizing) at avisynth.org - you could also add your align 16 observation.

FredBender
31st August 2003, 01:37
Originally posted by sh0dan
I don't know if you've seem it, but there is also some stuff about Assembleroptimizing (http://www.avisynth.org/index.php?page=AssemblerOptimizing) at avisynth.org - you could also add your align 16 observation.

It's already mentioned in the Simple MMX Optimization Guide (http://www.avisynth.org/index.php?page=SimpleMmxOptimization):


yloop:

mov edx,[row_size]
align 16
xloop:


EDIT: d'oh, I just realised that I compared two completely different versions of my filter when I stated that with "align 16" it would be 2,6 times as fast. I have rebuilt everything and now the speed gain is barely measurable.

I worked some more hours on the code and unrolled the inner loop, resulting in a speed gain of ~20%. Plus I noticed that the entire pmaddwd thing at the end is a lot slower than just dividing without mmx :p

Reading through the mmx optimization guidelines I found, I wonder if the restriction "MMX with memory access and INTEGER" do not pair has been lifted on Athlon or P4? Anyone got good links?

MasterYoshidino
31st August 2003, 08:42
I hate to sound off the discussion but I would like to give 1000 thanks to korusu for compiling anything that can be used in avisynth :), as using veedub is very slow due to color conversion.
yv12 would be nice but since normal huffies are yuy2 it's not too necessary.

the code is already mmx opimized, and looks clean so just port it to support yv12 and see if it crashes anyone's system ;)


edit [see below post]: yes after testing on a TV commercial clip of a hentai anime "Usagichan de Cue!!!" I found it only worked in YV12. :D
finally something that works on rainbows that is not as slow as SSIQ :)

Kurosu
31st August 2003, 11:44
1) It's already and only YV12 (but the site doesn't mention it at all - though you should have figured it out by testing ;) )
2) It's Kurosu

Kurosu
31st August 2003, 12:00
Originally posted by FredBender
I worked some more hours on the code and unrolled the inner loop, resulting in a speed gain of ~20%. Plus I noticed that the entire pmaddwd thing at the end is a lot slower than just dividing without mmx :p
1) If you want to try to write a "more-than-mmx" code, maybe some prefetching would be nice.
2) I tested with success a small trick (that's almost half of the asm code)

a/b ~ (a*2 * (2^15-1)/b)>>16
(There are rounding problems, but considering we are already degrading the picture by smoothing...)
Why 2^15-1? Because 2^15 is a negative word for pmulhw, and therefore the higher positive word. Why pmulhw? Because it automatically does the >>16 stuff.
However, wherever 'a' can reach 2^14 (that is for a 9*9 kernel - radius 4), you can't use such a trick. And using MMX code to work on dwords would be almost unpractical, so I just resort to the integer divide.

FredBender
31st August 2003, 16:35
1) If you want to try to write a "more-than-mmx" code, maybe some prefetching would be nice.
Can you go a little deeper into details? I have honestly no idea what you're talking about :)

a/b ~ (a*2 * (2^15-1)/b)>>16
1) I don't understand why that's true, but I believe you ;)
2) Replacing one div by 2 muls and a div? Or would you precalc all possible (2^15-1)/b?

However, wherever 'a' can reach 2^14 (that is for a 9*9 kernel - radius 4), you can't use such a trick.
The highest possible value for a in my case is somewhere between 2^15 and 2^16 (assuming max y is 235 IIRC). I use a 16*16 'kernel' or what I referred to as 'moving pixel window'.

Since I assume a div is still a lot slower than a mul, I did a precalc'd table unsigned int divisiontable[257] where

divisiontable[0] = 0 (never occurs)
divisiontable[1] = 2^32-1 (not exact, but you can't store 2^32 in a 32 bit reg - however it's extremely unprobable that none of the 255 surrounding pixels will be within the tolerance)
divisiontable[i] = 2^32 / i (for 2 >= i >= 256)

And I implement the division as follows (I left out paired integer instructions which have nothing to do with the actual algo for clarity):

// mm4 = | sum3| sum2| sum1| sum0|
// mm5 = | n3| n2| n1| n0|

movq mm1, [_4x1] // mm1 = | 1| 1| 1| 1|
pmaddwd mm4, mm1 // mm4 = | sum3+sum2| sum1+sum0|
movq [sum], mm4
pmaddwd mm5, mm1 // mm5 = | n3+n2 | n1+n0 |
movq [n], mm5

mov eax, dword ptr [sum] // eax = sum1+sum0
mov ecx, dword ptr [n] // ecx = n1+n0

add eax, dword ptr [sum+4] // eax = sum3+sum2+sum1+sum0 = final sum
add ecx, dword ptr [n+4] // ecx = n3+n2+n1+n0 = final n

mov ebx, divisiontable
mul dword ptr [ebx+4*ecx] // the quotient is now in dl
mov [edi], dl

Yeah, I left two pmaddwd instructions in because I did not want to add 16 bit values with normal integer registers. The last 3 instructions replace the integer div which would have been:

xor edx, edx
div ecx
mov [edi], al

So the lookup table mul version isn't even longer considered the number of instructions. The only penalty is the inaccuracy when n = 1, resulting in y/1 -> y-1 :(

Kurosu
31st August 2003, 17:52
Originally posted by FredBender
Can you go a little deeper into details? I have honestly no idea what you're talking about :)
In case you want to write iSSE code (which should ship with almost all x86 CPUs now - Duron and Athlon < 1GHz, K6, P2, celerons don't have them though), there are a few instruction that are usefull in this case:
- prefetch functions that preemptively ask for memory to be read while CPU is busy calculating
- some more I didn't think of, like pshufw (rearrange words in a mmx registers), functions to extract part of a mmx reg to a conventionnal reg (said to be extremely slow) and maybe others you might find use of.

1) I don't understand why that's true, but I believe you ;)
2) Replacing one div by 2 muls and a div? Or would you precalc all possible (2^15-1)/b?
Simply rewrite the equation (and assume no precision and rounding cause problems):

(a*2 * (2^15-1)/b)>>16 ~ (a*2*2^15/b)/2^16
~ (a*2^16/b)/2^16
~ (a/b)

Therefore you indeed just have to precalc (2^15-1)/b, and that's done in the constructor in my filter.

The highest possible value for a in my case is somewhere between 2^15 and 2^16 (assuming max y is 235 IIRC). I use a 16*16 'kernel' or what I referred to as 'moving pixel window'.
I wouldn't assume max Y is 235. Filters could change the dynamic; outputting values over 235. You'd have to enforce/invoke in your filter Limiter to be sure of this.
Regarding your kernel:
- if you weigh this kernel, that's worth doing this, but that's almost fetching 8 pixels from the current pixel; unless your video is very noisy, I don't really see the benefit
- is this window size fixed?
- 16*16*235 = 60160, and pmaddw takes signed words; any flat area with at least y=128 will overflow this, and you won't get what you expected :)

Since I assume a div is still a lot slower than a mul, I did a precalc'd table unsigned int divisiontable[257] where
That's what I did, but only available for radius<5 (11x11 kernel) :)
Actually I could use this too for bigger radius, but that was just to allow more dynamic. AFAIK, I limit the filter to 15x15 (radius of 7) and radius >= 5 is handled by another function.

So the lookup table mul version isn't even longer considered the number of instructions. The only penalty is the inaccuracy when n = 1, resulting in y/1 -> y-1 :(
Your algo looks fine to me, although I never practised regular asm. I just started using it 6 months ago, and only because of MMX. But I wouldn't care too much for the y-1 result: if you have to use a smart smoother, your picture is most probably noisy. What do you risk? 1 less on edges? luminance globally diminished? It's a very good trade-off for speed imo, as it can't be fixed easily.

Dreassica
31st August 2003, 23:19
Originally posted by Kurosu
1) It's already and only YV12 (but the site doesn't mention it at all - though you should have figured it out by testing ;) )
2) It's Kurosu

Niice, works like a charm as well.
A pity it doesnt work in a qmf environment, it make sit crash:(
Prolly a memroyleak or something. Thats what u get from testing unoptimised filters.:D

FredBender
31st August 2003, 23:27
In case you want to write iSSE code (which should ship with almost all x86 CPUs now - Duron and Athlon < 1GHz, K6, P2, celerons don't have them though)
Then it would be quite hard to write that code on my Duron 650 :)
some more I didn't think of, like pshufw (rearrange words in a mmx registers)
AFAIR my CPU supports pshufw, I'm almost sure I'v used it before. That's the perfect solution for a question I asked earlier in this thread:

mov al, [esi]
movd mm0, eax // mm0 = | ?| ?| ?| ?| ?| ?| ?| Y|
punpcklbw mm0, mm3 // mm0 = | 0| ?| 0| Y|
punpcklwd mm0, mm0 // mm0 = | 0| 0| Y| Y|
punpckldq mm0, mm0 // mm0 = | Y| Y| Y| Y|

could be replaced by

xor eax, eax
mov al, [esi]
movd mm0, eax // mm0 = | 0| 0| 0| Y|
pshufw mm0, mm0, 0 // mm0 = | Y| Y| Y| Y|

The highest possible value for a in my case is somewhere between 2^15 and 2^16 (assuming max y is 235 IIRC)I wouldn't assume max Y is 235
Okay, assuming a max Y of 255 leads to 16*16*255 = 65280 < 2^16 = 65536
that's almost fetching 8 pixels from the current pixel; unless your video is very noisy, I don't really see the benefit
I'm only using this filter for noisy cartoon dvds, i.e. the simpsons. I think for that purpose, the kernel can't be big enough - correct me if I'm wrong.
is this window size fixed?
Yes, otherwise I could not unroll the inner loop that efficiently. I could make it configurable, but honestly I don't think anyone other than me is going to use that filter.
16*16*235 = 60160, and pmaddw takes signed words; any flat area with at least y=128 will overflow this, and you won't get what you expected
That's exactly what my first thought was, and I wondered why the results did not reveal horrible artefacts. But remember that I use pmaddwd only to merge two 16bit integers, and each of them is only 4*16*255 = 16320 (I store 4 different y in one mm), so I could even double the kernel size once more if I wanted to ;)

BTW this is turning into a chat between Kurosu and me... :p

Kurosu
31st August 2003, 23:57
Originally posted by Dreassica
Niice, works liek a charm as well.
A pity it doesnt work in a qmf environment, it make sit crash:(
Prolly a memroyleak or sumthing. Thats what u get from testing unoptimised filters.:D
It _is_ optimized, not debugged :)
I'll look into that: it must be the same situation than many other filters crashing with multiple instances.

But after looking, it doesn't seem to be so.
1) Could you make a new thread with the complete script (especially sshiq parameters)?
2) It might just be avisynth using too much memory because of the way qmf and the like work
3) really, deen should be sufficient for everybody :)

@FredBender
Well, pshufw originated from 3DNow! 1, ie (env->GetCPUFlags() & CPUF_3DNow) = TRUE :)
That's why I didn't write about it before. Many usefull 3DNow! instructions regarding integers were added to IntegerSSE (I *think* that one is also available on your Duron - I had a Duron 700 one month ago :) ).

I'm only using this filter for noisy cartoon dvds, i.e. the simpsons. I think for that purpose, the kernel can't be big enough - correct me if I'm wrong.
This cartoon is a lot more flat than all the anime fans around here are used to - not even a single fade of luminance :)
But I still wonder if it's that usefull to use a 16x16 kernel (odd that you choose even dimensions :D)
I would suggest to really have a look at my filter and see how it behaves on such a source.

Yes, otherwise I could not unroll the inner loop that efficiently. I could make it configurable, but honestly I don't think anyone other than me is going to use that filter.
People encoding The Simpsons will surely :)

That's exactly what my first thought was, and I wondered why the results did not reveal horrible artefacts. But remember that I use pmaddwd only to merge two 16bit integers, and each of them is only 4*16*255 = 16320
As with many of the code you provided, I'm not always fully aware of the situation in which you use it. I encountered that problem, and needed some time before discovering pmaddw is signed :)

BTW this is turning into a chat between Kurosu and me... :p
Bah, that's the development forum, and many of the asm hardcore coders around here are no longer or not often here. In fact, all that aren't commenting some mmx code are out-of-topic :D

FredBender
1st September 2003, 00:54
(env->GetCPUFlags() & CPUF_3DNow) = TRUE
I think you meant '==' since (env...) surely is no variable and thus cannot be assigned a value via '=' ;)
Many usefull 3DNow! instructions regarding integers were added to IntegerSSE
Yeah, like pminsw and pmaxsw which i'm already using for my filter to check if surrounding y lie within the threshold.
(I *think* that one is also available on your Duron - I had a Duron 700 one month ago)
Second. On the other hand it may also be pure coincidence that the unsupported operations result in the desired effect ;)
But I still wonder if it's that usefull to use a 16x16 kernel
When I have done implementing u and v cleaning and realise that the filter is way to slow, I'm sure I'll adjust the kernel size to something less than 16x16 :)
(odd that you choose even dimensions)
Nice wordplay :) Yeah it's unsymmetrical, but easier to code and faster. For example, if I wanted a 15x15 kernel, I would have to do 3*4 + 2 + 1 pixels per scanline, with 16x16 I just do 4*4. Hm, of course I could set the last y to something beyond the tolerance... what are the official boundaries for y, u, v?
I encountered that problem, and needed some time before discovering pmaddw is signed
I think it would have been nice if every mmx opcode would contain those 's' or 'us' like in the pack instructions. Something that gave me headaches during the coding was the fact that mul [ebx+4*ecx] is only an 8bit multiplication. I searched quite a long time until I found out I had to add this "dword ptr" thing.

BTW sorry none of your smileys made it into my reply, I'm a 56K user and always write my replies offtopic, copying and pasting from the your post into my text editor.

Kurosu
1st September 2003, 15:28
Originally posted by FredBender
Yeah, like pminsw and pmaxsw which i'm already using for my filter to check if surrounding y lie within the threshold.
I should maybe try myself to implement such a think. But I would think going completely (now that I've a bit of experience in it) OO could enable to completely rewrite the loops, unrolling them and so much more. But speed is sufficient for now, I'm not in a contest :)

For example, if I wanted a 15x15 kernel, I would have to do 3*4 + 2 + 1 pixels per scanline, with 16x16 I just do 4*4.
As the lines are already fetched, I don't think doing 4 pixels at a time would be that slower (not 8, because of initial unpack needed). Of course, the pattern is more irregular, and the 16x16 needed reads look unaligned, and worse to boost with prefetches.

Hm, of course I could set the last y to something beyond the tolerance... what are the official boundaries for y, u, v?
I think it depends: CCIR-601 for instance is limited to 16-235, and I don't think U/V has any limitation. And that limitation is I think no longer that important: PC scale is finer, but not compatible with TV scale.

I think it would have been nice if every mmx opcode would contain those 's' or 'us' like in the pack instructions.
I guess so, but the inventors certainly couldn't fit any more instructions without an unwanted price bump :)

Something that gave me headaches during the coding was the fact that mul [ebx+4*ecx] is only an 8bit multiplication. I searched quite a long time until I found out I had to add this "dword ptr" thing.
Well, I'm filling a 4-word array with the values of the looked up factor, then do a pmulhw: shift and multiply at the same time, at the expense of the bias towards 0; and 4 of those at a time - that gave me a 20% speed increase than resorting to the divide operation (or was it the equivalent of your mul?).

BTW sorry none of your smileys made it into my reply, I'm a 56K user and always write my replies offtopic, copying and pasting from the your post into my text editor.
No biggie, and good luck. (Note to self: don't post 1MB+ bmp/png in that thread :)

sh0dan
1st September 2003, 16:52
A minor note - never assume that YUV is within valid CCIR 601 range - it might as well not be.

FredBender
1st September 2003, 17:05
I took a look at the mipsmooth source and have some questions about pairing. It seems the author (sh0dan IIRC?) placed instructions which are executed in the v-pipe one space to the right.

movq mm0,[esi+eax] // Load current frame pixels
pxor mm6,mm6

Why do these two instructions not pair? In the kernel loop they obviously do...

movq mm1,[edx+eax] // Load 8 pixels from test plane
movq mm2,mm0

..."since MMX instructions which access either memory or the integer register file can be issued in the U-pipe [only]".
But how is it possible the following two instructions pair:

pcmpeqb mm2,mm3 // Compare values to 0
prefetchnta [edx+eax+64] // it might just help - and we have an idle CPU here anyway ;)

BTW I would love to use the prefetch instructions the way shown here, but I am limited to one register as an argument since I am using VC6 and some macros provided by amd in "amd3dx.h". Is there a free tool available where I can "construct" MMX code in a GUI and it gives me some bytes I can insert in my asm code or something similar?

movq mm3,mm2
pxor mm2,[full] // mm2 inverse mask
pand mm5, mm3

Do modern processors have more than two MMX pipes? I honestly do not know. BTW the code is really nice to read, great work. Now on to your post, Kurosu:
Yeah, like pminsw and pmaxsw which i'm already using for my filter to check if surrounding y lie within the threshold.I should maybe try myself to implement such a think.
How do you check if the surrounding pixels are tolerated without the pminsw/pmaxsw instructions? psubusw and then checking for zero?
For example, if I wanted a 15x15 kernel, I would have to do 3*4 + 2 + 1 pixels per scanline, with 16x16 I just do 4*4.As the lines are already fetched, I don't think doing 4 pixels at a time would be any slower
Maybe I did not get your point, but why would I do "4 pixels at a time" when there are only 3 left? That was my initial thought :)
Hm, of course I could set the last y to something beyond the tolerance... what are the official boundaries for y, u, v?I think it depends: CCIR-601 for instance is limited to 16-235, and I don't think U/V has any limitation. And that limitation is I think no longer that important: PC scale is finer, but not compatible with TV scale.
I think I know how to do it now: I will just process 4 pixels, but before checking them against the tolerated boundaries I will set the 4th pixel's y to 0x7FFF (working in words, remember), so it will certainly not be tolerated. The question that arises is: is it worth the EXTRA TIME it takes to obtain a SMALLER kernel, just for the sake of symmetry? Still not sure about that...
I think it would have been nice if every mmx opcode would contain those 's' or 'us' like in the pack instructions.I guess so, but the inventors certainly couldn't fit any more instructions without an unwanted price bump
Hehe :) No I did not mean every instruction should be available in all possible s/us, b/w/d combinations (though that would be nice), I just thought it would be cool if the implemented instructions were named the way they work even if there are no 's' or 'us' alternatives. For example if pmaddwd was called pmaddswsd or something similar.
Well, I'm filling a 4-word array with the values of the looked up factor, then do a pmulhw: shift and multiply at the same time, at the expense of the bias towards 0; and 4 of those at a time - that gave me a 20% speed increase than resorting to the divide operation (or was it the equivalent of your mul?).
Can you provide a commented code snippet? I am not 100% sure what you are talking about.

About that prefetch thing: what do the different versions like t0, t1 etc. mean and which should I use for my filter and WHEN? Should prefetches be aligned in any manner?

About loop unrolling: would unrolling the kernel y loop make the whole thing any faster? I mean the code would grow really large... maybe it would not fit in the cache anymore? Anyone?

sh0dan
1st September 2003, 17:29
Actually the Mipsmooth is rather lazy indented, as most of it is copy/paste/slightly modified from temporalsoften. Most of it is experimental, so I didn't want to spend too much time cleaning up the code yet.

1) Yes it does pair. Probably some instruction was removed or something.

2) The third piece pairs, since U/V does not have to come in correct order. Athlon/P3/newer processors features out-of-order execution that allows the processor to reorder instructions, if they do not depend on eachother - that's why it is correctly assigned.

3) Piece 4 - probably an instruction was removed. Only MMX multiply instructions allow for more instructions to be paired, as it has longer latency (much longer on P4). (btw, "pand mm5, mm3" depends on "movq mm3,mm2", so there is no way they can be paired).

Dreassica
2nd September 2003, 18:46
Id be gratefull if u "unbroke" radi higher as 4, cuz i still can use those with some clever tweaking of the settings, since i do use it for extremely noisy anime sources. I didnt see any Maintain diffweight- ish parameter in the filter as well. Any reason for that?
Its definitely faster for me as smothhiq was, now a yv12 tempsmoother and im totally happy..

BTW it does work fine in qmf, i just jumped to conclusions too fast. :)

FredBender
5th September 2003, 20:34
Today I have finished a first working version of my filter that smoothes all three planes (Y, U, V). I would like to post it, but unfortunately I am not allowed to post attachments. Should/can I eMail the zip to some moderator?

MfA
6th September 2003, 17:38
What this filter seems to be doing is a form of G-neighbor (http://citeseer.nj.nec.com/boult95gneighbors.html) filtering.

What you can also try is a slightly more bayesian/fuzzy approach, using weighting instead of thresholding. This is what is known as bilateral filtering (http://vision.stanford.edu/public/publication/tomasi/tomasiIccv98.pdf). Given the stochastic nature of noise it seems a natural approach. Instead of using thresholds you weight each pixel in the window with a function of its distance to the center pixel (both spatially and photometrically ... usually the functions are gaussians, which you would need to implement with LUTs of course). For a fast photometric distance measure you could take the euclidian distance in YUV space (this is very simular to your idea of requiring thresholds to be passed both in the Y and UV planes). Or rather the sum of squared differences, you can incorporate the square root in a LUT.

Si
6th September 2003, 21:48
@FredBender
Just get some free web space from somewhere and post a link.

The ability to post attachments comes and goes here and has been gone for a long time now.

regards
Simon

FredBender
7th September 2003, 00:02
Originally posted by siwalters
@FredBender
Just get some free web space from somewhere and post a link.

<zoidberg>Of course, why didn't I think of that!</zoidberg>

Just go to my homepage for useless stuff (http://www.geocities.com/ackehurst/) and download the filter.