Smooth in time and space? - Page 11

vlad59 · 12th August 2002, 10:36

Quote:

Originally posted by bb
@vlad59:
I would like to test the Convolution3D with standard and full-1 matrix at the same time; this way it's easier to compare.
BTW: What do you say about the possibility of optimizing the algorithm for a full-1 matrix (in fact you wouldn't need a matrix at all in this special case)?

bb

I don't fully understand, what I'll do is to provide a new Convolution wich has a new parameter to choose between the two matrix.
I hope that's what you want

Optimizing for the full 1 matrix is not easy, let's take an example :

Code:

Tresholded matrix are : 
10 11 11     13 15 5     10 11 11     
10 11 11     13 15 5     10 11 11     
10 11 11     13 15 5     10 11 11     

To find my new convoluted values I've to add all the matrix values :
Sum = 10 + 11 + 11 + 13 + 15 + 5 + .........

And then divide by the numbers of matrix value i.e. 9+9+9 = 27

new values = sum / 27

That division is the problem :
 - I have no MMX opcode to perform a division (or I missed something)
 - and a simple division will be way too slow.

That's why I first tried having a weight matrix like that : 
1 1 1    1 1 1    1 1 1  
1 2 1    2 2 2    1 2 1
1 1 1    1 1 1    1 1 1

With this I can use a left shift to divide by 32
But with this the compressibility test is worse than with the standard matrix

So if anybody has an idea ....

I have got no time to think about it this weekend, thursday is off in France so you can expect a release in 2 or 3 days max.

dividee · 12th August 2002, 11:22

If you don't have division... use multiplication.
Basic Idea:
If you accumulate the sums in words in an MMX register, load an MMX register with four times 65536/27. The usual way in C code would be to multiply by that value and then >> 16. In MMX code, you can use PMULHW (or PMULHUW in ISSE) so you don't even have to do the shift.

Take care of rounding (if you don't want Acaila to complaint about a green tint

):
in C code you could do:
(sum * 65536/27 + 32768) >> 16
but that doesn't work with PMULHW (3DNow has a nice PMULHRW, though).
The technique I used in MMX code can be described in C code as:
((sum<<1) + sum) * 32768/27) >> 16

It becomes more difficult to do the "divisions" in parallel if the divisor can vary for each word.
In TemporalSoften(2), I had independent divisors for each word that can vary from 1 to 16, so I constructed a big lookup table with 16^4 QWORD entries. I packed four divisors in a WORD to use as an index in the table an so was able to do 4 "divisions" in parallel.

vlad59 · 12th August 2002, 13:10

@dividee

That's what I call a clear explaination, thanks a lot.

I'll have a look to TemporalSoften code to be sure I have understood.

thanks again, I'm now obliged to optimize better Convolution3D

bb · 12th August 2002, 14:01

@vlad59:
Just another idea: what about using a 2x2x2 or a 4x4x2 matrix?

bb

vlad59 · 12th August 2002, 15:30

@bb

In theory it's possible but with those matrix you have not a center pixel. I think it's annoying.
I thought about a 5x5x5 matrix, but it will be slower and I first want to have a stable Convolution3d before changing too much the basis.
But if you explain me better what could be cool with this, I can change my mind. You usually have great ideas

.

I also thought of not using the previous and next frame when there is too much difference between prev, current and next pixel (only luma will be checked). I should help a lot in fade in or fade out scene.

trbarry · 12th August 2002, 16:46

Vlad -

As a small point on performance issues you can usually expand the matrix horizontally faster than vertically. That's because the data will be coming from the same cache lines. So something like a 3x5 will probably be faster than a 4x4 by more than you might expect.

- Tom

bb · 12th August 2002, 19:25

@trbarry:
You are probably the master of filter optimisation. What do think about my proposal of interleaving the frames before filtering, then de-interleave again? Like you store 1st line of 1st frame, then 1st line of 2nd frame, then 1st line of 3rd frame, then 2nd line of 1st frame, etc., like
1111111111111111 (first line)
2222222222222222 (first line)
3333333333333333 (first line)
1111111111111111 (second line)
2222222222222222 (second line)
3333333333333333 (second line)
...

This way the pixels you have to touch during filtering would be close together in memory, which - I think - would improve cache hits.

Non-cubic matrices will have an impact on the picture, as far as I can see. You could get the effect that it smoothes more horizontally than vertically. I also thought about dropping the edge pixels, because they have the biggest distance to the center pixel. The perfect 3D matrix would probably be a sphere, not a cube... What do you think?

@vlad59:
You're right, not having a center pixel is annoying, I thought of that problem. I had the idea of something like a running total; the value of a pixel would be updated more than once while the algorithm is "passing by". I still have to think that over, but I can post my idea if you find it useful.

There's a VirtualDub filter using a 5x5x2 matrix, but I was a little disappointed with that one...

bb

trbarry · 12th August 2002, 20:13

bb -

Dunno, you'd have to try it. But in order to do this you would have to copy the data 2 extra times and I'm not sure whether that would make up for the (hopefully) better arrangement. And the gain would be processor dependent since the different machines have different cache sizes and stuff.

- Tom

dividee · 12th August 2002, 21:33

Why 2 extra times ? Or do you count read & write as 2 times ?
TemporalSoften (and TemporalSmoother) use a similar idea as bb exposed (I didn't invent it, that's the technique Avery used in Temporal Smoother for Virtualdub), except these filters interleave the clips on a per pixel basis instead of per line.
So you have ,with bb's notation (center = frame 2)
123123123...

Since these filters only operate on the temporal axis, it makes the inner loop works totally linearly. As long as the clip is read in sequential order, all you have to do for the next frame is replace the oldest pixels by the newest ones:
423423423 (for frame 3)
and then
453453453 (for frame 4)
The pixel replacement is also done in the inner loop.

The arithmetic of using such a circular buffer is sometimes a bit complicated (especially when you try to add full-scene change detection in the mix

), but I think it's worth it.

trbarry · 12th August 2002, 23:39

Maybe I didn't think it through well enough. I was thinking that if you wanted to work with the data in a different format that you would also have to copy/reformat the results back at some point, but that's not necessarily true I guess.

But again, I guess you would have to try it and see if what you gain makes up for the up-front overhead.

When I wrote Greedy/HM (DScaler GreedyHMA version) I reformated the data into what I thought was a clever arrangement that got around some then-current DScaler limitations but in hind sight I'm not sure it was worth it. And I'm almost certain that if I went to the work of pulling out that stuff for the Avisynth version it would go faster. But maybe I just did something silly when I designed the data structure I used.

I guess the way I think about it now is that it is not as important to keep everything close together in memory as it is to minimize the number of cache lines that have to be repeatedly filled on each pass through the filter. So if you hit say, 6 or 8 cache lines (of 32, 64, or 128 bytes each), but then move 8 bytes to the right and it is still the same 6 or 8 cache lines, then you sort of get it for free. But when you move to the next line of the screen then it is probably going to be a different cache line or at least a different (slower) level of memory cache.

There, I have thoroughly confused myself. What did I just say?

- Tom

vlad59 · 13th August 2002, 19:50

@dividee
you've made a typo when you explain me how to divide :
instead of :
((sum<<1) + sum) * 32768/27) >> 16
you should have written :
((sum<<1) + 1) * 32768/27) >> 16

I just coded it, it works without any problem, thanks again.

@all
Tomorrow you'll have the new matrix (full 1).

dividee · 13th August 2002, 20:39

Indeed I made a typo, but your correction is also wrong; it should be
((sum<<1)+27)*(32768/27))>>16 (or: ((sum+(27/2))*(65536/27))>>16 )
which is equivalent to the previously given
(sum * 65536/27 + 32768) >> 16

Koepi · 13th August 2002, 20:45

walk like an egyptian...

vlad59 · 13th August 2002, 21:22

@dividee

Of course you're right dividee, I really regret all the time I spent sleeping instead of listening to my Maths teacher.
Sorry to be such a pain in ....

@Koepi
I love the way you walk

@all

thanks to dividee, you can now download the new version of convolution3D (beta 1) with sources (not commented at all

, I'll make that in 2 days).

You now have the full 1 weight matrix. The compressibility is better with this matrix.

No more speed for now, but I'll need a day or 2 to understand the discussion between Tom and Dividee

EDIT : Removed old attachment

Koepi · 13th August 2002, 21:32

vlad:

wanna walk that way as well?

then we could change to RunDMC feat. airosmith(?) - walk this way... and nod our heads like will smith.

Maybe that helps understanding all those formulars.

I'm glad that I finally found out that there's a simple shift operator in c/++ - guess what I had to code around that in the xcdbackupcreator :-/

Dancing,

Koepi

NP: Velvet Acid Christ - Alien Surfaces

baz00ie · 14th August 2002, 02:10

Thanks for your work, Vlad59.
I've noticed an increase in quality and speed with the latest release.

take care
baz

vlad59 · 14th August 2002, 09:47

@baz00ie

First, thanks for testing and using my filter.
Increase in speed ??, I've done nothing yesterday to speed up Convolution3D

. I'll make benchs tonight.

vlad59 · 17th August 2002, 22:18

Here is the beta 2 of Convolution3D.

The main changes are the new temporal tresholds (thanks Tom for the idea) that will allow to have better compressibility and less ghosting.

This version shouldn't be faster than the beta 1.

I change some part of the internal engine, so if you see some difference, post here.

thanks in advance for testing.

I made some tests :
On my torture test (an old noisy anime badly mastered) :
Convolution3d (1, 12, 20, 8, 8, 0)
and
TemporalSmoother (4)
have roughly the same compressibility test (51.8 for C3D and 50.8 for TS)
but TemporalSmoother produce some ghosting and handle very badly fade in scene (C3D was also bad but somehow better than TS, Tom's STMedianFilter was the best for this scene).
On still scene TS is way better but lose some details.

I'll make new tests with MAM tomorrow.

trbarry · 18th August 2002, 01:07

Vlad -

I just took a peek at your code. Looks like you've done a pretty good job optimizing in MMX.

I even learned something new from it. I hadn't realized the pinsrw and pextrw instructions could use general purpose register operands. Cool.

- Tom

vlad59 · 18th August 2002, 07:49

Quote:

Originally posted by trbarry
Vlad -

I just took a peek at your code. Looks like you've done a pretty good job optimizing in MMX.

I even learned something new from it. I hadn't realized the pinsrw and pextrw instructions could use general purpose register operands. Cool.

- Tom

Thanks Tom, I read a lot of code (from you and dividee mainly) to understand better asm and learn new tips. I think I learned something

. But I'm still not satisfied with my code, there is still a lot to do

(especially to comment more).

Yep I used pinsrw and pextrw to compute each luma value individually, it cost a lot of time, and I still don't know if it really usefull.

12th August 2002, 11:22	#202 \| Link
dividee Registered User Join Date: Oct 2001 Location: Brussels Posts: 358	If you don't have division... use multiplication. Basic Idea: If you accumulate the sums in words in an MMX register, load an MMX register with four times 65536/27. The usual way in C code would be to multiply by that value and then >> 16. In MMX code, you can use PMULHW (or PMULHUW in ISSE) so you don't even have to do the shift. Take care of rounding (if you don't want Acaila to complaint about a green tint ): in C code you could do: (sum * 65536/27 + 32768) >> 16 but that doesn't work with PMULHW (3DNow has a nice PMULHRW, though). The technique I used in MMX code can be described in C code as: ((sum<<1) + sum) * 32768/27) >> 16 It becomes more difficult to do the "divisions" in parallel if the divisor can vary for each word. In TemporalSoften(2), I had independent divisors for each word that can vary from 1 to 16, so I constructed a big lookup table with 16^4 QWORD entries. I packed four divisors in a WORD to use as an index in the table an so was able to do 4 "divisions" in parallel. __________________ dividee Last edited by dividee; 12th August 2002 at 11:24.

12th August 2002, 13:10	#203 \| Link
vlad59 Vlad, the Buffy slayer Join Date: Oct 2001 Location: France Posts: 445	@dividee That's what I call a clear explaination, thanks a lot. I'll have a look to TemporalSoften code to be sure I have understood. thanks again, I'm now obliged to optimize better Convolution3D __________________ Vlad59 Convolution3D for avisynth 2.0X : http://www.hellninjacommando.com/con3d Convolution3D for avisynth 2.5 : http://www.hellninjacommando.com/con3d/beta

12th August 2002, 15:30	#205 \| Link
vlad59 Vlad, the Buffy slayer Join Date: Oct 2001 Location: France Posts: 445	@bb In theory it's possible but with those matrix you have not a center pixel. I think it's annoying. I thought about a 5x5x5 matrix, but it will be slower and I first want to have a stable Convolution3d before changing too much the basis. But if you explain me better what could be cool with this, I can change my mind. You usually have great ideas . I also thought of not using the previous and next frame when there is too much difference between prev, current and next pixel (only luma will be checked). I should help a lot in fade in or fade out scene. __________________ Vlad59 Convolution3D for avisynth 2.0X : http://www.hellninjacommando.com/con3d Convolution3D for avisynth 2.5 : http://www.hellninjacommando.com/con3d/beta

12th August 2002, 21:33	#209 \| Link
dividee Registered User Join Date: Oct 2001 Location: Brussels Posts: 358	Why 2 extra times ? Or do you count read & write as 2 times ? TemporalSoften (and TemporalSmoother) use a similar idea as bb exposed (I didn't invent it, that's the technique Avery used in Temporal Smoother for Virtualdub), except these filters interleave the clips on a per pixel basis instead of per line. So you have ,with bb's notation (center = frame 2) 123123123... Since these filters only operate on the temporal axis, it makes the inner loop works totally linearly. As long as the clip is read in sequential order, all you have to do for the next frame is replace the oldest pixels by the newest ones: 423423423 (for frame 3) and then 453453453 (for frame 4) The pixel replacement is also done in the inner loop. The arithmetic of using such a circular buffer is sometimes a bit complicated (especially when you try to add full-scene change detection in the mix ), but I think it's worth it. __________________ dividee

13th August 2002, 19:50	#211 \| Link
vlad59 Vlad, the Buffy slayer Join Date: Oct 2001 Location: France Posts: 445	@dividee you've made a typo when you explain me how to divide : instead of : ((sum<<1) + sum) * 32768/27) >> 16 you should have written : ((sum<<1) + 1) * 32768/27) >> 16 I just coded it, it works without any problem, thanks again. @all Tomorrow you'll have the new matrix (full 1). __________________ Vlad59 Convolution3D for avisynth 2.0X : http://www.hellninjacommando.com/con3d Convolution3D for avisynth 2.5 : http://www.hellninjacommando.com/con3d/beta

12th August 2002, 14:01	#204 \| Link
bb Moderator Join Date: Oct 2001 Location: Germany Posts: 2,665	@vlad59: Just another idea: what about using a 2x2x2 or a 4x4x2 matrix? bb

12th August 2002, 16:46	#206 \| Link
trbarry Registered User Join Date: Oct 2001 Location: Gainesville FL USA Posts: 2,092	Vlad - As a small point on performance issues you can usually expand the matrix horizontally faster than vertically. That's because the data will be coming from the same cache lines. So something like a 3x5 will probably be faster than a 4x4 by more than you might expect. - Tom

12th August 2002, 19:25	#207 \| Link
bb Moderator Join Date: Oct 2001 Location: Germany Posts: 2,665	@trbarry: You are probably the master of filter optimisation. What do think about my proposal of interleaving the frames before filtering, then de-interleave again? Like you store 1st line of 1st frame, then 1st line of 2nd frame, then 1st line of 3rd frame, then 2nd line of 1st frame, etc., like 1111111111111111 (first line) 2222222222222222 (first line) 3333333333333333 (first line) 1111111111111111 (second line) 2222222222222222 (second line) 3333333333333333 (second line) ... This way the pixels you have to touch during filtering would be close together in memory, which - I think - would improve cache hits. Non-cubic matrices will have an impact on the picture, as far as I can see. You could get the effect that it smoothes more horizontally than vertically. I also thought about dropping the edge pixels, because they have the biggest distance to the center pixel. The perfect 3D matrix would probably be a sphere, not a cube... What do you think? @vlad59: You're right, not having a center pixel is annoying, I thought of that problem. I had the idea of something like a running total; the value of a pixel would be updated more than once while the algorithm is "passing by". I still have to think that over, but I can post my idea if you find it useful. There's a VirtualDub filter using a 5x5x2 matrix, but I was a little disappointed with that one... bb

12th August 2002, 20:13	#208 \| Link
trbarry Registered User Join Date: Oct 2001 Location: Gainesville FL USA Posts: 2,092	bb - Dunno, you'd have to try it. But in order to do this you would have to copy the data 2 extra times and I'm not sure whether that would make up for the (hopefully) better arrangement. And the gain would be processor dependent since the different machines have different cache sizes and stuff. - Tom

12th August 2002, 23:39	#210 \| Link
trbarry Registered User Join Date: Oct 2001 Location: Gainesville FL USA Posts: 2,092	Maybe I didn't think it through well enough. I was thinking that if you wanted to work with the data in a different format that you would also have to copy/reformat the results back at some point, but that's not necessarily true I guess. But again, I guess you would have to try it and see if what you gain makes up for the up-front overhead. When I wrote Greedy/HM (DScaler GreedyHMA version) I reformated the data into what I thought was a clever arrangement that got around some then-current DScaler limitations but in hind sight I'm not sure it was worth it. And I'm almost certain that if I went to the work of pulling out that stuff for the Avisynth version it would go faster. But maybe I just did something silly when I designed the data structure I used. I guess the way I think about it now is that it is not as important to keep everything close together in memory as it is to minimize the number of cache lines that have to be repeatedly filled on each pass through the filter. So if you hit say, 6 or 8 cache lines (of 32, 64, or 128 bytes each), but then move 8 bytes to the right and it is still the same 6 or 8 cache lines, then you sort of get it for free. But when you move to the next line of the screen then it is probably going to be a different cache line or at least a different (slower) level of memory cache. There, I have thoroughly confused myself. What did I just say? - Tom

13th August 2002, 20:39	#212 \| Link
dividee Registered User Join Date: Oct 2001 Location: Brussels Posts: 358	Indeed I made a typo, but your correction is also wrong; it should be ((sum<<1)+27)(32768/27))>>16 (or: ((sum+(27/2))(65536/27))>>16 ) which is equivalent to the previously given (sum * 65536/27 + 32768) >> 16 __________________ dividee Last edited by dividee; 13th August 2002 at 20:42.

13th August 2002, 20:45	#213 \| Link
Koepi Moderator Join Date: Oct 2001 Location: Germany Posts: 4,454	walk like an egyptian... __________________ Koepi's new media development site

13th August 2002, 21:22	#214 \| Link
vlad59 Vlad, the Buffy slayer Join Date: Oct 2001 Location: France Posts: 445	@dividee Of course you're right dividee, I really regret all the time I spent sleeping instead of listening to my Maths teacher. Sorry to be such a pain in .... @Koepi I love the way you walk @all thanks to dividee, you can now download the new version of convolution3D (beta 1) with sources (not commented at all , I'll make that in 2 days). You now have the full 1 weight matrix. The compressibility is better with this matrix. No more speed for now, but I'll need a day or 2 to understand the discussion between Tom and Dividee EDIT : Removed old attachment __________________ Vlad59 Convolution3D for avisynth 2.0X : http://www.hellninjacommando.com/con3d Convolution3D for avisynth 2.5 : http://www.hellninjacommando.com/con3d/beta Last edited by vlad59; 25th August 2002 at 13:39.

13th August 2002, 21:32	#215 \| Link
Koepi Moderator Join Date: Oct 2001 Location: Germany Posts: 4,454	vlad: wanna walk that way as well? then we could change to RunDMC feat. airosmith(?) - walk this way... and nod our heads like will smith. Maybe that helps understanding all those formulars. I'm glad that I finally found out that there's a simple shift operator in c/++ - guess what I had to code around that in the xcdbackupcreator :-/ Dancing, Koepi NP: Velvet Acid Christ - Alien Surfaces __________________ Koepi's new media development site

14th August 2002, 02:10	#216 \| Link
baz00ie Registered User Join Date: Dec 2001 Posts: 36	Thanks for your work, Vlad59. I've noticed an increase in quality and speed with the latest release. take care baz

14th August 2002, 09:47	#217 \| Link
vlad59 Vlad, the Buffy slayer Join Date: Oct 2001 Location: France Posts: 445	@baz00ie First, thanks for testing and using my filter. Increase in speed ??, I've done nothing yesterday to speed up Convolution3D . I'll make benchs tonight. __________________ Vlad59 Convolution3D for avisynth 2.0X : http://www.hellninjacommando.com/con3d Convolution3D for avisynth 2.5 : http://www.hellninjacommando.com/con3d/beta

17th August 2002, 22:18	#218 \| Link
vlad59 Vlad, the Buffy slayer Join Date: Oct 2001 Location: France Posts: 445	beta 2 Here is the beta 2 of Convolution3D. The main changes are the new temporal tresholds (thanks Tom for the idea) that will allow to have better compressibility and less ghosting. This version shouldn't be faster than the beta 1. I change some part of the internal engine, so if you see some difference, post here. thanks in advance for testing. I made some tests : On my torture test (an old noisy anime badly mastered) : Convolution3d (1, 12, 20, 8, 8, 0) and TemporalSmoother (4) have roughly the same compressibility test (51.8 for C3D and 50.8 for TS) but TemporalSmoother produce some ghosting and handle very badly fade in scene (C3D was also bad but somehow better than TS, Tom's STMedianFilter was the best for this scene). On still scene TS is way better but lose some details. I'll make new tests with MAM tomorrow. __________________ Vlad59 Convolution3D for avisynth 2.0X : http://www.hellninjacommando.com/con3d Convolution3D for avisynth 2.5 : http://www.hellninjacommando.com/con3d/beta

18th August 2002, 01:07	#219 \| Link
trbarry Registered User Join Date: Oct 2001 Location: Gainesville FL USA Posts: 2,092	Vlad - I just took a peek at your code. Looks like you've done a pretty good job optimizing in MMX. I even learned something new from it. I hadn't realized the pinsrw and pextrw instructions could use general purpose register operands. Cool. - Tom