I figured out what's wrong when radius is 4 or 5. It calculates a sum which has to fit in 14 bits, so it can't be more than 16383. Radius 4 (9 * 9 * 255) is just too many pixels to fit.
This is easily fixed by using the pmulhuw instruction (introduced in SSE) instead of the pmulhw instruction (introduced in MMX).
https://github.com/dubhater/vapoursy...eleases/tag/v2
Code:
* Fix bad output with radius 4 and 5 (especially 5).
* Allow radius 6 and 7.
* Better precision in the calculations.