Since you asked in the other thread. You code is basically written in the simplest way possible which is horrible for cpus to actually execute. All those small loads every pixel quickly adds up. You're also creating and destroying a vector every pixel. That also adds up.
For example:
https://github.com/IFeelBloated/mins...Kernel.hpp#L92
You should use a fixed size 9 element array here. std::vector is just slow if you have a small fixed size allocation.
The rest mostly comes down to the fact that you're doing many small loads. That's costly. For example see the recreated removegrains for examples on how to do things in a speedier way:
https://github.com/vapoursynth/vapou...rainvs.cpp#L39
That code is a bit convoluted with macros and stuff but hopefully you'll get the idea.