Some times ago i wrote a c++ implementation of the hable implementation in tmap, i don't remeber how much faster it is, but IIRC about 2x .
It only works for float input with mod8 resolution video and AVX CPU, i didn't bother writing an SSE implementation due to the high amount of FMAD instruction used.
The static buffer optimization only work on windows right now.
If i see any interest in this i'll try to add some functionality like the debug views, non mod8 support, etc
Here the repo
https://github.com/MonoS/tmap-vapoursynth/tree/V1