Released r3.
Add SIMD operations. The provided binary package contains two DLLs for SSE2 and AVX2 each. Roughly test on my E3-1231 v3, the SSE2 version is about 1x~2x% faster than C version, while the AVX2 version is about only 5% faster than SSE2 version.
|