View Single Post
Old 31st May 2015, 21:06   #6  |  Link
MonoS
Registered User
 
Join Date: Aug 2012
Posts: 203
Ooook, i think i converted properly all the function inside croutines.

I had no chance to test it because the dll produced by my copy of codeblocks is not recognized by vs.

Notable changes:
- All intermediate computation is done in float instead of doubles, less work for the cpu and less work for me
- Added a transposed lut, so that during the dct all the 8 values can be loaded in a single instruction
- Reworked all the function [except for fillLUT] to use avx instruction [for example, iaca said that fillFactors needed 134 cycles to execute fully unrolled, now it need only 16 cycles, the row loop instead now is only 384 cycles fully unrolled, i don't even imagine how many cycles required before].

Probably there are other places to optimize [clamping perhaps??] but i didn't dig to deep into the code.

Hope someone can test this and/or let me know if i made any mistake, i repeat, it's my first time doing simd optimization

EDIT: As i expected there are room for other optimization [but not using a profiler, i'll BTW], the old algorithm, before the row/column split may, be faster now with simd and transposed lut.
If i have some other spare time i'll try to implement it and if i'll succeed to compile it and test it i'll make some tests
Attached Files
File Type: zip dctfilter avx.zip (4.1 KB, 172 views)

Last edited by MonoS; 31st May 2015 at 23:34.
MonoS is offline   Reply With Quote