How to optimize the hell out of linear system solving for small matrices (10x10)? This would be used in an AR engine for a few games, but has to be done very fast.
This solver is to be executed in excess of 1 000 000 times in microseconds on an Intel CPU. I am talking to the extreme level of optimization used in graphiscs for computer games. No matter if I code it in assembly and architecture-specific, or study precision or reliability tradeoffs reductions and use floating point hacks (like many games, I use the -ffast-math compile flag, no problem). The solve can even fail for about 20% of the time!
Eigen's partialPivLu is the fastest in my current benchmark, outperforming LAPACK when optimized with -O3 and a good compiler. But now I am at the point of handcrafting a custom linear solver. Any advice would be greatly appreciated. I will make my final solution open source and share it here.