* Fix the matmul layout for accelerate & mkl. * Reduce the required precision for pow (because of accelerate). * And a fix the gelu f16 test.
* Improve the handling of matmul with squeezed layouts. * Fix for the cuda backend. * Revert the temporary fix.