* Improve the handling of matmul with squeezed layouts. * Fix for the cuda backend. * Revert the temporary fix.