* Softmax numerical stability. * Fix the flash-attn test.
* Add some flash-attn test. * Add the cpu test. * Fail when the head is not a multiple of 8. * Polish the flash attention test.