* Add the AdamW optimizer. * Add some AdamW test validated against PyTorch.
* Softmax numerical stability. * Fix the flash-attn test.