* Make it easier to use samples from the repo. * Use f32 for accumulation in the f16/bf16 kernels.
* Boilerplate code for conv1d. * Boilerplate code for conv1d. * More boilerplate for conv1d. * Conv1d work. * Get the conv1d cuda kernel to work. * Conv1d support when no batch dim.