* Add the const-set op.
* Cuda implementation.
* Bugfix.
* Metal cleanup.
* Add the metal kernels.
* Add some testing.
* Finish the metal implementation.
* Bump the version.
* Start updating to cudarc 0.14.
* Adapt a couple more things.
* And a couple more fixes.
* More tweaks.
* And a couple more fixes.
* Bump the major version number.
* Proper module system for the cuda kernels.
* Proper ptx loading.
* Launch the sort kernel.
* Custom op.
* Start using the builder pattern.
* More builder.
* More builder.
* Get candle-core to compile.
* Get the tests to pass.
* Get candle-nn to work too.
* Support for custom cuda functions.
* cudnn fixes.
* Get flash attn to run.
* Switch the crate versions to be alpha.
* Bump the ug dependency.
* update to cudarc to v0.13.5 to support cuda 12.8
* Bump the crate version.
---------
Co-authored-by: Michael McCulloch <michael.james.mcculloch@fastmail.com>
* Include the MLX gemm kernels.
* Clippy lints.
* Export the gemm_f32 kernel.
* Add the f16/bf16 variants.
* Add the initial dispatch code.
* More plugging of the mlx kernels.
* Add a currently broken test.
* Tweaks.
* Bugfix + get the tests to pass.
* Enable the gemm bf16 tests.
* Add some randomized tests.
* Update candle-metal-kernels/src/lib.rs
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* More fixes.
* More clippy fixes.
---------
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>