* Add some fast Metal MLX SDPA kernels (#32)
* Sketch the sdpa kernel
* Add full sdpa kernel,
* Add test
* Add vectorized kernel for decoding
* Update tests
* Add some docs
* Fix sdpa_vector names
* Add softcapping for vectorized sdpa
* Add softcapping for full sdpa
* Add support for head dim 32, 96, 256
* Add support for head dim 32, 96, 256
* Update docs
* Add update notice
* Clippy and format
* Conditional compilation for bf16
* Use it in quantized llama
* Some review comments
* Use set_params!
* Remove unused
* Remove feature
* Fix metal sdpa for v stride
* Remove comma
* Add the dim method to layout and shape.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add a RotatingKVCache.
* Add some KvCache tests.
* Test the reset too.
* More kv-cache testing.
* More tests for the rotating kv-cache.
* Improve the api for the rotating cache so that the whole src tensor gets returned when it's overlarge.
* Handle contiguity + bugfix + use in mimi.
* Add a way to test the mimi streaming mode.
* Mimi streaming fixes.
* More rotating kv-cache.
* Fix the attn mask generation.
* Handle the abs case.
* Add some tests for the generated mask.
* Add the layernorm cuda kernels.
* Dedicated layer norm op.
* Add the slower variant.
* Plug the cuda implementation.
* Add the metal variant.
* Add a dedicated test.
* Bugfix.
* When converting a tensor to a variable, clone if the tensor is already a variable.
* Add a test to ensure training a batch norm works with VarMaps
---------
Co-authored-by: Jeffrey Dallatezza <jeffreydallatezza@Jeffreys-Laptop.local>
* add sigmoid op
* small fix
* add as a method on `Tensor`
* implement gradient calculation for sigmoid
* add sigmoid tests
* we should have a specialized op for this
* fix clippy
* fix clippy 2
* Revert all previous commits in favor of a `CustomOp` based solution
* use `CustomOp1` implementation
* fix rustfmt
* experimental add metal impl
* add cuda kernel impl
* fix fmt
* Add a test + reduce some cuda duplication.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Fast kernels for rotary embeddings.
* Add a test for the fast CPU kernel.
* Rope cuda bindings.
* Cuda kernel.
* Metal kernel (part 1).
* Cuda kernels.
* Finish the metal kernel.
* Use the new kernels in the quantized example.
* Fix warning.
* Trying out a custom RmsNorm cuda kernel.
* CPU implementation for rms-norm.
* Cuda wrappers.
* Add some validation.
* Add some testing.
* More testing.
* add one-hot encoding
* one_hot: improve error handling, use generic to_vecN::<D>
Bails if the index value is equal to or greater than the depth value,
which would result in an out-of-bounds error.
A redundant check is added to ensure the index value does not exceed
the length of the one-hot matrix size, which would also result in an
out-of-bounds error.
Bails if the index value is less than -1. If the index value is -1,
then it ignores the setting of the on_value for the index value. Only
values that are less than -1 are considered errors.
* one-hot: use two generics, one_hot::<I, O>, for input and output data types
Separating the input and output data types allows the input tensor
indices to be a different data type than the output encoded tensor data type.
For example, one_hot::<i64, u8>(...) will take an input tensor of i64 values
and encode the output tensor using u8 values.
The generic I::DTYPE must match the data type of the input indices, otherwise
the method will bail.
Additionally, this method adds an `allow_f64` option to enable the input indices
data type to be f64 values. f64 values are disabled by default.
TODO: indices data type and the generic I data type are currently not compile-time
checked.
* one_hot: remove input generic, use indices dtype matching
This commit removes the to_f64() type cast and explicitly
matches the DType from the input tensor. Currently, only U8,
U32 and I64 is supported for input tensors.
The match arms on the dtype is verbose. It would be nice
to use a generic type with the WithDtype traitbound to
pass to the to_vecN method and then return an inner value.
Open to suggestions for better approaches here to reduce
the match arm verbosity.
* one_hot: use flat_map iterator over dims instead of nested for loop
This commit replaces the nested for loops with an flat map iter over
the dimensions of the input tensor.
This commit also adds a test for a rank 3 input tensor.
* one_hot: use mandatory on/off-values, remove const msgs
This commit also updates doc tests, comments and test cases.
* Small cleanups.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Add training to batchnorm with exponential moving average
* Add more checks to batch norm
* Resolve some review comments
* Add with_momentum varients of `new` methods
* Add check for range of momentum variable; update batch norm test
* Run cargo fmt
* Add back num_features parameter
* Format; tiny simplification
* add bce with logit loss
* add bce with logit loss
* remove imports
* fix tiny bug
* add test documentation and refactor function
* fix test cases and formatting
* Add a custom softmax implementation.
* Add softmaxlastdim to the benchmarks.
* And add a test.
* Support more dtypes.
* Polish the code.
* Use the slow implementation on cuda.
* Add a todo for the cuda kernel.
* Some fixes for yolo-v3.
* Use the running stats for inference in the batch-norm layer.
* Get some proper predictions for yolo.
* Avoid the quadratic insertion.
* Add a couple functions required for yolo.
* Add the yolo-v3 example.
* Add minimum and maximum.
* Use the newly introduced maximum.
* Cuda support for min/max + add some testing.
* Allow for more tests to work with accelerate.
* Fix a typo.
* Rework the var-builder to handle initializations.
* Add some helper functions for layer creation.
* Improve the layer initializations.
* Get initialized variables.
* Precompute the rot embeddings when training lamas.
* Add the nn::optim and some conversion traits.
* Add the backward_step function for SGD.
* Get the SGD optimizer to work and add a test.
* Make the test slighly simpler.