* Optimize Tensor::new when called on nested Vec<..>.
* Improve performance.
* Similar flattening for the 4d case.
* More tweaks.
* Add some dummy test.
* Add the const-set op.
* Cuda implementation.
* Bugfix.
* Metal cleanup.
* Add the metal kernels.
* Add some testing.
* Finish the metal implementation.
* Bump the version.
* Start updating to cudarc 0.14.
* Adapt a couple more things.
* And a couple more fixes.
* More tweaks.
* And a couple more fixes.
* Bump the major version number.
* Proper module system for the cuda kernels.
* Proper ptx loading.
* Launch the sort kernel.
* Custom op.
* Start using the builder pattern.
* More builder.
* More builder.
* Get candle-core to compile.
* Get the tests to pass.
* Get candle-nn to work too.
* Support for custom cuda functions.
* cudnn fixes.
* Get flash attn to run.
* Switch the crate versions to be alpha.
* Bump the ug dependency.
* Improve reduce perf and add contiguous impl
* Improve arg reduce and add contiguous impl
* Improve softmax kernel. 33%-39% higher thrpt
* fmt
* Fixed all bugs. Improved code quality. Added tests.
* Stash for debugging
* Stash for debugging 2
* Fixing argmax bug and improve performance
Co-authored-by: Christopher Fleetwood <45471420+FL33TW00D@users.noreply.github.com>
* Fix test and add is_valid_simgroup_reduce_type trait
* Online softmax. Improved threadgroup reduce. Tidying up a bit.
* Remove redundant threadgroup_barrier from arg reduce
* Mostly tidying up. Some improvements
* Simplify indexed struct
* tidying
* Reuse operation operator instead of passing it in as a parameter
* Fix how operators are applied to indexed<vec<T,N>>
* Vectorized load. Scalar block reduce. Hitting max throughput for f32 reduce.
* Vectorized load for online softmax. Involves a reinterpret_cast of src which may be suboptimal.
* Metal as_type casting vec<bfloat, N> -> vec<float, N/2> for simd and fast math
* Use constant for input instead of const device. Fix strided reduce.
* Use contiguous reduce in tests
* Rename finalize -> to_scalar
* Support integer types max/min (switch with trait-inferred impl later)
* Was worried I was skipping work -> shuffling the 1D test cases
* Add build.rs to avoid metal kernel jit compile overhead
* Improve build. Extract utils
* Compile metal kernels for both macos and ios
* Fixed over xmas and then forgot about it
* Add calculate_reduce_threads util
* Remove old reduce.metal
* Improve f16/bf16 softmax precision by accumulating in f32
* Remove build.rs (for now)
* Move softmax bench to candle-nn
* Remove redundant thread calc util fn
* Use uint over ushort for indices etc
* Use fast exp in MDReduceOp
* Remove nested metal define for softmax
* Fix some clippy lint.
---------
Co-authored-by: Christopher Fleetwood <45471420+FL33TW00D@users.noreply.github.com>
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* module docs
* varbuilder gguf docs
* add a link to gguf files
* small additonal mod doc titles
* safetensor docs
* more core docs
* more module docs in canlde_core
* 2 more link fixes
* Add some fast Metal MLX SDPA kernels (#32)
* Sketch the sdpa kernel
* Add full sdpa kernel,
* Add test
* Add vectorized kernel for decoding
* Update tests
* Add some docs
* Fix sdpa_vector names
* Add softcapping for vectorized sdpa
* Add softcapping for full sdpa
* Add support for head dim 32, 96, 256
* Add support for head dim 32, 96, 256
* Update docs
* Add update notice
* Clippy and format
* Conditional compilation for bf16
* Use it in quantized llama
* Some review comments
* Use set_params!
* Remove unused
* Remove feature
* Fix metal sdpa for v stride
* Remove comma
* Add the dim method to layout and shape.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>