a2e925462c
Add the scatter in place ops. ( #2923 )
...
* Add the scatter_set op.
* Metal op.
* Cuda version.
* Merge the checks.
* Add the actual ops.
2025-04-26 07:36:49 +02:00
3827685524
Add the scatter op. ( #2921 )
...
* Add the scatter op.
* Backprop support.
* Cuda support.
2025-04-25 21:46:58 +02:00
a4c56a958e
Add the const-set op. ( #2910 )
...
* Add the const-set op.
* Cuda implementation.
* Bugfix.
* Metal cleanup.
* Add the metal kernels.
* Add some testing.
* Finish the metal implementation.
* Bump the version.
2025-04-19 10:07:02 +02:00
9dbaf958dc
Add an enum for scalar values. ( #2909 )
...
* Add a scalar enum type.
* Add a bit more to the scalar type.
* Small tweak.
* More scalar usage.
2025-04-18 22:13:38 +02:00
ce5f8dd129
Check the bounds in the cuda indexing kernels. ( #2908 )
...
* Check the bounds in the cuda indexing kernels.
* Another check.
2025-04-18 20:08:17 +02:00
e4e7b0b2da
Use cudarc 0.16. ( #2900 )
...
* Use cudarc 0.16.
* Allow for disabling event tracking.
* Tweaks.
* Bump the ug version.
* And bump the candle version too.
2025-04-15 21:40:18 +02:00
f3a73f80d1
Support for cudnn conv1d. ( #2888 )
...
* Support for cudnn conv1d.
* More conv1d work.
* Get the conv1d to work with cudnn.
* Cleanup.
2025-04-13 16:47:37 +02:00
d9198deb37
Im2col cuda optimization. ( #2885 )
2025-04-13 10:07:53 +02:00
34505fdf3a
Avoid using batched-matmul in nn::Linear. ( #2883 )
...
* Avoid using batched-matmul in nn::Linear.
* Also avoid batched matmul in conv1d.
* Also tweak the conv2d.
* Batched tests.
* Also cover conv2d.
2025-04-12 19:53:58 +02:00
acc5bd335f
Cuda cleanup. ( #2880 )
...
* Cuda cleanup.
* More fixes.
2025-04-11 21:43:35 +02:00
d9904a3baf
Update to cudarc 0.14 (breaking change). ( #2858 )
...
* Start updating to cudarc 0.14.
* Adapt a couple more things.
* And a couple more fixes.
* More tweaks.
* And a couple more fixes.
* Bump the major version number.
* Proper module system for the cuda kernels.
* Proper ptx loading.
* Launch the sort kernel.
* Custom op.
* Start using the builder pattern.
* More builder.
* More builder.
* Get candle-core to compile.
* Get the tests to pass.
* Get candle-nn to work too.
* Support for custom cuda functions.
* cudnn fixes.
* Get flash attn to run.
* Switch the crate versions to be alpha.
* Bump the ug dependency.
2025-04-03 09:12:19 +02:00
b4daa03e59
add as_cuda_slice_mut to CudaStorage and CudaDType ( #2859 )
2025-04-01 19:34:52 +02:00
0af3e428ec
fix: place ug
dep behind not wasm32
flag ( #2760 )
...
* place `ug` behind not wasm32 attr
so that wasm32 can compile
* mv `ug` to conditional target dep
assuming every non-wasm32 user wants this
2025-02-01 23:05:52 +01:00
b52c2c6050
Clippy fixes for the cuda feature. ( #2650 )
2024-11-29 09:01:34 +01:00
3159f91b90
20241118 docs ( #2629 )
...
* module docs
* varbuilder gguf docs
* add a link to gguf files
* small additonal mod doc titles
* safetensor docs
* more core docs
* more module docs in canlde_core
* 2 more link fixes
2024-11-19 04:07:07 +01:00
594d984f9c
Support for UG kernels. ( #2579 )
...
* Support for UG kernels.
* Add a dedicated test.
2024-10-27 13:37:19 +01:00
6faecaa616
Fix for cudnn bf16 conv2d. ( #2535 )
2024-10-02 23:18:55 +02:00
7b60bda4ed
Add support for cuda streams. ( #2532 )
2024-10-02 21:30:58 +02:00
aafa24ed93
Update cudarc to 0.12. ( #2451 )
...
* Update cudarc to 0.12.
* Some cudnn tweaks.
2024-08-27 10:10:30 +02:00
36cf54525d
Fix the fast bf16 gemm cublas kernels. ( #2274 )
...
* Use flash-attn in gemma.
* Fix for the fast bf16 cublas gemm.
* Fix some clippy lints.
* Fix another lint.
* Proper clippy fix.
2024-06-18 23:46:58 +02:00
1df2bddccf
Add the layernorm specialized op. ( #2212 )
...
* Add the layernorm cuda kernels.
* Dedicated layer norm op.
* Add the slower variant.
* Plug the cuda implementation.
* Add the metal variant.
* Add a dedicated test.
* Bugfix.
2024-05-24 15:58:01 +02:00
6f0b807ffd
More efficient cuda implementation for ConvTranspose1d. ( #2211 )
...
* More efficient cuda implementation for ConvTranspose1d.
* Small tweak.
2024-05-24 11:05:43 +02:00
9cff7bc3f4
Make it possible to use TF32 accumulation in F32 matmuls. ( #2178 )
...
* Allow the use of tf32 accumulation in matmul.
* Better timings.
* Dummy versions for use when cuda is not enabled.
2024-05-11 12:28:39 +02:00
89f53b9d7b
Bump the version number to 0.5.1. ( #2155 )
...
* Bump the version number to 0.5.1.
* Fix clippy lints for 1.78.
* More clippy fixes.
2024-05-03 11:17:05 +02:00
fa06f5f5f9
F16/BF16 bugfix (bis). ( #2143 )
...
* F16/BF16 bugfix (bis).
* Another fix.
* Yet another fix.
2024-04-29 14:08:44 +02:00
09d4845aa8
Bugfix the recent f16/bf16 changes. ( #2142 )
2024-04-29 13:30:11 +02:00
3bbb88fcb4
Fix sigmoid gradient calculation and move sigmoid into a specialized op ( #2114 )
...
* add sigmoid op
* small fix
* add as a method on `Tensor`
* implement gradient calculation for sigmoid
* add sigmoid tests
* we should have a specialized op for this
* fix clippy
* fix clippy 2
* Revert all previous commits in favor of a `CustomOp` based solution
* use `CustomOp1` implementation
* fix rustfmt
* experimental add metal impl
* add cuda kernel impl
* fix fmt
* Add a test + reduce some cuda duplication.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com >
2024-04-29 11:04:43 +02:00
ed7b99f525
Add a toggle for F16/BF16 accumulation in gemm. ( #2141 )
...
* Add a toggle to control f16/bf16 gemm precision.
* Use the faster variant in the quantized example.
* Bugfix.
2024-04-29 09:21:07 +02:00
8a05743a21
Add StorageRef. ( #2113 )
...
* Add the storage-ref bits.
* Add the metal implementation.
2024-04-23 13:23:27 +02:00
53e5380bf6
Add a synchronize method to devices. ( #2055 )
...
* Add a synchronize method to devices.
* Metal version.
2024-04-14 16:32:55 +02:00
8967c46563
Split the cuda error file. ( #2003 )
2024-04-04 08:27:23 +02:00
318d143224
Relax the contiguous check for cuda kernels. ( #2000 )
...
* Relax the contiguous check for cuda kernels.
* Ensure contiguity for RNNs.
* Unrelated fix for segment anything.
* Better error message + allow concatenating empty slices.
2024-04-03 09:02:38 +02:00
08c049def3
Improve the handling of matmul with squeezed layouts. ( #1998 )
...
* Improve the handling of matmul with squeezed layouts.
* Fix for the cuda backend.
* Revert the temporary fix.
2024-04-02 23:17:05 +02:00
665da30487
Backend refactoring. ( #1966 )
...
* Backend refactoring.
* Metal tweaks.
* Move the cudnn module.
2024-03-29 23:02:11 +01:00