0af3e428ec
fix: place ug
dep behind not wasm32
flag ( #2760 )
...
* place `ug` behind not wasm32 attr
so that wasm32 can compile
* mv `ug` to conditional target dep
assuming every non-wasm32 user wants this
2025-02-01 23:05:52 +01:00
b52c2c6050
Clippy fixes for the cuda feature. ( #2650 )
2024-11-29 09:01:34 +01:00
3159f91b90
20241118 docs ( #2629 )
...
* module docs
* varbuilder gguf docs
* add a link to gguf files
* small additonal mod doc titles
* safetensor docs
* more core docs
* more module docs in canlde_core
* 2 more link fixes
2024-11-19 04:07:07 +01:00
594d984f9c
Support for UG kernels. ( #2579 )
...
* Support for UG kernels.
* Add a dedicated test.
2024-10-27 13:37:19 +01:00
6faecaa616
Fix for cudnn bf16 conv2d. ( #2535 )
2024-10-02 23:18:55 +02:00
7b60bda4ed
Add support for cuda streams. ( #2532 )
2024-10-02 21:30:58 +02:00
aafa24ed93
Update cudarc to 0.12. ( #2451 )
...
* Update cudarc to 0.12.
* Some cudnn tweaks.
2024-08-27 10:10:30 +02:00
36cf54525d
Fix the fast bf16 gemm cublas kernels. ( #2274 )
...
* Use flash-attn in gemma.
* Fix for the fast bf16 cublas gemm.
* Fix some clippy lints.
* Fix another lint.
* Proper clippy fix.
2024-06-18 23:46:58 +02:00
1df2bddccf
Add the layernorm specialized op. ( #2212 )
...
* Add the layernorm cuda kernels.
* Dedicated layer norm op.
* Add the slower variant.
* Plug the cuda implementation.
* Add the metal variant.
* Add a dedicated test.
* Bugfix.
2024-05-24 15:58:01 +02:00
6f0b807ffd
More efficient cuda implementation for ConvTranspose1d. ( #2211 )
...
* More efficient cuda implementation for ConvTranspose1d.
* Small tweak.
2024-05-24 11:05:43 +02:00
9cff7bc3f4
Make it possible to use TF32 accumulation in F32 matmuls. ( #2178 )
...
* Allow the use of tf32 accumulation in matmul.
* Better timings.
* Dummy versions for use when cuda is not enabled.
2024-05-11 12:28:39 +02:00
89f53b9d7b
Bump the version number to 0.5.1. ( #2155 )
...
* Bump the version number to 0.5.1.
* Fix clippy lints for 1.78.
* More clippy fixes.
2024-05-03 11:17:05 +02:00
fa06f5f5f9
F16/BF16 bugfix (bis). ( #2143 )
...
* F16/BF16 bugfix (bis).
* Another fix.
* Yet another fix.
2024-04-29 14:08:44 +02:00
09d4845aa8
Bugfix the recent f16/bf16 changes. ( #2142 )
2024-04-29 13:30:11 +02:00
3bbb88fcb4
Fix sigmoid gradient calculation and move sigmoid into a specialized op ( #2114 )
...
* add sigmoid op
* small fix
* add as a method on `Tensor`
* implement gradient calculation for sigmoid
* add sigmoid tests
* we should have a specialized op for this
* fix clippy
* fix clippy 2
* Revert all previous commits in favor of a `CustomOp` based solution
* use `CustomOp1` implementation
* fix rustfmt
* experimental add metal impl
* add cuda kernel impl
* fix fmt
* Add a test + reduce some cuda duplication.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com >
2024-04-29 11:04:43 +02:00
ed7b99f525
Add a toggle for F16/BF16 accumulation in gemm. ( #2141 )
...
* Add a toggle to control f16/bf16 gemm precision.
* Use the faster variant in the quantized example.
* Bugfix.
2024-04-29 09:21:07 +02:00
8a05743a21
Add StorageRef. ( #2113 )
...
* Add the storage-ref bits.
* Add the metal implementation.
2024-04-23 13:23:27 +02:00
53e5380bf6
Add a synchronize method to devices. ( #2055 )
...
* Add a synchronize method to devices.
* Metal version.
2024-04-14 16:32:55 +02:00
8967c46563
Split the cuda error file. ( #2003 )
2024-04-04 08:27:23 +02:00
318d143224
Relax the contiguous check for cuda kernels. ( #2000 )
...
* Relax the contiguous check for cuda kernels.
* Ensure contiguity for RNNs.
* Unrelated fix for segment anything.
* Better error message + allow concatenating empty slices.
2024-04-03 09:02:38 +02:00
08c049def3
Improve the handling of matmul with squeezed layouts. ( #1998 )
...
* Improve the handling of matmul with squeezed layouts.
* Fix for the cuda backend.
* Revert the temporary fix.
2024-04-02 23:17:05 +02:00
665da30487
Backend refactoring. ( #1966 )
...
* Backend refactoring.
* Metal tweaks.
* Move the cudnn module.
2024-03-29 23:02:11 +01:00