1df2bddccf
Add the layernorm specialized op. ( #2212 )
...
* Add the layernorm cuda kernels.
* Dedicated layer norm op.
* Add the slower variant.
* Plug the cuda implementation.
* Add the metal variant.
* Add a dedicated test.
* Bugfix.
2024-05-24 15:58:01 +02:00
6f0b807ffd
More efficient cuda implementation for ConvTranspose1d. ( #2211 )
...
* More efficient cuda implementation for ConvTranspose1d.
* Small tweak.
2024-05-24 11:05:43 +02:00
d54e02d73d
Avoid a contiguous call in the quantized phi 3 model. ( #2209 )
...
* Simplify the KvCache api.
* Avoid a contiguous call in the quantized phi3 model.
2024-05-23 21:24:55 +02:00
45e235a747
Simplify the KvCache api. ( #2207 )
2024-05-23 17:07:21 +02:00
31cf64147b
Add a couple kv-cache helper functions. ( #2206 )
2024-05-23 16:21:47 +02:00
77ea479a18
Add Phi-3 Medium ( #2205 )
2024-05-23 13:33:17 +02:00
72e7ca529a
Add some missing where-cond kernels for metal. ( #2203 )
2024-05-22 09:44:52 +02:00
7ff921c538
Add RandomNormal ONNX operator ( #2200 )
0.5.1
2024-05-21 21:47:32 +02:00
9b8537a62f
Remove the deprecated wav crate in favor of hound. ( #2202 )
2024-05-21 21:43:35 +02:00
7ebc3548e1
Use flash-attn in gemma. ( #2195 )
...
* Use flash-attn in gemma.
* Fix flash-attn for head dim 256.
2024-05-18 19:18:59 +02:00
eefc1c77ef
Support flash-attn in quantized phi3. ( #2194 )
2024-05-18 17:12:56 +02:00
01545f7303
Add a slice_set op. ( #2193 )
...
* Add a slice_set op.
* Add some testing.
* Add the dedicated kv-cache module.
* Derive debug and clone.
* Expose more kv-cache functions.
* Return the current data when appending.
* Use the new cache in the quantized phi3 model.
2024-05-18 15:58:18 +02:00
349c3e806a
Support embedding model gte-Qwen1.5-7B-instruct ( #2190 )
...
* Support embedding model gte-Qwen1.5-7B-instruct
This is a text embedding model based on Qwen2. They share same
model architecture except the last MLP module. This commit brings in
minimal modification of the old Qwen2 implementation to support both
models.
An example is provided, and had been verified according to the official
PyTorch implementation.
* Avoid doing the 'last-token filtering' based on the absence of attention mask.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com >
2024-05-16 21:34:10 +02:00
bdaa34216a
chore: add fix for windows cudarc into the readme ( #2189 )
2024-05-16 14:32:50 +02:00
cc80e065e5
Allow the threshold argumet to be negative in the segment-anything example ( #2187 )
...
Threshold is 0.0 by default, negative values make more points included,
expanding the mask. Positive values make it more picky, making the mask
smaller.
Negative numbers start with a minus sign, which normally makes clap
consider it a flag.
2024-05-15 13:17:20 +02:00
13c64f6828
Fix VarBuilder::from_slice_safetensors ( #2180 )
...
Also implement SimpleBackend for SliceSafetensors
Signed-off-by: Harry Stern <harry@harrystern.net >
2024-05-12 07:26:06 +02:00
21f82a5155
Add SliceSafetensors. ( #2179 )
...
* Add SlicedSafetensors.
* And add some testing.
2024-05-11 13:15:42 +02:00
9cff7bc3f4
Make it possible to use TF32 accumulation in F32 matmuls. ( #2178 )
...
* Allow the use of tf32 accumulation in matmul.
* Better timings.
* Dummy versions for use when cuda is not enabled.
2024-05-11 12:28:39 +02:00
d9bc5ec151
Switch cudarc back to dynamic linking. ( #2176 )
2024-05-09 10:35:44 +02:00
84328e2b60
Update cudarc requirement from 0.11.0 to 0.11.1 ( #2174 )
...
* Upgrading cudarc dependency from v0.11.0 to v0.11.1 due to that version having resolved a compile-time bug.
See: https://github.com/huggingface/candle/issues/2173
2024-05-08 20:40:36 +02:00
82b641fd27
Update cudarc requirement from 0.10.0 to 0.11.0 ( #2165 )
...
* Update cudarc requirement from 0.10.0 to 0.11.0
Updates the requirements on [cudarc](https://github.com/coreylowman/cudarc ) to permit the latest version.
- [Release notes](https://github.com/coreylowman/cudarc/releases )
- [Commits](https://github.com/coreylowman/cudarc/compare/v0.10.0...v0.10.0 )
---
updated-dependencies:
- dependency-name: cudarc
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
* Use the default cuda version.
---------
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: laurent <laurent.mazare@gmail.com >
2024-05-06 17:12:14 +02:00
01794dc16e
Use write rather than try-write on the metal rw-locks. ( #2162 )
2024-05-05 07:22:46 +02:00
a75cd8164f
Force the revision for the phi3-llama quantized models. ( #2159 )
2024-05-04 10:41:18 +02:00
b13a82a438
Separate quantized phi-3 implementation. ( #2157 )
...
* Separate quantized phi-3 implementation.
* Integrate the quantized phi3 model.=
* Small fixes, get the generation to work properly.
* Keep the old llama implementation around.
* Change the default.
2024-05-04 10:14:57 +02:00
59b18d974e
Pin the version used for the quantized phi 3 gguf file. ( #2156 )
2024-05-03 15:03:22 +02:00
89f53b9d7b
Bump the version number to 0.5.1. ( #2155 )
...
* Bump the version number to 0.5.1.
* Fix clippy lints for 1.78.
* More clippy fixes.
2024-05-03 11:17:05 +02:00
a09d451d11
Support top-k in tthe llama example. ( #2150 )
2024-05-01 22:25:47 +02:00
fa06f5f5f9
F16/BF16 bugfix (bis). ( #2143 )
...
* F16/BF16 bugfix (bis).
* Another fix.
* Yet another fix.
2024-04-29 14:08:44 +02:00
09d4845aa8
Bugfix the recent f16/bf16 changes. ( #2142 )
2024-04-29 13:30:11 +02:00
a0d03aded1
Bug Fix: When converting a tensor to a variable, clone if the tensor is already a variable. ( #2124 )
...
* When converting a tensor to a variable, clone if the tensor is already a variable.
* Add a test to ensure training a batch norm works with VarMaps
---------
Co-authored-by: Jeffrey Dallatezza <jeffreydallatezza@Jeffreys-Laptop.local >
2024-04-29 11:21:53 +02:00
3bbb88fcb4
Fix sigmoid gradient calculation and move sigmoid into a specialized op ( #2114 )
...
* add sigmoid op
* small fix
* add as a method on `Tensor`
* implement gradient calculation for sigmoid
* add sigmoid tests
* we should have a specialized op for this
* fix clippy
* fix clippy 2
* Revert all previous commits in favor of a `CustomOp` based solution
* use `CustomOp1` implementation
* fix rustfmt
* experimental add metal impl
* add cuda kernel impl
* fix fmt
* Add a test + reduce some cuda duplication.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com >
2024-04-29 11:04:43 +02:00
ed7b99f525
Add a toggle for F16/BF16 accumulation in gemm. ( #2141 )
...
* Add a toggle to control f16/bf16 gemm precision.
* Use the faster variant in the quantized example.
* Bugfix.
2024-04-29 09:21:07 +02:00
287013ef28
Add a forward_via_f16 method to the qmatmul op. ( #2138 )
2024-04-28 20:35:01 +02:00
eb26e2467e
Add the cuda dequantize f16 kernels. ( #2137 )
...
* Add the cuda dequantize f16 kernels.
* Expose the cuda kernels.
* Add some testing + fix.
* Test the other cases too.
* A few more tests.
* Add an environment variable to enable the dequantize f16 + matmul behavior.
2024-04-28 20:05:05 +02:00
c68ed8963f
chore: fix some typos in comments ( #2121 )
...
Signed-off-by: hardlydearly <799511800@qq.com >
2024-04-28 08:34:32 +02:00
e5c8b88f90
Apply the cast before the scaling. ( #2135 )
2024-04-28 08:30:35 +02:00
805f3be8e1
Add a sort function. ( #2134 )
2024-04-28 08:18:04 +02:00
3b429f3023
Make the dtype configurable for phi. ( #2133 )
2024-04-27 21:32:49 +02:00
96a48e5cc4
Add argsort. ( #2132 )
...
* Add the argsort cuda kernels.
* CPU version of arg-sort.
* Hook the cuda kernel + rework the cpu bits.
* Add some dedicated test.
* Working cuda kernel.
* Metal kernel.
* Metal adjustments.
* Bugfix.
* Use the fast rope in qwen.
* Rework the expert selection in qwen.
2024-04-27 20:17:35 +02:00
6cf82fd7a3
Add Olmo models ( #2127 )
...
* add olmo support
* add olmo readme
* Fix fmt.
* Fix clippy.
* Get olmo to work on cuda.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com >
2024-04-26 11:02:51 +02:00
cfab6e7616
Mention phi-v3 in the readmes. ( #2122 )
2024-04-24 20:54:24 +02:00
11d4a3c588
Add the phi-3 model. ( #2120 )
...
* Add the phi-3 model.
* Faster rope.
* Bugfix.
* Fix the detokenization.
2024-04-24 09:48:13 +02:00
9d3f1c8af5
Add the phi-v3 quantized model. ( #2118 )
...
* Add the phi-v3 quantized model.
* Also include phi-3 in the main phi example.
2024-04-24 08:22:23 +02:00
7211009179
Fix for rustfmt. ( #2117 )
2024-04-23 19:09:33 +02:00
6fadaf2eff
candle-onnx: add operators RandomUniform and Exp ( #2116 )
...
* Add basic RandomUniform implementation
* Use is_some to check if seed is present
* Added Exp operator implementation
---------
Co-authored-by: Mateusz Okulus <mmokulus@gmail.com >
2024-04-23 19:02:19 +02:00
8a05743a21
Add StorageRef. ( #2113 )
...
* Add the storage-ref bits.
* Add the metal implementation.
2024-04-23 13:23:27 +02:00
b2e816752b
Use the faster rms-norm kernel for llama. ( #2107 )
...
* Use the faster rms-norm kernel for llama.
* Use the fast variant by default.
2024-04-22 18:52:00 +02:00
618ecf5e23
Better time measurement for the llama example. ( #2106 )
2024-04-22 17:54:27 +02:00
267601eec1
Update tokenizers requirement from 0.15.0 to 0.19.1 ( #2104 )
...
Updates the requirements on [tokenizers](https://github.com/huggingface/tokenizers ) to permit the latest version.
- [Release notes](https://github.com/huggingface/tokenizers/releases )
- [Changelog](https://github.com/huggingface/tokenizers/blob/main/RELEASE.md )
- [Commits](https://github.com/huggingface/tokenizers/compare/v0.15.0...v0.15.2 )
---
updated-dependencies:
- dependency-name: tokenizers
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-22 17:10:46 +02:00
08a15cb79e
Update zip requirement from 0.6.6 to 1.1.1 ( #2103 )
...
* Update zip requirement from 0.6.6 to 1.1.1
---
updated-dependencies:
- dependency-name: zip
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
* Fix for the zip crate update.
---------
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: laurent <laurent.mazare@gmail.com >
2024-04-22 16:23:27 +02:00