3fba2b5fc4
Add the SmolLM2 models. ( #2595 )
...
* Add the SmolLM2 models.
* More SmolLM2 support.
2024-11-03 17:11:12 +01:00
3699c1a053
Fix the repo name for llama 3.1. ( #2576 )
...
* Fix the repo name for llama 3.1.
* Fix the book.
2024-10-26 11:25:04 +02:00
ad8a4c5e5a
Add some llama-3.2 examples. ( #2508 )
...
* Add some llama-3.2 examples.
* Support tie-word-embeddings for llama.
2024-09-26 21:00:18 +02:00
0f5cbb08b3
Add support for Llama 3.1 ( #2359 )
...
* Add Llama 3.1 rope
* Clippy
* Format
* Clippy
* Add support for multiple eos tokens:
* Untagged either
* Remove either dep and fix settings.json
* Make the max positional embeddings configurable
2024-07-26 21:32:26 +02:00
a09d451d11
Support top-k in tthe llama example. ( #2150 )
2024-05-01 22:25:47 +02:00
618ecf5e23
Better time measurement for the llama example. ( #2106 )
2024-04-22 17:54:27 +02:00
52ae332910
Use llama v3 by default + add to readme. ( #2094 )
2024-04-20 16:11:24 +02:00
9c532aef47
Also enable llama-v3 8b instruct. ( #2088 )
2024-04-19 08:50:06 +02:00
e6ee7ba4d4
Llama v3. ( #2085 )
...
* Llama v3.
* Tweak the default params + handle special tokens.
* Small tweak.
2024-04-18 22:19:54 +02:00
28057781aa
Make the cache for the llama model explicit too. ( #1745 )
2024-02-22 12:04:33 +01:00
7c7400fb63
Use the tokenizer-output-stream in the llama example. ( #1715 )
...
* Use the tokenizer-output-stream in the llama example.
* Also use tokenizer-output-stream for llama2-c.
2024-02-15 16:47:33 +01:00
84250bf52f
fix index_pos bug when kv cache is disabled. ( #1517 )
...
* fix index_pos bug when kv cache is disabled
* Tweak the fix.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com >
2024-01-06 11:43:01 +01:00
1fb2dd905c
Add support for tiny-llama-1.1b. ( #1512 )
2023-12-31 12:18:25 +01:00
996a7f2e24
Rework the llama example config, add the solar model. ( #1485 )
2023-12-26 22:24:04 +01:00
bb3471ea31
Adapt more examples to the updated safetensor api. ( #947 )
...
* Simplify the safetensor usage.
* Convert more examples.
* Move more examples.
* Adapt stable-diffusion.
2023-09-23 21:26:03 +01:00
805bf9ffa7
Implement top_p / nucleus sampling ( #819 )
...
* Implement top_p / nucleus sampling
* Update changelog
* rustfmt
* Add tests
* Fix clippy warning
* Fix another clippy error
2023-09-12 18:10:16 +02:00
d3f05eae8c
Move some models to candle-transformers so that it's easier to re-use. ( #794 )
...
* Move some models to candle-transformers so that they can be shared.
* Also move falcon.
* Move Llama.
* Move whisper (partial).
2023-09-10 09:40:27 +01:00
6e485f2deb
Add some optional repeat penalty. ( #623 )
...
* Add some optional repeat penalty.
* Add the missing files.
2023-08-27 10:48:45 +01:00
c105550405
s/panic/bail/
2023-08-25 18:05:07 +02:00
4826a4212e
Adding support for codellama in examples.
...
Codellama requires bf16 for now (error to convert from bf16 to f16).
Multiprocess demo not functional for it because flash-attn only supports
f16 for now.
2023-08-25 09:56:11 +00:00
f9ecc84477
GQA support in the quantized model. ( #555 )
...
* GQA support in the quantized model.
* Fix the reshaping.
* Fix the main llama model.
* Infer the proper gqa from the model kind.
2023-08-22 19:41:10 +01:00
c78ce76501
Add a simple Module trait and implement it for the various nn layers ( #500 )
...
* Start adding the module trait.
* Use the module trait.
* Implement module for qmatmul.
2023-08-18 09:38:22 +01:00
13401df4d1
Add an abstract type for RmsNorm. ( #499 )
2023-08-18 08:52:14 +01:00
d32e8199cd
Layer norm tweaks ( #482 )
...
* Add some options to make layer-norm more configurable.
* Add the rms-norm variant.
* Replace the RmsNorm with the shared bits.
2023-08-17 10:07:13 +01:00
c5f45887dc
Add some tracing to the quantized example. ( #473 )
2023-08-16 18:49:08 +01:00
102fa4c2e3
Fixing llamav1
2023-08-16 14:53:29 +02:00
3071134788
Get the ggml based llama to generate some text. ( #464 )
...
* Add more stats to the ggml example.
* Build a quantized model from the file content.
* Move the tensor retrieval in the main crate.
* Start adding the forward pass.
* Add more to the forward pass of the quantized llama.
* Apply the attention layers.
* Add the sampling loop.
* Get the sampling loop to work.
* Minor tweak.
* Add a quantize/dequantize test.
* Bugfix.
* Add a comment + swap the order.
* Bugfixes.
2023-08-16 12:41:07 +01:00
33c882ea74
Clippy.
2023-08-16 10:41:00 +02:00
76804730c6
Using the real config from the hub when available.
2023-08-16 10:36:01 +02:00
5b1690fffa
Tweak the llama example. ( #450 )
2023-08-15 12:18:20 +01:00
3cc87058b7
Support local weights & dynamic outputs ( #447 )
...
* Support local weights & dynamic outputs
* Revise as suggested
* Cargo code format
2023-08-15 11:51:57 +01:00
c84883ecf2
Add a cuda kernel for upsampling. ( #441 )
...
* Add a cuda kernel for upsampling.
* Update for the latest tokenizers version.
2023-08-14 13:12:17 +01:00
906c0f3eb5
Remove the checkpoint conversion script. ( #405 )
...
* Remove the checkpoint conversion script.
* Remove references to the script.
2023-08-11 05:59:48 +01:00
b278834267
Support the Accelerate BLAS on macOS. ( #325 )
...
* Add the accelerate feature.
* Ffi tweaks.
2023-08-05 17:25:24 +01:00
df6667ba88
Add some tracing to llama. ( #318 )
2023-08-03 13:52:22 +01:00
4bf2ebf836
Use u8 tensors for masks. ( #273 )
2023-07-29 11:32:58 +01:00
50d8273ae4
Support both llama v1 and llama v2. ( #272 )
2023-07-28 18:40:59 +01:00
7513a5e005
Line-up the llama implementation with the python-transformers one. ( #271 )
...
* Line-up the llama implementation with the python-transformers one.
* Also lineup the multiprocess version.
2023-07-28 18:31:28 +01:00
3eb2bc6d07
Softmax numerical stability. ( #267 )
...
* Softmax numerical stability.
* Fix the flash-attn test.
2023-07-28 13:13:01 +01:00
ca479a873e
Upgrading hf-hub to 0.2.0
(Modified API to not pass the Repo around
...
all the time)
2023-07-27 20:05:02 +02:00
84ad558e50
Switch to using llama-v2 by default. ( #251 )
2023-07-26 17:18:27 +01:00
f052ba76cb
Lining up the flash attn version with the non-flash one. ( #248 )
...
* Move the flash-attn function in the proper crate.
* Causality tweak.
2023-07-26 15:11:45 +01:00
2ce5f12513
Again set a few extra params in flash-attn. ( #245 )
...
* Again set a few extra params.
* Use the appropriate kernel sizes.
* Add all the kernel sizes.
* Parallel compiling.
* Reduce the amount of parallelism.
* Add the missing kernel.
* Fix a typo.
* Remove bf16 support for now.
2023-07-26 14:16:37 +01:00
fa2b64d678
Proper flash-attn parameters. ( #244 )
...
* Proper flash-attn parameters.
* Set the flash attention parameters.
* Add more validations.
* Setup the o_ flash attn parameters.
* More flash-attn support.
* Set more flash attn parameters.
2023-07-26 10:13:40 +01:00
e40b150bbe
Better handling of dtypes in llama. ( #243 )
2023-07-26 08:28:33 +01:00
d9f9c859af
Add flash attention ( #241 )
...
* Add some flash-attn kernel, import the code for flash-attn v2 from Dao-AILab.
* More flash attn.
* Set up the flash attn parameters.
* Get things to compile locally.
* Move the flash attention files in a different directory.
* Build the static C library with nvcc.
* Add more flash attention.
* Update the build part.
* Better caching.
* Exclude flash attention from the default workspace.
* Put flash-attn behind a feature gate.
* Get the flash attn kernel to run.
* Move the flags to a more appropriate place.
* Enable flash attention in llama.
* Use flash attention in llama.
2023-07-26 07:48:10 +01:00
43c7223292
Rename the .r functions to .dims so as to be a bit more explicit. ( #220 )
2023-07-22 10:39:27 +01:00
12d6dc018d
Support for MQA for llama v2. ( #205 )
...
* Support for MQA for llama v2.
* More llama-v2.
* Move the rotary embedding precomputation in the cache.
* Add a v2 flag.
* Use the hf model.
2023-07-20 06:39:04 +01:00
439321745a
Removing candle-hub
internal to extract into hf-hub
standalone.
2023-07-19 15:04:38 +02:00
66750f9827
Add some 'cuda-if-available' helper function. ( #172 )
2023-07-15 08:25:15 +01:00