52 Commits

Author SHA1 Message Date
3fba2b5fc4 Add the SmolLM2 models. (#2595)
* Add the SmolLM2 models.

* More SmolLM2 support.
2024-11-03 17:11:12 +01:00
3699c1a053 Fix the repo name for llama 3.1. (#2576)
* Fix the repo name for llama 3.1.

* Fix the book.
2024-10-26 11:25:04 +02:00
ad8a4c5e5a Add some llama-3.2 examples. (#2508)
* Add some llama-3.2 examples.

* Support tie-word-embeddings for llama.
2024-09-26 21:00:18 +02:00
0f5cbb08b3 Add support for Llama 3.1 (#2359)
* Add Llama 3.1 rope

* Clippy

* Format

* Clippy

* Add support for multiple eos tokens:

* Untagged either

* Remove either dep and fix settings.json

* Make the max positional embeddings configurable
2024-07-26 21:32:26 +02:00
a09d451d11 Support top-k in tthe llama example. (#2150) 2024-05-01 22:25:47 +02:00
618ecf5e23 Better time measurement for the llama example. (#2106) 2024-04-22 17:54:27 +02:00
52ae332910 Use llama v3 by default + add to readme. (#2094) 2024-04-20 16:11:24 +02:00
9c532aef47 Also enable llama-v3 8b instruct. (#2088) 2024-04-19 08:50:06 +02:00
e6ee7ba4d4 Llama v3. (#2085)
* Llama v3.

* Tweak the default params + handle special tokens.

* Small tweak.
2024-04-18 22:19:54 +02:00
28057781aa Make the cache for the llama model explicit too. (#1745) 2024-02-22 12:04:33 +01:00
7c7400fb63 Use the tokenizer-output-stream in the llama example. (#1715)
* Use the tokenizer-output-stream in the llama example.

* Also use tokenizer-output-stream for llama2-c.
2024-02-15 16:47:33 +01:00
84250bf52f fix index_pos bug when kv cache is disabled. (#1517)
* fix index_pos bug when kv cache is disabled

* Tweak the fix.

---------

Co-authored-by: laurent <laurent.mazare@gmail.com>
2024-01-06 11:43:01 +01:00
1fb2dd905c Add support for tiny-llama-1.1b. (#1512) 2023-12-31 12:18:25 +01:00
996a7f2e24 Rework the llama example config, add the solar model. (#1485) 2023-12-26 22:24:04 +01:00
bb3471ea31 Adapt more examples to the updated safetensor api. (#947)
* Simplify the safetensor usage.

* Convert more examples.

* Move more examples.

* Adapt stable-diffusion.
2023-09-23 21:26:03 +01:00
805bf9ffa7 Implement top_p / nucleus sampling (#819)
* Implement top_p / nucleus sampling

* Update changelog

* rustfmt

* Add tests

* Fix clippy warning

* Fix another clippy error
2023-09-12 18:10:16 +02:00
d3f05eae8c Move some models to candle-transformers so that it's easier to re-use. (#794)
* Move some models to candle-transformers so that they can be shared.

* Also move falcon.

* Move Llama.

* Move whisper (partial).
2023-09-10 09:40:27 +01:00
6e485f2deb Add some optional repeat penalty. (#623)
* Add some optional repeat penalty.

* Add the missing files.
2023-08-27 10:48:45 +01:00
c105550405 s/panic/bail/ 2023-08-25 18:05:07 +02:00
4826a4212e Adding support for codellama in examples.
Codellama requires bf16 for now (error to convert from bf16 to f16).
Multiprocess demo not functional for it because flash-attn only supports
f16 for now.
2023-08-25 09:56:11 +00:00
c5f45887dc Add some tracing to the quantized example. (#473) 2023-08-16 18:49:08 +01:00
76804730c6 Using the real config from the hub when available. 2023-08-16 10:36:01 +02:00
5b1690fffa Tweak the llama example. (#450) 2023-08-15 12:18:20 +01:00
3cc87058b7 Support local weights & dynamic outputs (#447)
* Support local weights & dynamic outputs

* Revise as suggested

* Cargo code format
2023-08-15 11:51:57 +01:00
c84883ecf2 Add a cuda kernel for upsampling. (#441)
* Add a cuda kernel for upsampling.

* Update for the latest tokenizers version.
2023-08-14 13:12:17 +01:00
906c0f3eb5 Remove the checkpoint conversion script. (#405)
* Remove the checkpoint conversion script.

* Remove references to the script.
2023-08-11 05:59:48 +01:00
b278834267 Support the Accelerate BLAS on macOS. (#325)
* Add the accelerate feature.

* Ffi tweaks.
2023-08-05 17:25:24 +01:00
df6667ba88 Add some tracing to llama. (#318) 2023-08-03 13:52:22 +01:00
50d8273ae4 Support both llama v1 and llama v2. (#272) 2023-07-28 18:40:59 +01:00
ca479a873e Upgrading hf-hub to 0.2.0 (Modified API to not pass the Repo around
all the time)
2023-07-27 20:05:02 +02:00
84ad558e50 Switch to using llama-v2 by default. (#251) 2023-07-26 17:18:27 +01:00
e40b150bbe Better handling of dtypes in llama. (#243) 2023-07-26 08:28:33 +01:00
d9f9c859af Add flash attention (#241)
* Add some flash-attn kernel, import the code for flash-attn v2 from Dao-AILab.

* More flash attn.

* Set up the flash attn parameters.

* Get things to compile locally.

* Move the flash attention files in a different directory.

* Build the static C library with nvcc.

* Add more flash attention.

* Update the build part.

* Better caching.

* Exclude flash attention from the default workspace.

* Put flash-attn behind a feature gate.

* Get the flash attn kernel to run.

* Move the flags to a more appropriate place.

* Enable flash attention in llama.

* Use flash attention in llama.
2023-07-26 07:48:10 +01:00
12d6dc018d Support for MQA for llama v2. (#205)
* Support for MQA for llama v2.

* More llama-v2.

* Move the rotary embedding precomputation in the cache.

* Add a v2 flag.

* Use the hf model.
2023-07-20 06:39:04 +01:00
439321745a Removing candle-hub internal to extract into hf-hub standalone. 2023-07-19 15:04:38 +02:00
66750f9827 Add some 'cuda-if-available' helper function. (#172) 2023-07-15 08:25:15 +01:00
4ed56d7861 Removing cuda default.
Seems very important for a lot of exploring users usually on laptop
without GPUs.

Adding more README instructions in a follow up.
2023-07-14 16:52:15 +02:00
3c02ea56b0 Add a cli argument to easily switch the dtype. (#161) 2023-07-13 19:18:49 +01:00
ba35d895e7 Sketch the candle-transformers crate. (#147)
* Sketch the candle-transformers crate.

* Format the empty files.
2023-07-12 13:49:31 +01:00
eae646d322 Use arange in the examples. (#146) 2023-07-12 12:12:34 +01:00
20599172ac Add from_iter and arange, use it in the doctests. (#145) 2023-07-12 12:03:01 +01:00
b3b39cca92 Llama batch (#144)
* Add a batch dimension to llama.

* Bugfixes.
2023-07-12 11:38:19 +01:00
fa760759e5 Allow for lazy loading of npz files, use it in llama to reduce memory usage in the cpu version. (#141) 2023-07-11 20:22:34 +01:00
37cad85869 Resurrect the llama npy support. (#140) 2023-07-11 19:32:10 +01:00
760f1d7055 Refactor the llama example to make it more in sync with the other ones. (#139)
* Refactor the llama example to make it more in sync with the other ones.

* Make clippy happy.

* Properly load the safetensor weights.

* Get llama back to a working state for the safetensors case.
2023-07-11 17:20:55 +01:00
e923b3adc2 Add a KV cache to falcon. (#104) 2023-07-07 17:24:38 +01:00
115629fe08 Creating new sync Api for candle-hub.
- `api::Api` -> `api::tokio::api` (And created new `api::sync::Api`).
- Remove `tokio` from all our examples.
- Using similar codebase for now instead of ureq (for simplicity).
2023-07-06 15:15:25 +02:00
dd60bd84bb MKL adjustments. (#87) 2023-07-06 11:37:27 +01:00
c297a50960 Add mkl support for matrix multiply. (#86)
* Fix some rebase issues.

* Use mkl instead.

* Use mkl in bert.

* Add the optional mkl feature.

* Conditional compilation based on the mkl feature.

* Add more mkl support.
2023-07-06 11:05:05 +01:00
e2bfbcb79c Support dim indexes in cat. 2023-07-05 20:39:08 +01:00