Commit Graph

346 Commits

Author SHA1 Message Date
3e89df938c Starcoder fix (#264)
* Bugfix for starcoder.

* Get some proper code generation.

* Slightly simpler softmax.
2023-07-28 11:17:49 +01:00
6a54ca115e Add some Bigcode model (#260)
* Start sketching the bigcode gpt model.

* Sketch the bigcode model.

* Implement the attention mechanism.

* Random reshaping.

* Sketch more of the example.

* Add some kv cache.

* Properly generate the position ids.

* Proper attention mask.

* Bail on upcasting.

* Properly apply the attention mask.

* Add the smaller starcoder variants.

* Update for the new hub api.

* Fix a shape issue.

* Fix another shape issue.

* Get some logits out.

* Adjust the weigth names.
2023-07-28 09:57:32 +01:00
4f260ef025 Merge pull request #216 from LaurentMazare/llama_multiprocess2
TP sharding v2
2023-07-28 08:06:13 +01:00
ca479a873e Upgrading hf-hub to 0.2.0 (Modified API to not pass the Repo around
all the time)
2023-07-27 20:05:02 +02:00
25a2086e8f Putting back Send + Sync 2023-07-27 09:58:47 +02:00
7c7e6ba201 Removing inner dependency on safetensors. 2023-07-27 09:58:47 +02:00
ed58de7551 Fixed TP sharded version. 2023-07-27 09:58:46 +02:00
1735e4831e TP sharding v2 2023-07-27 09:58:14 +02:00
209f06d7c3 Micro-cleanup. (#256) 2023-07-27 07:55:54 +01:00
84ad558e50 Switch to using llama-v2 by default. (#251) 2023-07-26 17:18:27 +01:00
1235aa2536 Use bail rather than wrapping a string where possible. (#249)
* Use bail rather than wrapping a string where possible.

* Revert the cuda default bit.
2023-07-26 15:42:46 +01:00
f052ba76cb Lining up the flash attn version with the non-flash one. (#248)
* Move the flash-attn function in the proper crate.

* Causality tweak.
2023-07-26 15:11:45 +01:00
8b1d12bead Merge pull request #246 from LaurentMazare/rename_custom_op
Rename exposed ops.
2023-07-26 14:20:29 +01:00
2ce5f12513 Again set a few extra params in flash-attn. (#245)
* Again set a few extra params.

* Use the appropriate kernel sizes.

* Add all the kernel sizes.

* Parallel compiling.

* Reduce the amount of parallelism.

* Add the missing kernel.

* Fix a typo.

* Remove bf16 support for now.
2023-07-26 14:16:37 +01:00
1a5416ec35 Rename exposed ops. 2023-07-26 12:43:19 +02:00
fa2b64d678 Proper flash-attn parameters. (#244)
* Proper flash-attn parameters.

* Set the flash attention parameters.

* Add more validations.

* Setup the o_ flash attn parameters.

* More flash-attn support.

* Set more flash attn parameters.
2023-07-26 10:13:40 +01:00
e40b150bbe Better handling of dtypes in llama. (#243) 2023-07-26 08:28:33 +01:00
d9f9c859af Add flash attention (#241)
* Add some flash-attn kernel, import the code for flash-attn v2 from Dao-AILab.

* More flash attn.

* Set up the flash attn parameters.

* Get things to compile locally.

* Move the flash attention files in a different directory.

* Build the static C library with nvcc.

* Add more flash attention.

* Update the build part.

* Better caching.

* Exclude flash attention from the default workspace.

* Put flash-attn behind a feature gate.

* Get the flash attn kernel to run.

* Move the flags to a more appropriate place.

* Enable flash attention in llama.

* Use flash attention in llama.
2023-07-26 07:48:10 +01:00
550a13a547 Use the binary decoder for llama2.c. (#230)
* Use the binary decoder for llama2.c.

* Add the temperature.

* Formatting tweak.

* Fix the rotary embeddings.
2023-07-24 10:56:08 +01:00
35b65fed88 Add llama2.c as an example. (#229)
* Start adding llama2.c.

* Model loading.

* Add the llama-v2 model.

* Start converting the weights.

* Rotary embedding tweaks.

* Get the model to generate some tokens.
2023-07-24 09:13:50 +01:00
b6f7dfb682 CPU implementation for the custom RMS example. (#228)
* CPU implementation for the custom RMS example.

* Add the eps parameter.
2023-07-23 20:04:20 +01:00
e449ce53a2 Wrapping code to call the custom op. (#225)
* Wrapping code to call the custom op.

* Get the rms example to work.

* Get around rustfmt failing in the CI.

* Fix the rms computation.
2023-07-23 11:31:17 +01:00
b8a10425ad Kernel build example (#224)
* Build example kernels.

* Add some sample custom kernel.

* Get the example kernel to compile.

* Add some cuda code.

* More cuda custom op.

* More cuda custom ops.
2023-07-23 07:15:37 +01:00
1f26042693 Move some shared functions to the nn module. (#221) 2023-07-22 13:25:11 +01:00
43c7223292 Rename the .r functions to .dims so as to be a bit more explicit. (#220) 2023-07-22 10:39:27 +01:00
52c5d8c087 Add the gather op. (#219)
* Start adding gather.

* Gather cpu implementation + use in simple training.

* Add scatter_add for the gradient of gather.

* Simple cpu implementation of scatter_add.

* Use gather in the simple-training backprop.
2023-07-22 07:21:28 +01:00
410654525f Refactor the reduce ops in order to introduce argmin/argmax. (#212)
* Refactor the reduce ops in order to introduce argmin/argmax.

* Clippy fixes.

* Use the newly introduced argmax.

* Fix the strided case.

* Handle the non-contiguous case.
2023-07-21 11:41:08 +01:00
c60831aad4 Add more gradient tests + bugfixes. (#211)
* Add more gradient tests + bugfixes.

* More tests and fixes.

* More tests.
2023-07-21 06:52:39 +01:00
4845d5cc64 More realistic training setup. (#210)
* More realistic training setup.

* Compute the model accuracy.

* Very inefficient backprop for index select.

* More backprop.

* Fix some backprop issues.

* Backprop fix.

* Another broadcasting backprop fix.

* Better backprop for reducing ops.

* Training again.

* Add some gradient tests.

* Get the training to work.
2023-07-20 18:25:41 +01:00
12d6dc018d Support for MQA for llama v2. (#205)
* Support for MQA for llama v2.

* More llama-v2.

* Move the rotary embedding precomputation in the cache.

* Add a v2 flag.

* Use the hf model.
2023-07-20 06:39:04 +01:00
9515e8ea6c Merge branch 'main' into remove_wrapper 2023-07-19 18:53:55 +02:00
e6584476c4 Merge pull request #200 from LaurentMazare/removing_candle_hub
Removing `candle-hub` internal to extract into `hf-hub` standalone.
2023-07-19 17:27:55 +02:00
cb687b4897 Add some more developed training examples. (#199)
* Use contiguous tensors for variables.

* Sketch the mnist example.

* Start adding the reduce ops.

* Renaming.

* Refactor the reduce operations.

* Bugfix for the broadcasting vectorization.
2023-07-19 15:37:52 +01:00
dfd624dbd3 [Proposal] Remove SafeTensor wrapper (allows finer control for users). 2023-07-19 16:25:44 +02:00
439321745a Removing candle-hub internal to extract into hf-hub standalone. 2023-07-19 15:04:38 +02:00
ff61a42ad7 Use mkl to accelerate binary ops. (#190)
* Vectorized binary ops with mkl.

* Improve the binary op mkl support.

* Push the support for mkl binary ops.

* Proper vectorization of binary ops.

* Proper mkl'isation when broadcasting binary ops.
2023-07-18 12:04:39 +01:00
b706f32839 Add Shape try into (#189)
* Add the TryInto trait for shapes.

* Use the vectorized operations in block mode too.
2023-07-18 10:52:16 +01:00
d6313d2447 Add more tracing details to bert. (#188) 2023-07-18 08:11:05 +01:00
b8abe2bb4b Factorize the tokenizers version in the workspace cargo def. (#186) 2023-07-18 06:48:13 +01:00
f0cccd08f0 Bert tracing (#184)
* Add some tracing to bert.

* More tracing.

* Add a flag for tracing.
2023-07-17 19:40:42 +01:00
104f89df31 Centralize the dependency versions and inherit them. (#177) 2023-07-16 07:47:17 +01:00
66750f9827 Add some 'cuda-if-available' helper function. (#172) 2023-07-15 08:25:15 +01:00
4ed56d7861 Removing cuda default.
Seems very important for a lot of exploring users usually on laptop
without GPUs.

Adding more README instructions in a follow up.
2023-07-14 16:52:15 +02:00
a2f72edc0d Simplify the parameters used by sum and sum_keepdim. (#165) 2023-07-14 08:22:08 +01:00
2bfa791336 Use the same default as pytorch for sum. (#164) 2023-07-13 21:32:32 +01:00
3c02ea56b0 Add a cli argument to easily switch the dtype. (#161) 2023-07-13 19:18:49 +01:00
50b0946a2d Tensor mutability (#154)
* Working towards tensor mutability.

* Use a ref-cell to provide tensor mutability.
2023-07-13 11:04:40 +01:00
a3663ce2f2 Encodec forward pass (#153)
* Sketch the forward pass for encodec.

* Forward pass for the encodec resnet block.

* Encodec decoding.
2023-07-13 08:18:39 +01:00
6c75a98ad2 Add the forward pass for the T5 model. (#152)
* Add the forward pass for the T5 model.

* More t5 forward pass.
2023-07-12 22:02:40 +01:00
ba35d895e7 Sketch the candle-transformers crate. (#147)
* Sketch the candle-transformers crate.

* Format the empty files.
2023-07-12 13:49:31 +01:00