Commit Graph

691 Commits

Author SHA1 Message Date
57267cd536 Add a flag to force running the quantized model on CPUs. (#1778)
* Add a flag to force running the quantized model on CPUs.

* Add encodec to the readme.
2024-02-28 14:58:42 +01:00
60ee5cfd4d Support more modes in the encodec example. (#1777)
* Support more modes in the encodec example.

* Remove the old encodec model from the musicgen bits.
2024-02-28 09:22:33 +01:00
d0aca6c3c6 Encodec encoding demo. (#1775) 2024-02-28 06:49:03 +01:00
0c49e95dfb Encodec model. (#1771)
* Encodec model.

* Fixes.

* Add the padding functions.

* Get the LSTM bit to work.

* Get the encodec model to generate some tokens (decoder only for now).

* Minor tweak.

* Minor tweak.
2024-02-27 22:59:40 +01:00
32544a2ad6 Add an option to split the prompt. (#1766) 2024-02-27 11:24:11 +01:00
918136ba46 add quantized rwkv v5 model (#1743)
* and quantized rwkv v5 model

* Integrate the quantized rwkv model in the initial example.

---------

Co-authored-by: laurent <laurent.mazare@gmail.com>
2024-02-25 21:43:40 +01:00
2f22afd80e Cuda acceleration for quantized model. (#1754)
* Boilerplate for the quantized cuda support.

* More basic cuda support.

* More cuda quantization (quantize on cpu for now).

* Add the dequantization bit.

* Start adding some dedicated cuda kernels from llama.cpp.

* Move the kernel code.

* Start interfacing with the kernel.

* Tweak the kernel launch params.

* Bugfix for quantized metal.

* Fix some clippy lints.

* Tweak the launch parameters.

* Tweak cuda basics to perform a quantized matmul.

* Perform the dequantization on the cpu + use cublas for matmul.

* Add the dequantization kernel.

* Test the qmatmul.

* More kernels.

* Matmul-vec kernel.

* Add a couple kernels.

* More dequantization kernels.
2024-02-25 18:11:47 +01:00
8d04f70f4d Fix the eos token for gemma. (#1753) 2024-02-24 11:07:02 +01:00
32eb56d6b3 Fix typo in README (#1740) 2024-02-22 12:35:26 +01:00
28057781aa Make the cache for the llama model explicit too. (#1745) 2024-02-22 12:04:33 +01:00
544018b6d0 Explicit caching in llama2.c. 2024-02-22 10:22:03 +01:00
45d5322d62 Add the Gemma models. (#1741)
* Add the Gemma models.

* Add the gemma example.

* Adapt the RmsNorm.

* Get the 2b model to work.

* 7b support.

* Use the config head dim.

* Yet another fix.

* Make the matrixes contiguous.

* Also get the 7b model to work.

* And add to the readme.
2024-02-21 22:02:50 +01:00
7c7400fb63 Use the tokenizer-output-stream in the llama example. (#1715)
* Use the tokenizer-output-stream in the llama example.

* Also use tokenizer-output-stream for llama2-c.
2024-02-15 16:47:33 +01:00
058a910d0e Add a readme for rwkv. (#1712) 2024-02-14 15:31:33 +01:00
26fe162ab5 Custom tokenizer for rwkv. (#1711)
* Custom tokenizer for rwkv.

* Custom tokenizer.

* Getting the tokenizer to work.
2024-02-14 15:11:38 +01:00
2d5f2a728d Add the RWKV model (v5). (#1707)
* Start adding the RWKV model.

* More of the forward step.

* Handle rescaling.

* FeedForward.

* More work on RWKV.

* Better state tracking.

* Finish a first pass on forward.

* Fix the shape mismatches.

* Do not rescale in f32.

* Rename to rwkv-v5.

* Add the new models to the readme.
2024-02-14 10:58:32 +01:00
68f7655895 Add ConvNeXt-V2 and smaller model variants. (#1709) 2024-02-14 10:53:07 +01:00
ad73e93da2 Detach the tensors on batch-norm eval. (#1702)
* Detach the tensors on batch-norm eval.

* Fix pyo3 bindings.

* Black tweak.

* Formatting.

* Also update the pyo3-onnx formatting.

* Apply black.
2024-02-13 14:26:32 +01:00
13c67226e6 feat: support microphone whisper streaming (#1678)
* feat: support microphone whisper streaming

* fix: cleanup print stmts and adjust how input is read

* fix: remove incorrect comment

* feat: split into new example and simplify

* fix: feature flag example file

* fix: fmt fixes

* feat: simplify and remove redundant files
2024-02-12 18:01:21 +01:00
1e26d539d9 Improved mamba model optimized for inference (#1694)
* Sketch the mamba model for inference.

* Complete the forward pass.

* Add the mamba example.

* Optimize the selective-scan part.

* Fix a couple shape mismatches and get inference to work.

* Tweak the readmes.

* More readme tweaks.
2024-02-11 17:04:57 +01:00
74497e6bf7 Fixing the qwen tokenizer location. (#1693)
Using the chatglm one causes a bug where the "<|endoftext|>" is not
found.
2024-02-11 08:52:36 +01:00
8ab384e63d docs: add trocr examples (#1692) 2024-02-10 16:14:50 +01:00
27ffd644a9 Mention TrOCR in the readmes. (#1691) 2024-02-10 15:49:38 +01:00
42ce593ec6 Use the repo config for trocr rather than hardcoding it + small tweaks. (#1689)
* Use the repo config for trocr rather than hardcoding it + small tweaks.

* Add support for the printed models.

* Fail with an appropriate error message on missing position embeddings.
2024-02-10 13:15:03 +01:00
1c8d61f051 ChatGLM custom tokenizer. (#1687) 2024-02-10 10:47:04 +01:00
90447bc993 Add the custom tokenizer. (#1686) 2024-02-09 17:36:50 +01:00
40ce16001b Use the proper endoftext token for gwen. (#1685) 2024-02-09 17:02:03 +01:00
5657e596cd Add the Qwen2 model (#1684)
* Initial check-in for the qwen2 model.

* More qwen2 inference.

* Polish the qwen example.

* Fix the rope basis.

* Get the inference to work.

* Support different model sizes.
2024-02-09 15:02:49 +01:00
0dee8ea19b Add the ChatGLM model. (#1237)
* Add the ChatGLM model.

* Rotary embeddings.

* Add to the forward pass.

* Add to the forward pass.

* Add the rotary embeddings.

* Add the KV cache.

* Add the chatglm example.

* Bugfix.

* More glm fixes.

* Fix some shape issues.

* Get the inference to work.
2024-02-09 11:51:38 +01:00
020a979de2 Fix clippy lints for 1.76. (#1682) 2024-02-08 16:48:47 +01:00
678f64dd27 Fix token generation in bilingual models (non-English outputs) (#1668)
Co-authored-by: Guoqing Bao <guoqing.bao@enflame-tech.com>
2024-02-06 12:03:53 +01:00
153c940a9c Update docs to reflect current usage of example (#1610)
modified:   candle-examples/examples/onnx/README.md
2024-02-04 11:59:47 +01:00
50be8a98ba Quantized support for stable-lm2. (#1654)
* Quantized support for stable-lm2.

* Quantized support for v2-zephyr.
2024-02-04 11:57:05 +01:00
d32abbce53 Add StableLM-2, StableLM Code and Zephyr variants (#1650)
* Add StableLM Code and Zephyr variants

* Add V2 models

* Update README
2024-02-03 14:58:41 +01:00
dfab45e1c8 Supports more audio formats (#1628)
* Supports more audio formats

* Simplify the handling of the different buffer types.

* Check the sample rate.

---------

Co-authored-by: laurent <laurent.mazare@gmail.com>
2024-02-03 14:26:04 +01:00
a52d407ae6 Add ConvNeXt model. (#1604) 2024-02-03 13:34:28 +01:00
403680f17d Quantized GGUF style (#1523)
* Metal quantized modifications proposal.

- Add a device param, wherever needed.
- Create new QMetal storage thing that implements QuantizedType.
- Update everywhere needed.

Fix Python.

Fixing examples.

Fix: fmt + clippy + stub.

Moving everything around.

Only missing the actual implems.

Fixing everything + adding dequantized kernels.

More work.

Fixing matmul.

Fmt + Clippy

Some clippy fixes.

Working state.

Q2K Metal -> Bugged (also present in GGML).
Q4K CPU -> Bugged (present previously, new test catch it).
Q5K CPU -> Bugged (present previously).
Q8_1 Both -> Never really implemented it seems
Q8K metal -> Never implemented in metal

Fixing Q2K bug (present in ggml).

* Cleanup.

* Fix the rebase.

* Removing the fences speeds everything up and *is* correct this time...

* Cleanup the fence.

* After rebase.

* Bad code removal.

* Rebase after phi2 merge + fix replit default to CPU.

* Making the CI happy.

* More happy tests.

---------

Co-authored-by: Nicolas Patry <nicolas@Nicolass-MacBook-Pro.local>
2024-01-17 10:27:58 +01:00
5270224f40 Add MobileOne model. (#1595)
* Add MobileOne model.

* Clippy fixes

* Remove a comment.

---------

Co-authored-by: laurent <laurent.mazare@gmail.com>
2024-01-16 06:34:16 +01:00
ea36f3b11f Use the new phi model by default. (#1589) 2024-01-15 12:30:27 +01:00
539ead927a Update the Phi model to use the updated architecture. (#1580)
* Update the Phi model to use the updated architecture.

* Add more of the phi model.

* Repeat KV + caching.

* Apply the rotary embeddings.

* Add support for the new phi model in the phi example.

* Fix a couple glitches.

* Fix a couple more glitches.
2024-01-13 17:38:27 +01:00
e90bcdcc7c Metal: f16 and bf16 where_cond + benchmark (#1545)
* Use cfg to seperate benchmark results based on features

* Add metal where_cond for f16 and bf16. Add benchmark

* Remove allow pragma

* Avoid some unnecessary returns.

* Improve benchmarks layout

* Updated feature separated benchmarks

---------

Co-authored-by: Laurent <laurent.mazare@gmail.com>
2024-01-12 11:18:11 +01:00
8e06bfb4fd Mention VGG in the readme. (#1573) 2024-01-12 09:59:29 +01:00
6242276c09 Pin the revision used for phi-v2 + make it the default. (#1572)
* Pin the revision used for phi-v2 + make it the default.

* Tweak the custom-ops build.
2024-01-12 09:19:30 +01:00
2480c5dbdd Add RepVGG model. (#1561)
* Add RepVGG model.

* Add RepVGG README

* Extract var to top level

* Replace hashmap with a match

* Add a variant for the model kind + avoid some unnecessary config cloning.

---------

Co-authored-by: Laurent <laurent.mazare@gmail.com>
2024-01-11 07:07:40 +01:00
89b5a06858 Use bindgen-cuda for the custom-kernel example. (#1536)
* Use bindgen-cuda for the custom-kernel example.

* Only depend on the kernels when cuda is enabled.

* Skip rustfmt.
2024-01-07 17:18:46 +01:00
84250bf52f fix index_pos bug when kv cache is disabled. (#1517)
* fix index_pos bug when kv cache is disabled

* Tweak the fix.

---------

Co-authored-by: laurent <laurent.mazare@gmail.com>
2024-01-06 11:43:01 +01:00
03ce8caf40 Format properly the Stable Diffusion example run with params (#1511)
Move out the --sd-version flag out of the prompt.
2024-01-01 11:13:35 +01:00
b0fe5e4453 Do not implement Module for BatchNorm. (#1513) 2024-01-01 10:13:13 +01:00
1fb2dd905c Add support for tiny-llama-1.1b. (#1512) 2023-12-31 12:18:25 +01:00
51e577a682 Add Policy Gradient to Reinforcement Learning examples (#1500)
* added policy_gradient, modified main, ddpg and README

* fixed typo in README

* removed unnecessary imports

* small refactor

* Use clap for picking up the subcommand to run.

---------

Co-authored-by: Laurent <laurent.mazare@gmail.com>
2023-12-30 09:01:29 +01:00