Quantized version of mistral. (#1009)

* Quantized version of mistral. * Integrate the quantized mistral variant. * Use the quantized weight files. * Tweak the quantization command. * Fix the dtype when computing the rotary embeddings. * Update the readme with the quantized version. * Fix the decoding of the remaining tokens.
2025-06-19 19:58:35 +00:00 · 2023-09-30 19:25:47 +02:00
parent 06207332bc
commit deee7612da
7 changed files with 507 additions and 37 deletions
--- a/candle-examples/examples/mistral/README.md
+++ b/candle-examples/examples/mistral/README.md
@ -6,6 +6,9 @@ as of 2023-09-28. Weights (and the original Python model code) are released unde
 - [Blog post](https://mistral.ai/news/announcing-mistral-7b/) from Mistral announcing the model release.
 - [Model card](https://huggingface.co/mistralai/Mistral-7B-v0.1) on the
  HuggingFace Hub.
+This example supports the initial model as well as a quantized variant.
+
+## Running the example

 ```bash
 $ cargo run --example mistral --release --features cuda -- --prompt 'Write helloworld code in Rust' --sample-len 150
@ -38,3 +41,50 @@ fn main() {

 This example is released under the terms
 ```
+
+## Running the quantized version of the model
+
+```bash
+$ cargo run --example mistral --features accelerate --release -- \
+$   --prompt "Here is a sample quick sort implementation in rust " --quantized -n 400
+avx: false, neon: true, simd128: false, f16c: false
+temp: 0.00 repeat-penalty: 1.10 repeat-last-n: 64
+retrieved the files in 562.292µs
+loaded the model in 1.100323667s
+Here is a sample quick sort implementation in rust
+
+``rust
+fn quick_sort(arr: &mut [i32]) {
+    if arr.len() <= 1 {
+        return;
+    }
+
+    let pivot = arr[0];
+    let mut left = vec![];
+    let mut right = vec![];
+
+    for i in 1..arr.len() {
+        if arr[i] < pivot {
+            left.push(arr[i]);
+        } else {
+            right.push(arr[i]);
+        }
+    }
+
+    quick_sort(&mut left);
+    quick_sort(&mut right);
+
+    let mut i = 0;
+    for _ in &left {
+        arr[i] = left.pop().unwrap();
+        i += 1;
+    }
+
+    for _ in &right {
+        arr[i] = right.pop().unwrap();
+        i += 1;
+    }
+}
+``
+226 tokens generated (10.91 token/s)
+```