mirror of https://github.com/huggingface/candle.git synced 2025-06-19 03:54:56 +00:00

Files

Nicolas Patry 7161002a34 Finished scaffolding, lots of TODOs

- Most kernels just copy themselfs to get the shapes correct
- Matmul works only in 1 case and simply empty allocates otherwise
- Logits and randomized to make the demo finish itself.

Performance is quite bad (30ms/token), but lot's of prints and allocs and some actual sending to metal.

Couln't get it super high by removing the obvious blockers (println + the actual running matmuls).

Allocations takes between 1us and 100us and seems very stable, Maybe metal doesn't really have a smart allocator and we'll need to own it.

2023-11-02 15:32:28 +01:00

assets

Add a gif to the quantized readme. (#833 )

2023-09-13 08:43:52 +01:00

main.rs

Finished scaffolding, lots of TODOs

2023-11-02 15:32:28 +01:00

README.md

Add a gif to the quantized readme. (#833 )

2023-09-13 08:43:52 +01:00

README.md

candle-quantized-llama: Fast Inference of quantized LLaMA models

This example provides a quantized LLaMA model similar to llama.cpp. This is based on candle built-in quantization methods. Supported features include:

2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit integer quantization support.
SIMD optimizations on Apple Silicon and x86.
Support using the gguf and ggml file formats.

The weights are automatically downloaded for you from the HuggingFace Hub on the first run. There are various command line flags to use local files instead, run with --help to learn about them.

Running some example.

cargo run --example quantized --release -- --prompt "The best thing about coding in rust is "

> avx: true, neon: false, simd128: false, f16c: true
> temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
> loaded 291 tensors (3.79GB) in 2.17s
> params: HParams { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, ftype: 2 }
> The best thing about coding in rust is 1.) that I don’t need to worry about memory leaks, 2.) speed and 3.) my program will compile even on old machines.

Command-line flags

Run with --help to see all options.

--which: specify the model to use, e.g. 7b, 13-chat, 7b-code.
--prompt interactive: interactive mode where multiple prompts can be entered.
--model mymodelfile.gguf: use a local model file rather than getting one from the hub.

README.md Unescape Escape

candle-quantized-llama: Fast Inference of quantized LLaMA models

Running some example.

Command-line flags

README.md