Rotating kv cache positions (#2901 )

* Retrieve the current positions for rotating KV caches. * Add the function to the kv cache too. * More testing.
Updated candle-book: Introduction, Installation, MNIST guide, and added CONTRIBUTING.md (#2897 )
2025-06-16 18:48:51 +00:00 · 2025-04-15 23:09:26 +02:00 · 2025-04-15 21:41:10 +02:00 · 2025-04-15 21:40:18 +02:00 · 2025-04-14 20:47:52 +02:00 · 2025-04-14 15:52:11 +02:00
150 changed files with 8033 additions and 2575 deletions
--- a/.github/workflows/book-cd.yml
+++ b/.github/workflows/book-cd.yml
@ -1,40 +0,0 @@
-name: Deploy Rust book
-on:
-  push:
-    branches:
-      - main
-
-jobs:
-  deploy:
-    runs-on: ubuntu-latest
-    permissions:
-      contents: write  # To push a branch 
-      pull-requests: write  # To create a PR from that branch
-    steps:
-    - uses: actions/checkout@v3
-      with:
-        fetch-depth: 0
-    - name: Install latest mdbook
-      run: |
-        tag=$(curl 'https://api.github.com/repos/rust-lang/mdbook/releases/latest' | jq -r '.tag_name')
-        url="https://github.com/rust-lang/mdbook/releases/download/${tag}/mdbook-${tag}-x86_64-unknown-linux-gnu.tar.gz"
-        mkdir mdbook
-        curl -sSL $url | tar -xz --directory=./mdbook
-        echo `pwd`/mdbook >> $GITHUB_PATH
-    - name: Deploy GitHub Pages
-      run: |
-        # This assumes your book is in the root of your repository.
-        # Just add a `cd` here if you need to change to another directory.
-        cd candle-book
-        mdbook build
-        git worktree add gh-pages
-        git config user.name "Deploy from CI"
-        git config user.email ""
-        cd gh-pages
-        # Delete the ref to avoid keeping history.
-        git update-ref -d refs/heads/gh-pages
-        rm -rf *
-        mv ../book/* .
-        git add .
-        git commit -m "Deploy $GITHUB_SHA to gh-pages"
-        git push --force --set-upstream origin gh-pages
--- a/.github/workflows/book.yml
+++ b/.github/workflows/book.yml
@ -1,29 +0,0 @@
-name: CI
-on: 
-  pull_request:
-
-jobs:
-  test:
-    name: Test candle-book
-    runs-on: ubuntu-latest
-    permissions:
-      contents: write  # To push a branch 
-      pull-requests: write  # To create a PR from that branch
-    steps:
-    - uses: actions/checkout@master
-    - name: Install Rust
-      run: |
-        rustup set profile minimal
-        rustup toolchain install stable
-        rustup default stable
-    - name: Install latest mdbook
-      run: |
-        tag=$(curl 'https://api.github.com/repos/rust-lang/mdbook/releases/latest' | jq -r '.tag_name')
-        url="https://github.com/rust-lang/mdbook/releases/download/${tag}/mdbook-${tag}-x86_64-unknown-linux-gnu.tar.gz"
-        mkdir bin
-        curl -sSL $url | tar -xz --directory=bin
-        echo "$(pwd)/bin" >> $GITHUB_PATH
-    - name: Run tests
-      run: cd candle-book && cargo build && mdbook test -L ../target/debug/deps/
-
-
--- a/.github/workflows/maturin.yml
+++ b/.github/workflows/maturin.yml
--- a/Cargo.toml
+++ b/Cargo.toml
@ -3,7 +3,6 @@ members = [
    "candle-core",
    "candle-datasets",
    "candle-examples",
-    "candle-book",
    "candle-nn",
    "candle-pyo3",
    "candle-transformers",
@ -12,6 +11,7 @@ members = [
    "tensor-tools",
 ]
 exclude = [
+   "candle-book",
   "candle-flash-attn",
   "candle-kernels",
   "candle-metal-kernels",
@ -20,7 +20,7 @@ exclude = [
 resolver = "2"

 [workspace.package]
-version = "0.8.3"
+version = "0.9.0-alpha.4"
 edition = "2021"
 description = "Minimalist ML framework."
 repository = "https://github.com/huggingface/candle"
@ -33,21 +33,21 @@ ab_glyph = "0.2.23"
 accelerate-src = { version = "0.3.2" }
 anyhow = { version = "1", features = ["backtrace"] }
 byteorder = "1.4.3"
-candle = { path = "./candle-core", package = "candle-core", version = "0.8.3" }
-candle-datasets = { path = "./candle-datasets", version = "0.8.3" }
-candle-flash-attn = { path = "./candle-flash-attn", version = "0.8.3" }
-candle-kernels = { path = "./candle-kernels", version = "0.8.3" }
-candle-metal-kernels = { path = "./candle-metal-kernels", version = "0.8.3" }
-candle-nn = { path = "./candle-nn", version = "0.8.3" }
-candle-onnx = { path = "./candle-onnx", version = "0.8.3" }
-candle-transformers = { path = "./candle-transformers", version = "0.8.3" }
+candle = { path = "./candle-core", package = "candle-core", version = "0.9.0-alpha.4" }
+candle-datasets = { path = "./candle-datasets", version = "0.9.0-alpha.4" }
+candle-flash-attn = { path = "./candle-flash-attn", version = "0.9.0-alpha.4" }
+candle-kernels = { path = "./candle-kernels", version = "0.9.0-alpha.4" }
+candle-metal-kernels = { path = "./candle-metal-kernels", version = "0.9.0-alpha.4" }
+candle-nn = { path = "./candle-nn", version = "0.9.0-alpha.4" }
+candle-onnx = { path = "./candle-onnx", version = "0.9.0-alpha.4" }
+candle-transformers = { path = "./candle-transformers", version = "0.9.0-alpha.4" }
 clap = { version = "4.2.4", features = ["derive"] }
 criterion = { version = "0.5.1", default-features=false }
-cudarc = { version = "0.13.5", features = ["std", "cublas", "cublaslt", "curand", "driver", "nvrtc", "f16", "cuda-version-from-build-system", "dynamic-linking"], default-features=false }
+cudarc = { version = "0.16.0", features = ["std", "cublas", "cublaslt", "curand", "driver", "nvrtc", "f16", "cuda-version-from-build-system", "dynamic-linking"], default-features=false }
 fancy-regex = "0.13.0"
 gemm = { version = "0.17.0", features = ["wasm-simd128-enable"] }
 hf-hub = "0.4.1"
-half = { version = "2.3.1", features = ["num-traits", "use-intrinsics", "rand_distr"] }
+half = { version = "2.5.0", features = ["num-traits", "use-intrinsics", "rand_distr"] }
 hound = "3.5.1"
 image = { version = "0.25.2", default-features = false, features = ["jpeg", "png"] }
 imageproc = { version = "0.24.0", default-features = false }
@ -58,21 +58,21 @@ memmap2 = { version = "0.9.3", features = ["stable_deref_trait"] }
 num_cpus = "1.15.0"
 num-traits = "0.2.15"
 parquet = { version = "51.0.0" }
-rand = "0.8.5"
-rand_distr = "0.4.3"
+rand = "0.9.0"
+rand_distr = "0.5.1"
 rayon = "1.7.0"
 safetensors = "0.4.1"
 serde = { version = "1.0.171", features = ["derive"] }
 serde_plain = "1.0.2"
 serde_json = "1.0.99"
 thiserror = "1"
-tokenizers = { version = "0.19.1", default-features = false }
+tokenizers = { version = "0.21.0", default-features = false }
 tracing = "0.1.37"
 tracing-chrome = "0.7.1"
 tracing-subscriber = "0.3.7"
-ug = "0.1.0"
-ug-cuda = "0.1.0"
-ug-metal = "0.1.0"
+ug = "0.4.0"
+ug-cuda = "0.4.0"
+ug-metal = "0.4.0"
 yoke = { version = "0.7.2", features = ["derive"] }
 zip = { version = "1.1.1", default-features = false }
 metal = { version = "0.27.0", features = ["mps"]}
--- a/README.md
+++ b/README.md
@ -290,6 +290,8 @@ Cheatsheet:

 ### Why should I use Candle?

+<!--- ANCHOR: goals --->
+
 Candle's core goal is to *make serverless inference possible*. Full machine learning frameworks like PyTorch
 are very large, which makes creating instances on a cluster slow. Candle allows deployment of lightweight
 binaries.
@ -299,6 +301,7 @@ and the [GIL](https://www.backblaze.com/blog/the-python-gil-past-present-and-fut

 Finally, Rust is cool! A lot of the HF ecosystem already has Rust crates, like [safetensors](https://github.com/huggingface/safetensors) and [tokenizers](https://github.com/huggingface/tokenizers).

+<!--- ANCHOR_END: goals --->

 ### Other ML frameworks

--- a/candle-book/CONTRIBUTING.md
+++ b/candle-book/CONTRIBUTING.md
@ -0,0 +1,13 @@
+# Candle Book
+
+The book uses [mdBook](https://github.com/rust-lang/mdBook) for building.
+
+## Installation
+
+To install mdBook, run `cargo install mdbook`. More instructions can be found [here](https://rust-lang.github.io/mdBook/guide/installation.html).
+
+## Viewing the book
+
+To view the book, run `mdbook serve --open candle-book`. More instructions can be found [here](https://rust-lang.github.io/mdBook/guide/creating.html). 
+
+The book is built automatically in github CI.
--- a/candle-book/src/README.md
+++ b/candle-book/src/README.md
@ -1,6 +1,7 @@
 # Introduction

+{{#include ../../README.md:goals}}
+
 {{#include ../../README.md:features}}

-
-This book will introduce step by step how to use `candle`.
+This book will introduce step by step how to use `candle`.
--- a/candle-book/src/SUMMARY.md
+++ b/candle-book/src/SUMMARY.md
@ -5,7 +5,10 @@
 # User Guide

 - [Installation](guide/installation.md)
- [Hello World - MNIST](guide/hello_world.md)
+- [Tutorial - MNIST](guide/mnist/intro.md)
+  - [Modeling](guide/mnist/modeling.md)
+  - [Training](guide/mnist/training.md)
+  - [Saving And Loading](guide/mnist/saving_loading.md)
 - [PyTorch cheatsheet](guide/cheatsheet.md)

 # Reference Guide
--- a/candle-book/src/guide/installation.md
+++ b/candle-book/src/guide/installation.md
@ -1,8 +1,23 @@
 # Installation

-**With Cuda support**:
+## 1. Create a new rust app or library

-1. First, make sure that Cuda is correctly installed.
+```bash
+cargo new myapp
+cd myapp
+```
+
+## 2. Add the correct candle version
+
+### Standard
+
+```bash
+cargo add --git https://github.com/huggingface/candle.git candle-core
+```
+
+### CUDA
+
+First, make sure that Cuda is correctly installed.
 - `nvcc --version` should print information about your Cuda compiler driver.
 - `nvidia-smi --query-gpu=compute_cap --format=csv` should print your GPUs compute capability, e.g. something
 like:
@ -17,43 +32,36 @@ You can also compile the Cuda kernels for a specific compute cap using the

 If any of the above commands errors out, please make sure to update your Cuda version.

-2. Create a new app and add [`candle-core`](https://github.com/huggingface/candle/tree/main/candle-core) with Cuda support.
-
-Start by creating a new cargo:
-
-```bash
-cargo new myapp
-cd myapp
-```
-
-Make sure to add the `candle-core` crate with the cuda feature:
+Add the `candle-core` crate with the cuda feature:

 ```bash
 cargo add --git https://github.com/huggingface/candle.git candle-core --features "cuda"
 ```

+### MKL
+
+You can also see the `mkl` feature which can get faster inference on CPU.
+
+Add the `candle-core` crate with the mkl feature:
+
+```bash
+cargo add --git https://github.com/huggingface/candle.git candle-core --features "mkl"
+```
+
+### Metal
+
+Metal is exclusive to MacOS.
+
+Add the `candle-core` crate with the metal feature:
+
+```bash
+cargo add --git https://github.com/huggingface/candle.git candle-core --features "metal"
+```
+
+## 3. Building
+
 Run `cargo build` to make sure everything can be correctly built.

 ```bash
 cargo build
 ```
-
-**Without Cuda support**:
-
-Create a new app and add [`candle-core`](https://github.com/huggingface/candle/tree/main/candle-core) as follows:
-
-```bash
-cargo new myapp
-cd myapp
-cargo add --git https://github.com/huggingface/candle.git candle-core
-```
-
-Finally, run `cargo build` to make sure everything can be correctly built.
-
-```bash
-cargo build
-```
-
-**With mkl support**
-
-You can also see the `mkl` feature which could be interesting to get faster inference on CPU. [Using mkl](./advanced/mkl.md)
--- a/candle-book/src/guide/mnist/intro.md
+++ b/candle-book/src/guide/mnist/intro.md
@ -0,0 +1,17 @@
+# Candle MNIST Tutorial
+
+## Introduction
+
+This tutorial provides an introduction to Candle by implementing and training a neural network for MNIST digit classification from scratch. 
+
+Throughout this tutorial, you will learn the basics of:
+
+- Tensor operations and model construction
+- Creating and implementing neural network layers
+- Parameter initialization
+- Training loop implementation
+- Saving and loading trained models
+
+## Getting Started
+
+Before proceeding, please ensure that you have properly installed Candle by following the instructions in the [Installation](../installation.md) guide.
--- a/candle-book/src/guide/mnist/modeling.md
+++ b/candle-book/src/guide/mnist/modeling.md
@ -0,0 +1,172 @@
+# Candle MNIST Tutorial
+
+## Modeling
+
+Open `src/main.rs` in your project folder and insert the following code:
+
+```rust
+use candle_core::{Device, Result, Tensor};
+
+struct Model {
+    first: Tensor,
+    second: Tensor,
+}
+
+impl Model {
+    fn forward(&self, image: &Tensor) -> Result<Tensor> {
+        let x = image.matmul(&self.first)?;
+        let x = x.relu()?;
+        x.matmul(&self.second)
+    }
+}
+
+fn main() -> Result<()> {
+    // Use Device::new_cuda(0)?; to utilize GPU acceleration.
+    let device = Device::Cpu;
+
+    let first = Tensor::randn(0f32, 1.0, (784, 100), &device)?;
+    let second = Tensor::randn(0f32, 1.0, (100, 10), &device)?;
+    let model = Model { first, second };
+
+    let dummy_image = Tensor::randn(0f32, 1.0, (1, 784), &device)?;
+
+    let digit = model.forward(&dummy_image)?;
+    println!("Digit {digit:?} digit");
+    Ok(())
+}
+```
+
+Execute the program with:
+
+```bash
+$ cargo run --release
+
+> Digit Tensor[dims 1, 10; f32] digit
+```
+
+Since random inputs are provided, expect an incoherent output.
+
+## Implementing a `Linear` Layer
+
+To create a more sophisticated layer type, add a `bias` to the weight to construct the standard `Linear` layer.
+
+Replace the entire content of `src/main.rs` with:
+
+```rust
+use candle_core::{Device, Result, Tensor};
+
+struct Linear {
+    weight: Tensor,
+    bias: Tensor,
+}
+
+impl Linear {
+    fn forward(&self, x: &Tensor) -> Result<Tensor> {
+        let x = x.matmul(&self.weight)?;
+        x.broadcast_add(&self.bias)
+    }
+}
+
+struct Model {
+    first: Linear,
+    second: Linear,
+}
+
+impl Model {
+    fn forward(&self, image: &Tensor) -> Result<Tensor> {
+        let x = self.first.forward(image)?;
+        let x = x.relu()?;
+        self.second.forward(&x)
+    }
+}
+
+fn main() -> Result<()> {
+    // Use Device::new_cuda(0)?; for GPU acceleration.
+    // Use Device::Cpu; for CPU computation.
+    let device = Device::cuda_if_available(0)?;
+
+    // Initialize model parameters
+    let weight = Tensor::randn(0f32, 1.0, (784, 100), &device)?;
+    let bias = Tensor::randn(0f32, 1.0, (100, ), &device)?;
+    let first = Linear { weight, bias };
+    let weight = Tensor::randn(0f32, 1.0, (100, 10), &device)?;
+    let bias = Tensor::randn(0f32, 1.0, (10, ), &device)?;
+    let second = Linear { weight, bias };
+    let model = Model { first, second };
+
+    let dummy_image = Tensor::randn(0f32, 1.0, (1, 784), &device)?;
+
+    // Perform inference
+    let digit = model.forward(&dummy_image)?;
+    println!("Digit {digit:?} digit");
+    Ok(())
+}
+```
+
+Execute again with:
+
+```bash
+$ cargo run --release
+
+> Digit Tensor[dims 1, 10; f32] digit
+```
+
+## Utilizing `candle_nn`
+
+Many classical layers (such as [Linear](https://github.com/huggingface/candle/blob/main/candle-nn/src/linear.rs)) are already implemented in [candle-nn](https://github.com/huggingface/candle/tree/main/candle-nn).
+
+This `Linear` implementation follows PyTorch conventions for improved compatibility with existing models, utilizing the transpose of weights rather than direct weights.
+
+Let's simplify our implementation. First, add `candle-nn` as a dependency:
+
+```bash
+$ cargo add --git https://github.com/huggingface/candle.git candle-nn
+```
+
+Now, replace the entire content of `src/main.rs` with:
+
+```rust
+use candle_core::{Device, Result, Tensor};
+use candle_nn::{Linear, Module};
+
+struct Model {
+    first: Linear,
+    second: Linear,
+}
+
+impl Model {
+    fn forward(&self, image: &Tensor) -> Result<Tensor> {
+        let x = self.first.forward(image)?;
+        let x = x.relu()?;
+        self.second.forward(&x)
+    }
+}
+
+fn main() -> Result<()> {
+    // Use Device::new_cuda(0)?; for GPU acceleration.
+    let device = Device::Cpu;
+
+    // Note the dimension change: (784, 100) -> (100, 784)
+    let weight = Tensor::randn(0f32, 1.0, (100, 784), &device)?;
+    let bias = Tensor::randn(0f32, 1.0, (100, ), &device)?;
+    let first = Linear::new(weight, Some(bias));
+    let weight = Tensor::randn(0f32, 1.0, (10, 100), &device)?;
+    let bias = Tensor::randn(0f32, 1.0, (10, ), &device)?;
+    let second = Linear::new(weight, Some(bias));
+    let model = Model { first, second };
+
+    let dummy_image = Tensor::randn(0f32, 1.0, (1, 784), &device)?;
+
+    let digit = model.forward(&dummy_image)?;
+    println!("Digit {digit:?} digit");
+    Ok(())
+}
+```
+
+Execute the final version:
+
+```bash
+$ cargo run --release
+
+> Digit Tensor[dims 1, 10; f32] digit
+```
--- a/candle-book/src/guide/mnist/saving_loading.md
+++ b/candle-book/src/guide/mnist/saving_loading.md
@ -0,0 +1,158 @@
+# Candle MNIST Tutorial
+
+## Saving and Loading Models
+
+After training a model, it is useful to save and subsequently load the model parameters. In Candle, this functionality is managed through the `VarMap` data structure, with parameters stored on disk using the [safetensors](https://huggingface.co/docs/safetensors/index) format.
+
+### Saving Model Parameters
+
+Let's modify our `training_loop` function to include functionality for saving weights:
+
+```rust
+fn training_loop(
+    m: candle_datasets::vision::Dataset,
+) -> anyhow::Result<()> {
+    let dev = Device::cuda_if_available(0)?;
+
+    let train_labels = m.train_labels;
+    let train_images = m.train_images.to_device(&dev)?;
+    let train_labels = train_labels.to_dtype(DType::U32)?.to_device(&dev)?;
+
+    // Initialize a VarMap for trainable parameters
+    let varmap = VarMap::new();
+    let vs = VarBuilder::from_varmap(&varmap, DType::F32, &dev);
+    let model = Model::new(vs.clone())?;
+
+    let learning_rate = 0.05;
+    let epochs = 10;
+
+    // Initialize stochastic gradient descent optimizer
+    let mut sgd = candle_nn::SGD::new(varmap.all_vars(), learning_rate)?;
+    let test_images = m.test_images.to_device(&dev)?;
+    let test_labels = m.test_labels.to_dtype(DType::U32)?.to_device(&dev)?;
+    
+    for epoch in 1..epochs {
+        // Standard MNIST forward pass
+        let logits = model.forward(&train_images)?;
+        let log_sm = ops::log_softmax(&logits, D::Minus1)?;
+        
+        // Compute Negative Log Likelihood loss
+        let loss = loss::nll(&log_sm, &train_labels)?;
+
+        // Perform backward pass and update weights
+        sgd.backward_step(&loss)?;
+
+        // Evaluate model on test set
+        let test_logits = model.forward(&test_images)?;
+        let sum_ok = test_logits
+            .argmax(D::Minus1)?
+            .eq(&test_labels)?
+            .to_dtype(DType::F32)?
+            .sum_all()?
+            .to_scalar::<f32>()?;
+        let test_accuracy = sum_ok / test_labels.dims1()? as f32;
+        println!(
+            "{epoch:4} train loss: {:8.5} test acc: {:5.2}%",
+            loss.to_scalar::<f32>()?,
+            test_accuracy
+        );
+    }
+    
+    // Save model weights to disk
+    varmap.save("model_weights.safetensors")?;
+    Ok(())
+}
+```
+
+```bash
+$ cargo run --release
+
+> 1 train loss:  2.40485 test acc:  0.11%
+> 2 train loss:  2.34161 test acc:  0.14%
+> 3 train loss:  2.28841 test acc:  0.17%
+> 4 train loss:  2.24158 test acc:  0.19%
+> 5 train loss:  2.19898 test acc:  0.23%
+> 6 train loss:  2.15927 test acc:  0.26%
+> 7 train loss:  2.12161 test acc:  0.29%
+> 8 train loss:  2.08549 test acc:  0.32%
+> 9 train loss:  2.05053 test acc:  0.35%
+```
+
+### Loading Model Parameters
+
+Now that we have saved our model parameters, we can modify the code to load them. The primary change required is to make the `varmap` variable mutable:
+
+```rust
+fn training_loop(
+    m: candle_datasets::vision::Dataset,
+) -> anyhow::Result<()> {
+    let dev = Device::cuda_if_available(0)?;
+
+    let train_labels = m.train_labels;
+    let train_images = m.train_images.to_device(&dev)?;
+    let train_labels = train_labels.to_dtype(DType::U32)?.to_device(&dev)?;
+
+    // Create a mutable VarMap for trainable parameters
+    let mut varmap = VarMap::new();
+    let vs = VarBuilder::from_varmap(&varmap, DType::F32, &dev);
+    let model = Model::new(vs.clone())?;
+
+    // Load pre-trained weights from file
+    varmap.load("model_weights.safetensors")?;
+
+    let learning_rate = 0.05;
+    let epochs = 10;
+
+    // Initialize stochastic gradient descent optimizer
+    let mut sgd = candle_nn::SGD::new(varmap.all_vars(), learning_rate)?;
+    let test_images = m.test_images.to_device(&dev)?;
+    let test_labels = m.test_labels.to_dtype(DType::U32)?.to_device(&dev)?;
+    
+    for epoch in 1..epochs {
+        // Standard MNIST forward pass
+        let logits = model.forward(&train_images)?;
+        let log_sm = ops::log_softmax(&logits, D::Minus1)?;
+        
+        // Compute Negative Log Likelihood loss
+        let loss = loss::nll(&log_sm, &train_labels)?;
+
+        // Perform backward pass and update weights
+        sgd.backward_step(&loss)?;
+
+        // Evaluate model on test set
+        let test_logits = model.forward(&test_images)?;
+        let sum_ok = test_logits
+            .argmax(D::Minus1)?
+            .eq(&test_labels)?
+            .to_dtype(DType::F32)?
+            .sum_all()?
+            .to_scalar::<f32>()?;
+        let test_accuracy = sum_ok / test_labels.dims1()? as f32;
+        println!(
+            "{epoch:4} train loss: {:8.5} test acc: {:5.2}%",
+            loss.to_scalar::<f32>()?,
+            test_accuracy
+        );
+    }
+    
+    // Save updated weights back to disk
+    varmap.save("model_weights.safetensors")?;
+    Ok(())
+}
+```
+
+```bash
+$ cargo run --release
+
+> 1 train loss:  2.01645 test acc:  0.38%
+> 2 train loss:  1.98300 test acc:  0.41%
+> 3 train loss:  1.95008 test acc:  0.44%
+> 4 train loss:  1.91754 test acc:  0.47%
+> 5 train loss:  1.88534 test acc:  0.50%
+> 6 train loss:  1.85349 test acc:  0.53%
+> 7 train loss:  1.82198 test acc:  0.56%
+> 8 train loss:  1.79077 test acc:  0.59%
+> 9 train loss:  1.75989 test acc:  0.61%
+```
+
+Note that loading the weights will fail if the specified file does not exist or is incompatible with the current model architecture. Implementing file existence checks and appropriate error handling is left to the user.
--- a/candle-book/src/guide/mnist/training.md
+++ b/candle-book/src/guide/mnist/training.md
@ -0,0 +1,134 @@
+# Candle MNIST Tutorial
+
+## Training Implementation
+
+First, let's create a utility function `make_linear` that accepts a `VarBuilder` and returns an initialized linear layer. The `VarBuilder` constructs a `VarMap`, which is the data structure that stores our trainable parameters.
+
+```rust
+use candle_core::{Device, Result, Tensor};
+use candle_nn::{Linear, Module, VarBuilder, VarMap};
+
+fn make_linear(vs: VarBuilder, in_dim: usize, out_dim: usize) -> Result<Linear> {
+    let ws = vs.get_with_hints(
+        (out_dim, in_dim),
+        "weight",
+        candle_nn::init::DEFAULT_KAIMING_NORMAL,
+    )?;
+    let bound = 1. / (in_dim as f64).sqrt();
+    let bs = vs.get_with_hints(
+        out_dim,
+        "bias",
+        candle_nn::Init::Uniform {
+            lo: -bound,
+            up: bound,
+        },
+    )?;
+    Ok(Linear::new(ws, Some(bs)))
+}
+```
+
+Next, let's implement a `new` method for our model class to accept a `VarBuilder` and initialize the model. We use `VarBuilder::pp` to "push prefix" so that the parameter names are organized hierarchically: the first layer weights as `first.weight` and `first.bias`, and the second layer weights as `second.weight` and `second.bias`.
+
+```rust
+impl Model {
+    fn new(vs: VarBuilder) -> Result<Self> {
+        const IMAGE_DIM: usize = 784;
+        const HIDDEN_DIM: usize = 100;
+        const LABELS: usize = 10;
+
+        let first = make_linear(vs.pp("first"), IMAGE_DIM, HIDDEN_DIM)?;
+        let second = make_linear(vs.pp("second"), HIDDEN_DIM, LABELS)?;
+
+        Ok(Self { first, second })
+    }
+
+    fn forward(&self, image: &Tensor) -> Result<Tensor> {
+        let x = self.first.forward(image)?;
+        let x = x.relu()?;
+        self.second.forward(&x)
+    }
+}
+```
+
+Now, let's add the `candle-datasets` package to our project to access the MNIST dataset:
+
+```bash
+$ cargo add --git https://github.com/huggingface/candle.git candle-datasets
+```
+
+With the dataset available, we can implement our training loop:
+
+```rust
+use candle_core::{DType, Device, Result, Tensor, D};
+use candle_nn::{loss, ops, Linear, Module, Optimizer, VarBuilder, VarMap};
+
+fn training_loop(
+    m: candle_datasets::vision::Dataset,
+) -> anyhow::Result<()> {
+    let dev = Device::cuda_if_available(0)?;
+
+    let train_labels = m.train_labels;
+    let train_images = m.train_images.to_device(&dev)?;
+    let train_labels = train_labels.to_dtype(DType::U32)?.to_device(&dev)?;
+
+    // Initialize a VarMap to store trainable parameters
+    let varmap = VarMap::new();
+    let vs = VarBuilder::from_varmap(&varmap, DType::F32, &dev);
+    let model = Model::new(vs.clone())?;
+
+    let learning_rate = 0.05;
+    let epochs = 10;
+
+    // Initialize a stochastic gradient descent optimizer to update parameters
+    let mut sgd = candle_nn::SGD::new(varmap.all_vars(), learning_rate)?;
+    let test_images = m.test_images.to_device(&dev)?;
+    let test_labels = m.test_labels.to_dtype(DType::U32)?.to_device(&dev)?;
+    
+    for epoch in 1..epochs {
+        // Perform forward pass on MNIST data
+        let logits = model.forward(&train_images)?;
+        let log_sm = ops::log_softmax(&logits, D::Minus1)?;
+        
+        // Compute Negative Log Likelihood loss
+        let loss = loss::nll(&log_sm, &train_labels)?;
+
+        // Perform backward pass and update weights
+        sgd.backward_step(&loss)?;
+
+        // Evaluate model on test set
+        let test_logits = model.forward(&test_images)?;
+        let sum_ok = test_logits
+            .argmax(D::Minus1)?
+            .eq(&test_labels)?
+            .to_dtype(DType::F32)?
+            .sum_all()?
+            .to_scalar::<f32>()?;
+        let test_accuracy = sum_ok / test_labels.dims1()? as f32;
+        println!(
+            "{epoch:4} train loss: {:8.5} test acc: {:5.2}%",
+            loss.to_scalar::<f32>()?,
+            test_accuracy
+        );
+    }
+    Ok(())
+}
+```
+
+Finally, let's implement our main function:
+
+```rust
+pub fn main() -> anyhow::Result<()> {
+    let m = candle_datasets::vision::mnist::load()?;
+    return training_loop(m);
+}
+```
+
+Let's execute the training process:
+
+```bash
+$ cargo run --release
+
+> 1 train loss:  2.35449 test acc:  0.12%
+> 2 train loss:  2.30760 test acc:  0.15%
+> ...
+```
--- a/candle-core/Cargo.toml
+++ b/candle-core/Cargo.toml
@ -56,3 +56,7 @@ harness = false
 [[example]]
 name = "metal_basics"
 required-features = ["metal"]
+
+[[example]]
+name = "cuda_basics"
+required-features = ["cuda"]
--- a/candle-core/benches/benchmarks/mod.rs
+++ b/candle-core/benches/benchmarks/mod.rs
@ -21,7 +21,9 @@ impl BenchDevice for Device {
            Device::Cpu => Ok(()),
            Device::Cuda(device) => {
                #[cfg(feature = "cuda")]
-                return Ok(device.synchronize()?);
+                return Ok(device
+                    .synchronize()
+                    .map_err(|e| candle_core::Error::Cuda(Box::new(e)))?);
                #[cfg(not(feature = "cuda"))]
                panic!("Cuda device without cuda feature enabled: {:?}", device)
            }
--- a/candle-core/examples/cuda_basics.rs
+++ b/candle-core/examples/cuda_basics.rs
@ -6,28 +6,18 @@ extern crate intel_mkl_src;

 use anyhow::Result;
 use candle_core::{Device, Tensor};
-
+// xs: [1024, 64, 1924], c Tensor[dims 128, 64, 8; f32, cuda:0] Conv1dConfig { padding: 0, stride: 4, dilation: 1, groups: 1 }
 fn main() -> Result<()> {
    let device = Device::new_cuda(0)?;
-    let x = Tensor::randn(0f32, 1.0, (8 * 4096, 8 * 4096), &device)?
-        .to_dtype(candle_core::DType::BF16)?;
-    candle_core::cuda::set_gemm_reduced_precision_f32(false);
-    candle_core::cuda::set_gemm_reduced_precision_bf16(false);
-    let _x1 = x.matmul(&x)?;
-    drop(_x1);
-    let start_time = std::time::Instant::now();
-    let _x1 = x.matmul(&x)?;
-    device.synchronize()?;
-    println!("fp32: {:?}", start_time.elapsed());
-    drop(_x1);
-    candle_core::cuda::set_gemm_reduced_precision_f32(true);
-    candle_core::cuda::set_gemm_reduced_precision_bf16(true);
-    let _x1 = x.matmul(&x)?;
-    drop(_x1);
-    let start_time = std::time::Instant::now();
-    let _x1 = x.matmul(&x)?;
-    device.synchronize()?;
-    println!("tf32: {:?}", start_time.elapsed());
+    let x = Tensor::randn(0f32, 1.0, (1024, 64, 1924), &device)?;
+    let c = Tensor::randn(0f32, 1.0, (128, 64, 8), &device)?;
+    let _x1 = x.conv1d(&c, 0, 4, 1, 1)?;
    drop(_x1);
+    for _ in 0..20 {
+        let start_time = std::time::Instant::now();
+        let _x1 = x.conv1d(&c, 0, 4, 1, 1)?;
+        device.synchronize()?;
+        println!("conv1d: {:?}", start_time.elapsed());
+    }
    Ok(())
 }
--- a/candle-core/src/backprop.rs
+++ b/candle-core/src/backprop.rs
@ -32,7 +32,7 @@ impl Tensor {
    /// elements having dependencies on the latter ones, e.g. the first element if any is the
    /// argument.
    /// This assumes that the op graph is a DAG.
-    fn sorted_nodes(&self) -> Vec<&Tensor> {
+    pub fn sorted_nodes(&self) -> Vec<&Tensor> {
        // The vec of sorted nodes is passed as an owned value rather than a mutable reference
        // to get around some lifetime limitations.
        fn walk<'a>(
--- a/candle-core/src/conv.rs
+++ b/candle-core/src/conv.rs
@ -14,6 +14,7 @@ pub struct ParamsConv1D {
    pub(crate) padding: usize,
    pub(crate) stride: usize,
    pub(crate) dilation: usize,
+    pub(crate) cudnn_fwd_algo: Option<CudnnFwdAlgo>,
 }

 impl ParamsConv1D {
@ -54,7 +55,7 @@ impl ParamsConvTranspose1D {
    }
 }

-#[derive(Debug, Clone, PartialEq, Eq, Hash)]
+#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
 pub enum CudnnFwdAlgo {
    ImplicitGemm,
    ImplicitPrecompGemm,
@ -151,6 +152,19 @@ impl Tensor {
        stride: usize,
        dilation: usize,
        groups: usize,
+    ) -> Result<Self> {
+        self.conv1d_with_algo(kernel, padding, stride, dilation, groups, None)
+    }
+
+    /// Applies a 1D convolution over the input tensor.
+    pub fn conv1d_with_algo(
+        &self,
+        kernel: &Self,
+        padding: usize,
+        stride: usize,
+        dilation: usize,
+        groups: usize,
+        cudnn_fwd_algo: Option<CudnnFwdAlgo>,
    ) -> Result<Self> {
        let (c_out, c_in_k, k_size) = kernel.dims3()?;
        let (b_size, c_in, l_in) = self.dims3()?;
@ -174,6 +188,7 @@ impl Tensor {
            padding,
            stride,
            dilation,
+            cudnn_fwd_algo,
        };
        if groups == 1 {
            self.conv1d_single_group(kernel, &params)
@ -278,6 +293,18 @@ impl Tensor {
        stride: usize,
        dilation: usize,
        groups: usize,
+    ) -> Result<Self> {
+        self.conv2d_with_algo(kernel, padding, stride, dilation, groups, None)
+    }
+
+    pub fn conv2d_with_algo(
+        &self,
+        kernel: &Self,
+        padding: usize,
+        stride: usize,
+        dilation: usize,
+        groups: usize,
+        cudnn_fwd_algo: Option<CudnnFwdAlgo>,
    ) -> Result<Self> {
        let (b_size, c_in, i_h, i_w) = self.dims4()?;
        let (c_out, c_in_k, k_h, k_w) = kernel.dims4()?;
@ -297,7 +324,7 @@ impl Tensor {
            padding,
            stride,
            dilation,
-            cudnn_fwd_algo: None,
+            cudnn_fwd_algo,
        };
        if groups == 1 {
            self.conv2d_single_group(kernel, &params)
--- a/candle-core/src/cpu_backend/mod.rs
+++ b/candle-core/src/cpu_backend/mod.rs
@ -1289,6 +1289,15 @@ impl Map2 for MatMul {
        } else {
            Parallelism::None
        };
+        let (b, m, n, k) = if b_skip == 0 && a_skip == m * k {
+            // a_skip and c_skip should be updated but step is always 0 so
+            // it wouldn't matter.
+            (1, b * m, n, k)
+        } else if a_skip == 0 && b_skip == n * k {
+            (1, m, b * n, k)
+        } else {
+            (b, m, n, k)
+        };
        for step in 0..b {
            let lhs_p = &lhs[step * a_skip..];
            let rhs_p = &rhs[step * b_skip..];
@ -2482,15 +2491,15 @@ impl BackendDevice for CpuDevice {
        use rand::prelude::*;

        let elem_count = shape.elem_count();
-        let mut rng = rand::thread_rng();
+        let mut rng = rand::rng();
        match dtype {
            DType::U8 | DType::U32 | DType::I64 => {
                Err(Error::UnsupportedDTypeForOp(dtype, "rand_uniform").bt())
            }
            DType::BF16 => {
                let mut data = Vec::with_capacity(elem_count);
-                let uniform =
-                    rand::distributions::Uniform::new(bf16::from_f64(min), bf16::from_f64(max));
+                let uniform = rand::distr::Uniform::new(bf16::from_f64(min), bf16::from_f64(max))
+                    .map_err(Error::wrap)?;
                for _i in 0..elem_count {
                    data.push(rng.sample::<bf16, _>(uniform))
                }
@ -2498,8 +2507,8 @@ impl BackendDevice for CpuDevice {
            }
            DType::F16 => {
                let mut data = Vec::with_capacity(elem_count);
-                let uniform =
-                    rand::distributions::Uniform::new(f16::from_f64(min), f16::from_f64(max));
+                let uniform = rand::distr::Uniform::new(f16::from_f64(min), f16::from_f64(max))
+                    .map_err(Error::wrap)?;
                for _i in 0..elem_count {
                    data.push(rng.sample::<f16, _>(uniform))
                }
@ -2507,7 +2516,8 @@ impl BackendDevice for CpuDevice {
            }
            DType::F32 => {
                let mut data = Vec::with_capacity(elem_count);
-                let uniform = rand::distributions::Uniform::new(min as f32, max as f32);
+                let uniform =
+                    rand::distr::Uniform::new(min as f32, max as f32).map_err(Error::wrap)?;
                for _i in 0..elem_count {
                    data.push(rng.sample::<f32, _>(uniform))
                }
@ -2515,7 +2525,7 @@ impl BackendDevice for CpuDevice {
            }
            DType::F64 => {
                let mut data = Vec::with_capacity(elem_count);
-                let uniform = rand::distributions::Uniform::new(min, max);
+                let uniform = rand::distr::Uniform::new(min, max).map_err(Error::wrap)?;
                for _i in 0..elem_count {
                    data.push(rng.sample::<f64, _>(uniform))
                }
@ -2528,7 +2538,7 @@ impl BackendDevice for CpuDevice {
        use rand::prelude::*;

        let elem_count = shape.elem_count();
-        let mut rng = rand::thread_rng();
+        let mut rng = rand::rng();
        match dtype {
            DType::U8 | DType::U32 | DType::I64 => {
                Err(Error::UnsupportedDTypeForOp(dtype, "rand_normal").bt())
--- a/candle-core/src/cuda_backend/cudnn.rs
+++ b/candle-core/src/cuda_backend/cudnn.rs
@ -43,7 +43,7 @@ pub(crate) fn launch_conv2d<
        if let Some(cudnn) = cudnn.borrow().get(&device_id) {
            return Ok(cudnn.clone());
        }
-        let c = Cudnn::new(dev.cuda_device());
+        let c = Cudnn::new(dev.cuda_stream());
        if let Ok(c) = &c {
            cudnn.borrow_mut().insert(device_id, c.clone());
        }
@ -109,7 +109,7 @@ pub(crate) fn launch_conv2d<
        Some(CandleAlgo::Count) => A::CUDNN_CONVOLUTION_FWD_ALGO_COUNT,
    };
    let workspace_size = conv2d.get_workspace_size(alg)?;
-    let mut workspace = dev.cuda_device().alloc_zeros::<u8>(workspace_size)?;
+    let mut workspace = dev.cuda_stream().alloc_zeros::<u8>(workspace_size)?;
    unsafe {
        conv2d.launch::<CudaSlice<u8>, _, _, _>(
            alg,
@ -122,3 +122,104 @@ pub(crate) fn launch_conv2d<
    }
    Ok(())
 }
+
+pub(crate) fn launch_conv1d<
+    T: DeviceRepr + WithDType + ValidAsZeroBits + cudarc::cudnn::CudnnDataType,
+    Y: cudarc::cudnn::CudnnDataType,
+>(
+    src: &CudaView<T>,
+    src_l: &crate::Layout,
+    filter: &CudaView<T>,
+    dst: &mut CudaSlice<T>,
+    params: &crate::conv::ParamsConv1D,
+    dev: &crate::cuda_backend::CudaDevice,
+) -> crate::Result<()> {
+    use crate::conv::CudnnFwdAlgo as CandleAlgo;
+    use cudarc::cudnn::sys::cudnnConvolutionFwdAlgo_t as A;
+
+    let device_id = dev.id();
+    let cudnn = CUDNN.with(|cudnn| {
+        if let Some(cudnn) = cudnn.borrow().get(&device_id) {
+            return Ok(cudnn.clone());
+        }
+        let c = Cudnn::new(dev.cuda_stream());
+        if let Ok(c) = &c {
+            cudnn.borrow_mut().insert(device_id, c.clone());
+        }
+        c
+    })?;
+    let conv = cudnn.create_conv2d::<Y>(
+        /* pad */ [params.padding as i32, 0],
+        /* stride */ [params.stride as i32, 1],
+        /* dilation */ [params.dilation as i32, 1],
+        cudarc::cudnn::sys::cudnnConvolutionMode_t::CUDNN_CROSS_CORRELATION,
+    )?;
+    // https://docs.nvidia.com/deeplearning/cudnn/backend/latest/api/cudnn-ops-library.html#cudnnsettensornddescriptor
+    // > Tensors are restricted to having at least 4 dimensions, and at most CUDNN_DIM_MAX
+    // > dimensions (defined in cudnn.h). When working with lower dimensional data, it is
+    // > recommended that the user create a 4D tensor, and set the size along unused dimensions
+    // > to 1.
+    let x_shape = [
+        params.b_size as i32,
+        params.c_in as i32,
+        params.l_in as i32,
+        1,
+    ];
+    // Note that `src` already starts at the proper offset.
+    let x = if src_l.is_contiguous() {
+        cudnn.create_4d_tensor::<T>(
+            cudarc::cudnn::sys::cudnnTensorFormat_t::CUDNN_TENSOR_NCHW,
+            x_shape,
+        )?
+    } else {
+        let s = src_l.stride();
+        cudnn.create_4d_tensor_ex::<T>(x_shape, [s[0] as i32, s[1] as i32, s[2] as i32, 1i32])?
+    };
+    let w = cudnn.create_4d_filter::<T>(
+        cudarc::cudnn::sys::cudnnTensorFormat_t::CUDNN_TENSOR_NCHW,
+        [
+            params.c_out as i32,
+            params.c_in as i32,
+            params.k_size as i32,
+            1,
+        ],
+    )?;
+    let l_out = params.l_out() as i32;
+    let y = cudnn.create_4d_tensor::<T>(
+        cudarc::cudnn::sys::cudnnTensorFormat_t::CUDNN_TENSOR_NCHW,
+        [params.b_size as i32, params.c_out as i32, l_out, 1],
+    )?;
+    let conv1d = ConvForward {
+        conv: &conv,
+        x: &x,
+        w: &w,
+        y: &y,
+    };
+    let alg = match params.cudnn_fwd_algo {
+        None => conv1d.pick_algorithm()?,
+        Some(CandleAlgo::ImplicitGemm) => A::CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM,
+        Some(CandleAlgo::ImplicitPrecompGemm) => {
+            A::CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
+        }
+        Some(CandleAlgo::Gemm) => A::CUDNN_CONVOLUTION_FWD_ALGO_GEMM,
+        Some(CandleAlgo::Direct) => A::CUDNN_CONVOLUTION_FWD_ALGO_DIRECT,
+        Some(CandleAlgo::Fft) => A::CUDNN_CONVOLUTION_FWD_ALGO_FFT,
+        Some(CandleAlgo::FftTiling) => A::CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING,
+        Some(CandleAlgo::Winograd) => A::CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD,
+        Some(CandleAlgo::WinogradNonFused) => A::CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED,
+        Some(CandleAlgo::Count) => A::CUDNN_CONVOLUTION_FWD_ALGO_COUNT,
+    };
+    let workspace_size = conv1d.get_workspace_size(alg)?;
+    let mut workspace = dev.cuda_stream().alloc_zeros::<u8>(workspace_size)?;
+    unsafe {
+        conv1d.launch::<CudaSlice<u8>, _, _, _>(
+            alg,
+            Some(&mut workspace),
+            (T::one(), T::zero()),
+            src,
+            filter,
+            dst,
+        )?;
+    }
+    Ok(())
+}
--- a/candle-core/src/cuda_backend/device.rs
+++ b/candle-core/src/cuda_backend/device.rs
@ -2,8 +2,9 @@ use crate::backend::BackendDevice;
 use crate::{CpuStorage, CpuStorageRef, DType, Layout, Result, Shape};
 pub use candle_kernels as kernels;
 pub use cudarc;
-use cudarc::driver::{CudaFunction, LaunchAsync, LaunchConfig};
+use cudarc::driver::{CudaFunction, LaunchConfig, PushKernelArg};
 use half::{bf16, f16};
+use std::collections::HashMap;
 use std::sync::{Arc, Mutex};

 use super::{CudaError, CudaStorage, CudaStorageSlice, WrapErr};
@ -24,10 +25,17 @@ impl DeviceId {
 struct CudaRng(cudarc::curand::CudaRng);
 unsafe impl Send for CudaRng {}

+pub struct ModuleStore {
+    mdls: [Option<Arc<cudarc::driver::CudaModule>>; kernels::ALL_IDS.len()],
+}
+
 #[derive(Clone)]
 pub struct CudaDevice {
    id: DeviceId,
-    device: Arc<cudarc::driver::CudaDevice>,
+    context: Arc<cudarc::driver::CudaContext>,
+    modules: Arc<std::sync::RwLock<ModuleStore>>,
+    custom_modules: Arc<std::sync::RwLock<HashMap<String, Arc<cudarc::driver::CudaModule>>>>,
+    stream: Arc<cudarc::driver::CudaStream>,
    pub(crate) blas: Arc<cudarc::cublas::CudaBlas>,
    curand: Arc<Mutex<CudaRng>>,
 }
@ -38,17 +46,120 @@ impl std::fmt::Debug for CudaDevice {
    }
 }

-impl std::ops::Deref for CudaDevice {
-    type Target = Arc<cudarc::driver::CudaDevice>;
+impl CudaDevice {
+    #[allow(clippy::missing_safety_doc)]
+    pub unsafe fn alloc<T: cudarc::driver::DeviceRepr>(
+        &self,
+        len: usize,
+    ) -> Result<cudarc::driver::CudaSlice<T>> {
+        self.stream.alloc::<T>(len).w()
+    }
+
+    pub fn alloc_zeros<T: cudarc::driver::DeviceRepr + cudarc::driver::ValidAsZeroBits>(
+        &self,
+        len: usize,
+    ) -> Result<cudarc::driver::CudaSlice<T>> {
+        self.stream.alloc_zeros::<T>(len).w()
+    }
+
+    pub fn memcpy_htod<
+        T: cudarc::driver::DeviceRepr,
+        Src: cudarc::driver::HostSlice<T> + ?Sized,
+        Dst: cudarc::driver::DevicePtrMut<T>,
+    >(
+        &self,
+        src: &Src,
+        dst: &mut Dst,
+    ) -> Result<()> {
+        self.stream.memcpy_htod(src, dst).w()
+    }
+
+    pub fn memcpy_dtov<T: cudarc::driver::DeviceRepr, Src: cudarc::driver::DevicePtr<T>>(
+        &self,
+        src: &Src,
+    ) -> Result<Vec<T>> {
+        self.stream.memcpy_dtov(src).w()
+    }
+
+    pub fn memcpy_dtod<
+        T,
+        Src: cudarc::driver::DevicePtr<T>,
+        Dst: cudarc::driver::DevicePtrMut<T>,
+    >(
+        &self,
+        src: &Src,
+        dst: &mut Dst,
+    ) -> Result<()> {
+        self.stream.memcpy_dtod(src, dst).w()
+    }
+
+    pub fn memcpy_stod<
+        T: cudarc::driver::DeviceRepr,
+        Src: cudarc::driver::HostSlice<T> + ?Sized,
+    >(
+        &self,
+        src: &Src,
+    ) -> Result<cudarc::driver::CudaSlice<T>> {
+        self.stream.memcpy_stod(src).w()
+    }
+}
+
+pub struct CudaFunc {
+    func: CudaFunction,
+    stream: Arc<cudarc::driver::CudaStream>,
+}
+
+impl std::ops::Deref for CudaFunc {
+    type Target = CudaFunction;

    fn deref(&self) -> &Self::Target {
-        &self.device
+        &self.func
+    }
+}
+
+impl CudaFunc {
+    pub fn into_cuda_function(self) -> CudaFunction {
+        self.func
+    }
+}
+
+#[macro_export]
+macro_rules! builder_arg {
+    ($b:ident, $($arg:expr),*) => {
+        $(
+            let __arg = $arg;
+            $b.arg(&__arg);
+        )*
+    };
+}
+
+impl CudaFunc {
+    pub fn builder(&self) -> cudarc::driver::LaunchArgs<'_> {
+        self.stream.launch_builder(&self.func)
    }
 }

 impl CudaDevice {
-    pub fn cuda_device(&self) -> Arc<cudarc::driver::CudaDevice> {
-        self.device.clone()
+    pub fn cuda_stream(&self) -> Arc<cudarc::driver::CudaStream> {
+        self.stream.clone()
+    }
+
+    /// When turned on, all cuda tensors **created after calling this function** will
+    /// not track uses via cuda events.
+    ///
+    /// # Safety
+    ///
+    /// It is up to the user to ensure proper synchronization between multiple streams:
+    /// - Ensure that no tensor is freed before a use on another stream is finished.
+    /// - Ensure that a tensor is not used on another stream before allocation on the
+    ///   allocating stream finishes.
+    /// - Ensure that a tensor is not written two concurrently by multiple streams.
+    pub unsafe fn disable_event_tracking(&self) {
+        self.context.disable_event_tracking()
+    }
+
+    pub fn is_event_tracking(&self) -> bool {
+        self.context.is_event_tracking()
    }

    #[cfg(not(target_arch = "wasm32"))]
@ -56,7 +167,7 @@ impl CudaDevice {
        &self,
        func_name: &'static str,
        kernel: ug::lang::ssa::Kernel,
-    ) -> Result<CudaFunction> {
+    ) -> Result<CudaFunc> {
        let mut buf = vec![];
        ug_cuda::code_gen::gen(&mut buf, func_name, &kernel)?;
        let cuda_code = String::from_utf8(buf)?;
@ -65,12 +176,12 @@ impl CudaDevice {
            ..Default::default()
        };
        let ptx = cudarc::nvrtc::safe::compile_ptx_with_opts(cuda_code, opts).w()?;
-        self.device.load_ptx(ptx, "ug", &[func_name]).w()?;
-        let func = match self.device.get_func("ug", func_name) {
-            Some(func) => func,
-            None => crate::bail!("unknown function ug::{func_name}"),
-        };
-        Ok(func)
+        let module = self.context.load_module(ptx).w()?;
+        let func = module.load_function(func_name).w()?;
+        Ok(CudaFunc {
+            func,
+            stream: self.stream.clone(),
+        })
    }

    pub fn id(&self) -> DeviceId {
@ -83,58 +194,85 @@ impl CudaDevice {
        let slice = match dtype {
            DType::U8 => {
                // SAFETY: Set later by running the fill kernel.
-                let data = unsafe { self.alloc::<u8>(elem_count) }.w()?;
-                let func = self.get_or_load_func("fill_u8", kernels::FILL)?;
-                let params = (&data, v as u8, elem_count);
-                unsafe { func.launch(cfg, params) }.w()?;
+                let data = unsafe { self.alloc::<u8>(elem_count)? };
+                let func = self.get_or_load_func("fill_u8", &kernels::FILL)?;
+                let mut builder = self.stream.launch_builder(&func);
+                let v = v as u8;
+                builder.arg(&data);
+                builder.arg(&v);
+                builder.arg(&elem_count);
+                unsafe { builder.launch(cfg) }.w()?;
                CudaStorageSlice::U8(data)
            }
            DType::U32 => {
                // SAFETY: Set later by running the fill kernel.
-                let data = unsafe { self.alloc::<u32>(elem_count) }.w()?;
-                let func = self.get_or_load_func("fill_u32", kernels::FILL)?;
-                let params = (&data, v as u32, elem_count);
-                unsafe { func.launch(cfg, params) }.w()?;
+                let data = unsafe { self.alloc::<u32>(elem_count)? };
+                let func = self.get_or_load_func("fill_u32", &kernels::FILL)?;
+                let mut builder = self.stream.launch_builder(&func);
+                let v = v as u32;
+                builder.arg(&data);
+                builder.arg(&v);
+                builder.arg(&elem_count);
+                unsafe { builder.launch(cfg) }.w()?;
                CudaStorageSlice::U32(data)
            }
            DType::I64 => {
                // SAFETY: Set later by running the fill kernel.
-                let data = unsafe { self.alloc::<i64>(elem_count) }.w()?;
-                let func = self.get_or_load_func("fill_i64", kernels::FILL)?;
-                let params = (&data, v as i64, elem_count);
-                unsafe { func.launch(cfg, params) }.w()?;
+                let data = unsafe { self.alloc::<i64>(elem_count)? };
+                let func = self.get_or_load_func("fill_i64", &kernels::FILL)?;
+                let mut builder = self.stream.launch_builder(&func);
+                let v = v as i64;
+                builder.arg(&data);
+                builder.arg(&v);
+                builder.arg(&elem_count);
+                unsafe { builder.launch(cfg) }.w()?;
                CudaStorageSlice::I64(data)
            }
            DType::BF16 => {
                // SAFETY: Set later by running the fill kernel.
-                let data = unsafe { self.alloc::<bf16>(elem_count) }.w()?;
-                let func = self.get_or_load_func("fill_bf16", kernels::FILL)?;
-                let params = (&data, bf16::from_f64(v), elem_count);
-                unsafe { func.launch(cfg, params) }.w()?;
+                let data = unsafe { self.alloc::<bf16>(elem_count)? };
+                let func = self.get_or_load_func("fill_bf16", &kernels::FILL)?;
+                let mut builder = self.stream.launch_builder(&func);
+                let v = bf16::from_f64(v);
+                builder.arg(&data);
+                builder.arg(&v);
+                builder.arg(&elem_count);
+                unsafe { builder.launch(cfg) }.w()?;
                CudaStorageSlice::BF16(data)
            }
            DType::F16 => {
                // SAFETY: Set later by running the fill kernel.
-                let data = unsafe { self.alloc::<f16>(elem_count) }.w()?;
-                let func = self.get_or_load_func("fill_f16", kernels::FILL)?;
-                let params = (&data, f16::from_f64(v), elem_count);
-                unsafe { func.launch(cfg, params) }.w()?;
+                let data = unsafe { self.alloc::<f16>(elem_count)? };
+                let func = self.get_or_load_func("fill_f16", &kernels::FILL)?;
+                let mut builder = self.stream.launch_builder(&func);
+                let v = f16::from_f64(v);
+                builder.arg(&data);
+                builder.arg(&v);
+                builder.arg(&elem_count);
+                unsafe { builder.launch(cfg) }.w()?;
                CudaStorageSlice::F16(data)
            }
            DType::F32 => {
                // SAFETY: Set later by running the fill kernel.
-                let data = unsafe { self.alloc::<f32>(elem_count) }.w()?;
-                let func = self.get_or_load_func("fill_f32", kernels::FILL)?;
-                let params = (&data, v as f32, elem_count);
-                unsafe { func.launch(cfg, params) }.w()?;
+                let data = unsafe { self.alloc::<f32>(elem_count)? };
+                let func = self.get_or_load_func("fill_f32", &kernels::FILL)?;
+                let mut builder = self.stream.launch_builder(&func);
+                let v = v as f32;
+                builder.arg(&data);
+                builder.arg(&v);
+                builder.arg(&elem_count);
+                unsafe { builder.launch(cfg) }.w()?;
                CudaStorageSlice::F32(data)
            }
            DType::F64 => {
                // SAFETY: Set later by running the fill kernel.
-                let data = unsafe { self.alloc::<f64>(elem_count) }.w()?;
-                let func = self.get_or_load_func("fill_f64", kernels::FILL)?;
-                let params = (&data, v, elem_count);
-                unsafe { func.launch(cfg, params) }.w()?;
+                let data = unsafe { self.alloc::<f64>(elem_count) }?;
+                let func = self.get_or_load_func("fill_f64", &kernels::FILL)?;
+                let mut builder = self.stream.launch_builder(&func);
+                builder.arg(&data);
+                builder.arg(&v);
+                builder.arg(&elem_count);
+                unsafe { builder.launch(cfg) }.w()?;
                CudaStorageSlice::F64(data)
            }
        };
@ -144,38 +282,69 @@ impl CudaDevice {
        })
    }

-    pub fn get_or_load_func(&self, module_name: &str, ptx: &'static str) -> Result<CudaFunction> {
-        if !self.has_func(module_name, module_name) {
-            // Leaking the string here is a bit sad but we need a &'static str and this is only
-            // done once per kernel name.
-            let static_module_name = Box::leak(module_name.to_string().into_boxed_str());
-            self.load_ptx(ptx.into(), module_name, &[static_module_name])
-                .map_err(|cuda| CudaError::Load {
-                    cuda,
-                    module_name: module_name.to_string(),
-                })
-                .w()?;
+    pub fn get_or_load_custom_func(
+        &self,
+        fn_name: &str,
+        module_name: &str,
+        ptx: &str,
+    ) -> Result<CudaFunc> {
+        let ms = self.custom_modules.read().unwrap();
+        if let Some(mdl) = ms.get(module_name).as_ref() {
+            let func = mdl.load_function(fn_name).w()?;
+            return Ok(CudaFunc {
+                func,
+                stream: self.stream.clone(),
+            });
        }
-        self.get_func(module_name, module_name)
-            // Clippy recommends this `ok_or` rather than `ok_or_else` so hopefully the compiler is
-            // able to only build the error value if needed.
-            .ok_or(CudaError::MissingKernel {
-                module_name: module_name.to_string(),
-            })
-            .w()
+        drop(ms);
+        let mut ms = self.custom_modules.write().unwrap();
+        let cuda_module = self.context.load_module(ptx.into()).w()?;
+        ms.insert(module_name.to_string(), cuda_module.clone());
+        let func = cuda_module.load_function(fn_name).w()?;
+        Ok(CudaFunc {
+            func,
+            stream: self.stream.clone(),
+        })
+    }
+
+    pub fn get_or_load_func(&self, fn_name: &str, mdl: &kernels::Module) -> Result<CudaFunc> {
+        let ms = self.modules.read().unwrap();
+        if let Some(mdl) = ms.mdls[mdl.index()].as_ref() {
+            let func = mdl.load_function(fn_name).w()?;
+            return Ok(CudaFunc {
+                func,
+                stream: self.stream.clone(),
+            });
+        }
+        drop(ms);
+        let mut ms = self.modules.write().unwrap();
+        let cuda_module = self.context.load_module(mdl.ptx().into()).w()?;
+        ms.mdls[mdl.index()] = Some(cuda_module.clone());
+        let func = cuda_module.load_function(fn_name).w()?;
+        Ok(CudaFunc {
+            func,
+            stream: self.stream.clone(),
+        })
    }
 }

 impl CudaDevice {
    pub fn new_with_stream(ordinal: usize) -> Result<Self> {
-        let device = cudarc::driver::CudaDevice::new_with_stream(ordinal).w()?;
-        let blas = cudarc::cublas::CudaBlas::new(device.clone()).w()?;
-        let curand = cudarc::curand::CudaRng::new(299792458, device.clone()).w()?;
+        let context = cudarc::driver::CudaContext::new(ordinal).w()?;
+        let stream = context.new_stream().w()?;
+        let blas = cudarc::cublas::CudaBlas::new(stream.clone()).w()?;
+        let curand = cudarc::curand::CudaRng::new(299792458, stream.clone()).w()?;
+        let module_store = ModuleStore {
+            mdls: [const { None }; kernels::ALL_IDS.len()],
+        };
        Ok(Self {
            id: DeviceId::new(),
-            device,
+            context,
+            stream,
            blas: Arc::new(blas),
            curand: Arc::new(Mutex::new(CudaRng(curand))),
+            modules: Arc::new(std::sync::RwLock::new(module_store)),
+            custom_modules: Arc::new(std::sync::RwLock::new(HashMap::new())),
        })
    }
 }
@ -184,14 +353,21 @@ impl BackendDevice for CudaDevice {
    type Storage = CudaStorage;

    fn new(ordinal: usize) -> Result<Self> {
-        let device = cudarc::driver::CudaDevice::new(ordinal).w()?;
-        let blas = cudarc::cublas::CudaBlas::new(device.clone()).w()?;
-        let curand = cudarc::curand::CudaRng::new(299792458, device.clone()).w()?;
+        let context = cudarc::driver::CudaContext::new(ordinal).w()?;
+        let stream = context.default_stream();
+        let blas = cudarc::cublas::CudaBlas::new(stream.clone()).w()?;
+        let curand = cudarc::curand::CudaRng::new(299792458, stream.clone()).w()?;
+        let module_store = ModuleStore {
+            mdls: [const { None }; kernels::ALL_IDS.len()],
+        };
        Ok(Self {
            id: DeviceId::new(),
-            device,
+            context,
+            stream,
            blas: Arc::new(blas),
            curand: Arc::new(Mutex::new(CudaRng(curand))),
+            modules: Arc::new(std::sync::RwLock::new(module_store)),
+            custom_modules: Arc::new(std::sync::RwLock::new(HashMap::new())),
        })
    }

@ -199,13 +375,13 @@ impl BackendDevice for CudaDevice {
        // We do not call set_seed but instead create a new curand object. This ensures that the
        // state will be identical and the same random numbers will be generated.
        let mut curand = self.curand.lock().unwrap();
-        curand.0 = cudarc::curand::CudaRng::new(seed, self.device.clone()).w()?;
+        curand.0 = cudarc::curand::CudaRng::new(seed, self.stream.clone()).w()?;
        Ok(())
    }

    fn location(&self) -> crate::DeviceLocation {
        crate::DeviceLocation::Cuda {
-            gpu_id: self.device.ordinal(),
+            gpu_id: self.context.ordinal(),
        }
    }

@ -217,31 +393,31 @@ impl BackendDevice for CudaDevice {
        let elem_count = shape.elem_count();
        let slice = match dtype {
            DType::U8 => {
-                let data = self.alloc_zeros::<u8>(elem_count).w()?;
+                let data = self.alloc_zeros::<u8>(elem_count)?;
                CudaStorageSlice::U8(data)
            }
            DType::U32 => {
-                let data = self.alloc_zeros::<u32>(elem_count).w()?;
+                let data = self.alloc_zeros::<u32>(elem_count)?;
                CudaStorageSlice::U32(data)
            }
            DType::I64 => {
-                let data = self.alloc_zeros::<i64>(elem_count).w()?;
+                let data = self.alloc_zeros::<i64>(elem_count)?;
                CudaStorageSlice::I64(data)
            }
            DType::BF16 => {
-                let data = self.alloc_zeros::<bf16>(elem_count).w()?;
+                let data = self.alloc_zeros::<bf16>(elem_count)?;
                CudaStorageSlice::BF16(data)
            }
            DType::F16 => {
-                let data = self.alloc_zeros::<f16>(elem_count).w()?;
+                let data = self.alloc_zeros::<f16>(elem_count)?;
                CudaStorageSlice::F16(data)
            }
            DType::F32 => {
-                let data = self.alloc_zeros::<f32>(elem_count).w()?;
+                let data = self.alloc_zeros::<f32>(elem_count)?;
                CudaStorageSlice::F32(data)
            }
            DType::F64 => {
-                let data = self.alloc_zeros::<f64>(elem_count).w()?;
+                let data = self.alloc_zeros::<f64>(elem_count)?;
                CudaStorageSlice::F64(data)
            }
        };
@ -265,12 +441,12 @@ impl BackendDevice for CudaDevice {
                .w()?
            }
            DType::F32 => {
-                let mut data = unsafe { self.alloc::<f32>(elem_count) }.w()?;
+                let mut data = unsafe { self.alloc::<f32>(elem_count)? };
                curand.0.fill_with_uniform(&mut data).w()?;
                CudaStorageSlice::F32(data)
            }
            DType::F64 => {
-                let mut data = unsafe { self.alloc::<f64>(elem_count) }.w()?;
+                let mut data = unsafe { self.alloc::<f64>(elem_count)? };
                curand.0.fill_with_uniform(&mut data).w()?;
                CudaStorageSlice::F64(data)
            }
@ -309,7 +485,7 @@ impl BackendDevice for CudaDevice {
                .w()?
            }
            DType::F32 => {
-                let mut data = unsafe { self.alloc::<f32>(elem_count_round) }.w()?;
+                let mut data = unsafe { self.alloc::<f32>(elem_count_round)? };
                curand
                    .0
                    .fill_with_normal(&mut data, mean as f32, std as f32)
@ -317,7 +493,7 @@ impl BackendDevice for CudaDevice {
                CudaStorageSlice::F32(data)
            }
            DType::F64 => {
-                let mut data = unsafe { self.alloc::<f64>(elem_count_round) }.w()?;
+                let mut data = unsafe { self.alloc::<f64>(elem_count_round)? };
                curand.0.fill_with_normal(&mut data, mean, std).w()?;
                CudaStorageSlice::F64(data)
            }
@ -336,31 +512,31 @@ impl BackendDevice for CudaDevice {
        let elem_count = shape.elem_count();
        let slice = match dtype {
            DType::U8 => {
-                let data = self.alloc::<u8>(elem_count).w()?;
+                let data = self.alloc::<u8>(elem_count)?;
                CudaStorageSlice::U8(data)
            }
            DType::U32 => {
-                let data = self.alloc::<u32>(elem_count).w()?;
+                let data = self.alloc::<u32>(elem_count)?;
                CudaStorageSlice::U32(data)
            }
            DType::I64 => {
-                let data = self.alloc::<i64>(elem_count).w()?;
+                let data = self.alloc::<i64>(elem_count)?;
                CudaStorageSlice::I64(data)
            }
            DType::BF16 => {
-                let data = self.alloc::<bf16>(elem_count).w()?;
+                let data = self.alloc::<bf16>(elem_count)?;
                CudaStorageSlice::BF16(data)
            }
            DType::F16 => {
-                let data = self.alloc::<f16>(elem_count).w()?;
+                let data = self.alloc::<f16>(elem_count)?;
                CudaStorageSlice::F16(data)
            }
            DType::F32 => {
-                let data = self.alloc::<f32>(elem_count).w()?;
+                let data = self.alloc::<f32>(elem_count)?;
                CudaStorageSlice::F32(data)
            }
            DType::F64 => {
-                let data = self.alloc::<f64>(elem_count).w()?;
+                let data = self.alloc::<f64>(elem_count)?;
                CudaStorageSlice::F64(data)
            }
        };
@ -373,31 +549,31 @@ impl BackendDevice for CudaDevice {
    fn storage_from_slice<T: crate::WithDType>(&self, s: &[T]) -> Result<Self::Storage> {
        let slice = match T::cpu_storage_ref(s) {
            CpuStorageRef::U8(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::U8(data)
            }
            CpuStorageRef::U32(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::U32(data)
            }
            CpuStorageRef::I64(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::I64(data)
            }
            CpuStorageRef::BF16(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::BF16(data)
            }
            CpuStorageRef::F16(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::F16(data)
            }
            CpuStorageRef::F32(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::F32(data)
            }
            CpuStorageRef::F64(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::F64(data)
            }
        };
@ -410,31 +586,31 @@ impl BackendDevice for CudaDevice {
    fn storage_from_cpu_storage(&self, storage: &CpuStorage) -> Result<CudaStorage> {
        let slice = match storage {
            CpuStorage::U8(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::U8(data)
            }
            CpuStorage::U32(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::U32(data)
            }
            CpuStorage::I64(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::I64(data)
            }
            CpuStorage::BF16(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::BF16(data)
            }
            CpuStorage::F16(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::F16(data)
            }
            CpuStorage::F32(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::F32(data)
            }
            CpuStorage::F64(storage) => {
-                let data = self.htod_sync_copy(storage).w()?;
+                let data = self.memcpy_stod(storage)?;
                CudaStorageSlice::F64(data)
            }
        };
@ -447,31 +623,31 @@ impl BackendDevice for CudaDevice {
    fn storage_from_cpu_storage_owned(&self, storage: CpuStorage) -> Result<CudaStorage> {
        let slice = match storage {
            CpuStorage::U8(storage) => {
-                let data = self.htod_copy(storage).w()?;
+                let data = self.memcpy_stod(&storage)?;
                CudaStorageSlice::U8(data)
            }
            CpuStorage::U32(storage) => {
-                let data = self.htod_copy(storage).w()?;
+                let data = self.memcpy_stod(&storage)?;
                CudaStorageSlice::U32(data)
            }
            CpuStorage::I64(storage) => {
-                let data = self.htod_copy(storage).w()?;
+                let data = self.memcpy_stod(&storage)?;
                CudaStorageSlice::I64(data)
            }
            CpuStorage::BF16(storage) => {
-                let data = self.htod_copy(storage).w()?;
+                let data = self.memcpy_stod(&storage)?;
                CudaStorageSlice::BF16(data)
            }
            CpuStorage::F16(storage) => {
-                let data = self.htod_copy(storage).w()?;
+                let data = self.memcpy_stod(&storage)?;
                CudaStorageSlice::F16(data)
            }
            CpuStorage::F32(storage) => {
-                let data = self.htod_copy(storage).w()?;
+                let data = self.memcpy_stod(&storage)?;
                CudaStorageSlice::F32(data)
            }
            CpuStorage::F64(storage) => {
-                let data = self.htod_copy(storage).w()?;
+                let data = self.memcpy_stod(&storage)?;
                CudaStorageSlice::F64(data)
            }
        };
@ -482,7 +658,7 @@ impl BackendDevice for CudaDevice {
    }

    fn synchronize(&self) -> Result<()> {
-        self.device.synchronize().map_err(crate::Error::wrap)?;
+        self.stream.synchronize().map_err(crate::Error::wrap)?;
        Ok(())
    }
 }
--- a/candle-core/src/cuda_backend/mod.rs
+++ b/candle-core/src/cuda_backend/mod.rs
--- a/candle-core/src/custom_op.rs
+++ b/candle-core/src/custom_op.rs
@ -396,7 +396,10 @@ impl UgIOp1 {
        {
            let device = device.as_cuda_device()?;
            let func = device.compile(name, kernel)?;
-            Ok(Self { name, func })
+            Ok(Self {
+                name,
+                func: func.into_cuda_function(),
+            })
        }
        #[cfg(feature = "metal")]
        {
@ -459,16 +462,16 @@ impl InplaceOp1 for UgIOp1 {
    #[cfg(feature = "cuda")]
    fn cuda_fwd(&self, sto: &mut CudaStorage, layout: &Layout) -> Result<()> {
        use crate::cuda_backend::WrapErr;
-        use cudarc::driver::LaunchAsync;
+        use cudarc::driver::PushKernelArg;

        let elem_count = layout.shape().elem_count();
+        let stream = sto.device.cuda_stream();
        // TODO: support more dtypes.
        let sto = sto.as_cuda_slice::<f32>()?;
        let sto = match layout.contiguous_offsets() {
            None => crate::bail!("input has to be contiguous"),
            Some((o1, o2)) => sto.slice(o1..o2),
        };
-        let params = (&sto,);
        let (g, b) = if elem_count % 32 == 0 {
            (elem_count / 32, 32)
        } else {
@ -479,7 +482,9 @@ impl InplaceOp1 for UgIOp1 {
            block_dim: (b as u32, 1, 1),
            shared_mem_bytes: 0,
        };
-        unsafe { self.func.clone().launch(cfg, params) }.w()?;
+        let mut builder = stream.launch_builder(&self.func);
+        builder.arg(&sto);
+        unsafe { builder.launch(cfg) }.w()?;
        Ok(())
    }
 }
--- a/candle-core/src/pickle.rs
+++ b/candle-core/src/pickle.rs
@ -45,6 +45,7 @@ pub enum OpCode {
    BinFloat = b'G',
    Append = b'a',
    Appends = b'e',
+    Long1 = 0x8a,
 }

 // Avoid using FromPrimitive so as not to drag another dependency.
@ -84,6 +85,7 @@ impl TryFrom<u8> for OpCode {
            b'G' => Ok(Self::BinFloat),
            b'a' => Ok(Self::Append),
            b'e' => Ok(Self::Appends),
+            0x8a => Ok(Self::Long1),
            value => Err(value),
        }
    }
@ -106,6 +108,7 @@ pub enum Object {
        class_name: String,
    },
    Int(i32),
+    Long(i64),
    Float(f64),
    Unicode(String),
    Bool(bool),
@ -170,6 +173,14 @@ impl Object {
        }
    }

+    pub fn int_or_long(self) -> OResult<i64> {
+        match self {
+            Self::Int(t) => Ok(t as i64),
+            Self::Long(t) => Ok(t),
+            _ => Err(self),
+        }
+    }
+
    pub fn tuple(self) -> OResult<Vec<Self>> {
        match self {
            Self::Tuple(t) => Ok(t),
@ -590,6 +601,15 @@ impl Stack {
                let obj = self.new_obj(class, args)?;
                self.push(obj)
            }
+            OpCode::Long1 => {
+                let n_bytes = r.read_u8()?;
+                let mut v = 0;
+                // Decode the next n bytes in little endian
+                for i in 0..n_bytes {
+                    v |= (r.read_u8()? as i64) << (i * 8);
+                }
+                self.push(Object::Long(v))
+            }
        }
        Ok(false)
    }
@ -607,10 +627,10 @@ fn rebuild_args(args: Object) -> Result<(Layout, DType, String, usize)> {
    let mut args = args.tuple()?;
    let stride = Vec::<usize>::try_from(args.remove(3))?;
    let size = Vec::<usize>::try_from(args.remove(2))?;
-    let offset = args.remove(1).int()? as usize;
+    let offset = args.remove(1).int_or_long()? as usize;
    let storage = args.remove(0).persistent_load()?;
    let mut storage = storage.tuple()?;
-    let storage_size = storage.remove(4).int()? as usize;
+    let storage_size = storage.remove(4).int_or_long()? as usize;
    let path = storage.remove(2).unicode()?;
    let (_module_name, class_name) = storage.remove(1).class()?;
    let dtype = match class_name.as_str() {
@ -624,7 +644,11 @@ fn rebuild_args(args: Object) -> Result<(Layout, DType, String, usize)> {
            crate::bail!("unsupported storage type {other}")
        }
    };
-    let layout = Layout::new(crate::Shape::from(size), stride, offset);
+    let layout = Layout::new(
+        crate::Shape::from(size),
+        stride,
+        offset * dtype.size_in_bytes(),
+    );
    Ok((layout, dtype, path, storage_size))
 }

@ -792,7 +816,7 @@ impl PthTensors {
 /// # Arguments
 /// * `path` - Path to the pth file.
 /// * `key` - Optional key to retrieve `state_dict` from the pth file. Sometimes the pth file
-///           contains multiple objects and the state_dict is the one we are interested in.
+///   contains multiple objects and the state_dict is the one we are interested in.
 pub fn read_all_with_key<P: AsRef<std::path::Path>>(
    path: P,
    key: Option<&str>,
--- a/candle-core/src/quantized/cuda.rs
+++ b/candle-core/src/quantized/cuda.rs
@ -1,10 +1,10 @@
 use super::{GgmlDType, QStorage};
 use crate::quantized::k_quants::GgmlType;
 use crate::{backend::BackendDevice, cuda_backend::WrapErr};
-use crate::{CudaDevice, CudaStorage, Result};
+use crate::{builder_arg as barg, CudaDevice, CudaStorage, Result};
 use half::f16;

-use cudarc::driver::{CudaSlice, CudaView, DeviceSlice};
+use cudarc::driver::{CudaSlice, CudaView, PushKernelArg};

 #[derive(Clone, Debug)]
 struct PaddedCudaSlice {
@ -50,19 +50,20 @@ fn quantize_q8_1(
    ky: usize,
    dev: &CudaDevice,
 ) -> Result<()> {
-    use cudarc::driver::LaunchAsync;
-
    let kx = elem_count;
    let kx_padded = pad(kx, MATRIX_ROW_PADDING);
    let num_blocks = ceil_div(kx_padded, CUDA_QUANTIZE_BLOCK_SIZE);
-    let func = dev.get_or_load_func("quantize_q8_1", candle_kernels::QUANTIZED)?;
+    let func = dev.get_or_load_func("quantize_q8_1", &candle_kernels::QUANTIZED)?;
    let cfg = cudarc::driver::LaunchConfig {
        grid_dim: (num_blocks as u32, ky as u32, 1),
        block_dim: (CUDA_QUANTIZE_BLOCK_SIZE as u32, 1, 1),
        shared_mem_bytes: 0,
    };
-    let params = (src, dst, kx as i32, kx_padded as i32);
-    unsafe { func.launch(cfg, params) }.w()?;
+    let mut builder = func.builder();
+    builder.arg(src);
+    builder.arg(dst);
+    barg!(builder, kx as i32, kx_padded as i32);
+    unsafe { builder.launch(cfg) }.w()?;
    Ok(())
 }

@ -72,9 +73,7 @@ fn dequantize_f32(
    elem_count: usize,
    dev: &CudaDevice,
 ) -> Result<CudaStorage> {
-    use cudarc::driver::LaunchAsync;
-
-    let nb = (elem_count + 255) / 256;
+    let nb = elem_count.div_ceil(256);
    let (kernel_name, is_k, block_dim, num_blocks) = match dtype {
        GgmlDType::Q4_0 => ("dequantize_block_q4_0_f32", false, 32, nb),
        GgmlDType::Q4_1 => ("dequantize_block_q4_1_f32", false, 32, nb),
@ -99,8 +98,8 @@ fn dequantize_f32(
        GgmlDType::Q8K => ("dequantize_block_q8_K_f32", true, 32, nb),
        _ => crate::bail!("unsupported dtype for dequantize {dtype:?}"),
    };
-    let func = dev.get_or_load_func(kernel_name, candle_kernels::QUANTIZED)?;
-    let dst = unsafe { dev.alloc::<f32>(elem_count).w()? };
+    let func = dev.get_or_load_func(kernel_name, &candle_kernels::QUANTIZED)?;
+    let dst = unsafe { dev.alloc::<f32>(elem_count)? };
    // See e.g.
    // https://github.com/ggerganov/llama.cpp/blob/cbbd1efa06f8c09f9dff58ff9d9af509cc4c152b/ggml-cuda.cu#L7270
    let cfg = cudarc::driver::LaunchConfig {
@ -110,15 +109,20 @@ fn dequantize_f32(
    };

    if is_k {
-        let params = (&data.inner, &dst);
-        unsafe { func.launch(cfg, params) }.w()?;
+        let mut builder = func.builder();
+        builder.arg(&data.inner);
+        builder.arg(&dst);
+        unsafe { builder.launch(cfg) }.w()?;
    } else {
        let nb32 = match dtype {
            GgmlDType::Q5_0 | GgmlDType::Q5_1 => elem_count,
            _ => elem_count / 32,
        };
-        let params = (&data.inner, &dst, nb32 as i32);
-        unsafe { func.launch(cfg, params) }.w()?;
+        let mut builder = func.builder();
+        builder.arg(&data.inner);
+        builder.arg(&dst);
+        barg!(builder, nb32 as i32);
+        unsafe { builder.launch(cfg) }.w()?;
    }
    Ok(CudaStorage::wrap_cuda_slice(dst, dev.clone()))
 }
@ -129,9 +133,7 @@ fn dequantize_f16(
    elem_count: usize,
    dev: &CudaDevice,
 ) -> Result<CudaStorage> {
-    use cudarc::driver::LaunchAsync;
-
-    let nb = (elem_count + 255) / 256;
+    let nb = elem_count.div_ceil(256);
    let (kernel_name, is_k, block_dim, num_blocks) = match dtype {
        GgmlDType::Q4_0 => ("dequantize_block_q4_0_f16", false, 32, nb),
        GgmlDType::Q4_1 => ("dequantize_block_q4_1_f16", false, 32, nb),
@ -156,8 +158,8 @@ fn dequantize_f16(
        GgmlDType::Q8K => ("dequantize_block_q8_K_f16", true, 32, nb),
        _ => crate::bail!("unsupported dtype for dequantize {dtype:?}"),
    };
-    let func = dev.get_or_load_func(kernel_name, candle_kernels::QUANTIZED)?;
-    let dst = unsafe { dev.alloc::<f16>(elem_count).w()? };
+    let func = dev.get_or_load_func(kernel_name, &candle_kernels::QUANTIZED)?;
+    let dst = unsafe { dev.alloc::<f16>(elem_count)? };
    // See e.g.
    // https://github.com/ggerganov/llama.cpp/blob/cbbd1efa06f8c09f9dff58ff9d9af509cc4c152b/ggml-cuda.cu#L7270
    let cfg = cudarc::driver::LaunchConfig {
@ -167,15 +169,20 @@ fn dequantize_f16(
    };

    if is_k {
-        let params = (&data.inner, &dst);
-        unsafe { func.launch(cfg, params) }.w()?;
+        let mut builder = func.builder();
+        builder.arg(&data.inner);
+        builder.arg(&dst);
+        unsafe { builder.launch(cfg) }.w()?;
    } else {
        let nb32 = match dtype {
            GgmlDType::Q5_0 | GgmlDType::Q5_1 => elem_count,
            _ => elem_count / 32,
        };
-        let params = (&data.inner, &dst, nb32 as i32);
-        unsafe { func.launch(cfg, params) }.w()?;
+        let mut builder = func.builder();
+        builder.arg(&data.inner);
+        builder.arg(&dst);
+        barg!(builder, nb32 as i32);
+        unsafe { builder.launch(cfg) }.w()?;
    }
    Ok(CudaStorage::wrap_cuda_slice(dst, dev.clone()))
 }
@ -188,8 +195,6 @@ fn dequantize_mul_mat_vec(
    nrows: usize,
    dev: &CudaDevice,
 ) -> Result<CudaStorage> {
-    use cudarc::driver::LaunchAsync;
-
    let data_elems = data.len / dtype.type_size() * dtype.block_size();
    if data_elems < ncols * nrows {
        crate::bail!("unexpected data size {}, ncols {ncols} {nrows}", data_elems)
@ -210,8 +215,8 @@ fn dequantize_mul_mat_vec(
        GgmlDType::Q6K => "dequantize_mul_mat_vec_q6_k",
        _ => crate::bail!("unsupported dtype for quantized matmul {dtype:?}"),
    };
-    let func = dev.get_or_load_func(kernel_name, candle_kernels::QUANTIZED)?;
-    let dst = unsafe { dev.alloc::<f32>(nrows).w()? };
+    let func = dev.get_or_load_func(kernel_name, &candle_kernels::QUANTIZED)?;
+    let dst = unsafe { dev.alloc::<f32>(nrows)? };
    let block_num_y = ceil_div(nrows, GGML_CUDA_MMV_Y);
    let cfg = cudarc::driver::LaunchConfig {
        grid_dim: (block_num_y as u32, 1, 1),
@ -219,8 +224,12 @@ fn dequantize_mul_mat_vec(
        shared_mem_bytes: 0,
    };

-    let params = (&data.inner, y, &dst, ncols as i32, nrows as i32);
-    unsafe { func.launch(cfg, params) }.w()?;
+    let mut builder = func.builder();
+    builder.arg(&data.inner);
+    builder.arg(y);
+    builder.arg(&dst);
+    barg!(builder, ncols as i32, nrows as i32);
+    unsafe { builder.launch(cfg) }.w()?;
    Ok(CudaStorage::wrap_cuda_slice(dst, dev.clone()))
 }

@ -233,8 +242,6 @@ fn mul_mat_vec_via_q8_1(
    b_size: usize,
    dev: &CudaDevice,
 ) -> Result<CudaStorage> {
-    use cudarc::driver::LaunchAsync;
-
    let data_elems = data.len / dtype.type_size() * dtype.block_size();
    if data_elems < ncols * nrows {
        crate::bail!("unexpected data size {}, ncols {ncols} {nrows}", data_elems)
@ -249,7 +256,7 @@ fn mul_mat_vec_via_q8_1(
    let ncols_padded = pad(ncols, MATRIX_ROW_PADDING);
    let y_size_in_bytes =
        b_size * ncols_padded * GgmlDType::Q8_1.type_size() / GgmlDType::Q8_1.block_size();
-    let mut y_q8_1 = unsafe { dev.alloc::<u8>(y_size_in_bytes).w()? };
+    let mut y_q8_1 = unsafe { dev.alloc::<u8>(y_size_in_bytes)? };
    quantize_q8_1(y, &mut y_q8_1, ncols, b_size, dev)?;

    let kernel_name = match dtype {
@ -266,13 +273,13 @@ fn mul_mat_vec_via_q8_1(
        _ => crate::bail!("unsupported dtype for quantized matmul {dtype:?}"),
    };
    let kernel_name = format!("{kernel_name}{b_size}");
-    let func = dev.get_or_load_func(&kernel_name, candle_kernels::QUANTIZED)?;
-    let dst = unsafe { dev.alloc::<f32>(nrows * b_size).w()? };
+    let func = dev.get_or_load_func(&kernel_name, &candle_kernels::QUANTIZED)?;
+    let dst = unsafe { dev.alloc::<f32>(nrows * b_size)? };
    // https://github.com/ggerganov/llama.cpp/blob/facb8b56f8fd3bb10a693bf0943ae9d69d0828ef/ggml-cuda/mmvq.cu#L98
    let (nblocks, nwarps) = match b_size {
        1 => (nrows as u32, 4),
-        2..=4 => ((nrows as u32 + 1) / 2, 4),
-        5..=8 => ((nrows as u32 + 1) / 2, 2),
+        2..=4 => ((nrows as u32).div_ceil(2), 4),
+        5..=8 => ((nrows as u32).div_ceil(2), 2),
        _ => crate::bail!("unexpected bsize {b_size}"),
    };
    let cfg = cudarc::driver::LaunchConfig {
@ -281,16 +288,18 @@ fn mul_mat_vec_via_q8_1(
        shared_mem_bytes: 0,
    };

-    let params = (
-        &data.inner,
-        &y_q8_1,
-        &dst,
+    let mut builder = func.builder();
+    builder.arg(&data.inner);
+    builder.arg(&y_q8_1);
+    builder.arg(&dst);
+    barg!(
+        builder,
        /* ncols_x */ ncols as i32,
        /* nrows_x */ nrows as i32,
        /* nrows_y */ ncols_padded as i32,
-        /* nrows_dst */ nrows as i32,
+        /* nrows_dst */ nrows as i32
    );
-    unsafe { func.launch(cfg, params) }.w()?;
+    unsafe { builder.launch(cfg) }.w()?;
    Ok(CudaStorage::wrap_cuda_slice(dst, dev.clone()))
 }

@ -305,8 +314,6 @@ fn mul_mat_via_q8_1(
    y_cols: usize,
    dev: &CudaDevice,
 ) -> Result<CudaStorage> {
-    use cudarc::driver::LaunchAsync;
-
    let data_elems = data.len / dtype.type_size() * dtype.block_size();
    if data_elems < x_rows * x_cols {
        crate::bail!("unexpected lhs size {}, {x_rows} {x_cols}", data_elems)
@ -322,7 +329,7 @@ fn mul_mat_via_q8_1(
    let k_padded = pad(k, MATRIX_ROW_PADDING);
    let y_size_in_bytes =
        k_padded * y_cols * GgmlDType::Q8_1.type_size() / GgmlDType::Q8_1.block_size();
-    let mut y_q8_1 = unsafe { dev.alloc::<u8>(y_size_in_bytes).w()? };
+    let mut y_q8_1 = unsafe { dev.alloc::<u8>(y_size_in_bytes)? };
    quantize_q8_1(y, &mut y_q8_1, k, y_cols, dev)?;

    let (kernel_name, mmq_x, mmq_y) = match dtype {
@ -338,8 +345,8 @@ fn mul_mat_via_q8_1(
        GgmlDType::Q6K => ("mul_mat_q6_K", 64, 64),
        _ => crate::bail!("unsupported dtype for quantized matmul {dtype:?}"),
    };
-    let func = dev.get_or_load_func(kernel_name, candle_kernels::QUANTIZED)?;
-    let dst = unsafe { dev.alloc::<f32>(x_rows * y_cols).w()? };
+    let func = dev.get_or_load_func(kernel_name, &candle_kernels::QUANTIZED)?;
+    let dst = unsafe { dev.alloc::<f32>(x_rows * y_cols)? };
    let cfg = cudarc::driver::LaunchConfig {
        grid_dim: (
            ceil_div(x_rows, mmq_y) as u32,
@ -350,17 +357,19 @@ fn mul_mat_via_q8_1(
        shared_mem_bytes: 0,
    };

-    let params = (
-        /* vx */ &data.inner,
-        /* vy */ &y_q8_1,
-        /* dst */ &dst,
+    let mut builder = func.builder();
+    builder.arg(/* vx */ &data.inner);
+    builder.arg(/* vy */ &y_q8_1);
+    builder.arg(/* dst */ &dst);
+    barg!(
+        builder,
        /* ncols_x */ x_cols as i32,
        /* nrows_x */ x_rows as i32,
        /* ncols_y */ y_cols as i32,
        /* nrows_y */ k_padded as i32,
-        /* nrows_dst */ x_rows as i32,
+        /* nrows_dst */ x_rows as i32
    );
-    unsafe { func.launch(cfg, params) }.w()?;
+    unsafe { builder.launch(cfg) }.w()?;
    Ok(CudaStorage::wrap_cuda_slice(dst, dev.clone()))
 }

@ -369,7 +378,7 @@ impl QCudaStorage {
        let size_in_bytes = ceil_div(el_count, dtype.block_size()) * dtype.type_size();
        let padded_size_in_bytes =
            ceil_div(el_count + MATRIX_ROW_PADDING, dtype.block_size()) * dtype.type_size();
-        let inner = device.alloc_zeros::<u8>(padded_size_in_bytes).w()?;
+        let inner = device.alloc_zeros::<u8>(padded_size_in_bytes)?;
        Ok(QCudaStorage {
            data: PaddedCudaSlice {
                inner,
@ -416,8 +425,7 @@ impl QCudaStorage {

        let buffer = self
            .device
-            .dtoh_sync_copy(&self.data.inner.slice(..self.data.len))
-            .w()?;
+            .memcpy_dtov(&self.data.inner.slice(..self.data.len))?;
        let mut out = vec![0.0; elem_count];
        let block_len = elem_count / self.dtype.block_size();
        match self.dtype {
@ -448,9 +456,7 @@ impl QCudaStorage {
    pub fn quantize(&mut self, src: &CudaStorage) -> Result<()> {
        // Run the quantization on cpu.
        let src = match &src.slice {
-            crate::cuda_backend::CudaStorageSlice::F32(data) => {
-                self.device.dtoh_sync_copy(data).w()?
-            }
+            crate::cuda_backend::CudaStorageSlice::F32(data) => self.device.memcpy_dtov(data)?,
            _ => crate::bail!("only f32 can be quantized"),
        };
        let src_len = src.len();
@ -460,10 +466,9 @@ impl QCudaStorage {
        let data = qcpu_storage.data()?;
        let padded_len =
            data.len() + MATRIX_ROW_PADDING * self.dtype.type_size() / self.dtype.block_size();
-        let mut inner = unsafe { self.device.alloc::<u8>(padded_len).w()? };
+        let mut inner = unsafe { self.device.alloc::<u8>(padded_len)? };
        self.device
-            .htod_sync_copy_into(data.as_ref(), &mut inner.slice_mut(..data.len()))
-            .w()?;
+            .memcpy_htod(data.as_ref(), &mut inner.slice_mut(..data.len()))?;
        self.data = PaddedCudaSlice {
            inner,
            len: data.len(),
@ -597,10 +602,8 @@ pub fn load_quantized<T: super::GgmlType + Send + Sync + 'static>(
    };
    let dtype = T::DTYPE;
    let padded_len = data.len() + MATRIX_ROW_PADDING * dtype.type_size() / dtype.block_size();
-    let mut inner = unsafe { device.alloc::<u8>(padded_len).w()? };
-    device
-        .htod_sync_copy_into(data, &mut inner.slice_mut(..data.len()))
-        .w()?;
+    let mut inner = unsafe { device.alloc::<u8>(padded_len)? };
+    device.memcpy_htod(data, &mut inner.slice_mut(..data.len()))?;
    Ok(QStorage::Cuda(QCudaStorage {
        data: PaddedCudaSlice {
            inner,
@ -622,9 +625,9 @@ mod test {
        let el_padded = pad(el, MATRIX_ROW_PADDING);
        let y_size_in_bytes =
            el_padded * GgmlDType::Q8_1.type_size() / GgmlDType::Q8_1.block_size();
-        let mut y_q8_1 = unsafe { dev.alloc::<u8>(y_size_in_bytes).w()? };
+        let mut y_q8_1 = unsafe { dev.alloc::<u8>(y_size_in_bytes)? };
        let vs: Vec<f32> = (0..el).map(|v| v as f32).collect();
-        let y = dev.htod_sync_copy(&vs).w()?;
+        let y = dev.memcpy_stod(&vs)?;
        quantize_q8_1(&y.slice(..), &mut y_q8_1, el, 1, &dev)?;
        Ok(())
    }
@ -634,7 +637,7 @@ mod test {
        let dev = CudaDevice::new(0)?;
        let ncols = 256;
        let vs: Vec<f32> = (0..ncols).map(|v| v as f32).collect();
-        let y = dev.htod_sync_copy(&vs).w()?;
+        let y = dev.memcpy_stod(&vs)?;
        let mut xs = QCudaStorage::zeros(&dev, ncols, GgmlDType::Q4_0)?;
        xs.quantize(&CudaStorage::wrap_cuda_slice(y.clone(), dev.clone()))?;
        let cuda_storage = mul_mat_vec_via_q8_1(
@ -647,7 +650,7 @@ mod test {
            &dev,
        )?;
        let vs = cuda_storage.as_cuda_slice::<f32>()?;
-        let vs = dev.dtoh_sync_copy(&vs.slice(..)).unwrap();
+        let vs = dev.memcpy_dtov(&vs.slice(..))?;
        assert_eq!(vs.len(), 1);
        // for n = 255, n.(n+1).(2n+1) / 6 = 5559680
        // Q8 means 1/256 precision.
@ -662,7 +665,7 @@ mod test {
            &dev,
        )?;
        let vs = cuda_storage.as_cuda_slice::<f32>()?;
-        let vs = dev.dtoh_sync_copy(&vs.slice(..)).unwrap();
+        let vs = dev.memcpy_dtov(&vs.slice(..))?;
        assert_eq!(vs.len(), 1);
        assert_eq!(vs[0], 5561851.0);
        Ok(())
@ -673,7 +676,7 @@ mod test {
        let dev = CudaDevice::new(0)?;
        let ncols = 256;
        let vs: Vec<f32> = (0..ncols * 4).map(|v| v as f32 / 4.).collect();
-        let y = dev.htod_sync_copy(&vs).w()?;
+        let y = dev.memcpy_stod(&vs)?;
        let mut xs = QCudaStorage::zeros(&dev, ncols * 4, GgmlDType::Q4_0)?;
        xs.quantize(&CudaStorage::wrap_cuda_slice(y.clone(), dev.clone()))?;
        let cuda_storage = mul_mat_via_q8_1(
@ -687,7 +690,7 @@ mod test {
            &dev,
        )?;
        let vs = cuda_storage.as_cuda_slice::<f32>()?;
-        let vs = dev.dtoh_sync_copy(&vs.slice(..)).unwrap();
+        let vs = dev.memcpy_dtov(&vs.slice(..))?;

        /*
           x = torch.tensor([float(v) for v in range(1024)]).reshape(4, 256)
@ -714,7 +717,7 @@ mod test {
        let dev = CudaDevice::new(0)?;
        let (x_rows, ncols, y_cols) = (4, 16, 2048);
        let vs: Vec<f32> = (0..ncols * y_cols).map(|v| v as f32 / 256.).collect();
-        let y = dev.htod_sync_copy(&vs).w()?;
+        let y = dev.memcpy_stod(&vs)?;
        let mut xs = QCudaStorage::zeros(&dev, ncols * x_rows, GgmlDType::Q4_0)?;
        xs.quantize(&CudaStorage::wrap_cuda_slice(y.clone(), dev.clone()))?;
        let cuda_storage = mul_mat_via_q8_1(
@ -728,7 +731,7 @@ mod test {
            &dev,
        )?;
        let vs = cuda_storage.as_cuda_slice::<f32>()?;
-        let _vs = dev.dtoh_sync_copy(&vs.slice(..)).unwrap();
+        let _vs = dev.memcpy_dtov(&vs.slice(..))?;
        Ok(())
    }
 }
--- a/candle-core/src/shape.rs
+++ b/candle-core/src/shape.rs
@ -43,43 +43,22 @@ impl From<usize> for Shape {
    }
 }

-impl From<(usize,)> for Shape {
-    fn from(d1: (usize,)) -> Self {
-        Self(vec![d1.0])
+macro_rules! impl_from_tuple {
+    ($tuple:ty, $($index:tt),+) => {
+        impl From<$tuple> for Shape {
+            fn from(d: $tuple) -> Self {
+                Self(vec![$(d.$index,)+])
+            }
+        }
    }
 }

-impl From<(usize, usize)> for Shape {
-    fn from(d12: (usize, usize)) -> Self {
-        Self(vec![d12.0, d12.1])
-    }
-}
-
-impl From<(usize, usize, usize)> for Shape {
-    fn from(d123: (usize, usize, usize)) -> Self {
-        Self(vec![d123.0, d123.1, d123.2])
-    }
-}
-
-impl From<(usize, usize, usize, usize)> for Shape {
-    fn from(d1234: (usize, usize, usize, usize)) -> Self {
-        Self(vec![d1234.0, d1234.1, d1234.2, d1234.3])
-    }
-}
-
-impl From<(usize, usize, usize, usize, usize)> for Shape {
-    fn from(d12345: (usize, usize, usize, usize, usize)) -> Self {
-        Self(vec![d12345.0, d12345.1, d12345.2, d12345.3, d12345.4])
-    }
-}
-
-impl From<(usize, usize, usize, usize, usize, usize)> for Shape {
-    fn from(d123456: (usize, usize, usize, usize, usize, usize)) -> Self {
-        Self(vec![
-            d123456.0, d123456.1, d123456.2, d123456.3, d123456.4, d123456.5,
-        ])
-    }
-}
+impl_from_tuple!((usize,), 0);
+impl_from_tuple!((usize, usize), 0, 1);
+impl_from_tuple!((usize, usize, usize), 0, 1, 2);
+impl_from_tuple!((usize, usize, usize, usize), 0, 1, 2, 3);
+impl_from_tuple!((usize, usize, usize, usize, usize), 0, 1, 2, 3, 4);
+impl_from_tuple!((usize, usize, usize, usize, usize, usize), 0, 1, 2, 3, 4, 5);

 impl From<Vec<usize>> for Shape {
    fn from(dims: Vec<usize>) -> Self {
@ -636,4 +615,20 @@ mod tests {
        let shape = Shape::from((299, 792, 458));
        assert_eq!(shape.stride_contiguous(), [458 * 792, 458, 1]);
    }
+
+    #[test]
+    fn test_from_tuple() {
+        let shape = Shape::from((2,));
+        assert_eq!(shape.dims(), &[2]);
+        let shape = Shape::from((2, 3));
+        assert_eq!(shape.dims(), &[2, 3]);
+        let shape = Shape::from((2, 3, 4));
+        assert_eq!(shape.dims(), &[2, 3, 4]);
+        let shape = Shape::from((2, 3, 4, 5));
+        assert_eq!(shape.dims(), &[2, 3, 4, 5]);
+        let shape = Shape::from((2, 3, 4, 5, 6));
+        assert_eq!(shape.dims(), &[2, 3, 4, 5, 6]);
+        let shape = Shape::from((2, 3, 4, 5, 6, 7));
+        assert_eq!(shape.dims(), &[2, 3, 4, 5, 6, 7]);
+    }
 }
--- a/candle-core/src/sort.rs
+++ b/candle-core/src/sort.rs
@ -56,7 +56,7 @@ impl ArgSort {
 mod cuda {
    use super::*;
    use crate::cuda_backend::cudarc::driver::{
-        CudaSlice, DeviceRepr, LaunchAsync, LaunchConfig, ValidAsZeroBits,
+        CudaSlice, DeviceRepr, LaunchConfig, ValidAsZeroBits,
    };
    use crate::cuda_backend::{kernel_name, kernels, CudaStorageSlice as S, WrapErr};
    use crate::{CudaDevice, WithDType};
@ -69,27 +69,33 @@ mod cuda {
            layout: &crate::Layout,
            _wrap: W,
        ) -> Result<S> {
+            use cudarc::driver::PushKernelArg;
+
            let slice = match layout.contiguous_offsets() {
                None => crate::bail!("input has to be contiguous"),
                Some((o1, o2)) => src.slice(o1..o2),
            };
            let elem_count = layout.shape().elem_count();
-            let dst = unsafe { dev.alloc::<u32>(elem_count) }.w()?;
+            let dst = unsafe { dev.alloc::<u32>(elem_count)? };
            let func = if self.asc {
-                dev.get_or_load_func(&kernel_name::<T>("asort_asc"), kernels::SORT)?
+                dev.get_or_load_func(&kernel_name::<T>("asort_asc"), &kernels::SORT)?
            } else {
-                dev.get_or_load_func(&kernel_name::<T>("asort_desc"), kernels::SORT)?
+                dev.get_or_load_func(&kernel_name::<T>("asort_desc"), &kernels::SORT)?
            };
            let ncols = self.last_dim;
            let nrows = elem_count / ncols;
            let ncols_pad = next_power_of_2(ncols);
-            let params = (&slice, &dst, ncols as i32, ncols_pad as i32);
            let cfg = LaunchConfig {
                grid_dim: (1, nrows as u32, 1),
                block_dim: (ncols_pad as u32, 1, 1),
                shared_mem_bytes: (ncols_pad * std::mem::size_of::<u32>()) as u32,
            };
-            unsafe { func.launch(cfg, params) }.w()?;
+            let stream = dev.cuda_stream();
+            let mut builder = stream.launch_builder(&func);
+            let ncols = ncols as i32;
+            let ncols_pad = ncols_pad as i32;
+            builder.arg(&slice).arg(&dst).arg(&ncols).arg(&ncols_pad);
+            unsafe { builder.launch(cfg) }.w()?;
            Ok(S::U32(dst))
        }
    }
--- a/candle-core/src/tensor.rs
+++ b/candle-core/src/tensor.rs
@ -2580,6 +2580,28 @@ impl Tensor {
    pub fn broadcast_pow(&self, rhs: &Tensor) -> Result<Self> {
        rhs.broadcast_mul(&self.log()?)?.exp()
    }
+
+    /// Returns a new tensor with the order of elements reversed along the specified dimensions.
+    /// This function makes a copy of the tensor’s data.
+    ///
+    /// ```rust
+    /// # use candle_core::{Tensor, Device};
+    /// let t = Tensor::arange(0., 6., &Device::Cpu)?.reshape((2, 3))?;
+    /// assert_eq!(t.to_vec2::<f64>()?, &[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0]]);
+    /// let t_flipped = t.flip(&[0])?;
+    /// assert_eq!(t_flipped.to_vec2::<f64>()?, &[[3.0, 4.0, 5.0], [0.0, 1.0, 2.0]]);
+    /// # Ok::<(), candle_core::Error>(())
+    /// ```
+    pub fn flip(&self, dims: &[usize]) -> Result<Tensor> {
+        let mut result = self.clone();
+        for &dim in dims.iter() {
+            let size = result.dim(dim)?;
+            let indices: Vec<i64> = (0..size).rev().map(|x| x as i64).collect();
+            let indices_tensor = Tensor::from_vec(indices, (size,), result.device())?;
+            result = result.index_select(&indices_tensor, dim)?;
+        }
+        Ok(result)
+    }
 }

 macro_rules! bin_trait {
--- a/candle-core/src/test_utils.rs
+++ b/candle-core/src/test_utils.rs
@ -24,6 +24,15 @@ macro_rules! test_device {
    };
 }

+pub fn assert_tensor_eq(t1: &Tensor, t2: &Tensor) -> Result<()> {
+    assert_eq!(t1.shape(), t2.shape());
+    // Default U8 may not be large enough to hold the sum (`t.sum_all` defaults to the dtype of `t`)
+    let eq_tensor = t1.eq(t2)?.to_dtype(crate::DType::U32)?;
+    let all_equal = eq_tensor.sum_all()?;
+    assert_eq!(all_equal.to_scalar::<u32>()?, eq_tensor.elem_count() as u32);
+    Ok(())
+}
+
 pub fn to_vec0_round(t: &Tensor, digits: i32) -> Result<f32> {
    let b = 10f32.powi(digits);
    let t = t.to_vec0::<f32>()?;
--- a/candle-core/tests/conv_tests.rs
+++ b/candle-core/tests/conv_tests.rs
@ -53,6 +53,20 @@ fn conv1d(dev: &Device) -> Result<()> {
        test_utils::to_vec1_round(&res.flatten_all()?, 4)?,
        [2.4509, 2.6357, -1.3336, 4.1393, 0.5657, 1.8091, -1.1784, 3.5675, 0.5069, 3.3352]
    );
+    let res = {
+        let t = Tensor::cat(&[&t.zeros_like()?, &t, &t.zeros_like()?], 0)?;
+        t.conv1d(&w, /*padding*/ 1, 1, 1, 1)?
+    };
+    assert_eq!(res.dims(), [3, 2, 5]);
+    // Same as pytorch default padding: use zeros.
+    assert_eq!(
+        test_utils::to_vec1_round(&res.i(0)?.flatten_all()?, 4)?,
+        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]
+    );
+    assert_eq!(
+        test_utils::to_vec1_round(&res.i(1)?.flatten_all()?, 4)?,
+        [2.4509, 2.6357, -1.3336, 4.1393, 0.5657, 1.8091, -1.1784, 3.5675, 0.5069, 3.3352]
+    );

    let w = w.transpose(0, 1)?;
    // The CPU kernels applied in the contiguous and non contiguous cases are different.
@ -163,6 +177,22 @@ fn conv2d(dev: &Device) -> Result<()> {
            10.389, 3.6023, -4.2808, 0.2672, 5.3646, -5.2023, -2.1955, -9.4075
        ]
    );
+    let res = {
+        let t = Tensor::cat(&[&t.zeros_like()?, &t, &t.zeros_like()?], 0)?;
+        t.conv2d(&w, 0, 1, 1, 1)?
+    };
+    assert_eq!(res.dims(), [3, 2, 3, 3]);
+    assert_eq!(
+        test_utils::to_vec1_round(&res.i(0)?.flatten_all()?, 4)?,
+        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]
+    );
+    assert_eq!(
+        test_utils::to_vec1_round(&res.i(1)?.flatten_all()?, 4)?,
+        [
+            -4.2812, 2.0923, 5.2187, 7.5184, 0.752, -14.9426, 10.0087, 4.391, 0.2918, 1.6715,
+            10.389, 3.6023, -4.2808, 0.2672, 5.3646, -5.2023, -2.1955, -9.4075
+        ]
+    );

    let res = t.conv_transpose2d(&w.transpose(0, 1)?, 0, 0, 1, 1)?;

--- a/candle-core/tests/grad_tests.rs
+++ b/candle-core/tests/grad_tests.rs
@ -1,6 +1,6 @@
 #![allow(clippy::approx_constant)]
 use anyhow::{Context, Result};
-use candle_core::{test_device, test_utils, Device, Shape, Tensor, Var};
+use candle_core::{test_device, test_utils, DType, Device, Shape, Tensor, Var};

 fn simple_grad(device: &Device) -> Result<()> {
    let x = Var::new(&[3f32, 1., 4.], device)?;
@ -505,6 +505,36 @@ fn binary_grad(device: &Device) -> Result<()> {
    Ok(())
 }

+#[test]
+fn test_flip_backprop() -> Result<()> {
+    let device = &Device::Cpu;
+
+    // Create a tensor (leaf node) that requires gradients
+    let x = Var::ones((2, 2), DType::F64, device)?;
+    let weights = Tensor::arange(1.0, 5.0, device)?.reshape((2, 2))?;
+
+    let y = x.matmul(&weights)?;
+    let expected_y = Tensor::from_vec(vec![4.0, 6.0, 4.0, 6.0], (2, 2), device)?;
+    candle_core::test_utils::assert_tensor_eq(&y, &expected_y)?;
+
+    let z = y.flip(&[1])?;
+    let expected_z = Tensor::from_vec(vec![6.0, 4.0, 6.0, 4.0], (2, 2), device)?;
+    candle_core::test_utils::assert_tensor_eq(&z, &expected_z)?;
+
+    let loss = z.sum_all()?;
+
+    let grad_store = loss.backward()?;
+    let grad_x = grad_store.get_id(x.id()).unwrap();
+
+    let flipped_weights = weights.flip(&[1])?;
+    let dloss_dy = Tensor::ones((2, 2), DType::F64, device)?;
+    // dloss/dx = dloss/dy @ dy/dx = ones @ weight.flip.T
+    let expected_grad = dloss_dy.matmul(&flipped_weights.t()?)?;
+    candle_core::test_utils::assert_tensor_eq(grad_x, &expected_grad)?;
+
+    Ok(())
+}
+
 test_device!(
    simple_grad,
    simple_grad_cpu,
--- a/candle-core/tests/quantized_tests.rs
+++ b/candle-core/tests/quantized_tests.rs
@ -880,10 +880,10 @@ fn get_random_tensors(
    let mut rng = StdRng::seed_from_u64(314159265358979);

    let lhs = (0..m * k)
-        .map(|_| rng.gen::<f32>() - 0.5)
+        .map(|_| rng.random::<f32>() - 0.5)
        .collect::<Vec<_>>();
    let rhs = (0..n * k)
-        .map(|_| rng.gen::<f32>() - 0.5)
+        .map(|_| rng.random::<f32>() - 0.5)
        .collect::<Vec<_>>();

    let lhs = Tensor::from_vec(lhs, (m, k), device)?;
--- a/candle-core/tests/tensor_tests.rs
+++ b/candle-core/tests/tensor_tests.rs
@ -1682,3 +1682,54 @@ fn pow() -> Result<()> {
    );
    Ok(())
 }
+
+#[test]
+fn test_flip_1d() -> Result<()> {
+    // 1D: [0, 1, 2, 3, 4]
+    let t = Tensor::arange(0.0, 5.0, &Device::Cpu)?.reshape((5,))?;
+    let flipped = t.flip(&[0])?;
+    // Expected: [4, 3, 2, 1, 0]
+    let expected = Tensor::from_vec(vec![4.0, 3.0, 2.0, 1.0, 0.0], (5,), &Device::Cpu)?;
+    candle_core::test_utils::assert_tensor_eq(&flipped, &expected)?;
+    Ok(())
+}
+
+#[test]
+fn test_flip_2d() -> Result<()> {
+    // 2D:
+    // [[0, 1, 2],
+    //  [3, 4, 5]]
+    let t = Tensor::arange(0.0, 6.0, &Device::Cpu)?.reshape((2, 3))?;
+    let flipped = t.flip(&[0, 1])?;
+    // Expected:
+    // [[5, 4, 3],
+    //  [2, 1, 0]]
+    let expected = Tensor::from_vec(vec![5.0, 4.0, 3.0, 2.0, 1.0, 0.0], (2, 3), &Device::Cpu)?;
+    candle_core::test_utils::assert_tensor_eq(&flipped, &expected)?;
+    Ok(())
+}
+
+#[test]
+fn test_flip_3d_channels() -> Result<()> {
+    // 3D:
+    // [[[0,1,2],
+    //   [3,4,5]],
+    //
+    //  [[6,7,8],
+    //   [9,10,11]]]
+    let t = Tensor::arange(0.0, 12.0, &Device::Cpu)?.reshape((2, 2, 3))?;
+    let flipped = t.flip(&[2])?;
+    // Expected:
+    // [[[2,1,0],
+    //   [5,4,3]],
+    //
+    //  [[8,7,6],
+    //   [11,10,9]]]
+    let expected = Tensor::from_vec(
+        vec![2.0, 1.0, 0.0, 5.0, 4.0, 3.0, 8.0, 7.0, 6.0, 11.0, 10.0, 9.0],
+        (2, 2, 3),
+        &Device::Cpu,
+    )?;
+    candle_core::test_utils::assert_tensor_eq(&flipped, &expected)?;
+    Ok(())
+}
--- a/candle-datasets/src/nlp/tinystories.rs
+++ b/candle-datasets/src/nlp/tinystories.rs
@ -60,8 +60,8 @@ pub struct DatasetRandomIter<'a> {

 impl<'a> DatasetRandomIter<'a> {
    pub fn new(ds: &'a Dataset, valid: bool, seq_len: usize, device: Device) -> Self {
+        use rand::rng;
        use rand::seq::SliceRandom;
-        use rand::thread_rng;

        let all_tokens = if valid {
            &ds.valid_tokens
@ -69,13 +69,13 @@ impl<'a> DatasetRandomIter<'a> {
            &ds.train_tokens
        };
        let mut tokens = all_tokens.iter().collect::<Vec<_>>();
-        tokens.shuffle(&mut thread_rng());
+        tokens.shuffle(&mut rng());
        let current_tokens = tokens.pop().unwrap();
        let seq_len_in_bytes = seq_len * 2;
        let mut indexes_in_bytes = (0..current_tokens.len() - seq_len_in_bytes)
            .step_by(seq_len_in_bytes)
            .collect::<Vec<_>>();
-        indexes_in_bytes.shuffle(&mut thread_rng());
+        indexes_in_bytes.shuffle(&mut rng());
        Self {
            all_tokens,
            tokens,
@ -92,21 +92,21 @@ impl Iterator for DatasetRandomIter<'_> {

    fn next(&mut self) -> Option<Self::Item> {
        use byteorder::{LittleEndian, ReadBytesExt};
+        use rand::rng;
        use rand::seq::SliceRandom;
-        use rand::thread_rng;

        let seq_len = self.seq_len;
        if self.indexes_in_bytes.is_empty() {
            if self.tokens.is_empty() {
                self.tokens = self.all_tokens.iter().collect();
-                self.tokens.shuffle(&mut thread_rng());
+                self.tokens.shuffle(&mut rng());
            }
            self.current_tokens = self.tokens.pop().unwrap();
            let seq_len_in_bytes = self.seq_len * 2;
            self.indexes_in_bytes = (0..self.current_tokens.len() - seq_len_in_bytes)
                .step_by(seq_len_in_bytes)
                .collect::<Vec<_>>();
-            self.indexes_in_bytes.shuffle(&mut thread_rng());
+            self.indexes_in_bytes.shuffle(&mut rng());
        }
        let start_idx = self.indexes_in_bytes.pop().unwrap();
        let bytes = &self.current_tokens[start_idx..start_idx + 2 * (seq_len + 1)];
--- a/candle-datasets/src/vision/cifar.rs
+++ b/candle-datasets/src/vision/cifar.rs
@ -72,6 +72,8 @@ fn load_parquet(parquet: SerializedFileReader<std::fs::File>) -> Result<(Tensor,
            if let parquet::record::Field::Group(subrow) = field {
                for (_name, field) in subrow.get_column_iter() {
                    if let parquet::record::Field::Bytes(value) = field {
+                        // image-rs crate convention is to load in (width, height, channels) order
+                        // See: https://docs.rs/image/latest/image/trait.ImageDecoder.html#tymethod.dimensions
                        let image = image::load_from_memory(value.data()).unwrap();
                        buffer_images.extend(image.to_rgb8().as_raw());
                    }
@ -81,8 +83,10 @@ fn load_parquet(parquet: SerializedFileReader<std::fs::File>) -> Result<(Tensor,
            }
        }
    }
-    let images = (Tensor::from_vec(buffer_images, (samples, 3, 32, 32), &Device::Cpu)?
-        .to_dtype(DType::U8)?
+    // Reorder image-rs convention (width, height, channels) to candle/pytorch convolution convention (channels, height, width)
+    let images = (Tensor::from_vec(buffer_images, (samples, 32, 32, 3), &Device::Cpu)?
+        .to_dtype(DType::F32)?
+        .permute((0, 3, 2, 1))?
        / 255.)?;
    let labels = Tensor::from_vec(buffer_labels, (samples,), &Device::Cpu)?;
    Ok((images, labels))
--- a/candle-examples/Cargo.toml
+++ b/candle-examples/Cargo.toml
@ -60,7 +60,7 @@ bindgen_cuda = { version = "0.1.1", optional = true }
 default = []
 accelerate = ["dep:accelerate-src", "candle/accelerate", "candle-nn/accelerate", "candle-transformers/accelerate"]
 cuda = ["candle/cuda", "candle-nn/cuda", "candle-transformers/cuda", "dep:bindgen_cuda"]
-cudnn = ["candle/cudnn"]
+cudnn = ["candle/cudnn", "candle-nn/cudnn", "candle-transformers/cudnn"]
 flash-attn = ["cuda", "candle-transformers/flash-attn", "dep:candle-flash-attn"]
 mkl = ["dep:intel-mkl-src", "candle/mkl", "candle-nn/mkl", "candle-transformers/mkl"]
 nccl = ["cuda", "cudarc/nccl", "dep:half"]
@ -69,6 +69,7 @@ metal = ["candle/metal", "candle-nn/metal"]
 microphone = ["cpal", "rubato"]
 encodec = ["cpal", "symphonia", "rubato"]
 mimi = ["cpal", "symphonia", "rubato"]
+snac = ["cpal", "symphonia", "rubato"]
 depth_anything_v2 = ["palette", "enterpolation"]

 [[example]]
@ -107,6 +108,10 @@ required-features = ["candle-datasets"]
 name = "mimi"
 required-features = ["mimi"]

+[[example]]
+name = "snac"
+required-features = ["snac"]
+
 [[example]]
 name = "encodec"
 required-features = ["encodec"]
--- a/candle-examples/examples/chatglm/README.md
+++ b/candle-examples/examples/chatglm/README.md
@ -0,0 +1,13 @@
+# candle-chatglm
+
+Uses `THUDM/chatglm3-6b` to generate chinese text. Will not generate text for english (usually).
+ 
+## Text Generation
+
+```bash
+cargo run --example chatglm --release  -- --prompt "部署门槛较低等众多优秀特 "
+
+> 部署门槛较低等众多优秀特 点，使得其成为了一款备受欢迎的AI助手。
+> 
+> 作为一款人工智能助手，ChatGLM3-6B
+```
--- a/candle-examples/examples/chinese_clip/README.md
+++ b/candle-examples/examples/chinese_clip/README.md
@ -0,0 +1,42 @@
+# candle-chinese-clip
+
+Contrastive Language-Image Pre-Training (CLIP) is an architecture trained on
+pairs of images with related texts. This one is trained using in chinese instead of english.
+
+## Running on cpu
+
+```bash
+$ cargo run --example chinese_clip --release -- --images "candle-examples/examples/stable-diffusion/assets/stable-diffusion-xl.jpg","candle-examples/examples/yolo-v8/assets/bike.jpg" --cpu --sequences "一场自行车比赛","两只猫的照片","一个机器人拿着蜡烛"
+
+> Results for image: candle-examples/examples/stable-diffusion/assets/stable-diffusion-xl.jpg
+>
+> 2025-03-25T19:22:01.325177Z  INFO chinese_clip: Probability: 0.0000% Text: 一场自行车比赛 
+> 2025-03-25T19:22:01.325179Z  INFO chinese_clip: Probability: 0.0000% Text: 两只猫的照片 
+> 2025-03-25T19:22:01.325181Z  INFO chinese_clip: Probability: 100.0000% Text: 一个机器人拿着蜡烛 
+> 2025-03-25T19:22:01.325183Z  INFO chinese_clip: 
+> 
+> Results for image: candle-examples/examples/yolo-v8/assets/bike.jpg
+> 
+> 2025-03-25T19:22:01.325184Z  INFO chinese_clip: Probability: 100.0000% Text: 一场自行车比赛 
+> 2025-03-25T19:22:01.325186Z  INFO chinese_clip: Probability: 0.0000% Text: 两只猫的照片 
+> 2025-03-25T19:22:01.325187Z  INFO chinese_clip: Probability: 0.0000% Text: 一个机器人拿着蜡烛 
+```
+
+## Running on metal
+
+```bash 
+$ cargo run --features metal --example chinese_clip --release -- --images "candle-examples/examples/stable-diffusion/assets/stable-diffusion-xl.jpg","candle-examples/examples/yolo-v8/assets/bike.jpg" --cpu --sequences "一场自行车比赛","两只猫的照片","一个机器人拿着蜡烛"
+
+> Results for image: candle-examples/examples/stable-diffusion/assets/stable-diffusion-xl.jpg
+>
+> 2025-03-25T19:22:01.325177Z  INFO chinese_clip: Probability: 0.0000% Text: 一场自行车比赛 
+> 2025-03-25T19:22:01.325179Z  INFO chinese_clip: Probability: 0.0000% Text: 两只猫的照片 
+> 2025-03-25T19:22:01.325181Z  INFO chinese_clip: Probability: 100.0000% Text: 一个机器人拿着蜡烛 
+> 2025-03-25T19:22:01.325183Z  INFO chinese_clip: 
+> 
+> Results for image: candle-examples/examples/yolo-v8/assets/bike.jpg
+> 
+> 2025-03-25T19:22:01.325184Z  INFO chinese_clip: Probability: 100.0000% Text: 一场自行车比赛 
+> 2025-03-25T19:22:01.325186Z  INFO chinese_clip: Probability: 0.0000% Text: 两只猫的照片 
+> 2025-03-25T19:22:01.325187Z  INFO chinese_clip: Probability: 0.0000% Text: 一个机器人拿着蜡烛 
+```
--- a/candle-examples/examples/convmixer/README.md
+++ b/candle-examples/examples/convmixer/README.md
@ -0,0 +1,17 @@
+# candle-convmixer
+
+A lightweight CNN architecture that processes image patches similar to a vision transformer, with separate spatial and channel convolutions.
+
+ConvMixer from [Patches Are All You Need?](https://arxiv.org/pdf/2201.09792) and [ConvMixer](https://github.com/locuslab/convmixer). 
+
+## Running an example
+
+```bash
+$ cargo run --example convmixer --release -- --image candle-examples/examples/yolo-v8/assets/bike.jpg
+
+> mountain bike, all-terrain bike, off-roader: 61.75%
+> unicycle, monocycle     : 5.73%
+> moped                   : 3.66%
+> bicycle-built-for-two, tandem bicycle, tandem: 3.51%
+> crash helmet            : 0.85%
+```
--- a/candle-examples/examples/csm/README.md
+++ b/candle-examples/examples/csm/README.md
@ -0,0 +1,14 @@
+# Conversational Speech Model (CSM)
+
+CSM is a speech generation model from Sesame,
+[SesameAILabs/csm](https://github.com/SesameAILabs/csm).
+
+It can generate a conversational speech between two different speakers.
+The speakers turn are delimited by the `|` character in the prompt.
+
+```bash
+cargo run --example csm --features cuda -r -- \
+    --voices candle-examples/examples/csm/voices.safetensors  \
+    --prompt "Hey how are you doing?|Pretty good, pretty good. How about you?"
+```
+
--- a/candle-examples/examples/csm/main.rs
+++ b/candle-examples/examples/csm/main.rs
@ -0,0 +1,243 @@
+#[cfg(feature = "mkl")]
+extern crate intel_mkl_src;
+
+#[cfg(feature = "accelerate")]
+extern crate accelerate_src;
+
+use anyhow::{Error as E, Result};
+use clap::Parser;
+
+use candle_transformers::models::csm::{Config, Model};
+
+use candle::{DType, IndexOp, Tensor};
+use candle_nn::VarBuilder;
+use hf_hub::{api::sync::Api, Repo, RepoType};
+use tokenizers::Tokenizer;
+
+#[derive(Clone, Debug, Copy, PartialEq, Eq, clap::ValueEnum)]
+enum Which {
+    #[value(name = "1b")]
+    Csm1b,
+}
+
+#[derive(Parser, Debug)]
+#[command(author, version, about, long_about = None)]
+struct Args {
+    /// Run on CPU rather than on GPU.
+    #[arg(long)]
+    cpu: bool,
+
+    /// Enable tracing (generates a trace-timestamp.json file).
+    #[arg(long)]
+    tracing: bool,
+
+    #[arg(long)]
+    use_flash_attn: bool,
+
+    /// The prompt to be used for the generation, use a | to separate the speakers.
+    #[arg(long, default_value = "Hey how are you doing today?")]
+    prompt: String,
+
+    /// The voices to be used, in safetensors format.
+    #[arg(long)]
+    voices: String,
+
+    /// The output file using the wav format.
+    #[arg(long, default_value = "out.wav")]
+    out_file: String,
+
+    /// The temperature used to generate samples.
+    #[arg(long, default_value_t = 0.7)]
+    temperature: f64,
+
+    /// Nucleus sampling probability cutoff.
+    #[arg(long)]
+    top_p: Option<f64>,
+
+    /// Only sample among the top K samples.
+    #[arg(long)]
+    top_k: Option<usize>,
+
+    /// The seed to use when generating random samples.
+    #[arg(long, default_value_t = 299792458)]
+    seed: u64,
+
+    /// The length of the sample to generate (in tokens).
+    #[arg(long, short = 'n', default_value_t = 10000)]
+    sample_len: usize,
+
+    /// The model size to use.
+    #[arg(long, default_value = "1b")]
+    which: Which,
+
+    #[arg(long)]
+    model_id: Option<String>,
+
+    #[arg(long, default_value = "main")]
+    revision: String,
+
+    #[arg(long)]
+    tokenizer: Option<String>,
+
+    #[arg(long)]
+    config: Option<String>,
+
+    #[arg(long)]
+    weights: Option<String>,
+
+    /// The mimi model weight file, in safetensor format.
+    #[arg(long)]
+    mimi_weights: Option<String>,
+
+    /// Penalty to be applied for repeating tokens, 1. means no penalty.
+    #[arg(long, default_value_t = 1.1)]
+    repeat_penalty: f32,
+
+    /// The context size to consider for the repeat penalty.
+    #[arg(long, default_value_t = 64)]
+    repeat_last_n: usize,
+}
+
+fn main() -> Result<()> {
+    use tracing_chrome::ChromeLayerBuilder;
+    use tracing_subscriber::prelude::*;
+
+    let args = Args::parse();
+
+    let _guard = if args.tracing {
+        let (chrome_layer, guard) = ChromeLayerBuilder::new().build();
+        tracing_subscriber::registry().with(chrome_layer).init();
+        Some(guard)
+    } else {
+        None
+    };
+    println!(
+        "avx: {}, neon: {}, simd128: {}, f16c: {}",
+        candle::utils::with_avx(),
+        candle::utils::with_neon(),
+        candle::utils::with_simd128(),
+        candle::utils::with_f16c()
+    );
+    println!(
+        "temp: {:.2} repeat-penalty: {:.2} repeat-last-n: {}",
+        args.temperature, args.repeat_penalty, args.repeat_last_n
+    );
+
+    let start = std::time::Instant::now();
+    let api = Api::new()?;
+    let model_id = match args.model_id {
+        Some(model_id) => model_id,
+        None => {
+            let name = match args.which {
+                Which::Csm1b => "sesame/csm-1b",
+            };
+            name.to_string()
+        }
+    };
+    let repo = api.repo(Repo::with_revision(
+        model_id,
+        RepoType::Model,
+        args.revision,
+    ));
+    let filenames = match args.weights {
+        Some(files) => files
+            .split(',')
+            .map(std::path::PathBuf::from)
+            .collect::<Vec<_>>(),
+        None => vec![repo.get("model.safetensors")?],
+    };
+    let tokenizer_filename = match args.tokenizer {
+        Some(file) => std::path::PathBuf::from(file),
+        None => api
+            .model("meta-llama/Llama-3.2-1B".to_string())
+            .get("tokenizer.json")?,
+    };
+    let mimi_filename = match args.mimi_weights {
+        Some(model) => std::path::PathBuf::from(model),
+        None => Api::new()?
+            .model("kyutai/mimi".to_string())
+            .get("model.safetensors")?,
+    };
+    println!("retrieved the files in {:?}", start.elapsed());
+    let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
+
+    let start = std::time::Instant::now();
+    let config: Config = match args.config {
+        Some(config_file) => serde_json::from_slice(&std::fs::read(config_file)?)?,
+        None => {
+            let config_file = repo.get("config.json")?;
+            serde_json::from_slice(&std::fs::read(config_file)?)?
+        }
+    };
+    let device = candle_examples::device(args.cpu)?;
+    let (mut model, device) = {
+        let dtype = device.bf16_default_to_f32();
+        let vb = unsafe { VarBuilder::from_mmaped_safetensors(&filenames, dtype, &device)? };
+        let model = Model::new(&config, vb)?;
+        (model, device)
+    };
+    let mut mimi_model = {
+        use candle_transformers::models::mimi;
+        let vb =
+            unsafe { VarBuilder::from_mmaped_safetensors(&[mimi_filename], DType::F32, &device)? };
+        let config = mimi::Config::v0_1(Some(32));
+        mimi::Model::new(config, vb)?
+    };
+    let cb = config.audio_num_codebooks;
+
+    println!("loaded the model in {:?}", start.elapsed());
+
+    let voices = candle::safetensors::load(args.voices, &device)?;
+    let mut lp = candle_transformers::generation::LogitsProcessor::new(
+        args.seed,
+        Some(args.temperature),
+        None,
+    );
+    let tokens = voices
+        .get("tokens")
+        .expect("no tokens in prompt")
+        .to_dtype(DType::U32)?;
+    let mask = voices.get("mask").expect("no mask in prompt").clone();
+
+    let mut pos = 0;
+    let _frame = model.generate_frame(&tokens, &mask, pos, &mut lp)?;
+    pos += tokens.dim(1)?;
+
+    let mut all_pcms = vec![];
+    for (turn_idx, prompt) in args.prompt.split('|').enumerate() {
+        println!("{prompt:?}");
+        let speaker_idx = turn_idx % 2;
+        let prompt = format!("[{speaker_idx}]{}<|end_of_text|>", prompt);
+        let prompt = tokenizer.encode(prompt, true).map_err(E::msg)?;
+
+        let (mut tokens, mut mask) = model.text_tokens_and_mask(prompt.get_ids())?;
+
+        let mut generated_tokens = vec![];
+        loop {
+            let frame = model.generate_frame(&tokens, &mask, pos, &mut lp)?;
+            pos += tokens.dim(1)?;
+            let is_done = frame.iter().all(|&x| x == 0);
+            (tokens, mask) = model.audio_tokens_and_mask(frame)?;
+            print!("\rframe {pos}");
+            if is_done {
+                let _frame = model.generate_frame(&tokens, &mask, pos, &mut lp)?;
+                pos += tokens.dim(1)?;
+                break;
+            }
+            generated_tokens.push(tokens.clone());
+        }
+        println!();
+        let generated_tokens = Tensor::cat(&generated_tokens, 1)?.narrow(2, 0, cb)?.t()?;
+        let pcm = mimi_model.decode(&generated_tokens)?;
+        let pcm = pcm.i(0)?.i(0)?.to_dtype(DType::F32)?;
+        let pcm = candle_examples::audio::normalize_loudness(&pcm, 24_000, true)?;
+        all_pcms.push(pcm);
+    }
+    let pcm = Tensor::cat(&all_pcms, 0)?;
+    let pcm = pcm.to_vec1::<f32>()?;
+    println!("writing output file {}", args.out_file);
+    let mut output = std::fs::File::create(args.out_file)?;
+    candle_examples::wav::write_pcm_as_wav(&mut output, &pcm, 24_000)?;
+
+    Ok(())
+}
--- a/candle-examples/examples/csm/voices.safetensors
+++ b/candle-examples/examples/csm/voices.safetensors
--- a/candle-examples/examples/custom-ops/README.md
+++ b/candle-examples/examples/custom-ops/README.md
@ -0,0 +1,17 @@
+# candle-custom-ops
+
+ This example illustrates how to implement forward and backward passes for custom operations on the CPU and GPU.
+ The custom op in this example implements RMS normalization for the CPU and CUDA.
+ 
+## Running an example
+
+```bash
+$ cargo run --example custom-ops
+
+> [[ 0.,  1.,  2.,  3.,  4.,  5.,  6.],
+>  [ 7.,  8.,  9., 10., 11., 12., 13.]]
+> Tensor[[2, 7], f32]
+> [[0.0000, 0.2773, 0.5547, 0.8320, 1.1094, 1.3867, 1.6641],
+>  [0.6864, 0.7845, 0.8825, 0.9806, 1.0786, 1.1767, 1.2748]]
+> Tensor[[2, 7], f32]
+```
--- a/candle-examples/examples/custom-ops/main.rs
+++ b/candle-examples/examples/custom-ops/main.rs
@ -56,7 +56,7 @@ impl CustomOp1 for LayerNorm {
        layout: &Layout,
    ) -> Result<(candle::CudaStorage, Shape)> {
        use candle::backend::BackendStorage;
-        use candle::cuda_backend::cudarc::driver::{LaunchAsync, LaunchConfig};
+        use candle::cuda_backend::cudarc::driver::{LaunchConfig, PushKernelArg};
        use candle::cuda_backend::WrapErr;
        let (d1, d2) = layout.shape().dims2()?;
        let d1 = d1 as u32;
@ -68,15 +68,19 @@ impl CustomOp1 for LayerNorm {
            Some((o1, o2)) => slice.slice(o1..o2),
        };
        let elem_count = layout.shape().elem_count();
-        let dst = unsafe { dev.alloc::<f32>(elem_count) }.w()?;
-        let func = dev.get_or_load_func("rms_f32", cuda_kernels::LAYERNORM_KERNELS)?;
-        let params = (&dst, &slice, self.eps, d1, d2);
+        let dst = unsafe { dev.alloc::<f32>(elem_count) }?;
+        let func =
+            dev.get_or_load_custom_func("rms_f32", "mymodule", cuda_kernels::LAYERNORM_KERNELS)?;
        let cfg = LaunchConfig {
            grid_dim: (d1, 1, 1),
            block_dim: (d2, 1, 1),
            shared_mem_bytes: 0,
        };
-        unsafe { func.launch(cfg, params) }.w()?;
+        let mut builder = func.builder();
+        builder.arg(&dst);
+        builder.arg(&slice);
+        candle::builder_arg!(builder, self.eps, d1, d2);
+        unsafe { builder.launch(cfg) }.w()?;

        let dst = candle::CudaStorage::wrap_cuda_slice(dst, dev);
        Ok((dst, layout.shape().clone()))
--- a/candle-examples/examples/deepseekv2/README.md
+++ b/candle-examples/examples/deepseekv2/README.md
@ -0,0 +1,33 @@
+# DeepSeek V2
+
+DeepSeek V2 an MoE model featuring MLA (Multi-Latent Attention). There is a lite (16B) and a full (236B) model.
+
+- Context length of **32k tokens** (Lite model), **128k tokens** (full model)
+- 64 routed experts (Lite model), 160 routed experts (full model)
+
+## Running the example
+
+```bash
+$ cargo run --example deepseekv2 --release --features metal -- --prompt "Recursive fibonacci code in Rust:" --which lite --sample-len 150  
+
+fn fibonacci(n: u32) -> u32 {
+    if n <= 1 {
+        return n;
+    } else {
+        return fibonacci(n - 1) + fibonacci(n - 2);
+    }
+}
+
+## Fibonacci code in Python:
+
+def fibonacci(n):
+    if n <= 1:
+        return n
+    else:
+        return fibonacci(n-1) + fibonacci(n-2)
+
+## Fibonacci code in JavaScript:
+
+function fibonacci(n) {
+    if (n <= 1
+```
--- a/candle-examples/examples/deepseekv2/main.rs
+++ b/candle-examples/examples/deepseekv2/main.rs
@ -0,0 +1,282 @@
+#[cfg(feature = "mkl")]
+extern crate intel_mkl_src;
+
+#[cfg(feature = "accelerate")]
+extern crate accelerate_src;
+
+use anyhow::{Error as E, Result};
+use clap::Parser;
+
+use candle_transformers::models::deepseek2::{DeepSeekV2, DeepSeekV2Config};
+
+use candle::{DType, Device, Tensor};
+use candle_examples::token_output_stream::TokenOutputStream;
+use candle_nn::VarBuilder;
+use candle_transformers::generation::{LogitsProcessor, Sampling};
+use hf_hub::{api::sync::Api, Repo, RepoType};
+use tokenizers::Tokenizer;
+
+struct TextGeneration {
+    model: DeepSeekV2,
+    device: Device,
+    tokenizer: TokenOutputStream,
+    logits_processor: LogitsProcessor,
+    repeat_penalty: f32,
+    repeat_last_n: usize,
+}
+
+impl TextGeneration {
+    #[allow(clippy::too_many_arguments)]
+    fn new(
+        model: DeepSeekV2,
+        tokenizer: Tokenizer,
+        seed: u64,
+        temp: Option<f64>,
+        top_p: Option<f64>,
+        top_k: Option<usize>,
+        repeat_penalty: f32,
+        repeat_last_n: usize,
+        device: &Device,
+    ) -> Self {
+        let logits_processor = {
+            let temperature = temp.unwrap_or(0.);
+            let sampling = if temperature <= 0. {
+                Sampling::ArgMax
+            } else {
+                match (top_k, top_p) {
+                    (None, None) => Sampling::All { temperature },
+                    (Some(k), None) => Sampling::TopK { k, temperature },
+                    (None, Some(p)) => Sampling::TopP { p, temperature },
+                    (Some(k), Some(p)) => Sampling::TopKThenTopP { k, p, temperature },
+                }
+            };
+            LogitsProcessor::from_sampling(seed, sampling)
+        };
+
+        Self {
+            model,
+            tokenizer: TokenOutputStream::new(tokenizer),
+            logits_processor,
+            repeat_penalty,
+            repeat_last_n,
+            device: device.clone(),
+        }
+    }
+
+    fn run(&mut self, prompt: &str, sample_len: usize) -> Result<()> {
+        use std::io::Write;
+        self.tokenizer.clear();
+        let mut tokens = self
+            .tokenizer
+            .tokenizer()
+            .encode(prompt, true)
+            .map_err(E::msg)?
+            .get_ids()
+            .to_vec();
+        for &t in tokens.iter() {
+            if let Some(t) = self.tokenizer.next_token(t)? {
+                print!("{t}")
+            }
+        }
+        std::io::stdout().flush()?;
+
+        let mut generated_tokens = 0usize;
+        let eos_token = match self.tokenizer.get_token("<｜end▁of▁sentence｜>") {
+            Some(token) => token,
+            None => anyhow::bail!("cannot find the <｜end▁of▁sentence｜> token"),
+        };
+        let start_gen = std::time::Instant::now();
+        for index in 0..sample_len {
+            let context_size = if index > 0 { 1 } else { tokens.len() };
+            let start_pos = tokens.len().saturating_sub(context_size);
+            let ctxt = &tokens[start_pos..];
+            let input = Tensor::new(ctxt, &self.device)?.unsqueeze(0)?;
+            let logits = self.model.forward(&input, start_pos)?;
+            let logits = logits.squeeze(0)?.squeeze(0)?.to_dtype(DType::F32)?;
+            let logits = if self.repeat_penalty == 1. {
+                logits
+            } else {
+                let start_at = tokens.len().saturating_sub(self.repeat_last_n);
+                candle_transformers::utils::apply_repeat_penalty(
+                    &logits,
+                    self.repeat_penalty,
+                    &tokens[start_at..],
+                )?
+            };
+
+            let next_token = self.logits_processor.sample(&logits)?;
+            tokens.push(next_token);
+            generated_tokens += 1;
+            if next_token == eos_token {
+                break;
+            }
+            if let Some(t) = self.tokenizer.next_token(next_token)? {
+                print!("{t}");
+                std::io::stdout().flush()?;
+            }
+        }
+        let dt = start_gen.elapsed();
+        if let Some(rest) = self.tokenizer.decode_rest().map_err(E::msg)? {
+            print!("{rest}");
+        }
+        std::io::stdout().flush()?;
+        println!(
+            "\n{generated_tokens} tokens generated ({:.2} token/s)",
+            generated_tokens as f64 / dt.as_secs_f64(),
+        );
+        Ok(())
+    }
+}
+
+#[derive(Clone, Debug, Copy, PartialEq, Eq, clap::ValueEnum)]
+enum Which {
+    #[value(name = "lite")]
+    Lite,
+    #[value(name = "lite-chat")]
+    LiteChat,
+    #[value(name = "coder-lite-chat")]
+    CoderLiteChat,
+    #[value(name = "v2")]
+    V2,
+    #[value(name = "v2-chat")]
+    V2Chat,
+}
+
+#[derive(Parser, Debug)]
+#[command(author, version, about, long_about = None)]
+struct Args {
+    /// Run on CPU rather than on GPU.
+    #[arg(long)]
+    cpu: bool,
+
+    /// Enable tracing (generates a trace-timestamp.json file).
+    #[arg(long)]
+    tracing: bool,
+
+    #[arg(long)]
+    use_flash_attn: bool,
+
+    #[arg(long)]
+    prompt: String,
+
+    /// The temperature used to generate samples.
+    #[arg(long)]
+    temperature: Option<f64>,
+
+    /// Nucleus sampling probability cutoff.
+    #[arg(long)]
+    top_p: Option<f64>,
+
+    /// Only sample among the top K samples.
+    #[arg(long)]
+    top_k: Option<usize>,
+
+    /// The seed to use when generating random samples.
+    #[arg(long, default_value_t = 299792458)]
+    seed: u64,
+
+    /// The length of the sample to generate (in tokens).
+    #[arg(long, short = 'n', default_value_t = 10000)]
+    sample_len: usize,
+
+    /// The model size to use.
+    #[arg(long, default_value = "lite")]
+    which: Which,
+
+    #[arg(long)]
+    model_id: Option<String>,
+
+    #[arg(long, default_value = "main")]
+    revision: String,
+
+    /// Penalty to be applied for repeating tokens, 1. means no penalty.
+    #[arg(long, default_value_t = 1.1)]
+    repeat_penalty: f32,
+
+    /// The context size to consider for the repeat penalty.
+    #[arg(long, default_value_t = 64)]
+    repeat_last_n: usize,
+}
+
+fn main() -> Result<()> {
+    use tracing_chrome::ChromeLayerBuilder;
+    use tracing_subscriber::prelude::*;
+
+    let args = Args::parse();
+
+    let _guard = if args.tracing {
+        let (chrome_layer, guard) = ChromeLayerBuilder::new().build();
+        tracing_subscriber::registry().with(chrome_layer).init();
+        Some(guard)
+    } else {
+        None
+    };
+    println!(
+        "avx: {}, neon: {}, simd128: {}, f16c: {}",
+        candle::utils::with_avx(),
+        candle::utils::with_neon(),
+        candle::utils::with_simd128(),
+        candle::utils::with_f16c()
+    );
+    println!(
+        "temp: {:.2} repeat-penalty: {:.2} repeat-last-n: {}",
+        args.temperature.unwrap_or(0.),
+        args.repeat_penalty,
+        args.repeat_last_n
+    );
+
+    let start = std::time::Instant::now();
+    let api = Api::new()?;
+    let model_id = match args.model_id {
+        Some(model_id) => model_id,
+        None => match args.which {
+            Which::CoderLiteChat => "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct".to_string(),
+            Which::LiteChat => "deepseek-ai/DeepSeek-V2-Lite-Chat".to_string(),
+            Which::Lite => "deepseek-ai/DeepSeek-V2-Lite".to_string(),
+            Which::V2 => "deepseek-ai/DeepSeek-V2".to_string(),
+            Which::V2Chat => "deepseek-ai/DeepSeek-V2-Chat".to_string(),
+        },
+    };
+    let repo = api.repo(Repo::with_revision(
+        model_id,
+        RepoType::Model,
+        args.revision,
+    ));
+    let tokenizer_filename = repo.get("tokenizer.json")?;
+    let filenames = candle_examples::hub_load_safetensors(&repo, "model.safetensors.index.json")?;
+    println!("retrieved the files in {:?}", start.elapsed());
+    let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
+
+    let start = std::time::Instant::now();
+    let config: DeepSeekV2Config = {
+        let config_file = repo.get("config.json")?;
+        serde_json::from_slice(&std::fs::read(config_file)?)?
+    };
+    let device = candle_examples::device(args.cpu)?;
+    let (model, device) = {
+        let dtype = if device.is_cpu() {
+            DType::F16
+        } else {
+            DType::BF16
+        };
+        let vb = unsafe { VarBuilder::from_mmaped_safetensors(&filenames, dtype, &device)? };
+        let model = DeepSeekV2::new(&config, vb)?;
+        (model, device)
+    };
+
+    println!("loaded the model in {:?}", start.elapsed());
+
+    let mut pipeline = TextGeneration::new(
+        model,
+        tokenizer,
+        args.seed,
+        args.temperature,
+        args.top_p,
+        args.top_k,
+        args.repeat_penalty,
+        args.repeat_last_n,
+        &device,
+    );
+    pipeline.run(&args.prompt, args.sample_len)?;
+    Ok(())
+}
--- a/candle-examples/examples/distilbert/README.md
+++ b/candle-examples/examples/distilbert/README.md
@ -8,7 +8,7 @@ DistilBert is used to compute the sentence embeddings for a prompt. The model we
 are downloaded from the hub on the first run.

 ```bash
-cargo run --example distilbert --release -- --prompt "Here is a test sentence"
+$ cargo run --example distilbert --release -- --prompt "Here is a test sentence"

 > [[[ 0.5109,  0.1280, -0.2635, ...,  0.3462, -1.0434,  0.1441],
 >   [ 0.1735,  0.0818, -0.5549, ...,  0.3472, -0.8264, -0.0244],
@ -20,3 +20,25 @@ cargo run --example distilbert --release -- --prompt "Here is a test sentence"
 > Tensor[[1, 7, 768], f32]

 ```
+
+## Masked Token
+
+DistilBert is used to compute the top K choices for a masked token.
+
+```bash
+$ cargo run --example distilbert -- --prompt "The capital of France is [MASK]." --top-k 10
+
+> Input: The capital of France is [MASK].
+> Predictions for [MASK] at position 6:
+>   1: marseille       (probability: 12.14%)
+>   2: paris           (probability: 10.84%)
+>   3: toulouse        (probability: 8.57%)
+>   4: lyon            (probability: 7.61%)
+>   5: montpellier     (probability: 5.18%)
+>   6: bordeaux        (probability: 4.88%)
+>   7: nantes          (probability: 4.82%)
+>   8: lille           (probability: 4.07%)
+>   9: strasbourg      (probability: 3.12%)
+>   10: cannes          (probability: 3.04%)
+
+```
--- a/candle-examples/examples/distilbert/main.rs
+++ b/candle-examples/examples/distilbert/main.rs
@ -3,15 +3,48 @@ extern crate intel_mkl_src;

 #[cfg(feature = "accelerate")]
 extern crate accelerate_src;
-use candle_transformers::models::distilbert::{Config, DistilBertModel, DTYPE};
+use candle_transformers::models::distilbert::{
+    Config, DistilBertForMaskedLM, DistilBertModel, DTYPE,
+};

-use anyhow::{Error as E, Result};
+use anyhow::{Context, Error as E, Result};
 use candle::{Device, Tensor};
 use candle_nn::VarBuilder;
-use clap::Parser;
+use clap::{Parser, ValueEnum};
 use hf_hub::{api::sync::Api, Repo, RepoType};
+use std::path::PathBuf;
 use tokenizers::Tokenizer;

+enum ModelType {
+    Masked(DistilBertForMaskedLM),
+    UnMasked(DistilBertModel),
+}
+
+impl ModelType {
+    fn device(&self) -> &Device {
+        match self {
+            ModelType::Masked(model) => &model.bert.device,
+            ModelType::UnMasked(model) => &model.device,
+        }
+    }
+
+    fn forward(&self, input_ids: &Tensor, attention_mask: &Tensor) -> Result<Tensor> {
+        match self {
+            ModelType::Masked(model) => Ok(model.forward(input_ids, attention_mask)?),
+            ModelType::UnMasked(model) => Ok(model.forward(input_ids, attention_mask)?),
+        }
+    }
+}
+
+#[derive(Clone, Debug, Copy, PartialEq, Eq, ValueEnum)]
+enum Which {
+    #[value(name = "distilbert")]
+    DistilBert,
+
+    #[value(name = "distilbertformaskedlm")]
+    DistilbertForMaskedLM,
+}
+
 #[derive(Parser, Debug)]
 #[command(author, version, about, long_about = None)]
 struct Args {
@ -23,10 +56,14 @@ struct Args {
    #[arg(long)]
    tracing: bool,

+    #[arg(long, default_value = "distilbert")]
+    model: Which,
+
    /// The model to use, check out available models: https://huggingface.co/models?library=sentence-transformers&sort=trending
    #[arg(long)]
    model_id: Option<String>,

+    /// Revision or branch
    #[arg(long)]
    revision: Option<String>,

@ -42,94 +79,246 @@ struct Args {
    #[arg(long, default_value = "1")]
    n: usize,

-    /// L2 normalization for embeddings.
-    #[arg(long, default_value = "true")]
-    normalize_embeddings: bool,
+    /// Number of top predictions to show for each mask
+    #[arg(long, default_value = "5")]
+    top_k: usize,
 }

 impl Args {
-    fn build_model_and_tokenizer(&self) -> Result<(DistilBertModel, Tokenizer)> {
+    fn build_model_and_tokenizer(&self) -> Result<(ModelType, Tokenizer)> {
        let device = candle_examples::device(self.cpu)?;
+
+        let (model_id, revision) = self.resolve_model_and_revision();
+        let (config_path, tokenizer_path, weights_path) =
+            self.download_model_files(&model_id, &revision)?;
+
+        let config = std::fs::read_to_string(config_path)?;
+        let config: Config = serde_json::from_str(&config)?;
+        let tokenizer = Tokenizer::from_file(tokenizer_path).map_err(E::msg)?;
+
+        let vb = self.load_variables(&weights_path, &device)?;
+        let model = self.create_model(&config, vb)?;
+
+        Ok((model, tokenizer))
+    }
+
+    fn resolve_model_and_revision(&self) -> (String, String) {
        let default_model = "distilbert-base-uncased".to_string();
        let default_revision = "main".to_string();
-        let (model_id, revision) = match (self.model_id.to_owned(), self.revision.to_owned()) {
+
+        match (self.model_id.clone(), self.revision.clone()) {
            (Some(model_id), Some(revision)) => (model_id, revision),
-            (Some(model_id), None) => (model_id, "main".to_string()),
+            (Some(model_id), None) => (model_id, default_revision),
            (None, Some(revision)) => (default_model, revision),
            (None, None) => (default_model, default_revision),
-        };
+        }
+    }

-        let repo = Repo::with_revision(model_id, RepoType::Model, revision);
-        let (config_filename, tokenizer_filename, weights_filename) = {
-            let api = Api::new()?;
-            let api = api.repo(repo);
-            let config = api.get("config.json")?;
-            let tokenizer = api.get("tokenizer.json")?;
-            let weights = if self.use_pth {
-                api.get("pytorch_model.bin")?
-            } else {
-                api.get("model.safetensors")?
-            };
-            (config, tokenizer, weights)
-        };
-        let config = std::fs::read_to_string(config_filename)?;
-        let config: Config = serde_json::from_str(&config)?;
-        let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
+    fn download_model_files(
+        &self,
+        model_id: &str,
+        revision: &str,
+    ) -> Result<(PathBuf, PathBuf, PathBuf)> {
+        let repo = Repo::with_revision(model_id.to_string(), RepoType::Model, revision.to_string());
+        let api = Api::new()?;
+        let api = api.repo(repo);

-        let vb = if self.use_pth {
-            VarBuilder::from_pth(&weights_filename, DTYPE, &device)?
+        let config = api.get("config.json")?;
+        let tokenizer = api.get("tokenizer.json")?;
+        let weights = if self.use_pth {
+            api.get("pytorch_model.bin")?
        } else {
-            unsafe { VarBuilder::from_mmaped_safetensors(&[weights_filename], DTYPE, &device)? }
+            api.get("model.safetensors")?
        };
-        let model = DistilBertModel::load(vb, &config)?;
-        Ok((model, tokenizer))
+
+        Ok((config, tokenizer, weights))
+    }
+
+    fn load_variables(&self, weights_path: &PathBuf, device: &Device) -> Result<VarBuilder> {
+        if self.use_pth {
+            Ok(VarBuilder::from_pth(weights_path, DTYPE, device)?)
+        } else {
+            Ok(unsafe { VarBuilder::from_mmaped_safetensors(&[weights_path], DTYPE, device)? })
+        }
+    }
+
+    fn create_model(&self, config: &Config, vb: VarBuilder) -> Result<ModelType> {
+        match self.model {
+            Which::DistilbertForMaskedLM => {
+                Ok(ModelType::Masked(DistilBertForMaskedLM::load(vb, config)?))
+            }
+            Which::DistilBert => Ok(ModelType::UnMasked(DistilBertModel::load(vb, config)?)),
+        }
    }
 }

-fn get_mask(size: usize, device: &Device) -> Tensor {
-    let mask: Vec<_> = (0..size)
-        .flat_map(|i| (0..size).map(move |j| u8::from(j > i)))
-        .collect();
-    Tensor::from_slice(&mask, (size, size), device).unwrap()
+fn main() -> Result<()> {
+    let args = Args::parse();
+    let _guard = setup_tracing(&args);
+
+    let (model, tokenizer) = args.build_model_and_tokenizer()?;
+    let device = model.device();
+
+    let (token_ids, mask) = prepare_inputs(&args, &tokenizer, device)?;
+    let output = model.forward(&token_ids, &mask)?;
+
+    process_output(&model, &output, &token_ids, &tokenizer, &args)?;
+
+    Ok(())
 }

-fn main() -> Result<()> {
-    use tracing_chrome::ChromeLayerBuilder;
-    use tracing_subscriber::prelude::*;
+fn setup_tracing(args: &Args) -> Option<impl Drop> {
+    if args.tracing {
+        use tracing_chrome::ChromeLayerBuilder;
+        use tracing_subscriber::prelude::*;

-    let args = Args::parse();
-    let _guard = if args.tracing {
        println!("tracing...");
        let (chrome_layer, guard) = ChromeLayerBuilder::new().build();
        tracing_subscriber::registry().with(chrome_layer).init();
        Some(guard)
    } else {
        None
-    };
-    let (model, mut tokenizer) = args.build_model_and_tokenizer()?;
-    let device = &model.device;
+    }
+}

-    let tokenizer = tokenizer
+fn prepare_inputs(args: &Args, tokenizer: &Tokenizer, device: &Device) -> Result<(Tensor, Tensor)> {
+    let mut binding = tokenizer.clone();
+    let tokenizer_configured = binding
        .with_padding(None)
        .with_truncation(None)
        .map_err(E::msg)?;
-    let tokens = tokenizer
-        .encode(args.prompt, true)
+
+    let tokens = tokenizer_configured
+        .encode(args.prompt.clone(), true)
        .map_err(E::msg)?
        .get_ids()
        .to_vec();
+
    let token_ids = Tensor::new(&tokens[..], device)?.unsqueeze(0)?;
-    let mask = get_mask(tokens.len(), device);

-    println!("token_ids: {:?}", token_ids.to_vec2::<u32>());
-    println!("mask: {:?}", mask.to_vec2::<u8>());
+    let mask = match args.model {
+        Which::DistilbertForMaskedLM => attention_mask_maskedlm(tokenizer, &args.prompt, device)?,
+        Which::DistilBert => attention_mask(tokens.len(), device)?,
+    };

-    let ys = model.forward(&token_ids, &mask)?;
-    println!("{ys}");
+    println!("token_ids: {:?}", token_ids.to_vec2::<u32>()?);
+
+    Ok((token_ids, mask))
+}
+
+fn process_output(
+    model: &ModelType,
+    output: &Tensor,
+    token_ids: &Tensor,
+    tokenizer: &Tokenizer,
+    args: &Args,
+) -> Result<()> {
+    match model {
+        ModelType::UnMasked(_) => {
+            println!("embeddings");
+            println!("{output}");
+        }
+        ModelType::Masked(_) => {
+            process_masked_output(output, token_ids, tokenizer, args)?;
+        }
+    }

    Ok(())
 }

-pub fn normalize_l2(v: &Tensor) -> Result<Tensor> {
-    Ok(v.broadcast_div(&v.sqr()?.sum_keepdim(1)?.sqrt()?)?)
+fn process_masked_output(
+    output: &Tensor,
+    token_ids: &Tensor,
+    tokenizer: &Tokenizer,
+    args: &Args,
+) -> Result<()> {
+    let input_ids_vec = token_ids.to_vec2::<u32>()?;
+    let mask_token_id = tokenizer
+        .token_to_id("[MASK]")
+        .context("Mask token, \"[MASK]\", not found in tokenizer.")?;
+
+    println!("\nInput: {}", args.prompt);
+
+    for (token_idx, &token_id) in input_ids_vec[0].iter().enumerate() {
+        if token_id == mask_token_id {
+            println!("Predictions for [MASK] at position {}:", token_idx);
+
+            let pos_logits = output.get(0)?.get(token_idx)?;
+            let probs = candle_nn::ops::softmax(&pos_logits, 0)?;
+            let (top_values, top_indices) = get_top_k(&probs, args.top_k)?;
+
+            let values = top_values.to_vec1::<f32>()?;
+            let indices = top_indices.to_vec1::<u32>()?;
+
+            for (i, (&token_id, &prob)) in indices.iter().zip(values.iter()).enumerate() {
+                let token = tokenizer.decode(&[token_id], false).map_err(E::msg)?;
+                println!(
+                    "  {}: {:15} (probability: {:.2}%)",
+                    i + 1,
+                    token,
+                    prob * 100.0
+                );
+            }
+        }
+    }
+
+    Ok(())
+}
+
+fn get_top_k(tensor: &Tensor, k: usize) -> Result<(Tensor, Tensor)> {
+    let n = tensor.dims().iter().product::<usize>();
+    let k = std::cmp::min(k, n);
+
+    let values = tensor.to_vec1::<f32>()?;
+    let mut value_indices: Vec<(f32, usize)> = values
+        .into_iter()
+        .enumerate()
+        .map(|(idx, val)| (val, idx))
+        .collect();
+
+    value_indices.sort_by(|a, b| b.0.partial_cmp(&a.0).unwrap_or(std::cmp::Ordering::Equal));
+
+    let top_k_values: Vec<f32> = value_indices.iter().take(k).map(|(val, _)| *val).collect();
+    let top_k_indices: Vec<u32> = value_indices
+        .iter()
+        .take(k)
+        .map(|(_, idx)| *idx as u32)
+        .collect();
+
+    let device = tensor.device();
+    let top_values = Tensor::from_vec(top_k_values, (k,), device)?;
+    let top_indices = Tensor::from_vec(top_k_indices, (k,), device)?;
+
+    Ok((top_values, top_indices))
+}
+
+fn attention_mask(size: usize, device: &Device) -> Result<Tensor> {
+    let mask: Vec<_> = (0..size)
+        .flat_map(|i| (0..size).map(move |j| u8::from(j > i)))
+        .collect();
+    Ok(Tensor::from_slice(&mask, (size, size), device)?)
+}
+
+fn attention_mask_maskedlm(tokenizer: &Tokenizer, input: &str, device: &Device) -> Result<Tensor> {
+    let tokens = tokenizer.encode(input, true).map_err(E::msg)?;
+    let seq_len = tokens.get_attention_mask().to_vec().len();
+
+    let mask_token_id = tokenizer
+        .token_to_id("[MASK]")
+        .context("Mask token, \"[MASK]\", not found in tokenizer.")?;
+
+    let mut attention_mask_vec = Vec::with_capacity(seq_len * seq_len);
+
+    let ids = tokens.get_ids();
+    for _ in 0..seq_len {
+        for id in ids.iter() {
+            let mask_value = if id == &mask_token_id { 1u8 } else { 0u8 };
+            attention_mask_vec.push(mask_value);
+        }
+    }
+
+    let shape = (1, 1, seq_len, seq_len);
+    let mask = Tensor::from_vec(attention_mask_vec, shape, device)?;
+
+    Ok(mask)
 }
--- a/candle-examples/examples/efficientnet/README.md
+++ b/candle-examples/examples/efficientnet/README.md
@ -0,0 +1,15 @@
+# candle-efficientnet
+
+Demonstrates a Candle implementation of EfficientNet for image classification based on ImageNet classes.
+
+## Running an example
+
+```bash
+$ cargo run --example efficientnet --release -- --image candle-examples/examples/yolo-v8/assets/bike.jpg --which b1
+
+> bicycle-built-for-two, tandem bicycle, tandem: 45.85%
+> mountain bike, all-terrain bike, off-roader: 30.45%
+> crash helmet            : 2.58%
+> unicycle, monocycle     : 2.21%
+> tricycle, trike, velocipede: 1.53%
+```
--- a/candle-examples/examples/falcon/README.md
+++ b/candle-examples/examples/falcon/README.md
@ -1,3 +1,10 @@
 # candle-falcon

 Falcon is a general large language model.
+
+## Running an example
+
+Make sure to include the `--use-f32` flag if using CPU, because there isn't a BFloat16 implementation yet.
+```
+cargo run --example falcon --release -- --prompt "Flying monkeys are" --use-f32
+```
--- a/candle-examples/examples/gemma/main.rs
+++ b/candle-examples/examples/gemma/main.rs
@ -9,6 +9,7 @@ use clap::Parser;

 use candle_transformers::models::gemma::{Config as Config1, Model as Model1};
 use candle_transformers::models::gemma2::{Config as Config2, Model as Model2};
+use candle_transformers::models::gemma3::{Config as Config3, Model as Model3};

 use candle::{DType, Device, Tensor};
 use candle_examples::token_output_stream::TokenOutputStream;
@ -47,29 +48,16 @@ enum Which {
    BaseV2_9B,
    #[value(name = "2-9b-it")]
    InstructV2_9B,
-}
-
-impl Which {
-    fn is_v1(&self) -> bool {
-        match self {
-            Self::Base2B
-            | Self::Base7B
-            | Self::Instruct2B
-            | Self::Instruct7B
-            | Self::InstructV1_1_2B
-            | Self::InstructV1_1_7B
-            | Self::CodeBase2B
-            | Self::CodeBase7B
-            | Self::CodeInstruct2B
-            | Self::CodeInstruct7B => true,
-            Self::BaseV2_2B | Self::InstructV2_2B | Self::BaseV2_9B | Self::InstructV2_9B => false,
-        }
-    }
+    #[value(name = "3-1b")]
+    BaseV3_1B,
+    #[value(name = "3-1b-it")]
+    InstructV3_1B,
 }

 enum Model {
    V1(Model1),
    V2(Model2),
+    V3(Model3),
 }

 impl Model {
@ -77,6 +65,7 @@ impl Model {
        match self {
            Self::V1(m) => m.forward(input_ids, pos),
            Self::V2(m) => m.forward(input_ids, pos),
+            Self::V3(m) => m.forward(input_ids, pos),
        }
    }
 }
@ -284,6 +273,8 @@ fn main() -> Result<()> {
            Which::InstructV2_2B => "google/gemma-2-2b-it".to_string(),
            Which::BaseV2_9B => "google/gemma-2-9b".to_string(),
            Which::InstructV2_9B => "google/gemma-2-9b-it".to_string(),
+            Which::BaseV3_1B => "google/gemma-3-1b-pt".to_string(),
+            Which::InstructV3_1B => "google/gemma-3-1b-it".to_string(),
        },
    };
    let repo = api.repo(Repo::with_revision(
@ -304,7 +295,10 @@ fn main() -> Result<()> {
            .split(',')
            .map(std::path::PathBuf::from)
            .collect::<Vec<_>>(),
-        None => candle_examples::hub_load_safetensors(&repo, "model.safetensors.index.json")?,
+        None => match args.which {
+            Which::BaseV3_1B | Which::InstructV3_1B => vec![repo.get("model.safetensors")?],
+            _ => candle_examples::hub_load_safetensors(&repo, "model.safetensors.index.json")?,
+        },
    };
    println!("retrieved the files in {:?}", start.elapsed());
    let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
@ -317,14 +311,31 @@ fn main() -> Result<()> {
        DType::F32
    };
    let vb = unsafe { VarBuilder::from_mmaped_safetensors(&filenames, dtype, &device)? };
-    let model = if args.which.is_v1() {
-        let config: Config1 = serde_json::from_reader(std::fs::File::open(config_filename)?)?;
-        let model = Model1::new(args.use_flash_attn, &config, vb)?;
-        Model::V1(model)
-    } else {
-        let config: Config2 = serde_json::from_reader(std::fs::File::open(config_filename)?)?;
-        let model = Model2::new(args.use_flash_attn, &config, vb)?;
-        Model::V2(model)
+    let model = match args.which {
+        Which::Base2B
+        | Which::Base7B
+        | Which::Instruct2B
+        | Which::Instruct7B
+        | Which::InstructV1_1_2B
+        | Which::InstructV1_1_7B
+        | Which::CodeBase2B
+        | Which::CodeBase7B
+        | Which::CodeInstruct2B
+        | Which::CodeInstruct7B => {
+            let config: Config1 = serde_json::from_reader(std::fs::File::open(config_filename)?)?;
+            let model = Model1::new(args.use_flash_attn, &config, vb)?;
+            Model::V1(model)
+        }
+        Which::BaseV2_2B | Which::InstructV2_2B | Which::BaseV2_9B | Which::InstructV2_9B => {
+            let config: Config2 = serde_json::from_reader(std::fs::File::open(config_filename)?)?;
+            let model = Model2::new(args.use_flash_attn, &config, vb)?;
+            Model::V2(model)
+        }
+        Which::BaseV3_1B | Which::InstructV3_1B => {
+            let config: Config3 = serde_json::from_reader(std::fs::File::open(config_filename)?)?;
+            let model = Model3::new(args.use_flash_attn, &config, vb)?;
+            Model::V3(model)
+        }
    };

    println!("loaded the model in {:?}", start.elapsed());
--- a/candle-examples/examples/glm4/README.org
+++ b/candle-examples/examples/glm4/README.org
@ -12,7 +12,7 @@ GLM-4-9B is the open-source version of the latest generation of pre-trained mode

 ** Running with ~cpu~
 #+begin_src shell
-  cargo run --example glm4 --release -- --cpu--prompt "Hello world"
+  cargo run --example glm4 --release -- --cpu --prompt "Hello world"
 #+end_src

 ** Output Example
--- a/candle-examples/examples/helium/main.rs
+++ b/candle-examples/examples/helium/main.rs
@ -46,7 +46,7 @@ impl TextGeneration {
                Sampling::ArgMax
            } else {
                match (top_k, top_p) {
-                    (None, None) => Sampling::All { temperature },
+                    (None, None) => Sampling::GumbelSoftmax { temperature },
                    (Some(k), None) => Sampling::TopK { k, temperature },
                    (None, Some(p)) => Sampling::TopP { p, temperature },
                    (Some(k), Some(p)) => Sampling::TopKThenTopP { k, p, temperature },
--- a/candle-examples/examples/llama/README.md
+++ b/candle-examples/examples/llama/README.md
@ -0,0 +1,11 @@
+# candle-llama
+
+Candle implementations of various Llama based architectures.
+
+## Running an example
+
+```bash
+$ cargo run --example llama -- --prompt "Machine learning is " --which v32-3b-instruct
+
+> Machine learning is  the part of computer science which deals with the development of algorithms and
+```
--- a/candle-examples/examples/llama2-c/main.rs
+++ b/candle-examples/examples/llama2-c/main.rs
@ -256,6 +256,12 @@ fn run_inference(args: &InferenceCmd, common_args: &Args) -> Result<()> {
    let tokenizer = common_args.tokenizer()?;

    let device = candle_examples::device(common_args.cpu)?;
+    #[cfg(feature = "cuda")]
+    if let candle::Device::Cuda(d) = &device {
+        unsafe {
+            d.disable_event_tracking();
+        }
+    };

    let is_gguf = config_path.extension().map_or(false, |v| v == "gguf");
    let is_safetensors = config_path
--- a/candle-examples/examples/mamba-minimal/model.rs
+++ b/candle-examples/examples/mamba-minimal/model.rs
@ -21,7 +21,7 @@ impl Config {
    }

    fn dt_rank(&self) -> usize {
-        (self.d_model + 15) / 16
+        self.d_model.div_ceil(16)
    }

    fn d_conv(&self) -> usize {
--- a/candle-examples/examples/mamba/README.md
+++ b/candle-examples/examples/mamba/README.md
@ -12,6 +12,6 @@ would only work for inference.
 ## Running the example

 ```bash
-$ cargo run --example mamba-minimal --release -- --prompt "Mamba is the"
+$ cargo run --example mamba --release -- --prompt "Mamba is the"
 ```

--- a/candle-examples/examples/marian-mt/README.md
+++ b/candle-examples/examples/marian-mt/README.md
@ -18,21 +18,19 @@ I know you are waiting for me. I will go through the forest, I will go through t
 mountain. I cannot stay far from you any longer.</s>
 ```

+### Changing model and language pairs
+
+```bash
+$ cargo run --example marian-mt --release -- --text "hello, how are you." --which base --language-pair en-zh
+
+你好,你好吗?
+```
+
 ## Generating the tokenizer.json files

-You can use the following script to generate the `tokenizer.json` config files
-from the hf-hub repos. This requires the `tokenizers` and `sentencepiece`
-packages to be install and use the `convert_slow_tokenizer.py` script from this
-directory.
-
-```python
-from convert_slow_tokenizer import MarianConverter
-from transformers import AutoTokenizer
-
-
-tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en", use_fast=False)
-fast_tokenizer = MarianConverter(tokenizer, index=0).converted()
-fast_tokenizer.save(f"tokenizer-marian-base-fr.json")
-fast_tokenizer = MarianConverter(tokenizer, index=1).converted()
-fast_tokenizer.save(f"tokenizer-marian-base-en.json")
-```
+The tokenizer for each `marian-mt` model was trained independently, 
+meaning each new model needs unique tokenizer encoders and decoders.
+You can use the `./python/convert_slow_tokenizer.py` script in this directory to generate 
+the `tokenizer.json` config files from the hf-hub repos.
+The script requires all the packages in `./python/requirements.txt` or `./python/uv.lock` 
+to be installed, and has only been tested for `python 3.12.7`.  
--- a/candle-examples/examples/marian-mt/convert_slow_tokenizer.py
+++ b/candle-examples/examples/marian-mt/convert_slow_tokenizer.py
--- a/candle-examples/examples/marian-mt/main.rs
+++ b/candle-examples/examples/marian-mt/main.rs
@ -20,6 +20,22 @@ enum Which {
    Big,
 }

+#[derive(Clone, Debug, Copy, PartialEq, Eq, ValueEnum)]
+enum LanguagePair {
+    #[value(name = "fr-en")]
+    FrEn,
+    #[value(name = "en-zh")]
+    EnZh,
+    #[value(name = "en-hi")]
+    EnHi,
+    #[value(name = "en-es")]
+    EnEs,
+    #[value(name = "en-fr")]
+    EnFr,
+    #[value(name = "en-ru")]
+    EnRu,
+}
+
 // TODO: Maybe add support for the conditional prompt.
 #[derive(Parser)]
 struct Args {
@ -36,6 +52,10 @@ struct Args {
    #[arg(long, default_value = "big")]
    which: Which,

+    // Choose which language pair to use
+    #[arg(long, default_value = "fr-en")]
+    language_pair: LanguagePair,
+
    /// Run on CPU rather than on GPU.
    #[arg(long)]
    cpu: bool,
@ -53,21 +73,43 @@ pub fn main() -> anyhow::Result<()> {
    use hf_hub::api::sync::Api;
    let args = Args::parse();

-    let config = match args.which {
-        Which::Base => marian::Config::opus_mt_fr_en(),
-        Which::Big => marian::Config::opus_mt_tc_big_fr_en(),
+    let config = match (args.which, args.language_pair) {
+        (Which::Base, LanguagePair::FrEn) => marian::Config::opus_mt_fr_en(),
+        (Which::Big, LanguagePair::FrEn) => marian::Config::opus_mt_tc_big_fr_en(),
+        (Which::Base, LanguagePair::EnZh) => marian::Config::opus_mt_en_zh(),
+        (Which::Base, LanguagePair::EnHi) => marian::Config::opus_mt_en_hi(),
+        (Which::Base, LanguagePair::EnEs) => marian::Config::opus_mt_en_es(),
+        (Which::Base, LanguagePair::EnFr) => marian::Config::opus_mt_fr_en(),
+        (Which::Base, LanguagePair::EnRu) => marian::Config::opus_mt_en_ru(),
+        (Which::Big, lp) => anyhow::bail!("big is not supported for language pair {lp:?}"),
+    };
+    let tokenizer_default_repo = match args.language_pair {
+        LanguagePair::FrEn => "lmz/candle-marian",
+        LanguagePair::EnZh
+        | LanguagePair::EnHi
+        | LanguagePair::EnEs
+        | LanguagePair::EnFr
+        | LanguagePair::EnRu => "KeighBee/candle-marian",
    };
    let tokenizer = {
        let tokenizer = match args.tokenizer {
            Some(tokenizer) => std::path::PathBuf::from(tokenizer),
            None => {
-                let name = match args.which {
-                    Which::Base => "tokenizer-marian-base-fr.json",
-                    Which::Big => "tokenizer-marian-fr.json",
+                let filename = match (args.which, args.language_pair) {
+                    (Which::Base, LanguagePair::FrEn) => "tokenizer-marian-base-fr.json",
+                    (Which::Big, LanguagePair::FrEn) => "tokenizer-marian-fr.json",
+                    (Which::Base, LanguagePair::EnZh) => "tokenizer-marian-base-en-zh-en.json",
+                    (Which::Base, LanguagePair::EnHi) => "tokenizer-marian-base-en-hi-en.json",
+                    (Which::Base, LanguagePair::EnEs) => "tokenizer-marian-base-en-es-en.json",
+                    (Which::Base, LanguagePair::EnFr) => "tokenizer-marian-base-en-fr-en.json",
+                    (Which::Base, LanguagePair::EnRu) => "tokenizer-marian-base-en-ru-en.json",
+                    (Which::Big, lp) => {
+                        anyhow::bail!("big is not supported for language pair {lp:?}")
+                    }
                };
                Api::new()?
-                    .model("lmz/candle-marian".to_string())
-                    .get(name)?
+                    .model(tokenizer_default_repo.to_string())
+                    .get(filename)?
            }
        };
        Tokenizer::from_file(&tokenizer).map_err(E::msg)?
@ -77,13 +119,21 @@ pub fn main() -> anyhow::Result<()> {
        let tokenizer = match args.tokenizer_dec {
            Some(tokenizer) => std::path::PathBuf::from(tokenizer),
            None => {
-                let name = match args.which {
-                    Which::Base => "tokenizer-marian-base-en.json",
-                    Which::Big => "tokenizer-marian-en.json",
+                let filename = match (args.which, args.language_pair) {
+                    (Which::Base, LanguagePair::FrEn) => "tokenizer-marian-base-en.json",
+                    (Which::Big, LanguagePair::FrEn) => "tokenizer-marian-en.json",
+                    (Which::Base, LanguagePair::EnZh) => "tokenizer-marian-base-en-zh-zh.json",
+                    (Which::Base, LanguagePair::EnHi) => "tokenizer-marian-base-en-hi-hi.json",
+                    (Which::Base, LanguagePair::EnEs) => "tokenizer-marian-base-en-es-es.json",
+                    (Which::Base, LanguagePair::EnFr) => "tokenizer-marian-base-en-fr-fr.json",
+                    (Which::Base, LanguagePair::EnRu) => "tokenizer-marian-base-en-ru-ru.json",
+                    (Which::Big, lp) => {
+                        anyhow::bail!("big is not supported for language pair {lp:?}")
+                    }
                };
                Api::new()?
-                    .model("lmz/candle-marian".to_string())
-                    .get(name)?
+                    .model(tokenizer_default_repo.to_string())
+                    .get(filename)?
            }
        };
        Tokenizer::from_file(&tokenizer).map_err(E::msg)?
@ -94,18 +144,48 @@ pub fn main() -> anyhow::Result<()> {
    let vb = {
        let model = match args.model {
            Some(model) => std::path::PathBuf::from(model),
-            None => match args.which {
-                Which::Base => Api::new()?
-                    .repo(hf_hub::Repo::with_revision(
+            None => {
+                let api = Api::new()?;
+                let api = match (args.which, args.language_pair) {
+                    (Which::Base, LanguagePair::FrEn) => api.repo(hf_hub::Repo::with_revision(
                        "Helsinki-NLP/opus-mt-fr-en".to_string(),
                        hf_hub::RepoType::Model,
                        "refs/pr/4".to_string(),
-                    ))
-                    .get("model.safetensors")?,
-                Which::Big => Api::new()?
-                    .model("Helsinki-NLP/opus-mt-tc-big-fr-en".to_string())
-                    .get("model.safetensors")?,
-            },
+                    )),
+                    (Which::Big, LanguagePair::FrEn) => {
+                        api.model("Helsinki-NLP/opus-mt-tc-big-fr-en".to_string())
+                    }
+                    (Which::Base, LanguagePair::EnZh) => api.repo(hf_hub::Repo::with_revision(
+                        "Helsinki-NLP/opus-mt-en-zh".to_string(),
+                        hf_hub::RepoType::Model,
+                        "refs/pr/13".to_string(),
+                    )),
+                    (Which::Base, LanguagePair::EnHi) => api.repo(hf_hub::Repo::with_revision(
+                        "Helsinki-NLP/opus-mt-en-hi".to_string(),
+                        hf_hub::RepoType::Model,
+                        "refs/pr/3".to_string(),
+                    )),
+                    (Which::Base, LanguagePair::EnEs) => api.repo(hf_hub::Repo::with_revision(
+                        "Helsinki-NLP/opus-mt-en-es".to_string(),
+                        hf_hub::RepoType::Model,
+                        "refs/pr/4".to_string(),
+                    )),
+                    (Which::Base, LanguagePair::EnFr) => api.repo(hf_hub::Repo::with_revision(
+                        "Helsinki-NLP/opus-mt-en-fr".to_string(),
+                        hf_hub::RepoType::Model,
+                        "refs/pr/9".to_string(),
+                    )),
+                    (Which::Base, LanguagePair::EnRu) => api.repo(hf_hub::Repo::with_revision(
+                        "Helsinki-NLP/opus-mt-en-ru".to_string(),
+                        hf_hub::RepoType::Model,
+                        "refs/pr/7".to_string(),
+                    )),
+                    (Which::Big, lp) => {
+                        anyhow::bail!("big is not supported for language pair {lp:?}")
+                    }
+                };
+                api.get("model.safetensors")?
+            }
        };
        unsafe { VarBuilder::from_mmaped_safetensors(&[&model], DType::F32, &device)? }
    };
--- a/candle-examples/examples/marian-mt/python/convert_slow_tokenizer.py
+++ b/candle-examples/examples/marian-mt/python/convert_slow_tokenizer.py
@ -0,0 +1,53 @@
+from pathlib import Path
+import warnings
+
+from transformers import AutoTokenizer
+from transformers.convert_slow_tokenizer import SpmConverter, requires_backends, import_protobuf
+
+class MarianConverter(SpmConverter):
+    def __init__(self, *args, index: int = 0):
+        requires_backends(self, "protobuf")
+
+        super(SpmConverter, self).__init__(*args)
+
+        # from .utils import sentencepiece_model_pb2 as model_pb2
+        model_pb2 = import_protobuf()
+
+        m = model_pb2.ModelProto()
+        print(self.original_tokenizer.spm_files)
+        with open(self.original_tokenizer.spm_files[index], "rb") as f:
+            m.ParseFromString(f.read())
+        self.proto = m
+        print(self.original_tokenizer)
+        #with open(self.original_tokenizer.vocab_path, "r") as f:
+        dir_path = Path(self.original_tokenizer.spm_files[0]).parents[0]
+        with open(dir_path / "vocab.json", "r") as f:
+            import json
+            self._vocab = json.load(f)
+
+        if self.proto.trainer_spec.byte_fallback:
+            if not getattr(self, "handle_byte_fallback", None):
+                warnings.warn(
+                    "The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option"
+                    " which is not implemented in the fast tokenizers. In practice this means that the fast version of the"
+                    " tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these "
+                    "unknown tokens into a sequence of byte tokens matching the original piece of text."
+                )
+
+    def vocab(self, proto):
+        vocab_size = max(self._vocab.values()) + 1
+        vocab = [("<NIL>", -100) for _ in range(vocab_size)]
+        for piece in proto.pieces:
+            try:
+                index = self._vocab[piece.piece]
+            except Exception:
+                print(f"Ignored missing piece {piece.piece}")
+            vocab[index] = (piece.piece, piece.score)
+        return vocab
+
+
+tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en", use_fast=False)
+fast_tokenizer = MarianConverter(tokenizer, index=0).converted()
+fast_tokenizer.save("tokenizer-marian-base-fr.json")
+fast_tokenizer = MarianConverter(tokenizer, index=1).converted()
+fast_tokenizer.save("tokenizer-marian-base-en.json")
--- a/candle-examples/examples/marian-mt/python/requirements.txt
+++ b/candle-examples/examples/marian-mt/python/requirements.txt
@ -0,0 +1,22 @@
+certifi==2025.1.31
+charset-normalizer==3.4.1
+click==8.1.8
+filelock==3.18.0
+fsspec==2025.3.2
+huggingface-hub==0.30.1
+idna==3.10
+joblib==1.4.2
+numpy==2.2.4
+packaging==24.2
+protobuf==6.30.2
+pyyaml==6.0.2
+regex==2024.11.6
+requests==2.32.3
+sacremoses==0.1.1
+safetensors==0.5.3
+sentencepiece==0.2.0
+tokenizers==0.21.1
+tqdm==4.67.1
+transformers==4.50.3
+typing-extensions==4.13.0
+urllib3==2.3.0
--- a/candle-examples/examples/metavoice/README.md
+++ b/candle-examples/examples/metavoice/README.md
@ -13,6 +13,6 @@ Note that the current candle implementation suffers from some limitations as of
 ## Run an example

 ```bash
-cargo run --example metavoice --release -- \\
+cargo run --example metavoice --release -- \
  --prompt "This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model."
 ```
--- a/candle-examples/examples/metavoice/main.rs
+++ b/candle-examples/examples/metavoice/main.rs
@ -16,7 +16,7 @@ use candle_transformers::models::quantized_metavoice::transformer as qtransforme
 use candle::{DType, IndexOp, Tensor};
 use candle_nn::VarBuilder;
 use hf_hub::api::sync::Api;
-use rand::{distributions::Distribution, SeedableRng};
+use rand::{distr::Distribution, SeedableRng};

 pub const ENCODEC_NTOKENS: u32 = 1024;

@ -250,7 +250,7 @@ fn main() -> Result<()> {
            let logits = logits.i(step)?.to_dtype(DType::F32)?;
            let logits = &(&logits / 1.0)?;
            let prs = candle_nn::ops::softmax_last_dim(logits)?.to_vec1::<f32>()?;
-            let distr = rand::distributions::WeightedIndex::new(prs.as_slice())?;
+            let distr = rand::distr::weighted::WeightedIndex::new(prs.as_slice())?;
            let sample = distr.sample(&mut rng) as u32;
            codes_.push(sample)
        }
--- a/candle-examples/examples/mnist-training/README.md
+++ b/candle-examples/examples/mnist-training/README.md
@ -0,0 +1,16 @@
+# candle-mnist-training
+
+Training a 2 layer MLP on mnist in Candle.
+
+## Running an example
+
+```bash
+$ cargo run --example mnist-training --features candle-datasets
+
+> train-images: [60000, 784]
+> train-labels: [60000]
+> test-images: [10000, 784]
+> test-labels: [10000]
+>    1 train loss:  2.30265 test acc: 68.08%
+>    2 train loss:  1.50815 test acc: 60.77%
+```
--- a/candle-examples/examples/mnist-training/main.rs
+++ b/candle-examples/examples/mnist-training/main.rs
@ -7,6 +7,7 @@ extern crate accelerate_src;

 use clap::{Parser, ValueEnum};
 use rand::prelude::*;
+use rand::rng;

 use candle::{DType, Result, Tensor, D};
 use candle_nn::{loss, ops, Conv2d, Linear, Module, ModuleT, Optimizer, VarBuilder, VarMap};
@ -138,7 +139,7 @@ fn training_loop_cnn(
    let mut batch_idxs = (0..n_batches).collect::<Vec<usize>>();
    for epoch in 1..args.epochs {
        let mut sum_loss = 0f32;
-        batch_idxs.shuffle(&mut thread_rng());
+        batch_idxs.shuffle(&mut rng());
        for batch_idx in batch_idxs.iter() {
            let train_images = train_images.narrow(0, batch_idx * BSIZE, BSIZE)?;
            let train_labels = train_labels.narrow(0, batch_idx * BSIZE, BSIZE)?;
--- a/candle-examples/examples/moondream/README.md
+++ b/candle-examples/examples/moondream/README.md
@ -12,7 +12,7 @@ $ wget https://raw.githubusercontent.com/vikhyat/moondream/main/assets/demo-1.jp

 Now you can run Moondream from the `candle-examples` crate:
 ```bash
-$ cargo run --example moondream --release -- --prompt "What is the girl eating?" --image "./demo-1.jpg"
+$ cargo run --example moondream --release -- --prompt "Describe the people behind the bikers?" --image "candle-examples/examples/yolo-v8/assets/bike.jpg"

 avavx: false, neon: true, simd128: false, f16c: false
 temp: 0.00 repeat-penalty: 1.00 repeat-last-n: 64
--- a/candle-examples/examples/musicgen/README.md
+++ b/candle-examples/examples/musicgen/README.md
@ -0,0 +1,20 @@
+# candle-musicgen
+
+Candle implementation of musicgen from [Simple and Controllable Music Generation](https://arxiv.org/pdf/2306.05284).
+
+## Running an example
+
+```bash
+$ cargo run --example musicgen -- --prompt "90s rock song with loud guitars and heavy drums"
+
+> tokens: [2777, 7, 2480, 2324, 28, 8002, 5507, 7, 11, 2437, 5253, 7, 1]
+> Tensor[dims 1, 13; u32]
+> [[[ 0.0902,  0.1256, -0.0585, ...,  0.1057, -0.5141, -0.4675],
+>   [ 0.1972, -0.0268, -0.3368, ..., -0.0495, -0.3597, -0.3940],
+>   [-0.0855, -0.0007,  0.2225, ..., -0.2804, -0.5360, -0.2436],
+>   ...
+>   [ 0.0515,  0.0235, -0.3855, ..., -0.4728, -0.6858, -0.2923],
+>   [-0.3728, -0.1442, -0.1179, ..., -0.4388, -0.0287, -0.3242],
+>   [ 0.0163,  0.0012, -0.0020, ...,  0.0142,  0.0173, -0.0103]]]
+> Tensor[[1, 13, 768], f32]
+```
--- a/candle-examples/examples/orpheus/README.md
+++ b/candle-examples/examples/orpheus/README.md
@ -0,0 +1,14 @@
+# Orpheus
+
+Orpheus is a 3B text-to-speech model based on Llama.
+
+- Weights on HuggingFace
+  [canopylabs/orpheus-3b-0.1-ft](https://huggingface.co/canopylabs/orpheus-3b-0.1-ft).
+- Code on GitHub [canopyai/Orpheus-TTS](https://github.com/canopyai/Orpheus-TTS).
+
+
+```bash
+cargo run --example orpheus --features cuda -r
+```
+
+
--- a/candle-examples/examples/orpheus/main.rs
+++ b/candle-examples/examples/orpheus/main.rs
@ -0,0 +1,329 @@
+#[cfg(feature = "mkl")]
+extern crate intel_mkl_src;
+
+#[cfg(feature = "accelerate")]
+extern crate accelerate_src;
+
+use anyhow::{Error as E, Result};
+use clap::Parser;
+
+use candle::{DType, Device, IndexOp, Tensor};
+use candle_nn::VarBuilder;
+use candle_transformers::models::llama::{Cache, Llama, LlamaConfig};
+use candle_transformers::models::snac::{Config as SnacConfig, Model as SnacModel};
+use tokenizers::Tokenizer;
+
+// https://github.com/canopyai/Orpheus-TTS/blob/df0b0d96685dd21885aef7f900ee7f705c669e94/realtime_streaming_example/main.py#L43
+const STOP_TOKEN_ID: u32 = 128258;
+
+#[derive(Parser)]
+struct Args {
+    /// Run on CPU rather than on GPU.
+    #[arg(long)]
+    cpu: bool,
+
+    /// Enable tracing (generates a trace-timestamp.json file).
+    #[arg(long)]
+    tracing: bool,
+
+    /// Display the token for the specified prompt.
+    #[arg(long)]
+    verbose_prompt: bool,
+
+    #[arg(long, default_value = "Hey, how are you doing today?")]
+    prompt: String,
+
+    /// The temperature used to generate samples.
+    #[arg(long, default_value_t = 0.6)]
+    temperature: f64,
+
+    /// Nucleus sampling probability cutoff.
+    #[arg(long)]
+    top_p: Option<f64>,
+
+    /// Only sample among the top K samples.
+    #[arg(long)]
+    top_k: Option<usize>,
+
+    /// The seed to use when generating random samples.
+    #[arg(long, default_value_t = 299792458)]
+    seed: u64,
+
+    #[arg(long)]
+    model_id: Option<String>,
+
+    #[arg(long)]
+    revision: Option<String>,
+
+    #[arg(long)]
+    model_file: Option<String>,
+
+    #[arg(long)]
+    tokenizer_file: Option<String>,
+
+    #[arg(long)]
+    config_file: Option<String>,
+
+    /// The output wav file.
+    #[arg(long, default_value = "out.wav")]
+    out_file: String,
+
+    #[arg(long, default_value = "3b-0.1-ft")]
+    which: Which,
+
+    #[arg(long, default_value = "tara")]
+    voice: Voice,
+
+    #[arg(long)]
+    use_flash_attn: bool,
+}
+
+#[derive(Clone, Debug, Copy, PartialEq, Eq, clap::ValueEnum)]
+enum Voice {
+    #[value(name = "tara")]
+    Tara,
+    #[value(name = "leah")]
+    Leah,
+    #[value(name = "jess")]
+    Jess,
+    #[value(name = "leo")]
+    Leo,
+    #[value(name = "dan")]
+    Dan,
+    #[value(name = "mia")]
+    Mia,
+    #[value(name = "zac")]
+    Zac,
+    #[value(name = "zoe")]
+    Zoe,
+}
+
+impl Voice {
+    fn as_str(&self) -> &'static str {
+        match self {
+            Voice::Tara => "tara",
+            Voice::Leah => "leah",
+            Voice::Jess => "jess",
+            Voice::Leo => "leo",
+            Voice::Dan => "dan",
+            Voice::Mia => "mia",
+            Voice::Zac => "zac",
+            Voice::Zoe => "zoe",
+        }
+    }
+}
+
+#[derive(Clone, Debug, Copy, PartialEq, Eq, clap::ValueEnum)]
+enum Which {
+    #[value(name = "3b-0.1-ft")]
+    ThreeB0_1Ft,
+}
+
+fn main() -> Result<()> {
+    use tracing_chrome::ChromeLayerBuilder;
+    use tracing_subscriber::prelude::*;
+
+    let args = Args::parse();
+
+    let _guard = if args.tracing {
+        let (chrome_layer, guard) = ChromeLayerBuilder::new().build();
+        tracing_subscriber::registry().with(chrome_layer).init();
+        Some(guard)
+    } else {
+        None
+    };
+    println!(
+        "avx: {}, neon: {}, simd128: {}, f16c: {}",
+        candle::utils::with_avx(),
+        candle::utils::with_neon(),
+        candle::utils::with_simd128(),
+        candle::utils::with_f16c()
+    );
+    let prompt = args.prompt.clone();
+    let mut model = Model::load(args)?;
+    model.run(&prompt)?;
+    Ok(())
+}
+
+struct Model {
+    model: Llama,
+    tokenizer: Tokenizer,
+    logits_processor: candle_transformers::generation::LogitsProcessor,
+    cache: Cache,
+    device: Device,
+    verbose_prompt: bool,
+    snac: SnacModel,
+    out_file: String,
+    voice: Voice,
+}
+
+fn load_snac(device: &Device) -> Result<SnacModel> {
+    let api = hf_hub::api::sync::Api::new()?;
+    let m = api.model("hubertsiuzdak/snac_24khz".to_string());
+    let config = m.get("config.json")?;
+    let config: SnacConfig = serde_json::from_reader(std::fs::File::open(config)?)?;
+    let m = api.model("lmz/candle-snac".to_string());
+    let model = m.get("snac_24khz.safetensors")?;
+    let vb = unsafe { VarBuilder::from_mmaped_safetensors(&[model], DType::F32, device)? };
+    let model = SnacModel::new(&config, vb)?;
+    Ok(model)
+}
+
+impl Model {
+    fn load(args: Args) -> Result<Self> {
+        let start = std::time::Instant::now();
+        let api = hf_hub::api::sync::Api::new()?;
+        let model_id = match args.model_id {
+            Some(model_id) => model_id.to_string(),
+            None => match args.which {
+                Which::ThreeB0_1Ft => "canopylabs/orpheus-3b-0.1-ft".to_string(),
+            },
+        };
+        let revision = match args.revision {
+            Some(r) => r,
+            None => "main".to_string(),
+        };
+        let repo = api.repo(hf_hub::Repo::with_revision(
+            model_id,
+            hf_hub::RepoType::Model,
+            revision,
+        ));
+        let model_files = match args.model_file {
+            Some(m) => vec![m.into()],
+            None => match args.which {
+                Which::ThreeB0_1Ft => {
+                    candle_examples::hub_load_safetensors(&repo, "model.safetensors.index.json")?
+                }
+            },
+        };
+        let config = match args.config_file {
+            Some(m) => m.into(),
+            None => repo.get("config.json")?,
+        };
+        let tokenizer = match args.tokenizer_file {
+            Some(m) => m.into(),
+            None => repo.get("tokenizer.json")?,
+        };
+        println!("retrieved the files in {:?}", start.elapsed());
+        let tokenizer = Tokenizer::from_file(tokenizer).map_err(E::msg)?;
+
+        let start = std::time::Instant::now();
+        let device = candle_examples::device(args.cpu)?;
+        let dtype = device.bf16_default_to_f32();
+        let vb = unsafe { VarBuilder::from_mmaped_safetensors(&model_files, dtype, &device)? };
+        let config: LlamaConfig = serde_json::from_reader(std::fs::File::open(config)?)?;
+        let config = config.into_config(args.use_flash_attn);
+        let model = Llama::load(vb, &config)?;
+        let logits_processor = {
+            use candle_transformers::generation::{LogitsProcessor, Sampling};
+            let temperature = args.temperature;
+            let sampling = if temperature <= 0. {
+                Sampling::ArgMax
+            } else {
+                match (args.top_k.as_ref(), args.top_p.as_ref()) {
+                    (None, None) => Sampling::All { temperature },
+                    (Some(&k), None) => Sampling::TopK { k, temperature },
+                    (None, Some(&p)) => Sampling::TopP { p, temperature },
+                    (Some(&k), Some(&p)) => Sampling::TopKThenTopP { k, p, temperature },
+                }
+            };
+            LogitsProcessor::from_sampling(args.seed, sampling)
+        };
+
+        println!("loaded the model in {:?}", start.elapsed());
+        let cache = Cache::new(true, dtype, &config, &device)?;
+        let snac = load_snac(&device)?;
+        Ok(Self {
+            model,
+            tokenizer,
+            logits_processor,
+            cache,
+            device,
+            verbose_prompt: args.verbose_prompt,
+            snac,
+            voice: args.voice,
+            out_file: args.out_file,
+        })
+    }
+
+    fn run(&mut self, prompt: &str) -> Result<()> {
+        println!("running the model on '{}'", prompt);
+        let device = &self.device;
+        let prompt = format!("{voice}: {prompt}", voice = self.voice.as_str());
+        let tokens = self.tokenizer.encode(prompt, true).map_err(E::msg)?;
+        // https://github.com/canopyai/Orpheus-TTS/blob/df0b0d96685dd21885aef7f900ee7f705c669e94/orpheus_tts_pypi/orpheus_tts/engine_class.py#L82
+        let mut tokens = [
+            &[128259],
+            tokens.get_ids(),
+            &[128009, 128260, 128261, 128257],
+        ]
+        .concat();
+        if self.verbose_prompt {
+            println!("{:?}", tokens);
+        }
+        let mut cache = self.cache.clone();
+
+        println!("starting the inference loop");
+        let mut index_pos = 0;
+        let mut audio_tokens = vec![];
+        for index in 0..2000 {
+            let (context_size, context_index) = if index > 0 {
+                (1, index_pos)
+            } else {
+                (tokens.len(), 0)
+            };
+            let ctxt = &tokens[tokens.len().saturating_sub(context_size)..];
+            let input = Tensor::new(ctxt, device)?.unsqueeze(0)?;
+            let logits = self.model.forward(&input, context_index, &mut cache)?;
+            let logits = logits.squeeze(0)?;
+            index_pos += ctxt.len();
+
+            let next_token = self.logits_processor.sample(&logits)?;
+            if let Some(tok) = self.tokenizer.id_to_token(next_token) {
+                match tok.strip_prefix("<custom_token_") {
+                    Some(tok) => match tok.strip_suffix('>') {
+                        Some(tok) => {
+                            let tok = tok.parse::<u32>()?;
+                            // https://github.com/canopyai/Orpheus-TTS/blob/df0b0d96685dd21885aef7f900ee7f705c669e94/orpheus_tts_pypi/orpheus_tts/decoder.py#L86C35-L86C63
+                            let tok = tok - 10 - ((audio_tokens.len() as u32 % 7) * 4096);
+                            audio_tokens.push(tok);
+                        }
+                        None => {
+                            println!("{index}: unexpected custom token {next_token} {tok}");
+                        }
+                    },
+                    None => {
+                        println!("{index}: unexpected token {next_token} {tok}");
+                    }
+                }
+            }
+            if next_token == STOP_TOKEN_ID {
+                println!("reached stop token");
+                break;
+            }
+            tokens.push(next_token);
+        }
+        println!("generated {} audio tokens", audio_tokens.len());
+        let mut codes0 = vec![];
+        let mut codes1 = vec![];
+        let mut codes2 = vec![];
+        for audio_tokens in audio_tokens.chunks_exact(7) {
+            codes0.push(audio_tokens[0]);
+            for i in [1, 4] {
+                codes1.push(audio_tokens[i]);
+            }
+            for i in [2, 3, 5, 6] {
+                codes2.push(audio_tokens[i]);
+            }
+        }
+        let codes0 = Tensor::new(codes0, device)?.unsqueeze(0)?;
+        let codes1 = Tensor::new(codes1, device)?.unsqueeze(0)?;
+        let codes2 = Tensor::new(codes2, device)?.unsqueeze(0)?;
+        let pcm = self.snac.decode(&[&codes0, &codes1, &codes2])?;
+        println!("decoded to pcm {pcm:?}");
+        let mut output = std::fs::File::create(&self.out_file)?;
+        let pcm = pcm.i(0)?.i(0)?.to_vec1::<f32>()?;
+        candle_examples::wav::write_pcm_as_wav(&mut output, &pcm, 24000)?;
+        Ok(())
+    }
+}
--- a/candle-examples/examples/phi/main.rs
+++ b/candle-examples/examples/phi/main.rs
@ -148,6 +148,8 @@ enum WhichModel {
    #[value(name = "3-medium")]
    V3Medium,
    #[value(name = "2-old")]
+    V4Mini,
+    #[value(name = "4-mini")]
    V2Old,
    PuffinPhiV2,
    PhiHermes,
@ -261,6 +263,7 @@ fn main() -> Result<()> {
                    WhichModel::V2 | WhichModel::V2Old => "microsoft/phi-2".to_string(),
                    WhichModel::V3 => "microsoft/Phi-3-mini-4k-instruct".to_string(),
                    WhichModel::V3Medium => "microsoft/Phi-3-medium-4k-instruct".to_string(),
+                    WhichModel::V4Mini => "microsoft/Phi-4-mini-instruct".to_string(),
                    WhichModel::PuffinPhiV2 | WhichModel::PhiHermes => {
                        "lmz/candle-quantized-phi".to_string()
                    }
@ -281,6 +284,7 @@ fn main() -> Result<()> {
                    WhichModel::V2
                    | WhichModel::V3
                    | WhichModel::V3Medium
+                    | WhichModel::V4Mini
                    | WhichModel::PuffinPhiV2
                    | WhichModel::PhiHermes => "main".to_string(),
                }
@ -296,7 +300,8 @@ fn main() -> Result<()> {
            | WhichModel::V2
            | WhichModel::V2Old
            | WhichModel::V3
-            | WhichModel::V3Medium => repo.get("tokenizer.json")?,
+            | WhichModel::V3Medium
+            | WhichModel::V4Mini => repo.get("tokenizer.json")?,
            WhichModel::PuffinPhiV2 | WhichModel::PhiHermes => {
                repo.get("tokenizer-puffin-phi-v2.json")?
            }
@ -312,19 +317,21 @@ fn main() -> Result<()> {
                    WhichModel::V2 | WhichModel::V2Old => vec![repo.get("model-v2-q4k.gguf")?],
                    WhichModel::PuffinPhiV2 => vec![repo.get("model-puffin-phi-v2-q4k.gguf")?],
                    WhichModel::PhiHermes => vec![repo.get("model-phi-hermes-1_3B-q4k.gguf")?],
-                    WhichModel::V3 | WhichModel::V3Medium => anyhow::bail!(
+                    WhichModel::V3 | WhichModel::V3Medium | WhichModel::V4Mini => anyhow::bail!(
                        "use the quantized or quantized-phi examples for quantized phi-v3"
                    ),
                }
            } else {
                match args.model {
                    WhichModel::V1 | WhichModel::V1_5 => vec![repo.get("model.safetensors")?],
-                    WhichModel::V2 | WhichModel::V2Old | WhichModel::V3 | WhichModel::V3Medium => {
-                        candle_examples::hub_load_safetensors(
-                            &repo,
-                            "model.safetensors.index.json",
-                        )?
-                    }
+                    WhichModel::V2
+                    | WhichModel::V2Old
+                    | WhichModel::V3
+                    | WhichModel::V3Medium
+                    | WhichModel::V4Mini => candle_examples::hub_load_safetensors(
+                        &repo,
+                        "model.safetensors.index.json",
+                    )?,
                    WhichModel::PuffinPhiV2 => vec![repo.get("model-puffin-phi-v2.safetensors")?],
                    WhichModel::PhiHermes => vec![repo.get("model-phi-hermes-1_3B.safetensors")?],
                }
@ -341,7 +348,7 @@ fn main() -> Result<()> {
        WhichModel::V2 | WhichModel::V2Old => Config::v2(),
        WhichModel::PuffinPhiV2 => Config::puffin_phi_v2(),
        WhichModel::PhiHermes => Config::phi_hermes_1_3b(),
-        WhichModel::V3 | WhichModel::V3Medium => {
+        WhichModel::V3 | WhichModel::V3Medium | WhichModel::V4Mini => {
            panic!("use the quantized or quantized-phi examples for quantized phi-v3")
        }
    };
@ -361,7 +368,10 @@ fn main() -> Result<()> {
        let dtype = match args.dtype {
            Some(dtype) => std::str::FromStr::from_str(&dtype)?,
            None => {
-                if args.model == WhichModel::V3 || args.model == WhichModel::V3Medium {
+                if args.model == WhichModel::V3
+                    || args.model == WhichModel::V3Medium
+                    || args.model == WhichModel::V4Mini
+                {
                    device.bf16_default_to_f32()
                } else {
                    DType::F32
@ -377,7 +387,7 @@ fn main() -> Result<()> {
                let phi = Phi::new(&config, vb)?;
                Model::Phi(phi)
            }
-            WhichModel::V3 | WhichModel::V3Medium => {
+            WhichModel::V3 | WhichModel::V3Medium | WhichModel::V4Mini => {
                let config_filename = repo.get("config.json")?;
                let config = std::fs::read_to_string(config_filename)?;
                let config: Phi3Config = serde_json::from_str(&config)?;
--- a/candle-examples/examples/quantized-phi/README.md
+++ b/candle-examples/examples/quantized-phi/README.md
@ -0,0 +1,20 @@
+# candle-quantized-phi
+
+Candle implementation of various quantized Phi models.
+
+## Running an example
+
+```bash
+$ cargo run --example quantized-phi --release -- --prompt "The best thing about coding in rust is "
+
+> - it's memory safe (without you having to worry too much) 
+> - the borrow checker is really smart and will catch your mistakes for free, making them show up as compile errors instead of segfaulting in runtime.
+> 
+> This alone make me prefer using rust over c++ or go, python/Cython etc.
+> 
+> The major downside I can see now: 
+> - it's slower than other languages (viz: C++) and most importantly lack of libraries to leverage existing work done by community in that language. There are so many useful machine learning libraries available for c++, go, python etc but none for Rust as far as I am aware of on the first glance. 
+> - there aren't a lot of production ready projects which also makes it very hard to start new one (given my background)
+> 
+> Another downside:
+```
--- a/candle-examples/examples/quantized-qwen2-instruct/main.rs
+++ b/candle-examples/examples/quantized-qwen2-instruct/main.rs
@ -27,6 +27,8 @@ enum Which {
    W2_7b,
    #[value(name = "72b")]
    W2_72b,
+    #[value(name = "deepseekr1-qwen7b")]
+    DeepseekR1Qwen7B,
 }

 #[derive(Parser, Debug)]
@ -102,6 +104,7 @@ impl Args {
                    Which::W2_1_5b => "Qwen/Qwen2-1.5B-Instruct",
                    Which::W2_7b => "Qwen/Qwen2-7B-Instruct",
                    Which::W2_72b => "Qwen/Qwen2-72B-Instruct",
+                    Which::DeepseekR1Qwen7B => "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
                };
                let api = api.model(repo.to_string());
                api.get("tokenizer.json")?
@ -135,6 +138,11 @@ impl Args {
                        "qwen2-72b-instruct-q4_0.gguf",
                        "main",
                    ),
+                    Which::DeepseekR1Qwen7B => (
+                        "unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF",
+                        "DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf",
+                        "main",
+                    ),
                };
                let api = hf_hub::api::sync::Api::new()?;
                api.repo(hf_hub::Repo::with_revision(
@ -211,11 +219,15 @@ fn main() -> anyhow::Result<()> {

    let tokenizer = args.tokenizer()?;
    let mut tos = TokenOutputStream::new(tokenizer);
-    let prompt_str = args.prompt.unwrap_or_else(|| DEFAULT_PROMPT.to_string());
-    let prompt_str = format!(
-        "<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",
-        prompt_str
-    );
+    let prompt_str = args
+        .prompt
+        .clone()
+        .unwrap_or_else(|| DEFAULT_PROMPT.to_string());
+
+    let prompt_str = match args.which {
+        Which::DeepseekR1Qwen7B => format!("<｜User｜>{prompt_str}<｜Assistant｜>"),
+        _ => format!("<|im_start|>user\n{prompt_str}<|im_end|>\n<|im_start|>assistant\n"),
+    };
    print!("formatted instruct prompt: {}", &prompt_str);
    let tokens = tos
        .tokenizer()
@ -260,7 +272,13 @@ fn main() -> anyhow::Result<()> {
        print!("{t}");
        std::io::stdout().flush()?;
    }
-    let eos_token = *tos.tokenizer().get_vocab(true).get("<|im_end|>").unwrap();
+
+    let eos_token = match args.which {
+        Which::DeepseekR1Qwen7B => "<｜end▁of▁sentence｜>",
+        _ => "<|im_end|>",
+    };
+
+    let eos_token = *tos.tokenizer().get_vocab(true).get(eos_token).unwrap();
    let start_post_prompt = std::time::Instant::now();
    let mut sampled = 0;
    for index in 0..to_sample {
--- a/candle-examples/examples/quantized-t5/README.md
+++ b/candle-examples/examples/quantized-t5/README.md
@ -1,5 +1,7 @@
 # candle-quantized-t5

+Candle implementation for quantizing and running T5 translation models.
+
 ## Seq2Seq example

 This example uses a quantized version of the t5 model.
--- a/candle-examples/examples/quantized/main.rs
+++ b/candle-examples/examples/quantized/main.rs
@ -75,6 +75,8 @@ enum Which {
    SmolLM2_360MInstruct,
    #[value(name = "SmoLM2-1.7B-Instruct")]
    SmolLM2_1BInstruct,
+    #[value(name = "deepseekr1-llama8b")]
+    DeepseekR1Llama8b,
 }

 impl Which {
@ -94,7 +96,8 @@ impl Which {
            | Self::L8b
            | Self::Phi3
            | Self::SmolLM2_1BInstruct
-            | Self::SmolLM2_360MInstruct => false,
+            | Self::SmolLM2_360MInstruct
+            | Self::DeepseekR1Llama8b => false,
            // Zephyr and OpenChat are fine tuned versions of mistral and should be treated in the
            // same way. Starling is a fine tuned version of OpenChat.
            Self::OpenChat35
@ -132,7 +135,8 @@ impl Which {
            | Self::L8b
            | Self::SmolLM2_1BInstruct
            | Self::SmolLM2_360MInstruct
-            | Self::Phi3 => false,
+            | Self::Phi3
+            | Self::DeepseekR1Llama8b => false,
            Self::Zephyr7bAlpha | Self::Zephyr7bBeta => true,
        }
    }
@ -160,11 +164,41 @@ impl Which {
            | Self::L8b
            | Self::SmolLM2_1BInstruct
            | Self::SmolLM2_360MInstruct
-            | Self::Phi3 => false,
+            | Self::Phi3
+            | Self::DeepseekR1Llama8b => false,
            Self::OpenChat35 | Self::Starling7bAlpha => true,
        }
    }

+    fn is_deepseek(&self) -> bool {
+        match self {
+            Self::L7b
+            | Self::L13b
+            | Self::L70b
+            | Self::L7bChat
+            | Self::L13bChat
+            | Self::L70bChat
+            | Self::L7bCode
+            | Self::L13bCode
+            | Self::L34bCode
+            | Self::Leo7b
+            | Self::Leo13b
+            | Self::Mixtral
+            | Self::MixtralInstruct
+            | Self::Mistral7b
+            | Self::Mistral7bInstruct
+            | Self::Mistral7bInstructV02
+            | Self::Zephyr7bAlpha
+            | Self::Zephyr7bBeta
+            | Self::L8b
+            | Self::SmolLM2_1BInstruct
+            | Self::SmolLM2_360MInstruct
+            | Self::Phi3
+            | Self::OpenChat35
+            | Self::Starling7bAlpha => false,
+            Self::DeepseekR1Llama8b => true,
+        }
+    }
    fn tokenizer_repo(&self) -> &'static str {
        match self {
            Self::L7b
@ -191,6 +225,7 @@ impl Which {
            Self::Phi3 => "microsoft/Phi-3-mini-4k-instruct",
            Self::SmolLM2_360MInstruct => "HuggingFaceTB/SmolLM2-360M-Instruct",
            Self::SmolLM2_1BInstruct => "HuggingFaceTB/SmolLM2-1.7B-Instruct",
+            Self::DeepseekR1Llama8b => "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        }
    }
 }
@ -363,6 +398,10 @@ impl Args {
                        "HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF",
                        "smollm2-1.7b-instruct-q4_k_m.gguf",
                    ),
+                    Which::DeepseekR1Llama8b => (
+                        "unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF",
+                        "DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf",
+                    ),
                };
                let revision = if self.which == Which::Phi3 {
                    "5eef2ce24766d31909c0b269fe90c817a8f263fb"
@ -477,6 +516,7 @@ fn main() -> anyhow::Result<()> {
                | Which::L8b
                | Which::SmolLM2_1BInstruct
                | Which::SmolLM2_360MInstruct
+                | Which::DeepseekR1Llama8b
                | Which::Phi3 => 1,
                Which::Mixtral
                | Which::MixtralInstruct
@ -530,6 +570,8 @@ fn main() -> anyhow::Result<()> {
                    }
                } else if args.which.is_mistral() {
                    format!("[INST] {prompt} [/INST]")
+                } else if args.which.is_deepseek() {
+                    format!("<｜User｜>{prompt}<｜Assistant｜>")
                } else {
                    prompt
                }
@ -597,6 +639,7 @@ fn main() -> anyhow::Result<()> {
        let eos_token = match args.which {
            Which::SmolLM2_360MInstruct | Which::SmolLM2_1BInstruct => "<|endoftext|>",
            Which::L8b => "<|end_of_text|>",
+            Which::DeepseekR1Llama8b => "<｜end▁of▁sentence｜>",
            _ => match args.which.is_open_chat() {
                true => "<|end_of_turn|>",
                false => "</s>",
--- a/candle-examples/examples/reinforcement-learning/README.md
+++ b/candle-examples/examples/reinforcement-learning/README.md
@ -2,6 +2,11 @@

 Reinforcement Learning examples for candle.

+> [!WARNING]  
+> uv is not currently compatible with pyo3 as of 2025/3/28.
+
+## System wide python
+
 This has been tested with `gymnasium` version `0.29.1`. You can install the
 Python package with:
 ```bash
--- a/candle-examples/examples/reinforcement-learning/ddpg.rs
+++ b/candle-examples/examples/reinforcement-learning/ddpg.rs
@ -5,7 +5,7 @@ use candle_nn::{
    func, linear, sequential::seq, Activation, AdamW, Optimizer, ParamsAdamW, Sequential,
    VarBuilder, VarMap,
 };
-use rand::{distributions::Uniform, thread_rng, Rng};
+use rand::{distr::Uniform, rng, Rng};

 use super::gym_env::GymEnv;

@ -103,8 +103,8 @@ impl ReplayBuffer {
        if self.size < batch_size {
            Ok(None)
        } else {
-            let transitions: Vec<&Transition> = thread_rng()
-                .sample_iter(Uniform::from(0..self.size))
+            let transitions: Vec<&Transition> = rng()
+                .sample_iter(Uniform::try_from(0..self.size).map_err(Error::wrap)?)
                .take(batch_size)
                .map(|i| self.buffer.get(i).unwrap())
                .collect();
@ -498,11 +498,11 @@ pub fn run() -> Result<()> {
        OuNoise::new(MU, THETA, SIGMA, size_action)?,
    )?;

-    let mut rng = rand::thread_rng();
+    let mut rng = rand::rng();

    for episode in 0..MAX_EPISODES {
        // let mut state = env.reset(episode as u64)?;
-        let mut state = env.reset(rng.gen::<u64>())?;
+        let mut state = env.reset(rng.random::<u64>())?;

        let mut total_reward = 0.0;
        for _ in 0..EPISODE_LENGTH {
@ -538,7 +538,7 @@ pub fn run() -> Result<()> {
    agent.train = false;
    for episode in 0..10 {
        // let mut state = env.reset(episode as u64)?;
-        let mut state = env.reset(rng.gen::<u64>())?;
+        let mut state = env.reset(rng.random::<u64>())?;
        let mut total_reward = 0.0;
        for _ in 0..EPISODE_LENGTH {
            let mut action = 2.0 * agent.actions(&state)?;
--- a/candle-examples/examples/reinforcement-learning/dqn.rs
+++ b/candle-examples/examples/reinforcement-learning/dqn.rs
@ -1,9 +1,8 @@
 use std::collections::VecDeque;

-use rand::distributions::Uniform;
-use rand::{thread_rng, Rng};
+use rand::{distr::Uniform, rng, Rng};

-use candle::{DType, Device, Module, Result, Tensor};
+use candle::{DType, Device, Error, Module, Result, Tensor};
 use candle_nn::loss::mse;
 use candle_nn::{linear, seq, Activation, AdamW, Optimizer, VarBuilder, VarMap};

@ -65,8 +64,8 @@ pub fn run() -> Result<()> {
        // fed to the model so that it performs a backward pass.
        if memory.len() > BATCH_SIZE {
            // Sample randomly from the memory.
-            let batch = thread_rng()
-                .sample_iter(Uniform::from(0..memory.len()))
+            let batch = rng()
+                .sample_iter(Uniform::try_from(0..memory.len()).map_err(Error::wrap)?)
                .take(BATCH_SIZE)
                .map(|i| memory.get(i).unwrap().clone())
                .collect::<Vec<_>>();
--- a/candle-examples/examples/reinforcement-learning/policy_gradient.rs
+++ b/candle-examples/examples/reinforcement-learning/policy_gradient.rs
@ -4,7 +4,7 @@ use candle_nn::{
    linear, ops::log_softmax, ops::softmax, sequential::seq, Activation, AdamW, Optimizer,
    ParamsAdamW, VarBuilder, VarMap,
 };
-use rand::{distributions::Distribution, rngs::ThreadRng, Rng};
+use rand::{distr::Distribution, rngs::ThreadRng, Rng};

 fn new_model(
    input_shape: &[usize],
@ -39,7 +39,7 @@ fn accumulate_rewards(steps: &[Step<i64>]) -> Vec<f64> {
 }

 fn weighted_sample(probs: Vec<f32>, rng: &mut ThreadRng) -> Result<usize> {
-    let distribution = rand::distributions::WeightedIndex::new(probs).map_err(Error::wrap)?;
+    let distribution = rand::distr::weighted::WeightedIndex::new(probs).map_err(Error::wrap)?;
    let mut rng = rng;
    Ok(distribution.sample(&mut rng))
 }
@ -65,10 +65,10 @@ pub fn run() -> Result<()> {

    let mut optimizer = AdamW::new(varmap.all_vars(), optimizer_params)?;

-    let mut rng = rand::thread_rng();
+    let mut rng = rand::rng();

    for epoch_idx in 0..100 {
-        let mut state = env.reset(rng.gen::<u64>())?;
+        let mut state = env.reset(rng.random::<u64>())?;
        let mut steps: Vec<Step<i64>> = vec![];

        loop {
@ -84,7 +84,7 @@ pub fn run() -> Result<()> {
            steps.push(step.copy_with_obs(&state));

            if step.terminated || step.truncated {
-                state = env.reset(rng.gen::<u64>())?;
+                state = env.reset(rng.random::<u64>())?;
                if steps.len() > 5000 {
                    break;
                }
--- a/candle-examples/examples/resnet/README.md
+++ b/candle-examples/examples/resnet/README.md
@ -7,7 +7,7 @@ probabilities for the top-5 classes.
 ## Running an example

 ```
-$ cargo run --example resnet --release -- --image tiger.jpg
+$ cargo run --example resnet --release -- --image candle-examples/examples/yolo-v8/assets/bike.jpg

 loaded image Tensor[dims 3, 224, 224; f32]
 model built
--- a/candle-examples/examples/segformer/README.md
+++ b/candle-examples/examples/segformer/README.md
@ -10,9 +10,11 @@ If you want you can use the example images from this [pull request][pr], downloa

 ```bash
 # run the image classification task
-cargo run --example segformer classify <path-to-image>
+cargo run --example segformer classify candle-examples/examples/yolo-v8/assets/bike.jpg
+
 # run the segmentation task
-cargo run --example segformer segment <path-to-image>
+cargo run --example segformer segment candle-examples/examples/yolo-v8/assets/bike.jpg
+
 ```

 Example output for classification:
--- a/candle-examples/examples/segment-anything/README.md
+++ b/candle-examples/examples/segment-anything/README.md
@ -14,8 +14,8 @@ based on [MobileSAM](https://github.com/ChaoningZhang/MobileSAM).

 ```bash
 cargo run --example segment-anything --release -- \
-    --image candle-examples/examples/yolo-v8/assets/bike.jpg
-    --use-tiny
+    --image candle-examples/examples/yolo-v8/assets/bike.jpg \
+    --use-tiny \
    --point 0.6,0.6 --point 0.6,0.55
 ```

--- a/candle-examples/examples/siglip/README.md
+++ b/candle-examples/examples/siglip/README.md
@ -5,7 +5,7 @@ SigLIP is multi-modal text-vision model that improves over CLIP by using a sigmo

 ### Running an example
 ```
-$ cargo run --features cuda -r --example siglip -
+$ cargo run --features cuda -r --example siglip
 softmax_image_vec: [2.1912122e-14, 2.3624872e-14, 1.0, 1.0, 2.4787932e-8, 3.2784535e-12]


--- a/candle-examples/examples/siglip/main.rs
+++ b/candle-examples/examples/siglip/main.rs
@ -13,11 +13,40 @@ use candle_transformers::models::siglip;

 use tokenizers::Tokenizer;

+#[derive(Clone, Copy, Debug, clap::ValueEnum, PartialEq, Eq)]
+enum Which {
+    #[value(name = "v1-base-patch16-224")]
+    V1BasePatch16_224,
+    #[value(name = "v2-base-patch16-224")]
+    V2BasePatch16_224,
+    #[value(name = "v2-base-patch16-256")]
+    V2BasePatch16_256,
+    #[value(name = "v2-base-patch16-384")]
+    V2BasePatch16_384,
+    #[value(name = "v2-base-patch16-512")]
+    V2BasePatch16_512,
+    #[value(name = "v2-large-patch16-256")]
+    V2LargePatch16_256,
+    #[value(name = "v2-large-patch16-384")]
+    V2LargePatch16_384,
+    #[value(name = "v2-large-patch16-512")]
+    V2LargePatch16_512,
+}
+
 #[derive(Parser)]
 struct Args {
    #[arg(long)]
    model: Option<String>,

+    #[arg(long)]
+    config: Option<String>,
+
+    #[arg(long)]
+    hf_repo: Option<String>,
+
+    #[arg(long, default_value = "v1-base-patch16-224")]
+    which: Which,
+
    #[arg(long)]
    tokenizer: Option<String>,

@ -66,16 +95,37 @@ fn load_images<T: AsRef<std::path::Path>>(

 pub fn main() -> anyhow::Result<()> {
    let args = Args::parse();
+    let hf_repo = match args.hf_repo.as_ref() {
+        Some(hf_repo) => hf_repo,
+        None => match args.which {
+            Which::V1BasePatch16_224 => "google/siglip-base-patch16-224",
+            Which::V2BasePatch16_224 => "google/siglip2-base-patch16-224",
+            Which::V2BasePatch16_256 => "google/siglip2-base-patch16-256",
+            Which::V2BasePatch16_384 => "google/siglip2-base-patch16-384",
+            Which::V2BasePatch16_512 => "google/siglip2-base-patch16-512",
+            Which::V2LargePatch16_256 => "google/siglip2-large-patch16-256",
+            Which::V2LargePatch16_384 => "google/siglip2-large-patch16-384",
+            Which::V2LargePatch16_512 => "google/siglip2-large-patch16-512",
+        },
+    };
    let model_file = match args.model {
        None => {
            let api = hf_hub::api::sync::Api::new()?;
-            let api = api.model("google/siglip-base-patch16-224".to_string());
+            let api = api.model(hf_repo.to_string());
            api.get("model.safetensors")?
        }
        Some(model) => model.into(),
    };
-    let tokenizer = get_tokenizer(args.tokenizer)?;
-    let config = siglip::Config::base_patch16_224();
+    let config_file = match args.config {
+        None => {
+            let api = hf_hub::api::sync::Api::new()?;
+            let api = api.model(hf_repo.to_string());
+            api.get("config.json")?
+        }
+        Some(config) => config.into(),
+    };
+    let tokenizer = get_tokenizer(hf_repo, args.tokenizer)?;
+    let config: siglip::Config = serde_json::from_slice(&std::fs::read(config_file)?)?;
    let device = candle_examples::device(args.cpu)?;
    let vec_imgs = match args.images {
        Some(imgs) => imgs,
@ -114,11 +164,11 @@ pub fn main() -> anyhow::Result<()> {
    Ok(())
 }

-pub fn get_tokenizer(tokenizer: Option<String>) -> anyhow::Result<Tokenizer> {
+pub fn get_tokenizer(hf_repo: &str, tokenizer: Option<String>) -> anyhow::Result<Tokenizer> {
    let tokenizer = match tokenizer {
        None => {
            let api = hf_hub::api::sync::Api::new()?;
-            let api = api.model("google/siglip-base-patch16-224".to_string());
+            let api = api.model(hf_repo.to_string());
            api.get("tokenizer.json")?
        }
        Some(file) => file.into(),
--- a/candle-examples/examples/silero-vad/README.md
+++ b/candle-examples/examples/silero-vad/README.md
@ -6,7 +6,14 @@ This example uses the models available in the hugging face [onnx-community/siler

 ## Running the example

+### using arecord
+
 ```bash
 $ arecord -t raw -f S16_LE -r 16000 -c 1 -d 5 - | cargo run --example silero-vad --release --features onnx -- --sample-rate 16000
 ```

+### using SoX
+
+```bash
+$ rec -t raw -r 48000 -b 16 -c 1 -e signed-integer - trim 0 5 | sox -t raw -r 48000 -b 16 -c 1 -e signed-integer - -t raw -r 16000 -b 16 -c 1 -e signed-integer - | cargo run --example silero-vad --release --features onnx -- --sample-rate 16000
+```
--- a/candle-examples/examples/snac/audio_io.rs
+++ b/candle-examples/examples/snac/audio_io.rs
@ -0,0 +1,275 @@
+use anyhow::{Context, Result};
+use std::sync::{Arc, Mutex};
+
+pub const SAMPLE_RATE: usize = 24_000;
+
+pub(crate) struct AudioOutputData_ {
+    resampled_data: std::collections::VecDeque<f32>,
+    resampler: rubato::FastFixedIn<f32>,
+    output_buffer: Vec<f32>,
+    input_buffer: Vec<f32>,
+    input_len: usize,
+}
+
+impl AudioOutputData_ {
+    pub(crate) fn new(input_sample_rate: usize, output_sample_rate: usize) -> Result<Self> {
+        use rubato::Resampler;
+
+        let resampled_data = std::collections::VecDeque::with_capacity(output_sample_rate * 10);
+        let resample_ratio = output_sample_rate as f64 / input_sample_rate as f64;
+        let resampler = rubato::FastFixedIn::new(
+            resample_ratio,
+            f64::max(resample_ratio, 1.0),
+            rubato::PolynomialDegree::Septic,
+            1024,
+            1,
+        )?;
+        let input_buffer = resampler.input_buffer_allocate(true).remove(0);
+        let output_buffer = resampler.output_buffer_allocate(true).remove(0);
+        Ok(Self {
+            resampled_data,
+            resampler,
+            input_buffer,
+            output_buffer,
+            input_len: 0,
+        })
+    }
+
+    pub fn reset(&mut self) {
+        use rubato::Resampler;
+        self.output_buffer.fill(0.);
+        self.input_buffer.fill(0.);
+        self.resampler.reset();
+        self.resampled_data.clear();
+    }
+
+    pub(crate) fn take_all(&mut self) -> Vec<f32> {
+        let mut data = Vec::with_capacity(self.resampled_data.len());
+        while let Some(elem) = self.resampled_data.pop_back() {
+            data.push(elem);
+        }
+        data
+    }
+
+    pub(crate) fn is_empty(&self) -> bool {
+        self.resampled_data.is_empty()
+    }
+
+    // Assumes that the input buffer is large enough.
+    fn push_input_buffer(&mut self, samples: &[f32]) {
+        self.input_buffer[self.input_len..self.input_len + samples.len()].copy_from_slice(samples);
+        self.input_len += samples.len()
+    }
+
+    pub(crate) fn push_samples(&mut self, samples: &[f32]) -> Result<()> {
+        use rubato::Resampler;
+
+        let mut pos_in = 0;
+        loop {
+            let rem = self.input_buffer.len() - self.input_len;
+            let pos_end = usize::min(pos_in + rem, samples.len());
+            self.push_input_buffer(&samples[pos_in..pos_end]);
+            pos_in = pos_end;
+            if self.input_len < self.input_buffer.len() {
+                break;
+            }
+            let (_, out_len) = self.resampler.process_into_buffer(
+                &[&self.input_buffer],
+                &mut [&mut self.output_buffer],
+                None,
+            )?;
+            for &elem in self.output_buffer[..out_len].iter() {
+                self.resampled_data.push_front(elem)
+            }
+            self.input_len = 0;
+        }
+        Ok(())
+    }
+}
+
+type AudioOutputData = Arc<Mutex<AudioOutputData_>>;
+
+pub(crate) fn setup_output_stream() -> Result<(cpal::Stream, AudioOutputData)> {
+    use cpal::traits::{DeviceTrait, HostTrait, StreamTrait};
+
+    println!("Setup audio output stream!");
+    let host = cpal::default_host();
+    let device = host
+        .default_output_device()
+        .context("no output device available")?;
+    let mut supported_configs_range = device.supported_output_configs()?;
+    let config_range = match supported_configs_range.find(|c| c.channels() == 1) {
+        // On macOS, it's commonly the case that there are only stereo outputs.
+        None => device
+            .supported_output_configs()?
+            .next()
+            .context("no audio output available")?,
+        Some(config_range) => config_range,
+    };
+    let sample_rate = cpal::SampleRate(SAMPLE_RATE as u32).clamp(
+        config_range.min_sample_rate(),
+        config_range.max_sample_rate(),
+    );
+    let config: cpal::StreamConfig = config_range.with_sample_rate(sample_rate).into();
+    let channels = config.channels as usize;
+    println!(
+        "cpal device: {} {} {config:?}",
+        device.name().unwrap_or_else(|_| "unk".to_string()),
+        config.sample_rate.0
+    );
+    let audio_data = Arc::new(Mutex::new(AudioOutputData_::new(
+        SAMPLE_RATE,
+        config.sample_rate.0 as usize,
+    )?));
+    let ad = audio_data.clone();
+    let stream = device.build_output_stream(
+        &config,
+        move |data: &mut [f32], _: &cpal::OutputCallbackInfo| {
+            data.fill(0.);
+            let mut ad = ad.lock().unwrap();
+            let mut last_elem = 0f32;
+            for (idx, elem) in data.iter_mut().enumerate() {
+                if idx % channels == 0 {
+                    match ad.resampled_data.pop_back() {
+                        None => break,
+                        Some(v) => {
+                            last_elem = v;
+                            *elem = v
+                        }
+                    }
+                } else {
+                    *elem = last_elem
+                }
+            }
+        },
+        move |err| eprintln!("cpal error: {err}"),
+        None, // None=blocking, Some(Duration)=timeout
+    )?;
+    stream.play()?;
+    Ok((stream, audio_data))
+}
+
+pub(crate) fn setup_input_stream() -> Result<(cpal::Stream, AudioOutputData)> {
+    use cpal::traits::{DeviceTrait, HostTrait, StreamTrait};
+
+    println!("Setup audio input stream!");
+    let host = cpal::default_host();
+    let device = host
+        .default_input_device()
+        .context("no input device available")?;
+    let mut supported_configs_range = device.supported_input_configs()?;
+    let config_range = supported_configs_range
+        .find(|c| c.channels() == 1)
+        .context("no audio input available")?;
+    let sample_rate = cpal::SampleRate(SAMPLE_RATE as u32).clamp(
+        config_range.min_sample_rate(),
+        config_range.max_sample_rate(),
+    );
+    let config: cpal::StreamConfig = config_range.with_sample_rate(sample_rate).into();
+    println!(
+        "cpal device: {} {} {config:?}",
+        device.name().unwrap_or_else(|_| "unk".to_string()),
+        config.sample_rate.0
+    );
+    let audio_data = Arc::new(Mutex::new(AudioOutputData_::new(
+        config.sample_rate.0 as usize,
+        SAMPLE_RATE,
+    )?));
+    let ad = audio_data.clone();
+    let stream = device.build_input_stream(
+        &config,
+        move |data: &[f32], _: &cpal::InputCallbackInfo| {
+            let mut ad = ad.lock().unwrap();
+            if let Err(err) = ad.push_samples(data) {
+                eprintln!("error processing audio input {err:?}")
+            }
+        },
+        move |err| eprintln!("cpal error: {err}"),
+        None, // None=blocking, Some(Duration)=timeout
+    )?;
+    stream.play()?;
+    Ok((stream, audio_data))
+}
+
+fn conv<T>(samples: &mut Vec<f32>, data: std::borrow::Cow<symphonia::core::audio::AudioBuffer<T>>)
+where
+    T: symphonia::core::sample::Sample,
+    f32: symphonia::core::conv::FromSample<T>,
+{
+    use symphonia::core::audio::Signal;
+    use symphonia::core::conv::FromSample;
+    samples.extend(data.chan(0).iter().map(|v| f32::from_sample(*v)))
+}
+
+pub(crate) fn pcm_decode<P: AsRef<std::path::Path>>(path: P) -> Result<(Vec<f32>, u32)> {
+    use symphonia::core::audio::{AudioBufferRef, Signal};
+
+    let src = std::fs::File::open(path)?;
+    let mss = symphonia::core::io::MediaSourceStream::new(Box::new(src), Default::default());
+    let hint = symphonia::core::probe::Hint::new();
+    let meta_opts: symphonia::core::meta::MetadataOptions = Default::default();
+    let fmt_opts: symphonia::core::formats::FormatOptions = Default::default();
+    let probed = symphonia::default::get_probe().format(&hint, mss, &fmt_opts, &meta_opts)?;
+    let mut format = probed.format;
+    let track = format
+        .tracks()
+        .iter()
+        .find(|t| t.codec_params.codec != symphonia::core::codecs::CODEC_TYPE_NULL)
+        .expect("no supported audio tracks");
+    let mut decoder = symphonia::default::get_codecs()
+        .make(&track.codec_params, &Default::default())
+        .expect("unsupported codec");
+    let track_id = track.id;
+    let sample_rate = track.codec_params.sample_rate.unwrap_or(0);
+    let mut pcm_data = Vec::new();
+    while let Ok(packet) = format.next_packet() {
+        while !format.metadata().is_latest() {
+            format.metadata().pop();
+        }
+        if packet.track_id() != track_id {
+            continue;
+        }
+        match decoder.decode(&packet)? {
+            AudioBufferRef::F32(buf) => pcm_data.extend(buf.chan(0)),
+            AudioBufferRef::U8(data) => conv(&mut pcm_data, data),
+            AudioBufferRef::U16(data) => conv(&mut pcm_data, data),
+            AudioBufferRef::U24(data) => conv(&mut pcm_data, data),
+            AudioBufferRef::U32(data) => conv(&mut pcm_data, data),
+            AudioBufferRef::S8(data) => conv(&mut pcm_data, data),
+            AudioBufferRef::S16(data) => conv(&mut pcm_data, data),
+            AudioBufferRef::S24(data) => conv(&mut pcm_data, data),
+            AudioBufferRef::S32(data) => conv(&mut pcm_data, data),
+            AudioBufferRef::F64(data) => conv(&mut pcm_data, data),
+        }
+    }
+    Ok((pcm_data, sample_rate))
+}
+
+pub(crate) fn resample(pcm_in: &[f32], sr_in: u32, sr_out: u32) -> Result<Vec<f32>> {
+    use rubato::Resampler;
+
+    let mut pcm_out =
+        Vec::with_capacity((pcm_in.len() as f64 * sr_out as f64 / sr_in as f64) as usize + 1024);
+
+    let mut resampler =
+        rubato::FftFixedInOut::<f32>::new(sr_in as usize, sr_out as usize, 1024, 1)?;
+    let mut output_buffer = resampler.output_buffer_allocate(true);
+    let mut pos_in = 0;
+    while pos_in + resampler.input_frames_next() < pcm_in.len() {
+        let (in_len, out_len) =
+            resampler.process_into_buffer(&[&pcm_in[pos_in..]], &mut output_buffer, None)?;
+        pos_in += in_len;
+        pcm_out.extend_from_slice(&output_buffer[0][..out_len]);
+    }
+
+    if pos_in < pcm_in.len() {
+        let (_in_len, out_len) = resampler.process_partial_into_buffer(
+            Some(&[&pcm_in[pos_in..]]),
+            &mut output_buffer,
+            None,
+        )?;
+        pcm_out.extend_from_slice(&output_buffer[0][..out_len]);
+    }
+
+    Ok(pcm_out)
+}
--- a/candle-examples/examples/snac/main.rs
+++ b/candle-examples/examples/snac/main.rs
@ -0,0 +1,197 @@
+#[cfg(feature = "mkl")]
+extern crate intel_mkl_src;
+
+#[cfg(feature = "accelerate")]
+extern crate accelerate_src;
+
+use anyhow::Result;
+use candle::{DType, IndexOp, Tensor};
+use candle_nn::VarBuilder;
+use candle_transformers::models::snac::{Config, Model};
+use clap::{Parser, ValueEnum};
+use hf_hub::api::sync::Api;
+
+mod audio_io;
+
+#[derive(Clone, Debug, Copy, PartialEq, Eq, ValueEnum)]
+enum Action {
+    AudioToAudio,
+    AudioToCode,
+    CodeToAudio,
+}
+
+#[derive(Clone, Debug, Copy, PartialEq, Eq, clap::ValueEnum)]
+enum Which {
+    #[value(name = "24khz")]
+    S24khz,
+    #[value(name = "32khz")]
+    S32khz,
+    #[value(name = "44khz")]
+    S44khz,
+}
+
+impl Which {
+    fn sample_rate(&self) -> u32 {
+        match self {
+            Which::S24khz => 24000,
+            Which::S32khz => 32000,
+            Which::S44khz => 44000,
+        }
+    }
+
+    fn config_repo(&self) -> &'static str {
+        match self {
+            Which::S24khz => "hubertsiuzdak/snac_24khz",
+            Which::S32khz => "hubertsiuzdak/snac_32khz",
+            Which::S44khz => "hubertsiuzdak/snac_44khz",
+        }
+    }
+
+    fn model_file(&self) -> &'static str {
+        match self {
+            Which::S24khz => "snac_24khz.safetensors",
+            Which::S32khz => "snac_32khz.safetensors",
+            Which::S44khz => "snac_44khz.safetensors",
+        }
+    }
+}
+
+#[derive(Parser, Debug)]
+#[command(author, version, about, long_about = None)]
+struct Args {
+    /// The action to be performed, specifies the format for the input and output data.
+    action: Action,
+
+    /// The input file, either an audio file or some snac tokens stored as safetensors.
+    in_file: String,
+
+    /// The output file, either a wave audio file or some snac tokens stored as safetensors.
+    out_file: String,
+
+    /// The model size to use.
+    #[arg(long, default_value = "24khz")]
+    which: Which,
+
+    /// Run on CPU rather than on GPU.
+    #[arg(long)]
+    cpu: bool,
+
+    /// The model weight file, in safetensor format.
+    #[arg(long)]
+    model: Option<String>,
+
+    /// The config file, in safetensor format.
+    #[arg(long)]
+    config: Option<String>,
+}
+
+fn main() -> Result<()> {
+    let args = Args::parse();
+    let device = candle_examples::device(args.cpu)?;
+    let model_sample_rate = args.which.sample_rate();
+    let config = match args.config {
+        Some(c) => std::path::PathBuf::from(c),
+        None => Api::new()?
+            .model(args.which.config_repo().to_string())
+            .get("config.json")?,
+    };
+    let config: Config = serde_json::from_slice(&std::fs::read(config)?)?;
+    let model = match args.model {
+        Some(model) => std::path::PathBuf::from(model),
+        None => Api::new()?
+            .model("lmz/candle-snac".to_string())
+            .get(args.which.model_file())?,
+    };
+    let vb = unsafe { VarBuilder::from_mmaped_safetensors(&[model], DType::F32, &device)? };
+    let model = Model::new(&config, vb)?;
+
+    let codes = match args.action {
+        Action::CodeToAudio => {
+            let codes = candle::safetensors::load(args.in_file, &device)?;
+            let num_codebooks = model.num_codebooks();
+            (0..num_codebooks)
+                .map(|i| {
+                    codes
+                        .get(&format!("codes-{i}"))
+                        .expect("no codes in input file")
+                        .clone()
+                })
+                .collect::<Vec<_>>()
+        }
+        Action::AudioToCode | Action::AudioToAudio => {
+            let pcm = if args.in_file == "-" {
+                println!(">>>> RECORDING AUDIO, PRESS ENTER ONCE DONE <<<<");
+                let (stream, input_audio) = audio_io::setup_input_stream()?;
+                let mut pcms = vec![];
+                let stdin = std::thread::spawn(|| {
+                    let mut s = String::new();
+                    std::io::stdin().read_line(&mut s)
+                });
+                while !stdin.is_finished() {
+                    let input = input_audio.lock().unwrap().take_all();
+                    if input.is_empty() {
+                        std::thread::sleep(std::time::Duration::from_millis(100));
+                        continue;
+                    }
+                    pcms.push(input)
+                }
+                drop(stream);
+                pcms.concat()
+            } else {
+                let (pcm, sample_rate) = audio_io::pcm_decode(args.in_file)?;
+                if sample_rate != model_sample_rate {
+                    println!("WARNING: snac uses a {model_sample_rate} sample rate, input uses {sample_rate}, resampling...");
+                    audio_io::resample(&pcm, sample_rate, model_sample_rate)?
+                } else {
+                    pcm
+                }
+            };
+            let pcm_len = pcm.len();
+            let pcm = Tensor::from_vec(pcm, (1, 1, pcm_len), &device)?;
+            println!("input pcm shape: {:?}", pcm.shape());
+            model.encode(&pcm)?
+        }
+    };
+    for codes in codes.iter() {
+        println!("codes shape: {:?}", codes.shape());
+    }
+
+    match args.action {
+        Action::AudioToCode => {
+            let mut tensors = std::collections::HashMap::new();
+            for (i, codes) in codes.iter().enumerate() {
+                tensors.insert(format!("codes-{i}"), codes.clone());
+            }
+            candle::safetensors::save(&tensors, "codes.safetensors")?;
+        }
+        Action::AudioToAudio | Action::CodeToAudio => {
+            let codes = codes.iter().collect::<Vec<_>>();
+            let pcm = model.decode(&codes)?;
+            println!("output pcm shape: {:?}", pcm.shape());
+            let pcm = pcm.i(0)?.i(0)?;
+            let pcm = candle_examples::audio::normalize_loudness(&pcm, model_sample_rate, true)?;
+            let pcm = pcm.to_vec1::<f32>()?;
+            if args.out_file == "-" {
+                let (stream, ad) = audio_io::setup_output_stream()?;
+                {
+                    let mut ad = ad.lock().unwrap();
+                    ad.push_samples(&pcm)?;
+                }
+                loop {
+                    let ad = ad.lock().unwrap();
+                    if ad.is_empty() {
+                        break;
+                    }
+                    // That's very weird, calling thread::sleep here triggers the stream to stop
+                    // playing (the callback doesn't seem to be called anymore).
+                    // std::thread::sleep(std::time::Duration::from_millis(100));
+                }
+                drop(stream)
+            } else {
+                let mut output = std::fs::File::create(&args.out_file)?;
+                candle_examples::wav::write_pcm_as_wav(&mut output, &pcm, model_sample_rate)?;
+            }
+        }
+    }
+    Ok(())
+}
--- a/candle-examples/examples/stable-diffusion/main.rs
+++ b/candle-examples/examples/stable-diffusion/main.rs
@ -617,7 +617,7 @@ fn run(args: Args) -> Result<()> {
    let mut scheduler = sd_config.build_scheduler(n_steps)?;
    let device = candle_examples::device(cpu)?;
    // If a seed is not given, generate a random seed and print it
-    let seed = seed.unwrap_or(rand::thread_rng().gen_range(0u64..u64::MAX));
+    let seed = seed.unwrap_or(rand::rng().random_range(0u64..u64::MAX));
    println!("Using seed {seed}");
    device.set_seed(seed)?;
    let use_guide_scale = guidance_scale > 1.0;
--- a/candle-examples/examples/starcoder2/README.md
+++ b/candle-examples/examples/starcoder2/README.md
@ -0,0 +1,15 @@
+# candle-starcoder2
+
+Candle implementation of Star Coder 2 family of code generation model from [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/pdf/2402.19173).
+
+## Running an example
+
+```bash
+$ cargo run --example starcoder2 -- --prompt "write a recursive fibonacci function in python "
+
+> # that returns the nth number in the sequence.
+> 
+> def fib(n):
+>     if n
+
+```
--- a/candle-examples/examples/stella-en-v5/README.md
+++ b/candle-examples/examples/stella-en-v5/README.md
@ -10,7 +10,7 @@ Stella_en_1.5B_v5 is used to generate text embeddings embeddings for a prompt. T
 are downloaded from the hub on the first run.

 ```bash
-$ cargo run --example stella-en-v5 --release  -- --query "What are safetensors?"
+$ cargo run --example stella-en-v5 --release  -- --query "What are safetensors?" --which 1.5b

 > [[ 0.3905, -0.0130,  0.2072, ..., -0.1100, -0.0086,  0.6002]]
 >  Tensor[[1, 1024], f32]
--- a/candle-examples/examples/t5/README.md
+++ b/candle-examples/examples/t5/README.md
@ -1,5 +1,7 @@
 # candle-t5

+Candle implementations of the T5 family of translation models.
+
 ## Encoder-decoder example:

 ```bash
--- a/candle-examples/examples/vgg/README.md
+++ b/candle-examples/examples/vgg/README.md
@ -7,7 +7,7 @@ The VGG models are defined in `candle-transformers/src/models/vgg.rs`. The main
 You can run the example with the following command:

 ```bash
-cargo run --example vgg --release -- --image ../yolo-v8/assets/bike.jpg --which vgg13
+cargo run --example vgg --release -- --image candle-examples/examples/yolo-v8/assets/bike.jpg --which vgg13
 ```

 In the command above, `--image` specifies the path to the image file and `--which` specifies the VGG model to use (vgg13, vgg16, or vgg19).
--- a/candle-examples/examples/vit/README.md
+++ b/candle-examples/examples/vit/README.md
@ -7,8 +7,8 @@ probabilities for the top-5 classes.

 ## Running an example

-```
-$ cargo run --example vit --release -- --image tiger.jpg
+```bash
+$ cargo run --example vit --release -- --image candle-examples/examples/yolo-v8/assets/bike.jpg

 loaded image Tensor[dims 3, 224, 224; f32]
 model built
--- a/candle-examples/examples/whisper-microphone/README.md
+++ b/candle-examples/examples/whisper-microphone/README.md
@ -0,0 +1,15 @@
+# candle-whisper-microphone
+
+Whisper implementation using microphone as input.
+
+## Running an example
+
+```bash
+$ cargo run --example whisper-microphone --features microphone
+
+> transcribing audio...
+> 480256 160083
+> language_token: None
+> 0.0s -- 30.0s:  Hello, hello, I don't know if this is working, but You know, how long did I make this?
+> 480256 160085
+```
--- a/candle-examples/examples/whisper-microphone/main.rs
+++ b/candle-examples/examples/whisper-microphone/main.rs
@ -9,7 +9,7 @@ use candle::{Device, IndexOp, Tensor};
 use candle_nn::{ops::softmax, VarBuilder};
 use clap::{Parser, ValueEnum};
 use hf_hub::{api::sync::Api, Repo, RepoType};
-use rand::{distributions::Distribution, SeedableRng};
+use rand::{distr::Distribution, SeedableRng};
 use tokenizers::Tokenizer;

 mod multilingual;
@ -204,7 +204,7 @@ impl Decoder {
            let next_token = if t > 0f64 {
                let prs = softmax(&(&logits / t)?, 0)?;
                let logits_v: Vec<f32> = prs.to_vec1()?;
-                let distr = rand::distributions::WeightedIndex::new(&logits_v)?;
+                let distr = rand::distr::weighted::WeightedIndex::new(&logits_v)?;
                distr.sample(&mut self.rng) as u32
            } else {
                let logits_v: Vec<f32> = logits.to_vec1()?;
--- a/candle-examples/examples/whisper/main.rs
+++ b/candle-examples/examples/whisper/main.rs
@ -14,7 +14,9 @@ use candle::{Device, IndexOp, Tensor};
 use candle_nn::{ops::softmax, VarBuilder};
 use clap::{Parser, ValueEnum};
 use hf_hub::{api::sync::Api, Repo, RepoType};
-use rand::{distributions::Distribution, SeedableRng};
+use rand::distr::weighted::WeightedIndex;
+use rand::distr::Distribution;
+use rand::SeedableRng;
 use tokenizers::Tokenizer;

 mod multilingual;
@ -208,7 +210,7 @@ impl Decoder {
            let next_token = if t > 0f64 {
                let prs = softmax(&(&logits / t)?, 0)?;
                let logits_v: Vec<f32> = prs.to_vec1()?;
-                let distr = rand::distributions::WeightedIndex::new(&logits_v)?;
+                let distr = WeightedIndex::new(&logits_v)?;
                distr.sample(&mut self.rng) as u32
            } else {
                let logits_v: Vec<f32> = logits.to_vec1()?;
--- a/candle-examples/examples/yi/README.md
+++ b/candle-examples/examples/yi/README.md
@ -0,0 +1,13 @@
+# candle-yi
+
+Candle implentations of the Yi family of bilingual (English, Chinese) LLMs.
+
+## Running an example
+
+```bash
+$ cargo run --example yi -- --prompt "Here is a test sentence"
+
+> python
+> print("Hello World")
+> 
+```
--- a/candle-examples/examples/yolo-v3/README.md
+++ b/candle-examples/examples/yolo-v3/README.md
@ -0,0 +1,32 @@
+# candle-yolo-v3:
+
+Candle implementation of Yolo-V3 for object detection.
+
+## Running an example
+
+```bash
+$ cargo run --example yolo-v3 --release -- candle-examples/examples/yolo-v8/assets/bike.jpg
+
+> generated predictions Tensor[dims 10647, 85; f32]
+> person: Bbox { xmin: 46.362198, ymin: 72.177, xmax: 135.92522, ymax: 339.8356, confidence: 0.99705493, data: () }
+> person: Bbox { xmin: 137.25645, ymin: 67.58148, xmax: 216.90437, ymax: 333.80756, confidence: 0.9898516, data: () }
+> person: Bbox { xmin: 245.7842, ymin: 82.76726, xmax: 316.79053, ymax: 337.21613, confidence: 0.9884322, data: () }
+> person: Bbox { xmin: 207.52783, ymin: 61.815224, xmax: 266.77884, ymax: 307.92606, confidence: 0.9860648, data: () }
+> person: Bbox { xmin: 11.457404, ymin: 60.335564, xmax: 34.39357, ymax: 187.7714, confidence: 0.9545012, data: () }
+> person: Bbox { xmin: 251.88353, ymin: 11.235481, xmax: 286.56607, ymax: 92.54697, confidence: 0.8439807, data: () }
+> person: Bbox { xmin: -0.44309902, ymin: 55.486923, xmax: 13.160354, ymax: 184.09705, confidence: 0.8266243, data: () }
+> person: Bbox { xmin: 317.40826, ymin: 55.39501, xmax: 370.6704, ymax: 153.74887, confidence: 0.7327442, data: () }
+> person: Bbox { xmin: 370.02835, ymin: 66.120224, xmax: 404.22824, ymax: 142.09691, confidence: 0.7265741, data: () }
+> person: Bbox { xmin: 250.36511, ymin: 57.349842, xmax: 280.06335, ymax: 116.29384, confidence: 0.709422, data: () }
+> person: Bbox { xmin: 32.573215, ymin: 66.66239, xmax: 50.49056, ymax: 173.42068, confidence: 0.6998766, data: () }
+> person: Bbox { xmin: 131.72215, ymin: 63.946213, xmax: 166.66151, ymax: 241.52773, confidence: 0.64457536, data: () }
+> person: Bbox { xmin: 407.42416, ymin: 49.106407, xmax: 415.24307, ymax: 84.7134, confidence: 0.5955802, data: () }
+> person: Bbox { xmin: 51.650482, ymin: 64.4985, xmax: 67.40904, ymax: 106.952385, confidence: 0.5196007, data: () }
+> bicycle: Bbox { xmin: 160.10031, ymin: 183.90837, xmax: 200.86832, ymax: 398.609, confidence: 0.9623588, data: () }
+> bicycle: Bbox { xmin: 66.570915, ymin: 192.56966, xmax: 112.06765, ymax: 369.28497, confidence: 0.9174347, data: () }
+> bicycle: Bbox { xmin: 258.2856, ymin: 197.04532, xmax: 298.43106, ymax: 364.8627, confidence: 0.6851388, data: () }
+> bicycle: Bbox { xmin: 214.0034, ymin: 175.76498, xmax: 252.45158, ymax: 356.53818, confidence: 0.67071193, data: () }
+> motorbike: Bbox { xmin: 318.23938, ymin: 95.22487, xmax: 369.9743, ymax: 213.46263, confidence: 0.96691036, data: () }
+> motorbike: Bbox { xmin: 367.46417, ymin: 100.07982, xmax: 394.9981, ymax: 174.6545, confidence: 0.9185384, data: () }
+> writing "candle-examples/examples/yolo-v8/assets/bike.pp.jpg"
+```
--- a/candle-examples/examples/yolo-v3/darknet.rs
+++ b/candle-examples/examples/yolo-v3/darknet.rs
@ -133,6 +133,7 @@ fn conv(vb: VarBuilder, index: usize, p: usize, b: &Block) -> Result<(usize, Bl)
        padding,
        groups: 1,
        dilation: 1,
+        cudnn_fwd_algo: None,
    };
    let conv = if bias {
        conv2d(p, filters, size, conv_cfg, vb.pp(format!("conv_{index}")))?
--- a/candle-examples/examples/yolo-v8/model.rs
+++ b/candle-examples/examples/yolo-v8/model.rs
@ -92,6 +92,7 @@ impl ConvBlock {
            stride,
            groups: 1,
            dilation: 1,
+            cudnn_fwd_algo: None,
        };
        let bn = batch_norm(c2, 1e-3, vb.pp("bn"))?;
        let conv = conv2d_no_bias(c1, c2, k, cfg, vb.pp("conv"))?.absorb_bn(&bn)?;
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Laurent Mazare	7f0f83a7c1	Rotating kv cache positions (#2901 ) * Retrieve the current positions for rotating KV caches. * Add the function to the kv cache too. * More testing.	2025-04-15 23:09:26 +02:00
Kyle Birnbaum	76e565c4ab	Updated candle-book: Introduction, Installation, MNIST guide, and added CONTRIBUTING.md (#2897 ) * added CONTRIBUTING.md to candle-book * added description to candle-book introduction * Updated formatting and added different features to candle-book installation * mnist guide first draft candle-book * updated mnist guide syntax and grammar for candle-book * changed HelloWorld - Mnist to Tutorial - Mnist in SUMMARY.md * updated intro to mnist guide in candle-book	2025-04-15 21:41:10 +02:00
Laurent Mazare	e4e7b0b2da	Use cudarc 0.16. (#2900 ) * Use cudarc 0.16. * Allow for disabling event tracking. * Tweaks. * Bump the ug version. * And bump the candle version too.	2025-04-15 21:40:18 +02:00
Laurent Mazare	b01ebbad8a	Use cudarc 0.15.2. (#2896 )	2025-04-14 20:47:52 +02:00
Laurent Mazare	1d1d6d4fe6	Bump the crate version. (#2895 )	2025-04-14 15:52:11 +02:00
Laurent Mazare	2653002f29	Gumbel-Softmax sampling. (#2894 ) * Gumbel-Softmax sampling. * Add a sampling test. * Share the gumbel-softmax bits.	2025-04-14 15:42:42 +02:00
Laurent Mazare	a52b76ae82	Expose the cudnn algo in the conv ops. (#2892 ) * Set the algo. * Expose the cudnn preferred algo for conv ops.	2025-04-14 08:25:32 +02:00
Laurent Mazare	fb660b8d43	Add a cudnn feature to candle-nn/candle-transformers. (#2890 )	2025-04-13 17:43:41 +02:00
Laurent Mazare	2f9606b187	Exclude candle-book to avoid some CI failures. (#2889 ) * Exclude candle-book to avoid some CI failures. * Remove the book CIs.	2025-04-13 17:11:41 +02:00
Laurent Mazare	f3a73f80d1	Support for cudnn conv1d. (#2888 ) * Support for cudnn conv1d. * More conv1d work. * Get the conv1d to work with cudnn. * Cleanup.	2025-04-13 16:47:37 +02:00
Laurent Mazare	b44d38de0e	Add the Orpheus TTS. (#2886 ) * Add the Orpheus TTS. * Add a small readme. * Token fix. * Support more voices. * Clippy fixes.	2025-04-13 12:02:17 +02:00
Laurent Mazare	d9198deb37	Im2col cuda optimization. (#2885 )	2025-04-13 10:07:53 +02:00
Laurent Mazare	15ed0b11ce	Optimize the batched matmul for the cpu backend. (#2884 )	2025-04-12 21:40:40 +02:00
Laurent Mazare	34505fdf3a	Avoid using batched-matmul in nn::Linear. (#2883 ) * Avoid using batched-matmul in nn::Linear. * Also avoid batched matmul in conv1d. * Also tweak the conv2d. * Batched tests. * Also cover conv2d.	2025-04-12 19:53:58 +02:00
Laurent Mazare	d7b7ce16e4	Upgrade ug. (#2882 )	2025-04-12 13:19:32 +02:00
Laurent Mazare	19fb6dac1f	Bump the crate version. (#2881 )	2025-04-11 22:28:21 +02:00
Laurent Mazare	acc5bd335f	Cuda cleanup. (#2880 ) * Cuda cleanup. * More fixes.	2025-04-11 21:43:35 +02:00
Kyle Birnbaum	eb478ece92	Implementing DistilBertForMaskedLM. (#2866 ) * Initial commit: model weights working, prediciton incorrect * moved distilbertformaskedlm into distilbert modeling file * made maskedLM like bert example, still incorrect predictions * finally not getting NaNs, fixed attention mask * getting correct output sentences * get top k predictions * fixed output formatting slightly * added default arg for model_id * lint * moved masked token example code from distilbertformaskedlm example to distilbert example * lint * removed distilbertformaskedlm example * cleanup * clippy * removed embedding normalization from example * made output and model dependent on args instead of prompt * lint * replaced or_ok anyhow error with anyhow context * changed error message for mask token not found	2025-04-11 13:25:39 +02:00
Manpreet Singh	d339b01726	Fix hardcoded f32 dtype for attention_mask. Use the model dtype for compatibility. (#2872 )	2025-04-08 06:12:14 +02:00
Laurent Mazare	2f3bf42bcb	Support more snac variants. (#2871 )	2025-04-07 08:23:47 +02:00
Laurent Mazare	e3370c6316	Add the SNAC audio tokenizer. (#2869 ) * Add the SNAC audio tokenizer. * More snac. * Again more snac. * Add some example code for snac. * Get the weights to load. * Add to the snac model. * Fixes. * Get round-tripping to work. * Save/load code files. * Clippy fix. * Fmt fix.	2025-04-06 22:15:36 +02:00
Laurent Mazare	338f6a102e	Clippy 1.86 fixes for cuda. (#2868 )	2025-04-05 15:45:35 +02:00
Laurent Mazare	bc33df77e1	Add the missing voices for CSM. (#2867 )	2025-04-05 06:52:36 +02:00
Laurent Mazare	cf9d7bf24c	Add the CSM model. (#2862 ) * Add the CSM model. * Add some code to load the model. * Load the text tokenizer. * Add frame generation. * Get the sampling to work. * Rope fix. * Autoregressive generation. * Generate some audio file. * Use the actual prompt. * Support multiple turns. * Add a very barebone readme. * Move some of the shared bits to the model.	2025-04-04 06:48:03 +02:00
Laurent Mazare	9d31361c4f	Fix for clippy 1.86. (#2864 ) * Fix for clippy 1.86. * More clippy fixes. * More fixes.	2025-04-03 19:38:27 +02:00
Kyle Birnbaum	648596c073	Added readmes to examples (#2835 ) * added chatGLM readme * changed wording in readme * added readme for chinese-clip * added readme for convmixer * added readme for custom ops * added readme for efficientnet * added readme for llama * added readme to mnist-training * added readme to musicgen * added readme to quantized-phi * added readme to starcoder2 * added readme to whisper-microphone * added readme to yi * added readme to yolo-v3 * added readme to whisper-microphone * added space to example in glm4 readme * fixed mamba example readme to run mamba instead of mamba-minimal * removed slash escape character * changed moondream image to yolo-v8 example image * added procedure for making the reinforcement-learning example work with a virtual environment on my machine * added simple one line summaries to the example readmes without * changed non-existant image to yolo example's bike.jpg * added backslash to sam command * removed trailing - from siglip * added SoX to silero-vad example readme * replaced procedure for uv on mac with warning that uv isn't currently compatible with pyo3 * added example to falcon readme * added --which arg to stella-en-v5 readme * fixed image path in vgg readme * fixed the image path in the vit readme * Update README.md * Update README.md * Update README.md --------- Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>	2025-04-03 09:18:29 +02:00
Laurent Mazare	d9904a3baf	Update to cudarc 0.14 (breaking change). (#2858 ) * Start updating to cudarc 0.14. * Adapt a couple more things. * And a couple more fixes. * More tweaks. * And a couple more fixes. * Bump the major version number. * Proper module system for the cuda kernels. * Proper ptx loading. * Launch the sort kernel. * Custom op. * Start using the builder pattern. * More builder. * More builder. * Get candle-core to compile. * Get the tests to pass. * Get candle-nn to work too. * Support for custom cuda functions. * cudnn fixes. * Get flash attn to run. * Switch the crate versions to be alpha. * Bump the ug dependency.	2025-04-03 09:12:19 +02:00
Kyle Birnbaum	d6db305829	Added new language pairs to marian-mt example. (#2860 ) * added new language pairs to marian-mt * lint * seperated python code for converting tokenizers into its own file and and added a reqirements.txt for dependencies, updated instructions in readme and included python version * Cleanup. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>	2025-04-02 23:50:14 +02:00
Zack Angelo	b4daa03e59	add as_cuda_slice_mut to CudaStorage and CudaDType (#2859 )	2025-04-01 19:34:52 +02:00
Bryan Lee	9541467d6b	Add `flip` to `tensor` (#2855 ) * Add `flip` to `tensor` * Move the tests to the proper places. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>	2025-04-01 09:07:16 +02:00
Kyle Birnbaum	6429609090	Added Deepseekr1 Llama8b variant to quantized example (#2842 ) * added deepseekr1 llama8b variant to quantized example * lint	2025-03-30 10:55:21 +02:00
Kyle Birnbaum	ba473290da	Added DeepseekR1 Qwen7B variant to quantized-qwen2-instruct example (#2843 ) * quantized deepseek qwen generating tokens * removed is_deepseek from Args and replaced prompt if statement with pattern matching	2025-03-30 10:54:22 +02:00
Bryan Lee	59c26195db	Fix CIFAR10 dataset types and dimension ordering (#2845 )	2025-03-30 10:53:25 +02:00
LongYinan	cb02b389d5	Fix reinforcement learning example (#2837 )	2025-03-26 16:27:45 +01:00
Kyle Birnbaum	0d4097031c	fixed rand import for mnist-training (#2833 )	2025-03-26 08:10:03 +01:00
Kyle Birnbaum	10853b803c	fixed rand imports for whisper-microphone example (#2834 )	2025-03-26 08:09:27 +01:00
xkeyC	f3d472952f	fix: `candle-flash-attn` linux and `msvc` build (#2829 ) * fix: candle-flash-attn linux and msvc build * Missing newline at eof. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>	2025-03-25 08:45:12 +01:00
Christian Balcom	67b85f79f1	Pickle decoder fix and Long1 opcode addition. (#2824 ) * Pickle decoder changes: added Long1 opcode, fixed tensor offset calculation * Apply rustfmt. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>	2025-03-23 08:10:08 +01:00
Benjamin Beurdouche	0b24f7f0a4	Fix for whisper example. rand::distribution is now rand::distr (#2811 )	2025-03-16 19:14:55 +01:00
Laurent Mazare	3afb04925a	Allow for growing the default KV cache when needed. (#2810 )	2025-03-16 17:30:25 +01:00
André Cipriani Bandarra	cbf5fc80c2	Add Gemma 3 1b IT toe Gemma examples (#2809 ) - Updates the Gemma example to include Gemma 3 1b instruction tuned.	2025-03-16 17:00:48 +01:00
Laurent Mazare	468d1d525f	Bump the crate version to 0.8.4. (#2808 )	2025-03-15 07:42:24 +01:00
Mike Seddon	c930ab7e1a	upgrade half library to fix rand (#2806 ) fix lints	2025-03-14 09:01:54 +01:00
Laurent Mazare	111edbc4ea	Gemma 3 initial setup (text only). (#2802 ) * Gemma 3 initial setup (text only). * Use the rotating kv cache for the sliding window.	2025-03-14 07:56:02 +01:00
Laurent Mazare	e286cf7cc9	Parse the json config for siglip models. (#2800 ) * Parse the json config for siglip models. * Bump the tokenizers dependency. * Add a v2 model. * Support more v2 model.s	2025-03-09 14:01:09 +01:00
Mikhail Panfilov	e4ffb85228	Add ModernBert sentency classifier (#2796 )	2025-03-08 14:48:22 +01:00
Andrew Wason	37db86ff79	Allow ModernBert to be used to generate embeddings. (#2791 )	2025-03-03 12:39:04 +01:00
Jani Monoses	add3a714aa	phi-4-mini (#2790 )	2025-03-01 10:07:29 +01:00
Liang-Chi Hsieh	26c16923b9	Make sorted_nodes pub function (#2780 )	2025-02-22 10:23:45 +01:00
Laurent Mazare	9e8bf70333	Avoid some clippy lints on 1.85. (#2778 ) * Avoid some clippy lints on 1.85. * Upload artifacts v4.	2025-02-22 10:23:22 +01:00
Philip Fabianek	ac9cdbd448	Refactor From<Tuple> implementations by using macros, add tests (#2762 )	2025-02-19 10:58:29 +01:00
Eric Buehler	e6cc76fc37	Implement DeepSeek V2 (#2744 ) * Add deepseek v2 * Fix * Remove unused * Add kv cache * Remove from cargo.toml * Fix dtype selection logic * Fix unnecessary u32->f32->gather->u32 * Remove fromstr impl * Use local scopes for some clarity * Typo * Repeat k_pe * Chain calls to remove mut * Actually, remove all muts * Update readme	2025-02-19 10:51:01 +01:00