Tmp gemm.

Reuse buffers on our own reference counts.
Metal operational.
2025-06-17 19:18:50 +00:00 · 2023-11-19 20:43:59 +01:00 · 2023-11-18 23:28:59 +01:00 · 2023-11-18 00:52:38 +01:00 · 2023-11-17 10:36:57 +01:00 · 2023-11-16 11:07:56 +01:00
161 changed files with 3245 additions and 15469 deletions
--- a/.github/workflows/ci_cuda.yaml
+++ b/.github/workflows/ci_cuda.yaml
@ -8,8 +8,6 @@ jobs:
  start-runner:
    name: Start self-hosted EC2 runner
    runs-on: ubuntu-latest
-    # Don't run on forks, they won't have access to secrets anyway.
-    if: ${{ github.event.pull_request.head.repo.full_name == github.event.pull_request.base.repo.full_name }}
    env:
      AWS_REGION: us-east-1
      EC2_AMI_ID: ami-03cfed9ea28f4b002
@ -72,7 +70,7 @@ jobs:
    runs-on: ubuntu-latest
    env:
      AWS_REGION: us-east-1
-    if: ${{ (success() || failure()) && github.event.pull_request.head.repo.full_name == github.event.pull_request.base.repo.full_name }} # required to stop the runner even if the error happened in the previous jobs
+    if: ${{ always() }} # required to stop the runner even if the error happened in the previous jobs
    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -63,7 +63,7 @@ This documents the main changes to the `candle` crate.
  [760](https://github.com/huggingface/candle/pull/760).
 - Add the Segment-Anything Model (SAM) as an example
  [773](https://github.com/huggingface/candle/pull/773).
- TinyViT backbone for the segment anything example
+- TinyViT backbone for the segemnt anything example
  [787](https://github.com/huggingface/candle/pull/787).
 - Shape with holes support
  [770](https://github.com/huggingface/candle/pull/770).
--- a/Cargo.toml
+++ b/Cargo.toml
@ -19,7 +19,7 @@ exclude = [
 resolver = "2"

 [workspace.package]
-version = "0.3.3"
+version = "0.3.0"
 edition = "2021"
 description = "Minimalist ML framework."
 repository = "https://github.com/huggingface/candle"
@ -32,7 +32,6 @@ accelerate-src = { version = "0.3.2" }
 anyhow = { version = "1", features = ["backtrace"] }
 byteorder = "1.4.3"
 clap = { version = "4.2.4", features = ["derive"] }
-criterion = { version = "0.5.1", default-features=false }
 cudarc = { version = "0.9.14", features = ["f16"] }
 gemm = { version = "0.16.6", features = ["wasm-simd128-enable"] }
 hf-hub = "0.3.0"
@ -62,7 +61,7 @@ tracing-subscriber = "0.3.7"
 wav = "1.0.0"
 yoke = { version = "0.7.2", features = ["derive"] }
 zip = { version = "0.6.6", default-features = false }
-metal = { version = "0.27.0", features = ["mps"]}
+metal = { git = "https://github.com/ivarflakstad/metal-rs.git", features = ["mps"] }

 [profile.release-with-debug]
 inherits = "release"
--- a/README.md
+++ b/README.md
@ -54,25 +54,19 @@ These online demos run entirely in your browser:
 - [whisper](https://huggingface.co/spaces/lmz/candle-whisper): speech recognition.
 - [LLaMA2](https://huggingface.co/spaces/lmz/candle-llama2): text generation.
 - [T5](https://huggingface.co/spaces/radames/Candle-T5-Generation-Wasm): text generation.
- [Phi-1.5, and Phi-2](https://huggingface.co/spaces/radames/Candle-Phi-1.5-Wasm): text generation.
+- [Phi-v1.5](https://huggingface.co/spaces/radames/Candle-Phi-1.5-Wasm): text generation.
 - [Segment Anything Model](https://huggingface.co/spaces/radames/candle-segment-anything-wasm): Image segmentation.
 - [BLIP](https://huggingface.co/spaces/radames/Candle-BLIP-Image-Captioning): image captioning.

 We also provide a some command line based examples using state of the art models:

- [LLaMA and LLaMA-v2](./candle-examples/examples/llama/): general LLM, includes
-  the SOLAR-10.7B variant.
+- [LLaMA and LLaMA-v2](./candle-examples/examples/llama/): general LLM.
 - [Falcon](./candle-examples/examples/falcon/): general LLM.
- [Phi-1, Phi-1.5, and Phi-2](./candle-examples/examples/phi/): 1.3b and 2.7b general LLMs with performance on par with LLaMA-v2 7b.
+- [Phi-v1 and Phi-v1.5](./candle-examples/examples/phi/): a 1.3b general LLM with performance on par with LLaMA-v2 7b.
 - [StableLM-3B-4E1T](./candle-examples/examples/stable-lm/): a 3b general LLM
  pre-trained on 1T tokens of English and code datasets.
- [Minimal Mamba](./candle-examples/examples/minimal-mamba/): a minimal
-  implementation of the Mamba state space model.
 - [Mistral7b-v0.1](./candle-examples/examples/mistral/): a 7b general LLM with
-  better performance than all publicly available 13b models as of 2023-09-28.
- [Mixtral8x7b-v0.1](./candle-examples/examples/mixtral/): a sparse mixture of
-  experts 8x7b general LLM with better performance than a Llama 2 70B model with
-  much faster inference.
+  performance larger than all publicly available 13b models as of 2023-09-28.
 - [StarCoder](./candle-examples/examples/bigcode/): LLM specialized to code generation.
 - [Replit-code-v1.5](./candle-examples/examples/replit-code/): a 3.3b LLM specialized for code completion.
 - [Yi-6B / Yi-34B](./candle-examples/examples/yi/): two bilingual
@ -84,7 +78,7 @@ We also provide a some command line based examples using state of the art models
 <img src="https://github.com/huggingface/candle/raw/main/candle-examples/examples/quantized/assets/aoc.gif" width="600">
  
 - [Stable Diffusion](./candle-examples/examples/stable-diffusion/): text to
-  image generative model, support for the 1.5, 2.1, SDXL 1.0 and Turbo versions.
+  image generative model, support for the 1.5, 2.1, and SDXL 1.0 versions.

 <img src="https://github.com/huggingface/candle/raw/main/candle-examples/examples/stable-diffusion/assets/stable-diffusion-xl.jpg" width="200">

@ -128,7 +122,7 @@ There are also some wasm examples for whisper and
 [whisper](https://huggingface.co/spaces/lmz/candle-whisper),
 [llama2](https://huggingface.co/spaces/lmz/candle-llama2),
 [T5](https://huggingface.co/spaces/radames/Candle-T5-Generation-Wasm),
-[Phi-1.5, and Phi-2](https://huggingface.co/spaces/radames/Candle-Phi-1.5-Wasm),
+[Phi-v1.5](https://huggingface.co/spaces/radames/Candle-Phi-1.5-Wasm),
 [Segment Anything Model](https://huggingface.co/spaces/radames/candle-segment-anything-wasm).

 For LLaMA2, run the following command to retrieve the weight files and start a
@ -145,20 +139,17 @@ And then head over to
 <!--- ANCHOR: useful_libraries --->

 ## Useful External Resources
- [`candle-tutorial`](https://github.com/ToluClassics/candle-tutorial): A
+- [`candle-tutorial`](https://github.com/ToluClassics/candle-tutorial): a
  very detailed tutorial showing how to convert a PyTorch model to Candle.
- [`candle-lora`](https://github.com/EricLBuehler/candle-lora): Efficient and
-  ergonomic LoRA implementation for Candle. `candle-lora` has      
-  out-of-the-box LoRA support for many models from Candle, which can be found
-  [here](https://github.com/EricLBuehler/candle-lora/tree/master/candle-lora-transformers/examples).
- [`optimisers`](https://github.com/KGrewal1/optimisers): A collection of optimisers
+- [`optimisers`](https://github.com/KGrewal1/optimisers): a collection of optimisers
  including SGD with momentum, AdaGrad, AdaDelta, AdaMax, NAdam, RAdam, and RMSprop.
+- [`candle-lora`](https://github.com/EricLBuehler/candle-lora): a LoRA implementation
+  that conforms to the official `peft` implementation.
 - [`candle-vllm`](https://github.com/EricLBuehler/candle-vllm): Efficient platform for inference and
  serving local LLMs including an OpenAI compatible API server.
- [`candle-ext`](https://github.com/mokeyish/candle-ext): An extension library to Candle that provides PyTorch functions not currently available in Candle.
- [`kalosm`](https://github.com/floneum/floneum/tree/master/interfaces/kalosm): A multi-modal meta-framework in Rust for interfacing with local pre-trained models with support for controlled generation, custom samplers, in-memory vector databases, audio transcription, and more.
+- [`candle-ext`](https://github.com/mokeyish/candle-ext): an extension library to Candle that provides PyTorch functions not currently available in Candle.
+- [`kalosm`](https://github.com/floneum/floneum/tree/master/kalosm): A multi-modal meta-framework in Rust for interfacing with local pre-trained models with support for controlled generation, custom samplers, in-memory vector databases, audio transcription, and more.
 - [`candle-sampling`](https://github.com/EricLBuehler/candle-sampling): Sampling techniques for Candle.
- [`gpt-from-scratch-rs`](https://github.com/jeroenvlek/gpt-from-scratch-rs): A port of Andrej Karpathy's _Let's build GPT_ tutorial on YouTube showcasing the Candle API on a toy problem.

 If you have an addition to this list, please submit a pull request.

@ -177,23 +168,15 @@ If you have an addition to this list, please submit a pull request.
    - WASM support, run your models in a browser.
 - Included models.
    - Language Models.
-        - LLaMA v1 and v2 with variants such as SOLAR-10.7B.
+        - LLaMA v1 and v2.
        - Falcon.
        - StarCoder.
-        - Phi 1, 1.5, and 2.
-        - Minimal Mamba
+        - Phi v1.5.
        - Mistral 7b v0.1.
-        - Mixtral 8x7b v0.1.
        - StableLM-3B-4E1T.
        - Replit-code-v1.5-3B.
        - Bert.
        - Yi-6B and Yi-34B.
-    - Quantized LLMs.
-        - Llama 7b, 13b, 70b, as well as the chat and code variants.
-        - Mistral 7b, and 7b instruct.
-        - Mixtral 8x7b.
-        - Zephyr 7b a and b (Mistral-7b based).
-        - OpenChat 3.5 (Mistral-7b based).
    - Text to text.
        - T5 and its variants: FlanT5, UL2, MADLAD400 (translation), CoEdit (Grammar correction).
        - Marian MT (Machine Translation).
--- a/candle-book/Cargo.toml
+++ b/candle-book/Cargo.toml
@ -11,11 +11,11 @@ readme = "README.md"

 [dependencies]
 accelerate-src = { workspace = true, optional = true }
-candle = { path = "../candle-core", version = "0.3.3", package = "candle-core" }
-candle-datasets = { path = "../candle-datasets", version = "0.3.3" }
-candle-nn = { path = "../candle-nn", version = "0.3.3" }
-candle-transformers = { path = "../candle-transformers", version = "0.3.3" }
-candle-flash-attn = { path = "../candle-flash-attn", version = "0.3.3", optional = true }
+candle = { path = "../candle-core", version = "0.3.0", package = "candle-core" }
+candle-datasets = { path = "../candle-datasets", version = "0.3.0" }
+candle-nn = { path = "../candle-nn", version = "0.3.0" }
+candle-transformers = { path = "../candle-transformers", version = "0.3.0" }
+candle-flash-attn = { path = "../candle-flash-attn", version = "0.3.0", optional = true }
 safetensors = { workspace = true }
 serde = { workspace = true }
 serde_json = { workspace = true }
--- a/candle-book/src/apps/dekstop.md
+++ b/candle-book/src/apps/dekstop.md
--- a/candle-book/src/lib.rs
+++ b/candle-book/src/lib.rs
@ -28,7 +28,6 @@ let weights = candle::safetensors::load(weights_filename, &Device::Cpu).unwrap()
    #[rustfmt::skip]
    #[test]
    fn book_hub_2() {
-        {
 // ANCHOR: book_hub_2
 use candle::Device;
 use hf_hub::api::sync::Api;
@ -46,10 +45,9 @@ let weights = candle::safetensors::load_buffer(&mmap[..], &Device::Cpu).unwrap()
        assert_eq!(weights.len(), 206);
    }

-    // #[rustfmt::skip]
-    // #[test]
-    // fn book_hub_3() {
-    {
+    #[rustfmt::skip]
+    #[test]
+    fn book_hub_3() {
 // ANCHOR: book_hub_3
 use candle::{DType, Device, Tensor};
 use hf_hub::api::sync::Api;
@ -104,7 +102,6 @@ let tp_tensor = Tensor::from_raw_buffer(&raw, dtype, &tp_shape, &Device::Cpu).un
        assert_eq!(view.shape(), &[768, 768]);
        assert_eq!(tp_tensor.dims(), &[192, 768]);
    }
-}

    #[rustfmt::skip]
    #[test]
--- a/candle-core/Cargo.toml
+++ b/candle-core/Cargo.toml
@ -12,8 +12,8 @@ readme = "README.md"
 [dependencies]
 accelerate-src = { workspace = true, optional = true }
 byteorder = { workspace = true }
-candle-kernels = { path = "../candle-kernels", version = "0.3.3", optional = true }
-candle-metal-kernels = { path = "../candle-metal-kernels", version = "0.3.3", optional = true }
+candle-kernels = { path = "../candle-kernels", version = "0.3.0", optional = true }
+candle-metal-kernels = { path = "../candle-metal-kernels", version = "0.3.0", optional = true }
 metal = { workspace = true, optional = true}
 cudarc = { workspace = true, optional = true }
 gemm = { workspace = true }
@ -34,8 +34,6 @@ zip = { workspace = true }
 [dev-dependencies]
 anyhow = { workspace = true }
 clap = { workspace = true }
-criterion = { workspace = true }
-

 [features]
 default = []
@ -44,8 +42,3 @@ cudnn = ["cuda", "cudarc/cudnn"]
 mkl = ["dep:libc", "dep:intel-mkl-src"]
 accelerate = ["dep:libc", "dep:accelerate-src"]
 metal = ["dep:metal", "dep:candle-metal-kernels"]
-
-[[bench]]
-name = "matmul"
-harness = false
-
--- a/candle-core/benches/matmul.rs
+++ b/candle-core/benches/matmul.rs
@ -1,42 +0,0 @@
-use candle_core::{DType, Device, Tensor};
-use criterion::{black_box, criterion_group, criterion_main, Criterion, Throughput};
-use std::time::Instant;
-
-fn run(a: &Tensor, b: &Tensor) {
-    a.matmul(&b.t().unwrap()).unwrap();
-}
-
-fn criterion_benchmark(c: &mut Criterion) {
-    let b = 1;
-    let m = 1;
-    let n = 2048;
-    let k = 2048;
-
-    let device = Device::new_metal(0).unwrap();
-    let dtype = DType::F32;
-    let lhs = Tensor::zeros((b, m, k), dtype, &device).unwrap();
-    let rhs = Tensor::zeros((b, n, k), dtype, &device).unwrap();
-
-    let flops = b * m * n * k;
-
-    let mut group = c.benchmark_group("matmul_metal");
-    group.throughput(Throughput::Bytes(flops as u64));
-    group.bench_function("iter", move |b| {
-        b.iter_custom(|iters| {
-            let start = Instant::now();
-            for _i in 0..iters {
-                run(black_box(&lhs), black_box(&rhs));
-            }
-            if let Device::Metal(device) = &device {
-                device.wait_until_completed().unwrap();
-            } else {
-                panic!("Expected metal device");
-            }
-            start.elapsed()
-        })
-    });
-    group.finish();
-}
-
-criterion_group!(benches, criterion_benchmark);
-criterion_main!(benches);
--- a/candle-core/examples/basics.rs
+++ b/candle-core/examples/basics.rs
@ -8,10 +8,11 @@ use anyhow::Result;
 use candle_core::{Device, Tensor};

 fn main() -> Result<()> {
-    let a = Tensor::new(&[[0.0f32, 1.0, 2.0], [3.0, 4.0, 5.0]], &Device::Cpu)?;
-    let b = Tensor::new(&[[88.0f32, 99.0]], &Device::Cpu)?;
-    let new_a = a.slice_scatter(&b, 1, 2)?;
-    assert_eq!(a.to_vec2::<f32>()?, [[0.0, 1.0, 2.0], [3.0, 4.0, 5.0]]);
-    assert_eq!(new_a.to_vec2::<f32>()?, [[0.0, 1.0, 2.0], [3.0, 4.0, 5.0]]);
+    let inp = Tensor::randn(0f32, 1., (2, 320, 96, 96), &Device::Cpu)?;
+    let w = Tensor::randn(0f32, 1., (320, 320, 3, 3), &Device::Cpu)?;
+    let start = std::time::Instant::now();
+    let res = inp.conv2d(&w, 0, 1, 1, 1)?;
+    println!("{:?}", start.elapsed());
+    println!("{res:?}");
    Ok(())
 }
--- a/candle-core/examples/tensor-tools.rs
+++ b/candle-core/examples/tensor-tools.rs
@ -1,5 +1,5 @@
-use candle_core::quantized::{gguf_file, GgmlDType, QTensor};
-use candle_core::{Device, Result};
+use candle_core::quantized::{gguf_file, k_quants, QTensor};
+use candle_core::{Device, Result, Tensor};
 use clap::{Parser, Subcommand, ValueEnum};
 use rayon::prelude::*;

@ -11,7 +11,12 @@ enum QuantizationMode {
 }

 impl QuantizationMode {
-    fn quantize(&self, name: &str, tensor: QTensor, dtype: GgmlDType) -> Result<QTensor> {
+    fn quantize(
+        &self,
+        name: &str,
+        tensor: QTensor,
+        default: fn(&Tensor) -> Result<QTensor>,
+    ) -> Result<QTensor> {
        match self {
            Self::Llama => {
                // Same behavior as the llama.cpp quantization.
@ -19,9 +24,9 @@ impl QuantizationMode {
                if should_quantize {
                    let tensor = tensor.dequantize(&Device::Cpu)?;
                    if name == "output.weight" {
-                        QTensor::quantize(&tensor, GgmlDType::Q6K)
+                        QTensor::quantize::<k_quants::BlockQ6K>(&tensor)
                    } else {
-                        QTensor::quantize(&tensor, dtype)
+                        default(&tensor)
                    }
                } else {
                    Ok(tensor)
@ -55,27 +60,6 @@ enum Quantization {
    F32,
 }

-impl Quantization {
-    fn dtype(&self) -> GgmlDType {
-        match self {
-            Quantization::Q4_0 => GgmlDType::Q4_0,
-            Quantization::Q4_1 => GgmlDType::Q4_1,
-            Quantization::Q5_0 => GgmlDType::Q5_0,
-            Quantization::Q5_1 => GgmlDType::Q5_1,
-            Quantization::Q8_0 => GgmlDType::Q8_0,
-            Quantization::Q8_1 => GgmlDType::Q8_1,
-            Quantization::Q2k => GgmlDType::Q2K,
-            Quantization::Q3k => GgmlDType::Q3K,
-            Quantization::Q4k => GgmlDType::Q4K,
-            Quantization::Q5k => GgmlDType::Q5K,
-            Quantization::Q6k => GgmlDType::Q6K,
-            Quantization::Q8k => GgmlDType::Q8K,
-            Quantization::F16 => GgmlDType::F16,
-            Quantization::F32 => GgmlDType::F32,
-        }
-    }
-}
-
 #[derive(ValueEnum, Debug, Clone)]
 enum Format {
    Safetensors,
@ -141,12 +125,7 @@ struct Args {
    command: Command,
 }

-fn run_ls(
-    file: &std::path::PathBuf,
-    format: Option<Format>,
-    verbose: bool,
-    device: &Device,
-) -> Result<()> {
+fn run_ls(file: &std::path::PathBuf, format: Option<Format>, verbose: bool) -> Result<()> {
    let format = match format {
        Some(format) => format,
        None => match Format::infer(file) {
@ -212,7 +191,7 @@ fn run_ls(
        }
        Format::Ggml => {
            let mut file = std::fs::File::open(file)?;
-            let content = candle_core::quantized::ggml_file::Content::read(&mut file, device)?;
+            let content = candle_core::quantized::ggml_file::Content::read(&mut file)?;
            let mut tensors = content.tensors.into_iter().collect::<Vec<_>>();
            tensors.sort_by(|a, b| a.0.cmp(&b.0));
            for (name, qtensor) in tensors.iter() {
@ -253,8 +232,37 @@ fn run_quantize_safetensors(
    }
    println!("tensors: {}", tensors.len());

-    let dtype = q.dtype();
-    let block_size = dtype.block_size();
+    let quantize_fn = match q {
+        Quantization::Q4_0 => QTensor::quantize::<k_quants::BlockQ4_0>,
+        Quantization::Q4_1 => QTensor::quantize::<k_quants::BlockQ4_1>,
+        Quantization::Q5_0 => QTensor::quantize::<k_quants::BlockQ5_0>,
+        Quantization::Q5_1 => QTensor::quantize::<k_quants::BlockQ5_1>,
+        Quantization::Q8_0 => QTensor::quantize::<k_quants::BlockQ8_0>,
+        Quantization::Q8_1 => QTensor::quantize::<k_quants::BlockQ8_1>,
+        Quantization::Q2k => QTensor::quantize::<k_quants::BlockQ2K>,
+        Quantization::Q3k => QTensor::quantize::<k_quants::BlockQ3K>,
+        Quantization::Q4k => QTensor::quantize::<k_quants::BlockQ4K>,
+        Quantization::Q5k => QTensor::quantize::<k_quants::BlockQ5K>,
+        Quantization::Q6k => QTensor::quantize::<k_quants::BlockQ6K>,
+        Quantization::Q8k => QTensor::quantize::<k_quants::BlockQ8K>,
+        Quantization::F16 => QTensor::quantize::<half::f16>,
+        Quantization::F32 => QTensor::quantize::<f32>,
+    };
+    let block_size = match q {
+        Quantization::Q4_0 => k_quants::QK4_0,
+        Quantization::Q4_1 => k_quants::QK4_1,
+        Quantization::Q5_0 => k_quants::QK5_0,
+        Quantization::Q5_1 => k_quants::QK5_1,
+        Quantization::Q8_0 => k_quants::QK8_0,
+        Quantization::Q8_1 => k_quants::QK8_1,
+        Quantization::Q2k
+        | Quantization::Q3k
+        | Quantization::Q4k
+        | Quantization::Q5k
+        | Quantization::Q6k
+        | Quantization::Q8k => k_quants::QK_K,
+        Quantization::F16 | Quantization::F32 => 1,
+    };

    let qtensors = tensors
        .into_par_iter()
@ -262,9 +270,9 @@ fn run_quantize_safetensors(
            let should_quantize = tensor.rank() == 2 && tensor.dim(1)? % block_size == 0;
            println!("  quantizing {name} {tensor:?} {should_quantize}");
            let tensor = if should_quantize {
-                QTensor::quantize(&tensor, dtype)?
+                quantize_fn(&tensor)?
            } else {
-                QTensor::quantize(&tensor, GgmlDType::F32)?
+                QTensor::quantize::<f32>(&tensor)?
            };
            Ok((name, tensor))
        })
@ -282,7 +290,6 @@ fn run_quantize(
    out_file: std::path::PathBuf,
    q: Quantization,
    qmode: QuantizationMode,
-    device: &Device,
 ) -> Result<()> {
    if in_files.is_empty() {
        candle_core::bail!("no specified input files")
@ -308,15 +315,31 @@ fn run_quantize(
    let content = gguf_file::Content::read(&mut in_)?;
    println!("tensors: {}", content.tensor_infos.len());

-    let dtype = q.dtype();
+    let quantize_fn = match q {
+        Quantization::Q4_0 => QTensor::quantize::<k_quants::BlockQ4_0>,
+        Quantization::Q4_1 => QTensor::quantize::<k_quants::BlockQ4_1>,
+        Quantization::Q5_0 => QTensor::quantize::<k_quants::BlockQ5_0>,
+        Quantization::Q5_1 => QTensor::quantize::<k_quants::BlockQ5_1>,
+        Quantization::Q8_0 => QTensor::quantize::<k_quants::BlockQ8_0>,
+        Quantization::Q8_1 => QTensor::quantize::<k_quants::BlockQ8_1>,
+        Quantization::Q2k => QTensor::quantize::<k_quants::BlockQ2K>,
+        Quantization::Q3k => QTensor::quantize::<k_quants::BlockQ3K>,
+        Quantization::Q4k => QTensor::quantize::<k_quants::BlockQ4K>,
+        Quantization::Q5k => QTensor::quantize::<k_quants::BlockQ5K>,
+        Quantization::Q6k => QTensor::quantize::<k_quants::BlockQ6K>,
+        Quantization::Q8k => QTensor::quantize::<k_quants::BlockQ8K>,
+        Quantization::F16 => QTensor::quantize::<half::f16>,
+        Quantization::F32 => QTensor::quantize::<f32>,
+    };
+
    let qtensors = content
        .tensor_infos
        .par_iter()
        .map(|(name, _)| {
            println!("  quantizing {name}");
            let mut in_file = std::fs::File::open(&in_files[0])?;
-            let tensor = content.tensor(&mut in_file, name, device)?;
-            let tensor = qmode.quantize(name, tensor, dtype)?;
+            let tensor = content.tensor(&mut in_file, name)?;
+            let tensor = qmode.quantize(name, tensor, quantize_fn)?;
            Ok((name, tensor))
        })
        .collect::<Result<Vec<_>>>()?;
@ -336,7 +359,6 @@ fn run_quantize(

 fn main() -> anyhow::Result<()> {
    let args = Args::parse();
-    let device = Device::Cpu;
    match args.command {
        Command::Ls {
            files,
@ -348,7 +370,7 @@ fn main() -> anyhow::Result<()> {
                if multiple_files {
                    println!("--- {file:?} ---");
                }
-                run_ls(file, format.clone(), verbose, &device)?
+                run_ls(file, format.clone(), verbose)?
            }
        }
        Command::Quantize {
@ -356,7 +378,7 @@ fn main() -> anyhow::Result<()> {
            out_file,
            quantization,
            mode,
-        } => run_quantize(&in_file, out_file, quantization, mode, &device)?,
+        } => run_quantize(&in_file, out_file, quantization, mode)?,
    }
    Ok(())
 }
--- a/candle-core/src/backprop.rs
+++ b/candle-core/src/backprop.rs
@ -114,7 +114,7 @@ impl Tensor {
                    | Op::Unary(_node, UnaryOp::Round) => nodes,
                    Op::Reshape(node)
                    | Op::UpsampleNearest1D(node)
-                    | Op::UpsampleNearest2D { arg: node, .. }
+                    | Op::UpsampleNearest2D(node)
                    | Op::AvgPool2D { arg: node, .. }
                    | Op::MaxPool2D { arg: node, .. }
                    | Op::Copy(node)
@ -350,27 +350,9 @@ impl Tensor {
                    Op::UpsampleNearest1D { .. } => Err(Error::BackwardNotSupported {
                        op: "upsample-nearest1d",
                    })?,
-                    Op::UpsampleNearest2D {
-                        arg,
-                        target_h,
-                        target_w,
-                    } => {
-                        let (_n, c, h, w) = arg.dims4()?;
-                        if target_h % h != 0 || target_w % w != 0 {
-                            crate::bail!("backward not supported for non integer upscaling factors")
-                        }
-                        let scale_h = target_h / h;
-                        let scale_w = target_w / w;
-
-                        if scale_h != scale_w {
-                            crate::bail!("backward not supported for non uniform upscaling factors")
-                        };
-                        let kernel =
-                            Tensor::ones((c, 1, scale_h, scale_w), arg.dtype(), arg.device())?;
-                        let conv_sum = grad.conv2d(&kernel, 0, scale_h, 1, c)?;
-                        let sum_grad = grads.or_insert(arg)?;
-                        *sum_grad = conv_sum;
-                    }
+                    Op::UpsampleNearest2D { .. } => Err(Error::BackwardNotSupported {
+                        op: "upsample-nearest2d",
+                    })?,
                    Op::SliceScatter0(lhs, rhs, start_rhs) => {
                        let rhs_sum_grad = grads.or_insert(rhs)?;
                        let rhs_grad = grad.narrow(0, *start_rhs, rhs.dim(0)?)?;
--- a/candle-core/src/device.rs
+++ b/candle-core/src/device.rs
@ -201,9 +201,10 @@ impl Device {
                    Ok(Storage::Cuda(storage))
                }
            }
-            Device::Metal(device) => {
-                let storage = device.rand_uniform(shape, dtype, lo, up)?;
-                Ok(Storage::Metal(storage))
+            Device::Metal(_device) => {
+                // let storage = device.rand_uniform(shape, dtype, lo, up)?;
+                // Ok(Storage::Metal(storage))
+                crate::bail!("Metal rand_uniform not implemented")
            }
        }
    }
--- a/candle-core/src/indexer.rs
+++ b/candle-core/src/indexer.rs
@ -64,7 +64,7 @@ impl Tensor {
 #[derive(Debug)]
 /// Generic structure used to index a slice of the tensor
 pub enum TensorIndexer {
-    /// This selects the elements for which an index has some specific value.
+    /// This selects the elemnts for which an index has some specific value.
    Select(usize),
    /// This is a regular slice, purely indexing a chunk of the tensor
    Narrow(Bound<usize>, Bound<usize>),
@ -104,31 +104,37 @@ impl From<&Tensor> for TensorIndexer {
    }
 }

-trait RB: RangeBounds<usize> {}
-impl RB for Range<usize> {}
-impl RB for RangeFrom<usize> {}
-impl RB for RangeFull {}
-impl RB for RangeInclusive<usize> {}
-impl RB for RangeTo<usize> {}
-impl RB for RangeToInclusive<usize> {}
+macro_rules! impl_from_range {
+    ($range_type:ty) => {
+        impl From<$range_type> for TensorIndexer {
+            fn from(range: $range_type) -> Self {
+                use std::ops::Bound::*;

-impl<T: RB> From<T> for TensorIndexer {
-    fn from(range: T) -> Self {
-        use std::ops::Bound::*;
-        let start = match range.start_bound() {
-            Included(idx) => Included(*idx),
-            Excluded(idx) => Excluded(*idx),
-            Unbounded => Unbounded,
-        };
-        let end = match range.end_bound() {
-            Included(idx) => Included(*idx),
-            Excluded(idx) => Excluded(*idx),
-            Unbounded => Unbounded,
-        };
-        TensorIndexer::Narrow(start, end)
-    }
+                let start = match range.start_bound() {
+                    Included(idx) => Included(*idx),
+                    Excluded(idx) => Excluded(*idx),
+                    Unbounded => Unbounded,
+                };
+
+                let end = match range.end_bound() {
+                    Included(idx) => Included(*idx),
+                    Excluded(idx) => Excluded(*idx),
+                    Unbounded => Unbounded,
+                };
+
+                TensorIndexer::Narrow(start, end)
+            }
+        }
+    };
 }

+impl_from_range!(Range<usize>);
+impl_from_range!(RangeFrom<usize>);
+impl_from_range!(RangeFull);
+impl_from_range!(RangeInclusive<usize>);
+impl_from_range!(RangeTo<usize>);
+impl_from_range!(RangeToInclusive<usize>);
+
 /// Trait used to implement multiple signatures for ease of use of the slicing
 /// of a tensor
 pub trait IndexOp<T> {
--- a/candle-core/src/lib.rs
+++ b/candle-core/src/lib.rs
@ -123,6 +123,12 @@ pub trait Module {
    fn forward(&self, xs: &Tensor) -> Result<Tensor>;
 }

+impl Module for quantized::QMatMul {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        self.forward(xs)
+    }
+}
+
 impl<T: Fn(&Tensor) -> Result<Tensor>> Module for T {
    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        self(xs)
--- a/candle-core/src/metal_backend.rs
+++ b/candle-core/src/metal_backend.rs
--- a/candle-core/src/op.rs
+++ b/candle-core/src/op.rs
@ -132,11 +132,7 @@ pub enum Op {
    },

    UpsampleNearest1D(Tensor),
-    UpsampleNearest2D {
-        arg: Tensor,
-        target_h: usize,
-        target_w: usize,
-    },
+    UpsampleNearest2D(Tensor),

    Cat(Vec<Tensor>, usize),

--- a/candle-core/src/quantized/avx.rs
+++ b/candle-core/src/quantized/avx.rs
@ -353,7 +353,7 @@ pub(crate) fn vec_dot_q3k_q8k(n: usize, xs: &[BlockQ3K], ys: &[BlockQ8K]) -> Res
                q3 = q3.add(32);

                // Prepare low and high bits
-                // We hardcode the shifts here to avoid loading them into a separate register
+                // We hardcode the shifts here to avoid loading them into a seperate register
                let q3l_0 = _mm256_and_si256(q3bits, m3);
                let q3h_0 = if j == 0 {
                    _mm256_srli_epi16(_mm256_andnot_si256(hbits, _mm256_slli_epi16(mone, 0)), 0)
@ -586,7 +586,7 @@ pub(crate) fn vec_dot_q5k_q8k(n: usize, xs: &[BlockQ5K], ys: &[BlockQ8K]) -> Res
                let q5bits = _mm256_loadu_si256(q5 as *const __m256i);
                q5 = q5.add(32);

-                //Similar to q3k we hardcode the shifts here to avoid loading them into a separate register
+                //Similar to q3k we hardcode the shifts here to avoid loading them into a seperate register
                let q5l_0 = _mm256_and_si256(q5bits, m4);
                let q5l_0_shift_input = _mm256_and_si256(hbits, hmask);
                let q5l_0_right_shift = match j {
--- a/candle-core/src/quantized/ggml_file.rs
+++ b/candle-core/src/quantized/ggml_file.rs
@ -1,9 +1,7 @@
 //! Support for the GGML file format.

-#[cfg(feature = "metal")]
-use super::metal::load_quantized_metal;
-use super::{k_quants, GgmlDType, QStorage};
-use crate::{Device, Result};
+use super::{k_quants, GgmlDType};
+use crate::Result;
 use byteorder::{LittleEndian, ReadBytesExt};
 use std::collections::HashMap;

@ -123,22 +121,11 @@ fn from_raw_data<T: super::GgmlType + Send + Sync + 'static>(
    raw_data: &[u8],
    size_in_bytes: usize,
    dims: Vec<usize>,
-    device: &Device,
 ) -> Result<super::QTensor> {
    let raw_data_ptr = raw_data.as_ptr();
    let n_blocks = size_in_bytes / std::mem::size_of::<T>();
    let data = unsafe { std::slice::from_raw_parts(raw_data_ptr as *const T, n_blocks) };
-    let data: QStorage = match device {
-        Device::Cpu => QStorage::Cpu(Box::new(data.to_vec())),
-        #[cfg(feature = "metal")]
-        Device::Metal(metal) => load_quantized_metal(metal, data)?,
-        #[cfg(not(feature = "metal"))]
-        Device::Metal(_metal) => {
-            crate::bail!("Metal backend requires `metal` feature")
-        }
-        device => unimplemented!("Implement quantized tensor for device {device:?}"),
-    };
-    super::QTensor::new(data, dims)
+    super::QTensor::new(data.to_vec(), dims)
 }

 /// Creates a [Tensor] from a raw GGML tensor.
@ -146,50 +133,29 @@ pub fn qtensor_from_ggml(
    ggml_dtype: GgmlDType,
    raw_data: &[u8],
    dims: Vec<usize>,
-    device: &Device,
 ) -> Result<super::QTensor> {
    let tensor_elems = dims.iter().product::<usize>();
-    let block_size = ggml_dtype.block_size();
-    if tensor_elems % block_size != 0 {
+    let blck_size = ggml_dtype.blck_size();
+    if tensor_elems % blck_size != 0 {
        crate::bail!(
-            "the number of elements {tensor_elems} is not divisible by the block size {block_size}"
+            "the number of elements {tensor_elems} is not divisible by the block size {blck_size}"
        )
    }
-    let size_in_bytes = tensor_elems / block_size * ggml_dtype.type_size();
+    let size_in_bytes = tensor_elems / blck_size * ggml_dtype.type_size();

    match ggml_dtype {
-        GgmlDType::F32 => from_raw_data::<f32>(raw_data, size_in_bytes, dims, device),
-        GgmlDType::F16 => from_raw_data::<half::f16>(raw_data, size_in_bytes, dims, device),
-        GgmlDType::Q4_0 => {
-            from_raw_data::<k_quants::BlockQ4_0>(raw_data, size_in_bytes, dims, device)
-        }
-        GgmlDType::Q4_1 => {
-            from_raw_data::<k_quants::BlockQ4_1>(raw_data, size_in_bytes, dims, device)
-        }
-        GgmlDType::Q5_0 => {
-            from_raw_data::<k_quants::BlockQ5_0>(raw_data, size_in_bytes, dims, device)
-        }
-        GgmlDType::Q5_1 => {
-            from_raw_data::<k_quants::BlockQ5_1>(raw_data, size_in_bytes, dims, device)
-        }
-        GgmlDType::Q8_0 => {
-            from_raw_data::<k_quants::BlockQ8_0>(raw_data, size_in_bytes, dims, device)
-        }
-        GgmlDType::Q2K => {
-            from_raw_data::<k_quants::BlockQ2K>(raw_data, size_in_bytes, dims, device)
-        }
-        GgmlDType::Q3K => {
-            from_raw_data::<k_quants::BlockQ3K>(raw_data, size_in_bytes, dims, device)
-        }
-        GgmlDType::Q4K => {
-            from_raw_data::<k_quants::BlockQ4K>(raw_data, size_in_bytes, dims, device)
-        }
-        GgmlDType::Q5K => {
-            from_raw_data::<k_quants::BlockQ5K>(raw_data, size_in_bytes, dims, device)
-        }
-        GgmlDType::Q6K => {
-            from_raw_data::<k_quants::BlockQ6K>(raw_data, size_in_bytes, dims, device)
-        }
+        GgmlDType::F32 => from_raw_data::<f32>(raw_data, size_in_bytes, dims),
+        GgmlDType::F16 => from_raw_data::<half::f16>(raw_data, size_in_bytes, dims),
+        GgmlDType::Q4_0 => from_raw_data::<k_quants::BlockQ4_0>(raw_data, size_in_bytes, dims),
+        GgmlDType::Q4_1 => from_raw_data::<k_quants::BlockQ4_1>(raw_data, size_in_bytes, dims),
+        GgmlDType::Q5_0 => from_raw_data::<k_quants::BlockQ5_0>(raw_data, size_in_bytes, dims),
+        GgmlDType::Q5_1 => from_raw_data::<k_quants::BlockQ5_1>(raw_data, size_in_bytes, dims),
+        GgmlDType::Q8_0 => from_raw_data::<k_quants::BlockQ8_0>(raw_data, size_in_bytes, dims),
+        GgmlDType::Q2K => from_raw_data::<k_quants::BlockQ2K>(raw_data, size_in_bytes, dims),
+        GgmlDType::Q3K => from_raw_data::<k_quants::BlockQ3K>(raw_data, size_in_bytes, dims),
+        GgmlDType::Q4K => from_raw_data::<k_quants::BlockQ4K>(raw_data, size_in_bytes, dims),
+        GgmlDType::Q5K => from_raw_data::<k_quants::BlockQ5K>(raw_data, size_in_bytes, dims),
+        GgmlDType::Q6K => from_raw_data::<k_quants::BlockQ6K>(raw_data, size_in_bytes, dims),
        _ => crate::bail!("quantized type {ggml_dtype:?} is not supported yet"),
    }
 }
@ -197,7 +163,6 @@ pub fn qtensor_from_ggml(
 fn read_one_tensor<R: std::io::Seek + std::io::Read>(
    reader: &mut R,
    magic: VersionedMagic,
-    device: &Device,
 ) -> Result<(String, super::QTensor)> {
    let n_dims = reader.read_u32::<LittleEndian>()?;
    let name_len = reader.read_u32::<LittleEndian>()?;
@ -218,11 +183,11 @@ fn read_one_tensor<R: std::io::Seek + std::io::Read>(
    }
    let dims = dims.iter().map(|&u| u as usize).collect::<Vec<_>>();
    let tensor_elems = dims.iter().product::<usize>();
-    let size_in_bytes = tensor_elems * ggml_dtype.type_size() / ggml_dtype.block_size();
+    let size_in_bytes = tensor_elems * ggml_dtype.type_size() / ggml_dtype.blck_size();
    // TODO: Mmap version to avoid copying the data around?
    let mut raw_data = vec![0u8; size_in_bytes];
    reader.read_exact(&mut raw_data)?;
-    match qtensor_from_ggml(ggml_dtype, &raw_data, dims, device) {
+    match qtensor_from_ggml(ggml_dtype, &raw_data, dims) {
        Ok(tensor) => Ok((name, tensor)),
        Err(e) => crate::bail!("Error creating tensor {name}: {e}"),
    }
@ -236,10 +201,7 @@ pub struct Content {
 }

 impl Content {
-    pub fn read<R: std::io::Seek + std::io::Read>(
-        reader: &mut R,
-        device: &Device,
-    ) -> Result<Content> {
+    pub fn read<R: std::io::Seek + std::io::Read>(reader: &mut R) -> Result<Content> {
        // https://github.com/ggerganov/llama.cpp/blob/468ea24fb4633a0d681f7ac84089566c1c6190cb/llama.cpp#L505
        let last_position = reader.seek(std::io::SeekFrom::End(0))?;
        reader.seek(std::io::SeekFrom::Start(0))?;
@ -249,7 +211,7 @@ impl Content {
        let mut tensors = HashMap::new();

        while reader.stream_position()? != last_position {
-            let (name, tensor) = read_one_tensor(reader, magic, device)?;
+            let (name, tensor) = read_one_tensor(reader, magic)?;
            tensors.insert(name, tensor);
        }
        Ok(Self {
--- a/candle-core/src/quantized/gguf_file.rs
+++ b/candle-core/src/quantized/gguf_file.rs
@ -3,7 +3,7 @@
 //! Spec: https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md

 use super::{GgmlDType, QTensor};
-use crate::{Device, Result};
+use crate::Result;
 use byteorder::{LittleEndian, ReadBytesExt, WriteBytesExt};
 use std::collections::HashMap;

@ -41,7 +41,7 @@ impl VersionedMagic {
            (Magic::Gguf, 1) => Self::GgufV1,
            (Magic::Gguf, 2) => Self::GgufV2,
            (Magic::Gguf, 3) => Self::GgufV3,
-            _ => crate::bail!("gguf: unsupported magic/version {magic:?}/{version}"),
+            _ => crate::bail!("ggml: unsupported magic/version {magic:?}/{version}"),
        };
        Ok(versioned_magic)
    }
@ -59,25 +59,19 @@ impl TensorInfo {
        &self,
        reader: &mut R,
        tensor_data_offset: u64,
-        device: &Device,
    ) -> Result<QTensor> {
        let tensor_elems = self.shape.elem_count();
-        let block_size = self.ggml_dtype.block_size();
-        if tensor_elems % block_size != 0 {
+        let blck_size = self.ggml_dtype.blck_size();
+        if tensor_elems % blck_size != 0 {
            crate::bail!(
-            "the number of elements {tensor_elems} is not divisible by the block size {block_size}"
+            "the number of elements {tensor_elems} is not divisible by the block size {blck_size}"
        )
        }
-        let size_in_bytes = tensor_elems / block_size * self.ggml_dtype.type_size();
+        let size_in_bytes = tensor_elems / blck_size * self.ggml_dtype.type_size();
        let mut raw_data = vec![0u8; size_in_bytes];
        reader.seek(std::io::SeekFrom::Start(tensor_data_offset + self.offset))?;
        reader.read_exact(&mut raw_data)?;
-        super::ggml_file::qtensor_from_ggml(
-            self.ggml_dtype,
-            &raw_data,
-            self.shape.dims().to_vec(),
-            device,
-        )
+        super::ggml_file::qtensor_from_ggml(self.ggml_dtype, &raw_data, self.shape.dims().to_vec())
    }
 }

@ -466,13 +460,12 @@ impl Content {
        &self,
        reader: &mut R,
        name: &str,
-        device: &Device,
    ) -> Result<QTensor> {
        let tensor_info = match self.tensor_infos.get(name) {
            Some(tensor_info) => tensor_info,
-            None => crate::bail!("cannot find tensor info for {name}"),
+            None => crate::bail!("cannot find tensor-infor for {name}"),
        };
-        tensor_info.read(reader, self.tensor_data_offset, device)
+        tensor_info.read(reader, self.tensor_data_offset)
    }
 }

@ -524,9 +517,10 @@ pub fn write<W: std::io::Seek + std::io::Write>(
                "internal error, unexpected current position {tensor_start_pos} {offset} {pos}"
            )
        }
-        let data = tensor.data()?;
-        let size_in_bytes = data.len();
-        w.write_all(&data)?;
+        let data_ptr = tensor.as_ptr();
+        let size_in_bytes = tensor.storage_size_in_bytes();
+        let data = unsafe { std::slice::from_raw_parts(data_ptr, size_in_bytes) };
+        w.write_all(data)?;
        let padding = 31 - (31 + size_in_bytes) % 32;
        w.write_all(&vec![0u8; padding])?;
    }
--- a/candle-core/src/quantized/metal.rs
+++ b/candle-core/src/quantized/metal.rs
@ -1,155 +0,0 @@
-use super::{GgmlDType, QStorage};
-use crate::{DType, MetalDevice, MetalStorage, Result};
-use metal::Buffer;
-use std::sync::Arc;
-
-pub struct QMetalStorage {
-    dtype: GgmlDType,
-    device: MetalDevice,
-    buffer: Arc<Buffer>,
-}
-
-impl QMetalStorage {
-    pub fn dtype(&self) -> GgmlDType {
-        self.dtype
-    }
-
-    pub fn buffer(&self) -> &Buffer {
-        &self.buffer
-    }
-
-    pub fn new(buffer: Arc<Buffer>, device: MetalDevice, dtype: GgmlDType) -> Self {
-        Self {
-            device,
-            buffer,
-            dtype,
-        }
-    }
-
-    pub fn dequantize(&self, elem_count: usize) -> Result<MetalStorage> {
-        let buffer = self.device.new_buffer_managed(self.buffer.length())?;
-        let command_buffer = self.device.command_buffer()?;
-        command_buffer.set_label("to_cpu");
-        let blit = command_buffer.new_blit_command_encoder();
-        blit.set_label("blit_to_cpu");
-        // blit.wait_for_fence(&self.device.fence());
-        blit.copy_from_buffer(&self.buffer, 0, &buffer, 0, self.buffer.length());
-        // blit.update_fence(&self.device.fence());
-        blit.end_encoding();
-        self.device.wait_until_completed()?;
-        let mut out = vec![0.0; elem_count];
-        match self.dtype {
-            GgmlDType::F32 => {
-                let vec: Vec<f32> = read_to_vec(&buffer, elem_count);
-                use crate::quantized::k_quants::GgmlType;
-                f32::to_float(&vec, &mut out)?;
-            }
-            GgmlDType::F16 => {
-                let vec: Vec<half::f16> = read_to_vec(&buffer, elem_count);
-                use crate::quantized::k_quants::GgmlType;
-                half::f16::to_float(&vec, &mut out)?;
-            }
-            GgmlDType::Q4_0 => {
-                let vec: Vec<crate::quantized::BlockQ4_0> = read_to_vec(&buffer, elem_count);
-                use crate::quantized::k_quants::GgmlType;
-                crate::quantized::BlockQ4_0::to_float(&vec, &mut out)?;
-            }
-            GgmlDType::Q4_1 => {
-                let vec: Vec<crate::quantized::BlockQ4_1> = read_to_vec(&buffer, elem_count);
-                use crate::quantized::k_quants::GgmlType;
-                crate::quantized::BlockQ4_1::to_float(&vec, &mut out)?;
-            }
-            GgmlDType::Q5_0 => {
-                let vec: Vec<crate::quantized::BlockQ5_0> = read_to_vec(&buffer, elem_count);
-                use crate::quantized::k_quants::GgmlType;
-                crate::quantized::BlockQ5_0::to_float(&vec, &mut out)?;
-            }
-            GgmlDType::Q5_1 => {
-                let vec: Vec<crate::quantized::BlockQ5_1> = read_to_vec(&buffer, elem_count);
-                use crate::quantized::k_quants::GgmlType;
-                crate::quantized::BlockQ5_1::to_float(&vec, &mut out)?;
-            }
-            GgmlDType::Q8_0 => {
-                let vec: Vec<crate::quantized::BlockQ8_0> = read_to_vec(&buffer, elem_count);
-                use crate::quantized::k_quants::GgmlType;
-                crate::quantized::BlockQ8_0::to_float(&vec, &mut out)?;
-            }
-            GgmlDType::Q8_1 => {
-                let vec: Vec<crate::quantized::BlockQ8_1> = read_to_vec(&buffer, elem_count);
-                use crate::quantized::k_quants::GgmlType;
-                crate::quantized::BlockQ8_1::to_float(&vec, &mut out)?;
-            }
-            GgmlDType::Q2K => {
-                let vec: Vec<crate::quantized::BlockQ2K> =
-                    read_to_vec(&buffer, elem_count / self.dtype.block_size());
-                use crate::quantized::k_quants::GgmlType;
-                crate::quantized::BlockQ2K::to_float(&vec, &mut out)?;
-            }
-            GgmlDType::Q3K => {
-                let vec: Vec<crate::quantized::BlockQ3K> =
-                    read_to_vec(&buffer, elem_count / self.dtype.block_size());
-                use crate::quantized::k_quants::GgmlType;
-                crate::quantized::BlockQ3K::to_float(&vec, &mut out)?;
-            }
-            GgmlDType::Q4K => {
-                let vec: Vec<crate::quantized::BlockQ4K> =
-                    read_to_vec(&buffer, elem_count / self.dtype.block_size());
-                use crate::quantized::k_quants::GgmlType;
-                crate::quantized::BlockQ4K::to_float(&vec, &mut out)?;
-            }
-            GgmlDType::Q5K => {
-                let vec: Vec<crate::quantized::BlockQ5K> =
-                    read_to_vec(&buffer, elem_count / self.dtype.block_size());
-                use crate::quantized::k_quants::GgmlType;
-                crate::quantized::BlockQ5K::to_float(&vec, &mut out)?;
-            }
-            GgmlDType::Q6K => {
-                let vec: Vec<crate::quantized::BlockQ6K> =
-                    read_to_vec(&buffer, elem_count / self.dtype.block_size());
-                use crate::quantized::k_quants::GgmlType;
-                crate::quantized::BlockQ6K::to_float(&vec, &mut out)?;
-            }
-            GgmlDType::Q8K => {
-                let vec: Vec<crate::quantized::BlockQ8K> =
-                    read_to_vec(&buffer, elem_count / self.dtype.block_size());
-                use crate::quantized::k_quants::GgmlType;
-                crate::quantized::BlockQ8K::to_float(&vec, &mut out)?;
-            }
-        }
-
-        let buffer = self.device.new_buffer_with_data(&out)?;
-        Ok(MetalStorage::new(buffer, self.device.clone(), DType::F32))
-    }
-
-    pub fn quantize(&mut self, src: &MetalStorage) -> Result<()> {
-        // Quantization only happens on CPU for now.
-        let src = src.to_cpu::<f32>()?;
-        let elem_count = src.len();
-        let src = crate::Storage::Cpu(crate::CpuStorage::F32(src));
-        let mut qcpu_storage = crate::Device::Cpu.qzeros(elem_count, self.dtype)?;
-        qcpu_storage.quantize(&src)?;
-        let buffer = self.device.new_buffer_with_data(&qcpu_storage.data()?)?;
-        self.buffer = buffer;
-        Ok(())
-    }
-}
-
-pub fn load_quantized_metal<T: super::GgmlType + Send + Sync + 'static>(
-    device: &MetalDevice,
-    data: &[T],
-) -> Result<QStorage> {
-    let buffer = device.new_buffer_with_data(data)?;
-    let device = device.clone();
-    Ok(QStorage::Metal(QMetalStorage {
-        dtype: T::DTYPE,
-        device,
-        buffer,
-    }))
-}
-
-fn read_to_vec<T: Clone>(buffer: &Buffer, n: usize) -> Vec<T> {
-    let ptr = buffer.contents() as *const T;
-    assert!(!ptr.is_null());
-    let slice = unsafe { std::slice::from_raw_parts(ptr, n) };
-    slice.to_vec()
-}
--- a/candle-core/src/quantized/mod.rs
+++ b/candle-core/src/quantized/mod.rs
@ -1,125 +1,23 @@
-#[cfg(feature = "metal")]
-use crate::{backend::BackendStorage, DType};
-use crate::{CpuStorage, Device, Result, Shape, Storage, Tensor};
-use k_quants::*;
-use std::borrow::Cow;
+use crate::{Device, Result, Shape, Tensor};

 #[cfg(target_feature = "avx")]
 pub mod avx;
 pub mod ggml_file;
 pub mod gguf_file;
 pub mod k_quants;
-#[cfg(feature = "metal")]
-pub mod metal;
 #[cfg(target_feature = "neon")]
 pub mod neon;
 #[cfg(target_feature = "simd128")]
 pub mod simd128;
 pub mod utils;
-use half::f16;

 pub use k_quants::GgmlType;

 pub struct QTensor {
-    storage: QStorage,
+    data: Box<dyn QuantizedType>,
    shape: Shape,
 }

-impl Device {
-    fn qzeros(&self, elem_count: usize, dtype: GgmlDType) -> Result<QStorage> {
-        match self {
-            Device::Cpu => {
-                let storage = dtype.cpu_zeros(elem_count);
-                Ok(QStorage::Cpu(storage))
-            }
-            #[cfg(feature = "metal")]
-            Device::Metal(metal) => {
-                let size = elem_count * dtype.type_size() / dtype.block_size();
-                let buffer = metal.allocate_zeros(size)?;
-                Ok(QStorage::Metal(metal::QMetalStorage::new(
-                    buffer,
-                    metal.clone(),
-                    dtype,
-                )))
-            }
-            #[cfg(not(feature = "metal"))]
-            Device::Metal(_metal) => {
-                crate::bail!("Metal feature not activated");
-            }
-            Device::Cuda(_cuda) => {
-                crate::bail!("Cuda ggml quantization not supported");
-            }
-        }
-    }
-}
-
-pub enum QStorage {
-    Cpu(Box<dyn QuantizedType>),
-    #[cfg(feature = "metal")]
-    Metal(metal::QMetalStorage),
-}
-
-impl QStorage {
-    fn block_size(&self) -> usize {
-        match self {
-            QStorage::Cpu(storage) => storage.block_size(),
-            #[cfg(feature = "metal")]
-            QStorage::Metal(storage) => storage.dtype().block_size(),
-        }
-    }
-
-    fn dtype(&self) -> GgmlDType {
-        match self {
-            QStorage::Cpu(storage) => storage.dtype(),
-            #[cfg(feature = "metal")]
-            QStorage::Metal(storage) => storage.dtype(),
-        }
-    }
-
-    fn size_in_bytes(&self) -> usize {
-        match self {
-            QStorage::Cpu(storage) => storage.storage_size_in_bytes(),
-            #[cfg(feature = "metal")]
-            QStorage::Metal(storage) => storage.buffer().length() as usize,
-        }
-    }
-
-    fn quantize(&mut self, src: &Storage) -> Result<()> {
-        match (self, src) {
-            (QStorage::Cpu(storage), Storage::Cpu(src)) => {
-                storage.from_float(src.as_slice::<f32>()?)?;
-            }
-            #[cfg(feature = "metal")]
-            (QStorage::Metal(storage), Storage::Metal(src)) => storage.quantize(src)?,
-            _ => crate::bail!("Invalid dequantize storage locations do not match"),
-        }
-        Ok(())
-    }
-
-    fn dequantize(&self, elem_count: usize) -> Result<Storage> {
-        match self {
-            QStorage::Cpu(storage) => Ok(Storage::Cpu(storage.dequantize(elem_count)?)),
-            #[cfg(feature = "metal")]
-            QStorage::Metal(storage) => Ok(Storage::Metal(storage.dequantize(elem_count)?)),
-        }
-    }
-
-    fn data(&self) -> Result<Cow<[u8]>> {
-        match self {
-            QStorage::Cpu(storage) => {
-                let data_ptr = storage.as_ptr();
-                let size_in_bytes = storage.storage_size_in_bytes();
-                let data = unsafe { std::slice::from_raw_parts(data_ptr, size_in_bytes) };
-                Ok(Cow::from(data))
-            }
-            #[cfg(feature = "metal")]
-            QStorage::Metal(_storage) => {
-                crate::bail!("not implemented");
-            }
-        }
-    }
-}
-
 #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
 pub enum GgmlDType {
    F32,
@ -179,25 +77,6 @@ impl GgmlDType {
        }
    }

-    /// The block dtype
-    pub fn cpu_zeros(&self, elem_count: usize) -> Box<dyn QuantizedType> {
-        match self {
-            Self::F32 => Box::new(vec![f32::zeros(); elem_count]),
-            Self::F16 => Box::new(vec![f16::zeros(); elem_count]),
-            Self::Q4_0 => Box::new(vec![BlockQ4_0::zeros(); elem_count / BlockQ4_0::BLCK_SIZE]),
-            Self::Q4_1 => Box::new(vec![BlockQ4_1::zeros(); elem_count / BlockQ4_1::BLCK_SIZE]),
-            Self::Q5_0 => Box::new(vec![BlockQ5_0::zeros(); elem_count / BlockQ5_0::BLCK_SIZE]),
-            Self::Q5_1 => Box::new(vec![BlockQ5_1::zeros(); elem_count / BlockQ5_1::BLCK_SIZE]),
-            Self::Q8_0 => Box::new(vec![BlockQ8_0::zeros(); elem_count / BlockQ8_0::BLCK_SIZE]),
-            Self::Q8_1 => Box::new(vec![BlockQ8_1::zeros(); elem_count / BlockQ8_1::BLCK_SIZE]),
-            Self::Q2K => Box::new(vec![BlockQ2K::zeros(); elem_count / BlockQ2K::BLCK_SIZE]),
-            Self::Q3K => Box::new(vec![BlockQ3K::zeros(); elem_count / BlockQ3K::BLCK_SIZE]),
-            Self::Q4K => Box::new(vec![BlockQ4K::zeros(); elem_count / BlockQ4K::BLCK_SIZE]),
-            Self::Q5K => Box::new(vec![BlockQ5K::zeros(); elem_count / BlockQ5K::BLCK_SIZE]),
-            Self::Q6K => Box::new(vec![BlockQ6K::zeros(); elem_count / BlockQ6K::BLCK_SIZE]),
-            Self::Q8K => Box::new(vec![BlockQ8K::zeros(); elem_count / BlockQ8K::BLCK_SIZE]),
-        }
-    }
    /// The type size for blocks in bytes.
    pub fn type_size(&self) -> usize {
        use k_quants::*;
@ -221,7 +100,7 @@ impl GgmlDType {
    }

    /// The block size, i.e. the number of elements stored in each block.
-    pub fn block_size(&self) -> usize {
+    pub fn blck_size(&self) -> usize {
        match self {
            Self::F32 => 1,
            Self::F16 => 1,
@ -240,13 +119,9 @@ impl GgmlDType {
 pub trait QuantizedType: Send + Sync {
    fn dtype(&self) -> GgmlDType;
    fn matmul_t(&self, mkn: (usize, usize, usize), lhs: &[f32], dst: &mut [f32]) -> Result<()>;
-    fn dequantize(&self, elem_count: usize) -> Result<CpuStorage>;
+    fn to_float(&self, ys: &mut [f32]) -> Result<()>;
    fn storage_size_in_bytes(&self) -> usize;
    fn as_ptr(&self) -> *const u8;
-    fn block_size(&self) -> usize;
-    #[allow(clippy::wrong_self_convention)]
-    fn from_float(&mut self, xs: &[f32]) -> Result<()>;
-    fn size(&self) -> usize;
 }

 impl<T: k_quants::GgmlType + Send + Sync> QuantizedType for Vec<T> {
@ -254,26 +129,12 @@ impl<T: k_quants::GgmlType + Send + Sync> QuantizedType for Vec<T> {
        k_quants::matmul(mkn, lhs, self.as_slice(), dst)
    }

-    fn size(&self) -> usize {
-        self.len() * core::mem::size_of::<T>()
-    }
-
-    fn from_float(&mut self, xs: &[f32]) -> Result<()> {
-        T::from_float(xs, self)
-    }
-
    fn dtype(&self) -> GgmlDType {
        T::DTYPE
    }

-    fn block_size(&self) -> usize {
-        T::BLCK_SIZE
-    }
-
-    fn dequantize(&self, elem_count: usize) -> Result<CpuStorage> {
-        let mut ys = vec![0.0f32; elem_count];
-        T::to_float(self.as_slice(), &mut ys)?;
-        Ok(CpuStorage::F32(ys))
+    fn to_float(&self, ys: &mut [f32]) -> Result<()> {
+        T::to_float(self.as_slice(), ys)
    }

    fn storage_size_in_bytes(&self) -> usize {
@ -291,49 +152,56 @@ impl std::fmt::Debug for QTensor {
    }
 }

-fn check_shape(shape: &Shape, block_size: usize) -> Result<()> {
+fn check_shape<T: k_quants::GgmlType>(shape: &Shape) -> Result<()> {
    let dims = shape.dims();
    if dims.is_empty() {
        crate::bail!("scalar tensor cannot be quantized {shape:?}")
    }
-    if dims[dims.len() - 1] % block_size != 0 {
+    if dims[dims.len() - 1] % T::BLCK_SIZE != 0 {
        crate::bail!(
            "quantized tensor must have their last dim divisible by block size {shape:?} {}",
-            block_size
+            T::BLCK_SIZE
        )
    }
    Ok(())
 }

 impl QTensor {
-    pub fn new<S: Into<Shape>>(storage: QStorage, shape: S) -> Result<Self> {
+    pub fn new<S: Into<Shape>, T: k_quants::GgmlType + Send + Sync + 'static>(
+        data: Vec<T>,
+        shape: S,
+    ) -> Result<Self> {
        let shape = shape.into();
-        check_shape(&shape, storage.block_size())?;
-        Ok(Self { storage, shape })
+        check_shape::<T>(&shape)?;
+        Ok(Self {
+            data: Box::new(data),
+            shape,
+        })
    }

-    pub fn quantize(src: &Tensor, dtype: GgmlDType) -> Result<Self> {
+    pub fn quantize<T: k_quants::GgmlType + Send + Sync + 'static>(src: &Tensor) -> Result<Self> {
        let shape = src.shape();
-        let block_size = dtype.block_size();
-        check_shape(shape, block_size)?;
-        let src = src.to_dtype(crate::DType::F32)?.flatten_all()?;
-        let elem_count = shape.elem_count();
-        if elem_count % block_size != 0 {
+        check_shape::<T>(shape)?;
+        let src = src
+            .to_dtype(crate::DType::F32)?
+            .flatten_all()?
+            .to_vec1::<f32>()?;
+        if src.len() % T::BLCK_SIZE != 0 {
            crate::bail!(
                "tensor size ({shape:?}) is not divisible by block size {}",
-                block_size
+                T::BLCK_SIZE
            )
        }
-        let mut storage = src.device().qzeros(elem_count, dtype)?;
-        storage.quantize(&src.storage())?;
+        let mut data = vec![T::zeros(); src.len() / T::BLCK_SIZE];
+        T::from_float(&src, &mut data)?;
        Ok(Self {
-            storage,
+            data: Box::new(data),
            shape: shape.clone(),
        })
    }

    pub fn dtype(&self) -> GgmlDType {
-        self.storage.dtype()
+        self.data.dtype()
    }

    pub fn rank(&self) -> usize {
@ -345,19 +213,21 @@ impl QTensor {
    }

    pub fn dequantize(&self, device: &Device) -> Result<Tensor> {
-        let storage = self.storage.dequantize(self.shape.elem_count())?;
-        let none = crate::op::BackpropOp::none();
-        let is_variable = false;
-        crate::tensor::from_storage(storage, self.shape.clone(), none, is_variable)
-            .to_device(device)
+        let mut f32_data = vec![0f32; self.shape.elem_count()];
+        self.data.to_float(&mut f32_data)?;
+        Tensor::from_vec(f32_data, &self.shape, device)
+    }
+
+    pub fn matmul_t(&self, mkn: (usize, usize, usize), lhs: &[f32], dst: &mut [f32]) -> Result<()> {
+        self.data.matmul_t(mkn, lhs, dst)
    }

    pub fn storage_size_in_bytes(&self) -> usize {
-        self.storage.size_in_bytes()
+        self.data.storage_size_in_bytes()
    }

-    pub fn data(&self) -> Result<Cow<'_, [u8]>> {
-        self.storage.data()
+    pub fn as_ptr(&self) -> *const u8 {
+        self.data.as_ptr()
    }
 }

@ -424,97 +294,21 @@ impl crate::CustomOp1 for QTensor {
        }
        dst_shape.push(n);
        let dst_shape = Shape::from(dst_shape);
-        #[allow(clippy::infallible_destructuring_match)]
-        let self_storage = match &self.storage {
-            QStorage::Cpu(storage) => storage,
-            #[cfg(feature = "metal")]
-            _ => crate::bail!("Invalid storage"),
-        };
-        let slice = storage.as_slice::<f32>()?;
-        let slice = &slice[layout.start_offset()..layout.start_offset() + src_shape.elem_count()];
+        let storage = storage.as_slice::<f32>()?;
+        let storage =
+            &storage[layout.start_offset()..layout.start_offset() + src_shape.elem_count()];
        let mut dst_storage = vec![0f32; dst_shape.elem_count()];
-        self_storage.matmul_t((dst_shape.elem_count() / n, k, n), slice, &mut dst_storage)?;
+        self.matmul_t(
+            (dst_shape.elem_count() / n, k, n),
+            storage,
+            &mut dst_storage,
+        )?;
        Ok((crate::CpuStorage::F32(dst_storage), dst_shape))
    }
-
-    #[cfg(feature = "metal")]
-    fn metal_fwd(
-        &self,
-        storage: &crate::MetalStorage,
-        layout: &crate::Layout,
-    ) -> Result<(crate::MetalStorage, Shape)> {
-        use crate::MetalError;
-
-        if !layout.is_contiguous() {
-            crate::bail!("input tensor is not contiguous {layout:?}")
-        }
-        let src_shape = layout.shape();
-        // self is transposed so n is first then k.
-        if src_shape.rank() < 2 {
-            crate::bail!("input tensor has only one dimension {layout:?}")
-        }
-        let (n, k) = self.shape.dims2()?;
-        let mut dst_shape = src_shape.dims().to_vec();
-
-        let (b, m) = match dst_shape.len() {
-            3 => (dst_shape[0], dst_shape[1]),
-            2 => (1, dst_shape[0]),
-            n => crate::bail!("Invalid rank {n} for quantized matmul metal"),
-        };
-        let last_k = dst_shape.pop().unwrap();
-        if last_k != k {
-            crate::bail!("input tensor {layout:?} incompatible with {:?}", self.shape)
-        }
-        dst_shape.push(n);
-        let dst_shape = Shape::from(dst_shape);
-        let device = storage.device().clone();
-        let dst = device.new_buffer(dst_shape.elem_count(), DType::F32, "qmatmul")?;
-        let (buffer, dtype) = match &self.storage {
-            QStorage::Metal(metal) => (metal.buffer(), metal.dtype()),
-            _ => unreachable!("Cannot call metal matmul on non metal QTensor"),
-        };
-        let command_buffer = device.command_buffer()?;
-        candle_metal_kernels::call_quantized_matmul_t(
-            device.device(),
-            &command_buffer,
-            device.kernels(),
-            dtype.into(),
-            (b, m, n, k),
-            storage.buffer(),
-            layout.start_offset() * storage.dtype().size_in_bytes(),
-            buffer,
-            &dst,
-        )
-        .map_err(MetalError::from)?;
-        let dst_storage = crate::MetalStorage::new(dst, device, DType::F32);
-        Ok((dst_storage, dst_shape))
-    }
 }

-#[cfg(feature = "metal")]
-impl From<GgmlDType> for candle_metal_kernels::GgmlDType {
-    fn from(value: GgmlDType) -> Self {
-        match value {
-            GgmlDType::Q4_0 => candle_metal_kernels::GgmlDType::Q4_0,
-            GgmlDType::Q4_1 => candle_metal_kernels::GgmlDType::Q4_1,
-            GgmlDType::Q5_0 => candle_metal_kernels::GgmlDType::Q5_0,
-            GgmlDType::Q5_1 => candle_metal_kernels::GgmlDType::Q5_1,
-            GgmlDType::Q8_0 => candle_metal_kernels::GgmlDType::Q8_0,
-            GgmlDType::Q8_1 => candle_metal_kernels::GgmlDType::Q8_1,
-            GgmlDType::Q2K => candle_metal_kernels::GgmlDType::Q2K,
-            GgmlDType::Q3K => candle_metal_kernels::GgmlDType::Q3K,
-            GgmlDType::Q4K => candle_metal_kernels::GgmlDType::Q4K,
-            GgmlDType::Q5K => candle_metal_kernels::GgmlDType::Q5K,
-            GgmlDType::Q6K => candle_metal_kernels::GgmlDType::Q6K,
-            GgmlDType::Q8K => candle_metal_kernels::GgmlDType::Q8K,
-            GgmlDType::F16 => candle_metal_kernels::GgmlDType::F16,
-            GgmlDType::F32 => candle_metal_kernels::GgmlDType::F32,
-        }
-    }
-}
-
-impl crate::Module for QMatMul {
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+impl QMatMul {
+    pub fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        match self {
            Self::QTensor(t) => xs.apply_op1_no_bwd(t.as_ref()),
            Self::Tensor(w) => {
--- a/candle-core/src/shape.rs
+++ b/candle-core/src/shape.rs
@ -478,6 +478,23 @@ extract_dims!(
    (usize, usize, usize, usize, usize)
 );

+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn stride() {
+        let shape = Shape::from(());
+        assert_eq!(shape.stride_contiguous(), Vec::<usize>::new());
+        let shape = Shape::from(42);
+        assert_eq!(shape.stride_contiguous(), [1]);
+        let shape = Shape::from((42, 1337));
+        assert_eq!(shape.stride_contiguous(), [1337, 1]);
+        let shape = Shape::from((299, 792, 458));
+        assert_eq!(shape.stride_contiguous(), [458 * 792, 458, 1]);
+    }
+}
+
 pub trait ShapeWithOneHole {
    fn into_shape(self, el_count: usize) -> Result<Shape>;
 }
@ -610,20 +627,3 @@ impl ShapeWithOneHole for (usize, usize, usize, usize, ()) {
        Ok((d1, d2, d3, d4, d).into())
    }
 }
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-
-    #[test]
-    fn stride() {
-        let shape = Shape::from(());
-        assert_eq!(shape.stride_contiguous(), Vec::<usize>::new());
-        let shape = Shape::from(42);
-        assert_eq!(shape.stride_contiguous(), [1]);
-        let shape = Shape::from((42, 1337));
-        assert_eq!(shape.stride_contiguous(), [1337, 1]);
-        let shape = Shape::from((299, 792, 458));
-        assert_eq!(shape.stride_contiguous(), [458 * 792, 458, 1]);
-    }
-}
--- a/candle-core/src/tensor.rs
+++ b/candle-core/src/tensor.rs
@ -1,4 +1,4 @@
-//! Tensors are N-dimensional matrixes of elements using a single data type.
+//! Tensors are N-dimenional matrixes of elements using a single data type.
 #![allow(clippy::redundant_closure_call)]
 use crate::backend::{BackendDevice, BackendStorage};
 use crate::op::{
@ -361,16 +361,6 @@ impl Tensor {
        Self::new_impl(array, shape, device, false)
    }

-    /// Returns a new tensor with all the elements having the same specified value. Note that
-    /// the tensor is not contiguous so you would have to call `.contiguous()` on it if needed.
-    pub fn full<D: crate::WithDType, S: Into<Shape>>(
-        value: D,
-        shape: S,
-        device: &Device,
-    ) -> Result<Self> {
-        Self::from_vec_impl(vec![value], (), device, false)?.broadcast_as(shape)
-    }
-
    /// Creates a new 1D tensor from an iterator.
    pub fn from_iter<D: crate::WithDType>(
        iter: impl IntoIterator<Item = D>,
@ -396,7 +386,7 @@ impl Tensor {
        device: &Device,
    ) -> Result<Self> {
        if D::is_zero(&step) {
-            bail!("step cannot be zero")
+            crate::bail!("step cannot be zero")
        }
        let mut data = vec![];
        let mut current = start;
@ -426,9 +416,7 @@ impl Tensor {
        if buffer_size != shape.elem_count() {
            return Err(Error::ShapeMismatch { buffer_size, shape }.bt());
        }
-        // println!("from vec {buffer_size}");
        let storage = device.storage_owned(data)?;
-        // println!("Created storage");
        let none = BackpropOp::none();
        Ok(from_storage(storage, shape, none, is_variable))
    }
@ -681,7 +669,7 @@ impl Tensor {
    }

    /// Split a tensor into the specified number of chunks, this may return less chunks than
-    /// specified.
+    /// specificed.
    pub fn chunk<D: Dim>(&self, chunks: usize, dim: D) -> Result<Vec<Self>> {
        let dim = dim.to_index(self.shape(), "chunk")?;
        let size = self.dim(dim)?;
@ -1006,11 +994,7 @@ impl Tensor {
    /// tensor also has four dimensions, `(batch, channels, target_h, target_w)`.
    pub fn interpolate2d(&self, target_h: usize, target_w: usize) -> Result<Self> {
        let (n, c, _h, _w) = self.dims4()?;
-        let op = BackpropOp::new1(self, |arg| Op::UpsampleNearest2D {
-            arg,
-            target_h,
-            target_w,
-        });
+        let op = BackpropOp::new1(self, Op::UpsampleNearest2D);
        let storage = self
            .storage()
            .upsample_nearest2d(self.layout(), target_h, target_w)?;
@ -1043,9 +1027,6 @@ impl Tensor {
        let kernel_size = kernel_size.to_usize2();
        let stride = stride.to_usize2();
        let (n, c, h, w) = self.dims4()?;
-        if h < kernel_size.0 || w < kernel_size.1 {
-            bail!("kernel-size {kernel_size:?} is larger than the input size {h},{w}")
-        }
        // https://pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html#torch.nn.AvgPool2d
        let h_out = (h - kernel_size.0) / stride.0 + 1;
        let w_out = (w - kernel_size.1) / stride.1 + 1;
@ -1081,9 +1062,6 @@ impl Tensor {
        let kernel_size = kernel_size.to_usize2();
        let stride = stride.to_usize2();
        let (n, c, h, w) = self.dims4()?;
-        if h < kernel_size.0 || w < kernel_size.1 {
-            bail!("kernel-size {kernel_size:?} is larger than the input size {h},{w}")
-        }
        // https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html#torch.nn.MaxPool2d
        let h_out = (h - kernel_size.0) / stride.0 + 1;
        let w_out = (w - kernel_size.1) / stride.1 + 1;
@ -1806,7 +1784,7 @@ impl Tensor {
        let is_permutation =
            dims.len() == self.rank() && (0..dims.len()).all(|i| dims.contains(&i));
        if !is_permutation {
-            bail!(
+            crate::bail!(
                "dimension mismatch in permute, tensor {:?}, dims: {:?}",
                self.dims(),
                dims
@ -1885,7 +1863,10 @@ impl Tensor {
                    Storage::Metal(metal.storage_from_cpu_storage(storage)?)
                }
                (Storage::Cuda(storage), Device::Cpu) => Storage::Cpu(storage.to_cpu_storage()?),
-                (Storage::Metal(storage), Device::Cpu) => Storage::Cpu(storage.to_cpu_storage()?),
+                (Storage::Metal(storage), Device::Cpu) => {
+                    println!("{storage:?} - {:?}", storage.to_cpu_storage()?);
+                    Storage::Cpu(storage.to_cpu_storage()?)
+                }
                (Storage::Cuda(storage), Device::Cuda(cuda)) => {
                    // TODO: Avoid passing through the cpu storage here, especially if the gpu ids
                    // are the same.
@ -2301,7 +2282,7 @@ impl Tensor {
        if left == 0 && right == 0 {
            Ok(self.clone())
        } else if self.elem_count() == 0 {
-            bail!("cannot use pad_with_same on an empty tensor")
+            crate::bail!("cannot use pad_with_same on an empty tensor")
        } else if left == 0 {
            let dim = dim.to_index(self.shape(), "pad_with_same")?;
            let r = self.narrow(dim, self.dim(dim)? - 1, 1)?;
@ -2465,126 +2446,17 @@ impl Tensor {
    pub fn normalize_axis(&self, axis: i64) -> Result<usize> {
        let rank = self.rank() as i64;
        if rank <= axis {
-            bail!("axis {axis} is too large, tensor rank {rank}")
+            crate::bail!("axis {axis} is too large, tensor rank {rank}")
        } else if 0 <= axis {
            Ok(axis as usize)
        } else {
            let naxis = rank + axis;
            if naxis < 0 {
-                bail!("axis {axis} is too small, tensor rank {rank}")
+                crate::bail!("axis {axis} is too small, tensor rank {rank}")
            }
            Ok(naxis as usize)
        }
    }
-
-    /// Returns a lower triangular matrix of ones of size n by n.
-    pub fn tril2(n: usize, dtype: DType, device: &Device) -> Result<Self> {
-        let t = Tensor::arange(0u32, n as u32, device)?;
-        let t1 = t.reshape((1, n))?.broadcast_as((n, n))?;
-        let t2 = t.reshape((n, 1))?.broadcast_as((n, n))?;
-        t1.le(&t2)?.to_dtype(dtype)
-    }
-
-    /// Returns an upper triangular matrix of ones of size n by n.
-    pub fn triu2(n: usize, dtype: DType, device: &Device) -> Result<Self> {
-        let t = Tensor::arange(0u32, n as u32, device)?;
-        let t1 = t.reshape((1, n))?.broadcast_as((n, n))?;
-        let t2 = t.reshape((n, 1))?.broadcast_as((n, n))?;
-        t1.ge(&t2)?.to_dtype(dtype)
-    }
-
-    /// Returns a matrix with a diagonal of ones of size n by n.
-    pub fn eye(n: usize, dtype: DType, device: &Device) -> Result<Self> {
-        let t = Tensor::arange(0u32, n as u32, device)?;
-        let t1 = t.reshape((1, n))?.broadcast_as((n, n))?;
-        let t2 = t.reshape((n, 1))?.broadcast_as((n, n))?;
-        t1.eq(&t2)?.to_dtype(dtype)
-    }
-
-    /// Returns the cumulative sum of elements of the input tensor summed over the specified
-    /// dimension.
-    ///
-    /// This operation is most efficient when dim is the last dimension of the tensor.
-    pub fn cumsum<D: Dim>(&self, dim: D) -> Result<Self> {
-        let dim = dim.to_index(self.shape(), "cumsum")?;
-        let rank = self.rank();
-        if rank == 0 {
-            return Ok(self.clone());
-        }
-        let n_axis = self.dim(dim)?;
-        let triu = Tensor::triu2(n_axis, self.dtype(), self.device())?;
-        if rank == 1 {
-            self.unsqueeze(0)?.matmul(&triu)?.squeeze(0)
-        } else {
-            let last = rank - 1;
-            let t = self.transpose(dim, last)?;
-            let t = t.broadcast_matmul(&triu)?;
-            t.transpose(dim, last)
-        }
-    }
-
-    /// Returns a copy of `self` where the values within `ranges` have been replaced with the
-    /// content of `src`.
-    pub fn slice_assign<D: std::ops::RangeBounds<usize>>(
-        &self,
-        ranges: &[D],
-        src: &Tensor,
-    ) -> Result<Self> {
-        let src_dims = src.dims();
-        let self_dims = self.dims();
-        if self_dims.len() != src_dims.len() {
-            bail!(
-                "slice-assign requires input with the same rank {} <> {}",
-                self_dims.len(),
-                src_dims.len()
-            )
-        }
-        if self_dims.len() != ranges.len() {
-            bail!(
-                "slice-assign requires input with the same rank as there are ranges {} <> {}",
-                self_dims.len(),
-                ranges.len()
-            )
-        }
-        let mut src = src.clone();
-        let mut mask = Self::ones(src.shape(), DType::U8, src.device())?;
-        for (i, range) in ranges.iter().enumerate() {
-            let start_included = match range.start_bound() {
-                std::ops::Bound::Unbounded => 0,
-                std::ops::Bound::Included(v) => *v,
-                std::ops::Bound::Excluded(v) => *v + 1,
-            };
-            let end_excluded = match range.end_bound() {
-                std::ops::Bound::Unbounded => self_dims[i],
-                std::ops::Bound::Included(v) => *v + 1,
-                std::ops::Bound::Excluded(v) => *v,
-            };
-            if end_excluded <= start_included {
-                bail!("slice-assign: empty range for dim {i}, {start_included} {end_excluded}")
-            }
-            if self_dims[i] < end_excluded {
-                bail!(
-                    "slice-assign: upper bound is out of range for dim {i}, {end_excluded} {}",
-                    self_dims[i]
-                )
-            }
-            if end_excluded - start_included != src_dims[i] {
-                bail!(
-                    "slice-assign: the range for dim {i} ({start_included}..{end_excluded}) does not match the size of src {}", src_dims[i]
-                )
-            }
-            src = src.pad_with_zeros(i, start_included, self_dims[i] - end_excluded)?;
-            mask = mask.pad_with_zeros(i, start_included, self_dims[i] - end_excluded)?
-        }
-        mask.where_cond(/* on_true= */ &src, /* on_false= */ self)
-    }
-
-    /// Returns log(sum(exp(tensor), dim)).
-    pub fn logsumexp<D: Dims>(&self, sum_dims: D) -> Result<Self> {
-        let exp = self.exp()?;
-        let sum = exp.sum(sum_dims)?;
-        sum.log()
-    }
 }

 macro_rules! bin_trait {
--- a/candle-core/tests/grad_tests.rs
+++ b/candle-core/tests/grad_tests.rs
@ -270,166 +270,6 @@ fn unary_grad(device: &Device) -> Result<()> {
        [0.7358, 2.0000, 0.2707, 1.0000]
    );

-    // manually checked: see comments
-    let x = Var::new(&[[[[1f32, 2., 3.], [4., 5., 6.], [7., 8., 9.]]]], device)?;
-    let y = x.interpolate2d(6, 6)?.reshape(36)?;
-
-    #[rustfmt::skip]
-    let z = Tensor::new(
-        &[
-            1_f32, 02., 03., 04., 05., 06.,
-            07.,   08., 09., 10., 11., 12.,
-            13.,   14., 15., 16., 17., 18.,
-            19.,   20., 21., 22., 23., 24.,
-            25.,   26., 27., 28., 29., 30.,
-            31.,   32., 33., 34., 35., 36.,
-        ],
-        device,
-    )?;
-    // gradient should be
-    // row 1
-    // 1+2+7+8 = 18
-    // 3+4+9+10 = 26
-    // 5+6+11+12 = 34
-    // row 2
-    // 13+14+19+20 = 66
-    // 15+16+21+22 = 74
-    // 17+18+23+24 = 82
-    // row 3
-    // 25+26+31+32 = 114
-    // 27+28+33+34 = 122
-    // 29+30+35+36 = 130
-    let loss = y.unsqueeze(1)?.transpose(0, 1)?.matmul(&z.unsqueeze(1)?)?;
-
-    let grads = loss.backward()?;
-
-    let grad_x = grads.get(&x).context("no grad for x")?;
-    assert_eq!(
-        test_utils::to_vec2_round(&grad_x.flatten(0, 2)?, 4)?,
-        [[18_f32, 26., 34.], [66., 74., 82.], [114., 122., 130.]]
-    );
-
-    // manually checked: see comments
-    let x = Var::new(&[[[[1f32, 2.], [4., 5.]]]], device)?;
-    let y = x.interpolate2d(6, 6)?.reshape(36)?;
-
-    #[rustfmt::skip]
-    let z = Tensor::new(
-        &[
-            1_f32, 02., 03., 04., 05., 06.,
-            07.,   08., 09., 10., 11., 12.,
-            13.,   14., 15., 16., 17., 18.,
-            19.,   20., 21., 22., 23., 24.,
-            25.,   26., 27., 28., 29., 30.,
-            31.,   32., 33., 34., 35., 36.,
-        ],
-        device,
-    )?;
-    // gradient should be
-    // row 1
-    // 1+2+3+7+8+9+13+14+15 = 72
-    // 4+5+6+10+11+12+16+17+18 = 99
-    // row 2
-    // 19+20+21+25+26+27+31+32+33 = 234
-    // 22+23+24+28+29+30+34+35+36 = 243
-    let loss = y.unsqueeze(1)?.transpose(0, 1)?.matmul(&z.unsqueeze(1)?)?;
-
-    let grads = loss.backward()?;
-
-    let grad_x = grads.get(&x).context("no grad for x")?;
-    assert_eq!(
-        test_utils::to_vec2_round(&grad_x.flatten(0, 2)?, 4)?,
-        [[72_f32, 99.], [234., 261.]]
-    );
-
-    // manually checked: see comments
-    let x = Var::new(&[[[[1f32, 2.], [4., 5.]], [[6f32, 7.], [8., 9.]]]], device)?;
-
-    let y = x.interpolate2d(4, 4)?.reshape(32)?;
-
-    #[rustfmt::skip]
-    let z = Tensor::new(
-        &[
-            1_f32, 02., 03., 04.,
-            05.,   06., 07., 08.,
-            09.,   10., 11., 12.,
-            13.,   14., 15., 16.,
-            17.,   18., 19., 20.,
-            21.,   22., 23., 24.,
-            25.,   26., 27., 28.,
-            29.,   30., 31., 32.
-        ],
-        device,
-    )?;
-    // gradient should be
-    // m1r1
-    // 1+2+5+6=14
-    // 3+4+7+8=22
-    // m1r2
-    // 9+10+13+14=46
-    // 11+12+15+16=54
-    // m2r1
-    // 17+18+21+22=78
-    // 19+20+23+24=86
-    // m2r2
-    // 25+26+29+30=110
-    // 27+28+31+32=118
-    let loss = y.unsqueeze(1)?.transpose(0, 1)?.matmul(&z.unsqueeze(1)?)?;
-
-    let grads = loss.backward()?;
-
-    let grad_x = grads.get(&x).context("no grad for x")?;
-
-    assert_eq!(
-        test_utils::to_vec3_round(&grad_x.flatten(0, 1)?, 4)?,
-        [[[14_f32, 22.], [46., 54.]], [[78., 86.], [110., 118.]]]
-    );
-
-    // manually checked: see comments
-    let x = Var::new(
-        &[[[[1f32, 2.], [4., 5.]]], [[[6f32, 7.], [8., 9.]]]],
-        device,
-    )?;
-
-    let y = x.interpolate2d(4, 4)?.reshape(32)?;
-
-    #[rustfmt::skip]
-       let z = Tensor::new(
-           &[
-               1_f32, 02., 03., 04.,
-               05.,   06., 07., 08.,
-               09.,   10., 11., 12.,
-               13.,   14., 15., 16.,
-               17.,   18., 19., 20.,
-               21.,   22., 23., 24.,
-               25.,   26., 27., 28.,
-               29.,   30., 31., 32.
-           ],
-           device,
-       )?;
-    // gradient should be
-    // m1r1
-    // 1+2+5+6=14
-    // 3+4+7+8=22
-    // m1r2
-    // 9+10+13+14=46
-    // 11+12+15+16=54
-    // m2r1
-    // 17+18+21+22=78
-    // 19+20+23+24=86
-    // m2r2
-    // 25+26+29+30=110
-    // 27+28+31+32=118
-    let loss = y.unsqueeze(1)?.transpose(0, 1)?.matmul(&z.unsqueeze(1)?)?;
-
-    let grads = loss.backward()?;
-
-    let grad_x = grads.get(&x).context("no grad for x")?;
-
-    assert_eq!(
-        test_utils::to_vec3_round(&grad_x.flatten(0, 1)?, 4)?,
-        [[[14_f32, 22.], [46., 54.]], [[78., 86.], [110., 118.]]]
-    );
    Ok(())
 }

--- a/candle-core/tests/indexing_tests.rs
+++ b/candle-core/tests/indexing_tests.rs
@ -91,32 +91,3 @@ fn index_3d() -> Result<()> {
    assert_eq!(tensor.i((1, .., 3))?.to_vec1::<u32>()?, &[15, 19, 23]);
    Ok(())
 }
-
-#[test]
-fn slice_assign() -> Result<()> {
-    let dev = Device::Cpu;
-
-    let tensor = Tensor::arange(0u32, 4 * 5, &dev)?.reshape((4, 5))?;
-    let src = Tensor::arange(0u32, 2 * 3, &dev)?.reshape((3, 2))?;
-    let out = tensor.slice_assign(&[1..4, 3..5], &src)?;
-    assert_eq!(
-        out.to_vec2::<u32>()?,
-        &[
-            [0, 1, 2, 3, 4],
-            [5, 6, 7, 0, 1],
-            [10, 11, 12, 2, 3],
-            [15, 16, 17, 4, 5]
-        ]
-    );
-    let out = tensor.slice_assign(&[0..3, 0..2], &src)?;
-    assert_eq!(
-        out.to_vec2::<u32>()?,
-        &[
-            [0, 1, 2, 3, 4],
-            [2, 3, 7, 8, 9],
-            [4, 5, 12, 13, 14],
-            [15, 16, 17, 18, 19]
-        ]
-    );
-    Ok(())
-}
--- a/candle-core/tests/quantized_tests.rs
+++ b/candle-core/tests/quantized_tests.rs
@ -1,8 +1,7 @@
 use candle_core::{
    quantized::{self, GgmlDType},
-    test_device,
    test_utils::to_vec2_round,
-    Device, Module, Result, Tensor,
+    Device, Result, Tensor,
 };
 use quantized::{k_quants, GgmlType};
 use rand::prelude::*;
@ -14,44 +13,16 @@ const GGML_MAX_QUANTIZATION_TOTAL_ERROR_2BITS: f32 = 0.0075;
 const GGML_MAX_QUANTIZATION_TOTAL_ERROR_3BITS: f32 = 0.0040;
 const GGML_MAX_DOT_PRODUCT_ERROR: f32 = 0.02;

-fn test_matmul(
-    device: &Device,
-    (b, m, n, k): (usize, usize, usize, usize),
-    dtype: GgmlDType,
-) -> Result<()> {
-    let lhs = (0..(m * k))
-        .map(|v| v as f32 / (m * k) as f32)
-        .collect::<Vec<_>>();
-    let rhs = (0..(k * n))
-        .map(|v| v as f32 / (n * k) as f32)
-        .collect::<Vec<_>>();
-
-    let lhs = Tensor::from_slice(&lhs, (m, k), device)?;
-    let rhs = Tensor::from_slice(&rhs, (k, n), device)?;
-    let mm = lhs.matmul(&rhs)?;
-    let qtensor = quantized::QTensor::quantize(&rhs.t()?, dtype)?;
-    let matmul = quantized::QMatMul::from_qtensor(qtensor)?;
-    let res = matmul.forward(&lhs)?;
-
-    let error: f32 = ((&mm - &res)?.abs()? / &mm.abs()?)?
-        .sum_all()?
-        .to_scalar()?;
-    let error = error / (b * m * n) as f32;
-    assert!(
-        error <= 0.02,
-        "Error {error} is too big. \nExpected:\n {mm} \nFound:\n {res}\n for {dtype:?}"
-    );
-
-    Ok(())
-}
-
-fn quantized_matmul(device: &Device) -> Result<()> {
+#[test]
+fn quantized_matmul() -> Result<()> {
+    let cpu = &Device::Cpu;
    let (m, k, n) = (3, 64, 4);
    let lhs = (0..(m * k)).map(|v| v as f32).collect::<Vec<_>>();
-    let tensor_lhs = Tensor::from_slice(&lhs, (m, k), device)?;
+    let tensor_lhs = Tensor::from_slice(&lhs, (m, k), cpu)?;
    let mut dst = vec![42.; 3 * 4];
    let mut rhs_t = vec![k_quants::BlockQ4_0::zeros(); 8];
    let rhs = (0..(k * n)).map(|v| v as f32).collect::<Vec<_>>();
+    let tensor_rhs = Tensor::from_slice(&rhs, (n, k), cpu)?.t()?;
    k_quants::BlockQ4_0::from_float(&rhs, &mut rhs_t)?;
    k_quants::matmul((m, k, n), &lhs, &rhs_t, &mut dst)?;
    assert_eq!(
@ -61,7 +32,6 @@ fn quantized_matmul(device: &Device) -> Result<()> {
            341876.0, 994283.0, 1655709.0, 2301518.0
        ]
    );
-    let tensor_rhs = Tensor::from_slice(&rhs, (n, k), device)?.t()?;
    let mm = tensor_lhs.matmul(&tensor_rhs)?;
    assert_eq!(
        mm.to_vec2::<f32>()?,
@ -72,45 +42,35 @@ fn quantized_matmul(device: &Device) -> Result<()> {
        ]
    );

-    let qtensor = quantized::QTensor::quantize(&tensor_rhs.t()?, GgmlDType::Q4_0)?;
+    let qtensor = quantized::QTensor::new(rhs_t, (4, 64))?;
    let matmul = quantized::QMatMul::from_qtensor(qtensor)?;
    let res = matmul.forward(&tensor_lhs)?;
-    match device {
-        Device::Metal(_) => assert_eq!(
-            to_vec2_round(&res, 0)?,
-            &[
-                [84946.0, 214126.0, 344757.0, 473798.0],
-                [213458.0, 604350.0, 1000469.0, 1387990.0],
-                [341970.0, 994574.0, 1656181.0, 2302182.0]
-            ]
-        ),
-        _ => assert_eq!(
-            to_vec2_round(&res, 0)?,
-            &[
-                [85120.0, 214562.0, 345455.0, 474748.0],
-                [213475.0, 604465.0, 1000686.0, 1388317.0],
-                [341876.0, 994283.0, 1655709.0, 2301518.0]
-            ]
-        ),
-    }
-
-    test_matmul(device, (1, 3, 4, 256), GgmlDType::Q4_0)?;
+    assert_eq!(
+        to_vec2_round(&res, 0)?,
+        &[
+            [85120.0, 214562.0, 345455.0, 474748.0],
+            [213475.0, 604465.0, 1000686.0, 1388317.0],
+            [341876.0, 994283.0, 1655709.0, 2301518.0]
+        ]
+    );

    Ok(())
 }

-fn quantized_matmul_neg(device: &Device) -> Result<()> {
+#[test]
+fn quantized_matmul_neg() -> Result<()> {
+    let cpu = &Device::Cpu;
    let (m, k, n) = (3, 64, 4);
    let lhs = (0..(m * k))
        .map(|v| v as f32 - (m * k) as f32 / 2.0)
        .collect::<Vec<_>>();
-    let tensor_lhs = Tensor::from_slice(&lhs, (m, k), device)?;
+    let tensor_lhs = Tensor::from_slice(&lhs, (m, k), cpu)?;
    let mut dst = vec![42.; 3 * 4];
    let mut rhs_t = vec![k_quants::BlockQ4_0::zeros(); 8];
    let rhs = (0..k * n)
        .map(|v| v as f32 - (k * n) as f32 / 3.0)
        .collect::<Vec<_>>();
-    let tensor_rhs = Tensor::from_slice(&rhs, (n, k), device)?.t()?;
+    let tensor_rhs = Tensor::from_slice(&rhs, (n, k), cpu)?.t()?;
    k_quants::BlockQ4_0::from_float(&rhs, &mut rhs_t)?;
    k_quants::matmul((m, k, n), &lhs, &rhs_t, &mut dst)?;
    assert_eq!(
@ -130,52 +90,32 @@ fn quantized_matmul_neg(device: &Device) -> Result<()> {
        ]
    );

-    let qtensor = quantized::QTensor::quantize(&tensor_rhs.t()?, GgmlDType::Q4_0)?;
+    let qtensor = quantized::QTensor::new(rhs_t, (4, 64))?;
    let matmul = quantized::QMatMul::from_qtensor(qtensor)?;
    let res = matmul.forward(&tensor_lhs)?;
-    match device {
-        Device::Metal(_) => assert_eq!(
-            to_vec2_round(&res, 0)?,
-            &[
-                [243666.0, -19714.0, -285433.0, -550453.0],
-                [23782.0, 21654.0, 19400.0, 18369.0],
-                [-196102.0, 63022.0, 324233.0, 587191.0]
-            ]
-        ),
-        _ => assert_eq!(
-            to_vec2_round(&res, 0)?,
-            &[
-                [243524.0, -19596.0, -285051.0, -549815.0],
-                [23777.0, 21651.0, 19398.0, 18367.0],
-                [-196472.0, 63012.0, 324585.0, 587902.0]
-            ]
-        ),
-    }
+    assert_eq!(
+        to_vec2_round(&res, 0)?,
+        &[
+            [243524.0, -19596.0, -285051.0, -549815.0],
+            [23777.0, 21651.0, 19398.0, 18367.0],
+            [-196472.0, 63012.0, 324585.0, 587902.0]
+        ]
+    );

    Ok(())
 }

-test_device!(
-    quantized_matmul,
-    quantized_matmul_cpu,
-    quantized_matmul_cuda,
-    quantized_matmul_metal
-);
-test_device!(
-    quantized_matmul_neg,
-    quantized_matmul_neg_cpu,
-    quantized_matmul_neg_cuda,
-    quantized_matmul_neg_metal
-);
+#[test]
+fn quantize_q4_0() -> Result<()> {
+    use k_quants::BlockQ4_0;

-fn quantize_q4_0(device: &Device) -> Result<()> {
    let src = (0..32 * 4).map(|v| v as f32).collect::<Vec<_>>();
-
-    let src = Tensor::from_slice(&src, (32 * 4,), device)?;
-    let quant = quantized::QTensor::quantize(&src, GgmlDType::Q4_0)?;
-    let dst = quant.dequantize(device)?;
+    let mut dst = vec![0f32; 32 * 4];
+    let mut quant = vec![BlockQ4_0::zeros(); 4];
+    BlockQ4_0::from_float(&src, &mut quant)?;
+    BlockQ4_0::to_float(&quant, dst.as_mut_slice())?;
    assert_eq!(
-        dst.to_vec1::<f32>()?,
+        dst,
        &[
            -0.0, -0.0, 3.875, 3.875, 3.875, 3.875, 7.75, 7.75, 7.75, 7.75, 11.625, 11.625, 11.625,
            11.625, 15.5, 15.5, 15.5, 15.5, 19.375, 19.375, 19.375, 19.375, 23.25, 23.25, 23.25,
@ -191,17 +131,21 @@ fn quantize_q4_0(device: &Device) -> Result<()> {
            127.0, 127.0
        ]
    );
-    ggml_quantization_error_test(GgmlDType::Q4_0, device, GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
+    ggml_quantization_error_test::<BlockQ4_0>(GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
    Ok(())
 }

-fn quantize_q4_1(device: &Device) -> Result<()> {
+#[test]
+fn quantize_q4_1() -> Result<()> {
+    use k_quants::BlockQ4_1;
+
    let src = (0..32 * 4).map(|v| v as f32).collect::<Vec<_>>();
-    let src = Tensor::from_slice(&src, (32 * 4,), device)?;
-    let quant = quantized::QTensor::quantize(&src, GgmlDType::Q4_1)?;
-    let dst = quant.dequantize(device)?;
+    let mut dst = vec![0f32; 32 * 4];
+    let mut quant = vec![BlockQ4_1::zeros(); 4];
+    BlockQ4_1::from_float(&src, &mut quant)?;
+    BlockQ4_1::to_float(&quant, dst.as_mut_slice())?;
    assert_eq!(
-        round_vector(&dst.to_vec1::<f32>()?),
+        round_vector(&dst),
        &[
            0.0, 0.0, 2.066, 2.066, 4.133, 4.133, 6.199, 6.199, 8.266, 8.266, 10.332, 10.332,
            12.398, 12.398, 14.465, 14.465, 16.531, 16.531, 18.598, 18.598, 20.664, 20.664, 22.73,
@ -217,17 +161,21 @@ fn quantize_q4_1(device: &Device) -> Result<()> {
            118.73, 118.73, 120.797, 120.797, 122.863, 122.863, 124.93, 124.93, 126.996, 126.996
        ]
    );
-    ggml_quantization_error_test(GgmlDType::Q4_1, device, GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
+    ggml_quantization_error_test::<BlockQ4_1>(GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
    Ok(())
 }

-fn quantize_q5_0(device: &Device) -> Result<()> {
+#[test]
+fn quantize_q5_0() -> Result<()> {
+    use k_quants::BlockQ5_0;
+
    let src = (0..32 * 4).map(|v| v as f32).collect::<Vec<_>>();
-    let src = Tensor::from_slice(&src, (32 * 4,), device)?;
-    let quant = quantized::QTensor::quantize(&src, GgmlDType::Q5_0)?;
-    let dst = quant.dequantize(device)?;
+    let mut dst = vec![0f32; 32 * 4];
+    let mut quant = vec![BlockQ5_0::zeros(); 4];
+    BlockQ5_0::from_float(&src, &mut quant)?;
+    BlockQ5_0::to_float(&quant, dst.as_mut_slice())?;
    assert_eq!(
-        round_vector(&dst.to_vec1::<f32>()?),
+        round_vector(&dst),
        &[
            -0.0, 1.938, 1.938, 3.875, 3.875, 5.813, 5.813, 7.75, 7.75, 9.688, 9.688, 11.625,
            11.625, 13.563, 13.563, 15.5, 15.5, 17.438, 17.438, 19.375, 19.375, 21.313, 21.313,
@ -243,17 +191,21 @@ fn quantize_q5_0(device: &Device) -> Result<()> {
            119.063, 119.063, 119.063, 119.063, 127.0, 127.0, 127.0, 127.0
        ]
    );
-    ggml_quantization_error_test(GgmlDType::Q5_0, device, GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
+    ggml_quantization_error_test::<BlockQ5_0>(GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
    Ok(())
 }

-fn quantize_q5_1(device: &Device) -> Result<()> {
+#[test]
+fn quantize_q5_1() -> Result<()> {
+    use k_quants::BlockQ5_1;
+
    let src = (0..32 * 4).map(|v| v as f32).collect::<Vec<_>>();
-    let src = Tensor::from_slice(&src, (32 * 4,), device)?;
-    let quant = quantized::QTensor::quantize(&src, GgmlDType::Q5_1)?;
-    let dst = quant.dequantize(device)?;
+    let mut dst = vec![0f32; 32 * 4];
+    let mut quant = vec![BlockQ5_1::zeros(); 4];
+    BlockQ5_1::from_float(&src, &mut quant)?;
+    BlockQ5_1::to_float(&quant, dst.as_mut_slice())?;
    assert_eq!(
-        round_vector(&dst.to_vec1::<f32>()?),
+        dst,
        &[
            0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0,
            16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0,
@ -267,11 +219,13 @@ fn quantize_q5_1(device: &Device) -> Result<()> {
            124.0, 125.0, 126.0, 127.0
        ]
    );
-    ggml_quantization_error_test(GgmlDType::Q5_1, device, GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
+
+    ggml_quantization_error_test::<BlockQ5_1>(GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
    Ok(())
 }

-fn get_test_vector2(bound: f32, size: usize, device: &Device) -> Result<Tensor> {
+/// Generates a small test vector ranging from -`bound` to `bound` with `size` steps
+fn get_test_vector(bound: f32, size: usize) -> (Vec<f32>, Vec<f32>) {
    assert!(
        size % crate::quantized::k_quants::QK_K == 0,
        "size must be a multiple of {}",
@ -281,8 +235,10 @@ fn get_test_vector2(bound: f32, size: usize, device: &Device) -> Result<Tensor>
    let src = (0..size)
        .map(|v| (v as f32 - size as f32 / 2.) * bound / (size as f32 / 2.))
        .collect::<Vec<_>>();
+
+    let dst = vec![0f32; size];
    assert_eq!([src[0], src[size / 2]], [-bound, 0.0]);
-    Tensor::from_vec(src, (size,), device)
+    (src, dst)
 }

 /// Round a vector
@ -329,12 +285,11 @@ fn calculate_rmse(a: &[f32], b: &[f32]) -> f32 {
 }

 /// Mirrores the GGML quanitzation unit test: https://github.com/ggerganov/llama.cpp/blob/master/tests/test-quantize-fns.cpp#L43-L50
-fn ggml_quantization_error_test(dtype: GgmlDType, device: &Device, max_error: f32) -> Result<()> {
+fn ggml_quantization_error_test<T: GgmlType>(max_error: f32) -> Result<()> {
    let src = create_ggml_like_vector(0.0);
-    let src = Tensor::from_slice(&src, (GGML_TEST_SIZE,), device)?;
-    let quant = quantized::QTensor::quantize(&src, dtype)?;
-    let dst = quant.dequantize(device)?;
-    let error = calculate_rmse(&src.to_vec1::<f32>()?, &dst.to_vec1::<f32>()?);
+    let mut dst = vec![0.0; GGML_TEST_SIZE];
+    let _quant = quantize_roundtrip::<T>(src.as_slice(), dst.as_mut_slice())?;
+    let error = calculate_rmse(src.as_slice(), dst.as_slice());
    if error > max_error {
        candle_core::bail!(
            "Quantization error {} exceeds max error {}",
@ -345,15 +300,19 @@ fn ggml_quantization_error_test(dtype: GgmlDType, device: &Device, max_error: f3
    Ok(())
 }

-fn quantize_q2k(device: &Device) -> Result<()> {
-    let dtype = GgmlDType::Q2K;
+fn quantize_roundtrip<T: GgmlType>(src: &[f32], dst: &mut [f32]) -> Result<Vec<T>> {
+    let mut quant = vec![T::zeros(); src.len() / T::BLCK_SIZE];
+    T::from_float(src, &mut quant)?;
+    T::to_float(&quant, dst)?;
+    Ok(quant)
+}

-    let src = get_test_vector2(0.5, 1024, device)?;
-    let quant = quantized::QTensor::quantize(&src, dtype)?;
-    let dst = quant.dequantize(device)?;
+#[test]
+fn quantize_q2k() -> Result<()> {
+    use k_quants::BlockQ2K;

-    let src = src.to_vec1::<f32>()?;
-    let dst = dst.to_vec1::<f32>()?;
+    let (src, mut dst) = get_test_vector(0.5, 1024);
+    let _quant = quantize_roundtrip::<BlockQ2K>(src.as_slice(), dst.as_mut_slice())?;
    compare_with_error(dst.as_slice(), src.as_slice(), 0.1);

    // Test some specific values
@ -367,26 +326,20 @@ fn quantize_q2k(device: &Device) -> Result<()> {
        [-0.499, -0.366, -0.249, 0.0, 0.295, 0.492]
    );

-    let src_big = get_test_vector2(128.0, 1024, device)?;
-    let quant_big = quantized::QTensor::quantize(&src_big, dtype)?;
-    let dst_big = quant_big.dequantize(device)?;
-
-    let src_big = src_big.to_vec1::<f32>()?;
-    let dst_big = dst_big.to_vec1::<f32>()?;
+    let (src_big, mut dst_big) = get_test_vector(128.0, 1024);
+    let _quant_big = quantize_roundtrip::<BlockQ2K>(src_big.as_slice(), dst_big.as_mut_slice())?;
    compare_with_error(dst_big.as_slice(), src_big.as_slice(), 6.0);

-    ggml_quantization_error_test(dtype, device, GGML_MAX_QUANTIZATION_TOTAL_ERROR_2BITS)?;
+    ggml_quantization_error_test::<BlockQ2K>(GGML_MAX_QUANTIZATION_TOTAL_ERROR_2BITS)?;
    Ok(())
 }

-fn quantize_q3k(device: &Device) -> Result<()> {
-    let dtype = GgmlDType::Q3K;
-    let src = get_test_vector2(0.5, 1024, device)?;
-    let quant = quantized::QTensor::quantize(&src, dtype)?;
-    let dst = quant.dequantize(device)?;
+#[test]
+fn quantize_q3k() -> Result<()> {
+    use k_quants::BlockQ3K;

-    let src = src.to_vec1::<f32>()?;
-    let dst = dst.to_vec1::<f32>()?;
+    let (src, mut dst) = get_test_vector(0.5, 1024);
+    let _quant = quantize_roundtrip::<BlockQ3K>(src.as_slice(), dst.as_mut_slice())?;
    compare_with_error(dst.as_slice(), src.as_slice(), 0.03);

    // Test some specific values
@ -400,26 +353,20 @@ fn quantize_q3k(device: &Device) -> Result<()> {
        [-0.493, -0.37, -0.243, -0.0, 0.292, 0.492]
    );

-    let src_big = get_test_vector2(128.0, 1024, device)?;
-    let quant_big = quantized::QTensor::quantize(&src_big, dtype)?;
-    let dst_big = quant_big.dequantize(device)?;
-
-    let src_big = src_big.to_vec1::<f32>()?;
-    let dst_big = dst_big.to_vec1::<f32>()?;
+    let (src_big, mut dst_big) = get_test_vector(128.0, 1024);
+    let _quant_big = quantize_roundtrip::<BlockQ3K>(src_big.as_slice(), dst_big.as_mut_slice())?;
    compare_with_error(dst_big.as_slice(), src_big.as_slice(), 3.5);

-    ggml_quantization_error_test(dtype, device, GGML_MAX_QUANTIZATION_TOTAL_ERROR_3BITS)?;
+    ggml_quantization_error_test::<BlockQ3K>(GGML_MAX_QUANTIZATION_TOTAL_ERROR_3BITS)?;
    Ok(())
 }

-fn quantize_q4k(device: &Device) -> Result<()> {
-    let dtype = GgmlDType::Q4K;
-    let src = get_test_vector2(0.5, 1024, device)?;
-    let quant = quantized::QTensor::quantize(&src, dtype)?;
-    let dst = quant.dequantize(device)?;
+#[test]
+fn quantize_q4k() -> Result<()> {
+    use k_quants::BlockQ4K;

-    let src = src.to_vec1::<f32>()?;
-    let dst = dst.to_vec1::<f32>()?;
+    let (src, mut dst) = get_test_vector(0.5, 1024);
+    let _quant = quantize_roundtrip::<BlockQ4K>(src.as_slice(), dst.as_mut_slice())?;
    compare_with_error(dst.as_slice(), src.as_slice(), 0.017);

    // Test some specific values
@ -433,27 +380,21 @@ fn quantize_q4k(device: &Device) -> Result<()> {
        [-0.5, -0.373, -0.25, 0.0, 0.288, 0.498]
    );

-    let src_big = get_test_vector2(128.0, 1024, device)?;
-    let quant_big = quantized::QTensor::quantize(&src_big, dtype)?;
-    let dst_big = quant_big.dequantize(device)?;
-
-    let src_big = src_big.to_vec1::<f32>()?;
-    let dst_big = dst_big.to_vec1::<f32>()?;
+    let (src_big, mut dst_big) = get_test_vector(128.0, 1024);
+    let _quant_big = quantize_roundtrip::<BlockQ4K>(src_big.as_slice(), dst_big.as_mut_slice())?;
    compare_with_error(dst_big.as_slice(), src_big.as_slice(), 4.5);

-    ggml_quantization_error_test(dtype, device, GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
+    ggml_quantization_error_test::<BlockQ4K>(GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
    Ok(())
 }

-fn quantize_q5k(device: &Device) -> Result<()> {
-    let dtype = GgmlDType::Q5K;
-    let src = get_test_vector2(0.5, 1024, device)?;
-    let quant = quantized::QTensor::quantize(&src, dtype)?;
-    let dst = quant.dequantize(device)?;
+#[test]
+fn quantize_q5k() -> Result<()> {
+    use k_quants::BlockQ5K;

-    let src = src.to_vec1::<f32>()?;
-    let dst = dst.to_vec1::<f32>()?;
-    compare_with_error(dst.as_slice(), src.as_slice(), 0.009);
+    let (src, mut dst) = get_test_vector(0.5, 1024);
+    let _quant = quantize_roundtrip::<BlockQ5K>(src.as_slice(), dst.as_mut_slice())?;
+    compare_with_error(dst.as_slice(), src.as_slice(), 0.008);

    // Test some specific values
    assert_eq!(
@ -466,26 +407,21 @@ fn quantize_q5k(device: &Device) -> Result<()> {
        [-0.499, -0.372, -0.249, 0.001, 0.279, 0.499]
    );

-    let src_big = get_test_vector2(128.0, 1024, device)?;
-    let quant_big = quantized::QTensor::quantize(&src_big, dtype)?;
-    let dst_big = quant_big.dequantize(device)?;
-
-    let src_big = src_big.to_vec1::<f32>()?;
-    let dst_big = dst_big.to_vec1::<f32>()?;
+    let (src_big, mut dst_big) = get_test_vector(128.0, 1024);
+    let _quant_big = quantize_roundtrip::<BlockQ5K>(src_big.as_slice(), dst_big.as_mut_slice())?;
    compare_with_error(dst_big.as_slice(), src_big.as_slice(), 2.5);

-    ggml_quantization_error_test(dtype, device, GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
+    ggml_quantization_error_test::<BlockQ5K>(GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
+
    Ok(())
 }

-fn quantize_q6k(device: &Device) -> Result<()> {
-    let dtype = GgmlDType::Q6K;
-    let src = get_test_vector2(0.5, 1024, device)?;
-    let quant = quantized::QTensor::quantize(&src, dtype)?;
-    let dst = quant.dequantize(device)?;
+#[test]
+fn quantize_q6k() -> Result<()> {
+    use k_quants::BlockQ6K;

-    let src = src.to_vec1::<f32>()?;
-    let dst = dst.to_vec1::<f32>()?;
+    let (src, mut dst) = get_test_vector(0.5, 1024);
+    let _quant = quantize_roundtrip::<BlockQ6K>(src.as_slice(), dst.as_mut_slice())?;
    compare_with_error(dst.as_slice(), src.as_slice(), 0.008);

    // Test some specific values
@ -499,27 +435,22 @@ fn quantize_q6k(device: &Device) -> Result<()> {
        [-0.497, -0.372, -0.25, -0.0, 0.284, 0.5]
    );

-    let src_big = get_test_vector2(128.0, 1024, device)?;
-    let quant_big = quantized::QTensor::quantize(&src_big, dtype)?;
-    let dst_big = quant_big.dequantize(device)?;
-
-    let src_big = src_big.to_vec1::<f32>()?;
-    let dst_big = dst_big.to_vec1::<f32>()?;
+    let (src_big, mut dst_big) = get_test_vector(128.0, 1024);
+    let _quant_big = quantize_roundtrip::<BlockQ6K>(src_big.as_slice(), dst_big.as_mut_slice())?;
    compare_with_error(dst_big.as_slice(), src_big.as_slice(), 2.0);

-    ggml_quantization_error_test(dtype, device, GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
+    ggml_quantization_error_test::<BlockQ6K>(GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
+
    Ok(())
 }

-fn quantize_q8k(device: &Device) -> Result<()> {
-    let dtype = GgmlDType::Q8K;
-    let src = get_test_vector2(0.5, 1024, device)?;
-    let quant = quantized::QTensor::quantize(&src, dtype)?;
-    let dst = quant.dequantize(device)?;
+#[test]
+fn quantize_q8k() -> Result<()> {
+    use k_quants::BlockQ8K;

-    let src = src.to_vec1::<f32>()?;
-    let dst = dst.to_vec1::<f32>()?;
-    compare_with_error(dst.as_slice(), src.as_slice(), 0.008);
+    let (src, mut dst) = get_test_vector(0.5, 1024);
+    let _quant = quantize_roundtrip::<BlockQ8K>(src.as_slice(), dst.as_mut_slice())?;
+    compare_with_error(dst.as_slice(), src.as_slice(), 0.003);

    // Test some specific values
    assert_eq!(
@ -532,79 +463,15 @@ fn quantize_q8k(device: &Device) -> Result<()> {
        [-0.5, -0.375, -0.25, -0.0, 0.281, 0.499]
    );

-    let src_big = get_test_vector2(128.0, 1024, device)?;
-    let quant_big = quantized::QTensor::quantize(&src_big, dtype)?;
-    let dst_big = quant_big.dequantize(device)?;
-
-    let src_big = src_big.to_vec1::<f32>()?;
-    let dst_big = dst_big.to_vec1::<f32>()?;
+    let (src_big, mut dst_big) = get_test_vector(128.0, 1024);
+    let _quant_big = quantize_roundtrip::<BlockQ8K>(src_big.as_slice(), dst_big.as_mut_slice())?;
    compare_with_error(dst_big.as_slice(), src_big.as_slice(), 0.6);

-    ggml_quantization_error_test(dtype, device, GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
+    ggml_quantization_error_test::<BlockQ8K>(GGML_MAX_QUANTIZATION_TOTAL_ERROR)?;
+
    Ok(())
 }

-test_device!(
-    quantize_q4_0,
-    quantize_q4_0_cpu,
-    quantize_q4_0_cuda,
-    quantize_q4_0_metal
-);
-test_device!(
-    quantize_q4_1,
-    quantize_q4_1_cpu,
-    quantize_q4_1_cuda,
-    quantize_q4_1_metal
-);
-test_device!(
-    quantize_q5_0,
-    quantize_q5_0_cpu,
-    quantize_q5_0_cuda,
-    quantize_q5_0_metal
-);
-test_device!(
-    quantize_q5_1,
-    quantize_q5_1_cpu,
-    quantize_q5_1_cuda,
-    quantize_q5_1_metal
-);
-test_device!(
-    quantize_q2k,
-    quantize_q2k_cpu,
-    quantize_q2k_cuda,
-    quantize_q2k_metal
-);
-test_device!(
-    quantize_q3k,
-    quantize_q3k_cpu,
-    quantize_q3k_cuda,
-    quantize_q3k_metal
-);
-test_device!(
-    quantize_q4k,
-    quantize_q4k_cpu,
-    quantize_q4k_cuda,
-    quantize_q4k_metal
-);
-test_device!(
-    quantize_q5k,
-    quantize_q5k_cpu,
-    quantize_q5k_cuda,
-    quantize_q5k_metal
-);
-test_device!(
-    quantize_q6k,
-    quantize_q6k_cpu,
-    quantize_q6k_cuda,
-    quantize_q6k_metal
-);
-test_device!(
-    quantize_q8k,
-    quantize_q8k_cpu,
-    quantize_q8k_cuda,
-    quantize_q8k_metal
-);
-
 /// Very simple dot product implementation
 fn vec_dot_reference(a: &[f32], b: &[f32]) -> f32 {
    a.iter().zip(b).map(|(a, b)| a * b).sum()
@ -699,108 +566,6 @@ fn get_random_tensors(
    Ok((lhs, rhs, mm))
 }

-#[macro_export]
-macro_rules! quantized_matmul {
-    // TODO: Switch to generating the two last arguments automatically once concat_idents is
-    // stable. https://github.com/rust-lang/rust/issues/29599
-    ($fn_name: ident, $fn_name_cpu: ident, $fn_name_cuda: ident, $fn_name_metal: ident, $dtype: expr) => {
-        fn $fn_name(device: &Device) -> Result<()> {
-            test_matmul(device, (1, 3, 4, 256), $dtype)?;
-            Ok(())
-        }
-
-        test_device!($fn_name, $fn_name_cpu, $fn_name_cuda, $fn_name_metal);
-    };
-}
-
-quantized_matmul!(
-    quantized_matmul_q4_0_bis,
-    quantized_matmul_q4_0_cpu,
-    quantized_matmul_q4_0_cuda,
-    quantized_matmul_q4_0_metal,
-    GgmlDType::Q4_0
-);
-quantized_matmul!(
-    quantized_matmul_q4_1_bis,
-    quantized_matmul_q4_1_cpu,
-    quantized_matmul_q4_1_cuda,
-    quantized_matmul_q4_1_metal,
-    GgmlDType::Q4_1
-);
-quantized_matmul!(
-    quantized_matmul_q5_0_bis,
-    quantized_matmul_q5_0_cpu,
-    quantized_matmul_q5_0_cuda,
-    quantized_matmul_q5_0_metal,
-    GgmlDType::Q5_0
-);
-quantized_matmul!(
-    quantized_matmul_q5_1_bis,
-    quantized_matmul_q5_1_cpu,
-    quantized_matmul_q5_1_cuda,
-    quantized_matmul_q5_1_metal,
-    GgmlDType::Q5_1
-);
-quantized_matmul!(
-    quantized_matmul_q8_0_bis,
-    quantized_matmul_q8_0_cpu,
-    quantized_matmul_q8_0_cuda,
-    quantized_matmul_q8_0_metal,
-    GgmlDType::Q8_0
-);
-// Not implemented in Ggml
-// quantized_matmul!(
-//     quantized_matmul_q8_1_bis,
-//     quantized_matmul_q8_1_cpu,
-//     quantized_matmul_q8_1_cuda,
-//     quantized_matmul_q8_1_metal,
-//     GgmlDType::Q8_1
-// );
-// TODO This is bugged (also bugged in GGML
-quantized_matmul!(
-    quantized_matmul_q2k_bis,
-    quantized_matmul_q2k_cpu,
-    quantized_matmul_q2k_cuda,
-    quantized_matmul_q2k_metal,
-    GgmlDType::Q2K
-);
-quantized_matmul!(
-    quantized_matmul_q3k_bis,
-    quantized_matmul_q3k_cpu,
-    quantized_matmul_q3k_cuda,
-    quantized_matmul_q3k_metal,
-    GgmlDType::Q3K
-);
-quantized_matmul!(
-    quantized_matmul_q4k_bis,
-    quantized_matmul_q4k_cpu,
-    quantized_matmul_q4k_cuda,
-    quantized_matmul_q4k_metal,
-    GgmlDType::Q4K
-);
-quantized_matmul!(
-    quantized_matmul_q5k_bis,
-    quantized_matmul_q5k_cpu,
-    quantized_matmul_q5k_cuda,
-    quantized_matmul_q5k_metal,
-    GgmlDType::Q5K
-);
-quantized_matmul!(
-    quantized_matmul_q6k_bis,
-    quantized_matmul_q6k_cpu,
-    quantized_matmul_q6k_cuda,
-    quantized_matmul_q6k_metal,
-    GgmlDType::Q6K
-);
-// Not implemented on metal
-// quantized_matmul!(
-//     quantized_matmul_q8k_bis,
-//     quantized_matmul_q8k_cpu,
-//     quantized_matmul_q8k_cuda,
-//     quantized_matmul_q8k_metal,
-//     GgmlDType::Q8K
-// );
-
 #[test]
 fn quantized_matmul_q2k() -> Result<()> {
    use k_quants::BlockQ2K;
@ -813,7 +578,7 @@ fn quantized_matmul_q2k() -> Result<()> {
    let dst = round_vector(&[dst[0], dst[m * n / 3], dst[m * n * 2 / 3], dst[m * n - 1]]);
    assert_eq!(dst, [1.262, 1.513, -0.208, 1.702]);

-    let rhs = quantized::QTensor::quantize(&rhs, GgmlDType::Q2K)?;
+    let rhs = quantized::QTensor::quantize::<BlockQ2K>(&rhs)?;
    let rhs = quantized::QMatMul::from_qtensor(rhs)?;
    let mm = rhs.forward(&lhs)?;

@ -839,7 +604,7 @@ fn quantized_matmul_q3k() -> Result<()> {
    let dst = round_vector(&[dst[0], dst[m * n / 3], dst[m * n * 2 / 3], dst[m * n - 1]]);
    assert_eq!(dst, [1.262, 1.513, -0.208, 1.702]);

-    let rhs = quantized::QTensor::quantize(&rhs, GgmlDType::Q3K)?;
+    let rhs = quantized::QTensor::quantize::<BlockQ3K>(&rhs)?;
    let rhs = quantized::QMatMul::from_qtensor(rhs)?;
    let mm = rhs.forward(&lhs)?;

@ -865,7 +630,7 @@ fn quantized_matmul_q4k() -> Result<()> {
    let dst = round_vector(&[dst[0], dst[m * n / 3], dst[m * n * 2 / 3], dst[m * n - 1]]);
    assert_eq!(dst, [1.262, 1.513, -0.208, 1.702]);

-    let rhs = quantized::QTensor::quantize(&rhs, GgmlDType::Q4K)?;
+    let rhs = quantized::QTensor::quantize::<BlockQ4K>(&rhs)?;
    let rhs = quantized::QMatMul::from_qtensor(rhs)?;
    let mm = rhs.forward(&lhs)?;

@ -891,7 +656,7 @@ fn quantized_matmul_q5k() -> Result<()> {
    let dst = round_vector(&[dst[0], dst[m * n / 3], dst[m * n * 2 / 3], dst[m * n - 1]]);
    assert_eq!(dst, [1.262, 1.513, -0.208, 1.702]);

-    let rhs = quantized::QTensor::quantize(&rhs, GgmlDType::Q5K)?;
+    let rhs = quantized::QTensor::quantize::<BlockQ5K>(&rhs)?;
    let rhs = quantized::QMatMul::from_qtensor(rhs)?;
    let mm = rhs.forward(&lhs)?;

@ -918,7 +683,7 @@ fn quantized_matmul_q6k() -> Result<()> {
    let dst = round_vector(&[dst[0], dst[m * n / 3], dst[m * n * 2 / 3], dst[m * n - 1]]);
    assert_eq!(dst, [1.262, 1.513, -0.208, 1.702]);

-    let rhs = quantized::QTensor::quantize(&rhs, GgmlDType::Q6K)?;
+    let rhs = quantized::QTensor::quantize::<BlockQ6K>(&rhs)?;
    let rhs = quantized::QMatMul::from_qtensor(rhs)?;
    let mm = rhs.forward(&lhs)?;

@ -943,7 +708,7 @@ fn quantized_matmul_q8k() -> Result<()> {
    let dst = round_vector(&[dst[0], dst[m * n / 3], dst[m * n * 2 / 3], dst[m * n - 1]]);
    assert_eq!(dst, [1.262, 1.513, -0.208, 1.702]);

-    let rhs = quantized::QTensor::quantize(&rhs, GgmlDType::Q8K)?;
+    let rhs = quantized::QTensor::quantize::<BlockQ8K>(&rhs)?;
    let rhs = quantized::QMatMul::from_qtensor(rhs)?;
    let mm = rhs.forward(&lhs)?;

--- a/candle-core/tests/tensor_tests.rs
+++ b/candle-core/tests/tensor_tests.rs
@ -1,4 +1,4 @@
-use candle_core::{test_device, test_utils, DType, Device, IndexOp, Result, Tensor, D};
+use candle_core::{test_device, test_utils, DType, Device, IndexOp, Result, Tensor};

 fn zeros(device: &Device) -> Result<()> {
    let tensor = Tensor::zeros((5, 2), DType::F32, device)?;
@ -32,14 +32,6 @@ fn ones(device: &Device) -> Result<()> {
    Ok(())
 }

-fn full(device: &Device) -> Result<()> {
-    assert_eq!(
-        Tensor::full(42u32, (2, 3), device)?.to_vec2::<u32>()?,
-        [[42, 42, 42], [42, 42, 42]],
-    );
-    Ok(())
-}
-
 fn arange(device: &Device) -> Result<()> {
    assert_eq!(
        Tensor::arange(0u8, 5u8, device)?.to_vec1::<u8>()?,
@ -1080,7 +1072,6 @@ fn randn(device: &Device) -> Result<()> {

 test_device!(zeros, zeros_cpu, zeros_gpu, zeros_metal);
 test_device!(ones, ones_cpu, ones_gpu, ones_metal);
-test_device!(full, full_cpu, full_gpu, full_metal);
 test_device!(arange, arange_cpu, arange_gpu, arange_metal);
 test_device!(add_mul, add_mul_cpu, add_mul_gpu, add_mul_metal);
 test_device!(tensor_2d, tensor_2d_cpu, tensor_2d_gpu, tensor_2d_metal);
@ -1168,88 +1159,3 @@ fn i64_abs() -> Result<()> {
    assert_eq!(t.to_vec1::<i64>()?, [42, 1337]);
    Ok(())
 }
-
-#[test]
-fn tril_triu_eye() -> Result<()> {
-    let t = Tensor::tril2(4, DType::F32, &Device::Cpu)?;
-    assert_eq!(
-        t.to_vec2::<f32>()?,
-        [
-            [1.0, 0.0, 0.0, 0.0],
-            [1.0, 1.0, 0.0, 0.0],
-            [1.0, 1.0, 1.0, 0.0],
-            [1.0, 1.0, 1.0, 1.0]
-        ],
-    );
-    let t = Tensor::triu2(4, DType::F32, &Device::Cpu)?;
-    assert_eq!(
-        t.to_vec2::<f32>()?,
-        [
-            [1.0, 1.0, 1.0, 1.0],
-            [0.0, 1.0, 1.0, 1.0],
-            [0.0, 0.0, 1.0, 1.0],
-            [0.0, 0.0, 0.0, 1.0]
-        ]
-    );
-    let t = Tensor::eye(4, DType::F32, &Device::Cpu)?;
-    assert_eq!(
-        t.to_vec2::<f32>()?,
-        [
-            [1.0, 0.0, 0.0, 0.0],
-            [0.0, 1.0, 0.0, 0.0],
-            [0.0, 0.0, 1.0, 0.0],
-            [0.0, 0.0, 0.0, 1.0]
-        ]
-    );
-    Ok(())
-}
-
-#[test]
-fn cumsum() -> Result<()> {
-    let t = &[3f32, 1., 4., 1., 5.];
-    let t = Tensor::new(t, &Device::Cpu)?;
-    assert_eq!(t.cumsum(0)?.to_vec1::<f32>()?, [3., 4., 8., 9., 14.]);
-    let t = t.unsqueeze(1)?;
-    assert_eq!(
-        t.cumsum(0)?.to_vec2::<f32>()?,
-        [[3.0], [4.0], [8.0], [9.0], [14.0]]
-    );
-    assert_eq!(
-        t.cumsum(1)?.to_vec2::<f32>()?,
-        [[3.0], [1.0], [4.0], [1.0], [5.0]]
-    );
-    let t = &[[3f32, 1., 4., 1., 5.], [2., 1., 7., 8., 2.]];
-    let t = Tensor::new(t, &Device::Cpu)?;
-    assert_eq!(
-        t.cumsum(1)?.to_vec2::<f32>()?,
-        [[3.0, 4.0, 8.0, 9.0, 14.0], [2.0, 3.0, 10.0, 18.0, 20.0]],
-    );
-    assert_eq!(
-        t.cumsum(0)?.to_vec2::<f32>()?,
-        [[3.0, 1.0, 4.0, 1.0, 5.0], [5.0, 2.0, 11.0, 9.0, 7.0]]
-    );
-    Ok(())
-}
-
-/// A helper function for floating point comparison. Both a and b must be 1D Tensor and contains the same amount of data.
-/// Assertion passes if the difference of all pairs of a and b is smaller than epsilon.
-fn assert_close(a: &Tensor, b: &Tensor, epsilon: f64) -> Result<()> {
-    let a_vec: Vec<f64> = a.to_vec1()?;
-    let b_vec: Vec<f64> = b.to_vec1()?;
-
-    assert_eq!(a_vec.len(), b_vec.len());
-    for (a, b) in a_vec.iter().zip(b_vec.iter()) {
-        assert!((a - b).abs() < epsilon);
-    }
-    Ok(())
-}
-
-#[test]
-fn logsumexp() -> Result<()> {
-    let input = Tensor::new(&[[1f64, 2., 3.], [4., 5., 6.]], &Device::Cpu)?;
-    let output = input.logsumexp(D::Minus1)?;
-    // The expectations obtained from pytorch.
-    let expected = Tensor::new(&[3.4076, 6.4076], &Device::Cpu)?;
-    assert_close(&output, &expected, 0.00001)?;
-    Ok(())
-}
--- a/candle-datasets/Cargo.toml
+++ b/candle-datasets/Cargo.toml
@ -11,8 +11,8 @@ readme = "README.md"

 [dependencies]
 byteorder = { workspace = true }
-candle = { path = "../candle-core", version = "0.3.3", package = "candle-core" }
-candle-nn = { path = "../candle-nn", version = "0.3.3" }
+candle = { path = "../candle-core", version = "0.3.0", package = "candle-core" }
+candle-nn = { path = "../candle-nn", version = "0.3.0" }
 hf-hub = { workspace = true}
 intel-mkl-src = { workspace = true, optional = true }
 memmap2 = { workspace = true }
--- a/candle-examples/Cargo.toml
+++ b/candle-examples/Cargo.toml
@ -11,17 +11,14 @@ readme = "README.md"

 [dependencies]
 accelerate-src = { workspace = true, optional = true }
-candle = { path = "../candle-core", version = "0.3.3", package = "candle-core" }
-candle-datasets = { path = "../candle-datasets", version = "0.3.3" }
-candle-nn = { path = "../candle-nn", version = "0.3.3" }
-candle-transformers = { path = "../candle-transformers", version = "0.3.3" }
-candle-flash-attn = { path = "../candle-flash-attn", version = "0.3.3", optional = true }
-candle-onnx = { path = "../candle-onnx", version = "0.3.3", optional = true }
-
-csv = "1.3.0"
+candle = { path = "../candle-core", version = "0.3.0", package = "candle-core" }
+candle-datasets = { path = "../candle-datasets", version = "0.3.0" }
+candle-nn = { path = "../candle-nn", version = "0.3.0" }
+candle-transformers = { path = "../candle-transformers", version = "0.3.0" }
+candle-flash-attn = { path = "../candle-flash-attn", version = "0.3.0", optional = true }
+candle-onnx = { path = "../candle-onnx", version = "0.3.0", optional = true }
 cudarc = { workspace = true, optional = true }
 half = { workspace = true, optional = true }
-hf-hub = { workspace = true, features=["tokio"]}
 image = { workspace = true }
 intel-mkl-src = { workspace = true, optional = true }
 num-traits = { workspace = true }
@ -36,6 +33,7 @@ tokenizers = { workspace = true, features = ["onig"] }
 anyhow = { workspace = true }
 byteorder = { workspace = true }
 clap = { workspace = true }
+hf-hub = { workspace = true, features=["tokio"]}
 imageproc = { workspace = true }
 memmap2 = { workspace = true }
 rand = { workspace = true }
--- a/candle-examples/build.rs
+++ b/candle-examples/build.rs
@ -32,8 +32,6 @@ impl KernelDirectories {
        if should_compile {
            #[cfg(feature = "cuda")]
            {
-                let ccbin_env = std::env::var("CANDLE_NVCC_CCBIN");
-                println!("cargo:rerun-if-env-changed=CANDLE_NVCC_CCBIN");
                let mut command = std::process::Command::new("nvcc");
                let out_dir = ptx_file.parent().context("no parent for ptx file")?;
                let include_dirs: Vec<String> =
@ -46,11 +44,6 @@ impl KernelDirectories {
                    .arg(format!("-I/{}", self.kernel_dir))
                    .args(include_dirs)
                    .arg(cu_file);
-                if let Ok(ccbin_path) = &ccbin_env {
-                    command
-                        .arg("-allow-unsupported-compiler")
-                        .args(["-ccbin", ccbin_path]);
-                }
                let output = command
                    .spawn()
                    .context("failed spawning nvcc")?
@ -175,16 +168,8 @@ fn set_cuda_include_dir() -> Result<()> {

 #[allow(unused)]
 fn compute_cap() -> Result<usize> {
-    println!("cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP");
-
-    // Try to parse compute cap from env
-    let mut compute_cap = if let Ok(compute_cap_str) = std::env::var("CUDA_COMPUTE_CAP") {
-        println!("cargo:rustc-env=CUDA_COMPUTE_CAP={compute_cap_str}");
-        compute_cap_str
-            .parse::<usize>()
-            .context("Could not parse code")?
-    } else {
-        // Grab compute cap from nvidia-smi
+    // Grab compute code from nvidia-smi
+    let mut compute_cap = {
        let out = std::process::Command::new("nvidia-smi")
                    .arg("--query-gpu=compute_cap")
                    .arg("--format=csv")
@ -200,7 +185,6 @@ fn compute_cap() -> Result<usize> {
            .next()
            .context("missing line in stdout")?
            .replace('.', "");
-        println!("cargo:rustc-env=CUDA_COMPUTE_CAP={cap}");
        cap.parse::<usize>()
            .with_context(|| format!("cannot parse as int {cap}"))?
    };
--- a/candle-examples/examples/bert/README.md
+++ b/candle-examples/examples/bert/README.md
@ -2,10 +2,10 @@

 Bert is a general large language model. In this example it can be used for two
 different tasks:
-
 - Compute sentence embeddings for a prompt.
 - Compute similarities between a set of sentences.

+
 ## Sentence embeddings

 Bert is used to compute the sentence embeddings for a prompt. The model weights
@ -24,48 +24,6 @@ cargo run --example bert --release -- --prompt "Here is a test sentence"
 > Tensor[[1, 7, 384], f32]
 ```

-### Custom models
-
-You can specify different models, such as BGE, with the `--model-id` flag:
-
-```bash
-cargo run  --example bert --release -- \
--model-id BAAI/bge-large-zh-v1.5 \
--prompt "Here is a test sentence"
-Loaded and encoded 435.70775ms
-[[[ 3.0944e-1, -7.8455e-5,  -1.2768e0, ...,  1.3755e-2, -3.2371e-1,  2.3819e-1],
-  [-2.8506e-1,  1.9953e-1,  -1.3076e0, ...,  6.9819e-2,  1.0833e-2,  -1.1512e0],
-  [ 3.9892e-1,  2.0000e-1, -9.3178e-1, ..., -4.1393e-1, -4.9644e-2, -3.3786e-1],
-  ...
-  [ 6.0345e-1,  3.5744e-1,  -1.2672e0, ..., -6.9165e-1, -3.4973e-3, -8.4214e-1],
-  [ 3.9218e-1, -3.2735e-1,  -1.3123e0, ..., -4.9318e-1, -5.1334e-1, -3.6391e-1],
-  [ 3.0978e-1,  2.5662e-4,  -1.2773e0, ...,  1.3357e-2, -3.2390e-1,  2.3858e-1]]]
-Tensor[[1, 9, 1024], f32]
-Took 176.744667ms
-```
-
-### Gelu approximation
-
-You can get a speedup by using an approximation of the gelu activation, with a
-small loss of precision, by passing the `--approximate-gelu` flag:
-
-```bash
-$ cargo run  --example bert --release -- \
--model-id BAAI/bge-large-zh-v1.5 \
--prompt "Here is a test sentence" \
--approximate-gelu
-Loaded and encoded 244.388042ms
-[[[ 3.1048e-1, -6.0339e-4,  -1.2758e0, ...,  1.3718e-2, -3.2362e-1,  2.3775e-1],
-  [-2.8354e-1,  1.9984e-1,  -1.3077e0, ...,  6.9390e-2,  9.9681e-3,  -1.1531e0],
-  [ 3.9947e-1,  1.9917e-1, -9.3178e-1, ..., -4.1301e-1, -5.0719e-2, -3.3955e-1],
-  ...
-  [ 6.0499e-1,  3.5664e-1,  -1.2642e0, ..., -6.9134e-1, -3.4581e-3, -8.4471e-1],
-  [ 3.9311e-1, -3.2812e-1,  -1.3105e0, ..., -4.9291e-1, -5.1270e-1, -3.6543e-1],
-  [ 3.1082e-1, -2.6737e-4,  -1.2762e0, ...,  1.3319e-2, -3.2381e-1,  2.3815e-1]]]
-Tensor[[1, 9, 1024], f32]
-Took 116.840791ms
-```
-
 ## Similarities

 In this example, Bert is used to compute the sentence embeddings for a set of
--- a/candle-examples/examples/bert/main.rs
+++ b/candle-examples/examples/bert/main.rs
@ -3,7 +3,7 @@ extern crate intel_mkl_src;

 #[cfg(feature = "accelerate")]
 extern crate accelerate_src;
-use candle_transformers::models::bert::{BertModel, Config, HiddenAct, DTYPE};
+use candle_transformers::models::bert::{BertModel, Config, DTYPE};

 use anyhow::{Error as E, Result};
 use candle::Tensor;
@ -45,10 +45,6 @@ struct Args {
    /// L2 normalization for embeddings.
    #[arg(long, default_value = "true")]
    normalize_embeddings: bool,
-
-    /// Use tanh based approximation for Gelu instead of erf implementation.
-    #[arg(long, default_value = "false")]
-    approximate_gelu: bool,
 }

 impl Args {
@ -77,7 +73,7 @@ impl Args {
            (config, tokenizer, weights)
        };
        let config = std::fs::read_to_string(config_filename)?;
-        let mut config: Config = serde_json::from_str(&config)?;
+        let config: Config = serde_json::from_str(&config)?;
        let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;

        let vb = if self.use_pth {
@ -85,9 +81,6 @@ impl Args {
        } else {
            unsafe { VarBuilder::from_mmaped_safetensors(&[weights_filename], DTYPE, &device)? }
        };
-        if self.approximate_gelu {
-            config.hidden_act = HiddenAct::GeluApproximate;
-        }
        let model = BertModel::load(vb, &config)?;
        Ok((model, tokenizer))
    }
--- a/candle-examples/examples/blip/main.rs
+++ b/candle-examples/examples/blip/main.rs
@ -106,17 +106,17 @@ pub fn main() -> anyhow::Result<()> {

    let config = blip::Config::image_captioning_large();

-    let device = candle_examples::device(args.cpu)?;
    let (image_embeds, device, mut model) = if args.quantized {
        let device = Device::Cpu;
        let image = load_image(args.image)?.to_device(&device)?;
        println!("loaded image {image:?}");

-        let vb = quantized_blip::VarBuilder::from_gguf(model_file, &device)?;
+        let vb = quantized_blip::VarBuilder::from_gguf(model_file)?;
        let model = quantized_blip::BlipForConditionalGeneration::new(&config, vb)?;
        let image_embeds = image.unsqueeze(0)?.apply(model.vision_model())?;
        (image_embeds, device, Model::Q(model))
    } else {
+        let device = candle_examples::device(args.cpu)?;
        let image = load_image(args.image)?.to_device(&device)?;
        println!("loaded image {image:?}");

--- a/candle-examples/examples/distilbert/README.md
+++ b/candle-examples/examples/distilbert/README.md
@ -1,22 +0,0 @@
-# candle-distilbert
-
-DistilBert is a distiled version of the Bert model.
-
-## Sentence embeddings
-
-DistilBert is used to compute the sentence embeddings for a prompt. The model weights
-are downloaded from the hub on the first run.
-
-```bash
-cargo run --example distilbert --release -- --prompt "Here is a test sentence"
-
-> [[[ 0.5109,  0.1280, -0.2635, ...,  0.3462, -1.0434,  0.1441],
->   [ 0.1735,  0.0818, -0.5549, ...,  0.3472, -0.8264, -0.0244],
->   [ 0.0702, -0.1311, -0.4914, ...,  0.3483, -0.6194,  0.1829],
->   ...
->   [ 0.2993, -0.0106, -0.4640, ...,  0.2844, -0.6732,  0.0042],
->   [ 0.1066, -0.0081, -0.4299, ...,  0.3435, -0.7729,  0.0190],
->   [ 0.8903,  0.2055, -0.2541, ...,  0.3208, -0.6585,  0.0586]]]
-> Tensor[[1, 7, 768], f32]
-
-```
--- a/candle-examples/examples/distilbert/main.rs
+++ b/candle-examples/examples/distilbert/main.rs
@ -1,135 +0,0 @@
-#[cfg(feature = "mkl")]
-extern crate intel_mkl_src;
-
-#[cfg(feature = "accelerate")]
-extern crate accelerate_src;
-use candle_transformers::models::distilbert::{Config, DistilBertModel, DTYPE};
-
-use anyhow::{Error as E, Result};
-use candle::{Device, Tensor};
-use candle_nn::VarBuilder;
-use clap::Parser;
-use hf_hub::{api::sync::Api, Repo, RepoType};
-use tokenizers::Tokenizer;
-
-#[derive(Parser, Debug)]
-#[command(author, version, about, long_about = None)]
-struct Args {
-    /// Run on CPU rather than on GPU.
-    #[arg(long)]
-    cpu: bool,
-
-    /// Enable tracing (generates a trace-timestamp.json file).
-    #[arg(long)]
-    tracing: bool,
-
-    /// The model to use, check out available models: https://huggingface.co/models?library=sentence-transformers&sort=trending
-    #[arg(long)]
-    model_id: Option<String>,
-
-    #[arg(long)]
-    revision: Option<String>,
-
-    /// When set, compute embeddings for this prompt.
-    #[arg(long)]
-    prompt: String,
-
-    /// Use the pytorch weights rather than the safetensors ones
-    #[arg(long)]
-    use_pth: bool,
-
-    /// The number of times to run the prompt.
-    #[arg(long, default_value = "1")]
-    n: usize,
-
-    /// L2 normalization for embeddings.
-    #[arg(long, default_value = "true")]
-    normalize_embeddings: bool,
-}
-
-impl Args {
-    fn build_model_and_tokenizer(&self) -> Result<(DistilBertModel, Tokenizer)> {
-        let device = candle_examples::device(self.cpu)?;
-        let default_model = "distilbert-base-uncased".to_string();
-        let default_revision = "main".to_string();
-        let (model_id, revision) = match (self.model_id.to_owned(), self.revision.to_owned()) {
-            (Some(model_id), Some(revision)) => (model_id, revision),
-            (Some(model_id), None) => (model_id, "main".to_string()),
-            (None, Some(revision)) => (default_model, revision),
-            (None, None) => (default_model, default_revision),
-        };
-
-        let repo = Repo::with_revision(model_id, RepoType::Model, revision);
-        let (config_filename, tokenizer_filename, weights_filename) = {
-            let api = Api::new()?;
-            let api = api.repo(repo);
-            let config = api.get("config.json")?;
-            let tokenizer = api.get("tokenizer.json")?;
-            let weights = if self.use_pth {
-                api.get("pytorch_model.bin")?
-            } else {
-                api.get("model.safetensors")?
-            };
-            (config, tokenizer, weights)
-        };
-        let config = std::fs::read_to_string(config_filename)?;
-        let config: Config = serde_json::from_str(&config)?;
-        let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
-
-        let vb = if self.use_pth {
-            VarBuilder::from_pth(&weights_filename, DTYPE, &device)?
-        } else {
-            unsafe { VarBuilder::from_mmaped_safetensors(&[weights_filename], DTYPE, &device)? }
-        };
-        let model = DistilBertModel::load(vb, &config)?;
-        Ok((model, tokenizer))
-    }
-}
-
-fn get_mask(size: usize, device: &Device) -> Tensor {
-    let mask: Vec<_> = (0..size)
-        .flat_map(|i| (0..size).map(move |j| u8::from(j > i)))
-        .collect();
-    Tensor::from_slice(&mask, (size, size), device).unwrap()
-}
-
-fn main() -> Result<()> {
-    use tracing_chrome::ChromeLayerBuilder;
-    use tracing_subscriber::prelude::*;
-
-    let args = Args::parse();
-    let _guard = if args.tracing {
-        println!("tracing...");
-        let (chrome_layer, guard) = ChromeLayerBuilder::new().build();
-        tracing_subscriber::registry().with(chrome_layer).init();
-        Some(guard)
-    } else {
-        None
-    };
-    let (model, mut tokenizer) = args.build_model_and_tokenizer()?;
-    let device = &model.device;
-
-    let tokenizer = tokenizer
-        .with_padding(None)
-        .with_truncation(None)
-        .map_err(E::msg)?;
-    let tokens = tokenizer
-        .encode(args.prompt, true)
-        .map_err(E::msg)?
-        .get_ids()
-        .to_vec();
-    let token_ids = Tensor::new(&tokens[..], device)?.unsqueeze(0)?;
-    let mask = get_mask(tokens.len(), device);
-
-    println!("token_ids: {:?}", token_ids.to_vec2::<u32>());
-    println!("mask: {:?}", mask.to_vec2::<u8>());
-
-    let ys = model.forward(&token_ids, &mask)?;
-    println!("{ys}");
-
-    Ok(())
-}
-
-pub fn normalize_l2(v: &Tensor) -> Result<Tensor> {
-    Ok(v.broadcast_div(&v.sqr()?.sum_keepdim(1)?.sqrt()?)?)
-}
--- a/candle-examples/examples/falcon/main.rs
+++ b/candle-examples/examples/falcon/main.rs
@ -165,7 +165,14 @@ fn main() -> Result<()> {
        args.revision,
    ));
    let tokenizer_filename = repo.get("tokenizer.json")?;
-    let filenames = candle_examples::hub_load_safetensors(&repo, "model.safetensors.index.json")?;
+    let mut filenames = vec![];
+    for rfilename in [
+        "model-00001-of-00002.safetensors",
+        "model-00002-of-00002.safetensors",
+    ] {
+        let filename = repo.get(rfilename)?;
+        filenames.push(filename);
+    }
    println!("retrieved the files in {:?}", start.elapsed());
    let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;

--- a/candle-examples/examples/llama/main.rs
+++ b/candle-examples/examples/llama/main.rs
@ -13,7 +13,7 @@ extern crate accelerate_src;
 extern crate intel_mkl_src;

 use anyhow::{bail, Error as E, Result};
-use clap::{Parser, ValueEnum};
+use clap::Parser;

 use candle::{DType, Tensor};
 use candle_nn::VarBuilder;
@ -22,21 +22,11 @@ use hf_hub::{api::sync::Api, Repo, RepoType};
 use std::io::Write;

 use candle_transformers::models::llama as model;
-use model::{Llama, LlamaConfig};
+use model::{Config, Llama, LlamaConfig};

 const EOS_TOKEN: &str = "</s>";
 const DEFAULT_PROMPT: &str = "My favorite theorem is ";

-#[derive(Clone, Debug, Copy, PartialEq, Eq, ValueEnum)]
-enum Which {
-    V1,
-    V2,
-    #[value(name = "solar-10.7b")]
-    Solar10_7B,
-    #[value(name = "tiny-llama-1.1b-chat")]
-    TinyLlama1_1BChat,
-}
-
 #[derive(Parser, Debug)]
 #[command(author, version, about, long_about = None)]
 struct Args {
@ -44,6 +34,10 @@ struct Args {
    #[arg(long)]
    cpu: bool,

+    /// Use npy instead of safetensors
+    #[arg(long)]
+    npy: Option<String>,
+
    /// The temperature used to generate samples.
    #[arg(long)]
    temperature: Option<f64>,
@ -82,13 +76,17 @@ struct Args {
    #[arg(long)]
    revision: Option<String>,

-    /// The model size to use.
-    #[arg(long, default_value = "v2")]
-    which: Which,
+    #[arg(long)]
+    v1: bool,

    #[arg(long)]
    use_flash_attn: bool,

+    /// The folder name that contains safetensor weights and json files
+    /// (same structure as huggingface online)
+    #[arg(long)]
+    local_weights: Option<String>,
+
    /// Penalty to be applied for repeating tokens, 1. means no penalty.
    #[arg(long, default_value_t = 1.0)]
    repeat_penalty: f32,
@ -120,34 +118,65 @@ fn main() -> Result<()> {
        Some(dtype) => bail!("Unsupported dtype {dtype}"),
        None => DType::F16,
    };
-    let (llama, tokenizer_filename, cache) = {
-        let api = Api::new()?;
-        let model_id = args.model_id.unwrap_or_else(|| match args.which {
-            Which::V1 => "Narsil/amall-7b".to_string(),
-            Which::V2 => "meta-llama/Llama-2-7b-hf".to_string(),
-            Which::Solar10_7B => "upstage/SOLAR-10.7B-v1.0".to_string(),
-            Which::TinyLlama1_1BChat => "TinyLlama/TinyLlama-1.1B-Chat-v1.0".to_string(),
-        });
-        println!("loading the model weights from {model_id}");
-        let revision = args.revision.unwrap_or("main".to_string());
-        let api = api.repo(Repo::with_revision(model_id, RepoType::Model, revision));
+    let (llama, tokenizer_filename, cache) = match args.npy {
+        Some(filename) => {
+            let config = if args.v1 {
+                Config::config_7b_v1(args.use_flash_attn)
+            } else {
+                Config::config_7b_v2(args.use_flash_attn)
+            };
+            let cache = model::Cache::new(!args.no_kv_cache, dtype, &config, &device)?;
+            let vb = VarBuilder::from_npz(filename, dtype, &device)?;
+            let tokenizer = std::path::PathBuf::from("llama-tokenizer.json");
+            (Llama::load(vb, &cache, &config)?, tokenizer, cache)
+        }
+        None => {
+            let api = Api::new()?;
+            let model_id = args.model_id.unwrap_or_else(|| {
+                if args.v1 {
+                    "Narsil/amall-7b".to_string()
+                } else {
+                    "meta-llama/Llama-2-7b-hf".to_string()
+                }
+            });
+            println!("loading the model weights from {model_id}");
+            let revision = args.revision.unwrap_or("main".to_string());
+            let api = api.repo(Repo::with_revision(model_id, RepoType::Model, revision));

-        let tokenizer_filename = api.get("tokenizer.json")?;
-        let config_filename = api.get("config.json")?;
-        let config: LlamaConfig = serde_json::from_slice(&std::fs::read(config_filename)?)?;
-        let config = config.into_config(args.use_flash_attn);
+            let tokenizer_filename = match &args.local_weights {
+                Some(path) => (path.to_owned() + "tokenizer.json").into(),
+                _ => api.get("tokenizer.json")?,
+            };

-        let filenames = match args.which {
-            Which::V1 | Which::V2 | Which::Solar10_7B => {
-                candle_examples::hub_load_safetensors(&api, "model.safetensors.index.json")?
+            let config_filename = match &args.local_weights {
+                Some(path) => (path.to_owned() + "config.json").into(),
+                _ => api.get("config.json")?,
+            };
+            let config: LlamaConfig = serde_json::from_slice(&std::fs::read(config_filename)?)?;
+            let config = config.into_config(args.use_flash_attn);
+
+            let mut filenames = vec![];
+            for rfilename in [
+                "model-00001-of-00002.safetensors",
+                "model-00002-of-00002.safetensors",
+            ] {
+                match &args.local_weights {
+                    Some(path) => {
+                        filenames.push((path.to_owned() + rfilename).into());
+                    }
+                    _ => {
+                        let filename = api.get(rfilename)?;
+                        filenames.push(filename);
+                    }
+                };
            }
-            Which::TinyLlama1_1BChat => vec![api.get("model.safetensors")?],
-        };
-        println!("building the model");
-        let cache = model::Cache::new(!args.no_kv_cache, dtype, &config, &device)?;

-        let vb = unsafe { VarBuilder::from_mmaped_safetensors(&filenames, dtype, &device)? };
-        (Llama::load(vb, &cache, &config)?, tokenizer_filename, cache)
+            println!("building the model");
+            let cache = model::Cache::new(!args.no_kv_cache, dtype, &config, &device)?;
+
+            let vb = unsafe { VarBuilder::from_mmaped_safetensors(&filenames, dtype, &device)? };
+            (Llama::load(vb, &cache, &config)?, tokenizer_filename, cache)
+        }
    };
    let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
    let eos_token_id = tokenizer.token_to_id(EOS_TOKEN);
--- a/candle-examples/examples/llama2-c/main.rs
+++ b/candle-examples/examples/llama2-c/main.rs
@ -262,7 +262,7 @@ fn run_inference(args: &InferenceCmd, common_args: &Args) -> Result<()> {
        .extension()
        .map_or(false, |v| v == "safetensors");
    let (model, config) = if is_gguf {
-        let vb = qmodel::VarBuilder::from_gguf(config_path, &device)?;
+        let vb = qmodel::VarBuilder::from_gguf(config_path)?;
        let (_vocab_size, dim) = vb
            .get_no_shape("model.embed_tokens.weight")?
            .shape()
@ -279,13 +279,13 @@ fn run_inference(args: &InferenceCmd, common_args: &Args) -> Result<()> {
                (config.seq_len, config.head_size() / 2),
                "rot.freq_cis_real",
            )?
-            .dequantize(&device)?;
+            .dequantize(&candle::Device::Cpu)?;
        let freq_cis_imag = vb
            .get(
                (config.seq_len, config.head_size() / 2),
                "rot.freq_cis_imag",
            )?
-            .dequantize(&device)?;
+            .dequantize(&candle::Device::Cpu)?;

        let fake_vb = candle_nn::VarBuilder::from_tensors(
            [
@ -295,7 +295,7 @@ fn run_inference(args: &InferenceCmd, common_args: &Args) -> Result<()> {
            .into_iter()
            .collect(),
            candle::DType::F32,
-            &device,
+            &candle::Device::Cpu,
        );
        let cache = model::Cache::new(true, &config, fake_vb)?;
        let model = Model::QLlama(QLlama::load(vb, &cache, config.clone())?);
--- a/candle-examples/examples/llama_multiprocess/main.rs
+++ b/candle-examples/examples/llama_multiprocess/main.rs
@ -143,7 +143,14 @@ fn main() -> Result<()> {
    let config_filename = api.get("config.json")?;
    let config: Config = serde_json::from_slice(&std::fs::read(config_filename)?)?;
    let tokenizer_filename = api.get("tokenizer.json")?;
-    let filenames = candle_examples::hub_load_safetensors(&api, "model.safetensors.index.json")?;
+    let mut filenames = vec![];
+    for rfilename in [
+        "model-00001-of-00002.safetensors",
+        "model-00002-of-00002.safetensors",
+    ] {
+        let filename = api.get(rfilename)?;
+        filenames.push(filename);
+    }

    if args.rank.is_none() {
        let children: Vec<_> = (0..args.num_shards)
--- a/candle-examples/examples/mamba-minimal/README.md
+++ b/candle-examples/examples/mamba-minimal/README.md
@ -1,12 +0,0 @@
-# candle-mamba-minimal: minimal implementation of Mamba
-
-This is based on [mamba-minimal](https://github.com/johnma2006/mamba-minimal).
-
-## Running the example
-
-```bash
-$ cargo run --example mamba-minimal --release -- --prompt "Mamba is the"
-Mamba is the most popular and best-selling game in the world. It has been downloaded more than 1,000 times by over 1 million people worldwide since its release on March 18th 2016.
-
-The Mamba series of games are a collection that combines elements from all genres including action, adventure, strategy & puzzle games with some unique gameplay features such as stealth and survival. The game is also known for its innovative graphics and the ability to play in a variety of different modes like single player or multiplayer.
-```
--- a/candle-examples/examples/mamba-minimal/main.rs
+++ b/candle-examples/examples/mamba-minimal/main.rs
@ -1,287 +0,0 @@
-#[cfg(feature = "mkl")]
-extern crate intel_mkl_src;
-
-#[cfg(feature = "accelerate")]
-extern crate accelerate_src;
-
-use anyhow::{Error as E, Result};
-use clap::{Parser, ValueEnum};
-
-mod model;
-use model::{Config, Model};
-
-use candle::{DType, Device, Module, Tensor};
-use candle_examples::token_output_stream::TokenOutputStream;
-use candle_nn::VarBuilder;
-use candle_transformers::generation::LogitsProcessor;
-use hf_hub::{api::sync::Api, Repo, RepoType};
-use tokenizers::Tokenizer;
-
-struct TextGeneration {
-    model: Model,
-    device: Device,
-    tokenizer: TokenOutputStream,
-    logits_processor: LogitsProcessor,
-    repeat_penalty: f32,
-    repeat_last_n: usize,
-}
-
-impl TextGeneration {
-    #[allow(clippy::too_many_arguments)]
-    fn new(
-        model: Model,
-        tokenizer: Tokenizer,
-        seed: u64,
-        temp: Option<f64>,
-        top_p: Option<f64>,
-        repeat_penalty: f32,
-        repeat_last_n: usize,
-        device: &Device,
-    ) -> Self {
-        let logits_processor = LogitsProcessor::new(seed, temp, top_p);
-        Self {
-            model,
-            tokenizer: TokenOutputStream::new(tokenizer),
-            logits_processor,
-            repeat_penalty,
-            repeat_last_n,
-            device: device.clone(),
-        }
-    }
-
-    fn run(&mut self, prompt: &str, sample_len: usize) -> Result<()> {
-        use std::io::Write;
-        self.tokenizer.clear();
-        let mut tokens = self
-            .tokenizer
-            .tokenizer()
-            .encode(prompt, true)
-            .map_err(E::msg)?
-            .get_ids()
-            .to_vec();
-        for &t in tokens.iter() {
-            if let Some(t) = self.tokenizer.next_token(t)? {
-                print!("{t}")
-            }
-        }
-        std::io::stdout().flush()?;
-
-        let mut generated_tokens = 0usize;
-        let eos_token = match self.tokenizer.get_token("<|endoftext|>") {
-            Some(token) => token,
-            None => anyhow::bail!("cannot find the </s> token"),
-        };
-        let start_gen = std::time::Instant::now();
-        for _ in 0..sample_len {
-            let input = Tensor::new(tokens.as_slice(), &self.device)?.unsqueeze(0)?;
-            let logits = self.model.forward(&input)?;
-            let logits = logits.squeeze(0)?.squeeze(0)?.to_dtype(DType::F32)?;
-            let logits = if self.repeat_penalty == 1. {
-                logits
-            } else {
-                let start_at = tokens.len().saturating_sub(self.repeat_last_n);
-                candle_transformers::utils::apply_repeat_penalty(
-                    &logits,
-                    self.repeat_penalty,
-                    &tokens[start_at..],
-                )?
-            };
-
-            let next_token = self.logits_processor.sample(&logits)?;
-            tokens.push(next_token);
-            generated_tokens += 1;
-            if next_token == eos_token {
-                break;
-            }
-            if let Some(t) = self.tokenizer.next_token(next_token)? {
-                print!("{t}");
-                std::io::stdout().flush()?;
-            }
-        }
-        let dt = start_gen.elapsed();
-        if let Some(rest) = self.tokenizer.decode_rest().map_err(E::msg)? {
-            print!("{rest}");
-        }
-        std::io::stdout().flush()?;
-        println!(
-            "\n{generated_tokens} tokens generated ({:.2} token/s)",
-            generated_tokens as f64 / dt.as_secs_f64(),
-        );
-        Ok(())
-    }
-}
-
-#[derive(Parser, ValueEnum, Clone, Copy, PartialEq, Eq, Debug)]
-enum Which {
-    Mamba130m,
-    Mamba370m,
-    Mamba790m,
-    Mamba1_4b,
-    Mamba2_8b,
-    Mamba2_8bSlimPj,
-}
-
-impl std::fmt::Display for Which {
-    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
-        write!(f, "{:?}", self)
-    }
-}
-
-impl Which {
-    fn model_id(&self) -> &'static str {
-        match self {
-            Self::Mamba130m => "state-spaces/mamba-130m",
-            Self::Mamba370m => "state-spaces/mamba-370m",
-            Self::Mamba790m => "state-spaces/mamba-790m",
-            Self::Mamba1_4b => "state-spaces/mamba-1.4b",
-            Self::Mamba2_8b => "state-spaces/mamba-2.8b",
-            Self::Mamba2_8bSlimPj => "state-spaces/mamba-2.8b-slimpj'",
-        }
-    }
-
-    fn revision(&self) -> &'static str {
-        match self {
-            Self::Mamba130m
-            | Self::Mamba370m
-            | Self::Mamba790m
-            | Self::Mamba1_4b
-            | Self::Mamba2_8bSlimPj => "refs/pr/1",
-            Self::Mamba2_8b => "refs/pr/4",
-        }
-    }
-}
-
-#[derive(Parser, Debug)]
-#[command(author, version, about, long_about = None)]
-struct Args {
-    /// Run on CPU rather than on GPU.
-    #[arg(long)]
-    cpu: bool,
-
-    /// Enable tracing (generates a trace-timestamp.json file).
-    #[arg(long)]
-    tracing: bool,
-
-    #[arg(long)]
-    prompt: String,
-
-    /// The temperature used to generate samples.
-    #[arg(long)]
-    temperature: Option<f64>,
-
-    /// Nucleus sampling probability cutoff.
-    #[arg(long)]
-    top_p: Option<f64>,
-
-    /// The seed to use when generating random samples.
-    #[arg(long, default_value_t = 299792458)]
-    seed: u64,
-
-    /// The length of the sample to generate (in tokens).
-    #[arg(long, short = 'n', default_value_t = 5000)]
-    sample_len: usize,
-
-    #[arg(long, default_value = "mamba130m")]
-    which: Which,
-
-    #[arg(long)]
-    model_id: Option<String>,
-
-    #[arg(long)]
-    revision: Option<String>,
-
-    #[arg(long)]
-    tokenizer_file: Option<String>,
-
-    #[arg(long)]
-    weight_files: Option<String>,
-
-    #[arg(long)]
-    config_file: Option<String>,
-
-    /// Penalty to be applied for repeating tokens, 1. means no penalty.
-    #[arg(long, default_value_t = 1.1)]
-    repeat_penalty: f32,
-
-    /// The context size to consider for the repeat penalty.
-    #[arg(long, default_value_t = 64)]
-    repeat_last_n: usize,
-}
-
-fn main() -> Result<()> {
-    use tracing_chrome::ChromeLayerBuilder;
-    use tracing_subscriber::prelude::*;
-
-    let args = Args::parse();
-    let _guard = if args.tracing {
-        let (chrome_layer, guard) = ChromeLayerBuilder::new().build();
-        tracing_subscriber::registry().with(chrome_layer).init();
-        Some(guard)
-    } else {
-        None
-    };
-    println!(
-        "avx: {}, neon: {}, simd128: {}, f16c: {}",
-        candle::utils::with_avx(),
-        candle::utils::with_neon(),
-        candle::utils::with_simd128(),
-        candle::utils::with_f16c()
-    );
-    println!(
-        "temp: {:.2} repeat-penalty: {:.2} repeat-last-n: {}",
-        args.temperature.unwrap_or(0.),
-        args.repeat_penalty,
-        args.repeat_last_n
-    );
-
-    let start = std::time::Instant::now();
-    let api = Api::new()?;
-    let repo = api.repo(Repo::with_revision(
-        args.model_id
-            .unwrap_or_else(|| args.which.model_id().to_string()),
-        RepoType::Model,
-        args.revision
-            .unwrap_or_else(|| args.which.revision().to_string()),
-    ));
-    let tokenizer_filename = match args.tokenizer_file {
-        Some(file) => std::path::PathBuf::from(file),
-        None => api
-            .model("EleutherAI/gpt-neox-20b".to_string())
-            .get("tokenizer.json")?,
-    };
-    let config_filename = match args.config_file {
-        Some(file) => std::path::PathBuf::from(file),
-        None => repo.get("config.json")?,
-    };
-    let filenames = match args.weight_files {
-        Some(files) => files
-            .split(',')
-            .map(std::path::PathBuf::from)
-            .collect::<Vec<_>>(),
-        None => {
-            vec![repo.get("model.safetensors")?]
-        }
-    };
-    println!("retrieved the files in {:?}", start.elapsed());
-    let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
-
-    let start = std::time::Instant::now();
-    let config: Config = serde_json::from_slice(&std::fs::read(config_filename)?)?;
-    let device = candle_examples::device(args.cpu)?;
-    let vb = unsafe { VarBuilder::from_mmaped_safetensors(&filenames, DType::F32, &device)? };
-    let model = Model::new(&config, vb.pp("backbone"))?;
-    println!("loaded the model in {:?}", start.elapsed());
-
-    let mut pipeline = TextGeneration::new(
-        model,
-        tokenizer,
-        args.seed,
-        args.temperature,
-        args.top_p,
-        args.repeat_penalty,
-        args.repeat_last_n,
-        &device,
-    );
-    pipeline.run(&args.prompt, args.sample_len)?;
-    Ok(())
-}
--- a/candle-examples/examples/mamba-minimal/model.rs
+++ b/candle-examples/examples/mamba-minimal/model.rs
@ -1,204 +0,0 @@
-/// This follows the lines of:
-/// https://github.com/johnma2006/mamba-minimal/blob/master/model.py
-/// Simple, minimal implementation of Mamba in one file of PyTorch.
-use candle::{IndexOp, Module, Result, Tensor, D};
-use candle_nn::{RmsNorm, VarBuilder};
-
-use candle_transformers::models::with_tracing::{linear, linear_no_bias, Linear};
-
-#[derive(Debug, Clone, serde::Deserialize)]
-pub struct Config {
-    d_model: usize,
-    n_layer: usize,
-    vocab_size: usize,
-    pad_vocab_size_multiple: usize,
-}
-
-impl Config {
-    fn vocab_size(&self) -> usize {
-        let pad = self.pad_vocab_size_multiple;
-        (self.vocab_size + pad - 1) / pad * pad
-    }
-
-    fn dt_rank(&self) -> usize {
-        (self.d_model + 15) / 16
-    }
-
-    fn d_conv(&self) -> usize {
-        4
-    }
-
-    fn d_state(&self) -> usize {
-        16
-    }
-
-    fn d_inner(&self) -> usize {
-        self.d_model * 2
-    }
-}
-
-// https://github.com/johnma2006/mamba-minimal/blob/61f01953ca153f8c4a850d7111beecbf4be9cee1/model.py#L177
-#[derive(Clone, Debug)]
-pub struct MambaBlock {
-    in_proj: Linear,
-    conv1d: candle_nn::Conv1d,
-    x_proj: Linear,
-    dt_proj: Linear,
-    a_log: Tensor,
-    d: Tensor,
-    out_proj: Linear,
-    dt_rank: usize,
-}
-
-impl MambaBlock {
-    pub fn new(cfg: &Config, vb: VarBuilder) -> Result<Self> {
-        let d_inner = cfg.d_inner();
-        let d_conv = cfg.d_conv();
-        let d_state = cfg.d_state();
-        let dt_rank = cfg.dt_rank();
-        let in_proj = linear_no_bias(cfg.d_model, d_inner * 2, vb.pp("in_proj"))?;
-        let conv_cfg = candle_nn::Conv1dConfig {
-            groups: d_inner,
-            padding: d_conv - 1,
-            ..Default::default()
-        };
-        let conv1d = candle_nn::conv1d(d_inner, d_inner, d_conv, conv_cfg, vb.pp("conv1d"))?;
-        let x_proj = linear_no_bias(d_inner, dt_rank + d_state * 2, vb.pp("x_proj"))?;
-        let dt_proj = linear(dt_rank, d_inner, vb.pp("dt_proj"))?;
-        let a_log = vb.get((d_inner, d_state), "A_log")?;
-        let d = vb.get(d_inner, "D")?;
-        let out_proj = linear_no_bias(d_inner, cfg.d_model, vb.pp("out_proj"))?;
-        Ok(Self {
-            in_proj,
-            conv1d,
-            x_proj,
-            dt_proj,
-            a_log,
-            d,
-            out_proj,
-            dt_rank,
-        })
-    }
-
-    fn ssm(&self, xs: &Tensor) -> Result<Tensor> {
-        let (_d_in, n) = self.a_log.dims2()?;
-        let a = self.a_log.to_dtype(candle::DType::F32)?.exp()?.neg()?;
-        let d = self.d.to_dtype(candle::DType::F32)?;
-        let x_dbl = xs.apply(&self.x_proj)?;
-        let delta = x_dbl.narrow(D::Minus1, 0, self.dt_rank)?;
-        let b = x_dbl.narrow(D::Minus1, self.dt_rank, n)?;
-        let c = x_dbl.narrow(D::Minus1, self.dt_rank + n, n)?;
-        let delta = delta.contiguous()?.apply(&self.dt_proj)?;
-        // softplus without threshold
-        let delta = (delta.exp()? + 1.)?.log()?;
-        let ss = selective_scan(xs, &delta, &a, &b, &c, &d)?;
-        Ok(ss)
-    }
-}
-
-// https://github.com/johnma2006/mamba-minimal/blob/61f01953ca153f8c4a850d7111beecbf4be9cee1/model.py#L275
-fn selective_scan(
-    u: &Tensor,
-    delta: &Tensor,
-    a: &Tensor,
-    b: &Tensor,
-    c: &Tensor,
-    d: &Tensor,
-) -> Result<Tensor> {
-    let (b_sz, l, d_in) = u.dims3()?;
-    let n = a.dim(1)?;
-    let delta = delta.t()?.reshape((b_sz, d_in, l, 1))?; // b d_in l 1
-    let delta_a = delta.broadcast_mul(&a.reshape((1, d_in, 1, n))?)?.exp()?;
-    let delta_b_u = delta
-        .broadcast_mul(&b.reshape((b_sz, 1, l, n))?)?
-        .broadcast_mul(&u.t()?.reshape((b_sz, d_in, l, 1))?)?;
-    let mut xs = Tensor::zeros((b_sz, d_in, n), delta_a.dtype(), delta_a.device())?;
-    let mut ys = Vec::with_capacity(l);
-    for i in 0..l {
-        xs = ((delta_a.i((.., .., i))? * xs)? + delta_b_u.i((.., .., i))?)?;
-        let y = xs.matmul(&c.i((.., i, ..))?.unsqueeze(2)?)?.squeeze(2)?;
-        ys.push(y)
-    }
-    let ys = Tensor::stack(ys.as_slice(), 1)?;
-    ys + u.broadcast_mul(d)
-}
-
-impl Module for MambaBlock {
-    // https://github.com/johnma2006/mamba-minimal/blob/61f01953ca153f8c4a850d7111beecbf4be9cee1/model.py#L206
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
-        let (_b_sz, seq_len, _dim) = xs.dims3()?;
-        let xs_and_res = xs.apply(&self.in_proj)?.chunk(2, D::Minus1)?;
-        let (xs, res) = (&xs_and_res[0], &xs_and_res[1]);
-        let xs = xs
-            .t()?
-            .apply(&self.conv1d)?
-            .narrow(D::Minus1, 0, seq_len)?
-            .t()?;
-        let xs = candle_nn::ops::silu(&xs)?;
-        let ys = (self.ssm(&xs)? * candle_nn::ops::silu(res))?;
-        ys.apply(&self.out_proj)
-    }
-}
-
-// https://github.com/johnma2006/mamba-minimal/blob/61f01953ca153f8c4a850d7111beecbf4be9cee1/model.py#L143
-#[derive(Clone, Debug)]
-pub struct ResidualBlock {
-    mixer: MambaBlock,
-    norm: RmsNorm,
-}
-
-impl ResidualBlock {
-    pub fn new(cfg: &Config, vb: VarBuilder) -> Result<Self> {
-        let norm = candle_nn::rms_norm(cfg.d_model, 1e-5, vb.pp("norm"))?;
-        let mixer = MambaBlock::new(cfg, vb.pp("mixer"))?;
-        Ok(Self { mixer, norm })
-    }
-}
-
-impl Module for ResidualBlock {
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
-        xs.apply(&self.norm)?.apply(&self.mixer)? + xs
-    }
-}
-
-// https://github.com/johnma2006/mamba-minimal/blob/61f01953ca153f8c4a850d7111beecbf4be9cee1/model.py#L56
-#[derive(Clone, Debug)]
-pub struct Model {
-    embedding: candle_nn::Embedding,
-    layers: Vec<ResidualBlock>,
-    norm_f: RmsNorm,
-    lm_head: Linear,
-}
-
-impl Model {
-    pub fn new(cfg: &Config, vb: VarBuilder) -> Result<Self> {
-        let embedding = candle_nn::embedding(cfg.vocab_size(), cfg.d_model, vb.pp("embedding"))?;
-        let mut layers = Vec::with_capacity(cfg.n_layer);
-        let vb_l = vb.pp("layers");
-        for layer_idx in 0..cfg.n_layer {
-            let layer = ResidualBlock::new(cfg, vb_l.pp(layer_idx))?;
-            layers.push(layer)
-        }
-        let norm_f = candle_nn::rms_norm(cfg.d_model, 1e-5, vb.pp("norm_f"))?;
-        let lm_head = Linear::from_weights(embedding.embeddings().clone(), None);
-        Ok(Self {
-            embedding,
-            layers,
-            norm_f,
-            lm_head,
-        })
-    }
-}
-
-impl Module for Model {
-    fn forward(&self, input_ids: &Tensor) -> Result<Tensor> {
-        let (_b_size, seq_len) = input_ids.dims2()?;
-        let mut xs = self.embedding.forward(input_ids)?;
-        for layer in self.layers.iter() {
-            xs = layer.forward(&xs)?
-        }
-        xs.narrow(1, seq_len - 1, 1)?
-            .apply(&self.norm_f)?
-            .apply(&self.lm_head)
-    }
-}
--- a/candle-examples/examples/mistral/main.rs
+++ b/candle-examples/examples/mistral/main.rs
@ -155,8 +155,8 @@ struct Args {
    #[arg(long, short = 'n', default_value_t = 100)]
    sample_len: usize,

-    #[arg(long)]
-    model_id: Option<String>,
+    #[arg(long, default_value = "lmz/candle-mistral")]
+    model_id: String,

    #[arg(long, default_value = "main")]
    revision: String,
@ -207,18 +207,8 @@ fn main() -> Result<()> {

    let start = std::time::Instant::now();
    let api = Api::new()?;
-    let model_id = match args.model_id {
-        Some(model_id) => model_id,
-        None => {
-            if args.quantized {
-                "lmz/candle-mistral".to_string()
-            } else {
-                "mistralai/Mistral-7B-v0.1".to_string()
-            }
-        }
-    };
    let repo = api.repo(Repo::with_revision(
-        model_id,
+        args.model_id,
        RepoType::Model,
        args.revision,
    ));
@ -235,7 +225,10 @@ fn main() -> Result<()> {
            if args.quantized {
                vec![repo.get("model-q4k.gguf")?]
            } else {
-                candle_examples::hub_load_safetensors(&repo, "model.safetensors.index.json")?
+                vec![
+                    repo.get("pytorch_model-00001-of-00002.safetensors")?,
+                    repo.get("pytorch_model-00002-of-00002.safetensors")?,
+                ]
            }
        }
    };
@ -244,14 +237,13 @@ fn main() -> Result<()> {

    let start = std::time::Instant::now();
    let config = Config::config_7b_v0_1(args.use_flash_attn);
-    let device = candle_examples::device(args.cpu)?;
    let (model, device) = if args.quantized {
        let filename = &filenames[0];
-        let vb =
-            candle_transformers::quantized_var_builder::VarBuilder::from_gguf(filename, &device)?;
+        let vb = candle_transformers::quantized_var_builder::VarBuilder::from_gguf(filename)?;
        let model = QMistral::new(&config, vb)?;
-        (Model::Quantized(model), device)
+        (Model::Quantized(model), Device::Cpu)
    } else {
+        let device = candle_examples::device(args.cpu)?;
        let dtype = if device.is_cuda() {
            DType::BF16
        } else {
--- a/candle-examples/examples/mixtral/README.md
+++ b/candle-examples/examples/mixtral/README.md
@ -1,25 +0,0 @@
-# candle-mixtral: 8x7b LLM using a sparse mixture of experts.
-
-Mixtral-8x7B-v0.1 is a pretrained generative LLM with 56 billion parameters. 
-
- [Blog post](https://mistral.ai/news/mixtral-of-experts/) from Mistral announcing the model release.
- [Model card](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) on the HuggingFace Hub.
-
-## Running the example
-
-```bash
-$ cargo run --example mixtral --release  -- --prompt "def print_prime(n): "
-def print_prime(n):  # n is the number of prime numbers to be printed
-    i = 2
-    count = 0
-    while (count < n):
-        if (isPrime(i)):
-            print(i)
-            count += 1
-        i += 1
-
-def isPrime(n):
-    for x in range(2, int(n**0.5)+1):
-        if (n % x == 0):
-            ...
-```
--- a/candle-examples/examples/mixtral/main.rs
+++ b/candle-examples/examples/mixtral/main.rs
@ -1,241 +0,0 @@
-#[cfg(feature = "mkl")]
-extern crate intel_mkl_src;
-
-#[cfg(feature = "accelerate")]
-extern crate accelerate_src;
-
-use anyhow::{Error as E, Result};
-use clap::Parser;
-
-use candle_transformers::models::mixtral::{Config, Model};
-
-use candle::{DType, Device, Tensor};
-use candle_examples::token_output_stream::TokenOutputStream;
-use candle_nn::VarBuilder;
-use candle_transformers::generation::LogitsProcessor;
-use hf_hub::{api::sync::Api, Repo, RepoType};
-use tokenizers::Tokenizer;
-
-struct TextGeneration {
-    model: Model,
-    device: Device,
-    tokenizer: TokenOutputStream,
-    logits_processor: LogitsProcessor,
-    repeat_penalty: f32,
-    repeat_last_n: usize,
-}
-
-impl TextGeneration {
-    #[allow(clippy::too_many_arguments)]
-    fn new(
-        model: Model,
-        tokenizer: Tokenizer,
-        seed: u64,
-        temp: Option<f64>,
-        top_p: Option<f64>,
-        repeat_penalty: f32,
-        repeat_last_n: usize,
-        device: &Device,
-    ) -> Self {
-        let logits_processor = LogitsProcessor::new(seed, temp, top_p);
-        Self {
-            model,
-            tokenizer: TokenOutputStream::new(tokenizer),
-            logits_processor,
-            repeat_penalty,
-            repeat_last_n,
-            device: device.clone(),
-        }
-    }
-
-    fn run(&mut self, prompt: &str, sample_len: usize) -> Result<()> {
-        use std::io::Write;
-        self.tokenizer.clear();
-        let mut tokens = self
-            .tokenizer
-            .tokenizer()
-            .encode(prompt, true)
-            .map_err(E::msg)?
-            .get_ids()
-            .to_vec();
-        for &t in tokens.iter() {
-            if let Some(t) = self.tokenizer.next_token(t)? {
-                print!("{t}")
-            }
-        }
-        std::io::stdout().flush()?;
-
-        let mut generated_tokens = 0usize;
-        let eos_token = match self.tokenizer.get_token("</s>") {
-            Some(token) => token,
-            None => anyhow::bail!("cannot find the </s> token"),
-        };
-        let start_gen = std::time::Instant::now();
-        for index in 0..sample_len {
-            let context_size = if index > 0 { 1 } else { tokens.len() };
-            let start_pos = tokens.len().saturating_sub(context_size);
-            let ctxt = &tokens[start_pos..];
-            let input = Tensor::new(ctxt, &self.device)?.unsqueeze(0)?;
-            let logits = self.model.forward(&input, start_pos)?;
-            let logits = logits.squeeze(0)?.squeeze(0)?.to_dtype(DType::F32)?;
-            let logits = if self.repeat_penalty == 1. {
-                logits
-            } else {
-                let start_at = tokens.len().saturating_sub(self.repeat_last_n);
-                candle_transformers::utils::apply_repeat_penalty(
-                    &logits,
-                    self.repeat_penalty,
-                    &tokens[start_at..],
-                )?
-            };
-
-            let next_token = self.logits_processor.sample(&logits)?;
-            tokens.push(next_token);
-            generated_tokens += 1;
-            if next_token == eos_token {
-                break;
-            }
-            if let Some(t) = self.tokenizer.next_token(next_token)? {
-                print!("{t}");
-                std::io::stdout().flush()?;
-            }
-        }
-        let dt = start_gen.elapsed();
-        if let Some(rest) = self.tokenizer.decode_rest().map_err(E::msg)? {
-            print!("{rest}");
-        }
-        std::io::stdout().flush()?;
-        println!(
-            "\n{generated_tokens} tokens generated ({:.2} token/s)",
-            generated_tokens as f64 / dt.as_secs_f64(),
-        );
-        Ok(())
-    }
-}
-
-#[derive(Parser, Debug)]
-#[command(author, version, about, long_about = None)]
-struct Args {
-    /// Run on CPU rather than on GPU.
-    #[arg(long)]
-    cpu: bool,
-
-    /// Enable tracing (generates a trace-timestamp.json file).
-    #[arg(long)]
-    tracing: bool,
-
-    #[arg(long)]
-    use_flash_attn: bool,
-
-    #[arg(long)]
-    prompt: String,
-
-    /// The temperature used to generate samples.
-    #[arg(long)]
-    temperature: Option<f64>,
-
-    /// Nucleus sampling probability cutoff.
-    #[arg(long)]
-    top_p: Option<f64>,
-
-    /// The seed to use when generating random samples.
-    #[arg(long, default_value_t = 299792458)]
-    seed: u64,
-
-    /// The length of the sample to generate (in tokens).
-    #[arg(long, short = 'n', default_value_t = 100)]
-    sample_len: usize,
-
-    #[arg(long, default_value = "mistralai/Mixtral-8x7B-v0.1")]
-    model_id: String,
-
-    #[arg(long, default_value = "main")]
-    revision: String,
-
-    #[arg(long)]
-    tokenizer_file: Option<String>,
-
-    #[arg(long)]
-    weight_files: Option<String>,
-
-    /// Penalty to be applied for repeating tokens, 1. means no penalty.
-    #[arg(long, default_value_t = 1.1)]
-    repeat_penalty: f32,
-
-    /// The context size to consider for the repeat penalty.
-    #[arg(long, default_value_t = 64)]
-    repeat_last_n: usize,
-}
-
-fn main() -> Result<()> {
-    use tracing_chrome::ChromeLayerBuilder;
-    use tracing_subscriber::prelude::*;
-
-    let args = Args::parse();
-    let _guard = if args.tracing {
-        let (chrome_layer, guard) = ChromeLayerBuilder::new().build();
-        tracing_subscriber::registry().with(chrome_layer).init();
-        Some(guard)
-    } else {
-        None
-    };
-    println!(
-        "avx: {}, neon: {}, simd128: {}, f16c: {}",
-        candle::utils::with_avx(),
-        candle::utils::with_neon(),
-        candle::utils::with_simd128(),
-        candle::utils::with_f16c()
-    );
-    println!(
-        "temp: {:.2} repeat-penalty: {:.2} repeat-last-n: {}",
-        args.temperature.unwrap_or(0.),
-        args.repeat_penalty,
-        args.repeat_last_n
-    );
-
-    let start = std::time::Instant::now();
-    let api = Api::new()?;
-    let repo = api.repo(Repo::with_revision(
-        args.model_id,
-        RepoType::Model,
-        args.revision,
-    ));
-    let tokenizer_filename = match args.tokenizer_file {
-        Some(file) => std::path::PathBuf::from(file),
-        None => repo.get("tokenizer.json")?,
-    };
-    let filenames = match args.weight_files {
-        Some(files) => files
-            .split(',')
-            .map(std::path::PathBuf::from)
-            .collect::<Vec<_>>(),
-        None => candle_examples::hub_load_safetensors(&repo, "model.safetensors.index.json")?,
-    };
-    println!("retrieved the files in {:?}", start.elapsed());
-    let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
-
-    let start = std::time::Instant::now();
-    let config = Config::v0_1_8x7b(args.use_flash_attn);
-    let device = candle_examples::device(args.cpu)?;
-    let dtype = if device.is_cuda() {
-        DType::BF16
-    } else {
-        DType::F32
-    };
-    let vb = unsafe { VarBuilder::from_mmaped_safetensors(&filenames, dtype, &device)? };
-    let model = Model::new(&config, vb)?;
-    println!("loaded the model in {:?}", start.elapsed());
-
-    let mut pipeline = TextGeneration::new(
-        model,
-        tokenizer,
-        args.seed,
-        args.temperature,
-        args.top_p,
-        args.repeat_penalty,
-        args.repeat_last_n,
-        &device,
-    );
-    pipeline.run(&args.prompt, args.sample_len)?;
-    Ok(())
-}
--- a/candle-examples/examples/musicgen/musicgen_model.rs
+++ b/candle-examples/examples/musicgen/musicgen_model.rs
@ -321,7 +321,7 @@ impl MusicgenDecoder {
        let positions = self.embed_positions.forward(&input)?.to_device(dev)?;
        let mut xs = inputs_embeds.broadcast_add(&positions)?;
        let attention_mask = self.prepare_decoder_attention_mask(b_sz, seq_len)?;
-        for decoder_layer in self.layers.iter_mut() {
+        for (_layer_idx, decoder_layer) in self.layers.iter_mut().enumerate() {
            xs = decoder_layer.forward(&xs, &attention_mask, None)?;
        }
        let xs = self.layer_norm.forward(&xs)?;
--- a/candle-examples/examples/phi/README.md
+++ b/candle-examples/examples/phi/README.md
@ -1,33 +1,14 @@
-# candle-phi: 1.3b and 2.7b LLM with state of the art performance for <10b models.
+# candle-phi: 1.3b LLM with state of the art performance for <10b models.

-[Phi-1.5](https://huggingface.co/microsoft/phi-1_5) and
-[Phi-2](https://huggingface.co/microsoft/phi-2) are language models using
-only 1.3 and 2.7 billion parameters but with state of the art performance compared to
+[Phi-1.5](https://huggingface.co/microsoft/phi-1_5) is a language model using
+only 1.3 billion parameters but with state of the art performance compared to
 models with up to 10 billion parameters.

 The candle implementation provides both the standard version as well as a
 quantized variant.

-## Running some examples
+## Running some example

-For the v2 version.
-```bash
-$ cargo run --example phi --release -- --model 2 \
-  --prompt "A skier slides down a frictionless slope of height 40m and length 80m. What's the skier speed at the bottom?"
-
-A skier slides down a frictionless slope of height 40m and length 80m. What's the skier speed at the bottom?
-
-Solution:
-The potential energy of the skier is converted into kinetic energy as it slides down the slope. The formula for potential energy is mgh, where m is mass, g is acceleration due to gravity (9.8 m/s^2), and h is height. Since there's no friction, all the potential energy is converted into kinetic energy at the bottom of the slope. The formula for kinetic energy is 1/2mv^2, where v is velocity. We can equate these two formulas:
-mgh = 1/2mv^2
-Solving for v, we get:
-v = sqrt(2gh)
-Substituting the given values, we get:
-v = sqrt(2*9.8*40) = 28 m/s
-Therefore, the skier speed at the bottom of the slope is 28 m/s.
-```
-
-For the v1.5 version.
 ```bash
 $ cargo run --example phi --release -- --prompt "def print_prime(n): "

--- a/candle-examples/examples/phi/main.rs
+++ b/candle-examples/examples/phi/main.rs
@ -123,8 +123,6 @@ enum WhichModel {
    V1,
    #[value(name = "1.5")]
    V1_5,
-    #[value(name = "2")]
-    V2,
    PuffinPhiV2,
    PhiHermes,
 }
@ -145,10 +143,7 @@ struct Args {
    verbose_prompt: bool,

    #[arg(long)]
-    prompt: Option<String>,
-
-    #[arg(long)]
-    mmlu_dir: Option<String>,
+    prompt: String,

    /// The temperature used to generate samples.
    #[arg(long)]
@ -163,7 +158,7 @@ struct Args {
    seed: u64,

    /// The length of the sample to generate (in tokens).
-    #[arg(long, short = 'n', default_value_t = 5000)]
+    #[arg(long, short = 'n', default_value_t = 100)]
    sample_len: usize,

    #[arg(long)]
@ -230,7 +225,6 @@ fn main() -> Result<()> {
                match args.model {
                    WhichModel::V1 => "microsoft/phi-1".to_string(),
                    WhichModel::V1_5 => "microsoft/phi-1_5".to_string(),
-                    WhichModel::V2 => "microsoft/phi-2".to_string(),
                    WhichModel::PuffinPhiV2 | WhichModel::PhiHermes => {
                        "lmz/candle-quantized-phi".to_string()
                    }
@ -247,9 +241,7 @@ fn main() -> Result<()> {
                match args.model {
                    WhichModel::V1 => "refs/pr/2".to_string(),
                    WhichModel::V1_5 => "refs/pr/18".to_string(),
-                    WhichModel::V2 | WhichModel::PuffinPhiV2 | WhichModel::PhiHermes => {
-                        "main".to_string()
-                    }
+                    WhichModel::PuffinPhiV2 | WhichModel::PhiHermes => "main".to_string(),
                }
            }
        }
@ -258,32 +250,27 @@ fn main() -> Result<()> {
    let tokenizer_filename = match args.tokenizer {
        Some(file) => std::path::PathBuf::from(file),
        None => match args.model {
-            WhichModel::V1 | WhichModel::V1_5 | WhichModel::V2 => repo.get("tokenizer.json")?,
+            WhichModel::V1 | WhichModel::V1_5 => repo.get("tokenizer.json")?,
            WhichModel::PuffinPhiV2 | WhichModel::PhiHermes => {
                repo.get("tokenizer-puffin-phi-v2.json")?
            }
        },
    };
-    let filenames = match args.weight_file {
-        Some(weight_file) => vec![std::path::PathBuf::from(weight_file)],
+    let filename = match args.weight_file {
+        Some(weight_file) => std::path::PathBuf::from(weight_file),
        None => {
            if args.quantized {
                match args.model {
-                    WhichModel::V1 => vec![repo.get("model-v1-q4k.gguf")?],
-                    WhichModel::V1_5 => vec![repo.get("model-q4k.gguf")?],
-                    WhichModel::V2 => vec![repo.get("model-v2-q4k.gguf")?],
-                    WhichModel::PuffinPhiV2 => vec![repo.get("model-puffin-phi-v2-q4k.gguf")?],
-                    WhichModel::PhiHermes => vec![repo.get("model-phi-hermes-1_3B-q4k.gguf")?],
+                    WhichModel::V1 => repo.get("model-v1-q4k.gguf")?,
+                    WhichModel::V1_5 => repo.get("model-q4k.gguf")?,
+                    WhichModel::PuffinPhiV2 => repo.get("model-puffin-phi-v2-q4k.gguf")?,
+                    WhichModel::PhiHermes => repo.get("model-phi-hermes-1_3B-q4k.gguf")?,
                }
            } else {
                match args.model {
-                    WhichModel::V1 | WhichModel::V1_5 => vec![repo.get("model.safetensors")?],
-                    WhichModel::V2 => candle_examples::hub_load_safetensors(
-                        &repo,
-                        "model.safetensors.index.json",
-                    )?,
-                    WhichModel::PuffinPhiV2 => vec![repo.get("model-puffin-phi-v2.safetensors")?],
-                    WhichModel::PhiHermes => vec![repo.get("model-phi-hermes-1_3B.safetensors")?],
+                    WhichModel::V1 | WhichModel::V1_5 => repo.get("model.safetensors")?,
+                    WhichModel::PuffinPhiV2 => repo.get("model-puffin-phi-v2.safetensors")?,
+                    WhichModel::PhiHermes => repo.get("model-phi-hermes-1_3B.safetensors")?,
                }
            }
        }
@ -295,132 +282,32 @@ fn main() -> Result<()> {
    let config = match args.model {
        WhichModel::V1 => Config::v1(),
        WhichModel::V1_5 => Config::v1_5(),
-        WhichModel::V2 => Config::v2(),
        WhichModel::PuffinPhiV2 => Config::puffin_phi_v2(),
        WhichModel::PhiHermes => Config::phi_hermes_1_3b(),
    };
-    let device = candle_examples::device(args.cpu)?;
-    let model = if args.quantized {
-        let vb = candle_transformers::quantized_var_builder::VarBuilder::from_gguf(
-            &filenames[0],
-            &device,
-        )?;
-        println!("Loaded vb");
-        let model = match args.model {
-            WhichModel::V2 => QMixFormer::new_v2(&config, vb)?,
-            _ => QMixFormer::new(&config, vb)?,
-        };
-        println!("Loaded model");
-        Model::Quantized(model)
+    let (model, device) = if args.quantized {
+        let vb = candle_transformers::quantized_var_builder::VarBuilder::from_gguf(&filename)?;
+        let model = QMixFormer::new(&config, vb)?;
+        (Model::Quantized(model), Device::Cpu)
    } else {
-        let vb = unsafe { VarBuilder::from_mmaped_safetensors(&filenames, DType::F32, &device)? };
-        let model = match args.model {
-            WhichModel::V2 => MixFormer::new_v2(&config, vb)?,
-            _ => MixFormer::new(&config, vb)?,
-        };
-        Model::MixFormer(model)
+        let device = candle_examples::device(args.cpu)?;
+        let vb = unsafe { VarBuilder::from_mmaped_safetensors(&[filename], DType::F32, &device)? };
+        let model = MixFormer::new(&config, vb)?;
+        (Model::MixFormer(model), device)
    };
    println!("loaded the model in {:?}", start.elapsed());

-    match (args.prompt, args.mmlu_dir) {
-        (None, None) | (Some(_), Some(_)) => {
-            anyhow::bail!("exactly one of --prompt and --mmlu-dir must be specified")
-        }
-        (Some(prompt), None) => {
-            let mut pipeline = TextGeneration::new(
-                model,
-                tokenizer,
-                args.seed,
-                args.temperature,
-                args.top_p,
-                args.repeat_penalty,
-                args.repeat_last_n,
-                args.verbose_prompt,
-                &device,
-            );
-            pipeline.run(&prompt, args.sample_len)?;
-        }
-        (None, Some(mmlu_dir)) => mmlu(model, tokenizer, &device, mmlu_dir)?,
-    }
-    Ok(())
-}
-
-fn mmlu<P: AsRef<std::path::Path>>(
-    mut model: Model,
-    tokenizer: Tokenizer,
-    device: &Device,
-    mmlu_dir: P,
-) -> anyhow::Result<()> {
-    for dir_entry in mmlu_dir.as_ref().read_dir()?.flatten() {
-        let dir_entry = dir_entry.path();
-        let theme = match dir_entry.file_stem().and_then(|v| v.to_str()) {
-            None => "".to_string(),
-            Some(v) => match v.strip_suffix("_test") {
-                None => v.replace('_', " "),
-                Some(v) => v.replace('_', " "),
-            },
-        };
-        if dir_entry.extension().as_ref().and_then(|v| v.to_str()) != Some("csv") {
-            continue;
-        }
-        println!("reading {dir_entry:?}");
-        let dir_entry = std::fs::File::open(dir_entry)?;
-        let mut reader = csv::ReaderBuilder::new()
-            .has_headers(false)
-            .from_reader(dir_entry);
-        let token_a = tokenizer.token_to_id("A").unwrap();
-        let token_b = tokenizer.token_to_id("B").unwrap();
-        let token_c = tokenizer.token_to_id("C").unwrap();
-        let token_d = tokenizer.token_to_id("D").unwrap();
-        for row in reader.records() {
-            let row = match row {
-                Err(_) => continue,
-                Ok(row) => row,
-            };
-            if row.len() < 5 {
-                continue;
-            }
-            let question = row.get(0).unwrap();
-            let answer_a = row.get(1).unwrap();
-            let answer_b = row.get(2).unwrap();
-            let answer_c = row.get(3).unwrap();
-            let answer_d = row.get(4).unwrap();
-            let answer = row.get(5).unwrap();
-            let prompt = format!(
-                    "{} {theme}.\n{question}\nA. {answer_a}\nB. {answer_b}\nC. {answer_c}\nD. {answer_d}\nAnswer:\n",
-                    "The following are multiple choice questions (with answers) about"
-                );
-            let tokens = tokenizer.encode(prompt.as_str(), true).map_err(E::msg)?;
-            let tokens = tokens.get_ids().to_vec();
-            let input = Tensor::new(tokens, device)?.unsqueeze(0)?;
-            let logits = match &mut model {
-                Model::MixFormer(m) => {
-                    m.clear_kv_cache();
-                    m.forward(&input)?
-                }
-                Model::Quantized(m) => {
-                    m.clear_kv_cache();
-                    m.forward(&input)?
-                }
-            };
-            let logits = logits.squeeze(0)?.to_dtype(DType::F32)?;
-            let logits_v: Vec<f32> = logits.to_vec1()?;
-            let pr_a = logits_v[token_a as usize];
-            let pr_b = logits_v[token_b as usize];
-            let pr_c = logits_v[token_c as usize];
-            let pr_d = logits_v[token_d as usize];
-            let model_answer = if pr_a > pr_b && pr_a > pr_c && pr_a > pr_d {
-                "A"
-            } else if pr_b > pr_c && pr_b > pr_d {
-                "B"
-            } else if pr_c > pr_d {
-                "C"
-            } else {
-                "D"
-            };
-
-            println!("{prompt}\n -> {model_answer} vs {answer}");
-        }
-    }
+    let mut pipeline = TextGeneration::new(
+        model,
+        tokenizer,
+        args.seed,
+        args.temperature,
+        args.top_p,
+        args.repeat_penalty,
+        args.repeat_last_n,
+        args.verbose_prompt,
+        &device,
+    );
+    pipeline.run(&args.prompt, args.sample_len)?;
    Ok(())
 }
--- a/candle-examples/examples/quantized-t5/main.rs
+++ b/candle-examples/examples/quantized-t5/main.rs
@ -132,8 +132,7 @@ impl T5ModelBuilder {
    }

    pub fn build_model(&self) -> Result<t5::T5ForConditionalGeneration> {
-        let device = Device::Cpu;
-        let vb = t5::VarBuilder::from_gguf(&self.weights_filename, &device)?;
+        let vb = t5::VarBuilder::from_gguf(&self.weights_filename)?;
        Ok(t5::T5ForConditionalGeneration::load(vb, &self.config)?)
    }

--- a/candle-examples/examples/quantized/README.md
+++ b/candle-examples/examples/quantized/README.md
@ -26,19 +26,6 @@ cargo run --example quantized --release -- --prompt "The best thing about coding
 > The best thing about coding in rust is 1.) that I don’t need to worry about memory leaks, 2.) speed and 3.) my program will compile even on old machines.
 ```

-Using the mixtral sparse mixture of expert model:
-```bash
-
-$ cargo run --example quantized --release -- --which mixtral --prompt "Lebesgue's integral is superior to Riemann's because "
-> avx: true, neon: false, simd128: false, f16c: true
-> temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
-> loaded 995 tensors (26.44GB) in 0.03s
-Lebesgue's integral is superior to Riemann's because 1. it is defined for a wider class of functions, those which are absolutely integrable; 2. the definition does not involve limits in two variables---one being computed before the other (which makes some computations more difficult); and 3. interchange of order of integration is easier to establish than with Riemann's integral. On the other hand, Lebesgue's integral applies only for bounded functions defined on finite intervals; it does not provide numerical values for improper integrals. The latter are best evaluated using Cauchy's limit definition.
-
-The reason $f(x) = x^2$ is discontinuous at the ends of its interval of definition, and Riemann's integral requires continuity on the whole of an open interval containing it (see our earlier post), sine no such function exists with this property, is that the endpoints are infinite in measure for Lebesgue's integral.
- ```
-
-
 ## Command-line flags

 Run with `--help` to see all options.
--- a/candle-examples/examples/quantized/main.rs
+++ b/candle-examples/examples/quantized/main.rs
@ -9,7 +9,7 @@ use std::io::Write;
 use tokenizers::Tokenizer;

 use candle::quantized::{ggml_file, gguf_file};
-use candle::Tensor;
+use candle::{Device, Tensor};
 use candle_transformers::generation::LogitsProcessor;

 use candle_examples::token_output_stream::TokenOutputStream;
@ -45,28 +45,14 @@ enum Which {
    L13bCode,
    #[value(name = "32b-code")]
    L34bCode,
-    #[value(name = "7b-leo")]
-    Leo7b,
-    #[value(name = "13b-leo")]
-    Leo13b,
    #[value(name = "7b-mistral")]
    Mistral7b,
    #[value(name = "7b-mistral-instruct")]
    Mistral7bInstruct,
-    #[value(name = "7b-mistral-instruct-v0.2")]
-    Mistral7bInstructV02,
    #[value(name = "7b-zephyr-a")]
    Zephyr7bAlpha,
    #[value(name = "7b-zephyr-b")]
    Zephyr7bBeta,
-    #[value(name = "7b-open-chat-3.5")]
-    OpenChat35,
-    #[value(name = "7b-starling-a")]
-    Starling7bAlpha,
-    #[value(name = "mixtral")]
-    Mixtral,
-    #[value(name = "mixtral-instruct")]
-    MixtralInstruct,
 }

 impl Which {
@ -80,20 +66,12 @@ impl Which {
            | Self::L70bChat
            | Self::L7bCode
            | Self::L13bCode
-            | Self::L34bCode
-            | Self::Leo7b
-            | Self::Leo13b => false,
-            // Zephyr and OpenChat are fine tuned versions of mistral and should be treated in the
-            // same way. Starling is a fine tuned version of OpenChat.
-            Self::OpenChat35
-            | Self::Starling7bAlpha
-            | Self::Zephyr7bAlpha
+            | Self::L34bCode => false,
+            // Zephyr is a fine tuned version of mistral and should be treated in the same way.
+            Self::Zephyr7bAlpha
            | Self::Zephyr7bBeta
-            | Self::Mixtral
-            | Self::MixtralInstruct
            | Self::Mistral7b
-            | Self::Mistral7bInstruct
-            | Self::Mistral7bInstructV02 => true,
+            | Self::Mistral7bInstruct => true,
        }
    }

@ -108,73 +86,17 @@ impl Which {
            | Self::L7bCode
            | Self::L13bCode
            | Self::L34bCode
-            | Self::Leo7b
-            | Self::Leo13b
-            | Self::Mixtral
-            | Self::MixtralInstruct
            | Self::Mistral7b
-            | Self::Mistral7bInstruct
-            | Self::Mistral7bInstructV02
-            | Self::OpenChat35
-            | Self::Starling7bAlpha => false,
+            | Self::Mistral7bInstruct => false,
            Self::Zephyr7bAlpha | Self::Zephyr7bBeta => true,
        }
    }
-
-    fn is_open_chat(&self) -> bool {
-        match self {
-            Self::L7b
-            | Self::L13b
-            | Self::L70b
-            | Self::L7bChat
-            | Self::L13bChat
-            | Self::L70bChat
-            | Self::L7bCode
-            | Self::L13bCode
-            | Self::L34bCode
-            | Self::Leo7b
-            | Self::Leo13b
-            | Self::Mixtral
-            | Self::MixtralInstruct
-            | Self::Mistral7b
-            | Self::Mistral7bInstruct
-            | Self::Mistral7bInstructV02
-            | Self::Zephyr7bAlpha
-            | Self::Zephyr7bBeta => false,
-            Self::OpenChat35 | Self::Starling7bAlpha => true,
-        }
-    }
-
-    fn tokenizer_repo(&self) -> &'static str {
-        match self {
-            Which::L7b
-            | Which::L13b
-            | Which::L70b
-            | Which::L7bChat
-            | Which::L13bChat
-            | Which::L70bChat
-            | Which::L7bCode
-            | Which::L13bCode
-            | Which::L34bCode => "hf-internal-testing/llama-tokenizer",
-            Which::Leo7b => "LeoLM/leo-hessianai-7b",
-            Which::Leo13b => "LeoLM/leo-hessianai-13b",
-            Which::Mixtral => "mistralai/Mixtral-8x7B-v0.1",
-            Which::MixtralInstruct => "mistralai/Mixtral-8x7B-Instruct-v0.1",
-            Which::Mistral7b
-            | Which::Mistral7bInstruct
-            | Which::Mistral7bInstructV02
-            | Which::Zephyr7bAlpha
-            | Which::Zephyr7bBeta => "mistralai/Mistral-7B-v0.1",
-            Which::OpenChat35 => "openchat/openchat_3.5",
-            Which::Starling7bAlpha => "berkeley-nest/Starling-LM-7B-alpha",
-        }
-    }
 }

 #[derive(Parser, Debug)]
 #[command(author, version, about, long_about = None)]
 struct Args {
-    /// GGML/GGUF file to load, typically a .bin/.gguf file generated by the quantize command from llama.cpp
+    /// GGML file to load, typically a .bin file generated by the quantize command from llama.cpp
    #[arg(long)]
    model: Option<String>,

@ -235,7 +157,11 @@ impl Args {
            Some(config) => std::path::PathBuf::from(config),
            None => {
                let api = hf_hub::api::sync::Api::new()?;
-                let repo = self.which.tokenizer_repo();
+                let repo = if self.which.is_mistral() {
+                    "mistralai/Mistral-7B-v0.1"
+                } else {
+                    "hf-internal-testing/llama-tokenizer"
+                };
                let api = api.model(repo.to_string());
                api.get("tokenizer.json")?
            }
@ -266,22 +192,6 @@ impl Args {
                    Which::L7bCode => ("TheBloke/CodeLlama-7B-GGUF", "codellama-7b.Q8_0.gguf"),
                    Which::L13bCode => ("TheBloke/CodeLlama-13B-GGUF", "codellama-13b.Q8_0.gguf"),
                    Which::L34bCode => ("TheBloke/CodeLlama-34B-GGUF", "codellama-34b.Q8_0.gguf"),
-                    Which::Leo7b => (
-                        "TheBloke/leo-hessianai-7B-GGUF",
-                        "leo-hessianai-7b.Q4_K_M.gguf",
-                    ),
-                    Which::Leo13b => (
-                        "TheBloke/leo-hessianai-13B-GGUF",
-                        "leo-hessianai-13b.Q4_K_M.gguf",
-                    ),
-                    Which::Mixtral => (
-                        "TheBloke/Mixtral-8x7B-v0.1-GGUF",
-                        "mixtral-8x7b-v0.1.Q4_K_M.gguf",
-                    ),
-                    Which::MixtralInstruct => (
-                        "TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF",
-                        "mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf",
-                    ),
                    Which::Mistral7b => (
                        "TheBloke/Mistral-7B-v0.1-GGUF",
                        "mistral-7b-v0.1.Q4_K_S.gguf",
@ -290,10 +200,6 @@ impl Args {
                        "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
                        "mistral-7b-instruct-v0.1.Q4_K_S.gguf",
                    ),
-                    Which::Mistral7bInstructV02 => (
-                        "TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
-                        "mistral-7b-instruct-v0.2.Q4_K_S.gguf",
-                    ),
                    Which::Zephyr7bAlpha => (
                        "TheBloke/zephyr-7B-alpha-GGUF",
                        "zephyr-7b-alpha.Q4_K_M.gguf",
@ -301,11 +207,6 @@ impl Args {
                    Which::Zephyr7bBeta => {
                        ("TheBloke/zephyr-7B-beta-GGUF", "zephyr-7b-beta.Q4_K_M.gguf")
                    }
-                    Which::OpenChat35 => ("TheBloke/openchat_3.5-GGUF", "openchat_3.5.Q4_K_M.gguf"),
-                    Which::Starling7bAlpha => (
-                        "TheBloke/Starling-LM-7B-alpha-GGUF",
-                        "starling-lm-7b-alpha.Q4_K_M.gguf",
-                    ),
                };
                let api = hf_hub::api::sync::Api::new()?;
                let api = api.model(repo.to_string());
@ -361,16 +262,15 @@ fn main() -> anyhow::Result<()> {
    let model_path = args.model()?;
    let mut file = std::fs::File::open(&model_path)?;
    let start = std::time::Instant::now();
-    let device = candle_examples::device(false)?;

    let mut model = match model_path.extension().and_then(|v| v.to_str()) {
        Some("gguf") => {
-            let model = gguf_file::Content::read(&mut file).map_err(|e| e.with_path(model_path))?;
+            let model = gguf_file::Content::read(&mut file)?;
            let mut total_size_in_bytes = 0;
            for (_, tensor) in model.tensor_infos.iter() {
                let elem_count = tensor.shape.elem_count();
                total_size_in_bytes +=
-                    elem_count * tensor.ggml_dtype.type_size() / tensor.ggml_dtype.block_size();
+                    elem_count * tensor.ggml_dtype.type_size() / tensor.ggml_dtype.blck_size();
            }
            println!(
                "loaded {:?} tensors ({}) in {:.2}s",
@ -378,16 +278,15 @@ fn main() -> anyhow::Result<()> {
                &format_size(total_size_in_bytes),
                start.elapsed().as_secs_f32(),
            );
-            ModelWeights::from_gguf(model, &mut file, &device)?
+            ModelWeights::from_gguf(model, &mut file)?
        }
        Some("ggml" | "bin") | Some(_) | None => {
-            let model = ggml_file::Content::read(&mut file, &device)
-                .map_err(|e| e.with_path(model_path))?;
+            let model = ggml_file::Content::read(&mut file)?;
            let mut total_size_in_bytes = 0;
            for (_, tensor) in model.tensors.iter() {
                let elem_count = tensor.shape().elem_count();
                total_size_in_bytes +=
-                    elem_count * tensor.dtype().type_size() / tensor.dtype().block_size();
+                    elem_count * tensor.dtype().type_size() / tensor.dtype().blck_size();
            }
            println!(
                "loaded {:?} tensors ({}) in {:.2}s",
@ -403,20 +302,13 @@ fn main() -> anyhow::Result<()> {
                | Which::L13bChat
                | Which::L7bCode
                | Which::L13bCode
-                | Which::L34bCode
-                | Which::Leo7b
-                | Which::Leo13b => 1,
-                Which::Mixtral
-                | Which::MixtralInstruct
-                | Which::Mistral7b
+                | Which::L34bCode => 1,
+                Which::Mistral7b
                | Which::Mistral7bInstruct
-                | Which::Mistral7bInstructV02
                | Which::Zephyr7bAlpha
                | Which::Zephyr7bBeta
                | Which::L70b
-                | Which::L70bChat
-                | Which::OpenChat35
-                | Which::Starling7bAlpha => 8,
+                | Which::L70bChat => 8,
            };
            ModelWeights::from_ggml(model, args.gqa.unwrap_or(default_gqa))?
        }
@ -448,9 +340,7 @@ fn main() -> anyhow::Result<()> {
                        prompt.pop();
                    }
                }
-                if args.which.is_open_chat() {
-                    format!("GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:")
-                } else if args.which.is_zephyr() {
+                if args.which.is_zephyr() {
                    if prompt_index == 0 || is_interactive {
                        format!("<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>",)
                    } else {
@ -488,7 +378,7 @@ fn main() -> anyhow::Result<()> {

        let start_prompt_processing = std::time::Instant::now();
        let mut next_token = {
-            let input = Tensor::new(prompt_tokens.as_slice(), &device)?.unsqueeze(0)?;
+            let input = Tensor::new(prompt_tokens.as_slice(), &Device::Cpu)?.unsqueeze(0)?;
            let logits = model.forward(&input, 0)?;
            let logits = logits.squeeze(0)?;
            logits_processor.sample(&logits)?
@ -500,16 +390,12 @@ fn main() -> anyhow::Result<()> {
            std::io::stdout().flush()?;
        }

-        let eos_token = if args.which.is_open_chat() {
-            "<|end_of_turn|>"
-        } else {
-            "</s>"
-        };
-        let eos_token = *tos.tokenizer().get_vocab(true).get(eos_token).unwrap();
+        let eos_token = *tos.tokenizer().get_vocab(true).get("</s>").unwrap();
+
        let start_post_prompt = std::time::Instant::now();
        let mut sampled = 0;
        for index in 0..to_sample {
-            let input = Tensor::new(&[next_token], &device)?.unsqueeze(0)?;
+            let input = Tensor::new(&[next_token], &Device::Cpu)?.unsqueeze(0)?;
            let logits = model.forward(&input, prompt_tokens.len() + index)?;
            let logits = logits.squeeze(0)?;
            let logits = if args.repeat_penalty == 1. {
--- a/candle-examples/examples/reinforcement-learning/README.md
+++ b/candle-examples/examples/reinforcement-learning/README.md
@ -8,16 +8,9 @@ Python package with:
 pip install "gymnasium[accept-rom-license]"
 ```

-In order to run the examples, use the following commands. Note the additional
+In order to run the example, use the following command. Note the additional
 `--package` flag to ensure that there is no conflict with the `candle-pyo3`
 crate.
-
-For the Policy Gradient example:
 ```bash
-cargo run --example reinforcement-learning --features=pyo3 --package candle-examples -- pg
-```
-
-For the Deep Deterministic Policy Gradient example:
-```bash
-cargo run --example reinforcement-learning --features=pyo3 --package candle-examples -- ddpg
+cargo run --example reinforcement-learning --features=pyo3 --package candle-examples
 ```
--- a/candle-examples/examples/reinforcement-learning/atari_wrappers.py
+++ b/candle-examples/examples/reinforcement-learning/atari_wrappers.py
@ -78,7 +78,7 @@ class EpisodicLifeEnv(gym.Wrapper):
        # then update lives to handle bonus lives
        lives = self.env.unwrapped.ale.lives()
        if lives < self.lives and lives > 0:
-            # for Qbert sometimes we stay in lives == 0 condition for a few frames
+            # for Qbert somtimes we stay in lives == 0 condtion for a few frames
            # so its important to keep lives > 0, so that we only reset once
            # the environment advertises done.
            done = True
--- a/candle-examples/examples/reinforcement-learning/ddpg.rs
+++ b/candle-examples/examples/reinforcement-learning/ddpg.rs
@ -8,8 +8,6 @@ use candle_nn::{
 };
 use rand::{distributions::Uniform, thread_rng, Rng};

-use super::gym_env::GymEnv;
-
 pub struct OuNoise {
    mu: f64,
    theta: f64,
@ -451,106 +449,3 @@ impl DDPG<'_> {
        Ok(())
    }
 }
-
-// The impact of the q value of the next state on the current state's q value.
-const GAMMA: f64 = 0.99;
-// The weight for updating the target networks.
-const TAU: f64 = 0.005;
-// The capacity of the replay buffer used for sampling training data.
-const REPLAY_BUFFER_CAPACITY: usize = 100_000;
-// The training batch size for each training iteration.
-const TRAINING_BATCH_SIZE: usize = 100;
-// The total number of episodes.
-const MAX_EPISODES: usize = 100;
-// The maximum length of an episode.
-const EPISODE_LENGTH: usize = 200;
-// The number of training iterations after one episode finishes.
-const TRAINING_ITERATIONS: usize = 200;
-
-// Ornstein-Uhlenbeck process parameters.
-const MU: f64 = 0.0;
-const THETA: f64 = 0.15;
-const SIGMA: f64 = 0.1;
-
-const ACTOR_LEARNING_RATE: f64 = 1e-4;
-const CRITIC_LEARNING_RATE: f64 = 1e-3;
-
-pub fn run() -> Result<()> {
-    let env = GymEnv::new("Pendulum-v1")?;
-    println!("action space: {}", env.action_space());
-    println!("observation space: {:?}", env.observation_space());
-
-    let size_state = env.observation_space().iter().product::<usize>();
-    let size_action = env.action_space();
-
-    let mut agent = DDPG::new(
-        &Device::Cpu,
-        size_state,
-        size_action,
-        true,
-        ACTOR_LEARNING_RATE,
-        CRITIC_LEARNING_RATE,
-        GAMMA,
-        TAU,
-        REPLAY_BUFFER_CAPACITY,
-        OuNoise::new(MU, THETA, SIGMA, size_action)?,
-    )?;
-
-    let mut rng = rand::thread_rng();
-
-    for episode in 0..MAX_EPISODES {
-        // let mut state = env.reset(episode as u64)?;
-        let mut state = env.reset(rng.gen::<u64>())?;
-
-        let mut total_reward = 0.0;
-        for _ in 0..EPISODE_LENGTH {
-            let mut action = 2.0 * agent.actions(&state)?;
-            action = action.clamp(-2.0, 2.0);
-
-            let step = env.step(vec![action])?;
-            total_reward += step.reward;
-
-            agent.remember(
-                &state,
-                &Tensor::new(vec![action], &Device::Cpu)?,
-                &Tensor::new(vec![step.reward as f32], &Device::Cpu)?,
-                &step.state,
-                step.terminated,
-                step.truncated,
-            );
-
-            if step.terminated || step.truncated {
-                break;
-            }
-            state = step.state;
-        }
-
-        println!("episode {episode} with total reward of {total_reward}");
-
-        for _ in 0..TRAINING_ITERATIONS {
-            agent.train(TRAINING_BATCH_SIZE)?;
-        }
-    }
-
-    println!("Testing...");
-    agent.train = false;
-    for episode in 0..10 {
-        // let mut state = env.reset(episode as u64)?;
-        let mut state = env.reset(rng.gen::<u64>())?;
-        let mut total_reward = 0.0;
-        for _ in 0..EPISODE_LENGTH {
-            let mut action = 2.0 * agent.actions(&state)?;
-            action = action.clamp(-2.0, 2.0);
-
-            let step = env.step(vec![action])?;
-            total_reward += step.reward;
-
-            if step.terminated || step.truncated {
-                break;
-            }
-            state = step.state;
-        }
-        println!("episode {episode} with total reward of {total_reward}");
-    }
-    Ok(())
-}
--- a/candle-examples/examples/reinforcement-learning/main.rs
+++ b/candle-examples/examples/reinforcement-learning/main.rs
@ -6,32 +6,139 @@ extern crate intel_mkl_src;
 #[cfg(feature = "accelerate")]
 extern crate accelerate_src;

-use candle::Result;
-use clap::{Parser, Subcommand};
-
 mod gym_env;
 mod vec_gym_env;

 mod ddpg;
-mod policy_gradient;

-#[derive(Parser)]
+use candle::{Device, Result, Tensor};
+use clap::Parser;
+use rand::Rng;
+
+// The impact of the q value of the next state on the current state's q value.
+const GAMMA: f64 = 0.99;
+// The weight for updating the target networks.
+const TAU: f64 = 0.005;
+// The capacity of the replay buffer used for sampling training data.
+const REPLAY_BUFFER_CAPACITY: usize = 100_000;
+// The training batch size for each training iteration.
+const TRAINING_BATCH_SIZE: usize = 100;
+// The total number of episodes.
+const MAX_EPISODES: usize = 100;
+// The maximum length of an episode.
+const EPISODE_LENGTH: usize = 200;
+// The number of training iterations after one episode finishes.
+const TRAINING_ITERATIONS: usize = 200;
+
+// Ornstein-Uhlenbeck process parameters.
+const MU: f64 = 0.0;
+const THETA: f64 = 0.15;
+const SIGMA: f64 = 0.1;
+
+const ACTOR_LEARNING_RATE: f64 = 1e-4;
+const CRITIC_LEARNING_RATE: f64 = 1e-3;
+
+#[derive(Parser, Debug, Clone)]
+#[command(author, version, about, long_about = None)]
 struct Args {
-    #[command(subcommand)]
-    command: Command,
-}
+    /// Run on CPU rather than on GPU.
+    #[arg(long)]
+    cpu: bool,

-#[derive(Subcommand)]
-enum Command {
-    Pg,
-    Ddpg,
+    /// Enable tracing (generates a trace-timestamp.json file).
+    #[arg(long)]
+    tracing: bool,
 }

 fn main() -> Result<()> {
+    use tracing_chrome::ChromeLayerBuilder;
+    use tracing_subscriber::prelude::*;
+
    let args = Args::parse();
-    match args.command {
-        Command::Pg => policy_gradient::run()?,
-        Command::Ddpg => ddpg::run()?,
+
+    let _guard = if args.tracing {
+        let (chrome_layer, guard) = ChromeLayerBuilder::new().build();
+        tracing_subscriber::registry().with(chrome_layer).init();
+        Some(guard)
+    } else {
+        None
+    };
+
+    let env = gym_env::GymEnv::new("Pendulum-v1")?;
+    println!("action space: {}", env.action_space());
+    println!("observation space: {:?}", env.observation_space());
+
+    let size_state = env.observation_space().iter().product::<usize>();
+    let size_action = env.action_space();
+
+    let mut agent = ddpg::DDPG::new(
+        &Device::Cpu,
+        size_state,
+        size_action,
+        true,
+        ACTOR_LEARNING_RATE,
+        CRITIC_LEARNING_RATE,
+        GAMMA,
+        TAU,
+        REPLAY_BUFFER_CAPACITY,
+        ddpg::OuNoise::new(MU, THETA, SIGMA, size_action)?,
+    )?;
+
+    let mut rng = rand::thread_rng();
+
+    for episode in 0..MAX_EPISODES {
+        // let mut state = env.reset(episode as u64)?;
+        let mut state = env.reset(rng.gen::<u64>())?;
+
+        let mut total_reward = 0.0;
+        for _ in 0..EPISODE_LENGTH {
+            let mut action = 2.0 * agent.actions(&state)?;
+            action = action.clamp(-2.0, 2.0);
+
+            let step = env.step(vec![action])?;
+            total_reward += step.reward;
+
+            agent.remember(
+                &state,
+                &Tensor::new(vec![action], &Device::Cpu)?,
+                &Tensor::new(vec![step.reward as f32], &Device::Cpu)?,
+                &step.state,
+                step.terminated,
+                step.truncated,
+            );
+
+            if step.terminated || step.truncated {
+                break;
+            }
+            state = step.state;
+        }
+
+        println!("episode {episode} with total reward of {total_reward}");
+
+        for _ in 0..TRAINING_ITERATIONS {
+            agent.train(TRAINING_BATCH_SIZE)?;
+        }
+    }
+
+    println!("Testing...");
+    agent.train = false;
+    for episode in 0..10 {
+        // let mut state = env.reset(episode as u64)?;
+        let mut state = env.reset(rng.gen::<u64>())?;
+        let mut total_reward = 0.0;
+        for _ in 0..EPISODE_LENGTH {
+            let mut action = 2.0 * agent.actions(&state)?;
+            action = action.clamp(-2.0, 2.0);
+
+            let step = env.step(vec![action])?;
+            total_reward += step.reward;
+
+            if step.terminated || step.truncated {
+                break;
+            }
+            state = step.state;
+        }
+        println!("episode {episode} with total reward of {total_reward}");
    }
    Ok(())
 }
--- a/candle-examples/examples/reinforcement-learning/policy_gradient.rs
+++ b/candle-examples/examples/reinforcement-learning/policy_gradient.rs
@ -1,146 +0,0 @@
-use super::gym_env::{GymEnv, Step};
-use candle::{DType, Device, Error, Module, Result, Tensor};
-use candle_nn::{
-    linear, ops::log_softmax, ops::softmax, sequential::seq, Activation, AdamW, Optimizer,
-    ParamsAdamW, VarBuilder, VarMap,
-};
-use rand::{distributions::Distribution, rngs::ThreadRng, Rng};
-
-fn new_model(
-    input_shape: &[usize],
-    num_actions: usize,
-    dtype: DType,
-    device: &Device,
-) -> Result<(impl Module, VarMap)> {
-    let input_size = input_shape.iter().product();
-
-    let mut varmap = VarMap::new();
-    let var_builder = VarBuilder::from_varmap(&varmap, dtype, device);
-
-    let model = seq()
-        .add(linear(input_size, 32, var_builder.pp("lin1"))?)
-        .add(Activation::Relu)
-        .add(linear(32, num_actions, var_builder.pp("lin2"))?);
-
-    Ok((model, varmap))
-}
-
-fn accumulate_rewards(steps: &[Step<i64>]) -> Vec<f64> {
-    let mut rewards: Vec<f64> = steps.iter().map(|s| s.reward).collect();
-    let mut acc_reward = 0f64;
-    for (i, reward) in rewards.iter_mut().enumerate().rev() {
-        if steps[i].terminated {
-            acc_reward = 0.0;
-        }
-        acc_reward += *reward;
-        *reward = acc_reward;
-    }
-    rewards
-}
-
-fn weighted_sample(probs: Vec<f32>, rng: &mut ThreadRng) -> Result<usize> {
-    let distribution = rand::distributions::WeightedIndex::new(probs).map_err(Error::wrap)?;
-    let mut rng = rng;
-    Ok(distribution.sample(&mut rng))
-}
-
-pub fn run() -> Result<()> {
-    let env = GymEnv::new("CartPole-v1")?;
-
-    println!("action space: {:?}", env.action_space());
-    println!("observation space: {:?}", env.observation_space());
-
-    let (model, varmap) = new_model(
-        env.observation_space(),
-        env.action_space(),
-        DType::F32,
-        &Device::Cpu,
-    )?;
-
-    let optimizer_params = ParamsAdamW {
-        lr: 0.01,
-        weight_decay: 0.01,
-        ..Default::default()
-    };
-
-    let mut optimizer = AdamW::new(varmap.all_vars(), optimizer_params)?;
-
-    let mut rng = rand::thread_rng();
-
-    for epoch_idx in 0..100 {
-        let mut state = env.reset(rng.gen::<u64>())?;
-        let mut steps: Vec<Step<i64>> = vec![];
-
-        loop {
-            let action = {
-                let action_probs: Vec<f32> =
-                    softmax(&model.forward(&state.detach()?.unsqueeze(0)?)?, 1)?
-                        .squeeze(0)?
-                        .to_vec1()?;
-                weighted_sample(action_probs, &mut rng)? as i64
-            };
-
-            let step = env.step(action)?;
-            steps.push(step.copy_with_obs(&state));
-
-            if step.terminated || step.truncated {
-                state = env.reset(rng.gen::<u64>())?;
-                if steps.len() > 5000 {
-                    break;
-                }
-            } else {
-                state = step.state;
-            }
-        }
-
-        let total_reward: f64 = steps.iter().map(|s| s.reward).sum();
-        let episodes: i64 = steps
-            .iter()
-            .map(|s| (s.terminated || s.truncated) as i64)
-            .sum();
-        println!(
-            "epoch: {:<3} episodes: {:<5} avg reward per episode: {:.2}",
-            epoch_idx,
-            episodes,
-            total_reward / episodes as f64
-        );
-
-        let batch_size = steps.len();
-
-        let rewards = Tensor::from_vec(accumulate_rewards(&steps), batch_size, &Device::Cpu)?
-            .to_dtype(DType::F32)?
-            .detach()?;
-
-        let actions_mask = {
-            let actions: Vec<i64> = steps.iter().map(|s| s.action).collect();
-            let actions_mask: Vec<Tensor> = actions
-                .iter()
-                .map(|&action| {
-                    // One-hot encoding
-                    let mut action_mask = vec![0.0; env.action_space()];
-                    action_mask[action as usize] = 1.0;
-
-                    Tensor::from_vec(action_mask, env.action_space(), &Device::Cpu)
-                        .unwrap()
-                        .to_dtype(DType::F32)
-                        .unwrap()
-                })
-                .collect();
-            Tensor::stack(&actions_mask, 0)?.detach()?
-        };
-
-        let states = {
-            let states: Vec<Tensor> = steps.into_iter().map(|s| s.state).collect();
-            Tensor::stack(&states, 0)?.detach()?
-        };
-
-        let log_probs = actions_mask
-            .mul(&log_softmax(&model.forward(&states)?, 1)?)?
-            .sum(1)?;
-
-        let loss = rewards.mul(&log_probs)?.neg()?.mean_all()?;
-        optimizer.backward_step(&loss)?;
-    }
-
-    Ok(())
-}
--- a/candle-examples/examples/replit-code/main.rs
+++ b/candle-examples/examples/replit-code/main.rs
@ -236,11 +236,9 @@ fn main() -> Result<()> {
    let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;

    let start = std::time::Instant::now();
-    let device = Device::Cpu;
    let config = Config::replit_code_v1_5_3b();
    let (model, device) = if args.quantized {
-        let vb =
-            candle_transformers::quantized_var_builder::VarBuilder::from_gguf(&filename, &device)?;
+        let vb = candle_transformers::quantized_var_builder::VarBuilder::from_gguf(&filename)?;
        let model = Model::Q(Q::new(&config, vb.pp("transformer"))?);
        (model, Device::Cpu)
    } else {
--- a/candle-examples/examples/stable-diffusion/README.md
+++ b/candle-examples/examples/stable-diffusion/README.md
@ -8,7 +8,7 @@ XL using Rust and [candle](https://github.com/huggingface/candle).
 The `stable-diffusion` example is a conversion of
 [diffusers-rs](https://github.com/LaurentMazare/diffusers-rs) using candle
 rather than libtorch. This implementation supports Stable Diffusion v1.5, v2.1,
-as well as Stable Diffusion XL 1.0, and Turbo.
+as well as Stable Diffusion XL 1.0.

 ## Getting the weights

@ -23,26 +23,16 @@ cargo run --example stable-diffusion --release --features=cuda,cudnn \
    -- --prompt "a cosmonaut on a horse (hd, realistic, high-def)"
 ```

-The final image is named `sd_final.png` by default. The Turbo version is much
-faster than previous versions, to give it a try add a `--sd-version turbo` flag,
-e.g.:
-
-```bash
-cargo run --example stable-diffusion --release --features=cuda,cudnn \
-    -- --prompt "a cosmonaut on a horse (hd, realistic, high-def)" --sd-version turbo
-```
-
-The default scheduler for the v1.5, v2.1 and XL 1.0 version is the Denoising
-Diffusion Implicit Model scheduler (DDIM). The original paper and some code can
-be found in the [associated repo](https://github.com/ermongroup/ddim).
-The default scheduler for the XL Turbo version is the Euler Ancestral scheduler.
+The final image is named `sd_final.png` by default.
+The default scheduler is the Denoising Diffusion Implicit Model scheduler (DDIM). The
+original paper and some code can be found in the [associated repo](https://github.com/ermongroup/ddim).

 ### Command-line flags

 - `--prompt`: the prompt to be used to generate the image.
 - `--uncond-prompt`: the optional unconditional prompt.
- `--sd-version`: the Stable Diffusion version to use, can be `v1-5`, `v2-1`,
-  `xl`, or `turbo`.
+- `--sd-version`: the Stable Diffusion version to use, can be `v1-5`, `v2-1`, or
+  `xl`.
 - `--cpu`: use the cpu rather than the gpu (much slower).
 - `--height`, `--width`: set the height and width for the generated image.
 - `--n-steps`: the number of steps to be used in the diffusion process.
--- a/candle-examples/examples/stable-diffusion/main.rs
+++ b/candle-examples/examples/stable-diffusion/main.rs
@ -11,6 +11,8 @@ use candle::{DType, Device, IndexOp, Module, Tensor, D};
 use clap::Parser;
 use tokenizers::Tokenizer;

+const GUIDANCE_SCALE: f64 = 7.5;
+
 #[derive(Parser)]
 #[command(author, version, about, long_about = None)]
 struct Args {
@ -61,8 +63,8 @@ struct Args {
    sliced_attention_size: Option<usize>,

    /// The number of steps to run the diffusion for.
-    #[arg(long)]
-    n_steps: Option<usize>,
+    #[arg(long, default_value_t = 30)]
+    n_steps: usize,

    /// The number of samples to generate.
    #[arg(long, default_value_t = 1)]
@ -85,9 +87,6 @@ struct Args {
    #[arg(long)]
    use_f16: bool,

-    #[arg(long)]
-    guidance_scale: Option<f64>,
-
    #[arg(long, value_name = "FILE")]
    img2img: Option<String>,

@ -103,7 +102,6 @@ enum StableDiffusionVersion {
    V1_5,
    V2_1,
    Xl,
-    Turbo,
 }

 #[derive(Debug, Clone, Copy, PartialEq, Eq)]
@ -122,13 +120,12 @@ impl StableDiffusionVersion {
            Self::Xl => "stabilityai/stable-diffusion-xl-base-1.0",
            Self::V2_1 => "stabilityai/stable-diffusion-2-1",
            Self::V1_5 => "runwayml/stable-diffusion-v1-5",
-            Self::Turbo => "stabilityai/sdxl-turbo",
        }
    }

    fn unet_file(&self, use_f16: bool) -> &'static str {
        match self {
-            Self::V1_5 | Self::V2_1 | Self::Xl | Self::Turbo => {
+            Self::V1_5 | Self::V2_1 | Self::Xl => {
                if use_f16 {
                    "unet/diffusion_pytorch_model.fp16.safetensors"
                } else {
@ -140,7 +137,7 @@ impl StableDiffusionVersion {

    fn vae_file(&self, use_f16: bool) -> &'static str {
        match self {
-            Self::V1_5 | Self::V2_1 | Self::Xl | Self::Turbo => {
+            Self::V1_5 | Self::V2_1 | Self::Xl => {
                if use_f16 {
                    "vae/diffusion_pytorch_model.fp16.safetensors"
                } else {
@ -152,7 +149,7 @@ impl StableDiffusionVersion {

    fn clip_file(&self, use_f16: bool) -> &'static str {
        match self {
-            Self::V1_5 | Self::V2_1 | Self::Xl | Self::Turbo => {
+            Self::V1_5 | Self::V2_1 | Self::Xl => {
                if use_f16 {
                    "text_encoder/model.fp16.safetensors"
                } else {
@ -164,7 +161,7 @@ impl StableDiffusionVersion {

    fn clip2_file(&self, use_f16: bool) -> &'static str {
        match self {
-            Self::V1_5 | Self::V2_1 | Self::Xl | Self::Turbo => {
+            Self::V1_5 | Self::V2_1 | Self::Xl => {
                if use_f16 {
                    "text_encoder_2/model.fp16.safetensors"
                } else {
@ -192,7 +189,7 @@ impl ModelFile {
                            StableDiffusionVersion::V1_5 | StableDiffusionVersion::V2_1 => {
                                "openai/clip-vit-base-patch32"
                            }
-                            StableDiffusionVersion::Xl | StableDiffusionVersion::Turbo => {
+                            StableDiffusionVersion::Xl => {
                                // This seems similar to the patch32 version except some very small
                                // difference in the split regex.
                                "openai/clip-vit-large-patch14"
@ -209,11 +206,7 @@ impl ModelFile {
                    Self::Vae => {
                        // Override for SDXL when using f16 weights.
                        // See https://github.com/huggingface/candle/issues/1060
-                        if matches!(
-                            version,
-                            StableDiffusionVersion::Xl | StableDiffusionVersion::Turbo,
-                        ) && use_f16
-                        {
+                        if version == StableDiffusionVersion::Xl && use_f16 {
                            (
                                "madebyollin/sdxl-vae-fp16-fix",
                                "diffusion_pytorch_model.safetensors",
@ -268,7 +261,6 @@ fn text_embeddings(
    use_f16: bool,
    device: &Device,
    dtype: DType,
-    use_guide_scale: bool,
    first: bool,
 ) -> Result<Tensor> {
    let tokenizer_file = if first {
@ -293,6 +285,16 @@ fn text_embeddings(
    }
    let tokens = Tensor::new(tokens.as_slice(), device)?.unsqueeze(0)?;

+    let mut uncond_tokens = tokenizer
+        .encode(uncond_prompt, true)
+        .map_err(E::msg)?
+        .get_ids()
+        .to_vec();
+    while uncond_tokens.len() < sd_config.clip.max_position_embeddings {
+        uncond_tokens.push(pad_id)
+    }
+    let uncond_tokens = Tensor::new(uncond_tokens.as_slice(), device)?.unsqueeze(0)?;
+
    println!("Building the Clip transformer.");
    let clip_weights_file = if first {
        ModelFile::Clip
@ -308,24 +310,8 @@ fn text_embeddings(
    let text_model =
        stable_diffusion::build_clip_transformer(clip_config, clip_weights, device, DType::F32)?;
    let text_embeddings = text_model.forward(&tokens)?;
-
-    let text_embeddings = if use_guide_scale {
-        let mut uncond_tokens = tokenizer
-            .encode(uncond_prompt, true)
-            .map_err(E::msg)?
-            .get_ids()
-            .to_vec();
-        while uncond_tokens.len() < sd_config.clip.max_position_embeddings {
-            uncond_tokens.push(pad_id)
-        }
-
-        let uncond_tokens = Tensor::new(uncond_tokens.as_slice(), device)?.unsqueeze(0)?;
-        let uncond_embeddings = text_model.forward(&uncond_tokens)?;
-
-        Tensor::cat(&[uncond_embeddings, text_embeddings], 0)?.to_dtype(dtype)?
-    } else {
-        text_embeddings.to_dtype(dtype)?
-    };
+    let uncond_embeddings = text_model.forward(&uncond_tokens)?;
+    let text_embeddings = Tensor::cat(&[uncond_embeddings, text_embeddings], 0)?.to_dtype(dtype)?;
    Ok(text_embeddings)
 }

@ -370,7 +356,6 @@ fn run(args: Args) -> Result<()> {
        unet_weights,
        tracing,
        use_f16,
-        guidance_scale,
        use_flash_attn,
        img2img,
        img2img_strength,
@ -389,24 +374,6 @@ fn run(args: Args) -> Result<()> {
        None
    };

-    let guidance_scale = match guidance_scale {
-        Some(guidance_scale) => guidance_scale,
-        None => match sd_version {
-            StableDiffusionVersion::V1_5
-            | StableDiffusionVersion::V2_1
-            | StableDiffusionVersion::Xl => 7.5,
-            StableDiffusionVersion::Turbo => 0.,
-        },
-    };
-    let n_steps = match n_steps {
-        Some(n_steps) => n_steps,
-        None => match sd_version {
-            StableDiffusionVersion::V1_5
-            | StableDiffusionVersion::V2_1
-            | StableDiffusionVersion::Xl => 30,
-            StableDiffusionVersion::Turbo => 1,
-        },
-    };
    let dtype = if use_f16 { DType::F16 } else { DType::F32 };
    let sd_config = match sd_version {
        StableDiffusionVersion::V1_5 => {
@ -418,19 +385,13 @@ fn run(args: Args) -> Result<()> {
        StableDiffusionVersion::Xl => {
            stable_diffusion::StableDiffusionConfig::sdxl(sliced_attention_size, height, width)
        }
-        StableDiffusionVersion::Turbo => stable_diffusion::StableDiffusionConfig::sdxl_turbo(
-            sliced_attention_size,
-            height,
-            width,
-        ),
    };

    let scheduler = sd_config.build_scheduler(n_steps)?;
    let device = candle_examples::device(cpu)?;
-    let use_guide_scale = guidance_scale > 1.0;

    let which = match sd_version {
-        StableDiffusionVersion::Xl | StableDiffusionVersion::Turbo => vec![true, false],
+        StableDiffusionVersion::Xl => vec![true, false],
        _ => vec![true],
    };
    let text_embeddings = which
@ -446,18 +407,16 @@ fn run(args: Args) -> Result<()> {
                use_f16,
                &device,
                dtype,
-                use_guide_scale,
                *first,
            )
        })
        .collect::<Result<Vec<_>>>()?;
-
    let text_embeddings = Tensor::cat(&text_embeddings, D::Minus1)?;
    println!("{text_embeddings:?}");

    println!("Building the autoencoder.");
    let vae_weights = ModelFile::Vae.get(vae_weights, sd_version, use_f16)?;
-    let vae = sd_config.build_vae(vae_weights, &device, dtype)?;
+    let vae = sd_config.build_vae(&vae_weights, &device, dtype)?;
    let init_latent_dist = match &img2img {
        None => None,
        Some(image) => {
@ -467,7 +426,7 @@ fn run(args: Args) -> Result<()> {
    };
    println!("Building the unet.");
    let unet_weights = ModelFile::Unet.get(unet_weights, sd_version, use_f16)?;
-    let unet = sd_config.build_unet(unet_weights, &device, 4, use_flash_attn, dtype)?;
+    let unet = sd_config.build_unet(&unet_weights, &device, 4, use_flash_attn, dtype)?;

    let t_start = if img2img.is_some() {
        n_steps - (n_steps as f64 * img2img_strength) as usize
@ -475,19 +434,11 @@ fn run(args: Args) -> Result<()> {
        0
    };
    let bsize = 1;
-
-    let vae_scale = match sd_version {
-        StableDiffusionVersion::V1_5
-        | StableDiffusionVersion::V2_1
-        | StableDiffusionVersion::Xl => 0.18215,
-        StableDiffusionVersion::Turbo => 0.13025,
-    };
-
    for idx in 0..num_samples {
        let timesteps = scheduler.timesteps();
        let latents = match &init_latent_dist {
            Some(init_latent_dist) => {
-                let latents = (init_latent_dist.sample()? * vae_scale)?.to_device(&device)?;
+                let latents = (init_latent_dist.sample()? * 0.18215)?.to_device(&device)?;
                if t_start < timesteps.len() {
                    let noise = latents.randn_like(0f64, 1f64)?;
                    scheduler.add_noise(&latents, noise, timesteps[t_start])?
@ -514,31 +465,21 @@ fn run(args: Args) -> Result<()> {
                continue;
            }
            let start_time = std::time::Instant::now();
-            let latent_model_input = if use_guide_scale {
-                Tensor::cat(&[&latents, &latents], 0)?
-            } else {
-                latents.clone()
-            };
+            let latent_model_input = Tensor::cat(&[&latents, &latents], 0)?;

            let latent_model_input = scheduler.scale_model_input(latent_model_input, timestep)?;
            let noise_pred =
                unet.forward(&latent_model_input, timestep as f64, &text_embeddings)?;
-
-            let noise_pred = if use_guide_scale {
-                let noise_pred = noise_pred.chunk(2, 0)?;
-                let (noise_pred_uncond, noise_pred_text) = (&noise_pred[0], &noise_pred[1]);
-
-                (noise_pred_uncond + ((noise_pred_text - noise_pred_uncond)? * guidance_scale)?)?
-            } else {
-                noise_pred
-            };
-
+            let noise_pred = noise_pred.chunk(2, 0)?;
+            let (noise_pred_uncond, noise_pred_text) = (&noise_pred[0], &noise_pred[1]);
+            let noise_pred =
+                (noise_pred_uncond + ((noise_pred_text - noise_pred_uncond)? * GUIDANCE_SCALE)?)?;
            latents = scheduler.step(&noise_pred, timestep, &latents)?;
            let dt = start_time.elapsed().as_secs_f32();
            println!("step {}/{n_steps} done, {:.2}s", timestep_index + 1, dt);

            if args.intermediary_images {
-                let image = vae.decode(&(&latents / vae_scale)?)?;
+                let image = vae.decode(&(&latents / 0.18215)?)?;
                let image = ((image / 2.)? + 0.5)?.to_device(&Device::Cpu)?;
                let image = (image * 255.)?.to_dtype(DType::U8)?.i(0)?;
                let image_filename =
@ -552,7 +493,7 @@ fn run(args: Args) -> Result<()> {
            idx + 1,
            num_samples
        );
-        let image = vae.decode(&(&latents / vae_scale)?)?;
+        let image = vae.decode(&(&latents / 0.18215)?)?;
        let image = ((image / 2.)? + 0.5)?.to_device(&Device::Cpu)?;
        let image = (image.clamp(0f32, 1.)? * 255.)?.to_dtype(DType::U8)?.i(0)?;
        let image_filename = output_filename(&final_image, idx + 1, num_samples, None);
--- a/candle-examples/examples/stable-lm/main.rs
+++ b/candle-examples/examples/stable-lm/main.rs
@ -234,14 +234,13 @@ fn main() -> Result<()> {

    let start = std::time::Instant::now();
    let config = Config::stablelm_3b_4e1t(args.use_flash_attn);
-    let device = candle_examples::device(args.cpu)?;
    let (model, device) = if args.quantized {
        let filename = &filenames[0];
-        let vb =
-            candle_transformers::quantized_var_builder::VarBuilder::from_gguf(filename, &device)?;
+        let vb = candle_transformers::quantized_var_builder::VarBuilder::from_gguf(filename)?;
        let model = QStableLM::new(&config, vb)?;
        (Model::Quantized(model), Device::Cpu)
    } else {
+        let device = candle_examples::device(args.cpu)?;
        let dtype = if device.is_cuda() {
            DType::BF16
        } else {
--- a/candle-examples/examples/t5/main.rs
+++ b/candle-examples/examples/t5/main.rs
@ -96,9 +96,25 @@ impl T5ModelBuilder {
        let api = api.repo(repo);
        let config_filename = api.get("config.json")?;
        let tokenizer_filename = api.get("tokenizer.json")?;
-        let weights_filename = if model_id == "google/flan-t5-xxl" || model_id == "google/flan-ul2"
-        {
-            candle_examples::hub_load_safetensors(&api, "model.safetensors.index.json")?
+        let weights_filename = if model_id == "google/flan-t5-xxl" {
+            vec![
+                api.get("model-00001-of-00005.safetensors")?,
+                api.get("model-00002-of-00005.safetensors")?,
+                api.get("model-00003-of-00005.safetensors")?,
+                api.get("model-00004-of-00005.safetensors")?,
+                api.get("model-00005-of-00005.safetensors")?,
+            ]
+        } else if model_id == "google/flan-ul2" {
+            vec![
+                api.get("model-00001-of-00008.safetensors")?,
+                api.get("model-00002-of-00008.safetensors")?,
+                api.get("model-00003-of-00008.safetensors")?,
+                api.get("model-00004-of-00008.safetensors")?,
+                api.get("model-00005-of-00008.safetensors")?,
+                api.get("model-00006-of-00008.safetensors")?,
+                api.get("model-00007-of-00008.safetensors")?,
+                api.get("model-00008-of-00008.safetensors")?,
+            ]
        } else {
            vec![api.get("model.safetensors")?]
        };
--- a/candle-examples/examples/trocr/readme.md
+++ b/candle-examples/examples/trocr/readme.md
@ -8,7 +8,7 @@ the model itself.
 ## Running an example

 ```bash
-cargo run --example trocr --release --  --which base --cpu --image candle-examples/examples/trocr/assets/trocr.png
+cargo run --example trocr --release --  --which base --cpu --image assets/trocr.png
 ```

 ```
--- a/candle-examples/examples/whisper/main.rs
+++ b/candle-examples/examples/whisper/main.rs
@ -128,13 +128,7 @@ impl Decoder {
        let transcribe_token = token_id(&tokenizer, m::TRANSCRIBE_TOKEN)?;
        let translate_token = token_id(&tokenizer, m::TRANSLATE_TOKEN)?;
        let eot_token = token_id(&tokenizer, m::EOT_TOKEN)?;
-        let no_speech_token = m::NO_SPEECH_TOKENS
-            .iter()
-            .find_map(|token| token_id(&tokenizer, token).ok());
-        let no_speech_token = match no_speech_token {
-            None => anyhow::bail!("unable to find any non-speech token"),
-            Some(n) => n,
-        };
+        let no_speech_token = token_id(&tokenizer, m::NO_SPEECH_TOKEN)?;
        Ok(Self {
            model,
            rng: rand::rngs::StdRng::seed_from_u64(seed),
@ -518,7 +512,11 @@ fn main() -> Result<()> {
            )
        } else {
            let config = repo.get("config.json")?;
-            let tokenizer = repo.get("tokenizer.json")?;
+            let tokenizer = if args.model == WhichModel::LargeV3 {
+                panic!("openai/whisper-large-v3 does not provide a compatible tokenizer.json config at the moment")
+            } else {
+                repo.get("tokenizer.json")?
+            };
            let model = repo.get("model.safetensors")?;
            (config, tokenizer, model)
        };
@ -557,10 +555,8 @@ fn main() -> Result<()> {
    println!("loaded mel: {:?}", mel.dims());

    let mut model = if args.quantized {
-        let vb = candle_transformers::quantized_var_builder::VarBuilder::from_gguf(
-            &weights_filename,
-            &device,
-        )?;
+        let vb =
+            candle_transformers::quantized_var_builder::VarBuilder::from_gguf(&weights_filename)?;
        Model::Quantized(m::quantized_model::Whisper::load(&vb, config)?)
    } else {
        let vb =
--- a/candle-examples/examples/yi/main.rs
+++ b/candle-examples/examples/yi/main.rs
@ -74,9 +74,9 @@ impl TextGeneration {
        std::io::stdout().flush()?;

        let mut generated_tokens = 0usize;
-        let eos_token = match self.tokenizer.get_token("<|endoftext|>") {
+        let eos_token = match self.tokenizer.get_token("</s>") {
            Some(token) => token,
-            None => anyhow::bail!("cannot find the <|endoftext|> token"),
+            None => anyhow::bail!("cannot find the </s> token"),
        };
        let start_gen = std::time::Instant::now();
        for index in 0..sample_len {
@ -218,7 +218,21 @@ fn main() -> Result<()> {
            .split(',')
            .map(std::path::PathBuf::from)
            .collect::<Vec<_>>(),
-        None => candle_examples::hub_load_safetensors(&repo, "model.safetensors.index.json")?,
+        None => match args.which {
+            Which::L6b => vec![
+                repo.get("model-00001-of-00002.safetensors")?,
+                repo.get("model-00002-of-00002.safetensors")?,
+            ],
+            Which::L34b => vec![
+                repo.get("model-00001-of-00007.safetensors")?,
+                repo.get("model-00002-of-00007.safetensors")?,
+                repo.get("model-00003-of-00007.safetensors")?,
+                repo.get("model-00004-of-00007.safetensors")?,
+                repo.get("model-00005-of-00007.safetensors")?,
+                repo.get("model-00006-of-00007.safetensors")?,
+                repo.get("model-00007-of-00007.safetensors")?,
+            ],
+        },
    };
    println!("retrieved the files in {:?}", start.elapsed());
    let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
--- a/candle-examples/examples/yolo-v3/darknet.rs
+++ b/candle-examples/examples/yolo-v3/darknet.rs
@ -147,7 +147,7 @@ fn conv(vb: VarBuilder, index: usize, p: usize, b: &Block) -> Result<(usize, Bl)
    let func = candle_nn::func(move |xs| {
        let xs = conv.forward(xs)?;
        let xs = match &bn {
-            Some(bn) => xs.apply_t(bn, false)?,
+            Some(bn) => bn.forward(&xs)?,
            None => xs,
        };
        let xs = if leaky {
--- a/candle-examples/examples/yolo-v3/main.rs
+++ b/candle-examples/examples/yolo-v3/main.rs
@ -43,7 +43,6 @@ pub fn report(
    confidence_threshold: f32,
    nms_threshold: f32,
 ) -> Result<DynamicImage> {
-    let pred = pred.to_device(&Device::Cpu)?;
    let (npreds, pred_size) = pred.dims2()?;
    let nclasses = pred_size - 5;
    // The bounding boxes grouped by (maximum) class index.
--- a/candle-examples/examples/yolo-v8/README.md
+++ b/candle-examples/examples/yolo-v8/README.md
@ -32,7 +32,7 @@ Image source:
 ### Pose Estimation
 ```bash
 cargo run --example yolo-v8 --release -- \
-  candle-examples/examples/yolo-v8/assets/bike.jpg --task pose
+  candle-examples/examples/yolo-v8/assets/peoples.jpeg --task pose
 ```

 ![Leading group, Giro d'Italia 2021](./assets/bike.pose.jpg)
--- a/candle-examples/examples/yolo-v8/main.rs
+++ b/candle-examples/examples/yolo-v8/main.rs
@ -7,7 +7,7 @@ extern crate accelerate_src;
 mod model;
 use model::{Multiples, YoloV8, YoloV8Pose};

-use candle::{DType, Device, IndexOp, Result, Tensor};
+use candle::{DType, IndexOp, Result, Tensor};
 use candle_nn::{Module, VarBuilder};
 use candle_transformers::object_detection::{non_maximum_suppression, Bbox, KeyPoint};
 use clap::{Parser, ValueEnum};
@ -61,7 +61,6 @@ pub fn report_detect(
    nms_threshold: f32,
    legend_size: u32,
 ) -> Result<DynamicImage> {
-    let pred = pred.to_device(&Device::Cpu)?;
    let (pred_size, npreds) = pred.dims2()?;
    let nclasses = pred_size - 4;
    // The bounding boxes grouped by (maximum) class index.
@ -154,7 +153,6 @@ pub fn report_pose(
    confidence_threshold: f32,
    nms_threshold: f32,
 ) -> Result<DynamicImage> {
-    let pred = pred.to_device(&Device::Cpu)?;
    let (pred_size, npreds) = pred.dims2()?;
    if pred_size != 17 * 3 + 4 + 1 {
        candle::bail!("unexpected pred-size {pred_size}");
--- a/candle-examples/src/lib.rs
+++ b/candle-examples/src/lib.rs
@ -117,30 +117,3 @@ pub fn save_image_resize<P: AsRef<std::path::Path>>(
    image.save(p).map_err(candle::Error::wrap)?;
    Ok(())
 }
-
-/// Loads the safetensors files for a model from the hub based on a json index file.
-pub fn hub_load_safetensors(
-    repo: &hf_hub::api::sync::ApiRepo,
-    json_file: &str,
-) -> Result<Vec<std::path::PathBuf>> {
-    let json_file = repo.get(json_file).map_err(candle::Error::wrap)?;
-    let json_file = std::fs::File::open(json_file)?;
-    let json: serde_json::Value =
-        serde_json::from_reader(&json_file).map_err(candle::Error::wrap)?;
-    let weight_map = match json.get("weight_map") {
-        None => candle::bail!("no weight map in {json_file:?}"),
-        Some(serde_json::Value::Object(map)) => map,
-        Some(_) => candle::bail!("weight map in {json_file:?} is not a map"),
-    };
-    let mut safetensors_files = std::collections::HashSet::new();
-    for value in weight_map.values() {
-        if let Some(file) = value.as_str() {
-            safetensors_files.insert(file.to_string());
-        }
-    }
-    let safetensors_files = safetensors_files
-        .iter()
-        .map(|v| repo.get(v).map_err(candle::Error::wrap))
-        .collect::<Result<Vec<_>>>()?;
-    Ok(safetensors_files)
-}
--- a/candle-flash-attn/Cargo.toml
+++ b/candle-flash-attn/Cargo.toml
@ -1,6 +1,6 @@
 [package]
 name = "candle-flash-attn"
-version = "0.3.3"
+version = "0.3.0"
 edition = "2021"

 description = "Flash attention layer for the candle ML framework."
@ -11,7 +11,7 @@ license = "MIT OR Apache-2.0"
 readme = "README.md"

 [dependencies]
-candle = { path = "../candle-core", features = ["cuda"], version = "0.3.3", package = "candle-core" }
+candle = { path = "../candle-core", features = ["cuda"], version = "0.3.0", package = "candle-core" }
 half = { version = "2.3.1", features = ["num-traits"] }

 [build-dependencies]
@ -21,4 +21,4 @@ rayon = "1.7.0"

 [dev-dependencies]
 anyhow = { version = "1", features = ["backtrace"] }
-candle-nn = { path = "../candle-nn", version = "0.3.3", features = ["cuda"] }
+candle-nn = { path = "../candle-nn", version = "0.3.0", features = ["cuda"] }
--- a/candle-kernels/Cargo.toml
+++ b/candle-kernels/Cargo.toml
@ -1,6 +1,6 @@
 [package]
 name = "candle-kernels"
-version = "0.3.3"
+version = "0.3.0"
 edition = "2021"

 description = "CUDA kernels for Candle"
@ -14,4 +14,4 @@ license = "MIT OR Apache-2.0"
 [build-dependencies]
 anyhow = { version = "1", features = ["backtrace"] }
 glob = "0.3.1"
-rayon = "1.7.0"
+rayon = "1.7.0"
--- a/candle-kernels/src/lib.rs
+++ b/candle-kernels/src/lib.rs
@ -1 +1,9 @@
-
+pub const AFFINE: &str = include_str!(concat!(env!("OUT_DIR"), "/affine.ptx"));
+pub const BINARY: &str = include_str!(concat!(env!("OUT_DIR"), "/binary.ptx"));
+pub const CAST: &str = include_str!(concat!(env!("OUT_DIR"), "/cast.ptx"));
+pub const CONV: &str = include_str!(concat!(env!("OUT_DIR"), "/conv.ptx"));
+pub const FILL: &str = include_str!(concat!(env!("OUT_DIR"), "/fill.ptx"));
+pub const INDEXING: &str = include_str!(concat!(env!("OUT_DIR"), "/indexing.ptx"));
+pub const REDUCE: &str = include_str!(concat!(env!("OUT_DIR"), "/reduce.ptx"));
+pub const TERNARY: &str = include_str!(concat!(env!("OUT_DIR"), "/ternary.ptx"));
+pub const UNARY: &str = include_str!(concat!(env!("OUT_DIR"), "/unary.ptx"));
--- a/candle-metal-kernels/Cargo.toml
+++ b/candle-metal-kernels/Cargo.toml
@ -1,16 +1,16 @@
 [package]
 name = "candle-metal-kernels"
-version = "0.3.3"
+version = "0.3.0"
 edition = "2021"

-description = "Metal kernels for Candle"
+description = "CUDA kernels for Candle"
 repository = "https://github.com/huggingface/candle"
 keywords = ["blas", "tensor", "machine-learning"]
 categories = ["science"]
 license = "MIT OR Apache-2.0"

 [dependencies]
-metal = { version = "0.27.0", features = ["mps"]}
+metal = { git = "https://github.com/ivarflakstad/metal-rs.git", features = ["mps"] }
 once_cell = "1.18.0"
 thiserror = "1"
 tracing = "0.1.37"
--- a/candle-metal-kernels/src/GEMM.metal
+++ b/candle-metal-kernels/src/GEMM.metal
@ -0,0 +1,499 @@
+//
+//  GEMM.metal
+//  MetalFlashAttention
+//
+//  Created by Philip Turner on 6/23/23.
+//
+
+#include <metal_stdlib>
+#include "metal_data_type"
+#include "metal_simdgroup_event"
+#include "metal_simdgroup_matrix_storage"
+using namespace metal;
+
+// MARK: - Function Constants
+
+// Dimensions of each matrix.
+constant uint M [[function_constant(0)]];
+constant uint N [[function_constant(1)]];
+constant uint K [[function_constant(2)]];
+
+// Whether each matrix is transposed.
+constant bool A_trans [[function_constant(10)]];
+constant bool B_trans [[function_constant(11)]];
+constant bool D_trans [[function_constant(13)]];
+constant uint A_leading_dim = A_trans ? M : K;
+constant uint B_leading_dim = B_trans ? K : N;
+
+// Alpha and beta constants from BLAS.
+constant float alpha [[function_constant(20)]];
+constant float beta [[function_constant(21)]];
+
+constant bool batched [[function_constant(100)]];
+constant bool fused_activation [[function_constant(101)]];
+constant bool fused_bias [[function_constant(50001)]]; // 102
+constant bool use_bias = is_function_constant_defined(fused_bias) ? fused_bias : false;
+constant bool use_activation_function = fused_activation && !fused_bias;
+constant bool use_activation = use_bias || use_activation_function;
+constant bool batched_activation_function = batched && use_activation_function;
+
+constant ushort M_simd [[function_constant(200)]];
+constant ushort N_simd [[function_constant(201)]];
+constant ushort K_simd [[function_constant(202)]];
+
+// Elide work on the edge when matrix dimension < SRAM block dimension.
+constant ushort M_modulo = (M % M_simd == 0) ? M_simd : (M % M_simd);
+constant ushort N_modulo = (N % N_simd == 0) ? N_simd : (N % N_simd);
+constant ushort M_padded = (M < M_simd) ? (M_modulo + 7) / 8 * 8 : M_simd;
+constant ushort N_padded = (N < N_simd) ? (N_modulo + 7) / 8 * 8 : N_simd;
+
+constant ushort M_splits [[function_constant(210)]];
+constant ushort N_splits [[function_constant(211)]];
+
+constant ushort M_group = M_simd * M_splits;
+constant ushort N_group = N_simd * N_splits;
+constant ushort A_block_leading_dim = (A_trans ? M_group : K_simd);
+constant ushort B_block_leading_dim = (B_trans ? K_simd : N_group);
+
+// There is no padding for M reads/writes.
+// There is no padding for N reads/writes.
+constant ushort K_simd_unpadded = (K % K_simd == 0) ? K_simd : (K % K_simd);
+constant ushort K_simd_padded = (K_simd_unpadded + 7) / 8 * 8;
+
+constant ushort A_sram_length = (M_simd / 8) * 1;
+constant ushort B_sram_length = 1 * (N_simd / 8);
+constant ushort A_block_length = M_group * K_simd;
+
+// Threadgroup block must fit entire C accumulator and partial sums.
+constant ushort A_sram_offset = 0;
+constant ushort B_sram_offset = A_sram_offset + A_sram_length;
+constant ushort C_sram_offset = B_sram_offset + B_sram_length;
+constant ushort A_block_offset = 0;
+constant ushort B_block_offset = A_block_offset + A_block_length;
+
+// MARK: - Utilities
+
+template <typename T>
+METAL_FUNC thread simdgroup_matrix_storage<T>* A_sram(thread simdgroup_matrix_storage<T> *sram, ushort2 matrix_origin) {
+  // A_sram[M_simd][8]
+  return sram + A_sram_offset + (matrix_origin.y / 8) * (8 / 8) + (matrix_origin.x / 8);
+}
+
+template <typename T>
+METAL_FUNC thread simdgroup_matrix_storage<T>* B_sram(thread simdgroup_matrix_storage<T> *sram, ushort2 matrix_origin) {
+  // A_sram[8][N_simd]
+  return sram + B_sram_offset + (matrix_origin.y / 8) * (N_simd / 8) + (matrix_origin.x / 8);
+}
+
+template <typename T>
+METAL_FUNC thread simdgroup_matrix_storage<T>* C_sram(thread simdgroup_matrix_storage<T> *sram, ushort2 matrix_origin) {
+  // C_sram[M_simd][N_simd]
+  return sram + C_sram_offset + (matrix_origin.y / 8) * (N_simd / 8) + (matrix_origin.x / 8);
+}
+
+template <typename T>
+METAL_FUNC void prefetch(threadgroup T *A_block, device T *A,
+                         ushort2 A_tile_src, uint2 A_offset,
+                         threadgroup T *B_block, device T *B,
+                         ushort2 B_tile_src, uint2 B_offset, uint k)
+{
+  A_tile_src.x = min(uint(K_simd), K - k);
+  B_tile_src.y = min(uint(K_simd), K - k);
+  auto A_src = simdgroup_matrix_storage<T>::apply_offset(A, A_leading_dim, A_offset, A_trans);
+  auto B_src = simdgroup_matrix_storage<T>::apply_offset(B, B_leading_dim, B_offset, B_trans);
+  
+  // Rounded-up ceiling for the threadgroup block.
+  const uint K_edge_floor = K - K_simd_unpadded;
+  const uint K_edge_ceil = K_edge_floor + K_simd_padded;
+  ushort K_padded;
+  if (K_edge_floor == K_simd) {
+    K_padded = K_simd;
+  } else {
+    K_padded = min(uint(K_simd), K_edge_ceil - k);
+  }
+  ushort2 A_tile_dst(K_padded, A_tile_src.y);
+  ushort2 B_tile_dst(B_tile_src.x, K_padded);
+  
+  simdgroup_event events[2];
+  events[0].async_copy(A_block, A_block_leading_dim, A_tile_dst, A_src, A_leading_dim, A_tile_src, A_trans);
+  events[1].async_copy(B_block, B_block_leading_dim, B_tile_dst, B_src, B_leading_dim, B_tile_src, B_trans);
+  simdgroup_event::wait(2, events);
+}
+
+// One iteration of the MACC loop, effectively k=8 iterations.
+template <typename T>
+METAL_FUNC void multiply_accumulate(thread simdgroup_matrix_storage<T> *sram,
+                                    const threadgroup T *A_block,
+                                    const threadgroup T *B_block,
+                                    bool accumulate = true)
+{
+#pragma clang loop unroll(full)
+  for (ushort m = 0; m < M_padded; m += 8) {
+    ushort2 origin(0, m);
+    A_sram(sram, origin)->load(A_block, A_block_leading_dim, origin, A_trans);
+  }
+#pragma clang loop unroll(full)
+  for (ushort n = 0; n < N_padded; n += 8) {
+    ushort2 origin(n, 0);
+    B_sram(sram, origin)->load(B_block, B_block_leading_dim, origin, B_trans);
+  }
+#pragma clang loop unroll(full)
+  for (ushort m = 0; m < M_padded; m += 8) {
+    auto A = A_sram(sram, ushort2(0, m));
+#pragma clang loop unroll(full)
+    for (ushort n = 0; n < N_padded; n += 8) {
+      auto B = B_sram(sram, ushort2(n, 0));
+      auto C = C_sram(sram, ushort2(n, m));
+      C->multiply(*A, *B, accumulate);
+    }
+  }
+}
+
+template <typename T>
+METAL_FUNC void partial_store(thread simdgroup_matrix_storage<T> *sram,
+                              threadgroup T *C_block, bool is_k_summation)
+{
+#pragma clang loop unroll(full)
+  for (ushort m = 0; m < M_padded; m += 8) {
+#pragma clang loop unroll(full)
+    for (ushort n = 0; n < N_padded; n += 8) {
+      ushort2 origin(n, m);
+      if (is_k_summation) {
+        C_sram(sram, origin)->store(C_block, N_simd, origin);
+      } else {
+        C_sram(sram, origin)->store(C_block, N_group, origin);
+      }
+    }
+  }
+}
+
+template <typename T>
+METAL_FUNC void partial_accumulate(thread simdgroup_matrix_storage<T> *sram,
+                                   threadgroup T *C_block, bool is_k_summation)
+{
+#pragma clang loop unroll(full)
+  for (ushort m = 0; m < M_padded; m += 8) {
+#pragma clang loop unroll(full)
+    for (ushort n = 0; n < N_padded; n += 8) {
+      ushort2 origin(n, m);
+      auto B = B_sram(sram, ushort2(n, 0));
+      if (is_k_summation) {
+        B->load(C_block, N_simd, origin);
+      } else {
+        B->load(C_block, N_group, origin);
+      }
+    }
+#pragma clang loop unroll(full)
+    for (ushort n = 0; n < N_padded; n += 8) {
+      ushort2 origin(n, m);
+      auto B = B_sram(sram, ushort2(n, 0));
+      auto C = C_sram(sram, origin);
+      if (is_k_summation) {
+        C->thread_elements()[0] += B->thread_elements()[0];
+      } else {
+        float2 C_old = float2(B->thread_elements()[0]);
+        float2 C_new = float2(C->thread_elements()[0]);
+        C->thread_elements()[0] = vec<T, 2>(fast::fma(C_old, beta, C_new));
+      }
+    }
+  }
+}
+
+template <typename T>
+METAL_FUNC void async_access_accumulator(threadgroup T *C_block, device T *C,
+                                         uint2 C_offset, bool is_store)
+{
+  ushort2 C_tile(min(uint(N_group), N - C_offset.x),
+                 min(uint(M_group), M - C_offset.y));
+  auto C_src = simdgroup_matrix_storage<T>::apply_offset(C, N, C_offset);
+  
+  simdgroup_event event;
+  if (is_store) {
+    event.async_copy(C_src, N, C_tile, C_block, N_group, C_tile);
+  } else {
+    event.async_copy(C_block, N_group, C_tile, C_src, N, C_tile);
+    simdgroup_event::wait(1, &event);
+  }
+}
+
+template <typename T>
+METAL_FUNC void store_accumulator(thread simdgroup_matrix_storage<T> *sram,
+                                  device T *C, bool m_is_edge, bool n_is_edge)
+{
+  const ushort m_start = (m_is_edge) ? M_modulo : 0;
+  const ushort n_start = (n_is_edge) ? N_modulo : 0;
+  const ushort m_end = (m_is_edge) ? M_simd : M_modulo;
+  const ushort n_end = (n_is_edge) ? N_simd : N_modulo;
+  
+#pragma clang loop unroll(full)
+  for (ushort m = m_start; m < m_end; m += 8) {
+#pragma clang loop unroll(full)
+    for (ushort n = n_start; n < n_end; n += 8) {
+      ushort2 origin(n, m);
+      C_sram(sram, origin)->store(C, N, origin);
+    }
+  }
+}
+
+template <typename T>
+struct activation_functor {
+  using function = void(threadgroup T *C,
+                        device void *D,
+                        uint grid_index_in_batch,
+                        uint2 matrix_origin,
+                        ushort2 tile_dimensions,
+                        ushort lane_id);
+  
+  typedef visible_function_table<function> function_table;
+};
+
+// MARK: - Kernels
+
+template <typename T>
+void _gemm_impl(device T *A [[buffer(0)]],
+                device T *B [[buffer(1)]],
+                device T *C [[buffer(2)]],
+                device void *D [[buffer(3), function_constant(use_activation)]],
+                
+                threadgroup T *threadgroup_block [[threadgroup(0)]],
+                constant ulong4 *matrix_offsets [[buffer(10), function_constant(batched)]],
+                typename activation_functor<T>::function_table table [[buffer(11), function_constant(use_activation_function)]],
+                constant uint *activation_function_offsets [[buffer(12), function_constant(batched_activation_function)]],
+                
+                uint3 gid [[threadgroup_position_in_grid]],
+                ushort sidx [[simdgroup_index_in_threadgroup]],
+                ushort lane_id [[thread_index_in_simdgroup]])
+{
+  if (batched) {
+    // TODO: Re-compute every inner loop iteration for FP64 accumulate.
+    ulong3 offsets = matrix_offsets[gid.z].xyz;
+    A = (device T*)((device uchar*)A + offsets[0]);
+    B = (device T*)((device uchar*)B + offsets[1]);
+    C = (device T*)((device uchar*)C + offsets[2]);
+  }
+  
+  simdgroup_matrix_storage<T> sram[1024];
+  auto A_block = threadgroup_block + A_block_offset;
+  auto B_block = threadgroup_block + B_block_offset;
+  ushort2 sid(sidx % N_splits, sidx / N_splits);
+  ushort2 offset_in_simd = simdgroup_matrix_storage<T>::offset(lane_id);
+  
+  uint2 A_offset(0, gid.y * M_group);
+  uint2 B_offset(gid.x * N_group, 0);
+  {
+    uint C_base_offset_x = B_offset.x + sid.x * N_simd;
+    uint C_base_offset_y = A_offset.y + sid.y * M_simd;
+    if (C_base_offset_x >= N || C_base_offset_y >= M) {
+      return;
+    }
+  }
+  
+  ushort2 offset_in_group(sid.x * N_simd + offset_in_simd.x,
+                          sid.y * M_simd + offset_in_simd.y);
+  
+  if (use_bias) {
+    if (sidx == 0) {
+      auto bias = (device T*)D;
+      if (batched) {
+        ulong offset = matrix_offsets[gid.z].w;
+        bias = (device T*)((device uchar*)bias + offset);
+      }
+      
+      ushort bias_elements;
+      if (is_function_constant_defined(D_trans) && D_trans) {
+        bias += A_offset.y;
+        bias_elements = min(uint(M_group), M - A_offset.y);
+      } else {
+        bias += B_offset.x;
+        bias_elements = min(uint(N_group), N - B_offset.x);
+      }
+      
+      simdgroup_event event;
+      event.async_copy(threadgroup_block, bias, bias_elements);
+      simdgroup_event::wait(1, &event);
+    }
+    threadgroup_barrier(mem_flags::mem_threadgroup);
+    
+    if (is_function_constant_defined(D_trans) && D_trans) {
+      auto bias = threadgroup_block + offset_in_group.y;
+#pragma clang loop unroll(full)
+      for (ushort m = 0; m < M_padded; m += 8) {
+        auto D = bias[m];
+#pragma clang loop unroll(full)
+        for (ushort n = 0; n < N_padded; n += 8) {
+          auto C = C_sram(sram, ushort2(n, m));
+          *(C->thread_elements()) = D;
+        }
+      }
+    } else {
+      auto bias = threadgroup_block + offset_in_group.x;
+#pragma clang loop unroll(full)
+      for (ushort n = 0; n < N_padded; n += 8) {
+        auto D = *(threadgroup vec<T, 2>*)(bias + n);
+#pragma clang loop unroll(full)
+        for (ushort m = 0; m < M_padded; m += 8) {
+          auto C = C_sram(sram, ushort2(n, m));
+          *(C->thread_elements()) = D;
+        }
+      }
+    }
+    threadgroup_barrier(mem_flags::mem_threadgroup);
+  }
+  
+  ushort2 A_tile_src;
+  ushort2 B_tile_src;
+  if (sidx == 0) {
+    A_tile_src.y = min(uint(M_group), M - A_offset.y);
+    B_tile_src.x = min(uint(N_group), N - B_offset.x);
+    prefetch(A_block, A, A_tile_src, A_offset, B_block, B, B_tile_src, B_offset, 0);
+  }
+  
+  if (K > K_simd && !use_bias) {
+#pragma clang loop unroll(full)
+    for (ushort m = 0; m < M_padded; m += 8) {
+#pragma clang loop unroll(full)
+      for (ushort n = 0; n < N_padded; n += 8) {
+        *C_sram(sram, ushort2(n, m)) = simdgroup_matrix_storage<T>(0);
+      }
+    }
+  }
+  
+  for (uint K_floor = 0; K_floor < K; K_floor += K_simd) {
+    ushort2 A_block_offset(offset_in_simd.x, offset_in_group.y);
+    ushort2 B_block_offset(offset_in_group.x, offset_in_simd.y);
+    auto A_block_src = simdgroup_matrix_storage<T>::apply_offset(A_block, A_block_leading_dim, A_block_offset, A_trans);
+    auto B_block_src = simdgroup_matrix_storage<T>::apply_offset(B_block, B_block_leading_dim, B_block_offset, B_trans);
+    threadgroup_barrier(mem_flags::mem_threadgroup);
+    
+#pragma clang loop unroll(full)
+    for (ushort k = 0; k < K_simd_padded; k += 8) {
+      bool accumulate = use_bias || !(K <= K_simd && k == 0);
+      multiply_accumulate(sram, A_block_src, B_block_src, accumulate);
+      A_block_src += A_trans ? 8 * A_block_leading_dim : 8;
+      B_block_src += B_trans ? 8 : 8 * B_block_leading_dim;
+    }
+    
+    if (K_floor + K_simd < K) {
+#pragma clang loop unroll(full)
+      for (ushort k = K_simd_padded; k < K_simd; k += 8) {
+        multiply_accumulate(sram, A_block_src, B_block_src);
+        A_block_src += A_trans ? 8 * A_block_leading_dim : 8;
+        B_block_src += B_trans ? 8 : 8 * B_block_leading_dim;
+      }
+      threadgroup_barrier(mem_flags::mem_threadgroup);
+      
+      if (sidx == 0) {
+        uint K_next = K_floor + K_simd;
+        A_offset.x = K_next;
+        B_offset.y = K_next;
+        prefetch(A_block, A, A_tile_src, A_offset, B_block, B, B_tile_src, B_offset, K_next);
+      }
+    }
+  }
+  
+  if (alpha != 1) {
+#pragma clang loop unroll(full)
+    for (int m = 0; m < M_padded; m += 8) {
+#pragma clang loop unroll(full)
+      for (int n = 0; n < N_padded; n += 8) {
+        C_sram(sram, ushort2(n, m))->thread_elements()[0] *= alpha;
+      }
+    }
+  }
+  
+  uint2 C_offset(B_offset.x, A_offset.y);
+  ushort2 C_block_offset = offset_in_group.xy;
+  threadgroup_barrier(mem_flags::mem_threadgroup);
+  
+  if (beta != 0) {
+    if (sidx == 0) {
+      async_access_accumulator(threadgroup_block, C, C_offset, false);
+    }
+    threadgroup_barrier(mem_flags::mem_threadgroup);
+    
+    auto C_block = simdgroup_matrix_storage<T>::apply_offset(threadgroup_block, N_group, C_block_offset);
+    partial_accumulate(sram, C_block, false);
+    threadgroup_barrier(mem_flags::mem_threadgroup);
+  }
+  
+  if (use_activation_function) {
+    auto C_block = simdgroup_matrix_storage<T>::apply_offset(threadgroup_block, N_group, C_block_offset);
+    partial_store(sram, C_block, false);
+    simdgroup_barrier(mem_flags::mem_threadgroup);
+    
+    uint grid_index_in_batch = (batched ? gid.z : 0);
+    uint2 matrix_origin = C_offset + uint2(C_block_offset);
+    matrix_origin &= ~7;
+    ushort2 tile_dimensions(min(uint(N_group), N - matrix_origin.x),
+                            min(uint(M_group), M - matrix_origin.y));
+    uint function_index = 0;
+    if (batched_activation_function) {
+      function_index = activation_function_offsets[gid.z];
+    }
+    table[function_index](C_block, D, grid_index_in_batch, matrix_origin, tile_dimensions, lane_id);
+    threadgroup_barrier(mem_flags::mem_threadgroup);
+    
+    if (sidx == 0) {
+      async_access_accumulator(threadgroup_block, C, C_offset, true);
+    }
+    return;
+  } else if ((M % 8 != 0) || (N % 8 != 0)) {
+    auto C_block = simdgroup_matrix_storage<T>::apply_offset(threadgroup_block, N_group, C_block_offset);
+    partial_store(sram, C_block, false);
+    threadgroup_barrier(mem_flags::mem_threadgroup);
+    
+    if (sidx == 0) {
+      async_access_accumulator(threadgroup_block, C, C_offset, true);
+    }
+  } else {
+    uint2 matrix_origin = C_offset + uint2(C_block_offset);
+    auto C_src = simdgroup_matrix_storage<T>::apply_offset(C, N, matrix_origin);
+    store_accumulator(sram, C_src, false, false);
+    
+    const uint M_edge_floor = M - M % M_simd;
+    const uint N_edge_floor = N - N % N_simd;
+    if (matrix_origin.y < M_edge_floor) {
+      store_accumulator(sram, C_src, true, false);
+    }
+    if (matrix_origin.x < N_edge_floor) {
+      store_accumulator(sram, C_src, false, true);
+      if (matrix_origin.y < M_edge_floor) {
+        store_accumulator(sram, C_src, true, true);
+      }
+    }
+  }
+}
+
+kernel void hgemm(device half *A [[buffer(0)]],
+                  device half *B [[buffer(1)]],
+                  device half *C [[buffer(2)]],
+                  device void *D [[buffer(3), function_constant(use_activation)]],
+                  
+                  threadgroup half *threadgroup_block [[threadgroup(0)]],
+                  constant ulong4 *matrix_offsets [[buffer(10), function_constant(batched)]],
+                  typename activation_functor<half>::function_table table [[buffer(11), function_constant(use_activation_function)]],
+                  constant uint *activation_function_offsets [[buffer(12), function_constant(batched_activation_function)]],
+                  
+                  uint3 gid [[threadgroup_position_in_grid]],
+                  ushort sidx [[simdgroup_index_in_threadgroup]],
+                  ushort lane_id [[thread_index_in_simdgroup]])
+{
+  _gemm_impl<half>(A, B, C, D, threadgroup_block, matrix_offsets, table, activation_function_offsets, gid, sidx, lane_id);
+}
+
+kernel void sgemm(device float *A [[buffer(0)]],
+                  device float *B [[buffer(1)]],
+                  device float *C [[buffer(2)]],
+                  device void *D [[buffer(3), function_constant(use_activation)]],
+                  
+                  threadgroup float *threadgroup_block [[threadgroup(0)]],
+                  constant ulong4 *matrix_offsets [[buffer(10), function_constant(batched)]],
+                  typename activation_functor<float>::function_table table [[buffer(11), function_constant(use_activation_function)]],
+                  constant uint *activation_function_offsets [[buffer(12), function_constant(batched_activation_function)]],
+                  
+                  uint3 gid [[threadgroup_position_in_grid]],
+                  ushort sidx [[simdgroup_index_in_threadgroup]],
+                  ushort lane_id [[thread_index_in_simdgroup]])
+{
+  _gemm_impl<float>(A, B, C, D, threadgroup_block, matrix_offsets, table, activation_function_offsets, gid, sidx, lane_id);
+}
--- a/candle-metal-kernels/src/affine.metal
+++ b/candle-metal-kernels/src/affine.metal
@ -29,7 +29,9 @@ kernel void FN_NAME( \
    if (id >= dim) { \
        return; \
    } \
-    output[id] = TYPENAME(float(input[id]) * mul + add); \
+    const TYPENAME m = TYPENAME(mul); \
+    const TYPENAME a = TYPENAME(add); \
+    output[id] = input[id] * m + a; \
 } \
 kernel void FN_NAME##_strided( \
    constant size_t &dim, \
@ -45,80 +47,15 @@ kernel void FN_NAME##_strided( \
    if (id >= dim) { \
        return; \
    } \
-    output[id] = TYPENAME(float(input[get_strided_index(id, num_dims, dims, strides)]) * mul + add); \
-}
-
-#define POWF(FN_NAME, TYPENAME) \
-kernel void FN_NAME( \
-    constant size_t &dim, \
-    constant float &mul, \
-    device const TYPENAME *input,  \
-    device TYPENAME *output, \
-    uint id [[ thread_position_in_grid ]] \
-) { \
-    if (id >= dim) { \
-        return; \
-    } \
-    output[id] = TYPENAME(pow(input[id], TYPENAME(mul))); \
-} \
-kernel void FN_NAME##_strided( \
-    constant size_t &dim, \
-    constant size_t &num_dims, \
-    constant size_t *dims, \
-    constant size_t *strides, \
-    constant float &mul, \
-    device const TYPENAME *input,  \
-    device TYPENAME *output, \
-    uint id [[ thread_position_in_grid ]] \
-) { \
-    if (id >= dim) { \
-        return; \
-    } \
-    output[id] = TYPENAME(pow(input[get_strided_index(id, num_dims, dims, strides)], TYPENAME(mul))); \
-}
-
-#define ELU(FN_NAME, TYPENAME) \
-kernel void FN_NAME( \
-    constant size_t &dim, \
-    constant float &mul, \
-    device const TYPENAME *input,  \
-    device TYPENAME *output, \
-    uint id [[ thread_position_in_grid ]] \
-) { \
-    if (id >= dim) { \
-        return; \
-    } \
-    const TYPENAME x = input[id]; \
-    output[id] = TYPENAME((x > 0)?x: mul * exp(x - 1)); \
-} \
-kernel void FN_NAME##_strided( \
-    constant size_t &dim, \
-    constant size_t &num_dims, \
-    constant size_t *dims, \
-    constant size_t *strides, \
-    constant float &mul, \
-    device const TYPENAME *input,  \
-    device TYPENAME *output, \
-    uint id [[ thread_position_in_grid ]] \
-) { \
-    if (id >= dim) { \
-        return; \
-    } \
-    const TYPENAME x = input[get_strided_index(id, num_dims, dims, strides)]; \
-    output[id] = TYPENAME((x > 0)?x: mul * exp(x - 1)); \
+    const TYPENAME m = TYPENAME(mul); \
+    const TYPENAME a = TYPENAME(add); \
+    output[id] = input[get_strided_index(id, num_dims, dims, strides)] * m + a; \
 } \

-
-AFFINE(affine_f32, float)
-AFFINE(affine_f16, half)
-POWF(powf_f32, float)
-POWF(powf_f16, half)
-ELU(elu_f32, float)
-ELU(elu_f16, half)
+AFFINE(affine_float, float)
+AFFINE(affine_half, half)


 #if __METAL_VERSION__ >= 310
-AFFINE(affine_bf16, bfloat);
-POWF(powf_bf16, bfloat);
-ELU(elu_bf16, bfloat);
+AFFINE(affine_bfloat, bfloat);
 #endif
--- a/candle-metal-kernels/src/binary.metal
+++ b/candle-metal-kernels/src/binary.metal
@ -1,8 +1,5 @@
 #include <metal_stdlib>

-#define MAX(x, y) ((x) > (y) ? (x) : (y))
-#define MIN(x, y) ((x) < (y) ? (x) : (y))
-
 METAL_FUNC uint get_strided_index(
    uint idx,
    constant size_t &num_dims,
@ -25,15 +22,15 @@ kernel void FN_NAME( \
    constant size_t &dim, \
    device const TYPENAME *left,  \
    device const TYPENAME *right,  \
-    device OUT_TYPENAME *output, \
-    uint tid [[ thread_position_in_grid ]] \
+    device TYPENAME *output, \
+    uint thread_position_in_grid [[ thread_position_in_grid ]] \
 ) { \
-    if (tid >= dim) { \
+    if (thread_position_in_grid >= dim) { \
        return; \
    } \
-    TYPENAME x = left[tid]; \
-    TYPENAME y = right[tid]; \
-    output[tid] = OUT_TYPENAME(FN); \
+    TYPENAME x = left[thread_position_in_grid]; \
+    TYPENAME y = right[thread_position_in_grid]; \
+    output[thread_position_in_grid] = OUT_TYPENAME(FN); \
 }\
 kernel void FN_NAME_STRIDED( \
    constant size_t &dim, \
@ -43,73 +40,33 @@ kernel void FN_NAME_STRIDED( \
    constant size_t *right_strides, \
    device const TYPENAME *left,  \
    device const TYPENAME *right,  \
-    device OUT_TYPENAME *output, \
-    uint tid [[ thread_position_in_grid ]] \
+    device TYPENAME *output, \
+    uint thread_position_in_grid [[ thread_position_in_grid ]] \
 ) { \
-    if (tid >= dim) { \
+    if (thread_position_in_grid >= dim) { \
        return; \
    } \
-    TYPENAME x = left[get_strided_index(tid, num_dims, dims, left_strides)]; \
-    TYPENAME y = right[get_strided_index(tid, num_dims, dims, right_strides)]; \
-    output[tid] = OUT_TYPENAME(FN); \
+    TYPENAME x = left[get_strided_index(thread_position_in_grid, num_dims, dims, left_strides)]; \
+    TYPENAME y = right[get_strided_index(thread_position_in_grid, num_dims, dims, right_strides)]; \
+    output[thread_position_in_grid] = OUT_TYPENAME(FN); \
 }

 #define BINARY_OP(FN, NAME) \
-BINARY(FN, float, float, NAME##_f32, NAME##_f32_strided); \
-BINARY(FN, half, half, NAME##_f16, NAME##_f16_strided); \
-BINARY(FN, uint32_t, uint32_t, NAME##_u32, NAME##_u32_strided); \
-BINARY(FN, uint8_t, uint8_t, NAME##_u8, NAME##_u8_strided);
-
-#define INT64_BINARY_OP(NAME, FN) \
-BINARY(FN, int64_t, int64_t, NAME##_i64, NAME##_i64_strided);
+BINARY(FN, float, float, NAME##_float, NAME##_float_strided); \
+BINARY(FN, half, half, NAME##_half, NAME##_half_strided);

 #define BFLOAT_BINARY_OP(FN, NAME) \
-BINARY(FN, bfloat, bfloat, NAME##_bf16, NAME##_bf16_strided);
+BINARY(FN, bfloat, bfloat, NAME##_bfloat, NAME##_bfloat_strided);

-#define BINARY_OP_OUT(NAME, FN) \
-BINARY(FN, float, uint8_t, NAME##_f32, NAME##_f32_strided); \
-BINARY(FN, half, uint8_t, NAME##_f16, NAME##_f16_strided); \
-BINARY(FN, uint32_t, uint8_t, NAME##_u32, NAME##_u32_strided); \
-BINARY(FN, uint8_t, uint8_t, NAME##_u8, NAME##_u8_strided);
-
-#define INT64_BINARY_OP_OUT(NAME, FN) \
-BINARY(FN, int64_t, int8_t, NAME##_i64, NAME##_i64_strided);

 BINARY_OP(x + y, add)
 BINARY_OP(x - y, sub)
 BINARY_OP(x * y, mul)
 BINARY_OP(x / y, div)
-BINARY_OP(MIN(x, y), min)
-BINARY_OP(MAX(x, y), max)
-
-BINARY_OP_OUT(eq, x == y)
-BINARY_OP_OUT(ne, x != y)
-BINARY_OP_OUT(le, x <= y)
-BINARY_OP_OUT(lt, x < y)
-BINARY_OP_OUT(ge, x >= y)
-BINARY_OP_OUT(gt, x > y)
-
-#if __METAL_VERSION__ >= 220
-INT64_BINARY_OP(add, x + y)
-INT64_BINARY_OP(sub, x - y)
-INT64_BINARY_OP(mul, x * y)
-INT64_BINARY_OP(div, x / y)
-INT64_BINARY_OP(min, MIN(x, y))
-INT64_BINARY_OP(max, MAX(x, y))
-
-INT64_BINARY_OP_OUT(eq, x == y)
-INT64_BINARY_OP_OUT(ne, x != y)
-INT64_BINARY_OP_OUT(le, x <= y)
-INT64_BINARY_OP_OUT(lt, x < y)
-INT64_BINARY_OP_OUT(ge, x >= y)
-INT64_BINARY_OP_OUT(gt, x > y)
-#endif

 #if __METAL_VERSION__ >= 310
 BFLOAT_BINARY_OP(x + y, add)
 BFLOAT_BINARY_OP(x - y, sub)
 BFLOAT_BINARY_OP(x * y, mul)
 BFLOAT_BINARY_OP(x / y, div)
-BFLOAT_BINARY_OP(MIN(x, y), min)
-BFLOAT_BINARY_OP(MAX(x, y), max)
 #endif
--- a/candle-metal-kernels/src/cast.metal
+++ b/candle-metal-kernels/src/cast.metal
@ -23,12 +23,12 @@ kernel void FN_NAME( \
    constant size_t &dim, \
    device const LEFT_TYPENAME *input,  \
    device RIGHT_TYPENAME *output, \
-    uint tid [[ thread_position_in_grid ]] \
+    uint thread_position_in_grid [[ thread_position_in_grid ]] \
 ) { \
-    if (tid >= dim) { \
+    if (thread_position_in_grid >= dim) { \
        return; \
    } \
-    output[tid] = RIGHT_TYPENAME(input[tid]); \
+    output[thread_position_in_grid] = RIGHT_TYPENAME(input[thread_position_in_grid]); \
 } \
 kernel void FN_NAME_STRIDED( \
    constant size_t &dim, \
@ -37,28 +37,17 @@ kernel void FN_NAME_STRIDED( \
    constant size_t *strides, \
    device const LEFT_TYPENAME *input,  \
    device RIGHT_TYPENAME *output, \
-    uint tid [[ thread_position_in_grid ]] \
+    uint i [[ thread_position_in_grid ]] \
 ) { \
-    if (tid >= dim) { \
+    if (i >= dim) { \
        return; \
    } \
-    output[tid] = RIGHT_TYPENAME(input[get_strided_index(tid, num_dims, dims, strides)]); \
+    output[i] = RIGHT_TYPENAME(input[get_strided_index(i, num_dims, dims, strides)]); \
 } \

-CAST(cast_u32_f32, cast_u32_f32_strided, uint32_t, float)
-CAST(cast_u32_u8, cast_u32_u8_strided, uint32_t, uint8_t)
-CAST(cast_u8_u32, cast_u8_u32_strided, uint8_t, uint32_t)
-CAST(cast_u8_f32, cast_u8_f32_strided, uint8_t, float)
+CAST(cast_u32_f32, cast_u32_f32_strided, int32_t, float)
 CAST(cast_f16_f32, cast_f16_f32_strided, half, float)
 CAST(cast_f32_f16, cast_f32_f16_strided, float, half)

-#if __METAL_VERSION__ >= 220
-CAST(cast_u8_i64, cast_u8_i64_strided, uint8_t, int64_t)
-CAST(cast_u32_i64, cast_u32_i64_strided, uint32_t, int64_t)
-CAST(cast_i64_f32, cast_i64_f32_strided, int64_t, float)
-#endif
-
 #if __METAL_VERSION__ >= 310
-CAST(cast_bf16_f32, cast_bf16_f32_strided, bfloat, float)
-CAST(cast_f32_bf16, cast_f32_bf16_strided, float, bfloat)
 #endif
--- a/candle-metal-kernels/src/conv.metal
+++ b/candle-metal-kernels/src/conv.metal
@ -1,213 +0,0 @@
-template <typename T>
-METAL_FUNC void im2col(
-    constant size_t &dst_numel,
-    constant size_t &h_out,
-    constant size_t &w_out,
-    constant size_t &h_k,
-    constant size_t &w_k,
-    constant size_t &stride,
-    constant size_t &padding,
-    constant size_t &dilation,
-    constant size_t *src_dims,
-    constant size_t *src_strides,
-    device const T *src,
-    device T *dst,
-    uint tid [[ thread_position_in_grid ]]
-) {
-  // dst: (b_size, h_out, w_out, c_in, h_k, w_k)
-  // src: (b_size, c_in, h_in, w_in)
-  if (tid >= dst_numel) {
-    return;
-  }
-  const size_t b_in = src_dims[0];
-  const size_t c_in = src_dims[1];
-  const size_t h_in = src_dims[2];
-  const size_t w_in = src_dims[3];
-
-  const size_t dst_s4 = w_k;
-  const size_t dst_s3 = h_k * dst_s4;
-  const size_t dst_s2 = c_in * dst_s3;
-  const size_t dst_s1 = w_out * dst_s2;
-  const size_t dst_s0 = h_out * dst_s1;
-
-  size_t tmp_tid = tid;
-  const size_t b_idx = tmp_tid / dst_s0;
-  tmp_tid -= b_idx * dst_s0;
-  const size_t h_idx = tmp_tid / dst_s1;
-  tmp_tid -= h_idx * dst_s1;
-  const size_t w_idx = tmp_tid / dst_s2;
-  tmp_tid -= w_idx * dst_s2;
-  const size_t c_idx = tmp_tid / dst_s3;
-  tmp_tid -= c_idx * dst_s3;
-  const size_t h_k_idx = tmp_tid / dst_s4;
-  tmp_tid -= h_k_idx * dst_s4;
-  const size_t w_k_idx = tmp_tid;
-  size_t src_h_idx = h_idx * stride + h_k_idx * dilation;
-  size_t src_w_idx = w_idx * stride + w_k_idx * dilation;
-  if (src_h_idx < padding || src_h_idx >= h_in + padding) {
-    dst[tid] = static_cast<T>(0);
-  }
-  else if (src_w_idx < padding || src_w_idx >= w_in + padding) {
-    dst[tid] = static_cast<T>(0);
-  }
-  else {
-    src_h_idx -= padding;
-    src_w_idx -= padding;
-    const size_t src_i =
-      b_idx * src_strides[0]
-      + c_idx * src_strides[1]
-      + src_h_idx * src_strides[2]
-      + src_w_idx * src_strides[3];
-    dst[tid] = src[src_i];
-  }
-}
-
-template <typename T>
-METAL_FUNC void im2col1d(
-    constant size_t &dst_numel,
-    constant size_t &l_out,
-    constant size_t &l_k,
-    constant size_t &stride,
-    constant size_t &padding,
-    constant size_t &dilation,
-    constant size_t *src_dims,
-    constant size_t *src_strides,
-    device const T *src,
-    device T *dst,
-    uint tid [[ thread_position_in_grid ]]
-) {
-  // dst: (b_size, l_out, c_in, l_k)
-  // src: (b_size, c_in, l_in)
-  if (tid >= dst_numel) {
-    return;
-  }
-  const size_t b_in = src_dims[0];
-  const size_t c_in = src_dims[1];
-  const size_t l_in = src_dims[2];
-
-  const size_t dst_s2 = l_k;
-  const size_t dst_s1 = c_in * dst_s2;
-  const size_t dst_s0 = l_out * dst_s1;
-
-  size_t tmp_dst_i = tid;
-  const size_t b_idx = tmp_dst_i / dst_s0;
-  tmp_dst_i -= b_idx * dst_s0;
-  const size_t l_idx = tmp_dst_i / dst_s1;
-  tmp_dst_i -= l_idx * dst_s1;
-  const size_t c_idx = tmp_dst_i / dst_s2;
-  tmp_dst_i -= c_idx * dst_s2;
-  const size_t l_k_idx = tmp_dst_i;
-  size_t src_l_idx = l_idx * stride + l_k_idx * dilation;
-  if (src_l_idx < padding || src_l_idx >= l_in + padding) {
-    dst[tid] = static_cast<T>(0);
-  }
-  else {
-    src_l_idx -= padding;
-    const size_t src_i = b_idx * src_strides[0] + c_idx * src_strides[1] + src_l_idx * src_strides[2];
-    dst[tid] = src[src_i];
-  }
-}
-
-template <typename T>
-METAL_FUNC void upsample_nearest2d(
-    constant size_t &w_out,
-    constant size_t &h_out,
-    constant float &w_scale,
-    constant float &h_scale,
-    constant size_t *src_dims,
-    constant size_t *src_s,
-    device const T *src,
-    device T *dst,
-    uint tid [[ thread_position_in_grid ]]
-) {
-  // src: (b_size, c_in, w_in, h_in)
-
-  const size_t c = src_dims[1];
-  const size_t w_in = src_dims[2];
-  const size_t h_in = src_dims[3];
-
-  if (tid >= src_dims[0] * c * w_out * h_out) {
-    return;
-  }
-
-  // TODO: Improve this.
-  const size_t b_idx = tid / (w_out * h_out * c);
-  const size_t c_idx = (tid / (w_out * h_out)) % c;
-  const size_t dst_w = (tid / h_out) % w_out;
-  const size_t dst_h = tid % h_out;
-
-  size_t src_w = static_cast<size_t>(dst_w * w_scale);
-  size_t src_h = static_cast<size_t>(dst_h * h_scale);
-  if (src_w >= w_in) {
-    src_w = w_in - 1;
-  }
-  if (src_h >= h_in) {
-    src_h = h_in - 1;
-  }
-
-  const size_t src_i = b_idx * src_s[0] + c_idx * src_s[1] + src_w * src_s[2] + src_h * src_s[3];
-  dst[tid] = src[src_i];
-}
-
-#define IM2COL_OP(T, FN_NAME) \
-kernel void FN_NAME(  \
-    constant size_t &dst_numel, \
-    constant size_t &h_out, \
-    constant size_t &w_out, \
-    constant size_t &h_k, \
-    constant size_t &w_k, \
-    constant size_t &stride, \
-    constant size_t &padding, \
-    constant size_t &dilation, \
-    constant size_t *src_dims, \
-    constant size_t *src_strides, \
-    device const T *src, \
-    device T *dst, \
-    uint tid [[ thread_position_in_grid ]] \
-) {  \
-  im2col<T>(dst_numel, h_out, w_out, h_k, w_k, stride, padding, dilation, src_dims, src_strides, src, dst, tid); \
-} \
-
-#define IM2COL1D_OP(T, FN_NAME) \
-kernel void FN_NAME(  \
-    constant size_t &dst_numel, \
-    constant size_t &l_out, \
-    constant size_t &l_k, \
-    constant size_t &stride, \
-    constant size_t &padding, \
-    constant size_t &dilation, \
-    constant size_t *src_dims, \
-    constant size_t *src_strides, \
-    device const T *src, \
-    device T *dst, \
-    uint tid [[ thread_position_in_grid ]] \
-) {  \
-  im2col1d<T>(dst_numel, l_out, l_k, stride, padding, dilation, src_dims, src_strides, src, dst, tid); \
-} \
- 
-#define UPSAMPLE_NEAREST2D_OP(TYPENAME, FN_NAME) \
-kernel void FN_NAME(  \
-    constant size_t &w_out, \
-    constant size_t &h_out, \
-    constant float &w_scale, \
-    constant float &h_scale, \
-    constant size_t *dims, \
-    constant size_t *strides, \
-    device const TYPENAME *src, \
-    device TYPENAME *dst, \
-    uint tid [[ thread_position_in_grid ]] \
-) {  \
-  upsample_nearest2d<TYPENAME>(w_out, h_out, w_scale, h_scale, dims, strides, src, dst, tid); \
-} \
-
-IM2COL_OP(float, im2col_f32)
-IM2COL_OP(uint8_t, im2col_u8)
-IM2COL_OP(uint32_t, im2col_u32)
-
-IM2COL1D_OP(float, im2col1d_f32)
-IM2COL1D_OP(uint8_t, im2col1d_u8)
-IM2COL1D_OP(uint32_t, im2col1d_u32)
-
-UPSAMPLE_NEAREST2D_OP(float, upsample_nearest2d_f32)
-UPSAMPLE_NEAREST2D_OP(uint8_t, upsample_nearest2d_u8)
-UPSAMPLE_NEAREST2D_OP(uint32_t, upsample_nearest2d_u32)
--- a/candle-metal-kernels/src/indexing.metal
+++ b/candle-metal-kernels/src/indexing.metal
@ -1,34 +1,6 @@
 #include <metal_stdlib>
 using namespace metal;

-template<typename TYPENAME, typename INDEX_TYPENAME>
-METAL_FUNC void index( 
-    constant size_t &dst_size, 
-    constant size_t &left_size, 
-    constant size_t &src_dim_size, 
-    constant size_t &right_size, 
-    constant size_t &ids_size, 
-    const device TYPENAME *input, 
-    const device INDEX_TYPENAME *input_ids, 
-    device TYPENAME *output, 
-    uint tid [[ thread_position_in_grid ]] 
-) { 
-    if (tid >= dst_size) { 
-        return; 
-    } 
-    const size_t id_i = (tid / right_size) % ids_size; 
-    const INDEX_TYPENAME input_i = min(input_ids[id_i], (INDEX_TYPENAME)(src_dim_size - 1)); 
-    const size_t right_rank_i = tid % right_size; 
-    const size_t left_rank_i = tid / right_size / ids_size; 
-    /* 
-    // Force prevent out of bounds indexing 
-    // since there doesn't seem to be a good way to force crash 
-    // No need to check for zero we're only allowing unsized. 
-    */ 
-    const size_t src_i = left_rank_i * src_dim_size * right_size + input_i * right_size + right_rank_i; 
-    output[tid] = input[src_i]; 
-}
-
 # define INDEX_OP(NAME, INDEX_TYPENAME, TYPENAME) \
 kernel void NAME( \
    constant size_t &dst_size, \
@ -39,160 +11,93 @@ kernel void NAME( \
    const device TYPENAME *input, \
    const device INDEX_TYPENAME *input_ids, \
    device TYPENAME *output, \
-    uint tid [[ thread_position_in_grid ]] \
+    uint gid [[ thread_position_in_grid ]] \
 ) { \
-    index<TYPENAME, INDEX_TYPENAME>(dst_size, left_size, src_dim_size, right_size, ids_size, input, input_ids, output, tid); \
+    if (gid >= dst_size) { \
+        return; \
+    } \
+    const size_t id_i = (gid / right_size) % ids_size; \
+    const INDEX_TYPENAME input_i = min(input_ids[id_i], (INDEX_TYPENAME)(src_dim_size - 1)); \
+    const size_t right_rank_i = gid % right_size; \
+    const size_t left_rank_i = gid / right_size / ids_size; \
+    /* \
+    // Force prevent out of bounds indexing \
+    // since there doesn't seem to be a good way to force crash \
+    // No need to check for zero we're only allowing unsized. \
+    */ \
+    const size_t src_i = left_rank_i * src_dim_size * right_size + input_i * right_size + right_rank_i; \
+    output[gid] = input[src_i]; \
 }


-template<typename TYPENAME, typename INDEX_TYPENAME>
-METAL_FUNC void gather( 
-    constant size_t &dst_size, 
-    constant size_t &left_size, 
-    constant size_t &src_dim_size, 
-    constant size_t &right_size, 
-    constant size_t &ids_size, 
-    const device TYPENAME *input, 
-    const device INDEX_TYPENAME *input_ids, 
-    device TYPENAME *output, 
-    uint tid [[ thread_position_in_grid ]] 
-) { 
-    if (tid >= dst_size) { 
-        return; 
-    } 
-    const INDEX_TYPENAME input_i = input_ids[tid]; 
-    const size_t right_rank_i = tid % right_size; 
-    const size_t left_rank_i = tid / right_size / ids_size; 
-    const size_t src_i = (left_rank_i * src_dim_size + input_i) * right_size + right_rank_i; 
-    output[tid] = input[src_i]; 
-}

-# define GATHER_OP(NAME, INDEX_TYPENAME, TYPENAME) \
-kernel void NAME( \
-    constant size_t &dst_size, \
-    constant size_t &left_size, \
-    constant size_t &src_dim_size, \
-    constant size_t &right_size, \
-    constant size_t &ids_size, \
-    const device TYPENAME *input, \
-    const device INDEX_TYPENAME *input_ids, \
-    device TYPENAME *output, \
-    uint tid [[ thread_position_in_grid ]] \
-) { \
-    gather<TYPENAME, INDEX_TYPENAME>(dst_size, left_size, src_dim_size, right_size, ids_size, input, input_ids, output, tid); \
-}
+template <typename T, typename I>
+void index_add(
+    device I *ids [[buffer(0)]],
+    device T *inp [[buffer(1)]],
+    device T *out [[buffer(2)]],

-template<typename TYPENAME, typename INDEX_TYPENAME>
-METAL_FUNC void scatter_add( 
-    constant size_t &dst_size, 
-    constant size_t &left_size, 
-    constant size_t &src_dim_size, 
-    constant size_t &right_size, 
-    constant size_t &dst_dim_size, 
-    const device TYPENAME *input, 
-    const device INDEX_TYPENAME *input_ids, 
-    device TYPENAME *output, 
-    uint tid [[ thread_position_in_grid ]] 
-) { 
-    if (tid >= dst_size) { 
-        return; 
-    } 
-    const size_t right_rank_i = tid % right_size; 
-    const size_t left_rank_i = tid / right_size; 
-    for (unsigned int j = 0; j < src_dim_size; ++j) {
-        const size_t src_i = (left_rank_i * src_dim_size + j) * right_size + right_rank_i; 
-        const INDEX_TYPENAME idx = input_ids[src_i];
-        const size_t dst_i = (left_rank_i * dst_dim_size + idx) * right_size + right_rank_i; 
-        output[dst_i] += input[src_i]; 
+    constant uint &ids_dim_size,
+    constant uint &left_size,
+    constant uint &dst_dim_size,
+    constant uint &right_size,
+
+    uint gid [[ thread_position_in_grid ]] \
+) {
+
+    if (gid >= left_size * right_size) {
+        return;
+    }
+
+    const uint i = gid;
+    const uint pre = i / right_size;
+    const uint post = i % right_size;
+
+    for (uint j = 0; j < ids_dim_size; j++) {
+        const uint idx = ids[j];
+        const uint src_i = (pre * ids_dim_size + j) * right_size + post;
+        const uint dst_i = (pre * dst_dim_size + idx) * right_size + post;
+        out[dst_i] += inp[src_i];
    }
 }

-# define SCATTER_ADD_OP(NAME, INDEX_TYPENAME, TYPENAME) \
-kernel void NAME( \
-    constant size_t &dst_size, \
-    constant size_t &left_size, \
-    constant size_t &src_dim_size, \
-    constant size_t &right_size, \
-    constant size_t &dst_dim_size, \
-    const device TYPENAME *input, \
-    const device INDEX_TYPENAME *input_ids, \
-    device TYPENAME *output, \
-    uint tid [[ thread_position_in_grid ]] \
-) { \
-    scatter_add<TYPENAME, INDEX_TYPENAME>(dst_size, left_size, src_dim_size, right_size, dst_dim_size, input, input_ids, output, tid); \
-}
-
-template<typename TYPENAME, typename INDEX_TYPENAME>
-METAL_FUNC void index_add( 
-    constant size_t &dst_size, 
-    constant size_t &left_size, 
-    constant size_t &src_dim_size, 
-    constant size_t &right_size, 
-    constant size_t &dst_dim_size, 
-    constant size_t &ids_dim_size, 
-    const device TYPENAME *input, 
-    const device INDEX_TYPENAME *input_ids, 
-    device TYPENAME *output, 
-    uint tid [[ thread_position_in_grid ]] 
-) { 
-    if (tid >= dst_size) { 
-        return; 
-    } 
-    const size_t right_rank_i = tid % right_size; 
-    const size_t left_rank_i = tid / right_size; 
-    for (unsigned int j = 0; j < ids_dim_size; ++j) {
-        const INDEX_TYPENAME idx = input_ids[j];
-        const size_t src_i = (left_rank_i * src_dim_size + j) * right_size + right_rank_i; 
-        const size_t dst_i = (left_rank_i * dst_dim_size + idx) * right_size + right_rank_i; 
-        output[dst_i] += input[src_i]; 
-    }
-}
-
-# define INDEX_ADD_OP(NAME, INDEX_TYPENAME, TYPENAME) \
-kernel void NAME( \
-    constant size_t &dst_size, \
-    constant size_t &left_size, \
-    constant size_t &src_dim_size, \
-    constant size_t &right_size, \
-    constant size_t &dst_dim_size, \
-    constant size_t &ids_dim_size, \
-    const device TYPENAME *input, \
-    const device INDEX_TYPENAME *input_ids, \
-    device TYPENAME *output, \
-    uint tid [[ thread_position_in_grid ]] \
-) { \
-    index_add<TYPENAME, INDEX_TYPENAME>(dst_size, left_size, src_dim_size, right_size, dst_dim_size, ids_dim_size, input, input_ids, output, tid); \
-}
+#define IA_OP(TYPENAME, INDEX_TYPENAME, FN_NAME) \
+kernel void FN_NAME( \
+    device INDEX_TYPENAME *ids [[buffer(0)]], \
+    device TYPENAME *inp [[buffer(1)]], \
+    device TYPENAME *out [[buffer(2)]], \
+    constant uint &ids_dim_size, \
+    constant uint &left_size, \
+    constant uint &dst_dim_size, \
+    constant uint &right_size, \
+    uint gid [[ thread_position_in_grid ]] \
+) { index_add<TYPENAME, INDEX_TYPENAME>(ids, inp, out, ids_dim_size, left_size, dst_dim_size, right_size, gid); } \


 INDEX_OP(is_u32_f32, uint, float)
 INDEX_OP(is_u32_f16, uint, half)
-GATHER_OP(gather_u32_f32, uint, float)
-GATHER_OP(gather_u32_f16, uint, half)
-SCATTER_ADD_OP(sa_u32_f32, uint, float)
-SCATTER_ADD_OP(sa_u32_f16, uint, half)


 #if __METAL_VERSION__ >= 310
-INDEX_ADD_OP(ia_i64_bf16, int64_t, bfloat)
-INDEX_ADD_OP(ia_u32_bf16, uint32_t, bfloat)
-INDEX_ADD_OP(ia_u8_bf16, uint8_t, bfloat)
+IA_OP(bfloat, int64_t, ia_i64_bf16)
+IA_OP(bfloat, uint32_t, ia_u32_bf16)
+IA_OP(bfloat, uint8_t, ia_u8_bf16)
 #endif

-INDEX_ADD_OP(ia_u32_f16, uint32_t, half)
-INDEX_ADD_OP(ia_u8_f16, uint8_t, half)
+IA_OP(half, uint32_t, ia_u32_f16)
+IA_OP(half, uint8_t, ia_u8_f16)

-INDEX_ADD_OP(ia_i64_f32, int64_t, float)
-INDEX_ADD_OP(ia_i64_u8, int64_t, uint8_t)
-INDEX_ADD_OP(ia_i64_i64, int64_t, int64_t)
-INDEX_ADD_OP(ia_i64_u32, int64_t, uint32_t)
+IA_OP(float, int64_t, ia_i64_f32)
+IA_OP(uint8_t, int64_t, ia_i64_u8)
+IA_OP(int64_t, int64_t, ia_i64_i64)
+IA_OP(uint32_t, int64_t, ia_i64_u32)

-INDEX_ADD_OP(ia_u32_f32, uint32_t, float)
-INDEX_ADD_OP(ia_u32_u8, uint32_t, uint8_t)
-INDEX_ADD_OP(ia_u32_i64, uint32_t, int64_t)
-INDEX_ADD_OP(ia_u32_u32, uint32_t, uint32_t)
+IA_OP(float, uint32_t, ia_u32_f32)
+IA_OP(uint8_t, uint32_t, ia_u32_u8)
+IA_OP(int64_t, uint32_t, ia_u32_i64)
+IA_OP(uint32_t, uint32_t, ia_u32_u32)

-INDEX_ADD_OP(ia_u8_f32, uint8_t, float)
-INDEX_ADD_OP(ia_u8_u8, uint8_t, uint8_t)
-INDEX_ADD_OP(ia_u8_u32, uint8_t, uint32_t)
-INDEX_ADD_OP(ia_u8_i64, uint8_t, int64_t)
+IA_OP(float, uint8_t, ia_u8_f32)
+IA_OP(uint8_t, uint8_t, ia_u8_u8)
+IA_OP(uint32_t, uint8_t, ia_u8_u32)
+IA_OP(int64_t, uint8_t, ia_u8_i64)
--- a/candle-metal-kernels/src/lib.rs
+++ b/candle-metal-kernels/src/lib.rs
--- a/candle-metal-kernels/src/libMetalFlashAttention.metallib
+++ b/candle-metal-kernels/src/libMetalFlashAttention.metallib
--- a/candle-metal-kernels/src/quantized.metal
+++ b/candle-metal-kernels/src/quantized.metal
--- a/candle-metal-kernels/src/reduce.metal
+++ b/candle-metal-kernels/src/reduce.metal
@ -1,9 +1,6 @@
 #include <metal_stdlib>
 using namespace metal;

-#define MAX(x, y) ((x) > (y) ? (x) : (y))
-#define MIN(x, y) ((x) < (y) ? (x) : (y))
-
 METAL_FUNC uint get_strided_index(
    uint idx,
    constant size_t &num_dims,
@ -19,160 +16,39 @@ METAL_FUNC uint get_strided_index(
    return strided_i;
 }

-constant int THREADGROUP_SIZE = 2048;
+constant int THREADGROUP_SIZE = 1024;

-
-#define ARGMIN(NAME, T, MAXVALUE) \
+# define REDUCE(FN, NAME, TYPENAME) \
 kernel void NAME( \
-    constant size_t &num_dims, \
-    constant size_t *dims, \
-    constant size_t *strides, \
+    constant size_t &src_numel, \
    constant size_t &el_to_sum_per_block, \
-    device const T *src, \
-    device uint *dst,  \
-    uint id [[ thread_position_in_grid ]],  \
-    uint tid [[ thread_index_in_threadgroup ]],  \
-    uint dst_id [[ threadgroup_position_in_grid ]],  \
-    uint block_dim [[ threads_per_threadgroup ]]  \
-) {  \
-      \
-   threadgroup T shared_memory[THREADGROUP_SIZE];  \
-   threadgroup uint shared_indices[THREADGROUP_SIZE];  \
-       \
-   shared_memory[tid] = MAXVALUE;  \
-   shared_indices[tid] = 0xFFFFFFFF; \
-   bool notset = true; \
-   /*  \
-   // Elements summed in this block range from dst_id * el_to_sum_per_block   \
-   // to (dst_id + 1) * el_to_sum_per_block.  \
-   */  \
-   size_t start_idx = dst_id * el_to_sum_per_block;  \
-   size_t stop_idx = start_idx + el_to_sum_per_block;  \
-   size_t idx = start_idx + tid;  \
-   while (idx < stop_idx) {  \
-     /*  \
-     // TODO: Fast version for the contiguous case.  \
-     */  \
-     size_t strided_i = get_strided_index(idx, num_dims, dims, strides);  \
-     if (notset || src[strided_i] < shared_memory[tid]) {  \
-         shared_memory[tid] = src[strided_i];  \
-          /* Assume that the reduction takes place over the last dimension which is contiguous. */ \
-          shared_indices[tid] = idx % dims[num_dims - 1]; \
-          notset = false; \
-     }  \
-     idx += block_dim;  \
-   }  \
-       \
-   threadgroup_barrier(mem_flags::mem_none);  \
-     \
-   /*  \
-   // reduction in shared memory  \
-   */  \
-   for (uint s = block_dim / 2; s > 0; s >>= 1) {  \
-       if (tid < s && shared_memory[tid + s] < shared_memory[tid]) {  \
-           shared_indices[tid] = shared_indices[tid + s];  \
-           shared_memory[tid] = shared_memory[tid + s];  \
-       }  \
-       threadgroup_barrier(mem_flags::mem_none);  \
-   }  \
-     \
-     if (tid == 0){ \
-       dst[dst_id] = shared_indices[0];  \
-     } \
-} \
-
-
-#define ARGMAX(NAME, T, MINVALUE) \
-kernel void NAME( \
-    constant size_t &num_dims, \
-    constant size_t *dims, \
-    constant size_t *strides, \
-    constant size_t &el_to_sum_per_block, \
-    device const T *src, \
-    device uint *dst,  \
-    uint id [[ thread_position_in_grid ]],  \
-    uint tid [[ thread_index_in_threadgroup ]],  \
-    uint dst_id [[ threadgroup_position_in_grid ]],  \
-    uint block_dim [[ threads_per_threadgroup ]]  \
-) {  \
-      \
-   threadgroup T shared_memory[THREADGROUP_SIZE];  \
-   threadgroup uint shared_indices[THREADGROUP_SIZE];  \
-       \
-   shared_memory[tid] = MINVALUE;  \
-   shared_indices[tid] = 0xFFFFFFFF; \
-   /*  \
-   // Elements summed in this block range from dst_id * el_to_sum_per_block   \
-   // to (dst_id + 1) * el_to_sum_per_block.  \
-   */  \
-   size_t start_idx = dst_id * el_to_sum_per_block;  \
-   size_t stop_idx = start_idx + el_to_sum_per_block;  \
-   size_t idx = start_idx + tid;  \
-   bool notset = true; \
-   while (idx < stop_idx) {  \
-     /*  \
-     // TODO: Fast version for the contiguous case.  \
-     */  \
-     size_t strided_i = get_strided_index(idx, num_dims, dims, strides);  \
-     if (notset || shared_memory[tid] < src[strided_i]) {  \
-         shared_memory[tid] = src[strided_i];  \
-         shared_indices[tid] = idx % dims[num_dims - 1]; \
-         notset = false; \
-     }  \
-     idx += block_dim;  \
-   }  \
-       \
-   threadgroup_barrier(mem_flags::mem_none);  \
-     \
-   /*  \
-   // reduction in shared memory  \
-   */  \
-   for (uint s = block_dim / 2; s > 0; s >>= 1) {  \
-       if (tid < s && shared_memory[tid + s] > shared_memory[tid]) {  \
-           shared_indices[tid] = shared_indices[tid + s];  \
-           shared_memory[tid] = shared_memory[tid + s];  \
-       }  \
-       threadgroup_barrier(mem_flags::mem_none);  \
-   }  \
-     \
-   if (tid == 0){ \
-       dst[dst_id] = shared_indices[0];  \
-   } \
-} \
-
-#define REDUCE(FN, NAME, T, START) \
-kernel void NAME( \
-    constant size_t &num_dims, \
-    constant size_t *dims, \
-    constant size_t *strides, \
-    constant size_t &el_to_sum_per_block, \
-    device const T *src,  \
-    device T *dst, \
+    device const TYPENAME *src,  \
+    device TYPENAME *dst, \
    uint id [[ thread_position_in_grid ]], \
    uint tid [[ thread_index_in_threadgroup ]], \
    uint dst_id [[ threadgroup_position_in_grid ]], \
-    uint block_dim [[ threads_per_threadgroup ]] \
+    uint blockDim [[ threads_per_threadgroup ]] \
 ) { \
     \
-   threadgroup T shared_memory[THREADGROUP_SIZE]; \
+   threadgroup float shared_memory[THREADGROUP_SIZE]; \
      \
-   shared_memory[tid] = START; \
+   shared_memory[tid] = 0; \
   /* \
   // Elements summed in this block range from dst_id * el_to_sum_per_block  \
   // to (dst_id + 1) * el_to_sum_per_block. \
   */ \
   size_t start_idx = dst_id * el_to_sum_per_block; \
-   size_t stop_idx = start_idx + el_to_sum_per_block; \
+   size_t stop_idx = min(start_idx + el_to_sum_per_block, src_numel); \
   size_t idx = start_idx + tid; \
   while (idx < stop_idx) { \
     /* \
     // TODO: Fast version for the contiguous case. \
+     // size_t strided_i = get_strided_index(idx, num_dims, dims, strides); \
     */ \
-     size_t strided_i = get_strided_index(idx, num_dims, dims, strides); \
-     T x = shared_memory[tid]; \
-     T y = src[strided_i]; \
+     TYPENAME x = shared_memory[tid]; \
+     TYPENAME y = src[idx]; \
     shared_memory[tid] = FN; \
-     idx += block_dim; \
+     idx += blockDim; \
   } \
      \
   threadgroup_barrier(mem_flags::mem_none); \
@ -180,10 +56,10 @@ kernel void NAME( \
   /* \
   // reduction in shared memory \
   */ \
-   for (uint s = block_dim / 2; s > 0; s >>= 1) { \
+   for (uint s = blockDim / 2; s > 0; s >>= 1) { \
       if (tid < s) { \
-           T x = shared_memory[tid]; \
-           T y = shared_memory[tid + s]; \
+           TYPENAME x = shared_memory[tid]; \
+           TYPENAME y = shared_memory[tid + s]; \
           shared_memory[tid] = FN; \
       } \
       threadgroup_barrier(mem_flags::mem_none); \
@ -192,115 +68,72 @@ kernel void NAME( \
   dst[dst_id] = shared_memory[0]; \
 } \

+kernel void softmax_float(
+    constant size_t &src_numel,
+    constant size_t &el_to_sum_per_block,
+    device const float *src, 
+    device float *dst,
+    uint id [[ thread_position_in_grid ]],
+    uint tid [[ thread_index_in_threadgroup ]],
+    uint dst_id [[ threadgroup_position_in_grid ]],
+    uint blockDim [[ threads_per_threadgroup ]]
+) {
+    
+   threadgroup float shared_memory[THREADGROUP_SIZE];
+     
+   shared_memory[tid] = -INFINITY;
+   // Elements summed in this block range from dst_id * el_to_sum_per_block
+   // to (dst_id + 1) * el_to_sum_per_block.
+   size_t start_idx = dst_id * el_to_sum_per_block;
+   size_t stop_idx = min(start_idx + el_to_sum_per_block, src_numel);
+   size_t idx = start_idx + tid;

-#define SOFTMAX(NAME, T)                                                          \
-kernel void NAME(                                                                 \
-    constant size_t &src_numel,                                                   \
-    constant size_t &el_to_sum_per_block,                                         \
-    device const T *src,                                                          \
-    device T *dst,                                                                \
-                                                                                  \
-    uint id [[ thread_position_in_grid ]],                                        \
-    uint tid [[ thread_index_in_threadgroup ]],                                   \
-    uint dst_id [[ threadgroup_position_in_grid ]],                               \
-    uint block_dim [[ threads_per_threadgroup ]]                                  \
-) {                                                                               \
-    threadgroup float shared_memory[THREADGROUP_SIZE];                                \
-    shared_memory[tid] = -INFINITY;                                            \
-    size_t start_idx = dst_id * el_to_sum_per_block;                              \
-    size_t stop_idx = min(start_idx + el_to_sum_per_block, src_numel);            \
-    size_t idx = start_idx + tid;                                                 \
-                                                                                  \
-                                                                                  \
-    float tmp = -INFINITY; \
-    while (idx < stop_idx) {                                                      \
-        tmp = MAX(tmp, float(src[idx]));                   \
-        idx += block_dim;                                                         \
-    }                                                                             \
-    shared_memory[tid] = tmp; \
-                                                                                  \
-    threadgroup_barrier(mem_flags::mem_threadgroup);                              \
-                                                                                  \
-    for (uint s = block_dim / 2; s > 0; s >>= 1) {                                \
-        if (tid < s) {                                                            \
-            shared_memory[tid] = MAX(shared_memory[tid], shared_memory[tid + s]); \
-        }                                                                         \
-        threadgroup_barrier(mem_flags::mem_threadgroup);                              \
-    }                                                                             \
-                                                                                  \
-    /* wait for shared_memory[0] to be filled */ \
-    threadgroup_barrier(mem_flags::mem_threadgroup);                              \
-                                                                                  \
-    float _max = shared_memory[0];                                                    \
-                                                                                  \
-    /* prevent tid=0 from overwriting _max before other threads have written it */ \
-    threadgroup_barrier(mem_flags::mem_threadgroup);                              \
-    shared_memory[tid] = 0;                                                       \
-                                                                                  \
-    idx = start_idx + tid;                                                        \
-    while (idx < stop_idx) {                                                      \
-        const float val = exp(float(src[idx]) - _max);                                    \
-        dst[idx] = T(val);                                                           \
-        shared_memory[tid] += val;                                                \
-        idx += block_dim;                                                         \
-    }                                                                             \
-    threadgroup_barrier(mem_flags::mem_threadgroup);                              \
-    for (uint s = block_dim / 2; s > 0; s >>= 1) {                                \
-        if (tid < s) {                                                            \
-            shared_memory[tid] += shared_memory[tid + s];                         \
-        }                                                                         \
-        threadgroup_barrier(mem_flags::mem_threadgroup);                              \
-    }                                                                             \
-                                                                                  \
-    const T inv_acc = T(1.0/shared_memory[0]);                                         \
-    idx = start_idx + tid;                                                        \
-    while (idx < stop_idx) {                                                      \
-        dst[idx] *= inv_acc;                                                      \
-        idx += block_dim;                                                         \
-    }                                                                             \
-}                                                                                 \
+   while (idx < stop_idx) {
+     // TODO: Fast version for the contiguous case.
+     shared_memory[tid] = max(shared_memory[tid], src[idx]);
+     idx += blockDim;
+   }
+     
+   threadgroup_barrier(mem_flags::mem_none);
+   
+   // reduction in shared memory
+   for (uint s = blockDim / 2; s > 0; s >>= 1) {
+       if (tid < s) {
+           shared_memory[tid] = max(shared_memory[tid], shared_memory[tid + s]);
+       }
+       threadgroup_barrier(mem_flags::mem_none);
+   }
+   
+   float max = shared_memory[0];

-REDUCE(x + y, fast_sum_f32_strided, float, 0)
-REDUCE(x + y, fast_sum_u32_strided, uint, 0)
-REDUCE(x + y, fast_sum_f16_strided, half, 0)
-REDUCE(x + y, fast_sum_u8_strided, uint8_t, 0)
-REDUCE(x * y, fast_mul_f32_strided, float, 1)
-REDUCE(x * y, fast_mul_u32_strided, uint, 1)
-REDUCE(x * y, fast_mul_f16_strided, half, 1)
-REDUCE(MAX(x, y), fast_max_f32_strided, float, -HUGE_VALF)
-REDUCE(MAX(x, y), fast_max_u32_strided, uint, 0)
-REDUCE(MAX(x, y), fast_max_f16_strided, half, -HUGE_VALH)
-REDUCE(MAX(x, y), fast_max_u8_strided, uint8_t, 0)
-REDUCE(MIN(x, y), fast_min_f32_strided, float, HUGE_VALF)
-REDUCE(MIN(x, y), fast_min_u32_strided, uint, 0xFFFFFFFF)
-REDUCE(MIN(x, y), fast_min_f16_strided, half, HUGE_VALH)
-REDUCE(MIN(x, y), fast_min_u8_strided, uint8_t, 0xFF)
-ARGMIN(fast_argmin_f32_strided, float, HUGE_VALF)
-ARGMIN(fast_argmin_f16_strided, half, HUGE_VALH)
-ARGMIN(fast_argmin_u32_strided, uint, 0xFFFFFFFF)
-ARGMIN(fast_argmin_u8_strided, uint8_t, 0xFF)
-ARGMAX(fast_argmax_f32_strided, float, -HUGE_VALF)
-ARGMAX(fast_argmax_f16_strided, half, -HUGE_VALH)
-ARGMAX(fast_argmax_u32_strided, uint, 0)
-ARGMAX(fast_argmax_u8_strided, uint8_t, 0)
+   shared_memory[tid] = 0;

-SOFTMAX(softmax_f32, float)
-SOFTMAX(softmax_f16, half)
+   // Restart
+   idx = start_idx + tid;
+   while (idx < stop_idx) {
+     // TODO: Fast version for the contiguous case.
+     const float val = exp(src[idx] - max);
+     dst[idx] = val; 
+     shared_memory[tid] += val;
+     idx += blockDim;
+   }
+   // reduction in shared memory
+   for (uint s = blockDim / 2; s > 0; s >>= 1) {
+       if (tid < s) {
+           shared_memory[tid] += shared_memory[tid + s];
+       }
+       threadgroup_barrier(mem_flags::mem_none);
+   }

-#if __METAL_VERSION__ >= 220
-REDUCE(x + y, fast_sum_i64_strided, int64_t, 0)
-REDUCE(MIN(x, y), fast_min_i64_strided, int64_t, INT_MAX)
-REDUCE(MAX(x, y), fast_max_i64_strided, int64_t, INT_MIN)
-ARGMIN(fast_argmin_i64_strided, int64_t, INT_MAX)
-ARGMAX(fast_argmax_i64_strided, int64_t, INT_MIN)
-#endif
+   const float inv_acc = 1/shared_memory[0];
+   idx = start_idx + tid;
+   while (idx < stop_idx) {
+     dst[idx] *= inv_acc; 
+     idx += blockDim;
+   }
+}

-#if __METAL_VERSION__ >= 310
-REDUCE(x + y, fast_sum_bf16, bfloat, 0)
-REDUCE(x * y, fast_mul_bf16, bfloat, 1)
-REDUCE(MAX(x, y), fast_max_bf16, bfloat, -HUGE_VALBF)
-REDUCE(MIN(x, y), fast_min_bf16, bfloat, HUGE_VALBF)
-ARGMIN(fast_argmin_bf16, bfloat, HUGE_VALBF)
-ARGMAX(fast_argmax_bf16, bfloat, -HUGE_VALBF)
-SOFTMAX(softmax_bf16, bfloat)
-#endif
+
+REDUCE(x + y, fast_sum_float, float)
+REDUCE(x * y, fast_mul_float, float)
+REDUCE(max(x, y), fast_max_float, float)
--- a/candle-metal-kernels/src/ternary.metal
+++ b/candle-metal-kernels/src/ternary.metal
@ -32,9 +32,6 @@ kernel void FN_NAME(  \
    device TYPENAME *out ,\
    uint i [[ thread_position_in_grid ]] \
 ) {  \
-   if (i >= numel){ \
-       return; \
-   } \
   uint strided_i = get_strided_index(i, num_dims, dims, strides); \
   uint strided_i_t = get_strided_index(i, num_dims, dims, strides_t); \
   uint strided_i_f = get_strided_index(i, num_dims, dims, strides_f); \
@ -55,9 +52,6 @@ kernel void FN_NAME(  \

 WHERE_OP(float, uint8_t, where_u8_f32)
 // WHERE_OP(double, uint8_t, where_u8_f64)
-WHERE_OP(uint8_t, uint8_t, where_u8_u8)
-WHERE_OP(uint32_t, uint8_t, where_u8_u32)
-
-#if __METAL_VERSION__ >= 220
-WHERE_OP(int64_t, uint8_t, where_u8_i64)
-#endif
+// WHERE_OP(uint8_t, uint8_t, where_u8_u8)
+// WHERE_OP(uint32_t, uint8_t, where_u8_u32)
+// WHERE_OP(int64_t, uint8_t, where_u8_i64)
--- a/candle-metal-kernels/src/tests.rs
+++ b/candle-metal-kernels/src/tests.rs
@ -1,808 +0,0 @@
-use super::*;
-use half::{bf16, f16};
-use metal::{Device, MTLResourceOptions};
-
-fn read_to_vec<T: Clone>(buffer: &Buffer, n: usize) -> Vec<T> {
-    let ptr = buffer.contents() as *const T;
-    assert!(!ptr.is_null());
-    let slice = unsafe { std::slice::from_raw_parts(ptr, n) };
-    slice.to_vec()
-}
-
-fn new_buffer<T>(device: &Device, data: &[T]) -> Buffer {
-    let options = MTLResourceOptions::StorageModeManaged;
-    let ptr = data.as_ptr() as *const core::ffi::c_void;
-    let size = (data.len() * std::mem::size_of::<T>()) as u64;
-    device.new_buffer_with_data(ptr, size, options)
-}
-
-fn device() -> Device {
-    Device::system_default().unwrap()
-}
-
-fn approx(v: Vec<f32>, digits: i32) -> Vec<f32> {
-    let b = 10f32.powi(digits);
-    v.iter().map(|t| f32::round(t * b) / b).collect()
-}
-
-fn approx_f16(v: Vec<f16>, digits: i32) -> Vec<f32> {
-    let b = 10f32.powi(digits);
-    v.iter().map(|t| f32::round(t.to_f32() * b) / b).collect()
-}
-
-fn approx_bf16(v: Vec<bf16>, digits: i32) -> Vec<f32> {
-    let b = 10f32.powi(digits);
-    v.iter().map(|t| f32::round(t.to_f32() * b) / b).collect()
-}
-
-fn run<T: Clone>(v: &[T], name: unary::contiguous::Kernel) -> Vec<T> {
-    let device = device();
-    let fence = device.new_fence();
-    let kernels = Kernels::new(fence);
-    let command_queue = device.new_command_queue();
-    let command_buffer = command_queue.new_command_buffer();
-    let input = new_buffer(&device, v);
-    let output = new_buffer(&device, v);
-    call_unary_contiguous(
-        &device,
-        command_buffer,
-        &kernels,
-        name,
-        v.len(),
-        &input,
-        &output,
-    )
-    .unwrap();
-    command_buffer.commit();
-    command_buffer.wait_until_completed();
-    read_to_vec(&output, v.len())
-}
-
-fn run_binary<T: Clone>(x: &[T], y: &[T], name: binary::contiguous::Kernel) -> Vec<T> {
-    let device = device();
-    let fence = device.new_fence();
-    let kernels = Kernels::new(fence);
-    let command_queue = device.new_command_queue();
-    let command_buffer = command_queue.new_command_buffer();
-    let options = MTLResourceOptions::StorageModeManaged;
-    let left = new_buffer(&device, x);
-    let right = new_buffer(&device, y);
-    let output = device.new_buffer(std::mem::size_of_val(x) as u64, options);
-    call_binary_contiguous(
-        &device,
-        command_buffer,
-        &kernels,
-        name,
-        x.len(),
-        &left,
-        &right,
-        &output,
-    )
-    .unwrap();
-    command_buffer.commit();
-    command_buffer.wait_until_completed();
-    read_to_vec(&output, x.len())
-}
-
-fn run_strided<T: Clone>(
-    v: &[T],
-    kernel: unary::strided::Kernel,
-    shape: &[usize],
-    strides: &[usize],
-    offset: usize,
-) -> Vec<T> {
-    let device = device();
-    let command_queue = device.new_command_queue();
-    let command_buffer = command_queue.new_command_buffer();
-    let input = new_buffer(&device, v);
-    let output = new_buffer(&device, v);
-    let fence = device.new_fence();
-    let kernels = Kernels::new(fence);
-    call_unary_strided(
-        &device,
-        command_buffer,
-        &kernels,
-        kernel,
-        shape,
-        &input,
-        strides,
-        offset,
-        &output,
-        0,
-    )
-    .unwrap();
-    command_buffer.commit();
-    command_buffer.wait_until_completed();
-    read_to_vec(&output, v.len())
-}
-
-#[test]
-fn cos_f32() {
-    let v = vec![1.0f32, 2.0, 3.0];
-    let results = run(&v, unary::contiguous::cos::FLOAT);
-    let expected: Vec<_> = v.iter().map(|v| v.cos()).collect();
-    assert_eq!(approx(results, 4), vec![0.5403, -0.4161, -0.99]);
-    assert_eq!(approx(expected, 4), vec![0.5403, -0.4161, -0.99]);
-
-    let v = vec![1.0f32; 10_000];
-    let results = run(&v, unary::contiguous::cos::FLOAT);
-    let expected: Vec<_> = v.iter().map(|v| v.cos()).collect();
-    assert_eq!(approx(results, 4), vec![0.5403; 10_000]);
-    assert_eq!(approx(expected, 4), vec![0.5403; 10_000]);
-}
-
-#[test]
-fn cos_f32_strided() {
-    let v = vec![1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0];
-    let shape = vec![6];
-    let strides = vec![1];
-    let offset = 0;
-    let results = run_strided(&v, unary::strided::cos::FLOAT, &shape, &strides, offset);
-    let expected: Vec<_> = v.iter().map(|v| v.cos()).collect();
-    assert_eq!(
-        approx(results, 4),
-        vec![0.5403, -0.4161, -0.99, -0.6536, 0.2837, 0.9602]
-    );
-    assert_eq!(
-        approx(expected, 4),
-        vec![0.5403, -0.4161, -0.99, -0.6536, 0.2837, 0.9602]
-    );
-
-    // Contiguous
-    let v = vec![1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0];
-    let shape = vec![3, 2];
-    let strides = vec![2, 1];
-    let offset = 0;
-    let results = run_strided(&v, unary::strided::cos::FLOAT, &shape, &strides, offset);
-    let expected: Vec<_> = v.iter().map(|v| v.cos()).collect();
-    assert_eq!(
-        approx(results, 4),
-        vec![0.5403, -0.4161, -0.99, -0.6536, 0.2837, 0.9602]
-    );
-    assert_eq!(
-        approx(expected, 4),
-        vec![0.5403, -0.4161, -0.99, -0.6536, 0.2837, 0.9602]
-    );
-
-    // Transposed
-    let v = vec![1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0];
-    let shape = vec![3, 2];
-    let strides = vec![1, 3];
-    let offset = 0;
-    let results = run_strided(&v, unary::strided::cos::FLOAT, &shape, &strides, offset);
-    let expected: Vec<_> = v.iter().map(|v| v.cos()).collect();
-    assert_eq!(
-        approx(results, 4),
-        vec![0.5403, -0.6536, -0.4161, 0.2837, -0.99, 0.9602]
-    );
-    assert_eq!(
-        approx(expected, 4),
-        vec![0.5403, -0.4161, -0.99, -0.6536, 0.2837, 0.9602]
-    );
-
-    // Very large
-    let v = vec![1.0f32; 10_000];
-    let shape = vec![2, 5_000];
-    let strides = vec![2, 1];
-    let offset = 0;
-    let results = run_strided(&v, unary::strided::cos::FLOAT, &shape, &strides, offset);
-    let expected: Vec<_> = v.iter().map(|v| v.cos()).collect();
-    assert_eq!(approx(results, 4), vec![0.5403; 10_000]);
-    assert_eq!(approx(expected, 4), vec![0.5403; 10_000]);
-}
-
-#[test]
-fn cos_strided_random() {
-    let v: Vec<_> = (0..10_000).map(|_| rand::random::<f32>()).collect();
-    let shape = vec![5_000, 2];
-    let strides = vec![1, 5_000];
-    let offset = 0;
-    let results = run_strided(&v, unary::strided::cos::FLOAT, &shape, &strides, offset);
-    let expected: Vec<_> = v.iter().map(|v| v.cos()).collect();
-    assert_eq!(approx(vec![results[0]], 4), approx(vec![expected[0]], 4));
-    assert_eq!(
-        approx(vec![results[1]], 4),
-        approx(vec![expected[5_000]], 4)
-    );
-    assert_eq!(approx(vec![results[2]], 4), approx(vec![expected[1]], 4));
-    assert_eq!(
-        approx(vec![results[3]], 4),
-        approx(vec![expected[5_001]], 4)
-    );
-    assert_eq!(
-        approx(vec![results[5_000]], 4),
-        approx(vec![expected[2_500]], 4)
-    );
-}
-
-#[test]
-fn gelu_f16() {
-    let v: Vec<f16> = [-10f32, -1.0, 0., 1., 2., 3., 10.0, 20.0]
-        .iter()
-        .map(|v| f16::from_f32(*v))
-        .collect();
-    let expected: Vec<f32> = vec![-0.0, -0.16, 0.0, 0.84, 1.96, 3.0, 10.0, 20.0];
-    let results = run(&v, unary::contiguous::gelu::HALF);
-    assert_eq!(approx_f16(results, 2), expected);
-}
-
-#[test]
-fn gelu_f32() {
-    let v: Vec<f32> = vec![-10f32, -1.0, 0., 1., 2., 3., 10.0, 20.0];
-    let expected: Vec<f32> = vec![-0.0, -0.159, 0.0, 0.841, 1.955, 2.996, 10.0, 20.0];
-    let results = run(&v, unary::contiguous::gelu::FLOAT);
-    assert_eq!(approx(results, 3), expected);
-}
-
-#[test]
-fn binary_add_f32() {
-    let left = vec![1.0f32, 2.0, 3.0];
-    let right = vec![2.0f32, 3.1, 4.2];
-    let results = run_binary(&left, &right, binary::contiguous::add::FLOAT);
-    let expected: Vec<_> = left
-        .iter()
-        .zip(right.iter())
-        .map(|(&x, &y)| x + y)
-        .collect();
-    assert_eq!(approx(results, 4), vec![3.0f32, 5.1, 7.2]);
-    assert_eq!(approx(expected, 4), vec![3.0f32, 5.1, 7.2]);
-}
-
-fn cast<T: Clone, U: Clone>(v: &[T], name: &'static str) -> Vec<U> {
-    let device = device();
-    let fence = device.new_fence();
-    let kernels = Kernels::new(fence);
-    let command_queue = device.new_command_queue();
-    let command_buffer = command_queue.new_command_buffer();
-    let input = new_buffer(&device, v);
-    let options = MTLResourceOptions::StorageModeManaged;
-    let size = (v.len() * std::mem::size_of::<U>()) as u64;
-    let output = device.new_buffer(size, options);
-
-    call_cast_contiguous(
-        &device,
-        command_buffer,
-        &kernels,
-        name,
-        v.len(),
-        &input,
-        0,
-        &output,
-    )
-    .unwrap();
-    command_buffer.commit();
-    command_buffer.wait_until_completed();
-    read_to_vec(&output, v.len())
-}
-
-#[test]
-fn cast_u32_f32() {
-    let v = vec![1u32, 2, 3];
-    let results = cast(&v, "cast_u32_f32");
-    let expected: Vec<_> = v.iter().map(|&v| v as f32).collect();
-    assert_eq!(approx(results, 4), vec![1.0f32, 2.0, 3.0]);
-    assert_eq!(approx(expected, 4), vec![1.0f32, 2.0, 3.0]);
-
-    let v = vec![1.0f32, 2.0, 3.0];
-    let input: Vec<f16> = v.iter().map(|v| f16::from_f32(*v)).collect();
-    let results: Vec<f32> = cast(&input, "cast_f16_f32");
-    assert_eq!(results, vec![1.0f32, 2.0, 3.0]);
-
-    let v = vec![1.0f32; 10_000];
-    let input: Vec<f16> = v.iter().map(|v| f16::from_f32(*v)).collect();
-    let results: Vec<f32> = cast(&input, "cast_f16_f32");
-    assert_eq!(results.len(), 10_000);
-    assert_eq!(&results[..10], vec![1.0f32; 10]);
-    assert_eq!(results, vec![1.0f32; 10_000]);
-}
-
-fn run_affine<T: Clone>(v: &[T], mul: f64, add: f64) -> Vec<T> {
-    let device = device();
-    let fence = device.new_fence();
-    let kernels = Kernels::new(fence);
-    let command_queue = device.new_command_queue();
-    let command_buffer = command_queue.new_command_buffer();
-
-    let input = new_buffer(&device, v);
-    let output = new_buffer(&device, v);
-
-    let size = v.len();
-
-    call_affine(
-        &device,
-        command_buffer,
-        &kernels,
-        "affine_f32",
-        size,
-        &input,
-        &output,
-        mul as f32,
-        add as f32,
-    )
-    .unwrap();
-    command_buffer.commit();
-    command_buffer.wait_until_completed();
-
-    read_to_vec(&output, v.len())
-}
-
-fn run_affine_strided<T: Clone>(
-    v: &[T],
-    shape: &[usize],
-    strides: &[usize],
-    mul: f64,
-    add: f64,
-) -> Vec<T> {
-    let device = device();
-    let fence = device.new_fence();
-    let kernels = Kernels::new(fence);
-    let command_queue = device.new_command_queue();
-    let command_buffer = command_queue.new_command_buffer();
-
-    let input = new_buffer(&device, v);
-    let output = new_buffer(&device, v);
-
-    call_affine_strided(
-        &device,
-        command_buffer,
-        &kernels,
-        "affine_f32_strided",
-        shape,
-        &input,
-        strides,
-        0,
-        &output,
-        mul as f32,
-        add as f32,
-    )
-    .unwrap();
-    command_buffer.commit();
-    command_buffer.wait_until_completed();
-
-    let len: usize = shape.iter().product();
-    read_to_vec(&output, len)
-}
-
-#[test]
-fn affine() {
-    let input = [1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
-    let mul = 1.5;
-    let add = 1.1;
-    let result = run_affine(&input, mul, add);
-    assert_eq!(result, vec![2.6, 4.1, 5.6, 7.1, 8.6, 10.1, 11.6, 13.1]);
-
-    let input = [1.0f32; 40_000];
-    let mul = 1.5;
-    let add = 1.1;
-    let result = run_affine(&input, mul, add);
-    assert_eq!(result, vec![2.6; 40_000]);
-}
-
-#[test]
-fn affine_strided() {
-    let input = [1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
-    let mul = 1.5;
-    let add = 1.1;
-    let shape = [4];
-    let strides = [2];
-    let result = run_affine_strided(&input, &shape, &strides, mul, add);
-    // 1 on 2
-    assert_eq!(result, vec![2.6, 5.6, 8.6, 11.6]);
-}
-
-#[test]
-fn index_select() {
-    let embedding = [1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0];
-    let shape = [5, 2];
-    let ids = [0u32, 4, 2];
-    let dim = 0;
-    let result = run_index_select(&embedding, &shape, &ids, dim);
-    assert_eq!(result, vec![1.0f32, 2.0, 9.0, 10.0, 5.0, 6.0]);
-
-    let embedding = [1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0];
-    let shape = [2, 5];
-    let ids = [0u32, 1, 0];
-    let dim = 0;
-    let result = run_index_select(&embedding, &shape, &ids, dim);
-    assert_eq!(
-        result,
-        vec![1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 1.0f32, 2.0, 3.0, 4.0, 5.0]
-    );
-}
-
-#[test]
-fn index_select_f16() {
-    let embedding: Vec<_> = [1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
-        .into_iter()
-        .map(|x| f16::from_f32(x))
-        .collect();
-    let shape = [5, 2];
-    let ids = [0u32, 4, 2];
-    let dim = 0;
-    let result = run_index_select(&embedding, &shape, &ids, dim);
-    assert_eq!(
-        approx_f16(result, 4),
-        vec![1.0f32, 2.0, 9.0, 10.0, 5.0, 6.0]
-    );
-}
-
-#[test]
-fn index_select_dim1() {
-    let embedding = [1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0];
-    let shape = [5, 2];
-    let ids = [0u32, 1, 0];
-    let dim = 1;
-    let result = run_index_select(&embedding, &shape, &ids, dim);
-    assert_eq!(
-        result,
-        vec![1.0f32, 2.0, 1.0, 3.0, 4.0, 3.0, 5.0, 6.0, 5.0, 7.0, 8.0f32, 7.0, 9.0, 10.0, 9.0]
-    );
-}
-
-fn run_index_select<T: Clone, I: Clone + std::fmt::Debug>(
-    embeddings: &[T],
-    shape: &[usize],
-    ids: &[I],
-    dim: usize,
-) -> Vec<T> {
-    let device = Device::system_default().expect("no device found");
-
-    let command_queue = device.new_command_queue();
-    let command_buffer = command_queue.new_command_buffer();
-    let embeddings_buffer = new_buffer(&device, &embeddings);
-    let ids_buffer = new_buffer(&device, &ids);
-
-    let left_size: usize = shape[..dim].iter().product();
-    let right_size: usize = shape[dim + 1..].iter().product();
-    let dst_el = ids.len() * left_size * right_size;
-    let dst_buffer = new_buffer(&device, &vec![0.0f32; dst_el]);
-
-    let name = match core::mem::size_of::<T>() {
-        4 => "is_u32_f32",
-        2 => "is_u32_f16",
-        _ => unimplemented!(),
-    };
-
-    let fence = device.new_fence();
-    let kernels = Kernels::new(fence);
-    call_index_select(
-        &device,
-        &command_buffer,
-        &kernels,
-        name,
-        shape,
-        ids.len(),
-        dim,
-        &embeddings_buffer,
-        &ids_buffer,
-        &dst_buffer,
-    )
-    .unwrap();
-
-    command_buffer.commit();
-    command_buffer.wait_until_completed();
-
-    read_to_vec(&dst_buffer, dst_el)
-}
-
-#[test]
-fn cos_f16() {
-    let v: Vec<f16> = [1.0f32, 2.0, 3.0]
-        .iter()
-        .map(|v| f16::from_f32(*v))
-        .collect();
-    let results = run(&v, unary::contiguous::cos::HALF);
-    let expected: Vec<f16> = v.iter().map(|v| f16::from_f32(v.to_f32().cos())).collect();
-    assert_eq!(approx_f16(results, 2), vec![0.54, -0.42, -0.99]);
-    assert_eq!(approx_f16(expected, 2), vec![0.54, -0.42, -0.99]);
-}
-
-fn run_reduce<T: Clone>(v: &[T], out_length: usize, name: &'static str) -> Vec<T> {
-    let device = device();
-    let fence = device.new_fence();
-    let kernels = Kernels::new(fence);
-    let command_queue = device.new_command_queue();
-    let command_buffer = command_queue.new_command_buffer();
-    let input = new_buffer(&device, v);
-
-    let options = MTLResourceOptions::StorageModeManaged;
-    let output = device.new_buffer((out_length * core::mem::size_of::<T>()) as u64, options);
-    let dims = vec![v.len()];
-    let strides = vec![1];
-    call_reduce_strided(
-        &device,
-        command_buffer,
-        &kernels,
-        name,
-        &dims,
-        &strides,
-        out_length,
-        &input,
-        0,
-        &output,
-    )
-    .unwrap();
-    command_buffer.commit();
-    command_buffer.wait_until_completed();
-
-    read_to_vec(&output, out_length)
-}
-
-fn run_softmax<T: Clone + std::fmt::Debug>(v: &[T], last_dim: usize, name: &'static str) -> Vec<T> {
-    let device = device();
-    let fence = device.new_fence();
-    let kernels = Kernels::new(fence);
-    let command_queue = device.new_command_queue();
-    let command_buffer = command_queue.new_command_buffer();
-    let input = new_buffer(&device, v);
-    let output = new_buffer(&device, v);
-    call_last_softmax(
-        &device,
-        command_buffer,
-        &kernels,
-        name,
-        v.len(),
-        last_dim,
-        &input,
-        0,
-        &output,
-    )
-    .unwrap();
-    command_buffer.commit();
-    command_buffer.wait_until_completed();
-
-    read_to_vec(&output, v.len())
-}
-
-#[test]
-fn reduce_sum() {
-    let v = vec![1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0];
-    let out_length = 1;
-
-    let results = run_reduce(&v, out_length, "fast_sum_f32_strided");
-    assert_eq!(approx(results, 4), vec![21.0]);
-}
-
-#[test]
-fn reduce_sum2() {
-    let v = vec![1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0];
-    let out_length = 2;
-
-    let results = run_reduce(&v, out_length, "fast_sum_f32_strided");
-    assert_eq!(approx(results, 4), vec![6.0, 15.0]);
-}
-
-#[test]
-fn softmax() {
-    let v = vec![1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0];
-    let last_dim = 6;
-    let results = run_softmax(&v, last_dim, "softmax_f32");
-    assert_eq!(
-        approx(results, 4),
-        vec![0.0043, 0.0116, 0.0315, 0.0858, 0.2331, 0.6337]
-    );
-
-    let last_dim = 4096;
-    let n = 200;
-    let mut v = vec![0.0; n * last_dim];
-    for i in 0..n {
-        v[i * last_dim] = 20.0;
-    }
-    let results = run_softmax(&v, last_dim, "softmax_f32");
-    let results = approx(results, 4);
-    println!("{results:?}");
-    assert_eq!(
-        results.iter().map(|&s| s.round() as usize).sum::<usize>(),
-        n
-    );
-    assert_eq!(results[0], 1.0);
-    assert_eq!(results[1], 0.0);
-    assert_eq!(results[last_dim], 1.0);
-    assert_eq!(results[2 * last_dim], 1.0);
-
-    let v = vec![0.0f32, 1.0, 2.0, 3.0, 4.0, 5.0];
-    let last_dim = 6;
-    let results = run_softmax(&v, last_dim, "softmax_f32");
-    assert_eq!(
-        approx(results, 4),
-        vec![0.0043, 0.0116, 0.0315, 0.0858, 0.2331, 0.6337]
-    );
-
-    let v = vec![1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0];
-    let last_dim = 3;
-    let results = run_softmax(&v, last_dim, "softmax_f32");
-    assert_eq!(
-        approx(results, 4),
-        vec![0.0900, 0.2447, 0.6652, 0.0900, 0.2447, 0.6652]
-    );
-
-    let v = vec![1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0]
-        .iter()
-        .map(|v| f16::from_f32(*v))
-        .collect::<Vec<_>>();
-    let last_dim = 6;
-    let results = run_softmax(&v, last_dim, "softmax_f16");
-    assert_eq!(
-        approx_f16(results, 4),
-        vec![0.0043, 0.0116, 0.0316, 0.0858, 0.2332, 0.6338]
-    );
-
-    let v = vec![1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0]
-        .iter()
-        .map(|v| bf16::from_f32(*v))
-        .collect::<Vec<_>>();
-    let last_dim = 6;
-    let results = run_softmax(&v, last_dim, "softmax_bf16");
-    assert_eq!(
-        approx_bf16(results, 4),
-        vec![0.0043, 0.0116, 0.0315, 0.0859, 0.2324, 0.6328]
-    );
-}
-
-fn run_where_cond<I: Clone, T: Clone>(
-    shape: &[usize],
-    cond: &[I],
-    (cond_stride, cond_offset): (Vec<usize>, usize),
-    left_true: &[T],
-    (left_stride, left_offset): (Vec<usize>, usize),
-    right_false: &[T],
-    (_right_stride, _right_offset): (Vec<usize>, usize),
-    name: &'static str,
-) -> Vec<T> {
-    let device = device();
-    let fence = device.new_fence();
-    let kernels = Kernels::new(fence);
-    let command_queue = device.new_command_queue();
-    let command_buffer = command_queue.new_command_buffer();
-    let options = MTLResourceOptions::StorageModeManaged;
-
-    let length = cond.len();
-    let cond = device.new_buffer_with_data(
-        cond.as_ptr() as *const core::ffi::c_void,
-        std::mem::size_of_val(cond) as u64,
-        options,
-    );
-    let left = device.new_buffer_with_data(
-        left_true.as_ptr() as *const core::ffi::c_void,
-        (length * core::mem::size_of::<T>()) as u64,
-        options,
-    );
-    let right = device.new_buffer_with_data(
-        right_false.as_ptr() as *const core::ffi::c_void,
-        (length * core::mem::size_of::<T>()) as u64,
-        options,
-    );
-
-    let output = device.new_buffer((length * core::mem::size_of::<T>()) as u64, options);
-    call_where_cond_strided(
-        &device,
-        command_buffer,
-        &kernels,
-        name,
-        shape,
-        &cond,
-        (&cond_stride, cond_offset),
-        &left,
-        (&left_stride, left_offset),
-        &right,
-        (&cond_stride, cond_offset),
-        &output,
-    )
-    .unwrap();
-    command_buffer.commit();
-    command_buffer.wait_until_completed();
-
-    read_to_vec(&output, length)
-}
-
-#[test]
-fn where_cond() {
-    let shape = vec![6];
-    let cond = vec![0u8, 1, 0, 0, 1, 1];
-    let cond_l = (vec![1], 0);
-    let left_true = vec![1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0];
-    let left_l = (vec![1], 0);
-    let right_false = vec![-1.0f32, -2.0, -3.0, -4.0, -5.0, -6.0];
-    let right_l = (vec![1], 0);
-    let results = run_where_cond(
-        &shape,
-        &cond,
-        cond_l,
-        &left_true,
-        left_l,
-        &right_false,
-        right_l,
-        "where_u8_f32",
-    );
-    assert_eq!(approx(results, 4), vec![-1.0f32, 2.0, -3.0, -4.0, 5.0, 6.0]);
-}
-
-fn run_gemm<T: Clone>(
-    (b, m, n, k): (usize, usize, usize, usize),
-    lhs: &[T],
-    lhs_stride: Vec<usize>,
-    lhs_offset: usize,
-    rhs: &[T],
-    rhs_stride: Vec<usize>,
-    rhs_offset: usize,
-) -> Vec<T> {
-    let device = device();
-    let fence = device.new_fence();
-    let kernels = Kernels::new(fence);
-    let command_queue = device.new_command_queue();
-    let command_buffer = command_queue.new_command_buffer();
-    let options = MTLResourceOptions::StorageModeManaged;
-
-    let lhs = device.new_buffer_with_data(
-        lhs.as_ptr() as *const core::ffi::c_void,
-        std::mem::size_of_val(lhs) as u64,
-        options,
-    );
-    let rhs = device.new_buffer_with_data(
-        rhs.as_ptr() as *const core::ffi::c_void,
-        std::mem::size_of_val(rhs) as u64,
-        options,
-    );
-    let length = b * m * n;
-    let output = device.new_buffer((length * core::mem::size_of::<T>()) as u64, options);
-    call_gemm(
-        &device,
-        command_buffer,
-        &kernels,
-        "sgemm",
-        (b, m, n, k),
-        &lhs_stride,
-        lhs_offset,
-        &lhs,
-        &rhs_stride,
-        rhs_offset,
-        &rhs,
-        &output,
-    )
-    .unwrap();
-    command_buffer.commit();
-    command_buffer.wait_until_completed();
-
-    read_to_vec(&output, length)
-}
-
-#[test]
-fn gemm() {
-    let (b, m, n, k) = (1, 2, 4, 3);
-    let lhs_stride = vec![m * k, k, 1];
-    let lhs: Vec<f32> = (0..b * m * k).map(|f| f as f32).collect();
-    let rhs_stride = vec![n * k, n, 1];
-    let rhs: Vec<f32> = (0..b * n * k).map(|f| f as f32).collect();
-    let results = run_gemm((b, m, n, k), &lhs, lhs_stride, 0, &rhs, rhs_stride, 0);
-    assert_eq!(
-        approx(results, 4),
-        vec![20.0, 23.0, 26.0, 29.0, 56.0, 68.0, 80.0, 92.0]
-    );
-
-    let (b, m, n, k) = (2, 2, 4, 3);
-    let lhs_stride = vec![m * k, k, 1];
-    let lhs: Vec<f32> = (0..b * m * k).map(|f| f as f32).collect();
-    let rhs_stride = vec![n * k, n, 1];
-    let rhs: Vec<f32> = (0..b * n * k).map(|f| f as f32).collect();
-    let results = run_gemm((b, m, n, k), &lhs, lhs_stride, 0, &rhs, rhs_stride, 0);
-    assert_eq!(
-        approx(results, 4),
-        vec![
-            20.0, 23.0, 26.0, 29.0, 56.0, 68.0, 80.0, 92.0, 344.0, 365.0, 386.0, 407.0, 488.0,
-            518.0, 548.0, 578.0
-        ]
-    );
-
-    // OFFSET
-    let (b, m, n, k) = (2, 2, 4, 3);
-    let lhs_stride = vec![m * k, k, 1];
-    let lhs: Vec<f32> = (0..b * m * k).map(|f| f as f32).collect();
-    let rhs_stride = vec![n * k, n, 1];
-    let rhs: Vec<f32> = (0..b * n * k).map(|f| f as f32).collect();
-    // Manually set batch_size=1 and offset 12 elements * 4 the number of bytes for f32
-    let results = run_gemm((1, m, n, k), &lhs, lhs_stride, 0, &rhs, rhs_stride, 12 * 4);
-    assert_eq!(
-        approx(results, 4),
-        vec![56.0, 59.0, 62.0, 65.0, 200.0, 212.0, 224.0, 236.0]
-    );
-}
--- a/candle-metal-kernels/src/unary.metal
+++ b/candle-metal-kernels/src/unary.metal
@ -19,9 +19,7 @@ METAL_FUNC uint get_strided_index(
 }

 template <typename T> METAL_FUNC T sqr(T in){ return in * in; }
-template <typename T> METAL_FUNC T recip(T in){ return T(1.0 / in); }
 template <typename T> METAL_FUNC T neg(T in){ return -in; }
-
 template <typename T> METAL_FUNC T erf(T in){
    float x = (float) in;
    // constants
@ -44,14 +42,9 @@ template <typename T> METAL_FUNC T erf(T in){

    return T(sign*y);
 }
-template <typename T> METAL_FUNC T id(T in) { return in; }
-template <typename T> METAL_FUNC T gelu_erf(T x) {
-    return T(x * (1 + erf(x * M_SQRT1_2_F)) / 2);
-}
-template <typename T> METAL_FUNC T gelu(T x) {
-    if (x > 5) {
-        return x;
-    }
+template <typename T> METAL_FUNC T id(T in){ return in; }
+template <typename T> METAL_FUNC T gelu_erf(T x){ return T(x * (1 + erf(x * M_SQRT1_2_F)) / 2); }
+template <typename T> METAL_FUNC T gelu(T x){
    T x_sq = x * x;
    T x_cube = x_sq * x;
    T alpha = x + static_cast<T>(0.044715) * x_cube;
@ -59,17 +52,19 @@ template <typename T> METAL_FUNC T gelu(T x) {
    return static_cast<T>(0.5) * x * (static_cast<T>(1.0) + T(tanh(beta)));
 }

+
+
 #define UNARY(FN, TYPENAME, FN_NAME, FN_NAME_STRIDED) \
 kernel void FN_NAME( \
    constant size_t &dim, \
    device const TYPENAME *input,  \
    device TYPENAME *output, \
-    uint tid [[ thread_position_in_grid ]] \
+    uint thread_position_in_grid [[ thread_position_in_grid ]] \
 ) { \
-    if (tid >= dim) { \
+    if (thread_position_in_grid >= dim) { \
        return; \
    } \
-    output[tid] = TYPENAME(FN(float(input[tid]))); \
+    output[thread_position_in_grid] = TYPENAME(FN(input[thread_position_in_grid])); \
 }\
 kernel void FN_NAME_STRIDED( \
    constant size_t &dim, \
@ -78,20 +73,20 @@ kernel void FN_NAME_STRIDED( \
    constant size_t *strides, \
    device const TYPENAME *input,  \
    device TYPENAME *output, \
-    uint tid [[ thread_position_in_grid ]] \
+    uint thread_position_in_grid [[ thread_position_in_grid ]] \
 ) { \
-    if (tid >= dim) { \
+    if (thread_position_in_grid >= dim) { \
        return; \
    } \
-    output[tid] = TYPENAME(FN(float(input[get_strided_index(tid, num_dims, dims, strides)]))); \
+    output[thread_position_in_grid] = TYPENAME(FN(input[get_strided_index(thread_position_in_grid, num_dims, dims, strides)])); \
 }

 #define UNARY_OP(NAME) \
-UNARY(NAME, float, NAME##_f32, NAME##_f32_strided); \
-UNARY(NAME, half, NAME##_f16, NAME##_f16_strided);
+UNARY(NAME, float, NAME##_float, NAME##_float_strided); \
+UNARY(NAME, half, NAME##_half, NAME##_half_strided);

 #define BFLOAT_UNARY_OP(NAME) \
-UNARY(NAME, bfloat, NAME##_bf16, NAME##_bf16_strided);
+UNARY(NAME, bfloat, NAME##_bfloat, NAME##_bfloat_strided);


 UNARY_OP(cos)
@ -102,23 +97,16 @@ UNARY_OP(neg)
 UNARY_OP(exp)
 UNARY_OP(log)
 UNARY_OP(gelu)
-UNARY_OP(abs)
 UNARY_OP(ceil)
 UNARY_OP(floor)
 UNARY_OP(round)
 UNARY_OP(gelu_erf)
 UNARY_OP(erf)
-UNARY_OP(tanh)
-UNARY_OP(recip)
-UNARY(id, float, copy_f32, copy_f32_strided)
-UNARY(id, half, copy_f16, copy_f16_strided)
+UNARY(id, float, copy_float, copy_float_strided)
+UNARY(id, half, copy_half, copy_half_strided)
 UNARY(id, uint8_t, copy_u8, copy_u8_strided)
 UNARY(id, uint32_t, copy_u32, copy_u32_strided)

-#if __METAL_VERSION__ >= 220
-UNARY(id, int64_t, copy_i64, copy_i64_strided)
-#endif
-
 #if __METAL_VERSION__ >= 310
 BFLOAT_UNARY_OP(cos)
 BFLOAT_UNARY_OP(sin)
@ -128,14 +116,11 @@ BFLOAT_UNARY_OP(neg)
 BFLOAT_UNARY_OP(exp)
 BFLOAT_UNARY_OP(log)
 BFLOAT_UNARY_OP(gelu)
-BFLOAT_UNARY_OP(abs)
 BFLOAT_UNARY_OP(ceil)
 BFLOAT_UNARY_OP(floor)
 BFLOAT_UNARY_OP(round)
 BFLOAT_UNARY_OP(gelu_erf)
 BFLOAT_UNARY_OP(erf)
-BFLOAT_UNARY_OP(tanh)
-BFLOAT_UNARY_OP(recip)

-UNARY(id, bfloat, copy_bf16, copy_bf16_strided)
+UNARY(id, bfloat, copy_bfloat, copy_bfloat_strided)
 #endif
--- a/candle-nn/Cargo.toml
+++ b/candle-nn/Cargo.toml
@ -11,7 +11,7 @@ readme = "README.md"

 [dependencies]
 accelerate-src = { workspace = true, optional = true }
-candle = { path = "../candle-core", version = "0.3.3", package = "candle-core" }
+candle = { path = "../candle-core", version = "0.3.0", package = "candle-core" }
 half = { workspace = true }
 thiserror = { workspace = true }
 intel-mkl-src = { workspace = true, optional = true }
@ -19,7 +19,6 @@ num-traits = { workspace = true }
 rayon = { workspace = true }
 safetensors = { workspace = true }
 serde = { workspace = true }
-metal = { workspace = true, optional = true }
 candle-metal-kernels = { path = "../candle-metal-kernels", version = "0.3.0", optional = true }

 [dev-dependencies]
@ -31,4 +30,4 @@ default = []
 accelerate = ["dep:accelerate-src", "candle/accelerate"]
 cuda = ["candle/cuda"]
 mkl = ["dep:intel-mkl-src", "candle/mkl"]
-metal = ["candle/metal", "dep:candle-metal-kernels", "dep:metal"]
+metal = ["candle/metal", "dep:candle-metal-kernels"]
--- a/candle-nn/examples/cpu_benchmarks.rs
+++ b/candle-nn/examples/cpu_benchmarks.rs
@ -6,7 +6,7 @@ extern crate intel_mkl_src;
 extern crate accelerate_src;

 use candle::quantized::GgmlType;
-use candle::{CpuStorage, Device, Layout, Module, Result, Shape, Tensor, D};
+use candle::{CpuStorage, Device, Layout, Result, Shape, Tensor, D};
 use clap::{Parser, Subcommand};

 const CHECK_CONV2D: bool = false;
@ -222,10 +222,7 @@ impl Benchmark for QMatMul {
    type RunResult = Tensor;
    fn preprocess() -> Result<Self::PreProcessData> {
        let zeros = vec![candle::quantized::k_quants::BlockQ4_0::zeros(); 4096 * 11008 / 32];
-        let mm = candle::quantized::QTensor::new(
-            candle::quantized::QStorage::Cpu(Box::new(zeros)),
-            (4096, 11008),
-        )?;
+        let mm = candle::quantized::QTensor::new(zeros, (4096, 11008))?;
        let mm = candle::quantized::QMatMul::from_qtensor(mm)?;
        let arg = Tensor::randn(0f32, 1., (128, 11008), &Device::Cpu)?;
        Ok((mm, arg))
--- a/candle-nn/src/activation.rs
+++ b/candle-nn/src/activation.rs
@ -1,4 +1,4 @@
-use candle::{Result, Tensor};
+use candle::Tensor;
 use serde::Deserialize;

 #[derive(Debug, Clone, Copy, PartialEq, Deserialize, Default)]
@ -21,7 +21,7 @@ pub enum Activation {
 }

 impl super::Module for Activation {
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+    fn forward(&self, xs: &Tensor) -> candle::Result<Tensor> {
        match self {
            Self::Gelu => xs.gelu_erf(),
            // https://github.com/huggingface/transformers/blob/12f043eaeaabfef6f6efea411d98e6f6d3c094b7/src/transformers/activations.py#L49-L78
@ -40,60 +40,3 @@ impl super::Module for Activation {
        }
    }
 }
-
-#[derive(Clone, Debug)]
-pub struct PReLU {
-    weight: Tensor,
-    is_scalar: bool,
-}
-
-impl PReLU {
-    pub fn new(weight: Tensor, is_scalar: bool) -> Self {
-        Self { weight, is_scalar }
-    }
-
-    pub fn weight(&self) -> &Tensor {
-        &self.weight
-    }
-
-    pub fn is_scalar(&self) -> bool {
-        self.is_scalar
-    }
-}
-
-impl candle::Module for PReLU {
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
-        let weight = if self.is_scalar {
-            self.weight.reshape(())?
-        } else if xs.rank() >= 2 {
-            let num_channels = xs.dim(1)?;
-            let num_weights = self.weight.elem_count();
-            if num_weights != num_channels {
-                candle::bail!("error in prelu: unexpected number of channels for the input, got {num_channels}, weight dim is {num_weights}")
-            }
-            let mut s = vec![1; xs.rank()];
-            s[1] = self.weight.elem_count();
-            self.weight.reshape(s)?
-        } else {
-            self.weight.clone()
-        };
-        let zeros = xs.zeros_like()?;
-        xs.maximum(&zeros)? + xs.minimum(&zeros)?.broadcast_mul(&weight)?
-    }
-}
-
-/// Create or initialize a new PReLU layer.
-///
-/// This uses some default name for weights, namely `"weight"`.
-/// # Arguments
-///
-/// * `num_channels` - The number of channels. Use `None` to have as single trainable value and
-/// `Some` for a 1D vector with the appropriate number of channels. When applying the `forward`
-/// function, the input tensor shape `s` should either be one dimension with this number of
-/// channels or if `s.len() >= 2` it should have `s[1]` equal to this number.
-pub fn prelu(num_channels: Option<usize>, vs: crate::VarBuilder) -> Result<PReLU> {
-    let init_ws = crate::init::Init::Const(0.25);
-    // When using a scalar weight, the PyTorch encoding is to use a 1d vector of length 1.
-    let ws = vs.get_with_hints((num_channels.unwrap_or(1),), "weight", init_ws)?;
-    Ok(PReLU::new(ws, num_channels.is_none()))
-}
--- a/candle-nn/src/batch_norm.rs
+++ b/candle-nn/src/batch_norm.rs
@ -7,21 +7,15 @@
 //! running stats.
 //!
 //! [`Batch Normalization`]: https://arxiv.org/abs/1502.03167
-use candle::{DType, Result, Tensor, Var};
+use candle::{DType, Result, Tensor};

 #[derive(Debug, Clone, Copy, PartialEq)]
 pub struct BatchNormConfig {
    pub eps: f64,
    pub remove_mean: bool,
-
    /// The meaning of affine here is different from LayerNorm: when false there is no learnable
    /// parameter at all, 1 used for gamma and 0 for beta.
    pub affine: bool,
-
-    /// Controls exponential moving average of running stats. Defaults to 0.1
-    ///
-    /// `running_stat * (1.0 - momentum) + stat * momentum`.
-    pub momentum: f64,
 }

 impl Default for BatchNormConfig {
@ -30,7 +24,6 @@ impl Default for BatchNormConfig {
            eps: 1e-5,
            remove_mean: true,
            affine: true,
-            momentum: 0.1,
        }
    }
 }
@ -39,61 +32,23 @@ impl From<f64> for BatchNormConfig {
    fn from(eps: f64) -> Self {
        Self {
            eps,
-            ..Default::default()
+            remove_mean: true,
+            affine: true,
        }
    }
 }

 #[derive(Clone, Debug)]
 pub struct BatchNorm {
-    running_mean: Var,
-    running_var: Var,
+    running_mean: Tensor,
+    running_var: Tensor,
    weight_and_bias: Option<(Tensor, Tensor)>,
    remove_mean: bool,
    eps: f64,
-    momentum: f64,
+    num_features: usize,
 }

 impl BatchNorm {
-    fn check_validity(&self, num_features: usize) -> Result<()> {
-        if self.eps < 0. {
-            candle::bail!("batch-norm eps cannot be negative {}", self.eps)
-        }
-        if !(0.0..=1.0).contains(&self.momentum) {
-            candle::bail!(
-                "batch-norm momentum must be between 0 and 1, is {}",
-                self.momentum
-            )
-        }
-        if self.running_mean.dims() != [num_features] {
-            candle::bail!(
-                "batch-norm running mean has unexpected shape {:?} should have shape [{num_features}]",
-                self.running_mean.shape(),
-            )
-        }
-        if self.running_var.dims() != [num_features] {
-            candle::bail!(
-                "batch-norm running variance has unexpected shape {:?} should have shape [{num_features}]",
-                self.running_var.shape(),
-            )
-        }
-        if let Some((ref weight, ref bias)) = self.weight_and_bias.as_ref() {
-            if weight.dims() != [num_features] {
-                candle::bail!(
-                    "batch-norm weight has unexpected shape {:?} should have shape [{num_features}]",
-                    weight.shape(),
-                )
-            }
-            if bias.dims() != [num_features] {
-                candle::bail!(
-                    "batch-norm weight has unexpected shape {:?} should have shape [{num_features}]",
-                    bias.shape(),
-                )
-            }
-        }
-        Ok(())
-    }
-
    pub fn new(
        num_features: usize,
        running_mean: Tensor,
@ -102,16 +57,29 @@ impl BatchNorm {
        bias: Tensor,
        eps: f64,
    ) -> Result<Self> {
-        let out = Self {
-            running_mean: Var::from_tensor(&running_mean)?,
-            running_var: Var::from_tensor(&running_var)?,
+        if eps < 0. {
+            candle::bail!("batch-norm eps cannot be negative {eps}")
+        }
+        if weight.dims() != [num_features] {
+            candle::bail!(
+                "batch-norm unexpected weight shape {:?} {num_features}",
+                weight.shape()
+            )
+        }
+        if bias.dims() != [num_features] {
+            candle::bail!(
+                "batch-norm unexpected bias shape {:?} {num_features}",
+                bias.shape()
+            )
+        }
+        Ok(Self {
+            running_mean,
+            running_var,
            weight_and_bias: Some((weight, bias)),
            remove_mean: true,
            eps,
-            momentum: 0.1,
-        };
-        out.check_validity(num_features)?;
-        Ok(out)
+            num_features,
+        })
    }

    pub fn new_no_bias(
@ -120,64 +88,25 @@ impl BatchNorm {
        running_var: Tensor,
        eps: f64,
    ) -> Result<Self> {
-        let out = Self {
-            running_mean: Var::from_tensor(&running_mean)?,
-            running_var: Var::from_tensor(&running_var)?,
+        if eps < 0. {
+            candle::bail!("batch-norm eps cannot be negative {eps}")
+        }
+        Ok(Self {
+            running_mean,
+            running_var,
            weight_and_bias: None,
            remove_mean: true,
            eps,
-            momentum: 0.1,
-        };
-        out.check_validity(num_features)?;
-        Ok(out)
-    }
-
-    pub fn new_with_momentum(
-        num_features: usize,
-        running_mean: Tensor,
-        running_var: Tensor,
-        weight: Tensor,
-        bias: Tensor,
-        eps: f64,
-        momentum: f64,
-    ) -> Result<Self> {
-        let out = Self {
-            running_mean: Var::from_tensor(&running_mean)?,
-            running_var: Var::from_tensor(&running_var)?,
-            weight_and_bias: Some((weight, bias)),
-            remove_mean: true,
-            eps,
-            momentum,
-        };
-        out.check_validity(num_features)?;
-        Ok(out)
-    }
-
-    pub fn new_no_bias_with_momentum(
-        num_features: usize,
-        running_mean: Tensor,
-        running_var: Tensor,
-        eps: f64,
-        momentum: f64,
-    ) -> Result<Self> {
-        let out = Self {
-            running_mean: Var::from_tensor(&running_mean)?,
-            running_var: Var::from_tensor(&running_var)?,
-            weight_and_bias: None,
-            remove_mean: true,
-            eps,
-            momentum,
-        };
-        out.check_validity(num_features)?;
-        Ok(out)
+            num_features,
+        })
    }

    pub fn running_mean(&self) -> &Tensor {
-        self.running_mean.as_tensor()
+        &self.running_mean
    }

    pub fn running_var(&self) -> &Tensor {
-        self.running_var.as_tensor()
+        &self.running_var
    }

    pub fn eps(&self) -> f64 {
@ -188,12 +117,7 @@ impl BatchNorm {
        self.weight_and_bias.as_ref().map(|v| (&v.0, &v.1))
    }

-    pub fn momentum(&self) -> f64 {
-        self.momentum
-    }
-
-    pub fn forward_train(&self, x: &Tensor) -> Result<Tensor> {
-        let num_features = self.running_mean.as_tensor().dim(0)?;
+    pub fn forward_learning(&self, x: &Tensor) -> Result<Tensor> {
        let x_dtype = x.dtype();
        let internal_dtype = match x_dtype {
            DType::F16 | DType::BF16 => DType::F32,
@ -205,54 +129,40 @@ impl BatchNorm {
                x.shape()
            )
        }
-        if x.dim(1)? != num_features {
+        if x.dim(1)? != self.num_features {
            candle::bail!(
                "batch-norm input doesn't have the expected number of features ({:?} <> {})",
                x.shape(),
-                num_features
+                self.num_features
            )
        }
        let x = x.to_dtype(internal_dtype)?;
        let x = x.transpose(0, 1)?;
        let x_dims_post_transpose = x.dims();
-        // Flatten all the dimensions exception the channel one as this performs a Spatial Batch
-        // Normalization.
        let x = x.flatten_from(1)?.contiguous()?;
        let x = if self.remove_mean {
-            // The mean is taken over dim 1 as this is the batch dim after the transpose(0, 1) above.
            let mean_x = x.mean_keepdim(1)?;
-            let updated_running_mean = ((self.running_mean.as_tensor() * (1.0 - self.momentum))?
-                + (mean_x.flatten_all()? * self.momentum)?)?;
-            self.running_mean.set(&updated_running_mean)?;
            x.broadcast_sub(&mean_x)?
        } else {
            x
        };
-        // The mean is taken over dim 1 as this is the batch dim after the transpose(0, 1) above.
        let norm_x = x.sqr()?.mean_keepdim(1)?;
-        let updated_running_var = {
-            let batch_size = x.dim(1)? as f64;
-            let running_var_weight = 1.0 - self.momentum;
-            let norm_x_weight = self.momentum * batch_size / (batch_size - 1.0);
-            ((self.running_var.as_tensor() * running_var_weight)?
-                + (&norm_x.flatten_all()? * norm_x_weight)?)?
-        };
-        self.running_var.set(&updated_running_var)?;
-        let x = x
-            .broadcast_div(&(norm_x + self.eps)?.sqrt()?)?
-            .to_dtype(x_dtype)?;
+        let x_normed = x.broadcast_div(&(norm_x + self.eps)?.sqrt()?)?;
+        let x = x_normed.to_dtype(x_dtype)?;
        let x = match &self.weight_and_bias {
            None => x,
            Some((weight, bias)) => {
-                let weight = weight.reshape(((), 1))?;
-                let bias = bias.reshape(((), 1))?;
+                let weight = weight.reshape((self.num_features, 1))?;
+                let bias = bias.reshape((self.num_features, 1))?;
                x.broadcast_mul(&weight)?.broadcast_add(&bias)?
            }
        };
        x.reshape(x_dims_post_transpose)?.transpose(0, 1)
    }
+}

-    fn forward_eval(&self, x: &Tensor) -> Result<Tensor> {
+impl crate::Module for BatchNorm {
+    fn forward(&self, x: &Tensor) -> Result<Tensor> {
        let target_shape: Vec<usize> = x
            .dims()
            .iter()
@ -260,13 +170,9 @@ impl BatchNorm {
            .map(|(idx, v)| if idx == 1 { *v } else { 1 })
            .collect();
        let target_shape = target_shape.as_slice();
-
        let x = x
-            .broadcast_sub(&self.running_mean.as_tensor().reshape(target_shape)?)?
-            .broadcast_div(
-                &(self.running_var.as_tensor().reshape(target_shape)? + self.eps)?.sqrt()?,
-            )?;
-
+            .broadcast_sub(&self.running_mean.reshape(target_shape)?)?
+            .broadcast_div(&(self.running_var.reshape(target_shape)? + self.eps)?.sqrt()?)?;
        match &self.weight_and_bias {
            None => Ok(x),
            Some((weight, bias)) => {
@ -278,41 +184,30 @@ impl BatchNorm {
    }
 }

-impl crate::ModuleT for BatchNorm {
-    fn forward_t(&self, x: &Tensor, train: bool) -> Result<Tensor> {
-        if train {
-            self.forward_train(x)
-        } else {
-            self.forward_eval(x)
-        }
-    }
-}
-
 pub fn batch_norm<C: Into<BatchNormConfig>>(
    num_features: usize,
    config: C,
    vb: crate::VarBuilder,
 ) -> Result<BatchNorm> {
-    use crate::Init;
    let config = config.into();
    if config.eps < 0. {
        candle::bail!("batch-norm eps cannot be negative {}", config.eps)
    }
-    let running_mean = vb.get_with_hints(num_features, "running_mean", Init::Const(0.))?;
-    let running_var = vb.get_with_hints(num_features, "running_var", Init::Const(1.))?;
+    let running_mean = vb.get_with_hints(num_features, "running_mean", crate::Init::Const(0.))?;
+    let running_var = vb.get_with_hints(num_features, "running_var", crate::Init::Const(1.))?;
    let weight_and_bias = if config.affine {
-        let weight = vb.get_with_hints(num_features, "weight", Init::Const(1.))?;
-        let bias = vb.get_with_hints(num_features, "bias", Init::Const(0.))?;
+        let weight = vb.get_with_hints(num_features, "weight", crate::Init::Const(1.))?;
+        let bias = vb.get_with_hints(num_features, "bias", crate::Init::Const(0.))?;
        Some((weight, bias))
    } else {
        None
    };
    Ok(BatchNorm {
-        running_mean: Var::from_tensor(&running_mean)?,
-        running_var: Var::from_tensor(&running_var)?,
+        running_mean,
+        running_var,
        weight_and_bias,
        remove_mean: config.remove_mean,
        eps: config.eps,
-        momentum: config.momentum,
+        num_features,
    })
 }
--- a/candle-nn/src/encoding.rs
+++ b/candle-nn/src/encoding.rs
@ -1,150 +0,0 @@
-//! Encoding Utilities. (e.g., one-hot/cold encoding)
-
-use candle::{bail, DType, Result, Tensor, WithDType};
-
-/// One-hot/cold encoding.
-///
-/// Given an input tensor of indices, this function returns a tensor of the same shape as the input
-/// tensor with an additional dimension of the given depth size. The values in the returned tensor are
-/// all set to the `off_value` except for the positions represented by the indices, which are set to the `on_value`.
-///
-/// This method returns a tensor with a rank that is one rank larger than the input tensor.
-///
-/// As an example, the following tensor will be encoded to a one-hot matrix:
-///
-/// `[[0i64, 2], [1, -1]]`
-///
-/// with a depth of 4 will be encoded to:
-///
-/// `[[[1, 0, 0, 0], [0, 0, 1, 0]], [[0, 1, 0, 0], [0, 0, 0, 0]]]`
-///
-/// When the input tensor index has a value of -1, the corresponding one-hot vector will be ignored,
-/// resulting in a vector of values set to the `off_value`.
-///
-///
-/// This method supports one-cold encoding by setting `on_value` to `0` and `off_value` to `1`.
-/// By default `on_value` is `1` and `off_value` is `0`.
-///
-/// Other encoding values can be used by setting `on_value` and `off_value` to the desired values.
-///
-/// # Examples
-///
-/// ## One-hot encoding
-///
-/// ```rust
-/// use candle::{Shape, Tensor, Device};
-/// use candle_nn::encoding::one_hot;
-///
-/// let device = candle::Device::Cpu;
-///
-/// let indices = Tensor::new(vec![vec![0i64, 2], vec![1, -1]], &device).unwrap();
-/// let depth = 4;
-/// let one_hot = one_hot(indices, depth, 1f32, 0f32).unwrap();
-///
-/// let expected_matrix = [
-///     [[1.0, 0.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0]],
-///     [[0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0]],
-/// ];
-///
-/// assert_eq!(one_hot.shape(), &Shape::from((2, 2, depth)));
-///
-/// let matrix = one_hot.to_vec3::<f32>().unwrap();
-///
-/// assert_eq!(matrix, expected_matrix);
-///```
-/// ## One-cold Encoding
-///
-/// ```rust
-/// use candle::{Shape, Tensor, Device};
-/// use candle_nn::encoding::one_hot;
-///
-///
-/// let device = candle::Device::Cpu;
-/// let depth = 4;
-/// let indices = Tensor::new(vec![vec![0u8, 2], vec![1, 3]], &device).unwrap();
-/// let one_cold = one_hot(indices, depth, 0u8, 1u8).unwrap();
-///
-/// let expected_matrix = [[[0, 1, 1, 1], [1, 1, 0, 1]], [[1, 0, 1, 1], [1, 1, 1, 0]]];
-///
-/// assert_eq!(one_cold.shape(), &Shape::from((2, 2, depth)));
-///
-/// let matrix = one_cold.to_vec3::<u8>().unwrap();
-///
-/// assert_eq!(matrix, expected_matrix);
-/// ```
-///
-///
-/// # Bails
-///
-/// This method bails if:
-/// - One of the index value is less than -1.
-/// - One of the index value is greater than or equal to the depth value.
-/// - The input data type is not `U8`, `U32`, or `I64`.
-///
-/// # API Design
-///
-/// The api design for this method is loosely based on the [TensorFlow One-Hot](https://www.tensorflow.org/api_docs/python/tf/one_hot) method.
-pub fn one_hot<D: WithDType>(
-    indices: Tensor,
-    depth: usize,
-    on_value: D,
-    off_value: D,
-) -> Result<Tensor> {
-    let mut target_shape = indices.dims().to_vec();
-    target_shape.push(depth);
-    let indices = indices.flatten_all()?;
-    let mut out = vec![off_value; depth * indices.elem_count()];
-    match indices.dtype() {
-        DType::U8 => {
-            let indices = indices.to_vec1::<u8>()?;
-            for (i, &index) in indices.iter().enumerate() {
-                set_at_index(index, i * depth, depth, &mut out, on_value)?;
-            }
-        }
-        DType::U32 => {
-            let indices = indices.to_vec1::<u32>()?;
-            for (i, &index) in indices.iter().enumerate() {
-                set_at_index(index, i * depth, depth, &mut out, on_value)?;
-            }
-        }
-        DType::I64 => {
-            let indices = indices.to_vec1::<i64>()?;
-            for (i, &index) in indices.iter().enumerate() {
-                set_at_index(index, i * depth, depth, &mut out, on_value)?;
-            }
-        }
-        dtype => {
-            bail!("one_hot: unsupported data type {dtype:?}, expected U8, U32, or I64")
-        }
-    };
-    Tensor::from_vec(out, target_shape, indices.device())
-}
-
-fn set_at_index<D: WithDType, I: Into<i64>>(
-    value: I,
-    offset: usize,
-    depth: usize,
-    v: &mut Vec<D>,
-    on_value: D,
-) -> Result<()> {
-    let value = value.into();
-    // Skip for an entire row of off_values
-    if value == -1 {
-        return Ok(());
-    }
-    if value < -1 {
-        bail!(
-            "one_hot: invalid negative index value {value}, expected a positive index value or -1"
-        );
-    }
-    let value = value as usize;
-    if value >= depth {
-        bail!("one_hot: index value {value} exceeds depth {depth}")
-    }
-    let idx = offset + value;
-    if idx >= v.len() {
-        bail!("one_hot: index out of bounds {idx}, len {}", v.len());
-    }
-    v[idx] = on_value;
-    Ok(())
-}
--- a/candle-nn/src/lib.rs
+++ b/candle-nn/src/lib.rs
@ -2,7 +2,6 @@ pub mod activation;
 pub mod batch_norm;
 pub mod conv;
 pub mod embedding;
-pub mod encoding;
 pub mod func;
 pub mod group_norm;
 pub mod init;
@ -16,7 +15,7 @@ pub mod sequential;
 pub mod var_builder;
 pub mod var_map;

-pub use activation::{prelu, Activation, PReLU};
+pub use activation::Activation;
 pub use batch_norm::{batch_norm, BatchNorm, BatchNormConfig};
 pub use conv::{
    conv1d, conv2d, conv2d_no_bias, conv_transpose2d, conv_transpose2d_no_bias, Conv1d,
--- a/candle-nn/src/linear.rs
+++ b/candle-nn/src/linear.rs
@ -56,7 +56,7 @@ impl super::Module for Linear {

 /// Create or initialize a new linear layer.
 ///
-/// This uses some default names for weights and biases, namely `"weight"` and `"bias"`.
+/// This uses some default names for weight and biases, namely `"weight"` and `"bias"`.
 pub fn linear(in_dim: usize, out_dim: usize, vs: crate::VarBuilder) -> Result<Linear> {
    let init_ws = crate::init::DEFAULT_KAIMING_NORMAL;
    let ws = vs.get_with_hints((out_dim, in_dim), "weight", init_ws)?;
@ -69,7 +69,6 @@ pub fn linear(in_dim: usize, out_dim: usize, vs: crate::VarBuilder) -> Result<Li
    Ok(Linear::new(ws, Some(bs)))
 }

-/// Create or initialize a new linear layer without biases.
 pub fn linear_no_bias(in_dim: usize, out_dim: usize, vs: crate::VarBuilder) -> Result<Linear> {
    let init_ws = crate::init::DEFAULT_KAIMING_NORMAL;
    let ws = vs.get_with_hints((out_dim, in_dim), "weight", init_ws)?;
--- a/candle-nn/src/ops.rs
+++ b/candle-nn/src/ops.rs
@ -208,35 +208,25 @@ impl candle::CustomOp1 for SoftmaxLastDim {
        storage: &candle::MetalStorage,
        layout: &Layout,
    ) -> Result<(candle::MetalStorage, Shape)> {
-        use candle::{backend::BackendStorage, DType};
-        let device = storage.device();
-        let command_buffer = device.command_buffer()?;
+        use candle::backend::{BackendStorage};
+        let device  = storage.device();
+        let command_buffer = device.command_buffer();
        let kernels = device.kernels();
-        let name = match storage.dtype() {
-            DType::F32 => "softmax_f32",
-            DType::F16 => "softmax_f16",
-            DType::BF16 => "softmax_bf16",
-            dtype => candle::bail!("softmax-last-dim is not implemented for {dtype:?}"),
-        };
-
-        let n = layout.stride().len();
-        if !(layout.is_contiguous() && layout.stride()[n - 1] == 1) {
-            candle::bail!("Non contiguous softmax-last-dim is not implemented");
-        }
-
+        let name = "softmax_float";
+        assert!(layout.is_contiguous());
+        assert!(layout.start_offset() == 0);
        let last_dim = layout.dims()[layout.shape().rank() - 1];
        let elem_count = layout.shape().elem_count();
-        let output = device.new_buffer(elem_count, storage.dtype(), "softmax")?;
+        let mut output = device.new_buffer(elem_count, storage.dtype());
        candle_metal_kernels::call_last_softmax(
            device.metal_device(),
            &command_buffer,
-            kernels,
+            &kernels,
            name,
            elem_count,
            last_dim,
            storage.buffer(),
-            layout.start_offset() * storage.dtype().size_in_bytes(),
-            &output,
+            &mut output,
        )
        .unwrap();
        let newstorage = candle::MetalStorage::new(output, device.clone(), storage.dtype());
--- a/candle-nn/src/optim.rs
+++ b/candle-nn/src/optim.rs
@ -190,12 +190,4 @@ impl AdamW {
        };
        Self::new(vars, params)
    }
-
-    pub fn params(&self) -> &ParamsAdamW {
-        &self.params
-    }
-
-    pub fn set_params(&mut self, params: ParamsAdamW) {
-        self.params = params;
-    }
 }
--- a/candle-nn/src/var_builder.rs
+++ b/candle-nn/src/var_builder.rs
@ -40,7 +40,7 @@ struct TensorData<B: Backend> {
 /// A trait that defines how tensor data is retrieved.
 ///
 /// Typically this would use disk storage in some specific format, or random initialization.
-/// Note that there is a specialized version of this trait (`SimpleBackend`) that can be used most
+/// Note that there is a speciliazed version of this trait (`SimpleBackend`) that can be used most
 /// of the time. The main restriction is that it doesn't allow for specific args (besides
 /// initialization hints).
 pub trait Backend: Send + Sync {
@ -535,18 +535,12 @@ impl Backend for ShardedSafeTensors {

    fn get(
        &self,
-        target_shape: Shape, // The size is only checked when the world size is 1.
+        _target_shape: Shape, // The size is not checked for ShardedTensors
        path: &str,
        h: Self::Hints,
        dtype: DType,
        dev: &Device,
    ) -> Result<Tensor> {
-        if h.world_size == 1 {
-            // There is no sharding to be applied here so we use the default backend to speed
-            // things up.
-            return SimpleBackend::get(&self.0, target_shape, path, Default::default(), dtype, dev);
-        }
-
        let Shard {
            dim,
            rank,
--- a/candle-nn/tests/batch_norm.rs
+++ b/candle-nn/tests/batch_norm.rs
@ -16,8 +16,6 @@ input = torch.randn(2, 5, 3, 4)
 output = m(input)
 print(input.flatten())
 print(output.flatten())
-print(m.running_mean)
-print(m.running_var)
 */
 #[test]
 fn batch_norm() -> Result<()> {
@ -39,7 +37,7 @@ fn batch_norm() -> Result<()> {
        1.4252, -0.9115, -0.1093, -0.3100, -0.6734, -1.4357, 0.9205,
    ];
    let input = Tensor::new(&input, &Device::Cpu)?.reshape((2, 5, 3, 4))?;
-    let output = bn.forward_train(&input)?;
+    let output = bn.forward_learning(&input)?;
    assert_eq!(output.dims(), &[2, 5, 3, 4]);
    let output = output.flatten_all()?;
    assert_eq!(
@ -67,20 +65,11 @@ fn batch_norm() -> Result<()> {
        Tensor::new(&[-1.5f32], &Device::Cpu)?.broadcast_as(5)?,
        1e-8,
    )?;
-    let output2 = bn2.forward_train(&input)?;
+    let output2 = bn2.forward_learning(&input)?;
    assert_eq!(output2.dims(), &[2, 5, 3, 4]);
    let output2 = output2.flatten_all()?;
    let diff2 = ((output2 - (output * 0.5)?)? + 1.5)?.sqr()?;
    let sum_diff2 = diff2.sum_keepdim(0)?;
    assert_eq!(test_utils::to_vec1_round(&sum_diff2, 4)?, &[0f32]);
-
-    assert_eq!(
-        test_utils::to_vec1_round(bn.running_mean(), 4)?,
-        &[-0.0133, 0.0197, -0.0153, -0.0073, -0.0020]
-    );
-    assert_eq!(
-        test_utils::to_vec1_round(bn.running_var(), 4)?,
-        &[0.9972, 0.9842, 0.9956, 0.9866, 0.9898]
-    );
    Ok(())
 }
--- a/candle-nn/tests/one_hot.rs
+++ b/candle-nn/tests/one_hot.rs
@ -1,120 +0,0 @@
-use candle::{Result, Shape, Tensor};
-use candle_nn::encoding::one_hot;
-
-#[test]
-fn test_i64_one_hot() -> Result<()> {
-    let device = candle::Device::Cpu;
-
-    let indices = Tensor::new(vec![vec![0i64, 2], vec![1, -1]], &device)?;
-    let depth = 4;
-
-    let on_value = 1.0;
-    let off_value = 0.0;
-
-    let one_hot = one_hot::<f32>(indices, depth, on_value, off_value)?;
-
-    let expected_matrix = [
-        [[1., 0., 0., 0.], [0., 0., 1., 0.]],
-        [[0., 1., 0., 0.], [0., 0., 0., 0.]],
-    ];
-
-    assert_eq!(one_hot.shape(), &Shape::from((2, 2, depth)));
-
-    let matrix = one_hot.to_vec3::<f32>()?;
-
-    assert_eq!(matrix, expected_matrix);
-
-    Ok(())
-}
-
-#[test]
-fn test_rank_3_one_hot() -> Result<()> {
-    let device = candle::Device::Cpu;
-
-    let indices = Tensor::new(
-        vec![
-            vec![vec![0i64, 1], vec![2, 3]],
-            vec![vec![3, 1], vec![1, -1]],
-        ],
-        &device,
-    )?;
-    let depth = 4;
-
-    let on_value = 1.0;
-    let off_value = 0.0;
-
-    let one_hot = one_hot::<f32>(indices, depth, on_value, off_value)?;
-
-    let expected_matrix = Tensor::new(
-        vec![
-            vec![
-                vec![vec![1f32, 0., 0., 0.], vec![0., 1., 0., 0.]],
-                vec![vec![0., 0., 1., 0.], vec![0., 0., 0., 1.]],
-            ],
-            vec![
-                vec![vec![0., 0., 0., 1.], vec![0., 1., 0., 0.]],
-                vec![vec![0., 1., 0., 0.], vec![0., 0., 0., 0.]],
-            ],
-        ],
-        &device,
-    )?;
-
-    assert_eq!(one_hot.shape(), expected_matrix.shape());
-    assert_eq!(one_hot.dims(), expected_matrix.dims());
-
-    let matrix = one_hot.get(1)?.to_vec3::<f32>()?;
-    let expected_matrix = expected_matrix.get(1)?.to_vec3::<f32>()?;
-
-    assert_eq!(matrix, expected_matrix);
-
-    Ok(())
-}
-
-#[test]
-fn test_u8_one_cold() -> Result<()> {
-    let device = candle::Device::Cpu;
-    let depth = 4;
-    let indices = Tensor::new(vec![vec![0i64, 2], vec![1, -1]], &device)?;
-
-    let on_value = 0u8;
-    let off_value = 1;
-
-    // Note that the method does not require the turbofish operator, as the type is inferred from the on_value.
-    let one_cold = one_hot(indices, depth, on_value, off_value)?;
-
-    let expected_matrix = [[[0, 1, 1, 1], [1, 1, 0, 1]], [[1, 0, 1, 1], [1, 1, 1, 1]]];
-
-    assert_eq!(one_cold.shape(), &Shape::from((2, 2, depth)));
-
-    let matrix = one_cold.to_vec3::<u8>()?;
-
-    assert_eq!(matrix, expected_matrix);
-
-    Ok(())
-}
-
-#[test]
-fn test_iter() -> Result<()> {
-    let device = candle::Device::Cpu;
-    let depth = 4;
-    let indices = Tensor::new(vec![vec![0i64, 2], vec![1, -1]], &device)?;
-    let matrix = indices.to_vec2::<i64>()?;
-    let (dim1, dim2) = indices.dims2()?;
-
-    let iter = (0..dim1).flat_map(|i| (0..dim2).map(move |j| (i, j)));
-
-    let mut v = vec![0; depth * dim1 * dim2];
-
-    for (i, j) in iter {
-        let idx = i * depth * dim2 + j * depth;
-        v[idx] = matrix[i][j];
-    }
-
-    for (i, row) in matrix.iter().enumerate() {
-        for (j, &value) in row.iter().enumerate() {
-            let idx = i * depth * dim2 + j * depth;
-            assert_eq!(v[idx], value);
-        }
-    }
-    Ok(())
-}
--- a/candle-onnx/Cargo.toml
+++ b/candle-onnx/Cargo.toml
@ -1,6 +1,6 @@
 [package]
 name = "candle-onnx"
-version = "0.3.3"
+version = "0.3.0"
 edition = "2021"

 description = "ONNX support for Candle"
@ -10,8 +10,8 @@ categories = ["science"]
 license = "MIT OR Apache-2.0"

 [dependencies]
-candle = { path = "../candle-core", version = "0.3.3", package = "candle-core" }
-candle-nn = { path = "../candle-nn", version = "0.3.3" }
+candle = { path = "../candle-core", version = "0.3.0", package = "candle-core" }
+candle-nn = { path = "../candle-nn", version = "0.3.0" }
 prost = "0.12.1"

 [build-dependencies]
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Nicolas Patry	c65f68e988	Tmp gemm.	2023-11-19 20:43:59 +01:00
Nicolas Patry	eed1631ee2	Reuse buffers on our own reference counts.	2023-11-18 23:28:59 +01:00
Nicolas Patry	251c65f9f1	Metal operational.	2023-11-18 00:52:38 +01:00
Nicolas Patry	a0010898cc	Better batched matmul.	2023-11-17 10:36:57 +01:00
Nicolas Patry	2801541e5f	new_owned -> new()..to_owned().	2023-11-16 11:07:56 +01:00
Nicolas Patry	4289984d32	Remove some prints.	2023-11-13 14:51:40 +01:00
Nicolas Patry	1471f98f0b	BF16 metal fix.	2023-11-13 14:44:20 +01:00
Nicolas Patry	dd4a40f1c0	Fixes + cache compute_pipeline_state.	2023-11-13 14:33:16 +01:00
Nicolas Patry	79845bd93b	Working version for llama2-c.	2023-11-13 12:36:27 +01:00
Nicolas Patry	6071797450	Add erf.	2023-11-11 18:22:16 +01:00
Nicolas Patry	b58b247323	Putting back f16 index select.	2023-11-11 17:43:35 +01:00
Nicolas Patry	3900091e75	All tests are panicking instead of random failure.	2023-11-11 17:43:35 +01:00
Nicolas Patry	54355ff997	Adding some half kernels.	2023-11-11 17:43:35 +01:00
Nicolas Patry	e02f1912bb	Reusing a single buffer (for now) to speed things up.	2023-11-11 17:43:35 +01:00
Nicolas Patry	a52b71686b	Going back on remote metal-rs.	2023-11-11 17:43:35 +01:00
Nicolas Patry	7adfb70dff	Few fixes.	2023-11-11 17:43:35 +01:00
Nicolas Patry	3ad02147e4	Starting to fix some tests.	2023-11-11 17:43:34 +01:00
Nicolas Patry	4f39695465	Missing new test.	2023-11-11 17:42:53 +01:00
Nicolas Patry	4cf4844c9d	Adding the test scaffolding.	2023-11-11 17:27:19 +01:00
Nicolas Patry	d840838e95	Cleanup fixed a few ops removed debugging scaffolding.	2023-11-11 17:18:00 +01:00
Nicolas Patry	61a070fdd1	Debugging rope.	2023-11-11 17:18:00 +01:00
Nicolas Patry	e35669647d	Fixed matmul (display still broken without casting back to CPU first? )	2023-11-11 17:18:00 +01:00
Nicolas Patry	53e8b7ee3e	Tmp state.	2023-11-11 17:18:00 +01:00
Nicolas Patry	cc26cce23c	Fixing the kernels + launches to make them faster. Cool work by @ivarflakstad Co-authored-by: Ivar Flakstad <69173633+ivarflakstad@users.noreply.github.com>	2023-11-11 17:18:00 +01:00
Nicolas Patry	02c2ec2c71	Adding indexing. Co-authored-by: Ivar Flakstad <69173633+ivarflakstad@users.noreply.github.com>	2023-11-11 17:18:00 +01:00
Nicolas Patry	9a2784b8ab	Refactor to simplify our lives for settings the params in the encoder.	2023-11-11 17:18:00 +01:00
Nicolas Patry	0f652f0e3d	Adding the actual backend	2023-11-11 17:18:00 +01:00
Nicolas Patry	ddee9dc1dd	Remove tracing.	2023-11-11 17:18:00 +01:00
Nicolas Patry	fc9bb7784a	Metal part 1 - Scaffolding for metal.	2023-11-11 17:18:00 +01:00