CustomOp for einsum.

2025-06-17 02:58:50 +00:00 · 2023-09-08 20:46:30 +01:00
145 changed files with 1388 additions and 7910 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,58 +1,13 @@
 # Changelog
 This documents the main changes to the `candle` crate.

-## v0.2.3 - Unreleased
+## v0.2.1 - Unreleased

 ### Added

-### Modified
-
-## v0.2.2 - 2023-09-18
-
-### Added
- Support for `top_p` sampling
-  [819](https://github.com/huggingface/candle/pull/819).
- T5 model including decoding
-  [864](https://github.com/huggingface/candle/pull/864).
- 1-d upsampling
-  [839](https://github.com/huggingface/candle/pull/839).
-
-### Modified
- Bugfix for conv2d
-  [820](https://github.com/huggingface/candle/pull/820).
- Support tensor based indexing using `.i`
-  [842](https://github.com/huggingface/candle/pull/842).
-
-## v0.2.1 - 2023-09-11
-
-### Added
- Add some RNNs (GRU and LSTM) in `candle-nn`
-  [674](https://github.com/huggingface/candle/pull/674),
-  [688](https://github.com/huggingface/candle/pull/688).
- gguf v2 support
-  [725](https://github.com/huggingface/candle/pull/725).
- Quantized llama example in Python using the pyo3 api
-  [716](https://github.com/huggingface/candle/pull/716).
- `candle-nn` layer for conv2d-transposed
-  [760](https://github.com/huggingface/candle/pull/760).
- Add the Segment-Anything Model (SAM) as an example
-  [773](https://github.com/huggingface/candle/pull/773).
- TinyViT backbone for the segemnt anything example
-  [787](https://github.com/huggingface/candle/pull/787).
- Shape with holes support
-  [770](https://github.com/huggingface/candle/pull/770).
-
 ### Modified
 - Dilations are now supported in conv-transpose2d.
  [671](https://github.com/huggingface/candle/pull/671).
- Interactive mode for the quantized model
-  [690](https://github.com/huggingface/candle/pull/690).
- Faster softmax operation
-  [747](https://github.com/huggingface/candle/pull/747).
- Faster convolution operations on CPU and CUDA via im2col
-  [802](https://github.com/huggingface/candle/pull/802).
- Moving some models to a more central location
-  [796](https://github.com/huggingface/candle/pull/796).

 ## v0.2.0 - 2023-08-30

--- a/Cargo.toml
+++ b/Cargo.toml
@ -8,7 +8,6 @@ members = [
    "candle-pyo3",
    "candle-transformers",
    "candle-wasm-examples/llama2-c",
-    "candle-wasm-examples/segment-anything",
    "candle-wasm-examples/whisper",
    "candle-wasm-examples/yolo",
 ]
@ -19,7 +18,7 @@ exclude = [
 resolver = "2"

 [workspace.package]
-version = "0.2.3"
+version = "0.2.1"
 edition = "2021"
 description = "Minimalist ML framework."
 repository = "https://github.com/huggingface/candle"
@ -34,7 +33,7 @@ byteorder = "1.4.3"
 clap = { version = "4.2.4", features = ["derive"] }
 cudarc = { version = "0.9.14", features = ["f16"] }
 # TODO: Switch back to the official gemm implementation once it has caught up.
-gemm = { version = "0.16.0", package = "candle-gemm" }
+gemm = { version = "0.15.6", package = "candle-gemm" }
 hf-hub = "0.3.0"
 half = { version = "2.3.1", features = ["num-traits", "use-intrinsics", "rand_distr"] }
 image = { version = "0.24.7", default-features = false, features = ["jpeg", "png"] }
--- a/README.md
+++ b/README.md
@ -8,9 +8,7 @@ Candle is a minimalist ML framework for Rust with a focus on performance (includ
 and ease of use. Try our online demos: 
 [whisper](https://huggingface.co/spaces/lmz/candle-whisper),
 [LLaMA2](https://huggingface.co/spaces/lmz/candle-llama2),
-[yolo](https://huggingface.co/spaces/lmz/candle-yolo),
-[Segment
-Anything](https://huggingface.co/spaces/radames/candle-segment-anything-wasm).
+[yolo](https://huggingface.co/spaces/lmz/candle-yolo).

 ## Get started

@ -47,49 +45,40 @@ For more advanced examples, please have a look at the following section.

 ## Check out our examples

-These online demos run entirely in your browser:
- [yolo](https://huggingface.co/spaces/lmz/candle-yolo): pose estimation and
-  object recognition.
- [whisper](https://huggingface.co/spaces/lmz/candle-whisper): text to speech.
- [LLaMA2](https://huggingface.co/spaces/lmz/candle-llama2): text generation.
- [Segment Anything Model](https://huggingface.co/spaces/radames/candle-segment-anything-wasm): Image segmentation.
-
-We also provide a some command line based examples using state of the art models:
-
- [LLaMA and LLaMA-v2](./candle-examples/examples/llama/): general LLM.
- [Falcon](./candle-examples/examples/falcon/): general LLM.
- [StarCoder](./candle-examples/examples/bigcode/): LLM specialized to code
-  generation.
- [Quantized LLaMA](./candle-examples/examples/quantized/): quantized version of
-  the LLaMA model using the same quantization techniques as
-  [llama.cpp](https://github.com/ggerganov/llama.cpp).
-
-<img src="https://github.com/huggingface/candle/raw/main/candle-examples/examples/quantized/assets/aoc.gif" width="600">
-  
- [Stable Diffusion](./candle-examples/examples/stable-diffusion/): text to
-  image generative model, support for the 1.5, 2.1, and SDXL 1.0 versions.
-
-<img src="https://github.com/huggingface/candle/raw/main/candle-examples/examples/stable-diffusion/assets/stable-diffusion-xl.jpg" width="200">
-
- [yolo-v3](./candle-examples/examples/yolo-v3/) and
-  [yolo-v8](./candle-examples/examples/yolo-v8/): object detection and pose
-  estimation models.
-
-<img src="https://github.com/huggingface/candle/raw/main/candle-examples/examples/yolo-v8/assets/bike.od.jpg" width="200"><img src="https://github.com/huggingface/candle/raw/main/candle-examples/examples/yolo-v8/assets/bike.pose.jpg" width="200">
- [segment-anything](./candle-examples/examples/segment-anything/): image
-  segmentation model with prompt.
-
-<img src="https://github.com/huggingface/candle/raw/main/candle-examples/examples/segment-anything/assets/sam_merged.jpg" width="200">
+Check out our [examples](./candle-examples/examples/):

 - [Whisper](./candle-examples/examples/whisper/): speech recognition model.
- [T5](./candle-examples/examples/t5), [Bert](./candle-examples/examples/bert/): useful for sentence embeddings.
+- [LLaMA and LLaMA-v2](./candle-examples/examples/llama/): general LLM.
+- [Falcon](./candle-examples/examples/falcon/): general LLM.
+- [Bert](./candle-examples/examples/bert/): useful for sentence embeddings.
+- [StarCoder](./candle-examples/examples/bigcode/): LLM specialized to code
+  generation.
+- [Stable Diffusion](./candle-examples/examples/stable-diffusion/): text to
+  image generative model, support for the 1.5, 2.1, and SDXL 1.0 versions.
 - [DINOv2](./candle-examples/examples/dinov2/): computer vision model trained
  using self-supervision (can be used for imagenet classification, depth
  evaluation, segmentation).
-
-Run them using commands like:
+- [Quantized LLaMA](./candle-examples/examples/quantized/): quantized version of
+  the LLaMA model using the same quantization techniques as
+  [llama.cpp](https://github.com/ggerganov/llama.cpp).
+- [yolo-v3](./candle-examples/examples/yolo-v3/) and
+  [yolo-v8](./candle-examples/examples/yolo-v8/): object detection and pose
+  estimation models.
+  [segment-anything](./candle-examples/examples/segment-anything/): image
+  segmentation model with prompt.
+Run them using the following commands:
 ```
+cargo run --example whisper --release
+cargo run --example llama --release
+cargo run --example falcon --release
+cargo run --example bert --release
+cargo run --example bigcode --release
+cargo run --example stable-diffusion --release -- --prompt "a rusty robot holding a fire torch"
+cargo run --example dinov2 --release -- --image path/to/myinput.jpg
 cargo run --example quantized --release
+cargo run --example yolo-v3 --release -- myimage.jpg
+cargo run --example yolo-v8 --release -- myimage.jpg # for pose estimation, add --task pose 
+cargo run --example segment-anything --release -- --image myimage.jpg
 ```

 In order to use **CUDA** add `--features cuda` to the example command line. If
@ -99,8 +88,7 @@ There are also some wasm examples for whisper and
 [llama2.c](https://github.com/karpathy/llama2.c). You can either build them with
 `trunk` or try them online:
 [whisper](https://huggingface.co/spaces/lmz/candle-whisper),
-[llama2](https://huggingface.co/spaces/lmz/candle-llama2),
-[Segment Anything Model](https://huggingface.co/spaces/radames/candle-segment-anything-wasm).
+[llama2](https://huggingface.co/spaces/lmz/candle-llama2).

 For LLaMA2, run the following command to retrieve the weight files and start a
 test server:
@ -113,15 +101,6 @@ trunk serve --release --port 8081
 And then head over to
 [http://localhost:8081/](http://localhost:8081/).

-<!--- ANCHOR: useful_libraries --->
-
-## Useful Libraries
- [`candle-lora`](https://github.com/EricLBuehler/candle-lora) provides a LoRA implementation that conforms to the official `peft` implementation.
-
-If you have an addition to this list, please submit a pull request.
-
-<!--- ANCHOR_END: useful_libraries --->
-
 <!--- ANCHOR: features --->

 ## Features
@ -134,20 +113,10 @@ If you have an addition to this list, please submit a pull request.
    - CUDA backend for efficiently running on GPUs, multiple GPU distribution via NCCL.
    - WASM support, run your models in a browser.
 - Included models.
-    - Language Models.
-        - LLaMA v1 and v2.
-        - Falcon.
-        - StarCoder.
-        - T5.
-        - Bert.
+    - LLMs: LLaMA v1 and v2, Falcon, StarCoder.
    - Whisper (multi-lingual support).
-    - Stable Diffusion v1.5, v2.1, XL v1.0.
-    - Computer Vision Models.
-        - DINOv2.
-        - EfficientNet.
-        - yolo-v3.
-        - yolo-v8.
-        - Segment-Anything Model (SAM).
+    - Stable Diffusion.
+    - Computer Vision: DINOv2, EfficientNet, yolo-v3, yolo-v8.
 - File formats: load models from safetensors, npz, ggml, or PyTorch files.
 - Serverless (on CPU), small and fast deployments.
 - Quantization support using the llama.cpp quantized types.
@ -288,24 +257,6 @@ This is a bug in gcc-11 triggered by the Cuda compiler. To fix this, install a d
 env CANDLE_NVCC_CCBIN=/usr/lib/gcc/x86_64-linux-gnu/10 cargo ...
 ```

-#### Linking error on windows when running rustdoc or mdbook tests
-
-```
-Couldn't compile the test.
---- .\candle-book\src\inference\hub.md - Using_the_hub::Using_in_a_real_model_ (line 50) stdout ----
-error: linking with `link.exe` failed: exit code: 1181
-//very long chain of linking
- = note: LINK : fatal error LNK1181: cannot open input file 'windows.0.48.5.lib'
-```
-
-Make sure you link all native libraries that might be located outside a project target, e.g., to run mdbook tests, you should run:
-
-```
-mdbook test candle-book -L .\target\debug\deps\ `
-L native=$env:USERPROFILE\.cargo\registry\src\index.crates.io-6f17d22bba15001f\windows_x86_64_msvc-0.42.2\lib `
-L native=$env:USERPROFILE\.cargo\registry\src\index.crates.io-6f17d22bba15001f\windows_x86_64_msvc-0.48.5\lib
-```
-
 #### Tracking down errors

 You can set `RUST_BACKTRACE=1` to be provided with backtraces when a candle
--- a/candle-book/Cargo.toml
+++ b/candle-book/Cargo.toml
@ -11,11 +11,11 @@ readme = "README.md"

 [dependencies]
 accelerate-src = { workspace = true, optional = true }
-candle = { path = "../candle-core", version = "0.2.3", package = "candle-core" }
-candle-datasets = { path = "../candle-datasets", version = "0.2.3" }
-candle-nn = { path = "../candle-nn", version = "0.2.3" }
-candle-transformers = { path = "../candle-transformers", version = "0.2.3" }
-candle-flash-attn = { path = "../candle-flash-attn", version = "0.2.3", optional = true }
+candle = { path = "../candle-core", version = "0.2.1", package = "candle-core" }
+candle-datasets = { path = "../candle-datasets", version = "0.2.1" }
+candle-nn = { path = "../candle-nn", version = "0.2.1" }
+candle-transformers = { path = "../candle-transformers", version = "0.2.1" }
+candle-flash-attn = { path = "../candle-flash-attn", version = "0.2.1", optional = true }
 safetensors = { workspace = true }
 serde = { workspace = true }
 serde_json = { workspace = true }
--- a/candle-book/src/SUMMARY.md
+++ b/candle-book/src/SUMMARY.md
@ -10,10 +10,10 @@

 # Reference Guide

- [Running a model](inference/inference.md)
+- [Running a model](inference/README.md)
    - [Using the hub](inference/hub.md)
 - [Error management](error_manage.md)
- [Training](training/training.md)
+- [Training](training/README.md)
    - [MNIST](training/mnist.md)
    - [Fine-tuning]()
    - [Serialization]()
--- a/candle-book/src/error_manage.md
+++ b/candle-book/src/error_manage.md
@ -29,7 +29,7 @@ After adding `RUST_BACKTRACE=1`:
 Error: WithBacktrace { inner: ShapeMismatchBinaryOp { lhs: [1, 784], rhs: [1, 784], op: "matmul" }, backtrace: Backtrace [{ fn: "candle::error::Error::bt", file: "/home/nicolas/.cargo/git/checkouts/candle-5bb8ef7e0626d693/f291065/candle-core/src/error.rs", line: 200 }, { fn: "candle::tensor::Tensor::matmul", file: "/home/nicolas/.cargo/git/checkouts/candle-5bb8ef7e0626d693/f291065/candle-core/src/tensor.rs", line: 816 }, { fn: "myapp::main", file: "./src/main.rs", line: 29 }, { fn: "core::ops::function::FnOnce::call_once", file: "/rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs", line: 250 }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace", file: "/rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs", line: 135 }, { fn: "std::rt::lang_start::{{closure}}", file: "/rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/rt.rs", line: 166 }, { fn: "core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once", file: "/rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs", line: 284 }, { fn: "std::panicking::try::do_call", file: "/rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs", line: 500 }, { fn: "std::panicking::try", file: "/rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs", line: 464 }, { fn: "std::panic::catch_unwind", file: "/rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panic.rs", line: 142 }, { fn: "std::rt::lang_start_internal::{{closure}}", file: "/rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/rt.rs", line: 148 }, { fn: "std::panicking::try::do_call", file: "/rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs", line: 500 }, { fn: "std::panicking::try", file: "/rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs", line: 464 }, { fn: "std::panic::catch_unwind", file: "/rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panic.rs", line: 142 }, { fn: "std::rt::lang_start_internal", file: "/rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/rt.rs", line: 148 }, { fn: "std::rt::lang_start", file: "/rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/rt.rs", line: 165 }, { fn: "main" }, { fn: "__libc_start_main" }, { fn: "_start" }] }
 ```

-Not super pretty at the moment, but we can see error occurred on `{ fn: "myapp::main", file: "./src/main.rs", line: 29 }`
+Not super pretty at the moment, but we can see error occured on `{ fn: "myapp::main", file: "./src/main.rs", line: 29 }`


 Another thing to note, is that since Rust is compiled it is not necessarily as easy to recover proper stacktraces
--- a/candle-book/src/guide/hello_world.md
+++ b/candle-book/src/guide/hello_world.md
@ -6,7 +6,7 @@ Open `src/main.rs` and fill in this content:

 ```rust
 # extern crate candle_core;
-use candle_core::{Device, Result, Tensor};
+use candle_core::{DType, Device, Result, Tensor};

 struct Model {
    first: Tensor,
@ -25,11 +25,11 @@ fn main() -> Result<()> {
    // Use Device::new_cuda(0)?; to use the GPU.
    let device = Device::Cpu;

-    let first = Tensor::randn(0f32, 1.0, (784, 100), &device)?;
-    let second = Tensor::randn(0f32, 1.0, (100, 10), &device)?;
+    let first = Tensor::zeros((784, 100), DType::F32, &device)?;
+    let second = Tensor::zeros((100, 10), DType::F32, &device)?;
    let model = Model { first, second };

-    let dummy_image = Tensor::randn(0f32, 1.0, (1, 784), &device)?;
+    let dummy_image = Tensor::zeros((1, 784), DType::F32, &device)?;

    let digit = model.forward(&dummy_image)?;
    println!("Digit {digit:?} digit");
@ -50,7 +50,7 @@ the classical `Linear` layer. We can do as such

 ```rust
 # extern crate candle_core;
-# use candle_core::{Device, Result, Tensor};
+# use candle_core::{DType, Device, Result, Tensor};
 struct Linear{
    weight: Tensor,
    bias: Tensor,
@ -80,7 +80,7 @@ This will change the model running code into a new function

 ```rust
 # extern crate candle_core;
-# use candle_core::{Device, Result, Tensor};
+# use candle_core::{DType, Device, Result, Tensor};
 # struct Linear{
 #     weight: Tensor,
 #     bias: Tensor,
@ -110,15 +110,15 @@ fn main() -> Result<()> {
    let device = Device::cuda_if_available(0)?;

    // Creating a dummy model
-    let weight = Tensor::randn(0f32, 1.0, (784, 100), &device)?;
-    let bias = Tensor::randn(0f32, 1.0, (100, ), &device)?;
+    let weight = Tensor::zeros((784, 100), DType::F32, &device)?;
+    let bias = Tensor::zeros((100, ), DType::F32, &device)?;
    let first = Linear{weight, bias};
-    let weight = Tensor::randn(0f32, 1.0, (100, 10), &device)?;
-    let bias = Tensor::randn(0f32, 1.0, (10, ), &device)?;
+    let weight = Tensor::zeros((100, 10), DType::F32, &device)?;
+    let bias = Tensor::zeros((10, ), DType::F32, &device)?;
    let second = Linear{weight, bias};
    let model = Model { first, second };

-    let dummy_image = Tensor::randn(0f32, 1.0, (1, 784), &device)?;
+    let dummy_image = Tensor::zeros((1, 784), DType::F32, &device)?;

    // Inference on the model
    let digit = model.forward(&dummy_image)?;
@ -146,7 +146,7 @@ And rewrite our examples using it
 ```rust
 # extern crate candle_core;
 # extern crate candle_nn;
-use candle_core::{Device, Result, Tensor};
+use candle_core::{DType, Device, Result, Tensor};
 use candle_nn::{Linear, Module};

 struct Model {
@ -167,15 +167,15 @@ fn main() -> Result<()> {
    let device = Device::Cpu;

    // This has changed (784, 100) -> (100, 784) !
-    let weight = Tensor::randn(0f32, 1.0, (100, 784), &device)?;
-    let bias = Tensor::randn(0f32, 1.0, (100, ), &device)?;
+    let weight = Tensor::zeros((100, 784), DType::F32, &device)?;
+    let bias = Tensor::zeros((100, ), DType::F32, &device)?;
    let first = Linear::new(weight, Some(bias));
-    let weight = Tensor::randn(0f32, 1.0, (10, 100), &device)?;
-    let bias = Tensor::randn(0f32, 1.0, (10, ), &device)?;
+    let weight = Tensor::zeros((10, 100), DType::F32, &device)?;
+    let bias = Tensor::zeros((10, ), DType::F32, &device)?;
    let second = Linear::new(weight, Some(bias));
    let model = Model { first, second };

-    let dummy_image = Tensor::randn(0f32, 1.0, (1, 784), &device)?;
+    let dummy_image = Tensor::zeros((1, 784), DType::F32, &device)?;

    let digit = model.forward(&dummy_image)?;
    println!("Digit {digit:?} digit");
@ -188,8 +188,8 @@ Feel free to modify this example to use `Conv2d` to create a classical convnet i

 Now that we have the running dummy code we can get to more advanced topics:

- [For PyTorch users](../guide/cheatsheet.md)
- [Running existing models](../inference/inference.md)
- [Training models](../training/training.md)
+- [For PyTorch users](./guide/cheatsheet.md)
+- [Running existing models](./inference/README.md)
+- [Training models](./training/README.md)


--- a/candle-book/src/inference/inference.md
+++ b/candle-book/src/inference/inference.md
--- a/candle-book/src/training/training.md
+++ b/candle-book/src/training/training.md
--- a/candle-core/Cargo.toml
+++ b/candle-core/Cargo.toml
@ -12,7 +12,7 @@ readme = "README.md"
 [dependencies]
 accelerate-src = { workspace = true, optional = true }
 byteorder = { workspace = true }
-candle-kernels = { path = "../candle-kernels", version = "0.2.3", optional = true }
+candle-kernels = { path = "../candle-kernels", version = "0.2.1", optional = true }
 cudarc = { workspace = true, optional = true }
 gemm = { workspace = true }
 half = { workspace = true }
--- a/candle-core/src/accelerate.rs
+++ b/candle-core/src/accelerate.rs
@ -370,38 +370,6 @@ pub fn vd_sqr(a: &[f64], y: &mut [f64]) {
    y.iter_mut().zip(a.iter()).for_each(|(y, a)| *y = *a * *a)
 }

-#[inline]
-pub fn vs_tanh_inplace(y: &mut [f32]) {
-    unsafe { ffi::vvtanhf(y.as_mut_ptr(), y.as_ptr(), &(y.len() as i32)) }
-}
-
-#[inline]
-pub fn vd_tanh_inplace(y: &mut [f64]) {
-    unsafe { ffi::vvtanh(y.as_mut_ptr(), y.as_ptr(), &(y.len() as i32)) }
-}
-
-#[inline]
-pub fn vs_gelu(vs: &[f32], ys: &mut [f32]) {
-    for (&v, y) in vs.iter().zip(ys.iter_mut()) {
-        *y = (2.0f32 / std::f32::consts::PI).sqrt() * v * (1.0 + 0.044715 * v * v)
-    }
-    vs_tanh_inplace(ys);
-    for (&v, y) in vs.iter().zip(ys.iter_mut()) {
-        *y = 0.5 * v * (1.0 + *y)
-    }
-}
-
-#[inline]
-pub fn vd_gelu(vs: &[f64], ys: &mut [f64]) {
-    for (&v, y) in vs.iter().zip(ys.iter_mut()) {
-        *y = (2.0f64 / std::f64::consts::PI).sqrt() * v * (1.0 + 0.044715 * v * v)
-    }
-    vd_tanh_inplace(ys);
-    for (&v, y) in vs.iter().zip(ys.iter_mut()) {
-        *y = 0.5 * v * (1.0 + *y)
-    }
-}
-
 macro_rules! binary_op {
    ($fn_name:ident, $ty:ty, $accelerate_name:ident) => {
        #[inline]
--- a/candle-core/src/backend.rs
+++ b/candle-core/src/backend.rs
@ -57,7 +57,6 @@ pub trait BackendStorage: Sized {

    fn avg_pool2d(&self, _: &Layout, _: (usize, usize), _: (usize, usize)) -> Result<Self>;
    fn max_pool2d(&self, _: &Layout, _: (usize, usize), _: (usize, usize)) -> Result<Self>;
-    fn upsample_nearest1d(&self, _: &Layout, _: usize) -> Result<Self>;
    fn upsample_nearest2d(&self, _: &Layout, _: usize, _: usize) -> Result<Self>;

    fn gather(&self, _: &Layout, _: &Self, _: &Layout, _: usize) -> Result<Self>;
--- a/candle-core/src/backprop.rs
+++ b/candle-core/src/backprop.rs
@ -91,14 +91,13 @@ impl Tensor {
                        }
                    }
                    Op::Reshape(node)
-                    | Op::UpsampleNearest1D(node)
                    | Op::UpsampleNearest2D(node)
                    | Op::AvgPool2D { arg: node, .. }
                    | Op::MaxPool2D { arg: node, .. }
                    | Op::Copy(node)
                    | Op::Broadcast(node)
                    | Op::Cmp(node, _)
-                    | Op::Reduce(node, ReduceOp::Min | ReduceOp::Sum | ReduceOp::Max, _)
+                    | Op::Reduce(node, _, _)
                    | Op::ToDType(node)
                    | Op::ToDevice(node)
                    | Op::Transpose(node, _, _)
@ -112,7 +111,6 @@ impl Tensor {
                        track_grad |= tg;
                        nodes
                    }
-                    Op::Reduce(_, ReduceOp::ArgMin | ReduceOp::ArgMax, _) => nodes,
                }
            } else {
                nodes
@ -264,9 +262,6 @@ impl Tensor {
                        let sum_grad = grads.or_insert(arg)?;
                        *sum_grad = sum_grad.add(&grad_arg)?;
                    }
-                    Op::UpsampleNearest1D { .. } => Err(Error::BackwardNotSupported {
-                        op: "upsample-nearest1d",
-                    })?,
                    Op::UpsampleNearest2D { .. } => Err(Error::BackwardNotSupported {
                        op: "upsample-nearest2d",
                    })?,
@ -522,7 +517,6 @@ impl Tensor {
    }
 }

-#[derive(Debug)]
 pub struct GradStore(HashMap<TensorId, Tensor>);

 impl GradStore {
--- a/candle-core/src/cpu_backend.rs
+++ b/candle-core/src/cpu_backend.rs
@ -4,9 +4,6 @@ use crate::{DType, Error, IntDType, Layout, Result, Shape, WithDType};
 use half::{bf16, f16};
 use rayon::prelude::*;

-const USE_IM2COL_CONV1D: bool = true;
-const USE_IM2COL_CONV2D: bool = true;
-
 // TODO: Maybe we should not implement [Clone] here and instead have an explicit allocator +
 // intercept the oom errors to avoid panicking and provide a proper error.
 #[derive(Debug, Clone)]
@ -727,36 +724,6 @@ impl Map1 for MaxPool2D {
    }
 }

-struct UpsampleNearest1D(usize);
-
-impl Map1 for UpsampleNearest1D {
-    fn f<T: WithDType>(&self, src: &[T], layout: &Layout) -> Result<Vec<T>> {
-        // TODO: Specialized implementation for the case 2*sz?
-        let dst_sz = self.0;
-        let (b_sz, c, src_sz) = layout.shape().dims3()?;
-        let stride = layout.stride();
-        let stride_sz = stride[2];
-        let src_index = layout.start_offset();
-        let scale_sz = src_sz as f64 / dst_sz as f64;
-        let mut dst = vec![T::zero(); b_sz * c * dst_sz];
-        let src_idxs = (0..dst_sz)
-            .map(|idx| usize::min(src_sz - 1, (idx as f64 * scale_sz) as usize))
-            .collect::<Vec<_>>();
-        for b_idx in 0..b_sz {
-            let dst = &mut dst[b_idx * c * dst_sz..];
-            let src_index = src_index + b_idx * stride[0];
-            for c_idx in 0..c {
-                let dst = &mut dst[c_idx * dst_sz..];
-                let src_index = src_index + c_idx * stride[1];
-                for (idx, src_idx) in src_idxs.iter().enumerate() {
-                    dst[idx] = src[src_index + src_idx * stride_sz]
-                }
-            }
-        }
-        Ok(dst)
-    }
-}
-
 struct UpsampleNearest2D(usize, usize);

 impl Map1 for UpsampleNearest2D {
@ -1122,140 +1089,6 @@ impl<'a> Map2 for Conv1D<'a> {
    }
 }

-struct Im2Col1D {
-    l_k: usize,
-    stride: usize,
-    dilation: usize,
-    padding: usize,
-}
-
-impl Im2Col1D {
-    fn l_out(&self, l: usize) -> usize {
-        (l + 2 * self.padding - self.dilation * (self.l_k - 1) - 1) / self.stride + 1
-    }
-}
-
-impl Map1 for Im2Col1D {
-    fn f<T: WithDType>(&self, vs: &[T], layout: &Layout) -> Result<Vec<T>> {
-        let &Self {
-            l_k,
-            stride,
-            dilation,
-            padding,
-        } = self;
-        let (b, c, l) = layout.shape().dims3()?;
-        let l_out = self.l_out(l);
-        let src = &vs[layout.start_offset()..];
-        let mut dst = vec![T::zero(); b * l_out * c * l_k];
-        let (src_s0, src_s1, src_s2) = {
-            let s = layout.stride();
-            (s[0], s[1], s[2])
-        };
-        // TODO: provide specialized kernels for the common use cases.
-        // - l_k = 1
-        // - padding = 0
-        // - stride = 1
-        // - dilation = 1
-        for b_idx in 0..b {
-            let src_idx = b_idx * src_s0;
-            let dst_idx = b_idx * l_out * c * l_k;
-            for l_idx in 0..l_out {
-                let dst_idx = dst_idx + l_idx * c * l_k;
-                for c_idx in 0..c {
-                    let dst_idx = dst_idx + c_idx * l_k;
-                    let src_idx = c_idx * src_s1 + src_idx;
-                    for l_k_idx in 0..l_k {
-                        let src_l = l_idx * stride + l_k_idx * dilation;
-                        if padding != 0 && (src_l < padding || src_l >= l + padding) {
-                            continue;
-                        }
-                        let src_l = src_l - padding;
-                        let src_idx = src_idx + src_l * src_s2;
-                        let dst_idx = dst_idx + l_k_idx;
-                        dst[dst_idx] = src[src_idx]
-                    }
-                }
-            }
-        }
-        Ok(dst)
-    }
-}
-
-struct Im2Col {
-    h_k: usize,
-    w_k: usize,
-    stride: usize,
-    dilation: usize,
-    padding: usize,
-}
-
-impl Im2Col {
-    fn hw_out(&self, h: usize, w: usize) -> (usize, usize) {
-        let h_out = (h + 2 * self.padding - self.dilation * (self.h_k - 1) - 1) / self.stride + 1;
-        let w_out = (w + 2 * self.padding - self.dilation * (self.w_k - 1) - 1) / self.stride + 1;
-        (h_out, w_out)
-    }
-}
-
-impl Map1 for Im2Col {
-    fn f<T: WithDType>(&self, vs: &[T], layout: &Layout) -> Result<Vec<T>> {
-        let &Self {
-            h_k,
-            w_k,
-            stride,
-            dilation,
-            padding,
-        } = self;
-        let (b, c, h, w) = layout.shape().dims4()?;
-        let (h_out, w_out) = self.hw_out(h, w);
-        let src = &vs[layout.start_offset()..];
-        let mut dst = vec![T::zero(); b * h_out * w_out * c * h_k * w_k];
-        let (src_s0, src_s1, src_s2, src_s3) = {
-            let s = layout.stride();
-            (s[0], s[1], s[2], s[3])
-        };
-        // TODO: provide specialized kernels for the common use cases.
-        // - h_k = w_k = 1
-        // - padding = 0
-        // - stride = 1
-        // - dilation = 1
-        for b_idx in 0..b {
-            let src_idx = b_idx * src_s0;
-            let dst_idx = b_idx * h_out * w_out * c * h_k * w_k;
-            for h_idx in 0..h_out {
-                let dst_idx = dst_idx + h_idx * w_out * c * h_k * w_k;
-                for w_idx in 0..w_out {
-                    let dst_idx = dst_idx + w_idx * c * h_k * w_k;
-                    for c_idx in 0..c {
-                        let dst_idx = dst_idx + c_idx * h_k * w_k;
-                        let src_idx = c_idx * src_s1 + src_idx;
-                        for h_k_idx in 0..h_k {
-                            let src_h = h_idx * stride + h_k_idx * dilation;
-                            if padding != 0 && (src_h < padding || src_h >= h + padding) {
-                                continue;
-                            }
-                            let src_h = src_h - padding;
-                            let src_idx = src_idx + src_h * src_s2;
-                            let dst_idx = dst_idx + h_k_idx * w_k;
-                            for w_k_idx in 0..w_k {
-                                let src_w = w_idx * stride + w_k_idx * dilation;
-                                if padding != 0 && (src_w < padding || src_w >= w + padding) {
-                                    continue;
-                                }
-                                let src_w = src_w - padding;
-                                let src_idx = src_idx + src_w * src_s3;
-                                let dst_idx = dst_idx + w_k_idx;
-                                dst[dst_idx] = src[src_idx]
-                            }
-                        }
-                    }
-                }
-            }
-        }
-        Ok(dst)
-    }
-}
-
 struct Conv2D<'a>(&'a crate::conv::ParamsConv2D);

 impl<'a> Map2 for Conv2D<'a> {
@ -1461,9 +1294,8 @@ impl Map2 for MatMul {
    ) -> Result<Vec<T>> {
        use gemm::{gemm, Parallelism};

-        match T::DTYPE {
-            DType::F16 | DType::F32 | DType::F64 => {}
-            _ => Err(Error::UnsupportedDTypeForOp(T::DTYPE, "matmul").bt())?,
+        if T::DTYPE == DType::BF16 {
+            return Err(Error::UnsupportedDTypeForOp(T::DTYPE, "matmul").bt())?;
        }

        let (b, m, n, k) = self.0;
@ -2167,10 +1999,6 @@ impl BackendStorage for CpuStorage {
        MaxPool2D(kernel_size, stride).map(self, layout)
    }

-    fn upsample_nearest1d(&self, layout: &Layout, sz: usize) -> Result<Self> {
-        UpsampleNearest1D(sz).map(self, layout)
-    }
-
    fn upsample_nearest2d(&self, layout: &Layout, h: usize, w: usize) -> Result<Self> {
        UpsampleNearest2D(h, w).map(self, layout)
    }
@ -2399,40 +2227,7 @@ impl BackendStorage for CpuStorage {
        kernel_l: &Layout,
        params: &crate::conv::ParamsConv1D,
    ) -> Result<Self> {
-        if !USE_IM2COL_CONV1D {
-            return Conv1D(params).map(self, l, kernel, kernel_l);
-        }
-        let op = Im2Col1D {
-            l_k: params.k_size,
-            padding: params.padding,
-            stride: params.stride,
-            dilation: params.dilation,
-        };
-        let col = op.map(self, l)?;
-        let b = params.b_size;
-        let n = params.c_out;
-        let l_out = params.l_out();
-        let k = op.l_k * params.c_in;
-        let m = l_out;
-        let col_l = Layout::contiguous((b, m, k));
-        let res = if kernel_l.is_contiguous() {
-            let kernel_l = Layout::contiguous_with_offset((1, n, k), kernel_l.start_offset())
-                .transpose(1, 2)?
-                .broadcast_as((b, k, n))?;
-            col.matmul(kernel, (b, m, n, k), &col_l, &kernel_l)?
-        } else {
-            // Make the kernel contiguous if not already the case.
-            let mut kernel_c = self.device().zeros_impl(kernel_l.shape(), kernel.dtype())?;
-            kernel.copy_strided_src(&mut kernel_c, 0, kernel_l)?;
-            let kernel_l = Layout::contiguous_with_offset((1, n, k), kernel_l.start_offset())
-                .transpose(1, 2)?
-                .broadcast_as((b, k, n))?;
-            col.matmul(kernel, (b, m, n, k), &col_l, &kernel_l)?
-        };
-        let res_l = Layout::contiguous((b, l_out, params.c_out)).transpose(1, 2)?;
-        let mut res_t = self.device().zeros_impl(res_l.shape(), res.dtype())?;
-        res.copy_strided_src(&mut res_t, 0, &res_l)?;
-        Ok(res_t)
+        Conv1D(params).map(self, l, kernel, kernel_l)
    }

    fn conv2d(
@ -2442,43 +2237,7 @@ impl BackendStorage for CpuStorage {
        kernel_l: &Layout,
        params: &crate::conv::ParamsConv2D,
    ) -> Result<Self> {
-        if !USE_IM2COL_CONV2D {
-            return Conv2D(params).map(self, l, kernel, kernel_l);
-        }
-        let op = Im2Col {
-            h_k: params.k_h,
-            w_k: params.k_w,
-            padding: params.padding,
-            stride: params.stride,
-            dilation: params.dilation,
-        };
-        let col = op.map(self, l)?;
-        let b = params.b_size;
-        let n = params.c_out;
-        let (h_out, w_out) = (params.out_h(), params.out_w());
-        let k = op.h_k * op.w_k * params.c_in;
-        let m = h_out * w_out;
-        let col_l = Layout::contiguous((b, m, k));
-        let res = if kernel_l.is_contiguous() {
-            let kernel_l = Layout::contiguous_with_offset((1, n, k), kernel_l.start_offset())
-                .transpose(1, 2)?
-                .broadcast_as((b, k, n))?;
-            col.matmul(kernel, (b, m, n, k), &col_l, &kernel_l)?
-        } else {
-            // Make the kernel contiguous if not already the case.
-            let mut kernel_c = self.device().zeros_impl(kernel_l.shape(), kernel.dtype())?;
-            kernel.copy_strided_src(&mut kernel_c, 0, kernel_l)?;
-            let kernel_l = Layout::contiguous_with_offset((1, n, k), kernel_l.start_offset())
-                .transpose(1, 2)?
-                .broadcast_as((b, k, n))?;
-            col.matmul(kernel, (b, m, n, k), &col_l, &kernel_l)?
-        };
-        let res_l = Layout::contiguous((b, h_out, w_out, params.c_out))
-            .transpose(1, 2)?
-            .transpose(1, 3)?;
-        let mut res_t = self.device().zeros_impl(res_l.shape(), res.dtype())?;
-        res.copy_strided_src(&mut res_t, 0, &res_l)?;
-        Ok(res_t)
+        Conv2D(params).map(self, l, kernel, kernel_l)
    }

    fn conv_transpose2d(
--- a/candle-core/src/cuda_backend.rs
+++ b/candle-core/src/cuda_backend.rs
@ -312,13 +312,6 @@ impl BackendDevice for CudaDevice {
        // cudarc changes.
        let elem_count = shape.elem_count();
        let curand = self.curand.lock().unwrap();
-        // curand can only generate an odd number of values.
-        // https://github.com/huggingface/candle/issues/734
-        let elem_count_round = if elem_count % 2 == 1 {
-            elem_count + 1
-        } else {
-            elem_count
-        };
        let slice = match dtype {
            DType::U8 | DType::U32 | DType::I64 | DType::F16 | DType::BF16 => {
                Err(CudaError::UnsupportedDtype {
@ -328,7 +321,7 @@ impl BackendDevice for CudaDevice {
                .w()?
            }
            DType::F32 => {
-                let mut data = unsafe { self.alloc::<f32>(elem_count_round) }.w()?;
+                let mut data = unsafe { self.alloc::<f32>(elem_count) }.w()?;
                curand
                    .0
                    .fill_with_normal(&mut data, mean as f32, std as f32)
@ -336,7 +329,7 @@ impl BackendDevice for CudaDevice {
                CudaStorageSlice::F32(data)
            }
            DType::F64 => {
-                let mut data = unsafe { self.alloc::<f64>(elem_count_round) }.w()?;
+                let mut data = unsafe { self.alloc::<f64>(elem_count) }.w()?;
                curand.0.fill_with_normal(&mut data, mean, std).w()?;
                CudaStorageSlice::F64(data)
            }
@ -600,105 +593,6 @@ impl Map1 for Elu {
    }
 }

-struct Im2Col1D {
-    l_k: usize,
-    stride: usize,
-    dilation: usize,
-    padding: usize,
-}
-
-impl Im2Col1D {
-    fn l_out(&self, l: usize) -> usize {
-        (l + 2 * self.padding - self.dilation * (self.l_k - 1) - 1) / self.stride + 1
-    }
-}
-
-impl Map1 for Im2Col1D {
-    fn f<T: DeviceRepr + WithDType>(
-        &self,
-        src: &CudaSlice<T>,
-        dev: &CudaDevice,
-        layout: &Layout,
-    ) -> Result<CudaSlice<T>> {
-        let shape = layout.shape();
-        let dims = shape.dims();
-        let l_out = self.l_out(dims[2]);
-        let dst_el = dims[0] * l_out * dims[1] * self.l_k;
-        let cfg = LaunchConfig::for_num_elems(dst_el as u32);
-        let ds = dev.htod_copy([dims, layout.stride()].concat()).w()?;
-        let src = &src.slice(layout.start_offset()..);
-        let func = dev.get_or_load_func(&kernel_name::<T>("im2col1d"), kernels::CONV)?;
-        // SAFETY: Set later by running the kernel.
-        let dst = unsafe { dev.alloc::<T>(dst_el) }.w()?;
-        let params = (
-            dst_el,
-            l_out,
-            self.l_k,
-            self.stride,
-            self.padding,
-            self.dilation,
-            &ds,
-            src,
-            &dst,
-        );
-        // SAFETY: ffi.
-        unsafe { func.launch(cfg, params) }.w()?;
-        Ok(dst)
-    }
-}
-
-struct Im2Col {
-    h_k: usize,
-    w_k: usize,
-    stride: usize,
-    dilation: usize,
-    padding: usize,
-}
-
-impl Im2Col {
-    fn hw_out(&self, h: usize, w: usize) -> (usize, usize) {
-        let h_out = (h + 2 * self.padding - self.dilation * (self.h_k - 1) - 1) / self.stride + 1;
-        let w_out = (w + 2 * self.padding - self.dilation * (self.w_k - 1) - 1) / self.stride + 1;
-        (h_out, w_out)
-    }
-}
-
-impl Map1 for Im2Col {
-    fn f<T: DeviceRepr + WithDType>(
-        &self,
-        src: &CudaSlice<T>,
-        dev: &CudaDevice,
-        layout: &Layout,
-    ) -> Result<CudaSlice<T>> {
-        let shape = layout.shape();
-        let dims = shape.dims();
-        let (h_out, w_out) = self.hw_out(dims[2], dims[3]);
-        let dst_el = dims[0] * h_out * w_out * dims[1] * self.h_k * self.w_k;
-        let cfg = LaunchConfig::for_num_elems(dst_el as u32);
-        let ds = dev.htod_copy([dims, layout.stride()].concat()).w()?;
-        let src = &src.slice(layout.start_offset()..);
-        let func = dev.get_or_load_func(&kernel_name::<T>("im2col"), kernels::CONV)?;
-        // SAFETY: Set later by running the kernel.
-        let dst = unsafe { dev.alloc::<T>(dst_el) }.w()?;
-        let params = (
-            dst_el,
-            h_out,
-            w_out,
-            self.h_k,
-            self.w_k,
-            self.stride,
-            self.padding,
-            self.dilation,
-            &ds,
-            src,
-            &dst,
-        );
-        // SAFETY: ffi.
-        unsafe { func.launch(cfg, params) }.w()?;
-        Ok(dst)
-    }
-}
-
 struct Powf(f64);
 impl Map1 for Powf {
    fn f<T: DeviceRepr + WithDType>(
@ -1756,46 +1650,9 @@ impl BackendStorage for CudaStorage {
        kernel_l: &Layout,
        params: &crate::conv::ParamsConv1D,
    ) -> Result<Self> {
-        const USE_IM2COL_CONV1D: bool = true;
-
        let device = self.device().clone();
-        if !USE_IM2COL_CONV1D {
-            let slice = Conv1D(params).map(&self.slice, l, &kernel.slice, kernel_l, &device)?;
-            return Ok(Self { slice, device });
-        }
-
-        let col = Im2Col1D {
-            l_k: params.k_size,
-            stride: params.stride,
-            dilation: params.dilation,
-            padding: params.padding,
-        }
-        .map(&self.slice, &device, l)?;
-        let col = Self { slice: col, device };
-        let l_out = params.l_out();
-        let b = params.b_size;
-        let n = params.c_out;
-        let k = params.k_size * params.c_in;
-        let m = l_out;
-        let col_l = Layout::contiguous((b, m, k));
-        let res = if kernel_l.is_contiguous() {
-            let kernel_l = Layout::contiguous_with_offset((1, n, k), kernel_l.start_offset())
-                .transpose(1, 2)?
-                .broadcast_as((b, k, n))?;
-            col.matmul(kernel, (b, m, n, k), &col_l, &kernel_l)?
-        } else {
-            // Make the kernel contiguous if not already the case.
-            let mut kernel_c = self.device().zeros_impl(kernel_l.shape(), kernel.dtype())?;
-            kernel.copy_strided_src(&mut kernel_c, 0, kernel_l)?;
-            let kernel_l = Layout::contiguous_with_offset((1, n, k), kernel_l.start_offset())
-                .transpose(1, 2)?
-                .broadcast_as((b, k, n))?;
-            col.matmul(kernel, (b, m, n, k), &col_l, &kernel_l)?
-        };
-        let res_l = Layout::contiguous((b, l_out, n)).transpose(1, 2)?;
-        let mut res_t = self.device().zeros_impl(res_l.shape(), res.dtype())?;
-        res.copy_strided_src(&mut res_t, 0, &res_l)?;
-        Ok(res_t)
+        let slice = Conv1D(params).map(&self.slice, l, &kernel.slice, kernel_l, &device)?;
+        Ok(Self { slice, device })
    }

    #[cfg(not(feature = "cudnn"))]
@ -1806,50 +1663,9 @@ impl BackendStorage for CudaStorage {
        kernel_l: &Layout,
        params: &crate::conv::ParamsConv2D,
    ) -> Result<Self> {
-        const USE_IM2COL_CONV2D: bool = true;
-
        let device = self.device().clone();
-        if !USE_IM2COL_CONV2D {
-            let slice = Conv2D(params).map(&self.slice, l, &kernel.slice, kernel_l, &device)?;
-            return Ok(Self { slice, device });
-        }
-
-        let col = Im2Col {
-            h_k: params.k_h,
-            w_k: params.k_w,
-            stride: params.stride,
-            dilation: params.dilation,
-            padding: params.padding,
-        }
-        .map(&self.slice, &device, l)?;
-        let col = Self { slice: col, device };
-        let h_out = params.out_h();
-        let w_out = params.out_w();
-        let b = params.b_size;
-        let n = params.c_out;
-        let k = params.k_h * params.k_w * params.c_in;
-        let m = h_out * w_out;
-        let col_l = Layout::contiguous((b, m, k));
-        let res = if kernel_l.is_contiguous() {
-            let kernel_l = Layout::contiguous_with_offset((1, n, k), kernel_l.start_offset())
-                .transpose(1, 2)?
-                .broadcast_as((b, k, n))?;
-            col.matmul(kernel, (b, m, n, k), &col_l, &kernel_l)?
-        } else {
-            // Make the kernel contiguous if not already the case.
-            let mut kernel_c = self.device().zeros_impl(kernel_l.shape(), kernel.dtype())?;
-            kernel.copy_strided_src(&mut kernel_c, 0, kernel_l)?;
-            let kernel_l = Layout::contiguous_with_offset((1, n, k), kernel_l.start_offset())
-                .transpose(1, 2)?
-                .broadcast_as((b, k, n))?;
-            col.matmul(kernel, (b, m, n, k), &col_l, &kernel_l)?
-        };
-        let res_l = Layout::contiguous((b, h_out, w_out, n))
-            .transpose(1, 2)?
-            .transpose(1, 3)?;
-        let mut res_t = self.device().zeros_impl(res_l.shape(), res.dtype())?;
-        res.copy_strided_src(&mut res_t, 0, &res_l)?;
-        Ok(res_t)
+        let slice = Conv2D(params).map(&self.slice, l, &kernel.slice, kernel_l, &device)?;
+        Ok(Self { slice, device })
    }

    #[cfg(feature = "cudnn")]
@ -1954,10 +1770,6 @@ impl BackendStorage for CudaStorage {
        Ok(Self { slice, device })
    }

-    fn upsample_nearest1d(&self, _: &Layout, _out_sz: usize) -> Result<Self> {
-        crate::bail!("upsample-nearest1d is not supported on cuda")
-    }
-
    fn upsample_nearest2d(&self, l: &Layout, out_w: usize, out_h: usize) -> Result<Self> {
        let device = self.device().clone();
        let slice = UpsampleNearest2D(out_w, out_h).map(&self.slice, &device, l)?;
--- a/candle-core/src/dummy_cuda_backend.rs
+++ b/candle-core/src/dummy_cuda_backend.rs
@ -152,10 +152,6 @@ impl crate::backend::BackendStorage for CudaStorage {
        Err(Error::NotCompiledWithCudaSupport)
    }

-    fn upsample_nearest1d(&self, _: &Layout, _: usize) -> Result<Self> {
-        Err(Error::NotCompiledWithCudaSupport)
-    }
-
    fn upsample_nearest2d(&self, _: &Layout, _: usize, _: usize) -> Result<Self> {
        Err(Error::NotCompiledWithCudaSupport)
    }
--- a/candle-core/src/indexer.rs
+++ b/candle-core/src/indexer.rs
@ -46,31 +46,19 @@ impl Tensor {
                    current_dim += 1;
                    out
                }
-                TensorIndexer::IndexSelect(indexes) => {
-                    if indexes.rank() != 1 {
-                        crate::bail!("multi-dimensional tensor indexing is not supported")
-                    }
-                    let out = x.index_select(&indexes.to_device(x.device())?, current_dim)?;
-                    current_dim += 1;
-                    out
-                }
-                TensorIndexer::Err(e) => crate::bail!("indexing error {e:?}"),
            };
        }
        Ok(x)
    }
 }

-#[derive(Debug)]
+#[derive(Debug, Clone)]
 /// Generic structure used to index a slice of the tensor
 pub enum TensorIndexer {
    /// This selects the elemnts for which an index has some specific value.
    Select(usize),
    /// This is a regular slice, purely indexing a chunk of the tensor
    Narrow(Bound<usize>, Bound<usize>),
-    /// Indexing via a 1d tensor
-    IndexSelect(Tensor),
-    Err(Error),
 }

 impl From<usize> for TensorIndexer {
@ -79,31 +67,6 @@ impl From<usize> for TensorIndexer {
    }
 }

-impl From<&[u32]> for TensorIndexer {
-    fn from(index: &[u32]) -> Self {
-        match Tensor::new(index, &crate::Device::Cpu) {
-            Ok(tensor) => TensorIndexer::IndexSelect(tensor),
-            Err(e) => TensorIndexer::Err(e),
-        }
-    }
-}
-
-impl From<Vec<u32>> for TensorIndexer {
-    fn from(index: Vec<u32>) -> Self {
-        let len = index.len();
-        match Tensor::from_vec(index, len, &crate::Device::Cpu) {
-            Ok(tensor) => TensorIndexer::IndexSelect(tensor),
-            Err(e) => TensorIndexer::Err(e),
-        }
-    }
-}
-
-impl From<&Tensor> for TensorIndexer {
-    fn from(tensor: &Tensor) -> Self {
-        TensorIndexer::IndexSelect(tensor.clone())
-    }
-}
-
 macro_rules! impl_from_range {
    ($range_type:ty) => {
        impl From<$range_type> for TensorIndexer {
--- a/candle-core/src/lib.rs
+++ b/candle-core/src/lib.rs
@ -110,8 +110,14 @@ impl ToUsize2 for (usize, usize) {
 }

 // A simple trait defining a module with forward method using a single argument.
-pub trait Module {
+pub trait Module: std::fmt::Debug {
    fn forward(&self, xs: &Tensor) -> Result<Tensor>;
+
+    /// Change the module to use training mode vs eval mode.
+    ///
+    /// The default implementation does nothing as this is only used for a couple modules such as
+    /// dropout or batch-normalization.
+    fn set_training(&mut self, _training: bool) {}
 }

 impl Module for quantized::QMatMul {
@ -119,9 +125,3 @@ impl Module for quantized::QMatMul {
        self.forward(xs)
    }
 }
-
-impl<T: Fn(&Tensor) -> Result<Tensor>> Module for T {
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
-        self(xs)
-    }
-}
--- a/candle-core/src/op.rs
+++ b/candle-core/src/op.rs
@ -116,7 +116,6 @@ pub enum Op {
        stride: (usize, usize),
    },

-    UpsampleNearest1D(Tensor),
    UpsampleNearest2D(Tensor),

    Cat(Vec<Tensor>, usize),
@ -601,24 +600,6 @@ impl UnaryOpT for Gelu {
    fn f64_vec(xs: &[f64], ys: &mut [f64]) {
        crate::mkl::vd_gelu(xs, ys)
    }
-
-    #[cfg(feature = "accelerate")]
-    const F32_VEC: bool = true;
-
-    #[cfg(feature = "accelerate")]
-    #[inline(always)]
-    fn f32_vec(xs: &[f32], ys: &mut [f32]) {
-        crate::accelerate::vs_gelu(xs, ys)
-    }
-
-    #[cfg(feature = "accelerate")]
-    const F64_VEC: bool = true;
-
-    #[cfg(feature = "accelerate")]
-    #[inline(always)]
-    fn f64_vec(xs: &[f64], ys: &mut [f64]) {
-        crate::accelerate::vd_gelu(xs, ys)
-    }
 }

 impl UnaryOpT for Relu {
--- a/candle-core/src/safetensors.rs
+++ b/candle-core/src/safetensors.rs
@ -78,7 +78,11 @@ impl st::View for &Tensor {
 }

 impl Tensor {
-    pub fn save_safetensors<P: AsRef<Path>>(&self, name: &str, filename: P) -> Result<()> {
+    pub fn save_safetensors<P: AsRef<std::path::Path>>(
+        &self,
+        name: &str,
+        filename: P,
+    ) -> Result<()> {
        let data = [(name, self.clone())];
        Ok(st::serialize_to_file(data, &None, filename.as_ref())?)
    }
@ -263,7 +267,7 @@ impl MmapedFile {
    /// # Safety
    ///
    /// The unsafe is inherited from [`memmap2::MmapOptions`].
-    pub unsafe fn new<P: AsRef<Path>>(p: P) -> Result<Self> {
+    pub unsafe fn new<P: AsRef<std::path::Path>>(p: P) -> Result<Self> {
        let p = p.as_ref();
        let file = std::fs::File::open(p).map_err(|e| Error::from(e).with_path(p))?;
        let inner = memmap2::MmapOptions::new()
--- a/candle-core/src/shape.rs
+++ b/candle-core/src/shape.rs
@ -444,18 +444,6 @@ impl<D1: Dim, D2: Dim, D3: Dim, D4: Dim, D5: Dim> Dims for (D1, D2, D3, D4, D5)
    }
 }

-impl<D1: Dim, D2: Dim, D3: Dim, D4: Dim, D5: Dim, D6: Dim> Dims for (D1, D2, D3, D4, D5, D6) {
-    fn to_indexes_internal(self, shape: &Shape, op: &'static str) -> Result<Vec<usize>> {
-        let d0 = self.0.to_index(shape, op)?;
-        let d1 = self.1.to_index(shape, op)?;
-        let d2 = self.2.to_index(shape, op)?;
-        let d3 = self.3.to_index(shape, op)?;
-        let d4 = self.4.to_index(shape, op)?;
-        let d5 = self.5.to_index(shape, op)?;
-        Ok(vec![d0, d1, d2, d3, d4, d5])
-    }
-}
-
 extract_dims!(dims0, 0, |_: &[usize]| (), ());
 extract_dims!(dims1, 1, |d: &[usize]| d[0], usize);
 extract_dims!(dims2, 2, |d: &[usize]| (d[0], d[1]), (usize, usize));
--- a/candle-core/src/storage.rs
+++ b/candle-core/src/storage.rs
@ -369,19 +369,6 @@ impl Storage {
        }
    }

-    pub(crate) fn upsample_nearest1d(&self, layout: &Layout, sz: usize) -> Result<Self> {
-        match self {
-            Storage::Cpu(storage) => {
-                let storage = storage.upsample_nearest1d(layout, sz)?;
-                Ok(Self::Cpu(storage))
-            }
-            Self::Cuda(storage) => {
-                let storage = storage.upsample_nearest1d(layout, sz)?;
-                Ok(Self::Cuda(storage))
-            }
-        }
-    }
-
    pub(crate) fn upsample_nearest2d(&self, layout: &Layout, h: usize, w: usize) -> Result<Self> {
        match self {
            Storage::Cpu(storage) => {
--- a/candle-core/src/tensor.rs
+++ b/candle-core/src/tensor.rs
@ -105,28 +105,6 @@ macro_rules! binary_op {
    };
 }

-macro_rules! binary_op_scalar {
-    ($fn_name:ident, $op_name:ident) => {
-        pub fn $fn_name<T: TensorOrScalar>(&self, rhs: T) -> Result<Self> {
-            let rhs = match rhs.to_tensor_scalar()? {
-                crate::scalar::TensorScalar::Tensor(rhs) => rhs,
-                crate::scalar::TensorScalar::Scalar(rhs) => rhs
-                    .to_dtype(self.dtype())?
-                    .to_device(self.device())?
-                    .broadcast_as(self.shape())?,
-            };
-            let shape = self.same_shape_binary_op(&rhs, stringify!($fn_name))?;
-            let storage = self.storage().binary_impl::<crate::op::$op_name>(
-                &*rhs.storage(),
-                self.layout(),
-                rhs.layout(),
-            )?;
-            let op = BackpropOp::new2(self, &rhs, |t1, t2| Op::Binary(t1, t2, BinaryOp::$op_name));
-            Ok(from_storage(storage, shape.clone(), op, false))
-        }
-    };
-}
-
 macro_rules! broadcast_binary_op {
    ($fn_name:ident, $inner_fn_name:ident) => {
        pub fn $fn_name(&self, rhs: &Self) -> Result<Self> {
@ -469,8 +447,8 @@ impl Tensor {
    binary_op!(mul, Mul);
    binary_op!(sub, Sub);
    binary_op!(div, Div);
-    binary_op_scalar!(maximum, Maximum);
-    binary_op_scalar!(minimum, Minimum);
+    binary_op!(maximum, Maximum);
+    binary_op!(minimum, Minimum);
    broadcast_binary_op!(broadcast_add, add);
    broadcast_binary_op!(broadcast_mul, mul);
    broadcast_binary_op!(broadcast_sub, sub);
@ -666,12 +644,7 @@ impl Tensor {
        let storage = self.storage().reduce_op(op, self.layout(), &[dim])?;
        let mut dims = self.dims().to_vec();
        dims[dim] = 1;
-        let op = match op {
-            ReduceOp::Sum | ReduceOp::Min | ReduceOp::Max => {
-                BackpropOp::new1(self, |arg| Op::Reduce(arg, op, dims.to_vec()))
-            }
-            ReduceOp::ArgMin | ReduceOp::ArgMax => BackpropOp::none(),
-        };
+        let op = BackpropOp::new1(self, |arg| Op::Reduce(arg, op, dims.to_vec()));
        let res = from_storage(storage, dims, op, false);
        if keepdim {
            Ok(res)
@ -854,35 +827,12 @@ impl Tensor {
        self.cmp(rhs, CmpOp::Le)
    }

-    /// Clamp the tensor values to be between `min` and `max`.
-    pub fn clamp<T1: TensorOrScalar, T2: TensorOrScalar>(&self, min: T1, max: T2) -> Result<Self> {
-        self.maximum(min)?.minimum(max)
-    }
-
-    /// Interpolate the input tensor to the `target_size` size, taking the value of the nearest element.
-    ///
-    /// The input tensor should have three dimensions, `(batch, channels, l)`, the returned
-    /// tensor also has three dimensions, `(batch, channels, target_size)`.
-    pub fn interpolate1d(&self, target_size: usize) -> Result<Self> {
-        let (n, c, _l) = self.dims3()?;
-        let op = BackpropOp::new1(self, Op::UpsampleNearest1D);
-        let storage = self
-            .storage()
-            .upsample_nearest1d(self.layout(), target_size)?;
-        Ok(from_storage(storage, (n, c, target_size), op, false))
-    }
-
-    /// Alias for `interpolate1d`.
-    pub fn upsample_nearest1d(&self, target_size: usize) -> Result<Self> {
-        self.interpolate1d(target_size)
-    }
-
-    /// Interpolate the input tensor to the `(target_h, target_w)` size, taking the value of the
+    /// Upsample the input tensor to the `(target_h, target_w)` size, taking the value of the
    /// nearest element.
    ///
    /// The input tensor should have four dimensions, `(batch, channels, h, w)`, the returned
    /// tensor also has four dimensions, `(batch, channels, target_h, target_w)`.
-    pub fn interpolate2d(&self, target_h: usize, target_w: usize) -> Result<Self> {
+    pub fn upsample_nearest2d(&self, target_h: usize, target_w: usize) -> Result<Self> {
        let (n, c, _h, _w) = self.dims4()?;
        let op = BackpropOp::new1(self, Op::UpsampleNearest2D);
        let storage = self
@ -891,11 +841,6 @@ impl Tensor {
        Ok(from_storage(storage, (n, c, target_h, target_w), op, false))
    }

-    /// Alias for `interpolate2d`.
-    pub fn upsample_nearest2d(&self, target_h: usize, target_w: usize) -> Result<Self> {
-        self.interpolate2d(target_h, target_w)
-    }
-
    /// 2D average pooling over an input tensor with multiple channels.
    ///
    /// The input tensor should have four dimensions, `(batch, channels, h, w)`, the returned
--- a/candle-core/tests/tensor_tests.rs
+++ b/candle-core/tests/tensor_tests.rs
@ -33,17 +33,6 @@ fn tensor_2d(device: &Device) -> Result<()> {
    Ok(())
 }

-fn clamp(device: &Device) -> Result<()> {
-    let data = &[[3f32, 1., 4., 1., 5.], [2., 1., 7., 8., 2.]];
-    let tensor = Tensor::new(data, device)?;
-    let tensor = tensor.clamp(1.5, 6.2)?;
-    assert_eq!(
-        tensor.to_vec2::<f32>()?,
-        [[3.0, 1.5, 4.0, 1.5, 5.0], [2.0, 1.5, 6.2, 6.2, 2.0]],
-    );
-    Ok(())
-}
-
 fn binary_op(device: &Device) -> Result<()> {
    let data = &[[3f32, 1., 4., 1., 5.], [2., 1., 7., 8., 2.]];
    let tensor1 = Tensor::new(data, device)?;
@ -888,14 +877,6 @@ fn broadcasting(device: &Device) -> Result<()> {
    Ok(())
 }

-fn randn(device: &Device) -> Result<()> {
-    let tensor = Tensor::randn(0f32, 1f32, (5, 3), device)?;
-    assert_eq!(tensor.dims(), [5, 3]);
-    let tensor = Tensor::rand(0f32, 1f32, (5, 3), device)?;
-    assert_eq!(tensor.dims(), [5, 3]);
-    Ok(())
-}
-
 test_device!(zeros, zeros_cpu, zeros_gpu);
 test_device!(add_mul, add_mul_cpu, add_mul_gpu);
 test_device!(tensor_2d, tensor_2d_cpu, tensor_2d_gpu);
@ -918,8 +899,6 @@ test_device!(index_select, index_select_cpu, index_select_gpu);
 test_device!(index_add, index_add_cpu, index_add_gpu);
 test_device!(gather, gather_cpu, gather_gpu);
 test_device!(scatter_add, scatter_add_cpu, scatter_add_gpu);
-test_device!(randn, randn_cpu, randn_gpu);
-test_device!(clamp, clamp_cpu, clamp_gpu);

 // There was originally a bug on the CPU implementation for randn
 // https://github.com/huggingface/candle/issues/381
--- a/candle-datasets/Cargo.toml
+++ b/candle-datasets/Cargo.toml
@ -11,8 +11,8 @@ readme = "README.md"

 [dependencies]
 byteorder = { workspace = true }
-candle = { path = "../candle-core", version = "0.2.3", package = "candle-core" }
-candle-nn = { path = "../candle-nn", version = "0.2.3" }
+candle = { path = "../candle-core", version = "0.2.1", package = "candle-core" }
+candle-nn = { path = "../candle-nn", version = "0.2.1" }
 hf-hub = { workspace = true}
 intel-mkl-src = { workspace = true, optional = true }
 memmap2 = { workspace = true }
--- a/candle-datasets/src/vision/mnist.rs
+++ b/candle-datasets/src/vision/mnist.rs
@ -8,9 +8,13 @@ use parquet::file::reader::{FileReader, SerializedFileReader};
 use std::fs::File;
 use std::io::{self, BufReader, Read};

-fn read_u32<T: Read>(reader: &mut T) -> std::io::Result<u32> {
-    use byteorder::ReadBytesExt;
-    reader.read_u32::<byteorder::BigEndian>()
+fn read_u32<T: Read>(reader: &mut T) -> Result<u32> {
+    let mut b = vec![0u8; 4];
+    reader.read_exact(&mut b)?;
+    let (result, _) = b.iter().rev().fold((0u64, 1u64), |(s, basis), &x| {
+        (s + basis * u64::from(x), basis * 256)
+    });
+    Ok(result as u32)
 }

 fn check_magic_number<T: Read>(reader: &mut T, expected: u32) -> Result<()> {
--- a/candle-examples/Cargo.toml
+++ b/candle-examples/Cargo.toml
@ -11,20 +11,19 @@ readme = "README.md"

 [dependencies]
 accelerate-src = { workspace = true, optional = true }
-candle = { path = "../candle-core", version = "0.2.3", package = "candle-core" }
-candle-datasets = { path = "../candle-datasets", version = "0.2.3" }
-candle-flash-attn = { path = "../candle-flash-attn", version = "0.2.3", optional = true }
-candle-nn = { path = "../candle-nn", version = "0.2.3" }
-candle-transformers = { path = "../candle-transformers", version = "0.2.3" }
-cudarc = { workspace = true, optional = true }
-half = { workspace = true, optional = true }
-image = { workspace = true }
-intel-mkl-src = { workspace = true, optional = true }
-num-traits = { workspace = true }
-rayon = { workspace = true }
+candle = { path = "../candle-core", version = "0.2.1", package = "candle-core" }
+candle-datasets = { path = "../candle-datasets", version = "0.2.1" }
+candle-nn = { path = "../candle-nn", version = "0.2.1" }
+candle-transformers = { path = "../candle-transformers", version = "0.2.1" }
+candle-flash-attn = { path = "../candle-flash-attn", version = "0.2.1", optional = true }
 safetensors = { workspace = true }
 serde = { workspace = true }
 serde_json = { workspace = true }
+num-traits = { workspace = true }
+intel-mkl-src = { workspace = true, optional = true }
+cudarc = { workspace = true, optional = true }
+half = { workspace = true, optional = true }
+image = { workspace = true }

 [dev-dependencies]
 anyhow = { workspace = true }
--- a/candle-examples/examples/bert/README.md
+++ b/candle-examples/examples/bert/README.md
@ -1,44 +0,0 @@
-# candle-bert
-
-Bert is a general large language model. In this example it can be used for two
-different tasks:
- Compute sentence embeddings for a prompt.
- Compute similarities between a set of sentences.
-
-
-## Sentence embeddings
-
-Bert is used to compute the sentence embeddings for a prompt. The model weights
-are downloaded from the hub on the first run.
-
-```bash
-cargo run --example bert --release -- --prompt "Here is a test sentence"
-
-> [[[ 0.0798, -0.0665, -0.0247, ..., -0.1082, -0.1000, -0.2751],
->   [ 0.4218,  0.2690,  0.2740, ...,  0.3889,  1.3503,  0.9908],
->   [ 0.0466,  0.3041, -0.1143, ...,  0.4427,  0.6926, -0.1515],
->   ...
->   [ 0.3396,  0.4320, -0.4408, ...,  0.9212,  0.2331, -0.6777],
->   [ 0.2789,  0.7539,  0.4306, ..., -0.0095,  0.3375, -1.7529],
->   [ 0.6737,  0.7882,  0.0548, ...,  0.1836,  0.7299, -0.6617]]]
-> Tensor[[1, 7, 384], f32]
-```
-
-## Similarities
-
-In this example, Bert is used to compute the sentence embeddings for a set of
-sentences (hardcoded in the examples). Then cosine similarities are computed for
-each sentence pair and they are reported by decreasing values, hence the first
-reported pair contains the two sentences that have the highest similarity score.
-The sentence embeddings are computed using average pooling through all the
-sentence tokens, including some potential padding.
-
-```bash
-cargo run --example bert --release
-
-> score: 0.85 'The new movie is awesome' 'The new movie is so great'
-> score: 0.61 'The cat sits outside' 'The cat plays in the garden'
-> score: 0.52 'I love pasta' 'Do you like pizza?'
-> score: 0.23 'The new movie is awesome' 'Do you like pizza?'
-> score: 0.22 'I love pasta' 'The new movie is awesome'
-```
--- a/candle-examples/examples/bert/main.rs
+++ b/candle-examples/examples/bert/main.rs
@ -3,13 +3,14 @@ extern crate intel_mkl_src;

 #[cfg(feature = "accelerate")]
 extern crate accelerate_src;
-use candle_transformers::models::bert::{BertModel, Config, DTYPE};
+mod model;

 use anyhow::{anyhow, Error as E, Result};
 use candle::Tensor;
 use candle_nn::VarBuilder;
 use clap::Parser;
 use hf_hub::{api::sync::Api, Cache, Repo, RepoType};
+use model::{BertModel, Config, DTYPE};
 use tokenizers::{PaddingParams, Tokenizer};

 #[derive(Parser, Debug)]
--- a/candle-examples/examples/bert/model.rs
+++ b/candle-examples/examples/bert/model.rs
--- a/candle-examples/examples/bigcode/README.md
+++ b/candle-examples/examples/bigcode/README.md
@ -1,19 +0,0 @@
-# candle-starcoder: code generation model
-
-[StarCoder/BigCode](https://huggingface.co/bigcode/starcoderbase-1b) is a LLM
-model specialized to code generation. The initial model was trained on 80
-programming languages.
-
-## Running some example
-
-```bash
-cargo run --example bigcode --release -- --prompt "fn fact(n: u64) -> u64 "
-
-> fn fact(n: u64) -> u64  {
->     if n == 0 {
->         1
->     } else {
->         n * fact(n - 1)
->     }
-> }
-```
--- a/candle-examples/examples/bigcode/main.rs
+++ b/candle-examples/examples/bigcode/main.rs
@ -7,7 +7,8 @@ extern crate accelerate_src;
 use anyhow::{Error as E, Result};
 use clap::Parser;

-use candle_transformers::models::bigcode::{Config, GPTBigCode};
+mod model;
+use model::{Config, GPTBigCode};

 use candle::{DType, Device, Tensor};
 use candle_nn::VarBuilder;
@ -28,10 +29,9 @@ impl TextGeneration {
        tokenizer: Tokenizer,
        seed: u64,
        temp: Option<f64>,
-        top_p: Option<f64>,
        device: &Device,
    ) -> Self {
-        let logits_processor = LogitsProcessor::new(seed, temp, top_p);
+        let logits_processor = LogitsProcessor::new(seed, temp);
        Self {
            model,
            tokenizer,
@ -95,10 +95,6 @@ struct Args {
    #[arg(long)]
    temperature: Option<f64>,

-    /// Nucleus sampling probability cutoff.
-    #[arg(long)]
-    top_p: Option<f64>,
-
    /// The seed to use when generating random samples.
    #[arg(long, default_value_t = 299792458)]
    seed: u64,
@ -154,14 +150,7 @@ fn main() -> Result<()> {
    let model = GPTBigCode::load(vb, config)?;
    println!("loaded the model in {:?}", start.elapsed());

-    let mut pipeline = TextGeneration::new(
-        model,
-        tokenizer,
-        args.seed,
-        args.temperature,
-        args.top_p,
-        &device,
-    );
+    let mut pipeline = TextGeneration::new(model, tokenizer, args.seed, args.temperature, &device);
    pipeline.run(&args.prompt, args.sample_len)?;
    Ok(())
 }
--- a/candle-examples/examples/bigcode/model.rs
+++ b/candle-examples/examples/bigcode/model.rs
--- a/candle-examples/examples/dinov2/README.md
+++ b/candle-examples/examples/dinov2/README.md
@ -1,19 +0,0 @@
-# candle-dinov2
-
-[DINOv2](https://github.com/facebookresearch/dinov2) is a computer vision model.
-In this example, it is used as an ImageNet classifier: the model returns the
-probability for the image to belong to each of the 1000 ImageNet categories.
-
-## Running some example
-
-```bash
-cargo run --example dinov2 --release -- --image candle-examples/examples/yolo-v8/assets/bike.jpg
-
-> mountain bike, all-terrain bike, off-roader: 43.67%
-> bicycle-built-for-two, tandem bicycle, tandem: 33.20%
-> crash helmet            : 13.23%
-> unicycle, monocycle     : 2.44%
-> maillot                 : 2.42%
-```
-
-![Leading group, Giro d'Italia 2021](../yolo-v8/assets/bike.jpg)
--- a/candle-examples/examples/dinov2/main.rs
+++ b/candle-examples/examples/dinov2/main.rs
@ -9,10 +9,285 @@ extern crate accelerate_src;

 use clap::Parser;

-use candle::{DType, IndexOp, D};
-use candle_nn::{Module, VarBuilder};
-use candle_transformers::models::dinov2;
+use candle::{DType, IndexOp, Result, Tensor, D};
+use candle_nn::{layer_norm, LayerNorm, Linear, Module, VarBuilder};

+const IMG_SIZE: usize = 518;
+const PATCH_SIZE: usize = 14;
+const NUM_CLASSES: usize = 1000;
+
+fn linear(vb: VarBuilder, in_dim: usize, out_dim: usize, bias: bool) -> Result<Linear> {
+    if bias {
+        candle_nn::linear(in_dim, out_dim, vb)
+    } else {
+        candle_nn::linear_no_bias(in_dim, out_dim, vb)
+    }
+}
+
+#[derive(Debug)]
+struct Attention {
+    qkv: Linear,
+    proj: Linear,
+    num_heads: usize,
+    scale: f64,
+}
+
+impl Attention {
+    fn new(
+        vb: VarBuilder,
+        dim: usize,
+        num_heads: usize,
+        qkv_bias: bool,
+        proj_bias: bool,
+    ) -> Result<Self> {
+        let qkv = linear(vb.pp("qkv"), dim, dim * 3, qkv_bias)?;
+        let proj = linear(vb.pp("proj"), dim, dim, proj_bias)?;
+        let scale = 1. / ((dim / num_heads) as f64).sqrt();
+        Ok(Self {
+            qkv,
+            proj,
+            num_heads,
+            scale,
+        })
+    }
+}
+
+impl Module for Attention {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        let (b, n, c) = xs.dims3()?;
+        let qkv = self
+            .qkv
+            .forward(xs)?
+            .reshape((b, n, 3, self.num_heads, c / self.num_heads))?
+            .transpose(1, 2)? // 02134
+            .transpose(0, 1)? // 20134
+            .transpose(2, 3)?; // 20314
+        let q = (qkv.i(0)? * self.scale)?;
+        let k = qkv.i(1)?;
+        let v = qkv.i(2)?;
+        let attn = candle_nn::ops::softmax(&q.matmul(&k.t()?)?, D::Minus1)?;
+        let attn = attn.matmul(&v)?.transpose(1, 2)?.reshape((b, n, c))?;
+        self.proj.forward(&attn)
+    }
+}
+
+#[derive(Debug)]
+struct LayerScale {
+    gamma: Tensor,
+}
+
+impl LayerScale {
+    fn new(vb: VarBuilder, dim: usize) -> Result<Self> {
+        let gamma = vb.get(dim, "gamma")?;
+        Ok(Self { gamma })
+    }
+}
+
+impl Module for LayerScale {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        xs.broadcast_mul(&self.gamma)
+    }
+}
+
+#[derive(Debug)]
+struct Mlp {
+    fc1: Linear,
+    fc2: Linear,
+}
+
+impl Mlp {
+    fn new(vb: VarBuilder, in_features: usize, hidden_features: usize, bias: bool) -> Result<Self> {
+        let out_features = in_features;
+        let fc1 = linear(vb.pp("fc1"), in_features, hidden_features, bias)?;
+        let fc2 = linear(vb.pp("fc2"), hidden_features, out_features, bias)?;
+        Ok(Self { fc1, fc2 })
+    }
+}
+
+impl Module for Mlp {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        let xs = self.fc1.forward(xs)?.gelu()?;
+        self.fc2.forward(&xs)
+    }
+}
+
+#[derive(Debug)]
+struct Block {
+    norm1: LayerNorm,
+    attn: Attention,
+    ls1: LayerScale,
+    norm2: LayerNorm,
+    mlp: Mlp,
+    ls2: LayerScale,
+}
+
+impl Block {
+    fn new(vb: VarBuilder, dim: usize, num_heads: usize) -> Result<Self> {
+        let norm1 = layer_norm(dim, 1e-5, vb.pp("norm1"))?;
+        let attn = Attention::new(vb.pp("attn"), dim, num_heads, true, true)?;
+        let ls1 = LayerScale::new(vb.pp("ls1"), dim)?;
+        let norm2 = layer_norm(dim, 1e-5, vb.pp("norm2"))?;
+        let mlp = Mlp::new(vb.pp("mlp"), dim, dim * 4, true)?;
+        let ls2 = LayerScale::new(vb.pp("ls2"), dim)?;
+        Ok(Self {
+            norm1,
+            attn,
+            ls1,
+            norm2,
+            mlp,
+            ls2,
+        })
+    }
+}
+
+impl Module for Block {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        let residual = xs;
+        let xs = self
+            .ls1
+            .forward(&self.attn.forward(&self.norm1.forward(xs)?)?)?;
+        let xs = (xs + residual)?;
+        let residual = &xs;
+        let xs = self
+            .ls2
+            .forward(&self.mlp.forward(&self.norm2.forward(&xs)?)?)?;
+        xs + residual
+    }
+}
+
+#[derive(Debug)]
+struct PatchEmbed {
+    proj: candle_nn::Conv2d,
+    patch_size: (usize, usize),
+    num_patches: usize,
+}
+
+impl PatchEmbed {
+    fn new(
+        vb: VarBuilder,
+        img_size: usize,
+        patch_size: usize,
+        in_chans: usize,
+        embed_dim: usize,
+    ) -> Result<Self> {
+        let config = candle_nn::Conv2dConfig {
+            stride: patch_size,
+            ..Default::default()
+        };
+        let proj = candle_nn::conv2d(in_chans, embed_dim, patch_size, config, vb.pp("proj"))?;
+        let num_patches = (img_size / patch_size) * (img_size / patch_size);
+        Ok(Self {
+            proj,
+            patch_size: (patch_size, patch_size),
+            num_patches,
+        })
+    }
+}
+
+impl Module for PatchEmbed {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        let (_b, _c, h, w) = xs.dims4()?;
+        let (patch_h, patch_w) = self.patch_size;
+        if (h % patch_h) != 0 {
+            candle::bail!("image height {h} is not a multiple of patch height {patch_h}")
+        }
+        if (w % patch_w) != 0 {
+            candle::bail!("image width {w} is not a multiple of patch width {patch_w}")
+        }
+        let xs = self.proj.forward(xs)?;
+        let (b, c, h, w) = xs.dims4()?;
+        // flatten embeddings.
+        xs.reshape((b, c, h * w))?.transpose(1, 2)
+    }
+}
+
+#[derive(Debug)]
+pub struct DinoVisionTransformer {
+    patch_embed: PatchEmbed,
+    cls_token: Tensor,
+    pos_embed: Tensor,
+    blocks: Vec<Block>,
+    norm: LayerNorm,
+    head: Linear,
+}
+
+impl DinoVisionTransformer {
+    pub fn new(vb: VarBuilder, depth: usize, embed_dim: usize, num_heads: usize) -> Result<Self> {
+        let patch_embed =
+            PatchEmbed::new(vb.pp("patch_embed"), IMG_SIZE, PATCH_SIZE, 3, embed_dim)?;
+        let cls_token = vb.get((1, 1, embed_dim), "cls_token")?;
+        let num_tokens = 1;
+        let pos_embed = vb.get(
+            (1, patch_embed.num_patches + num_tokens, embed_dim),
+            "pos_embed",
+        )?;
+        let head = linear(vb.pp("head"), 2 * embed_dim, NUM_CLASSES, true)?;
+        let norm = layer_norm(embed_dim, 1e-5, vb.pp("norm"))?;
+        let vb_b = vb.pp("blocks");
+        let blocks = (0..depth)
+            .map(|i| Block::new(vb_b.pp(&i.to_string()), embed_dim, num_heads))
+            .collect::<Result<Vec<_>>>()?;
+        Ok(Self {
+            patch_embed,
+            cls_token,
+            pos_embed,
+            blocks,
+            norm,
+            head,
+        })
+    }
+
+    fn interpolate_pos_encoding(&self, xs: &Tensor, w: usize, h: usize) -> Result<Tensor> {
+        let npatch = xs.dim(1)? - 1;
+        let n = self.pos_embed.dim(1)? - 1;
+        let sqrt_n = (n as f64).sqrt();
+        if npatch == n && w == h {
+            return Ok(xs.clone());
+        }
+        let class_pos_embed = self.pos_embed.i((.., ..1))?;
+        let patch_pos_embed = self.pos_embed.i((.., 1..))?;
+        let dim = xs.dim(D::Minus1)?;
+        let (w0, h0) = ((w / PATCH_SIZE) as f64 + 0.1, (h / PATCH_SIZE) as f64 + 0.1);
+        let patch_pos_embed = patch_pos_embed
+            .reshape((1, sqrt_n as usize, sqrt_n as usize, dim))?
+            .transpose(2, 3)?
+            .transpose(1, 2)?;
+        // This uses bicubic interpolation in the original implementation.
+        let patch_pos_embed = patch_pos_embed.upsample_nearest2d(h0 as usize, w0 as usize)?;
+        let el_count = patch_pos_embed.shape().elem_count();
+        let patch_pos_embed =
+            patch_pos_embed
+                .transpose(1, 2)?
+                .transpose(2, 3)?
+                .reshape((1, el_count / dim, dim))?;
+        Tensor::cat(&[&class_pos_embed, &patch_pos_embed], 1)
+    }
+
+    fn prepare_tokens_with_mask(&self, xs: &Tensor) -> Result<Tensor> {
+        let (_b, _nc, w, h) = xs.dims4()?;
+        let xs = self.patch_embed.forward(xs)?;
+        let xs = Tensor::cat(&[&self.cls_token, &xs], 1)?;
+        &xs + &self.interpolate_pos_encoding(&xs, w, h)?
+    }
+}
+
+impl Module for DinoVisionTransformer {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        let mut xs = self.prepare_tokens_with_mask(xs)?;
+        for blk in self.blocks.iter() {
+            xs = blk.forward(&xs)?
+        }
+        let xs = self.norm.forward(&xs)?;
+        let xs_norm_clstoken = xs.i((.., 0))?;
+        let xs_norm_patchtokens = xs.i((.., 1..))?.mean(1)?;
+        let xs = Tensor::cat(&[xs_norm_clstoken, xs_norm_patchtokens], D::Minus1)?;
+        self.head.forward(&xs)
+    }
+}
+
+pub fn vit_small(vb: VarBuilder) -> Result<DinoVisionTransformer> {
+    DinoVisionTransformer::new(vb, 12, 384, 6)
+}
 #[derive(Parser)]
 struct Args {
    #[arg(long)]
@ -45,7 +320,7 @@ pub fn main() -> anyhow::Result<()> {
    let weights = unsafe { candle::safetensors::MmapedFile::new(model_file)? };
    let weights = weights.deserialize()?;
    let vb = VarBuilder::from_safetensors(vec![weights], DType::F32, &device);
-    let model = dinov2::vit_small(vb)?;
+    let model = vit_small(vb)?;
    println!("model built");
    let logits = model.forward(&image.unsqueeze(0)?)?;
    let prs = candle_nn::ops::softmax(&logits, D::Minus1)?
--- a/candle-examples/examples/efficientnet/main.rs
+++ b/candle-examples/examples/efficientnet/main.rs
@ -8,11 +8,340 @@ extern crate intel_mkl_src;
 #[cfg(feature = "accelerate")]
 extern crate accelerate_src;

-use candle::{DType, IndexOp, D};
-use candle_nn::{Module, VarBuilder};
-use candle_transformers::models::efficientnet::{EfficientNet, MBConvConfig};
 use clap::{Parser, ValueEnum};

+use candle::{DType, IndexOp, Result, Tensor, D};
+use candle_nn as nn;
+use nn::{Module, VarBuilder};
+
+// Based on the Python version from torchvision.
+// https://github.com/pytorch/vision/blob/0d75d9e5516f446c9c0ef93bd4ed9fea13992d06/torchvision/models/efficientnet.py#L47
+#[derive(Debug, Clone, Copy)]
+pub struct MBConvConfig {
+    expand_ratio: f64,
+    kernel: usize,
+    stride: usize,
+    input_channels: usize,
+    out_channels: usize,
+    num_layers: usize,
+}
+
+fn make_divisible(v: f64, divisor: usize) -> usize {
+    let min_value = divisor;
+    let new_v = usize::max(
+        min_value,
+        (v + divisor as f64 * 0.5) as usize / divisor * divisor,
+    );
+    if (new_v as f64) < 0.9 * v {
+        new_v + divisor
+    } else {
+        new_v
+    }
+}
+
+fn bneck_confs(width_mult: f64, depth_mult: f64) -> Vec<MBConvConfig> {
+    let bneck_conf = |e, k, s, i, o, n| {
+        let input_channels = make_divisible(i as f64 * width_mult, 8);
+        let out_channels = make_divisible(o as f64 * width_mult, 8);
+        let num_layers = (n as f64 * depth_mult).ceil() as usize;
+        MBConvConfig {
+            expand_ratio: e,
+            kernel: k,
+            stride: s,
+            input_channels,
+            out_channels,
+            num_layers,
+        }
+    };
+    vec![
+        bneck_conf(1., 3, 1, 32, 16, 1),
+        bneck_conf(6., 3, 2, 16, 24, 2),
+        bneck_conf(6., 5, 2, 24, 40, 2),
+        bneck_conf(6., 3, 2, 40, 80, 3),
+        bneck_conf(6., 5, 1, 80, 112, 3),
+        bneck_conf(6., 5, 2, 112, 192, 4),
+        bneck_conf(6., 3, 1, 192, 320, 1),
+    ]
+}
+
+impl MBConvConfig {
+    fn b0() -> Vec<Self> {
+        bneck_confs(1.0, 1.0)
+    }
+    fn b1() -> Vec<Self> {
+        bneck_confs(1.0, 1.1)
+    }
+    fn b2() -> Vec<Self> {
+        bneck_confs(1.1, 1.2)
+    }
+    fn b3() -> Vec<Self> {
+        bneck_confs(1.2, 1.4)
+    }
+    fn b4() -> Vec<Self> {
+        bneck_confs(1.4, 1.8)
+    }
+    fn b5() -> Vec<Self> {
+        bneck_confs(1.6, 2.2)
+    }
+    fn b6() -> Vec<Self> {
+        bneck_confs(1.8, 2.6)
+    }
+    fn b7() -> Vec<Self> {
+        bneck_confs(2.0, 3.1)
+    }
+}
+
+/// Conv2D with same padding.
+#[derive(Debug)]
+struct Conv2DSame {
+    conv2d: nn::Conv2d,
+    s: usize,
+    k: usize,
+}
+
+impl Conv2DSame {
+    fn new(
+        vb: VarBuilder,
+        i: usize,
+        o: usize,
+        k: usize,
+        stride: usize,
+        groups: usize,
+        bias: bool,
+    ) -> Result<Self> {
+        let conv_config = nn::Conv2dConfig {
+            stride,
+            groups,
+            ..Default::default()
+        };
+        let conv2d = if bias {
+            nn::conv2d(i, o, k, conv_config, vb)?
+        } else {
+            nn::conv2d_no_bias(i, o, k, conv_config, vb)?
+        };
+        Ok(Self {
+            conv2d,
+            s: stride,
+            k,
+        })
+    }
+}
+
+impl Module for Conv2DSame {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        let s = self.s;
+        let k = self.k;
+        let (_, _, ih, iw) = xs.dims4()?;
+        let oh = (ih + s - 1) / s;
+        let ow = (iw + s - 1) / s;
+        let pad_h = usize::max((oh - 1) * s + k - ih, 0);
+        let pad_w = usize::max((ow - 1) * s + k - iw, 0);
+        if pad_h > 0 || pad_w > 0 {
+            let xs = xs.pad_with_zeros(2, pad_h / 2, pad_h - pad_h / 2)?;
+            let xs = xs.pad_with_zeros(3, pad_w / 2, pad_w - pad_w / 2)?;
+            self.conv2d.forward(&xs)
+        } else {
+            self.conv2d.forward(xs)
+        }
+    }
+}
+
+#[derive(Debug)]
+struct ConvNormActivation {
+    conv2d: Conv2DSame,
+    bn2d: nn::BatchNorm,
+    activation: bool,
+}
+
+impl ConvNormActivation {
+    fn new(
+        vb: VarBuilder,
+        i: usize,
+        o: usize,
+        k: usize,
+        stride: usize,
+        groups: usize,
+    ) -> Result<Self> {
+        let conv2d = Conv2DSame::new(vb.pp("0"), i, o, k, stride, groups, false)?;
+        let bn2d = nn::batch_norm(o, 1e-3, vb.pp("1"))?;
+        Ok(Self {
+            conv2d,
+            bn2d,
+            activation: true,
+        })
+    }
+
+    fn no_activation(self) -> Self {
+        Self {
+            activation: false,
+            ..self
+        }
+    }
+}
+
+impl Module for ConvNormActivation {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        let xs = self.conv2d.forward(xs)?;
+        let xs = self.bn2d.forward(&xs)?;
+        if self.activation {
+            swish(&xs)
+        } else {
+            Ok(xs)
+        }
+    }
+}
+
+#[derive(Debug)]
+struct SqueezeExcitation {
+    fc1: Conv2DSame,
+    fc2: Conv2DSame,
+}
+
+impl SqueezeExcitation {
+    fn new(vb: VarBuilder, in_channels: usize, squeeze_channels: usize) -> Result<Self> {
+        let fc1 = Conv2DSame::new(vb.pp("fc1"), in_channels, squeeze_channels, 1, 1, 1, true)?;
+        let fc2 = Conv2DSame::new(vb.pp("fc2"), squeeze_channels, in_channels, 1, 1, 1, true)?;
+        Ok(Self { fc1, fc2 })
+    }
+}
+
+impl Module for SqueezeExcitation {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        let residual = xs;
+        // equivalent to adaptive_avg_pool2d([1, 1])
+        let xs = xs.mean_keepdim(D::Minus2)?.mean_keepdim(D::Minus1)?;
+        let xs = self.fc1.forward(&xs)?;
+        let xs = swish(&xs)?;
+        let xs = self.fc2.forward(&xs)?;
+        let xs = nn::ops::sigmoid(&xs)?;
+        residual.broadcast_mul(&xs)
+    }
+}
+
+#[derive(Debug)]
+struct MBConv {
+    expand_cna: Option<ConvNormActivation>,
+    depthwise_cna: ConvNormActivation,
+    squeeze_excitation: SqueezeExcitation,
+    project_cna: ConvNormActivation,
+    config: MBConvConfig,
+}
+
+impl MBConv {
+    fn new(vb: VarBuilder, c: MBConvConfig) -> Result<Self> {
+        let vb = vb.pp("block");
+        let exp = make_divisible(c.input_channels as f64 * c.expand_ratio, 8);
+        let expand_cna = if exp != c.input_channels {
+            Some(ConvNormActivation::new(
+                vb.pp("0"),
+                c.input_channels,
+                exp,
+                1,
+                1,
+                1,
+            )?)
+        } else {
+            None
+        };
+        let start_index = if expand_cna.is_some() { 1 } else { 0 };
+        let depthwise_cna =
+            ConvNormActivation::new(vb.pp(start_index), exp, exp, c.kernel, c.stride, exp)?;
+        let squeeze_channels = usize::max(1, c.input_channels / 4);
+        let squeeze_excitation =
+            SqueezeExcitation::new(vb.pp(start_index + 1), exp, squeeze_channels)?;
+        let project_cna =
+            ConvNormActivation::new(vb.pp(start_index + 2), exp, c.out_channels, 1, 1, 1)?
+                .no_activation();
+        Ok(Self {
+            expand_cna,
+            depthwise_cna,
+            squeeze_excitation,
+            project_cna,
+            config: c,
+        })
+    }
+}
+
+impl Module for MBConv {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        let use_res_connect =
+            self.config.stride == 1 && self.config.input_channels == self.config.out_channels;
+        let ys = match &self.expand_cna {
+            Some(expand_cna) => expand_cna.forward(xs)?,
+            None => xs.clone(),
+        };
+        let ys = self.depthwise_cna.forward(&ys)?;
+        let ys = self.squeeze_excitation.forward(&ys)?;
+        let ys = self.project_cna.forward(&ys)?;
+        if use_res_connect {
+            ys + xs
+        } else {
+            Ok(ys)
+        }
+    }
+}
+
+fn swish(s: &Tensor) -> Result<Tensor> {
+    s * nn::ops::sigmoid(s)?
+}
+
+#[derive(Debug)]
+struct EfficientNet {
+    init_cna: ConvNormActivation,
+    blocks: Vec<MBConv>,
+    final_cna: ConvNormActivation,
+    classifier: nn::Linear,
+}
+
+impl EfficientNet {
+    fn new(p: VarBuilder, configs: Vec<MBConvConfig>, nclasses: usize) -> Result<Self> {
+        let f_p = p.pp("features");
+        let first_in_c = configs[0].input_channels;
+        let last_out_c = configs.last().unwrap().out_channels;
+        let final_out_c = 4 * last_out_c;
+        let init_cna = ConvNormActivation::new(f_p.pp(0), 3, first_in_c, 3, 2, 1)?;
+        let nconfigs = configs.len();
+        let mut blocks = vec![];
+        for (index, cnf) in configs.into_iter().enumerate() {
+            let f_p = f_p.pp(index + 1);
+            for r_index in 0..cnf.num_layers {
+                let cnf = if r_index == 0 {
+                    cnf
+                } else {
+                    MBConvConfig {
+                        input_channels: cnf.out_channels,
+                        stride: 1,
+                        ..cnf
+                    }
+                };
+                blocks.push(MBConv::new(f_p.pp(r_index), cnf)?)
+            }
+        }
+        let final_cna =
+            ConvNormActivation::new(f_p.pp(nconfigs + 1), last_out_c, final_out_c, 1, 1, 1)?;
+        let classifier = nn::linear(final_out_c, nclasses, p.pp("classifier.1"))?;
+        Ok(Self {
+            init_cna,
+            blocks,
+            final_cna,
+            classifier,
+        })
+    }
+}
+
+impl Module for EfficientNet {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        let mut xs = self.init_cna.forward(xs)?;
+        for block in self.blocks.iter() {
+            xs = block.forward(&xs)?
+        }
+        let xs = self.final_cna.forward(&xs)?;
+        // Equivalent to adaptive_avg_pool2d([1, 1]) -> squeeze(-1) -> squeeze(-1)
+        let xs = xs.mean(D::Minus1)?.mean(D::Minus1)?;
+        self.classifier.forward(&xs)
+    }
+}
+
 #[derive(Clone, Copy, Debug, ValueEnum)]
 enum Which {
    B0,
--- a/candle-examples/examples/falcon/README.md
+++ b/candle-examples/examples/falcon/README.md
@ -1,3 +0,0 @@
-# candle-falcon
-
-Falcon is a general large language model.
--- a/candle-examples/examples/falcon/main.rs
+++ b/candle-examples/examples/falcon/main.rs
@ -14,7 +14,8 @@ use clap::Parser;
 use hf_hub::{api::sync::Api, Repo, RepoType};
 use tokenizers::Tokenizer;

-use candle_transformers::models::falcon::{Config, Falcon};
+mod model;
+use model::{Config, Falcon};

 struct TextGeneration {
    model: Falcon,
@ -25,25 +26,17 @@ struct TextGeneration {
    repeat_last_n: usize,
 }

-struct GenerationOptions {
-    temp: Option<f64>,
-    top_p: Option<f64>,
-    repeat_penalty: f32,
-    repeat_last_n: usize,
-}
-
 impl TextGeneration {
    fn new(
        model: Falcon,
        tokenizer: Tokenizer,
-        generation_options: GenerationOptions,
        seed: u64,
+        temp: Option<f64>,
        device: &Device,
+        repeat_penalty: f32,
+        repeat_last_n: usize,
    ) -> Self {
-        let logits_processor =
-            LogitsProcessor::new(seed, generation_options.temp, generation_options.top_p);
-        let repeat_penalty = generation_options.repeat_penalty;
-        let repeat_last_n = generation_options.repeat_last_n;
+        let logits_processor = LogitsProcessor::new(seed, temp);
        Self {
            model,
            tokenizer,
@ -126,10 +119,6 @@ struct Args {
    #[arg(long)]
    temperature: Option<f64>,

-    /// Nucleus sampling probability cutoff.
-    #[arg(long)]
-    top_p: Option<f64>,
-
    /// The seed to use when generating random samples.
    #[arg(long, default_value_t = 299792458)]
    seed: u64,
@ -197,14 +186,15 @@ fn main() -> Result<()> {
    let model = Falcon::load(vb, config)?;
    println!("loaded the model in {:?}", start.elapsed());

-    let generation_options = GenerationOptions {
-        temp: args.temperature,
-        top_p: args.top_p,
-        repeat_penalty: args.repeat_penalty,
-        repeat_last_n: args.repeat_last_n,
-    };
-    let mut pipeline =
-        TextGeneration::new(model, tokenizer, generation_options, args.seed, &device);
+    let mut pipeline = TextGeneration::new(
+        model,
+        tokenizer,
+        args.seed,
+        args.temperature,
+        &device,
+        args.repeat_penalty,
+        args.repeat_last_n,
+    );
    pipeline.run(&args.prompt, args.sample_len)?;
    Ok(())
 }
--- a/candle-examples/examples/falcon/model.rs
+++ b/candle-examples/examples/falcon/model.rs
@ -1,4 +1,5 @@
-use candle::{DType, Device, Result, Tensor, D};
+use anyhow::Result;
+use candle::{DType, Device, Tensor, D};
 use candle_nn::{Embedding, LayerNorm, Linear, Module, VarBuilder};

 const MAX_SEQ_LEN: usize = 5000;
@ -20,7 +21,7 @@ fn layer_norm(size: usize, eps: f64, vb: VarBuilder) -> Result<LayerNorm> {
            if let (Ok(weight), Ok(bias)) = (vb.get(size, "gamma"), vb.get(size, "beta")) {
                (weight, bias)
            } else {
-                return Err(err);
+                return Err(err.into());
            }
        }
    };
@ -81,13 +82,13 @@ impl Default for Config {
 impl Config {
    pub fn validate(&self) -> Result<()> {
        if self.alibi {
-            candle::bail!("alibi is not supported");
+            anyhow::bail!("alibi is not supported");
        }
        if self.new_decoder_architecture {
-            candle::bail!("new_decoder_architecture is not supported");
+            anyhow::bail!("new_decoder_architecture is not supported");
        }
        if self.n_head_kv.is_some() {
-            candle::bail!("n_head_kv is not supported");
+            anyhow::bail!("n_head_kv is not supported");
        }
        Ok(())
    }
--- a/candle-examples/examples/llama/main.rs
+++ b/candle-examples/examples/llama/main.rs
@ -21,10 +21,11 @@ use candle_transformers::generation::LogitsProcessor;
 use hf_hub::{api::sync::Api, Repo, RepoType};
 use std::io::Write;

-use candle_transformers::models::llama as model;
+mod model;
 use model::{Config, Llama, LlamaConfig};

 const EOS_TOKEN: &str = "</s>";
+const MAX_SEQ_LEN: usize = 4096;
 const DEFAULT_PROMPT: &str = "My favorite theorem is ";

 #[derive(Parser, Debug)]
@ -42,10 +43,6 @@ struct Args {
    #[arg(long)]
    temperature: Option<f64>,

-    /// Nucleus sampling probability cutoff.
-    #[arg(long)]
-    top_p: Option<f64>,
-
    /// The seed to use when generating random samples.
    #[arg(long, default_value_t = 299792458)]
    seed: u64,
@ -197,7 +194,7 @@ fn main() -> Result<()> {

    println!("starting the inference loop");
    print!("{prompt}");
-    let mut logits_processor = LogitsProcessor::new(args.seed, args.temperature, args.top_p);
+    let mut logits_processor = LogitsProcessor::new(args.seed, args.temperature);
    let start_gen = std::time::Instant::now();
    let mut index_pos = 0;
    let mut token_generated = 0;
--- a/candle-examples/examples/llama/model.rs
+++ b/candle-examples/examples/llama/model.rs
@ -4,7 +4,7 @@ use serde::Deserialize;
 use std::collections::HashMap;
 use std::sync::{Arc, Mutex};

-pub const MAX_SEQ_LEN: usize = 4096;
+use super::MAX_SEQ_LEN;

 #[derive(Deserialize)]
 pub struct LlamaConfig {
--- a/candle-examples/examples/llama2-c/main.rs
+++ b/candle-examples/examples/llama2-c/main.rs
@ -27,10 +27,6 @@ struct InferenceCmd {
    #[arg(long)]
    temperature: Option<f64>,

-    /// Nucleus sampling probability cutoff.
-    #[arg(long)]
-    top_p: Option<f64>,
-
    #[arg(long, default_value = "")]
    prompt: String,

@ -137,7 +133,6 @@ fn main() -> anyhow::Result<()> {
        None => {
            let cmd = InferenceCmd {
                temperature: None,
-                top_p: None,
                prompt: "".to_string(),
                config: None,
                model_id: "karpathy/tinyllamas".to_string(),
@ -261,7 +256,7 @@ fn run_inference(args: &InferenceCmd, common_args: &Args) -> Result<()> {
    let model = Llama::load(vb, &cache, config)?;

    println!("starting the inference loop");
-    let mut logits_processor = LogitsProcessor::new(299792458, args.temperature, args.top_p);
+    let mut logits_processor = LogitsProcessor::new(299792458, args.temperature);
    let mut index_pos = 0;

    print!("{}", args.prompt);
--- a/candle-examples/examples/musicgen/main.rs
+++ b/candle-examples/examples/musicgen/main.rs
@ -13,6 +13,7 @@ extern crate accelerate_src;
 mod encodec_model;
 mod musicgen_model;
 mod nn;
+mod t5_model;

 use musicgen_model::{GenConfig, MusicgenForConditionalGeneration};

@ -77,7 +78,7 @@ fn main() -> Result<()> {
    let model = model.deserialize()?;
    let vb = VarBuilder::from_safetensors(vec![model], DTYPE, &device);
    let config = GenConfig::small();
-    let mut model = MusicgenForConditionalGeneration::load(vb, config)?;
+    let model = MusicgenForConditionalGeneration::load(vb, config)?;

    let tokens = tokenizer
        .encode(args.prompt.as_str(), true)
--- a/candle-examples/examples/musicgen/musicgen_model.rs
+++ b/candle-examples/examples/musicgen/musicgen_model.rs
@ -1,10 +1,9 @@
-use crate::encodec_model;
+use crate::{encodec_model, t5_model};
 use candle::{DType, Device, Result, Tensor, D};
 use candle_nn::{
    embedding, layer_norm, linear_no_bias, Activation, Embedding, LayerNorm, Linear, Module,
    VarBuilder,
 };
-use candle_transformers::models::t5;

 // https://github.com/huggingface/transformers/blob/cd4584e3c809bb9e1392ccd3fe38b40daba5519a/src/transformers/models/musicgen/configuration_musicgen.py#L83
 #[derive(Debug, Clone, PartialEq)]
@ -371,7 +370,7 @@ impl MusicgenForCausalLM {

 #[derive(Debug)]
 pub struct MusicgenForConditionalGeneration {
-    pub text_encoder: t5::T5EncoderModel,
+    pub text_encoder: crate::t5_model::T5EncoderModel,
    pub audio_encoder: crate::encodec_model::EncodecModel,
    pub decoder: MusicgenForCausalLM,
    cfg: GenConfig,
@ -380,7 +379,7 @@ pub struct MusicgenForConditionalGeneration {
 #[derive(Debug, Clone, PartialEq)]
 pub struct GenConfig {
    musicgen: Config,
-    t5: t5::Config,
+    t5: crate::t5_model::Config,
    encodec: crate::encodec_model::Config,
 }

@ -388,7 +387,7 @@ impl GenConfig {
    pub fn small() -> Self {
        Self {
            musicgen: Config::musicgen_small(),
-            t5: t5::Config::musicgen_small(),
+            t5: t5_model::Config::musicgen_small(),
            encodec: encodec_model::Config::musicgen_small(),
        }
    }
@ -400,7 +399,7 @@ impl MusicgenForConditionalGeneration {
    }

    pub fn load(vb: VarBuilder, cfg: GenConfig) -> Result<Self> {
-        let text_encoder = t5::T5EncoderModel::load(vb.pp("text_encoder"), &cfg.t5)?;
+        let text_encoder = t5_model::T5EncoderModel::load(vb.pp("text_encoder"), &cfg.t5)?;
        let audio_encoder =
            encodec_model::EncodecModel::load(vb.pp("audio_encoder"), &cfg.encodec)?;
        let decoder = MusicgenForCausalLM::load(vb.pp("decoder"), &cfg.musicgen)?;
--- a/candle-examples/examples/musicgen/t5_model.rs
+++ b/candle-examples/examples/musicgen/t5_model.rs
@ -1,39 +1,11 @@
 // T5 Text Encoder
 // https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py

-use candle::{DType, Device, Result, Tensor, D};
+use candle::{DType, Result, Tensor, D};
 use candle_nn::{embedding, linear_no_bias, Activation, Embedding, Linear, Module, VarBuilder};
-use serde::Deserialize;
 use std::sync::Arc;

-fn default_relative_attention_max_distance() -> usize {
-    128
-}
-
-fn default_is_decoder() -> bool {
-    false
-}
-
-fn default_use_cache() -> bool {
-    true
-}
-
-fn get_mask(size: usize, device: &Device) -> Result<Tensor> {
-    let mask: Vec<_> = (0..size)
-        .flat_map(|i| (0..size).map(move |j| u8::from(j > i)))
-        .collect();
-    let result = Tensor::from_slice(&mask, (size, size), device)?;
-    Ok(result)
-}
-
-fn masked_fill(on_false: &Tensor, mask: &Tensor, on_true: f32) -> Result<Tensor> {
-    let shape = mask.shape();
-    let on_true = Tensor::new(on_true, on_false.device())?.broadcast_as(shape.dims())?;
-    let m = mask.where_cond(&on_true, on_false)?;
-    Ok(m)
-}
-
-#[derive(Debug, Clone, PartialEq, Deserialize)]
+#[derive(Debug, Clone, PartialEq)]
 pub struct Config {
    vocab_size: usize,
    d_model: usize,
@ -43,20 +15,16 @@ pub struct Config {
    num_decoder_layers: Option<usize>,
    num_heads: usize,
    relative_attention_num_buckets: usize,
-    #[serde(default = "default_relative_attention_max_distance")]
    relative_attention_max_distance: usize,
    dropout_rate: f64,
    layer_norm_epsilon: f64,
    initializer_factor: f64,
-    #[serde(default)]
    feed_forward_proj: Activation,
-    #[serde(default = "default_is_decoder")]
    is_decoder: bool,
    is_encoder_decoder: bool,
-    #[serde(default = "default_use_cache")]
-    pub use_cache: bool,
-    pub pad_token_id: usize,
-    pub eos_token_id: usize,
+    use_cache: bool,
+    pad_token_id: usize,
+    eos_token_id: usize,
 }

 impl Default for Config {
@ -163,71 +131,27 @@ impl T5DenseActDense {
    }
 }

-#[derive(Debug)]
-struct T5DenseGatedActDense {
-    wi_0: Linear,
-    wi_1: Linear,
-    wo: Linear,
-    act: Activation,
-}
-
-impl T5DenseGatedActDense {
-    fn load(vb: VarBuilder, cfg: &Config) -> Result<Self> {
-        let wi_0 = linear_no_bias(cfg.d_model, cfg.d_ff, vb.pp("wi_0"))?;
-        let wi_1 = linear_no_bias(cfg.d_model, cfg.d_ff, vb.pp("wi_1"))?;
-        let wo = linear_no_bias(cfg.d_ff, cfg.d_model, vb.pp("wo"))?;
-        Ok(Self {
-            wi_0,
-            wi_1,
-            wo,
-            act: Activation::NewGelu,
-        })
-    }
-
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
-        let hidden_gelu = self.act.forward(&self.wi_0.forward(xs)?)?;
-        let hidden_linear = self.wi_1.forward(xs)?;
-        let xs = hidden_gelu.broadcast_mul(&hidden_linear)?;
-        let xs = self.wo.forward(&xs)?;
-        Ok(xs)
-    }
-}
-
 #[derive(Debug)]
 struct T5LayerFF {
-    dense_act: Option<T5DenseActDense>,
-    gated_dense_act: Option<T5DenseGatedActDense>,
+    dense_relu_dense: T5DenseActDense,
    layer_norm: T5LayerNorm,
 }

 impl T5LayerFF {
    fn load(vb: VarBuilder, cfg: &Config) -> Result<Self> {
+        // is_gated_act is not supported.
+        let dense_relu_dense = T5DenseActDense::load(vb.pp("DenseReluDense"), cfg)?;
        let layer_norm =
            T5LayerNorm::load(cfg.d_model, cfg.layer_norm_epsilon, vb.pp("layer_norm"))?;
-        let (dense_act, gated_dense_act) = if cfg.feed_forward_proj == Activation::NewGelu {
-            (
-                None,
-                Some(T5DenseGatedActDense::load(vb.pp("DenseReluDense"), cfg)?),
-            )
-        } else {
-            (
-                Some(T5DenseActDense::load(vb.pp("DenseReluDense"), cfg)?),
-                None,
-            )
-        };
        Ok(Self {
-            dense_act,
-            gated_dense_act,
+            dense_relu_dense,
            layer_norm,
        })
    }

    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        let ys = self.layer_norm.forward(xs)?;
-        let ys = match &self.dense_act {
-            Some(dense_act) => dense_act.forward(&ys)?,
-            None => self.gated_dense_act.as_ref().unwrap().forward(&ys)?,
-        };
+        let ys = self.dense_relu_dense.forward(&ys)?;
        let xs = (xs + ys)?;
        Ok(xs)
    }
@ -245,23 +169,16 @@ struct T5Attention {
    relative_attention_num_buckets: usize,
    relative_attention_max_distance: usize,
    inner_dim: usize,
-    use_cache: bool,
-    kv_cache: Option<(Tensor, Tensor)>,
 }

 impl T5Attention {
-    fn load(
-        has_relative_attention_bias: bool,
-        decoder: bool,
-        vb: VarBuilder,
-        cfg: &Config,
-    ) -> Result<Self> {
+    fn load(h: bool, vb: VarBuilder, cfg: &Config) -> Result<Self> {
        let inner_dim = cfg.num_heads * cfg.d_kv;
        let q = linear_no_bias(cfg.d_model, inner_dim, vb.pp("q"))?;
        let k = linear_no_bias(cfg.d_model, inner_dim, vb.pp("k"))?;
        let v = linear_no_bias(cfg.d_model, inner_dim, vb.pp("v"))?;
        let o = linear_no_bias(inner_dim, cfg.d_model, vb.pp("o"))?;
-        let relative_attention_bias = if has_relative_attention_bias {
+        let relative_attention_bias = if h {
            let emb = embedding(
                cfg.relative_attention_num_buckets,
                cfg.num_heads,
@ -282,77 +199,47 @@ impl T5Attention {
            relative_attention_num_buckets: cfg.relative_attention_num_buckets,
            relative_attention_max_distance: cfg.relative_attention_max_distance,
            inner_dim,
-            use_cache: cfg.use_cache && decoder,
-            kv_cache: None,
        })
    }

    fn forward(
-        &mut self,
+        &self,
        xs: &Tensor,
        position_bias: Option<&Tensor>,
-        key_value_states: Option<&Tensor>,
-        mask: Option<&Tensor>,
    ) -> Result<(Tensor, Option<Tensor>)> {
-        // Performs Self-attention (if key_value_states is None) or attention
-        // over source sentence (provided by key_value_states).
-        let kv_input = match key_value_states {
-            None => xs,
-            Some(key_value_states) => key_value_states,
-        };
-        let (b_sz, q_len) = (xs.dim(0)?, xs.dim(1)?);
-        let kv_len = kv_input.dim(1)?;
+        // TODO: Apply the mask(s)?
+        // TODO: kv caching.
+        let (b_sz, seq_len) = (xs.dim(0)?, xs.dim(1)?);
        let q = self.q.forward(xs)?;
-        let k = self.k.forward(kv_input)?;
-        let v = self.v.forward(kv_input)?;
+        let k = self.k.forward(xs)?;
+        let v = self.v.forward(xs)?;
        let q = q
-            .reshape((b_sz, q_len, self.n_heads, self.d_kv))?
+            .reshape((b_sz, seq_len, self.n_heads, self.d_kv))?
            .transpose(1, 2)?
            .contiguous()?;
-        let mut k = k
-            .reshape((b_sz, kv_len, self.n_heads, self.d_kv))?
+        let k = k
+            .reshape((b_sz, seq_len, self.n_heads, self.d_kv))?
            .transpose(1, 2)?
            .contiguous()?;
-        let mut v = v
-            .reshape((b_sz, kv_len, self.n_heads, self.d_kv))?
+        let v = v
+            .reshape((b_sz, seq_len, self.n_heads, self.d_kv))?
            .transpose(1, 2)?
            .contiguous()?;
-
-        if self.use_cache {
-            if let Some((kv_cache_k, kv_cache_v)) = &self.kv_cache {
-                k = Tensor::cat(&[kv_cache_k, &k], 2)?.contiguous()?;
-                v = Tensor::cat(&[kv_cache_v, &v], 2)?.contiguous()?;
-            };
-            self.kv_cache = Some((k.clone(), v.clone()));
-        };
-        // TODO: Use flash_attn.
        let scores = q.matmul(&k.t()?)?;
-        let scores = match mask {
-            None => scores,
-            Some(mask) => masked_fill(
-                &scores,
-                &mask
-                    .unsqueeze(0)?
-                    .unsqueeze(0)?
-                    .repeat((b_sz, self.n_heads))?,
-                f32::NEG_INFINITY,
-            )?,
-        };

        let (scores, position_bias) = match position_bias {
-            Some(position_bias) => (
-                scores.broadcast_add(position_bias)?,
-                Some(position_bias.clone()),
-            ),
+            Some(position_bias) => ((scores + position_bias)?, Some(position_bias.clone())),
            None => match &self.relative_attention_bias {
                None => (scores, None),
                Some(relative_attention_bias) => {
+                    let query_length = seq_len;
+                    let key_length = seq_len;
                    // This only handles the bidirectional case.
                    let num_buckets = self.relative_attention_num_buckets as u32 / 2;
                    let max_exact = num_buckets / 2;
-                    let relative_position = (0..q_len as u32)
+                    let relative_position = (0..query_length as u32)
                        .map(|i| {
-                            (0..kv_len as u32)
+                            (0..key_length as u32)
                                .map(|j| {
                                    if i < j {
                                        if j - i < max_exact {
@ -387,7 +274,7 @@ impl T5Attention {
                        .forward(&relative_buckets)?
                        .permute((2, 0, 1))?
                        .unsqueeze(0)?;
-                    (scores.broadcast_add(&position_bias)?, Some(position_bias))
+                    ((scores + &position_bias)?, Some(position_bias))
                    // TODO: position_bias_masked?
                }
            },
@ -397,7 +284,7 @@ impl T5Attention {
        let attn_output = attn_weights.matmul(&v)?;
        let attn_output = attn_output
            .transpose(1, 2)?
-            .reshape((b_sz, q_len, self.inner_dim))?;
+            .reshape((b_sz, seq_len, self.inner_dim))?;
        let attn_output = self.o.forward(&attn_output)?;
        Ok((attn_output, position_bias))
    }
@ -410,8 +297,8 @@ struct T5LayerSelfAttention {
 }

 impl T5LayerSelfAttention {
-    fn load(h: bool, d: bool, vb: VarBuilder, cfg: &Config) -> Result<Self> {
-        let self_attention = T5Attention::load(h, d, vb.pp("SelfAttention"), cfg)?;
+    fn load(h: bool, vb: VarBuilder, cfg: &Config) -> Result<Self> {
+        let self_attention = T5Attention::load(h, vb.pp("SelfAttention"), cfg)?;
        let layer_norm =
            T5LayerNorm::load(cfg.d_model, cfg.layer_norm_epsilon, vb.pp("layer_norm"))?;
        Ok(Self {
@ -421,52 +308,27 @@ impl T5LayerSelfAttention {
    }

    fn forward(
-        &mut self,
+        &self,
        xs: &Tensor,
        position_bias: Option<&Tensor>,
-        mask: Option<&Tensor>,
    ) -> Result<(Tensor, Option<Tensor>)> {
        let normed_xs = self.layer_norm.forward(xs)?;
-        let (ys, position_bias) =
-            self.self_attention
-                .forward(&normed_xs, position_bias, None, mask)?;
+        let (ys, position_bias) = self.self_attention.forward(&normed_xs, position_bias)?;
        let ys = (xs + ys)?;
        Ok((ys, position_bias))
    }
 }

 #[derive(Debug)]
-struct T5LayerCrossAttention {
-    cross_attention: T5Attention,
-    layer_norm: T5LayerNorm,
-}
+struct T5LayerCrossAttention {}

 impl T5LayerCrossAttention {
-    fn load(decoder: bool, vb: VarBuilder, cfg: &Config) -> Result<Self> {
-        let cross_attention = T5Attention::load(false, decoder, vb.pp("EncDecAttention"), cfg)?;
-        let layer_norm =
-            T5LayerNorm::load(cfg.d_model, cfg.layer_norm_epsilon, vb.pp("layer_norm"))?;
-        Ok(Self {
-            cross_attention,
-            layer_norm,
-        })
+    fn load(_vb: VarBuilder, _cfg: &Config) -> Result<Self> {
+        todo!()
    }

-    fn forward(
-        &mut self,
-        hidden_states: &Tensor,
-        position_bias: Option<&Tensor>,
-        key_value_states: &Tensor,
-    ) -> Result<(Tensor, Option<Tensor>)> {
-        let normed_hidden_states = self.layer_norm.forward(hidden_states)?;
-        let (ys, position_bias) = self.cross_attention.forward(
-            &normed_hidden_states,
-            position_bias,
-            Some(key_value_states),
-            None,
-        )?;
-        let ys = (hidden_states + ys)?;
-        Ok((ys, position_bias))
+    fn forward(&self, _xs: &Tensor) -> Result<Tensor> {
+        todo!()
    }
 }

@ -478,17 +340,11 @@ struct T5Block {
 }

 impl T5Block {
-    fn load(
-        has_relative_attention_bias: bool,
-        decoder: bool,
-        vb: VarBuilder,
-        cfg: &Config,
-    ) -> Result<Self> {
+    fn load(has_relative_attention_bias: bool, vb: VarBuilder, cfg: &Config) -> Result<Self> {
        let vb = vb.pp("layer");
-        let self_attn =
-            T5LayerSelfAttention::load(has_relative_attention_bias, decoder, vb.pp("0"), cfg)?;
+        let self_attn = T5LayerSelfAttention::load(has_relative_attention_bias, vb.pp("0"), cfg)?;
        let cross_attn = if cfg.is_decoder {
-            Some(T5LayerCrossAttention::load(decoder, vb.pp("1"), cfg)?)
+            Some(T5LayerCrossAttention::load(vb.pp("1"), cfg)?)
        } else {
            None
        };
@ -502,29 +358,14 @@ impl T5Block {
    }

    fn forward(
-        &mut self,
+        &self,
        xs: &Tensor,
        position_bias: Option<&Tensor>,
-        encoder_hidden_states: Option<&Tensor>,
    ) -> Result<(Tensor, Option<Tensor>)> {
-        // TODO: Cache masks
-        let mask = match self.cross_attn.is_some() {
-            true => {
-                let mask_len = xs.dim(1)?;
-                // If the input seq length is 1, no need for a mask, this is also helpful to avoid shape
-                // issues when using the KV cache in the decoder.
-                if mask_len <= 1 {
-                    None
-                } else {
-                    Some(get_mask(mask_len, xs.device())?)
-                }
-            }
-            false => None,
-        };
-        let (mut xs, position_bias) = self.self_attn.forward(xs, position_bias, mask.as_ref())?;
+        let (mut xs, position_bias) = self.self_attn.forward(xs, position_bias)?;
        // TODO: clamp for f16?
-        if let Some(cross_attn) = &mut self.cross_attn {
-            (xs, _) = cross_attn.forward(&xs, None, encoder_hidden_states.unwrap())?;
+        if let Some(cross_attn) = &self.cross_attn {
+            xs = cross_attn.forward(&xs)?;
            // TODO: clamp for f16?
        }
        let xs = self.ff.forward(&xs)?;
@ -541,9 +382,9 @@ struct T5Stack {
 }

 impl T5Stack {
-    fn load(decoder: bool, vb: VarBuilder, shared: &Arc<Embedding>, cfg: &Config) -> Result<Self> {
+    fn load(vb: VarBuilder, shared: &Arc<Embedding>, cfg: &Config) -> Result<Self> {
        let block = (0..cfg.num_layers)
-            .map(|i| T5Block::load(i == 0, decoder, vb.pp(&format!("block.{i}")), cfg))
+            .map(|i| T5Block::load(i == 0, vb.pp(&format!("block.{i}")), cfg))
            .collect::<Result<Vec<_>>>()?;
        let final_layer_norm = T5LayerNorm::load(
            cfg.d_model,
@ -557,113 +398,37 @@ impl T5Stack {
        })
    }

-    fn forward(
-        &mut self,
-        input_ids: &Tensor,
-        encoder_hidden_states: Option<&Tensor>,
-    ) -> Result<Tensor> {
+    fn forward(&self, input_ids: &Tensor) -> Result<Tensor> {
        let input_embeds = self.shared.as_ref().forward(input_ids)?;
+        let (_b_sz, _seq_len) = (input_embeds.dim(0)?, input_embeds.dim(1)?);
+
        let mut hidden_states = input_embeds;
        let mut position_bias = None;
-        for block in self.block.iter_mut() {
-            (hidden_states, position_bias) = block.forward(
-                &hidden_states,
-                position_bias.as_ref(),
-                encoder_hidden_states,
-            )?
+        for block in self.block.iter() {
+            (hidden_states, position_bias) =
+                block.forward(&hidden_states, position_bias.as_ref())?
        }
-        self.final_layer_norm.forward(&hidden_states)
+        let hidden_states = self.final_layer_norm.forward(&hidden_states)?;
+        Ok(hidden_states)
    }
 }

 #[derive(Debug)]
 pub struct T5EncoderModel {
+    shared: Arc<Embedding>,
    encoder: T5Stack,
-    device: Device,
 }

 impl T5EncoderModel {
    pub fn load(vb: VarBuilder, cfg: &Config) -> Result<Self> {
        let shared = embedding(cfg.vocab_size, cfg.d_model, vb.pp("shared"))?;
        let shared = Arc::new(shared);
-        let encoder = T5Stack::load(false, vb.pp("encoder"), &shared, cfg)?;
-        Ok(Self {
-            encoder,
-            device: vb.device().clone(),
-        })
+        let encoder = T5Stack::load(vb.pp("encoder"), &shared, cfg)?;
+        Ok(Self { shared, encoder })
    }

-    pub fn forward(&mut self, input_ids: &Tensor) -> Result<Tensor> {
-        self.encoder.forward(input_ids, None)
-    }
-
-    pub fn device(&self) -> &Device {
-        &self.device
-    }
-}
-
-#[derive(Debug)]
-pub struct T5ForConditionalGeneration {
-    encoder: T5Stack,
-    decoder: T5Stack,
-    shared: Arc<Embedding>,
-    device: Device,
-}
-
-impl T5ForConditionalGeneration {
-    pub fn load(vb: VarBuilder, cfg: &Config) -> Result<Self> {
-        assert!(cfg.is_encoder_decoder);
-        let shared = embedding(cfg.vocab_size, cfg.d_model, vb.pp("shared"))?;
-        let shared = Arc::new(shared);
-
-        let mut encoder_cfg = cfg.clone();
-        encoder_cfg.is_decoder = false;
-        encoder_cfg.use_cache = false;
-        encoder_cfg.is_encoder_decoder = false;
-        let encoder = T5Stack::load(false, vb.pp("encoder"), &shared, &encoder_cfg)?;
-
-        let mut decoder_cfg = cfg.clone();
-        decoder_cfg.is_decoder = true;
-        decoder_cfg.is_encoder_decoder = false;
-        decoder_cfg.num_layers = cfg.num_decoder_layers.unwrap_or(cfg.num_layers);
-        let decoder = T5Stack::load(true, vb.pp("decoder"), &shared, &decoder_cfg)?;
-
-        Ok(Self {
-            encoder,
-            decoder,
-            shared,
-            device: vb.device().clone(),
-        })
-    }
-
-    pub fn encode(&mut self, input_ids: &Tensor) -> Result<Tensor> {
-        self.encoder.forward(input_ids, None)
-    }
-
-    pub fn decode(
-        &mut self,
-        decoder_input_ids: &Tensor,
-        encoder_output: &Tensor,
-    ) -> Result<Tensor> {
-        let decoder_output = self
-            .decoder
-            .forward(decoder_input_ids, Some(encoder_output))?;
-        let sequence_output = decoder_output
-            .narrow(1, decoder_output.dim(1)? - 1, 1)?
-            .squeeze(1)?;
-        // TODO: check cfg.tie_word_embeddings to load from model instead.
-        let lm_head_weights = self.shared.embeddings().t()?;
-        let output = sequence_output.matmul(&lm_head_weights)?;
-        // TODO: Rescale output before projecting on vocab? * (self.model_dim**-0.5)
-        Ok(output)
-    }
-
-    pub fn forward(&mut self, input_ids: &Tensor, decoder_input_ids: &Tensor) -> Result<Tensor> {
-        let encoder_output = self.encode(input_ids)?;
-        self.decode(decoder_input_ids, &encoder_output)
-    }
-
-    pub fn device(&self) -> &Device {
-        &self.device
+    pub fn forward(&self, input_ids: &Tensor) -> Result<Tensor> {
+        let encoder_outputs = self.encoder.forward(input_ids)?;
+        Ok(encoder_outputs)
    }
 }
--- a/candle-examples/examples/quantized/README.md
+++ b/candle-examples/examples/quantized/README.md
@ -1,37 +0,0 @@
-# candle-quantized-llama: Fast Inference of quantized LLaMA models
-
-This example provides a quantized LLaMA model similar to
-[llama.cpp](https://github.com/ggerganov/llama.cpp). This is based on candle
-built-in quantization methods. Supported features include:
-
- 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit integer quantization support.
- SIMD optimizations on Apple Silicon and x86.
- Support using the `gguf` and `ggml` file formats.
-
-The weights are automatically downloaded for you from the [HuggingFace
-Hub](https://huggingface.co/) on the first run. There are various command line
-flags to use local files instead, run with `--help` to learn about them.
-
-![Axiom of Choice](./assets/aoc.gif)
-
-## Running some example.
-
-```bash
-cargo run --example quantized --release -- --prompt "The best thing about coding in rust is "
-
-> avx: true, neon: false, simd128: false, f16c: true
-> temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
-> loaded 291 tensors (3.79GB) in 2.17s
-> params: HParams { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, ftype: 2 }
-> The best thing about coding in rust is 1.) that I don’t need to worry about memory leaks, 2.) speed and 3.) my program will compile even on old machines.
-```
-
-## Command-line flags
-
-Run with `--help` to see all options.
-
- `--which`: specify the model to use, e.g. `7b`, `13-chat`, `7b-code`.
- `--prompt interactive`: interactive mode where multiple prompts can be
-  entered.
- `--model mymodelfile.gguf`: use a local model file rather than getting one
-  from the hub.
--- a/candle-examples/examples/quantized/assets/aoc.gif
+++ b/candle-examples/examples/quantized/assets/aoc.gif
--- a/candle-examples/examples/quantized/main.rs
+++ b/candle-examples/examples/quantized/main.rs
@ -12,7 +12,7 @@ use candle::quantized::{ggml_file, gguf_file};
 use candle::{Device, Tensor};
 use candle_transformers::generation::LogitsProcessor;

-use candle_transformers::models::quantized_llama as model;
+mod model;
 use model::ModelWeights;

 const DEFAULT_PROMPT: &str = "My favorite theorem is ";
@ -71,10 +71,6 @@ struct Args {
    #[arg(long, default_value_t = 0.8)]
    temperature: f64,

-    /// Nucleus sampling probability cutoff.
-    #[arg(long)]
-    top_p: Option<f64>,
-
    /// The seed to use when generating random samples.
    #[arg(long, default_value_t = 299792458)]
    seed: u64,
@ -314,7 +310,7 @@ fn main() -> anyhow::Result<()> {
            prompt_tokens
        };
        let mut all_tokens = vec![];
-        let mut logits_processor = LogitsProcessor::new(args.seed, temperature, args.top_p);
+        let mut logits_processor = LogitsProcessor::new(args.seed, temperature);

        let start_prompt_processing = std::time::Instant::now();
        let mut next_token = {
--- a/candle-transformers/src/models/quantized_llama.rs
+++ b/candle-transformers/src/models/quantized_llama.rs
@ -144,7 +144,7 @@ impl LayerWeights {
        let att = (q.matmul(&k.t()?)? / (self.head_dim as f64).sqrt())?;
        let mask = mask.broadcast_as(att.shape())?;
        let att = masked_fill(&att, &mask, f32::NEG_INFINITY)?;
-        let att = candle_nn::ops::softmax_last_dim(&att)?;
+        let att = candle_nn::ops::softmax(&att, D::Minus1)?;
        // Convert to contiguous as matmul doesn't support strided vs for now.
        let y = att.matmul(&v.contiguous()?)?;
        let y = y.transpose(1, 2)?.reshape(&[b_sz, seq_len, n_embd])?;
--- a/candle-examples/examples/segment-anything/README.md
+++ b/candle-examples/examples/segment-anything/README.md
@ -1,40 +0,0 @@
-# candle-segment-anything: Segment-Anything Model
-
-This example is based on Meta AI [Segment-Anything
-Model](https://github.com/facebookresearch/segment-anything). This model
-provides a robust and fast image segmentation pipeline that can be tweaked via
-some prompting (requesting some points to be in the target mask, requesting some
-points to be part of the background so _not_ in the target mask, specifying some
-bounding box).
-
-The default backbone can be replaced by the smaller and faster TinyViT model
-based on [MobileSAM](https://github.com/ChaoningZhang/MobileSAM).
-
-## Running some example.
-
-```bash
-cargo run --example segment-anything --release -- \
-    --image candle-examples/examples/yolo-v8/assets/bike.jpg
-    --use-tiny
-    --point-x 0.4
-    --point-y 0.3
-```
-
-Running this command generates a `sam_merged.jpg` file containing the original
-image with a blue overlay of the selected mask. The red dot represents the prompt
-specified by `--point-x 0.4 --point-y 0.3`, this prompt is assumed to be part
-of the target mask.
-
-The values used for `--point-x` and `--point-y` should be between 0 and 1 and
-are proportional to the image dimension, i.e. use 0.5 for the image center.
-
-![Leading group, Giro d'Italia 2021](../yolo-v8/assets/bike.jpg)
-
-![Leading group, Giro d'Italia 2021](./assets/sam_merged.jpg)
-
-### Command-line flags
- `--use-tiny`: use the TinyViT based MobileSAM backbone rather than the default
-  one.
- `--point-x`, `--point-y`: specifies the location of the target point.
- `--threshold`: sets the threshold value to be part of the mask, a negative
-  value results in a larger mask and can be specified via `--threshold=-1.2`.
--- a/candle-examples/examples/segment-anything/assets/sam_merged.jpg
+++ b/candle-examples/examples/segment-anything/assets/sam_merged.jpg
--- a/candle-examples/examples/segment-anything/main.rs
+++ b/candle-examples/examples/segment-anything/main.rs
@ -7,11 +7,107 @@ extern crate intel_mkl_src;
 #[cfg(feature = "accelerate")]
 extern crate accelerate_src;

-use candle::DType;
-use candle_nn::VarBuilder;
-use candle_transformers::models::segment_anything::sam;
+pub mod model_image_encoder;
+pub mod model_mask_decoder;
+pub mod model_prompt_encoder;
+pub mod model_sam;
+pub mod model_transformer;
+
+use candle::{DType, Result, Tensor};
+use candle_nn::{Module, VarBuilder};
 use clap::Parser;

+pub fn linear(vb: VarBuilder, in_dim: usize, out_dim: usize, bias: bool) -> Result<Linear> {
+    let inner = if bias {
+        candle_nn::linear(in_dim, out_dim, vb)?
+    } else {
+        candle_nn::linear_no_bias(in_dim, out_dim, vb)?
+    };
+    let span = tracing::span!(tracing::Level::TRACE, "linear");
+    Ok(Linear { inner, span })
+}
+
+#[derive(Debug)]
+pub struct LayerNorm2d {
+    weight: Tensor,
+    bias: Tensor,
+    num_channels: usize,
+    eps: f64,
+}
+
+impl LayerNorm2d {
+    pub fn new(num_channels: usize, eps: f64, vb: VarBuilder) -> Result<Self> {
+        let weight = vb.get(num_channels, "weight")?;
+        let bias = vb.get(num_channels, "bias")?;
+        Ok(Self {
+            weight,
+            bias,
+            num_channels,
+            eps,
+        })
+    }
+}
+
+impl Module for LayerNorm2d {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        let u = xs.mean_keepdim(1)?;
+        let xs = xs.broadcast_sub(&u)?;
+        let s = xs.sqr()?.mean_keepdim(1)?;
+        let xs = xs.broadcast_div(&(s + self.eps)?.sqrt()?)?;
+        xs.broadcast_mul(&self.weight.reshape((1, self.num_channels, 1, 1))?)?
+            .broadcast_add(&self.bias.reshape((1, self.num_channels, 1, 1))?)
+    }
+}
+
+#[derive(Debug)]
+pub struct MlpBlock {
+    lin1: Linear,
+    lin2: Linear,
+    activation: candle_nn::Activation,
+    span: tracing::Span,
+}
+
+impl MlpBlock {
+    pub fn new(
+        embedding_dim: usize,
+        mlp_dim: usize,
+        activation: candle_nn::Activation,
+        vb: VarBuilder,
+    ) -> Result<Self> {
+        let lin1 = linear(vb.pp("lin1"), embedding_dim, mlp_dim, true)?;
+        let lin2 = linear(vb.pp("lin2"), mlp_dim, embedding_dim, true)?;
+        let span = tracing::span!(tracing::Level::TRACE, "mlp-block");
+        Ok(Self {
+            lin1,
+            lin2,
+            activation,
+            span,
+        })
+    }
+}
+
+impl Module for MlpBlock {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        let _enter = self.span.enter();
+        xs.apply(&self.lin1)?
+            .apply(&self.activation)?
+            .apply(&self.lin2)
+    }
+}
+
+#[derive(Debug)]
+pub struct Linear {
+    inner: candle_nn::Linear,
+    span: tracing::Span,
+}
+
+impl Module for Linear {
+    fn forward(&self, x: &Tensor) -> Result<Tensor> {
+        let _enter = self.span.enter();
+        self.inner.forward(x)
+    }
+}
+
 #[derive(Parser)]
 struct Args {
    #[arg(long)]
@ -27,26 +123,15 @@ struct Args {
    #[arg(long)]
    generate_masks: bool,

-    /// The target point x coordinate, between 0 and 1 (0.5 is at the middle of the image).
    #[arg(long, default_value_t = 0.5)]
    point_x: f64,

-    /// The target point y coordinate, between 0 and 1 (0.5 is at the middle of the image).
    #[arg(long, default_value_t = 0.5)]
    point_y: f64,

-    /// The detection threshold for the mask, 0 is the default value, negative values mean a larger
-    /// mask, positive makes the mask more selective.
-    #[arg(long, default_value_t = 0.)]
-    threshold: f32,
-
    /// Enable tracing (generates a trace-timestamp.json file).
    #[arg(long)]
    tracing: bool,
-
-    /// Use the TinyViT based models from MobileSAM
-    #[arg(long)]
-    use_tiny: bool,
 }

 pub fn main() -> anyhow::Result<()> {
@ -64,9 +149,28 @@ pub fn main() -> anyhow::Result<()> {

    let device = candle_examples::device(args.cpu)?;

-    let (image, initial_h, initial_w) =
-        candle_examples::load_image(&args.image, Some(sam::IMAGE_SIZE))?;
-    let image = image.to_device(&device)?;
+    let (image, initial_h, initial_w) = if args.image.ends_with(".safetensors") {
+        let mut tensors = candle::safetensors::load(&args.image, &device)?;
+        let image = match tensors.remove("image") {
+            Some(image) => image,
+            None => {
+                if tensors.len() != 1 {
+                    anyhow::bail!("multiple tensors in '{}'", args.image)
+                }
+                tensors.into_values().next().unwrap()
+            }
+        };
+        let image = if image.rank() == 4 {
+            image.get(0)?
+        } else {
+            image
+        };
+        let (_c, h, w) = image.dims3()?;
+        (image, h, w)
+    } else {
+        let (image, h, w) = candle_examples::load_image(&args.image, Some(model_sam::IMAGE_SIZE))?;
+        (image.to_device(&device)?, h, w)
+    };
    println!("loaded image {image:?}");

    let model = match args.model {
@ -74,22 +178,13 @@ pub fn main() -> anyhow::Result<()> {
        None => {
            let api = hf_hub::api::sync::Api::new()?;
            let api = api.model("lmz/candle-sam".to_string());
-            let filename = if args.use_tiny {
-                "mobile_sam-tiny-vitt.safetensors"
-            } else {
-                "sam_vit_b_01ec64.safetensors"
-            };
-            api.get(filename)?
+            api.get("sam_vit_b_01ec64.safetensors")?
        }
    };
    let weights = unsafe { candle::safetensors::MmapedFile::new(model)? };
    let weights = weights.deserialize()?;
    let vb = VarBuilder::from_safetensors(vec![weights], DType::F32, &device);
-    let sam = if args.use_tiny {
-        sam::Sam::new_tiny(vb)? // tiny vit_t
-    } else {
-        sam::Sam::new(768, 12, 12, &[2, 5, 8, 11], vb)? // sam_vit_b
-    };
+    let sam = model_sam::Sam::new(768, 12, 12, &[2, 5, 8, 11], vb)?; // sam_vit_b

    if args.generate_masks {
        // Default options similar to the Python version.
@ -101,7 +196,7 @@ pub fn main() -> anyhow::Result<()> {
            /* crop_n_points_downscale_factor */ 1,
        )?;
        for (idx, bbox) in bboxes.iter().enumerate() {
-            println!("{idx} {bbox:?}");
+            println!("{bbox:?}");
            let mask = (&bbox.data.to_dtype(DType::U8)? * 255.)?;
            let (h, w) = mask.dims2()?;
            let mask = mask.broadcast_as((3, h, w))?;
@ -123,42 +218,56 @@ pub fn main() -> anyhow::Result<()> {
        println!("mask:\n{mask}");
        println!("iou_predictions: {iou_predictions:?}");

-        let mask = (mask.ge(args.threshold)? * 255.)?;
+        // Save the mask as an image.
+        let mask = (mask.ge(0f32)? * 255.)?;
        let (_one, h, w) = mask.dims3()?;
        let mask = mask.expand((3, h, w))?;
+        candle_examples::save_image_resize(&mask, "sam_mask.png", initial_h, initial_w)?;

-        let mut img = image::io::Reader::open(&args.image)?
-            .decode()
-            .map_err(candle::Error::wrap)?;
-        let mask_pixels = mask.permute((1, 2, 0))?.flatten_all()?.to_vec1::<u8>()?;
-        let mask_img: image::ImageBuffer<image::Rgb<u8>, Vec<u8>> =
-            match image::ImageBuffer::from_raw(w as u32, h as u32, mask_pixels) {
-                Some(image) => image,
-                None => anyhow::bail!("error saving merged image"),
-            };
-        let mask_img = image::DynamicImage::from(mask_img).resize_to_fill(
-            img.width(),
-            img.height(),
-            image::imageops::FilterType::CatmullRom,
-        );
-        for x in 0..img.width() {
-            for y in 0..img.height() {
-                let mask_p = imageproc::drawing::Canvas::get_pixel(&mask_img, x, y);
-                if mask_p.0[0] > 100 {
-                    let mut img_p = imageproc::drawing::Canvas::get_pixel(&img, x, y);
-                    img_p.0[2] = 255 - (255 - img_p.0[2]) / 2;
-                    img_p.0[1] /= 2;
-                    img_p.0[0] /= 2;
-                    imageproc::drawing::Canvas::draw_pixel(&mut img, x, y, img_p)
+        if !args.image.ends_with(".safetensors") {
+            let mut img = image::io::Reader::open(&args.image)?
+                .decode()
+                .map_err(candle::Error::wrap)?;
+            let mask_pixels = mask.permute((1, 2, 0))?.flatten_all()?.to_vec1::<u8>()?;
+            let mask_img: image::ImageBuffer<image::Rgb<u8>, Vec<u8>> =
+                match image::ImageBuffer::from_raw(w as u32, h as u32, mask_pixels) {
+                    Some(image) => image,
+                    None => anyhow::bail!("error saving merged image"),
+                };
+            let mask_img = image::DynamicImage::from(mask_img).resize_to_fill(
+                img.width(),
+                img.height(),
+                image::imageops::FilterType::CatmullRom,
+            );
+            for x in 0..img.width() {
+                for y in 0..img.height() {
+                    let mask_p = imageproc::drawing::Canvas::get_pixel(&mask_img, x, y);
+                    if mask_p.0[0] > 100 {
+                        let mut img_p = imageproc::drawing::Canvas::get_pixel(&img, x, y);
+                        img_p.0[2] = 255 - (255 - img_p.0[2]) / 2;
+                        img_p.0[1] /= 2;
+                        img_p.0[0] /= 2;
+                        imageproc::drawing::Canvas::draw_pixel(&mut img, x, y, img_p)
+                    }
                }
            }
+            match point {
+                Some((x, y)) => {
+                    let (x, y) = (
+                        (x * img.width() as f64) as i32,
+                        (y * img.height() as f64) as i32,
+                    );
+                    imageproc::drawing::draw_filled_circle(
+                        &img,
+                        (x, y),
+                        3,
+                        image::Rgba([255, 0, 0, 200]),
+                    )
+                    .save("sam_merged.jpg")?
+                }
+                None => img.save("sam_merged.jpg")?,
+            };
        }
-        let (x, y) = (
-            (args.point_x * img.width() as f64) as i32,
-            (args.point_y * img.height() as f64) as i32,
-        );
-        imageproc::drawing::draw_filled_circle(&img, (x, y), 3, image::Rgba([255, 0, 0, 200]))
-            .save("sam_merged.jpg")?
    }
    Ok(())
 }
--- a/candle-examples/examples/segment-anything/model_image_encoder.rs
+++ b/candle-examples/examples/segment-anything/model_image_encoder.rs
@ -34,17 +34,23 @@ impl Module for PatchEmbed {
    }
 }

-// A custom op to make add_decomposed_rel_pos faster. Most of the time is spent on the final
-// addition in the case where b = 12, q_h = q_w = 4096, k_h = k_w = 4096
-//   (attn.reshape((b, q_h, q_w, k_h, k_w))?
-//       + rel_h.unsqueeze(4)?.broadcast_add(&rel_w.unsqueeze(3)?)?)?
-//   .reshape((b, q_h * q_w, k_h * k_w))
-// Ideally we would perform this operation in place but this is not supported in candle at the
-// moment. We should also investigate using f16 rather than f32.
-struct Add3(usize, usize, usize, usize, usize);
-impl candle::CustomOp3 for Add3 {
+#[derive(Debug)]
+struct Attention {
+    qkv: crate::Linear,
+    proj: crate::Linear,
+    num_heads: usize,
+    scale: f64,
+    rel_pos_hw: Option<(Tensor, Tensor)>,
+    span: tracing::Span,
+    span_rel_pos: tracing::Span,
+    span_softmax: tracing::Span,
+}
+
+// rel_h = torch.einsum("bhwc,hkc->bhwk", r_q, Rh)
+struct EinSum1;
+impl candle::CustomOp2 for EinSum1 {
    fn name(&self) -> &'static str {
-        "add3"
+        "einsum1"
    }

    fn cpu_fwd(
@ -53,12 +59,14 @@ impl candle::CustomOp3 for Add3 {
        l1: &candle::Layout,
        s2: &candle::CpuStorage,
        l2: &candle::Layout,
-        s3: &candle::CpuStorage,
-        l3: &candle::Layout,
    ) -> Result<(candle::CpuStorage, candle::Shape)> {
-        use rayon::prelude::*;
+        use candle::cpu::kernels::VecOps;

-        let Add3(b, q_h, q_w, k_h, k_w) = *self;
+        let (b, h, w, c) = l1.shape().dims4()?;
+        let (h2, k, c2) = l2.shape().dims3()?;
+        if c != c2 || h != h2 {
+            candle::bail!("shape mismatch {l1:?} {l2:?}")
+        }
        let s1 = s1.as_slice::<f32>()?;
        let s1 = match l1.contiguous_offsets() {
            None => candle::bail!("input1 has to be contiguous"),
@ -69,48 +77,90 @@ impl candle::CustomOp3 for Add3 {
            None => candle::bail!("input2 has to be contiguous"),
            Some((o1, o2)) => &s2[o1..o2],
        };
-        let s3 = s3.as_slice::<f32>()?;
-        let s3 = match l3.contiguous_offsets() {
-            None => candle::bail!("input3 has to be contiguous"),
-            Some((o1, o2)) => &s3[o1..o2],
-        };
-        let mut dst = vec![0f32; b * q_h * q_w * k_h * k_w];
-        dst.par_chunks_exact_mut(k_h * k_w)
-            .enumerate()
-            .for_each(|(b_idx, dst)| {
-                let s1_idx = b_idx * k_h * k_w;
-                let s2_idx = b_idx * k_h;
-                let s3_idx = b_idx * k_w;
-                for h_idx in 0..k_h {
-                    let s1_idx = s1_idx + h_idx * k_w;
-                    let s2_idx = s2_idx + h_idx;
-                    let dst_idx = h_idx * k_w;
-                    for w_idx in 0..k_w {
-                        let s1_idx = s1_idx + w_idx;
-                        let s3_idx = s3_idx + w_idx;
-                        let dst_idx = dst_idx + w_idx;
-                        dst[dst_idx] = s1[s1_idx] + s2[s2_idx] + s3[s3_idx]
+        let mut dst = vec![0f32; b * h * w * k];
+        for b_idx in 0..b {
+            let lhs_idx = b_idx * h * w * c;
+            let dst_idx = b_idx * h * w * k;
+            for h_idx in 0..h {
+                let lhs_idx = lhs_idx + h_idx * w * c;
+                let rhs_idx = h_idx * k * c;
+                let dst_idx = dst_idx + h_idx * w * k;
+                for w_idx in 0..w {
+                    let lhs_idx = lhs_idx + w_idx * c;
+                    let dst_idx = dst_idx + w_idx * k;
+                    let lhs = &s1[lhs_idx..lhs_idx + c];
+                    for k_idx in 0..k {
+                        let rhs_idx = rhs_idx + k_idx * c;
+                        let rhs = &s2[rhs_idx..rhs_idx + c];
+                        let mut d = 0f32;
+                        unsafe { f32::vec_dot(lhs.as_ptr(), rhs.as_ptr(), &mut d, c) };
+                        dst[dst_idx + k_idx] += d;
                    }
                }
-            });
-        let dst = candle::WithDType::to_cpu_storage_owned(dst);
-        Ok((dst, (b, q_h * q_w, k_h * k_w).into()))
+            }
+        }
+        let storage = candle::WithDType::to_cpu_storage_owned(dst);
+        Ok((storage, (b, h, w, k).into()))
    }
 }

-#[derive(Debug)]
-struct Attention {
-    qkv: super::Linear,
-    proj: super::Linear,
-    num_heads: usize,
-    scale: f64,
-    rel_pos_hw: Option<(Tensor, Tensor)>,
-    span: tracing::Span,
-    span_matmul: tracing::Span,
-    span_rel_pos: tracing::Span,
-    span_softmax: tracing::Span,
-}
+// rel_w = torch.einsum("bhwc,wkc->bhwk", r_q, Rw)
+struct EinSum2;
+impl candle::CustomOp2 for EinSum2 {
+    fn name(&self) -> &'static str {
+        "einsum2"
+    }

+    fn cpu_fwd(
+        &self,
+        s1: &candle::CpuStorage,
+        l1: &candle::Layout,
+        s2: &candle::CpuStorage,
+        l2: &candle::Layout,
+    ) -> Result<(candle::CpuStorage, candle::Shape)> {
+        use candle::cpu::kernels::VecOps;
+
+        let (b, h, w, c) = l1.shape().dims4()?;
+        let (w2, k, c2) = l2.shape().dims3()?;
+        if c != c2 || w != w2 {
+            candle::bail!("shape mismatch {l1:?} {l2:?}")
+        }
+        let s1 = s1.as_slice::<f32>()?;
+        let s1 = match l1.contiguous_offsets() {
+            None => candle::bail!("input1 has to be contiguous"),
+            Some((o1, o2)) => &s1[o1..o2],
+        };
+        let s2 = s2.as_slice::<f32>()?;
+        let s2 = match l2.contiguous_offsets() {
+            None => candle::bail!("input2 has to be contiguous"),
+            Some((o1, o2)) => &s2[o1..o2],
+        };
+        let mut dst = vec![0f32; b * h * w * k];
+        for b_idx in 0..b {
+            let lhs_idx = b_idx * h * w * c;
+            let dst_idx = b_idx * h * w * k;
+            for h_idx in 0..h {
+                let lhs_idx = lhs_idx + h_idx * w * c;
+                let dst_idx = dst_idx + h_idx * w * k;
+                for w_idx in 0..w {
+                    let lhs_idx = lhs_idx + w_idx * c;
+                    let rhs_idx = w_idx * k * c;
+                    let dst_idx = dst_idx + w_idx * k;
+                    let lhs = &s1[lhs_idx..lhs_idx + c];
+                    for k_idx in 0..k {
+                        let rhs_idx = rhs_idx + k_idx * c;
+                        let rhs = &s2[rhs_idx..rhs_idx + c];
+                        let mut d = 0f32;
+                        unsafe { f32::vec_dot(lhs.as_ptr(), rhs.as_ptr(), &mut d, c) };
+                        dst[dst_idx + k_idx] += d;
+                    }
+                }
+            }
+        }
+        let storage = candle::WithDType::to_cpu_storage_owned(dst);
+        Ok((storage, (b, h, w, k).into()))
+    }
+}
 impl Attention {
    fn new(
        dim: usize,
@ -121,11 +171,10 @@ impl Attention {
        vb: VarBuilder,
    ) -> Result<Self> {
        let span = tracing::span!(tracing::Level::TRACE, "attention");
-        let span_matmul = tracing::span!(tracing::Level::TRACE, "attn-matmul");
        let span_rel_pos = tracing::span!(tracing::Level::TRACE, "attn-rel-pos");
        let span_softmax = tracing::span!(tracing::Level::TRACE, "attn-sm");
-        let qkv = super::linear(vb.pp("qkv"), dim, dim * 3, qkv_bias)?;
-        let proj = super::linear(vb.pp("proj"), dim, dim, true)?;
+        let qkv = crate::linear(vb.pp("qkv"), dim, dim * 3, qkv_bias)?;
+        let proj = crate::linear(vb.pp("proj"), dim, dim, true)?;
        let head_dim = dim / num_heads;
        let scale = 1. / (head_dim as f64).sqrt();
        let rel_pos_hw = if use_rel_pos {
@ -142,7 +191,6 @@ impl Attention {
            scale,
            rel_pos_hw,
            span,
-            span_matmul,
            span_rel_pos,
            span_softmax,
        })
@ -157,27 +205,18 @@ impl Attention {
    ) -> Result<Tensor> {
        match &self.rel_pos_hw {
            Some((rel_pos_h, rel_pos_w)) => {
+                println!("{:?} {:?}", attn.layout(), q.layout());
                let r_h = get_rel_pos(q_h, k_h, rel_pos_h)?;
                let r_w = get_rel_pos(q_w, k_w, rel_pos_w)?;
                let (b, _, dim) = q.dims3()?;
                let r_q = q.reshape((b, q_h, q_w, dim))?;
                // rel_h = torch.einsum("bhwc,hkc->bhwk", r_q, Rh)
-                let rel_h = r_q.matmul(&r_h.broadcast_left(b)?.t()?.contiguous()?)?;
+                let rel_h = r_q.apply_op2_no_bwd(&r_h, &EinSum1)?;
                // rel_w = torch.einsum("bhwc,wkc->bhwk", r_q, Rw)
-                let rel_w = r_q
-                    .transpose(1, 2)? // -> bwhc
-                    .contiguous()?
-                    .matmul(&r_w.broadcast_left(b)?.t()?.contiguous()?)? // bwhc,bwck -> bwhk
-                    .transpose(1, 2)?
-                    .contiguous()?;
-                if attn.device().is_cpu() {
-                    let op = Add3(b, q_h, q_w, k_h, k_w);
-                    attn.apply_op3_no_bwd(&rel_h, &rel_w, &op)
-                } else {
-                    (attn.reshape((b, q_h, q_w, k_h, k_w))?
-                        + rel_h.unsqueeze(4)?.broadcast_add(&rel_w.unsqueeze(3)?)?)?
-                    .reshape((b, q_h * q_w, k_h * k_w))
-                }
+                let rel_w = r_q.apply_op2_no_bwd(&r_w, &EinSum2)?;
+                (attn.reshape((b, q_h, q_w, k_h, k_w))?
+                    + rel_h.unsqueeze(4)?.broadcast_add(&rel_w.unsqueeze(3)?)?)?
+                .reshape((b, q_h * q_w, k_h * k_w))
            }
            None => Ok(attn),
        }
@ -222,10 +261,7 @@ impl Module for Attention {
        let q = qkv.i(0)?;
        let k = qkv.i(1)?;
        let v = qkv.i(2)?;
-        let attn = {
-            let _enter = self.span_matmul.enter();
-            (&q * self.scale)?.matmul(&k.t()?)?
-        };
+        let attn = (&q * self.scale)?.matmul(&k.t()?)?;
        let attn = {
            let _enter = self.span_rel_pos.enter();
            self.add_decomposed_rel_pos(attn, &q, (h, w), (h, w))?
@ -234,10 +270,7 @@ impl Module for Attention {
            let _enter = self.span_softmax.enter();
            candle_nn::ops::softmax_last_dim(&attn)?
        };
-        let attn = {
-            let _enter = self.span_matmul.enter();
-            attn.matmul(&v)?
-        };
+        let attn = attn.matmul(&v)?;
        let attn = attn
            .reshape((b, self.num_heads, h, w, c / self.num_heads))?
            .permute((0, 2, 3, 1, 4))?
@ -251,7 +284,7 @@ struct Block {
    norm1: LayerNorm,
    attn: Attention,
    norm2: LayerNorm,
-    mlp: super::MlpBlock,
+    mlp: crate::MlpBlock,
    window_size: usize,
    span: tracing::Span,
 }
@ -281,7 +314,7 @@ impl Block {
            input_size_attn,
            vb.pp("attn"),
        )?;
-        let mlp = super::MlpBlock::new(dim, dim * 4, candle_nn::Activation::Gelu, vb.pp("mlp"))?;
+        let mlp = crate::MlpBlock::new(dim, dim * 4, candle_nn::Activation::Gelu, vb.pp("mlp"))?;
        let span = tracing::span!(tracing::Level::TRACE, "ie-block");
        Ok(Self {
            norm1,
@ -375,9 +408,9 @@ pub struct ImageEncoderViT {
    patch_embed: PatchEmbed,
    blocks: Vec<Block>,
    neck_conv1: candle_nn::Conv2d,
-    neck_ln1: super::LayerNorm2d,
+    neck_ln1: crate::LayerNorm2d,
    neck_conv2: candle_nn::Conv2d,
-    neck_ln2: super::LayerNorm2d,
+    neck_ln2: crate::LayerNorm2d,
    pos_embed: Option<Tensor>,
    span: tracing::Span,
 }
@ -433,13 +466,13 @@ impl ImageEncoderViT {
            Default::default(),
            vb.pp("neck.0"),
        )?;
-        let neck_ln1 = super::LayerNorm2d::new(out_chans, 1e-6, vb.pp("neck.1"))?;
+        let neck_ln1 = crate::LayerNorm2d::new(out_chans, 1e-6, vb.pp("neck.1"))?;
        let cfg = candle_nn::Conv2dConfig {
            padding: 1,
            ..Default::default()
        };
        let neck_conv2 = candle_nn::conv2d_no_bias(out_chans, out_chans, 3, cfg, vb.pp("neck.2"))?;
-        let neck_ln2 = super::LayerNorm2d::new(out_chans, 1e-6, vb.pp("neck.3"))?;
+        let neck_ln2 = crate::LayerNorm2d::new(out_chans, 1e-6, vb.pp("neck.3"))?;
        let pos_embed = if use_abs_pos {
            let p = vb.get(
                (1, img_size / patch_size, img_size / patch_size, embed_dim),
--- a/candle-examples/examples/segment-anything/model_mask_decoder.rs
+++ b/candle-examples/examples/segment-anything/model_mask_decoder.rs
@ -1,11 +1,11 @@
 use candle::{IndexOp, Result, Tensor};
 use candle_nn::{Module, VarBuilder};

-use super::transformer::TwoWayTransformer;
+use crate::model_transformer::TwoWayTransformer;

 #[derive(Debug)]
 struct MlpMaskDecoder {
-    layers: Vec<super::Linear>,
+    layers: Vec<crate::Linear>,
    sigmoid_output: bool,
    span: tracing::Span,
 }
@ -28,7 +28,7 @@ impl MlpMaskDecoder {
            } else {
                hidden_dim
            };
-            let layer = super::linear(vb.pp(i), in_dim, out_dim, true)?;
+            let layer = crate::linear(vb.pp(i), in_dim, out_dim, true)?;
            layers.push(layer)
        }
        let span = tracing::span!(tracing::Level::TRACE, "mlp-mask-decoder");
@ -64,7 +64,7 @@ pub struct MaskDecoder {
    mask_tokens: candle_nn::Embedding,
    iou_prediction_head: MlpMaskDecoder,
    output_upscaling_conv1: candle_nn::ConvTranspose2d,
-    output_upscaling_ln: super::LayerNorm2d,
+    output_upscaling_ln: crate::LayerNorm2d,
    output_upscaling_conv2: candle_nn::ConvTranspose2d,
    num_mask_tokens: usize,
    output_hypernetworks_mlps: Vec<MlpMaskDecoder>,
@ -104,7 +104,7 @@ impl MaskDecoder {
            vb.pp("output_upscaling.0"),
        )?;
        let output_upscaling_ln =
-            super::LayerNorm2d::new(transformer_dim / 4, 1e-6, vb.pp("output_upscaling.1"))?;
+            crate::LayerNorm2d::new(transformer_dim / 4, 1e-6, vb.pp("output_upscaling.1"))?;
        let output_upscaling_conv2 = candle_nn::conv_transpose2d(
            transformer_dim / 4,
            transformer_dim / 8,
--- a/candle-examples/examples/segment-anything/model_prompt_encoder.rs
+++ b/candle-examples/examples/segment-anything/model_prompt_encoder.rs
@ -56,9 +56,9 @@ pub struct PromptEncoder {
    point_embeddings: Vec<candle_nn::Embedding>,
    not_a_point_embed: candle_nn::Embedding,
    mask_downscaling_conv1: candle_nn::Conv2d,
-    mask_downscaling_ln1: super::LayerNorm2d,
+    mask_downscaling_ln1: crate::LayerNorm2d,
    mask_downscaling_conv2: candle_nn::Conv2d,
-    mask_downscaling_ln2: super::LayerNorm2d,
+    mask_downscaling_ln2: crate::LayerNorm2d,
    mask_downscaling_conv3: candle_nn::Conv2d,
    no_mask_embed: candle_nn::Embedding,
    image_embedding_size: (usize, usize),
@ -100,9 +100,9 @@ impl PromptEncoder {
            vb.pp("mask_downscaling.6"),
        )?;
        let mask_downscaling_ln1 =
-            super::LayerNorm2d::new(mask_in_chans / 4, 1e-6, vb.pp("mask_downscaling.1"))?;
+            crate::LayerNorm2d::new(mask_in_chans / 4, 1e-6, vb.pp("mask_downscaling.1"))?;
        let mask_downscaling_ln2 =
-            super::LayerNorm2d::new(mask_in_chans, 1e-6, vb.pp("mask_downscaling.4"))?;
+            crate::LayerNorm2d::new(mask_in_chans, 1e-6, vb.pp("mask_downscaling.4"))?;
        let mut point_embeddings = Vec::with_capacity(num_points_embeddings);
        let vb_e = vb.pp("point_embeddings");
        for i in 0..num_points_embeddings {
--- a/candle-examples/examples/segment-anything/model_sam.rs
+++ b/candle-examples/examples/segment-anything/model_sam.rs
@ -1,10 +1,9 @@
 use candle::{DType, IndexOp, Result, Tensor};
 use candle_nn::{Module, VarBuilder};

-use super::image_encoder::ImageEncoderViT;
-use super::mask_decoder::MaskDecoder;
-use super::prompt_encoder::PromptEncoder;
-use super::tiny_vit::{tiny_vit_5m, TinyViT};
+use crate::model_image_encoder::ImageEncoderViT;
+use crate::model_mask_decoder::MaskDecoder;
+use crate::model_prompt_encoder::PromptEncoder;

 const PROMPT_EMBED_DIM: usize = 256;
 pub const IMAGE_SIZE: usize = 1024;
@ -15,24 +14,9 @@ const STABILITY_SCORE_THRESHOLD: f32 = 0.95;
 const MODEL_MASK_THRESHOLD: f32 = 0.0;
 const CROP_NMS_THRESH: f32 = 0.7;

-#[derive(Debug)]
-enum ImageEncoder {
-    Original(ImageEncoderViT),
-    TinyViT(TinyViT),
-}
-
-impl Module for ImageEncoder {
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
-        match self {
-            Self::Original(vit) => vit.forward(xs),
-            Self::TinyViT(vit) => vit.forward(xs),
-        }
-    }
-}
-
 #[derive(Debug)]
 pub struct Sam {
-    image_encoder: ImageEncoder,
+    image_encoder: ImageEncoderViT,
    prompt_encoder: PromptEncoder,
    mask_decoder: MaskDecoder,
    pixel_mean: Tensor,
@ -83,7 +67,7 @@ impl Sam {
        let pixel_std =
            Tensor::new(&[58.395f32, 57.12, 57.375], vb.device())?.reshape((3, 1, 1))?;
        Ok(Self {
-            image_encoder: ImageEncoder::Original(image_encoder),
+            image_encoder,
            prompt_encoder,
            mask_decoder,
            pixel_std,
@ -91,42 +75,6 @@ impl Sam {
        })
    }

-    pub fn new_tiny(vb: VarBuilder) -> Result<Self> {
-        let image_embedding_size = IMAGE_SIZE / VIT_PATCH_SIZE;
-
-        let image_encoder = tiny_vit_5m(vb.pp("image_encoder"))?;
-        let prompt_encoder = PromptEncoder::new(
-            PROMPT_EMBED_DIM,
-            (image_embedding_size, image_embedding_size),
-            (IMAGE_SIZE, IMAGE_SIZE),
-            16,
-            vb.pp("prompt_encoder"),
-        )?;
-        let mask_decoder = MaskDecoder::new(
-            PROMPT_EMBED_DIM,
-            /* num_multitask_outputs */ 3,
-            /* iou_head_depth */ 3,
-            /* iou_head_hidden_dim */ 256,
-            vb.pp("mask_decoder"),
-        )?;
-        let pixel_mean =
-            Tensor::new(&[123.675f32, 116.28, 103.53], vb.device())?.reshape((3, 1, 1))?;
-        let pixel_std =
-            Tensor::new(&[58.395f32, 57.12, 57.375], vb.device())?.reshape((3, 1, 1))?;
-        Ok(Self {
-            image_encoder: ImageEncoder::TinyViT(image_encoder),
-            prompt_encoder,
-            mask_decoder,
-            pixel_std,
-            pixel_mean,
-        })
-    }
-
-    pub fn embeddings(&self, img: &Tensor) -> Result<Tensor> {
-        let img = self.preprocess(img)?.unsqueeze(0)?;
-        self.image_encoder.forward(&img)
-    }
-
    pub fn forward(
        &self,
        img: &Tensor,
@ -136,50 +84,33 @@ impl Sam {
        let (_c, original_h, original_w) = img.dims3()?;
        let img = self.preprocess(img)?.unsqueeze(0)?;
        let img_embeddings = self.image_encoder.forward(&img)?;
-        let (low_res_mask, iou) = self.forward_for_embeddings(
-            &img_embeddings,
-            original_h,
-            original_w,
-            point,
-            multimask_output,
-        )?;
-        let mask = low_res_mask
-            .upsample_nearest2d(IMAGE_SIZE, IMAGE_SIZE)?
-            .get(0)?
-            .i((.., ..original_h, ..original_w))?;
-        Ok((mask, iou))
-    }
-
-    pub fn forward_for_embeddings(
-        &self,
-        img_embeddings: &Tensor,
-        original_h: usize,
-        original_w: usize,
-        point: Option<(f64, f64)>,
-        multimask_output: bool,
-    ) -> Result<(Tensor, Tensor)> {
        let image_pe = self.prompt_encoder.get_dense_pe()?;
        let points = match point {
            None => None,
            Some((x, y)) => {
                let points = Tensor::new(
                    &[[[x as f32 * original_w as f32, y as f32 * original_h as f32]]],
-                    img_embeddings.device(),
+                    img.device(),
                )?;
-                let labels = Tensor::ones((1, 1), DType::F32, img_embeddings.device())?;
+                let labels = Tensor::ones((1, 1), DType::F32, img.device())?;
                Some((points, labels))
            }
        };
        let points = points.as_ref().map(|(x, y)| (x, y));
        let (sparse_prompt_embeddings, dense_prompt_embeddings) =
            self.prompt_encoder.forward(points, None, None)?;
-        self.mask_decoder.forward(
-            img_embeddings,
+        let (low_res_mask, iou_predictions) = self.mask_decoder.forward(
+            &img_embeddings,
            &image_pe,
            &sparse_prompt_embeddings,
            &dense_prompt_embeddings,
            multimask_output,
-        )
+        )?;
+        let mask = low_res_mask
+            .upsample_nearest2d(IMAGE_SIZE, IMAGE_SIZE)?
+            .get(0)?
+            .i((.., ..original_h, ..original_w))?;
+        Ok((mask, iou_predictions))
    }

    pub fn unpreprocess(&self, img: &Tensor) -> Result<Tensor> {
@ -208,7 +139,7 @@ impl Sam {
        img: &Tensor,
        cb: CropBox,
        point_grids: &[(f64, f64)],
-    ) -> Result<Vec<crate::object_detection::Bbox<Tensor>>> {
+    ) -> Result<Vec<candle_examples::object_detection::Bbox<Tensor>>> {
        // Crop the image and calculate embeddings.
        let img = img.i((.., cb.y0..cb.y1, cb.x0..cb.x1))?;
        let img = self.preprocess(&img)?.unsqueeze(0)?;
@ -281,7 +212,7 @@ impl Sam {
                let min_max_x = min_max_indexes(&low_res_mask_per_x);
                let min_max_y = min_max_indexes(&low_res_mask_per_y);
                if let Some(((x0, x1), (y0, y1))) = min_max_x.zip(min_max_y) {
-                    let bbox = crate::object_detection::Bbox {
+                    let bbox = candle_examples::object_detection::Bbox {
                        xmin: x0 as f32,
                        ymin: y0 as f32,
                        xmax: x1 as f32,
@ -299,7 +230,7 @@ impl Sam {

        let mut bboxes = vec![bboxes];
        // Remove duplicates within this crop.
-        crate::object_detection::non_maximum_suppression(&mut bboxes, CROP_NMS_THRESH);
+        candle_examples::object_detection::non_maximum_suppression(&mut bboxes, CROP_NMS_THRESH);

        // TODO: Return to the original image frame.
        Ok(bboxes.remove(0))
@ -312,7 +243,7 @@ impl Sam {
        crop_n_layer: usize,
        crop_overlap_ratio: f64,
        crop_n_points_downscale_factor: usize,
-    ) -> Result<Vec<crate::object_detection::Bbox<Tensor>>> {
+    ) -> Result<Vec<candle_examples::object_detection::Bbox<Tensor>>> {
        let (_c, h, w) = img.dims3()?;
        let point_grids = build_all_layer_point_grids(
            points_per_side,
--- a/candle-examples/examples/segment-anything/model_transformer.rs
+++ b/candle-examples/examples/segment-anything/model_transformer.rs
@ -68,7 +68,7 @@ struct TwoWayAttentionBlock {
    norm1: LayerNorm,
    cross_attn_token_to_image: Attention,
    norm2: LayerNorm,
-    mlp: super::MlpBlock,
+    mlp: crate::MlpBlock,
    norm3: LayerNorm,
    norm4: LayerNorm,
    cross_attn_image_to_token: Attention,
@ -100,7 +100,7 @@ impl TwoWayAttentionBlock {
            2,
            vb.pp("cross_attn_image_to_token"),
        )?;
-        let mlp = super::MlpBlock::new(
+        let mlp = crate::MlpBlock::new(
            embedding_dim,
            mlp_dim,
            candle_nn::Activation::Relu,
--- a/candle-examples/examples/stable-diffusion/README.md
+++ b/candle-examples/examples/stable-diffusion/README.md
@ -1,63 +0,0 @@
-# candle-stable-diffusion: A Diffusers API in Rust/Candle
-
-![rusty robot holding a candle](./assets/stable-diffusion-xl.jpg)
-
-_A rusty robot holding a fire torch in its hand_, generated by Stable Diffusion
-XL using Rust and [candle](https://github.com/huggingface/candle).
-
-The `stable-diffusion` example is a conversion of
-[diffusers-rs](https://github.com/LaurentMazare/diffusers-rs) using candle
-rather than libtorch. This implementation supports Stable Diffusion v1.5, v2.1,
-as well as Stable Diffusion XL 1.0.
-
-## Getting the weights
-
-The weights are automatically downloaded for you from the [HuggingFace
-Hub](https://huggingface.co/) on the first run. There are various command line
-flags to use local files instead, run with `--help` to learn about them.
-
-## Running some example.
-
-```bash
-cargo run --example stable-diffusion --release --features=cuda,cudnn \
-    -- --prompt "a cosmonaut on a horse (hd, realistic, high-def)"
-```
-
-The final image is named `sd_final.png` by default.
-The default scheduler is the Denoising Diffusion Implicit Model scheduler (DDIM). The
-original paper and some code can be found in the [associated repo](https://github.com/ermongroup/ddim).
-
-### Command-line flags
-
- `--prompt`: the prompt to be used to generate the image.
- `--uncond-prompt`: the optional unconditional prompt.
- `--sd-version`: the Stable Diffusion version to use, can be `v1-5`, `v2-1`, or
-  `xl`.
- `--cpu`: use the cpu rather than the gpu (much slower).
- `--height`, `--width`: set the height and width for the generated image.
- `--n-steps`: the number of steps to be used in the diffusion process.
- `--num-samples`: the number of samples to generate.
- `--final-image`: the filename for the generated image(s).
-
-### Using flash-attention
-
-Using flash attention makes image generation a lot faster and uses less memory.
-The downside is some long compilation time. You can set the
-`CANDLE_FLASH_ATTN_BUILD_DIR` environment variable to something like
-`/home/user/.candle` to ensures that the compilation artifacts are properly
-cached.
-
-Enabling flash-attention requires both a feature flag, `--feature flash-attn`
-and using the command line flag `--use-flash-attn`.
-
-## Image to Image Pipeline
-...
-
-## FAQ
-
-### Memory Issues
-
-This requires a GPU with more than 8GB of memory, as a fallback the CPU version can be used
-with the `--cpu` flag but is much slower.
-Alternatively, reducing the height and width with the `--height` and `--width`
-flag is likely to reduce memory usage significantly.
--- a/candle-examples/examples/stable-diffusion/assets/stable-diffusion-xl.jpg
+++ b/candle-examples/examples/stable-diffusion/assets/stable-diffusion-xl.jpg
--- a/candle-transformers/src/models/stable_diffusion/attention.rs
+++ b/candle-transformers/src/models/stable_diffusion/attention.rs
@ -17,7 +17,7 @@ impl GeGlu {
    }
 }

-impl Module for GeGlu {
+impl GeGlu {
    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        let _enter = self.span.enter();
        let hidden_states_and_gate = self.proj.forward(xs)?.chunk(2, D::Minus1)?;
@ -53,7 +53,7 @@ impl FeedForward {
    }
 }

-impl Module for FeedForward {
+impl FeedForward {
    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        let _enter = self.span.enter();
        let xs = self.project_in.forward(xs)?;
@ -78,7 +78,7 @@ fn flash_attn(_: &Tensor, _: &Tensor, _: &Tensor, _: f32, _: bool) -> Result<Ten
 }

 #[derive(Debug)]
-pub struct CrossAttention {
+struct CrossAttention {
    to_q: nn::Linear,
    to_k: nn::Linear,
    to_v: nn::Linear,
@ -94,7 +94,7 @@ pub struct CrossAttention {

 impl CrossAttention {
    // Defaults should be heads = 8, dim_head = 64, context_dim = None
-    pub fn new(
+    fn new(
        vs: nn::VarBuilder,
        query_dim: usize,
        context_dim: Option<usize>,
@ -205,7 +205,7 @@ impl CrossAttention {
        self.reshape_batch_dim_to_heads(&xs)
    }

-    pub fn forward(&self, xs: &Tensor, context: Option<&Tensor>) -> Result<Tensor> {
+    fn forward(&self, xs: &Tensor, context: Option<&Tensor>) -> Result<Tensor> {
        let _enter = self.span.enter();
        let query = self.to_q.forward(xs)?;
        let context = context.unwrap_or(xs).contiguous()?;
@ -501,10 +501,8 @@ impl AttentionBlock {
        xs.reshape((batch, t, self.num_heads, h_times_d / self.num_heads))?
            .transpose(1, 2)
    }
-}

-impl Module for AttentionBlock {
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+    pub fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        let _enter = self.span.enter();
        let in_dtype = xs.dtype();
        let residual = xs;
--- a/candle-transformers/src/models/stable_diffusion/clip.rs
+++ b/candle-transformers/src/models/stable_diffusion/clip.rs
@ -14,7 +14,7 @@ pub enum Activation {
    Gelu,
 }

-impl Module for Activation {
+impl Activation {
    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        match self {
            Activation::QuickGelu => xs * nn::ops::sigmoid(&(xs * 1.702f64)?)?,
@ -99,36 +99,6 @@ impl Config {
            activation: Activation::Gelu,
        }
    }
-
-    // https://huggingface.co/warp-ai/wuerstchen/blob/main/text_encoder/config.json
-    pub fn wuerstchen() -> Self {
-        Self {
-            vocab_size: 49408,
-            embed_dim: 1024,
-            intermediate_size: 4096,
-            max_position_embeddings: 77,
-            pad_with: None,
-            num_hidden_layers: 24,
-            num_attention_heads: 16,
-            projection_dim: 1024,
-            activation: Activation::Gelu,
-        }
-    }
-
-    // https://huggingface.co/warp-ai/wuerstchen-prior/blob/main/text_encoder/config.json
-    pub fn wuerstchen_prior() -> Self {
-        Self {
-            vocab_size: 49408,
-            embed_dim: 1280,
-            intermediate_size: 5120,
-            max_position_embeddings: 77,
-            pad_with: None,
-            num_hidden_layers: 32,
-            num_attention_heads: 20,
-            projection_dim: 512,
-            activation: Activation::Gelu,
-        }
-    }
 }

 // CLIP Text Model
@ -159,7 +129,7 @@ impl ClipTextEmbeddings {
    }
 }

-impl Module for ClipTextEmbeddings {
+impl ClipTextEmbeddings {
    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        let token_embedding = self.token_embedding.forward(xs)?;
        let position_embedding = self.position_embedding.forward(&self.position_ids)?;
@ -349,39 +319,21 @@ impl ClipTextTransformer {
    }

    // https://github.com/huggingface/transformers/blob/674f750a57431222fa2832503a108df3badf1564/src/transformers/models/clip/modeling_clip.py#L678
-    fn build_causal_attention_mask(
-        bsz: usize,
-        seq_len: usize,
-        mask_after: usize,
-        device: &Device,
-    ) -> Result<Tensor> {
+    fn build_causal_attention_mask(bsz: usize, seq_len: usize, device: &Device) -> Result<Tensor> {
        let mask: Vec<_> = (0..seq_len)
-            .flat_map(|i| {
-                (0..seq_len).map(move |j| {
-                    if j > i || j > mask_after {
-                        f32::MIN
-                    } else {
-                        0.
-                    }
-                })
-            })
+            .flat_map(|i| (0..seq_len).map(move |j| if j > i { f32::MIN } else { 0. }))
            .collect();
        let mask = Tensor::from_slice(&mask, (seq_len, seq_len), device)?;
        mask.broadcast_as((bsz, seq_len, seq_len))
    }
+}

-    pub fn forward_with_mask(&self, xs: &Tensor, mask_after: usize) -> Result<Tensor> {
+impl ClipTextTransformer {
+    pub fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        let (bsz, seq_len) = xs.dims2()?;
        let xs = self.embeddings.forward(xs)?;
-        let causal_attention_mask =
-            Self::build_causal_attention_mask(bsz, seq_len, mask_after, xs.device())?;
+        let causal_attention_mask = Self::build_causal_attention_mask(bsz, seq_len, xs.device())?;
        let xs = self.encoder.forward(&xs, &causal_attention_mask)?;
        self.final_layer_norm.forward(&xs)
    }
 }
-
-impl Module for ClipTextTransformer {
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
-        self.forward_with_mask(xs, usize::MAX)
-    }
-}
--- a/candle-transformers/src/models/stable_diffusion/ddim.rs
+++ b/candle-transformers/src/models/stable_diffusion/ddim.rs
@ -7,7 +7,7 @@
 //!
 //! Denoising Diffusion Implicit Models, J. Song et al, 2020.
 //! https://arxiv.org/abs/2010.02502
-use super::schedulers::{betas_for_alpha_bar, BetaSchedule, PredictionType};
+use crate::schedulers::{betas_for_alpha_bar, BetaSchedule, PredictionType};
 use candle::{Result, Tensor};

 /// The configuration for the DDIM scheduler.
@ -67,14 +67,14 @@ impl DDIMScheduler {
            .rev()
            .collect();
        let betas = match config.beta_schedule {
-            BetaSchedule::ScaledLinear => super::utils::linspace(
+            BetaSchedule::ScaledLinear => crate::utils::linspace(
                config.beta_start.sqrt(),
                config.beta_end.sqrt(),
                config.train_timesteps,
            )?
            .sqr()?,
            BetaSchedule::Linear => {
-                super::utils::linspace(config.beta_start, config.beta_end, config.train_timesteps)?
+                crate::utils::linspace(config.beta_start, config.beta_end, config.train_timesteps)?
            }
            BetaSchedule::SquaredcosCapV2 => betas_for_alpha_bar(config.train_timesteps, 0.999)?,
        };
--- a/candle-transformers/src/models/stable_diffusion/embeddings.rs
+++ b/candle-transformers/src/models/stable_diffusion/embeddings.rs
@ -17,8 +17,8 @@ impl TimestepEmbedding {
    }
 }

-impl Module for TimestepEmbedding {
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+impl TimestepEmbedding {
+    pub fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        let xs = nn::ops::silu(&self.linear_1.forward(xs)?)?;
        self.linear_2.forward(&xs)
    }
@ -41,8 +41,8 @@ impl Timesteps {
    }
 }

-impl Module for Timesteps {
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+impl Timesteps {
+    pub fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        let half_dim = (self.num_channels / 2) as u32;
        let exponent = (Tensor::arange(0, half_dim, xs.device())?.to_dtype(candle::DType::F32)?
            * -f64::ln(10000.))?;
--- a/candle-examples/examples/stable-diffusion/main.rs
+++ b/candle-examples/examples/stable-diffusion/main.rs
@ -4,10 +4,20 @@ extern crate accelerate_src;
 #[cfg(feature = "mkl")]
 extern crate intel_mkl_src;

-use candle_transformers::models::stable_diffusion;
+mod attention;
+mod clip;
+mod ddim;
+mod embeddings;
+mod resnet;
+mod schedulers;
+mod stable_diffusion;
+mod unet_2d;
+mod unet_2d_blocks;
+mod utils;
+mod vae;

 use anyhow::{Error as E, Result};
-use candle::{DType, Device, IndexOp, Module, Tensor, D};
+use candle::{DType, Device, IndexOp, Tensor, D};
 use clap::Parser;
 use tokenizers::Tokenizer;

--- a/candle-transformers/src/models/stable_diffusion/resnet.rs
+++ b/candle-transformers/src/models/stable_diffusion/resnet.rs
@ -4,7 +4,7 @@
 //!
 //! Denoising Diffusion Implicit Models, K. He and al, 2015.
 //! https://arxiv.org/abs/1512.03385
-use super::utils::{conv2d, Conv2d};
+use crate::utils::{conv2d, Conv2d};
 use candle::{Result, Tensor, D};
 use candle_nn as nn;
 use candle_nn::Module;
--- a/candle-transformers/src/models/stable_diffusion/schedulers.rs
+++ b/candle-transformers/src/models/stable_diffusion/schedulers.rs
--- a/candle-examples/examples/stable-diffusion/stable_diffusion.rs
+++ b/candle-examples/examples/stable-diffusion/stable_diffusion.rs
@ -1,15 +1,5 @@
-pub mod attention;
-pub mod clip;
-pub mod ddim;
-pub mod ddpm;
-pub mod embeddings;
-pub mod resnet;
-pub mod schedulers;
-pub mod unet_2d;
-pub mod unet_2d_blocks;
-pub mod utils;
-pub mod vae;
-
+use crate::schedulers::PredictionType;
+use crate::{clip, ddim, unet_2d, vae};
 use candle::{DType, Device, Result};
 use candle_nn as nn;

@ -90,7 +80,7 @@ impl StableDiffusionConfig {
        sliced_attention_size: Option<usize>,
        height: Option<usize>,
        width: Option<usize>,
-        prediction_type: schedulers::PredictionType,
+        prediction_type: PredictionType,
    ) -> Self {
        let bc = |out_channels, use_cross_attn, attention_head_dim| unet_2d::BlockConfig {
            out_channels,
@ -164,7 +154,7 @@ impl StableDiffusionConfig {
            sliced_attention_size,
            height,
            width,
-            schedulers::PredictionType::VPrediction,
+            PredictionType::VPrediction,
        )
    }

@ -172,7 +162,7 @@ impl StableDiffusionConfig {
        sliced_attention_size: Option<usize>,
        height: Option<usize>,
        width: Option<usize>,
-        prediction_type: schedulers::PredictionType,
+        prediction_type: PredictionType,
    ) -> Self {
        let bc = |out_channels, use_cross_attn, attention_head_dim| unet_2d::BlockConfig {
            out_channels,
@ -245,7 +235,7 @@ impl StableDiffusionConfig {
            height,
            width,
            // https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/scheduler/scheduler_config.json
-            schedulers::PredictionType::Epsilon,
+            PredictionType::Epsilon,
        )
    }

--- a/candle-transformers/src/models/stable_diffusion/unet_2d.rs
+++ b/candle-transformers/src/models/stable_diffusion/unet_2d.rs
@ -2,9 +2,9 @@
 //!
 //! The 2D Unet models take as input a noisy sample and the current diffusion
 //! timestep and return a denoised version of the input.
-use super::embeddings::{TimestepEmbedding, Timesteps};
-use super::unet_2d_blocks::*;
-use super::utils::{conv2d, Conv2d};
+use crate::embeddings::{TimestepEmbedding, Timesteps};
+use crate::unet_2d_blocks::*;
+use crate::utils::{conv2d, Conv2d};
 use candle::{Result, Tensor};
 use candle_nn as nn;
 use candle_nn::Module;
--- a/candle-transformers/src/models/stable_diffusion/unet_2d_blocks.rs
+++ b/candle-transformers/src/models/stable_diffusion/unet_2d_blocks.rs
@ -1,11 +1,11 @@
 //! 2D UNet Building Blocks
 //!
-use super::attention::{
+use crate::attention::{
    AttentionBlock, AttentionBlockConfig, SpatialTransformer, SpatialTransformerConfig,
 };
-use super::resnet::{ResnetBlock2D, ResnetBlock2DConfig};
-use super::utils::{conv2d, Conv2d};
-use candle::{Module, Result, Tensor, D};
+use crate::resnet::{ResnetBlock2D, ResnetBlock2DConfig};
+use crate::utils::{conv2d, Conv2d};
+use candle::{Result, Tensor, D};
 use candle_nn as nn;

 #[derive(Debug)]
@ -43,7 +43,7 @@ impl Downsample2D {
    }
 }

-impl Module for Downsample2D {
+impl Downsample2D {
    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        let _enter = self.span.enter();
        match &self.conv {
@ -172,8 +172,8 @@ impl DownEncoderBlock2D {
    }
 }

-impl Module for DownEncoderBlock2D {
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+impl DownEncoderBlock2D {
+    pub fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        let _enter = self.span.enter();
        let mut xs = xs.clone();
        for resnet in self.resnets.iter() {
@ -256,8 +256,8 @@ impl UpDecoderBlock2D {
    }
 }

-impl Module for UpDecoderBlock2D {
-    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+impl UpDecoderBlock2D {
+    pub fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        let _enter = self.span.enter();
        let mut xs = xs.clone();
        for resnet in self.resnets.iter() {
--- a/candle-transformers/src/models/stable_diffusion/utils.rs
+++ b/candle-transformers/src/models/stable_diffusion/utils.rs
--- a/candle-transformers/src/models/stable_diffusion/vae.rs
+++ b/candle-transformers/src/models/stable_diffusion/vae.rs
@ -4,7 +4,7 @@
 //! Auto-encoder models compress their input to a usually smaller latent space
 //! before expanding it back to its original shape. This results in the latent values
 //! compressing the original information.
-use super::unet_2d_blocks::{
+use crate::unet_2d_blocks::{
    DownEncoderBlock2D, DownEncoderBlock2DConfig, UNetMidBlock2D, UNetMidBlock2DConfig,
    UpDecoderBlock2D, UpDecoderBlock2DConfig,
 };
@ -132,15 +132,14 @@ impl Encoder {

 impl Encoder {
    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
-        let mut xs = xs.apply(&self.conv_in)?;
+        let mut xs = self.conv_in.forward(xs)?;
        for down_block in self.down_blocks.iter() {
-            xs = xs.apply(down_block)?
+            xs = down_block.forward(&xs)?
        }
-        let xs = self
-            .mid_block
-            .forward(&xs, None)?
-            .apply(&self.conv_norm_out)?;
-        nn::ops::silu(&xs)?.apply(&self.conv_out)
+        let xs = self.mid_block.forward(&xs, None)?;
+        let xs = self.conv_norm_out.forward(&xs)?;
+        let xs = nn::ops::silu(&xs)?;
+        self.conv_out.forward(&xs)
    }
 }

@ -303,7 +302,7 @@ impl DiagonalGaussianDistribution {
    }

    pub fn sample(&self) -> Result<Tensor> {
-        let sample = self.mean.randn_like(0., 1.);
+        let sample = Tensor::randn(0., 1f32, self.mean.shape(), self.mean.device());
        &self.mean + &self.std * sample
    }
 }
--- a/candle-examples/examples/t5/README.md
+++ b/candle-examples/examples/t5/README.md
@ -1,25 +0,0 @@
-# candle-t5
-
-## Encoder-decoder example:
-
-```bash
-$ cargo run --example t5 --release -- --model-id "t5-small" --prompt "translate to German: A beautiful candle." --decode
-...
-Running on CPU, to run on GPU, build this example with `--features cuda`
- Eine schöne Kerze.
-9 tokens generated (2.42 token/s)
-```
-
-## Sentence embedding example:
-
-```bash
-$ cargo run --example t5 --release -- --model-id "t5-small" --prompt "A beautiful candle."
-...
-[[[ 0.0515, -0.0541, -0.0761, ..., -0.0392,  0.1511, -0.0265],
-  [-0.0974,  0.0998, -0.1659, ..., -0.2450,  0.1738, -0.0164],
-  [ 0.0624, -0.1024,  0.0430, ..., -0.1388,  0.0564, -0.2962],
-  [-0.0389, -0.1173,  0.0026, ...,  0.1064, -0.1065,  0.0990],
-  [ 0.1300,  0.0027, -0.0326, ...,  0.0026, -0.0317,  0.0851]]]
-Tensor[[1, 5, 512], f32]
-Took 303.766583ms
-```
--- a/candle-examples/examples/t5/main.rs
+++ b/candle-examples/examples/t5/main.rs
@ -1,286 +0,0 @@
-#[cfg(feature = "mkl")]
-extern crate intel_mkl_src;
-
-#[cfg(feature = "accelerate")]
-extern crate accelerate_src;
-use std::io::Write;
-use std::path::PathBuf;
-
-use candle_transformers::models::t5;
-
-use anyhow::{anyhow, Error as E, Result};
-use candle::{DType, Device, Tensor};
-use candle_nn::VarBuilder;
-use candle_transformers::generation::LogitsProcessor;
-use clap::Parser;
-use hf_hub::{api::sync::Api, Cache, Repo, RepoType};
-use tokenizers::Tokenizer;
-
-const DTYPE: DType = DType::F32;
-
-#[derive(Parser, Debug, Clone)]
-#[command(author, version, about, long_about = None)]
-struct Args {
-    /// Run on CPU rather than on GPU.
-    #[arg(long)]
-    cpu: bool,
-
-    /// Run offline (you must have the files already cached)
-    #[arg(long)]
-    offline: bool,
-
-    /// Enable tracing (generates a trace-timestamp.json file).
-    #[arg(long)]
-    tracing: bool,
-
-    /// The model repository to use on the HuggingFace hub.
-    #[arg(long)]
-    model_id: Option<String>,
-
-    #[arg(long)]
-    revision: Option<String>,
-
-    /// Enable decoding.
-    #[arg(long)]
-    decode: bool,
-
-    // Enable/disable decoding.
-    #[arg(long, default_value = "false")]
-    use_cache: bool,
-
-    /// Use this prompt, otherwise compute sentence similarities.
-    #[arg(long)]
-    prompt: Option<String>,
-
-    /// L2 normalization for embeddings.
-    #[arg(long, default_value = "true")]
-    normalize_embeddings: bool,
-
-    /// The temperature used to generate samples.
-    #[arg(long, default_value_t = 0.8)]
-    temperature: f64,
-
-    /// Nucleus sampling probability cutoff.
-    #[arg(long)]
-    top_p: Option<f64>,
-
-    /// Penalty to be applied for repeating tokens, 1. means no penalty.
-    #[arg(long, default_value_t = 1.1)]
-    repeat_penalty: f32,
-
-    /// The context size to consider for the repeat penalty.
-    #[arg(long, default_value_t = 64)]
-    repeat_last_n: usize,
-}
-
-struct T5ModelBuilder {
-    device: Device,
-    config: t5::Config,
-    weights_filename: PathBuf,
-}
-
-impl T5ModelBuilder {
-    pub fn load(args: &Args) -> Result<(Self, Tokenizer)> {
-        let device = candle_examples::device(args.cpu)?;
-        let default_model = "t5-small".to_string();
-        let default_revision = "refs/pr/15".to_string();
-        let (model_id, revision) = match (args.model_id.to_owned(), args.revision.to_owned()) {
-            (Some(model_id), Some(revision)) => (model_id, revision),
-            (Some(model_id), None) => (model_id, "main".to_string()),
-            (None, Some(revision)) => (default_model, revision),
-            (None, None) => (default_model, default_revision),
-        };
-
-        let repo = Repo::with_revision(model_id, RepoType::Model, revision);
-        let (config_filename, tokenizer_filename, weights_filename) = if args.offline {
-            let cache = Cache::default().repo(repo);
-            (
-                cache
-                    .get("config.json")
-                    .ok_or(anyhow!("Missing config file in cache"))?,
-                cache
-                    .get("tokenizer.json")
-                    .ok_or(anyhow!("Missing tokenizer file in cache"))?,
-                cache
-                    .get("model.safetensors")
-                    .ok_or(anyhow!("Missing weights file in cache"))?,
-            )
-        } else {
-            let api = Api::new()?;
-            let api = api.repo(repo);
-            (
-                api.get("config.json")?,
-                api.get("tokenizer.json")?,
-                api.get("model.safetensors")?,
-            )
-        };
-        let config = std::fs::read_to_string(config_filename)?;
-        let mut config: t5::Config = serde_json::from_str(&config)?;
-        config.use_cache = args.use_cache;
-        let tokenizer = Tokenizer::from_file(tokenizer_filename).map_err(E::msg)?;
-        Ok((
-            Self {
-                device,
-                config,
-                weights_filename,
-            },
-            tokenizer,
-        ))
-    }
-
-    pub fn build_encoder(&self) -> Result<t5::T5EncoderModel> {
-        let weights =
-            unsafe { candle::safetensors::MmapedFile::new(self.weights_filename.clone())? };
-        let weights = weights.deserialize()?;
-        let vb = VarBuilder::from_safetensors(vec![weights], DTYPE, &self.device);
-        Ok(t5::T5EncoderModel::load(vb, &self.config)?)
-    }
-
-    pub fn build_conditional_generation(&self) -> Result<t5::T5ForConditionalGeneration> {
-        let weights =
-            unsafe { candle::safetensors::MmapedFile::new(self.weights_filename.clone())? };
-        let weights = weights.deserialize()?;
-        let vb = VarBuilder::from_safetensors(vec![weights], DTYPE, &self.device);
-        Ok(t5::T5ForConditionalGeneration::load(vb, &self.config)?)
-    }
-}
-
-fn main() -> Result<()> {
-    let args = Args::parse();
-    let (builder, mut tokenizer) = T5ModelBuilder::load(&args)?;
-    let device = &builder.device;
-    let tokenizer = tokenizer
-        .with_padding(None)
-        .with_truncation(None)
-        .map_err(E::msg)?;
-    match args.prompt {
-        Some(prompt) => {
-            let tokens = tokenizer
-                .encode(prompt, true)
-                .map_err(E::msg)?
-                .get_ids()
-                .to_vec();
-            let input_token_ids = Tensor::new(&tokens[..], device)?.unsqueeze(0)?;
-            if !args.decode {
-                let mut model = builder.build_encoder()?;
-                let start = std::time::Instant::now();
-                let ys = model.forward(&input_token_ids)?;
-                println!("{ys}");
-                println!("Took {:?}", start.elapsed());
-            } else {
-                let mut model = builder.build_conditional_generation()?;
-                let mut output_token_ids = [builder.config.pad_token_id as u32].to_vec();
-                let temperature = if args.temperature <= 0. {
-                    None
-                } else {
-                    Some(args.temperature)
-                };
-                let mut logits_processor = LogitsProcessor::new(299792458, temperature, args.top_p);
-                let encoder_output = model.encode(&input_token_ids)?;
-                let start = std::time::Instant::now();
-
-                for index in 0.. {
-                    if output_token_ids.len() > 512 {
-                        break;
-                    }
-                    let decoder_token_ids = if index == 0 || !builder.config.use_cache {
-                        Tensor::new(output_token_ids.as_slice(), device)?.unsqueeze(0)?
-                    } else {
-                        let last_token = *output_token_ids.last().unwrap();
-                        Tensor::new(&[last_token], device)?.unsqueeze(0)?
-                    };
-                    let logits = model
-                        .decode(&decoder_token_ids, &encoder_output)?
-                        .squeeze(0)?;
-                    let logits = if args.repeat_penalty == 1. {
-                        logits
-                    } else {
-                        let start_at = tokens.len().saturating_sub(args.repeat_last_n);
-                        candle_transformers::utils::apply_repeat_penalty(
-                            &logits,
-                            args.repeat_penalty,
-                            &tokens[start_at..],
-                        )?
-                    };
-
-                    let next_token_id = logits_processor.sample(&logits)?;
-                    if next_token_id as usize == builder.config.eos_token_id {
-                        break;
-                    }
-                    output_token_ids.push(next_token_id);
-                    if let Some(text) = tokenizer.id_to_token(next_token_id) {
-                        let text = text.replace('▁', " ").replace("<0x0A>", "\n");
-                        print!("{text}");
-                        std::io::stdout().flush()?;
-                    }
-                }
-                let dt = start.elapsed();
-                println!(
-                    "\n{} tokens generated ({:.2} token/s)\n",
-                    tokens.len(),
-                    tokens.len() as f64 / dt.as_secs_f64(),
-                );
-            }
-        }
-        None => {
-            let mut model = builder.build_encoder()?;
-            let sentences = [
-                "The cat sits outside",
-                "A man is playing guitar",
-                "I love pasta",
-                "The new movie is awesome",
-                "The cat plays in the garden",
-                "A woman watches TV",
-                "The new movie is so great",
-                "Do you like pizza?",
-            ];
-            let n_sentences = sentences.len();
-            let mut all_embeddings = Vec::with_capacity(n_sentences);
-            for sentence in sentences {
-                let tokens = tokenizer
-                    .encode(sentence, true)
-                    .map_err(E::msg)?
-                    .get_ids()
-                    .to_vec();
-                let token_ids = Tensor::new(&tokens[..], model.device())?.unsqueeze(0)?;
-                let embeddings = model.forward(&token_ids)?;
-                println!("generated embeddings {:?}", embeddings.shape());
-                // Apply some avg-pooling by taking the mean embedding value for all tokens (including padding)
-                let (_n_sentence, n_tokens, _hidden_size) = embeddings.dims3()?;
-                let embeddings = (embeddings.sum(1)? / (n_tokens as f64))?;
-                let embeddings = if args.normalize_embeddings {
-                    normalize_l2(&embeddings)?
-                } else {
-                    embeddings
-                };
-                println!("pooled embeddings {:?}", embeddings.shape());
-                all_embeddings.push(embeddings)
-            }
-
-            let mut similarities = vec![];
-            for (i, e_i) in all_embeddings.iter().enumerate() {
-                for (j, e_j) in all_embeddings
-                    .iter()
-                    .enumerate()
-                    .take(n_sentences)
-                    .skip(i + 1)
-                {
-                    let sum_ij = (e_i * e_j)?.sum_all()?.to_scalar::<f32>()?;
-                    let sum_i2 = (e_i * e_i)?.sum_all()?.to_scalar::<f32>()?;
-                    let sum_j2 = (e_j * e_j)?.sum_all()?.to_scalar::<f32>()?;
-                    let cosine_similarity = sum_ij / (sum_i2 * sum_j2).sqrt();
-                    similarities.push((cosine_similarity, i, j))
-                }
-            }
-            similarities.sort_by(|u, v| v.0.total_cmp(&u.0));
-            for &(score, i, j) in similarities[..5].iter() {
-                println!("score: {score:.2} '{}' '{}'", sentences[i], sentences[j])
-            }
-        }
-    }
-    Ok(())
-}
-
-pub fn normalize_l2(v: &Tensor) -> Result<Tensor> {
-    Ok(v.broadcast_div(&v.sqr()?.sum_keepdim(1)?.sqrt()?)?)
-}
--- a/candle-examples/examples/whisper/README.md
+++ b/candle-examples/examples/whisper/README.md
@ -1,39 +0,0 @@
-# candle-whisper: speech recognition
-
-An implementation of [OpenAI Whisper](https://github.com/openai/whisper) using
-candle. Whisper is a general purpose speech recognition model, it can be used to
-convert audio files (in the `.wav` format) to text. Supported features include
-language detection as well as multilingual speech recognition.
-
-## Running some example
-
-If no audio file is passed as input, a [sample
-file](https://huggingface.co/datasets/Narsil/candle-examples/resolve/main/samples_jfk.wav) is automatically downloaded
-from the hub.
-
-```bash
- cargo run --example whisper --release
-
-> No audio file submitted: Downloading https://huggingface.co/datasets/Narsil/candle_demo/blob/main/samples_jfk.wav
-> loaded wav data: Header { audio_format: 1, channel_count: 1, sampling_rate: 16000, bytes_per_second: 32000, bytes_per_sample: 2, bits_per_sample: 16 }
-> pcm data loaded 176000
-> loaded mel: [1, 80, 3000]
-> 0.0s -- 30.0s:  And so my fellow Americans ask not what your country can do for you ask what you can do for your country
- ```
-
- In order to use the multilingual mode, specify a multilingual model via the
- `--model` flag, see the details below.
-
-## Command line flags
-
- `--input`: the audio file to be converted to text, in wav format.
- `--language`: force the language to some specific value rather than being
-  detected, e.g. `en`.
- `--task`: the task to be performed, can be `transcribe` (return the text data
-  in the original language) or `translate` (translate the text to English). 
- `--timestamps`: enable the timestamp mode where some timestamps are reported
-  for each recognized audio extracts.
- `--model`: the model to be used. Models that do not end with `-en` are
-  multilingual models, other ones are English only models. The supported models
-  are `tiny`, `tiny.en`, `base`, `base.en`, `small`, `small.en`, `medium`,
-  `medium.en`, `large`, and `large-v2`.
--- a/candle-transformers/src/models/whisper/audio.rs
+++ b/candle-transformers/src/models/whisper/audio.rs
@ -198,13 +198,17 @@ fn log_mel_spectrogram_<T: Float + std::fmt::Display>(
    mel
 }

-pub fn pcm_to_mel<T: Float + std::fmt::Display>(samples: &[T], filters: &[T]) -> Vec<T> {
-    log_mel_spectrogram_(
+pub fn pcm_to_mel<T: Float + std::fmt::Display>(
+    samples: &[T],
+    filters: &[T],
+) -> anyhow::Result<Vec<T>> {
+    let mel = log_mel_spectrogram_(
        samples,
        filters,
        super::N_FFT,
        super::HOP_LENGTH,
        super::N_MELS,
        false,
-    )
+    );
+    Ok(mel)
 }
--- a/candle-examples/examples/whisper/main.rs
+++ b/candle-examples/examples/whisper/main.rs
@ -10,16 +10,41 @@ extern crate accelerate_src;
 extern crate intel_mkl_src;

 use anyhow::{Error as E, Result};
-use candle::{Device, IndexOp, Tensor};
+use candle::{DType, Device, IndexOp, Tensor};
 use candle_nn::{ops::softmax, VarBuilder};
 use clap::{Parser, ValueEnum};
 use hf_hub::{api::sync::Api, Repo, RepoType};
 use rand::{distributions::Distribution, SeedableRng};
 use tokenizers::Tokenizer;

-mod multilingual;
-use candle_transformers::models::whisper::{self as m, audio, model};
+mod audio;
+mod model;
 use model::{Config, Whisper};
+mod multilingual;
+
+const DTYPE: DType = DType::F32;
+
+// Audio parameters.
+const SAMPLE_RATE: usize = 16000;
+const N_FFT: usize = 400;
+const N_MELS: usize = 80;
+const HOP_LENGTH: usize = 160;
+const CHUNK_LENGTH: usize = 30;
+const N_SAMPLES: usize = CHUNK_LENGTH * SAMPLE_RATE; // 480000 samples in a 30-second chunk
+const N_FRAMES: usize = N_SAMPLES / HOP_LENGTH; // 3000 frames in a mel spectrogram input
+
+const NO_SPEECH_THRESHOLD: f64 = 0.6;
+const LOGPROB_THRESHOLD: f64 = -1.0;
+const TEMPERATURES: [f64; 6] = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0];
+const COMPRESSION_RATIO_THRESHOLD: f64 = 2.4;
+
+// Tokenizer dependent bits.
+const SOT_TOKEN: &str = "<|startoftranscript|>";
+const TRANSCRIBE_TOKEN: &str = "<|transcribe|>";
+const TRANSLATE_TOKEN: &str = "<|translate|>";
+const NO_TIMESTAMPS_TOKEN: &str = "<|notimestamps|>";
+const EOT_TOKEN: &str = "<|endoftext|>";
+const NO_SPEECH_TOKEN: &str = "<|nocaptions|>";

 #[allow(dead_code)]
 #[derive(Debug, Clone)]
@ -69,7 +94,7 @@ impl Decoder {
        timestamps: bool,
        verbose: bool,
    ) -> Result<Self> {
-        let no_timestamps_token = token_id(&tokenizer, m::NO_TIMESTAMPS_TOKEN)?;
+        let no_timestamps_token = token_id(&tokenizer, NO_TIMESTAMPS_TOKEN)?;
        // Suppress the notimestamps token when in timestamps mode.
        // https://github.com/openai/whisper/blob/e8622f9afc4eba139bf796c210f5c01081000472/whisper/decoding.py#L452
        let suppress_tokens: Vec<f32> = (0..model.config.vocab_size as u32)
@ -84,11 +109,11 @@ impl Decoder {
            })
            .collect();
        let suppress_tokens = Tensor::new(suppress_tokens.as_slice(), device)?;
-        let sot_token = token_id(&tokenizer, m::SOT_TOKEN)?;
-        let transcribe_token = token_id(&tokenizer, m::TRANSCRIBE_TOKEN)?;
-        let translate_token = token_id(&tokenizer, m::TRANSLATE_TOKEN)?;
-        let eot_token = token_id(&tokenizer, m::EOT_TOKEN)?;
-        let no_speech_token = token_id(&tokenizer, m::NO_SPEECH_TOKEN)?;
+        let sot_token = token_id(&tokenizer, SOT_TOKEN)?;
+        let transcribe_token = token_id(&tokenizer, TRANSCRIBE_TOKEN)?;
+        let translate_token = token_id(&tokenizer, TRANSLATE_TOKEN)?;
+        let eot_token = token_id(&tokenizer, EOT_TOKEN)?;
+        let no_speech_token = token_id(&tokenizer, NO_SPEECH_TOKEN)?;
        Ok(Self {
            model,
            rng: rand::rngs::StdRng::seed_from_u64(seed),
@ -195,17 +220,17 @@ impl Decoder {
    }

    fn decode_with_fallback(&mut self, segment: &Tensor) -> Result<DecodingResult> {
-        for (i, &t) in m::TEMPERATURES.iter().enumerate() {
+        for (i, &t) in TEMPERATURES.iter().enumerate() {
            let dr: Result<DecodingResult> = self.decode(segment, t);
-            if i == m::TEMPERATURES.len() - 1 {
+            if i == TEMPERATURES.len() - 1 {
                return dr;
            }
            // On errors, we try again with a different temperature.
            match dr {
                Ok(dr) => {
-                    let needs_fallback = dr.compression_ratio > m::COMPRESSION_RATIO_THRESHOLD
-                        || dr.avg_logprob < m::LOGPROB_THRESHOLD;
-                    if !needs_fallback || dr.no_speech_prob > m::NO_SPEECH_THRESHOLD {
+                    let needs_fallback = dr.compression_ratio > COMPRESSION_RATIO_THRESHOLD
+                        || dr.avg_logprob < LOGPROB_THRESHOLD;
+                    if !needs_fallback || dr.no_speech_prob > NO_SPEECH_THRESHOLD {
                        return Ok(dr);
                    }
                }
@ -223,13 +248,13 @@ impl Decoder {
        let mut segments = vec![];
        while seek < content_frames {
            let start = std::time::Instant::now();
-            let time_offset = (seek * m::HOP_LENGTH) as f64 / m::SAMPLE_RATE as f64;
-            let segment_size = usize::min(content_frames - seek, m::N_FRAMES);
+            let time_offset = (seek * HOP_LENGTH) as f64 / SAMPLE_RATE as f64;
+            let segment_size = usize::min(content_frames - seek, N_FRAMES);
            let mel_segment = mel.narrow(2, seek, segment_size)?;
-            let segment_duration = (segment_size * m::HOP_LENGTH) as f64 / m::SAMPLE_RATE as f64;
+            let segment_duration = (segment_size * HOP_LENGTH) as f64 / SAMPLE_RATE as f64;
            let dr = self.decode_with_fallback(&mel_segment)?;
            seek += segment_size;
-            if dr.no_speech_prob > m::NO_SPEECH_THRESHOLD && dr.avg_logprob < m::LOGPROB_THRESHOLD {
+            if dr.no_speech_prob > NO_SPEECH_THRESHOLD && dr.avg_logprob < LOGPROB_THRESHOLD {
                println!("no speech detected, skipping {seek} {dr:?}");
                continue;
            }
@ -467,8 +492,8 @@ fn main() -> Result<()> {
    let mut input = std::fs::File::open(input)?;
    let (header, data) = wav::read(&mut input)?;
    println!("loaded wav data: {header:?}");
-    if header.sampling_rate != m::SAMPLE_RATE as u32 {
-        anyhow::bail!("wav file must have a {} sampling rate", m::SAMPLE_RATE)
+    if header.sampling_rate != SAMPLE_RATE as u32 {
+        anyhow::bail!("wav file must have a {} sampling rate", SAMPLE_RATE)
    }
    let data = data.as_sixteen().expect("expected 16 bit wav file");
    let pcm_data: Vec<_> = data[..data.len() / header.channel_count as usize]
@ -476,14 +501,14 @@ fn main() -> Result<()> {
        .map(|v| *v as f32 / 32768.)
        .collect();
    println!("pcm data loaded {}", pcm_data.len());
-    let mel = audio::pcm_to_mel(&pcm_data, &mel_filters);
+    let mel = audio::pcm_to_mel(&pcm_data, &mel_filters)?;
    let mel_len = mel.len();
-    let mel = Tensor::from_vec(mel, (1, m::N_MELS, mel_len / m::N_MELS), &device)?;
+    let mel = Tensor::from_vec(mel, (1, N_MELS, mel_len / N_MELS), &device)?;
    println!("loaded mel: {:?}", mel.dims());

    let weights = unsafe { candle::safetensors::MmapedFile::new(weights_filename)? };
    let weights = weights.deserialize()?;
-    let vb = VarBuilder::from_safetensors(vec![weights], m::DTYPE, &device);
+    let vb = VarBuilder::from_safetensors(vec![weights], DTYPE, &device);
    let config: Config = serde_json::from_str(&std::fs::read_to_string(config_filename)?)?;
    let mut model = Whisper::load(&vb, config)?;

--- a/candle-transformers/src/models/whisper/model.rs
+++ b/candle-transformers/src/models/whisper/model.rs
@ -1,5 +1,5 @@
 use candle::{Device, IndexOp, Result, Tensor, D};
-use candle_nn::{Conv1d, Conv1dConfig, Embedding, LayerNorm, Module, VarBuilder};
+use candle_nn::{ops::softmax, Conv1d, Conv1dConfig, Embedding, LayerNorm, Module, VarBuilder};
 use serde::Deserialize;

 // The names in comments correspond to the original implementation:
@ -166,7 +166,7 @@ impl MultiHeadAttention {
        }
        let w = {
            let _enter = self.softmax_span.enter();
-            candle_nn::ops::softmax_last_dim(&qk)?
+            softmax(&qk, D::Minus1)?
        };
        let wv = {
            let _enter = self.matmul_span.enter();
--- a/candle-examples/examples/whisper/multilingual.rs
+++ b/candle-examples/examples/whisper/multilingual.rs
@ -113,7 +113,7 @@ pub fn detect_language(model: &mut Whisper, tokenizer: &Tokenizer, mel: &Tensor)
        .iter()
        .map(|(t, _)| crate::token_id(tokenizer, &format!("<|{t}|>")))
        .collect::<Result<Vec<_>>>()?;
-    let sot_token = crate::token_id(tokenizer, crate::m::SOT_TOKEN)?;
+    let sot_token = crate::token_id(tokenizer, crate::SOT_TOKEN)?;
    let audio_features = model.encoder.forward(&mel, true)?;
    let tokens = Tensor::new(&[[sot_token]], device)?;
    let language_token_ids = Tensor::new(language_token_ids.as_slice(), device)?;
--- a/candle-examples/examples/wuerstchen/main.rs
+++ b/candle-examples/examples/wuerstchen/main.rs
@ -1,391 +0,0 @@
-#![allow(unused)]
-
-#[cfg(feature = "accelerate")]
-extern crate accelerate_src;
-
-#[cfg(feature = "mkl")]
-extern crate intel_mkl_src;
-
-use candle_transformers::models::stable_diffusion;
-use candle_transformers::models::wuerstchen;
-
-use anyhow::{Error as E, Result};
-use candle::{DType, Device, IndexOp, Module, Tensor, D};
-use clap::Parser;
-use tokenizers::Tokenizer;
-
-const PRIOR_GUIDANCE_SCALE: f64 = 8.0;
-const RESOLUTION_MULTIPLE: f64 = 42.67;
-const LATENT_DIM_SCALE: f64 = 10.67;
-const PRIOR_CIN: usize = 16;
-const DECODER_CIN: usize = 4;
-
-#[derive(Parser)]
-#[command(author, version, about, long_about = None)]
-struct Args {
-    /// The prompt to be used for image generation.
-    #[arg(
-        long,
-        default_value = "A very realistic photo of a rusty robot walking on a sandy beach"
-    )]
-    prompt: String,
-
-    #[arg(long, default_value = "")]
-    uncond_prompt: String,
-
-    /// Run on CPU rather than on GPU.
-    #[arg(long)]
-    cpu: bool,
-
-    /// Enable tracing (generates a trace-timestamp.json file).
-    #[arg(long)]
-    tracing: bool,
-
-    /// The height in pixels of the generated image.
-    #[arg(long)]
-    height: Option<usize>,
-
-    /// The width in pixels of the generated image.
-    #[arg(long)]
-    width: Option<usize>,
-
-    /// The decoder weight file, in .safetensors format.
-    #[arg(long, value_name = "FILE")]
-    decoder_weights: Option<String>,
-
-    /// The CLIP weight file, in .safetensors format.
-    #[arg(long, value_name = "FILE")]
-    clip_weights: Option<String>,
-
-    /// The CLIP weight file used by the prior model, in .safetensors format.
-    #[arg(long, value_name = "FILE")]
-    prior_clip_weights: Option<String>,
-
-    /// The prior weight file, in .safetensors format.
-    #[arg(long, value_name = "FILE")]
-    prior_weights: Option<String>,
-
-    /// The VQGAN weight file, in .safetensors format.
-    #[arg(long, value_name = "FILE")]
-    vqgan_weights: Option<String>,
-
-    #[arg(long, value_name = "FILE")]
-    /// The file specifying the tokenizer to used for tokenization.
-    tokenizer: Option<String>,
-
-    #[arg(long, value_name = "FILE")]
-    /// The file specifying the tokenizer to used for prior tokenization.
-    prior_tokenizer: Option<String>,
-
-    /// The size of the sliced attention or 0 for automatic slicing (disabled by default)
-    #[arg(long)]
-    sliced_attention_size: Option<usize>,
-
-    /// The number of steps to run the diffusion for.
-    #[arg(long, default_value_t = 30)]
-    n_steps: usize,
-
-    /// The number of samples to generate.
-    #[arg(long, default_value_t = 1)]
-    num_samples: i64,
-
-    /// The name of the final image to generate.
-    #[arg(long, value_name = "FILE", default_value = "sd_final.png")]
-    final_image: String,
-}
-
-#[derive(Debug, Clone, Copy, PartialEq, Eq)]
-enum ModelFile {
-    Tokenizer,
-    PriorTokenizer,
-    Clip,
-    PriorClip,
-    Decoder,
-    VqGan,
-    Prior,
-}
-
-impl ModelFile {
-    fn get(&self, filename: Option<String>) -> Result<std::path::PathBuf> {
-        use hf_hub::api::sync::Api;
-        match filename {
-            Some(filename) => Ok(std::path::PathBuf::from(filename)),
-            None => {
-                let repo_main = "warp-ai/wuerstchen";
-                let repo_prior = "warp-ai/wuerstchen-prior";
-                let (repo, path) = match self {
-                    Self::Tokenizer => (repo_main, "tokenizer/tokenizer.json"),
-                    Self::PriorTokenizer => (repo_prior, "tokenizer/tokenizer.json"),
-                    Self::Clip => (repo_main, "text_encoder/model.safetensors"),
-                    Self::PriorClip => (repo_prior, "text_encoder/model.safetensors"),
-                    Self::Decoder => (repo_main, "decoder/diffusion_pytorch_model.safetensors"),
-                    Self::VqGan => (repo_main, "vqgan/diffusion_pytorch_model.safetensors"),
-                    Self::Prior => (repo_prior, "prior/diffusion_pytorch_model.safetensors"),
-                };
-                let filename = Api::new()?.model(repo.to_string()).get(path)?;
-                Ok(filename)
-            }
-        }
-    }
-}
-
-fn output_filename(
-    basename: &str,
-    sample_idx: i64,
-    num_samples: i64,
-    timestep_idx: Option<usize>,
-) -> String {
-    let filename = if num_samples > 1 {
-        match basename.rsplit_once('.') {
-            None => format!("{basename}.{sample_idx}.png"),
-            Some((filename_no_extension, extension)) => {
-                format!("{filename_no_extension}.{sample_idx}.{extension}")
-            }
-        }
-    } else {
-        basename.to_string()
-    };
-    match timestep_idx {
-        None => filename,
-        Some(timestep_idx) => match filename.rsplit_once('.') {
-            None => format!("{filename}-{timestep_idx}.png"),
-            Some((filename_no_extension, extension)) => {
-                format!("{filename_no_extension}-{timestep_idx}.{extension}")
-            }
-        },
-    }
-}
-
-fn encode_prompt(
-    prompt: &str,
-    uncond_prompt: Option<&str>,
-    tokenizer: std::path::PathBuf,
-    clip_weights: std::path::PathBuf,
-    clip_config: stable_diffusion::clip::Config,
-    device: &Device,
-) -> Result<Tensor> {
-    let tokenizer = Tokenizer::from_file(tokenizer).map_err(E::msg)?;
-    let pad_id = match &clip_config.pad_with {
-        Some(padding) => *tokenizer.get_vocab(true).get(padding.as_str()).unwrap(),
-        None => *tokenizer.get_vocab(true).get("<|endoftext|>").unwrap(),
-    };
-    println!("Running with prompt \"{prompt}\".");
-    let mut tokens = tokenizer
-        .encode(prompt, true)
-        .map_err(E::msg)?
-        .get_ids()
-        .to_vec();
-    let tokens_len = tokens.len();
-    while tokens.len() < clip_config.max_position_embeddings {
-        tokens.push(pad_id)
-    }
-    let tokens = Tensor::new(tokens.as_slice(), device)?.unsqueeze(0)?;
-
-    println!("Building the clip transformer.");
-    let text_model =
-        stable_diffusion::build_clip_transformer(&clip_config, clip_weights, device, DType::F32)?;
-    let text_embeddings = text_model.forward_with_mask(&tokens, tokens_len - 1)?;
-    match uncond_prompt {
-        None => Ok(text_embeddings),
-        Some(uncond_prompt) => {
-            let mut uncond_tokens = tokenizer
-                .encode(uncond_prompt, true)
-                .map_err(E::msg)?
-                .get_ids()
-                .to_vec();
-            let uncond_tokens_len = uncond_tokens.len();
-            while uncond_tokens.len() < clip_config.max_position_embeddings {
-                uncond_tokens.push(pad_id)
-            }
-            let uncond_tokens = Tensor::new(uncond_tokens.as_slice(), device)?.unsqueeze(0)?;
-
-            let uncond_embeddings =
-                text_model.forward_with_mask(&uncond_tokens, uncond_tokens_len - 1)?;
-            let text_embeddings = Tensor::cat(&[text_embeddings, uncond_embeddings], 0)?;
-            Ok(text_embeddings)
-        }
-    }
-}
-
-fn run(args: Args) -> Result<()> {
-    use tracing_chrome::ChromeLayerBuilder;
-    use tracing_subscriber::prelude::*;
-
-    let Args {
-        prompt,
-        uncond_prompt,
-        cpu,
-        height,
-        width,
-        n_steps,
-        tokenizer,
-        final_image,
-        sliced_attention_size,
-        num_samples,
-        clip_weights,
-        prior_weights,
-        vqgan_weights,
-        decoder_weights,
-        tracing,
-        ..
-    } = args;
-
-    let _guard = if tracing {
-        let (chrome_layer, guard) = ChromeLayerBuilder::new().build();
-        tracing_subscriber::registry().with(chrome_layer).init();
-        Some(guard)
-    } else {
-        None
-    };
-
-    let device = candle_examples::device(cpu)?;
-    let height = height.unwrap_or(1024);
-    let width = width.unwrap_or(1024);
-
-    let prior_text_embeddings = {
-        let tokenizer = ModelFile::PriorTokenizer.get(args.prior_tokenizer)?;
-        let weights = ModelFile::PriorClip.get(args.prior_clip_weights)?;
-        encode_prompt(
-            &prompt,
-            Some(&uncond_prompt),
-            tokenizer.clone(),
-            weights,
-            stable_diffusion::clip::Config::wuerstchen_prior(),
-            &device,
-        )?
-    };
-    println!("generated prior text embeddings {prior_text_embeddings:?}");
-
-    let text_embeddings = {
-        let tokenizer = ModelFile::Tokenizer.get(tokenizer)?;
-        let weights = ModelFile::Clip.get(clip_weights)?;
-        encode_prompt(
-            &prompt,
-            None,
-            tokenizer.clone(),
-            weights,
-            stable_diffusion::clip::Config::wuerstchen(),
-            &device,
-        )?
-    };
-    println!("generated text embeddings {text_embeddings:?}");
-
-    println!("Building the prior.");
-    let b_size = 1;
-    let image_embeddings = {
-        // https://huggingface.co/warp-ai/wuerstchen-prior/blob/main/prior/config.json
-        let latent_height = (height as f64 / RESOLUTION_MULTIPLE).ceil() as usize;
-        let latent_width = (width as f64 / RESOLUTION_MULTIPLE).ceil() as usize;
-        let mut latents = Tensor::randn(
-            0f32,
-            1f32,
-            (b_size, PRIOR_CIN, latent_height, latent_width),
-            &device,
-        )?;
-
-        let prior = {
-            let prior_weights = ModelFile::Prior.get(prior_weights)?;
-            let weights = unsafe { candle::safetensors::MmapedFile::new(prior_weights)? };
-            let weights = weights.deserialize()?;
-            let vb = candle_nn::VarBuilder::from_safetensors(vec![weights], DType::F32, &device);
-            wuerstchen::prior::WPrior::new(
-                /* c_in */ PRIOR_CIN, /* c */ 1536, /* c_cond */ 1280,
-                /* c_r */ 64, /* depth */ 32, /* nhead */ 24, vb,
-            )?
-        };
-        let prior_scheduler = wuerstchen::ddpm::DDPMWScheduler::new(60, Default::default())?;
-        let timesteps = prior_scheduler.timesteps();
-        println!("prior denoising");
-        for (index, &t) in timesteps.iter().enumerate() {
-            let start_time = std::time::Instant::now();
-            if index == timesteps.len() - 1 {
-                continue;
-            }
-            let latent_model_input = Tensor::cat(&[&latents, &latents], 0)?;
-            let ratio = (Tensor::ones(2, DType::F32, &device)? * t)?;
-            let noise_pred = prior.forward(&latent_model_input, &ratio, &prior_text_embeddings)?;
-            let noise_pred = noise_pred.chunk(2, 0)?;
-            let (noise_pred_text, noise_pred_uncond) = (&noise_pred[0], &noise_pred[1]);
-            let noise_pred = (noise_pred_uncond
-                + ((noise_pred_text - noise_pred_uncond)? * PRIOR_GUIDANCE_SCALE)?)?;
-            latents = prior_scheduler.step(&noise_pred, t, &latents)?;
-            let dt = start_time.elapsed().as_secs_f32();
-            println!("step {}/{} done, {:.2}s", index + 1, timesteps.len(), dt);
-        }
-        ((latents * 42.)? - 1.)?
-    };
-
-    println!("Building the vqgan.");
-    let vqgan = {
-        let vqgan_weights = ModelFile::VqGan.get(vqgan_weights)?;
-        let weights = unsafe { candle::safetensors::MmapedFile::new(vqgan_weights)? };
-        let weights = weights.deserialize()?;
-        let vb = candle_nn::VarBuilder::from_safetensors(vec![weights], DType::F32, &device);
-        wuerstchen::paella_vq::PaellaVQ::new(vb)?
-    };
-
-    println!("Building the decoder.");
-
-    // https://huggingface.co/warp-ai/wuerstchen/blob/main/decoder/config.json
-    let decoder = {
-        let decoder_weights = ModelFile::Decoder.get(decoder_weights)?;
-        let weights = unsafe { candle::safetensors::MmapedFile::new(decoder_weights)? };
-        let weights = weights.deserialize()?;
-        let vb = candle_nn::VarBuilder::from_safetensors(vec![weights], DType::F32, &device);
-        wuerstchen::diffnext::WDiffNeXt::new(
-            /* c_in */ DECODER_CIN,
-            /* c_out */ DECODER_CIN,
-            /* c_r */ 64,
-            /* c_cond */ 1024,
-            /* clip_embd */ 1024,
-            /* patch_size */ 2,
-            vb,
-        )?
-    };
-
-    for idx in 0..num_samples {
-        // https://huggingface.co/warp-ai/wuerstchen/blob/main/model_index.json
-        let latent_height = (image_embeddings.dim(2)? as f64 * LATENT_DIM_SCALE) as usize;
-        let latent_width = (image_embeddings.dim(3)? as f64 * LATENT_DIM_SCALE) as usize;
-
-        let mut latents = Tensor::randn(
-            0f32,
-            1f32,
-            (b_size, DECODER_CIN, latent_height, latent_width),
-            &device,
-        )?;
-
-        println!("diffusion process with prior {image_embeddings:?}");
-        let scheduler = wuerstchen::ddpm::DDPMWScheduler::new(60, Default::default())?;
-        let timesteps = scheduler.timesteps();
-        for (index, &t) in timesteps.iter().enumerate() {
-            let start_time = std::time::Instant::now();
-            if index == timesteps.len() - 1 {
-                continue;
-            }
-            let ratio = (Tensor::ones(1, DType::F32, &device)? * t)?;
-            let noise_pred =
-                decoder.forward(&latents, &ratio, &image_embeddings, Some(&text_embeddings))?;
-            latents = scheduler.step(&noise_pred, t, &latents)?;
-            let dt = start_time.elapsed().as_secs_f32();
-            println!("step {}/{} done, {:.2}s", index + 1, timesteps.len(), dt);
-        }
-        println!(
-            "Generating the final image for sample {}/{}.",
-            idx + 1,
-            num_samples
-        );
-        let image = vqgan.decode(&(&latents * 0.3764)?)?;
-        // TODO: Add the clamping between 0 and 1.
-        let image = ((image / 2.)? + 0.5)?.to_device(&Device::Cpu)?;
-        let image = (image * 255.)?.to_dtype(DType::U8)?.i(0)?;
-        let image_filename = output_filename(&final_image, idx + 1, num_samples, None);
-        candle_examples::save_image(&image, image_filename)?
-    }
-    Ok(())
-}
-
-fn main() -> Result<()> {
-    let args = Args::parse();
-    run(args)
-}
--- a/candle-examples/examples/yolo-v3/main.rs
+++ b/candle-examples/examples/yolo-v3/main.rs
@ -4,7 +4,7 @@ extern crate intel_mkl_src;
 #[cfg(feature = "accelerate")]
 extern crate accelerate_src;

-use candle_transformers::object_detection::{non_maximum_suppression, Bbox};
+use candle_examples::object_detection::{non_maximum_suppression, Bbox};
 mod darknet;

 use anyhow::Result;
--- a/candle-examples/examples/yolo-v8/README.md
+++ b/candle-examples/examples/yolo-v8/README.md
@ -1,47 +0,0 @@
-# candle-yolo-v8: Object Detection and Pose Estimation
-
-This is a port of [Ultralytics
-YOLOv8](https://github.com/ultralytics/ultralytics). The implementation is based
-on the [tinygrad
-version](https://github.com/tinygrad/tinygrad/blob/master/examples/yolov8.py)
-and on the model architecture described in this
-[issue](https://github.com/ultralytics/ultralytics/issues/189). The supported
-tasks are object detection and pose estimation.
-
-You can try this model online on the [Candle YOLOv8
-Space](https://huggingface.co/spaces/lmz/candle-yolo). The model then fully runs
-in your browser using WebAssembly - if you use a custom image it will never
-leave your phone/computer!
-
-## Running some example
-
-### Object Detection
-```bash
-cargo run --example yolo-v8 --release -- candle-examples/examples/yolo-v8/assets/bike.jpg
-```
-
-This prints details about the detected objects and generates a `bike.pp.jpg` file.
-
-![Leading group, Giro d'Italia 2021](./assets/bike.jpg)
-
-Image source:
-[wikimedia](https://commons.wikimedia.org/wiki/File:Leading_group,_Giro_d%27Italia_2021,_Stage_15.jpg).
-
-![Leading group, Giro d'Italia 2021](./assets/bike.od.jpg)
-
-### Pose Estimation
-```bash
-cargo run --example yolo-v8 --release -- \
-  candle-examples/examples/yolo-v8/assets/peoples.jpeg --task pose
-```
-
-![Leading group, Giro d'Italia 2021](./assets/bike.pose.jpg)
-
-### Command-line flags
-
- `--which`: select the model variant to be used, `n`, `s` , `m`, `l`, or `x` by
-  increasing size and quality.
- `--task`: `detect` for object detection and `pose` for pose estimation.
- `--legend-size`: the size of the characters to print.
- `--model`: use a local model file rather than downloading it from the hub.
-
--- a/candle-examples/examples/yolo-v8/assets/bike.jpg
+++ b/candle-examples/examples/yolo-v8/assets/bike.jpg
--- a/candle-examples/examples/yolo-v8/assets/bike.od.jpg
+++ b/candle-examples/examples/yolo-v8/assets/bike.od.jpg
--- a/candle-examples/examples/yolo-v8/assets/bike.pose.jpg
+++ b/candle-examples/examples/yolo-v8/assets/bike.pose.jpg
--- a/candle-examples/examples/yolo-v8/main.rs
+++ b/candle-examples/examples/yolo-v8/main.rs
@ -8,8 +8,8 @@ mod model;
 use model::{Multiples, YoloV8, YoloV8Pose};

 use candle::{DType, Device, IndexOp, Result, Tensor};
+use candle_examples::object_detection::{non_maximum_suppression, Bbox, KeyPoint};
 use candle_nn::{Module, VarBuilder};
-use candle_transformers::object_detection::{non_maximum_suppression, Bbox, KeyPoint};
 use clap::{Parser, ValueEnum};
 use image::DynamicImage;

--- a/candle-examples/src/lib.rs
+++ b/candle-examples/src/lib.rs
@ -1,5 +1,6 @@
 pub mod coco_classes;
 pub mod imagenet;
+pub mod object_detection;

 use candle::{Device, Result, Tensor};

--- a/candle-transformers/src/object_detection.rs
+++ b/candle-transformers/src/object_detection.rs
--- a/candle-flash-attn/Cargo.toml
+++ b/candle-flash-attn/Cargo.toml
@ -1,6 +1,6 @@
 [package]
 name = "candle-flash-attn"
-version = "0.2.3"
+version = "0.2.1"
 edition = "2021"

 description = "Flash attention layer for the candle ML framework."
@ -11,7 +11,7 @@ license = "MIT OR Apache-2.0"
 readme = "README.md"

 [dependencies]
-candle = { path = "../candle-core", features = ["cuda"], version = "0.2.3", package = "candle-core" }
+candle = { path = "../candle-core", features = ["cuda"], version = "0.2.1", package = "candle-core" }
 half = { version = "2.3.1", features = ["num-traits"] }

 [build-dependencies]
@ -21,4 +21,4 @@ rayon = "1.7.0"

 [dev-dependencies]
 anyhow = { version = "1", features = ["backtrace"] }
-candle-nn = { path = "../candle-nn", version = "0.2.3", features = ["cuda"] }
+candle-nn = { path = "../candle-nn", version = "0.2.1", features = ["cuda"] }
--- a/candle-kernels/Cargo.toml
+++ b/candle-kernels/Cargo.toml
@ -1,6 +1,6 @@
 [package]
 name = "candle-kernels"
-version = "0.2.3"
+version = "0.2.1"
 edition = "2021"

 description = "CUDA kernels for Candle"
--- a/candle-kernels/build.rs
+++ b/candle-kernels/build.rs
@ -164,8 +164,6 @@ mod cuda {

        println!("cargo:rustc-env=CUDA_COMPUTE_CAP=sm_{compute_cap}");

-        let ccbin_env = std::env::var("CANDLE_NVCC_CCBIN");
-        println!("cargo:rerun-if-env-changed=CANDLE_NVCC_CCBIN");
        let children = kernel_paths
                .par_iter()
                .flat_map(|p| {
@ -190,13 +188,8 @@ mod cuda {
                            .args(["--output-directory", &out_dir])
                            // Flash attention only
                            // .arg("--expt-relaxed-constexpr")
-                            .args(&include_options);
-                        if let Ok(ccbin_path) = &ccbin_env {
-                            command
-                                .arg("-allow-unsupported-compiler")
-                                .args(["-ccbin", ccbin_path]);
-                        }
-                        command.arg(p);
+                            .args(&include_options)
+                            .arg(p);
                        Some((p,  command.spawn()
                        .expect("nvcc failed to start. Ensure that you have CUDA installed and that `nvcc` is in your PATH.").wait_with_output()))
                    }})
--- a/candle-kernels/src/conv.cu
+++ b/candle-kernels/src/conv.cu
@ -51,118 +51,6 @@ __device__ void conv1d(
  dst[dst_i] = static_cast<T>(d);
 }

-template <typename T>
-__device__ void im2col1d(
-    const size_t dst_numel,
-    const size_t l_out,
-    const size_t l_k,
-    const size_t stride,
-    const size_t padding,
-    const size_t dilation,
-    const size_t *info,
-    const T *src,
-    T *dst
-) {
-  const size_t dst_i = blockIdx.x * blockDim.x + threadIdx.x;
-  // dst: (b_size, l_out, c_in, l_k)
-  // src: (b_size, c_in, l_in)
-  if (dst_i >= dst_numel) {
-    return;
-  }
-  const size_t *src_dims = info;
-  const size_t *src_s = info + 3;
-  const size_t b_in = src_dims[0];
-  const size_t c_in = src_dims[1];
-  const size_t l_in = src_dims[2];
-
-  const size_t dst_s2 = l_k;
-  const size_t dst_s1 = c_in * dst_s2;
-  const size_t dst_s0 = l_out * dst_s1;
-
-  size_t tmp_dst_i = dst_i;
-  const size_t b_idx = tmp_dst_i / dst_s0;
-  tmp_dst_i -= b_idx * dst_s0;
-  const size_t l_idx = tmp_dst_i / dst_s1;
-  tmp_dst_i -= l_idx * dst_s1;
-  const size_t c_idx = tmp_dst_i / dst_s2;
-  tmp_dst_i -= c_idx * dst_s2;
-  const size_t l_k_idx = tmp_dst_i;
-  size_t src_l_idx = l_idx * stride + l_k_idx * dilation;
-  if (src_l_idx < padding || src_l_idx >= l_in + padding) {
-    dst[dst_i] = static_cast<T>(0);
-  }
-  else {
-    src_l_idx -= padding;
-    const size_t src_i = b_idx * src_s[0] + c_idx * src_s[1] + src_l_idx * src_s[2];
-    dst[dst_i] = src[src_i];
-  }
-}
-
-template <typename T>
-__device__ void im2col(
-    const size_t dst_numel,
-    const size_t h_out,
-    const size_t w_out,
-    const size_t h_k,
-    const size_t w_k,
-    const size_t stride,
-    const size_t padding,
-    const size_t dilation,
-    const size_t *info,
-    const T *src,
-    T *dst
-) {
-  const size_t dst_i = blockIdx.x * blockDim.x + threadIdx.x;
-  // dst: (b_size, h_out, w_out, c_in, h_k, w_k)
-  // src: (b_size, c_in, h_in, w_in)
-  if (dst_i >= dst_numel) {
-    return;
-  }
-  const size_t *src_dims = info;
-  const size_t *src_s = info + 4;
-  const size_t b_in = src_dims[0];
-  const size_t c_in = src_dims[1];
-  const size_t h_in = src_dims[2];
-  const size_t w_in = src_dims[3];
-
-  const size_t dst_s4 = w_k;
-  const size_t dst_s3 = h_k * dst_s4;
-  const size_t dst_s2 = c_in * dst_s3;
-  const size_t dst_s1 = w_out * dst_s2;
-  const size_t dst_s0 = h_out * dst_s1;
-
-  size_t tmp_dst_i = dst_i;
-  const size_t b_idx = tmp_dst_i / dst_s0;
-  tmp_dst_i -= b_idx * dst_s0;
-  const size_t h_idx = tmp_dst_i / dst_s1;
-  tmp_dst_i -= h_idx * dst_s1;
-  const size_t w_idx = tmp_dst_i / dst_s2;
-  tmp_dst_i -= w_idx * dst_s2;
-  const size_t c_idx = tmp_dst_i / dst_s3;
-  tmp_dst_i -= c_idx * dst_s3;
-  const size_t h_k_idx = tmp_dst_i / dst_s4;
-  tmp_dst_i -= h_k_idx * dst_s4;
-  const size_t w_k_idx = tmp_dst_i;
-  size_t src_h_idx = h_idx * stride + h_k_idx * dilation;
-  size_t src_w_idx = w_idx * stride + w_k_idx * dilation;
-  if (src_h_idx < padding || src_h_idx >= h_in + padding) {
-    dst[dst_i] = static_cast<T>(0);
-  }
-  else if (src_w_idx < padding || src_w_idx >= w_in + padding) {
-    dst[dst_i] = static_cast<T>(0);
-  }
-  else {
-    src_h_idx -= padding;
-    src_w_idx -= padding;
-    const size_t src_i =
-      b_idx * src_s[0]
-      + c_idx * src_s[1]
-      + src_h_idx * src_s[2]
-      + src_w_idx * src_s[3];
-    dst[dst_i] = src[src_i];
-  }
-}
-
 // Naive implementation of conv2d.
 template <typename T, typename A>
 __device__ void conv2d(
@ -475,38 +363,6 @@ extern "C" __global__ void FN_NAME(  \
  conv2d<TYPENAME, TYPEACC>(src_numel, w_out, h_out, stride, padding, dilation, info, src, kernel, dst); \
 } \

-#define IM2COL1D_OP(TYPENAME, FN_NAME) \
-extern "C" __global__ void FN_NAME(  \
-    const size_t dst_numel, \
-    const size_t l_out, \
-    const size_t l_k, \
-    const size_t stride, \
-    const size_t padding, \
-    const size_t dilation, \
-    const size_t *info, \
-    const TYPENAME *src, \
-    TYPENAME *dst \
-) {  \
-  im2col1d<TYPENAME>(dst_numel, l_out, l_k, stride, padding, dilation, info, src, dst); \
-} \
-
-#define IM2COL_OP(TYPENAME, FN_NAME) \
-extern "C" __global__ void FN_NAME(  \
-    const size_t dst_numel, \
-    const size_t h_out, \
-    const size_t w_out, \
-    const size_t h_k, \
-    const size_t w_k, \
-    const size_t stride, \
-    const size_t padding, \
-    const size_t dilation, \
-    const size_t *info, \
-    const TYPENAME *src, \
-    TYPENAME *dst \
-) {  \
-  im2col<TYPENAME>(dst_numel, h_out, w_out, h_k, w_k, stride, padding, dilation, info, src, dst); \
-} \
-
 #define CONVT2D_OP(TYPENAME, TYPEACC, FN_NAME) \
 extern "C" __global__ void FN_NAME(  \
    const size_t src_numel, \
@ -572,8 +428,6 @@ CONVT2D_OP(__nv_bfloat16, float, conv_transpose2d_bf16)
 AVG_POOL2D_OP(__nv_bfloat16, float, avg_pool2d_bf16)
 MAX_POOL2D_OP(__nv_bfloat16, max_pool2d_bf16)
 UPSAMPLE_NEAREST2D_OP(__nv_bfloat16, upsample_nearest2d_bf16)
-IM2COL_OP(__nv_bfloat16, im2col_bf16)
-IM2COL1D_OP(__nv_bfloat16, im2col1d_bf16)
 #endif

 #if __CUDA_ARCH__ >= 530
@ -583,8 +437,6 @@ CONVT2D_OP(__half, float, conv_transpose2d_f16)
 AVG_POOL2D_OP(__half, float, avg_pool2d_f16)
 MAX_POOL2D_OP(__half, max_pool2d_f16)
 UPSAMPLE_NEAREST2D_OP(__half, upsample_nearest2d_f16)
-IM2COL_OP(__half, im2col_f16)
-IM2COL1D_OP(__half, im2col1d_f16)
 #endif

 CONV1D_OP(float, float, conv1d_f32)
@ -616,13 +468,3 @@ UPSAMPLE_NEAREST2D_OP(float, upsample_nearest2d_f32)
 UPSAMPLE_NEAREST2D_OP(double, upsample_nearest2d_f64)
 UPSAMPLE_NEAREST2D_OP(uint8_t, upsample_nearest2d_u8)
 UPSAMPLE_NEAREST2D_OP(uint32_t, upsample_nearest2d_u32)
-
-IM2COL_OP(float, im2col_f32)
-IM2COL_OP(double, im2col_f64)
-IM2COL_OP(uint8_t, im2col_u8)
-IM2COL_OP(uint32_t, im2col_u32)
-
-IM2COL1D_OP(float, im2col1d_f32)
-IM2COL1D_OP(double, im2col1d_f64)
-IM2COL1D_OP(uint8_t, im2col1d_u8)
-IM2COL1D_OP(uint32_t, im2col1d_u32)
--- a/candle-nn/Cargo.toml
+++ b/candle-nn/Cargo.toml
@ -11,14 +11,13 @@ readme = "README.md"

 [dependencies]
 accelerate-src = { workspace = true, optional = true }
-candle = { path = "../candle-core", version = "0.2.3", package = "candle-core" }
+candle = { path = "../candle-core", version = "0.2.1", package = "candle-core" }
 half = { workspace = true }
 thiserror = { workspace = true }
 intel-mkl-src = { workspace = true, optional = true }
 num-traits = { workspace = true }
 rayon = { workspace = true }
 safetensors = { workspace = true }
-serde = { workspace = true }

 [dev-dependencies]
 anyhow = { workspace = true }
--- a/candle-nn/examples/cpu_benchmarks.rs
+++ b/candle-nn/examples/cpu_benchmarks.rs
@ -6,11 +6,9 @@ extern crate intel_mkl_src;
 extern crate accelerate_src;

 use candle::quantized::GgmlType;
-use candle::{CpuStorage, Device, Layout, Result, Shape, Tensor, D};
+use candle::{Device, Result, Tensor, D};
 use clap::{Parser, Subcommand};

-const CHECK_CONV2D: bool = false;
-
 trait Benchmark {
    type PreProcessData;
    type RunResult;
@ -21,87 +19,6 @@ trait Benchmark {
    const ITERS: usize;
 }

-struct Im2Col {
-    h_k: usize,
-    w_k: usize,
-    stride: usize,
-    dilation: usize,
-    padding: usize,
-}
-
-impl Im2Col {
-    fn hw_out(&self, h: usize, w: usize) -> (usize, usize) {
-        let h_out = (h + 2 * self.padding - self.dilation * (self.h_k - 1) - 1) / self.stride + 1;
-        let w_out = (w + 2 * self.padding - self.dilation * (self.w_k - 1) - 1) / self.stride + 1;
-        (h_out, w_out)
-    }
-}
-
-impl candle::CustomOp1 for Im2Col {
-    fn name(&self) -> &'static str {
-        "im2col"
-    }
-
-    fn cpu_fwd(&self, storage: &CpuStorage, layout: &Layout) -> Result<(CpuStorage, Shape)> {
-        let &Self {
-            h_k,
-            w_k,
-            stride,
-            dilation,
-            padding,
-        } = self;
-        let (b, c, h, w) = layout.shape().dims4()?;
-        let (h_out, w_out) = self.hw_out(h, w);
-        let slice = storage.as_slice::<f32>()?;
-        let src = &slice[layout.start_offset()..];
-        let mut dst = vec![0f32; b * h_out * w_out * c * h_k * w_k];
-        let (src_s0, src_s1, src_s2, src_s3) = {
-            let s = layout.stride();
-            (s[0], s[1], s[2], s[3])
-        };
-        // TODO: provide specialized kernels for the common use cases.
-        // - h_k = w_k = 1
-        // - padding = 0
-        // - stride = 1
-        // - dilation = 1
-        for b_idx in 0..b {
-            let src_idx = b_idx * src_s0;
-            let dst_idx = b_idx * h_out * w_out * c * h_k * w_k;
-            for h_idx in 0..h_out {
-                let dst_idx = dst_idx + h_idx * w_out * c * h_k * w_k;
-                for w_idx in 0..w_out {
-                    let dst_idx = dst_idx + w_idx * c * h_k * w_k;
-                    for c_idx in 0..c {
-                        let dst_idx = dst_idx + c_idx * h_k * w_k;
-                        let src_idx = c_idx * src_s1 + src_idx;
-                        for h_k_idx in 0..h_k {
-                            let src_h = h_idx * stride + h_k_idx * dilation;
-                            if padding != 0 && (src_h < padding || src_h >= h + padding) {
-                                continue;
-                            }
-                            let src_h = src_h - padding;
-                            let src_idx = src_idx + src_h * src_s2;
-                            let dst_idx = dst_idx + h_k_idx * w_k;
-                            for w_k_idx in 0..w_k {
-                                let src_w = w_idx * stride + w_k_idx * dilation;
-                                if padding != 0 && (src_w < padding || src_w >= w + padding) {
-                                    continue;
-                                }
-                                let src_w = src_w - padding;
-                                let src_idx = src_idx + src_w * src_s3;
-                                let dst_idx = dst_idx + w_k_idx;
-                                dst[dst_idx] = src[src_idx]
-                            }
-                        }
-                    }
-                }
-            }
-        }
-        let storage = candle::WithDType::to_cpu_storage_owned(dst);
-        Ok((storage, (b * h_out * w_out, c * h_k * w_k).into()))
-    }
-}
-
 // Conv1d example as used in whisper.
 struct Conv1d;
 impl Benchmark for Conv1d {
@ -136,48 +53,7 @@ impl Benchmark for Conv2d {
        d.0.conv2d(&d.1, 0, 1, 1, 1)
    }

-    const ITERS: usize = 5;
-}
-
-// Conv2d example as used in stable-diffusion, im2col implementation.
-struct Conv2dIm2Col;
-impl Benchmark for Conv2dIm2Col {
-    type PreProcessData = (Tensor, Tensor);
-    type RunResult = Tensor;
-
-    fn preprocess() -> Result<Self::PreProcessData> {
-        let inp = Tensor::randn(0f32, 1., (2, 320, 96, 96), &Device::Cpu)?;
-        let w = Tensor::randn(0f32, 1., (320, 320, 3, 3), &Device::Cpu)?;
-        Ok((inp, w))
-    }
-
-    fn run_one(d: &Self::PreProcessData) -> Result<Self::RunResult> {
-        // d.0.conv2d(&d.1, 0, 1, 1, 1)
-        let (b, _, h, w) = d.0.dims4()?;
-        let (_, _, h_k, w_k) = d.1.dims4()?;
-        let op = Im2Col {
-            h_k,
-            w_k,
-            stride: 1,
-            dilation: 1,
-            padding: 0,
-        };
-        let (h_out, w_out) = op.hw_out(h, w);
-        let col = d.0.apply_op1_no_bwd(&op)?;
-        let res = col.matmul(&d.1.flatten_from(1)?.t()?)?;
-        let res = res
-            .reshape((b, h_out, w_out, ()))?
-            .permute((0, 3, 1, 2))?
-            .contiguous()?;
-        if CHECK_CONV2D {
-            let res2 = d.0.conv2d(&d.1, op.padding, op.stride, op.dilation, 1);
-            let diff = (&res - res2)?.sqr()?.mean_all()?;
-            println!("{diff}");
-        }
-        Ok(res)
-    }
-
-    const ITERS: usize = 5;
+    const ITERS: usize = 1;
 }

 struct Matmul;
@ -269,7 +145,6 @@ fn run<B: Benchmark>(iters: Option<usize>) -> Result<()> {
 enum Task {
    Conv1d,
    Conv2d,
-    Conv2dIm2Col,
    Matmul,
    Qmatmul,
    Softmax,
@ -292,7 +167,6 @@ fn main() -> Result<()> {
    match args.task {
        Task::Conv1d => run::<Conv1d>(args.iters)?,
        Task::Conv2d => run::<Conv2d>(args.iters)?,
-        Task::Conv2dIm2Col => run::<Conv2dIm2Col>(args.iters)?,
        Task::Matmul => run::<Matmul>(args.iters)?,
        Task::Softmax => run::<Softmax>(args.iters)?,
        Task::SoftmaxLastDim => run::<SoftmaxLastDim>(args.iters)?,
--- a/candle-nn/src/activation.rs
+++ b/candle-nn/src/activation.rs
@ -1,29 +1,18 @@
 use candle::Tensor;
-use serde::Deserialize;

-#[derive(Debug, Clone, Copy, PartialEq, Deserialize, Default)]
-#[serde(rename_all = "lowercase")]
+#[derive(Debug, Clone, Copy, PartialEq)]
 pub enum Activation {
-    #[default]
    Gelu,
-    #[serde(rename = "gated-gelu")]
-    NewGelu,
    Relu,
    Elu(f64),
-    LeakyRelu(f64),
 }

 impl super::Module for Activation {
    fn forward(&self, xs: &Tensor) -> candle::Result<Tensor> {
        match self {
            Self::Gelu => xs.gelu(),
-            // TODO: This is "gelu_new", not the original "gelu".
-            // There's some small numerical difference:
-            // https://github.com/huggingface/transformers/blob/12f043eaeaabfef6f6efea411d98e6f6d3c094b7/src/transformers/activations.py#L49-L78
-            Self::NewGelu => xs.gelu(),
            Self::Relu => xs.relu(),
            &Self::Elu(alpha) => xs.elu(alpha),
-            &Self::LeakyRelu(negative_slope) => crate::ops::leaky_relu(xs, negative_slope),
        }
    }
 }
--- a/candle-nn/src/conv.rs
+++ b/candle-nn/src/conv.rs
@ -39,14 +39,6 @@ impl Conv1d {
    pub fn config(&self) -> &Conv1dConfig {
        &self.config
    }
-
-    pub fn weight(&self) -> &Tensor {
-        &self.weight
-    }
-
-    pub fn bias(&self) -> Option<&Tensor> {
-        self.bias.as_ref()
-    }
 }

 impl crate::Module for Conv1d {
@ -107,14 +99,6 @@ impl Conv2d {
    pub fn config(&self) -> &Conv2dConfig {
        &self.config
    }
-
-    pub fn weight(&self) -> &Tensor {
-        &self.weight
-    }
-
-    pub fn bias(&self) -> Option<&Tensor> {
-        self.bias.as_ref()
-    }
 }

 impl crate::Module for Conv2d {
@ -203,10 +187,10 @@ pub fn conv1d(
    out_channels: usize,
    kernel_size: usize,
    cfg: Conv1dConfig,
-    vb: crate::VarBuilder,
+    vs: crate::VarBuilder,
 ) -> Result<Conv1d> {
    let init_ws = crate::init::DEFAULT_KAIMING_NORMAL;
-    let ws = vb.get_with_hints(
+    let ws = vs.get_with_hints(
        (out_channels, in_channels / cfg.groups, kernel_size),
        "weight",
        init_ws,
@ -216,7 +200,7 @@ pub fn conv1d(
        lo: -bound,
        up: bound,
    };
-    let bs = vb.get_with_hints(out_channels, "bias", init_bs)?;
+    let bs = vs.get_with_hints(out_channels, "bias", init_bs)?;
    Ok(Conv1d::new(ws, Some(bs), cfg))
 }

@ -225,10 +209,10 @@ pub fn conv2d(
    out_channels: usize,
    kernel_size: usize,
    cfg: Conv2dConfig,
-    vb: crate::VarBuilder,
+    vs: crate::VarBuilder,
 ) -> Result<Conv2d> {
    let init_ws = crate::init::DEFAULT_KAIMING_NORMAL;
-    let ws = vb.get_with_hints(
+    let ws = vs.get_with_hints(
        (
            out_channels,
            in_channels / cfg.groups,
@ -243,7 +227,7 @@ pub fn conv2d(
        lo: -bound,
        up: bound,
    };
-    let bs = vb.get_with_hints(out_channels, "bias", init_bs)?;
+    let bs = vs.get_with_hints(out_channels, "bias", init_bs)?;
    Ok(Conv2d::new(ws, Some(bs), cfg))
 }

@ -252,10 +236,10 @@ pub fn conv2d_no_bias(
    out_channels: usize,
    kernel_size: usize,
    cfg: Conv2dConfig,
-    vb: crate::VarBuilder,
+    vs: crate::VarBuilder,
 ) -> Result<Conv2d> {
    let init_ws = crate::init::DEFAULT_KAIMING_NORMAL;
-    let ws = vb.get_with_hints(
+    let ws = vs.get_with_hints(
        (
            out_channels,
            in_channels / cfg.groups,
@ -273,19 +257,19 @@ pub fn conv_transpose2d(
    out_channels: usize,
    kernel_size: usize,
    cfg: ConvTranspose2dConfig,
-    vb: crate::VarBuilder,
+    vs: crate::VarBuilder,
 ) -> Result<ConvTranspose2d> {
    let bound = 1. / (out_channels as f64).sqrt() / kernel_size as f64;
    let init = crate::Init::Uniform {
        lo: -bound,
        up: bound,
    };
-    let ws = vb.get_with_hints(
+    let ws = vs.get_with_hints(
        (in_channels, out_channels, kernel_size, kernel_size),
        "weight",
        init,
    )?;
-    let bs = vb.get_with_hints(out_channels, "bias", init)?;
+    let bs = vs.get_with_hints(out_channels, "bias", init)?;
    Ok(ConvTranspose2d::new(ws, Some(bs), cfg))
 }

@ -294,15 +278,15 @@ pub fn conv_transpose2d_no_bias(
    out_channels: usize,
    kernel_size: usize,
    cfg: ConvTranspose2dConfig,
-    vb: crate::VarBuilder,
+    vs: crate::VarBuilder,
 ) -> Result<ConvTranspose2d> {
    let bound = 1. / (out_channels as f64).sqrt() / kernel_size as f64;
    let init = crate::Init::Uniform {
        lo: -bound,
        up: bound,
    };
-    let ws = vb.get_with_hints(
-        (in_channels, out_channels, kernel_size, kernel_size),
+    let ws = vs.get_with_hints(
+        (out_channels, in_channels, kernel_size, kernel_size),
        "weight",
        init,
    )?;
--- a/candle-nn/src/embedding.rs
+++ b/candle-nn/src/embedding.rs
@ -18,11 +18,6 @@ impl Embedding {
    pub fn embeddings(&self) -> &Tensor {
        &self.embeddings
    }
-
-    /// Get the hidden size of the embedding matrix
-    pub fn hidden_size(&self) -> usize {
-        self.hidden_size
-    }
 }

 impl crate::Module for Embedding {
--- a/candle-nn/src/ops.rs
+++ b/candle-nn/src/ops.rs
@ -44,10 +44,6 @@ pub fn sigmoid(xs: &Tensor) -> Result<Tensor> {
    (xs.neg()?.exp()? + 1.0)?.recip()
 }

-pub fn leaky_relu(xs: &Tensor, negative_slope: f64) -> Result<Tensor> {
-    xs.relu()?.minimum(&(xs * negative_slope)?)
-}
-
 pub fn dropout(xs: &Tensor, drop_p: f32) -> Result<Tensor> {
    // This implementation is inefficient as it stores the full mask for the backward pass.
    // Instead we could just store the seed and have a specialized kernel that would both
@ -189,42 +185,3 @@ impl candle::CustomOp1 for SoftmaxLastDim {
 pub fn softmax_last_dim(xs: &Tensor) -> Result<Tensor> {
    xs.apply_op1_no_bwd(&SoftmaxLastDim)
 }
-
-// https://pytorch.org/docs/stable/generated/torch.nn.PixelShuffle.html
-pub fn pixel_shuffle(xs: &Tensor, upscale_factor: usize) -> Result<Tensor> {
-    let (b_size, c, h, w) = xs.dims4()?;
-    let out_c = c / upscale_factor / upscale_factor;
-    xs.reshape((b_size, out_c, upscale_factor, upscale_factor, h, w))?
-        .permute((0, 1, 4, 2, 5, 3))?
-        .reshape((b_size, out_c, h * upscale_factor, w * upscale_factor))
-}
-
-pub fn pixel_unshuffle(xs: &Tensor, downscale_factor: usize) -> Result<Tensor> {
-    let (b_size, c, h, w) = xs.dims4()?;
-    let out_c = c * downscale_factor * downscale_factor;
-    xs.reshape((
-        b_size,
-        c,
-        h / downscale_factor,
-        downscale_factor,
-        w / downscale_factor,
-        downscale_factor,
-    ))?
-    .permute((0, 1, 3, 5, 2, 4))?
-    .reshape((b_size, out_c, h / downscale_factor, w / downscale_factor))
-}
-
-// https://pytorch.org/docs/stable/generated/torch.nn.ReplicationPad2d.html
-pub fn replication_pad2d(xs: &Tensor, pad: usize) -> Result<Tensor> {
-    match pad {
-        0 => Ok(xs.clone()),
-        1 => {
-            let (_b_size, _c, h, w) = xs.dims4()?;
-            let (first, last) = (xs.narrow(3, 0, 1)?, xs.narrow(3, w - 1, 1)?);
-            let xs = Tensor::cat(&[&first, xs, &last], 3)?;
-            let (first, last) = (xs.narrow(2, 0, 1)?, xs.narrow(2, h - 1, 1)?);
-            Tensor::cat(&[&first, &xs, &last], 2)
-        }
-        n => candle::bail!("replication-pad with a size of {n} is not supported"),
-    }
-}
--- a/candle-pyo3/.gitignore
+++ b/candle-pyo3/.gitignore
@ -1,160 +0,0 @@
-# Byte-compiled / optimized / DLL files
-__pycache__/
-*.py[cod]
-*$py.class
-
-# C extensions
-*.so
-
-# Distribution / packaging
-.Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-share/python-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-MANIFEST
-
-# PyInstaller
-#  Usually these files are written by a python script from a template
-#  before PyInstaller builds the exe, so as to inject date/other infos into it.
-*.manifest
-*.spec
-
-# Installer logs
-pip-log.txt
-pip-delete-this-directory.txt
-
-# Unit test / coverage reports
-htmlcov/
-.tox/
-.nox/
-.coverage
-.coverage.*
-.cache
-nosetests.xml
-coverage.xml
-*.cover
-*.py,cover
-.hypothesis/
-.pytest_cache/
-cover/
-
-# Translations
-*.mo
-*.pot
-
-# Django stuff:
-*.log
-local_settings.py
-db.sqlite3
-db.sqlite3-journal
-
-# Flask stuff:
-instance/
-.webassets-cache
-
-# Scrapy stuff:
-.scrapy
-
-# Sphinx documentation
-docs/_build/
-
-# PyBuilder
-.pybuilder/
-target/
-
-# Jupyter Notebook
-.ipynb_checkpoints
-
-# IPython
-profile_default/
-ipython_config.py
-
-# pyenv
-#   For a library or package, you might want to ignore these files since the code is
-#   intended to run in multiple environments; otherwise, check them in:
-# .python-version
-
-# pipenv
-#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
-#   However, in case of collaboration, if having platform-specific dependencies or dependencies
-#   having no cross-platform support, pipenv may install dependencies that don't work, or not
-#   install all needed dependencies.
-#Pipfile.lock
-
-# poetry
-#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
-#   This is especially recommended for binary packages to ensure reproducibility, and is more
-#   commonly ignored for libraries.
-#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
-#poetry.lock
-
-# pdm
-#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
-#pdm.lock
-#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
-#   in version control.
-#   https://pdm.fming.dev/#use-with-ide
-.pdm.toml
-
-# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
-__pypackages__/
-
-# Celery stuff
-celerybeat-schedule
-celerybeat.pid
-
-# SageMath parsed files
-*.sage.py
-
-# Environments
-.env
-.venv
-env/
-venv/
-ENV/
-env.bak/
-venv.bak/
-
-# Spyder project settings
-.spyderproject
-.spyproject
-
-# Rope project settings
-.ropeproject
-
-# mkdocs documentation
-/site
-
-# mypy
-.mypy_cache/
-.dmypy.json
-dmypy.json
-
-# Pyre type checker
-.pyre/
-
-# pytype static type analyzer
-.pytype/
-
-# Cython debug symbols
-cython_debug/
-
-# PyCharm
-#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
-#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
-#  and can be added to the global gitignore or merged into this file.  For a more nuclear
-#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
-#.idea/
--- a/candle-pyo3/Cargo.toml
+++ b/candle-pyo3/Cargo.toml
@ -12,10 +12,11 @@ readme = "README.md"
 [lib]
 name = "candle"
 crate-type = ["cdylib"]
+doc = false

 [dependencies]
-candle = { path = "../candle-core", version = "0.2.3", package = "candle-core" }
-candle-nn = { path = "../candle-nn", version = "0.2.3" }
+candle = { path = "../candle-core", version = "0.2.1", package = "candle-core" }
+candle-nn = { path = "../candle-nn", version = "0.2.1" }
 half = { workspace = true }
 pyo3 = { version = "0.19.0", features = ["extension-module"] }

--- a/candle-pyo3/README.md
+++ b/candle-pyo3/README.md
@ -1,26 +1,7 @@
-## Installation 
-
 From the `candle-pyo3` directory, enable a virtual env where you will want the
 candle package to be installed then run.

 ```bash
-maturin develop -r 
+maturin develop
 python test.py
 ```
-
-## Generating Stub Files for Type Hinting
-
-For type hinting support, the `candle-pyo3` package requires `*.pyi` files. You can automatically generate these files using the `stub.py` script.
-
-### Steps:
-1. Install the package using `maturin`.
-2. Generate the stub files by running:
-   ```
-   python stub.py
-   ```
-
-### Validation:
-To ensure that the stub files match the current implementation, execute:
-```
-python stub.py --check
-```
--- a/Show More
+++ b/Show More