Finished scaffolding, lots of TODOs

- Most kernels just copy themselfs to get the shapes correct - Matmul works only in 1 case and simply empty allocates otherwise - Logits and randomized to make the demo finish itself. Performance is quite bad (30ms/token), but lot's of prints and allocs and some actual sending to metal. Couln't get it super high by removing the obvious blockers (println + the actual running matmuls). Allocations takes between 1us and 100us and seems very stable, Maybe metal doesn't really have a smart allocator and we'll need to own it.
2025-06-19 03:54:56 +00:00 · 2023-11-02 15:32:28 +01:00
parent 82cce52e73
commit 7161002a34
11 changed files with 212 additions and 52 deletions
--- a/candle-core/src/op.rs
+++ b/candle-core/src/op.rs
@ -182,7 +182,7 @@ pub trait CustomOp1 {
        _layout: &Layout,
    ) -> Result<(MetalStorage, Shape)> {
        Err(crate::Error::Metal(
-            format!("no cuda implementation for {}", self.name()).into(),
+            format!("no metal implementation for {}", self.name()).into(),
        ))
    }