Finished scaffolding, lots of TODOs

- Most kernels just copy themselfs to get the shapes correct - Matmul works only in 1 case and simply empty allocates otherwise - Logits and randomized to make the demo finish itself. Performance is quite bad (30ms/token), but lot's of prints and allocs and some actual sending to metal. Couln't get it super high by removing the obvious blockers (println + the actual running matmuls). Allocations takes between 1us and 100us and seems very stable, Maybe metal doesn't really have a smart allocator and we'll need to own it.
2025-06-20 04:00:28 +00:00 · 2023-11-02 15:32:28 +01:00
parent 82cce52e73
commit 7161002a34
11 changed files with 212 additions and 52 deletions
--- a/candle-nn/src/ops.rs
+++ b/candle-nn/src/ops.rs
@ -190,6 +190,16 @@ impl candle::CustomOp1 for SoftmaxLastDim {
            device: dev.clone(),
        };
        Ok((dst, layout.shape().clone()))
+    }    
+
+    #[cfg(feature = "metal")]
+    fn metal_fwd(
+        &self,
+        storage: &candle::MetalStorage,
+        layout: &Layout,
+    ) -> Result<(candle::MetalStorage, Shape)> {
+        println!("TODO softmax-last-dim");
+        Ok((storage.clone(), layout.shape().clone()))
    }
 }