- Most kernels just copy themselfs to get the shapes correct
- Matmul works only in 1 case and simply empty allocates otherwise
- Logits and randomized to make the demo finish itself.
Performance is quite bad (30ms/token), but lot's of prints and allocs and some actual sending to metal.
Couln't get it super high by removing the obvious blockers (println + the actual running matmuls).
Allocations takes between 1us and 100us and seems very stable, Maybe metal doesn't really have a smart allocator and we'll need to own it.