* Add the metavoice transformer.
* Sketch the speaker-encoder module.
* Adding to the metavoice model.
* Start adding the metavoice example.
* Get some logits out.
* Load the second stage model.
* Get the second step to run.
* Tweak the example.
* Add encodec tilting.
* Glue the different bits together.
* Fix a shape issue.
* Use a constant.
* BPE tokenization.
* Add a warning.
* Encodec model.
* Fixes.
* Add the padding functions.
* Get the LSTM bit to work.
* Get the encodec model to generate some tokens (decoder only for now).
* Minor tweak.
* Minor tweak.
* and quantized rwkv v5 model
* Integrate the quantized rwkv model in the initial example.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Boilerplate for the quantized cuda support.
* More basic cuda support.
* More cuda quantization (quantize on cpu for now).
* Add the dequantization bit.
* Start adding some dedicated cuda kernels from llama.cpp.
* Move the kernel code.
* Start interfacing with the kernel.
* Tweak the kernel launch params.
* Bugfix for quantized metal.
* Fix some clippy lints.
* Tweak the launch parameters.
* Tweak cuda basics to perform a quantized matmul.
* Perform the dequantization on the cpu + use cublas for matmul.
* Add the dequantization kernel.
* Test the qmatmul.
* More kernels.
* Matmul-vec kernel.
* Add a couple kernels.
* More dequantization kernels.
* Add the Gemma models.
* Add the gemma example.
* Adapt the RmsNorm.
* Get the 2b model to work.
* 7b support.
* Use the config head dim.
* Yet another fix.
* Make the matrixes contiguous.
* Also get the 7b model to work.
* And add to the readme.
* Start adding the RWKV model.
* More of the forward step.
* Handle rescaling.
* FeedForward.
* More work on RWKV.
* Better state tracking.
* Finish a first pass on forward.
* Fix the shape mismatches.
* Do not rescale in f32.
* Rename to rwkv-v5.
* Add the new models to the readme.
* feat: support microphone whisper streaming
* fix: cleanup print stmts and adjust how input is read
* fix: remove incorrect comment
* feat: split into new example and simplify
* fix: feature flag example file
* fix: fmt fixes
* feat: simplify and remove redundant files
* Sketch the mamba model for inference.
* Complete the forward pass.
* Add the mamba example.
* Optimize the selective-scan part.
* Fix a couple shape mismatches and get inference to work.
* Tweak the readmes.
* More readme tweaks.
* Use the repo config for trocr rather than hardcoding it + small tweaks.
* Add support for the printed models.
* Fail with an appropriate error message on missing position embeddings.
* Initial check-in for the qwen2 model.
* More qwen2 inference.
* Polish the qwen example.
* Fix the rope basis.
* Get the inference to work.
* Support different model sizes.
* Add the ChatGLM model.
* Rotary embeddings.
* Add to the forward pass.
* Add to the forward pass.
* Add the rotary embeddings.
* Add the KV cache.
* Add the chatglm example.
* Bugfix.
* More glm fixes.
* Fix some shape issues.
* Get the inference to work.
* Supports more audio formats
* Simplify the handling of the different buffer types.
* Check the sample rate.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Metal quantized modifications proposal.
- Add a device param, wherever needed.
- Create new QMetal storage thing that implements QuantizedType.
- Update everywhere needed.
Fix Python.
Fixing examples.
Fix: fmt + clippy + stub.
Moving everything around.
Only missing the actual implems.
Fixing everything + adding dequantized kernels.
More work.
Fixing matmul.
Fmt + Clippy
Some clippy fixes.
Working state.
Q2K Metal -> Bugged (also present in GGML).
Q4K CPU -> Bugged (present previously, new test catch it).
Q5K CPU -> Bugged (present previously).
Q8_1 Both -> Never really implemented it seems
Q8K metal -> Never implemented in metal
Fixing Q2K bug (present in ggml).
* Cleanup.
* Fix the rebase.
* Removing the fences speeds everything up and *is* correct this time...
* Cleanup the fence.
* After rebase.
* Bad code removal.
* Rebase after phi2 merge + fix replit default to CPU.
* Making the CI happy.
* More happy tests.
---------
Co-authored-by: Nicolas Patry <nicolas@Nicolass-MacBook-Pro.local>
* Update the Phi model to use the updated architecture.
* Add more of the phi model.
* Repeat KV + caching.
* Apply the rotary embeddings.
* Add support for the new phi model in the phi example.
* Fix a couple glitches.
* Fix a couple more glitches.
* Use cfg to seperate benchmark results based on features
* Add metal where_cond for f16 and bf16. Add benchmark
* Remove allow pragma
* Avoid some unnecessary returns.
* Improve benchmarks layout
* Updated feature separated benchmarks
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>