* Improve metal buffer usage
* Clone cpu storage when loading to reduce wait_until_complete calls
* Use powers of two for buffer sizes so reuse is more likely.
* Select best available buffer by size.
* Add count to MetalStorage -> can use buffer with different size
Co-authored-by: Chris Fleetwood <christopher.fleetwood@huggingface.co>
* Simplify new buffer creation without blit copy. Revert &[] -> Vec
* Add documentation on newBufferWithBytes safety / synchronization
* Drop unused buffers after command buffer is done syncing.
---------
Co-authored-by: Chris Fleetwood <christopher.fleetwood@huggingface.co>
* Normalize loudness of the generated audio.
* Lints.
* One more lint.
* Avoid running the bs1770 tests.
* Another attempt at discarding doc comments.
* Also normalize the loudness in the encodec example.
* Add the metavoice transformer.
* Sketch the speaker-encoder module.
* Adding to the metavoice model.
* Start adding the metavoice example.
* Get some logits out.
* Load the second stage model.
* Get the second step to run.
* Tweak the example.
* Add encodec tilting.
* Glue the different bits together.
* Fix a shape issue.
* Use a constant.
* BPE tokenization.
* Fix the position index in metavoice.
* Add the metavoice transformer.
* Sketch the speaker-encoder module.
* Adding to the metavoice model.
* Start adding the metavoice example.
* Get some logits out.
* Load the second stage model.
* Get the second step to run.
* Tweak the example.
* Add encodec tilting.
* Glue the different bits together.
* Fix a shape issue.
* Use a constant.
* BPE tokenization.
* Add a warning.
* Encodec model.
* Fixes.
* Add the padding functions.
* Get the LSTM bit to work.
* Get the encodec model to generate some tokens (decoder only for now).
* Minor tweak.
* Minor tweak.
* and quantized rwkv v5 model
* Integrate the quantized rwkv model in the initial example.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Boilerplate for the quantized cuda support.
* More basic cuda support.
* More cuda quantization (quantize on cpu for now).
* Add the dequantization bit.
* Start adding some dedicated cuda kernels from llama.cpp.
* Move the kernel code.
* Start interfacing with the kernel.
* Tweak the kernel launch params.
* Bugfix for quantized metal.
* Fix some clippy lints.
* Tweak the launch parameters.
* Tweak cuda basics to perform a quantized matmul.
* Perform the dequantization on the cpu + use cublas for matmul.
* Add the dequantization kernel.
* Test the qmatmul.
* More kernels.
* Matmul-vec kernel.
* Add a couple kernels.
* More dequantization kernels.
* Add the Gemma models.
* Add the gemma example.
* Adapt the RmsNorm.
* Get the 2b model to work.
* 7b support.
* Use the config head dim.
* Yet another fix.
* Make the matrixes contiguous.
* Also get the 7b model to work.
* And add to the readme.