* Separate quantized phi-3 implementation.
* Integrate the quantized phi3 model.=
* Small fixes, get the generation to work properly.
* Keep the old llama implementation around.
* Change the default.
* Add the argsort cuda kernels.
* CPU version of arg-sort.
* Hook the cuda kernel + rework the cpu bits.
* Add some dedicated test.
* Working cuda kernel.
* Metal kernel.
* Metal adjustments.
* Bugfix.
* Use the fast rope in qwen.
* Rework the expert selection in qwen.
* Quantized phi in a separate file.
* Add the quantized phi example + rework the model code.
* Improve the phi model.
* Get some generation out.
* Use the appropriate rope shape.
* Tweak the default prompt.
---------
Co-authored-by: Jane Doe <jane.doe@example.org>
* moondream implementation
* add moondream example
* change config default activation
* Add assets and integrate phi mixformer with example
* Make use of kv cache and fix seq_len bug; Clean up example code
* Add README link to example
* Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig
* Delete image
* Use apply instead of forward
* Use latest release special token; Fix token/s accuracy; Use GeluPytorchTanh in VisionConfig v2
* Derive debug and clone traits for Moondream model.
* This change avoids crashes when running T5 models with F16 tensors on CPU.
* This enables running ProstT5's (https://huggingface.co/Rostlab/ProstT5) encoder-only mode in Candle. This ProstT5 mode stores it's embed_tokens weights within the encoder, as its decoding stage was replaced with a CNN. You could write more, like: This alone is not sufficient to run ProstT5 within Candle examples. We will develop a ProstT5 runner outside candle for now, but would be willing to upstream it to candle-examples at a later point.
* Revert "This enables running ProstT5's (https://huggingface.co/Rostlab/ProstT5) encoder-only mode in Candle. This ProstT5 mode stores it's embed_tokens weights within the encoder, as its decoding stage was replaced with a CNN. You could write more, like: This alone is not sufficient to run ProstT5 within Candle examples. We will develop a ProstT5 runner outside candle for now, but would be willing to upstream it to candle-examples at a later point."
This reverts commit d886d3ce5e.
* This change avoids crashes when running T5 models with F16 tensors on CPU.
* This enables running ProstT5's (https://huggingface.co/Rostlab/ProstT5) encoder-only mode in Candle. This ProstT5 mode stores it's embed_tokens weights within the encoder, as its decoding stage was replaced with a CNN. This alone is not sufficient to run ProstT5 within Candle examples. We will develop a ProstT5 runner outside candle for now, but would be willing to upstream it to candle-examples at a later point.
* Start adding the recurrent-gemma model.
* More griffin.
* Add the example + get the weights to load from the HF version.
* More inference code.
* Rope + kv-cache on the attention side.
* Add to the inference code.
* Add more to the recurrent gemma inference.
* Get some first inference to run.
* Add the softcap on logits.
* Fixes.
* Use partial rotary embeddings.
* Get inference to work.
* Add a comment.
* And add a readme.
* moondream implementation
* add moondream example
* change config default activation
* Add assets and integrate phi mixformer with example
* Make use of kv cache and fix seq_len bug; Clean up example code
* Add README link to example
* Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig
* Delete image
* Use apply instead of forward
* Use latest release special token; Fix token/s accuracy; Use GeluPytorchTanh in VisionConfig v2
* moondream implementation
* add moondream example
* change config default activation
* Add assets and integrate phi mixformer with example
* Make use of kv cache and fix seq_len bug; Clean up example code
* Add README link to example
* Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig
* Delete image
* Use apply instead of forward
* Pass bos token at the beginning of tensor.
* Quantize moondream.
* Forward with image bos token.
* Clippy.
* Use q4_0 quantization.
* Add pointers for sequence and tokens; Remove seq_len conditional
* moondream implementation
* add moondream example
* change config default activation
* Add assets and integrate phi mixformer with example
* Make use of kv cache and fix seq_len bug; Clean up example code
* Add README link to example
* Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig
* Delete image
* Use apply instead of forward
* CLIP model implementation with example
* CLIP Implementation fixes, batch images
* CLIP model remove images from git
* CLIP model remove unnecessary use of batch_indices
* Fast kernels for rotary embeddings.
* Add a test for the fast CPU kernel.
* Rope cuda bindings.
* Cuda kernel.
* Metal kernel (part 1).
* Cuda kernels.
* Finish the metal kernel.
* Use the new kernels in the quantized example.
* Fix warning.