candle

mirror of https://github.com/huggingface/candle.git synced 2025-06-16 10:38:54 +00:00

Author	SHA1	Message	Date
Laurent Mazare	b13a82a438	Separate quantized phi-3 implementation. (#2157 ) * Separate quantized phi-3 implementation. * Integrate the quantized phi3 model.= * Small fixes, get the generation to work properly. * Keep the old llama implementation around. * Change the default.	2024-05-04 10:14:57 +02:00
Laurent Mazare	89f53b9d7b	Bump the version number to 0.5.1. (#2155 ) * Bump the version number to 0.5.1. * Fix clippy lints for 1.78. * More clippy fixes.	2024-05-03 11:17:05 +02:00
Laurent Mazare	96a48e5cc4	Add argsort. (#2132 ) * Add the argsort cuda kernels. * CPU version of arg-sort. * Hook the cuda kernel + rework the cpu bits. * Add some dedicated test. * Working cuda kernel. * Metal kernel. * Metal adjustments. * Bugfix. * Use the fast rope in qwen. * Rework the expert selection in qwen.	2024-04-27 20:17:35 +02:00
Isotr0py	6cf82fd7a3	Add Olmo models (#2127 ) * add olmo support * add olmo readme * Fix fmt. * Fix clippy. * Get olmo to work on cuda. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>	2024-04-26 11:02:51 +02:00
Laurent Mazare	11d4a3c588	Add the phi-3 model. (#2120 ) * Add the phi-3 model. * Faster rope. * Bugfix. * Fix the detokenization.	2024-04-24 09:48:13 +02:00
Laurent Mazare	b2e816752b	Use the faster rms-norm kernel for llama. (#2107 ) * Use the faster rms-norm kernel for llama. * Use the fast variant by default.	2024-04-22 18:52:00 +02:00
Laurent Mazare	c388be93e7	Updated quantized phi model (#2099 ) * Quantized phi in a separate file. * Add the quantized phi example + rework the model code. * Improve the phi model. * Get some generation out. * Use the appropriate rope shape. * Tweak the default prompt. --------- Co-authored-by: Jane Doe <jane.doe@example.org>	2024-04-21 07:37:07 +02:00
Santiago Medina	d22f1d4f4e	Derive clone and debug traits for Moondream model (#2100 ) * moondream implementation * add moondream example * change config default activation * Add assets and integrate phi mixformer with example * Make use of kv cache and fix seq_len bug; Clean up example code * Add README link to example * Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig * Delete image * Use apply instead of forward * Use latest release special token; Fix token/s accuracy; Use GeluPytorchTanh in VisionConfig v2 * Derive debug and clone traits for Moondream model.	2024-04-21 07:08:28 +02:00
Laurent Mazare	587ee3bb6f	Small cleanups to the llama multi-process example. (#2098 )	2024-04-20 22:19:46 +02:00
Laurent Mazare	b45c710dbf	Fix for gemma MQA. (#2091 )	2024-04-19 21:49:55 +02:00
Laurent Mazare	2b93dffe64	Use faster rotary embeddings for llama like models. (#2087 )	2024-04-18 22:34:29 +02:00
Laurent Mazare	e6ee7ba4d4	Llama v3. (#2085 ) * Llama v3. * Tweak the default params + handle special tokens. * Small tweak.	2024-04-18 22:19:54 +02:00
Laurent Mazare	af955f260c	Make the falcon model cloneable. (#2067 )	2024-04-15 09:39:03 +02:00
Laurent Mazare	8ad822a983	Add a function to clear the KV cache in falcon. (#2066 ) * Add a function to clear the KV cache in falcon. * Clippy.	2024-04-15 09:29:25 +02:00
Laurent Mazare	50e49ecc5f	Add a quantized version of recurrent-gemma. (#2054 ) * Add a quantized version of recurrent-gemma. * Share the rglru part. * Get the quantized gemma model to work.	2024-04-13 20:07:01 +02:00
Victor-Mihaila	fb805b8ca2	Avoid crashes when running T5 models with F16 tensors on CPU (#2047 ) * This change avoids crashes when running T5 models with F16 tensors on CPU. * This enables running ProstT5's (https://huggingface.co/Rostlab/ProstT5) encoder-only mode in Candle. This ProstT5 mode stores it's embed_tokens weights within the encoder, as its decoding stage was replaced with a CNN. You could write more, like: This alone is not sufficient to run ProstT5 within Candle examples. We will develop a ProstT5 runner outside candle for now, but would be willing to upstream it to candle-examples at a later point. * Revert "This enables running ProstT5's (https://huggingface.co/Rostlab/ProstT5) encoder-only mode in Candle. This ProstT5 mode stores it's embed_tokens weights within the encoder, as its decoding stage was replaced with a CNN. You could write more, like: This alone is not sufficient to run ProstT5 within Candle examples. We will develop a ProstT5 runner outside candle for now, but would be willing to upstream it to candle-examples at a later point." This reverts commit `d886d3ce5e`.	2024-04-13 11:07:28 +02:00
Victor-Mihaila	79e3bec789	Change for the encoder-only ProstT5 model (#2045 ) * This change avoids crashes when running T5 models with F16 tensors on CPU. * This enables running ProstT5's (https://huggingface.co/Rostlab/ProstT5) encoder-only mode in Candle. This ProstT5 mode stores it's embed_tokens weights within the encoder, as its decoding stage was replaced with a CNN. This alone is not sufficient to run ProstT5 within Candle examples. We will develop a ProstT5 runner outside candle for now, but would be willing to upstream it to candle-examples at a later point.	2024-04-13 11:06:24 +02:00
Laurent Mazare	2bf413caa3	Add the recurrent-gemma model. (#2039 ) * Start adding the recurrent-gemma model. * More griffin. * Add the example + get the weights to load from the HF version. * More inference code. * Rope + kv-cache on the attention side. * Add to the inference code. * Add more to the recurrent gemma inference. * Get some first inference to run. * Add the softcap on logits. * Fixes. * Use partial rotary embeddings. * Get inference to work. * Add a comment. * And add a readme.	2024-04-13 00:05:21 +02:00
Laurent Mazare	3ad4770eb6	Use cat for faster MQA computation. (#2043 ) * Use cat for faster MQA computation. * Move the function to utils + use it in mistral. * Use the shared repeat-kv in a few more models. * Fix.	2024-04-12 09:15:10 +02:00
Laurent Mazare	a0460cd2b1	Add the code-gemma models. (#2038 ) * Add the code-gemma models. * Tweak to the gemma config.	2024-04-10 21:19:21 +02:00
Laurent Mazare	b81ecf712d	Support alternative dtypes for mamba (#2036 ) * Allow different dtypes in mamba. * Add a dtype flag.	2024-04-10 18:10:01 +02:00
Laurent Mazare	33c9b66554	Add the new gemma models. (#2023 ) * Add the new gemma models. * Revert the lightning changes. * Support for the 1.1 models.	2024-04-06 21:25:38 +02:00
Laurent Mazare	e662431acf	Fix the final rmsnorm for quantized-metavoice. (#2021 )	2024-04-06 19:35:01 +02:00
Laurent Mazare	b869a659ec	Faster mask implementation for mixformers. (#2017 ) * Faster mask implementation for mixformers. * Clippy.	2024-04-05 09:38:26 +02:00
Laurent Mazare	88f7793598	Moondream tracing. (#2016 ) * Moondream tracing. * A bit more tracing.	2024-04-05 09:11:08 +02:00
Laurent Mazare	2ac302a5d1	Add the rope THD kernel. (#2014 ) * Add the rope THD kernel. * Cuda kernel for rope-thd. * Add the metal kernels. * Add a dedicated test.	2024-04-05 08:32:58 +02:00
Laurent Mazare	c87381fc96	Use F16 for moondream on cuda. (#2013 )	2024-04-04 23:30:10 +02:00
Laurent Mazare	f48c07e242	Include topk sampling in the quantized example. (#2005 ) * Include topk sampling in the quantized example. * Also sample with top-k on the mistral side.	2024-04-04 09:27:54 +02:00
Laurent Mazare	318d143224	Relax the contiguous check for cuda kernels. (#2000 ) * Relax the contiguous check for cuda kernels. * Ensure contiguity for RNNs. * Unrelated fix for segment anything. * Better error message + allow concatenating empty slices.	2024-04-03 09:02:38 +02:00
Laurent Mazare	08c049def3	Improve the handling of matmul with squeezed layouts. (#1998 ) * Improve the handling of matmul with squeezed layouts. * Fix for the cuda backend. * Revert the temporary fix.	2024-04-02 23:17:05 +02:00
Santiago Medina	d17b2cdad9	Match Moondream's latest release (#1997 ) * moondream implementation * add moondream example * change config default activation * Add assets and integrate phi mixformer with example * Make use of kv cache and fix seq_len bug; Clean up example code * Add README link to example * Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig * Delete image * Use apply instead of forward * Use latest release special token; Fix token/s accuracy; Use GeluPytorchTanh in VisionConfig v2	2024-04-02 21:37:09 +02:00
Jorge António	fb918a23c8	first commit (#1994 )	2024-04-02 16:31:05 +02:00
Laurent Mazare	b23436bf90	Stable diffusion fix. (#1993 ) * Stable diffusion fix. * And add a comment.	2024-04-02 14:36:28 +02:00
Laurent Mazare	be9c200cbb	Expose the t5 config fields + allow t5-large. (#1987 )	2024-04-01 20:58:34 +02:00
Santiago Medina	ea0d8d3753	Quantized moondream implementation and BOS token (#1980 ) * moondream implementation * add moondream example * change config default activation * Add assets and integrate phi mixformer with example * Make use of kv cache and fix seq_len bug; Clean up example code * Add README link to example * Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig * Delete image * Use apply instead of forward * Pass bos token at the beginning of tensor. * Quantize moondream. * Forward with image bos token. * Clippy. * Use q4_0 quantization. * Add pointers for sequence and tokens; Remove seq_len conditional	2024-04-01 19:37:54 +02:00
Laurent Mazare	f9954b73ba	Add options to use local files + specify a custom repo or branch. (#1973 )	2024-03-31 09:32:50 +02:00
Santiago Medina	92f81d2fcb	Add Moondream transformer implementation and example (#1970 ) * moondream implementation * add moondream example * change config default activation * Add assets and integrate phi mixformer with example * Make use of kv cache and fix seq_len bug; Clean up example code * Add README link to example * Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig * Delete image * Use apply instead of forward	2024-03-31 08:54:56 +02:00
Laurent Mazare	b190fd8592	Remove some unnecessary calls to contiguous. (#1968 ) * Remove some unnecessary calls to contiguous. * Slightly improved kv cache concatenation.	2024-03-30 13:22:00 +01:00
Laurent Mazare	708e422456	Qwen MoE model. (#1960 ) * Qwen MoE model. * Add the MoE model to the example. * Fix the scaling. * Readme updates. * Readme tweaks.	2024-03-28 23:10:57 +01:00
Laurent Mazare	cdc8b57b5c	Fix clippy lints + minor cleanups. (#1957 ) * Fix clippy lints + minor cleanups. * fmt. * Derive clone.	2024-03-28 14:17:46 +01:00
Tigran Zhampeissov	b0340d72ec	CLIP model implementation with example (#1950 ) * CLIP model implementation with example * CLIP Implementation fixes, batch images * CLIP model remove images from git * CLIP model remove unnecessary use of batch_indices	2024-03-28 13:44:12 +01:00
Jorge António	ada5d7c096	add send and sync trait bounds for scheduler config in stable diffusion models (#1952 ) * first commit * add Sync deriving * static * remove static	2024-03-28 10:03:00 +01:00
Jorge António	75b6d4b0da	add config for mamba 2.8b model parameter (#1946 ) * first commit * Make the mamba config public. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>	2024-03-27 07:47:23 +01:00
Laurent Mazare	66f0a4eeea	Another fix for squeezing. (#1943 )	2024-03-26 17:05:26 +01:00
Laurent Mazare	4523ecfb2a	Faster repeat penalty (#1940 ) * Avoid the attention mask where possible. * Faster repeat penalty.	2024-03-26 11:31:20 +01:00
Laurent Mazare	196765e995	Use the new rope kernel in mistral. (#1937 ) * Use the new rope kernel in mistral. * Compute the cos and sin with full precision. * Bugfix.	2024-03-25 23:26:05 +01:00
Laurent Mazare	d3a8d291d5	Avoid the attention mask where possible. (#1933 )	2024-03-25 15:31:04 +01:00
Laurent Mazare	1b98f84a2b	Fast kernels for rotary embeddings. (#1928 ) * Fast kernels for rotary embeddings. * Add a test for the fast CPU kernel. * Rope cuda bindings. * Cuda kernel. * Metal kernel (part 1). * Cuda kernels. * Finish the metal kernel. * Use the new kernels in the quantized example. * Fix warning.	2024-03-24 22:48:52 +01:00
laurent	cf7d7fcf2f	Also avoid the mask in the llama example.	2024-03-24 19:04:32 +01:00
laurent	8c0db87992	Avoid using the attn mask when not necessary.	2024-03-24 18:55:56 +01:00

1 2 3 4 5 ...

288 Commits