candle

mirror of https://github.com/huggingface/candle.git synced 2025-06-16 10:38:54 +00:00

Author	SHA1	Message	Date
Laurent Mazare	2b93dffe64	Use faster rotary embeddings for llama like models. (#2087 )	2024-04-18 22:34:29 +02:00
Laurent Mazare	e6ee7ba4d4	Llama v3. (#2085 ) * Llama v3. * Tweak the default params + handle special tokens. * Small tweak.	2024-04-18 22:19:54 +02:00
Laurent Mazare	af955f260c	Make the falcon model cloneable. (#2067 )	2024-04-15 09:39:03 +02:00
Laurent Mazare	8ad822a983	Add a function to clear the KV cache in falcon. (#2066 ) * Add a function to clear the KV cache in falcon. * Clippy.	2024-04-15 09:29:25 +02:00
Laurent Mazare	50e49ecc5f	Add a quantized version of recurrent-gemma. (#2054 ) * Add a quantized version of recurrent-gemma. * Share the rglru part. * Get the quantized gemma model to work.	2024-04-13 20:07:01 +02:00
Victor-Mihaila	fb805b8ca2	Avoid crashes when running T5 models with F16 tensors on CPU (#2047 ) * This change avoids crashes when running T5 models with F16 tensors on CPU. * This enables running ProstT5's (https://huggingface.co/Rostlab/ProstT5) encoder-only mode in Candle. This ProstT5 mode stores it's embed_tokens weights within the encoder, as its decoding stage was replaced with a CNN. You could write more, like: This alone is not sufficient to run ProstT5 within Candle examples. We will develop a ProstT5 runner outside candle for now, but would be willing to upstream it to candle-examples at a later point. * Revert "This enables running ProstT5's (https://huggingface.co/Rostlab/ProstT5) encoder-only mode in Candle. This ProstT5 mode stores it's embed_tokens weights within the encoder, as its decoding stage was replaced with a CNN. You could write more, like: This alone is not sufficient to run ProstT5 within Candle examples. We will develop a ProstT5 runner outside candle for now, but would be willing to upstream it to candle-examples at a later point." This reverts commit `d886d3ce5e`.	2024-04-13 11:07:28 +02:00
Victor-Mihaila	79e3bec789	Change for the encoder-only ProstT5 model (#2045 ) * This change avoids crashes when running T5 models with F16 tensors on CPU. * This enables running ProstT5's (https://huggingface.co/Rostlab/ProstT5) encoder-only mode in Candle. This ProstT5 mode stores it's embed_tokens weights within the encoder, as its decoding stage was replaced with a CNN. This alone is not sufficient to run ProstT5 within Candle examples. We will develop a ProstT5 runner outside candle for now, but would be willing to upstream it to candle-examples at a later point.	2024-04-13 11:06:24 +02:00
Laurent Mazare	2bf413caa3	Add the recurrent-gemma model. (#2039 ) * Start adding the recurrent-gemma model. * More griffin. * Add the example + get the weights to load from the HF version. * More inference code. * Rope + kv-cache on the attention side. * Add to the inference code. * Add more to the recurrent gemma inference. * Get some first inference to run. * Add the softcap on logits. * Fixes. * Use partial rotary embeddings. * Get inference to work. * Add a comment. * And add a readme.	2024-04-13 00:05:21 +02:00
Laurent Mazare	3ad4770eb6	Use cat for faster MQA computation. (#2043 ) * Use cat for faster MQA computation. * Move the function to utils + use it in mistral. * Use the shared repeat-kv in a few more models. * Fix.	2024-04-12 09:15:10 +02:00
Laurent Mazare	a0460cd2b1	Add the code-gemma models. (#2038 ) * Add the code-gemma models. * Tweak to the gemma config.	2024-04-10 21:19:21 +02:00
Laurent Mazare	b81ecf712d	Support alternative dtypes for mamba (#2036 ) * Allow different dtypes in mamba. * Add a dtype flag.	2024-04-10 18:10:01 +02:00
Laurent Mazare	33c9b66554	Add the new gemma models. (#2023 ) * Add the new gemma models. * Revert the lightning changes. * Support for the 1.1 models.	2024-04-06 21:25:38 +02:00
Laurent Mazare	e662431acf	Fix the final rmsnorm for quantized-metavoice. (#2021 )	2024-04-06 19:35:01 +02:00
Laurent Mazare	b869a659ec	Faster mask implementation for mixformers. (#2017 ) * Faster mask implementation for mixformers. * Clippy.	2024-04-05 09:38:26 +02:00
Laurent Mazare	88f7793598	Moondream tracing. (#2016 ) * Moondream tracing. * A bit more tracing.	2024-04-05 09:11:08 +02:00
Laurent Mazare	2ac302a5d1	Add the rope THD kernel. (#2014 ) * Add the rope THD kernel. * Cuda kernel for rope-thd. * Add the metal kernels. * Add a dedicated test.	2024-04-05 08:32:58 +02:00
Laurent Mazare	c87381fc96	Use F16 for moondream on cuda. (#2013 )	2024-04-04 23:30:10 +02:00
Laurent Mazare	f48c07e242	Include topk sampling in the quantized example. (#2005 ) * Include topk sampling in the quantized example. * Also sample with top-k on the mistral side.	2024-04-04 09:27:54 +02:00
Laurent Mazare	318d143224	Relax the contiguous check for cuda kernels. (#2000 ) * Relax the contiguous check for cuda kernels. * Ensure contiguity for RNNs. * Unrelated fix for segment anything. * Better error message + allow concatenating empty slices.	2024-04-03 09:02:38 +02:00
Laurent Mazare	08c049def3	Improve the handling of matmul with squeezed layouts. (#1998 ) * Improve the handling of matmul with squeezed layouts. * Fix for the cuda backend. * Revert the temporary fix.	2024-04-02 23:17:05 +02:00
Santiago Medina	d17b2cdad9	Match Moondream's latest release (#1997 ) * moondream implementation * add moondream example * change config default activation * Add assets and integrate phi mixformer with example * Make use of kv cache and fix seq_len bug; Clean up example code * Add README link to example * Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig * Delete image * Use apply instead of forward * Use latest release special token; Fix token/s accuracy; Use GeluPytorchTanh in VisionConfig v2	2024-04-02 21:37:09 +02:00
Jorge António	fb918a23c8	first commit (#1994 )	2024-04-02 16:31:05 +02:00
Laurent Mazare	b23436bf90	Stable diffusion fix. (#1993 ) * Stable diffusion fix. * And add a comment.	2024-04-02 14:36:28 +02:00
Laurent Mazare	be9c200cbb	Expose the t5 config fields + allow t5-large. (#1987 )	2024-04-01 20:58:34 +02:00
Santiago Medina	ea0d8d3753	Quantized moondream implementation and BOS token (#1980 ) * moondream implementation * add moondream example * change config default activation * Add assets and integrate phi mixformer with example * Make use of kv cache and fix seq_len bug; Clean up example code * Add README link to example * Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig * Delete image * Use apply instead of forward * Pass bos token at the beginning of tensor. * Quantize moondream. * Forward with image bos token. * Clippy. * Use q4_0 quantization. * Add pointers for sequence and tokens; Remove seq_len conditional	2024-04-01 19:37:54 +02:00
Laurent Mazare	f9954b73ba	Add options to use local files + specify a custom repo or branch. (#1973 )	2024-03-31 09:32:50 +02:00
Santiago Medina	92f81d2fcb	Add Moondream transformer implementation and example (#1970 ) * moondream implementation * add moondream example * change config default activation * Add assets and integrate phi mixformer with example * Make use of kv cache and fix seq_len bug; Clean up example code * Add README link to example * Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig * Delete image * Use apply instead of forward	2024-03-31 08:54:56 +02:00
Laurent Mazare	b190fd8592	Remove some unnecessary calls to contiguous. (#1968 ) * Remove some unnecessary calls to contiguous. * Slightly improved kv cache concatenation.	2024-03-30 13:22:00 +01:00
Laurent Mazare	708e422456	Qwen MoE model. (#1960 ) * Qwen MoE model. * Add the MoE model to the example. * Fix the scaling. * Readme updates. * Readme tweaks.	2024-03-28 23:10:57 +01:00
Laurent Mazare	cdc8b57b5c	Fix clippy lints + minor cleanups. (#1957 ) * Fix clippy lints + minor cleanups. * fmt. * Derive clone.	2024-03-28 14:17:46 +01:00
Tigran Zhampeissov	b0340d72ec	CLIP model implementation with example (#1950 ) * CLIP model implementation with example * CLIP Implementation fixes, batch images * CLIP model remove images from git * CLIP model remove unnecessary use of batch_indices	2024-03-28 13:44:12 +01:00
Jorge António	ada5d7c096	add send and sync trait bounds for scheduler config in stable diffusion models (#1952 ) * first commit * add Sync deriving * static * remove static	2024-03-28 10:03:00 +01:00
Jorge António	75b6d4b0da	add config for mamba 2.8b model parameter (#1946 ) * first commit * Make the mamba config public. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>	2024-03-27 07:47:23 +01:00
Laurent Mazare	66f0a4eeea	Another fix for squeezing. (#1943 )	2024-03-26 17:05:26 +01:00
Laurent Mazare	4523ecfb2a	Faster repeat penalty (#1940 ) * Avoid the attention mask where possible. * Faster repeat penalty.	2024-03-26 11:31:20 +01:00
Laurent Mazare	196765e995	Use the new rope kernel in mistral. (#1937 ) * Use the new rope kernel in mistral. * Compute the cos and sin with full precision. * Bugfix.	2024-03-25 23:26:05 +01:00
Laurent Mazare	d3a8d291d5	Avoid the attention mask where possible. (#1933 )	2024-03-25 15:31:04 +01:00
Laurent Mazare	1b98f84a2b	Fast kernels for rotary embeddings. (#1928 ) * Fast kernels for rotary embeddings. * Add a test for the fast CPU kernel. * Rope cuda bindings. * Cuda kernel. * Metal kernel (part 1). * Cuda kernels. * Finish the metal kernel. * Use the new kernels in the quantized example. * Fix warning.	2024-03-24 22:48:52 +01:00
laurent	cf7d7fcf2f	Also avoid the mask in the llama example.	2024-03-24 19:04:32 +01:00
laurent	8c0db87992	Avoid using the attn mask when not necessary.	2024-03-24 18:55:56 +01:00
Laurent Mazare	e2b4829531	Support more mistral models. (#1927 ) * Support more mistral models. * Use the appropriate rope parameter.	2024-03-24 08:04:04 +01:00
laurent	5e70821dd0	Allow for arbitrary temperature modifications.	2024-03-23 15:47:39 +01:00
Laurent Mazare	a62a97340c	Add topk sampling. (#1923 )	2024-03-23 15:26:09 +01:00
Laurent Mazare	6f877592a7	Avoid broadcasting on the batch dimension for the attention mask. (#1920 )	2024-03-23 13:08:53 +01:00
Laurent Mazare	32f567bac4	Fix loading the gguf files. (#1913 )	2024-03-22 10:28:38 +01:00
Laurent Mazare	c07e4057ab	Fix for the llama model. (#1906 )	2024-03-21 19:36:10 +01:00
Laurent Mazare	c0bdd9c7a6	Use the fast RmsNorm in the quantized model. (#1904 )	2024-03-21 18:49:35 +01:00
Laurent Mazare	455c42aa72	Avoid copying the data on squeeze and unsqueeze. (#1884 ) * Avoid copying the data on squeeze and unsqueeze. * Fix the quantized llama example. * Unrelated fix for the quantized stable-lm example on cuda. * Fix for mamba on cuda (unrelated to the PR).	2024-03-20 13:04:36 +01:00
Jani Monoses	90fc82211f	Use a common with_tracing::RmsNorm in a few models. (#1871 ) * Add RmsNorm with tracing. * Use with_tracing::RmsNorm in some models.	2024-03-18 21:40:06 +01:00
Laurent Mazare	ff03fd3fb3	Expose some helper functions to create quantized models. (#1837 )	2024-03-12 11:30:24 +01:00

1 2 3 4 5 ...

378 Commits