candle

mirror of https://github.com/huggingface/candle.git synced 2025-06-16 10:38:54 +00:00

Author	SHA1	Message	Date
Laurent Mazare	957d604a78	Enable BF16 on metal. (#2380 )	2024-08-01 11:05:07 +02:00
Takanori MAEHARA	ce90287f45	Add get_ids to GradStore (#2379 )	2024-08-01 10:56:13 +02:00
Laurent Mazare	1ba87a9450	Use BF16 on metal when possible. (#2378 )	2024-08-01 10:48:58 +02:00
Yun-Jhong Wu	bd80078acf	Fix log_sum_exp to handle large positive/negative inputs (#2367 )	2024-08-01 10:37:02 +02:00
Laurent Mazare	8696cf6494	Enable the affine kernel for u8/u32. (#2376 )	2024-08-01 10:03:11 +02:00
Eric Buehler	0f5cbb08b3	Add support for Llama 3.1 (#2359 ) * Add Llama 3.1 rope * Clippy * Format * Clippy * Add support for multiple eos tokens: * Untagged either * Remove either dep and fix settings.json * Make the max positional embeddings configurable	2024-07-26 21:32:26 +02:00
Ivor Wanders	f25173d68b	Fix for backprop in ConvTranspose2D with stride of 2 (#2337 ) * Add gradient test for conv_transpose2d with stride of 2. * Swap dilation and stride in ConvTranspose2D backpropagation. Without this, a shape mismatch occurs with a stride of 2 and dilation of 1. * Add further tests of the ConvTranspose2D gradient. Values calculated with torch, minor numerical errors adjusted and commented.	2024-07-17 19:22:23 +02:00
Alexey Gerasev	6a4741bbf9	Fix Elu gradient NaN on large input (#2328 ) * Fix Elu gradient NaN on large input * Reuse previously computed exp in Elu	2024-07-16 14:41:16 +02:00
Laurent Mazare	25960676ca	Add a basic metal example with capture (#2324 ) * Add some tracing. * Get the trace to work.	2024-07-09 12:38:11 +02:00
Laurent Mazare	6baa1d486b	Fix a bug in the metal implemtation of col2im1d. (#2284 )	2024-06-22 23:21:20 +02:00
Laurent Mazare	36cf54525d	Fix the fast bf16 gemm cublas kernels. (#2274 ) * Use flash-attn in gemma. * Fix for the fast bf16 cublas gemm. * Fix some clippy lints. * Fix another lint. * Proper clippy fix.	2024-06-18 23:46:58 +02:00
Eric Buehler	9182c828e6	Automatically upcast for to_u64 (#2244 )	2024-06-04 11:32:36 +02:00
Lionel Touati	1ec3b2cc18	add where_cond f32 for metal (#2236 )	2024-06-02 14:30:06 +02:00
Laurent Mazare	0814dfd148	Add a metal kernel for col2im1d. (#2214 ) * Add a metal kernel for col2im1d. * Enable the col2im variant. * Bugfix. * Revert the quantized tweak.	2024-05-25 11:03:23 +02:00
Laurent Mazare	1df2bddccf	Add the layernorm specialized op. (#2212 ) * Add the layernorm cuda kernels. * Dedicated layer norm op. * Add the slower variant. * Plug the cuda implementation. * Add the metal variant. * Add a dedicated test. * Bugfix.	2024-05-24 15:58:01 +02:00
Laurent Mazare	6f0b807ffd	More efficient cuda implementation for ConvTranspose1d. (#2211 ) * More efficient cuda implementation for ConvTranspose1d. * Small tweak.	2024-05-24 11:05:43 +02:00
Laurent Mazare	01545f7303	Add a slice_set op. (#2193 ) * Add a slice_set op. * Add some testing. * Add the dedicated kv-cache module. * Derive debug and clone. * Expose more kv-cache functions. * Return the current data when appending. * Use the new cache in the quantized phi3 model.	2024-05-18 15:58:18 +02:00
Laurent Mazare	21f82a5155	Add SliceSafetensors. (#2179 ) * Add SlicedSafetensors. * And add some testing.	2024-05-11 13:15:42 +02:00
Laurent Mazare	9cff7bc3f4	Make it possible to use TF32 accumulation in F32 matmuls. (#2178 ) * Allow the use of tf32 accumulation in matmul. * Better timings. * Dummy versions for use when cuda is not enabled.	2024-05-11 12:28:39 +02:00
Laurent Mazare	01794dc16e	Use write rather than try-write on the metal rw-locks. (#2162 )	2024-05-05 07:22:46 +02:00
Laurent Mazare	b13a82a438	Separate quantized phi-3 implementation. (#2157 ) * Separate quantized phi-3 implementation. * Integrate the quantized phi3 model.= * Small fixes, get the generation to work properly. * Keep the old llama implementation around. * Change the default.	2024-05-04 10:14:57 +02:00
Laurent Mazare	89f53b9d7b	Bump the version number to 0.5.1. (#2155 ) * Bump the version number to 0.5.1. * Fix clippy lints for 1.78. * More clippy fixes.	2024-05-03 11:17:05 +02:00
Laurent Mazare	fa06f5f5f9	F16/BF16 bugfix (bis). (#2143 ) * F16/BF16 bugfix (bis). * Another fix. * Yet another fix.	2024-04-29 14:08:44 +02:00
Laurent Mazare	09d4845aa8	Bugfix the recent f16/bf16 changes. (#2142 )	2024-04-29 13:30:11 +02:00
Jeffrey Dallatezza	a0d03aded1	Bug Fix: When converting a tensor to a variable, clone if the tensor is already a variable. (#2124 ) * When converting a tensor to a variable, clone if the tensor is already a variable. * Add a test to ensure training a batch norm works with VarMaps --------- Co-authored-by: Jeffrey Dallatezza <jeffreydallatezza@Jeffreys-Laptop.local>	2024-04-29 11:21:53 +02:00
MilkFather	3bbb88fcb4	Fix sigmoid gradient calculation and move sigmoid into a specialized op (#2114 ) * add sigmoid op * small fix * add as a method on `Tensor` * implement gradient calculation for sigmoid * add sigmoid tests * we should have a specialized op for this * fix clippy * fix clippy 2 * Revert all previous commits in favor of a `CustomOp` based solution * use `CustomOp1` implementation * fix rustfmt * experimental add metal impl * add cuda kernel impl * fix fmt * Add a test + reduce some cuda duplication. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>	2024-04-29 11:04:43 +02:00
Laurent Mazare	ed7b99f525	Add a toggle for F16/BF16 accumulation in gemm. (#2141 ) * Add a toggle to control f16/bf16 gemm precision. * Use the faster variant in the quantized example. * Bugfix.	2024-04-29 09:21:07 +02:00
Laurent Mazare	287013ef28	Add a forward_via_f16 method to the qmatmul op. (#2138 )	2024-04-28 20:35:01 +02:00
Laurent Mazare	eb26e2467e	Add the cuda dequantize f16 kernels. (#2137 ) * Add the cuda dequantize f16 kernels. * Expose the cuda kernels. * Add some testing + fix. * Test the other cases too. * A few more tests. * Add an environment variable to enable the dequantize f16 + matmul behavior.	2024-04-28 20:05:05 +02:00
Laurent Mazare	805f3be8e1	Add a sort function. (#2134 )	2024-04-28 08:18:04 +02:00
Laurent Mazare	96a48e5cc4	Add argsort. (#2132 ) * Add the argsort cuda kernels. * CPU version of arg-sort. * Hook the cuda kernel + rework the cpu bits. * Add some dedicated test. * Working cuda kernel. * Metal kernel. * Metal adjustments. * Bugfix. * Use the fast rope in qwen. * Rework the expert selection in qwen.	2024-04-27 20:17:35 +02:00
Laurent Mazare	8a05743a21	Add StorageRef. (#2113 ) * Add the storage-ref bits. * Add the metal implementation.	2024-04-23 13:23:27 +02:00
dependabot[bot]	08a15cb79e	Update zip requirement from 0.6.6 to 1.1.1 (#2103 ) * Update zip requirement from 0.6.6 to 1.1.1 --- updated-dependencies: - dependency-name: zip dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix for the zip crate update. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: laurent <laurent.mazare@gmail.com>	2024-04-22 16:23:27 +02:00
Thomas Santerre	0067fe00a8	Metal Unary: Add benchmarks and process kernels in a tile based fashion (#2056 ) * add basic unary bench for sqrt * process unary commands in tiles of 4 * re-enable all benchmarks * rename helper to unary * modify approach to split up tiled and non-tiled operations * undo bench ignore for other tests * update tile size to 2 * only perform the optimization on the contiguous even numbered element case	2024-04-21 00:10:33 +02:00
Laurent Mazare	587ee3bb6f	Small cleanups to the llama multi-process example. (#2098 )	2024-04-20 22:19:46 +02:00
Laurent Mazare	dd78422701	Handle multiple dimensions in metal QMM + two fixes. (#2097 )	2024-04-20 18:55:45 +02:00
Laurent Mazare	1690ab45d2	Fix the silu gradient issue on 0. (#2083 )	2024-04-18 14:31:41 +02:00
Laurent Mazare	8de0ce6cba	Add more QMMV cuda kernels. (#2077 ) * Add more QMMV cuda kernels. * Enable the new kernels. * Adapt the testing.	2024-04-18 08:36:43 +02:00
Laurent Mazare	2817643db9	Add the mmv kernels for small batch sizes. (#2075 ) * Add the mmv kernels for smaller sizes. * Support more mmv kernels. * Use the new kernels. * Fix the call. * Silly fix. * Improve the testing. * Fix for dmmv. * Add another dedicated test for the batching mmv.	2024-04-16 21:30:51 +02:00
Laurent Mazare	f135b7963d	Fix for the batch dim in the quantized matmul example. (#2073 ) * Fix for the batch dim in the quantized matmul example. * Enable more tests on cuda. * Add a test for qmm with a batch. * Fix the zeros-dim test on metal.	2024-04-15 20:00:28 +02:00
Laurent Mazare	8ad822a983	Add a function to clear the KV cache in falcon. (#2066 ) * Add a function to clear the KV cache in falcon. * Clippy.	2024-04-15 09:29:25 +02:00
Laurent Mazare	e198bb0816	Handle zero dims in some simple operations. (#2064 ) * Handle zero dims in some simple operations. * Handle zero-dims in matmul. * More testing.	2024-04-15 09:18:54 +02:00
Laurent Mazare	f7d5bf5b97	Faster kernels for quantized matmul on cuda (#2060 ) * Hook the quantized matmul cuda kernels. * Add a (currently broken) test. * Kernel fixes. * Fix by transposing the rhs matrix. * Add the q4-1 kernels. * Proper block sizes. * More details in the tests.	2024-04-15 08:32:47 +02:00
Laurent Mazare	c449f65b12	Expose the synchronize function on the generic device. (#2062 )	2024-04-14 23:02:03 +02:00
ivarflakstad	db7dbf3071	Add missing bfloat unary strided kernels and fix typo (#2058 )	2024-04-14 20:01:13 +02:00
Laurent Mazare	53e5380bf6	Add a synchronize method to devices. (#2055 ) * Add a synchronize method to devices. * Metal version.	2024-04-14 16:32:55 +02:00
Thomas Santerre	4c88c3ce06	Add benchmarks for qmatmul operations (#2048 ) * Add qmatmul bench * add all dtypes	2024-04-13 12:30:14 +02:00
Laurent Mazare	a4d5a414e3	Support gather on bf16 for metal. (#2035 )	2024-04-10 12:49:25 +02:00
Laurent Mazare	718671a0d5	Use BufferOffset in metal backend ops. (#2029 ) * Use BufferOffset in the metal backend. * More BufferOffset usage. * Use in where-cond.	2024-04-08 09:37:25 +02:00
Laurent Mazare	c5fe4a7f89	Rework the buffer offset logic for metal kernels (#2028 ) * Move the metal kernels utils in a separate module. * Use the BufferOffset for unary ops. * Fix clippy lints. * Use the new BufferOffset. * Adapt the binary ops. * Affine. * More ops (powf, elu, cast).	2024-04-07 22:37:53 +02:00

1 2 3 4 5 ...

771 Commits