candle

mirror of https://github.com/huggingface/candle.git synced 2025-06-19 11:56:45 +00:00

Author	SHA1	Message	Date
OlivierDehaene	8d1a57c9a0	chore: update flash attention kernels (#1518 ) * chore: update flash attention kernels * fmt * remove unused kernels * force f32 * correct stride	2024-01-05 18:28:55 +01:00
Laurent Mazare	d2c3f14773	Fix for flash-attn. (#1310 ) Co-authored-by: laurent <laurent@par2dc5-ai-prd-cl01dgx02.cm.cluster>	2023-11-10 10:27:27 +01:00
Laurent Mazare	ab0d9fbdd1	Properly set the is_bf16 flag. (#738 )	2023-09-04 16:45:26 +01:00
Laurent Mazare	f80fd44201	BF16 support for flash-attn. (#737 )	2023-09-04 16:35:43 +01:00
Laurent Mazare	d0cdea95a5	Add back the bf16 flash-attn kernels. (#730 )	2023-09-04 07:50:52 +01:00
Laurent Mazare	03be33eea4	Relax the requirements on CustomOp. (#486 ) * Relax the requirements on CustomOp. * Simplify the custom-ops when no backward is required.	2023-08-17 11:12:05 +01:00
Laurent Mazare	67834119fc	Fix the flash-attention function names. (#282 )	2023-07-31 10:04:39 +01:00
Laurent Mazare	0ace420e66	Flash attention without padding (varlen). (#281 ) * Expose the seqlen variable for flash-attn without padding. * Fix the batched call. * Adapt for the varlen variant. * No need to set the batch strides when in varlen mode. * Add a test (disabled at the moment). * Get the test to work properly.	2023-07-31 09:45:39 +01:00
Laurent Mazare	4f92420132	Add some flash attn test (#253 ) * Add some flash-attn test. * Add the cpu test. * Fail when the head is not a multiple of 8. * Polish the flash attention test.	2023-07-26 20:56:00 +01:00
Laurent Mazare	1235aa2536	Use bail rather than wrapping a string where possible. (#249 ) * Use bail rather than wrapping a string where possible. * Revert the cuda default bit.	2023-07-26 15:42:46 +01:00
Laurent Mazare	f052ba76cb	Lining up the flash attn version with the non-flash one. (#248 ) * Move the flash-attn function in the proper crate. * Causality tweak.	2023-07-26 15:11:45 +01:00
Laurent Mazare	2ce5f12513	Again set a few extra params in flash-attn. (#245 ) * Again set a few extra params. * Use the appropriate kernel sizes. * Add all the kernel sizes. * Parallel compiling. * Reduce the amount of parallelism. * Add the missing kernel. * Fix a typo. * Remove bf16 support for now.	2023-07-26 14:16:37 +01:00
Laurent Mazare	fa2b64d678	Proper flash-attn parameters. (#244 ) * Proper flash-attn parameters. * Set the flash attention parameters. * Add more validations. * Setup the o_ flash attn parameters. * More flash-attn support. * Set more flash attn parameters.	2023-07-26 10:13:40 +01:00
Laurent Mazare	d9f9c859af	Add flash attention (#241 ) * Add some flash-attn kernel, import the code for flash-attn v2 from Dao-AILab. * More flash attn. * Set up the flash attn parameters. * Get things to compile locally. * Move the flash attention files in a different directory. * Build the static C library with nvcc. * Add more flash attention. * Update the build part. * Better caching. * Exclude flash attention from the default workspace. * Put flash-attn behind a feature gate. * Get the flash attn kernel to run. * Move the flags to a more appropriate place. * Enable flash attention in llama. * Use flash attention in llama.	2023-07-26 07:48:10 +01:00

14 Commits