Commit Graph

14 Commits

Author SHA1 Message Date
8d1a57c9a0 chore: update flash attention kernels (#1518)
* chore: update flash attention kernels

* fmt

* remove unused kernels

* force f32

* correct stride
2024-01-05 18:28:55 +01:00
d2c3f14773 Fix for flash-attn. (#1310)
Co-authored-by: laurent <laurent@par2dc5-ai-prd-cl01dgx02.cm.cluster>
2023-11-10 10:27:27 +01:00
ab0d9fbdd1 Properly set the is_bf16 flag. (#738) 2023-09-04 16:45:26 +01:00
f80fd44201 BF16 support for flash-attn. (#737) 2023-09-04 16:35:43 +01:00
d0cdea95a5 Add back the bf16 flash-attn kernels. (#730) 2023-09-04 07:50:52 +01:00
03be33eea4 Relax the requirements on CustomOp. (#486)
* Relax the requirements on CustomOp.

* Simplify the custom-ops when no backward is required.
2023-08-17 11:12:05 +01:00
67834119fc Fix the flash-attention function names. (#282) 2023-07-31 10:04:39 +01:00
0ace420e66 Flash attention without padding (varlen). (#281)
* Expose the seqlen variable for flash-attn without padding.

* Fix the batched call.

* Adapt for the varlen variant.

* No need to set the batch strides when in varlen mode.

* Add a test (disabled at the moment).

* Get the test to work properly.
2023-07-31 09:45:39 +01:00
4f92420132 Add some flash attn test (#253)
* Add some flash-attn test.

* Add the cpu test.

* Fail when the head is not a multiple of 8.

* Polish the flash attention test.
2023-07-26 20:56:00 +01:00
1235aa2536 Use bail rather than wrapping a string where possible. (#249)
* Use bail rather than wrapping a string where possible.

* Revert the cuda default bit.
2023-07-26 15:42:46 +01:00
f052ba76cb Lining up the flash attn version with the non-flash one. (#248)
* Move the flash-attn function in the proper crate.

* Causality tweak.
2023-07-26 15:11:45 +01:00
2ce5f12513 Again set a few extra params in flash-attn. (#245)
* Again set a few extra params.

* Use the appropriate kernel sizes.

* Add all the kernel sizes.

* Parallel compiling.

* Reduce the amount of parallelism.

* Add the missing kernel.

* Fix a typo.

* Remove bf16 support for now.
2023-07-26 14:16:37 +01:00
fa2b64d678 Proper flash-attn parameters. (#244)
* Proper flash-attn parameters.

* Set the flash attention parameters.

* Add more validations.

* Setup the o_ flash attn parameters.

* More flash-attn support.

* Set more flash attn parameters.
2023-07-26 10:13:40 +01:00
d9f9c859af Add flash attention (#241)
* Add some flash-attn kernel, import the code for flash-attn v2 from Dao-AILab.

* More flash attn.

* Set up the flash attn parameters.

* Get things to compile locally.

* Move the flash attention files in a different directory.

* Build the static C library with nvcc.

* Add more flash attention.

* Update the build part.

* Better caching.

* Exclude flash attention from the default workspace.

* Put flash-attn behind a feature gate.

* Get the flash attn kernel to run.

* Move the flags to a more appropriate place.

* Enable flash attention in llama.

* Use flash attention in llama.
2023-07-26 07:48:10 +01:00