3ad4770eb6
Use cat for faster MQA computation. ( #2043 )
...
* Use cat for faster MQA computation.
* Move the function to utils + use it in mistral.
* Use the shared repeat-kv in a few more models.
* Fix.
2024-04-12 09:15:10 +02:00
196765e995
Use the new rope kernel in mistral. ( #1937 )
...
* Use the new rope kernel in mistral.
* Compute the cos and sin with full precision.
* Bugfix.
2024-03-25 23:26:05 +01:00
e2b4829531
Support more mistral models. ( #1927 )
...
* Support more mistral models.
* Use the appropriate rope parameter.
2024-03-24 08:04:04 +01:00
6f877592a7
Avoid broadcasting on the batch dimension for the attention mask. ( #1920 )
2024-03-23 13:08:53 +01:00
c0bdd9c7a6
Use the fast RmsNorm in the quantized model. ( #1904 )
2024-03-21 18:49:35 +01:00
f6408a3779
feat: add clear_kv_cache to mistral and qmistral models ( #1464 )
2023-12-21 21:19:19 +01:00
902d0b9166
More model cloning. ( #1126 )
...
* More model cloning.
* More cloning on quantized models.
2023-10-18 21:55:46 +01:00
392fe02fba
Move the common quantized-nn code to a shared module. ( #1063 )
2023-10-09 06:22:22 +01:00
deee7612da
Quantized version of mistral. ( #1009 )
...
* Quantized version of mistral.
* Integrate the quantized mistral variant.
* Use the quantized weight files.
* Tweak the quantization command.
* Fix the dtype when computing the rotary embeddings.
* Update the readme with the quantized version.
* Fix the decoding of the remaining tokens.
2023-09-30 18:25:47 +01:00