Cuda acceleration for quantized model. (#1754)

mirror of https://github.com/huggingface/candle.git synced 2025-06-15 18:28:24 +00:00

* Boilerplate for the quantized cuda support.

* More basic cuda support.

* More cuda quantization (quantize on cpu for now).

* Add the dequantization bit.

* Start adding some dedicated cuda kernels from llama.cpp.

* Move the kernel code.

* Start interfacing with the kernel.

* Tweak the kernel launch params.

* Bugfix for quantized metal.

* Fix some clippy lints.

* Tweak the launch parameters.

* Tweak cuda basics to perform a quantized matmul.

* Perform the dequantization on the cpu + use cublas for matmul.

* Add the dequantization kernel.

* Test the qmatmul.

* More kernels.

* Matmul-vec kernel.

* Add a couple kernels.

* More dequantization kernels.

This commit is contained in:

Laurent Mazare

2024-02-25 18:11:47 +01:00

committed by

GitHub

parent 8d04f70f4d

commit 2f22afd80e

11 changed files with 1996 additions and 69 deletions

									
										1

candle-examples/examples/custom-ops/cuda_kernels.rs
									
												View File
												
				@ -0,0 +1 @@

				pub const LAYERNORM_KERNELS: &str = include_str!(concat!(env!("OUT_DIR"), "/layernorm_kernels.ptx"));

				`@ -0,0 +1 @@`
				`pub const LAYERNORM_KERNELS: &str = include_str!(concat!(env!("OUT_DIR"), "/layernorm_kernels.ptx"));`

Cuda acceleration for quantized model. (#1754)

1 candle-examples/examples/custom-ops/cuda_kernels.rs Unescape Escape View File

1

candle-examples/examples/custom-ops/cuda_kernels.rs

View File