109e95b189
Basic qmatmul
parallelization ( #492 )
...
* Basic `par_iter` parallelization
* Pass errors up
* Disable `avx` for x86 macs
2023-08-18 09:45:37 +01:00
c78ce76501
Add a simple Module trait and implement it for the various nn layers ( #500 )
...
* Start adding the module trait.
* Use the module trait.
* Implement module for qmatmul.
2023-08-18 09:38:22 +01:00
13401df4d1
Add an abstract type for RmsNorm. ( #499 )
2023-08-18 08:52:14 +01:00
a22b1bed7b
Tensor -> QTensor conversion ( #496 )
...
* Sketch some qmatmul test.
* Add the quantization function.
* More testing.
* Make the test smaller and faster.
* Add some shape checking.
2023-08-18 08:19:20 +01:00
26fd37b348
Use the main branch of the HF repo where possible. ( #498 )
...
* Use the main branch of the HF repo where possible.
* And add the large model.
2023-08-18 08:18:30 +01:00
f056dcab21
Add medium model ( #497 )
2023-08-18 08:08:59 +01:00
557b2c28dd
Q6K quantization ( #495 )
...
* Print the detected arch options.
* Add the q6k quantization.
* Add a currently broken test.
* Bugfix.
* Bugfix.
* Another bugfix.
* Another bugfix + get the test to work.
2023-08-17 22:22:57 +01:00
fc81af1712
AVX version of the q6k vec-dot. ( #493 )
...
* AVX version of the q6k vec-dot.
* Use the avx sum.
2023-08-17 20:13:18 +01:00
3164cd24fa
Replicate the sot-token logic from the Python implementation more acc… ( #491 )
...
* Replicate the sot-token logic from the Python implementation more accurately.
* Add a flag to control the timestamp mode.
2023-08-17 16:59:36 +01:00
5f30c1e1e0
Add the whisper small model. ( #490 )
2023-08-17 15:48:34 +01:00
ad7c53953b
Add a verbose-prompt mode, similar to llama.cpp. ( #489 )
2023-08-17 15:26:44 +01:00
5d99026fd2
F16 support for stable diffusion ( #488 )
...
* F16 support for stable diffusion.
* Keep the attention bits in F32.
* Keep more of the attention bits in F32.
* More mixed precision support.
2023-08-17 13:48:56 +01:00
c3176f0dfb
Flash-attention support in stable diffusion ( #487 )
...
* Add flash-attention for the stable-diffusion example.
* Change the dtype.
* Silly fix.
* Another fix.
* Revert the dtype back to the query dtype after apply flash-attn.
2023-08-17 12:16:40 +01:00
03be33eea4
Relax the requirements on CustomOp. ( #486 )
...
* Relax the requirements on CustomOp.
* Simplify the custom-ops when no backward is required.
2023-08-17 11:12:05 +01:00
d32e8199cd
Layer norm tweaks ( #482 )
...
* Add some options to make layer-norm more configurable.
* Add the rms-norm variant.
* Replace the RmsNorm with the shared bits.
2023-08-17 10:07:13 +01:00
d99cac3ec3
Move the avx specific bits to a separate file. ( #481 )
2023-08-17 09:01:06 +01:00
f708efb19c
Add some accelerate details on the readme. ( #480 )
2023-08-17 08:26:02 +01:00
306c8eee7a
AVX version of the vecdot for q4_0. ( #474 )
...
* AVX version of the vecdot for q4_0.
* Tweak the avx bits.
* Add a qmatmul benchmark.
* Fix the quantized test.
2023-08-17 07:03:32 +01:00
098909de40
Add vecdot for q6k-q8k. ( #476 )
...
* Add vecdot for q6k-q8k.
* Add some testing for q8k.
* Use QMatMul for the output layer.
2023-08-16 20:59:40 +01:00
3bedba1fce
Use a zipped iterator. ( #475 )
...
* Use a zipped iterator.
* Add to/from float for q8k.
2023-08-16 20:15:11 +01:00
c5f45887dc
Add some tracing to the quantized example. ( #473 )
2023-08-16 18:49:08 +01:00
fa4590d7fd
Merge pull request #469 from huggingface/fix_llama_v1
...
Fixing llamav1
2023-08-16 17:47:40 +02:00
2e206e269d
Add the model argument. ( #471 )
2023-08-16 16:41:06 +01:00
575e88a999
Add a quantized test that use negative values. ( #470 )
...
* Add a quantized test that use negative values.
* Add a default tokenizer.
2023-08-16 16:32:58 +01:00
a9101700b6
Add a kv-cache to the quantized llama example. ( #466 )
...
* Add a kv-cache to the quantized llama example.
* Also print the prompt.
* Bugfix in q6k dequantizing.
* Another bugfix.
2023-08-16 14:28:42 +01:00
102fa4c2e3
Fixing llamav1
2023-08-16 14:53:29 +02:00
3071134788
Get the ggml based llama to generate some text. ( #464 )
...
* Add more stats to the ggml example.
* Build a quantized model from the file content.
* Move the tensor retrieval in the main crate.
* Start adding the forward pass.
* Add more to the forward pass of the quantized llama.
* Apply the attention layers.
* Add the sampling loop.
* Get the sampling loop to work.
* Minor tweak.
* Add a quantize/dequantize test.
* Bugfix.
* Add a comment + swap the order.
* Bugfixes.
2023-08-16 12:41:07 +01:00
fec87e86f5
Merge pull request #465 from huggingface/llama_hub_config
...
Using the real config from the hub when available.
2023-08-16 13:28:59 +02:00
33c882ea74
Clippy.
2023-08-16 10:41:00 +02:00
76804730c6
Using the real config from the hub when available.
2023-08-16 10:36:01 +02:00
965597a873
Add a test for qmatmul. ( #459 )
2023-08-16 06:36:27 +01:00
ca449f9ee1
Add quantized tensors. ( #458 )
...
* Add quantized tensors.
* Implement the debug trait for QTensor.
* Add the QMatMul custom op.
2023-08-15 22:45:53 +01:00
b8263aa15c
Quantized support for f16 and f32 ( #457 )
...
* Add f32 as a quantized type.
* Add f16 as a quantized type too.
2023-08-15 21:09:37 +01:00
e68b2accb4
Split out the quantized file. ( #456 )
2023-08-15 20:26:27 +01:00
08effe3762
More quantization support ( #455 )
...
* Properly initialize wdata.
* Simplify the matmul bits.
* Add from_float for q4_0.
* Fix a couple bugs.
* Get the test to work.
* Get clippy to be happy.
2023-08-15 18:58:04 +01:00
8ad4a21ffc
Add a basic optimizer example. ( #454 )
2023-08-15 17:19:18 +01:00
5e49922be2
Basic quantization support ( #453 )
...
* Add a vecdot trait.
* Start implementing mul_mat.
* Add to the mul mat implementation.
* Add q8_0 quantization.
* Implement the GgmlType trait for all types.
* Add the missing block.
* Add a TODO.
2023-08-15 15:53:19 +01:00
ebcfd96d94
add c++17 flags ( #452 )
2023-08-15 15:29:34 +01:00
5b1690fffa
Tweak the llama example. ( #450 )
2023-08-15 12:18:20 +01:00
3cc87058b7
Support local weights & dynamic outputs ( #447 )
...
* Support local weights & dynamic outputs
* Revise as suggested
* Cargo code format
2023-08-15 11:51:57 +01:00
531f23b4d0
Rename vec-dot to vec-ops. ( #449 )
...
* Rename vec-dot to vec-ops.
* Also bump the crate version.
* Add a currently empty readme.
2023-08-15 10:48:57 +01:00
495e0b7580
Simd support ( #448 )
...
* Import the simd intrinsics in candle-core.
* simd version of reduce-sum.
* Bugfix.
* Fix some clippy lints.
2023-08-15 09:50:38 +01:00
90374097dc
Cudnn support ( #445 )
...
* Add a cudnn feature to be used for conv2d.
* Allocate the proper workspace.
* Only create a single cudnn handle per cuda device.
* Proper cudnn usage.
* Bugfix.
2023-08-14 21:30:41 +01:00
c84883ecf2
Add a cuda kernel for upsampling. ( #441 )
...
* Add a cuda kernel for upsampling.
* Update for the latest tokenizers version.
2023-08-14 13:12:17 +01:00
a094dc503d
Add a cuda kernel for avg-pool2d. ( #440 )
...
* Add a cuda kernel for avg-pool2d.
* Avoid running out of bounds.
* Finish wiring the avg pool kernel + add some testing.
* Support for max-pool + testing.
2023-08-14 12:32:05 +01:00
34f4b3187e
Add a naive conv2d cuda kernel. ( #438 )
...
* Add a naive conv2d cuda kernel.
* Proper conv2d support on the rust side.
* Conv1d testing on gpu.
* Also use the test on gpus.
* Fix the clean-ptx target.
2023-08-14 10:34:42 +01:00
eab54e4490
Fix the tests for mkl. ( #437 )
2023-08-14 08:09:27 +01:00
9e7e6e0288
Add dequantization for ggmls q4_0
, q4_1
, q5_0
, q5_1
and q8_0
( #407 )
...
* Added dequantization for `q4_0`, `q4_1`, `q5_0`, `q5_1` and `q8_0`
* expose `tensor_from_ggml` for external usage
* bugfixes & example
2023-08-13 23:22:57 +01:00
8bd2b22b33
Optimize the logit computations in the whisper example. ( #434 )
2023-08-13 22:00:13 +01:00
d379a76a9e
Add a softmax bench. ( #433 )
...
* Add a softmax bench.
* Add the vectorized sum reduce.
2023-08-13 20:09:18 +01:00