Commit Graph

460 Commits

Author SHA1 Message Date
607ffb9f1e Retrieve more information from PyTorch checkpoints. (#515)
* Retrieve more information from PyTorch checkpoints.

* Add enough support to load dino-v2 backbone weights.
2023-08-19 15:05:34 +01:00
f861a9df6e Add ggml support to tensor-tools (#512)
* Pickle work-in-progress.

* More unpickling.

* More pickling.

* Proper handling of setitems.

* Clippy.

* Again more pickling.

* Restore the example.

* Add enough pickle support to get the list of tensors.

* Read the data from zip files.

* Retrieve the tensor shape.

* Extract the size and dtype.

* More storage types.

* Improve the destructuring.

* Also support ggml files.
2023-08-19 11:45:22 +01:00
ad33715c61 Preliminary support for importing PyTorch weights. (#511)
* Pickle work-in-progress.

* More unpickling.

* More pickling.

* Proper handling of setitems.

* Clippy.

* Again more pickling.

* Restore the example.

* Add enough pickle support to get the list of tensors.

* Read the data from zip files.

* Retrieve the tensor shape.

* Extract the size and dtype.

* More storage types.

* Improve the destructuring.
2023-08-19 11:26:32 +01:00
90ff04e77e Add the tensor-tools binary. (#510) 2023-08-19 09:06:44 +01:00
cb069d6063 Add the permute op (similar to pytorch). (#504)
* Add the permute op (similar to pytorch).

* Add the backprop for dimension permutation.
2023-08-18 16:30:53 +01:00
95462c6a2e Add a vision transformer example (dino-v2). (#502)
* Add a vision transformer example (dino-v2).

* Add some documentation + test.

* CI fix.

* Another fix (still unable to replicate the errors locally :( )
2023-08-18 11:58:06 +01:00
109e95b189 Basic qmatmul parallelization (#492)
* Basic `par_iter` parallelization

* Pass errors up

* Disable `avx` for x86 macs
2023-08-18 09:45:37 +01:00
c78ce76501 Add a simple Module trait and implement it for the various nn layers (#500)
* Start adding the module trait.

* Use the module trait.

* Implement module for qmatmul.
2023-08-18 09:38:22 +01:00
a22b1bed7b Tensor -> QTensor conversion (#496)
* Sketch some qmatmul test.

* Add the quantization function.

* More testing.

* Make the test smaller and faster.

* Add some shape checking.
2023-08-18 08:19:20 +01:00
557b2c28dd Q6K quantization (#495)
* Print the detected arch options.

* Add the q6k quantization.

* Add a currently broken test.

* Bugfix.

* Bugfix.

* Another bugfix.

* Another bugfix + get the test to work.
2023-08-17 22:22:57 +01:00
fc81af1712 AVX version of the q6k vec-dot. (#493)
* AVX version of the q6k vec-dot.

* Use the avx sum.
2023-08-17 20:13:18 +01:00
03be33eea4 Relax the requirements on CustomOp. (#486)
* Relax the requirements on CustomOp.

* Simplify the custom-ops when no backward is required.
2023-08-17 11:12:05 +01:00
d99cac3ec3 Move the avx specific bits to a separate file. (#481) 2023-08-17 09:01:06 +01:00
306c8eee7a AVX version of the vecdot for q4_0. (#474)
* AVX version of the vecdot for q4_0.

* Tweak the avx bits.

* Add a qmatmul benchmark.

* Fix the quantized test.
2023-08-17 07:03:32 +01:00
098909de40 Add vecdot for q6k-q8k. (#476)
* Add vecdot for q6k-q8k.

* Add some testing for q8k.

* Use QMatMul for the output layer.
2023-08-16 20:59:40 +01:00
3bedba1fce Use a zipped iterator. (#475)
* Use a zipped iterator.

* Add to/from float for q8k.
2023-08-16 20:15:11 +01:00
575e88a999 Add a quantized test that use negative values. (#470)
* Add a quantized test that use negative values.

* Add a default tokenizer.
2023-08-16 16:32:58 +01:00
a9101700b6 Add a kv-cache to the quantized llama example. (#466)
* Add a kv-cache to the quantized llama example.

* Also print the prompt.

* Bugfix in q6k dequantizing.

* Another bugfix.
2023-08-16 14:28:42 +01:00
3071134788 Get the ggml based llama to generate some text. (#464)
* Add more stats to the ggml example.

* Build a quantized model from the file content.

* Move the tensor retrieval in the main crate.

* Start adding the forward pass.

* Add more to the forward pass of the quantized llama.

* Apply the attention layers.

* Add the sampling loop.

* Get the sampling loop to work.

* Minor tweak.

* Add a quantize/dequantize test.

* Bugfix.

* Add a comment + swap the order.

* Bugfixes.
2023-08-16 12:41:07 +01:00
965597a873 Add a test for qmatmul. (#459) 2023-08-16 06:36:27 +01:00
ca449f9ee1 Add quantized tensors. (#458)
* Add quantized tensors.

* Implement the debug trait for QTensor.

* Add the QMatMul custom op.
2023-08-15 22:45:53 +01:00
b8263aa15c Quantized support for f16 and f32 (#457)
* Add f32 as a quantized type.

* Add f16 as a quantized type too.
2023-08-15 21:09:37 +01:00
e68b2accb4 Split out the quantized file. (#456) 2023-08-15 20:26:27 +01:00
08effe3762 More quantization support (#455)
* Properly initialize wdata.

* Simplify the matmul bits.

* Add from_float for q4_0.

* Fix a couple bugs.

* Get the test to work.

* Get clippy to be happy.
2023-08-15 18:58:04 +01:00
5e49922be2 Basic quantization support (#453)
* Add a vecdot trait.

* Start implementing mul_mat.

* Add to the mul mat implementation.

* Add q8_0 quantization.

* Implement the GgmlType trait for all types.

* Add the missing block.

* Add a TODO.
2023-08-15 15:53:19 +01:00
531f23b4d0 Rename vec-dot to vec-ops. (#449)
* Rename vec-dot to vec-ops.

* Also bump the crate version.

* Add a currently empty readme.
2023-08-15 10:48:57 +01:00
495e0b7580 Simd support (#448)
* Import the simd intrinsics in candle-core.

* simd version of reduce-sum.

* Bugfix.

* Fix some clippy lints.
2023-08-15 09:50:38 +01:00
90374097dc Cudnn support (#445)
* Add a cudnn feature to be used for conv2d.

* Allocate the proper workspace.

* Only create a single cudnn handle per cuda device.

* Proper cudnn usage.

* Bugfix.
2023-08-14 21:30:41 +01:00
c84883ecf2 Add a cuda kernel for upsampling. (#441)
* Add a cuda kernel for upsampling.

* Update for the latest tokenizers version.
2023-08-14 13:12:17 +01:00
a094dc503d Add a cuda kernel for avg-pool2d. (#440)
* Add a cuda kernel for avg-pool2d.

* Avoid running out of bounds.

* Finish wiring the avg pool kernel + add some testing.

* Support for max-pool + testing.
2023-08-14 12:32:05 +01:00
34f4b3187e Add a naive conv2d cuda kernel. (#438)
* Add a naive conv2d cuda kernel.

* Proper conv2d support on the rust side.

* Conv1d testing on gpu.

* Also use the test on gpus.

* Fix the clean-ptx target.
2023-08-14 10:34:42 +01:00
9e7e6e0288 Add dequantization for ggmls q4_0, q4_1, q5_0, q5_1 and q8_0 (#407)
* Added dequantization for `q4_0`, `q4_1`, `q5_0`, `q5_1` and `q8_0`

* expose `tensor_from_ggml` for external usage

* bugfixes & example
2023-08-13 23:22:57 +01:00
d379a76a9e Add a softmax bench. (#433)
* Add a softmax bench.

* Add the vectorized sum reduce.
2023-08-13 20:09:18 +01:00
9af438ac1b Track the conv2d operations in stable-diffusion. (#431)
* Track the conv2d operations in stable-diffusion.

* Add more tracing to stable-diffusion.

* Also trace the resnet bits.

* Trace the attention blocks.

* Also trace the attention inner part.

* Small tweak.
2023-08-13 15:58:26 +01:00
5a63b51f14 Add a matmul benchmark. (#429) 2023-08-13 13:41:03 +01:00
9aca398a4f More accelerate optimizations (#427)
* Add more tracing to the whisper example.

* Support accelerate in more examples.

* Use accelerate for pointwise functions.

* Use accelerate for binary operations too.

* Bugfix for binary operation: use the rhs before the lhs.
2023-08-13 12:53:34 +01:00
16b89f5b83 fix: can directly save the loaded weights (#421) 2023-08-12 16:33:29 +01:00
e12372021b Expose the tensor write-bytes function. (#412) 2023-08-11 17:13:42 +01:00
01ea57da8c Fix the conv tests. (#409) 2023-08-11 14:59:54 +01:00
662db45fc3 Use zero padding in conv1d and conv2d (same as pytorch). (#408) 2023-08-11 14:53:05 +01:00
e29c7809ec Parallelise the CPU kernels for the conv ops. (#401)
* Parallelise the conv2d op.

* Tighter control on threading.

* Also parallelise conv1d.

* Add some safety comment.
2023-08-11 05:51:58 +01:00
a325c1aa50 Upsample test + bugfix. (#399) 2023-08-10 21:02:35 +02:00
94eff56aee Optimize the cpu conv2d kernel (#396)
* Conv2d simd optimization.

* Fix the contiguous copying.

* Small tweak.
2023-08-10 17:40:09 +01:00
ff53f38467 Small example for benchmarking some cpu ops (#394)
* Refactor the benchmark example.

* Rename the example.

* Add some comments.
2023-08-10 17:00:17 +01:00
c8039579a5 Conv1d optimize (#392)
* Reorder the conv1d loops in the cpu backend.

* Optimize the 1d convolution.

* Conv1D optimize.

* Fix some clippy lints.
2023-08-10 15:23:52 +01:00
f3fe730a30 Npy tweaks & error with path (#384)
* Simplify the npy writing.

* Wrap the file path so as to provide better errors.
2023-08-10 06:21:58 +01:00
c7f92f985e Further randn tweaks: use the appropriate rng rather than the f64 one, some cleanup. (#383) 2023-08-10 05:48:19 +01:00
Lei
3bbc08a8df Fix randn cpu (#382)
* Change distributions

Standard generates in [0, 1), Normal is correct.

* Add test

Not sure if this is the best place to put  the test

* Remove unnecessary use
2023-08-10 05:33:44 +01:00
25ec2d9f6b fix: remove incorrect unwrap (#379) 2023-08-09 21:45:24 +01:00
fcfdcbd337 Add a conv1d benchmark based on the whisper sizes. (#377)
* Add a conv1d benchmark based on the whisper sizes.

* Enforce the batch-dim in conv1d.
2023-08-09 20:27:03 +01:00