* Optimize Tensor::new when called on nested Vec<..>.
* Improve performance.
* Similar flattening for the 4d case.
* More tweaks.
* Add some dummy test.
* Support for (un)-batched rope.
* Use 3d rope in the rope/ropei/rope_thd functions.
* Get the CPU versions to work.
* Fix the cuda version.
* Adapt the metal side.
* Fix the metal tests.
* removed scale factor from computation and made quantized gemma3 work similarly to non-quantized gemma3
* created default consts, replaced is_sliding with Option holding a window_size
* gemma3: changed RotaryEmbedding base freq based on layer and sliding window
* Changed attention mask per layer, either normal or sliding
* made attention mask creation slightly more efficient by only creating them once per model iteration
* changed is_sliding to an Option
* clippy
* changed to stop on both <eos> and <end_of_turn> instead of either or
* Add the const-set op.
* Cuda implementation.
* Bugfix.
* Metal cleanup.
* Add the metal kernels.
* Add some testing.
* Finish the metal implementation.
* Bump the version.
* implemented quantized-gemma, inference not working
* Fixed a few modeling bugs: outputing the correct tokens for a few iterations then garbage
* lint
* clippy
* quantized-gemma3 example working
* added readme
* clippy
* added CONTRIBUTING.md to candle-book
* added description to candle-book introduction
* Updated formatting and added different features to candle-book installation
* mnist guide first draft candle-book
* updated mnist guide syntax and grammar for candle-book
* changed HelloWorld - Mnist to Tutorial - Mnist in SUMMARY.md
* updated intro to mnist guide in candle-book
* Initial commit: model weights working, prediciton incorrect
* moved distilbertformaskedlm into distilbert modeling file
* made maskedLM like bert example, still incorrect predictions
* finally not getting NaNs, fixed attention mask
* getting correct output sentences
* get top k predictions
* fixed output formatting slightly
* added default arg for model_id
* lint
* moved masked token example code from distilbertformaskedlm example to distilbert example
* lint
* removed distilbertformaskedlm example
* cleanup
* clippy
* removed embedding normalization from example
* made output and model dependent on args instead of prompt
* lint
* replaced or_ok anyhow error with anyhow context
* changed error message for mask token not found
* Add the SNAC audio tokenizer.
* More snac.
* Again more snac.
* Add some example code for snac.
* Get the weights to load.
* Add to the snac model.
* Fixes.
* Get round-tripping to work.
* Save/load code files.
* Clippy fix.
* Fmt fix.
* Add the CSM model.
* Add some code to load the model.
* Load the text tokenizer.
* Add frame generation.
* Get the sampling to work.
* Rope fix.
* Autoregressive generation.
* Generate some audio file.
* Use the actual prompt.
* Support multiple turns.
* Add a very barebone readme.
* Move some of the shared bits to the model.
* added chatGLM readme
* changed wording in readme
* added readme for chinese-clip
* added readme for convmixer
* added readme for custom ops
* added readme for efficientnet
* added readme for llama
* added readme to mnist-training
* added readme to musicgen
* added readme to quantized-phi
* added readme to starcoder2
* added readme to whisper-microphone
* added readme to yi
* added readme to yolo-v3
* added readme to whisper-microphone
* added space to example in glm4 readme
* fixed mamba example readme to run mamba instead of mamba-minimal
* removed slash escape character
* changed moondream image to yolo-v8 example image
* added procedure for making the reinforcement-learning example work with a virtual environment on my machine
* added simple one line summaries to the example readmes without
* changed non-existant image to yolo example's bike.jpg
* added backslash to sam command
* removed trailing - from siglip
* added SoX to silero-vad example readme
* replaced procedure for uv on mac with warning that uv isn't currently compatible with pyo3
* added example to falcon readme
* added --which arg to stella-en-v5 readme
* fixed image path in vgg readme
* fixed the image path in the vit readme
* Update README.md
* Update README.md
* Update README.md
---------
Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>
* Start updating to cudarc 0.14.
* Adapt a couple more things.
* And a couple more fixes.
* More tweaks.
* And a couple more fixes.
* Bump the major version number.
* Proper module system for the cuda kernels.
* Proper ptx loading.
* Launch the sort kernel.
* Custom op.
* Start using the builder pattern.
* More builder.
* More builder.
* Get candle-core to compile.
* Get the tests to pass.
* Get candle-nn to work too.
* Support for custom cuda functions.
* cudnn fixes.
* Get flash attn to run.
* Switch the crate versions to be alpha.
* Bump the ug dependency.
* added new language pairs to marian-mt
* lint
* seperated python code for converting tokenizers into its own file and and added a reqirements.txt for dependencies, updated instructions in readme and included python version
* Cleanup.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* update to cudarc to v0.13.5 to support cuda 12.8
* Bump the crate version.
---------
Co-authored-by: Michael McCulloch <michael.james.mcculloch@fastmail.com>
* Improve reduce perf and add contiguous impl
* Improve arg reduce and add contiguous impl
* Improve softmax kernel. 33%-39% higher thrpt
* fmt
* Fixed all bugs. Improved code quality. Added tests.
* Stash for debugging
* Stash for debugging 2
* Fixing argmax bug and improve performance
Co-authored-by: Christopher Fleetwood <45471420+FL33TW00D@users.noreply.github.com>
* Fix test and add is_valid_simgroup_reduce_type trait
* Online softmax. Improved threadgroup reduce. Tidying up a bit.
* Remove redundant threadgroup_barrier from arg reduce
* Mostly tidying up. Some improvements
* Simplify indexed struct
* tidying
* Reuse operation operator instead of passing it in as a parameter
* Fix how operators are applied to indexed<vec<T,N>>
* Vectorized load. Scalar block reduce. Hitting max throughput for f32 reduce.
* Vectorized load for online softmax. Involves a reinterpret_cast of src which may be suboptimal.
* Metal as_type casting vec<bfloat, N> -> vec<float, N/2> for simd and fast math
* Use constant for input instead of const device. Fix strided reduce.
* Use contiguous reduce in tests
* Rename finalize -> to_scalar
* Support integer types max/min (switch with trait-inferred impl later)
* Was worried I was skipping work -> shuffling the 1D test cases
* Add build.rs to avoid metal kernel jit compile overhead
* Improve build. Extract utils
* Compile metal kernels for both macos and ios
* Fixed over xmas and then forgot about it
* Add calculate_reduce_threads util
* Remove old reduce.metal
* Improve f16/bf16 softmax precision by accumulating in f32
* Remove build.rs (for now)
* Move softmax bench to candle-nn
* Remove redundant thread calc util fn
* Use uint over ushort for indices etc
* Use fast exp in MDReduceOp
* Remove nested metal define for softmax
* Fix some clippy lint.
---------
Co-authored-by: Christopher Fleetwood <45471420+FL33TW00D@users.noreply.github.com>
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add some metal sort kernels imported from MLX.
* Add another test.
* Start adding the multiblock version.
* Proper kernel names.
* Split out the main metal file.
* Multi-block sort.
* More sorting.
* DType parametrization.
* Add a larger test.
* Update main.rs
* Update codegeex4_9b.rs
* Get things to compile.
* Add some default for when rope_ratio is missing.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Update the stable diffusion example with inpainting support for 1.5, 2 and XL.
* Apply cargo fmt.
* Clippy fixes.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* update flash-attn v1
* restore: hdim224
* add 224 flash_fwd_template
* remove whitespace
* softcap is working, including test and api.
* make softcap test case better
* unpadded lse added
* update flash-attn v1
* restore: hdim224
* add 224 flash_fwd_template
* remove whitespace
* softcap is working, including test and api.
* make softcap test case better
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* add xlm-roberta-base
* Add task enum for fill-mask and reranker in xlm-roberta example; update README and fix attention mask dimensions
- Introduced a new `Task` enum to replace string task identifiers in the xlm-roberta example.
- Updated the logic in `main.rs` to handle tasks using the new enum.
- Enhanced README with example output for fill-mask task.
- Fixed dimension retrieval in `prepare_4d_attention_mask` function for better clarity and safety.
* Clippy fix.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Fix bug in whisper transformer
- due to num_threads going to zero
in single threaded case
* Apply rustfmt.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* change: BertEncoder struct to public
* change: make certain fields in Config struct public
* change: all fields in bert config struct to be public
* change: add clone to bert encoder and others
* Clippy fix.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Adds support for stella_en_v5 embedding model -400M variant
* Unified stella
* WIP: Unified Stella
* Combined stella for both 1.5B and 400M variants
* Cargo fmt for the CI
* removed redundant stella-400m model and example after merge into stella-en-v5
* cargo fmt --all
---------
Co-authored-by: Anubhab Bandyopadhyay <4890833+AnubhabB@users.noreply.github.com>
Co-authored-by: laurent <laurent.mazare@gmail.com>
At least on my macOS Sequoia system (MBP 14" 2021, M1 Pro), when I run
the `whisper-microphone` example after it has gathered 10 seconds of
audio, it fails before the transcription:
```
Error: Insufficient buffer size 384 for input channel 0, expected 1024
```
At least for the audio device I'm using (Airpods Pro Max), there is no
guarantee that each audio buffer is a multiple of 1024 samples. Thus at
the end of the 10 seconds, `buffered_pcm` can have some samples at the
end that do not form a complete 1024 sample chunk.
This fixes that by tracking when there is a partial chunk at the end of
the buffer, and leaving it in `buffered_pcm` to be processed on the next
loop iteration.
Note that, in the interest of keeping this PR as small as possible, I
didn't make any other changes to this example.
* module docs
* varbuilder gguf docs
* add a link to gguf files
* small additonal mod doc titles
* safetensor docs
* more core docs
* more module docs in canlde_core
* 2 more link fixes
* dinov2
* add another example
* ad dinov2reg4
* eva2
* efficientvit
* moondream
* update t5
* update t5
* rwkv
* stable diffusion docs
* add wasm link
* add segment_anything
* adjsut for clippy
* ignore bertdoc
* dinov2 ignore
* update block to be text
* remove the rust blocks for the moment
* bump python to 3.11
* add a setup-python step
* add py311 to test as well
* links in chinese_clip
* links for clip model
* add mod docs for flux and llava
* module doc for MMDIT and MIMI
* add docs for a few more modesl
* mod docs for bert naser and beit
* add module docs for convmixer colpali codegeex and chatglm
* add another series of moddocs
* add fastvit-llama2_c
* module docs mamba -> mobileone
* module docs from moondream-phi3
* mod docs for quantized and qwen
* update to yi
* fix long names
* Update llama2_c.rs
* Update llama2_c_weights.rs
* Fix the link for mimi + tweaks
---------
Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>
* Add some fast Metal MLX SDPA kernels (#32)
* Sketch the sdpa kernel
* Add full sdpa kernel,
* Add test
* Add vectorized kernel for decoding
* Update tests
* Add some docs
* Fix sdpa_vector names
* Add softcapping for vectorized sdpa
* Add softcapping for full sdpa
* Add support for head dim 32, 96, 256
* Add support for head dim 32, 96, 256
* Update docs
* Add update notice
* Clippy and format
* Conditional compilation for bf16
* Use it in quantized llama
* Some review comments
* Use set_params!
* Remove unused
* Remove feature
* Fix metal sdpa for v stride
* Remove comma
* Add the dim method to layout and shape.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Stable diffusion 3.5 support.
* Clippy fixes.
* CFG fix.
* Remove some unnecessary clones.
* Avoid duplicating some of the code.
* Release the mmdit model earlier to reduce memory usage.
* Stella_en_1.5B_v5
* Separated creation. This is a critical step for numerical accuracy and would be documented in the readme
* EmbedDim would require clone and copy
* WIP: example
* Examples added
* a litte more in README
* WIP: ONNX Reduce-max ops
* WIP: tests for ReduceMin
* Reduce min/ max v18+
* Reformatting tests for better review readability
* Error on empty set, backward compatibility (13 and below) with 'axes'
* Stella_en_1.5B_v5
* Separated creation. This is a critical step for numerical accuracy and would be documented in the readme
* EmbedDim would require clone and copy
* WIP: example
* Examples added
* a litte more in README
* Add stable diffusion 3 example
Add get_qkv_linear to handle different dimensionality in linears
Add stable diffusion 3 example
Add use_quant_conv and use_post_quant_conv for vae in stable diffusion
adapt existing AutoEncoderKLConfig to the change
add forward_until_encoder_layer to ClipTextTransformer
rename sd3 config to sd3_medium in mmdit; minor clean-up
Enable flash-attn for mmdit impl when the feature is enabled.
Add sd3 example codebase
add document
crediting references
pass the cargo fmt test
pass the clippy test
* fix typos
* expose cfg_scale and time_shift as options
* Replace the sample image with JPG version. Change image output format accordingly.
* make meaningful error messages
* remove the tail-end assignment in sd3_vae_vb_rename
* remove the CUDA requirement
* use default_value in clap args
* add use_flash_attn to turn on/off flash-attn for MMDiT at runtime
* resolve clippy errors and warnings
* use default_value_t
* Pin the web-sys dependency.
* Clippy fix.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* start to impl chinese clip
* impl vision model
* copy code from bert
* refactor use
* refactor use again
* fix text model
* refactor
* try to fix text model
* tuning
* tuning chinese clip
* delete useless code
* revert code
* Clippy fixes.
* Also apply cargo fmt.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* add bert for masked lm
* working example
* add example readme
* Clippy fix.
* And apply rustfmt.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* WIP: hopefully better const impl
* with GPU
* More tests on
* Reverting primitive for
* Incorporating review changes - added check elem count check in kerner, using for call strategy
* rustfmt ran
* add: direction for lstm layer
* lint: remove unused Error import
* refactor: remove unnecessary int assignment to Direction enum:
* refactor: use &'static str type instead of String for direction_str:
* Run cargofmt.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add Pixtral.
* More pixtral vision encoder.
* Sketch a pixtral example.
* Sketch a pixtral example.
* Better image loading.
* Support loading images embedded in safetensor files.
* Clippy fixes.
* Add the llava multimodal adapter.
* Add more of the llava bits.
* Add the pixtral config.
* More pixtral inference.
* Add the text generation bits.
* Get the example to work.
* Bugfix.
* Run some bits of the model in f32.
* Blessed version :)
* Better rope frequency computations.
* README update.
* Add the SigLIP model.
* Add more to the forward pass of the vision model.
* Complete the forward pass.
* Add the siglip example.
* Fix.
* Another fix.
* Get everything in place.
* Add a readme.
* candle-onnx: Add Split and Expand operators, Fix Where Op
Implemented based on https://github.com/onnx/onnx/blob/main/docs/Operators.md
Test cases based on those examples.
TODO: Should add the remaining Split examples as tests
TODO: Add.test case that motivates Where fix
* candle-onnx: Add ReduceSum operator
Implemented based on https://github.com/onnx/onnx/blob/main/docs/Operators.md
Test cases based on those examples.
TODO: Should add the remaining ReduceSum examples as tests
* candle-onnx: Add ReduceL2 operator
Implemented based on https://github.com/onnx/onnx/blob/main/docs/Operators.md
Test cases based on those examples.
TODO: Should add the remaining ReduceSum examples as tests
* candle-onnx: Fix Clip operator empty string as default arg issue
Optional input args may be signified by an empty string. The length of the input array is not enough because non optional args may follow optional ones.
I encountered this when trying to use the ONNX model found at https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 for example.
The LSTM op has a utility which I factored to be more generally accessible, and I have used it in the ops I have recently created or debugged.
I believe it is likely that this issue may also manifest in other ops, but I didn't want to change anything that I'm not testing.
* fix formatting
* fix small mistake made during refactor
* Quantized version of flux.
* More generic sampling.
* Hook the quantized model.
* Use the newly minted gguf file.
* Fix for the quantized model.
* Default to avoid the faster cuda kernels.
* Add a RotatingKVCache.
* Add some KvCache tests.
* Test the reset too.
* More kv-cache testing.
* More tests for the rotating kv-cache.
* Improve the api for the rotating cache so that the whole src tensor gets returned when it's overlarge.
* Handle contiguity + bugfix + use in mimi.
* Add a way to test the mimi streaming mode.
* Mimi streaming fixes.
* More rotating kv-cache.
* Fix the attn mask generation.
* Handle the abs case.
* Add some tests for the generated mask.
* Adding Granite 7b Instruct model example
* Minor refactoring to make it a little more idiomatic
* Clippy fixes.
* * Adding a README with some information about supported Granite models
* Changing the default prompt to accomodate better the Language
modality of the Granite 7b Instruct model
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add the mimi audio-tokenizer.
* Formatting tweaks.
* Add a full example.
* Use the transformers names.
* More renamings.
* Get encoding and decoding to work.
* Clippy fixes.
* Include the MLX gemm kernels.
* Clippy lints.
* Export the gemm_f32 kernel.
* Add the f16/bf16 variants.
* Add the initial dispatch code.
* More plugging of the mlx kernels.
* Add a currently broken test.
* Tweaks.
* Bugfix + get the tests to pass.
* Enable the gemm bf16 tests.
* Add some randomized tests.
* Update candle-metal-kernels/src/lib.rs
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* More fixes.
* More clippy fixes.
---------
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* Allow loading images with given std and mean
* OpenCLIP text encoder component
* Two MobileCLIP models
* Clippy fixes.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* silero-vad v5 example
This change adds an example of how to run silero-vad v5
* PR: rename 'vad' to 'silero-vad'
* Update README.md
---------
Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>
index_select does not support negative indexing, but
this change adds just enough workarounds in onnx to
allow evaluating silero-vad models (which make use of
negative indices).
* onnx: workaround pow with negative base
rather than fully defining pow in the cpu backend (as in #2318),
this implements a much smaller change which is sufficient to evaluate silero-vad
onnx models. Specifically, checking if pow is run with 2.0 exponent, and if so
evaluate as simply `x*x` instead of the cpu backend of `e^(2.0 * ln(x))`.
* PR: use Tensor::powf insead
powf correctly handles a negative base.
* Start sketching parler-tts support.
* Implement the attention.
* Add the example code.
* Fix the example.
* Add the description + t5 encode it.
* More of the parler forward pass.
* Fix the positional embeddings.
* Support random sampling in generation.
* Handle EOS.
* Add the python decoder.
* Proper causality mask.
* Soft NMS with thresholds
* NMS Test
* Soft nms w/ boxes removed below threshold
* Soft nms test
* No longer removing bounding boxes to fit Soft-NMS focus
* Initialize confidence
* Added comments
* Refactored out updating based on IOU/sigma
* Score_threshold -> confidence_threshold for clarity
* Remove bboxes below confidence threshold
* Softnms basic functionality test
* Softnms confidence decay test
* Softnms confidence threshold test
* Softnms no overlapping bbox test
* Testing confidence after no overlap test
* Single bbox and no bbox tests
* Signify test completion
* Handling result of test functions
* Checking all pairs of bboxes instead of a forward pass
* Equal confidence overlap test
* Clarified tests for implementation
* No longer dropping boxes, just setting to 0.0
* Formatted w/ cargo
* Add the flux autoencoder.
* Add the encoder down-blocks.
* Upsampling in the decoder.
* Sketch the flow matching model.
* More flux model.
* Add some of the positional embeddings.
* Add the rope embeddings.
* Add the sampling functions.
* Add the flux example.
* Fix the T5 bits.
* Proper T5 tokenizer.
* Clip encoder path fix.
* Get the clip embeddings.
* No configurable weights in layer norm.
* More weights related fixes.
* Yet another shape fix.
* DType fix.
* Fix a couple more shape issues.
* DType fixes.
* Fix the latent dims.
* Fix more shape issues.
* Autoencoder fixes.
* Get some generations out.
* Bugfix.
* T5 padding.
* Clippy fix.
* Add the decode only mode.
* Fix.
* More fixes.
* Finally get some generations to work.
* Add readme.
* bert attention mask
* Allow for using None as a mask.
* Revert part of the changes so that the proper default mask applies.
* Cosmetic change.
* Another cosmetic tweak.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add Llama 3.1 rope
* Clippy
* Format
* Clippy
* Add support for multiple eos tokens:
* Untagged either
* Remove either dep and fix settings.json
* Make the max positional embeddings configurable
* onnx: fix pad, unsqueeze
both implementations have off-by-one errors:
- Pad 'reflect' cycle for eg `dim==3` is `[0,1,2,1]` which has length of
4 (or `dim*2 - 2`) not 5 (current code `dim*2 - 1`)
- Unsqueeze(-1) for tensor with `dim==3` should be 3 (ie `dim+index+1`)
not 2 (ie currently `dim+index`)
in addition, Pad is incorrectly calculating the starting padding.
If we want to pad out 2 elements to the start, and we have this cycle
of indices of length 6, then we should skip 4 elements, but currently
we skip 2. A more visual representation of what's going on is below:
```
pad_start: 2
data: [a,b,c,d]
indices: [0, 1, 2, 3, 2, 1, 0, 1, 2, 3, 2, 1, 0, ..] // zigzag between 0..4
actual: skip [ c d| c b a b]
expected: ~ skip ~ [ c b| a b c d]
```
The values between `[` and `|` are padding and the values between
`|` and `]` in the example should match the original data being padded.
* Fix clippy lints.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add gradient test for conv_transpose2d with stride of 2.
* Swap dilation and stride in ConvTranspose2D backpropagation.
Without this, a shape mismatch occurs with a stride of 2 and dilation of 1.
* Add further tests of the ConvTranspose2D gradient.
Values calculated with torch, minor numerical errors adjusted and commented.
* Add: DINOv2Reg4 with PlantCLEF2024 weights and example ( See https://arxiv.org/abs/2309.16588 and https://zenodo.org/records/10848263 )
* Remove extra files + update README to download them + remove extra lines
* minor fix (README remove extra spaces)
* minor fix (README: Fix image url)
* Modif: Add back interpolate_pos_encoding() + fix when no interpolation + remove extra comments + Update README ( source image changed and so the predictions )
* Fix: Improve code lisibility with '$ cargo clippy' and '$ cargo fmt'
* Another clippy fix.
---------
Co-authored-by: x-VEspit <vincent.espitalier@cirad.fr>
Co-authored-by: laurent <laurent.mazare@gmail.com>
* feat(gemm): implement Gemm operator in candle-onnx
* feat(onnx): Add support for ArgMax operator in candle-onnx
* Apply rustfmt.
* Remove argmax as it was already present.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* define structs
* construct ResidualConvUnit
* forward() for ResidualConvUnit
* implement FeatureFusionBlock
* implement Scratch
* implement DPTHead
* add identity module
* implement forward for DTPHead
* add get_intermediate_layers to DinoVisionTransformer
* implement DepthAnythingV2
* some minor tweaks
* fix compile errors
* fix var builder prefixes
* setup initial example
* use fixed patch size of 37 (518 / 14)
* debugged until output
* print min and max values
* add some dynamism to the output location
* scale input image
* extract prep function
* extract output path function
* normalize image with magic mean and std
* add spectral coloring
* squeeze in the right place
* make enterpolation optional
* use bail instead of panic
* omit unnecessary Shape call
* remove empty curly braces
* use bail instead of assert
* use vb and pp
* remove closures
* extract config object
* Apply rustfmt.
* Fix some clippy lints.
* More lints.
* Use the array methods.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* implement if, and pad reflect mode
The intent of this change is to allow eval of the current silero_vad.onnx (v4).
This onnx file uses 'If' and 'Pad' nodes, which had not been supported
by simple_eval until now
* Cleanup (fmt, clippy, minor test tweaks).
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add the layernorm cuda kernels.
* Dedicated layer norm op.
* Add the slower variant.
* Plug the cuda implementation.
* Add the metal variant.
* Add a dedicated test.
* Bugfix.
* Add a slice_set op.
* Add some testing.
* Add the dedicated kv-cache module.
* Derive debug and clone.
* Expose more kv-cache functions.
* Return the current data when appending.
* Use the new cache in the quantized phi3 model.
* Support embedding model gte-Qwen1.5-7B-instruct
This is a text embedding model based on Qwen2. They share same
model architecture except the last MLP module. This commit brings in
minimal modification of the old Qwen2 implementation to support both
models.
An example is provided, and had been verified according to the official
PyTorch implementation.
* Avoid doing the 'last-token filtering' based on the absence of attention mask.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
Threshold is 0.0 by default, negative values make more points included,
expanding the mask. Positive values make it more picky, making the mask
smaller.
Negative numbers start with a minus sign, which normally makes clap
consider it a flag.
* Separate quantized phi-3 implementation.
* Integrate the quantized phi3 model.=
* Small fixes, get the generation to work properly.
* Keep the old llama implementation around.
* Change the default.
* When converting a tensor to a variable, clone if the tensor is already a variable.
* Add a test to ensure training a batch norm works with VarMaps
---------
Co-authored-by: Jeffrey Dallatezza <jeffreydallatezza@Jeffreys-Laptop.local>
* add sigmoid op
* small fix
* add as a method on `Tensor`
* implement gradient calculation for sigmoid
* add sigmoid tests
* we should have a specialized op for this
* fix clippy
* fix clippy 2
* Revert all previous commits in favor of a `CustomOp` based solution
* use `CustomOp1` implementation
* fix rustfmt
* experimental add metal impl
* add cuda kernel impl
* fix fmt
* Add a test + reduce some cuda duplication.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Add the cuda dequantize f16 kernels.
* Expose the cuda kernels.
* Add some testing + fix.
* Test the other cases too.
* A few more tests.
* Add an environment variable to enable the dequantize f16 + matmul behavior.
* Add the argsort cuda kernels.
* CPU version of arg-sort.
* Hook the cuda kernel + rework the cpu bits.
* Add some dedicated test.
* Working cuda kernel.
* Metal kernel.
* Metal adjustments.
* Bugfix.
* Use the fast rope in qwen.
* Rework the expert selection in qwen.
* Quantized phi in a separate file.
* Add the quantized phi example + rework the model code.
* Improve the phi model.
* Get some generation out.
* Use the appropriate rope shape.
* Tweak the default prompt.
---------
Co-authored-by: Jane Doe <jane.doe@example.org>
* moondream implementation
* add moondream example
* change config default activation
* Add assets and integrate phi mixformer with example
* Make use of kv cache and fix seq_len bug; Clean up example code
* Add README link to example
* Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig
* Delete image
* Use apply instead of forward
* Use latest release special token; Fix token/s accuracy; Use GeluPytorchTanh in VisionConfig v2
* Derive debug and clone traits for Moondream model.
* add basic unary bench for sqrt
* process unary commands in tiles of 4
* re-enable all benchmarks
* rename helper to unary
* modify approach to split up tiled and non-tiled operations
* undo bench ignore for other tests
* update tile size to 2
* only perform the optimization on the contiguous even numbered element case
* add support for l3b, new tokenizer
* add todo
* Add todo and use k_s model
* Use the official tokenizers.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Add the mmv kernels for smaller sizes.
* Support more mmv kernels.
* Use the new kernels.
* Fix the call.
* Silly fix.
* Improve the testing.
* Fix for dmmv.
* Add another dedicated test for the batching mmv.
* Utilize batches in Stable Diffusion that were already there, but unutilized.
Also refactor out the `save_image` function.
* Clippy + cosmetic fixes.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Fix for the batch dim in the quantized matmul example.
* Enable more tests on cuda.
* Add a test for qmm with a batch.
* Fix the zeros-dim test on metal.
* Hook the quantized matmul cuda kernels.
* Add a (currently broken) test.
* Kernel fixes.
* Fix by transposing the rhs matrix.
* Add the q4-1 kernels.
* Proper block sizes.
* More details in the tests.
* This change avoids crashes when running T5 models with F16 tensors on CPU.
* This enables running ProstT5's (https://huggingface.co/Rostlab/ProstT5) encoder-only mode in Candle. This ProstT5 mode stores it's embed_tokens weights within the encoder, as its decoding stage was replaced with a CNN. You could write more, like: This alone is not sufficient to run ProstT5 within Candle examples. We will develop a ProstT5 runner outside candle for now, but would be willing to upstream it to candle-examples at a later point.
* Revert "This enables running ProstT5's (https://huggingface.co/Rostlab/ProstT5) encoder-only mode in Candle. This ProstT5 mode stores it's embed_tokens weights within the encoder, as its decoding stage was replaced with a CNN. You could write more, like: This alone is not sufficient to run ProstT5 within Candle examples. We will develop a ProstT5 runner outside candle for now, but would be willing to upstream it to candle-examples at a later point."
This reverts commit d886d3ce5e.
* This change avoids crashes when running T5 models with F16 tensors on CPU.
* This enables running ProstT5's (https://huggingface.co/Rostlab/ProstT5) encoder-only mode in Candle. This ProstT5 mode stores it's embed_tokens weights within the encoder, as its decoding stage was replaced with a CNN. This alone is not sufficient to run ProstT5 within Candle examples. We will develop a ProstT5 runner outside candle for now, but would be willing to upstream it to candle-examples at a later point.
* Start adding the recurrent-gemma model.
* More griffin.
* Add the example + get the weights to load from the HF version.
* More inference code.
* Rope + kv-cache on the attention side.
* Add to the inference code.
* Add more to the recurrent gemma inference.
* Get some first inference to run.
* Add the softcap on logits.
* Fixes.
* Use partial rotary embeddings.
* Get inference to work.
* Add a comment.
* And add a readme.
* Move the metal kernels utils in a separate module.
* Use the BufferOffset for unary ops.
* Fix clippy lints.
* Use the new BufferOffset.
* Adapt the binary ops.
* Affine.
* More ops (powf, elu, cast).
* moondream implementation
* add moondream example
* change config default activation
* Add assets and integrate phi mixformer with example
* Make use of kv cache and fix seq_len bug; Clean up example code
* Add README link to example
* Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig
* Delete image
* Use apply instead of forward
* Use latest release special token; Fix token/s accuracy; Use GeluPytorchTanh in VisionConfig v2
* Add flag to use f16
* Avoid breaking the quantized version on cuda.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* add the sign unary operator
* remove uneeded import
* remove uneeded import
* undo formatting
* undo formatting
* remove unnecessary redefintion
* allow gradient to flow through for sign and round
* fix cpu ops to ensure that negzero and positive zero are handled properly
* clippy fixes
* Properly avoid gradient tracking.
* Use a branchless version.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* moondream implementation
* add moondream example
* change config default activation
* Add assets and integrate phi mixformer with example
* Make use of kv cache and fix seq_len bug; Clean up example code
* Add README link to example
* Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig
* Delete image
* Use apply instead of forward
* Use latest release special token; Fix token/s accuracy; Use GeluPytorchTanh in VisionConfig v2
* moondream implementation
* add moondream example
* change config default activation
* Add assets and integrate phi mixformer with example
* Make use of kv cache and fix seq_len bug; Clean up example code
* Add README link to example
* Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig
* Delete image
* Use apply instead of forward
* Pass bos token at the beginning of tensor.
* Quantize moondream.
* Forward with image bos token.
* Clippy.
* Use q4_0 quantization.
* Add pointers for sequence and tokens; Remove seq_len conditional
* Add more cuda kernels for quantized matmul.
* Add the vec-dot bits.
* Expose the quantized matmul-vec kernels.
* Also include the quantize-q8-1 kernel.
* Glue code for the q8-1 quantization.
* mm-vec product via q8-1 quantization.
* Add a test.
* Add a mm test.
* Get the test to return some sensible results.
* Also test dmmv.
* Fix the launch params.
* Allow for tweaking the force_dmmv parameter while it's experimental.
* moondream implementation
* add moondream example
* change config default activation
* Add assets and integrate phi mixformer with example
* Make use of kv cache and fix seq_len bug; Clean up example code
* Add README link to example
* Remove pos_embed scaling; Remove assets; Add to README; Expand VisionConfig
* Delete image
* Use apply instead of forward
* CLIP model implementation with example
* CLIP Implementation fixes, batch images
* CLIP model remove images from git
* CLIP model remove unnecessary use of batch_indices
* Fast kernels for rotary embeddings.
* Add a test for the fast CPU kernel.
* Rope cuda bindings.
* Cuda kernel.
* Metal kernel (part 1).
* Cuda kernels.
* Finish the metal kernel.
* Use the new kernels in the quantized example.
* Fix warning.
* Trying out a custom RmsNorm cuda kernel.
* CPU implementation for rms-norm.
* Cuda wrappers.
* Add some validation.
* Add some testing.
* More testing.
* Avoid copying the data on squeeze and unsqueeze.
* Fix the quantized llama example.
* Unrelated fix for the quantized stable-lm example on cuda.
* Fix for mamba on cuda (unrelated to the PR).
* first attempt
* progress
* integrate into metal backend
* finish and get test passing
* add other dtype support
* update transpose1d dtypes supported
* first pass at implementation of maxpool2d
* Add definitions for other dtypes
* add tests for other dtypes
* Cosmetic tweaks + re-enable maxpool2d tests for metal.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add a specialized kernel for copy2d.
* Move the cat operations.
* Avoid transpositions in cat.
* Bugfix.
* Bugfix for the cuda kernel.
* Add a benchmark.
* Add more testing.
* Test fix.
* Faster kernel.
* Add the missing kernel.
* Tweak the test.
* Add a metal kernel.
* Fix for the metal kernel.
* Get the tests to pass on metal.
* Also use this opportunity to fix the metal kernel for ELU.
* Add some bf16 kernels.
* Clippy fixes.
* use_resource API misunderstood. It is not additive. Several usages must be bit-ORed together.
* The seeding was incorrect and used the address instead of the value of the passed in seed.
* Add a check that likely exhibits failure to update the seed between generation of random tensors.
* Buffer overrun, the length given to the std::ptr::copy call was in bytes, and not 32-bit units.
* By default seed the RNG with a time-based value, so that different runs may produce different output, just like the CPU engine.
Use device.set_seed if determinism is warranted.
* Revert "By default seed the RNG with a time-based value, so that different runs may produce different output, just like the CPU engine. Use device.set_seed if determinism is warranted."
This reverts commit d7302de9
Discussion in https://github.com/huggingface/candle/pull/1811#issuecomment-1983079119
* The Metal random kernel failed to set element N/2 of tensors with N elements, N being even. The reason was that all threads but thread 0 all created 2 random samples, but thread 0 only one, i.e. an odd number. In order to produce an even number of samples, the early termination of thread 0 should only everr occur for odd sized tensors.
* Add a test catching any deterministic tensor element in rand and randn output.
---------
Co-authored-by: niklas <niklas@appli.se>
Co-authored-by: Ivar Flakstad <69173633+ivarflakstad@users.noreply.github.com>
* Add a --seed argument to the stable-diffusion example.
* Make the case when no seed is specified, that it will not be set, but use the engine's default. This will make the CPU engine work again when no --seed is given, and will cause a bailout when a seed is there, as the engine does not currently support it.
---------
Co-authored-by: niklas <niklas@appli.se>
* Improve metal buffer usage
* Clone cpu storage when loading to reduce wait_until_complete calls
* Use powers of two for buffer sizes so reuse is more likely.
* Select best available buffer by size.
* Add count to MetalStorage -> can use buffer with different size
Co-authored-by: Chris Fleetwood <christopher.fleetwood@huggingface.co>
* Simplify new buffer creation without blit copy. Revert &[] -> Vec
* Add documentation on newBufferWithBytes safety / synchronization
* Drop unused buffers after command buffer is done syncing.
---------
Co-authored-by: Chris Fleetwood <christopher.fleetwood@huggingface.co>
* Normalize loudness of the generated audio.
* Lints.
* One more lint.
* Avoid running the bs1770 tests.
* Another attempt at discarding doc comments.
* Also normalize the loudness in the encodec example.
* Add the metavoice transformer.
* Sketch the speaker-encoder module.
* Adding to the metavoice model.
* Start adding the metavoice example.
* Get some logits out.
* Load the second stage model.
* Get the second step to run.
* Tweak the example.
* Add encodec tilting.
* Glue the different bits together.
* Fix a shape issue.
* Use a constant.
* BPE tokenization.
* Fix the position index in metavoice.
* Add the metavoice transformer.
* Sketch the speaker-encoder module.
* Adding to the metavoice model.
* Start adding the metavoice example.
* Get some logits out.
* Load the second stage model.
* Get the second step to run.
* Tweak the example.
* Add encodec tilting.
* Glue the different bits together.
* Fix a shape issue.
* Use a constant.
* BPE tokenization.
* Add a warning.
* Encodec model.
* Fixes.
* Add the padding functions.
* Get the LSTM bit to work.
* Get the encodec model to generate some tokens (decoder only for now).
* Minor tweak.
* Minor tweak.
* and quantized rwkv v5 model
* Integrate the quantized rwkv model in the initial example.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Boilerplate for the quantized cuda support.
* More basic cuda support.
* More cuda quantization (quantize on cpu for now).
* Add the dequantization bit.
* Start adding some dedicated cuda kernels from llama.cpp.
* Move the kernel code.
* Start interfacing with the kernel.
* Tweak the kernel launch params.
* Bugfix for quantized metal.
* Fix some clippy lints.
* Tweak the launch parameters.
* Tweak cuda basics to perform a quantized matmul.
* Perform the dequantization on the cpu + use cublas for matmul.
* Add the dequantization kernel.
* Test the qmatmul.
* More kernels.
* Matmul-vec kernel.
* Add a couple kernels.
* More dequantization kernels.
* Add the Gemma models.
* Add the gemma example.
* Adapt the RmsNorm.
* Get the 2b model to work.
* 7b support.
* Use the config head dim.
* Yet another fix.
* Make the matrixes contiguous.
* Also get the 7b model to work.
* And add to the readme.
* Start adding the RWKV model.
* More of the forward step.
* Handle rescaling.
* FeedForward.
* More work on RWKV.
* Better state tracking.
* Finish a first pass on forward.
* Fix the shape mismatches.
* Do not rescale in f32.
* Rename to rwkv-v5.
* Add the new models to the readme.
* feat: support microphone whisper streaming
* fix: cleanup print stmts and adjust how input is read
* fix: remove incorrect comment
* feat: split into new example and simplify
* fix: feature flag example file
* fix: fmt fixes
* feat: simplify and remove redundant files
* Sketch the mamba model for inference.
* Complete the forward pass.
* Add the mamba example.
* Optimize the selective-scan part.
* Fix a couple shape mismatches and get inference to work.
* Tweak the readmes.
* More readme tweaks.
* Use the repo config for trocr rather than hardcoding it + small tweaks.
* Add support for the printed models.
* Fail with an appropriate error message on missing position embeddings.
* Initial check-in for the qwen2 model.
* More qwen2 inference.
* Polish the qwen example.
* Fix the rope basis.
* Get the inference to work.
* Support different model sizes.
* Add the ChatGLM model.
* Rotary embeddings.
* Add to the forward pass.
* Add to the forward pass.
* Add the rotary embeddings.
* Add the KV cache.
* Add the chatglm example.
* Bugfix.
* More glm fixes.
* Fix some shape issues.
* Get the inference to work.
* feat: support multithread spectrogram and small perf tweaks
* feat: clippy improvement for loop variable
* fix: add back speed up scale down logic
* fix: readd mirroring logic
* feat: prefer scoped thread and simplify/improve logic/traits
* Add support for loading Fortran contiguous tensors
This commit introduces the ability to handle Fortran contiguous tensors in the tensor loading process. Previously, the code only supported loading tensors that were contiguous in memory, failing with an error for non-contiguous tensors. With this update, tensors identified as Fortran contiguous (column-major order) are now correctly handled by reversing their dimensions after loading. This enhancement ensures broader compatibility with different tensor layouts, improving the robustness of tensor loading operations.
- Check if a tensor is Fortran contiguous using the `is_fortran_contiguous` flag.
- For Fortran contiguous tensors, reverse the dimensions after loading to correctly represent their layout in memory.
- Continue to bail out with an error for tensors that are neither C contiguous nor Fortran contiguous, maintaining the previous behavior for non-contiguous tensors without explicit support.
This change addresses the issue of loading Fortran contiguous tensors, which was previously unsupported, thereby extending the functionality of the tensor loading mechanism to accommodate a wider variety of tensor layouts.
* Add reshape step to handle fortran contiguous case
* Skip fortran contiguous fix if rank is < 2
* Fail on rank 0, 1 if contiguous
`candle-nn` already exposes a trait to define custom backends. However,
it's not possible to actually construct a `VarBuilder` with a custom
backend because the constructor is not exposed.
This change makes the constructor public and renames it from `new` to
`from_backend` to avoid that it is seen as the primary
constructor (which could be confusing to users).
* Supports more audio formats
* Simplify the handling of the different buffer types.
* Check the sample rate.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
Update the source of the configuration_mixformer_sequential.py
It has been removed, therefore, it is still available in this -> d38e6f954ec29b96fe2cf033937dad64e279b5d9
* Metal quantized modifications proposal.
- Add a device param, wherever needed.
- Create new QMetal storage thing that implements QuantizedType.
- Update everywhere needed.
Fix Python.
Fixing examples.
Fix: fmt + clippy + stub.
Moving everything around.
Only missing the actual implems.
Fixing everything + adding dequantized kernels.
More work.
Fixing matmul.
Fmt + Clippy
Some clippy fixes.
Working state.
Q2K Metal -> Bugged (also present in GGML).
Q4K CPU -> Bugged (present previously, new test catch it).
Q5K CPU -> Bugged (present previously).
Q8_1 Both -> Never really implemented it seems
Q8K metal -> Never implemented in metal
Fixing Q2K bug (present in ggml).
* Cleanup.
* Fix the rebase.
* Removing the fences speeds everything up and *is* correct this time...
* Cleanup the fence.
* After rebase.
* Bad code removal.
* Rebase after phi2 merge + fix replit default to CPU.
* Making the CI happy.
* More happy tests.
---------
Co-authored-by: Nicolas Patry <nicolas@Nicolass-MacBook-Pro.local>
* Update the Phi model to use the updated architecture.
* Add more of the phi model.
* Repeat KV + caching.
* Apply the rotary embeddings.
* Add support for the new phi model in the phi example.
* Fix a couple glitches.
* Fix a couple more glitches.
* Use cfg to seperate benchmark results based on features
* Add metal where_cond for f16 and bf16. Add benchmark
* Remove allow pragma
* Avoid some unnecessary returns.
* Improve benchmarks layout
* Updated feature separated benchmarks
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Use cfg to seperate benchmark results based on features
* Remove allow pragma
* Avoid some unnecessary returns.
* Improve benchmarks layout
* Derive bench_name from actual device
* Run CPU benchmarks even when GPU feature is enabled
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add RepVGG model.
* Add RepVGG README
* Extract var to top level
* Replace hashmap with a match
* Add a variant for the model kind + avoid some unnecessary config cloning.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add relu kernel for metal
* Copy error messages proposed in #1491
* Revert non relu changes
* Fix name changes
* Fix the last of us (:
* Fix copy and paste mistakes
* Fix typo
* Revert order changes
* Revert order change
* Add deleted functions back
* Run rustfmt
* feat: add dependabot to the project
* feat: add let's accept patches/fix from other libs
* Revert "feat: add let's accept patches/fix from other libs"
This reverts commit d31a956f81.
* Simpler repro for the neon optimization issue.
* Bugfix for q4k.
* Improve the fix, share the dot-prod bit.
* Clippy fixes.
* Fix for q6k.
* Also fix for q2k.
* Use the new shared dotprod.
* Add more testing.
* add one-hot encoding
* one_hot: improve error handling, use generic to_vecN::<D>
Bails if the index value is equal to or greater than the depth value,
which would result in an out-of-bounds error.
A redundant check is added to ensure the index value does not exceed
the length of the one-hot matrix size, which would also result in an
out-of-bounds error.
Bails if the index value is less than -1. If the index value is -1,
then it ignores the setting of the on_value for the index value. Only
values that are less than -1 are considered errors.
* one-hot: use two generics, one_hot::<I, O>, for input and output data types
Separating the input and output data types allows the input tensor
indices to be a different data type than the output encoded tensor data type.
For example, one_hot::<i64, u8>(...) will take an input tensor of i64 values
and encode the output tensor using u8 values.
The generic I::DTYPE must match the data type of the input indices, otherwise
the method will bail.
Additionally, this method adds an `allow_f64` option to enable the input indices
data type to be f64 values. f64 values are disabled by default.
TODO: indices data type and the generic I data type are currently not compile-time
checked.
* one_hot: remove input generic, use indices dtype matching
This commit removes the to_f64() type cast and explicitly
matches the DType from the input tensor. Currently, only U8,
U32 and I64 is supported for input tensors.
The match arms on the dtype is verbose. It would be nice
to use a generic type with the WithDtype traitbound to
pass to the to_vecN method and then return an inner value.
Open to suggestions for better approaches here to reduce
the match arm verbosity.
* one_hot: use flat_map iterator over dims instead of nested for loop
This commit replaces the nested for loops with an flat map iter over
the dimensions of the input tensor.
This commit also adds a test for a rank 3 input tensor.
* one_hot: use mandatory on/off-values, remove const msgs
This commit also updates doc tests, comments and test cases.
* Small cleanups.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Add training to batchnorm with exponential moving average
* Add more checks to batch norm
* Resolve some review comments
* Add with_momentum varients of `new` methods
* Add check for range of momentum variable; update batch norm test
* Run cargo fmt
* Add back num_features parameter
* Format; tiny simplification
* added policy_gradient, modified main, ddpg and README
* fixed typo in README
* removed unnecessary imports
* small refactor
* Use clap for picking up the subcommand to run.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add the Mixtral model.
* Add more of the mixtral layers.
* Add the final layers for mixtral.
* Sketch the expert selection.
* Add some expert routing logic.
* Hopefully finish the routing logic for mixtral.
* Add the mixtral example.
* Fix the weight filenames.
* Bugfix.
* Another fix.
* Yet another fix + remove the unused pragma.
* Shape fix.
* Support for quantized mixtral.
* Support mixtral in the quantized example.
* Mlp or moe type.
* Fix the expert field namings.
* Refactor the mlp bit.
* More MoE logic.
* Add the MoE quantized logic.
* Fix the experts length.
* Add the Mixtral model.
* Add more of the mixtral layers.
* Add the final layers for mixtral.
* Sketch the expert selection.
* Add some expert routing logic.
* Hopefully finish the routing logic for mixtral.
* Add the mixtral example.
* Fix the weight filenames.
* Bugfix.
* Another fix.
* Yet another fix + remove the unused pragma.
* Shape fix.
* Add a readme.
* encode size of upsample in enum
* working convolution method for limited 2d kernels
* add test for sf 3 interpolation
* add higher dimensional tests, fix to work with multichannel input
* Remove commented out line.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add support for SD Turbo
* Set Leading as default in euler_ancestral discrete
* Use the appropriate default values for n_steps and guidance_scale.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
Few fixes.
Going back on remote metal-rs.
Reusing a single buffer (for now) to speed things up.
Adding some half kernels.
All tests are panicking instead of random failure.
Putting back f16 index select.
Add erf.
Working version for llama2-c.
Fixes + cache compute_pipeline_state.
BF16 metal fix.
Remove some prints.
new_owned -> new()..to_owned().
Better batched matmul.
Metal operational.
Reuse buffers on our own reference counts.
Tmp gemm.
Revert "Tmp gemm."
This reverts commit c65f68e988.
Interleave committing.
Speeding up copies using blit.
Fmt.
Fmt.
Remove the assert!
Fmt all.
Fixes after big rebase.
Add softmax for half and bfloat + tests
Fixing Llama example + accumulate softmax in float.
* add bce with logit loss
* add bce with logit loss
* remove imports
* fix tiny bug
* add test documentation and refactor function
* fix test cases and formatting
* distilbet files
* Apply various cleanups.
* More cleanups.
* More polish.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Fix linspace implementation
`steps` should be strictly greater than 1 to make it consistent with the context.
* Handle steps == 0 and steps == 1.
* Fix rustfmt.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Add OpenChat to quantized examples
* Add chat prompt
* Make the openchat example more in line with the other models.
* Fix a typo.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
Updating the readme to coincide with other examples. If you try to run it as previously written, you will get a "cannot find the path specified" error.
* Add support to UL2 model family
* Update docs with UL2
* Create ActivationWithOptionalGating to avoid polluting activations
* Also refactor quantized t5
* Remove useless conversion
* Revert Activation::NewGelu name change
* Remove useless return
* Apply rustfmt and clippy recommendations
* Reuse t5::ActivationWithOptionalGating in quantized version
* (cosmetic change) use a match rather than ifs + avoid early returns.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* add bce with logit loss
* add bce with logit loss
* remove imports
* fix tiny bug
* add test documentation and refactor function
* fix test cases and formatting
* add trocr model
* fix formatting
* commit the actual model lol
* more formatting
* remove tokenizer config
* Support the shape op in ONNX.
* Share the axis normalization bits.
* Add some limited support for gather.
* Unsqueeze.
* Comparison with broadcasting.
* Add Not + handle i32.
* Tweaks for the quantized model.
I think it makes more sense to have it there, since it's a seq2seq model with cross attention, and not a LM. There are also Decoder only T5 models that work as LMs, but that's not the standard.
* Support the shape op in ONNX.
* Share the axis normalization bits.
* Add some limited support for gather.
* Unsqueeze.
* Comparison with broadcasting.
* Add Not + handle i32.
* Adds check for 7b-zephyr and uses correct template
* Handle zephyr as mistral.
* Disable the protoc bits of the CI.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add more models to the onnx example.
* Input validation.
* Input validation.
* Bugfix.
* Implement clip.
* BatchNorm support.
* Get the efficientnet onnx to work.
* Add the onnx protos.
* Move the reading bits.
* Install protoc on the CI.
* Install protoc on the cuda CI too.
* Use clap for the onnx tool.
* Tweak the CI protoc install.
* Add some simple evalution function.
* Add some binary operator support.
* Negative and `*args` shape handling
* Rename to `PyShapeWithHole` + validate that only one hole exists
* Regenerate stubs
---------
Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>
* Skeleton files for the marian MT model.
* Marian initialization.
* Implement the attention forward method.
* Forward pass for the encoder side.
* Expose the encoder and decoder.
* Start plugging the decoder.
* Forward pass for the decoder layer.
* Set up the marian example.
* Add some missing backtraces.
* Bugfix.
* feat: implement VGG13, VGG16 and VGG19
* Cosmetic fixes.
* More cosmetic tweaks + avoid re-loading the weights on each final layer.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Fix Gym wrapper
- It was returning things in the wrong order
- Gym now differentiates between terminated and truncated
* Add DDPG
* Apply fixes
* Remove Result annotations
* Also remove Vec annotation
* rustfmt
* Various small improvements (avoid cloning, mutability, get clippy to pass, ...)
---------
Co-authored-by: Travis Hammond <travis.hammond@alexanderthamm.com>
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add the jina-bert model.
* Use alibi.
* Remove the unused pragma.
* Recompute the alibi embeddings.
* Generate the token type ids.
* Use the module trait.
* Add the jina-bert example.
* DType fix.
* Get the inference to work.
* add bce with logit loss
* add bce with logit loss
* remove imports
* fix tiny bug
* add test documentation and refactor function
* fix test cases and formatting
* Add the blip example.
* Tweak the example.
* Implement the cross-attn logic.
* Fix some shape mismatches.
* Get some logits out.
* Get some caption to be generated.
* Start adding vision-transformers.
* Add self-attn.
* More vision transformers.
* vit-vit.
* Add the actual vit model.
* Add the example code for the vision transformers.
* Add some reinforcement learning example.
* Python initialization.
* Get the example to run.
* Vectorized gym envs for the atari wrappers.
* Get some simulation loop to run.
* Only optimize float tensors.
* Use full tensors for zeros and ones.
* Add a benchmark for the matmul slowness.
* Add the convmixer model.
* Proper adaptive pooling.
* Some first `Module` implementations
* Add `state_dict` and `load_state_dict` functionality
* Move modules around and create `candle.nn.Linear`
* Add `nn.Embedding` and `nn.LayerNorm`
* Add BERT implementation
* Batch q-matmul
* Automatically dequantize `QTensors` if a `Tensor` is expected
* Add Module `.to()`, `.cuda()`, `cpu()` and `.type()` functionality
* Unittests for `Module`, `Tensor` and `candle.utils`
* Add `pytorch` like slicing to `Tensor`
* Cleanup and BERT fixes
* `black` formatting + unit-test for `nn.Linear`
* Refactor slicing implementation
* [Whisper] Update to use quantized model
* [whisper] add language detection
* [whisper] change assets location
* [whisper] adapt js example with quantized models
* [whisper] better task parsing
* [whisper] minor fixes
* [SAM] Add undo button and background point mode
* [SAM] remove pts on near clicks
* [SAM] check shiftKey toggle point mode
* [SAM] clear points when clearing image
* Add the quantized-whisper model.
* Quantized the whisper model.
* Adapt the whisper example to handle quantization.
* Add the quantized flag.
* Load the proper weights.
* Quantized version of mistral.
* Integrate the quantized mistral variant.
* Use the quantized weight files.
* Tweak the quantization command.
* Fix the dtype when computing the rotary embeddings.
* Update the readme with the quantized version.
* Fix the decoding of the remaining tokens.
* Add the mistral example.
* Use the two model files.
* Adjust the dtype.
* Tweak the weight paths.
* Remove the end of text token.
* Get the mistral model to generate some text.
* Fix when running on the gpu.
* More gpu fixes.
* Add the mistral example.
* Use the two model files.
* Adjust the dtype.
* Tweak the weight paths.
* Remove the end of text token.
* Get the mistral model to generate some text.
* add phi wasm module
* replace input with textarea
* trim input prompt
* stop on <|endoftext|>
* formatting
* clean up
* add blurb, and syntax highlighting
* add phi-v1.5 wasm
* add note
* hide Options on details
* add first token to generated text
* whitespaces for new line
* fix: abort -> aborted
* Use yoke to provide a self-referential container for mmaped safetensor files.
* Add the new self-owned type for safetensor files without removing the previous version.
* Add routing.
* Add an initializer for the case of multiple files.
* Complete the mixformers implementation.
* Tweak the attention.
* Add the phi-1.5 example.
* Improve the phi example.
* Bugfix.
* Get the phi example to work.
* Load gguf files for the quantized t5.
* Add the quantized t5 example.
* Allow for loading local files.
* Add some support for quantizing safetensor files.
* Transpose before quantizing.
* Quantized t5.
* Retrieve the weights from the hub.
* Only use classifier free guidance for the prior.
* Add another specific layer-norm structure.
* Tweaks.
* Fix the latent shape.
* Print the prior shape.
* More shape fixes.
* Remove some debugging continue.
* Specialized attention module for Wuerstchen.
* Reshaping ops.
* Attention processor.
* Finish the forward pass.
* Hook the new attention processor.
* Get the prior forward pass to work.
* Make it contiguous.
* More Weurstchen fixes.
* More shape fixes.
* Add more of the prior specific bits.
* Broadcast add.
* Fix the clip config.
* Add some masking options to the clip model.
* Load t5 decoder
* Run enc, dec, and lm head, but no cross attn
* Cross-attention over key_value_states
* New arg for decoder input ids
* Add mask, don't forward position biases through decoder
* Update t5 examples
* Clippy + rustfmt
* Add a readme for the segment-anything model.
* Add the original image.
* Clean-up the segment anything cli example.
* Also print the mask id in the outputs.
* Enable im2col on the cpu side.
* Hook im2col on the cpu backend.
* Use the kernel offset.
* Avoid an unnecessary copy.
* Handle non-contiguous kernels.
* Add a const to select the conv2d kernel.
* im2col implementation for conv2d.
* Fix for the im2col implementation to match the current conv2d.
* Small optimization.
* Add a cuda kernel.
* Handle arbitrary layouts.
* Im2Col cuda code.
* Add ability to override the compiler used by NVCC from an environment variable
* Allow relative paths in CANDLE_FLASH_ATTN_BUILD_DIR
* Add the compilation failure to the readme, with a possible solution
* Adjust the error message, and remove the special handling of the relative paths
* Start processing images.
* Add LayerNorm2d.
* Properly use LayerNorm2d.
* Tweak eps.
* Use LayerNorm on inputs with a rank different from 3.
* Window partitioning.
* Fix a couple todos.
* More todos.
* Hard-code the einsums.
* More padding support.
* Some sizes tweaks.
* Use the hub to get the weights.
* Use a batch matmul.
* Tweaks.
* More fixes.
* Get some predictions to be generated.
* More segment-anything again.
* Transformer block forward.
* Two-ways transformer.
* Position embeddings.
* Sketch the prompt encoder.
* More prompt-encoder.
* More prompt-encoder.
* Add the main sam module.
* Embed the transformer.
* And hook the transformer forward step.
* Build the model.
* Handle the global attn indexes.
* Get the model to load.
* img2img pipeline for stable diffusion.
* Rename the arguments + fix.
* Fix for zero strength.
* Another fix.
* Another fix.
* Revert.
* Include the backtrace.
* Noise scaling.
* Fix the height/width.
* Add a custom softmax implementation.
* Add softmaxlastdim to the benchmarks.
* And add a test.
* Support more dtypes.
* Polish the code.
* Use the slow implementation on cuda.
* Add a todo for the cuda kernel.
* Sketch a quantized llama using the pyo3 api.
* Add more ops.
* Expose a few more functions to use in the quantized model.
* Rope embeddings.
* Get the forward pass to work.
* Add more pyo3 support.
* Add some support for quantized tensors in pyo3.
* Add an arc layer on qmatmul.
* Add the quantized matmul.
* Quantization support.
* More quantization support.
* Test the python quantization.
* Add the dilation parameter.
* Restore the basic optimizer example.
* Dilation support in cudnn.
* Use the dilation parameter in the cpu backend.
* More dilation support.
* No support for dilation in transposed convolutions.
* Add dilation to a test.
* Remove a print.
* Helper function.
* Add conv-transpose.
* Return zeros for now.
* Naive CPU implementation.
* Add a conv-transpose test + fix the cpu implementation.
* Add a second test.
* return detections with classes names
* ignore .DS_Store
* example how to load wasm module
* add param to set model size
* add param for model size
* accept iou and confidence threshold on run
* conf and iou thresholds
* clamp only
* remove images from branch
* a couple of renamings, add readme with instructions
* final design
* minor font + border update
Codellama requires bf16 for now (error to convert from bf16 to f16).
Multiprocess demo not functional for it because flash-attn only supports
f16 for now.
* Add the pose estimation head for yolo.
* Properly handle the added position dimensions.
* Integrate the pose estimation head in the forward pass.
* Renaming.
* Fix for pose estimation.
* Add to the cuda example a reproduction of the issue.
* Tweak.
* Add a test using non-square matrixes.
* Fix the conv2d kernel.
* Display the error.
* And tweak the comment.
* Add some group parameter to convolutions.
* Avoid some unnecessary groups checks.
* Move the tensor convolution bits.
* Properh handling of groups.
* Bump the crate version.
* And add a changelog.
According to the API:
```rust
inp = inp.to_device(&Device::Cuda(0)?)?;
```
cannot work as `Cuda(...)` expects a type `Device` not an integer.
I'd recommend to instead use `new_cuda(...)`
* first q2 implementation
* First Q4K and Q5K implementations
* fix `q2k` and `q5k`
* Some first cleanups
* run `clippy` on tests
* finally implement `q3k`
* deactivate `q3k` test on macos
* also disable the test on linux
* Fix floating bits in `q3k` dequantization
* Refactoring pass + reorder quants in file
* `fmt`
* Re-add `src` asserts and redefine `dst`
* Sketch the yolo wasm example.
* Web ui.
* Get the web ui to work.
* UI tweaks.
* More UI tweaks.
* Use the natural width/height.
* Add a link to the hf space in the readme.
* Sketching yolo-v8.
* Get the model to load.
* yolo-v8 forward pass.
* Complete(?) the forward pass.
* Fix some shape issues.
* Add the missing padding.
* Process the predictions.
* Some fixes for yolo-v3.
* Use the running stats for inference in the batch-norm layer.
* Get some proper predictions for yolo.
* Avoid the quadratic insertion.
* Add a couple functions required for yolo.
* Add the yolo-v3 example.
* Add minimum and maximum.
* Use the newly introduced maximum.
* Cuda support for min/max + add some testing.
* Allow for more tests to work with accelerate.
* Fix a typo.
* Skeleton files for neon support of quantization.
* SIMD version for q4 vecdot.
* Also simdify the q6k multiplication.
* Add some timings to stable-diffusion.
* Separate the prompt stats from the post-prompt ones in the quantized example.
* Slightly nicer output printing.
* Line up with the llama.cpp implementation.
* Pickle work-in-progress.
* More unpickling.
* More pickling.
* Proper handling of setitems.
* Clippy.
* Again more pickling.
* Restore the example.
* Add enough pickle support to get the list of tensors.
* Read the data from zip files.
* Retrieve the tensor shape.
* Extract the size and dtype.
* More storage types.
* Improve the destructuring.
* Also support ggml files.
* Pickle work-in-progress.
* More unpickling.
* More pickling.
* Proper handling of setitems.
* Clippy.
* Again more pickling.
* Restore the example.
* Add enough pickle support to get the list of tensors.
* Read the data from zip files.
* Retrieve the tensor shape.
* Extract the size and dtype.
* More storage types.
* Improve the destructuring.
* Add a vision transformer example (dino-v2).
* Add some documentation + test.
* CI fix.
* Another fix (still unable to replicate the errors locally :( )
* Print the detected arch options.
* Add the q6k quantization.
* Add a currently broken test.
* Bugfix.
* Bugfix.
* Another bugfix.
* Another bugfix + get the test to work.
* Add flash-attention for the stable-diffusion example.
* Change the dtype.
* Silly fix.
* Another fix.
* Revert the dtype back to the query dtype after apply flash-attn.
* Add more stats to the ggml example.
* Build a quantized model from the file content.
* Move the tensor retrieval in the main crate.
* Start adding the forward pass.
* Add more to the forward pass of the quantized llama.
* Apply the attention layers.
* Add the sampling loop.
* Get the sampling loop to work.
* Minor tweak.
* Add a quantize/dequantize test.
* Bugfix.
* Add a comment + swap the order.
* Bugfixes.
* Properly initialize wdata.
* Simplify the matmul bits.
* Add from_float for q4_0.
* Fix a couple bugs.
* Get the test to work.
* Get clippy to be happy.
* Add a vecdot trait.
* Start implementing mul_mat.
* Add to the mul mat implementation.
* Add q8_0 quantization.
* Implement the GgmlType trait for all types.
* Add the missing block.
* Add a TODO.
* Add a cudnn feature to be used for conv2d.
* Allocate the proper workspace.
* Only create a single cudnn handle per cuda device.
* Proper cudnn usage.
* Bugfix.
* Add a cuda kernel for avg-pool2d.
* Avoid running out of bounds.
* Finish wiring the avg pool kernel + add some testing.
* Support for max-pool + testing.
* Add a naive conv2d cuda kernel.
* Proper conv2d support on the rust side.
* Conv1d testing on gpu.
* Also use the test on gpus.
* Fix the clean-ptx target.
* Track the conv2d operations in stable-diffusion.
* Add more tracing to stable-diffusion.
* Also trace the resnet bits.
* Trace the attention blocks.
* Also trace the attention inner part.
* Small tweak.
* Add more tracing to the whisper example.
* Support accelerate in more examples.
* Use accelerate for pointwise functions.
* Use accelerate for binary operations too.
* Bugfix for binary operation: use the rhs before the lhs.
* Change distributions
Standard generates in [0, 1), Normal is correct.
* Add test
Not sure if this is the best place to put the test
* Remove unnecessary use
* Start adding a stable-diffusion example.
* Proper computation of the causal mask.
* Add the chunk operation.
* Work in progress: port the attention module.
* Add some dummy modules for conv2d and group-norm, get the attention module to compile.
* Re-enable the 2d convolution.
* Add the embeddings module.
* Add the resnet module.
* Add the unet blocks.
* Add the unet.
* And add the variational auto-encoder.
* Use the pad function from utils.
* Move the vision datasets to a separate crate.
* Move the batcher bits.
* Update the readme.
* Move the tiny-stories bits.
---------
Co-authored-by: Jane Doe <jane.doe@example.org>
- Loading with memmap
- Loading a sharded tensor
- Moved some snippets to `candle-examples/src/lib.rs` This is because
managing book specific dependencies is a pain https://github.com/rust-lang/mdBook/issues/706
- This causes a non aligned inclusion https://github.com/rust-lang/mdBook/pull/1856 which we have
to ignore fmt to remove.
mdbook might need some more love :)
* Rework the var-builder to handle initializations.
* Add some helper functions for layer creation.
* Improve the layer initializations.
* Get initialized variables.
* Precompute the rot embeddings when training lamas.
* Rework the commands and run inference by default.
* Add the training module and load the training dataset.
* Random dataset iterator.
* Proper valid-loss computation.
* Compute the evaluation loss.
* Add more substance to the training loop.
* Expose the seqlen variable for flash-attn without padding.
* Fix the batched call.
* Adapt for the varlen variant.
* No need to set the batch strides when in varlen mode.
* Add a test (disabled at the moment).
* Get the test to work properly.
* Cuda support for the mnist training.
* min/max fix + testing.
* Add the argmin/argmax tests.
* More cuda support for argmin/argmax.
* Cuda kernels for argmin and argmax.
* Start sketching the bigcode gpt model.
* Sketch the bigcode model.
* Implement the attention mechanism.
* Random reshaping.
* Sketch more of the example.
* Add some kv cache.
* Properly generate the position ids.
* Proper attention mask.
* Bail on upcasting.
* Properly apply the attention mask.
* Add the smaller starcoder variants.
* Update for the new hub api.
* Fix a shape issue.
* Fix another shape issue.
* Get some logits out.
* Adjust the weigth names.
- Add ms/token on llama2.c (15ms/token on my personal machine)
- Hide `Run` buttons while models are not ready
- Add dummy `progress` while weights are downloading (I briefly looked
at putting a real progressbar.. and nothing easy enough came up.)
* Again set a few extra params.
* Use the appropriate kernel sizes.
* Add all the kernel sizes.
* Parallel compiling.
* Reduce the amount of parallelism.
* Add the missing kernel.
* Fix a typo.
* Remove bf16 support for now.
* Proper flash-attn parameters.
* Set the flash attention parameters.
* Add more validations.
* Setup the o_ flash attn parameters.
* More flash-attn support.
* Set more flash attn parameters.
* Add some flash-attn kernel, import the code for flash-attn v2 from Dao-AILab.
* More flash attn.
* Set up the flash attn parameters.
* Get things to compile locally.
* Move the flash attention files in a different directory.
* Build the static C library with nvcc.
* Add more flash attention.
* Update the build part.
* Better caching.
* Exclude flash attention from the default workspace.
* Put flash-attn behind a feature gate.
* Get the flash attn kernel to run.
* Move the flags to a more appropriate place.
* Enable flash attention in llama.
* Use flash attention in llama.
* Skeleton methods for IndexAdd/ScatterAdd.
* Add a Map2InPlace trait.
* Add the glue code for the index-add/scatter-add kernels.
* Tweak the file name: embeddings -> indexing.
* Add the cuda kernel for indexadd.
* And add the scatter-add kernels.
* Start adding llama2.c.
* Model loading.
* Add the llama-v2 model.
* Start converting the weights.
* Rotary embedding tweaks.
* Get the model to generate some tokens.
* Add the copy op.
* Tweak some cat error messages.
* Handle the contiguous case in to_vec1.
* Fast variant for to_vec2.
* Add add a faster to_vec3 variant.
* Cleanup some todos.
* Fix more todo.
* Optimize for the contiguous case.
* Add the IntDType trait.
* Handle the intdtype trait for more ops.
* Remove a todo.
* Remove a todo.
* Build example kernels.
* Add some sample custom kernel.
* Get the example kernel to compile.
* Add some cuda code.
* More cuda custom op.
* More cuda custom ops.
* Start adding gather.
* Gather cpu implementation + use in simple training.
* Add scatter_add for the gradient of gather.
* Simple cpu implementation of scatter_add.
* Use gather in the simple-training backprop.
* Refactor the reduce ops in order to introduce argmin/argmax.
* Clippy fixes.
* Use the newly introduced argmax.
* Fix the strided case.
* Handle the non-contiguous case.
* More realistic training setup.
* Compute the model accuracy.
* Very inefficient backprop for index select.
* More backprop.
* Fix some backprop issues.
* Backprop fix.
* Another broadcasting backprop fix.
* Better backprop for reducing ops.
* Training again.
* Add some gradient tests.
* Get the training to work.
* Add the comparison operations.
* Add the helper functions on the tensor side.
* More cmp operations.
* Cpu implementation for the comparison operations.
* Use contiguous tensors for variables.
* Sketch the mnist example.
* Start adding the reduce ops.
* Renaming.
* Refactor the reduce operations.
* Bugfix for the broadcasting vectorization.
Make sure that you have [`candle-core`](https://github.com/huggingface/candle/tree/main/candle-core) correctly installed as described in [**Installation**](https://huggingface.github.io/candle/guide/installation.html).
Let's see how to run a simple matrix multiplication.
Write the following to your `myapp/src/main.rs` file:
- [`optimisers`](https://github.com/KGrewal1/optimisers): A collection of optimisers
including SGD with momentum, AdaGrad, AdaDelta, AdaMax, NAdam, RAdam, and RMSprop.
- [`candle-vllm`](https://github.com/EricLBuehler/candle-vllm): Efficient platform for inference and
serving local LLMs including an OpenAI compatible API server.
- [`candle-ext`](https://github.com/mokeyish/candle-ext): An extension library to Candle that provides PyTorch functions not currently available in Candle.
- [`candle-coursera-ml`](https://github.com/vishpat/candle-coursera-ml): Implementation of ML algorithms from Coursera's [Machine Learning Specialization](https://www.coursera.org/specializations/machine-learning-introduction) course.
- [`kalosm`](https://github.com/floneum/floneum/tree/master/interfaces/kalosm): A multi-modal meta-framework in Rust for interfacing with local pre-trained models with support for controlled generation, custom samplers, in-memory vector databases, audio transcription, and more.
- [`candle-sampling`](https://github.com/EricLBuehler/candle-sampling): Sampling techniques for Candle.
- [`gpt-from-scratch-rs`](https://github.com/jeroenvlek/gpt-from-scratch-rs): A port of Andrej Karpathy's _Let's build GPT_ tutorial on YouTube showcasing the Candle API on a toy problem.
- [`candle-einops`](https://github.com/tomsanbear/candle-einops): A pure rust implementation of the python [einops](https://github.com/arogozhnikov/einops) library.
- [`atoma-infer`](https://github.com/atoma-network/atoma-infer): A Rust library for fast inference at scale, leveraging FlashAttention2 for efficient attention computation, PagedAttention for efficient KV-cache memory management, and multi-GPU support. It is OpenAI api compatible.
- [`llms-from-scratch-rs`](https://github.com/nerdai/llms-from-scratch-rs): A comprehensive Rust translation of the code from Sebastian Raschka's Build an LLM from Scratch book.
If you have an addition to this list, please submit a pull request.
<!--- ANCHOR_END: useful_libraries --->
<!--- ANCHOR: features --->
## Features
- Simple syntax (looks and like PyTorch)
- CPU and Cuda backends, m1, f16, bf16 (and tentatively wasm)
- Enable serverless (CPU), small and fast deployments
-Model training
- Distributed computing (NCCL).
- Models out of the box (Llama, Whisper, Falcon, ...)
- Emphasis on enabling users to use custom ops/kernels
- Simple syntax, looks and feels like PyTorch.
- Model training.
- Embed user-defined ops/kernels, such as [flash-attention v2](https://github.com/huggingface/candle/blob/89ba005962495f2bfbda286e185e9c3c7f5300a3/candle-flash-attn/src/lib.rs#L152).
-Backends.
- Optimized CPU backend with optional MKL support for x86 and Accelerate for macs.
- CUDA backend for efficiently running on GPUs, multiple GPU distribution via NCCL.
- WASM support, run your models in a browser.
- Included models.
- Language Models.
- LLaMA v1, v2, and v3 with variants such as SOLAR-10.7B.
- [candle-onnx](./candle-onnx/): ONNX model evaluation.
## FAQ
### Why Candle?
### Why should I use Candle?
Candle stems from the need to reduce binary size in order to *enable serverless*
possible by making the whole engine smaller than PyTorch very large library volume.
This enables creating runtimes on a cluster much faster.
<!--- ANCHOR: goals --->
And simply *removing Python* from production workloads.
Python can really add overhead in more complex workflows and the [GIL](https://www.backblaze.com/blog/the-python-gil-past-present-and-future/) is a notorious source of headaches.
Candle's core goal is to *make serverless inference possible*. Full machine learning frameworks like PyTorch
are very large, which makes creating instances on a cluster slow. Candle allows deployment of lightweight
binaries.
Rust is cool, and a lot of the HF ecosystem already has Rust crates [safetensors](https://github.com/huggingface/safetensors) and [tokenizers](https://github.com/huggingface/tokenizers).
Secondly, Candle lets you *remove Python* from production workloads. Python overhead can seriously hurt performance,
and the [GIL](https://www.backblaze.com/blog/the-python-gil-past-present-and-future/) is a notorious source of headaches.
Finally, Rust is cool! A lot of the HF ecosystem already has Rust crates, like [safetensors](https://github.com/huggingface/safetensors) and [tokenizers](https://github.com/huggingface/tokenizers).
<!--- ANCHOR_END: goals --->
### Other ML frameworks
- [dfdx](https://github.com/coreylowman/dfdx) is a formidable crate, with shapes being included
in types preventing a lot of headaches by getting compiler to complain about shape mismatch right off the bat
However we found that some features still require nightly and writing code can be a bit dauting for non rust experts.
in types. This prevents a lot of headaches by getting the compiler to complain about shape mismatches right off the bat.
However, we found that some features still require nightly, and writing code can be a bit daunting for non rust experts.
We're leveraging and contributing to other core crates for the runtime so hopefully both crates can benefit from each
other
other.
- [burn](https://github.com/burn-rs/burn) is a general crate that can leverage multiple backends so you can choose the best
engine for your workload
engine for your workload.
- [tch-rs](https://github.com/LaurentMazare/tch-rs.git) Bindings to the torch library in Rust. Extremely versatile, but they
do bring in the entire torch library into the runtime. The main contributor of `tch-rs` is also involved in the development
bring in the entire torch library into the runtime. The main contributor of `tch-rs` is also involved in the development
of `candle`.
### Missing symbols when compiling with the mkl feature.
### Common Errors
#### Missing symbols when compiling with the mkl feature.
If you get some missing symbols when compiling binaries/tests using the mkl
features, e.g.:
or accelerate features, e.g. for mkl you get:
```
= note: /usr/bin/ld: (....o): in function `blas::sgemm':
.../blas-0.22.0/src/lib.rs:1944: undefined reference to `sgemm_' collect2: error: ld returned 1 exit status
= note: some `extern` functions couldn't be found; some native libraries may need to be installed or have their path specified
= note: use the `-l` flag to specify native libraries to link
= note: use the `cargo:rustc-link-lib` directive to specify the native libraries to link with Cargo (see https://doc.rust-lang.org/cargo/reference/build-scripts.html#cargorustc-link-libkindname)
= note: use the `cargo:rustc-link-lib` directive to specify the native libraries to link with Cargo
```
or for accelerate:
```
Undefined symbols for architecture arm64:
"_dgemm_", referenced from:
candle_core::accelerate::dgemm::h1b71a038552bcabe in libcandle_core...
"_sgemm_", referenced from:
candle_core::accelerate::sgemm::h2cf21c592cba3c47 in libcandle_core...
ld: symbol(s) not found for architecture arm64
```
This is likely due to some missing linker flag that enable the mkl library. You
can try adding the following at the top of your binary:
```
This is likely due to a missing linker flag that was needed to enable the mkl library. You
can try adding the following for mkl at the top of your binary:
```rust
externcrateintel_mkl_src;
```
or for accelerate:
```rust
externcrateaccelerate_src;
```
### How to know where an error comes from.
#### Cannot run the LLaMA examples: access to source requires login credentials
```
Error: request error: https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/tokenizer.json: status code 401
```
This is likely because you're not permissioned for the LLaMA-v2 model. To fix
this, you have to register on the huggingface-hub, accept the [LLaMA-v2 model
conditions](https://huggingface.co/meta-llama/Llama-2-7b-hf), and set up your
authentication token. See issue
[#350](https://github.com/huggingface/candle/issues/350) for more details.
#### Missing cute/cutlass headers when compiling flash-attn
```
In file included from kernels/flash_fwd_launch_template.h:11:0,
from kernels/flash_fwd_hdim224_fp16_sm80.cu:5:
kernels/flash_fwd_kernel.h:8:10: fatal error: cute/algorithm/copy.hpp: No such file or directory
#include <cute/algorithm/copy.hpp>
^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
Error: nvcc error while compiling:
```
[cutlass](https://github.com/NVIDIA/cutlass) is provided as a git submodule so you may want to run the following command to check it in properly.
```bash
git submodule update --init
```
#### Compiling with flash-attention fails
```
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
```
This is a bug in gcc-11 triggered by the Cuda compiler. To fix this, install a different, supported gcc version - for example gcc-10, and specify the path to the compiler in the NVCC_CCBIN environment variable.
You can set `RUST_BACKTRACE=1` to be provided with backtraces when a candle
error is generated.
#### CudaRC error
If you encounter an error like this one `called `Result::unwrap()` on an `Err` value: LoadLibraryExW { source: Os { code: 126, kind: Uncategorized, message: "The specified module could not be found." } }` on windows. To fix copy and rename these 3 files (make sure they are in path). The paths depend on your cuda version.
The book uses [mdBook](https://github.com/rust-lang/mdBook) for building.
## Installation
To install mdBook, run `cargo install mdbook`. More instructions can be found [here](https://rust-lang.github.io/mdBook/guide/installation.html).
## Viewing the book
To view the book, run `mdbook serve --open candle-book`. More instructions can be found [here](https://rust-lang.github.io/mdBook/guide/creating.html).
Not super pretty at the moment, but we can see error occurred on `{ fn: "myapp::main", file: "./src/main.rs", line: 29 }`
Another thing to note, is that since Rust is compiled it is not necessarily as easy to recover proper stacktraces
especially in release builds. We're using [`anyhow`](https://docs.rs/anyhow/latest/anyhow/) for that.
The library is still young, please [report](https://github.com/LaurentMazare/candle/issues) any issues detecting where an error is coming from.
## Cuda error management
When running a model on Cuda, you might get a stacktrace not really representing the error.
The reason is that CUDA is async by nature, and therefore the error might be caught while you were sending totally different kernels.
One way to avoid this is to use `CUDA_LAUNCH_BLOCKING=1` as an environment variable. This will force every kernel to be launched sequentially.
You might still however see the error happening on other kernels as the faulty kernel might exit without an error but spoiling some pointer for which the error will happen when dropping the `CudaSlice` only.
If this occurs, you can use [`compute-sanitizer`](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html)
This tool is like `valgrind` but for cuda. It will help locate the errors in the kernels.
Now it works, it is a great way to create your own layers.
But most of the classical layers are already implemented in [candle-nn](https://github.com/huggingface/candle/tree/main/candle-nn).
## Using `candle_nn`.
For instance [Linear](https://github.com/huggingface/candle/blob/main/candle-nn/src/linear.rs) is already there.
This Linear is coded with PyTorch layout in mind, to reuse better existing models out there, so it uses the transpose of the weights and not the weights directly.
This tutorial provides an introduction to Candle by implementing and training a neural network for MNIST digit classification from scratch.
Throughout this tutorial, you will learn the basics of:
- Tensor operations and model construction
- Creating and implementing neural network layers
- Parameter initialization
- Training loop implementation
- Saving and loading trained models
## Getting Started
Before proceeding, please ensure that you have properly installed Candle by following the instructions in the [Installation](../installation.md) guide.
Many classical layers (such as [Linear](https://github.com/huggingface/candle/blob/main/candle-nn/src/linear.rs)) are already implemented in [candle-nn](https://github.com/huggingface/candle/tree/main/candle-nn).
This `Linear` implementation follows PyTorch conventions for improved compatibility with existing models, utilizing the transpose of weights rather than direct weights.
Let's simplify our implementation. First, add `candle-nn` as a dependency:
After training a model, it is useful to save and subsequently load the model parameters. In Candle, this functionality is managed through the `VarMap` data structure, with parameters stored on disk using the [safetensors](https://huggingface.co/docs/safetensors/index) format.
### Saving Model Parameters
Let's modify our `training_loop` function to include functionality for saving weights:
Now that we have saved our model parameters, we can modify the code to load them. The primary change required is to make the `varmap` variable mutable:
Note that loading the weights will fail if the specified file does not exist or is incompatible with the current model architecture. Implementing file existence checks and appropriate error handling is left to the user.
First, let's create a utility function `make_linear` that accepts a `VarBuilder` and returns an initialized linear layer. The `VarBuilder` constructs a `VarMap`, which is the data structure that stores our trainable parameters.
Next, let's implement a `new` method for our model class to accept a `VarBuilder` and initialize the model. We use `VarBuilder::pp` to "push prefix" so that the parameter names are organized hierarchically: the first layer weights as `first.weight` and `first.bias`, and the second layer weights as `second.weight` and `second.bias`.
We now have access to all the [tensors](https://huggingface.co/bert-base-uncased?show_tensors=true) within the file.
You can check all the names of the tensors [here](https://huggingface.co/bert-base-uncased?show_tensors=true)
## Using async
`hf-hub` comes with an async API.
```bash
cargo add hf-hub --features tokio
```
```rust,ignore
# This is tested directly in examples crate because it needs external dependencies unfortunately:
# See [this](https://github.com/rust-lang/mdBook/issues/706)
{{#include ../lib.rs:book_hub_1}}
```
## Using in a real model.
Now that we have our weights, we can use them in our bert architecture:
```rust
# extern crate candle_core;
# extern crate candle_nn;
# extern crate hf_hub;
# use hf_hub::api::sync::Api;
#
# let api = Api::new().unwrap();
# let repo = api.model("bert-base-uncased".to_string());
#
# let weights = repo.get("model.safetensors").unwrap();
use candle_core::{Device, Tensor, DType};
use candle_nn::{Linear, Module};
let weights = candle_core::safetensors::load(weights, &Device::Cpu).unwrap();
let weight = weights.get("bert.encoder.layer.0.attention.self.query.weight").unwrap();
let bias = weights.get("bert.encoder.layer.0.attention.self.query.bias").unwrap();
let linear = Linear::new(weight.clone(), Some(bias.clone()));
let input_ids = Tensor::zeros((3, 768), DType::F32, &Device::Cpu).unwrap();
let output = linear.forward(&input_ids).unwrap();
```
For a full reference, you can check out the full [bert](https://github.com/LaurentMazare/candle/tree/main/candle-examples/examples/bert) example.
## Memory mapping
For more efficient loading, instead of reading the file, you could use [`memmap2`](https://docs.rs/memmap2/latest/memmap2/)
**Note**: Be careful about memory mapping it seems to cause issues on [Windows, WSL](https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5893)
and will definitely be slower on network mounted disk, because it will issue more read calls.
```rust,ignore
{{#include ../lib.rs:book_hub_2}}
```
**Note**: This operation is **unsafe**. [See the safety notice](https://docs.rs/memmap2/latest/memmap2/struct.Mmap.html#safety).
In practice model files should never be modified, and the mmaps should be mostly READONLY anyway, so the caveat most likely does not apply, but always keep it in mind.
## Tensor Parallel Sharding
When using multiple GPUs to use in Tensor Parallel in order to get good latency, you can load only the part of the Tensor you need.
For that you need to use [`safetensors`](https://crates.io/crates/safetensors) directly.
//! This program implements a neural network to predict the winner of the second round of elections based on the results of the first round.
//!
//! ##Basic moments:
//!
//! A multilayer perceptron with two hidden layers is used. The first hidden layer has 4 neurons, the second has 2 neurons.
//! The input is a vector of 2 numbers - the percentage of votes for the first and second candidates in the first stage.
//! The output is the number 0 or 1, where 1 means that the first candidate will win in the second stage, 0 means that he will lose.
//! For training, samples with real data on the results of the first and second stages of different elections are used.
//! The model is trained by backpropagation using gradient descent and the cross-entropy loss function.
//! Model parameters (weights of neurons) are initialized randomly, then optimized during training.
//! After training, the model is tested on a deferred sample to evaluate the accuracy.
//! If the accuracy on the test set is below 100%, the model is considered underfit and the learning process is repeated.
//! Thus, this neural network learns to find hidden relationships between the results of the first and second rounds of voting in order to make predictions for new data.
Tracing is a powerful tool for identifying performance issues and bottlenecks in code.
> Profiling on GPUs is trickier due to asynchronous execution, see the [GPU section](#gpu).
## Overview
Candle uses the [tracing](https://docs.rs/tracing/latest/tracing/) crate for instrumentation.
To try it out, run an example in `candle-examples` with the `--tracing` flag.
This generates a trace file, typically named `trace-<timestamp>.json`.
You can view the trace in Chrome by navigating to `chrome://tracing/`, clicking **Load**, and selecting the generated trace file.
## Adding Tracing
Candle includes built-in tracing for many internal operations, using [spans](https://docs.rs/tracing/latest/tracing/struct.Span.html) to mark key points of execution.
To add custom tracing in your code, you can define a span like this:
Then, to record the span during execution, create a guard:
```rust
let_enter=span.enter();
```
This guard will record the span's duration, from when it is created to when it is dropped, into a global data structure managed by the tracing crate.
## Recording and Saving a Trace
To capture and save trace data, you need to configure the tracing system with an output format. Candle uses the [tracing_subscriber](https://docs.rs/tracing-subscriber/latest/tracing_subscriber/) and [tracing_chrome](https://docs.rs/tracing-chrome/latest/tracing_chrome/) crates.
The snippet below sets up a Chrome compatible recorder that logs all tracing activity between creation and drop of the guard:
When using CUDA, Metal, or other asynchronous GPU backends, tracing may produce misleading timing data because operations are queued rather than executed immediately.
### CUDA
For CUDA-specific profiling, you have two options:
1. Set the environment variable `CUDA_LAUNCH_BLOCKING=1` which forces synchronous execution. This makes trace timings more accurate, at the cost of reduced performance.
2. Use [NVIDIA's Nsight Systems](https://developer.nvidia.com/nsight-systems) (`nsys profile` and `nsys-ui`) which are designed specifically for profiling asynchronous CUDA executions.
We recommend using NVIDIA's Nsight Systems when possible, as it offers accurate performance data without altering typical execution patterns. In contrast, setting the `CUDA_LAUNCH_BLOCKING` environment variable forces synchronous execution, which can significantly alter execution behavior.
#### Performance Profiling with NVIDIA Nsight Systems
1. Generate an `.nsys-rep` file containing performance data ([docs](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#example-single-command-lines))
This program implements a neural network to predict the winner of the second round of elections based on the results of the first round.
Basic moments:
1. A multilayer perceptron with two hidden layers is used. The first hidden layer has 4 neurons, the second has 2 neurons.
2. The input is a vector of 2 numbers - the percentage of votes for the first and second candidates in the first stage.
3. The output is the number 0 or 1, where 1 means that the first candidate will win in the second stage, 0 means that he will lose.
4. For training, samples with real data on the results of the first and second stages of different elections are used.
5. The model is trained by backpropagation using gradient descent and the cross-entropy loss function.
6. Model parameters (weights of neurons) are initialized randomly, then optimized during training.
7. After training, the model is tested on a deferred sample to evaluate the accuracy.
8. If the accuracy on the test set is below 100%, the model is considered underfit and the learning process is repeated.
Thus, this neural network learns to find hidden relationships between the results of the first and second rounds of voting in order to make predictions for new data.
//! This implementation should be in line with the [PyTorch version](https://github.com/pytorch/pytorch/blob/7b419e8513a024e172eae767e24ec1b849976b13/torch/_tensor_str.py).
//! let a = Tensor::arange(0f32, 6f32, &Device::Cpu)?.reshape((2, 3))?;
//! let b = Tensor::arange(0f32, 12f32, &Device::Cpu)?.reshape((3, 4))?;
//!
//! let c = a.matmul(&b)?;
//!
//! # Ok(())}
//! ```
//!
//! ## Features
//!
//! - Simple syntax (looks and like PyTorch)
//! - Simple syntax (looks and feels like PyTorch)
//! - CPU and Cuda backends (and M1 support)
//! - Enable serverless (CPU) small and fast deployments
//! - Model training
@ -32,50 +32,142 @@
//! Python can really add overhead in more complex workflows and the [GIL](https://www.backblaze.com/blog/the-python-gil-past-present-and-future/) is a notorious source of headaches.
//!
//! Rust is cool, and a lot of the HF ecosystem already has Rust crates [safetensors](https://github.com/huggingface/safetensors) and [tokenizers](https://github.com/huggingface/tokenizers)
//!
//! ## Other Crates
//!
//! Candle consists of a number of crates. This crate holds core the common data structures but you may wish
//! to look at the docs for the other crates which can be found here:
//!
//! - [candle-core](https://docs.rs/candle-core/). Core Datastructures and DataTypes.
//! - [candle-nn](https://docs.rs/candle-nn/). Building blocks for Neural Nets.
//! - [candle-datasets](https://docs.rs/candle-datasets/). Rust access to commonly used Datasets like MNIST.
//! - [candle-examples](https://docs.rs/candle-examples/). Examples of Candle in Use.
//! - [candle-onnx](https://docs.rs/candle-onnx/). Loading and using ONNX models.
//! - [candle-pyo3](https://docs.rs/candle-pyo3/). Access to Candle from Python.
//! - [candle-transformers](https://docs.rs/candle-transformers/). Candle implemntation of many published transformer models.
/// on a single command buffer. Using a single command buffer would be fastest on the GPU but
/// prevents overlapping of CPU and GPU commands (because command buffer needs to be committed
/// to start to work).
/// Despite what the documentation says, command buffers are NOT ordered. They are ordered
/// for their START time, but there's no guarantee that command buffer1 will finish before
/// command buffer2 starts (or there are metal bugs there)
command_buffer: CommandBuffer,
/// Keeps track of the current amount of compute command encoders on the current
/// command buffer
/// Arc, RwLock because of the interior mutability.
command_buffer_index: usize,
/// The maximum amount of [compute command encoder](https://developer.apple.com/documentation/metal/mtlcomputecommandencoder?language=objc) per [command buffer](https://developer.apple.com/documentation/metal/mtlcommandbuffer?language=objc)
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.