* Add the Mixtral model.
* Add more of the mixtral layers.
* Add the final layers for mixtral.
* Sketch the expert selection.
* Add some expert routing logic.
* Hopefully finish the routing logic for mixtral.
* Add the mixtral example.
* Fix the weight filenames.
* Bugfix.
* Another fix.
* Yet another fix + remove the unused pragma.
* Shape fix.
* Support for quantized mixtral.
* Support mixtral in the quantized example.
* Mlp or moe type.
* Fix the expert field namings.
* Refactor the mlp bit.
* More MoE logic.
* Add the MoE quantized logic.
* Fix the experts length.
* Add OpenChat to quantized examples
* Add chat prompt
* Make the openchat example more in line with the other models.
* Fix a typo.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Support the shape op in ONNX.
* Share the axis normalization bits.
* Add some limited support for gather.
* Unsqueeze.
* Comparison with broadcasting.
* Add Not + handle i32.
* Tweaks for the quantized model.
* Adds check for 7b-zephyr and uses correct template
* Handle zephyr as mistral.
* Disable the protoc bits of the CI.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add a couple functions required for yolo.
* Add the yolo-v3 example.
* Add minimum and maximum.
* Use the newly introduced maximum.
* Cuda support for min/max + add some testing.
* Allow for more tests to work with accelerate.
* Fix a typo.
* Separate the prompt stats from the post-prompt ones in the quantized example.
* Slightly nicer output printing.
* Line up with the llama.cpp implementation.
* Print the detected arch options.
* Add the q6k quantization.
* Add a currently broken test.
* Bugfix.
* Bugfix.
* Another bugfix.
* Another bugfix + get the test to work.