* Use cfg to seperate benchmark results based on features
* Add metal where_cond for f16 and bf16. Add benchmark
* Remove allow pragma
* Avoid some unnecessary returns.
* Improve benchmarks layout
* Updated feature separated benchmarks
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Build example kernels.
* Add some sample custom kernel.
* Get the example kernel to compile.
* Add some cuda code.
* More cuda custom op.
* More cuda custom ops.