Commit Graph

23 Commits

Author SHA1 Message Date
1ad5baecc5 Handle transposed matrixes in cublas. 2023-06-26 17:49:29 +01:00
3761f02aa8 Use atomicAdd as a quick workaround some cuda synchronisation issue. 2023-06-26 16:31:24 +01:00
f2ac5547fc Avoid the race condition on cuda sums. 2023-06-26 16:19:06 +01:00
cd2a171c06 Add the where kernels. 2023-06-26 13:25:02 +01:00
f6104c4b64 Add the reduce-sum kernel. 2023-06-26 12:35:26 +01:00
16f0f5b9d2 Add a cuda kernel for embeddings. 2023-06-26 11:47:57 +01:00
8ed350dc94 Add a couple unitary ops. 2023-06-23 20:19:20 +01:00
88187b784b Also optimize the contiguous case for the binary cuda kernels. 2023-06-23 19:04:13 +01:00
5ca309ecb0 Optimize the unary cuda kernels for the contiguous case. 2023-06-23 18:40:15 +01:00
4c8931d2e4 More u32 support. 2023-06-23 14:54:03 +01:00
f8848db001 Fix the gelu kernel for f16. 2023-06-23 13:38:54 +01:00
09b7731b8d Fix unary op. 2023-06-23 13:10:26 +02:00
56ae71dd4c Address comments. 2023-06-23 13:08:04 +02:00
fd21c708ab Creating Gelu op (no backward). 2023-06-23 13:07:39 +02:00
1a90f9d3a6 Cuda implementation for copying data around. 2023-06-23 11:18:29 +01:00
065b7a19c7 Stride support for unary ops. 2023-06-22 15:46:34 +01:00
5b1ab5b687 Support strides in affine. 2023-06-22 15:38:42 +01:00
5276755fb3 Add cuda support for unary ops. 2023-06-22 15:12:59 +01:00
b8f514d9c6 Add more binary kernels. 2023-06-22 14:07:02 +01:00
e1eb86db61 Add some first binary op (add). 2023-06-22 13:52:02 +01:00
83d6198009 Simplify the binary kernels. 2023-06-22 13:16:03 +01:00
4b1c3405e9 Add a couple cuda kernels from dfdx. 2023-06-22 12:56:29 +01:00
083ced4428 Integrate the kernels bits. 2023-06-22 09:59:00 +01:00