6708870e63
Add the alloc_uninit function. ( #1901 )
...
* Add the alloc_uninit function.
* Dummy metal fix.
* Lazy initialization.
2024-03-22 07:25:23 +01:00
ec97c98e81
Async tensor copying. ( #1900 )
2024-03-21 13:09:42 +01:00
ce9fbc3682
Optimize the cat operation on contiguous tensors ( #1855 )
...
* Add a specialized kernel for copy2d.
* Move the cat operations.
* Avoid transpositions in cat.
* Bugfix.
* Bugfix for the cuda kernel.
* Add a benchmark.
* Add more testing.
* Test fix.
* Faster kernel.
* Add the missing kernel.
* Tweak the test.
* Add a metal kernel.
* Fix for the metal kernel.
* Get the tests to pass on metal.
* Also use this opportunity to fix the metal kernel for ELU.
* Add some bf16 kernels.
* Clippy fixes.
2024-03-17 10:49:13 +01:00
3440cec3a0
Fast CPU kernel for transposed 1d convolutions. ( #1822 )
...
* Fast CPU kernel for transposed 1d convolutions.
* Bugfix.
2024-03-08 22:43:07 +01:00
a2cb2edead
Add a couple backtraces on cpu errors. ( #1738 )
2024-02-20 19:54:13 +01:00
fc67d878bb
Bugfix for conv-transpose1d ( #1734 )
...
* Add a currently broken test.
* Bugfix + fix test.
2024-02-19 09:04:49 +01:00
be4555c5a5
Add the conv-transpose1d op. ( #1251 )
...
* Skeleton structure for conv-transpose1d.
* CPU implementation for conv-transpose1d.
2023-11-03 09:44:46 +01:00
7bbde55c61
Marian MT model ( #1210 )
...
* Skeleton files for the marian MT model.
* Marian initialization.
* Implement the attention forward method.
* Forward pass for the encoder side.
* Expose the encoder and decoder.
* Start plugging the decoder.
* Forward pass for the decoder layer.
* Set up the marian example.
* Add some missing backtraces.
* Bugfix.
2023-10-29 15:12:22 +00:00
9abeddd750
Make the cuda rng seedable. ( #1056 )
2023-10-08 09:32:36 +01:00
9a465e1b26
Add 1d upsampling. ( #839 )
...
* Add 1d upsampling.
* Add the interpolate functions.
2023-09-13 16:50:39 +01:00
871efc0307
Bugfix for the conv2d cpu kernel. ( #820 )
2023-09-11 23:11:27 +01:00
dbd4561416
im2col version of the conv1d kernel. ( #815 )
...
* im2col version of the cuda conv1d kernel.
* im2col version of the conv1d cpu kernel.
2023-09-11 14:40:09 +01:00
70f38c2069
Proper error on unsupported dtypes when using gemm. ( #813 )
2023-09-11 12:10:51 +01:00
6fb665004c
Enable im2col on the cpu side. ( #805 )
...
* Enable im2col on the cpu side.
* Hook im2col on the cpu backend.
* Use the kernel offset.
* Avoid an unnecessary copy.
* Handle non-contiguous kernels.
* Add a const to select the conv2d kernel.
2023-09-11 09:28:13 +01:00
a4f40f3dc8
Use rayon directly rather than constraining the number of threads. ( #749 )
2023-09-05 20:26:15 +01:00
cda45a7443
Let outside CustomOp2 implementations use binary_map/binary_map_vec ( #741 )
2023-09-05 09:27:32 +01:00
84d003ff53
Handle arbitrary shapes in Tensor::new. ( #718 )
2023-09-02 19:59:21 +01:00
393690387f
Support dilation in conv-transpose2d. ( #671 )
2023-08-30 09:22:00 +01:00
59b731de99
Add the powf op. ( #664 )
...
* Add the powf op.
* Cuda kernels and backprop.
* Add a test.
2023-08-29 20:48:18 +01:00
71221559d3
Fix the dilated convolutions. ( #659 )
2023-08-29 16:37:42 +01:00
a044907ffc
Dilated convolutions ( #657 )
...
* Add the dilation parameter.
* Restore the basic optimizer example.
* Dilation support in cudnn.
* Use the dilation parameter in the cpu backend.
* More dilation support.
* No support for dilation in transposed convolutions.
* Add dilation to a test.
* Remove a print.
* Helper function.
2023-08-29 16:12:11 +01:00
72fae3140c
Optimize the conv2d transpose cpu kernel. ( #644 )
...
* Optimize the conv2d transpose cpu kernel.
* Use multiple cores.
2023-08-28 20:06:31 +01:00
ca26198b95
Fix the cpu kernel for conv-transpose. ( #643 )
2023-08-28 16:45:12 +01:00
b292047882
Backprop for conv2d. ( #638 )
...
* Start adding backprop for conv2d.
* Backprop for conv2d.
* Bugfix + start adding a conv2d test.
* Conv2d backprop testing.
* More conv fixes.
2023-08-28 16:08:55 +01:00
3cca89cc70
Add conv-transpose. ( #635 )
...
* Add conv-transpose.
* Return zeros for now.
* Naive CPU implementation.
* Add a conv-transpose test + fix the cpu implementation.
* Add a second test.
2023-08-28 10:10:12 +01:00
d8ba0452dc
Fail on bf16. ( #594 )
2023-08-25 06:10:38 +01:00
329f661d9b
Trace softmax ( #568 )
...
* Trace the softmax op.
* Inline the sum.
* Add min/max vec operations.
2023-08-23 15:25:50 +01:00
9a5c7db91a
Add support for i64 ( #563 )
...
* Add the i64 dtype.
* Adapt the cuda kernels.
2023-08-23 10:42:19 +01:00
495e0b7580
Simd support ( #448 )
...
* Import the simd intrinsics in candle-core.
* simd version of reduce-sum.
* Bugfix.
* Fix some clippy lints.
2023-08-15 09:50:38 +01:00
d379a76a9e
Add a softmax bench. ( #433 )
...
* Add a softmax bench.
* Add the vectorized sum reduce.
2023-08-13 20:09:18 +01:00
9af438ac1b
Track the conv2d operations in stable-diffusion. ( #431 )
...
* Track the conv2d operations in stable-diffusion.
* Add more tracing to stable-diffusion.
* Also trace the resnet bits.
* Trace the attention blocks.
* Also trace the attention inner part.
* Small tweak.
2023-08-13 15:58:26 +01:00
662db45fc3
Use zero padding in conv1d and conv2d (same as pytorch). ( #408 )
2023-08-11 14:53:05 +01:00
e29c7809ec
Parallelise the CPU kernels for the conv ops. ( #401 )
...
* Parallelise the conv2d op.
* Tighter control on threading.
* Also parallelise conv1d.
* Add some safety comment.
2023-08-11 05:51:58 +01:00
a325c1aa50
Upsample test + bugfix. ( #399 )
2023-08-10 21:02:35 +02:00
94eff56aee
Optimize the cpu conv2d kernel ( #396 )
...
* Conv2d simd optimization.
* Fix the contiguous copying.
* Small tweak.
2023-08-10 17:40:09 +01:00
c8039579a5
Conv1d optimize ( #392 )
...
* Reorder the conv1d loops in the cpu backend.
* Optimize the 1d convolution.
* Conv1D optimize.
* Fix some clippy lints.
2023-08-10 15:23:52 +01:00
c7f92f985e
Further randn tweaks: use the appropriate rng rather than the f64 one, some cleanup. ( #383 )
2023-08-10 05:48:19 +01:00
3bbc08a8df
Fix randn cpu ( #382 )
...
* Change distributions
Standard generates in [0, 1), Normal is correct.
* Add test
Not sure if this is the best place to put the test
* Remove unnecessary use
2023-08-10 05:33:44 +01:00
fcfdcbd337
Add a conv1d benchmark based on the whisper sizes. ( #377 )
...
* Add a conv1d benchmark based on the whisper sizes.
* Enforce the batch-dim in conv1d.
2023-08-09 20:27:03 +01:00
a5c5a893aa
add max_pool2d ( #371 )
...
Co-authored-by: 赵理山 <ls@zhaolishandeMacBook-Air.local >
2023-08-09 18:05:26 +01:00
1892bd139c
Extract the strides in the conv ops. ( #370 )
2023-08-09 17:57:05 +01:00
cd225bd3b1
More testing for avg-pool2d. ( #366 )
...
* More testing for avg-pool2d.
* Another fix.
* Add a max-pool test with non-divisible kernel sizes.
2023-08-09 16:12:23 +01:00
b80348d22f
Bugfix for avg-pool + add some test. ( #365 )
2023-08-09 15:44:16 +01:00
dbc6f281c9
Conv1d test with padding. ( #356 )
2023-08-09 05:45:38 +01:00
cf965ecaa8
Simplify the conv1d and conv2d code. ( #352 )
2023-08-08 22:10:59 +01:00
608b2358c6
Add some conv1d test + bugfix using padding. ( #349 )
2023-08-08 20:50:20 +01:00
1e6dbeac01
Add some conv2d tests. ( #347 )
...
* Add some conv2d tests.
* Add a simpler conv2d test.
* More conv2d testing + bugfix.
* Add a todo.
2023-08-08 19:02:42 +01:00
13ce68ff9b
Bugfix for conv2d. ( #343 )
2023-08-08 15:20:00 +01:00
ab35684326
Naive implementation for conv2d. ( #341 )
2023-08-08 06:34:36 +01:00
b5bb5e056d
Add more conv2d support. ( #340 )
...
* Add more conv2d support.
* Conv2d cpu work.
* Conv2d output shape.
2023-08-08 06:04:32 +01:00