Training:

- Removed a lot of surface (SerializedFileReader ownership is really
  painful).
- Moved example + vision to hf.co version.
- Removed feature gate.
This commit is contained in:
Nicolas Patry
2023-08-14 17:23:08 +02:00
parent dd02f589c0
commit d7a273be51
7 changed files with 148 additions and 16 deletions

View File

@ -6,19 +6,19 @@ start with the Hello world dataset of machine learning, MNIST.
Let's start with downloading `MNIST` from [huggingface](https://huggingface.co/datasets/mnist).
This requires `candle-datasets` with the `hub` feature.
This requires [`hf-hub`](https://github.com/huggingface/hf-hub).
```bash
cargo add candle-datasets --features hub
cargo add hf-hub
```
This is going to be very hands-on for now.
```rust,ignore
{{#include ../../../candle-examples/src/lib.rs:book_training_1}}
```
This uses the standardized `parquet` files from the `refs/convert/parquet` branch on every dataset.
`files` is now a `Vec` of [`parquet::file::serialized_reader::SerializedFileReader`].
Our handles are now [`parquet::file::serialized_reader::SerializedFileReader`].
We can inspect the content of the files with:
@ -37,5 +37,3 @@ Column id 0, name image, value {bytes: [137, ....]
So each row contains 2 columns (image, label) with image being saved as bytes.
Let's put them into a useful struct.

View File

@ -1 +1,10 @@
# MNIST
So we now have downloaded the MNIST parquet files, let's put them in a simple struct.
```rust,ignore
{{#include ../../../candle-examples/src/lib.rs:book_training_3}}
```
The parsing of the file and putting it into single tensors requires the dataset to fit the entire memory.
It is quite rudimentary, but simple enough for a small dataset like MNIST.