Training:

- Removed a lot of surface (SerializedFileReader ownership is really painful). - Moved example + vision to hf.co version. - Removed feature gate.
2025-06-16 10:38:54 +00:00 · 2023-08-14 17:23:08 +02:00
parent dd02f589c0
commit d7a273be51
7 changed files with 148 additions and 16 deletions
--- a/candle-book/src/training/README.md
+++ b/candle-book/src/training/README.md
@ -6,19 +6,19 @@ start with the Hello world dataset of machine learning, MNIST.

 Let's start with downloading `MNIST` from [huggingface](https://huggingface.co/datasets/mnist).

-This requires `candle-datasets` with the `hub` feature.
+This requires [`hf-hub`](https://github.com/huggingface/hf-hub).
 ```bash
-cargo add candle-datasets --features hub
 cargo add hf-hub
 ```

+This is going to be very hands-on for now.

 ```rust,ignore
 {{#include ../../../candle-examples/src/lib.rs:book_training_1}}
 ```

 This uses the standardized `parquet` files from the `refs/convert/parquet` branch on every dataset.
-`files` is now a `Vec` of [`parquet::file::serialized_reader::SerializedFileReader`].
+Our handles are now [`parquet::file::serialized_reader::SerializedFileReader`].

 We can inspect the content of the files with:

@ -37,5 +37,3 @@ Column id 0, name image, value {bytes: [137, ....]

 So each row contains 2 columns (image, label) with image being saved as bytes.
 Let's put them into a useful struct.
-
-
--- a/candle-book/src/training/mnist.md
+++ b/candle-book/src/training/mnist.md
@ -1 +1,10 @@
 # MNIST
+
+So we now have downloaded the MNIST parquet files, let's put them in a simple struct.
+
+```rust,ignore
+{{#include ../../../candle-examples/src/lib.rs:book_training_3}}
+```
+
+The parsing of the file and putting it into single tensors requires the dataset to fit the entire memory.
+It is quite rudimentary, but simple enough for a small dataset like MNIST.