2. Datasets

An important part of developing machine learning algorithm are datasets. In this part of our onboarding you will get to know DVC, and add a dataset yourself.

2.1. Prepare the Data

In a real project acquiring and preparing the data is an important and often times challenging part of development. However, we don’t want you to manually label images for days and have prepared a complete dataset for you already.

You can download the data by running

cd kitcar_ml/onboarding && python3 setup.py

2.2. Explore the data

You should now see a new folder dataset that contains many images and two .yaml files:

dataset
|  ...
├── img_9990.jpg
├── img_9998.jpg
├── test.yaml
└── train.yaml

If you look at the images, you will see digits from 0-9. The train.yaml and test.yaml contain our labels. They will later be used to train and test our neural networks to predict which number each file contains.

Make yourself familiar with the dataset. Read through our datasets tutorial and be sure you understand how to open the dataset using Python.

2.3. Adding Datasets

Until now the dataset is nothing more than a folder on your computer. Whenever we develop code, we rely on git to version control and share our code. The version control and share datasets we have a tool called DVC. To find out more about DVC you can (and will probably need to) read the docs about it here.

Your Task

Add and commit the new dataset using DVC and git.

Tip

You should then see a new dataset.dvc-file afterward. Be careful to add /dataset to .gitignore file and not commit the complete dataset to git.

If you want to, you can also push your data to our server with

dvc push dataset.dvc