Datasets
========

An important part of developing machine learning algorithm are datasets.
In this part of our onboarding you will get to know DVC, and add a dataset yourself.

Prepare the Data
^^^^^^^^^^^^^^^^

In a real project acquiring and preparing the data is an important and often times challenging part of development.
However, we don't want you to manually label images for days and have prepared a complete dataset for you already.

You can download the data by running

.. prompt:: bash

  cd kitcar_ml/onboarding && python3 setup.py


Explore the data
^^^^^^^^^^^^^^^^

You should now see a new folder ``dataset`` that contains many images and two **.yaml** files:

.. code:: bash

  dataset
  |  ...
  ├── img_9990.jpg
  ├── img_9998.jpg
  ├── test.yaml
  └── train.yaml

If you look at the images, you will see digits from 0-9. The ``train.yaml`` and ``test.yaml`` contain our labels.
They will later be used to train and test our neural networks to predict which number each file contains.

Make yourself familiar with the dataset. Read through our :ref:`datasets <datasets>` tutorial and be sure you understand how to open the dataset using Python.

Adding Datasets
^^^^^^^^^^^^^^^

Until now the dataset is nothing more than a folder on your computer. Whenever we develop code, we rely on git to version control and share our code.
The version control and share datasets we have a tool called DVC.
To find out more about DVC you can (and will probably need to) read the docs about it :ref:`here<DVC Tutorial>`.

.. admonition:: Your Task

  Add and commit the new dataset using DVC and git.

.. tip::

  You should then see a new ``dataset.dvc``-file afterward.
  Be careful to add ``/dataset`` to ``.gitignore`` file and not commit the complete dataset to git.

If you want to, you can also push your data to our server with

.. prompt:: bash

  dvc push dataset.dvc