.. _datasets: Datasets ======== Developing data hungry code, e.g. machine learning applications, comes with the burden of creating and maintaining datasets. While creating a dataset might seem to be most of the work, often enough, maintaining it is the hard part. Our dataset format ------------------ We have defined our own labeled dataset format that uses a single *.yaml* file with * a description of all class ids, * a description of all attributes, and * list of all labels .. admonition:: Example Let's consider a small dataset of three images:: tree . . ├── frame1.png ├── frame2.png ├── frame3.png └── labels.yaml The *labels.yaml* can look like this: .. literalinclude:: resources/labels.yaml :language: yaml Why? ^^^^ * Python can easily open and construct classes out of yaml files * Class descriptions are easier to understand than integer IDs * Attributes are accessible in the beginning of each label file How to open a dataset? ^^^^^^^^^^^^^^^^^^^^^^ .. doctest:: :skip: >>> import torch # doctest: +SKIP >>> from kitcar_ml.utils.data.labeled_dataset import LabeledDataset # doctest: +SKIP >>> dataset = LabeledDataset.from_yaml("path/to/label/file") # doctest: +SKIP >>> dataloader = torch.utils.data.DataLoader( ... dataset, ... batch_size=2, ... shuffle=False, ... num_workers=1, ... ) # doctest: +SKIP >>> for i, data in enumerate(dataloader): # doctest: +SKIP ... pass Our datasets ------------ Here are all datasets we currently have set up and how you can download them. 1. Real images ^^^^^^^^^^^^^^ **Old labeled data** On the webdav we have a labeled dataset created for the first traffic sign recognition network. The dataset has a different format, so we need to convert it first. With a :ref:`script ` you can convert it to our current data format. But there is also a dvc stage for this conversion. .. literalinclude:: ../../../data/pre_processed/real/dvc.yaml :language: yaml You can download and convert the dataset with the following commands: .. prompt:: bash dvc pull data/raw/real/old_labeled_data.dvc dvc repro data/pre_processed/real/dvc.yaml 2. Generated images ^^^^^^^^^^^^^^^^^^^ Somewhere between real and simulated datasets are generated images. They are real images artificially modified to automatically create a dataset. The first generated dataset contains real camera images from our vehicles that are supplemented with images of traffic signs. .. figure:: resources/no_generated.png :alt: unedited .. figure:: resources/generated.png :alt: generated .. figure:: resources/generated_debug.png :alt: generated The dataset is located at ``data/pre_processed/real/generated_data``. It can be downloaded and/or updated using the following commands: .. prompt:: bash dvc pull data/pre_processed/real/generated_data dvc repro data/pre_processed/real/dvc.yaml The underlying tool can be found in :py:mod:`kitcar_ml.utils.data_generation.data_generation_tool`. 3. Simulated images ^^^^^^^^^^^^^^^^^^^ Using DVC we can easily import datasets from our simulation repository. .. literalinclude:: ../../../data/raw/simulation/labeled_images/random_roads.dvc :language: yaml .. literalinclude:: ../../../data/raw/simulation/labeled_images/random_roads_gan.dvc :language: yaml The dataset contains automatically generated and labeled images from simulated roads. It can be downloaded with the following command: .. prompt:: bash dvc pull -R data/raw/simulation/labeled_images Analyzing Datasets ------------------ Dataset Analysis can be done by this tool: :py:mod:`kitcar_ml.utils.data.analyse_bbox_dataset`. With this command you could create a detailed report for every bounding box dataset: .. prompt:: bash python3 -m kitcar_ml.utils.data.analyse_bbox_dataset --label-file LABEL_FILE --output-folder OUTPUT_FOLDER Report ^^^^^^ The analysis tool creates a comprehensive report, which looks something like this: .. include:: resources/analysis/report.txt :literal: Class Distribution ^^^^^^^^^^^^^^^^^^ To analyze the distributions of the classes in the dataset there is the class distribution diagram. .. figure:: resources/analysis/class_distribution.png :alt: Class Distribution Heatmaps ^^^^^^^^ A heat map visualizes for each point in the image the number of bounding boxes that cover it. .. figure:: resources/analysis/heatmaps/total.png :alt: Total Heatmap .. figure:: resources/analysis/heatmaps/pass_right_sign.png :alt: Pass Right Sign Heatmap Scatter ^^^^^^^ A scatter plot displays a dot for the center point of every bounding box in a dataset. .. figure:: resources/analysis/scatter/total.png :alt: Total Scatter .. figure:: resources/analysis/scatter/pass_right_sign.png :alt: Pass Right Sign Scatter