Datasets

Developing data hungry code, e.g. machine learning applications, comes with the burden of creating and maintaining datasets. While creating a dataset might seem to be most of the work, often enough, maintaining it is the hard part.

Our dataset format

We have defined our own labeled dataset format that uses a single .yaml file with

a description of all class ids,

a description of all attributes, and

list of all labels

Example

Let’s consider a small dataset of three images:

tree .
.
├── frame1.png
├── frame2.png
├── frame3.png
└── labels.yaml

The labels.yaml can look like this:

attributes:
- x1
- y1
- x2
- y2
- class_id
classes:
  135: zebra_crossing_sign
  202: CrosswalkLines
labels:
  frame00001.png:
  - - 381
    - 66
    - 436
    - 113
    - 135
  - - -84
    - 135
    - 345
    - 195
    - 202
  frame00002.png: []
  frame00003.png:
  - - 745
    - 67
    - 789
    - 109
    - 135

Why?

Python can easily open and construct classes out of yaml files

Class descriptions are easier to understand than integer IDs

Attributes are accessible in the beginning of each label file

How to open a dataset?

>>> import torch   
>>> from kitcar_ml.utils.data.labeled_dataset import LabeledDataset  
>>> dataset = LabeledDataset.from_yaml("path/to/label/file")   
>>> dataloader = torch.utils.data.DataLoader(
...     dataset,
...     batch_size=2,
...     shuffle=False,
...     num_workers=1,
... )   
>>> for i, data in enumerate(dataloader):   
...     pass

Our datasets

Here are all datasets we currently have set up and how you can download them.

1. Real images

Old labeled data

On the webdav we have a labeled dataset created for the first traffic sign recognition network.

The dataset has a different format, so we need to convert it first.

With a script you can convert it to our current data format.

But there is also a dvc stage for this conversion.

stages:
  parse_old_labeled_data:
    cmd: |
      python3 -m kitcar_ml.utils.data.import_kitcar_xml_labels --input-dir ../../raw/real/old_labeled_data/ --output-dir old_labeled_data
      python3 -c "from kitcar_ml.utils.data.visual_labeled_dataset import VisualLabeledDataset; vl = VisualLabeledDataset.from_yaml('old_labeled_data/labels.yaml'); vl.create_debug_images('debug',sample_size=50);"
      python3 -c "from kitcar_ml.utils.data.labeled_dataset import LabeledDataset; LabeledDataset.split_file('old_labeled_data/labels.yaml',{'train':0.7,'test':0.2,'val':0.1});"
      python3 -m kitcar_ml.utils.data.analyse_bbox_dataset --label-file "old_labeled_data/labels.yaml" --output-folder "old_labeled_data/analysis"
    deps:
    - ../../../kitcar_ml/utils/data/import_kitcar_xml_labels.py
    - ../../../kitcar_ml/utils/data/visual_labeled_dataset.py
    - ../../../kitcar_ml/utils/data/analyse_bbox_dataset.py
    - ../../../kitcar_ml/utils/data/label_conversion.yaml
    - ../../raw/real/old_labeled_data
    outs:
    - old_labeled_data
  parse_label_studio_data_standard_signs:
    cmd: |
      python3 -m kitcar_ml.utils.data.import_label_studio_labels --annotations-dir ../../raw/real/label_studio_annotations/object_detection_standard_signs --output-dir standard_signs
      python3 -c "from kitcar_ml.utils.data.visual_labeled_dataset import VisualLabeledDataset; vl = VisualLabeledDataset.from_yaml('standard_signs/labels.yaml'); vl.create_debug_images('debug',sample_size=50);"
      python3 -c "from kitcar_ml.utils.data.labeled_dataset import LabeledDataset; LabeledDataset.split_file('standard_signs/labels.yaml',{'train':0.8,'val':0.2});"
      python3 -m kitcar_ml.utils.data.analyse_bbox_dataset --label-file "standard_signs/labels.yaml" --output-folder "standard_signs/analysis"
    deps:
    - ../../../kitcar_ml/utils/data/import_label_studio_labels.py
    - ../../../kitcar_ml/utils/data/visual_labeled_dataset.py
    - ../../../kitcar_ml/utils/data/analyse_bbox_dataset.py
    - ../../raw/real/label_studio_annotations/object_detection_standard_signs/
    outs:
    - standard_signs:
        persist: true
  parse_label_studio_data_speed_limits:
    cmd: |
      python3 -m kitcar_ml.utils.data.import_label_studio_labels --annotations-dir ../../raw/real/label_studio_annotations/object_detection_speed_limits --output-dir speed_limits
      python3 -c "from kitcar_ml.utils.data.visual_labeled_dataset import VisualLabeledDataset; vl = VisualLabeledDataset.from_yaml('speed_limits/labels.yaml'); vl.create_debug_images('debug',sample_size=50);"
      python3 -c "from kitcar_ml.utils.data.labeled_dataset import LabeledDataset; LabeledDataset.split_file('speed_limits/labels.yaml',{'train':0.8,'val':0.2});"
      python3 -m kitcar_ml.utils.data.analyse_bbox_dataset --label-file "speed_limits/labels.yaml" --output-folder "speed_limits/analysis"
    deps:
    - ../../../kitcar_ml/utils/data/import_label_studio_labels.py
    - ../../../kitcar_ml/utils/data/visual_labeled_dataset.py
    - ../../../kitcar_ml/utils/data/analyse_bbox_dataset.py
    - ../../raw/real/label_studio_annotations/object_detection_speed_limits/
    outs:
    - speed_limits:
        persist: true
  parse_label_studio_data_obstacles:
    cmd: |
      python3 -m kitcar_ml.utils.data.import_label_studio_labels --annotations-dir ../../raw/real/label_studio_annotations/object_detection_obstacles --output-dir obstacles
      python3 -c "from kitcar_ml.utils.data.visual_labeled_dataset import VisualLabeledDataset; vl = VisualLabeledDataset.from_yaml('obstacles/labels.yaml'); vl.create_debug_images('debug',sample_size=50);"
      python3 -c "from kitcar_ml.utils.data.labeled_dataset import LabeledDataset; LabeledDataset.split_file('obstacles/labels.yaml',{'train':0.8,'val':0.2});"
      python3 -m kitcar_ml.utils.data.analyse_bbox_dataset --label-file "obstacles/labels.yaml" --output-folder "obstacles/analysis"
    deps:
    - ../../../kitcar_ml/utils/data/import_label_studio_labels.py
    - ../../../kitcar_ml/utils/data/visual_labeled_dataset.py
    - ../../../kitcar_ml/utils/data/analyse_bbox_dataset.py
    - ../../raw/real/label_studio_annotations/object_detection_obstacles/
    outs:
    - obstacles:
        persist: true

You can download and convert the dataset with the following commands:

dvc pull data/raw/real/old_labeled_data.dvc
dvc repro data/pre_processed/real/dvc.yaml

2. Generated images

Somewhere between real and simulated datasets are generated images. They are real images artificially modified to automatically create a dataset. The first generated dataset contains real camera images from our vehicles that are supplemented with images of traffic signs.

The dataset is located at data/pre_processed/real/generated_data. It can be downloaded and/or updated using the following commands:

dvc pull data/pre_processed/real/generated_data
dvc repro data/pre_processed/real/dvc.yaml

The underlying tool can be found in kitcar_ml.utils.data_generation.data_generation_tool.

3. Simulated images

Using DVC we can easily import datasets from our simulation repository.

md5: 174fa7e322a1a970b4f2fed0372eb83e
frozen: true
deps:
- path: data/labeled_images/random_roads
  repo:
    url: git@git.kitcar-team.de:kitcar/kitcar-gazebo-simulation.git
    rev: master
    rev_lock: c275771e37692142a81e8277bd17cc8950ce608c
outs:
- md5: dadb48c84308d367fd53f78541372628.dir
  size: 1134048848
  nfiles: 16926
  path: random_roads
  hash: md5

md5: 454b0cd8128b523b033becec457d90ca
frozen: true
deps:
- path: data/labeled_images/random_roads_gan
  repo:
    url: git@git.kitcar-team.de:kitcar/kitcar-gazebo-simulation.git
    rev: master
    rev_lock: c275771e37692142a81e8277bd17cc8950ce608c
outs:
- md5: f4a8b6547f4b9a128debae6fcec4af69.dir
  size: 815084441
  nfiles: 2771
  path: random_roads_gan
  hash: md5

The dataset contains automatically generated and labeled images from simulated roads. It can be downloaded with the following command:

dvc pull -R data/raw/simulation/labeled_images

Analyzing Datasets

Dataset Analysis can be done by this tool: kitcar_ml.utils.data.analyse_bbox_dataset.

With this command you could create a detailed report for every bounding box dataset:

python3 -m kitcar_ml.utils.data.analyse_bbox_dataset --label-file LABEL_FILE --output-folder OUTPUT_FOLDER

Report

The analysis tool creates a comprehensive report, which looks something like this:

BBox Dataset Analysis
---------------------

Basic Information

Path to dataset: old_labeled_data
Number of Classes: 23
Number of Bounding Boxes: 6569
Number of Images: 3236
Image Size: (1280, 300)

Detailed Information

  ID  Class Name                      Targets        Width          Height         Aspect Ratio
----  ------------------------------  -------------  -------------  -------------  --------------
   0  pass_right_sign                 376, 5.72 %    29.01 +-20.34  26.22 +-17.8   1.11 +-0.17
   1  zebra_crossing_sign             338, 5.15 %    43.8 +-33.44   43.9 +-32.47   1.02 +-0.25
   2  sharp_turn_left_sign            679, 10.34 %   52.03 +-59.51  23.36 +-23.19  2.3 +-0.68
   3  expressway_start_sign           378, 5.75 %    50.54 +-43.33  47.33 +-40.56  1.08 +-0.24
   4  priority_sign                   152, 2.31 %    29.28 +-18.32  33.59 +-18.66  0.88 +-0.23
   5  oncoming_traffic_sign           136, 2.07 %    37.1 +-36.56   34.38 +-29.82  1.05 +-0.16
   6  parking_sign                    357, 5.43 %    43.78 +-38.8   42.57 +-35.46  1.01 +-0.14
   7  uphill_grade_sign               206, 3.14 %    53.84 +-43.32  45.1 +-33.48   1.16 +-0.16
   8  speed_limit_zone_30_end_sign    333, 5.07 %    42.83 +-47.8   42.15 +-40.17  0.99 +-0.2
   9  yield_sign                      342, 5.21 %    46.75 +-34.78  40.4 +-33.28   1.18 +-0.26
  10  sharp_turn_right_sign           392, 5.97 %    54.6 +-50.7    22.14 +-19.42  2.5 +-0.66
  11  sharp_turn_right_small_sign     390, 5.94 %    28.98 +-26.37  26.5 +-18.92   1.06 +-0.29
  12  turn_right_sign                 244, 3.71 %    44.89 +-33.76  39.64 +-30.32  1.12 +-0.22
  13  downhill_grade_sign             197, 3.0 %     63.84 +-45.81  53.82 +-41.21  1.2 +-0.22
  14  turn_left_sign                  184, 2.8 %     37.8 +-33.22   35.61 +-28.08  1.04 +-0.18
  15  stop_sign                       157, 2.39 %    51.5 +-33.32   51.5 +-35.61   1.03 +-0.16
  16  no_overtaking_end_sign          84, 1.28 %     26.73 +-17.24  31.1 +-23.03   0.9 +-0.12
  17  speed_limit_zone_30_start_sign  207, 3.15 %    49.16 +-56.22  46.25 +-46.01  1.01 +-0.17
  18  sharp_turn_left_small_sign      621, 9.45 %    27.34 +-27.93  26.6 +-23.6    1.02 +-0.21
  19  expressway_end_sign             388, 5.91 %    41.14 +-37.02  40.23 +-33.98  1.02 +-0.18
  20  no_overtaking_start_sign        164, 2.5 %     42.37 +-20.75  43.27 +-31.42  1.05 +-0.26
  21  speed_limit_zone_20_start_sign  125, 1.9 %     61.07 +-53.5   56.74 +-42.93  1.03 +-0.16
  22  speed_limit_zone_20_end_sign    119, 1.81 %    65.92 +-60.24  57.38 +-46.67  1.1 +-0.16
      TOTAL                           6569, 100.0 %  43.49 +-42.05  36.51 +-32.72  1.27 +-0.59

Class Distribution

To analyze the distributions of the classes in the dataset there is the class distribution diagram.

Heatmaps

A heat map visualizes for each point in the image the number of bounding boxes that cover it.

Scatter

A scatter plot displays a dot for the center point of every bounding box in a dataset.