DVC Tutorial
Authentication
To authenticate you need to create a credentials file, in the wiki you can see how to do that:
DVC Stages
Stages are the main thing of DVC’s reproducibility feature, they represent processes that form the steps of a pipeline.
A Stage is just defined by:
command, i.e.
python3 train.pydependencies [paths], i.e. data/train_dataset
outputs [paths], i.e. checkpoints/model.pt
To create your own Stage you can use the command dvc run
dvc run \
-n NAME_OF_THE_STAGE \
-d DEPENDENCY_1 \
-d DEPENDENCY_2 \
-o OUTPUT_1 \
-o OUTPUT_2 \
COMMAND
This executes the command and computes hashes for dependencies and outputs. To push your result you can just use dvc push.
dvc run creates a new/modifies a dvc.yaml to store the new stage in a human-readable way. Also, a dvc.lock file is created, which stores all the hashes. To reproduce your stage you can use:
dvc repro dvc.yaml
Then DVC will look for changes in the dependencies, if there are no changes it will load the outputs of the last execution, otherwise it will run the command again and save the hashes in the dvc.lock file.
Multiple DVC stages can form a pipeline by just having the outputs of one stage being the dependencies of another stage.
To show the pipeline visually you can run dvc dag or just scroll down, where you can see an image of the current pipeline.
Create a DVC tracked Dataset
If you have your own dataset and would like to track it with dvc and git, you can just use the dvc add command.
So for example you have a folder full of images named test_dataset.
Then all you have to do is cd into the parent directory and run:
dvc add test_dataset
This will compute a hash for your dataset and stores it in a .dvc file. You now have to push your dataset to the dvc remote with dvc push. Then push the newly created .dvc and also the changes of the gitignore file with git.
Import a Dataset from other Repositories
For importing a dataset from another repository which also has dvc installed, you can use the command dvc import.
For example, you can import the maschinen_halle_parking dataset from the simulation repository with the following command.
dvc import \
git@git.kitcar-team.de:kitcar/kitcar-gazebo-simulation.git \
data/real_images/maschinen_halle_parking
This imports the dataset to your current directory and adds it to the .gitignore to ensure, that git doesn’t tracks the dataset. GIt will only track the newly created yaml file which stores the reference to the simulation repository.
Freezing and Unfreezing
To unfreeze or freeze a dvc stage you can simply use the commands dvc freeze [target] and dvc unfreeze [target].
By default, an imported dataset is frozen, this means that the data will not change unless you update it manually with the command dvc update.
If you like to keep your dataset always up to date without updating it manually, you can unfreeze the import stage. This will take more time on a dvc repro call, but ensures that you have always the latest version of data.
DVC Dependency Graph