CVRunner - A Runner for CV Models
a Deep Learning Infrastructure Project
In this project, I built CVRunner, a Python framework that simplifies and automates the setup for deep learning model training, making it easier to manage experiments, logging, and distributed training.
CVRunner is designed to streamline the process of training deep learning models by providing:
-
Automated Logging with Weights & Biases (wandb) Seamless experiment tracking and visualization without manual setup, allowing researchers to focus on model development rather than logging infrastructure.
-
Multi-GPU and Distributed Training via
DistributedTrainRunnerBuilt-in support for distributed data parallelism, enabling efficient scaling of training workloads across multiple GPUs with minimal code changes. -
Flexible Configuration using a Python-based
ExperimentClass Define all experiment parameters—epochs, batch size, model architecture, dataset, and more—in a stateless, declarativeExperimentclass. This makes experiments fully reproducible and easy to version-control. -
Multi-Environment Support Run experiments locally, in Docker containers, or on Kubernetes clusters with minimal configuration changes, ensuring seamless transitions between development and production environments.
Customizable Training Logic
The Runner class manages the state of a training job (model, optimizer, metrics, etc.) and holds a reference to the experiment configuration. To customize training, subclass the runner and override methods such as run, train_epoch, val_epoch, and checkpoint.
Usage Examples
Running an experiment locally:
cvrunner -e tests/test_runner.py -l
Running in a Docker container:
cvrunner -e test/test_generator/mnist_components.py --target_image test_cvrunner --build
Running on Kubernetes:
cvrunner -e test/test_generator/mnist_components.py --target_image test_cvrunner --build --k8s