Automated version control for LLMs using DVC and CI/CD

In the world of large language models (LLMs), experimentation is the norm — not the exception. From dataset preprocessing to fine-tuning hyperparameters and evaluating different checkpoints, each stage introduces variations that can impact model performance. Without a robust system in place, tracking what worked (and what didn’t) becomes a guessing game, making reproducibility and collaboration a challenge. That’s where version control becomes critical.

Traditional version control systems like Git are great for tracking code, but they fall short when it comes to handling large datasets, model weights, and experiment metadata — all of which are essential components in an LLM workflow. This is where DVC (Data Version Control) can help. DVC extends Git-like capabilities to data science workflows, enabling seamless versioning of large files, model artifacts, and even ML pipelines — all while keeping your Git repository clean and lightweight. It lets teams roll back to previous experiments, compare metrics across runs, and reproduce results with confidence.

But tracking versions manually is still prone to human error. By integrating DVC into your automated CI/CD pipeline on CircleCI, you can automate experiment tracking, dataset versioning, and model checkpoint storage every time a new training job is triggered. Whether it’s pushing updated datasets to remote storage or logging a new model checkpoint with experiment metadata, this approach ensures every step is documented and reproducible by design.

This tutorial uses a LoRA-based fine-tuning workflow for a language model to show how to version control its lightweight adapter checkpoints using DVC and CircleCI. While this example centers on LoRA, the approach is easily extensible to other LLM training paradigms, making it a versatile foundation for any ML ops-ready workflow.

Prerequisites

For this tutorial, you need to set up a Python development environment on your machine. You also need a CircleCI account to automate the testing of your LLM model. You will need:

A GitHub account
Download and install Python
Create a CircleCI account
A HuggingFace account
A Google Account for syncing artifacts to a Google Drive remote folder

Training a language model using LoRA-based adapter

In this tutorial, you will write a Python script to fine-tune a language model on a custom dataset. Training large language models from scratch is typically resource-intensive, requiring significant compute power and memory due to their billions of parameters. To make this process more efficient, use LoRA (Low-Rank Adaptation) — a parameter-efficient fine-tuning technique that updates only a small subset of the model’s weights while preserving its overall performance.

For this tutorial, you’ll fine-tune the lightweight distilgpt2 model on a small Shakespeare dataset. While this example uses a small model and dataset for simplicity, the same workflow can be easily adapted to larger models or different datasets by changing a few configuration values.

Setting up your Python environment and installing dependencies

In Python projects, it’s a best practice to use virtual environments to isolate dependencies and avoid version conflicts across different projects. Start by creating a requirements.txt file that lists all the necessary packages for this tutorial. Then, install them into a clean virtual environment using the pip package manager.

Key dependencies include:

dvc and dvc-gdrive for tracking and version-controlling datasets and models, with support for syncing to a remote Google Drive folder.
transformers and datasets from Hugging Face, which allow you to download pre-trained language models and datasets and fine-tune them locally.

# File Name: requirements.txt
dvc
dvc-gdrive
datasets
torch
transformers
peft
huggingface_hub[hf_xet]

Begin by creating and activating a new virtual environment using the following commands:

python3 -m venv venv
source venv/bin/activate

Once the virtual environment is active, you can install all the dependencies at once by running:

pip install -r requirements.txt

This command reads the package list from requirements.txt and installs them into your current environment, setting you up with everything needed to train, fine-tune, and version-control your LLM workflow.

Creating a text generation data loader in Python

Data preprocessing is a crucial step in training any language model. Raw text data must be transformed into a format that the model can understand — specifically, a sequence of numerical tokens based on the model’s vocabulary. This is done through tokenization, a process that converts text into subword units using a tokenizer tailored to the specific language model.

In this tutorial, you’ll use the datasets library to load a dataset from the Hugging Face Hub. You’ll also use the transformers library to load the tokenizer corresponding to the model we’re fine-tuning. The model name is passed as a parameter, allowing us to dynamically load the appropriate tokenizer.

The core of this step is the prepare_dataset function, which iterates over the dataset’s text column and applies the tokenizer to each sample. This transforms the raw text into tokenized inputs — numerical arrays that the model can process. Tokenization is essential because language models operate on numbers and vectors, not plain text.

# File Name: data/dataloader.py

from datasets import load_dataset
from transformers import AutoTokenizer

class TextGenerationDataLoader:
    """
    A class to load and preprocess datasets for text generation tasks.

    Attributes:
        model_name (str): The name of the pre-trained model for tokenization.
        max_length (int): Maximum length for truncating/padding input text.
        data_split (str): The split of the dataset to load (e.g., "train").
        tokenizer (AutoTokenizer): Tokenizer for the specified model.
    """
    def __init__(
        self,
        model_name: str,
        max_length: int = 512,
        data_split: str = "train"
    ) -> None:
        """
        Initializes the TextGenerationDataLoader with model and dataset parameters.

        Args:
            model_name (str): The model identifier for tokenization.
            max_length (int, optional): Maximum length for tokenized inputs. Defaults to 512.
            data_split (str, optional): The dataset split (train, test). Defaults to "train".
        """
        self.model_name = model_name
        self.max_length = max_length
        self.data_split = data_split
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        if not self.tokenizer.pad_token:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        self.tokenizer.padding_side = 'right'

    def prepare_dataset(
        self,
        dataset_name: str,
    ) -> None:
        """
        Prepares and tokenizes the dataset for text generation.

        Args:
            dataset_name (str): The name of the dataset to load.

        Returns:
            tokenized_data (Dataset): The tokenized dataset ready for model input.
        """
        dataset = load_dataset(dataset_name, trust_remote_code=True)[self.data_split]
        text_column = self._get_text_column(dataset)

        def tokenize_func(examples):
            texts = [txt + self.tokenizer.eos_token for txt in examples[text_column]]
            return self.tokenizer(
                texts,
                truncation=True,
                max_length=self.max_length,
                padding='max_length'
            )
        tokenized_data = dataset.map(tokenize_func, batched=True)
        tokenized_data.set_format(
            type="torch",
            columns=['input_ids', 'attention_mask']
        )
        return tokenized_data

    def _get_text_column(self, dataset):
        """
        Determines the text column name in the dataset based on common column names.

        Args:
            dataset (Dataset): The dataset object.

        Returns:
            str: The name of the text column in the dataset.
        """
        possible_text_columns = ['text', 'content', 'sentence', 'document']

        for column in possible_text_columns:
            if column in dataset.column_names:
                return column

        # If none of the common names are found, use the first column
        return dataset.column_names[0]

    def get_dataset_stats(self, dataset):
        """
        Retrieves basic statistics about the dataset.

        Args:
            dataset (Dataset): The tokenized dataset.

        Returns:
            dict: A dictionary with the total number of samples and shapes of input_ids and attention_mask.
        """
        return {
            "total_samples": len(dataset),
            "inputs_ids_shape": dataset["input_ids"].shape,
            "attention_mask_shape": dataset["attention_mask"].shape
        }

You will use the TextGenerationDataLoader helper class later to preprocess the dataset before training the model.

Loading the language model and adding LoRA adapter

Now that your dataset is ready, load the base language model and prepare it for parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation). To encapsulate this logic, define a LoraLanguageModel class that handles loading the model and injecting the LoRA adapters.

# File Name: model/model_loader.py

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

class LoraLanguageModel:
    """
    A class for initializing and managing a causal language model with LoRA (Low-Rank Adaptation) applied.

    Attributes:
        model (AutoModelForCausalLM): The causal language model with LoRA adaptation.
        tokenizer (AutoTokenizer): The tokenizer for the specified model.
    """
    def __init__(
        self,
        model_name: str,
        rank : int = 16,
        lora_alpha: int = 32
    ) -> None:
        """
        Initializes the LoraLanguageModel with LoRA configuration.

        Args:
            model_name (str): The name of the pre-trained model.
            rank (int, optional): The rank of the LoRA adaptation. Defaults to 16.
            lora_alpha (int, optional): The alpha scaling factor for LoRA. Defaults to 32.
        """
        self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
        lora_config = LoraConfig(
            r=rank,
            lora_alpha=lora_alpha,
            task_type="CAUSAL_LM"
        )
        self.model = get_peft_model(self.model, lora_config)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        if not self.tokenizer.pad_token: 
            self.tokenizer.pad_token = self.tokenizer.eos_token

    def get_model(self):
        return self.model

    def get_tokenizer(self):
        return self.tokenizer

    def print_trainable_parameters(self):
        """
        Prints the number of trainable parameters and their percentage out of the total parameters.

        This function calculates the number of parameters that are trainable (i.e., require gradients) 
        and the total number of parameters in the model.

        Outputs:
            - Trainable parameters count.
            - Total parameters count.
            - Percentage of trainable parameters.
        """
        trainable_params = 0
        all_param = 0
        for _, param in self.model.named_parameters():
            all_param += param.numel()
            if param.requires_grad:
                trainable_params += param.numel()

        print(f"Trainable params: {trainable_params} || All params: {all_param}")
        print(f"Trainable percentage: {100 * trainable_params / all_param:.2f}%")

At the core of this process is the get_peft_model function, which modifies the original model architecture by inserting additional trainable weights — the LoRA adapters. Instead of updating all parameters of the model (which can be in the billions), LoRA fine-tunes only a small number of added weights, drastically reducing the training cost while preserving performance.

Two important hyperparameters govern the behavior of LoRA:

rank: Controls the size of the low-rank matrices. A higher rank increases the number of trainable parameters, which may lead to better performance but also requires more computational resources.
lora_alpha: Scales the update applied by the LoRA layers. This parameter can influence training dynamics and convergence speed.

You will use the LoraLanguageModel helper class to load a LoRA-augmented model for efficient training in the next step.

Training script for LoRA language model

With the helper classes in place, we’re now ready to train the language model. In this example, you’ll fine-tune the lightweight distilgpt2 model on a small Shakespeare dataset — chosen to keep the setup simple, fast, and reproducible for anyone following along.

The training script includes an argument parser, allowing you to easily customize key parameters such as the model name, dataset path, number of training epochs, and more via command-line arguments. This flexibility makes the script adaptable to various models and datasets with minimal changes.

# File Name: train.py

import os
import argparse
from datetime import datetime

from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

from data.dataloader import TextGenerationDataLoader
from model.model_loader import LoraLanguageModel

def parse_arguments():
    """
    Parses command-line arguments for training the text generation model.

    Returns:
        argparse.Namespace: The parsed command-line arguments.
    """
    parser = argparse.ArgumentParser(description="Training Parameters for Text Generation Model")

    parser.add_argument("--model_name", type=str, default="distilbert/distilgpt2", help="HuggingFace Model Name")
    parser.add_argument("--dataset_name", type=str, default="karpathy/tiny_shakespeare", help="HuggingFace Dataset Name")
    parser.add_argument("--max_length", type=int, default=512, help="Max Length of Training Example")
    parser.add_argument("--data_split", type=str, default="train", help="Data Split of the Mentioned Dataset")
    parser.add_argument("--run_name", type=str, default=None, help="Training Run Name")
    parser.add_argument("--batch_size", type=int, default=1, help="Batch Size for LORA Training")
    parser.add_argument('--epochs', type=int, default=1, help='Number of training epochs')
    parser.add_argument('--learning_rate', type=float, default=2e-5, help='Learning rate')
    parser.add_argument('--lora_r', type=int, default=16, help='LoRA rank')
    parser.add_argument('--lora_alpha', type=int, default=32, help='LoRA alpha')
    return parser.parse_args()

def train(args):
    """
    Trains a text generation model with LoRA adaptation using the specified parameters.

    Args:
        args (argparse.Namespace): The parsed arguments containing training parameters.

    Returns:
        bool: True if training completes successfully.
    """
    short_model_name = args.model_name.split("/")[-1]
    short_dataset_name = args.dataset_name.split("/")[-1]
    if not args.run_name:
        args.run_name = f"{short_model_name}-{short_dataset_name}-{int(datetime.timestamp(datetime.utcnow()))}"
    output_dir = f"./results/{args.run_name}"

    dataloader = TextGenerationDataLoader(args.model_name, max_length=args.max_length, data_split=args.data_split)
    dataset = dataloader.prepare_dataset(dataset_name=args.dataset_name)
    dataset_stats = dataloader.get_dataset_stats(dataset)

    model = LoraLanguageModel(model_name=args.model_name, rank=args.lora_r, lora_alpha=args.lora_alpha)
    model.print_trainable_parameters()

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.batch_size,
        learning_rate=args.learning_rate,
        logging_dir=f'{output_dir}/logs',
        logging_steps=10,
        save_strategy="no",
        run_name=args.run_name
    )

    data_collator = DataCollatorForLanguageModeling(
            tokenizer=model.get_tokenizer(),
            mlm=False  # Not using masked language modeling
        )

    trainer = Trainer(
        model=model.get_model(),
        args=training_args,
        train_dataset=dataset,
        data_collator=data_collator
    )

    train_result = trainer.train()
    adapter_path = os.path.join("./results", 'latest_lora_adapter')
    model.get_model().save_pretrained(adapter_path)

    return True

if __name__ == "__main__":
    """
    Main entry point for training the text generation model with LoRA adaptation.

    Parses arguments and initiates the training process.
    """
    args = parse_arguments()
    train(args)

You can modify the CLI arguments to change the model, dataset, or training hyperparameters. Run the script using the command below:

python3 train.py \
    --model_name distilbert/distilgpt2 \
    --dataset_name karpathy/tiny_shakespeare \
    --run_name train-lora-model \
    --max_length 512 \
    --data_split train \
    --epochs 1 \
    --batch_size 4 \
    --lora_r 16 \
    --lora_alpha 32

For demonstration purposes, run a single training epoch and this saves the final LoRA adapter weights to the local directory: ./results/latest_lora_adapter/. These weights represent the trained parameters introduced by LoRA and are essential for generating predictions during inference. It’s important to persist these weights for future use — whether for evaluation, continued training, or deployment.

To make this process automated and reproducible, integrate DVC to track and version-control the adapter weights. DVC enables us to sync these artifacts to a remote storage location — in this case, a Google Drive folder — ensuring that your training results are safely stored, easily shareable, and fully traceable over time.

Automating model version control with DVC and CircleCI

While Git is excellent for tracking code, it struggles with large files like datasets and model weights. That’s where DVC (Data Version Control) comes in. DVC extends Git-like versioning to data and model artifacts, allowing you to sync large files to remote storage backends such as Google Drive, S3, or Azure — all while keeping your Git repository clean and lightweight.

In this tutorial, you’ll use Google Drive as the remote storage provider, offering a free and convenient solution for backing up model artifacts. DVC’s CLI is intentionally similar to Git, making it intuitive for developers already familiar with version control workflows.

Initializing DVC and adding model weights

Start by setting up DVC in your project and adding the trained LoRA adapter weights to version control. Make sure you have installed the package correctly from the requirements.txt file in the Python environment. To explicitly install the dvc packages using pip, run this command:

pip install dvc dvc-gdrive

Then, initialize DVC in your project root:

git init # Initializing Git is important as it used to track DVC artifacts
dvc init

This command creates a .dvc folder with configuration and metadata files, including MD5 hashes for tracked artifacts — similar to how Git handles source files.

To track your trained LoRA adapter weights, run:

dvc add results/latest_lora_adapter

This adds the entire folder to DVC’s tracking system and creates a corresponding .dvc file: ./results/latest_lora_adapter.dvc. This file holds metadata and a unique hash pointing to the exact state of your model weights.

git add results/latest_lora_adapter.dvc

This links your Git commit to a specific version of your model artifacts — allowing full reproducibility when paired with a remote DVC storage.

Setting up Google Drive as remote storage

To use Google Drive as remote storage with DVC, start by creating a private folder in your personal or organizational Drive account. Each folder has a unique folder ID, which will be used to configure it as the remote storage location for DVC.

However, simply using the folder ID leads to authentication issues, as DVC will prompt a browser-based login. While this may work on a local machine, it’s not feasible for automated CI/CD pipelines that require non-interactive authentication. To support secure and fully automated workflows, you’ll use Google’s Service Accounts, which allow key-based authentication and eliminate the need for browser interaction.

Service Accounts act like virtual users with restricted access, enabling you to grant upload permissions to only the necessary Drive folders. This ensures your storage remains private, while still allowing model artifacts to be pushed programmatically during CI runs.

To set this up, follow Google’s official documentation to create a Service Account. To summarize, here is what you need to do in a few steps:

Go to the Google Cloud Console. Create a new project or choose an existing project.
For the project, go to IAM & Admin settings and create a new Service Account.
In the Service Account settings, go to Keys tab and create a new key.
Download the JSON file with the service account credentials.
In the project API & Services settings, enable the Google Drive API.

At the end of the process, you will have a JSON key file like this:

{
    "type": "service_account",
    "project_id": "",
    "private_key_id": "",
    "private_key": "",
    "client_email": "xxxx.iam.gserviceaccount.com",
    "client_id": "",
    "auth_uri": "",
    "token_uri": "",
    "auth_provider_x509_cert_url": "",
    "client_x509_cert_url": "",
    "universe_domain": "googleapis.com"
  }

Take the client_email from this file and share your Google Drive folder with it, giving the Service Account permission to access and write to the folder. Later, you’ll configure DVC to use this JSON key and folder ID to automatically push model weights to Google Drive during your CI/CD workflows — all without requiring manual logins.

CircleCI config for automating model version control

Now, set up the CircleCI configuration to automate the entire process — from retraining the model to versioning and pushing the new weights to Google Drive. The goal of this CI pipeline is to re-train a model, track the updated weights in DVC, push them to a remote Google Drive folder, and commit the updated DVC configuration to Git. This ensures that your experiments are fully automated and reproducible, allowing you to revert to any previous version of both the repository and model weights when needed.

The CircleCI configuration is as follows:

# File Name: .circleci/config.yml

version: 2.1

parameters:
  model_name:
    type: string
    default: "gpt2"
    description: "HuggingFace Model Path"

  dataset_name:
    type: string
    default: "karpathy/tiny_shakespeare"

orbs:
  python: circleci/python@3.0.0

jobs:
  train-and-version-lora:
    docker:
      - image: cimg/python:3.10
    steps:
      - checkout
      - python/install-packages:
          pkg-manager: pip

      - run:
          name: Setup DVC with Secure Credentials
          command: |
            if [ ! -d ".dvc" ]; then
              dvc init
            else
              echo "DVC already initialized"
            fi
            dvc remote add -d storage gdrive://${GDRIVE_FOLDER_ID}
            if [ ! -z "${GDRIVE_CONFIG_B64}" ]; then
              # Create credentials from environment variable
              echo "${GDRIVE_CONFIG_B64}" | base64 --decode > /tmp/gdrive_credentials.json
              dvc remote modify storage gdrive_use_service_account true
              dvc remote modify storage gdrive_service_account_json_file_path /tmp/gdrive_credentials.json
            fi
      - run:
          name: Train LoRA Adapter
          command: |
            python train.py \
              --model_name << pipeline.parameters.model_name >> \
              --dataset_name << pipeline.parameters.dataset_name >> \

      - run:
          name: DVC Versioning with Secure Credentials
          command: |
            MODEL_NAME=$(echo << pipeline.parameters.model_name >> | tr '/' '-')
            OUTPUT_DIR="./results/latest_lora_adapter"
            dvc add ${OUTPUT_DIR}
            dvc push
            git add results/latest_lora_adapter.dvc
            git config --global user.email ${GIT_CONFIG_EMAIL}
            git config --global user.name ${GIT_CONFIG_USERNAME}
            git commit -m "Update LORA Adapter Post Training [skip ci]"
            git push -u ${GITHUB_REPO_URL} main

            if [ -f "/tmp/gdrive_credentials.json" ]; then
              rm /tmp/gdrive_credentials.json
            fi

workflows:
  train-workflow:
    jobs:
      - train-and-version-lora:
          name: << pipeline.parameters.model_name >>-<< pipeline.parameters.dataset_name >>

In the CircleCI pipeline, begin by setting up the DVC credentials. Use the Google Service Account’s JSON key, which is securely stored as an environment variable in CircleCI. During the workflow, the key is written temporarily to the local Docker system to authenticate the pipeline with Google Drive.

Once authenticated, configure DVC to use the Google Drive remote storage. Afterward, use the dvc push command to upload the newly trained model weights to the Drive folder. Finally, add the generated .dvc files to Git and make a new commit to finalize the changes, ensuring that all versioning information is captured in the repository.

This setup fully automates the version control of model weights and configuration changes, creating a seamless and reproducible process that integrates training, storage, and versioning in a CI/CD environment.

Setting up the project on CircleCI

The complete code for this project is available on my GitHub account. To execute the defined workflow, you can connect the code repository to your CircleCI account. Start by heading to the Projects tab on your CircleCI dashboard and creating a new project. This will redirect you to a page where you can set up your workflow.

CircleCI Project tab

If you haven’t already connected your GitHub account to CircleCI, you’ll need to do that first. Once connected, select the relevant repository for this project.

Choose GitHub repo

CircleCI will automatically detect the config.yml file that you defined in the project. You can proceed with the configuration and set up the necessary triggers to control when your pipeline will execute. For this example, configure the pipeline to run whenever a PR is merged to the default branch, which will retrain and version control the model whenever there are confirmed changes. This setting is ideal for stable repositories with occasional changes. However, you can configure it to run on different branches or tagged pull requests, depending on your needs.

CircleCI YAML configuration

CircleCI triggers

Once the project is created, you can access it to view the workflow details. Before triggering the pipeline, you need to set the required environment variables for execution. To do this, go to Project Settings > Environment Variables, and add the following variables: GDRIVE_FOLDER_ID, GDRIVE_CONFIG_B64, GIT_CONFIG_EMAIL, GIT_CONFIG_USERNAME, and GITHUB_REPO_URL.

To find the GDRIVE_FOLDER_ID, check the URL of your Google Drive folder:

https://drive.google.com/drive/u/0/folders/<YOUR_GOOGLE_FOLDER_ID>

Next, to securely add the JSON file for your Google Drive service account to GDRIVE_CONFIG_64 variable, encode the JSON as a base64 string. You can do this using the following command:

base64 -i <path_to_your_google_drive_key_json>

Next, configure your GitHub credentials to enable version control of the newly created .dvc files after training, and to push them to your remote GitHub repository. To associate your commits with the correct GitHub account, set the following environment variables using your personal GitHub details:

GIT_CONFIG_EMAIL: your GitHub email address
GIT_CONFIG_USERNAME: your GitHub username

To specify the remote repository, set the GITHUB_REPO_URL environment variable. To avoid authentication issues during pushes, it’s recommended to use a URL containing your Personal Access Token (PAT). The URL should follow this format:

https://<GITHUB_USERNAME>:<PERSONAL_ACCESS_TOKEN>@github.com/<GITHUB_REPO_URL>

Note:

GITHUB_USERNAME should be your GitHub username, which can differ from the repository owner (e.g., if the repository belongs to an organization).
GITHUB_REPO_URL should follow the format: <OWNER>/<REPO_NAME>, representing the full path to the GitHub repository.

After adding the environment variables, your settings page should look similar to the example below. Ensure the correct key names are used as they are used in the CircleCI YAML configuration.

Python environment variables

While the pipeline will execute automatically based on the triggers you set, you can also manually trigger the pipeline with modified parameters. Once triggered, you can check the pipeline’s progress and confirm successful execution. If everything is set up correctly, the process should complete as expected, and you’ll see the result in your CircleCI dashboard.

CircleCI green run

Conclusion

In this blog, you learned how to automate model version control using DVC and CircleCI, with Google Drive as a remote storage backend. This setup streamlines model training workflows by making them reproducible, manageable, and versioned, ensuring easy rollback to previous versions and consistent model performance.

While this approach is effective for simple models, it can be extended to more advanced use cases. By integrating tools like MLflow, you can enhance experiment tracking by logging hyperparameters, metrics, and model versions. This allows for easier comparison of different runs and identification of the best-performing configurations.

Additionally, this system can be adapted for fine-tuning workflows of pre-trained models, version-controlled datasets for data preprocessing, and collaborative model development where multiple contributors can securely share models. Incorporating version control and experiment tracking into your machine learning workflows will improve reproducibility, transparency, and scalability, and make it easier to iterate and collaborate on production-ready AI systems.

Site

Blog