CI/CD preprocessing pipelines in LLM applications
Senior NLP Researcher

In Large Language Model (LLM) applications, the quality of the training data is paramount in determining the final model performance. One of the most important steps in preparing datasets is cleaning and transforming raw data into similar and usable formats. However, this process can be tedious and time-consuming when done manually. Automating these data cleaning workflows is essential to improve efficiency and maintain consistency across multiple datasets.
Unprocessed datasets often require repetitive cleaning tasks like handling missing values, formatting text, and removing unnecessary columns. Performing these steps manually can lead to errors and burnout. Automating the process ensures consistency and speed. Additionally, converting datasets from formats like CSV to Parquet is common, as Parquet’s columnar storage offers improved performance, reduced disk space usage, and faster read/write speeds for LLM workflows. Automating this part of the machine learning pipeline can reduce manual efforts and reduce human errors in repetitive work.
In this article, you will learn how to automate a simple Python program which uses the Hugging Face API to clean and process datasets. You will then schedule a workflow using CircleCI, enabling you to streamline and automate data cleaning for any LLM application, saving you valuable time and effort in the process.
Prerequisites
For this tutorial, you need to set up a Python development environment on your machine. You also need a CircleCI account to automate the testing of your LLM model. Refer to this list to set up everything required for this tutorial.
- A GitHub account
- Download and install Python
- Create a CircleCI account
- A HuggingFace account and HfAPI token
Setting up your Python environment and installing dependencies
Python projects commonly use virtual environments to prevent package version conflicts and dependency mismatches. You’ll need Python 3 and can create and activate a virtual environment using the following commands:
python3 -m venv venv
source venv/bin/activate
For this tutorial, you will work with the Hugging Face Datasets package and the HfAPI provided by the huggingface-hub package. Additionally, pandas will be used for data processing. These dependencies can be installed using Python’s package manager. Run the following command to install them with pip:
pip install datasets pandas huggingface-hub
Python code to fetch and clean data from HuggingFace
Next, let’s write a simple Python program that retrieves all datasets from your Hugging Face account, processes them, and uploads the cleaned versions back. It is crucial to process only datasets that have not been previously processed and are intended for cleaning. A straightforward way to achieve this is by tagging such datasets with an unprocessed label. While this is just one approach, it is essential to filter out irrelevant datasets to prevent an endless processing loop or accidental modifications to already cleaned datasets. You can manually edit your dataset cards and add tags to mark them for processing.
Below is the Python code that utilizes the Hugging Face API (HfAPI) to fetch, process, and re-upload datasets in the optimized Parquet format, which is well-suited for LLM applications and efficient data reading.
# File Name: process_datasets.py
from datasets import load_dataset, Dataset, DatasetDict
from huggingface_hub import HfApi
import os
class DatasetMonitor:
def __init__(self, username: str, token: str) -> None:
self.api = HfApi(token=token)
self.username = username
self.token = token
def clean_dataset(self, data: DatasetDict):
# -- Simple sample cleaning operations -- #
df = data.to_pandas()
df = df.drop_duplicates()
df = df.dropna(how='all')
df = df.fillna({
'text': '',
'numeric_column': 0,
})
# -- Include any other possible cleaning requirements as per your datasets -- #
return Dataset.from_pandas(df)
def process_datasets(self):
# -- Fetches all datasets information from the user's HuggingFace account -- #
datasets = self.api.list_datasets(
author=self.username,
)
for dataset in datasets:
dataset_name = dataset.id # dataset.id is the username/repo-name representation
# -- Check and skip processing dataset if already processed or not intended to be processed -- #
if not "unprocessed" in dataset.tags or self.api.repo_exists(dataset_name + "-processed", repo_type="dataset"):
continue
data = load_dataset(dataset_name)
# -- Each dataset has multiple splits e.g train,test -- #
# -- We want to process them separately and maintain the data partitions -- #
cleaned_splits = DatasetDict({
split_name: self.clean_dataset(split_data)
for split_name, split_data in data.items()
})
# -- Create a new HF repository to add the new processed dataset. -- #
# -- Trvially, for now we are just appending -processed to symbolize it is processed repository -- #
new_data_repo = self.api.create_repo(repo_id=dataset_name + "-processed", exist_ok=True, repo_type="dataset")
# Adds the dataset to new HF dataset repository -- #
# -- Auto-converts data from Python dictionary to Parquet format -- #
cleaned_splits.push_to_hub(
new_data_repo.repo_id,
private=True,
)
print(f"Processed Dataset: {dataset_name}")
if __name__ == "__main__":
HF_USERNAME = os.environ["HF_USERNAME"]
HF_TOKEN = os.environ["HF_TOKEN"]
monitor = DatasetMonitor(username=HF_USERNAME, token=HF_TOKEN)
monitor.process_datasets()
We begin by initializing the DatasetMonitor class, which uses the HF_USERNAME and HF_TOKEN parameters to authenticate with the Hugging Face API. Since these credentials are sensitive, they should always be accessed through environment variables rather than being hardcoded into your codebase. These parameters grant access to your Hugging Face account, and you can generate your own access tokens in the account settings.
Next, the code executes the process_datasets
function, which retrieves all dataset repositories and filters those tagged as unprocessed in their metadata. It then loads the datasets, applies cleaning operations using pandas, and processes them accordingly. Finally, the cleaned dataset is re-uploaded to Hugging Face with a -processed suffix appended to its name.
Setting Up a cron job with CircleCI
Now that the local Python script is ready, you can automate its execution using CircleCI. Scheduling a cron job ensures that your script periodically checks your organization’s Hugging Face account for unprocessed datasets and uploads the cleaned versions automatically.
With CircleCI, we will configure the script to run every night at 00:00 UTC or whenever a new commit is pushed to the main
branch. This setup ensures that data processing happens on a regular schedule while also allowing for immediate updates when new changes are introduced.
Below is the CircleCI configuration YAML file that defines this workflow:
# File Name: .circleci/config.yml
version: 2.1
jobs:
process_datasets:
docker:
- image: cimg/python:3.10
steps:
- checkout
- run:
name: Install Dependencies
command: pip install -r requirements.txt
- run:
name: Process Datasets
command: python process_datasets.py
workflows:
pipeline:
jobs:
- process_datasets # Runs on every push to main
scheduled_pipeline:
triggers:
- schedule:
cron: "0 0 * * *" # Runs every day at 00:00 UTC
filters:
branches:
only:
- main
jobs:
- process_datasets
The configuration file defines a CI job that runs in a Python 3.10 Docker container, where it installs the required dependencies and executes the Python script. The workflow is configured to trigger on every push to the main
branch and also schedules a nightly cron job to ensure datasets are processed regularly.
With the configuration properly set up, the next step is to create a GitHub repository for the project and push all the code to it. If you’re unfamiliar with the process, you can review pushing a project to GitHub for step-by-step instructions.
For reference, you can also check out my GitHub repository for this project, which contains a similar setup.
Setting up the project on CircleCI
Now that the project is on GitHub, you can connect the code repository to your CircleCI account to execute the defined workflow. Head to the Projects tab on your CircleCI account:
You can see some of my previously added sample projects in the image above but your own setup may differ. You’ll need to create a new project by clicking the “Create Project” button, which will redirect you to a new page where you can set up your workflow.
If you haven’t already, you’ll need to connect your GitHub account. Once connected, select the relevant repository for this project to proceed with the setup.
CircleCI will automatically detect the config.yml file defined in your code. You can proceed with this configuration, which will also set up the necessary triggers. Once the project is created, you can execute workflows and view the completed runs in the Projects tab.
You will see the newly created project, where you can view the workflow details. While the cron job will handle periodic execution, you can manually trigger a run by pushing changes to your GitHub repository to test the pipeline.
Note that you may face issues during the first execution. If the run fails and shows the error below, it is likely due to unset environment variables required for accessing your Hugging Face account in the Python script.
Setting environment variables for execution
Go to Project Settings > Environment Variables and add the HF_TOKEN and HF_USERNAME environment variables. Once configured correctly, the page should look similar to the example below.
Now, re-execute the workflow and it should turn green.
Conclusion
In this blog, we walked through automating the pre-processing of Hugging Face datasets using a simple Python program and CircleCI for scheduling. We explored how to clean datasets, convert them to more efficient formats like Parquet, and upload them back to Hugging Face with a processed tag. Automating this process is crucial for improving efficiency, consistency, and scalability, especially when dealing with large or frequently updated datasets.
While this example serves as a basic boilerplate to help you understand the core steps, the exact implementation will depend on your specific organizational and data requirements. With CircleCI and CI/CD workflows, you can save valuable time, reduce errors, and focus more on the core aspects of your machine learning projects.