Real-time synthetic data generation for LLM training with CircleCI workflows
Senior NLP Researcher

Large Language Models (LLMs) are not magic. They learn from massive amounts of curated, supervised data, especially question-answer pairs that teach them how to understand context and generate relevant responses. Once deployed, these models remain static with their knowledge limited to the original training data. In a world where new events, topics, and user needs emerge every day, relying on a stale training dataset quickly leads to outdated responses.
To keep LLMs aligned with the latest information, developers need to continuously update their training data and fine-tune the model at regular periods. However, with large amounts of data published online, it is practically impossible to manually generate annotated question-answer pairs for supervised training of the model. Creating and labeling new conversational data manually is time-consuming, laborious, and often expensive.
Synthetic data generation can automate the process. Rather than waiting for labeled datasets to become available, you can generate relevant question-answer pairs on the fly using current online content. You can also automate the entire process using a CI/CD pipeline.
In this tutorial, I will guide you through a complete workflow that uses CircleCI to automatically generate synthetic data in real-time. You will start by searching for recent news or trending topics using Python and DuckDuckGoSearch, scrape full article content using BeautifulSoup4, and feed that content into an LLM to generate conversational training data. Then, you will automate this flow using a scheduled CircleCI pipeline that runs every day.
While your specific use-case can be different, the same process can be used to generate supervised dataset for FAQ systems, conversational chat support agents for quickly evolving enterprise datasets, or training domain-specific datasets.
Prerequisites
To follow this tutorial, you will need Python and a CircleCI account to run the automated pipeline. TogetherAI is the LLM provider. TogetherAI offers $25 free credits, which is sufficient to follow along with this tutorial.
Before you start:
- Set up a GitHub account
- Download and install Python
- Create a CircleCI account
- Sign up for a TogetherAI account and create a new API access token
Creating a new Python project and install required dependencies
Start by setting up a new Python project with a fresh virtual environment. Install the necessary Python packages. Creating a new virtual environment helps isolate your project’s dependencies from other Python projects on your system, ensuring a clean and conflict-free workspace. By defining a requirements.txt
file, you make it easy to reinstall the packages needed for the project.
Create a new directory and initialize a new virtual environment. In Unix-based system, execute this commands:
mkdir SyntheticDataGeneration
cd SyntheticDataGeneration/
python3 -m venv venv
source venv/bin/activate
Now, let’s create a requirements.txt file to specify the Python packages you will use throughout the project. Here’s a quick overview of what each package does:
- duckduckgo-search: A Python client to perform web searches using the DuckDuckGo API.
- beautifulsoup4: A popular HTML parser used to scrape and extract article content from web pages.
- requests: The first-choice library for making HTTP requests in Python.
- together: A Python SDK for accessing Together AI’s large language models, which you will use to generate synthetic Q&A pairs from scraped content.
Create a new text file named requirements.txt and add this content:
# File Name: requirements.txt
duckduckgo-search
together
requests
tqdm
BeautifulSoup4
Install the required dependencies in the new virtual environment, using Python’s package manager:
pip install -r requirements.txt
Now, you can start with the Python project.
Scraping the real-time news articles of blogs online for relevant content
To create relevant question-answer pairs, you need up-to-date content like the latest news articles, blogs, or web content about a particular topic. In this section, you will build a Python script to automatically search for and extract content from the web, ensuring that the data you use for training your model is fresh and relevant.
You will use the duckduckgo-search package to perform search queries and retrieve article URLs. A key feature here is the use of the timelimit="d"
parameter in DuckDuckGo’s search. This filters the results to only show articles published within the day, which means you will get results only within the last 24 hours. You can modify this parameter as per your requirement and cron job, with other options including weekly, monthly or yearly filter.
Moreover, we use a sample query of technology
and you can modify the search term as per your choice.
After gathering the URLs, you will use requests
library to fetch each page and BeautifulSoup4
to scrape the content. The script will filter out irrelevant or short paragraphs, focusing only on meaningful text that will be useful when generating conversational training data.
Create a new file called scrape.py
and add the Python code:
# File Name: scrape.py
from typing import List, Dict
import os
import time
import json
import requests
from tqdm import tqdm
from duckduckgo_search import DDGS
from duckduckgo_search.exceptions import RatelimitException
from bs4 import BeautifulSoup
# -- Headers for scraping GET request -- #
HEADERS = {
"User-Agent": "Mozilla/5.0"
}
def fetch_article_urls(query: str, max_results: int = 10) -> List[str]:
"""
Uses Duck-Duck-Go text search for a query.
Returns relevant URLs for the query in the previous day.
:param query: str = Text to search
:param max_results: int = Max number of URLs to return.
:return List[str] = List of URL endpoints for relevant articles.
"""
with DDGS() as ddgs:
try:
response = ddgs.text(
query,
timelimit="d", # -- IMP: Filters results on time. -- #
max_results=max_results
)
except RatelimitException as e:
print("DDGS Rate Limit Error")
except Exception as e:
print(f"Unexpected error in DDGS: {e}")
return [r["href"] for r in response]
def extract_content_bs4(url: str) -> Dict[str, str]:
"""
Scrapes a URL with GET request.
If successfully scraped, returns a dict with url, title, and text keys..
Else, returns a dict with url, and error keys.
:param url: str = URL to scrape
:return Dict[str,str]
"""
try:
resp = requests.get(url, headers=HEADERS, timeout=10)
if resp.status_code != 200:
return {"url": url, "error": f"Status code: {resp.status_code}"}
soup = BeautifulSoup(resp.text, "html.parser")
# Try to get the title
title = soup.title.string.strip() if soup.title else ""
# Attempt to extract main article content heuristically
paragraphs = soup.find_all("p")
# -- Only get context paragraphs with more than 40 characters -- #
content = "\n".join(p.get_text(strip=True) for p in paragraphs if len(p.get_text()) > 40)
return {
"url": url,
"title": title,
"text": content
}
except Exception as e:
return {"url": url, "error": str(e)}
def main():
query = "technology" # Sample query. Modify as required.
urls = fetch_article_urls(query)
articles = []
for url in tqdm(urls, desc="Scraping URLs"):
article = extract_content_bs4(url)
articles.append(article)
time.sleep(1) # rate limit to avoid blocks
os.makedirs("data/", exist_ok=True)
with open("data/scraped_articles.json", "w") as f:
json.dump(articles, f, indent=2)
if __name__ == "__main__":
main()
Here is a quick breakdown of how the script works:
fetch_article_urls(query)
: Searches DuckDuckGo for your specified query (like “technology”) and returns a list of URLs from the past day.extract_content_bs4(url)
: Takes each URL, makes a GET request, parses the page, and returns the article’s title and main content.main()
: Runs the whole process, scraping articles and saving them as a JSON file in the data/ folder.
We also include a time.sleep(1)
delay between requests to avoid hitting rate limits or getting blocked by websites.
This step is essential because it provides the real-time data you need to generate synthetic Q&A pairs in the next section.
Generate a conversational question-answer dataset from the scraped content
Now that you have scraped real-time articles, it is time to turn that raw content into something much more valuable: structured question-answer (Q&A) pairs that mimic the kind of conversational data used to train large language models.
In this section, you will use the Together API
to query an open-source language model like mistralai/Mixtral-8x7B-Instruct-v0.1
and generate realistic Q&A pairs based on the scraped articles. This step essentially transforms unstructured web content into synthetic training data.
The Together API is a developer-friendly platform that provides access to powerful open-source large language models (like Mixtral and LLaMA)
through a simple API, allowing you to run inference without hosting the models yourself. It’s fast, cost-efficient, and offers $25 in free credits to get started. If you do not have an existing account, create a new account on the platform and head over to the API Keys section in the user settings to get your API key. This API key is used as an environment variable to establish connection to the API server.
Create a new .env file:
TOGETHER_API_KEY=<YOUR_API_KEY_HERE>
Execute these commands to get terminal-level access for your environment variables:
set -a
source .env
Note: Never commit your .env file — add it to .gitignore to prevent exposing sensitive credentials in version control.
Now, generate the question-answer pairs using the LLM. Create a new file called generate.py and add this code:
# File Name: generate.py
from typing import List, Dict
import re
import os
import json
from tqdm import tqdm
from together import Together
TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
assert TOGETHER_API_KEY, "Together API key is requierd to generate data"
# -- Together API client. Automatically uses the TOGETHER_API_KEY environment variable -- #
CLIENT = Together()
def build_prompt(title: str, context: str) -> str:
"""
Builds an explained prompt for the LLM.
:param title: str = Title of the article
:param context: str = Content of the article
:return str = Explained user prompt for the LLM.
"""
return f"""You are an AI assistant that generates training data for large language models.
Title: {title}
Article:
{context}
Task:
Based on the article above, generate a list of question and answer pairs in the following JSON format:
```json
{{
"question": "...",
"answer": "..."
}}
```.
Enclose JSON between ```json and ``` code blocks.
"""
def generate_qa(title: str, content: str, model: str = "mistralai/Mixtral-8x7B-Instruct-v0.1") -> str:
"""
Provided article title and content, uses an LLM to generate relevant question-answer pairs.
:param title: str = Title of the article
:param context: str = Content of the article
:param model: str = Model ID of the Together LLM. Uses open source mixtral-8x7B model by default.
:return str: Generated response from the completions endpoint.
"""
prompt = build_prompt(title, content)
response = CLIENT.chat.completions.create(
messages=[
{
"role": "system",
"content": "You are a helpful assistant to generate structured synthetic question answering dataset for an LLM."
},
{
"role": "user",
"content": prompt
}
],
model=model,
)
return response.choices[0].message.content
def extract_json_from_markdown(text: str) -> List[Dict]:
"""
Matches content between ```json and ``` blocks.
Uses regex to parse the string content and convert the JSON.
:param text: str = Generated content from the LLM.
:return List[Dict] = A list of question-answer pairs generated by the LLM.
"""
# -- Use regex to get all content between ```json and ``` -- #
matches = re.findall(r"```json(.*?)```", text, re.DOTALL)
json_objects = []
for match in matches:
try:
obj = json.loads(match.strip())
json_objects.append(obj)
except json.JSONDecodeError as e:
print("Error decoding JSON:", e)
return json_objects
def main():
FILE_PATH = "data/scraped_articles.json" # File path of the scraped articles json. Fetched throgh scrape.py.
assert os.path.exists(FILE_PATH), "No scrpaed data file found."
with open(FILE_PATH) as f:
articles = json.load(f)
articles = list(filter(lambda x: "text" in x, articles)) # Only retain artciles that were successfully scraped.
qa_pairs = []
for article in tqdm(articles, desc="Generating Q/A Pairs"):
try:
qa_json = generate_qa(article["title"], article["text"])
qa_json = extract_json_from_markdown(qa_json)
qa_pairs.append({
"url": article["url"],
"generated_questions": qa_json
})
except Exception as e:
print(f"Error processing article: {article['url']}\n{e}")
with open("data/qa_dataset.json", "w") as f:
json.dump(qa_pairs, f, indent=2)
if __name__ == "__main__":
main()
Here is what you’re doing step-by-step:
build_prompt(title, context)
: This function creates a clean, instructive prompt for the LLM. It explains the task clearly and tells the model to return data in a consistent JSON format.generate_qa(title, content)
: This is where the magic happens. You send the prompt to the Together API using theirchat.completions.create()
interface, and the LLM returns a response containing Q&A pairs based on the article.extract_json_from_markdown(text)
: Since the model wraps the output in markdown code blocks, this function uses regex to extract just the raw JSON and safely loads it into Python.main()
: Loads the scraped articles, filters out the ones without content, and generates Q&A data for each article. It saves everything to data/qa_dataset.json so you can use it in downstream training or analysis.
To test the code locally, execute these commands:
python scrape.py
python generate.py
This will create a new data directory, that will contain the json files for the extracted articles, and the generated question-answer pairs. We can now automate this process using CircleCI cron jobs.
Write a scheduled CircleCI workflow to generate synthetic data
Now that you have got the scraping and generation scripts ready, you can automate everything using CircleCI. In this section, you will create a scheduled workflow that runs daily and continuously produces fresh question-answer data from real-world content.
Create a new CircleCI configuration file named .circleci/config.yml
in your project directory. Add this code in the yaml configuration:
# File Name: .circleci/config.yml
version: 2.1
jobs:
generate_synthetic_data:
docker:
- image: cimg/python:3.10
steps:
- checkout
- run:
name: Install Dependencies
command: pip install -r requirements.txt
- run:
name: Scrape Relevant Data
command: python scrape.py
- run:
name: Generate QA Pairs for LLM
command: python generate.py
- store_artifacts:
path: data
destination: synthetic-data
workflows:
pipeline:
jobs:
- generate_synthetic_data
scheduled_pipeline:
triggers:
- schedule:
cron: "59 23 * * *" # Cron to scehdule workflow on 23:59 everyday.
filters:
branches:
only:
- main
jobs:
- generate_synthetic_data
This CI pipeline checks out the latest code, creates a fresh Python environment and executes the Python scripts to create the new json files. The store_artifacts
command saves the generated JSON files from your data/
directory. These files will be available as downloadable artifacts in your CircleCI job summary. Moreover, we create a scheduled workflow for the CI pipeline that defines a cron expression
to automatically trigger the job every night at 11:59 PM UTC.
With this setup, you have now built a self-refreshing pipeline that scrapes, generates, and saves high-quality synthetic data on a daily schedule.
Now that your code is ready, it’s time to commit and push it to a remote repository on GitHub.
First, create a new repository on GitHub:
- Go to github.com and create a new repository.
- Copy the repository URL (e.g.,
https://github.com/your-username/your-repo.git
).
Next, on your local directory, add the path to your remote repository and sync the local commits with the remote repository.
In Unix-based systems, you can do this using these commands by replacing <YOUR_GIT_REPOSITORY_URL>
with the actual path to your remote repository:
git add .
git commit -m "Initial Commit"
git remote add origin <YOUR_GIT_REPOSITORY_URL>
git push -u origin main
Setting up project on CircleCI
The complete code for this project is available on my GitHub account. To setup the project on CircleCI, log in to CircleCI using your GitHub account via OAuth. This is important because scheduled workflows are supported only for organizations connected through GitHub OAuth. After you log in, CircleCI will automatically detect repositories and allow you to set them up as projects. Go to your CircleCI dashboard and select the GitHub repository containing this project. Once linked, CircleCI will recognize the pipeline configuration and the scheduled trigger defined in your config.yml
.
Because your pipeline calls the Together API, you’ll need to securely provide your API key:
- Go to Project Settings in CircleCI.
- Go to the Environment Variables tab.
- Add a new variable with the name
TOGETHER_API_KEY
and paste your key as the value.
Although your pipeline is scheduled to run automatically every night at 11:59 PM UTC, you can also trigger it manually using the CircleCI API. This is useful for testing or generating data on demand. Once triggered, you can monitor the progress from the CircleCI dashboard. If everything is configured correctly, the pipeline should successfully complete and show a green run.
After successful execution, navigate to the Artifacts tab for the completed job in CircleCI. You’ll find the generated files such as scraped_articles.json
and qa_dataset.json
there. These files are downloadable and can be used to periodically fine-tune your LLM.
Conclusion
In this blog, you built a complete automated pipeline to generate synthetic question-answer data using CircleCI and open-source LLMs via the Together API. You scraped fresh web content, transformed it into conversational QA pairs, and stored the results as artifacts using CircleCI’s scheduling and job orchestration capabilities.
To make this system production-ready, you can persist the generated data to cloud storage like AWS S3
. This ensures easy access, long-term storage, and centralized dataset collection across runs which is useful for retraining or dataset curation at scale.
For advanced implementations, you can extend this pipeline in multiple ways, like:
- Domain-specific generation (e.g., medical, legal, or finance)
- Multilingual QA datasets using translation layers
- Human-in-the-loop validation to refine LLM outputs
- Dataset versioning to track model improvements over time