How to use LLMs to generate test data (and why it matters more than ever)
Senior Software Engineer
The way software is written is changing fast. In the past few years, AI coding assistants and large language models (LLMs) have gone from novelty to necessity for many developers. Tools like Cursor, ChatGPT, and custom in-house models are helping teams generate boilerplate, scaffold features, and even build entire apps within minutes.
It’s exciting. But it also raises the stakes.
When code is written faster, it’s deployed faster. And when that code is generated by a model trained on the entire internet, it doesn’t always behave how developers expect. That’s where testing comes in. The more AI you use to write code, the more you need to test it thoroughly.
Testing is no longer just about catching bugs. It’s not only “does this run?” but also “does it do what I intended?” — and just as importantly, “does it avoid what I didn’t intend?”
Which leads to an interesting question: if you’re already using LLMs to write your code, why not use them to help test it, too?
Why LLMs are a natural fit for test data generation
If you’ve ever written a unit test, you know the pain of coming up with realistic test data. Say you’re testing a payment API. Developers need a bunch of fake transactions, users with different account types, edge cases like refunds and chargebacks, and maybe even a few malformed inputs to make sure your validations are working.
LLMs are great at this.
You can prompt a model to generate dozens or hundreds of test cases in plain English. Need a list of JSON payloads representing ecommerce orders from customers in five countries with various discount codes? Easy. Want a mix of valid and invalid email addresses that mimic real-world typos? Done. The model can give you both structure and variation—fast.
This doesn’t replace test design, of course. It just supercharges it. You still need to know what you’re testing and why. But once you do, the model can take your intent and spin up more data than you’d ever want to write by hand.
Real-world example: Testing a user onboarding flow
Here’s a concrete example of how LLMs can be used to generate test data for a user onboarding flow, integrate that into a test suite, and connect it with CI/CD.
Step 1: Define what you want to test
Suppose they’re working on an API endpoint that handles new user signups. Each signup includes the following fields:
name
(string)email
(string)age
(integer, must be between 13 and 120)preferences
(optional object)
To ensure your API behaves correctly, you should test a variety of input scenarios, such as:
- Valid inputs with internationalized names
- Emails with common user typos
- Invalid ages (below 13, above 120, non-integer values)
- Missing fields
- Duplicate emails
- JSON schema violations (e.g., a string instead of a number)
Step 2: Use an LLM to generate the data
Start by crafting a prompt like this:
Generate 20 JSON user signup payloads for testing based on the following data schema:
- Name (string)
- Email (string)
- Age (integer, must be between 13 and 120)
Include a mix of valid and invalid examples. Cover edge cases like:
- Duplicate emails
- International characters in names
- Missing fields
- Ages below 13, above 120, and non-numeric ages
- Incorrect field types
The model might return something like:
[
{"name": "Miyuki たなか", "email": "miyuki@example.com", "age": 28},
{"name": "José Alvarez", "email": "jose@exmaple.com", "age": 34},
{"name": "Anna", "email": "anna@example.com", "age": 12},
{"name": "", "email": "invalid", "age": "twenty"},
{"name": "John Smith", "email": "john@example.com", "age": 200},
{"email": "missingname@example.com", "age": 22},
{"name": "Sarah", "email": "sarah@example.com", "age": null},
{"name": "Omar", "email": "omar@example.com", "age": 29, "preferences": {"language": "ar"}},
{"name": "Miyuki たなか", "email": "miyuki@example.com", "age": 28},
{"name": "Bot", "email": "bot@example.com", "age": 30, "preferences": "likes cheese"}
]
If you’re doing a lot of prompt iteration to dial this in, it helps to use a structured environment where you can experiment with variations, test outputs side-by-side, and quickly adjust your wording based on results. Tools like Circlet are purpose-built for this: they give you a clean interface to write and compare prompts, visualize outputs, and keep track of what works best. This can save hours of guesswork and help you generate data that truly reflects the edge cases you’re aiming to test.
Step 3: Plug the data into your tests
Assuming the team is using a test framework like Jest (Node.js) or Pytest (Python), you can structure your tests like so:
# test_signup_api.py
import json
import requests
with open('test_data/signup_payloads.json') as f:
payloads = json.load(f)
@pytest.mark.parametrize("payload", payloads)
def test_signup_api(payload):
response = requests.post("http://localhost:3000/api/signup", json=payload)
if is_valid(payload):
assert response.status_code == 200
else:
assert response.status_code in (400, 422)
This test loops through each generated payload and sends it to the signup API. If the input is valid (as defined by your custom is_valid
function), it expects a 200 OK
response. For invalid inputs, it checks that the API correctly rejects them with a 400
or 422
error—helping ensure your backend handles edge cases just as gracefully as the happy path.
Where CI/CD fits in
Whether you’re generating test data by hand or with an LLM, the value really comes through when you run those tests continuously. CI/CD ensures that every change, no matter how small, is automatically vetted against your full suite of tests—including all that rich, varied data you just created.
CircleCI lets you hook that process directly into your development workflow. Push a branch, run the pipeline, get feedback. Quickly. Reliably. Repeatedly.
This becomes even more important when AI is part of your stack. Models are non-deterministic by nature, which means the code they generate might differ from one run to the next. Your tests (and your CI pipeline) are what keep that variability in check.
Here’s how you can set it up.
Put the generated test data into a file (signup_payloads.json
) and commit it with your tests. Then, set up your CircleCI config like this:
version: 2.1
jobs:
test:
docker:
- image: cimg/python:3.11
steps:
- checkout
- run: pip install -r requirements.txt
- run: pytest tests/
workflows:
version: 2
test_and_deploy:
jobs:
- test
Now every time you push changes, CircleCI will:
- Run your tests using real and edge-case inputs
- Catch regressions early
- Give you confidence in your LLM-assisted code
You can go a step further by generating the test data automatically in your pipeline using a script and the OpenAI API (or any local model). Just remember to cache or snapshot the data if consistency matters to avoid test flakiness and improve reproducibility.
Tips for using LLMs effectively in test generation
To make the most of LLMs in your test data generation workflow, keep the following principles in mind:
- Be specific in your prompts. The more context you give, the better the output.
- Mix valid and invalid cases. LLMs can generate both, and both are important.
- Use examples from your domain. Give the model real-world scenarios to work with.
- Validate the output. Don’t assume the model is always right. Review and tweak as needed.
- Automate it in your pipeline. With a little scripting, you can generate fresh test data on every run. This helps ensure your tests are resilient to changing inputs and better reflect real-world variability.
Once a few of these techniques have been tried, it’s helpful to reflect on how they fit into your broader testing strategy and CI/CD workflow. LLMs are just one piece of the puzzle, but a powerful one.
Final thoughts
Software teams are entering a new era. The tools are more powerful. The pace is faster. But some things never change: good code still needs good tests.
Using LLMs to generate test data is a simple, practical step that can save time, improve coverage, and make any CI/CD process more valuable. If this isn’t already part of the development workflow, it’s worth exploring.
For teams working with LLMs regularly—especially those experimenting with prompt design—tools like Circlet can bring needed structure to an otherwise ad hoc process. They make it easier to iterate, test, and reuse prompts effectively, helping ensure that the way test data is generated is just as thoughtful and rigorous as the code it supports.
Done right, this kind of prompt-driven testing can become a core part of how teams ship reliable, AI-assisted software with confidence.