Preventing harmful LLM output with automated moderation

Large Language Models (LLMs) can produce impressive text responses, but they’re not immune to generating harmful or disallowed content. If you’re developing an LLM-powered application, you need a reliable way to detect and block risky outputs. Disallowed content – hate speech, explicit descriptions, harmful instructions – can damage your product’s reputation, endanger user safety, and potentially violate legal or platform guidelines.

In this tutorial, we’ll show you how to automate a moderation check so that whenever a user requests for disallowed content, or a model produces disallowed content, your pipeline will fail and alert you immediately.

OpenAI provides an off‐the‐shelf Moderation API that checks for categories like hate, self‐harm, sexual content, and violence. It returns a simple flag if text is potentially disallowed. Other solutions exist – Google Perspective can measure toxicity, or you could deploy a local toxicity classifier. However, we’ll use OpenAI’s Moderation Endpoint because it integrates easily with GPT‐based apps.

The goal is straightforward:

Build a simple chatbot that answers user questions.
Integrate OpenAI’s Moderation API (or any moderation logic) to check for flagged outputs.
Automate the process with CircleCI so that:
- Harmful content triggers an immediate “blocked” response.
- A commit is pushed to the repo to fail your CircleCI build, showing you and your team that disallowed content was generated.

By the end, you’ll have a working example of how to catch harmful LLM responses before they reach users, with CircleCI acting as a continuous safety net. Let’s begin!

Prerequisites

Before you begin, make sure you have:

Python 3.7+ installed.
A basic understanding of Python.
An OpenAI API account.
Git installed in your machine.
A GitHub account.
A CircleCI account.
A Python Virtual Environment tool such as venv. (Optional)

You can start by building the basic chatbot.

Building the basic chatbot

In this section, we’ll focus on basic functionality – no moderation just yet. Once it can chat, you can integrate your automated moderation and CI steps.

Project setup

Create a new folder and initialize git by running:

git init

Next, create a requirements.txt file and add these packages:

openai
python-dotenv

Install the dependencies by running:

pip install -r requirements.txt

Note: You might want to do the dependencies installation in a virtual environment.

Obtain and store your OpenAI key

Copy your key from your OpenAI dashboard.

Next, create a .env file in your project folder and add the key to it:

OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxx

To make sure you do not accidentally commit your .env file to GitHub, you should add it to your .gitignore file.

If you do not already have a .gitignore file, create one in the root of your project, and add this line:

.env

This tells Git to ignore the .env file so it stays private and is not included in version control.

Code the basic chatbot

Create a Python file named chatbot.py. Insert this code:

import os
import openai
from dotenv import load_dotenv

def initialize_openai():
    load_dotenv()  
    openai.api_key = os.getenv("OPENAI_API_KEY")

def get_chat_response(user_input: str) -> str:
    """
    Sends user input to GPT-3.5 (via ChatCompletion) and returns the model's response.
    """
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant specialized in technology topics."},
            {"role": "user", "content": user_input}
        ],
        max_tokens=100,
        temperature=0.7
    )
    return response.choices[0].message.content.strip()

def run_chatbot():
    print("=== Basic GPT-3.5 Chatbot ===")
    print("Type 'exit' or 'quit' to end.\n")

    while True:
        user_input = input("User: ")
        if user_input.lower() in ["exit", "quit"]:
            print("Chatbot: Goodbye!")
            break

        response = get_chat_response(user_input)
        print(f"Chatbot: {response}\n")

if __name__ == "__main__":
    initialize_openai()
    run_chatbot()

We load environment variables via load_dotenv(), then set openai.api_key. The get_chat_response function calls the GPT‐3.5‐Turbo model with a short conversation: a system message and one user message. The run_chatbot() function runs a while loop for capturing user input and printing GPT’s response. You can test this locally by running python chatbot.py in the terminal. You can end the conversation by typing “exit” or “quit.”

At this point, you have a working chatbot that can handle simple queries on technology topics. Next, extend it by adding moderation logic and then automate everything with CircleCI.

Adding moderation

Now that your chatbot is responding sensibly to user questions, you can teach it how to recognize harmful or disallowed text. To do this, use the OpenAI Moderation Endpoint, which scans a piece of text and returns a simple “flagged” result if it’s potentially unsafe.

Here’s the plan:

Moderate the user’s input before sending it to GPT‐3.5.
Moderate the model’s response before displaying it back to the user.

If either is flagged, you’ll refuse to proceed.

Installing our moderation logic

OpenAI’s Moderation Endpoint is designed to catch problematic content (hate speech, explicit material, violence, etc.). If a string is flagged, you get response["results"][0]["flagged"] = True. That is the signal to block the text.

Write a helper function that calls client.moderations.create(input=...) and returns True if flagged, False otherwise. Here’s a snippet illustrating how you might do it:

def moderate_text(client, text: str) -> bool:
    """
    Checks text against the Moderation endpoint.
    Returns True if disallowed, otherwise False.
    """
    try:
        response = client.moderations.create(input=text)
        return response.results[0].flagged
    except Exception as e:
        print(f"Moderation error: {e}")
        return True  # default to flagged if an error occurs

Pass in a client (our OpenAI connection) plus the text. If flagged is True, it means the content might violate policy.

Blocking harmful user input and model output

With your helper function in place, you can check both ends of the conversation:

User Input: If flagged, it is ignored and there is a message like \[User content flagged\].
Model Output: If the response is flagged, it is withheld.

Here’s a short snippet showing where you might call moderate_text() inside your chatbot loop. Notice that it skips any flagged user input and blocks flagged model output:

# 1) Moderate user input
if moderate_text(client, user_input):
    print("Assistant: [User content flagged by moderation. Not processed.]")
    continue
# 2) Generate model output
assistant_reply = ... # call GPT-3.5
# 3) Moderate model output
if moderate_text(client, assistant_reply):
    print("Assistant: [Model output flagged by moderation. Not displayed.]")
else:
    print(f"Assistant: {assistant_reply}")

This approach ensures no disallowed text reaches the conversation on either side.

Try entering risky prompts – such as something with explicit hate speech. If the Moderation Endpoint recognizes it, you should see a refusal message in your console instead of GPT’s full reply.

Note: Certain borderline content might not be flagged if it doesn’t meet the threshold. That’s normal behavior for OpenAI’s category scoring. If you need stricter filtering, you can build additional checks on top of the moderation results.

With moderation in place, your chatbot is safer. Next, connect this to CircleCI so that any flagged content not only gets blocked, but also fails the build – making your entire team instantly aware of disallowed outputs.

Automating moderation with CircleCI

Your chatbot now refuses harmful user input and filters disallowed model outputs. But what if you want your entire team to be alerted whenever the chatbot produces flagged content? That’s where CircleCI comes in. By pushing a “flagged” file to the repo whenever disallowed text is detected, you automatically trigger CircleCI to run – and fail the build if such files exist.

Here is code that includes both moderation checks and the commit‐and‐push approach for raising alerts in CircleCI:

#!/usr/bin/env python3

import os
import subprocess
from datetime import datetime

from openai import OpenAI

def record_flagged_content(offending_text: str):
    """
    Writes a small file indicating flagged content, then commits + pushes it
    to your GitHub repo. CircleCI picks up this commit and runs the pipeline,
    which can fail if it sees any flagged files.

    Requirements:
      - Git must be configured in this local environment with credentials or SSH.
      - You have a local clone of the repository that CircleCI monitors.
    """
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"flagged_event_{timestamp}.txt"

    with open(filename, "w") as f:
        f.write(f"Flagged content detected at {timestamp}\nOffending text:\n{offending_text}\n")

    try:
        # Stage and commit
        subprocess.run(["git", "add", filename], check=True)
        subprocess.run(["git", "commit", "-m", f"Add flagged content file {filename}"], check=True)

        # Push to main (ensure 'main' matches your default branch)
        subprocess.run(["git", "push", "origin", "main"], check=True)
        print(f"Pushed flagged file {filename} to repo, triggering CircleCI pipeline.")
    except subprocess.CalledProcessError as e:
        print(f"Error pushing flagged content file: {e}")

def initialize_openai_client():
    """
    Reads OPENAI_API_KEY from your environment (or sets it directly)
    and returns an OpenAI client instance for the 1.0+ interface.
    """
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise ValueError("OPENAI_API_KEY not set.")

    return OpenAI(api_key=api_key)

def moderate_text(client: OpenAI, text: str) -> bool:
    """
    Checks text against the Moderation endpoint.
    Returns True if flagged (disallowed), otherwise False.
    """
    try:
        response = client.moderations.create(input=text)
        return response.results[0].flagged
    except Exception as e:
        print(f"Moderation API error: {e}")
        # Default to flagged or handle differently
        return True

def run_chatbot():
    """
    A console-based chatbot using GPT-3.5-Turbo.
    We moderate BOTH the user's input and the model's output.
    If any is flagged, we block and commit/push a 'flagged_event' file
    so CircleCI fails the build on the next run.
    """
    client = initialize_openai_client()

    print("\n=== GPT-3.5-Turbo Chatbot (Moderate User & Model, then Commit Flagged File) ===")
    print("Type 'exit' or 'quit' to end the session.\n")

    messages = [
        {"role": "system", "content": "You are a helpful assistant specialized in technology topics."}
    ]

    while True:
        user_input = input("User: ")
        if user_input.lower() in ["exit", "quit"]:
            print("Assistant: Goodbye!")
            break

        # 1) Moderate the user's input
        user_flagged = moderate_text(client, user_input)
        if user_flagged:
            print("Assistant: [User content flagged by moderation. Not processed.]")
            record_flagged_content(user_input)
            continue  # Skip sending to the model

        # If user input is safe, add to conversation
        messages.append({"role": "user", "content": user_input})

        try:
            # 2) Get model response
            response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=messages,
                temperature=0.7,
                max_tokens=150
            )

            assistant_reply = response.choices[0].message.content.strip()

            # 3) Moderate the model's output
            model_flagged = moderate_text(client, assistant_reply)
            if model_flagged:
                print("Assistant: [Model output flagged by moderation. Not displayed.]")
                messages.append({"role": "assistant", "content": "[Blocked content]"})
                record_flagged_content(assistant_reply)
            else:
                print(f"Assistant: {assistant_reply}\n")
                messages.append({"role": "assistant", "content": assistant_reply})

        except Exception as e:
            print(f"Error: {e}")
            break

if __name__ == "__main__":
    run_chatbot()

How it works

record_flagged_content(): Whenever it detects disallowed text (either from the user or the model), it writes a small text file – like flagged_event_20231005_153340.txt – and immediately commits and pushes it.

a. CircleCI discovers the new commit on main and runs a build. b. If your .circleci/config.yml is set up to fail when it finds these flagged files, the build will go red.
moderate_text(): Returns True for flagged content. You do this on both user input and model output.
run_chatbot(): A console loop that prompts you for input. If something’s flagged, it prints a short \[\... flagged\] message and calls record_flagged_content(). Otherwise, it just displays the safe chatbot reply.

By itself, pushing flagged files doesn’t cause the build to fail, unless you tell CircleCI how to handle them. In the next section, you will configure your .circleci/config.yml.

Failing the build in CircleCI

Now that your chatbot commits a “flagged” file whenever disallowed text appears, you need CircleCI to look for those files and fail the pipeline if any are found. This is as simple as adding a check_flagged_files job in your .circleci/config.yml.

In the root directory of your repository (the same one that contains your chatbot code), create a folder named .circleci. Then inside that folder, make a file called config.yml:

version: 2.1

workflows:
  build_flagged_files:
    jobs:
      - check_flagged_files:
          filters:
            branches:
              only: main

jobs:
  check_flagged_files:
    docker:
      - image: cimg/base:stable
    steps:
      - checkout
      - run:
          name: Detect flagged files
          command: |
            FILE_COUNT=$(find . -maxdepth 1 -name 'flagged_event_*.txt' | wc -l)
            if [ "$FILE_COUNT" -gt 0 ]; then
              echo "Found $FILE_COUNT flagged_event_*.txt files - failing build!"
              exit 1
            else
              echo "No flagged files. Passing build."
              exit 0
            fi

Here’s what the file does:

workflows: Defines a workflow named build_flagged_files which runs the job check_flagged_files.
check_flagged_files:

a. Docker image: uses cimg/base:stable, a minimal CircleCI image for basic shell commands. b. checkout: pulls your repo (including any flagged_event\_\*.txt files). c. run: executes a small shell script that counts how many files match flagged_event\_\*.txt.
```
i.  If one or more exist, it prints a message and exits 1, causing the build to fail.
ii. If none exist, it prints "No flagged files" and exits 0, passing the build.
```

CircleCI will read this config.yml from your repository, so you must commit it to your GitHub (or other VCS) repository.

Enable CircleCI for your repo

Go to CircleCI.com and log in. In CircleCI’s dashboard, click Projects (it’s in the left sidebar). Find your repo in the list and select Set Up Project.

Note: If your repo doesn’t appear, ensure your GitHub settings allow CircleCI access to it.

CircleCI should detect your .circleci/config.yml in the main branch and automatically attempt a pipeline. You’ll see it in the CircleCI dashboard.

Next, run your chatbot by entering a “harmful” prompt that triggers a flagged event. When the chatbot sees flagged content, it commits a new file (flagged_event_20231008_134500.txt, for example).

Chatbot app in action

CircleCI sees the new commit on main. The check_flagged_files job finds that file and fails the pipeline. In the CircleCI dashboard, you’ll see a new workflow run for “build_flagged_files.”

Chatbot app in action

The pipeline fails (red build) if that flagged file is present; it passes (green) if none are found.

This gives you a clear signal that harmful content was generated – everyone on the team sees the failing build in CircleCI’s dashboard, and you can take action.

Conclusion

Congratulations! You’ve now built a simple GPT‑3.5 chatbot that checks every piece of text – both user input and model output – for disallowed content using the OpenAI Moderation Endpoint. You’ve also automated it with CircleCI so that whenever something slips through the cracks, a flagged file is committed, triggering a failing pipeline. This ensures your entire team is immediately aware that harmful or offensive output was generated.

You can find the complete sample project here: LLM Moderation Repo.

Site

Blog