No results for

Mar 4, 202515 min read

Building a customer churn detection system with Hugging Face and CircleCI

Armstrong Asenavi

Full Stack Engineer

Losing a customer to a competitor can be costly; customer retention is vital for business success and growth. Businesses must anticipate when and why a customer might leave, so they can implement measures to retain them. One solution might be to build a system that predicts churn. But can it be done?

Using machine learning (ML) techniques to analyze customer service interactions can provide valuable insight into customer sentiment. This post analyzes both structured data and unstructured data to predict churn risk.

In this tutorial, you’ll learn how to build an automated customer churn prediction system. You will use HuggingFace Transformers for text generation processing and CircleCI for continuous integration and delivery (CI/CD) pipelines. You will use a FastAPI endpoint for automatic predictions.

By the end of this tutorial, you will learn to build an app that tells support agents whether a customer might leave.

Let’s get started building a system that will help businesses retain their customers.

Prerequisites

Before starting this tutorial, you should have:

If you’re missing any of these prerequisites, take a moment to set them up using the links. You don’t need to be an expert in any of these tools – this tutorial will guide you through each step of the process.

Setting up your customer churn prediction environment

In this section, you will set up your tools and learn about the dataset. 

Set up the development environment

First, run this command to create a project folder:

mkdir churn_prediction_app
cd churn_prediction_app
Copy to clipboard

Next, run this code to clone the GitHub repo.

git clone https://github.com/CIRCLECI-GWP/churn-predictor-app.git
Copy to clipboard

Now, create a Python virtual environment and activate it.

python -m venv churn_env
churn_env\Scripts\activate # for MacOS & Linux use `source churn_env/bin/activate`
Copy to clipboard

Install the packages:

pip install -r requirements.txt
Copy to clipboard

After setting up the Python environment and adding packages, you can start building your system to predict churn. In the next section, you will explore the telecom churn dataset to identify some patterns.

Overview of the dataset (features, label distribution, and data characteristics)

This tutorial uses a fictional dataset for a telecom company that provides phone and internet services to customers in California. The phone conversation transcripts have been generated using the GPT2 algorithm.

The dataset has 35 attributes(features) including:

  • Gender – Whether the customer is male or female
  • Senior citizen – Whether the customer is 65 or over
  • Partner – Whether the customer has a partner
  • Dependents – Whether the customer has dependents 
  • Tenure months – The customer’s tenure with the company Phone service - Whether the customer subscribes to home phone service
  • Multiple lines - Whether the customer subscribes to multiple phone lines
  • Internet service - Whether the customer subscribes to internet service
  • Conversation – The synthetic GPT-2 generated customer-agent ticket conversations
  • Churn label: Whether the customer left the service or not

Find more details on the attributes on IBM Cognos Analytics. The dataset for this post is available in the data folder from the GitHub repo. 

The conversation column contains synthetic text generated using the GPT2 algorithm. This shows how you can apply this solution to real-world customer service support ticket interactions and phone conversations. GPT2 is a powerful unsupervised transformer language model built by OpenAI.

You can run the code in VS Code or your favorite code editor. In the root folder, create a new file with the “.ipynb” extension or open the experiments.ipynb notebook file in the folder. VS Code will recognize it as a Jupyter Notebook. VS Code will prompt you to select a Python kernel. Choose the environment you installed earlier. Now you can create and run code cells using the interface.

Run this code to load the data and view basic distributions:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Load the dataset
df = pd.read_csv('./data/Telco_customer_churn_with_text.csv')
### Initial exploration
print("Dataset shape:", df.shape)
print("\nFeature distribution:\n", df.describe())
print("\nMissing values:\n", df.isnull().sum())
Copy to clipboard

Take a moment to examine the distribution of your target variable (churn):

print("Churn distribution:")  # Print table title
print(df['Churn Label'].value_counts(normalize=True))
Copy to clipboard

Variable distribution

The output summarizes the distribution of variables and the shape of the dataframe.

The figure below plots the distribution of the target valiable (Churn Label). You will notice that the dataset is imbalanced (with 26.5% churners and 73.5% retained). This is typical for churn prediction problems. Since most customers do not churn, there is a class imbalance.

Churn distribution

Next, you will learn how to preprocess this data and engineer meaningful features. 

Preparing and analyzing customer churn data

Now that you have loaded your data, data preprocessing is a crucial exercise. Even the most sophisticated ML model can crumble under poor-quality data input. You need to clean your data and create important features that will help you build a robust model. 

Data preprocessing

The Python script data_preprocessing.py is cleaning and preprocessing your data. Run it in the terminal:

python preprocessing.py
Copy to clipboard

This generates a processed data file (processed_telco_data.csv) in the data folder and a report in the outputs folder on missing values, duplicates, and data types.

# data_preprocessing.py
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, PowerTransformer
from sklearn.impute import SimpleImputer
import logging
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class DataPreparation:
def init(self):
self.label_encoders = {}
self.scaler = StandardScaler()
self.power_transformer = PowerTransformer()
self.imputer = SimpleImputer(strategy='median')

def load_data(self, file_path):
"""Load and validate the dataset."""
try:
#df = pd.read_csv(file_path)
df = pd.read_csv(file_path)

logger.info(f"Successfully loaded dataset with shape: {df.shape}")
return df
except Exception as e:
logger.error(f"Error loading data: {e}")
raise

def validate_data(self, df):
"""Perform basic data validation checks."""
validation_report = {
'missing_values': df.isnull().sum().to_dict(),
'duplicates': df.duplicated().sum(),
'data_types': df.dtypes.to_dict()
}

# Check for invalid values in important columns
if 'Monthly Charges' in df.columns:
validation_report['negative_charges'] = (df['Monthly Charges'] < 0).sum()

if 'Tenure Months' in df.columns:
validation_report['invalid_tenure'] = (df['Tenure Months'] < 0).sum()

logger.info("Data validation completed")
return validation_report
Copy to clipboard

Feature engineering

The script data_preprocessing.py also shows how you can create and transform features. Data transformations will depend on your specific use case. You can make it as complex or as simple as you want. You might want to clean up your data in different ways. For a comprehensive list of feature transformation techniques, visit SKLearn Dataset Transformations.

The data_preprocessing.py creates new features that capture key patterns in customer behavior. For this post, you encode business insights into the model. For example:

  • Customer value features (Revenue_per_Month, Average_Monthly_Charges, and Charges_Evolution`)
  • Service usage (Total_Services)
  • Customer segments (Value_Segment) – grouped as low, medium, high, and premium
  • Contract risk score - customers with month-to-month contracts are more likely to churn

To visualize features, run feature_analysis.py in a Jupyter notebook. 

Feature distribution

The image shows the distribution of numeric variables. The plot splits each metric between churned (Yes) and retained (No) customers. Customers who churn often show distinct behavioral patterns. For example, they have higher monthly charges and lower tenure months.

Feature correlations

For categorical variables, customers who are more likely to churn have common features:

  • Month-to-month contracts,
  • Electronic check payments,
  • No extra services (like online security or device protection).

Categorical features

You can get more information on the features using mutual information. Contract-related features such as Contract_Encoded, Contract_Risk_Score, and Tenure_Months have high importance. Demographic factors like Gender_Encoded show lower predictive power.

Feature Importance

Basic modeling (training a baseline classifier)

Using customer profile features, you can train a baseline model. For this tutorial, use the RandomForestClassifier. Select features using the boruta algorithm. Implement hyperparameter tuning using GridSearchCV.

Implement the baseline model by running baseline_model.py in the terminal or notebook:

python baseline_model.py
Copy to clipboard

This code snippet shows how you build the model, tune hyperparameters, and select the best model.

# Define Model
rf = RandomForestClassifier(random_state=42)
# Hyperparameter tuning
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10],
}
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)
# Best Model
best_model = grid_search.best_estimator_
Copy to clipboard

Baseline model classification report

The baseline model has an accuracy of 0.80. Adding textual features from customer conversations can improve ML performance.

Implementing text analysis with HuggingFace Transformers

This tutorial uses Huggingface to process unstructured text data (agent-customer conversations).

Set up The Hugging Face transformers library

Getting started with the Hugging Face transformers library is straightforward. Hugging Face allows you to access models without downloading them on your computer.

Hugging transformers library is a rich collection of pre-trained models for various NLP tasks. You can use them for sentiment analysis, text classification, question answering, and more. You can access the models through AutoModelForSequenceClassification and AutoTokenizer classes. You can explore the available models on Hugging Face model hub.

In the next section, you will learn how to load a pre-trained sentiment analysis model. You will apply it to the customer_text column to generate customer sentiments. 

Implement sentiment analysis for support tickets

You can generate customer sentiments for the dataset by running data_preparation.py:

python data_preparation.py
Copy to clipboard

The function extract_sentiment() from the DataPreparation() class in the code snippet extracts customer sentiment from the customer_text column.

def extract_sentiment(self, text_column model_name="distilbert-base-uncased-finetuned-sst-2-english"):
"""Extracts sentiment from a text column and returns the updated DataFrame.
"""
# Load model and tokenizer once
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
Copy to clipboard

The transformers library loads two essential components:

  • AutoTokenizer - converts text into an understandable format for the model, and 
  • AutoModelForSequenceClassification - making sentiment prediction. 

For this tutorial, we’re using DistilBERT – a faster, lighter version of BERT model for sentiment analysis. The get_sentiment() function processes customer’s text and returns a prediction of either positive or negative sentiment.

Build feature extractors using sentence transformers

Text embeddings are numerical representations of text that have semantic meaning. Embeddings are important for performing tasks like similarity search and clustering. Text embedding is converting words into numbers that represent their meaning.

Hugging Face sentence transformers are handy for building feature extractors. You will use the sentence transformers to generate embeddings as inputs for the churn prediction task. The approach allows you to use the pre-trained model’s rich semantic information. In this post, you will use the all-mpnet-base-v2 model.

The all-mpnet-base-v2 sentence-transformers model maps the customer conversation text to a 768-dimensional dense vector space. This post applies the PCA dimension reduction technique for 10 feature selections. 

Combine with structured features

Combine the customer profile features with the text features like this:

python generate_model_data.py
Copy to clipboard

This creates a model_data.csv file in the data folder. You already have this file in the data folder. You can skip this part and jump right into model building. 

Building your customer churn prediction system

The model building process:

  • Impute numerical features
  • Replaces missing values with mean,
  • One-hot encode categorical features,
  • Generate embeddings for textual features.

2025-02-16-Data processing

Design the risk-scoring algorithm to predict customer churn

Now you are ready to build and train any classifier model to predict whether a customer will churn or not. The model will also generate classes (Yes/No) and the probability of a customer churning. This is crucial for creating strategies for retaining a customer.

You will use the SKLearn RandomForestClassifier model. There are many hyperparameters to choose from. Use Grid search to optimize the hyperparameters for a more accurate model prediction.

Here is the model:

# Define Model
rf = RandomForestClassifier(random_state=42)
# Hyperparameter tuning
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10],
}
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)
# Best Model
best_model = grid_search.best_estimator_
joblib.dump(best_model, "churn_model.pkl")
Copy to clipboard

Model results

The results show a slight dip in accuracy compared to the baseline model. You can improve model performance by generating high-quality conversations. Alternatively, you can use a dataset from actual agent-customer conversations.

Create the main prediction pipeline

After building and training the model and saving it as churn_model.pkl, you can access it for making predictions. You will use the prediction_pipeline.py with the fastAPI endpoint named-main.py. The predict_churn function serves as the main prediction pipeline, taking three parameters:

  • data: The customer data for prediction
  • model_path: Path to our saved ML model
  • scaler_path: Path to our saved scaler

The pipeline accepts different data formats:

  • a pandas DataFrame
  • a list of dictionaries,
  • a single dictionary for one-off predictions

The pipeline follows a clear sequence:

  • Data preprocessing using the data_preparation.py module
  • Data scaler using a pre-trained scaler
  • Generates predictions using a pre-trained model
  • Returns both probabilities and binary predictions (1 = will churn, 0 = won’t churn)

The pipeline implements error checking for common issues:

  • Invalid data formats
  • Missing files (model or scaler not found)
  • Missing columns
  • Unexpected errors 

The next section shows a screenshot of how FastAPI works.

Create a FastAPI endpoint for serving the model

FastAPI is one of the fastest ways to deploy ML models, providing an easier way to implement data validation for specific data types. The fastAPI package needs a uvicorn server. You already installed these when you were setting up the project.

The first step is to define the format of the data that you will provide to the model to generate predictions. This is important because the model works with both numeric and text data. You want to feed the model with correct data.

Define the request body (data sent from the client side to the API) using BaseModel– a pydantic module. See the main.py module for details. The response body is the data sent from the API to the client.

In summary:

  • BaseModel defines the structure and validation rules for request data.
  • CustomerData specifies the expected input, mapping attributes to request fields.
  • Field speficies default values and aliases.
  • Config with allow_population_by_field_name allows the use both attribute names and aliases.

The proper definition of inputs and outputs simplifies handling of API requests.

Open the terminal in the root directory (with the virtual environment activated) and run:

uvicorn main:app --reload
Copy to clipboard

Go to http://127.0.0.1:8000.

FastAPI welcome screen

Open http://127.0.0.1:8000/docs in your browser to test your API. 

This will open a Swagger UI.

Swagger UI home page

Click the POST button.

Swagger UI try it out button

Click Try it Out and enter your prediction data.

Swagger UI running predictions

Enter the values and click Execute. The predictions are shown under the response body section.

For example, if you enter this data:

{
            "Gender": "Female",
            "Senior Citizen": "No",
            "Partner": "Yes",
            "Dependents": "No",
            "Tenure Months": 24,
            "Phone Service": "Yes",
            "Multiple Lines": "No",
            "Internet Service": "DSL",
            "Online Security": "Yes",
            "Online Backup": "No",
            "Device Protection": "Yes",
            "Tech Support": "No",
            "Streaming TV": "Yes",
            "Streaming Movies": "No",
            "Contract": "Month-to-month",
            "Paperless Billing": "Yes",
            "Payment Method": "Electronic check",
            "Monthly Charges": 65.6,
            "Total Charges": 1576.45,
            "customer_text": "Your internet is horrible these days!"
        }
Copy to clipboard

You should get a class response and the probability as shown in the following image.

Response and probability

The customer is unlikely to churn, with a probability of 31.9%.

You have successfully deployed your ML model as an API using FastAPI.

Setting up CI/CD for churn prediction with CircleCI

When you’re building an ML application, it is crucial to have a reliable CI/CD pipeline. Let’s walk through how to set up CicleCI’s CI/CD for your churn prediction app.

Create a Dockerfile

The structure of the dockerfile is:

FROM python:3.12.9-bookworm
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model and application files
COPY churn_model.pkl .
COPY scaler.pkl .
COPY boruta_features.pkl .
COPY main.py .
COPY prediction_pipeline.py .
COPY data_preparation.py .
COPY quantile_bins.pkl .
# Security best practice: run as non-root user
RUN useradd -m appuser && chown -R appuser /app
USER appuser
# Run the FastAPI application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Copy to clipboard

This Dockerfile creates a containerized environment for the FastAPI-based churn app:

  • FROM python:3.12.9-bookworm - Uses a Python 3.12.9 image with Debian Bookworm.
  • WORKDIR /app → Sets /app specifies the working directory inside the container.
  • COPY requirements.txt . - Copies dependencies list into the container.
  • RUN pip install --no-cache-dir -r requirements.txt → Installs dependencies efficiently.
  • COPY ... → Copies important files into the container.
  • RUN useradd -m appuser && chown -R appuser /app → Adds a non-root user (appuser) for better security.
  • USER appuser → Runs the container as appuser instead of root.
  • CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] → Starts the FastAPI app using Uvicorn on port 8000.

The container provides a secure, self-contained, and reproducible environment for deploying the app.

Create a CircleCI configuration

First, you’ll need to create a .circleci/config.yml file in your project. Please find the file in your clone repo. The config.yml file tells CircleCI what to do with your code. Here is a code snippet of the config.yml:

version: 2.1
jobs:
build-and-test:
docker:
- image: cimg/python:3.12.9
steps:
- checkout
- setup_remote_docker:
docker_layer_caching: true
# Rest of the configuration...
Copy to clipboard

Here is a break down each part of the pipeline:

  1. Environment setup:
    • Your pipeline runs in a Python 3.12 Docker container
    • You will use CircleCI’s remote Docker engine to build a Docker image for your application
  2. Dependency management:
- restore_cache:
keys:
- v1-dependencies-{{ checksum "requirements.txt" }}
- run:
name: Install Dependencies
command: |
python -m venv venv
. venv/bin/activate
pip install --no-cache-dir -r requirements.txt
Copy to clipboard

The first part of the CI/CD pipeline runs these tasks:

  • Creates a virtual environment for your project
  • Caches your dependencies to speed up future builds
  • Installs all required packages from requirements.txt
  1. Testing:
- run:
name: Run Tests
command: |
. venv/bin/activate
pytest tests/ -v
Copy to clipboard

This part:

  • Runs your software testing suite using pytest
  • Fails the build if any tests fail
  1. Docker Image building and publishing:
- run:
name: Build and Push Docker Image
command: |
docker build -t $DOCKER_USERNAME/churn-prediction:latest .
echo $DOCKER_PASSWORD | docker login -u $DOCKER_USERNAME --password-stdin
docker push $DOCKER_USERNAME/churn-prediction:latest
Copy to clipboard

The command:

  • Builds your Docker image
  • Logs into Docker Hub using credentials stored in CircleCI
  • Pushes the image to your Docker Hub repository

Managing Environments

For your CI/CD pipeline to work properly, you will need to set up a few things in CircleCI:

  1. Create a CircleCI context called docker-hub-creds:

Learn how to create and use context for your organization.

Next, click the Add Environment Variables button within CircleCI context, and configure your Docker Hub credentials:

  • DOCKER_USERNAME: Your Docker Hub username
  • DOCKER_PASSWORD: Your Docker Hub password or access token
  1. The pipeline uses these variables automatically through the context:
workflows:
build-test-deploy:
jobs:
- build-and-test:
context:
- docker-hub-creds
Copy to clipboard

Now, every time you push code to your repository:

  • CircleCI automatically runs your tests
  • If tests pass, it builds a new Docker image
  • Pushes the image to Docker Hub

You can monitor all of this in the CircleCI dashboard.

CircleCI dashboard

This setup ensures thorough testing and packing of your churn prediction app before deployment. This will ensure high quality and reliability in your ML apps.

Deploying and monitoring your customer churn system

In this section, you will learn how to build and test your container locally.

Writing and running tests

Your tests are organized in the tests/ folder with two main files:

  • test_api.py: Tests your FastAPI endpoints
  • test_prediction.py: Tests your prediction pipeline logic

To run your tests, simply use:

pytest tests/ -v
Copy to clipboard

Pytest results

Tests help to verify that both your API and prediction logic work properly before deployment.

Building and testing your Docker Container locally

You used CircleCI to build and push the docker image to your dockerHub repository. To build and test your container locally, run this command on your machine:

# Pull the latest image
docker pull yourusername/churn-prediction:latest

Copy to clipboard

After pulling the image, run this command to test it locally:

docker run -d -p 8000:8000 yourusername/churn-prediction:latest
Copy to clipboard

Monitor your deployment by running:

docker ps
docker logs <container_id>
Copy to clipboard

To test the deployed endpoint, open your browser and go to: http://127.0.0.1:8000/docs (Swagger UI).

If you prefer, use curl to test it:

curl -X 'POST' \
'http://127.0.0.1:8000/predict' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"Gender": "Female",
"Senior Citizen": "No",
"Partner": "Yes",
"Dependents": "No",
"Tenure Months": 24,
"Phone Service": "Yes",
"Multiple Lines": "No",
"Internet Service": "DSL",
"Online Security": "Yes",
"Online Backup": "No",
"Device Protection": "Yes",
"Tech Support": "No",
"Streaming TV": "Yes",
"Streaming Movies": "No",
"Contract": "Month-to-month",
"Paperless Billing": "Yes",
"Payment Method": "Electronic check",
"Monthly Charges": 65.6,
"Total Charges": 1576.45,
"customer_text": "Your internet is horrible these days!"
}'
Copy to clipboard

Your churn prediction system is now ready for use. You have successfully tested, containerized, and deployed it. Remember to monitor and update your app regularly.

Conclusion

Congratulations! You have successfully built a complete ML pipeline for predicting customer churn. You have learned how to:

  • Build a complete churn prediction model
  • Create a FastAPI app serving real-time predictions
  • Build a CI/CD pipeline using CircleCI
  • Containerized application ready for deployment
  • Automate testing and quality checks

CircleCI’s CI/CD tools help you run tests and build your app automatically. Not performing these steps manually saves you time and prevents mistakes.

Try it yourself

Want to build your ML pipeline? CircleCI offers a generous free tier to get you started:

  1. Sign up at CircleCI
  2. Connect your repository
  3. Speficy your workflows in .circleci/config.yml
  4. Push your code to automate your CI/CD pipelines

Now you know how to get your ML models into production with CircleCI’s CI/CD. Fork this project, sign up for CircleCI and start building right away.

Have a question or need help? Drop a comment or join the CircleCI community.


Armstrong Asenavi is a detail-oriented full stack AI engineer and technical writer. He draws on his experience to translate complex AI/ML concepts into accessible content and documentation.

Copy to clipboard