TutorialsApr 1, 202515 min read

Benchmarking Kotlin Coroutines performance with CircleCI

Hangga Aji Sayekti

Software Engineer

Developer D sits at a desk working on an intermediate-level project.

A benchmark can be interpreted as a standard of comparison used to assess something.

In everyday life, for example, when we want to buy a new cellphone and want to know which one is faster, we can see the speed test (benchmark) by measuring how fast the cellphone opens applications or runs games. From there, we can compare which cellphone is better based on the numbers produced.

In the context of programming, a benchmark is the process of testing and measuring the performance of a system, software, method, or algorithm based on certain metrics. Benchmarks are used to compare the performance of various implementations, identify bottlenecks, and optimize code to be more efficient.

In this post, we will benchmark the performance of Kotlin coroutines in various Dispatchers. From the benchmark that we will do, we can find out how far the performance of coroutines is for various Dispatchers. We will also compare it with the sequential process, which refers to the normal way of running algorithms without using coroutines.

Prerequisites

To follow along with this tutorial, you will need the following:

Why coroutines benchmarking belongs in CI/CD

Benchmarking coroutines can be done locally on your own computer. However, in my experience, it is better to do it in CI/CD. Why? Here are the reasons:

  • Consistency
  • Similar to production
  • Early detection
  • Automated benchmarking and performance tracking
  • Multi-environment testing

Consistency

If we do the test on our own computers, the results may vary because each person uses a laptop or PC with different specifications. Some use powerful processors, some just ordinary ones. CI/CD uses a server with more stable conditions, so the test results are more reliable. Also, when running a benchmark on your own computer, performance can also be affected by other applications running, such as browsers, IDEs, or other applications. Well, in CI/CD, the test is run in a cleaner environment and is not disturbed by things like this.

Similar to production

The server where the application is run in the real world (production environment) often incorporates a CI/CD pipeline. Testing in CI/CD produces results that are more relevant to the actual conditions after the application is released.

Early detection

If there is a code change that makes the application slower, CI/CD can detect it early on before the code is implemented. That way you can tell if there is a performance problem before it reaches the user.

Automated benchmarking and performance tracking

With CI/CD, benchmark results will be stored over time, so we can see if there is an increase or decrease in performance. Imagine if this was done manually on each computer; it would be difficult to track changes from one version to the next.

Multi-environment testing

CI/CD allows testing on multiple versions of Java, various operating systems, or different types of servers. If on our own computer, we can usually only test in one particular condition.

So benchmarking coroutines in CI/CD pipelines ensures more accurate, stable, and reproducible results, reduces the influence of external factors, and allows for early detection of performance regressions. Additionally, we can automatically compare results between versions and ensure optimal performance before releasing to production.

Choosing a benchmarking framework

Selecting the appropriate benchmarking framework depends on several factors, such as the programming language, type of testing, and specific application needs.

Types of benchmarking

Before diving into specific frameworks, it is essential to understand the two main types of benchmarking in the context of performance programming:

Microbenchmarking: Measures the performance of specific functions or algorithms with high precision. This type of benchmarking is useful for evaluating small, isolated code snippets while accounting for compiler optimizations. Multi-threaded benchmarking: Evaluates how an application handles concurrent execution, making it crucial for multi-threaded systems such as servers, applications with thread pools, or programs that rely on asynchronous processing.

While these two categories are the primary focus in performance programming, benchmarking can also encompass broader areas such as system benchmarking, application benchmarking, and load testing, depending on the project’s specific needs. These include:

  • Microbenchmarking frameworks
  • Multi-threaded benchmarking frameworks
  • Benchmarking in Kotlin
  • Example use cases

Microbenchmarking frameworks

For microbenchmarking, frameworks such as JMH (Java), BenchmarkDotNet (.NET), and timeit (Python) are suitable because they have low overhead and support optimizations like JIT (Just-In-Time) compilation and GC (Garbage Collection) tuning.

Note: To learn more about JIT optimizations and GC tuning, refer to:

Multi-threaded benchmarking frameworks

For multi-threaded applications, frameworks such as JMH and Gatling support concurrent execution testing, providing an accurate picture of how an application handles parallel processes. These benchmarks are essential for evaluating system performance in multi-threaded environments.

Benchmarking in Kotlin

Because we are using Kotlin, the most suitable choice is kotlinx-benchmark, which leverages the JMH framework to execute benchmarks on the JVM.

To automate these benchmarks in a CI/CD pipeline, we will use CircleCI. Automated benchmarking ensures consistent performance tracking over time and helps identify regressions early in the development cycle.

Example use cases for dispatchers

Kotlin’s coroutines offer different dispatchers for handling concurrent workloads efficiently:

  • Sequential Execution → Processes transactions one by one in blocking mode.
  • Dispatchers.Default → Uses a CPU-optimized thread pool for parallel execution, suitable for computationally intensive tasks.
  • Dispatchers.IO → Uses an IO-optimized thread pool, commonly used for database reads, file operations, and network calls.
  • Dispatchers.Unconfined → Not tied to a specific thread, executes tasks on the first available thread.
  • Parallel Execution (async-await) → Uses async-await with Dispatchers.Default to process transactions in parallel efficiently.

By considering these factors, developers can choose the most appropriate benchmarking framework and ensure their application maintains optimal performance under different conditions.

Benchmarking scenario

To measure the performance of coroutines, we need to write accurate and reliable benchmark code. This process involves executing coroutines in various scenarios to observe their execution time and efficiency. In this section, we will discuss how to write proper benchmarking code to systematically evaluate the performance of coroutines.

When benchmarking coroutines, it is important to compare different execution approaches to understand their performance differences. One effective way is to test several Dispatchers and compare them to a sequential approach. By doing this, we can evaluate which is more efficient in a given scenario and identify the trade-offs between parallelism and overhead of each execution strategy. In this section, we will begin writing code to benchmark these approaches using the outlined workflow below:

  1. Benchmark objective Measure the performance of various coroutines execution strategies in processing transactions asynchronously.

  2. Test scenario
    • Simulate transaction validation and processing with random latency.
    • Each transaction has a chance of failing (e.g. due to insufficient balance).
  3. Tested method
    • Sequential → Processes transactions one by one in blocking mode.
    • Default dispatchers (Dispatchers.Default) → Uses CPU-optimized thread pool for parallel execution.
    • Input/output dispatchers (Dispatchers.IO) → Takes advantage of IO-optimized thread pool, usually for heavy I/O operations.
    • Unconfined Dispatchers (Dispatchers.Unconfined) → Not tied to a specific thread as pointed out earlier.
    • Parallel execution (async-await) → Uses async-await to process transactions in parallel with Dispatchers.Default.

Expected results

  • Sequential is the slowest because it runs one after another.
  • Default and I/O Dispatchers are faster because they utilize the thread pool.
  • Unconfined Dispatchers can vary depending on the first execution.
  • Parallel execution is usually the most optimal because all transactions run simultaneously.

Benchmark flows

Writing benchmark code for coroutines

Now you can implement the benchmark scenario in Kotlin code.

Create new project

Create a new Kotlin project and make sure the project properties are like this:

  • Project Name: CoroutineBenchmark
  • Build System: Gradle
  • JDK: Use version 11 or higher)
  • Gradle DSL: Kotlin

New project

The project structure will be similar to:

CoroutineBenchmark/
├── gradle/
│   └── wrapper/
│       ├── gradle-wrapper.jar
│       └── gradle-wrapper.properties
├── src/
│   └── main/
│       └── kotlin/
├── build.gradle.kts
├── gradle.properties
├── gradlew
├── gradlew.bat
└── settings.gradle.kts

Set up the kotlinx.benchmark toolkit

We need to set up kotlinx.benchmark and some other required dependencies in Gradle. Please open the build.gradle.kts file and add these snippets.

Set up the required plugins:

plugins {
    kotlin("jvm") version "2.1.10"
    id("org.jetbrains.kotlinx.benchmark") version "0.4.13"
    kotlin("plugin.allopen") version "2.0.20"
}

This contains configuration for the Kotlin project with JMH benchmarking:

  • kotlin("jvm") → Enables Kotlin on JVM. Automatically added when creating a new project with Kotlin.
  • org.jetbrains.kotlinx.benchmark → It is highly recommended to use the latest version currently, which is 0.4.13.
  • kotlin("plugin.allopen") → Makes classes with @State annotation open, as required by JMH.

Why is this needed? Because JMH requires benchmark classes annotated with @State to be non-final, as it needs to generate subclasses for proper benchmarking. Since Kotlin classes are final by default, this configuration ensures that JMH can extend them.

Also add some libraries we need including coroutines and kotlinx.benchmark to the dependencies list:

dependencies {
    testImplementation(kotlin("test"))
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core:1.7.1")
    implementation("org.jetbrains.kotlinx:kotlinx-benchmark-runtime:0.4.13")
}

This dependencies block includes:

  • testImplementation(kotlin("test")) → Adds Kotlin’s standard testing library.
  • implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core:1.7.1") → Provides support for Kotlin coroutines.
  • implementation("org.jetbrains.kotlinx:kotlinx-benchmark-runtime:0.4.13") → Adds the runtime library for benchmarking with JMH.

Then also add this configuration:

allOpen {
    annotation("org.openjdk.jmh.annotations.State")
}

It is a Gradle Kotlin DSL configuration used in the Kotlin Gradle plugin. It makes classes annotated with @State non-final at runtime.

  • allOpen plugin: A Kotlin compiler plugin that removes the final modifier from classes with specific annotations.
  • annotation("org.openjdk.jmh.annotations.State"): Specifies that classes annotated with @State should be open for inheritance.

Finally, designate the main source set as a benchmark target:

benchmark {
    targets {
        register("main")
    }
}

This means that the benchmark will be run on the main code (not test or other modules).

Now you can write the benchmark code in Kotlin.

Creating a Kotlin class

In the main directory, src/main/kotlin, let’s create a new class and just name it CoroutineBenchmark. We can also create a new package first, for example, id.web.hangga so that the file location is in src/main/kotlin/id/web/hangga/CoroutineBenchmark.kt

New class

After creating a Kotlin class, add the following annotations:

import kotlinx.benchmark.*
import kotlinx.coroutines.*
import org.openjdk.jmh.annotations.Fork
import org.openjdk.jmh.annotations.Level
import java.util.concurrent.TimeUnit
import java.util.concurrent.atomic.AtomicInteger
import kotlin.random.Random
import kotlin.system.measureTimeMillis

@State(Scope.Benchmark)
@Fork(1)
@Warmup(iterations = 10)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
class CoroutineBenchmark {

}

These JMH annotations configure the benchmark:

  • @State(Scope.Benchmark) → Shares state across all iterations.
  • @Fork(1) → Runs in a fresh JVM once.
  • @Warmup(iterations = 10) → Runs 10 warm-up iterations to stabilize JVM optimizations.
  • @Measurement(iterations = 10, time = 1, timeUnit = SECONDS) → Runs 10 test iterations, each lasting 1 second.

Then we define initialization for benchmarking on the CoroutineBenchmark class:

private val transactionCount = 100 // Number of transactions to be tested
private val processedTransactions = AtomicInteger(0)
private lateinit var transactions: List<Int>

@Setup(Level.Iteration) // Prepare transactions before each benchmark iteration
fun setup() {
    transactions = List(transactionCount) { it + 1 }
    processedTransactions.set(0)
}
  • transactionCount = 100 → Total transactions to test.
  • processedTransactions = AtomicInteger(0) → Tracks completed transactions safely in a multi-threaded environment.
  • transactions: List<Int> → Holds transaction IDs.
  • @Setup(Level.Iteration) → Runs before each benchmark iteration to reset data.
  • setup() → Initializes transactions (1 to 100) and resets the processed counter.

Then, create a transaction simulation and validate it:

// Simulate transaction validation (e.g., balance verification)
private suspend fun validateTransaction(transactionId: Int): Boolean {
    delay(Random.nextLong(5, 20)) // Simulate validation latency
    return transactionId % 10 != 0 // Simulate some failed transactions (e.g., insufficient balance)
}

// Simulate transaction processing
private suspend fun processTransaction(transactionId: Int) {
    if (validateTransaction(transactionId)) {
        delay(Random.nextLong(10, 50)) // Simulate transaction processing time
        processedTransactions.incrementAndGet()
    }
}

This code simulates transaction validation and processing with delays to mimic real-world latency.

  • validateTransaction(transactionId: Int): Boolean → Checks if a transaction is valid with a simulated latency (5-20 ms). Transactions that are multiples of 10 are considered failed.
  • processTransaction(transactionId: Int) → If the transaction is valid, it undergoes processing with a delay (10-50 ms), and the counter for successfully processed transactions is incremented.

Finally, create the methods that we will benchmark:

@Benchmark
fun sequentialTransactions() = runBlocking {
    val time = measureTimeMillis {
        for (id in 1..transactionCount) {
            processTransaction(id)
        }
    }
    println("Sequential processing time: $time ms")
}

@Benchmark
fun defaultDispatchersTransactions() = runBlocking {
    benchmarkDispatcher("Default", Dispatchers.Default)
}

@Benchmark
fun iODispatchersTransactions() = runBlocking { benchmarkDispatcher("IO", Dispatchers.IO) }

@Benchmark
fun unconfinedDispatchersTransactions() = runBlocking {
    benchmarkDispatcher("Unconfined", Dispatchers.Unconfined)
}

private suspend fun benchmarkDispatcher(name: String, Dispatcher: CoroutineDispatcher) {
    val time = measureTimeMillis {
        coroutineScope {
            (1..transactionCount)
                .map { id -> launch(Dispatcher) { processTransaction(id) } }
                .joinAll()
        }
    }
    println("Concurrent processing time on $name: $time ms")
}

@Benchmark fun parallelDispatchersTransactions() = runBlocking { benchmarkDispatcherParallel() }

private suspend fun benchmarkDispatcherParallel() {
    val time = measureTimeMillis {
        withContext(
            Dispatchers.Default
        ) { // Runs in a thread pool that matches the number of CPU cores.
            (1..transactionCount)
                .map { id ->
                    async { processTransaction(id) } // Uses async to run in parallel
                }
                .awaitAll() // wait all async done
        }
    }
    println("Parallel processing time : $time ms")
}

This code benchmarks different coroutine execution strategies for processing transactions.

  • sequentialTransactions() → Runs all transactions sequentially using runBlocking, measuring the total execution time.
  • defaultDispatchersTransactions() → Uses Dispatchers.Default, which runs coroutines on a thread pool optimized for CPU-intensive tasks.
  • iODispatchersTransactions() → Uses Dispatchers.IO, optimized for I/O operations like file or network access.
  • unconfinedDispatchersTransactions() → Uses Dispatchers.Unconfined, which starts coroutines in the caller’s thread but may shift to another thread.
  • benchmarkDispatcher(name, Dispatcher) → A helper function that runs transactions concurrently using the specified Dispatchers and measures execution time.
  • parallelDispatchersTransactions() → Runs transactions in parallel using Dispatchers.Default and async-await to fully utilize CPU cores.

Now you are ready to run the benchmark. To run the benchmark locally, use this command:

./gradlew benchmark

In Intellij IDEA, you can also use the right-hand menu tree to run the benchmark task:

Local benchmark

Local result

Local Benchmark Summary

Metric Value Description
Average Success 14.935 ops/s Average operations per second.
Confidence Interval (99.9%) [14.858, 15.051] ops/s Expected throughput range with 99.9% confidence.
Min, Avg, Max (14.876, 14.935, 15.078) ops/s Lowest, average, and highest observed values.
Standard Deviation 0.064 Indicates stable performance with minimal fluctuations.
Margin of Error (99.9%) ±0.097 ops/s Possible deviation in the average throughput.

Local Benchmark Results (Per Test)

No Benchmark Mode Count Score Error Units
1 defaultDispatchersTransactions thrpt 10 14.923 ±0.229 ops/s
2 iODispatchersTransactions thrpt 10 14.877 ±0.088 ops/s
3 parallelDispatchersTransactions thrpt 10 14.994 ±0.136 ops/s
4 sequentialTransactions thrpt 10 0.229 ±0.012 ops/s
5 unconfinedDispatchersTransactions thrpt 10 14.954 ±0.097 ops/s

Comparison of local benchmark vs. expected results

No Benchmark Local Score (ops/s) Expected Performance Matches Expectation?
1 sequentialTransactions 0.229 Slowest (one-by-one execution) ✅ Yes
2 defaultDispatchersTransactions 14.923 Faster (thread pool usage) ✅ Yes
3 iODispatchersTransactions 14.877 Faster (thread pool usage) ✅ Yes
4 parallelDispatchersTransactions 14.994 Fastest (full parallel execution) ✅ Yes
5 unconfinedDispatchersTransactions 14.954 Variable performance ✅ Yes (consistent but could vary)

Observations:

  • Sequential Execution is the slowest, as expected, since transactions run one after another.
  • Default and I/O Dispatchers perform similarly, which aligns with expectations because both rely on the thread pool.
  • Parallel Dispatchers provide the best performance, slightly outperforming Default and IO Dispatchers, making it the most efficient option.
  • Unconfined Dispatchers were expected to be inconsistent, but in this run, they remained competitive with other thread-pool-based dispatchers.

Overall, the benchmark results align well with the expected behavior. If needed, further testing under different workloads could reveal variations in unconfined performance.

Next, commit and push the changes to your GitHub repository. Review Pushing a project to GitHub for instructions.

Running benchmarks in CircleCI

To run the benchmark on CircleCI, we first push our project to GitHub. Then, we integrate it with CircleCI. Once successful, CircleCI will create a new branch named circleci-project-setup, which contains a file like .circleci/config.yml.

Local benchmark

Open .circleci/config.yml. The default contents are similar to this:

# This config was automatically generated from your source code
# Stacks detected: deps:java:.,tool:gradle:
version: 2.1
jobs:
  test-java:
    docker:
      - image: cimg/openjdk:17.0
    steps:
      - checkout
      - run:
          name: Calculate cache key
          command: |-
            find . -name 'pom.xml' -o -name 'gradlew*' -o -name '*.gradle*' | \
                    sort | xargs cat > /tmp/CIRCLECI_CACHE_KEY
      - restore_cache:
          key: cache-{{ checksum "/tmp/CIRCLECI_CACHE_KEY" }}
      - run:
          command: ./gradlew check
      - store_test_results:
          path: build/test-results
      - save_cache:
          key: cache-{{ checksum "/tmp/CIRCLECI_CACHE_KEY" }}
          paths:
            - ~/.gradle/caches
      - store_artifacts:
          path: build/reports

  deploy:
    # This is an example deploy job, not actually used by the workflow
    docker:
      - image: cimg/base:stable
    steps:
      # Replace this with steps to deploy to users
      - run:
          name: deploy
          command: "#e.g. ./deploy.sh"

workflows:
  build-and-test:
    jobs:
      - test-java
    # - deploy:
    #     requires:
    #       - test-java

By default, this workflow has two jobs, but only one is running because we don’t use the other.

  1. Job test-java
    • Runs Java tests using Gradle in a Docker container with OpenJDK 17.
    • Utilizes caching to speed up the build process.
    • Stores test results and build reports.
  2. Job deploy (Not in use)
    • Placeholder for the deployment process.

The build-and-test workflow runs the test-java job, while the deploy job is commented out and not used.

Now, after this, we need to create a new job to run the benchmark on workflows:

benchmark:
  docker:
    - image: cimg/openjdk:17.0
  steps:
    - checkout
    - run:
        name: Run Kotlin Benchmark
        command: ./gradlew benchmark

With this, it first checks out the source code and then executes ./gradlew benchmark, which runs benchmarking with JMH to measure the performance of the Kotlin code.

Add the benchmark job into the build-and-test workflow as shown below:

workflows:
  build-and-test:
    jobs:
      - test-java
      - benchmark

Now, push to the GitHub repository. The build-and-test workflow will run two jobs: test-java and benchmark.

Pipeline benchmark

Open the benchmark job for the details: Pipeline benchmark

Then, in Parallel runs expand Run Kotlin Benchmark and review the log: Pipeline benchmark

The results are slightly different from the local execution that you completed before.

Analyzing the results

Here is an example and explanation of the benchmark results in CircleCI:

  Success: 15.130 ±(99.9%) 0.131 ops/s [Average]
  (min, avg, max) = (14.961, 15.130, 15.237), stdev = 0.086
  CI (99.9%): [15.000, 15.261] (assumes normal distribution)

main summary:
Benchmark                                              Mode  Cnt   Score   Error  Units
CoroutineBenchmark.defaultDispatchersTransactions     thrpt   10  15.144 ± 0.173  ops/s
CoroutineBenchmark.iODispatchersTransactions          thrpt   10  15.128 ± 0.143  ops/s
CoroutineBenchmark.parallelDispatchersTransactions    thrpt   10  15.131 ± 0.185  ops/s
CoroutineBenchmark.sequentialTransactions             thrpt   10   0.262 ± 0.017  ops/s
CoroutineBenchmark.unconfinedDispatchersTransactions  thrpt   10  15.130 ± 0.131  ops/s

This benchmark result evaluates the performance of different Kotlin coroutinesDispatchers based on throughput (ops/s), which measures the number of operations per second. Here it is in table form.

Success summary

Metric Value Description
Average Success 15.130 ops/s Average operations per second.
Confidence Interval (99.9%) [15.000, 15.261] ops/s Expected throughput range with 99.9% confidence.
Min, Avg, Max (14.961, 15.130, 15.237) ops/s Lowest, average, and highest observed values.
Standard Deviation 0.086 Indicates stable performance with minimal fluctuations.
Margin of Error (99.9%) ±0.131 ops/s Possible deviation in the average throughput.

Main summary

Benchmark Mode Count Score Error Units
CoroutineBenchmark.defaultDispatchersTransactions Throughput 10 15.144 ±0.173 ops/s
CoroutineBenchmark.iODispatchersTransactions Throughput 10 15.128 ±0.143 ops/s
CoroutineBenchmark.parallelDispatchersTransactions Throughput 10 15.131 ±0.185 ops/s
CoroutineBenchmark.sequentialTransactions Throughput 10 0.262 ±0.017 ops/s
CoroutineBenchmark.unconfinedDispatchersTransactions Throughput 10 15.130 ±0.131 ops/s

Here’s a comparison between the expected results and the actual benchmark results:

Comparison of benchmarks vs. expected results

Execution Type Expected Performance Actual Performance (ops/s) Observation
Sequential Slowest due to running sequentially 0.262 Matches expectation (significantly slower).
Default dispatcher Faster due to thread pool usage 15.144 Matches expectation (similar to other dispatchers).
IO dispatcher Faster due to thread pool usage 15.128 Matches expectation (similar to other dispatchers).
Unconfined dispatcher Varies depending on first execution 15.130 Contrary to expectation; performs similarly to other dispatchers.
Parallel eexecution Expected to be the fastest 15.131 Contrary to expectation; performs similarly to other dispatchers.

Key observations

  1. Sequential execution is indeed the slowest, aligning with expectations.
  2. Default, IO, and Parallel Dispatchers perform nearly identically (~15.1 ops/s), contradicting the expectation that parallel execution would be significantly faster.
  3. Unconfined Dispatcher does not show significant variation, contradicting the expectation that it would behave unpredictably.
  4. Minimal performance difference among the thread-pool-based dispatchers suggests that coroutine dispatching overhead is negligible in this scenario.

Conclusion

Benchmarking is essential for evaluating and optimizing performance, whether in daily life, programming, or software development. In programming, benchmarking helps measure system efficiency, identify bottlenecks, and improve code execution.

For Kotlin coroutines, benchmarking across different Dispatchers provides insights into performance trade-offs. Running benchmarks in a CI/CD pipeline ensures consistency, accuracy, and early detection of performance regressions. It also enables automated tracking over time and multi-environment testing.

Using kotlinx-benchmark and JMH, developers can systematically compare coroutine execution strategies to optimize performance before deployment.

The complete code for this project is available on GitHub.

Copy to clipboard