Benchmarking Kotlin Coroutines performance with CircleCI

Software Engineer

A benchmark can be interpreted as a standard of comparison used to assess something.
In everyday life, for example, when we want to buy a new cellphone and want to know which one is faster, we can see the speed test (benchmark) by measuring how fast the cellphone opens applications or runs games. From there, we can compare which cellphone is better based on the numbers produced.
In the context of programming, a benchmark is the process of testing and measuring the performance of a system, software, method, or algorithm based on certain metrics. Benchmarks are used to compare the performance of various implementations, identify bottlenecks, and optimize code to be more efficient.
In this post, we will benchmark the performance of Kotlin coroutines in various Dispatchers
. From the benchmark that we will do, we can find out how far the performance of coroutines is for various Dispatchers. We will also compare it with the sequential process, which refers to the normal way of running algorithms without using coroutines.
Prerequisites
To follow along with this tutorial, you will need the following:
- Basic knowledge of the Kotlin language.
- A CircleCI account.
- A GitHub account.
- Intellij IDEA installed in your system.
Why coroutines benchmarking belongs in CI/CD
Benchmarking coroutines can be done locally on your own computer. However, in my experience, it is better to do it in CI/CD. Why? Here are the reasons:
- Consistency
- Similar to production
- Early detection
- Automated benchmarking and performance tracking
- Multi-environment testing
Consistency
If we do the test on our own computers, the results may vary because each person uses a laptop or PC with different specifications. Some use powerful processors, some just ordinary ones. CI/CD uses a server with more stable conditions, so the test results are more reliable. Also, when running a benchmark on your own computer, performance can also be affected by other applications running, such as browsers, IDEs, or other applications. Well, in CI/CD, the test is run in a cleaner environment and is not disturbed by things like this.
Similar to production
The server where the application is run in the real world (production environment) often incorporates a CI/CD pipeline. Testing in CI/CD produces results that are more relevant to the actual conditions after the application is released.
Early detection
If there is a code change that makes the application slower, CI/CD can detect it early on before the code is implemented. That way you can tell if there is a performance problem before it reaches the user.
Automated benchmarking and performance tracking
With CI/CD, benchmark results will be stored over time, so we can see if there is an increase or decrease in performance. Imagine if this was done manually on each computer; it would be difficult to track changes from one version to the next.
Multi-environment testing
CI/CD allows testing on multiple versions of Java, various operating systems, or different types of servers. If on our own computer, we can usually only test in one particular condition.
So benchmarking coroutines in CI/CD pipelines ensures more accurate, stable, and reproducible results, reduces the influence of external factors, and allows for early detection of performance regressions. Additionally, we can automatically compare results between versions and ensure optimal performance before releasing to production.
Choosing a benchmarking framework
Selecting the appropriate benchmarking framework depends on several factors, such as the programming language, type of testing, and specific application needs.
Types of benchmarking
Before diving into specific frameworks, it is essential to understand the two main types of benchmarking in the context of performance programming:
Microbenchmarking: Measures the performance of specific functions or algorithms with high precision. This type of benchmarking is useful for evaluating small, isolated code snippets while accounting for compiler optimizations. Multi-threaded benchmarking: Evaluates how an application handles concurrent execution, making it crucial for multi-threaded systems such as servers, applications with thread pools, or programs that rely on asynchronous processing.
While these two categories are the primary focus in performance programming, benchmarking can also encompass broader areas such as system benchmarking, application benchmarking, and load testing, depending on the project’s specific needs. These include:
- Microbenchmarking frameworks
- Multi-threaded benchmarking frameworks
- Benchmarking in Kotlin
- Example use cases
Microbenchmarking frameworks
For microbenchmarking, frameworks such as JMH (Java), BenchmarkDotNet (.NET), and timeit (Python) are suitable because they have low overhead and support optimizations like JIT (Just-In-Time) compilation and GC (Garbage Collection) tuning.
Note: To learn more about JIT optimizations and GC tuning, refer to:
- Optimizing JMH Benchmark Setup for Improved Performance
- BenchmarkDotNet Documentation
- Python timeit module
Multi-threaded benchmarking frameworks
For multi-threaded applications, frameworks such as JMH and Gatling support concurrent execution testing, providing an accurate picture of how an application handles parallel processes. These benchmarks are essential for evaluating system performance in multi-threaded environments.
Benchmarking in Kotlin
Because we are using Kotlin, the most suitable choice is kotlinx-benchmark, which leverages the JMH framework to execute benchmarks on the JVM.
To automate these benchmarks in a CI/CD pipeline, we will use CircleCI. Automated benchmarking ensures consistent performance tracking over time and helps identify regressions early in the development cycle.
Example use cases for dispatchers
Kotlin’s coroutines offer different dispatchers for handling concurrent workloads efficiently:
- Sequential Execution → Processes transactions one by one in blocking mode.
Dispatchers.Default
→ Uses a CPU-optimized thread pool for parallel execution, suitable for computationally intensive tasks.Dispatchers.IO
→ Uses an IO-optimized thread pool, commonly used for database reads, file operations, and network calls.Dispatchers.Unconfined
→ Not tied to a specific thread, executes tasks on the first available thread.- Parallel Execution (async-await) → Uses
async-await
withDispatchers.Default
to process transactions in parallel efficiently.
By considering these factors, developers can choose the most appropriate benchmarking framework and ensure their application maintains optimal performance under different conditions.
Benchmarking scenario
To measure the performance of coroutines, we need to write accurate and reliable benchmark code. This process involves executing coroutines in various scenarios to observe their execution time and efficiency. In this section, we will discuss how to write proper benchmarking code to systematically evaluate the performance of coroutines.
When benchmarking coroutines, it is important to compare different execution approaches to understand their performance differences. One effective way is to test several Dispatchers
and compare them to a sequential approach. By doing this, we can evaluate which is more efficient in a given scenario and identify the trade-offs between parallelism and overhead of each execution strategy. In this section, we will begin writing code to benchmark these approaches using the outlined workflow below:
-
Benchmark objective Measure the performance of various coroutines execution strategies in processing transactions asynchronously.
- Test scenario
- Simulate transaction validation and processing with random latency.
- Each transaction has a chance of failing (e.g. due to insufficient balance).
- Tested method
- Sequential → Processes transactions one by one in blocking mode.
- Default dispatchers (
Dispatchers.Default
) → Uses CPU-optimized thread pool for parallel execution. - Input/output dispatchers (
Dispatchers.IO
) → Takes advantage of IO-optimized thread pool, usually for heavy I/O operations. - Unconfined Dispatchers (
Dispatchers.Unconfined
) → Not tied to a specific thread as pointed out earlier. - Parallel execution (
async-await
) → Usesasync-await
to process transactions in parallel withDispatchers.Default
.
Expected results
- Sequential is the slowest because it runs one after another.
- Default and I/O Dispatchers are faster because they utilize the thread pool.
- Unconfined Dispatchers can vary depending on the first execution.
- Parallel execution is usually the most optimal because all transactions run simultaneously.
Writing benchmark code for coroutines
Now you can implement the benchmark scenario in Kotlin code.
Create new project
Create a new Kotlin project and make sure the project properties are like this:
- Project Name: CoroutineBenchmark
- Build System: Gradle
- JDK: Use version 11 or higher)
- Gradle DSL: Kotlin
The project structure will be similar to:
CoroutineBenchmark/
├── gradle/
│ └── wrapper/
│ ├── gradle-wrapper.jar
│ └── gradle-wrapper.properties
├── src/
│ └── main/
│ └── kotlin/
├── build.gradle.kts
├── gradle.properties
├── gradlew
├── gradlew.bat
└── settings.gradle.kts
Set up the kotlinx.benchmark
toolkit
We need to set up kotlinx.benchmark
and some other required dependencies in Gradle. Please open the build.gradle.kts
file and add these snippets.
Set up the required plugins:
plugins {
kotlin("jvm") version "2.1.10"
id("org.jetbrains.kotlinx.benchmark") version "0.4.13"
kotlin("plugin.allopen") version "2.0.20"
}
This contains configuration for the Kotlin project with JMH benchmarking:
kotlin("jvm")
→ Enables Kotlin on JVM. Automatically added when creating a new project with Kotlin.org.jetbrains.kotlinx.benchmark
→ It is highly recommended to use the latest version currently, which is 0.4.13.kotlin("plugin.allopen")
→ Makes classes with@State
annotation open, as required by JMH.
Why is this needed? Because JMH requires benchmark classes annotated with @State
to be non-final, as it needs to generate subclasses for proper benchmarking. Since Kotlin classes are final
by default, this configuration ensures that JMH can extend them.
Also add some libraries we need including coroutines and kotlinx.benchmark
to the dependencies list:
dependencies {
testImplementation(kotlin("test"))
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core:1.7.1")
implementation("org.jetbrains.kotlinx:kotlinx-benchmark-runtime:0.4.13")
}
This dependencies
block includes:
testImplementation(kotlin("test"))
→ Adds Kotlin’s standard testing library.implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core:1.7.1")
→ Provides support for Kotlin coroutines.implementation("org.jetbrains.kotlinx:kotlinx-benchmark-runtime:0.4.13")
→ Adds the runtime library for benchmarking with JMH.
Then also add this configuration:
allOpen {
annotation("org.openjdk.jmh.annotations.State")
}
It is a Gradle Kotlin DSL configuration used in the Kotlin Gradle plugin. It makes classes annotated with @State
non-final at runtime.
allOpen
plugin: A Kotlin compiler plugin that removes thefinal
modifier from classes with specific annotations.annotation("org.openjdk.jmh.annotations.State")
: Specifies that classes annotated with@State
should be open for inheritance.
Finally, designate the main source set as a benchmark target:
benchmark {
targets {
register("main")
}
}
This means that the benchmark will be run on the main
code (not test
or other modules).
Now you can write the benchmark code in Kotlin.
Creating a Kotlin class
In the main directory, src/main/kotlin
, let’s create a new class and just name it CoroutineBenchmark
. We can also create a new package first, for example, id.web.hangga
so that the file location is in src/main/kotlin/id/web/hangga/CoroutineBenchmark.kt
After creating a Kotlin class, add the following annotations:
import kotlinx.benchmark.*
import kotlinx.coroutines.*
import org.openjdk.jmh.annotations.Fork
import org.openjdk.jmh.annotations.Level
import java.util.concurrent.TimeUnit
import java.util.concurrent.atomic.AtomicInteger
import kotlin.random.Random
import kotlin.system.measureTimeMillis
@State(Scope.Benchmark)
@Fork(1)
@Warmup(iterations = 10)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
class CoroutineBenchmark {
}
These JMH annotations configure the benchmark:
@State(Scope.Benchmark)
→ Shares state across all iterations.@Fork(1)
→ Runs in a fresh JVM once.@Warmup(iterations = 10)
→ Runs 10 warm-up iterations to stabilize JVM optimizations.@Measurement(iterations = 10, time = 1, timeUnit = SECONDS)
→ Runs 10 test iterations, each lasting 1 second.
Then we define initialization for benchmarking on the CoroutineBenchmark
class:
private val transactionCount = 100 // Number of transactions to be tested
private val processedTransactions = AtomicInteger(0)
private lateinit var transactions: List<Int>
@Setup(Level.Iteration) // Prepare transactions before each benchmark iteration
fun setup() {
transactions = List(transactionCount) { it + 1 }
processedTransactions.set(0)
}
transactionCount = 100
→ Total transactions to test.processedTransactions = AtomicInteger(0)
→ Tracks completed transactions safely in a multi-threaded environment.transactions: List<Int>
→ Holds transaction IDs.@Setup(Level.Iteration)
→ Runs before each benchmark iteration to reset data.setup()
→ Initializes transactions (1 to 100) and resets the processed counter.
Then, create a transaction simulation and validate it:
// Simulate transaction validation (e.g., balance verification)
private suspend fun validateTransaction(transactionId: Int): Boolean {
delay(Random.nextLong(5, 20)) // Simulate validation latency
return transactionId % 10 != 0 // Simulate some failed transactions (e.g., insufficient balance)
}
// Simulate transaction processing
private suspend fun processTransaction(transactionId: Int) {
if (validateTransaction(transactionId)) {
delay(Random.nextLong(10, 50)) // Simulate transaction processing time
processedTransactions.incrementAndGet()
}
}
This code simulates transaction validation and processing with delays to mimic real-world latency.
validateTransaction(transactionId: Int): Boolean
→ Checks if a transaction is valid with a simulated latency (5-20 ms). Transactions that are multiples of 10 are considered failed.processTransaction(transactionId: Int)
→ If the transaction is valid, it undergoes processing with a delay (10-50 ms), and the counter for successfully processed transactions is incremented.
Finally, create the methods that we will benchmark:
@Benchmark
fun sequentialTransactions() = runBlocking {
val time = measureTimeMillis {
for (id in 1..transactionCount) {
processTransaction(id)
}
}
println("Sequential processing time: $time ms")
}
@Benchmark
fun defaultDispatchersTransactions() = runBlocking {
benchmarkDispatcher("Default", Dispatchers.Default)
}
@Benchmark
fun iODispatchersTransactions() = runBlocking { benchmarkDispatcher("IO", Dispatchers.IO) }
@Benchmark
fun unconfinedDispatchersTransactions() = runBlocking {
benchmarkDispatcher("Unconfined", Dispatchers.Unconfined)
}
private suspend fun benchmarkDispatcher(name: String, Dispatcher: CoroutineDispatcher) {
val time = measureTimeMillis {
coroutineScope {
(1..transactionCount)
.map { id -> launch(Dispatcher) { processTransaction(id) } }
.joinAll()
}
}
println("Concurrent processing time on $name: $time ms")
}
@Benchmark fun parallelDispatchersTransactions() = runBlocking { benchmarkDispatcherParallel() }
private suspend fun benchmarkDispatcherParallel() {
val time = measureTimeMillis {
withContext(
Dispatchers.Default
) { // Runs in a thread pool that matches the number of CPU cores.
(1..transactionCount)
.map { id ->
async { processTransaction(id) } // Uses async to run in parallel
}
.awaitAll() // wait all async done
}
}
println("Parallel processing time : $time ms")
}
This code benchmarks different coroutine execution strategies for processing transactions.
sequentialTransactions()
→ Runs all transactions sequentially usingrunBlocking
, measuring the total execution time.defaultDispatchersTransactions()
→ UsesDispatchers.Default
, which runs coroutines on a thread pool optimized for CPU-intensive tasks.iODispatchersTransactions()
→ UsesDispatchers.IO
, optimized for I/O operations like file or network access.unconfinedDispatchersTransactions()
→ UsesDispatchers.Unconfined
, which starts coroutines in the caller’s thread but may shift to another thread.benchmarkDispatcher(name, Dispatcher)
→ A helper function that runs transactions concurrently using the specified Dispatchers and measures execution time.parallelDispatchersTransactions()
→ Runs transactions in parallel usingDispatchers.Default
andasync-await
to fully utilize CPU cores.
Now you are ready to run the benchmark. To run the benchmark locally, use this command:
./gradlew benchmark
In Intellij IDEA, you can also use the right-hand menu tree to run the benchmark task:
Local Benchmark Summary
Metric | Value | Description |
---|---|---|
Average Success | 14.935 ops/s | Average operations per second. |
Confidence Interval (99.9%) | [14.858, 15.051] ops/s | Expected throughput range with 99.9% confidence. |
Min, Avg, Max | (14.876, 14.935, 15.078) ops/s | Lowest, average, and highest observed values. |
Standard Deviation | 0.064 | Indicates stable performance with minimal fluctuations. |
Margin of Error (99.9%) | ±0.097 ops/s | Possible deviation in the average throughput. |
Local Benchmark Results (Per Test)
No | Benchmark | Mode | Count | Score | Error | Units |
---|---|---|---|---|---|---|
1 | defaultDispatchersTransactions |
thrpt | 10 | 14.923 | ±0.229 | ops/s |
2 | iODispatchersTransactions |
thrpt | 10 | 14.877 | ±0.088 | ops/s |
3 | parallelDispatchersTransactions |
thrpt | 10 | 14.994 | ±0.136 | ops/s |
4 | sequentialTransactions |
thrpt | 10 | 0.229 | ±0.012 | ops/s |
5 | unconfinedDispatchersTransactions |
thrpt | 10 | 14.954 | ±0.097 | ops/s |
Comparison of local benchmark vs. expected results
No | Benchmark | Local Score (ops/s) | Expected Performance | Matches Expectation? |
---|---|---|---|---|
1 | sequentialTransactions |
0.229 | Slowest (one-by-one execution) | ✅ Yes |
2 | defaultDispatchersTransactions |
14.923 | Faster (thread pool usage) | ✅ Yes |
3 | iODispatchersTransactions |
14.877 | Faster (thread pool usage) | ✅ Yes |
4 | parallelDispatchersTransactions |
14.994 | Fastest (full parallel execution) | ✅ Yes |
5 | unconfinedDispatchersTransactions |
14.954 | Variable performance | ✅ Yes (consistent but could vary) |
Observations:
- Sequential Execution is the slowest, as expected, since transactions run one after another.
- Default and I/O Dispatchers perform similarly, which aligns with expectations because both rely on the thread pool.
- Parallel Dispatchers provide the best performance, slightly outperforming Default and IO Dispatchers, making it the most efficient option.
- Unconfined Dispatchers were expected to be inconsistent, but in this run, they remained competitive with other thread-pool-based dispatchers.
Overall, the benchmark results align well with the expected behavior. If needed, further testing under different workloads could reveal variations in unconfined
performance.
Next, commit and push the changes to your GitHub repository. Review Pushing a project to GitHub for instructions.
Running benchmarks in CircleCI
To run the benchmark on CircleCI, we first push our project to GitHub. Then, we integrate it with CircleCI. Once successful, CircleCI will create a new branch named circleci-project-setup, which contains a file like .circleci/config.yml.
Open .circleci/config.yml
. The default contents are similar to this:
# This config was automatically generated from your source code
# Stacks detected: deps:java:.,tool:gradle:
version: 2.1
jobs:
test-java:
docker:
- image: cimg/openjdk:17.0
steps:
- checkout
- run:
name: Calculate cache key
command: |-
find . -name 'pom.xml' -o -name 'gradlew*' -o -name '*.gradle*' | \
sort | xargs cat > /tmp/CIRCLECI_CACHE_KEY
- restore_cache:
key: cache-{{ checksum "/tmp/CIRCLECI_CACHE_KEY" }}
- run:
command: ./gradlew check
- store_test_results:
path: build/test-results
- save_cache:
key: cache-{{ checksum "/tmp/CIRCLECI_CACHE_KEY" }}
paths:
- ~/.gradle/caches
- store_artifacts:
path: build/reports
deploy:
# This is an example deploy job, not actually used by the workflow
docker:
- image: cimg/base:stable
steps:
# Replace this with steps to deploy to users
- run:
name: deploy
command: "#e.g. ./deploy.sh"
workflows:
build-and-test:
jobs:
- test-java
# - deploy:
# requires:
# - test-java
By default, this workflow has two jobs, but only one is running because we don’t use the other.
- Job
test-java
- Runs Java tests using Gradle in a Docker container with OpenJDK 17.
- Utilizes caching to speed up the build process.
- Stores test results and build reports.
- Job
deploy
(Not in use)- Placeholder for the deployment process.
The build-and-test
workflow runs the test-java
job, while the deploy
job is commented out and not used.
Now, after this, we need to create a new job to run the benchmark on workflows
:
benchmark:
docker:
- image: cimg/openjdk:17.0
steps:
- checkout
- run:
name: Run Kotlin Benchmark
command: ./gradlew benchmark
With this, it first checks out the source code and then executes ./gradlew benchmark
, which runs benchmarking with JMH to measure the performance of the Kotlin code.
Add the benchmark job into the build-and-test
workflow as shown below:
workflows:
build-and-test:
jobs:
- test-java
- benchmark
Now, push to the GitHub repository. The build-and-test
workflow will run two jobs: test-java
and benchmark
.
Open the benchmark
job for the details:
Then, in Parallel runs expand Run Kotlin Benchmark and review the log:
The results are slightly different from the local execution that you completed before.
Analyzing the results
Here is an example and explanation of the benchmark results in CircleCI:
Success: 15.130 ±(99.9%) 0.131 ops/s [Average]
(min, avg, max) = (14.961, 15.130, 15.237), stdev = 0.086
CI (99.9%): [15.000, 15.261] (assumes normal distribution)
main summary:
Benchmark Mode Cnt Score Error Units
CoroutineBenchmark.defaultDispatchersTransactions thrpt 10 15.144 ± 0.173 ops/s
CoroutineBenchmark.iODispatchersTransactions thrpt 10 15.128 ± 0.143 ops/s
CoroutineBenchmark.parallelDispatchersTransactions thrpt 10 15.131 ± 0.185 ops/s
CoroutineBenchmark.sequentialTransactions thrpt 10 0.262 ± 0.017 ops/s
CoroutineBenchmark.unconfinedDispatchersTransactions thrpt 10 15.130 ± 0.131 ops/s
This benchmark result evaluates the performance of different Kotlin coroutinesDispatchers
based on throughput (ops/s)
, which measures the number of operations per second. Here it is in table form.
Success summary
Metric | Value | Description |
---|---|---|
Average Success | 15.130 ops/s | Average operations per second. |
Confidence Interval (99.9%) | [15.000, 15.261] ops/s | Expected throughput range with 99.9% confidence. |
Min, Avg, Max | (14.961, 15.130, 15.237) ops/s | Lowest, average, and highest observed values. |
Standard Deviation | 0.086 | Indicates stable performance with minimal fluctuations. |
Margin of Error (99.9%) | ±0.131 ops/s | Possible deviation in the average throughput. |
Main summary
Benchmark | Mode | Count | Score | Error | Units |
---|---|---|---|---|---|
CoroutineBenchmark.defaultDispatchersTransactions | Throughput | 10 | 15.144 | ±0.173 | ops/s |
CoroutineBenchmark.iODispatchersTransactions | Throughput | 10 | 15.128 | ±0.143 | ops/s |
CoroutineBenchmark.parallelDispatchersTransactions | Throughput | 10 | 15.131 | ±0.185 | ops/s |
CoroutineBenchmark.sequentialTransactions | Throughput | 10 | 0.262 | ±0.017 | ops/s |
CoroutineBenchmark.unconfinedDispatchersTransactions | Throughput | 10 | 15.130 | ±0.131 | ops/s |
Here’s a comparison between the expected results and the actual benchmark results:
Comparison of benchmarks vs. expected results
Execution Type | Expected Performance | Actual Performance (ops/s) | Observation |
---|---|---|---|
Sequential | Slowest due to running sequentially | 0.262 | Matches expectation (significantly slower). |
Default dispatcher | Faster due to thread pool usage | 15.144 | Matches expectation (similar to other dispatchers). |
IO dispatcher | Faster due to thread pool usage | 15.128 | Matches expectation (similar to other dispatchers). |
Unconfined dispatcher | Varies depending on first execution | 15.130 | Contrary to expectation; performs similarly to other dispatchers. |
Parallel eexecution | Expected to be the fastest | 15.131 | Contrary to expectation; performs similarly to other dispatchers. |
Key observations
- Sequential execution is indeed the slowest, aligning with expectations.
- Default, IO, and Parallel Dispatchers perform nearly identically (~15.1 ops/s), contradicting the expectation that parallel execution would be significantly faster.
- Unconfined Dispatcher does not show significant variation, contradicting the expectation that it would behave unpredictably.
- Minimal performance difference among the thread-pool-based dispatchers suggests that coroutine dispatching overhead is negligible in this scenario.
Conclusion
Benchmarking is essential for evaluating and optimizing performance, whether in daily life, programming, or software development. In programming, benchmarking helps measure system efficiency, identify bottlenecks, and improve code execution.
For Kotlin coroutines, benchmarking across different Dispatchers
provides insights into performance trade-offs. Running benchmarks in a CI/CD pipeline ensures consistency, accuracy, and early detection of performance regressions. It also enables automated tracking over time and multi-environment testing.
Using kotlinx-benchmark and JMH, developers can systematically compare coroutine execution strategies to optimize performance before deployment.
The complete code for this project is available on GitHub.