Troubleshooting Server Installations

2 months ago • 5 min read

Server v2.x Server Admin

CircleCI Server version 2.x is no longer a supported release. Please consult your account team for help in upgrading to a supported release.

This document describes an initial set of troubleshooting steps to take if you are having problems with your CircleCI installation on your private server. If your issue is not addressed below, you can generate a support bundle and contact CircleCI Support Engineers by opening a support ticket.

Generating a Support Bundle
Debug Queuing Builds
Why do my Jobs stay in queued status until they fail and never successfully run?
Why is the cache failing to unpack?
How do I get around the API service being impacted by a high thread count?

Generating a Support Bundle

To download a support bundle, select Support from the Management Console menu bar, and then select Download Support Bundle. CircleCI support engineers will often request support bundles to help diagnose/fix the problem you are experiencing.

Debug Queuing Builds

If your Services component is fine, but builds are not running, or all builds are queueing, follow the steps below.

1. Check Dispatcher Logs for Errors

Run sudo docker logs dispatcher, if you see log output that is free of errors you may continue on the next step.

If the logs dispatcher container does not exist or is down, start it by running the sudo docker start <container_name> command and monitor the progress. The following output indicates that the logs dispatcher is up and running correctly:

Jan 4 22:38:38.589:+0000 INFO circle.backend.build.run-queue dispatcher mode is on - no need for

 run-queue
Jan 4 22:38:38.589:+0000 INFO circle.backend.build.usage-queue 5a4ea0047d560d00011682dc:

 GERey/realitycheck/37 -> forwarded to run-queue
Jan 4 22:38:38.589:+0000 INFO circle.backend.build.usage-queue 5a4ea0047d560d00011682dc: publishing

 :usage-changed (:recur) event

Jan 4 22:38:39.069:+0000 INFO circle.backend.build.usage-queue got usage-queue event for

 5a4ea0047d560d00011682dc (finished-build)

If you see errors or do not see the above output, investigate the stack traces because they indicate that there is an issue with routing builds from 1.0 to 2.0. If there are errors in the output, then you may have a problem with routing builds to 1.0 or 2.0 builds.

If you can run 1.0 builds, but not 2.0 builds, or if you can only run 2.0 builds and the log dispatcher is up and running, continue on to the next steps.

2. Check Picard-Dispatcher Logs for Errors

Run the sudo docker logs picard-dispatcher command. A healthy picard-dispatcher should output the following:

Jan 9 19:32:33 INFO picard-dispatcher.init Still running...
Jan 9 19:34:33 INFO picard-dispatcher.init Still running...
Jan 9 19:34:44 INFO picard-dispatcher.core taking build=GERey/realitycheck/38
Jan 9 19:34:45 INFO circle.http.builds project GERey/realitycheck at revision

2c6179654541ee3d succcessfully fetched and parsed .circleci/config.yml

picard-dispatcher.tasks build GERey/realitycheck/38 is using resource

class {:cpu 2.0, :ram 4096, :class :medium}
picard-dispatcher.tasks Computed tasks for build=GERey/realitycheck/38,

 stage=:write_artifacts, parallel=1
Jan 9 19:34:45 INFO picard-dispatcher.tasks build has matching jobs:

 build=GERey/realitycheck/38 parsed=:write_artifacts passed=:write_artifacts

The output should be filled with the above messages. If it is a slow day and builds are not happening very often, the output will appear as follows:

Jan 9 19:32:33.629:+0000 INFO picard-dispatcher.init Still running...

As soon as you run a build, you should see the above message to indicate that it has been dispatched to the scheduler. If you do not see the above output or you have a stack trace in the picard-dispatcher container, contact support@circleci.com.

If you run a 2.0 build and do not see a message in the picard-dispatcher log output, it often indicates that a job is getting lost between the dispatcher and the picard dispatcher.

Stop and restart the CircleCI app in the Management Console at port 8800 to re-establish the connection between the two containers.

3. Check Picard-Scheduler Logs for Errors

Run sudo docker logs picard-scheduler . The picard-scheduler schedules jobs and sends them to nomad through a direct connection. It does not actually handle queuing of the jobs in CircleCI.

4. Check Nomad Node Status

Check to see if there are any nomad nodes by running the nomad node-status -allocs command and viewing the following output:

ID        DC         Name             Class        Drain  Status  Running Allocs
ec2727c5  us-east-1  ip-127-0-0-1     linux-64bit  false  ready   0

If you do not see any nomad clients listed, please consult our <nomad#,Introduction to Nomad Cluster Operation> for more detailed information on managing and troubleshooting the nomad server.

DC in the output stands for datacenter and will always print us-east-1 and should be left as such. It doesn’t affect or break anything. The things that are the most important are the Drain, Status, and Allocs columns.

Drain - If Drain is true then CircleCI will not route jobs to that nomad client. It is possible to change this value by running the following command nomad node-drain [options] <node>. If you set Drain to true, it will finish the jobs that were currently running and then stop accepting builds. After the number of allocations reaches 0, it is safe to terminate instance. If Drain is set to false it means the node is accepting connections and should be getting builds.
Status - If Status is ready then it is ready to accept builds and should be wired up correctly. If it is not wired up correctly it will not show ready and it should be investigated because a node that is not showing ready in the Status will not accept builds.
Allocs - Allocs is a term used to refer to builds. So, the number of Running Allocs is the number of builds running on a single node. This number indicates whether builds are routing. If all of the Builders have Running Allocs, but your job is still queued, that means you do not have enough capacity and you need to add more Builders to your fleet.

If you see output like the above, but your builds are still queued, then continue to the next step.

5. Check Job Processing Status

Run the sudo docker exec -it nomad nomad status command to view the jobs that are currently being processed. It should list the status of each job as well as the ID of the job, as follows:

ID                                                      Type   Priority  Status
5a4ea06b7d560d000116830f-0-build-GERey-realitycheck-1   batch  50        dead
5a4ea0c9fa4f8c0001b6401b-0-build-GERey-realitycheck-2   batch  50        dead
5a4ea0cafa4f8c0001b6401c-0-build-GERey-realitycheck-3   batch  50        dead

After a job has completed, the Status shows dead. This is a regular state for jobs. If the status shows running, the job is currently running. This should appear in the CircleCI app builds dashboard. If it is not appearing in the app, there may be a problem with the output-processor. Run the docker logs picard-output-processor command and check the logs for any obvious stack traces.

If the job is in a constant pending state with no allocations being made, run the sudo docker exec -it nomad nomad status JOB_ID command to see where Nomad is stuck and then refer to standard Nomad Cluster error documentation for information.
If the job is running/dead but the CircelCI app shows nothing:
- Check the Nomad job logs by running the sudo docker exec -it nomad nomad logs --stderr --job JOB_ID command.
- Run the picard-output-processor command to check those logs for specific errors.

The use of --stderr is to print the specific error if one exists.

Why do my Jobs stay in `queued` status until they fail and never successfully run?

If the nomad client logs contain the following error message typw, check port 8585:

{"error":"rpc error: code = Unavailable desc = grpc: the connection is
unavailable","level":"warning","msg":"error fetching config, retrying","time":"2018-04-17T18:47:01Z"}

Why is the cache failing to unpack?

If a restore_cache step is failing for one of your jobs, it is worth checking the size of the cache - you can view the cache size from the CircleCI Jobs page within the restore_cache step. We recommend keeping cache sizes under 500MB – this is our upper limit for corruption checks because above this limit check times would be excessively long. Larger cache sizes are allowed but may cause problems due to a higher chance of decompression issues and corruption during download. To keep cache sizes down, consider splitting into multiple distinct caches.

How do I get around the API service being impacted by a high thread count?

Disable cache warming by completing the following steps:

Add the export DOMAIN_SERVICE_REFRESH_USERS=false flag to the `/etc/circleconfig/api-service/customizations file on the Services machine. For more information on configuration overrides, see the guide to Service Configuration Overrides.
Restart CircleCI:
1. Navigate to the Management Console
2. Click Stop Now and wait for it to stop
3. Click Start

Help make this document better

This guide, as well as the rest of our docs, are open source and available on GitHub. We welcome your contributions.

Suggest an edit to this page (please read the contributing guide first).
To report a problem in the documentation, or to submit feedback and comments, please open an issue on GitHub.
CircleCI is always seeking ways to improve your experience with our platform. If you would like to share feedback, please join our research community.

Need support?

Our support engineers are available to help with service issues, billing, or account related questions, and can help troubleshoot build configurations. Contact our support engineers by opening a ticket.

You can also visit our support site to find support articles, community forums, and training resources.

CircleCI Documentation by CircleCI is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.