This topic lists common issues and solutions for host deployment by Alibaba Cloud Devops.
Alibaba Cloud ECS prompts "deploy channel error" or "offline"
Restart Cloud Assistant if it is not working normally. For more information, see View execution results and fix common issues.
Clean up ECS disk if it is out of space.
Self-hosted instances go offline
Restart the agent if it is offline.
Common agent operations
Check status.
/home/staragent/bin/staragentctl status
Restart agent.
/home/staragent/bin/staragentctl restart
Uninstall agent.
/home/staragent/bin/staragentctl stop rm -rf /home/staragent rm /usr/sbin/staragent_sn
The build running on the instance is not the latest, and cannot be updated
Check the server disk as indicated by the logs. Clean up disk if it is out of space.
Host deployment fails and no logs for it
For non-Alibaba Cloud ECS, uninstall the agent if the self-hosted instance is generated through an image, then re-add the self-hosted instance and try deployment again.
Rollback tasks
You must roll back multiple tasks separately.
In the drop-down menu in Deployment History, select the task you want to rollback.
Click Rollback.
Deployment fails and the service does not start
Debug your deployment script to verify it works as Alibaba Cloud Devops execute the script without any modifications.
Copy the deployment script from Alibaba Cloud Devops configuration to server, and execute it to see if it can start the service.
If the script works on server but not on Alibaba Cloud Devops, replace all relative paths with absolute paths.
Windows host deployment fails with deploy channel error?
Alibaba Cloud DevOps does not support Windows hosts. Here are two options to try:
Execute script on a Linux server to connect with the Windows host.
Upload the build to your OSS, so that the Windows host can then download the package from it for deployment.
Cannot add a non-Alibaba Cloud ECS instance to the host group
It is recommended to use the hybrid cloud hosting mode for self-hosted instance that are not Alibaba Cloud ECS. For details, see https://help.aliyun.com/document_detail/201140.html
The host reports an environment variable not a valid identifier
It is because the environment variable contains special characters. To solve this, following the steps below:
In the host deployment task, select Encode Variables.
In the deployment script, decode all the environment variables with Base 64. For example, to use the
PIPELINE_ID
environment variable, addexport PIPELINE_ID=$(echo $PIPELINE_ID | base64 -d)
at the beginning of the deployment script.
Deployment fails due to insufficient disk space
Log on to your host and execute
df -hl
to check the remaining disk space. Clean up the disk if it is out of space.Check the Runner status and service log information of the host
ls -al /etc/systemd/system | grep runner systemctl status runner-{version}-{username}.service
.If the host goes offline, uninstall the residual service and reinstall it.
# Stop the specified systemd service. systemctl stop runner-{version}-{tenant}.service # Delete the systemd configuration file of the service. rm -rf /etc/systemd/system/runner-{version}-{tenant}.service # Delete the runner configuration directory related to the user. rm -rf /root/yunxiao/{username}/runner/config
Host deployment fails due to the conflict of Runner configurations
If you create instance B by replicating the image from instance A, host deployment will fail due to the reuse of Runner configurations from instance A in instance B.
Root cause analysis: The Runner's unique identifier (e.g., registration information) of Instance A is the same as that of instance B. As a result, the routing system was unable to differentiate between instance A and instance B.
Critical impact:
Deployment targets might be misrouted. Tasks to be processed by instance B were dispatched to instance A.
Automation workflows might be disrupted, leading to environmental contamination or service overlap.
Recommended solutions:
Uninstall Runner on instance B.
Identify the Runner service name, which is typically named after “runner-{version}-{tenant}.service”.
ls -al /etc/systemd/system | grep runner
Stop and remove the Runner service.
# Stop the Runner service systemctl stop runner-{version}-{tenant}.service # Remove the service file rm -rf /etc/systemd/system/runner-{version}-{tenant}.service
Clean up Runner configuration directories.
# Delete the Runner configuration directory rm -rf /root/yunxiao/{tenant}/runner/config # Completely remove the entire Yunxiao directory rm -rf /root/yunxiao/
Verify full cleanup.
# Check for running processes ps -ef | grep runner
If you choose to replicate instance by image cloning, reinstall Runner on the new instance and ensure that the Runner's channel ID is unique.
The deployment script executes successfully on the host, but fails on Flow
Add environment variables related to the script, such as "source /root/.bash_profile;source /etc/profile;".
Use an absolute path for the script, such as "/home/admin/app/deploy.sh" instead of "./deploy.sh".
Add "| grep -v rdc_deploy_command" at the end of grep command. For example, "ps -ef | grep athens | grep -v grep" needs to be changed to "ps -ef | grep athens | grep -v grep | grep -v rdc_deploy_command".
The host deployment fails and reports User.NoPermission
It means the user does not have permission to call a certain API. Check the service connection of the deployment group.
Host deployment service has started successfully, but the pipeline still shows "deploying"
It is usually because the return code is not correctly passed, or the child process does not exit normally.
Script example demo: https://atomgit.com/flow-example/spring-boot/blob/master/deploy.sh
Follow the steps below for troubleshooting:
Return code verification
Add
echo $?
after key steps in the script (such as service start command) to ensure the return code is 0. If a non-zero value is returned, check the exception handling logic of related commands.The end of the script must explicitly declare
exit 0
to avoid implicit reliance on the return code of the last command.
Child process management
If you are using
nohup
, use the standard syntax:nohup java -jar ${JAR_NAME} > ${JAVA_OUT} 2>&1 &
to start a background process (such as Java service). The&
at the end indicates background operation to avoid blocking the main process.Check other commands that may create child processes (such as
systemd
ordocker run -d
) to ensure they correctly separate processes.
Timeout mechanism enhancement
If the service startup takes a long time, add polling detection (such as HTTP health check) in the script, so that the script exits once the service is ready.
Example:
# Start the service and capture the return code
nohup java -jar app.jar > log.txt 2>&1 &
echo "Service started with exit code: $?"
# Explicitly declare script successful exit
exit 0
If the problem is still not solved after the above optimizations, check the pipeline configuration, or Alibaba Cloud DevOps host Runner Service status.
The workspace cleanup task gets stuck during the host deployment
If Docker is installed manually, follow the steps below for troubleshooting and fixes.
# get docker status
systemctl status docker
###Output key information:
# Active: active (running): The service is running
# Active: inactive (dead): The service is not started
# Loaded: loaded (...): The service has correctly loaded the configuration
# View information of the Docker client and server, including images, containers, volumes, etc.
docker ps
# Restart the docker service
sudo systemctl restart docker
Troubleshoot Flow Runner
Use the tool to get prompts for troubleshooting
ImportantCheck the Runner status before when host deployment fails, or self-hosted instance fails to apply for environment.
Download the troubleshooting tool (supports Linux only).
wget "https://rdc-public-software.oss-cn-hangzhou.aliyuncs.com/runner/runnerStatusCheck" -O runnerStatusCheck
Set execution permissions.
chmod u+x runnerStatusCheck
Execute the tool.
./runnerStatusCheck
Follow the prompts provided by the tool for further processing, as illustrated below.
Check the Linux system version.
Use the command
lsb_release -a
to get the version.Alibaba Cloud Devops Unified Runner currently supports the following Linux distribution versions:
CentOS 6 and later
Ubuntu 16.04 and later
Alibaba Cloud Linux 2 and 3
Check the Runner service status and logs.
Use the command
ls -al /etc/systemd/system | grep runner
to identify the Runner service name, which is typically formatted as runner-{version}-{tenant}.service, as shown below.Check the Runner service status with the command
systemctl status runner-{version}-{tenant}.service
. If the status is active (running), the service is functioning properly, as shown below.You can view the Runner execution logs with the command
journalctl -u runner-{version}-{tenant}.service -a --no-pager --since '5 minutes ago' -f
.
Common issues with offline host groups
During building or deploying, logs are not reported or the host goes offline
Use the command
df -hl
to get the status of the disk. Clean it up once it is out of space.Restart the Runner service with
systemctl restart runner-{version}-{tenant}.service
If it is not active (running).Ensure the network connectivity.
Query the Runner service status with
systemctl status runner-{version}-{tenant}.service
and note the service process parameter --configPath=***, as illustrated below.Check the URL in the configPath using
cat {***}/config.yml | grep url
.Check if the URL is accessible using the following command.
# Note to replace the url in the command below curl '{url}/api/v2/runner/storage/latest?os=linux&arch=amd64'