Implement graceful shutdown based on MSE XXL-JOB - Microservices Engine

Microservices Engine (MSE) XXL-JOB provides the graceful shutdown feature in addition to the features provided by open source XXL-JOB. Before MSE XXL-JOB gracefully shuts down your application, it notifies the server to stop dispatching new jobs and waits for all running jobs are complete. Business interruptions do not occur during this process. This topic describes how to enable graceful shutdown based on Alibaba Cloud XXL-JOB to help you handle actual application restart or shutdown scenarios.

Overview

In actual business scenarios, jobs in an application process are scheduled at a fixed frequency. When an application is released and restarted, running jobs are forcefully stopped, which may lead to incomplete data and a sudden drop in the scheduling success rate. As a result, business data issues occur due to the following reasons:

Job execution interruptions: When a job is running, the application process is terminated and business processing is stopped. This may cause incomplete business data.
Decreased job scheduling success rate: During the application release and restart process, the scheduler distributes jobs to terminated nodes. This causes job scheduling failures and decreases the overall job processing efficiency.

To address the preceding issues in job scheduling, you need to enable graceful shutdown for jobs to ensure seamless business continuity during the rolling deployment restart processes of applications.

Practices for implementing graceful shutdown based on open source XXL-JOB

Principles and existing issues of graceful shutdown based on open source XXL-JOB

When you use open source XXL-JOB to schedule jobs, XXL-JOB executors cannot properly implement the graceful shutdown feature. Therefore, custom modifications are required if you want to use open source XXL-JOB to gracefully shut down applications. Before you perform modifications, you can analyze the entire process of job distribution and execution on open source XXL-JOB. The entire process involves two modules: XXL-JOB Admin and XXL-Job Executor.

The following content describes the existing issues that may occur when you open source XXL-JOB to gracefully shut down applications.

Issue 1: Jobs are scheduled to offline nodes

After executors become offline, the system does not complete job scheduling and update the executor list in a timely manner. In this case, jobs may be scheduled to offline nodes, which results in scheduling failures.

Logical processing operation 1: Register executors

After the required XXL-JOB SDK starts, your business application initializes the ExecutorRegistryThread thread to continuously send heartbeat messages to XXL-JOB Admin.
Upon receipt of heartbeat messages, XXL-JOB Admin writes the information about the registered executors to a database table named xxl_job_registry by using JobRegistryHelper.
A specific thread in JobRegistryHelper periodically queries and updates the address_list field in the xxl_job_group table. address_list provides a list of the registered executors.

Logical processing operation 2: Select online executors

After a scheduling thread triggers a job, XxlJobTrigger is responsible for running the job.
Before running the job, XxlJobTrigger reads a list of executors from the address_list field in the xxl_job_group table.
ExecutorRouter selects an executor from the executor list based on the specified routing policy.
After an executor is selected, XxlJobTrigger sends an RPC request to distribute the job to the node with the specified IP address. In this case, if you select an offline node, the job may fail to be triggered.

Summary: The executor list obtained during executor registration is not synchronized with the executor list obtained for running the job in real time. As a result, the online executor list is not updated in a timely manner.

Issue 2: Jobs are forcefully stopped

When an XXL-JOB executor is disabled, the JobThread thread of the job is stopped and the job is marked as failed. In addition, the requests of all jobs in the queue are discarded and the jobs are marked as failed.

Logical processing operation 1: Distribute jobs to specific executors

After the executor of your business application receives a job request, the executor creates a JobThread thread for each job based on the job ID. The JobThread thread is used to run the job.
When a job request is triggered, it is added to the queue for the current JobThread thread to process. Different jobs have different blocking policies.
The JobThread thread continuously reads triggering results in the queue and executes the corresponding JobHandler to complete business logic processing.
After the job ends, the JobThread thread submits the execution information to the execution response queue of TriggerCallbackThread and proceeds to the next job.
When the executor stops, it executes the XxlJobExecutor.destroy method to stop the JobThread thread and clears the queue of scheduling requests.

Logical processing operation 2: Feed back execution results of jobs

The TriggerCallbackThread thread continuously runs and loads the current queue of execution results and distributes the execution results to XXL-JOB Admin in batches.
If the TriggerCallbackThread thread fails to send execution results to XXL-JOB Admin, the TriggerCallbackThread thread stores the execution results to local disks. The TriggerCallbackThread thread will try to resend execution results later.
After XXL-JOB Admin receives the execution results, it writes them to the database.

Summary: In the preceding shutdown process, the underlying layer of the removeJobThread thread directly stops the JobThread thread, ignores the job requests in the thread queue, and marks the jobs as failed.

Implementation process

Based on the process analysis in Principles and existing issues of graceful shutdown based on open source XXL-JOB, graceful shutdown of an application involves three steps: remove traffic, wait for jobs in the queue to complete, and then shut down the application.

The com.xxl.job.core.executor.XxlJobExecutor#destroy method provided by the XXL-JOB core module automatically performs a callback when an application process exits in Spring Boot mode. In this process, application executors are disabled and their resources are reclaimed. However, the relevant logic used cannot completely implement the graceful shutdown feature. Therefore, you need to perform the following steps to implement graceful shutdown based on the preceding analysis.

Step 1: Remove traffic from application nodes

The stopEmbedServer() method in XxlJobExecutor#destroy stops the heartbeat registration mechanism and sends the registryRemove request to XXL-JOB Admin to remove the current executor.
After XXL-JOB Admin receives a request, it removes the current executor from the xxl_job_registry table in the database. However, based on the preceding analysis, the address_list field in the xxl_job_group table is not synchronized and updated in real time. Therefore, traffic is not removed.
The features provided by the XXL-JOB Admin server need to be modified to remove traffic. You can use one of the following methods to modify the features.
- Add the subsequent processing logic to the JobRegistryHelper.registryRemove method to update the address_list field in the xxl_job_group table. You can also implement the update logic in the freshGroupRegistryInfo method.
- Modify the XxlJobTrigger#trigger() method to adjust the method for reading the address_list field in the xxl_job_group table. This allows the XXL-JOB Admin server to read address_list directly from the xxl_job_registry table during the automatic registration process.

After you perform the preceding operations, traffic is removed.

Step 2: Wait for jobs in the queue to complete

Modify the subsequent processing logic of the XxlJobExecutor#destroy method to wait for all jobs in the queue to complete. The following sample code shows an example.

public void destroy(){

    // destroy executor-server
    stopEmbedServer();

    // destroy jobThreadRepository
    if (jobThreadRepository.size() > 0) {
        List keyList = new ArrayList(jobThreadRepository.keySet());
        for (int i=0; i < keyList.size(); i++) {
            JobThread jobThread = jobThreadRepository.get(keyList.get(i));
            // Wait for all jobs in the queue to complete.
            while (jobThread != null && jobThread.isRunningOrHasQueue()) {
                try {
                    TimeUnit.SECONDS.sleep(1L);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }
    }
    jobHandlerRepository.clear();

    // destroy JobLogFileCleanThread
    JobLogFileCleanThread.getInstance().toStop();

    // destroy TriggerCallbackThread
    TriggerCallbackThread.getInstance().toStop();

}

Wait for the queue of execution results to feed back all execution results. In open source XXL-JOB, the TriggerCallbackThread.getInstance().toStop() method finally synchronizes the execution results at a time after the TriggerCallbackThread thread is stopped. Therefore, additional processing is not required.

After you perform this step, you can wait for running jobs to complete. You can also perform custom shutdown operations based on job types.

Step 3: Stop application processes

To stop application processes, we recommend that you add kill -15 to the application deployment script to trigger a JVM shutdown hook. You can also forcefully stop application processes upon timeouts based on your business requirements.
You can integrate the graceful shutdown feature into XXL-JOB by using the Spring Boot Actuator feature and then use the /actuator/shutdown interface to shut down the application.

Prerequisites

The engine version is 2.1.0 or later. For more information about engine versions, see Release notes for XXL-JOB.
The dependency related to the SchedulerX plug-in package is added to the pom.xml file of the client. For more information about XXL-JOB plug-in versions, see Release notes for the XXL-JOB plug-in.
```
<dependency>
  <groupId>com.aliyun.schedulerx</groupId>
  <artifactId>schedulerx3-plugin-xxljob</artifactId>
  <version>Latest version</version>
</dependency>
```

Procedure

This section describes how to configure and enable the graceful shutdown solution for different business forms and deployment scenarios. The procedure consists of two steps.

Step 1: Perform initial integration with executors by using different frameworks

Different deployment modes of business applications require different methods for initial integration.

Mode 1: Integrate business applications with XXL-JOB executors by using Spring Boot (recommended)

If you integrate a business application with XXL-Job executors by using Spring Boot, the system can automatically perform initial integration for graceful shutdown. Perform the following steps:

Add the Maven dependency of the SchedulerX plug-in. For more information about plug-in versions, see Plug-in version release notes.

<dependency>
  <groupId>com.aliyun.schedulerx</groupId>
  <artifactId>schedulerx3-plugin-xxljob</artifactId>
  <version>Latest version</version>
</dependency>

Add the application configuration parameter and enable graceful shutdown. For more information about the parameter, see Parameter description.
```
# Configure graceful shutdown
xxl.job.executor.shutdownMode=WAIT_ALL
```

Mode 2: Integrate business applications with XXL-JOB executors by using Spring

If your business application is a web application that is started by using the Spring framework, you must add the POM dependency and application startup parameter. For more information, see Mode 1: Integrate business applications with XXL-JOB executors by using Spring Boot (recommended). You must also add the following XxlJobExecutorEnhancerInitializer configuration to the web.xml file:

<web-app>
  <context-param>
        <!-- Spring ApplicationContextInitializer is used to improve the capabilities of XXL-JOB executors. -->
        <param-name>globalInitializerClasses</param-name>
        <param-value>com.aliyun.schedulerx.xxljob.enhance.XxlJobExecutorEnhancerInitializer</param-value>
    </context-param>
</web-app>

Mode 3: Integrate business applications with XXL-JOB executors in the frameless mode

If you integrate a business application with XXL-JOB executors in the frameless mode, as mentioned in the use cases of XXL-JOB executors, when you start the business application in pure Java, you can perform initial integration for graceful shutdown by using custom code. You must add the POM dependency and application startup parameters for the business application. For more information, see Mode 1: Integrate business applications with XXL-JOB executors by using Spring Boot (recommended). The following code shows an example.

Before executors start, add EnhancerLoader.load(xxlJobProp) to load feature enhancements.
Before executors start, add Runtime.getRuntime().addShutdownHook(...) to add a shutdown hook for the current application.

Sample code

public static void main(String[] args) {
    try {
        // Load the xxl-job-executor.properties file.
        Properties xxlJobProp = FrameLessXxlJobConfig.loadProperties("xxl-job-executor.properties");

        // Load the enhancement features for XXL-JOB executors during the initial integration.
        EnhancerLoader.load(xxlJobProp);

        // Start an XXL-JOB executor.
        FrameLessXxlJobConfig.getInstance().initXxlJobExecutor(xxlJobProp);

        // Add a shutdown hook for graceful shutdown.
        Runtime.getRuntime().addShutdownHook(new Thread(){
            @Override
            public void run() {
                FrameLessXxlJobConfig.getInstance().destroyXxlJobExecutor();
            }
        });
        // Blocks until interrupted.
        while (true) {
            try {
                TimeUnit.HOURS.sleep(1);
            } catch (InterruptedException e) {
                break;
            }
        }
    } catch (Exception e) {
        logger.error(e.getMessage(), e);
    } finally {
        // destroy
        FrameLessXxlJobConfig.getInstance().destroyXxlJobExecutor();
    }
}

Step 2: Shut down an application

Use kill -15 in a self-built CD process

In a self-built continuous delivery (CD) process, one node is available to stop application processes. This node allows you create a script named stop.sh to stop or exit application processes. The script content must contain the logic for the graceful shutdown of the application. The following code shows the sample script content for application process shutdown.

Sample script content for application process shutdown

# Write the process ID to the app.pid file after an application starts.
PID="{Application deployment path}/app.pid"
FORCE=1
if [ -f ${PID} ]; then
  TARGET_PID=`cat ${PID}`
  kill -15 ${TARGET_PID}
  loop=1
  while(( $loop<=5 ))
  do
    ## Use a health check to confirm that the current application process is terminated. The logic can be customized based on the application characteristics.
    health
    if [ $?  == 0 ]; then
      echo "check $loop times, current app has not stop yet."
      sleep 5s
      let "loop++"
    else
      FORCE=0
      break
    fi
  done
  if [ $FORCE -eq 1 ]; then
  	echo "App(pid:${TARGET_PID}) stop timeout, forced termination."
    kill -9 ${TARGET_PID}
  if
  rm -rf ${PID}
  echo "App(pid:${TARGET_PID}) stopped successful."
fi

Use the preStop hook for Kubernetes deployments

The lifecycle management feature of Kubernetes pods can automatically implement graceful shutdown. You can also use the preStop hook to implement the graceful shutdown logic by running exec to execute scripts and using HTTP requests.

Invalid modification: If the application process is the main process PID 1 in a container, the system automatically sends a SIGTERM signal to the main process to gracefully shut down the application.
Custom preStop: If the container has a complex multi-process relationship, you can configure a custom preStop script that uses kill -15 PID to stop the application process. You can also call the preconfigured stop.sh script to exit the application process.

Important

In this solution, the terminationGracePeriodSeconds parameter of the pod specifies the maximum wait time, in seconds, for graceful shutdown. The default value of this parameter is 30. You must configure the parameter based on your business requirements.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app-container
        image: my-app-image:latest
        lifecycle:
          preStop:
            exec:
              # command: ["/bin/sh", "-c", "kill -15 PID && sleep 30"]
              command: ["/bin/sh", "-c", "Script path/stop.sh"]

Use automatic integration on the application release platform in Alibaba Cloud

This solution will be available soon.

Parameter description

You must configure the following parameter to enable graceful shutdown for your application. This parameter provides two modes for you to implement graceful shutdown.

# The graceful shutdown mode. Valid values: WAIT_ALL and WAIT_RUNNING. 
# If you do not configure this parameter, the original logic of XXL-JOB is applied so that graceful shutdown is disabled by default. 
xxl.job.executor.shutdownMode=WAIT_ALL

Graceful shutdown mode	Description
`WAIT_ALL`	In this mode, an application exits only after all jobs, including running jobs and jobs in queue, are complete. This mode is recommended.
`WAIT_RUNNING`	In this mode, an application exits after running jobs to which threads are allocated in the application are complete. Jobs in queue are dropped.