MapReduce is a lightweight model that is developed by SchedulerX to perform batch data processing in a distributed manner. You can call the MapJobProcessor or MapReduceJobProcessor interface to use connected workers as distributed computing engines that process large amounts of data in batches. MapReduce has the following advantages over traditional methods, such as Hadoop and Spark, that are used to perform batch processing on massive amounts of data: low cost, high speed, and simple programming. MapReduce processes data in seconds without the need to import the data to big data platforms and without additional storage and computing costs.
Precautions
A task cannot exceed 64 KB in size.
The return value of the result field of ProcessResult cannot exceed 1,000 bytes in size.
If you use the reduce method, the results of all tasks are cached on the master node. In this case, high memory pressure is caused on the master node. We recommend that you specify a small number of tasks and a small return value for the result field. If the reduce method is not required, you can directly call the MapJobProcessor interface.
If a failover is triggered, SchedulerX may run a task more than once. In this case, you must implement the idempotence of tasks on your own.
Methods
MapJobProcessor derived class
Method
Description
Required
public ProcessResult process(JobContext context) throws Exception;
The entry to execute a task. You must develop the logic that is used to obtain the name of the task from the context. After this method is called, the system returns ProcessResult.
Yes
public ProcessResult map(List<? extends Object> taskList, String taskName);
The map method is used to distribute a batch of tasks to multiple workers. You can call the map method multiple times. If taskList is empty, an error is returned. After this method is called, the system returns ProcessResult.
Yes
public void kill(JobContext context);
This method is triggered when a job is terminated in the frontend. You must develop the logic that is used to interrupt your workloads.
No
MapReduceJobProcessor derived class
Method
Description
Required
public ProcessResult process(JobContext context) throws Exception;
The entry to execute a task. You must develop the logic that is used to obtain the name of the task from the context. After this method is called, the system returns ProcessResult.
Yes
public ProcessResult map(List<? extends Object> taskList, String taskName);
The map method is used to distribute a batch of tasks to multiple workers. You can call the map method multiple times. If taskList is empty, an error is returned. After this method is called, the system returns ProcessResult.
Yes
public ProcessResult reduce(JobContext context);
The reduce method is called back after the tasks on all worker nodes are executed. The reduce method is executed by the master node and usually used to aggregate data, send notifications downstream, or pass data between upstream and downstream applications in workflows.
The reduce method handles the results of all tasks.
The system uses
return ProcessResult(true, result)
to return the execution result of a task, such as an order ID.When the system calls the reduce method, the system obtains the status (
context.getTaskStatuses()
) and the execution results (context.getTaskResults()
) of all tasks from the context and then processes the execution results.
Yes
public void kill(JobContext context);
This method is triggered when a job is terminated in the frontend. You must develop the logic that is used to interrupt your workloads.
No
public boolean runReduceIfFail(JobContext context)
Specifies whether to call the reduce method if a task fails. Default configuration: The reduce method is executed if a task fails.
No
Procedure
Log on to the SchedulerX console. In the left-side navigation pane, click Task Management.
On the Task Management page, click Create task.
In the Create Task panel, select MapReduce from the Execution mode drop-down list and configure the related parameters in the Advanced Configuration section. The following table describes the parameters.
Parameter
Description
distribution policy
NoteThe agent version must be 1.10.3 or later.
Polling Scheme: The system evenly distributes the same number of tasks to each worker. This policy is suitable for scenarios in which each worker requires almost the same amount of time to process a task. This is the default value.
workerLoad optimal strategy: The master node automatically detects the loads of worker nodes. This policy is suitable for scenarios in which a large difference exists between the amount of time each worker requires to process a task.
Number of single-machine concurrent subtasks
The number of execution threads on a worker. Default value: 5. To speed up the execution, you can specify a larger value. If the downstream or the databases cannot withstand the value that you specified, you can specify a smaller value.
Number of failed retries of subtasks
If a task fails, the task is automatically retried. Default value: 0.
Sub-task failure retry interval
The interval between two consecutive retries if a task fails. Unit: seconds. Default value: 0.
Subtask Failover Strategy
NoteThe agent version must be 1.8.12 or later.
Specifies whether to distribute a task to a new worker after the worker failed to execute the task and was stopped. If you turn on the switch, the system may execute a task more than once when a failover is triggered. You must implement the idempotence of tasks on your own.
The master node participates in the execution
NoteThe agent version must be 1.8.12 or later.
Specifies whether the master node participates in the execution of tasks. At least two workers must be available to run tasks. If an extremely large number of tasks exist, we recommend that you turn off the switch.
Subtask distribution method
Push model: The system evenly distributes tasks to each worker.
Pull model: Each worker automatically pulls tasks. The Wooden Bucket Theory is not applicable to this model. This model supports dynamic scale-up to pull tasks. During the pull process, all tasks are cached on the master node. This causes high pressure on the memory. We recommend that you do not distribute more than 10,000 tasks at the same time.
For more information about parameters, see Advanced parameters for job management