Use the data development feature to archive data - - Alibaba Cloud Documentation Center

This topic describes how to use the data development feature in Data Management (DMS) to archive data.

Background information

The volume of data in online tables increases as your business grows. Business data is divided into hot data and cold data. Hot data is the most recent data that is frequently accessed. Cold data is the old data that is not up to date and is infrequently accessed. Cold data is stored for a long period for backup, filing or tracing purposes. If the amount of data in business tables is controllable and the database performance is sufficient to handle the workload, you may not need to process cold data. However, if frequent data or schema changes are required, timely and efficient separation of cold and hot data can significantly shorten the time needed to perform the changes and reduce performance overhead.

Before the data warehouse development feature was introduced, data was archived in the following ways:

The developers write the dumping and cleanup logic and deploy the code on an application server. The code is periodically called as scheduled.
The database administrators (DBAs) or O&M engineers write dumping and cleanup SQL statements in a script and deploy the script on a database server by using the crontab command. The script is periodically called as scheduled.

Both the data archiving methods have the following shortcomings:

The development and configuration processes must be repeated for each business table.
The lack of global control leads to the inability to stop scheduled tasks during important periods such as big promotions and data changes.
When tasks are not run as expected, the owners cannot be notified and fix the issues in a timely manner. This results in an excessive amount of data in the online tables and thus affects the performance of services.
Execution logs cannot be managed centrally for efficient tracing and troubleshooting.

Benefits

Archiving data online by using the data warehouse development feature in DMS provides the following benefits:

You can configure archiving logic for business tables on demand at any time.
The configurations of all task flows are visible. In case of important events such as big promotions, scheduled tasks can be easily paused.
Roles such as developers, database administrators (DBAs), and O&M engineers can all configure the task flows in the functional module on the console. This reduces the workload of developers.
The central operation interface allows you to check the execution logs at any time.
When a task fails, you can receive a timely notification to resolve the failure as early as possible.

Procedure

Log on to the DMS console V5.0.
In the top navigation bar, click DTS. In the left-side navigation pane, choose Data Development > Data Warehouse Development.
Note You cannot create task flows in the Data Warehouse Development (Old) module.
Optional:Create a workspace. For more information, see Create a workspace.
Note If you have an existing workspace in Data Warehouse Development and the task flow to be created can be created in this workspace, you do not need to create a workspace.
Find the workspace you want to manage and click Go to Workspace in the Operation column.
Create a task flow.
Note On the page that appears, click Create Task Flow.
1. In the left-side navigation pane of the workspace, click the icon and select Business Scenario.
2. In the Create Business Scenario dialog box, enter a name and description for the business scenario.
3. Click OK.
4. Right-click the business scenario and select Create Task Flow.
5. In the Create Task Flow dialog box, enter a name and description for the task flow.
6. Click OK.
Find the Single Instance SQL node in the Task Type list on the left side of the canvas, and drag it to the blank area on the canvas.
Note If the dumped data and online data are not stored on the same instance, use the Cross-database Spark SQL node.
Configure the task node.
1. Right-click the Single Instance SQL node and select Edit Current Node.
2. Edit the dumping logic on the page that appears.
  - You can use the Design schemas and Schema design for logical tables features to create tables for multiple years at a time. You can also use the CREATE statement to create tables in the current SQL task.
  - If you want to store data in a table each month, specify the table by using a variable in the INSERT statement. The variable format is ${Variable name}.
3. On the Properties panel, click Variable configuration. On the Node Variable tab, configure variables for the task flow.
4. After you edit the SQL statements and configure the variables, click Preview to check whether the SQL statements are edited as expected.
  Note If Auto Save is selected, you do not need to manually save the configuration.
Repeat steps 6 and 7 to create another task node for cleaning up the data in the online table.
Return to the Directed Acyclic Graph (DAG) and connect the two task nodes.
Note Data must be dumped before it can be cleaned up.
After the task flows are connected in the DAG, enable scheduling in the Properties panel in the lower part of the page and set the parameters.
Note Auto Save is selected by default. You do not need to manually save the changes.

What to do next

On the Data Warehouse Development page, find the workspace you want to manage and click Operation in the Operation column to view the status of the task flow.

Note If the task fails, the owners of the task flow are notified. (By default, the creator, DMS administrators, and DBA are the owners.)