The Internet of Things (IoT) is a network that carries data based on the Internet and traditional telecommunication networks. The IoT enables connections among all physical objects that are independently addressable. This topic describes how to automatically synchronize IoT data to MaxCompute on the cloud.

Background information

The IoT aims to collect required information in real time by using various technologies and devices such as sensors. The IoT can connect objects on various types of networks to enable communications among the objects and interaction between the objects and human beings. On the IoT, objects and processes can be detected, identified, and managed in an intelligent manner.

As the three representative technologies of the third information and communications technology (ICT) wave, IoT, big data analytics, and cloud computing will have a wide influence in the future. IoT focuses on connecting things. Big data analytics aims to exploit the data value. Cloud computing provides service support such as computing resources for big data analytics and IoT.

Big data is an important part of the IoT system. The IoT system consists of devices, networks, platforms, applications, and features such as big data analytics and security assurance. Big data analytics is an important means to exploit the value of big data. To conduct big data analytics, synchronize data to the cloud first.

Solution

The solution for automatically synchronizing IoT data to the cloud includes storage and synchronization of raw data to a data analytics system.

Large amounts of raw data from IoT devices are generally stored in semi-structured form. For example, the raw data is stored in Comma-Separated Values (CSV) files in Object Storage Service (OSS).

To synchronize raw data to a big data analytics system or a traditional database, a professional data synchronization system is required. The following figure shows the process of synchronizing raw data from OSS to MaxCompute by using DataWorks Data Integration.Solution
  1. Create a batch sync node. For more information, see Create a sync node by using the codeless UI.
  2. Configure the batch sync node to read data from OSS. For more information, see Configure OSS Reader.
  3. Configure the batch sync node to write data to MaxCompute. You can also write data to a data store of another type. For more information, see Supported data stores and plug-ins.

Configure automatic data synchronization

To use DataWorks to read CSV files from OSS, you must specify the names of the files to read. IoT devices constantly generate raw data and store the raw data in CSV files. If you want to manually synchronize the raw data to the cloud, complex operations are required. This section describes how to configure automatic synchronization of data to MaxCompute.

When you use this solution, note the following points:
  • CSV files must be periodically generated in OSS.

    DataWorks can periodically run a sync node as scheduled. You can configure the scheduling system to run the sync node at the interval at which CSV files are generated in OSS. For example, if a new CSV file is generated in OSS every 15 minutes, you can configure the scheduling system to run the sync node every 15 minutes.

  • Each file obtained from OSS must be named based on the corresponding timestamp.
    After a sync node reads a file from OSS, DataWorks must name the file based on the timestamp of the file. You can specify a variable for DataWorks to dynamically generate file names to make sure that each file name is the same as the corresponding file name in OSS.
    Note We recommend that you specify a variable for DataWorks to generate file names with a timestamp in the yyyymmddhhmm format, for example, iot_log_201911062315.csv.
  1. Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the target workspace and click Data Integration in the Actions column.
  2. Add connections to OSS and MaxCompute. For more information, see Configure an OSS connection and Configure a MaxCompute connection.
  3. Click the DataWorks icon in the upper-left corner and choose All Products > DataStudio. On the page that appears, create a workflow. For more information, see Create a workflow.
  4. Create a batch sync node. For more information, see Create a batch synchronization node.
  5. On the configuration tab of the batch sync node, specify the Connection parameter and set Object Name Prefix to a file name format that contains a variable to indicate the timestamp.

    As shown in the preceding figure, specify the variable in the ${Variable} format in the Object Name Prefix field. You can customize the variable name, for example, filename.

    Click the Properties tab in the right-side navigation pane. In the Properties pane that appears, assign a value to the variable filename in the Arguments field in the General section, for example, filename=$[yyyymmddhh24mi]. For more information, see Scheduling parameters.

    The value $[yyyymmddhh24mi] indicates that the timestamp is accurate to minute. For example, 201911062315, 202005250843, and 201912012207 represent 23:15 on November 6, 2019, 08:43 on May 25, 2020, and 22:07 on December 1, 2019, respectively.

  6. In the Properties pane, set Instance Recurrence in the Schedule section.
    Set Instance Recurrence to Minute. Then, set Start From, Interval, and End At as required.
    Notice The value of Interval must be the same as the interval at which files are generated in OSS. For example, if a new file is generated in OSS every 15 minutes, you must set Interval to 15 minutes.
  7. Commit and deploy the batch sync node. For more information, see Deploy a node.
  8. After you deploy the batch sync node, click Operation Center in the upper-right corner. On the page that appears, choose Cycle Task Maintenance > Cycle Task or Cycle Task Maintenance > Cycle Instance and check whether the generated node or instance meets your needs. For more information, see Auto triggered nodes.