Alibaba Cloud provides a wide array of cloud storage and database services. To perform search and analysis on the data stored in these services, you can use Data Integration to collect the data and then synchronize the data to Alibaba Cloud Elasticsearch.

For more information about Data Integration, click here.

Supported data sources

  • Alibaba Cloud databases such as ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, ApsaraDB RDS for SQL Server, ApsaraDB RDS for PPAS, ApsaraDB for MongoDB, and ApsaraDB for HBase
  • Alibaba Cloud DRDS
  • Alibaba Cloud MaxCompute
  • Alibaba Cloud OSS
  • Alibaba Cloud Tablestore
  • User-created databases such as HDFS, Oracle, FTP, DB2, MySQL, PostgreSQL, SQL Server, PPAS, MongoDB, and HBase
Notice Data synchronization may incur Internet traffic fees.

Overview

Follow these steps to import data in offline mode:

  • Purchase an ECS instance that resides in the same VPC as your Elasticsearch cluster. This ECS instance is used to collect data from data sources and writes the data to your Elasticsearch cluster. Data Integration issues the tasks to perform these operations.
  • Activate Data Integration, and add the ECS instance to Data Integration as a resource to run data synchronization tasks.
  • Create a data synchronization script and schedule it to run periodically.

Preparations

  • Create an Alibaba Cloud Elasticsearch cluster.
  • Enable the Auto Indexing feature for the Elasticsearch cluster.

    For more information, see Procedure.

  • Purchase an ECS instance that resides in the same VPC as the Elasticsearch cluster, and assign a public IP address to the ECS instance or activate Elastic IP Address.

    You can also use an existing ECS instance. For more information about how to purchase an ECS instance, see Create an ECS instance.

    Notice
    • We recommend that you choose CentOS 6, CentOS 7, or Aliyun Linux for your ECS instance.
    • If you want the ECS instance to execute MaxCompute tasks or data synchronization tasks, ensure that the ECS instance runs Python 2.6 or 2.7. The default Python version in CentOS 5 is 2.4, whereas other operating systems have Python 2.6 or later installed.
    • Make sure that the ECS instance has a public IP address.
  • Activate Data Integration.

    For more information, see Create a workspace.

  • Prepare the data source from which you need to import data. For more information about supported data sources, see Supported data sources.

Configure scheduling resources

  1. Log on to the DataWorks console. In the top navigation bar, click Workspaces. Then, in the top navigation bar of the page that appears, select the region where your workspace resides. Find your workspace and click Data Integration in the Actions column.
    If you have activated Data Integration or DataWorks, the following page is displayed.Page showing that you have activated Data Integration or DataWorks

    If you have not activated Data Integration or DataWorks, activate it as prompted. You are charged when you activate it. Read and understand the charges displayed on-screen before the activation.

  2. In the left-side navigation pane of the Data Integration console, click Custom Resource Group.
  3. In the upper-right corner of the Custom Resource Groups page, click Add Resource Group to configure the ECS instance that you have created in the VPC as a scheduling resource.
    Notice The Add Resource Group feature is only available in the DataWorks Professional Edition.

Configure a data source

  1. In the left-side navigation pane of the Data Integration console, click Connection. Then, in the left-side navigation pane of the page that appears, click Data Source.
  2. In the upper-right corner of the Data Source page, click Add Connection.
  3. In the Add Connection dialog box, select a data source. In the dialog box that appears, specify the required parameters.

    For more information, see Data source configuration .

  4. Click Test Connection.

    For more information about data sources that support connectivity tests and FAQ related to connection tests, see Test data store connectivity.

  5. After the connectivity test is successful, click Complete.

Create and execute a data synchronization script

  1. In the left-side navigation pane of the Data Integration console, click Home Page.
  2. Click New Task.
  3. In the Create Node dialog box that appears, specify the required parameters.
    Parameter Description
    Node Type Select Batch Synchronization.
    Node Name Enter a node name.
    Location Choose Business Flow > Data Integration. If you have not created a business flow, create one first. For more information, see Workflow.
  4. Click Commit.
  5. On the page that appears, configure the data synchronization task.

    You can use the wizard or script mode to configure the task and specify a database as the reader and Alibaba Cloud Elasticsearch as the writer. For more information, see Create a sync node by using the codeless UI and Create a sync node by using the code editor.

    This section uses the script mode as an example. The script is as follows:

    {
        "type": "job",
        "steps": [
            {
                "stepType": "odps",
                "parameter": {
                    "partition": [],
                    "datasource": "odps_first",
                    "column": [],
                    "emptyAsNull": false,
                    "table": ""
                },
                "name": "Reader",
                "category": "reader"
            },
            {
                "stepType": "elasticsearch",
                "parameter": {
                    "accessId": "",
                    "endpoint": "http://es-cn-xxxxx.elasticsearch.aliyuncs.com:9200",
                    "indexType": "",
                    "accessKey": "",
                    "cleanup": true,
                    "discovery": false,
                    "column": [
                        {
                            "name": "",
                            "type": ""
                        }
                    ],
                    "index": "",
                    "batchSize": 1000,
                    "splitter": ","
                },
                "name": "Writer",
                "category": "writer"
            }
        ],
        "version": "2.0",
        "order": {
            "hops": [
                {
                    "from": "Reader",
                    "to": "Writer"
                }
            ]
        },
        "setting": {
            "errorLimit": {
                "record": ""
            },
            "speed": {
                "concurrent": 2,
                "throttle": false
            }
        }
    }

    For more information about Elasticsearch configuration parameters, see Configure Elasticsearch Writer.

  6. After the script is configured and saved, click the Properties tab in the right-side navigation pane. In the Properties pane that appears, specify the required parameters.

    For more information, see scheduling configuration.

    Notice
    • Before submitting a task, you must set Parent Nodes in the Dependencies section of the Properties pane. For more information, see Dependencies.
    • If you want to schedule the task periodically, you must configure time properties in the Schedule section, such as the execution time, scheduling period, and effective period of the task.
    • The configuration of a periodic task takes effect at 00:00 of the next day.
  7. Click Submit icon to submit the task.
  8. In the upper-right corner, click Operation Center. In the left-side navigation pane, choose Cycle Task Maintenance > Cycle Task. On the page that appears, find the task you submitted and click Change next to Scheduling Resource Group. Then, change the default scheduling resource group to the scheduling resource group you configured.

Verify data synchronization results

  1. Log on to the Kibana console of your Elasticsearch cluster.
  2. In the left-side navigation pane, click Dev Tools.
  3. On the Console tab of the Dev Tools page, run the following command to query synchronized data:
    GET /<your_index_name>/_search

    The <your_index_name> parameter specifies the name of the Elasticsearch index that you configured in the data synchronization script.