AnalyticDB for MySQL allows you to import Hadoop data by using Data Lake Analytics (DLA). This topic describes the procedure to import Hadoop data. For more information, see Feature description.

Prerequisites

1.Configure the network environment
DLA can be deployed in the same VPC as Hadoop or AnalyticDB for MySQL. First you must establish a connection between the VPCs for the Hadoop and AnalyticDB for MySQL clusters, so that the Hadoop and AnalyticDB for MySQL clusters can access each other.
Note To test connectivity, you can access the ECS instance at one end from the ECS instance at the other end, such as running the telnet command. Add the IP address of the vSwitch for the Hadoop cluster to the whitelist of the AnalyticDB for MySQL cluster regardless of whether a connection is established between them.
  1. Connect the VPCs for the Hadoop and AnalyticDB for MySQL clusters.

    For more information, see Establish IPsec-VPN connections between two VPCs or submit a ticket to VPC engineers.

  2. View the vSwitch ID and security group ID of the Hadoop cluster

    Log on to the E-MapReduce console and view the basic information of the cluster. View the vSwitch ID and security group ID of the Hadoop cluster on the Cluster Overview page.

2.Create a YAML configuration file
Transmission mappings are configured in a YAML file. The following sections describe what information is required in the YAML file and how to obtain the information. Data can be transmitted from a single Hadoop Parquet file to a single AnalyticDB for MySQL table.

The YAML configuration file template:

marmaray:
  adb:
    column_mapping:   
    username:         
    passward:        
    port:            
    host:            
    database:       
    tablename:         
  Hadoop:
    yarn_queue:     
    fs.defaultFS:    
  hive:
    data_path:         
    job_name:       

Parameters in the template:

  • Parameters for the destination AnalyticDB for MySQL cluster
    • column_mapping: the mapping of column names.

      The mapping between the schema fields of the Parquet file and the fields of the destination AnalyticDB for MySQL table. The mapping is marked with vertical bars (|). Separate multiple mapping pairs a with commas (,).

      Example:

      The schema of the Parquet file contains three fields: a, b, and c. They are mapped to three fields of the destination AnalyticDB for MySQL table: a1, b1, and c1. The mapping of column names:

      a|a1,b|b1,c|c1
    • username: the username that is used to log on to the AnalyticDB for MySQL database.
    • password: the password that is used to log on to the AnalyticDB for MySQL database.
    • port: the port number that is used to log on to the AnalyticDB for MySQL database.
    • host: the endpoint of the AnalyticDB for MySQL cluster.
    • database: the name of the database.
    • table_name: the name of the table
  • Parameters for the source Hadoop cluster
    • yarn_queue: The yarn queue name of the Hadoop cluster.
    • fs.defaultFS: HDOOP_CONF_DIR. DLA cannot automatically obtain the configurations in the HDOOP_CONF_DIR on the Hadoop cluster, so you must specify configurations to read the parameters related to the Hadoop cluster. Specify different configurations for high-availability nodes and other nodes:
      • Non-high-availability nodes: You provide the absolute path in the hdfs://<master_ip>:9000/path/to/file format.

        The method to obtain master_ip:

        1. Log on to the E-MapReduce console and click Cluster Management in the navigation bar.
        2. Obtain the IP address of the master role.
      • High-availability nodes: HDOOP_CONF_DIR configurations are required. For more information, see Hadoop.
    • data_path: the path of the Parquet file in the Hadoop cluster.
    • job_name: the name of the job.

A sample YAML configuration file:

marmaray:
  adb:
    column_mapping: a|b,c|d,e|f
    username: xiaoer
    possward: password
    port: 3306
    host: am-xxxx.aliyuncs.com
    database: test_db
    tablename: test_table
  Hadoop:
    yarn_queue: default
    fs.defaultFS: hdfs://172.123.123.123:9000
  hive:
    data_path: /tmp/parquet
    job_name: testParquetToADB
3.Upload the YAML configuration file
After you create a YAML configuration file, upload it to OSS.

Procedure

The transmission service runs on DLA, so you must pull the Marmaray JAR package and load the transmission mapping configurations. You place the transmission mapping configurations and Marmaray JAR running template in the OSS directory and enter the OSS path in the DLA running parameters.
  1. Submit a ticket to obtain the Marmaray JAR package and upload it to OSS.
  2. Log on to the Data Lake Analytics console. In the left-side navigation pane, choose Severless Spark > Submit job to go to the Parameter Configuration page.
  3. Click Create Job. In the Create Job dialog box, set File Name, Data Format, and Parent, and then click OK.
  4. Edit the job. The created job appears in the Job List hierarchy tree.

    Configuration details

    DLA runs JSON files. A sample job template:

    {
        "args": [
            "-c",
            "./<YAML configuration file name>             
        ],
        "name": "<Job name>",
        "className": "com.alibaba.analyticdb.marmaray.ParquetToADBJob",
        "conf": {
            "spark.driver.resourceSpec": "<Specifications {small, medium, large, xlarge}>",
            "spark.executor.resourceSpec": "<Specifications {small, medium, large, xlarge}>",
            "spark.executor.instances": "<Number of instances>",
            "spark.dla.eni.enable": "<Whether to enable ENI{true}>",
            "spark.dla.eni.vswitch.id": "<vSwitch ID of the Hadoop cluster>",
            "spark.dla.eni.security.group.id": "<Security group ID of the Hadoop cluster>",
        },
        "file": "oss://${oss-buck-name}/<Marmaray JAR package path>",
        "files": "oss://${oss-buck-name}/<YAML configuration file path>"
    }

    Parameters in the sample template:

    • name: the name of the job. You can use it to track job information.
    • Parameter Description Parameter in the Apache Spark community
      spark.driver.resourceSpec The resource specifications of the Spark driver, which can be small, medium, large, or xlarge. small indicates 1 CPU core and 4 GB of memory. medium indicates 2 CPU cores and 8 GB of memory. large indicates 4 CPU cores and 16 GB of memory. xlarge indicates 8 CPU cores and 32 GB of memory. spark.driver.cores and spark.driver.memory
      spark.executor.resourceSpec The resource specifications of Spark executors, which are the same as those of spark.driver.resourceSpec. spark.executor.cores and spark.executor.memory
    • spark.executor.instances: the number of executors.
    • spark.dla.eni.enable: specifies whether to enable ENI. The true value indicates that connection of VPCs is enabled.
    • spark.dla.eni.vswitch.id: the vSwitch ID of the Hadoop cluster.
    • spark.dla.eni.security.group.id: the security group ID of the Hadoop cluster.
    • file: the path of Marmaray JAR package.
    • files: the path of the YAML configuration file.
  5. After you configure the job parameters, click Execute to run the job.