AnalyticDB for MySQL allows you to import Hadoop data by using Data Lake Analytics (DLA). This topic describes the procedure.


1. Configure the network environment
DLA can be deployed in the same virtual private cloud (VPC) as Hadoop or AnalyticDB MySQL. You must establish a connection between the VPCs for the Hadoop and AnalyticDB MySQL clusters so that they can access each other.
Note To test the connectivity, you can access the Elastic Compute Service (ECS) instance at one end from the ECS instance at the other end. For example, you can run the telnet command. Add the IP address of the vSwitch for the Hadoop cluster to the whitelist of the AnalyticDB MySQL cluster regardless of whether a connection is established between them.
  1. Connect the VPCs for the Hadoop and AnalyticDB MySQL clusters.

    For more information, see Establish IPsec-VPN connections between two VPCs or submit a ticket to VPC engineers.

  2. View the vSwitch ID and security group ID of the Hadoop cluster.

    Log on to the E-MapReduce console and view the basic information of the cluster. View the vSwitch ID and security group ID of the Hadoop cluster on the Cluster Overview page.

2. Create a YAML configuration file
Transmission mappings are configured in a YAML file. The following sections describe what information is required in the YAML file and how to obtain the information. Data can be transmitted from a single Hadoop Parquet file to a single AnalyticDB MySQL table.

YAML configuration file template:


Parameters in the template:

  • Parameters for the destination AnalyticDB MySQL cluster
    • column_mapping: the mapping of column names.

      The mapping between the schema fields of the Parquet file and the fields of the destination AnalyticDB for MySQL table. The mapping is marked with vertical bars (|). Separate multiple mapping pairs with commas (,).


      The schema of the Parquet file contains three fields: a, b, and c. They are mapped to three fields of the destination AnalyticDB for MySQL table: a1, b1, and c1. The fields are mapped in the following format:

    • username: the username that is used to log on to the AnalyticDB for MySQL database.
    • password the password that is used to log on to the AnalyticDB for MySQL database.
    • port: the port number that is used to log on to the AnalyticDB for MySQL database.
    • host: the endpoint of the AnalyticDB for MySQL cluster.
    • database: the name of the database.
    • table_name: the name of the table.
  • Parameters for the source Hadoop cluster
    • yarn_queue: the yarn queue name of the Hadoop cluster.
    • fs.defaultFS: HDOOP_CONF_DIR. DLA cannot automatically obtain the configurations in the HDOOP_CONF_DIR on the Hadoop cluster. You must specify configurations to read the parameters related to the Hadoop cluster. Specify different configurations for high-availability and non-high-availability nodes:
      • Non-high-availability nodes: Provide the absolute path in the hdfs://<master_ip>:9000/path/to/file format.

        Perform the following steps to obtain master_ip:

        1. Log on to the E-MapReduce console and click Cluster Management in the navigation bar.
        2. Obtain the IP address of the master role.
      • High-availability nodes: HDOOP_CONF_DIR configurations are required. For more information, see Hadoop.
    • data_path: the path of the Parquet file in the Hadoop cluster.
    • job_name: the name of the job.

Sample YAML configuration file:

    column_mapping: a|b,c|d,e|f
    username: xiaoer
    password: password
    port: 3306
    database: test_db
    tablename: test_table
    yarn_queue: default
    fs.defaultFS: hdfs://172.123.XX.XX:9000
    data_path: /tmp/parquet
    job_name: testParquetToADB
3. Upload the YAML configuration file
After you create a YAML configuration file, upload it to Object Storage Service (OSS).


The transmission service runs on DLA, so you must pull the Marmaray JAR package and load the transmission mapping configurations. Place the transmission mapping configurations and Marmaray JAR running template in an OSS directory and enter the OSS path in the DLA running parameters.
  1. Submit a ticket to obtain the Marmaray JAR package and upload it to OSS.
  2. Log on to the Data Lake Analytics console. In the left-side navigation pane, choose Severless Spark > Submit job to go to the Parameter Configuration page.
  3. Click Create a job Template. In the Create a job Template dialog box, set File Name, Data Format, Parent, and Job type, and then click OK.
  4. Edit the job. The created job is displayed in the Job List hierarchy tree.

    Configuration details

    DLA runs JSON files. Sample job template:

        "args": [
            "./<YAML configuration file name>             
        "name": "<Job name>",
        "className": "",
        "conf": {
            "spark.driver.resourceSpec": "<Specifications {small, medium, large, xlarge}>",
            "spark.executor.resourceSpec": "<Specifications {small, medium, large, xlarge}>",
            "spark.executor.instances": "<Number of instances>",
            "spark.dla.eni.enable": "<Whether to enable ENI{true}>",
            "": "{vSwitch ID of the Hadoop cluster}",
            "": "<Security group ID of the Hadoop cluster>",
        "file": "oss://${oss-buck-name}/<Marmaray JAR package path>",
        "files": "oss://${oss-buck-name}/<YAML configuration file path>"

    Parameters in the template:

    • name: the name of the job. You can use it to track job information.
    • Parameter Description Parameter defined by the Apache Spark community
      spark.driver.resourceSpec The resource specifications of the Spark driver. Valid values:
      • small: 1 CPU core and 4 GB of memory
      • medium: 2 CPU cores and 8 GB of memory
      • large: 4 CPU cores and 16 GB of memory
      • xlarge: 8 CPU cores and 32 GB of memory
      spark.driver.cores and spark.driver.memory
      spark.executor.resourceSpec The resource specifications of the Spark executor. The values are the same as those of spark.driver.resourceSpec. spark.executor.cores and spark.executor.memory
    • spark.executor.instances: the number of executors.
    • spark.dla.eni.enable: specifies whether to enable Elastic Network Interface (ENI). A value of true indicates that the VPCs are connected.
    • the vSwitch ID of the Hadoop cluster.
    • the security group ID of the Hadoop cluster.
    • file: the path of the Marmaray JAR package.
    • files: the path of the YAML configuration file.
  5. After you configure the job parameters, click Execute to run the job.