This topic describes how to configure the network of data sources to allow the serverless Spark engine of Data Lake Analytics (DLA) to access data of the data sources in your virtual private cloud (VPC). The data sources include ApsaraDB RDS, AnalyticDB, PolarDB, ApsaraDB for MongoDB, Elasticsearch, ApsaraDB for HBase, E-MapReduce, Message Queue for Apache Kafka, and self-managed data services hosted on an Elastic Compute Service (ECS) instance.

Background information

The driver and executors on the serverless Spark engine run on a security container. You can attach an elastic network interface (ENI) of your VPC to the security container. This way, the security container can run in your VPC in a similar way that an ECS instance runs in a VPC. The lifecycle of an ENI is the same as that of a Spark process. After a job is completed, all the ENIs are released.

To attach an ENI of your VPC to the serverless Spark engine, you must configure the IDs of the security group and vSwitch of your VPC in job configurations of the serverless Spark engine. If your ECS instance can access the destination data, you need only to configure the IDs of the security group and vSwitch that are associated with the ECS instance in configurations of the Spark job.

Note On the serverless Spark engine, the driver and each executor that run on the computing container occupy an IP address of the specified vSwitch. Before you submit a job, make sure that IP addresses in the classless inter-domain routing (CIDR) block to which the vSwitch belongs are sufficient.

Usage notes

If you use DLA to access data from specified Alibaba Cloud services, the configurations described in this topic are not required. The specified services include Object Storage Service (OSS), MaxCompute, Tablestore and SLS. To access data from these services, you must configure an AccessKey pair.

Procedure

  1. Obtain the IDs of the required vSwitch and security group.
    DLA allows you to use one of the following methods to obtain the IDs of the vSwitch and security group: If your ECS instance has accessed the destination data source over your VPC, we recommend that you use Method 1 to obtain the IDs of the security group and vSwitch of your ECS instance. If your ECS instance cannot access the destination data source, you can use Method 2 to obtain the related information from the basic information of the destination data source. You can also use Method 3 to create a security group and vSwitch.
    • Method 1: Use the IDs of the security group and vSwitch of an ECS instance
      1. Log on to the ECS console and find the required ECS instance in the ECS instance list.
      2. On the Instance Details page of the ECS instance, query the IDs of the security group and vSwitch, as shown in the following figure.IDs of security group and vSwitch
    • Method 2: Use the IDs of the existing security group and vSwitch of the destination data source
      You can obtain the IDs of the security group and vSwitch from the basic information of the destination data source. The following figure shows the basic information of an E-MapReduce cluster.Cluster Overview
      If the basic information of the destination data source does not include the security group information, you can log on to the Virtual Private Cloud console and select a security group in the VPC where the destination data source resides. Security group ID
    • Method 3: Create a security group and vSwitch for the VPC that you want to access
      1. Create a security group and vSwitch for the VPC that you want to access. For more information, see Create a vSwitch and Create a security group.
      2. Add an outbound rule to the security group that is created in the preceding step. This rule allows access to the destination data source.

        Log on to the ECS console. In the left-side navigation pane, choose Network & Security > Security Groups. On the Security Groups page, add an outbound rule to the security group. This rule allows access to the destination data source. For more information, see Add security group rules.

  2. Add the CIDR block to which the vSwitch belongs to a whitelist of the destination data source.
    • If the destination data source is an Alibaba Cloud instance, such as an ApsaraDB RDS or ApsaraDB for MongoDB instance, you can log on to the service console to configure a whitelist. You can add the CIDR block to which the vSwitch belongs and security group ID to the whitelist. The following figure shows the whitelists configured for an ApsaraDB RDS instance.Whitelists and security groups
    • If the destination data source is a self-managed service hosted on an ECS instance, you must add an inbound rule to the security group that is associated with the ECS instance. This rule allows access to the destination data source from the created security group or CIDR block to which the created vSwitch belongs. For more information, see Add security group rules.
    Note If your security group is an advanced security group, instances in the security group cannot access each other. You must add the CIDR block to which the selected vSwitch belongs to the inbound and outbound rules of the security group.
  3. Submit a Spark job.
    Write a spark-submit script in the serverless Spark engine. For more information, see Create and run Spark jobs.
    {
        "name": "SparkPi",
        "file": "local:///tmp/spark-examples.jar",
        "className": "org.apache.spark.examples.DriverSubmissio*****",
        "args": [
           "100000"
        ],
        "conf": {
            "spark.driver.resourceSpec": "small",
            "spark.executor.resourceSpec": "medium",
            "spark.executor.instances": 1,
            "spark.dla.eni.enable": "true",
            "spark.dla.eni.vswitch.id": "vsw-bp17jqw3lrrobn6y*****",
            "spark.dla.eni.security.group.id": "sg-bp163uxgt4zandx*****",
    
        }
    }
    Note
    • If the spark.dla.eni.enable parameter is set to true, the serverless Spark engine can access your VPC and you can attach an ENI of your VPC to the serverless Spark engine.
    • spark.dla.eni.vswitch.id is set to the vSwitch ID that is obtained in Step 1 and spark.dla.eni.security.group.id is set to the security group ID that is obtained in Step 1.