This topic describes how to configure the network of data sources to allow the serverless Spark engine of Data Lake Analytics (DLA) to access data of the data sources in your VPC. The data sources include ApsaraDB RDS, AnalyticDB, PolarDB, ApsaraDB for MongoDB, Elasticsearch, ApsaraDB for HBase, E-MapReduce Hadoop, Kafka, and self-managed data services hosted on Elastic Compute Service (ECS) instances.

Usage notes

If you use DLA to access data from specified services, the configurations described in this topic are not required. The specified services include Object Storage Service (OSS), MaxCompute, and Tablestore. To access data from these services, you must configure the AccessKey ID and AccessKey secret.

Principles

The driver and executors on the serverless Spark engine are running in a security container. You can attach an elastic network interface (ENI) of your VPC to the security container. This way, the security container can run in the VPC as an ECS instance and access the network of your data sources. The lifecycle of an ENI is the same as that of a process on the serverless Spark engine. After a job succeeds, all ENIs are released.

To attach an ENI of your VPC to the serverless Spark engine, you must configure the IDs of the security group and vSwitch of your VPC in the job configuration of the serverless Spark engine. If your ECS instance can access the destination data, you need only to configure the IDs of the security group and vSwitch of the ECS instance on the serverless Spark engine.

Note On the serverless Spark engine, the driver and each executor that run on the computing container occupy an IP address of the specified vSwitch. Before you submit a job, make sure that IP addresses in the CIDR blocks of the vSwitch are sufficient.

Obtain the IDs of the required vSwitch and security group

DLA allows you to use one of the following methods to obtain the IDs of the required vSwitch and security group: If your ECS instance has accessed the destination data over your VPC, we recommend that you use Method 1 to obtain the IDs of the security group and vSwitch of your ECS instance. Otherwise, use Method 2 to create a security group and vSwitch.

Method 1: Use the IDs of the security group and vSwitch of an ECS instance

If one of your ECS instances has accessed data from a data source, such as ApsaraDB RDS, you can perform the following steps to obtain the IDs of the security group and vSwitch of the ECS instance :

  1. Log on to the ECS console. Enter the instance name in the search box to find the instance.
  2. On the Instance Details page of the ECS instance, view the vSwitch ID.
  3. On the security group page, view the security group ID.

Method 2: Use the IDs of the existing vSwitch and security group

  1. Obtain the vSwitch ID from the basic information page of the specific data source. The following figure shows the Basic Information tab of ApsaraDB RDS.
  2. Obtain the security group ID from the basic information page of a specific data source.
  3. If the basic information page of a data source does not contain its security group information, log on to the VPC console and select a security group in the VPC related to the data source.

Method 3: Create a security group and vSwitch in the VPC that you want to access

  1. Create a vSwitch.

    For more information, see Create a vSwitch.

  2. Create a security group.

    For more information, see Create a security group.

  3. Configure the outbound rules of the security group that is created in Step 2.

    Log on to the ECS console. In the left-side navigation pane, choose Network & Security > Security Groups. On the Security Groups page, configure outbound rules to allow access to the destination data.

Add the CIDR blocks to which the vSwitch belongs to a whitelist

If the destination data is stored in an instance of an Alibaba Cloud service, such as ApsaraDB RDS or ApsaraDB for MongoDB, you can configure a whitelist in the console of the service. A whitelist can include CIDR blocks or security groups. To configure a whitelist of CIDR blocks, add the CIDR block of the vSwitch that is created in Step 1 to the whitelist. To configure a whitelist of security groups, add the security group that is created in Step 2 to the whitelist. ApsaraDB RDS is used in this example.

If the destination data is stored in a self-managed database hosted on an ECS instance, you must configure inbound rules for the security group of the ECS instance. The configured rules allow access from the new security groups or CIDR blocks of the vSwitch.

Note
  1. For more information about how to configure a security group, see Add security group rules.
  2. In most cases, access failures are caused by invalid configurations of the security group or whitelist. If a failure occurs, check the outbound rules of the security group for the Spark job, the inbound rules of the security group for the ECS instance that stores the destination data, or the whitelist of the ECS instance.

Submit a job.

Compile a spark-submit script in the serverless Spark engine. For more information, see Create and run Spark jobs.
Note
  1. If the spark.dla.eni.enable parameter is set to true, the permissions to access the VPC are granted and you can attach an ENI of your VPC to the serverless Spark engine.
  2. The value of spark.dla.eni.vswitch.id is the vSwitch ID that is obtained from Obtain the IDs of the required vSwitch and security group.
  3. The value of spark.dla.eni.security.group.id is the security group ID that is obtained from Obtain the IDs of the required vSwitch and security group.