This topic describes how to access instances in an Alibaba Cloud Virtual Private Cloud (VPC) from Spark on MaxCompute.

Directly access instances in a VPC

You can access instances in an Alibaba Cloud VPC or custom private domain names from Spark on MaxCompute. Instances in an Alibaba Cloud VPC include Elastic Compute Service (ECS) instances, ApsaraDB for HBase instances, and ApsaraDB RDS instances.

When you access instances in a VPC from Spark on MaxCompute, add the spark.hadoop.odps.cupid.vpc.domain.list parameter to the spark-defaults.conf file or the DataWorks file to specify one or more instances. The value of this parameter is in the JSON format. When you configure this parameter, you must remove spaces and line feeds from the text in this parameter and merge JSON text into one line.

The following examples show the configurations of the spark.hadoop.odps.cupid.vpc.domain.list parameter when you access different instances. You must use the actual values to replace the values of the regionId, vpcId, domain, and port parameters in the following examples. For information about the ID of each region, see Project operations.
Notice
  • You must add the classless inter-domain routing (CIDR) block 100.104.0.0/16 to the whitelist of the instance that you want to access.
  • If the regionId parameter is set to cn-shanghai or cn-beijing, you must set spark.hadoop.odps.cupid.smartnat.enable to true.
  • You can access only the services in one VPC in the current region from Spark on MaxCompute.
  • You do not need to set spark.hadoop.odps.cupid.eni.enable to true.

Access instances in a VPC by using an ENI

Compared with the direct access method described in Directly access instances in a VPC, this access method provides high stability and better performance. In addition, this access method supports Internet access.

When you use this access method, take note of the following points:
  • You can use this access method to access instances in a VPC. If your Spark job needs to access instances across multiple VPCs at the same time, you can establish connections between the VPC that you have accessed and other VPCs.
  • For a Spark job that runs in a MaxCompute project, the user ID (UID) must be the same for the Alibaba Cloud account that owns the MaxCompute project and the Alibaba Cloud account that owns the VPC. Otherwise, the following error message appears: You are not allowed to use this vpc - vpc owner and project owner must be the same person.
To access instances in a VPC by using an ENI, perform the following steps:
  1. Provide the following information of the VPC that you want to access to the MaxCompute technical support team.
    • Region ID: the ID of the region in which the VPC is deployed, such as cn-beijing.
    • UID: the UID of the Alibaba Cloud account that owns the MaxCompute project. It is also the UID of the Alibaba Cloud account that owns the VPC that you want to access.
    • VPC ID: the ID of the VPC that you want to access by using an ENI.
    • vSwitch ID: the ID of the vSwitch that belongs to the VPC. You can go to the VPC console to query the ID. If multiple vSwitch IDs are displayed in the VPC console, select one from these IDs.
    • Security group ID: the ID of the security group that belongs to the VPC. To access instances in a VPC from Spark on MaxCompute, you must create a security group for the VPC to control access requests. For more information about how to create a security group, see Create a security group.
  2. Grant Spark on MaxCompute the permissions to create an ENI in the VPC.

    After the permissions are granted, you can create an ENI in the VPC to allow access from Spark on MaxCompute. To grant the permissions, you can use an Alibaba Cloud account to log on to the Resource Access Management (RAM) console and click Grants in the left-side navigation pane.

  3. Wait for the MaxCompute technical support team to enable the ENI.
  4. Add security group rules.

    After the ENI is enabled, you must add rules to the security group provided in Step 1. These rules specify the ports that you can use to access the instances from Spark on MaxCompute, such as ports 9200 and 31000.

    For example, if you want to access an ApsaraDB RDS instance, you must add such rules to the security group provided in Step 1. If such rules cannot be added to the security group and only the rules that specify IP addresses can be added, the rules that you want to add must include the CIDR block of the vSwitch provided in Step 1.

  5. Configure your Spark job.
    To run your Spark job, you must add the following configurations to access the instances in the VPC by using the ENI.
    spark.hadoop.odps.cupid.eni.enable = true
    spark.hadoop.odps.cupid.eni.info = regionid:vpc id
    In this case, the following configurations are not required:
    spark.hadoop.odps.cupid.vpc.domain.list
    spark.hadoop.odps.cupid.smartnat.enable
    spark.hadoop.odps.cupid.pvtz.rolearn   # (Used to access a custom domain name)
    spark.hadoop.odps.cupid.vpc.usepvtz    # (Used to access a custom domain name)

Example 1: Access an ApsaraDB for MongoDB instance

The following code shows the value of spark.hadoop.odps.cupid.vpc.domain.list when you access an ApsaraDB for MongoDB instance. In this example, ApsaraDB for MongoDB has a primary instance and a secondary instance.
{
  "regionId":"cn-beijing",
  "vpcs":[
    {
      "vpcId":"vpc-2zeaeq21mb1dmkqh0****",
      "zones":[
        {
          "urls":[
            {
              "domain":"dds-2ze3230cfea08****.mongodb.rds.aliyuncs.com",
              "port": 3717
            },
            {
              "domain":"dds-2ze3230cfea08****.mongodb.rds.aliyuncs.com",
              "port":3717
            }
          ]
        }
      ]
    }
  ]
}
Results of merging JSON text into one line:
{"regionId":"cn-beijing","vpcs":[{"vpcId":"vpc-2zeaeq21mb1dmkqh0****","zones":[{"urls":[{"domain":"dds-2ze3230cfea08****.mongodb.rds.aliyuncs.com","port": 3717},{"domain":"dds-2ze3230cfea08****.mongodb.rds.aliyuncs.com","port":3717}]}]}]}

Example 2: Access an ApsaraDB RDS instance

The following code shows the value of spark.hadoop.odps.cupid.vpc.domain.list when you access an ApsaraDB RDS instance.
{
  "regionId":"cn-beijing",
  "vpcs":[
    {
      "vpcId":"vpc-2zeaeq21mb1dmkqh0****",
      "zones":[
        {
          "urls":[
            {
              "domain":"rm-2zem49k73c54z****.mysql.rds.aliyuncs.com",
              "port": 3306
            }
          ]
        }
      ]
    }
  ]
}
Results of merging JSON text into one line:

{"regionId":"cn-beijing","vpcs":[{"vpcId":"vpc-2zeaeq21mb1dmkqh0****","zones":[{"urls":[{"domain":"rm-2zem49k73c54z****.mysql.rds.aliyuncs.com","port": 3306}]}]}]}

Example 3: Access an ApsaraDB for HBase instance

The following code shows the value of spark.hadoop.odps.cupid.vpc.domain.list when you access an ApsaraDB for HBase instance.
{
  "regionId":"cn-beijing",
  "vpcs":[
    {
      "vpcId":"vpc-2zeaeq21mb1dmkqh0exox",
      "zones":[
        {
          "urls":[
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":2181
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":16000
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":16020
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":2181
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":16000
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":16020
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":2181
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":16000
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":16020
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-cor*-***.hbase.rds.aliyuncs.com",
              "port":16020
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-cor*-***.hbase.rds.aliyuncs.com",
              "port":16020
            }, 
            {
              "domain":"hb-2zecxg2ltnpeg8me4-cor*-***.hbase.rds.aliyuncs.com",
              "port":16020
            }
          ]
        }
      ]
    }
  ]
}
Results of merging JSON text into one line:

{"regionId":"cn-beijing","vpcs":[{"vpcId":"vpc-2zeaeq21mb1dmkqh0exox","zones":[{"urls":[{"domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com","port":2181},{"domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com","port":16000},{"domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com","port":16020},{"domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com","port":2181},{"domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com","port":16000},{"domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com","port":16020},{"domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com","port":2181},{"domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com","port":16000},{"domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com","port":16020},{"domain":"hb-2zecxg2ltnpeg8me4-cor*-***.hbase.rds.aliyuncs.com","port":16020},{"domain":"hb-2zecxg2ltnpeg8me4-cor*-***.hbase.rds.aliyuncs.com","port":16020},{"domain":"hb-2zecxg2ltnpeg8me4-cor*-***.hbase.rds.aliyuncs.com","port":16020}]}]}]}

Example 4: Access an ApsaraDB for Redis instance

The following code shows the value of spark.hadoop.odps.cupid.vpc.domain.list when you access an ApsaraDB for Redis instance.
{
  "regionId":"cn-beijing",
  "vpcs":[
    {
      "vpcId":"vpc-2zeaeq21mb1dmkqh0****",
      "zones":[
        {
          "urls":[
            {
              "domain":"r-2zebda0d3c05****.redis.rds.aliyuncs.com",
              "port":3717
            }
          ]
        }
      ]
    }
  ]
}
Results of merging JSON text into one line:

{"regionId":"cn-beijing","vpcs":[{"vpcId":"vpc-2zeaeq21mb1dmkqh0****","zones":[{"urls":[{"domain":"r-2zebda0d3c05****.redis.rds.aliyuncs.com","port":3717}]}]}]}

Example 5: Access a LogHub instance

The following code shows the value of spark.hadoop. odps.cupid.vpc.domain.list when you access a LogHub instance.
{
  "regionId":"cn-beijing",
  "vpcs":[
    {
      "zones":[
        {
          "urls":[
            {
              "domain":"cn-beijing-intranet.log.aliyuncs.com",
              "port":80
            }
          ]
        }
      ]
    }
  ]
}
Results of merging JSON text into one line:

{"regionId":"cn-beijing","vpcs":[{"zones":[{"urls":[{"domain":"cn-beijing-intranet.log.aliyuncs.com","port":80}]}]}]}

Set the domain parameter to the classic network endpoint or the VPC endpoint of the LogHub instance. For the endpoint of each region, see Endpoints.

Example 6: Access a DataHub instance

The following code shows the value of spark.hadoop.odps.cupid.vpc.domain.list when you access a DataHub instance.
{
  "regionId":"cn-beijing",
  "vpcs":[
    {
      "zones":[
        {
          "urls":[
            {
              "domain":"dh-cn-beijing.aliyun-inc.com",
              "port":80
            }
          ]
        }
      ]
    }
  ]
}
Results of merging JSON text into one line:

{"regionId":"cn-beijing","vpcs":[{"zones":[{"urls":[{"domain":"dh-cn-beijing.aliyun-inc.com","port":80}]}]}]}

Set the domain parameter to the ECS endpoint on the classic network.

Example 7: Access a custom domain name

In this example, you use the custom domain name example.aliyundoc.com in a VPC and initiate a request to access this domain name from Spark on MaxCompute by using example.aliyundoc.com:80. 80 indicates the port number. Perform the following operations before you access the domain name:
  1. Associate a zone with a VPC in PrivateZone.
  2. On the Cloud Resource Access Authorization page in the RAM console, click Confirm Authorization Policy to grant MaxCompute the read-only permissions on PrivateZone.
  3. Add the following parameters to the configurations of the Spark node:
    spark.hadoop.odps.cupid.pvtz.rolearn=acs:ram::xxxxxxxxxxx:role/aliyunodpsdefaultrole 
    spark.hadoop.odps.cupid.vpc.usepvtz=true

    The spark.hadoop.odps.cupid.pvtz.rolearn parameter specifies the Alibaba Cloud Resource Name (ARN), which can be obtained from the RAM console.

  4. Add the spark.hadoop.odps.cupid.vpc.domain.list parameter to the configuration file of your Spark job. The following code shows the value of this parameter:
    {
      "regionId":"cn-beijing",
      "vpcs":[
        {
          "vpcId":"vpc-2zeaeq21mb1dmkqh0****",
          "zones":[
            {
              "urls":[
                {
                  "domain":"example.aliyundoc.com",
                  "port":80
                }
              ],
              "zoneId":"9b7ce89c6a6090e114e0f7c415ed****"
            }
          ]
        }
      ]
    }
    Results of merging JSON text into one line:
    
    {"regionId":"cn-beijing","vpcs":[{"vpcId":"vpc-2zeaeq21mb1dmkqh0****","zones":[{"urls":[{"domain":"example.aliyundoc.com","port":80}],"zoneId":"9b7ce89c6a6090e114e0f7c415ed****"}]}]}

Example 8: Access an HDFS instance

  • To enable HDFS support, add the hdfs-site.xml file. Sample configurations in the file:
    <?xml version="1.0"?>
    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>dfs://DfsMountpointDomainName:10290</value>
        </property>
        <property>
            <name>fs.dfs.impl</name>
            <value>com.alibaba.dfs.DistributedFileSystem</value>
        </property>
        <property>
            <name>fs.AbstractFileSystem.dfs.impl</name>
            <value>com.alibaba.dfs.DFS</value>
        </property>
    </configuration>
  • Add the spark.hadoop.odps.cupid.vpc.domain.list parameter to the configuration file of your Spark job. The following code shows the value of this parameter:
    {
        "regionId": "cn-shanghai",
        "vpcs": [{
            "vpcId": "vpc-xxxxxx",
            "zones": [{
                "urls": [{
                    "domain": "DfsMountpointDomainName",
                    "port": 10290
                }]
            }]
        }]
    }
    Results of merging JSON text into one line:
    
    {"regionId": "cn-shanghai","vpcs": [{"vpcId": "vpc-xxxxxx","zones": [{"urls": [{"domain": "DfsMountpointDomainName","port": 10290}]}]}]}