This topic introduces how Spark on MaxCompute can access user instances in a Virtual Private Cloud (VPC).

About VPC access

Spark on MaxCompute can access instances (for example ECS, HBase, and RDS) in a VPC, as well as user-defined private domain names.

When Spark on MaxCompute accesses VPC-type instances, add the spark.hadoop.odps.cupid.vpc.domain.list parameter in the spark-defaults.conf file to specify one or several VPC instances. This parameter is in JSON format. You need to compress JSON objects into one line.

For the different values of the spark.hadoop.odps.cupid.vpc.domain.list parameter to access different instances, see the following examples. Replace the RegionID, VPCID, instance domain name, and port number with the actual values. For the RegionID of a region, see Regions and zones.

Example 1: Access MongoDB

When Spark on MaxCompute accesses MongoDB instances, set spark.hadoop.odps.cupid.vpc.domain.list as follows. This MongoDB has two instances, including a primary instance and a standby instance.
{
  "regionId":"cn-beijing",
  "vpcs":[
    {
      "vpcId":"vpc-2zeaeq21mb1dmkqh0****",
      "zones":[
        {
          "urls":[
            {
              "domain":"dds-2ze3230cfea08****.mongodb.rds.aliyuncs.com",
                  "port": 3717
            },
            {
              "domain":"dds-2ze3230cfea08****.mongodb.rds.aliyuncs.com",
              "port":3717
            }
          ]
        }
      ]
    }
  ]
}

Example 2: Access RDS

When Spark on MaxCompute accesses RDS, set spark.hadoop.odps.cupid.vpc.domain.list as follows.
{
  "regionId":"cn-beijing",
  "vpcs":[
    {
      "vpcId":"vpc-2zeaeq21mb1dmkqh0****",
      "zones":[
        {
          "urls":[
            {
              "domain":"rm-2zem49k73c54z****.mysql.rds.aliyuncs.com",
              "port": 3306
            }
          ]
        }
      ]
    }
  ]
}

Example 3: Access HBase

When Spark on MaxCompute accesses HBase, set spark.hadoop.odps.cupid.vpc.domain.list as follows.
{
  "regionId":"cn-beijing",
  "vpcs":[
    {
      "vpcId":"vpc-2zeaeq21mb1dmkqh0exox",
      "zones":[
        {
          "urls":[
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":2181
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":16000
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":16020
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":2181
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":16000
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":16020
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":2181
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":16000
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com",
              "port":16020
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-cor*-***.hbase.rds.aliyuncs.com",
              "port":16020
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-cor*-***.hbase.rds.aliyuncs.com",
              "port":16020
            },
            {
              "domain":"hb-2zecxg2ltnpeg8me4-cor*-***.hbase.rds.aliyuncs.com",
              "port":16020
            }
          ]
        }
      ]
    }
  ]
}

Example 4: Access Redis

When Spark on MaxCompute accesses Redis, set spark.hadoop.odps.cupid.vpc.domain.list as follows.
{
  "regionId":"cn-beijing",
  "vpcs":[
    {
      "vpcId":"vpc-2zeaeq21mb1dmkqh0****",
      "zones":[
        {
          "urls":[
            {
              "domain":"r-2zebda0d3c05****.redis.rds.aliyuncs.com",
              "port":3717
            }
          ]
        }
      ]
    }
  ]
}

Example 5: Access LogHub

When Spark on MaxCompute accesses LogHub, set spark.hadoop. odps.cupid.vpc.domain.list as follows.
{
  "regionId":"cn-beijing",
  "vpcs":[
    {
      "zones":[
        {
          "urls":[
            {
              "domain":"cn-beijing-intranet.log.aliyuncs.com",
              "port": 80
            }
          ]
        }
      ]
    }
  ]
}

For the domain parameter, use the LogHub endpoint of the classic network or the VPC endpoint of each region. For the endpoint of each region, see Service endpoint.

Example 6: Access DataHub

When Spark on MaxCompute accesses DataHub, set spark.hadoop.odps.cupid.vpc.domain.list as follows.
{
  "regionId":"cn-beijing",
  "vpcs":[
    {
      "zones":[
        {
          "urls":[
            {
              "domain":"dh-cn-beijing.aliyun-inc.com",
              "port": 80
            }
          ]
        }
      ]
    }
  ]
}

For the domain parameter, use the ECS endpoint of the classic network as the Datahub endpoint.

Example 7: Access a user-defined domain name

If you have a user-defined domain a.b.com in a VPC. Spark will initiate the access by this domain and the port a.b.com: 80. First, you need to complete the following configurations:
  1. Associate a Zone with a VPC in PrivateZone.
  2. Click Authorize to authorize MaxCompute read-only access to PrivateZone.
  3. In Spark node configuration, add the following two parameters.
    spark.hadoop.odps.cupid.pvtz.rolearn=acs:ram::xxxxxxxxxxx:role/aliyunodpsdefaultrole 
    spark.hadoop.odps.cupid.vpc.usepvtz=true

    The spark.hadoop.odps.cupid.pvtz.rolearn parameter is your ARN information. You can get it from RAM console.

  4. In the Spark configuration file, set spark.hadoop.odps.cupid.vpc.domain.list as follows.
    {
      "regionId":"cn-beijing",
      "vpcs":[
        {
          "vpcId":"vpc-2zeaeq21mb1dmkqh0****",
          "zones":[
            {
              "urls":[
                {
                  "domain":"abc.com",
                  "port": 80,
                }
              ],
              "zoneId":"9b7ce89c6a6090e114e0f7c415ed****"
            }
          ]
        }
      ]
    }

Example 8: Access HDFS

For more information about how to use Hadoop Distributed File System (HDFS) for file storage.
  • To enable HDFS support, add hdfs-site.xml as follows.
    <? xml version="1.0"? >
    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>dfs://DfsMountpointDomainName:10290</value>
        </property>
        <property>
            <name>fs.dfs.impl</name>
            <value>com.alibaba.dfs.DistributedFileSystem</value>
        </property>
        <property>
            <name>fs.AbstractFileSystem.dfs.impl</name>
            <value>com.alibaba.dfs.DFS</value>
        </property>
    </configuration>
  • In the Spark configuration file, set spark.hadoop.odps.cupid.vpc.domain.list as follows:
    {
        "regionId": "cn-shanghai",
        "vpcs": [
            "vpcId": "vpc-xxxxxx",
            "zones": [{
                "urls": [{
                    "domain": "DfsMountpointDomainName",
                    "port": 10290
                }]
            }]
        }]
    }