MaxCompute can use external tables, user-defined functions (UDFs), or the Alibaba Cloud LakeHouse solution to access the Internet or a virtual private cloud (VPC). Before MaxCompute accesses the Internet or a VPC, you must establish a network connection between MaxCompute and the destination IP address or service. The service can be ApsaraDB for HBase, ApsaraDB RDS, or Hadoop clusters. This topic describes how to establish a network connection. For some operations in this process, you must submit a ticket to contact Alibaba Cloud technical support personnel.

Network architecture

The following figure shows the network architecture in scenarios where MaxCompute accesses services, such as ApsaraDB for HBase, ApsaraDB RDS, and Hadoop clusters, over the Internet and a VPC. Network architecture
Category Description
MaxCompute domain MaxCompute is a software as a service (SaaS) cloud data warehouse. It can use external tables, UDFs, or the Alibaba Cloud LakeHouse solution to access the Internet.
Network connection domain You can use one of the following schemes to enable MaxCompute to access the destination IP address or network:
  • Service mapping scheme: includes Internet-based service mapping and VPC-based service mapping. This scheme allows MaxCompute to access the destination IP address or network by using a specific service IP address, domain name, and port. For more information, see Service mapping scheme.
  • VPC connection scheme: allows you to create a leased line in a VPC and authorize MaxCompute to access the destination IP address or network. For more information, see VPC connection scheme.
Destination domain
  • The public IP address or domain name is the IP address or domain name of the network that you want to access.
  • A VPC is a virtual private network established on Alibaba Cloud. In a VPC, you can manage the IP addresses or domain names of various services and managed clusters, such as Elastic Compute Service (ECS), ApsaraDB RDS, ApsaraDB for HBase, and Hadoop clusters. An elastic network interface (ENI) is a virtual network interface controller (NIC) that can be bound to an ECS instance that is deployed in a VPC. You can use ENIs to implement fine-grained network management.
  • MaxCompute can also access Alibaba Cloud services such as Object Storage Service (OSS) and Tablestore. The network connections between MaxCompute and these services are established.

Prerequisites

Before you establish a network connection, make sure that the following conditions are met:
  • The information about the Internet, ApsaraDB RDS instance, ApsaraDB for HBase cluster, or Hadoop cluster is obtained. The Hadoop cluster can be a self-managed cluster on the cloud or an E-MapReduce (EMR) cluster. The information includes the IP addresses and port numbers of the Hive metastore service and NameNode of HDFS.
  • A MaxCompute project is created. If you have a MaxCompute project, you can directly use it without the need to create another one. The MaxCompute project must be in the same region as the VPC to which the ApsaraDB RDS instance, ApsaraDB for HBase cluster, or Hadoop cluster you use belongs. If the MaxCompute project needs to access the Hadoop system or an open source data lake engine, we recommend that you set the data type edition for the project to the Hive-compatible data type edition.
  • The Alibaba Cloud account that owns the VPC you want to access is obtained. The Alibaba Cloud account that is used to access the MaxCompute project and the administrator account of the destination system or cluster are also obtained.

Supported regions

The following table lists the regions where the service mapping and VPC connection schemes are used. You cannot establish a network connection for MaxCompute in the regions where none of the schemes is used.
Scheme Supported region Connection object
Internet-based service mapping All regions at the China site (aliyun.com) Public IP address or domain name
VPC-based service mapping
  • China (Beijing)
  • China (Shanghai)
IP address or domain name of a VPC
VPC connection scheme
  • China (Beijing)
  • China (Shanghai)
  • China (Zhangjiakou)
  • China (Hangzhou)
  • IP address or domain name of a VPC
  • ApsaraDB RDS
  • ApsaraDB for HBase cluster
  • Hadoop cluster

Service mapping scheme

The service mapping scheme is used when network resources are preset on the cloud. When a MaxCompute job is running, MaxCompute can automatically access the destination object over a network based on the configured destination IP address.

The service mapping scheme allows MaxCompute to access a single destination IP address or domain name over the Internet or a VPC.

  • Internet-based service mapping
    To enable MaxCompute to access the Internet, perform the following operations:
    1. Submit a ticket to apply for a whitelist of public IP addresses or domain names and the port numbers.

      You must enter the destination IP address or domain name and the port number in the ticket. If you need to access multiple destinations, separate them with commas (,). For example, if you want to access an Alibaba Cloud domain name, provide the network configuration information www.aliyun.com:80. If you want to access the AMAP service, provide the network configuration information restapi.amap.com:443,restapi.amap.com:80.

    2. After the ticket is processed, you can run a job on the MaxCompute client. Sample commands:
      -- Specify the public IP address or domain name and port number that are configured in the whitelist. The specified information is used in the following SELECT statement to access the Internet. 
      set odps.internet.access.list=<ip_address:port|realm_name:port>; 
      select <UDF_name>("<http://ip_address|realm_name>");
      • ip_address:port|realm_name:port: required. The public IP address or domain name and port number that you want to access.
      • UDF_name: the name of the UDF that you use to access the public IP address or domain name. The following sample code shows how to create a UDF:
        package com.aliyun.odps.test.udf;
        import com.aliyun.odps.udf.UDF;
        import java.io.BufferedReader;
        import java.io.IOException;
        import java.io.InputStreamReader;
        import java.net.URL;
        public class <UDF_name> extends UDF {
          public String evaluate(String urlStr) throws IOException {
            URL url = new URL(urlStr);
            StringBuilder sb = new StringBuilder();
            try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()))) {
              String line;
              while ((line = reader.readLine()) != null) {
                sb.append(line).append('\n');
              }
            }
            return sb.toString();
          }
        }
      Sample commands that are used based on the preceding UDF example:
      set odps.internet.access.list=www.aliyun.com:80;
      SELECT url_fetch("http://www.aliyun.com");
  • VPC-based service mapping
    To enable MaxCompute to access a VPC, perform the following operations:
    1. Add a whitelist of the Classless Inter-Domain Routing (CIDR) blocks of MaxCompute to a security group of the VPC that you want to access. Then, the IP addresses in the CIDR blocks of MaxCompute can be used to access the VPC.
      Note This configuration is required only when you use the VPC-based service mapping scheme.
      Region CIDR block added to a security group of a VPC
      China (Shanghai) 100.104.49.64/26, 100.104.212.192/26, 100.104.244.0/26, and 100.104.94.0/26
      China (Beijing) 100.104.218.0/26, 100.104.120.0/26, 100.104.156.192/26, 100.104.149.0/26, 100.104.49.64/26, 100.104.212.192/26, 100.104.244.0/26, 100.104.94.0/26
    2. Log on to the MaxCompute client and specify the destination VPC. Sample command:
      -- Add the destination VPC to the whitelist of MaxCompute. 
      setproject odps.security.outbound.destination=<RegionID>_<VPC ID>[*];  
      • RegionID: required. The ID of the region to which the VPC belongs. For more information, see Obtain the region ID and VPC ID of a VPC.
      • VPC ID: required. The ID of the VPC that you want to access.
      • [*]: optional. A wildcard, which indicates that all IP addresses and port numbers under the VPC are added to the whitelist of MaxCompute.
      For example, the ID of the VPC that you want to access is vpc-bp1e4p7feyvwt103j****. The region to which the VPC belongs is China (Shanghai), and the IP address and port number of the VPC are 192.0.2.0:80. To add all IP addresses and port numbers in the VPC to the whitelist of MaxCompute, run the following command:
      setproject odps.security.outbound.destination=cn-shanghai_vpc-bp1e4p7feyvwt103j****[*];
    3. Commit an SQL statement on the MaxCompute client to use a UDF to access the destination IP address of the VPC.

      If you want to access the IP address of the VPC that you added in the preceding step, run the following commands:

      -- Specify the ID of the VPC that you want to access. 
      set odps.vpc.id=vpc-bp1e4p7feyvwt103j****; 
      -- Specify the IP address and port number of the VPC that you want to access. 
      set odps.vpc.access.ips=192.0.2.0:80; 
      -- Use a UDF to access the destination VPC.      
      SELECT url_fetch("http://192.0.2.0:80");   

VPC connection scheme

If you use the VPC connection scheme, you must specify a VPC and authorize MaxCompute to access services in the VPC. Therefore, you must allow MaxCompute to create an ENI in a security group of the VPC and bind this ENI to an ECS instance. In addition, you must expose necessary services in the VPC to MaxCompute. The services include ApsaraDB RDS, ApsaraDB for HBase, and the Hive metastore and HDFS services of Hadoop clusters. The VPC connection scheme can allow MaxCompute to access multiple IP addresses or CIDR blocks in the destination VPC.

To use the VPC connection scheme, perform the following steps:

  1. Allow MaxCompute to create an ENI in the destination VPC by using Resource Access Management (RAM) authorization. This establishes a connection between MaxCompute and the VPC. To perform RAM authorization, you must use the Alibaba Cloud account that owns the VPC to log on to the RAM console and click Confirm Authorization Policy. When you use the Alibaba Cloud LakeHouse solution, you must also use this Alibaba Cloud account to create a MaxCompute project.
  2. Log on to the VPC console. On the VPCs page, click the name of the VPC that you want to access in the Instance ID/Name column. On the page that appears, view the security groups in the VPC Resources section of the Resources tab. VPC
  3. Click Add on the right of the value under Security Group. On the Security Groups page, click Create Security Group and create a security group for MaxCompute to manage its access permissions in the VPC. Record the ID of the security group that you created. You will use it when you submit a MaxCompute ticket to apply for VPC access. When you create a security group, make sure that the VPC you select for the security group is the same as that of the service that you want to access. For more information, see Create a security group.

    Create a security group

    Add

    Configure a security group

    Configure a security group
    You can create a security group for MaxCompute to manage its permissions to access various resources in a VPC. ENIs that are created by MaxCompute are managed in this security group. When a network connection is established by using a ticket, MaxCompute creates two or more ENIs based on the bandwidth requirements. The ENIs are managed in the security group that is included in the ticket.
    Note
    • You must create a basic security group, instead of an advanced security group. By default, basic security groups allow outbound traffic. However, advanced security groups do not allow outbound traffic by default. If you use an advanced security group, no services in the VPC can be accessed.
    • If access to ApsaraDB for HBase cannot be allowed in the security group, contact the MaxCompute technical service team. After the team grants the network access permissions to ApsaraDB for HBase, you need only to allow access from the IP address of the ENI of MaxCompute. However, if the configuration of MaxCompute changes, the IP address of the ENI may also change. To facilitate operations, we recommend that you allow access from the CIDR blocks of the vSwitch of the destination VPC.
  4. Submit a ticket to request the MaxCompute technical service team to create a network connection. The following information is required to create a network connection:
    • Zone ID: the ID of the region to which the VPC belongs. For more information, see Region IDs.
    • VPC ID: the ID of the VPC that you want to access. If MaxCompute needs to access an EMR or ApsaraDB for HBase cluster, you can obtain the VPC ID from the network information in the EMR or ApsaraDB for HBase console. VPC
    • vSwitch ID: If MaxCompute needs to access an EMR or ApsaraDB for HBase cluster, you can obtain the vSwitch ID from the network information in the EMR or ApsaraDB for HBase console. Obtain the VPC ID

      To obtain the vSwitch ID, perform the following operations: Log on to the VPC console. In the left-side navigation pane, click vSwitches. On the vSwitches page, click the name of the vSwitch in the Instance ID/Name column. On the page that appears, view the vSwitch ID in the vSwitch Basic Information section.

      vSwitches

      vSwitches

      View the vSwitch ID by selecting the specified vSwitch

      View details of the vSwitch
    • Security group ID: the ID of the security group you created in Step 3. View security groups
    • Alibaba Cloud account ID: You can obtain this information in the Alibaba Cloud Management Console. User information
  5. Configure a firewall or security group for the cluster that you want to access.
    • Configure a firewall or security group for a Hadoop cluster
      MaxCompute accesses a Hadoop cluster by using the security group that is bound to the ENI of MaxCompute. Therefore, you must enable the security group or firewall of the Hadoop cluster to allow the inbound traffic from the required ports of the security group to which the ENI belongs. To allow the inbound traffic from the required ports, perform the following operations:
      • Configure an inbound rule for the security group of the Hadoop cluster.
      • Make sure that the authorization object belongs to the security group that is bound to the ENI.
      • Set the port for the Hive MetaStore service to 9083.
      • Set the port for NameNode of HDFS to 8020.
      • Set the port for DataNode of HDFS to 50010.
      Configure a firewall or security group
    • Configure a firewall or security group for an ApsaraDB for HBase cluster Log on to the ApsaraDB for HBase console and add the security group that is created for MaxCompute to ApsaraDB for HBase.

      For more information about how to configure a whitelist and security group of ApsaraDB for HBase, see Configure IP address whitelists and security groups.

      ApsaraDB for HBase clusters

      Clusters

      Configure a security group

      Add a security group
      Note If you are not allowed to add a security group, add the IP address of the ENI that is created by MaxCompute.
    • Configure a security group of ApsaraDB RDS. Log on to the ApsaraDB RDS console and add the security group or the IP address of the ENI that is created by MaxCompute to ApsaraDB RDS.
      Note If the configuration of MaxCompute changes, the IP address of the ENI may also change. To facilitate operations, we recommend that you allow access from the CIDR blocks of the vSwitch of the destination VPC.
      For more information about how to configure security policies of ApsaraDB RDS, such as whitelists, see Access control. Security group of ApsaraDB RDS