MaxCompute is the core computing component of the Alibaba Cloud big data platform and provides powerful computing capabilities. MaxCompute schedules a large number of nodes to run computing jobs in parallel. It also provides mechanisms to systematically process and manage distributed computing features, such as failovers and retries.

Background information

MaxCompute SQL provides an entry point for distributed data processing. This allows you to process and store exabytes of offline data. The computing framework of MaxCompute continues to evolve to meet the requirements that arise from expanded big data business and new use scenarios. In early versions, MaxCompute provides powerful computing capabilities to process internal data in special formats. It is now available to process external data.

MaxCompute SQL is now used to process structured data that is stored in MaxCompute internal tables in the CFile column store format. You must use different tools to import external user data to MaxCompute tables for data computations. The user data includes texts and unstructured data. For example, to process Object Storage Service (OSS) data in MaxCompute, you can use one of the following methods:
  • Use OSS SDK or other tools to download data from OSS. Then, use MaxCompute Tunnel to import the downloaded data to a MaxCompute table.
  • Write a user-defined function (UDF) to call OSS SDK and access OSS data.
However, the two methods have deficiencies.
  • The first method requires data transfer operations outside the MaxCompute system. If a large amount of OSS data needs to be processed, parallel operations are required to accelerate the process. As a result, you cannot fully utilize the large-scale computing capabilities of MaxCompute.
  • The second method requires UDF-based access permissions. It also requires that developers control the number of parallel jobs and handle issues related to data partitioning.

MaxCompute provides external tables to address these issues. External tables are used to process data that is stored outside MaxCompute internal tables. You can execute a simple Data Definition Language (DDL) statement to create an external table in MaxCompute. Then, you can use this table to associate it with external data sources. This allows access to and output of data in various formats. In most cases, external tables can be accessed like standard MaxCompute tables. You can fully utilize the computing capabilities of MaxCompute SQL to process external data.

Note
  • If you use an external table, the data in this table is not stored in MaxCompute, and you are not charged for the storage of the table data.
  • Full search is supported for external tables.
  • Tunnel commands and Tunnel SDK cannot be used for external tables. You can use Tunnel to upload data to MaxCompute internal tables. You can also use OSS SDK for Python to upload data to OSS and map the data to external tables in MaxCompute.
  • You can create, search, configure, and process external tables in the DataWorks console. You can also query and analyze data in external tables. For more information, see External table.
  • If external tables are used, you are charged only computing fees for MaxCompute based on the billing rules of computing resources in MaxCompute. Data in external tables is not stored in MaxCompute. Therefore, no storage fees are generated for MaxCompute. For more information about storage fees, see the description related to the billing rules for data source storage. If you use a public endpoint of MaxCompute to connect to an external table, you are charged for Internet traffic and data downloads. For more information about MaxCompute fees, see Overview.

Examples

This section describes how to use MaxCompute external tables to process unstructured data: