MaxCompute is the core computing component of the Alibaba Cloud big data platform and provides powerful computing capabilities. MaxCompute schedules a large number of nodes to run computing jobs in parallel. It also provides mechanisms to systematically process and manage distributed computing features, such as failovers and retries.
Background information
MaxCompute SQL provides an entry point for distributed data processing. This allows you to process and store exabytes of offline data. The computing framework of MaxCompute continues to evolve to meet the requirements that arise from expanded big data business and new use scenarios. In early versions, MaxCompute provides powerful computing capabilities to process internal data in special formats. It is now available to process external data.
- Use OSS SDK or other tools to download data from OSS. Then, use MaxCompute Tunnel to import the downloaded data to a MaxCompute table.
- Write a user-defined function (UDF) to call OSS SDK and access OSS data.
- The first method requires data transfer operations outside the MaxCompute system. If a large amount of OSS data needs to be processed, parallel operations are required to accelerate the process. As a result, you cannot fully utilize the large-scale computing capabilities of MaxCompute.
- The second method requires UDF-based access permissions. It also requires that developers control the number of parallel jobs and handle issues related to data partitioning.
MaxCompute provides external tables to address these issues. External tables are used to process data that is stored outside MaxCompute internal tables. You can execute a simple Data Definition Language (DDL) statement to create an external table in MaxCompute. Then, you can use this table to associate it with external data sources. This allows access to and output of data in various formats. In most cases, external tables can be accessed like standard MaxCompute tables. You can fully utilize the computing capabilities of MaxCompute SQL to process external data.
- If you use an external table, the data in this table is not stored in MaxCompute, and you are not charged for the storage of the table data.
- Full search is supported for external tables.
- Tunnel commands and Tunnel SDK cannot be used for external tables. You can use Tunnel to upload data to MaxCompute internal tables. You can also use OSS SDK for Python to upload data to OSS and map the data to external tables in MaxCompute.
- You can create, search, configure, and process external tables in the DataWorks console. You can also query and analyze data in external tables. For more information, see External table.
- If external tables are used, you are charged only computing fees for MaxCompute based on the billing rules of computing resources in MaxCompute. Data in external tables is not stored in MaxCompute. Therefore, no storage fees are generated for MaxCompute. For more information about storage fees, see the description related to the billing rules for data source storage. If you use a public endpoint of MaxCompute to connect to an external table, you are charged for Internet traffic and data downloads. For more information about MaxCompute fees, see Overview.
Examples
- To access unstructured data in OSS and Tablestore, see Access OSS data and Access Tablestore data.
- To use external tables to access OSS data, you must authorize MaxCompute to access OSS. The authorization is performed in the Resource Access Management (RAM) console. For more information, see STS authorization.
- The unstructured data processing framework of MaxCompute allows you to export MaxCompute data to OSS by using the INSERT statement. For more information, see Write data to OSS.
- For more information about how to process data in various open source formats, see Open source data formats supported by OSS external tables.