This topic describes how to use DataWorks to create and configure external tables. This topic also lists the data types supported in external tables.
Concepts
Concept | Description |
---|---|
Object Storage Service (OSS) | OSS supports storage classes including standard, infrequent access, and archive. It is applicable to various data storage scenarios. In addition, OSS can deeply integrate with the Hadoop open-source community and other products such as E-MapReduce (EMR), Batch Compute, MaxCompute, Machine Learning Platform for AI, and Function Compute. |
MaxCompute | MaxCompute is an efficient and fully managed data warehousing solution. When used in conjunction with OSS, it enables you to analyze and process large amounts of data with reduced costs. |
External tables of MaxCompute | Based on the computing framework V2.0 of MaxCompute, you can use external tables to directly query numerous files in OSS without loading data into MaxCompute tables. This reduces the time and labor required for data migration and lowers the storage costs. |
- Data stored in OSS is converted through the unstructured data processing framework and passed to user-defined interfaces using the InputStream Java class. You need to write code for the EXTRACT logic, that is, to read, parse, convert, and compute data from input streams. The format of extracted data records must be supported by MaxCompute.
- The extracted data records can be further processed by the SQL compute engine built in MaxCompute. More data records can be produced during the processing.
- The produced records are passed to the user-defined output logic for further computing and conversion. Finally, the system uses the OutputStream Java class to export required data in the records to OSS.
You can create, search for, configure, process, and configure external tables by using a visual interface in the DataWorks console. You can also query, compute, and analyze data in external tables. DataWorks is powered by MaxCompute.
Network and access authorization
Since MaxCompute is separate from OSS, network connectivity between them on different clusters may affect the ability of MaxCompute to access the data stored in OSS. We recommend that you use the private endpoint which ends with -internal.aliyuncs.com when you access the data stored in OSS from MaxCompute.
- Perform STS authorization.
You need to authorize the account used for running MaxCompute jobs to access OSS data. STS is a token management service provided by the RAM service of Alibaba Cloud. With STS, authorized RAM users and cloud services can issue tokens with custom validity and permissions. Applications can use tokens to call Alibaba Cloud API operations to manipulate resources.
You can grant the permissions by using either of the following two methods:- If the MaxCompute project and OSS bucket are owned by the same Alibaba Cloud account,
log on to the DataWorks console and perform authorization.
- Open the editing page of a newly created table and find the Physical Model section.
- Set Table Type to External Table.
- Configure Storage Address and click Authorize.
- In the Cloud Resource Access Authorization dialog box, click Confirm Authorization Policy.
- Grant MaxCompute the permission to access OSS data in the RAM console.
- Log on to the RAM console.
Note If you use different accounts for logon to MaxCompute and OSS, you must log on to the RAM console using the OSS account to perform the following operations.
- In the left-side navigation pane, click RAM Roles.
- In the Create Role panel, select Alibaba Cloud Account for the Select Trusted Entity parameter and click Next.
- Configure RAM Role Name and Note.
Note The role name must be set to AliyunODPSDefaultRole or AliyunODPSRoleForOtherUser.
- Under Select Trusted Alibaba Cloud Account, select Current Alibaba Cloud Account or Other Alibaba Cloud Account.
Note If you select Other Alibaba Cloud Account, enter the account ID.
- Click OK.
- Configure the role details.
On the RAM Roles page, click the target role in the RAM Role Name column. On the Trusted Policy Management tab, click Edit Trust Policy and enter a policy as required.
--If the MaxCompute project and OSS bucket belong to the same account, enter the following content: { "Statement": [ { "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "Service": [ "odps.aliyuncs.com" ] } } ], "Version": "1" }
--If the MaxCompute project and OSS bucket belong to different accounts, enter the following content: { "Statement": [ { "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "Service": [ "ID of the Alibaba Cloud account that owns the MaxCompute project@odps.aliyuncs.com" ] } } ], "Version": "1" }
After the configuration is completed, click OK.
- Associate an authorization policy with the role. Click Add Permissions and search for the AliyunODPSRolePolicy policy that is required for granting OSS
access. Attach the AliyunODPSRolePolicy policy to the role. If you cannot find this
policy in this way, click Input and Attach to attach the required permission to the role.
{ "Version": "1", "Statement": [ { "Action": [ "oss:ListBuckets", "oss:GetObject", "oss:ListObjects", "oss:PutObject", "oss:DeleteObject", "oss:AbortMultipartUpload", "oss:ListParts" ], "Resource": "*", "Effect": "Allow" }, { "Action": [ "ots:ListTable", "ots:DescribeTable", "ots:GetRow", "ots:PutRow", "ots:UpdateRow", "ots:DeleteRow", "ots:GetRange", "ots:BatchGetRow", "ots:BatchWriteRow", "ots:ComputeSplitPointsBySize" ], "Resource": "*", "Effect": "Allow" } ] }
- Log on to the RAM console.
- If the MaxCompute project and OSS bucket are owned by the same Alibaba Cloud account,
log on to the DataWorks console and perform authorization.
- Use an OSS data store.
If you have created and saved an OSS data store, view and use the data store under
.
Create an external table
- Use Data Definition Language (DDL) to create an external table.
Go to the Data Analytics page and use a DDL statement to create an external table. The DDL statement must comply with the MaxCompute syntax. If you have completed STS authorization, you do not need to include the odps.properties.rolearn attribute in the DDL statement.
The following section provides a sample DDL statement, in which EXTERNAL indicates an external table:CREATE EXTERNAL TABLE IF NOT EXISTS ambulance_data_csv_external( vehicleId int, recordId int, patientId int, calls int, locationLatitute double, locationLongtitue double, recordTime string, direction string ) STORED BY 'com.aliyun.odps.udf.example.text.TextStorageHandler' --Required. The STORED BY statement specifies the name of a custom storage handler class or other file format. with SERDEPROPERTIES ( 'delimiter'='\\|', -- Optional. The SERDEPROPERTIES clause specifies the parameters used when serializing or deserializing data. These parameters can be passed into the EXTRACT logic through DataAttributes. 'odps.properties.rolearn'='acs:ram::xxxxxxxxxxxxx:role/aliyunodpsdefaultrole' ) LOCATION 'oss://oss-cn-shanghai-internal.aliyuncs.com/oss-odps-test/Demo/SampleData/CustomTxt/AmbulanceData/' --Required. The LOCATION parameter specifies the location of the external table. USING 'odps-udf-example.jar'; --Required if you use a custom format class. The USING parameter specifies the JAR package where the custom format class resides.
STORED BY can be followed by a parameter indicating the storage handler built in for the Comma Separated Value (CSV) or Tab Separated Value (TSV) file. The details are as follows:com.aliyun.odps.CsvStorageHandler
indicates that the table is stored as a CSV file. With this specification, the column delimiter is a comma (,) and the newline character is \n. One example isSTORED BY'com.aliyun.odps.CsvStorageHandler'
.com.aliyun.odps.TsvStorageHandler
indicates that the table is stored as a TSV file. With this specification, the column delimiter is \t and the newline character is \n.
STORED BY can also be followed by a parameter indicating an open-source external table, such as ORC, Parquet, sequence file, Record Columnar File (RCFile), Avro, or text file. For example, you can specify theorg.apache.hive.hcatalog.data.JsonSerDe
class to save the table as a text file.- org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: The table is stored as a text file.
- org.apache.hadoop.hive.ql.io.orc.OrcSerde: The table is stored in the Optimized Row Columnar (ORC) format.
- org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe: The table is stored in the Parquet format.
- org.apache.hadoop.hive.serde2.avro.AvroSerDe: The table is stored in the Avro format.
- org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: The table is stored as a sequence file.
The following section provides a DDL statement for creating an open-source external table:CREATE EXTERNAL TABLE [IF NOT EXISTS] (<column schemas>) [PARTITIONED BY (partition column schemas)] [ROW FORMAT SERDE ''] STORED AS [WITH SERDEPROPERTIES ( 'odps.properties.rolearn'='${roleran}' [,'name2'='value2',...] ) ] LOCATION 'oss://${endpoint}/${bucket}/${userfilePath}/';
The following table lists attributes of the SERDEPROPERTIES clause.Attribute Valid value Default value Description odps.text.option.gzip.input.enabled true or false false Indicates whether the file to be read is compressed. odps.text.option.gzip.output.enabled true or false false Indicates whether the file to be written is compressed. odps.text.option.header.lines.count N (a non-negative integer) 0 Skip the first N lines of the file. odps.text.option.null.indicator String Empty string The string that represents NULL. odps.text.option.ignore.empty.lines true or false true Specifies whether to ignore blank lines. odps.text.option.encoding UTF-8, UTF-16, or US-ASCII UTF-8 The encoding of the file. Note MaxCompute can also read CSV and TSV files in GZIP format by using a built-in extracter. You can select whether the file is GZIP compressed, which determines attributes you need to specify.The LOCATION parameter is in the oss://oss-cn-shanghai-internal.aliyuncs.com/Bucket name/Directory name format. You can obtain the OSS directory in the graphical user interface (GUI). You do not need to add a filename next to the directory.
You can find tables created by using DDL statements in the node tree on the Workspace Tables tab. You can modify the values of level 1 folder and level 2 folder to change the table locations.
- Create a Table Store external table.
The following section describes a statement used to create a Table Store external table:
CREATE EXTERNAL TABLE IF NOT EXISTS ots_table_external( odps_orderkey bigint, odps_orderdate string, odps_custkey bigint, odps_orderstatus string, odps_totalprice double ) STORED BY 'com.aliyun.odps.TableStoreStorageHandler' WITH SERDEPROPERTIES ( 'tablestore.columns.mapping'=':o_orderkey,:o_orderdate,o_custkey, o_orderstatus,o_totalprice', -- (3) 'tablestore.table.name'='ots_tpch_orders' 'odps.properties.rolearn'='acs:ram::xxxxx:role/aliyunodpsdefaultrole' ) LOCATION 'tablestore://odps-ots-dev.cn-shanghai.ots-internal.aliyuncs.com';
The parameters in the statement are described as follows:- com.aliyun.odps.TableStoreStorageHandler specifies the storage handler built in MaxCompute for processing data stored in Table Store.
- SERDEPROPERTIES provides options for parameters. You must specify tablestore.columns.mapping and
tablestore.table.name when using TableStoreStorageHandler.
- tablestore.columns.mapping: Required. This parameter describes the columns of the table in Table Store that
MaxCompute accesses, including the primary key columns and property columns. A primary
key column is indicated with the colon sign (:) at the beginning of the column name.
In this example, primary key columns are
:o_orderkey
and:o_orderdatee
. The others are property columns.Table Store supports up to four primary key columns. The data types include STRING, INTEGER, and BINARY. The first primary key column is the partition key. You must specify all primary key columns of the table in Table Store when specifying the mapping. You only need to specify the property columns that MaxCompute accesses instead of specifying all property columns.
- tablestore.table.name: the name of the table to access in Table Store. If the table name does not exist in Table Store, an error is returned. MaxCompute does not create a table in Table Store.
- tablestore.columns.mapping: Required. This parameter describes the columns of the table in Table Store that
MaxCompute accesses, including the primary key columns and property columns. A primary
key column is indicated with the colon sign (:) at the beginning of the column name.
In this example, primary key columns are
- LOCATION: specifies the name and endpoint of the Table Store instance.
- Create a table in the GUI.
Go to the Data Analytics page and create a table in the GUI. An external table has the following attributes:
- General
- Table name (entered when you create a table)
- Display name
- Level 1 folder and level 2 folder
- Description
- Physical Model
- Table Type: Select External Table.
- Partitioning: Table Store external tables do not support partitioning.
- Storage Address: In the Physical Model section, you can enter the value of the LOCATION parameter as the storage address. You can also click Click to Select to select a storage address. Then, click Authorize.
- Storage Format: Select the file format as required. CSV, TSV, ORC, Parquet, sequence file, RCFile, Avro, text file, and custom file formats are supported. If you select a custom file format, you need to select the corresponding resource. After you commit a resource, DataWorks automatically parses out included class names and displays them in the Class Name drop-down list.
- rolearn: If you have completed STS authorization, you can leave it unspecified.
- Schema
Parameter Description Field Type MaxCompute 2.0 supports fields of TINYINT, SMALLINT, INT, BIGINT, VARCHAR, and STRING types. Actions The add, change, and delete actions are supported. Definition or Maximum Value Length You can set the maximum length of the VARCHAR type fields. For complex data types, enter the definition.
- General
Data types
Type | New | Example | Description |
---|---|---|---|
TINYINT | Yes | 1Y and -127Y | A signed eight-bit integer in the range of -128 to 127. |
SMALLINT | Yes | 32767S and -100S | A signed 16-bit integer in the range of -32768 to 32767. |
INT | Yes | 1000 and -15645787 | A signed 32-bit integer in the range of -2~31 to 2~31 - 1. |
BIGINT | No | 100000000000L and -1L | A signed 64-bit integer in the range of -2~63 + 1 to 2~63 - 1. |
FLOAT | Yes | None. | The 32-bit binary floating point type. |
DOUBLE | No | 3.1415926 1E+7 | An eight-byte double precision floating-point number (a 64-bit binary floating-point number). |
DECIMAL | No | 3.5BD and 99999999999.9999999BD | A decimal exact number. The integer part is in the range of -10~36 + 1 to 10~36 - 1, and the fractional part is accurate to 10 to 18 decimal places. |
VARCHAR(n) | Yes | None. | A variable-length character string, which can be 1 to 65535 characters in length. |
STRING | No | "abc", 'bcd', and "alibaba" | A string. Currently, the maximum length is 8 MB. |
BINARY | Yes | None. | A binary number. Currently, the maximum length is 8 MB. |
DATETIME | No | DATETIME '2017-11-11 00:00:00' | The data type for dates and times. UTC-8 is used as the standard time of the system. The range is from 0000-01-01 to 9999-12-31, accurate to milliseconds. |
TIMESTAMP | Yes | TIMESTAMP '2017-11-11 00:00:00.123456789' | The TIMESTAMP data type, which is independent from time zones. The range is from 0000-01-01 to 9999-12-31 23.59:59.999,999,999, accurate to nanoseconds. |
BOOLEAN | No | TRUE and FALSE | The value must be TRUE or FALSE. |
Type | Definition | Constructor |
---|---|---|
ARRAY | array< int >; array< struct< a:int, b:string >> | array(1, 2, 3); array(array(1, 2); array(3, 4)) |
MAP | map< string, string >; map< smallint, array< string>> | map("k1", "v1", "k2", "v2"); map(1S, array('a', 'b'), 2S, array('x', 'y)) |
STRUCT | struct< x:int, y:int>; struct< field1:bigint, field2:array< int>, field3:map< int, int>> | named_struct('x', 1, 'y', 2); named_struct('field1', 100L, 'field2', array(1, 2), 'field3', map(1, 100, 2, 200)) |
If you want to use data types newly supported by MaxCompute 2.0 including TINYINT,
SMALLINT, INT, FLOAT, VARCHAR, TIMESTAMP, BINARY, and complex data types, you must
include set odps.sql.type.system.odps2=true;
before the statement used to create a table. Commit the set statement together with
the table creation statement. If compatibility with Hive is required, we recommend
that you include the odps.sql.hive.compatible=true;
statement.
View and process external tables
Go to the Data Analytics page and click Workspace Tables in the left-side navigation pane to view external tables. External tables are processed in the same way as internal tables.