This topic describes how to use DataWorks to create and configure external tables. This topic also lists the data types supported in external tables.

Concepts

The following table lists the concepts you need to know before using external tables.
Concept Description
Object Storage Service (OSS) OSS supports storage classes including standard, infrequent access, and archive. It is applicable to various data storage scenarios. In addition, OSS can deeply integrate with the Hadoop open-source community and other products such as E-MapReduce (EMR), Batch Compute, MaxCompute, Machine Learning Platform for AI, and Function Compute.
MaxCompute MaxCompute is an efficient and fully managed data warehousing solution. When used in conjunction with OSS, it enables you to analyze and process large amounts of data with reduced costs.
External tables of MaxCompute Based on the computing framework V2.0 of MaxCompute, you can use external tables to directly query numerous files in OSS without loading data into MaxCompute tables. This reduces the time and labor required for data migration and lowers the storage costs.
Currently, MaxCompute supports external tables that store unstructured data, such as the tables in OSS and Table Store. The unstructured data processing framework allows MaxCompute to import data from and export data to external tables in OSS and Table Store. The following section takes the OSS external tables as an example to describe the processing logic:
  1. Data stored in OSS is converted through the unstructured data processing framework and passed to user-defined interfaces using the InputStream Java class. You need to write code for the EXTRACT logic, that is, to read, parse, convert, and compute data from input streams. The format of extracted data records must be supported by MaxCompute.
  2. The extracted data records can be further processed by the SQL compute engine built in MaxCompute. More data records can be produced during the processing.
  3. The produced records are passed to the user-defined output logic for further computing and conversion. Finally, the system uses the OutputStream Java class to export required data in the records to OSS.

You can create, search for, configure, process, and configure external tables by using a visual interface in the DataWorks console. You can also query, compute, and analyze data in external tables. DataWorks is powered by MaxCompute.

Network and access authorization

Since MaxCompute is separate from OSS, network connectivity between them on different clusters may affect the ability of MaxCompute to access the data stored in OSS. We recommend that you use the private endpoint which ends with -internal.aliyuncs.com when you access the data stored in OSS from MaxCompute.

MaxCompute requires a secure authorized channel to access OSS data. MaxCompute uses Resource Access Management (RAM) and Security Token Service (STS) of Alibaba Cloud to ensure data access security. MaxCompute applies for data access permissions from STS as the table creator. The permission settings of Table Store are the same as those of OSS.
  1. Perform STS authorization.

    You need to authorize the account used for running MaxCompute jobs to access OSS data. STS is a token management service provided by the RAM service of Alibaba Cloud. With STS, authorized RAM users and cloud services can issue tokens with custom validity and permissions. Applications can use tokens to call Alibaba Cloud API operations to manipulate resources.

    You can grant the permissions by using either of the following two methods:
    • If the MaxCompute project and OSS bucket are owned by the same Alibaba Cloud account, log on to the DataWorks console and perform authorization.
      1. Open the editing page of a newly created table and find the Physical Model section.
      2. Set Table Type to External Table.
      3. Configure Storage Address and click Authorize.
      4. In the Cloud Resource Access Authorization dialog box, click Confirm Authorization Policy.
    • Grant MaxCompute the permission to access OSS data in the RAM console.
      1. Log on to the RAM console.
        Note If you use different accounts for logon to MaxCompute and OSS, you must log on to the RAM console using the OSS account to perform the following operations.
      2. In the left-side navigation pane, click RAM Roles.
      3. Click Create RAM Role, select Alibaba Cloud Account, and then click Next.
      4. Configure RAM Role Name and Note.
        Note The role name must be set to AliyunODPSDefaultRole or AliyunODPSRoleForOtherUser.
      5. Under Select Trusted Alibaba Cloud Account, select Current Alibaba Cloud Account or Other Alibaba Cloud Account.
        Note If you select Other Alibaba Cloud Account, enter the account ID.
      6. Click OK.
      7. Configure the role details.
        On the RAM Roles page, click the target role in the RAM Role Name column. On the Trusted Policy Management tab, click Edit Trust Policy and enter a policy as required.
        --If the MaxCompute project and OSS bucket belong to the same account, enter the following content:
        {
        "Statement": [
        {
        "Action": "sts:AssumeRole",
        "Effect": "Allow",
        "Principal": {
        "Service": [
        "odps.aliyuncs.com"
              ]
            }
          }
        ],
        "Version": "1"
        }                                           
        --If the MaxCompute project and OSS bucket belong to different accounts, enter the following content:
        {
        "Statement": [
        {
        "Action": "sts:AssumeRole",
        "Effect": "Allow",
        "Principal": {
        "Service": [
        "ID of the Alibaba Cloud account that owns the MaxCompute project@odps.aliyuncs.com"
              ]
            }
          }
        ],
        "Version": "1"
        }

        After the configuration is completed, click OK.

      8. Associate an authorization policy with the role. Click Add Permissions and search for the AliyunODPSRolePolicy policy that is required for granting OSS access. Attach the AliyunODPSRolePolicy policy to the role. If you cannot find this policy in this way, click Input and Attach to attach the required permission to the role.
        {
          "Version": "1",
          "Statement": [
            {
              "Action": [
                "oss:ListBuckets",
                "oss:GetObject",
                "oss:ListObjects",
                "oss:PutObject",
                "oss:DeleteObject",
                "oss:AbortMultipartUpload",
                "oss:ListParts"
                ],
                "Resource": "*",
                "Effect": "Allow"
          },
          {
              "Action": [
                "ots:ListTable",
                "ots:DescribeTable",
                "ots:GetRow",
                "ots:PutRow",
                "ots:UpdateRow",
                "ots:DeleteRow",
                "ots:GetRange",
                "ots:BatchGetRow",
                "ots:BatchWriteRow",
                "ots:ComputeSplitPointsBySize"
              ],
              "Resource": "*",
              "Effect": "Allow"
            }
          ]
        }
  2. Use an OSS data store.

    If you have created and saved an OSS data store, view and use the data store under Workspace Manage > Data Source.

Create an external table

  1. Use Data Definition Language (DDL) to create an external table.

    Go to the Data Analytics page and use a DDL statement to create an external table. The DDL statement must comply with the MaxCompute syntax. If you have completed STS authorization, you do not need to include the odps.properties.rolearn attribute in the DDL statement.

    The following section provides a sample DDL statement, in which EXTERNAL indicates an external table:
    CREATE EXTERNAL TABLE IF NOT EXISTS ambulance_data_csv_external(
    vehicleId int,
    recordId int,
    patientId int,
    calls int,
    locationLatitute double,
    locationLongtitue double,
    recordTime string,
    direction string
    )
    
    STORED BY 'com.aliyun.odps.udf.example.text.TextStorageHandler' --Required. The STORED BY statement specifies the name of a custom storage handler class or other file format.
    with SERDEPROPERTIES (
    'delimiter'='\\|', -- Optional. The SERDEPROPERTIES clause specifies the parameters used when serializing or deserializing data. These parameters can be passed into the EXTRACT logic through DataAttributes.
    'odps.properties.rolearn'='acs:ram::xxxxxxxxxxxxx:role/aliyunodpsdefaultrole'
    )
    LOCATION 'oss://oss-cn-shanghai-internal.aliyuncs.com/oss-odps-test/Demo/SampleData/CustomTxt/AmbulanceData/'     --Required. The LOCATION parameter specifies the location of the external table.
    USING 'odps-udf-example.jar'; --Required if you use a custom format class. The USING parameter specifies the JAR package where the custom format class resides. 
    STORED BY can be followed by a parameter indicating the storage handler built in for the Comma Separated Value (CSV) or Tab Separated Value (TSV) file. The details are as follows:
    • com.aliyun.odps.CsvStorageHandler indicates that the table is stored as a CSV file. With this specification, the column delimiter is a comma (,) and the newline character is \n. One example is STORED BY'com.aliyun.odps.CsvStorageHandler'.
    • com.aliyun.odps.TsvStorageHandler indicates that the table is stored as a TSV file. With this specification, the column delimiter is \t and the newline character is \n.
    STORED BY can also be followed by a parameter indicating an open-source external table, such as ORC, Parquet, sequence file, Record Columnar File (RCFile), Avro, or text file. For example, you can specify the org.apache.hive.hcatalog.data.JsonSerDe class to save the table as a text file.
    • org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: The table is stored as a text file.
    • org.apache.hadoop.hive.ql.io.orc.OrcSerde: The table is stored in the Optimized Row Columnar (ORC) format.
    • org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe: The table is stored in the Parquet format.
    • org.apache.hadoop.hive.serde2.avro.AvroSerDe: The table is stored in the Avro format.
    • org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: The table is stored as a sequence file.
    The following section provides a DDL statement for creating an open-source external table:
      CREATE EXTERNAL TABLE [IF NOT EXISTS] (<column schemas>)
      [PARTITIONED BY (partition column schemas)]
      [ROW FORMAT SERDE '']
      STORED AS 
      [WITH SERDEPROPERTIES ( 'odps.properties.rolearn'='${roleran}'
      [,'name2'='value2',...]
      ) ]
      LOCATION 'oss://${endpoint}/${bucket}/${userfilePath}/';
    The following table lists attributes of the SERDEPROPERTIES clause.
    Attribute Valid value Default value Description
    odps.text.option.gzip.input.enabled true or false false Indicates whether the file to be read is compressed.
    odps.text.option.gzip.output.enabled true or false false Indicates whether the file to be written is compressed.
    odps.text.option.header.lines.count N (a non-negative integer) 0 Skip the first N lines of the file.
    odps.text.option.null.indicator String Empty string The string that represents NULL.
    odps.text.option.ignore.empty.lines true or false true Specifies whether to ignore blank lines.
    odps.text.option.encoding UTF-8, UTF-16, or US-ASCII UTF-8 The encoding of the file.
    Note MaxCompute can also read CSV and TSV files in GZIP format by using a built-in extracter. You can select whether the file is GZIP compressed, which determines attributes you need to specify.

    The LOCATION parameter is in the oss://oss-cn-shanghai-internal.aliyuncs.com/Bucket name/Directory name format. You can obtain the OSS directory in the graphical user interface (GUI). You do not need to add a filename next to the directory.

    You can find tables created by using DDL statements in the node tree on the Workspace Tables tab. You can modify the values of level 1 folder and level 2 folder to change the table locations.

  2. Create a Table Store external table.
    The following section describes a statement used to create a Table Store external table:
    CREATE EXTERNAL TABLE IF NOT EXISTS ots_table_external(
    odps_orderkey bigint,
    odps_orderdate string,
    odps_custkey bigint,
    odps_orderstatus string,
    odps_totalprice double
    )
    STORED BY 'com.aliyun.odps.TableStoreStorageHandler' 
    WITH SERDEPROPERTIES (
    'tablestore.columns.mapping'=':o_orderkey,:o_orderdate,o_custkey, o_orderstatus,o_totalprice', -- (3)
    'tablestore.table.name'='ots_tpch_orders'
    'odps.properties.rolearn'='acs:ram::xxxxx:role/aliyunodpsdefaultrole'
    )
    LOCATION 'tablestore://odps-ots-dev.cn-shanghai.ots-internal.aliyuncs.com'; 
    The parameters in the statement are described as follows:
    • com.aliyun.odps.TableStoreStorageHandler specifies the storage handler built in MaxCompute for processing data stored in Table Store.
    • SERDEPROPERTIES provides options for parameters. You must specify tablestore.columns.mapping and tablestore.table.name when using TableStoreStorageHandler.
      • tablestore.columns.mapping: Required. This parameter describes the columns of the table in Table Store that MaxCompute accesses, including the primary key columns and property columns. A primary key column is indicated with the colon sign (:) at the beginning of the column name. In this example, primary key columns are :o_orderkey and :o_orderdatee. The others are property columns.

        Table Store supports up to four primary key columns. The data types include STRING, INTEGER, and BINARY. The first primary key column is the partition key. You must specify all primary key columns of the table in Table Store when specifying the mapping. You only need to specify the property columns that MaxCompute accesses instead of specifying all property columns.

      • tablestore.table.name: the name of the table to access in Table Store. If the table name does not exist in Table Store, an error is returned. MaxCompute does not create a table in Table Store.
    • LOCATION: specifies the name and endpoint of the Table Store instance.
  3. Create a table in the GUI.
    Go to the Data Analytics page and create a table in the GUI. An external table has the following attributes:
    • General
      • Table name (entered when you create a table)
      • Display name
      • Level 1 folder and level 2 folder
      • Description
    • Physical Model
      • Table Type: Select External Table.
      • Partitioning: Table Store external tables do not support partitioning.
      • Storage Address: In the Physical Model section, you can enter the value of the LOCATION parameter as the storage address. You can also click Click to Select to select a storage address. Then, click Authorize.
      • Storage Format: Select the file format as required. CSV, TSV, ORC, Parquet, sequence file, RCFile, Avro, text file, and custom file formats are supported. If you select a custom file format, you need to select the corresponding resource. After you commit a resource, DataWorks automatically parses out included class names and displays them in the Class Name drop-down list.
      • rolearn: If you have completed STS authorization, you can leave it unspecified.
    • SchemaSchema
      Parameter Description
      Field Type MaxCompute 2.0 supports fields of TINYINT, SMALLINT, INT, BIGINT, VARCHAR, and STRING types.
      Actions The add, change, and delete actions are supported.
      Definition or Maximum Value Length You can set the maximum length of the VARCHAR type fields. For complex data types, enter the definition.

Data types

The following table describes the data types supported in external tables.
Type New Example Description
TINYINT Yes 1Y and -127Y A signed eight-bit integer in the range of -128 to 127.
SMALLINT Yes 32767S and -100S A signed 16-bit integer in the range of -32768 to 32767.
INT Yes 1000 and -15645787 A signed 32-bit integer in the range of -2~31 to 2~31 - 1.
BIGINT No 100000000000L and -1L A signed 64-bit integer in the range of -2~63 + 1 to 2~63 - 1.
FLOAT Yes None. The 32-bit binary floating point type.
DOUBLE No 3.1415926 1E+7 An eight-byte double precision floating-point number (a 64-bit binary floating-point number).
DECIMAL No 3.5BD and 99999999999.9999999BD A decimal exact number. The integer part is in the range of -10~36 + 1 to 10~36 - 1, and the fractional part is accurate to 10 to 18 decimal places.
VARCHAR(n) Yes None. A variable-length character string, which can be 1 to 65535 characters in length.
STRING No "abc", 'bcd', and "alibaba" A string. Currently, the maximum length is 8 MB.
BINARY Yes None. A binary number. Currently, the maximum length is 8 MB.
DATETIME No DATETIME '2017-11-11 00:00:00' The data type for dates and times. UTC-8 is used as the standard time of the system. The range is from 0000-01-01 to 9999-12-31, accurate to milliseconds.
TIMESTAMP Yes TIMESTAMP '2017-11-11 00:00:00.123456789' The TIMESTAMP data type, which is independent from time zones. The range is from 0000-01-01 to 9999-12-31 23.59:59.999,999,999, accurate to nanoseconds.
BOOLEAN No TRUE and FALSE The value must be TRUE or FALSE.
The following table lists complex field types supported in external tables.
Type Definition Constructor
ARRAY array< int >; array< struct< a:int, b:string >> array(1, 2, 3); array(array(1, 2); array(3, 4))
MAP map< string, string >; map< smallint, array< string>> map("k1", "v1", "k2", "v2"); map(1S, array('a', 'b'), 2S, array('x', 'y))
STRUCT struct< x:int, y:int>; struct< field1:bigint, field2:array< int>, field3:map< int, int>> named_struct('x', 1, 'y', 2); named_struct('field1', 100L, 'field2', array(1, 2), 'field3', map(1, 100, 2, 200))

If you want to use data types newly supported by MaxCompute 2.0 including TINYINT, SMALLINT, INT, FLOAT, VARCHAR, TIMESTAMP, BINARY, and complex data types, you must include set odps.sql.type.system.odps2=true; before the statement used to create a table. Commit the set statement together with the table creation statement. If compatibility with Hive is required, we recommend that you include the odps.sql.hive.compatible=true; statement.

View and process external tables

Go to the Data Analytics page and click Workspace Tables in the left-side navigation pane to view external tables. External tables are processed in the same way as internal tables.