This topic describes how to import data from Tablestore to MaxCompute to establish seamless connections between multiple data sources.
Tablestore is a NoSQL database service that is built on the Apsara distributed operating system. Tablestore allows you to store large volumes of structured data and access the data in real time. For more information, see Tablestore documentation.
You can create, search for, configure, and process external tables in the DataWorks console. You can also query and analyze data by using the external table feature. For more information, see External table.
You must ensure the network connectivity between MaxCompute and Tablestore. If you use Alibaba Cloud MaxCompute to access a Tablestore instance, we recommend that you use the internal endpoint of the Tablestore instance. The endpoint ends with ots-internal.aliyuncs.com. Example: tablestore://odps-ots-dev.cn-shanghai.ots-internal.aliyuncs.com.
MaxCompute data type | Tablestore data type |
---|---|
STRING | STRING |
BIGINT | INTEGER |
DOUBLE | DOUBLE |
BOOLEAN | BOOLEAN |
BINARY | BINARY |
STS authorization
To use MaxCompute to access data in Tablestore, you must configure a secure authorization channel. MaxCompute uses Alibaba Cloud Resource Access Management (RAM) and Security Token Service (STS) to ensure secure access to data.
- One-click authorization. If the MaxCompute project and the Tablestore instance that you want to use belong to the same Alibaba Cloud account, you must log on to the Alibaba Cloud Management Console by using the Alibaba Cloud account and then perform one-click authorization on the Cloud Resource Access Authorization page of the RAM console. For more information, see Cloud Resource Access Authorization.
- Custom authorization
- Authorize access from MaxCompute to Tablestore in the RAM console.
Log on to the RAM console and create a role such as AliyunODPSDefaultRole or AliyunODPSRoleForOtherUser. If the MaxCompute project and the Tablestore instance that you want to use belong to different Alibaba Cloud accounts, you must log on to the RAM console by using the Alibaba Cloud account to which the Tablestore instance belongs.
- Modify policy settings.
-- If the MaxCompute project and the Tablestore instance that you want to use belong to the same Alibaba Cloud account, use the following policy configuration: { "Statement": [ { "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "Service": [ "odps.aliyuncs.com" ] } } ], "Version": "1" } -- If the MaxCompute project and the Tablestore instance that you want to use belong to different Alibaba Cloud accounts, use the following policy configuration: { "Statement": [ { "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "Service": [ "ID of the Alibaba Cloud account to which the MaxCompute project belongs@odps.aliyuncs.com" ] } } ], "Version": "1" }
You can click your profile picture in the upper-right corner of the current page to view the ID of your Alibaba Cloud account. - Edit the AliyunODPSRolePolicy policy for the role.
{ "Version": "1", "Statement": [ { "Action": [ "ots:ListTable", "ots:DescribeTable", "ots:GetRow", "ots:PutRow", "ots:UpdateRow", "ots:DeleteRow", "ots:GetRange", "ots:BatchGetRow", "ots:BatchWriteRow", "ots:ComputeSplitPointsBySize" ], "Resource": "*", "Effect": "Allow" } ] } -- You can also specify other permissions based on your business requirements.
- Attach the AliyunODPSRolePolicy policy to the role.
- Authorize access from MaxCompute to Tablestore in the RAM console.
Create an external table
MaxCompute provides the external table feature. You can use external tables to import data from Tablestore to the metadata system of MaxCompute for processing. The following section describes how to create a Tablestore external table:
DROP TABLE IF EXISTS ots_table_external;
CREATE EXTERNAL TABLE IF NOT EXISTS ots_table_external
(
odps_orderkey bigint,
odps_orderdate string,
odps_custkey bigint,
odps_orderstatus string,
odps_totalprice double,
odps_createdate timestamp
)
STORED BY 'com.aliyun.odps.TableStoreStorageHandler'
WITH SERDEPROPERTIES (
'tablestore.columns.mapping'=':o_orderkey,:o_orderdate,o_custkey,o_orderstatus,o_totalprice',
'tablestore.table.name'='ots_tpch_orders',
'odps.properties.rolearn'='acs:ram::xxxxx:role/aliyunodpsdefaultrole',
'tablestore.timestamp.ticks.unit'='seconds',
'tablestore.column.odps_createdate.timestamp.ticks.unit'='millis'
)
LOCATION 'tablestore://odps-ots-dev.cn-shanghai.ots-internal.aliyuncs.com';
com.aliyun.odps.TableStoreStorageHandler
is a built-in MaxCompute storage handler that is used to process Tablestore data. This storage handler defines the interaction between MaxCompute and Tablestore. The related logic is implemented by MaxCompute.SERDEPROPERITES
provides parameters. If you use TableStoreStorageHandler, you must configure three required parameters. You can also configure two optional parameters based on your business requirements.- Three required parameters:
- tablestore.columns.mapping: the columns of the Tablestore table that you want to access. The columns include
primary key columns and attribute columns.
- The column whose name starts with a colon (:) is a primary key column, such as
:o_orderkey
and:o_orderdate
in the preceding example. Other columns are attribute columns. - A Tablestore table supports up to four primary key columns. The data types of the primary key columns must be STRING, INTEGER, or BINARY. The first primary key column is the partition key column.
- You must specify all the primary key columns of the Tablestore table and the attribute columns that you want to access. Make sure that all the attribute columns that you specify are in the Tablestore table. Otherwise, errors are returned when you query data by using the external table.
- The column whose name starts with a colon (:) is a primary key column, such as
- tablestore.table.name: the name of the Tablestore table that you want to access. If you specify an invalid table name or the table name that you specify does not exist, an error is returned. MaxCompute does not actively create a Tablestore table.
- odps.properties.rolearn: the Alibaba Cloud Resource Name (ARN) of the AliyunODPSDefaultRole role in RAM. You can obtain the ARN on the RAM Roles page of the RAM console.
- tablestore.columns.mapping: the columns of the Tablestore table that you want to access. The columns include
primary key columns and attribute columns.
- Two optional parameters:
- tablestore.timestamp.ticks.unit: the time unit for tables. This parameter specifies that all fields of the INTEGER data type in the external table use the same time unit. Valid values: seconds, millis, micros, and nanos.
- tablestore.column.<col1_name>.timestamp.ticks.unit: the time unit for columns. This parameter specifies the time unit of a specific column in the external table. Valid values: seconds, millis, micros, and nanos.
Note You can use the preceding optional parameters to map the fields of the INTEGER data type in a Tablestore table to the fields of the TIMESTAMP data type in a MaxCompute table. If both parameters are specified, the tablestore.column.<col1_name>.timestamp.ticks.unit parameter has a higher priority than the tablestore.timestamp.ticks.unit parameter.
- Three required parameters:
- LOCATION specifies the name and endpoint of the Tablestore instance. You must complete RAM authorization or STS authorization to ensure secure access to Tablestore data.
desc extended <table_name>;
In the execution result, Extended Info includes the basic information about the external table, the information about the storage handler, and the location of the external table.
Query data by using an external table
SELECT odps_orderkey, odps_orderdate, SUM(odps_totalprice) AS sum_total
FROM ots_table_external
WHERE odps_orderkey > 5000 AND odps_orderkey < 7000 AND odps_orderdate >= '1996-05-03' AND odps_orderdate < '1997-05-01'
GROUP BY odps_orderkey, odps_orderdate
HAVING sum_total> 400000.0;
If you access Tablestore data by using MaxCompute SQL statements, all operations, such as the selection of column names, are performed in MaxCompute. In the preceding example, the column names are odps_orderkey and odps_totalprice rather than the names of the primary key column o_orderkey and attribute column o_totalprice in the Tablestore table. This is because the mappings are defined in the DDL statement that is used to create the external table. You can also retain the names of the primary key columns and attribute columns in the Tablestore table based on your business requirements.
CREATE TABLE internal_orders AS
SELECT odps_orderkey, odps_orderdate, odps_custkey, odps_totalprice
FROM ots_table_external
WHERE odps_orderkey > 5000 ;
internal_orders is a MaxCompute table that supports all features of a MaxCompute internal table. The internal_orders table uses the efficiently compressed column store and contains complete internal macro data and statistical information. The internal_orders table is stored in MaxCompute. Therefore, you can access the internal_orders table faster than a Tablestore table. This method is suitable for data that needs to be computed multiple times.
Export data from MaxCompute to Tablestore
INSERT OVERWRITE TABLE
statement. Example: INSERT OVERWRITE TABLE ots_table_external
SELECT odps_orderkey, odps_orderdate, odps_custkey, CONCAT(odps_custkey, 'SHIPPED'), CEIL(odps_totalprice)
FROM internal_orders;
DISTRIBUTE BY rand()
. Example: INSERT OVERWRITE TABLE ots_table_external
SELECT odps_orderkey, odps_orderdate, odps_custkey, CONCAT(odps_custkey, 'SHIPPED'), CEIL(odps_totalprice)
FROM (SELECT * FROM internal_orders DISTRIBUTE BY rand()) t;
Tablestore is a NoSQL data storage service that stores data in the key-value pair format. Data outputs from MaxCompute affect only the rows that include the primary keys of the Tablestore table. In this example, only the rows that include odps_orderkey and odps_orderdate are affected. Only the attribute columns that are specified when you create the ots_table_external table are updated. The columns that are not included in the external table are not modified.
- If the size of data that you want to write from MaxCompute to Tablestore at a time
is greater than 4 MB, an error may occur. In this case, you must remove the excess
data and then write data to Tablestore again.
ODPS-0010000:System internal error - Output to TableStore failed with exception: TableStore BatchWrite request id XXXXX failed with error code OTSParameterInvalid and message:The total data size of BatchWriteRow request exceeds the limit
- Writing multiple data entries at the same time or by row is considered a single operation. For more information, see BatchWriteRow. If you want to write large amounts of data at a time, you can write the data by row.
- If you want to write multiple data entries at a time, make sure that you do not write
duplicate rows. If duplicate rows exist, the following error may occur:
ErrorCode: OTSParameterInvalid, ErrorMessage: The input parameter is invalid
For more information, see What do I do if OTSParameterInvalid is reported when I use BatchWriteRow to submit 100 data entries at a time.