A StarRocks connector seamlessly accesses a MaxCompute data source and performs complex SQL queries using an external catalog, without the need to import data or create an external table. This improves data analytics efficiency and reduces operational difficulties and costs. This topic describes how to use a StarRocks connector to access MaxCompute.
Background
StarRocks is a next-generation, high-speed Massively Parallel Processing (MPP) database. StarRocks meets various enterprise-level analysis requirements, such as multi-dimensional analysis, custom report creation, real-time data analytics, and ad hoc data analytics. For more information about StarRocks, see What is StarRocks?. StarRocks supports the data catalog feature. The data catalog feature lets you manage internal and external data within the same system. This lets you easily access and query data from various external data sources. For more information, see Catalog overview.
Prerequisites
A StarRocks cluster (V3.2.3 or later) is deployed, and the StarRocks cluster can access MaxCompute. For information about how to deploy a StarRocks cluster, see Deploy a shared-nothing StarRocks cluster using Docker Compose.
You have purchased an exclusive resource group for Data Transmission Service (subscription) .
Limits
You can read data from standard tables, partitioned tables, clustered tables, Delta tables, and materialized views. You cannot read data from foreign tables or logical views.
You cannot read JSON data.
Only MaxCompute projects that do not have the schema feature enabled are supported. For more information about the schema feature, see Schema-related operations.
Create a MaxCompute catalog
Log on to the StarRocks cluster and create a MaxCompute catalog. For information about external catalogs, see CREATE EXTERNAL CATALOG.
Syntax
CREATE EXTERNAL CATALOG <catalog_name>
[COMMENT <comment>]
PROPERTIES
(
"type" = "odps",
CatalogParams,
ScanParams,
CachingMetaParams
)Parameters
Common parameters
Parameter
Required
Description
catalog_name
Yes
The name of the MaxCompute catalog. The name must meet the following requirements:
The name can contain only letters (a to z or A to Z), digits (0 to 9), or underscores (_) and must start with a letter.
The name cannot exceed 1,023 characters in length.
The catalog name is case-sensitive.
type
Yes
The data source type. Set this parameter to
odps.comment
No
The description of the MaxCompute catalog.
CatalogParams
Yes
The parameters that are used for StarRocks to connect to MaxCompute.
ScanParams
No
The parameters that are used for StarRocks to connect to MaxCompute file storage.
CachingMetaParams
No
The parameters that define the metadata cache policy.
Parameters configurable in CatalogParams
Parameter
Required
Description
odps.endpoint
Yes
The endpoint of the MaxCompute service. You must configure this parameter based on the region and network connectivity mode that you selected when you create the MaxCompute project. For information about the endpoints that correspond to different regions and network connectivity modes, see Endpoints.
ImportantOnly Alibaba Cloud VPC networks are supported.
odps.project
Yes
The name of the MaxCompute project. You can log on to the MaxCompute console and choose in the navigation pane on the left to obtain the MaxCompute project name.
NoteIf your workspace is in standard mode, you must distinguish the project name in the development environment (_dev) from the project name in the production environment when you configure this parameter.
odps.access.id
Yes
The AccessKey ID of your Alibaba Cloud account or RAM user. You can go to the AccessKey Management page to obtain the AccessKey ID.
odps.access.key
Yes
The AccessKey secret that corresponds to the AccessKey ID.
odps.tunnel.quota
Yes
The name of the quota used to access MaxCompute. Two types of resources are supported for accessing MaxCompute: exclusive resource groups for Data Transmission Service (subscription) and open storage (pay-as-you-go). You can specify only one quota name.
Subscription: Log on to the MaxCompute console, switch the region in the upper-left corner, and then choose Manage Configurations > Quotas in the navigation pane on the left to view the list of available quotas. For more information, see Computing resources - Quota management.
Pay-as-you-go: Log on to the MaxCompute console, switch the region in the upper-left corner, and then choose Manage Configurations > Tenant Property in the navigation pane on the left. Turn on the Storage API Switch switch to grant the required permissions.
Parameters configurable in ScanParams
Parameter
Required
Description
odps.split.policy
No
The data sharding policy that is used during data scanning. Valid values:
size: Data is sharded by data size. This is the default value. The default size for sharding is 256 MB.
row_offset: Data is sharded by the number of rows.
NoteThe sharding policy significantly affect the data scan concurrency of the compute engine. If the default policy does not meet your business requirements, you can use the
row_offsetpolicy together with theodps.split.row.countparameter to make adjustments as needed.odps.split.row.count
No
The maximum number of rows of data contained in each shard. Default value: 4194304. Value range: values greater than 0.
NoteThis parameter is required only when you set the
odps.split.policyparameter torow_offset.Parameters configurable in CachingMetaParams
Parameter
Required
Description
odps.cache.table.enable
No
Specifies whether to cache table metadata. Valid values:
true: caches table metadata. This is the default value. Table metadata caching helps improve metadata retrieval and update speed. Table metadata caching is suitable for scenarios in which you need to frequently perform DDL operations, a large environment that involves multiple users is used, or you want to optimize complex queries.
false: does not cache table metadata. The caching resources are limited, and excessive dependence on caching may exhaust memory resources. You can disable table metadata caching in scenarios in which you do not need to frequently query data or table metadata frequently changes.
odps.cache.table.expire
No
The duration after which the cached table metadata is automatically deleted. Unit: seconds (s). Default value: 86400, which is equal to 24 hours. Value range: values greater than or equal to 0.
odps.cache.table.size
No
The number of tables whose metadata needs to be cached. Default value: 1000.
odps.cache.partition.enable
No
Specifies whether to cache the metadata of all partitions in the desired table. Valid values:
true: caches the metadata of all partitions in the desired table. This is the default value.
false: does not cache the metadata of any partitions in the desired table.
NoteThe
odps.cache.partition.enableparameter does not take effect for non-partitioned tables.odps.cache.partition.expire
No
The duration after which the cached metadata of all partitions is automatically deleted. Unit: seconds (s). Default value: 86400, which is equal to 24 hours. Value range: values greater than or equal to 0.
odps.cache.partition.size
No
The number of partitioned tables that can be cached. Default value: 1000.
odps.cache.table-name.enable
No
Specifies whether to cache the names of all tables in the MaxCompute project. Valid values:
true: caches the names of all tables in the MaxCompute project.
false: does not cache the names of all tables in the MaxCompute project. This is the default value.
odps.cache.table-name.expire
No
The duration after which the cached names of tables in the MaxCompute project are automatically deleted. Unit: seconds (s). Default value: 86400, which is equal to 24 hours. Value range: values greater than or equal to 0.
Example
This section provides an example of how to create a MaxCompute catalog named odps_catalog. This example uses the MaxCompute project mf_mc_bj.
-- Create a catalog.
CREATE EXTERNAL CATALOG odps_catalog PROPERTIES(
"type"="odps",
"odps.access.id"="<yourAccessKeyId>",
"odps.access.key"="<yourAccessKeySecret>",
"odps.endpoint"="http://service.cn-beijing.maxcompute.aliyun.com/api",
"odps.tunnel.quota"="pay-as-you-go",
"odps.project"="mf_mc_bj"
);Access MaxCompute
Switch to the MaxCompute catalog and project that you created. This example uses the MaxCompute catalog
odps_catalogand the MaxCompute projectmf_mc_bj.Syntax
Method 1
-- Specify a catalog. SET CATALOG <catalog_name>; -- Specify a MaxCompute project. USE <project_name>;Method 2
USE <catalog_name>.<project_name>;
Example
SET CATALOG odps_catalog; USE mf_mc_bj;
Query data. This example uses the
srctable.
SELECT * FROM src LIMIT 10;The following result is returned.
+------+-------+
| key | value |
+------+-------+
| 1 | 1 |
| 3 | 3 |
| 2 | 2 |
| 4 | 100 |
| 5 | 200 |
| 6 | 300 |
| 3 | 400 |
+------+-------+Other operations
View MaxCompute catalogs
Query all catalogs in the current StarRocks cluster.
SHOW CATALOGS;The following result is returned.
+-----------------+----------+------------------------------------------------------------------+ | Catalog | Type | Comment | +-----------------+----------+------------------------------------------------------------------+ | default_catalog | Internal | An internal catalog contains this cluster‘s self-managed tables. | | odps_catalog | Odps | NULL | +-----------------+----------+------------------------------------------------------------------+Query the creation statement for a specific catalog.
SHOW CREATE CATALOG odps_catalog;The following result is returned.
+--------------+--------------------------------------------------------------------------+ | Catalog | Create Catalog | +--------------+--------------------------------------------------------------------------+ | odps_catalog | CREATE EXTERNAL CATALOG `odps_catalog` PROPERTIES ("odps.endpoint" = "http://service.cn-beijing.maxcompute.aliyun.com/api", "odps.access.id" = "<yourAccessKeyId>", "odps.access.key" = "<yourAccessKeySecret>", "odps.project" = "odps_project", "type" = "odps" )| +-----------------------------------------------------------------------------------------+
View the table schema of a MaxCompute table
Query the schema of a MaxCompute table.
Syntax
DESC[RIBE] <catalog_name>.<database_name>.<table_name>;Parameters
catalog_name: Required. The name of the catalog.
database_name: Required. The name of the MaxCompute project in the catalog.
table_name: Required. The name of the table in the MaxCompute project.
Example
DESC odps_catalog.mf_mc_bj.src;The following result is returned.
+-------+------+------+-------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+------+------+-------+---------+-------+ | id | INT | Yes | false | NULL | | | a | INT | Yes | false | NULL | | +-------+------+------+-------+---------+-------+
Query the schema of a MaxCompute table and its creation statement.
Syntax
SHOW CREATE TABLE <catalog_name>.<database_name>.<table_name>;Parameters
catalog_name: Required. The name of the catalog.
database_name: Required. The name of the MaxCompute project in the catalog.
table_name: Required. The name of the table in the MaxCompute project.
Example
SHOW CREATE TABLE odps_catalog.mf_mc_bj.src;The following result is returned.
+-------+--------------------------------------------------------------------------------+ | Table | Create Table | +-------+--------------------------------------------------------------------------------+ | src | CREATE TABLE `src` ( `id` int(11) DEFAULT NULL, `a` int(11) DEFAULT NULL ) | +-------+--------------------------------------------------------------------------------+
Delete a MaxCompute catalog
You can run the following command to delete a MaxCompute catalog.
DROP CATALOG odps_catalog;Collect statistical information for Cost-Based Optimizer (CBO)
StarRocks does not support automatic collection of statistical information for a MaxCompute table. You can run the following command to manually create a task to collect the statistical information.
-- Replace <table_name> with the name of the MaxCompute table.
ANALYZE TABLE <table_name>;The StarRocks Cost-Based Optimizer (CBO) estimates the execution cost of SQL queries based on various types of statistical information and selects the execution plan with the lowest cost from tens of thousands of potential execution plans. This improves the efficiency and performance of complex queries. For more information, see Gather statistics for CBO.
Manually update the cached metadata
By default, StarRocks caches MaxCompute metadata to improve query performance. If you change a table schema or update data in a table, you can manually update the table's metadata. This ensures that StarRocks retrieves the latest metadata cache as soon as possible.
-- Replace <table_name> with the name of the MaxCompute table.
REFRESH EXTERNAL TABLE <table_name>;FAQ
Error message "Your slot quota is exceeded" is reported during data reads.
Problem analysis
This error occurs when the data read operation exceeds the quota of the Data Transmission Service. When StarRocks reads data, it splits the table into multiple shards and distributes them to backend (BE) nodes. The BE nodes then read the data concurrently using a thread pool. Therefore, the number of slots occupied during the read operation is typically calculated as follows:
Max(Number of shards, Number of BEs × Concurrency of the BE thread pool).Solutions
Purchase an exclusive resource group for Data Transmission Service that supports high concurrency. For more information, see Purchase and use an exclusive resource group for Data Transmission Service.
Configure the
ScanParamsparameter to adjust the number of shards. If you use therow_offsetsharding policy, you must also increase the value of theodps.split.row.countparameter.Modify the number of threads in the StarRocks BE thread pool by changing the
scanner_thread_pool_queue_sizeandscanner_thread_pool_thread_numparameters. For more information about how to modify these parameters, see What is StarRocks?.
Appendix: Data type mappings
Field data type in MaxCompute | Field data type in StarRocks |
BOOLEAN | BOOLEAN |
TINYINT | TINYINT |
SMALLINT | SMALLINT |
INT | INT |
BIGINT | BIGINT |
FLOAT | FLOAT |
DOUBLE | DOUBLE |
DECIMAL(p, s) | DECIMAL(p, s) |
STRING | VARCHAR(1073741824) |
VARCHAR(n) | VARCHAR(n) |
CHAR(n) | CHAR(n) |
JSON | VARCHAR(1073741824) |
BINARY | VARBINARY |
DATE | DATE |
DATETIME | DATETIME |
TIMESTAMP | DATETIME Important StarRocks does not support the TIMESTAMP data type. When StarRocks reads data of the TIMESTAMP type, StarRocks converts the data type of the data into DATETIME, which causes a loss in data precision. |
ARRAY | ARRAY |
MAP | MAP |
STRUCT | STRUCT |