Warehouse usage guide - E-MapReduce - Alibaba Cloud Documentation Center

To further optimize query performance and management flexibility, StarRocks 3.3 introduces the concept of a warehouse, which is a group of compute nodes (CNs) or backend (BE) nodes within a StarRocks instance and is responsible for executing computational logic. This topic describes how to use a warehouse in E-MapReduce (EMR) Serverless StarRocks.

Background information

In a distributed database system, the rational allocation and utilization of resources is the key to ensure the high performance of the system. StarRocks is a high-performance Massively Parallel Processing (MPP) analytic database designed for real-time analysis and processing of large amounts of data. StarRocks delivers high throughput while ensuring low latency based on intelligent distribution and parallel execution of queries.

You can use warehouses in but not limited to the following scenarios:

Load balance and resource isolation: Query requests from different businesses or user groups are distributed to different warehouses to achieve load balance and ensure that resources of different businesses do not interfere with each other.
Performance optimization: For queries that require fast responses or for critical business operations, you can specify a high-performance warehouse for execution. Warehouses can have more powerful computing resources or superior network conditions.
Data security and compliance: When the system handles sensitive data, the execution of specific queries must comply with relative security standards or must be performed on specific warehouses to meet strict data protection regulations.

To support these requirements, StarRocks provides a series of configuration and management features that allow you to flexibly specify and use warehouses to optimize the query and resource management policies.

Prerequisites

An EMR Serverless StarRocks instance is created. For more information, see Create an instance.

Note

You can add warehouses only to shared-data instances of StarRocks 3.3 or later. Shared-nothing instances support only a default warehouse named default_warehouse and do not support the multi-warehouse mode.

Usage notes

By default, asynchronous tasks, such as dynamic partition creation, compaction, schema change, materialized view refresh, and statistics collection, are performed in default_warehouse. To ensure high performance and smooth user experience, we recommend that you configure at least three CNs or BE nodes in default_warehouse.

Query warehouse and node information

Query information about all warehouses

Syntax
```
SHOW WAREHOUSES [LIKE '<pattern>'];
```
[LIKE '<pattern>'] is optional and is used to match a specific warehouse by name.
- <pattern> supports the following wildcards:
  - %: matches characters of any length, including zero characters.
  - _: matches a single character.
- Examples of <pattern>:
  - LIKE 'test%': matches all warehouses whose names start with test.
  - LIKE '%_warehouse': matches all warehouses whose names end with _warehouse.
Examples
- Query information about all warehouses.
```
SHOW WAREHOUSES;
```
- Query information about warehouses whose names match the specified pattern.
```
SHOW WAREHOUSES LIKE 'test%';
```

Query information about all nodes in warehouses

The information of each node includes the name of the warehouse to which it belongs.

SHOW COMPUTE NODES;

Query information about nodes in a specific warehouse

Query information about nodes in a single warehouse.
```
SHOW NODES FROM WAREHOUSE <warehouse_name>; 
```
Query information about nodes in multiple warehouses.
```
SHOW NODES FROM WAREHOUSES [LIKE '<pattern>']; 
```
If you do not specify [LIKE '<pattern>'], information about nodes in all warehouses is displayed.

Use a specific warehouse to perform operations

Note

If you do not specify a warehouse, all operations are performed in default_warehouse.

Queries

You can specify a warehouse for a query operation only by using SET_VAR or manually switching the warehouse in the current session.

Specify a warehouse by using `SET_VAR`

In StarRocks, you can use SET_VAR hints to dynamically specify the warehouse used by the current query. This method is suitable for scenarios where you need to temporarily switch the warehouse. It does not affect other queries or session configurations. For more information about SET_VAR hints, see System variables.

The following syntax is supported:

SELECT /*+SET_VAR(warehouse="<warehouse_name>")*/ * FROM <database_name>.<table_name>;

/*+SET_VAR(warehouse="<warehouse_name>")*/ is written in fixed format and is used to dynamically configure the parameters at the query level. <warehouse_name> indicates the name of the warehouse. <database_name> and <table_name> indicate the names of database and table. You can replace them with actual names.

SET_VAR hints allow you to flexibly specify a warehouse for a single query, enabling resource isolation and performance optimization. Take note of the following items:

Required version: You must use StarRocks 3.3 or later.
Syntax specifications: Ensure that the position and format of /*+SET_VAR(...)*/ are correct.
Permissions and resources: Ensure that the specified warehouse exists and you have access permissions on the warehouse.

Manually switch the warehouse in the current session

Specify a warehouse.
```
SET [session] warehouse [=] <warehouse_name>;
```
In the statement, session and= are optional and do not affect the semantics.
Query the warehouse in use.
```
SHOW variables LIKE "%warehouse%";
```
The specified warehouse takes effect only in the current session. default_warehouse is used if no warehouse is specified. default_warehouse is used for asynchronous tasks, such as compaction, asynchronous materialized view refresh, and statistics collection.
Execute a query.
```
SELECT * FROM my_db.my_table;
```

Data import operations (Load)

Stream Load

Add the warehouse information to the header of the HTTP request. The following syntax is supported:

curl --location-trusted -u <username>:<password> \
    -H "label:<label_name>" \
    -H "timeout:100" \
    -H "max_filter_ratio:1" \
    -H "warehouse:<warehouse_name>" \
    -T <file_path> \
    -XPUT http://<fe_host>:<fe_http_port>/api/<database_name>/<table_name>/_stream_load

The following table describes the parameters in the code.

Parameter	Required	Description
`<username>:<password>`	Yes	The username and password of the StarRocks instance. If no password is set for the account, you only need to enter `<username>:`, such as `admin:`.
`label`	No	The label of the import job. If you do not specify a label, StarRocks automatically generates one for the import job. Data with the same label cannot be imported repeatedly. This prevents repeated import of the same data.
`timeout`	No	The timeout period for the import job. Valid values: 1 to 259200. Unit: seconds.
`max_filter_ratio`	No	The maximum fault tolerance rate of the import job. The maximum fault tolerance rate refers to the maximum proportion of data rows filtered out due to unqualified data quality that the import job can tolerate. Valid values: 0 to 1.
`warehouse`	Yes	The warehouse that is used to perform the Stream Load operation.
`<file_path>`	Yes	The path where the data file is located.
`fe_host`	Yes	The internal endpoint of the frontend (FE) node in the StarRocks instance. You can view the endpoint on the Instance Details page of the StarRocks instance. Note If you submit the import job to a BE node or a CN, you must specify the IP address of the BE node or CN. For example, you can run the `SHOW COMPUTE NODES;` command to view the IP address.
`fe_http_port`	Yes	The HTTP port number of the FE node in the StarRocks instance. Default value: 8030. Note If you submit the import job to a BE node or a CN, you must specify the HTTP port number of the BE node or CN. Default value: 8040.
`database_name`	Yes	The name of the database where the destination StarRocks table resides.
`table_name`	Yes	The name of the destination StarRocks table.

External transaction import
When you call the /api/transaction/begin interface to enable a transaction, you must specify a label and add the warehouse information to the HTTP request header. The following syntax is supported:
```
curl --location-trusted -u <username>:<password> \
    -H "label:<label_name>" \
    -H "db:<database_name>" \
    -H "table:<table_name>" \
    -H "warehouse::<warehouse_name>" \
    -XPOST http://<fe_host>:<fe_http_port>/api/transaction/begin
```
After you run the preceding command, when you call the /api/transaction/load, /api/transaction/prepare, or /api/transaction/commit interface, you must use the label that is specified for the /api/transaction/begin interface, but you do not need to specify warehouse information again in the HTTP request header.

Broker Load

LOAD LABEL [<database_name>.]<label_name>
(
    data_desc[, data_desc ...]
)
WITH BROKER
(
    StorageCredentialParams
)
[PROPERTIES
(
    opt_properties
)
]

Add the warehouse parameter to opt_properties in the statement that is used to create an import task. The warehouse of the current session is used by default. Sample statement:

LOAD LABEL my_db.path_extra_fields_label
(
  DATA INFILE( "hdfs://<hdfs_ip>:9000/starrocks_test_data/hdfs_data/*.orc")
  INTO TABLE `my_tbl` 
  FORMAT AS "orc" 
)  
WITH BROKER hdfs_broker 
(
  "hadoop.security.authentication" = "simple",
  "username" = "<hdfs_username>",
  "password" = "<hdfs_password>"
) 
 PROPERTIES
(
  "warehouse"="test"
);

Routine Load

Add the warehouse parameter to job_properties in the statement. The warehouse of the current session is used by default.

CREATE ROUTINE LOAD <database_name>.<job_name> ON <table_name>
[load_properties]
[job_properties]
FROM data_source
[data_source_properties]

Specify a warehouse named test for Routine Load to continuously consume Kafka messages and import the messages to StarRocks. Sample statement:

CREATE ROUTINE LOAD my_db.routine_load_basic_types_fe824e00_680a_11ee_81d7_00163e135aa9 
ON my_table
COLUMNS (k1,k2,k3,k4,k5,k6,v1,v2,v3,v4,v5,v6,v7,v8), 
COLUMNS TERMINATED BY ',',where v1 != 0 
PROPERTIES (
    "desired_concurrent_number"="1",
    "max_error_number"="1000",
    "max_batch_interval"="7",
    "warehouse"="test"
)
FROM KAFKA (
    "kafka_broker_list"="<kafka_broker_ip>:<kafka_broker_port>",
    "kafka_topic"="xxx",
    "kafka_partitions"="0",
    "kafka_offsets"="OFFSET_BEGINNING"
);

Flink Connector
StarRocks supports native integration with Apache Flink, allowing efficient data ingestion via Flink into StarRocks. Starting from Apache Flink version 1.2.9, you can configure the sink.properties.warehouse parameter to write data into a specified warehouse, enabling more flexible resource management and performance optimization.
Use sink.properties.warehouse to specify the target warehouse:
```
sink.properties.warehouse=<warehouse_name>
```
- <warehouse_name>: The name of the target warehouse.
- If this parameter is not specified, the system will use default_warehouse.

Materialized views

Add the warehouse parameter to PROPERTIES in the CREATE MATERIALIZED VIEW statement. The warehouse of the current session is used by default. The following syntax is supported:

CREATE MATERIALIZED VIEW [IF NOT EXISTS] [database_name.]<mv_name>
[COMMENT ""]
[PROPERTIES ("key"="value", ...)]
AS 
<query_statement>

For example, create a materialized view named order_mv and specify a warehouse named test for supporting subsequent rewrite operations on the materialized view. Sample statement:

CREATE MATERIALIZED VIEW order_mv
DISTRIBUTED BY HASH(`order_id`)
REFRESH ASYNC EVERY (interval 1 MINUTE) 
PROPERTIES ("warehouse" = "test")
AS SELECT 
    order_list.order_id,
    sum(goods.price) as total
FROM order_list INNER JOIN goods ON goods.item_id1 = order_list.item_id2
GROUP BY order_id;

After the execution is successful, you can use one of the following methods to view the information about the warehouse. The fe.audit.log file also records the execution status of rewrite statements on the materialized view, which contains the Warehouse field.

Method 1: Run the following command to view the details of the order_mv materialized view:
```
SHOW materialized views WHERE name="order_mv";
```
Method 2: Run the following command to view the complete SQL statement for creating the order_mv materialized view:
```
SHOW CREATE materialized VIEW order_mv;
```

Compaction

Compaction is executed in default_warehouse by default. If you want to specify a different warehouse, you can modify the lake_compaction_warehouse configuration item. You can dynamically modify this configuration item in sessions. You can also modify this configuration item in the fe.conf file and then restart the FE node.

For example, you want to specify a warehouse named test for compaction.

Method 1: You can run the ADMIN SET FRONTEND CONFIG command to dynamically modify the configuration item.
```
ADMIN SET FRONTEND CONFIG ("lake_compaction_warehouse" = "test");
```
Method 2: Add lake_compaction_warehouse = test; to the fe.conf file and restart the FE node to make the modification take effect.

After the modification, you can run the show proc '/compactions' ; command to view the latest compaction records. The warehouse information is in the last column.

Connect to a specific warehouse

JDBC connection

The following syntax is supported:

jdbc.url=jdbc:mysql://<mysql_host>:3306/dbName?sessionVariables=warehouse=<warehouse_name>

Superset connection

Install starrocks-sqlalchemy by following instructions in Superset Support and use init_command to specify a warehouse.

starrocks://<username>:<password>@<host>:<port>/<database_name>?init_command=set warehouse = <warehouse_name>

Specify default warehouse for users

Note

This feature is only supported in compute-storage separation instances of version 3.3.8 and above.

You can modify user properties (for example, sr_user) or specify the default warehouse (for example, sr_warehouse) when creating a user.

Method 1: Modify existing user properties through ALTER USER

ALTER USER sr_user SET PROPERTIES ('session.warehouse' = 'sr_warehouse');

Method 2: Create a new user and specify the default warehouse through CREATE USER
```
CREATE USER 'sr_user' PROPERTIES ('session.warehouse' = 'sr_warehouse');
```

After specification, you can view the properties of a specific user by executing the SHOW PROPERTY command.

Grant permissions on a warehouse

You can execute the GRANT statement to grant permissions to users or roles. The following syntax is supported:

GRANT  
    { USAGE } 
    ON { WAREHOUSE <warehouse_name> [, <warehouse_name>,...] | ALL WAREHOUSE } 
    TO { ROLE | USER} {<role_name>|<user_identity>} [ WITH GRANT OPTION ]

Permission	Description
USAGE	Allows users to perform queries and other operations by using a specific warehouse.

By default, a user who has the USAGE permission on a warehouse also has the SHOW permission on the warehouse.

For example, execute the following statement to grant the USAGE permission to user1:

GRANT USAGE ON warehouse waa TO user1;

For more permission-related operations, see Manage user privileges.

Background information

Prerequisites

Usage notes

Query warehouse and node information

Query information about all warehouses

Query information about all nodes in warehouses

Query information about nodes in a specific warehouse

Use a specific warehouse to perform operations

Queries

Specify a warehouse by using SET_VAR

Manually switch the warehouse in the current session

Data import operations (Load)

Stream Load

External transaction import

Broker Load

Routine Load

Flink Connector

Materialized views

Compaction

Connect to a specific warehouse

JDBC connection

Superset connection

Specify default warehouse for users

Grant permissions on a warehouse

Specify a warehouse by using `SET_VAR`