This topic provides answers to some frequently asked questions about StarRocks.
Business testing and evaluation
Data model and table design
What requirements does StarRocks have for hardware resources?
We recommend that the servers for backend (BE) nodes have at least 16 CPU cores and 64 GB of memory, and the servers for frontend (FE) nodes have at least 8 CPU cores and 16 GB of memory.
In most cases, the servers for FE nodes in a production environment have 16 CPU cores, 32 or 64 GB of memory, and an NVMe disk with 200 to 500 GB of memory.
HDD and SSD can be used.
The disk capacity can be estimated based on the following condition: The compression ratio is 3, and the disk usage is 70% to 75%.
If you want to import Hive data in Parquet or ORC format into StarRocks, estimate the disk capacity based on the compression ratio of 1:1.
For example, if the size of the Hive data that you want to import is 3 TB, the size of the data imported into StarRocks is 3 TB.
The CPU must support the Advanced Vector Extensions 2 (AVX2) instruction set. You can run the
cat /proc/cpuinfo |grep avx2command to determine whether the AVX2 instruction set is supported.
The fully vectorized execution engine makes more efficient use of CPU processing power if the CPU has the instruction set. If not, we recommend that you change the server.
We recommend that you select 10GE network interface controllers (NICs) and 10GE switches.
What requirements does StarRocks have for software configurations?
For more information, see Parameter configuration.
How many replicas do I need to specify in a production environment?
In most cases, 2 or 3 replicas are needed in a production environment. We recommend that you specify 3 replicas.
How do I determine the number of partitions of a table?
Proper partitions help you effectively decrease the amount of data to be scanned.
In most cases, you can select partition keys based on your business requirements. For example, you can select the time or region column as the partition key.
If you need partitions to be automatically created, you can enable dynamic partitioning.
How do I determine the number of tablets?
You can select a column with high cardinality as a bucketing key to prevent data skew between tablets.
If a unique ID column exists, we recommend that you select the unique ID column as the bucketing key.
If data cannot be evenly distributed into each tablet by using one bucketing column, you can select multiple columns as the bucketing key. However, do not use an excessive amount of columns as the bucketing key.
You can estimate the optimal size of tablets based on the following information. The number of tablets can be calculated based on the following items and the total data amount.
For raw data, such as data in the CSV format, the size of a tablet ranges from 1 GB to 10 GB.
For data in the Parquet format, the size of a tablet is about 1 GB.
If you want to make full use of machine resources, use the following formula to calculate the number of tablets:
Number of BE nodes × Number of CPU cores/2.
How do I determine the sort key?
The sort key should be designed based on the characteristics of data queries.
You can select columns that are frequently used for filter conditions and in the Group BY clause as the sort key to accelerate queries.
If a large number of point queries are often performed, set the ID column of the point queries as the first column of the sort key columns.
For example, if the query statement is
select sum(revenue) from lineorder where user_id='aaa100';and high concurrency occurs, we recommend that you set the
user_idcolumn as the first column of the sort key columns.
If aggregate queries and scans are mainly performed, we recommend that you set the columns with low cardinality first.
For example, if the query statement is
select region, nation, count(*) from lineorder_flat group by region, nation, we recommend that you specify the
regioncolumn as the first column and the
nationcolumn as the second column.
How do I select proper data types?
We recommend that you select data types with high precision. For example, select an integer data type rather than a string data type, and select the INT data type rather than the BIGINT data type. Data types with high precision can help you fully make use of the performance of databases.
How do I perform parameter tuning when the Routine Load import method has performance issues?
Parameter tuning policies
If the Routine Load import method has performance issues, you can perform parameter tuning from the following dimensions:
Job scheduling cycle
You can modify the max_batch_interval parameter to shorten the scheduling cycle and accelerate data consumption. However, if you shorten the scheduling cycle, more CPU resources may be consumed.Important
The minimum scheduling cycle is 5 seconds.
If a large number of partitions and BE nodes are involved, you can configure the following parameters to accelerate the job execution. However, if you increase the degree of parallelism (DOP), more CPU resources may be consumed.
A Routine Load job is split into several tasks based on the number of partitions in a Kafka topic and distributed to multiple BE nodes for execution based on the number of BE nodes. The DOP of a job refers to the number of tasks that a Routine Load job is split into.
You can use the following formula to calculate the DOP of a job.
concurrent_num = Min( Min( partition_num, Min( desired_concurrent_num, alive_be_num ) ),Config.max_routine_load_task_concurrent_num )
Import batch size
routine_load_task_consume_second: You can accelerate data consumption by increasing the duration of a read operation.
max_routine_load_batch_size: You can accelerate data consumption by increasing the data amount in a read operation.
You can use the following logs to determine whether the max_routine_load_batch_size parameter is set to a lower value. The value of the
left_bytesfield in the logs should be no less than 0, which indicates that the amount of data read at a time does not exceed the value of the max_routine_load_batch_size parameter. Otherwise, the max_routine_load_batch_size parameter is set to a lower value.
I0325 20:27:50.410579 15259 data_consumer_group.cpp:131] consumer group done: 41448fb1a0ca59ad-30e34dabfa7e47a0. consume time(ms)=3261, received rows=179190, received bytes=9855450, eos: 1, left_time: -261, left_bytes: 514432550, blocking get time(us): 3065086, blocking put time(us): 24855 1
Parameters of Routine Load jobs
The maximum number of Routing Load jobs in the NEED_SCHEDULE, RUNNING, or PAUSED state.
The maximum DOP of a Routine Load job.
The maximum number of Routine Load jobs that can be allocated to a single BE node.
The maximum amount of data that can be read from Kafka at a time.
The maximum period used to read data from Kafka at a time.
The timeout period for which a Routine Load job can run.
The maximum number of consumers in a consumer group.
The expected DOP of a Routine Load job. The actual DOP is calculated based on the following formula:
The scheduling cycle of a Routine Load job.
This parameter is used only to specify the range of the error detection window. The range of the window is calculated by multiplying 10 and the value of this parameter.
The maximum number of error rows allowed in the sampling window. The value must be greater than or equal to 0. The default value 0 specifies that no error row is allowed.
Error rows do not contain the rows that are filtered out by a WHERE clause.
Specifies whether to enable the strict mode. By default, the strict mode is enabled. After you enable the strict mode, if a column of the raw data is changed from NOT NULL to NULL, the data is filtered out.
How do I select a data model?
StarRocks provides the following four data models. You can choose a data model based on your business requirements.
Aggregate key model
Data is updated for real-time analysis.
Data is updated for real-time analysis.
If some columns are updated, we recommend that the number of updated columns is no more than 200.
How does the COUNT function work in various StarRocks data models?
StarRocks has four data models, which are duplicate key, unique key, aggregate key, and primary key models. The way in which the COUNT function works varies based on different data models. The following items describe the differences:
Duplicate key model: This data model does not require merge operations, and thus the result of the COUNT function is returned in a quick manner.
Unique key and aggregate key models: The two data models require merge operations on multiple versions of data, and thus the result of the COUNT function is returned later than that of the duplicate key model.
Theoretically, if the key column is of the STRING type, the result of the COUNT function is more slowly returned.
Primary key model: The data model supports the primary key index and the delete operation on vectors when data is read. The model does not require merge operations. As a result, the result of the COUNT function is returned in the model faster than those in the unique key and aggregate key models.
We recommend that you select the primary key model if your data involves the update operations.
How do I reduce the disk usage of the
The deleted data is stored in the
/mnt/disk1/starrocks/storage/trash/ directory. If you want to reduce the disk usage of this directory, you can reduce the value of the trash_file_expire_time_sec parameter in the be.conf file to shorten the retention period of the trash directory. The default retention period is 259,200 seconds, which equals 72 hours.
What do I do if an error occurs when I create a materialized view?
Problem description: The following figure shows the error message.
Execute the following statements:
show proc "/cluster_balance";and
show proc "/statistic";.
Check whether a tablet is rebalanced.
If yes, wait until the execution is complete.
If not, you can execute the
set disable_balance=truestatement and create a materialized view.
What do I do if a slow data query occurs?
We recommend that you use the following solution:
Enable profile reporting and then view the profile information on StarRocks user interface (UI).
You can execute the
SET is_report_success = true;statement to enable profile reporting.
Common troubleshooting methods:
Adjust the parallelism.
SET pipeline_dop = 8; SET enable_pipeline_engine = true;
SET enable_pipeline_engine=false; SET parallel_fragment_exec_instance_num=8;
View the data distribution of tablets.
show data xxx;Note
The recommended size of a tablet is 1 to 10 GB.
View the created table.
You can determine the I/O wait based on the profile information. If the wait time is high, you can delete unnecessary indexes, such as the frequently created bitmap indexes.
View the table data model and select an appropriate data model. For example, in the unique key model, the pushdown does not work before the compaction phase is incomplete. This causes slow queries.
Why do I fail to access a StarRocks cluster by using port 8030 or port 8040?
For StarRocks clusters of E-MapReduce (EMR) V5.8.0, EMR V3.42.0, or a minor version that is earlier than EMR V5.8.0 or EMR V3.42.0, the port number of load_url is 8030 and the port number of webserver_port is 8040. However, for StarRocks clusters of EMR V5.9.0 or a later minor version, or EMR V3.43.0 or a later minor version, the port number of load_url is 18030 and the port number of webserver_port is 18040. You must select the port that you want to use to access the StarRocks cluster based on your cluster version.
You can run the
show frontends command to view the actual port number of the StarRocks cluster.
In which regions is EMR StarRocks available?
StarRocks is available in all regions.
How does StarRocks distribute data to the data disks of BE nodes?
By default, four enhanced SSDs of PL1 are configured for each BE node. StarRocks evenly distributes data to BE nodes based on loads and the bucket mechanism. The data in the same tablet and replica is stored on the same disk.
How do I reset a cluster if a data exception occurs?
After you perform the following operations, cluster data is cleared. Proceed with caution.
Stop the services that are deployed on the FE and BE nodes.
Clear the directory in which the metadata of FE nodes is stored.
View the fe.conf file in the
/opt/apps/STARROCKS/starrocks-current/fe/conf/directory and obtain the configuration directory of
Delete the bdb folder from the configuration directory.
Clear the image folder in the configuration directory.
Clear BE data and metadata.
View the be.conf file in the
/opt/apps/STARROCKS/starrocks-current/be/conf/directory and obtain the configuration directory of
Delete all folders and files except for the data and meta folders from the configuration directory.
Clear the data and meta folders in the configuration directory.
The configuration directory may exist in multiple paths. You must perform the preceding operations for each path.
Restart the services that are deployed on the FE and BE nodes.
How do I view the logs of the FE and BE nodes?
The logs of the FE and BE nodes are stored in the following directories:
How do I use the compute-storage separation mode in StarRocks 3.X?
By default, EMR StarRocks 3.X supports compute-storage integration rather than compute-storage separation. If you want to use the compute-storage separation mode, submit a ticket.