All Products
Search
Document Center

E-MapReduce:FAQ

Last Updated:Jun 20, 2026

This topic answers frequently asked questions about using StarRocks.

Hardware requirements

  • Overall machine configuration

    We recommend that BE nodes have at least 16 CPU cores and 64 GB of memory, and FE nodes have at least 8 CPU cores and 16 GB of memory.

    In a production environment, an FE node typically has 16 CPU cores, 32 GB or 64 GB of memory, and a 200 GB to 500 GB NVMe solid-state drive (SSD).

  • Disk

    • You can use Hard Disk Drives (HDDs) or SSDs.

    • Estimate the required disk capacity using a 3:1 compression ratio and 70% to 75% maximum disk utilization.

    • If you import Hive data in Parquet or ORC format into StarRocks, use a 1:1 compression ratio for estimation.

      For example, if the source Hive data is 3 TB, the data size in StarRocks will also be 3 TB after import.

  • CPU

    • The CPU must support the Advanced Vector Extensions 2 (AVX2) instruction set. You can run the cat /proc/cpuinfo |grep avx2 command to verify support.

    • The vectorized execution engine works best with a compatible CPU instruction set. If your machine's CPU does not support the required instruction set, we recommend upgrading your hardware.

  • Network

    We recommend using 10 Gigabit Ethernet network interface cards (NICs) and switches.

Software configuration requirements

For more information about software configurations, see Parameter configuration.

Replica configuration

In a typical production environment, two or three replicas are sufficient. We recommend setting the number of replicas to three.

Partitioning a table

Proper partitioning can effectively reduce the amount of scanned data.

  • Typically, you select partition keys based on how your business manages data. For example, you can use time or region columns as partition keys.

  • If you need to create partitions automatically, you can use dynamic partitioning.

Using bucketing

  • Select a high-cardinality column as the bucketing key to prevent data skew between buckets.

    • If a unique ID exists, we recommend using it for bucketing.

    • If you encounter severe data skew, you can use multiple columns as the bucketing key. However, avoid using too many columns.

  • Estimate the optimal tablet size and the total number of buckets based on your total data volume and the following guidelines.

    • For uncompressed raw data, such as data in CSV format, the recommended size for each tablet is between 1 GB and 10 GB.

    • For data in Parquet format, we recommend a tablet size of around 1 GB.

  • To maximize resource utilization on a small number of machines, calculate the number of buckets by using the formula: Number of BEs * CPU cores / 2.

Designing a sort key

Design the sort key based on your query patterns:

  • To accelerate queries, use columns that are frequently used in filter conditions and GROUP BY clauses as the sort key.

  • If your workload involves a large number of point lookups, place the ID column used for lookups as the first column of the sort key.

    For example, if you frequently run a high-concurrency query like select sum(revenue) from lineorder where user_id='aaa100';, we strongly recommend making user_id the first column of the sort key.

  • If your queries are mostly aggregations and scans, place low-cardinality columns first in the sort key.

    For example, for a query like select region, nation, count(*) from lineorder_flat group by region, nation, it is better to set region as the first column and nation as the second column.

Choosing data types

Use the most precise data types possible. For example, use an integer type instead of a string type if applicable, and prefer INT over BIGINT when the data range allows. Using precise data types improves database performance.

Tuning Routine Load parameters

Parameter tuning strategy

If you experience performance issues with Routine Load, consider tuning the following parameters:

  • Task scheduling cycle

    To accelerate data consumption, shorten the task scheduling cycle by modifying the max_batch_interval parameter.

    Important

    The minimum task scheduling cycle is 5 seconds.

  • Task parallelism

    When you have a large number of partitions and BE nodes, you can increase the values of the following parameters to accelerate task execution. However, increasing parallelism may lead to higher CPU consumption.

    • max_routine_load_task_concurrent_num

    • desired_concurrent_number

    StarRocks splits a single Routine Load job into multiple sub-tasks based on the number of Kafka topic partitions and available BE nodes. These sub-tasks are then distributed to the BEs for execution. Task parallelism is the number of sub-tasks that a job is split into.

    The actual task parallelism is calculated using the following formula.

    concurrent_num = Min( Min( partition_num, Min( desired_concurrent_num, alive_be_num ) ),Config.max_routine_load_task_concurrent_num )
  • Task batch size

    • routine_load_task_consume_second: Accelerate data consumption by increasing the duration of a single read.

    • max_routine_load_batch_size: Accelerate data consumption by increasing the amount of data read in a single batch.

    You can check the following log to determine if the current batch size parameter is set too low. Normally, a value of 0 or greater for the left_bytes field in this log indicates that the amount of data read in a single batch has not exceeded the max_routine_load_batch_size limit. A negative value indicates that max_routine_load_batch_size is too small.

    I0325 20:27:50.410579 15259 data_consumer_group.cpp:131] consumer group done:
    41448fb1a0ca59ad-30e34dabfa7e47a0. consume time(ms)=3261, received rows=179190,
    received bytes=9855450, eos: 1, left_time: -261, left_bytes: 514432550, blocking
    get time(us): 3065086, blocking put time(us): 24855
    1

Routine Load parameters

Parameter

Type

Default value

Description

max_routine_load_job_num

fe.conf

100

The maximum number of Routine Load jobs in the NEED_SCHEDULE, RUNNING, or PAUSED state.

max_routine_load_task_concurrent_num

fe.conf

5

The maximum parallelism of a single Routine Load job.

max_routine_load_task_num_per_be

fe.conf

5

The maximum number of Routine Load tasks that can be scheduled on a single BE node.

max_routine_load_batch_size

fe.conf

500 MB

The maximum amount of data to read from Kafka in a single batch.

routine_load_task_consume_second

fe.conf

3

The maximum time to read data from Kafka in a single batch.

routine_load_task_timeout_second

fe.conf

15

The timeout for a single Routine Load task.

max_consumer_num_per_group

be.conf

3

The maximum number of consumers per consumer group.

desired_concurrent_number

properties

3

The desired parallelism for the Routine Load job. The actual parallelism is calculated based on the following formula: concurrent_num = Min( Min( partition_num, Min(desired_concurrent_num, alive_be_num ) ),Config.max_routine_load_task_concurrent_num ).

max_batch_interval

properties

10s

The scheduling interval for a Routine Load job.

max_batch_rows

properties

200000

This parameter defines the error detection window size. The window size is 10 * max_batch_rows.

max_error_number

properties

0

The maximum number of error rows allowed within a sampling window. The value must be 0 or greater. The default, 0, allows no error rows.

Important

Rows filtered out by a WHERE clause are not counted as error rows.

strict_mode

properties

true

Specifies whether to enable strict mode. The default is true. In strict mode, rows are filtered if a non-nullable source column is transformed to NULL.

Selecting a data model

StarRocks provides four data models. Select the one that best fits your use case:

Data model

Scenario

duplicate key model

  • Data is not updated frequently.

  • Query patterns are flexible and do not involve pre-aggregation.

  • Raw data needs to be retained.

aggregate key model

  • Data is only appended, not updated.

  • Business queries include aggregate functions, such as MIN, MAX, or SUM.

  • Raw detailed data does not need to be queried.

unique key model

Ideal for real-time analysis of frequently updated data.

primary key model

Ideal for real-time analysis of frequently updated data.

If you have partial column updates, we recommend keeping the number of columns under 200.

COUNT function in data models

StarRocks provides four data models: the duplicate key model, unique key model, aggregate key model, and primary key model. The behavior of the COUNT function differs significantly across these models.

  • duplicate key model: This model does not require a merge operation, so COUNT operations are fast.

  • unique key and aggregate key models: The implementation of COUNT involves multi-version merge operations, making them relatively slower.

    If the key is a string type, the COUNT operation is theoretically even slower.

  • primary key model: When reading data, this model uses an in-memory index and a delete vector, which eliminates the need for merge operations. As a result, COUNT operations are faster than in the unique key and aggregate key models.

    Use this model for workloads that involve update operations.

Reducing trash directory disk usage

The /mnt/disk1/starrocks/storage/trash/ directory stores deleted data. To reduce the disk usage of this directory, you can decrease the value of the trash_file_expire_time_sec parameter in the be.conf file to shorten the retention period for the trash directory. The default value is 259200 seconds (72 hours).

Error creating a materialized view

  • Symptom: When you try to create a materialized view, the following error occurs.

    mysql> CREATE MATERIALIZED VIEW test_bl1_mvte AS select timed, event, count(userId) as PV from test_bl1 group by timed, event;
    ERROR 1064 (HY000): table [test_bl1] is not stable. Some tablets of this table may not be healthy or are being scheduled. You need to repair the table first or stop cluster balance. See 'help admin;'.
    mysql>
  • Solution:

    1. Run the commands show proc "/cluster_balance"; and show proc "/statistic";.

    2. Check whether any tablet is being rebalanced:

      • If a tablet is being rebalanced, wait for the process to complete.

      • If not, run the command set disable_balance=true and then try creating the materialized view again.

Troubleshooting slow queries

We recommend the following troubleshooting methods:

  • Enable profile reporting and view the profile information in the StarRocks UI.

    Run the command SET is_report_success = true; to enable profile reporting.

  • Follow these basic troubleshooting steps:

    1. Adjust parallelism.

      • Pipeline

        SET pipeline_dop = 8;
        SET enable_pipeline_engine = true;
      • Non-pipeline engine

        SET enable_pipeline_engine=false;
        SET parallel_fragment_exec_instance_num=8;
    2. Check the tablet distribution.

      show data xxx;
      Note

      The recommended tablet size is between 1 GB and 10 GB.

    3. Check the table definition.

      1. Check the iotime in the profile. If it is high, you can remove unnecessary indexes, such as an excessive number of bitmap indexes.

      2. Check the table's data model and select an appropriate one. For example, in a unique key model, predicate pushdown cannot be applied until compaction is complete, which often causes slow queries.

Access failure on port 8030 or 8040

The default ports for load_url and webserver_port depend on your EMR version. For EMR V5.8.0/V3.42.0 and earlier, use ports 8030 and 8040, respectively. For EMR V5.9.0/V3.43.0 and later, use ports 18030 and 18040.

Note

You can run the show frontends command to view the actual ports in use.

EMR StarRocks region availability

EMR StarRocks is available in all regions.

BE data disks and data distribution

By default, four enhanced SSD (ESSD) PL1 cloud disks are mounted. StarRocks automatically balances data across all nodes based on load, bucketing, and other factors. Each tablet replica is stored entirely on a single disk.

Resetting the cluster for data recovery

Important

The following operations clear all data in the cluster. Proceed with caution.

  1. Stop the cluster services (FE and BE).

  2. Clear the FE metadata directory.

    1. Check the fe.conf file in the /opt/apps/STARROCKS/starrocks-current/fe/conf/ directory to find the configured directory for meta_dir.

    2. Delete the bdb folder from the configured directory.

    3. Empty the image folder in the configured directory.

  3. Clear the BE data and metadata.

    1. Check the be.conf file in the /opt/apps/STARROCKS/starrocks-current/be/conf/ directory to find the configured directory for storage_root_path.

    2. Delete all folders and files from the configured directory except for the data and meta folders.

    3. Empty the data and meta folders in the configured directory.

    Note

    The configuration may specify multiple paths. You must perform these operations for each path.

  4. Restart the cluster services (FE and BE).

Viewing logs

The log directories are typically found in the following paths:

  • FE

    • /opt/apps/STARROCKS/starrocks-current/fe/log/

    • /mnt/disk1/log/starrocks/

  • BE

    • /opt/apps/STARROCKS/starrocks-current/be/log/

    • /mnt/disk1/log/starrocks/

Scaled-out task nodes not used for computation

Symptom

After you scale out a StarRocks cluster by adding three task nodes, the new nodes are assigned as compute nodes (CNs) but do not automatically run computation tasks. Although the existing BE nodes are under high load, the newly added task nodes remain underutilized, with monitoring data showing consistently low load.

Cause

In a compute-storage integration architecture, the CNs in a StarRocks cluster can only be used for queries on external tables. Therefore, if the application primarily involves querying internal tables, BE nodes handle these tasks, leaving the new CNs underutilized.

Solution

Before you run query tasks, execute the following SQL command to prioritize CNs for task execution.

SET GLOBAL prefer_compute_node=true;

Viewing the StarRocks cluster password

The password for a StarRocks cluster is the one you set for the root user when you created the cluster. If you have forgotten the password, you can reset it. For more information, see How do I reset the cluster logon password?