Kudu FAQ - E-MapReduce - Alibaba Cloud Documentation Center

Answers to common questions about Kudu on EMR.

General

Where are Kudu log files stored?

Kudu log files are in /mnt/disk1/log/kudu.

What partitioning methods does Kudu support?

Kudu supports range partitioning and hash partitioning. The two methods can be combined. For details, see Apache Kudu Schema Design.

How do I access the Kudu web UI?

Kudu is not integrated with Knox. Create an SSH tunnel to access the web UI instead. For instructions, see Create an SSH tunnel to access web UIs of open source components.

Where can I find the Kudu community FAQ?

See the Apache Kudu Troubleshooting page.

Startup errors

`NonRecoverableException` when connecting the Kudu client

This error means the number of master nodes configured on the client does not match what the cluster expects:

org.apache.kudu.client.NonRecoverableException: Could not connect to a leader master. Client configured with 1 master(s) (192.168.0.10:7051) but cluster indicates it expects 3 master(s) (192.168.0.36:7051,192.168.0.11:7051,192.168.0.10:7051)

Deploy all required master nodes and connect the Kudu client to the primary master node.

Kudu fails to start due to Bigboot monitor defect

A defect in Bigboot V3.5.0 prevents Kudu from restarting after a crash. The Bigboot monitor fails to remove stale service information from its database, causing subsequent restart attempts to fail.

Stop Kudu and start it again directly on the machine using the following commands.

Run the following commands on a core or task node. If you run them on a master node, replace kudu-tserver with kudu-master.

/usr/lib/b2monitor-current/bin/monictrl -stop kudu-tserver
/usr/lib/b2monitor-current/bin/monictrl -start kudu-tserver

Note

Run these commands on the machine itself. The EMR console may not be able to perform the stop operation because the service is already terminated.

Clock synchronization error prevents Kudu from starting

This error occurs when ntpd cannot connect to the configured NTP server:

Service unavailable: RunTabletServer() failed: Cannot initialize clock: timed out waiting for clock synchronisation: Error reading clock. Clock considered unsynchronized

The logs may also include output similar to:

E1010 10:37:54.165313 29920 system_ntp.cc:104] /sbin/ntptime
------------------------------------------
stdout:
ntp_gettime() returns code 5 (ERROR)
  time e6ee0402.2a452c4c  Mon, Oct 10 2022 10:37:54.165, (.165118697),
  maximum error 16000000 us, estimated error 16000000 us, TAI offset 0
ntp_adjtime() returns code 5 (ERROR)
  modes 0x0 (),
  offset 0.000 us, frequency 187.830 ppm, interval 1 s,
  maximum error 16000000 us, estimated error 16000000 us,
  status 0x2041 (PLL,UNSYNC,NANO),
  time constant 6, precision 0.001 us, tolerance 500 ppm,

Restart the server and try again.

Runtime errors

Network error: unable to resolve hostname

Bad status: Network error: Could not obtain a remote proxy to the peer.: unable to resolve address for <hostname>: Name or service not known

This error occurs when a hostname cannot be resolved to an IP address. Without a valid mapping, the Kudu tablet's Raft peer cannot identify its peers and terminates the connection.

Solution 1: Add the hostname-to-IP mapping to /etc/hosts.

Solution 2: If the host behind the hostname has been released, add a mapping between the hostname and any IP address to /etc/hosts. The IP address does not need to be reachable. Once the mapping exists, the Kudu tablet server replicates data from the unavailable Raft server to a new Raft server in the Raft group.

Filesystem layout integrity error

Bad status: I/O error: Failed to load Fs layout: could not verify integrity of files: <directory>, <number> data directories provided, but expected <number>

The number of disks specified by -fs_data_dirs does not match the metadata recorded by -fs_metadata_dir. Update -fs_data_dirs so the disk count matches what is recorded in -fs_metadata_dir.

Thread creation failure (`pthread_create` error 11)

pthread_create failed: Resource temporarily unavailable (error 11)

Check the following causes in order.

Insufficient process limits

Check the current limit for max user processes:

ulimit -a

If the value is too low, increase it by modifying /etc/security/limits.conf or by creating /etc/security/limits.d/kudu.conf.

Kudu client V0.8 thread leak in hybrid deployments

In hybrid deployments, Spark executors may leak threads when using Kudu client V0.8. This is a known issue documented in KUDU-1453. Upgrade to Kudu client V0.9 to resolve the issue.

Trino shutdown thread leak

When Trino exits, the shutdown hook thread blocks on the take method of BlockingQueue waiting for an element. This thread cannot be interrupted, so the EMR control keeps sending SIGTERM signals, spawning new SIGTERM handler threads until the process limit is reached.

Fix the issue on the Trino side, or force-terminate the process with kill -9.

Jindo SDK thread pool leak

Spark uses the JindoOssCommitter class for write jobs. This class creates a JindoOssMagicCommitter object that generates a thread pool named oss-committer-pool. The thread pool is not static and is never shut down. As new JindoOssMagicCommitter objects are created, thread pools accumulate without being released. This is especially likely with Spark Streaming or Structured Streaming workloads.

Add the following Spark parameters to work around the issue:

spark.sql.hive.outputCommitterClass=org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
spark.sql.sources.outputCommitterClass=org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

Find the process with the most threads

Use the following threads_monitor.sh script to identify which process consumes the most threads:

#!/bin/bash

total_threads=0
max_pid=-1
max_threads=-1

for tid in `ls /proc`
do
  if [[ $tid != *self && -f /proc/$tid/status ]]; then
    num_threads=`cat /proc/$tid/status | grep Threads | awk '{print $NF}'`
    ((total_threads+=num_threads))
    if [[ ${max_pid} -eq -1 || ${max_threads} -lt ${num_threads} ]]; then
      max_pid=${tid}
      max_threads=${num_threads}
    fi
#    echo "Thread ${pid}: ${num_threads}"
  fi
done

echo "Total threads: ${total_threads}"
echo "Max threads: ${max_threads}, pid is ${max_pid}"
ps -ef | grep ${max_pid} | grep -v grep

Soft memory limit exceeded

Rejecting Write request: Soft memory limit exceeded

Write throughput exceeds the soft memory limit. To resolve this, adjust one of the following parameters:

Configure the memory_limit_hard_bytes parameter to increase the memory size. The default value is 0, which indicates that the maximum memory usage is automatically set by the system. You can change the value to -1. This indicates that no limit is imposed on the memory usage.
Configure the memory_limit_soft_percentage parameter to adjust the percentage of available memory. The default value is 80.

Kudu allowlist configuration and token refresh

Does EMR provide a button to manually enable or disable the Kudu allowlist?

Problem: You want to enable or disable the Kudu allowlist through the EMR console.

Solution: EMR does not provide a dedicated button for this feature. The allowlist is a custom configuration item that you add manually. After you modify the configuration file, you must restart the Kudu service for the changes to take effect. You cannot switch between full-access mode and partial-access mode without restarting the service.

Must I restart the Kudu service to refresh an expired token?

Problem: The Kudu service token has expired and you want to refresh it.

Solution: Yes. You must restart the Kudu service to refresh the token.

Error: "Unable to open the Kudu table" or "Unable to initialize the Kudu scan node" when querying Kudu tables from Impala

Problem: You receive one of the following errors when you use Impala to query Kudu tables:

Unable to open the Kudu table
Unable to initialize the Kudu scan node

Cause: This error is typically caused by an expired Kudu Master token, a failure to obtain the leader, or an authentication failure. This error is unrelated to Ranger configurations.

Solution:

Add the client subnet (for example, 10.85.0.0/16) to the Kudu allowlist configuration.
Restart the Kudu service.

FAQ about restarting Kudu

The console keeps loading when I restart the Kudu service

Problem: When you restart the Kudu service, the console page keeps showing a loading spinner.

Solution: This issue may occur because your logon session has expired, which causes page refresh issues, or because the service restart process takes time to update the status. Refresh the page or check the operation history.

Can I use Impala queries during a Kudu service restart?

Problem: You want to know whether you can query data during a Kudu service restart.

Solution: We recommend that you do not run queries during the restart. Although the Impala service itself may remain available, queries that involve Kudu tables will fail due to connection interruptions.

The service shows an abnormal status or prompts me to manually start other services after a Kudu restart

Problem: After you restart the Kudu service, the console shows an abnormal status or prompts you to manually start other services.

Solution: No manual action is required. The abnormal status is typically displayed because health checks detect that the service has not fully started. If the service has recovered on its own and the Master node can run queries, the restart is complete. You do not need to manually start other services.