Troubleshooting common Druid issues FAQ - E-MapReduce

This topic covers common problems you may encounter when running Apache Druid on E-MapReduce (EMR), including indexing failures and frequently seen runtime errors.

Troubleshoot indexing failures

Follow these steps to diagnose an indexing failure, starting from the outer layer and working inward.

Batch indexing

Run the curl command and check its output. If the output shows an error or nothing at all, check your input file format. To inspect the raw API response, add the -v flag to the curl command.
Open the Overlord page and check the job execution status. If the job failed, view the logs directly on that page.
If no logs appear on the Overlord page, open the YARN page and check whether an index job was generated. This step applies to Hadoop-based jobs.
If you still cannot identify the cause, log in to the EMR Druid cluster and check the Overlord log:
```
/mnt/disk1/log/druid/overlord-emr-header-1.cluster-xxxx.log
```
For a high availability (HA) cluster, check the Overlord that received the job submission.
If the job was submitted to MiddleManager but MiddleManager returned a failure, find the worker node assigned to the job in Overlord, log in to that node, and check the MiddleManager log:
```
/mnt/disk1/log/druid/middleManager-emr-header-1.cluster-xxxx.log
```

Real-time Tranquility indexing

Start by checking the Tranquility log to confirm whether messages were received or dropped. Then follow steps 2 through 5 of the batch indexing procedure above.

Most indexing errors fall into two categories:

Cluster configuration errors: JVM memory parameters, cross-cluster connectivity, high-security mode access, and Kerberos principals.
Job errors: Job description file format, input data parsing, and other job-level settings such as ioConfig.

Common errors and fixes

Service startup fails

Cause: The machine's available memory is insufficient for the JVM parameters configured for the Druid component — for example, a large heap size or a high thread count on a machine with limited memory.

Fix: Check the component logs to identify which parameter is over-provisioned, then reduce it. JVM memory has two parts: heap memory and direct memory. For tuning guidance, see Apache Druid Performance FAQ.

YARN task fails with a JAR conflict error

Error:

Error: class com.fasterxml.jackson.datatype.guava.deser.HostAndPortDeserializer overrides final method deserialize.(Lcom/fasterxml/jackson/core/JsonParser;Lcom/fasterxml/jackson/databind/DeserializationContext;)Ljava/lang/Object;

Cause: Druid's dependencies conflict with JAR files already present on the Hadoop cluster.

Fix: Add one of the following properties to the tuningConfig section of the indexing job configuration:

Property	Effect
`mapreduce.job.classloader: "true"`	Allows MapReduce jobs to use standalone class loaders
`mapreduce.job.user.classpath.first: "true"`	Gives MapReduce the priority to use your JAR packages

You can select one of these two configuration items. Example:

"tuningConfig": {
    "jobProperties": {
        "mapreduce.job.classloader": "true"
    }
}

For more information, see Apache Druid: Working with different versions of Hadoop.

Reduce task cannot create the segments directory

Fix: Check your deep storage configuration — specifically the type and directory fields:

If type is local: Verify that the directory exists and that the EMR Druid account has write permission.
If type is hdfs: Write the path as a full HDFS URI, for example hdfs://<hdfs_master>:9000/. For hdfs_master, use the IP address. If you must use a hostname, use the full hostname — for example, emr-header-1.cluster-xxxxxxxx rather than emr-header-1.

For Hadoop batch indexing on a standalone EMR Druid cluster, deep storage must be set to "hdfs". Using "local" causes the MapReduce job to enter an undefined state because the remote YARN cluster cannot write to a local path on a different machine.

Failed to create directory within 10,000 attempts

Cause: The java.io.tmp path specified in the JVM configuration does not exist.

Fix: Create the directory and make sure the EMR Druid account has permission to access it.

com.twitter.finagle.NoBrokersAvailableException: No hosts are available for disco!firehose:druid:overlord

Cause: ZooKeeper connection mismatch between Druid and Tranquility.

Fix: Make sure both services use the same ZooKeeper connection string.

The default ZooKeeper path for EMR Druid is /druid, so the zookeeper.connect value in your Tranquility configuration must include /druid.

If you are using Tranquility with Kafka, two separate ZooKeeper settings exist:

Setting	Connects to
`zookeeper.connect`	ZooKeeper of the EMR Druid cluster
`kafka.zookeeper.connect`	ZooKeeper of the Kafka cluster

These two ZooKeeper clusters may be different. Verify each setting points to the correct cluster.

MiddleManager cannot find com.hadoop.compression.lzo.LzoCodec

Cause: The EMR Hadoop cluster is configured with LZO compression, but the required files are not in Druid's dependency directory.

Fix: Copy the LZO JAR and its native library from HADOOP_HOME/lib on the EMR cluster to Druid's druid.extensions.hadoopDependenciesDir (default: DRUID_HOME/hadoop-dependencies).

Indexing fails with GPLNativeCodeLoader IOException

Error:

2018-02-01T09:00:32,647 ERROR [task-runner-0-priority-0] com.hadoop.compression.lzo.GPLNativeCodeLoader - could not unpack the binaries
  java.io.IOException: No such file or directory
          at java.io.UnixFileSystem.createFileExclusively(Native Method) ~[?:1.8.0_151]
          at java.io.File.createTempFile(File.java:2024) ~[?:1.8.0_151]
          at java.io.File.createTempFile(File.java:2070) ~[?:1.8.0_151]
          at com.hadoop.compression.lzo.GPLNativeCodeLoader.unpackBinaries(GPLNativeCodeLoader.java:115) [hadoop-lzo-0.4.21-SNAPSHOT.jar:?]

Cause: The java.io.tmp path does not exist.

Fix: Create the directory and make sure the EMR Druid account has permission to access it.