All Products
Search
Document Center

E-MapReduce:Ingestion Spec

Last Updated:Dec 27, 2024

This section briefly introduces Ingestion Spec, the description file of the index data.

Ingestion Spec is a unified description of the format of the data being indexed and how it is indexed by Druid. It is a JSON file, which consists of three parts:

{
    "dataSchema" : {...},
    "ioConfig" : {...},
    "tuningConfig" : {...}
}

Key

Format

Description

Required

dataSchema

JSON object

Describes the schema information of the data you want to consume. dataSchema is fixed and does not change with the way in which data is consumed.

Yes

ioConfig

JSON object

Describes the source and destination of the data you want to consume. If the consumption method of the data is different, ioConfig is also different.

Yes

tuningConfig

JSON object

Configures the parameters of the data you want to consume. If the consumption method of the data is different, the adjustable parameters are also different.

No

dataSchema

dataSchema describes the format of the data and how to parse the data. The typical structure is as follows:

{
    "dataSoruce": <name_of_dataSource>,
    "parser": {
        "type": <>,
        "parseSpec": {
            "format": <>,
            "timestampSpec": {},
            "dimensionsSpec": {}
        }
    },
    "metricsSpec": {},
    "granularitySpec": {}
}

Key

Format

Description

Required

dataSource

String

Name of the data source.

Yes

parser

JSON object

How the data is parsed.

Yes

metricsSpec

Array of JSON objects

Aggregator list.

Yes

granularitySpec

JSON object

Data aggregation settings, such as creating segments and aggregation granularity.

Yes

  • parser

    parser determines how your data is parsed correctly. metricsSpec defines how the data is clustered for calculation. granularitySpec defines the granularity of the data fragmentation and the granularity of the query.

    There are two types of parser: string and hadoopstring. The latter is used for Hadoop index jobs. ParseSpec is a specific definition of data format resolution.

    Key

    Format

    Description

    Required

    type

    String

    The data format can be json, jsonLowercase, csv, or tsv.

    Yes

    timestampSpec

    JSON object

    Timestamp and timestamp type.

    Yes

    dimensionsSpec

    JSON object

    The dimension of the data (columns are included).

    Yes

    For different data formats, additional parseSpec options may exist. The following table describes timestampSpec and dimensionsSpec.

    Key

    Format

    Description

    Required

    column

    String

    Columns corresponding to the timestamp.

    Yes

    format

    String

    The timestamp type can be ISO, millis, POSIX, auto, or whatever is supported by joda time.

    Yes

    Key

    Format

    Description

    Required

    dimensions

    JSON array

    Describes which dimensions the data contains. Each dimension can be just a string. You can also specify the attribute for the dimension. For example, the type of dimensions: [dimenssion1, dimenssion2, {type: long, name: dimenssion3}] is string by default.

    Yes

    dimensionExclusions

    Array of JSON strings

    Dimension to be deleted when data is consumed.

    No

    spatialDimensions

    Array of JSON objects

    Spatial dimension.

    No

  • metricsSpec

    MetricsSpec is an array of JSON objects. It defines several aggregators. Aggregators typically have the following structures:

    ```json
    {
        "type": <type>,
        "name": <output_name>,
        "fieldName": <metric_name>
    }
    ```

    The following commonly used aggregators are provided:

    Type

    Type optional

    count

    count

    sum

    longSum, doubleSum, floatSum

    min/max

    longMin/longMax, doubleMin/doubleMax, floatMin/floatMax

    first/last

    longFirst/longLast, doubleFirst/doubleLast, floatFirst/floatLast

    javascript

    javascript

    cardinality

    cardinality

    hyperUnique

    hyperUnique

    Note

    The last three types in the table are advanced aggregators. For information about how to use them, see Apache Druid official documents.

  • granularitySpec

    Two aggregation modes are supported: uniform and arbitrary. The uniform mode aggregates data with a fixed interval of time. The arbitrary mode tries to make sure that each of the segments has the same size, but the time interval for aggregation is not fixed. Uniform is the current default option.

    Key

    Format

    Description

    Required

    segmentGranularity

    String

    Segment granularity Uniform type.The default is DAY.

    No.

    queryGranularity

    String

    Minimum data aggregation granularity for query. The default is true.

    No

    rollup

    Bool value

    Aggregate or not.

    No.

    intervals

    String

    Time interval of data consumption.

    It is Yes for batch and No for realtime.

ioConfig

ioConfig describes the data source. An example of Hadoop index is as follows:

{
    "type": "hadoop",
    "inputSpec": {
        "type": "static",
        "paths": "hdfs://emr-header-1.cluster-6789:9000/druid/quickstart/wikiticker-2015-09-16-sampled.json"
    }
}
Note

This part is not required for streaming data that is processed through Tranquility.

TuningConfig

TuningConfig refers to additional settings. For example, you can specify MapReduce parameters to use Hadoop to create an index for batch data. The contents of tuningConfig may vary based on the data source.