This section briefly introduces Ingestion Spec, the description file of the index data.
Ingestion Spec is a unified description of the format of the data being indexed and how it is indexed by Druid. It is a JSON file, which consists of three parts:
{
"dataSchema" : {...},
"ioConfig" : {...},
"tuningConfig" : {...}
}Key | Format | Description | Required |
dataSchema | JSON object | Describes the schema information of the data you want to consume. dataSchema is fixed and does not change with the way in which data is consumed. | Yes |
ioConfig | JSON object | Describes the source and destination of the data you want to consume. If the consumption method of the data is different, ioConfig is also different. | Yes |
tuningConfig | JSON object | Configures the parameters of the data you want to consume. If the consumption method of the data is different, the adjustable parameters are also different. | No |
dataSchema
dataSchema describes the format of the data and how to parse the data. The typical structure is as follows:
{
"dataSoruce": <name_of_dataSource>,
"parser": {
"type": <>,
"parseSpec": {
"format": <>,
"timestampSpec": {},
"dimensionsSpec": {}
}
},
"metricsSpec": {},
"granularitySpec": {}
}Key | Format | Description | Required |
dataSource | String | Name of the data source. | Yes |
parser | JSON object | How the data is parsed. | Yes |
metricsSpec | Array of JSON objects | Aggregator list. | Yes |
granularitySpec | JSON object | Data aggregation settings, such as creating segments and aggregation granularity. | Yes |
parser
parser determines how your data is parsed correctly. metricsSpec defines how the data is clustered for calculation. granularitySpec defines the granularity of the data fragmentation and the granularity of the query.
There are two types of parser: string and hadoopstring. The latter is used for Hadoop index jobs. ParseSpec is a specific definition of data format resolution.
Key
Format
Description
Required
type
String
The data format can be json, jsonLowercase, csv, or tsv.
Yes
timestampSpec
JSON object
Timestamp and timestamp type.
Yes
dimensionsSpec
JSON object
The dimension of the data (columns are included).
Yes
For different data formats, additional parseSpec options may exist. The following table describes timestampSpec and dimensionsSpec.
Key
Format
Description
Required
column
String
Columns corresponding to the timestamp.
Yes
format
String
The timestamp type can be ISO, millis, POSIX, auto, or whatever is supported by joda time.
Yes
Key
Format
Description
Required
dimensions
JSON array
Describes which dimensions the data contains. Each dimension can be just a string. You can also specify the attribute for the dimension. For example, the type of dimensions: [dimenssion1, dimenssion2, {type: long, name: dimenssion3}] is string by default.
Yes
dimensionExclusions
Array of JSON strings
Dimension to be deleted when data is consumed.
No
spatialDimensions
Array of JSON objects
Spatial dimension.
No
metricsSpec
MetricsSpec is an array of JSON objects. It defines several aggregators. Aggregators typically have the following structures:
```json { "type": <type>, "name": <output_name>, "fieldName": <metric_name> } ```The following commonly used aggregators are provided:
Type
Type optional
count
count
sum
longSum, doubleSum, floatSum
min/max
longMin/longMax, doubleMin/doubleMax, floatMin/floatMax
first/last
longFirst/longLast, doubleFirst/doubleLast, floatFirst/floatLast
javascript
javascript
cardinality
cardinality
hyperUnique
hyperUnique
NoteThe last three types in the table are advanced aggregators. For information about how to use them, see Apache Druid official documents.
granularitySpec
Two aggregation modes are supported: uniform and arbitrary. The uniform mode aggregates data with a fixed interval of time. The arbitrary mode tries to make sure that each of the segments has the same size, but the time interval for aggregation is not fixed. Uniform is the current default option.
Key
Format
Description
Required
segmentGranularity
String
Segment granularity Uniform type.The default is DAY.
No.
queryGranularity
String
Minimum data aggregation granularity for query. The default is true.
No
rollup
Bool value
Aggregate or not.
No.
intervals
String
Time interval of data consumption.
It is Yes for batch and No for realtime.
ioConfig
ioConfig describes the data source. An example of Hadoop index is as follows:
{
"type": "hadoop",
"inputSpec": {
"type": "static",
"paths": "hdfs://emr-header-1.cluster-6789:9000/druid/quickstart/wikiticker-2015-09-16-sampled.json"
}
}This part is not required for streaming data that is processed through Tranquility.
TuningConfig
TuningConfig refers to additional settings. For example, you can specify MapReduce parameters to use Hadoop to create an index for batch data. The contents of tuningConfig may vary based on the data source.