The Azure Blob Storage data source lets you read files from Azure Blob Storage and synchronize the data to a destination. This topic covers the supported data types, how to add the data source, and the script parameters for configuring a batch synchronization task.
Supported data types
| Data type | Description |
|---|---|
| STRING | Text |
| LONG | Integer |
| BYTES | Byte array. Text content is read and converted into a UTF-8 encoded byte array. |
| BOOL | Boolean |
| DOUBLE | Floating-point number |
| DATE | Date and time. Supported formats: YYYY-MM-dd HH:mm:ss, yyyy-MM-dd, HH:mm:ss |
Prerequisites
Before you develop a synchronization task, add Azure Blob Storage as a data source in DataWorks. For instructions, see Data source management.
Parameter descriptions are available in the DataWorks console when you add the data source.
Develop a synchronization task
Configure an offline sync task for a single table
Use either the codeless UI or the code editor to configure a synchronization task:
-
Codeless UI: Configure a task in codeless UI
-
Code editor: Configure a task in the code editor
For all script parameters and a demo script, see the Appendix: Script demo and parameter descriptions section.
Appendix: Script demo and parameter descriptions
Reader script demo
The following script configures a batch synchronization task that reads from Azure Blob Storage using the code editor. All Reader parameters are set under steps[0].parameter.
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "azureblob",
"parameter": {
"datasource": "",
"object": ["f/z/1.csv"],
"fileFormat": "csv",
"encoding": "utf8/gbk/...",
"fieldDelimiter": ",",
"useMultiCharDelimiter": true,
"lineDelimiter": "\n",
"skipHeader": true,
"compress": "zip/gzip",
"column": [
{
"index": 0,
"type": "long"
},
{
"index": 1,
"type": "boolean"
},
{
"index": 2,
"type": "double"
},
{
"index": 3,
"type": "string"
},
{
"index": 4,
"type": "date"
}
]
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "stream",
"parameter": {},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"concurrent": 1
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}
Reader script parameters
Required parameters
| Parameter | Description | Default |
|---|---|---|
datasource |
The data source name. Must match the name of the data source you added in DataWorks. | None |
fileFormat |
The file format. Valid values: csv, text, parquet, orc. |
None |
object |
The file path for CSV and text files. Supports the * wildcard character and array values. Required when fileFormat is csv or text. |
None |
path |
The file path for Parquet and ORC files. Supports the * wildcard character and array values. Required when fileFormat is parquet or orc. |
None |
column |
The list of columns to read. Each entry requires type (data type) and either index (0-based column position) or value (constant to generate). |
All columns as STRING |
Wildcard examples for object and path:
| Pattern | Matches |
|---|---|
a/b/*.csv |
All CSV files directly under a/b/ |
a/b/1.csv |
A single file |
["a/b/1.csv", "a/b/2.csv"] |
Multiple specific files |
Column configuration examples:
Read all columns as STRING:
"column": ["*"]
Read specific columns with explicit types:
"column": [
{ "type": "long", "index": 0 },
{ "type": "string", "value": "alibaba" }
]
The value field generates a constant column instead of reading from the source file.
Optional parameters
| Parameter | Description | Default |
|---|---|---|
encoding |
The file encoding, for example utf8 or gbk. |
utf-8 |
fieldDelimiter |
The field delimiter for reading data. For non-printable characters, use Unicode encoding, for example \u001b. |
, (comma) |
useMultiCharDelimiter |
Specifies whether to treat fieldDelimiter as a multi-character delimiter. Set to true to enable multi-character delimiter support. |
false |
lineDelimiter |
The row delimiter. Valid only when fileFormat is text. |
None |
compress |
The compression type. Valid values: gzip, bzip2, zip. Leave blank for no compression. |
No compression |
nullFormat |
Defines which string value represents null. For example, "nullFormat": "null" treats the source string null as a null field. If not set, the source data is written to the destination without conversion. |
None |
skipHeader |
Applies to CSV files only. Set to true to skip the header row during synchronization. Not supported for compressed files. |
false |
parquetSchema |
Specifies the schema for Parquet files. Valid only when fileFormat is parquet. |
None |
csvReaderConfig |
Additional configuration for reading CSV files. Map type. Uses default values if not set. | None |
maxRetryTimes |
The maximum number of retries when a file download fails. Set to 0 to disable retries. Only available in the code editor, not the codeless UI. |
0 |
retryIntervalSeconds |
The retry interval in seconds when a file download fails. Only available in the code editor, not the codeless UI. | 5 |
parquetSchema format:
message <MessageTypeName> {
required/optional <DataType> <ColumnName>;
...;
}
-
Set all fields to
optionalso they can be null. -
Supported data types:
BOOLEAN,Int32,Int64,Int96,FLOAT,DOUBLE,BINARY(for strings),fixed_len_byte_array. -
Each column definition must end with a semicolon, including the last line.
Example:
"parquetSchema": "message m { optional int32 minute_id; optional int32 dsp_id; optional int32 adx_pid; optional int64 req; optional int64 res; optional int64 suc; optional int64 imp; optional double revenue; }"