All Products
Search
Document Center

DataWorks:Azure Blob Storage

Last Updated:Mar 26, 2026

The Azure Blob Storage data source lets you read files from Azure Blob Storage and synchronize the data to a destination. This topic covers the supported data types, how to add the data source, and the script parameters for configuring a batch synchronization task.

Supported data types

Data type Description
STRING Text
LONG Integer
BYTES Byte array. Text content is read and converted into a UTF-8 encoded byte array.
BOOL Boolean
DOUBLE Floating-point number
DATE Date and time. Supported formats: YYYY-MM-dd HH:mm:ss, yyyy-MM-dd, HH:mm:ss

Prerequisites

Before you develop a synchronization task, add Azure Blob Storage as a data source in DataWorks. For instructions, see Data source management.

Parameter descriptions are available in the DataWorks console when you add the data source.

Develop a synchronization task

Configure an offline sync task for a single table

Use either the codeless UI or the code editor to configure a synchronization task:

For all script parameters and a demo script, see the Appendix: Script demo and parameter descriptions section.

Appendix: Script demo and parameter descriptions

Reader script demo

The following script configures a batch synchronization task that reads from Azure Blob Storage using the code editor. All Reader parameters are set under steps[0].parameter.

{
  "type": "job",
  "version": "2.0",
  "steps": [
    {
      "stepType": "azureblob",
      "parameter": {
        "datasource": "",
        "object": ["f/z/1.csv"],
        "fileFormat": "csv",
        "encoding": "utf8/gbk/...",
        "fieldDelimiter": ",",
        "useMultiCharDelimiter": true,
        "lineDelimiter": "\n",
        "skipHeader": true,
        "compress": "zip/gzip",
        "column": [
          {
            "index": 0,
            "type": "long"
          },
          {
            "index": 1,
            "type": "boolean"
          },
          {
            "index": 2,
            "type": "double"
          },
          {
            "index": 3,
            "type": "string"
          },
          {
            "index": 4,
            "type": "date"
          }
        ]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "stream",
      "parameter": {},
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting": {
    "errorLimit": {
      "record": "0"
    },
    "speed": {
      "concurrent": 1
    }
  },
  "order": {
    "hops": [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  }
}

Reader script parameters

Required parameters

Parameter Description Default
datasource The data source name. Must match the name of the data source you added in DataWorks. None
fileFormat The file format. Valid values: csv, text, parquet, orc. None
object The file path for CSV and text files. Supports the * wildcard character and array values. Required when fileFormat is csv or text. None
path The file path for Parquet and ORC files. Supports the * wildcard character and array values. Required when fileFormat is parquet or orc. None
column The list of columns to read. Each entry requires type (data type) and either index (0-based column position) or value (constant to generate). All columns as STRING

Wildcard examples for object and path:

Pattern Matches
a/b/*.csv All CSV files directly under a/b/
a/b/1.csv A single file
["a/b/1.csv", "a/b/2.csv"] Multiple specific files

Column configuration examples:

Read all columns as STRING:

"column": ["*"]

Read specific columns with explicit types:

"column": [
  { "type": "long", "index": 0 },
  { "type": "string", "value": "alibaba" }
]

The value field generates a constant column instead of reading from the source file.

Optional parameters

Parameter Description Default
encoding The file encoding, for example utf8 or gbk. utf-8
fieldDelimiter The field delimiter for reading data. For non-printable characters, use Unicode encoding, for example \u001b. , (comma)
useMultiCharDelimiter Specifies whether to treat fieldDelimiter as a multi-character delimiter. Set to true to enable multi-character delimiter support. false
lineDelimiter The row delimiter. Valid only when fileFormat is text. None
compress The compression type. Valid values: gzip, bzip2, zip. Leave blank for no compression. No compression
nullFormat Defines which string value represents null. For example, "nullFormat": "null" treats the source string null as a null field. If not set, the source data is written to the destination without conversion. None
skipHeader Applies to CSV files only. Set to true to skip the header row during synchronization. Not supported for compressed files. false
parquetSchema Specifies the schema for Parquet files. Valid only when fileFormat is parquet. None
csvReaderConfig Additional configuration for reading CSV files. Map type. Uses default values if not set. None
maxRetryTimes The maximum number of retries when a file download fails. Set to 0 to disable retries. Only available in the code editor, not the codeless UI. 0
retryIntervalSeconds The retry interval in seconds when a file download fails. Only available in the code editor, not the codeless UI. 5

parquetSchema format:

message <MessageTypeName> {
  required/optional <DataType> <ColumnName>;
  ...;
}
  • Set all fields to optional so they can be null.

  • Supported data types: BOOLEAN, Int32, Int64, Int96, FLOAT, DOUBLE, BINARY (for strings), fixed_len_byte_array.

  • Each column definition must end with a semicolon, including the last line.

Example:

"parquetSchema": "message m { optional int32 minute_id; optional int32 dsp_id; optional int32 adx_pid; optional int64 req; optional int64 res; optional int64 suc; optional int64 imp; optional double revenue; }"