Configure Azure Blob Storage as a data sync source - DataWorks

Supported data types

Data type	Description
STRING	Text
LONG	Integer
BYTES	Byte array. Text content is read and converted into a UTF-8 encoded byte array.
BOOL	Boolean
DOUBLE	Floating-point number
DATE	Date and time. Supported formats: `YYYY-MM-dd HH:mm:ss`, `yyyy-MM-dd`, `HH:mm:ss`

Prerequisites

Before you develop a synchronization task, add Azure Blob Storage as a data source in DataWorks. For instructions, see Data source management.

Parameter descriptions are available in the DataWorks console when you add the data source.

Develop a synchronization task

Configure an offline sync task for a single table

Use either the codeless UI or the code editor to configure a synchronization task:

Codeless UI: Configure a task in codeless UI
Code editor: Configure a task in the code editor

For all script parameters and a demo script, see the Appendix: Script demo and parameter descriptions section.

Appendix: Script demo and parameter descriptions

Reader script demo

The following script configures a batch synchronization task that reads from Azure Blob Storage using the code editor. All Reader parameters are set under steps[0].parameter.

{
  "type": "job",
  "version": "2.0",
  "steps": [
    {
      "stepType": "azureblob",
      "parameter": {
        "datasource": "",
        "object": ["f/z/1.csv"],
        "fileFormat": "csv",
        "encoding": "utf8/gbk/...",
        "fieldDelimiter": ",",
        "useMultiCharDelimiter": true,
        "lineDelimiter": "\n",
        "skipHeader": true,
        "compress": "zip/gzip",
        "column": [
          {
            "index": 0,
            "type": "long"
          },
          {
            "index": 1,
            "type": "boolean"
          },
          {
            "index": 2,
            "type": "double"
          },
          {
            "index": 3,
            "type": "string"
          },
          {
            "index": 4,
            "type": "date"
          }
        ]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "stream",
      "parameter": {},
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting": {
    "errorLimit": {
      "record": "0"
    },
    "speed": {
      "concurrent": 1
    }
  },
  "order": {
    "hops": [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  }
}

Reader script parameters

Required parameters

Parameter	Description	Default
`datasource`	The data source name. Must match the name of the data source you added in DataWorks.	None
`fileFormat`	The file format. Valid values: `csv`, `text`, `parquet`, `orc`.	None
`object`	The file path for CSV and text files. Supports the `*` wildcard character and array values. Required when `fileFormat` is `csv` or `text`.	None
`path`	The file path for Parquet and ORC files. Supports the `*` wildcard character and array values. Required when `fileFormat` is `parquet` or `orc`.	None
`column`	The list of columns to read. Each entry requires `type` (data type) and either `index` (0-based column position) or `value` (constant to generate).	All columns as STRING

Wildcard examples for object and path:

Pattern	Matches
`a/b/*.csv`	All CSV files directly under `a/b/`
`a/b/1.csv`	A single file
`["a/b/1.csv", "a/b/2.csv"]`	Multiple specific files

Column configuration examples:

Read all columns as STRING:

"column": ["*"]

Read specific columns with explicit types:

"column": [
  { "type": "long", "index": 0 },
  { "type": "string", "value": "alibaba" }
]

The value field generates a constant column instead of reading from the source file.

Optional parameters

Parameter	Description	Default
`encoding`	The file encoding, for example `utf8` or `gbk`.	`utf-8`
`fieldDelimiter`	The field delimiter for reading data. For non-printable characters, use Unicode encoding, for example `\u001b`.	`,` (comma)
`useMultiCharDelimiter`	Specifies whether to treat `fieldDelimiter` as a multi-character delimiter. Set to `true` to enable multi-character delimiter support.	`false`
`lineDelimiter`	The row delimiter. Valid only when `fileFormat` is `text`.	None
`compress`	The compression type. Valid values: `gzip`, `bzip2`, `zip`. Leave blank for no compression.	No compression
`nullFormat`	Defines which string value represents null. For example, `"nullFormat": "null"` treats the source string `null` as a null field. If not set, the source data is written to the destination without conversion.	None
`skipHeader`	Applies to CSV files only. Set to `true` to skip the header row during synchronization. Not supported for compressed files.	`false`
`parquetSchema`	Specifies the schema for Parquet files. Valid only when `fileFormat` is `parquet`.	None
`csvReaderConfig`	Additional configuration for reading CSV files. Map type. Uses default values if not set.	None
`maxRetryTimes`	The maximum number of retries when a file download fails. Set to `0` to disable retries. Only available in the code editor, not the codeless UI.	`0`
`retryIntervalSeconds`	The retry interval in seconds when a file download fails. Only available in the code editor, not the codeless UI.	`5`

parquetSchema format:

message <MessageTypeName> {
  required/optional <DataType> <ColumnName>;
  ...;
}

Set all fields to optional so they can be null.
Supported data types: BOOLEAN, Int32, Int64, Int96, FLOAT, DOUBLE, BINARY (for strings), fixed_len_byte_array.
Each column definition must end with a semicolon, including the last line.

Example:

"parquetSchema": "message m { optional int32 minute_id; optional int32 dsp_id; optional int32 adx_pid; optional int64 req; optional int64 res; optional int64 suc; optional int64 imp; optional double revenue; }"