DataWorks Data Integration downloads files from remote HTTP endpoints using the HTTP protocol and syncs them to a target data source.
Supported resource groups
HttpFile supports the following resource groups:
Supported field types
| Data type | Description |
|---|---|
| STRING | Text. |
| LONG | Integer. |
| BYTES | Byte array. Text content is converted to a UTF-8 encoded byte array. |
| BOOL | Boolean. |
| DOUBLE | Decimal. |
| DATE | Date and time. Supported formats: yyyy-MM-dd HH:mm:ss, yyyy-MM-dd, HH:mm:ss. |
Supported file formats and compression
| File format | Supported |
|---|---|
| CSV | Yes |
| TEXT (delimited) | Yes |
| Compression | Supported |
|---|---|
| gzip | Yes |
| bzip2 | Yes |
| zip | Yes |
skipHeader is not supported for compressed files.
Add a data source
Add the HttpFile data source on the Data Source Management page before creating a synchronization task. For instructions, see Data source management.
Configure a synchronization task
Configure an offline synchronization task
Use the codeless UI or the code editor to configure your task:
For the full script reference and parameter descriptions, see Script reference.
Script reference
Reader script example
The following script reads from a CSV file over HTTP using GET, skips the header row, and maps five columns to different data types.
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "httpfile",
"parameter": {
"datasource": "<data-source-name>",
"fileName": "/data/export.csv",
"requestMethod": "GET",
"requestHeaders": {
"Authorization": "Bearer <token>"
},
"socketTimeoutSeconds": 3600,
"connectTimeoutSeconds": 60,
"bufferByteSizeInKB": 1024,
"fileFormat": "csv",
"encoding": "utf-8",
"fieldDelimiter": ",",
"skipHeader": true,
"compress": "",
"column": [
{ "index": 0, "type": "long" },
{ "index": 1, "type": "boolean" },
{ "index": 2, "type": "double" },
{ "index": 3, "type": "string" },
{ "index": 4, "type": "date" }
]
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "stream",
"parameter": {},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"concurrent": 1
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}
Replace the placeholders with your actual values:
| Placeholder | Description | Example |
|---|---|---|
<data-source-name> |
The name of the HttpFile data source on the Data Source Management page. | my-http-source |
<token> |
Your API authentication token. | eyJhbGc... |
Reader parameters
Parameters are grouped by function. Connection parameters define how to reach the endpoint; read behavior parameters control how the file is parsed.
Connection parameters
| Parameter | Description | Required | Default |
|---|---|---|---|
datasource |
The name of the HttpFile data source. Must match exactly the name on the Data Source Management page. | Yes | None |
fileName |
The file path on the HTTP server. URL-encode any special characters or non-ASCII characters. For example, a space in /file/test abc.csv becomes /file/test%20abc.csv. The final request URL combines the data source base URL with this path. For encoding rules, see HTML URL Encoding Reference. |
Yes | None |
requestMethod |
The HTTP method. Valid values: GET, POST, PUT. |
No | GET |
requestParam |
Query parameters appended to the URL. Takes effect only when requestMethod is GET. URL-encode any special characters. For example, start=2024-03-25 17:06:54 becomes start=2024-03-25%2017:06:54. |
No | None |
requestBody |
The request body. Takes effect only when requestMethod is POST or PUT. Pair with Content-Type in requestHeaders. Example: {"requestBody": "{\"a\":\"b\"}", "requestHeaders": {"Content-Type": "application/json"}} |
No | None |
requestHeaders |
HTTP request headers as key-value pairs. Example: {"Content-Type": "application/json"} |
No | {"User-Agent": "DataX Http File Reader"} |
connectTimeoutSeconds |
How long to wait when establishing an HTTP connection, in seconds. If exceeded, the task fails. Available in Advanced mode only; not configurable in the codeless UI. | No | 60 |
socketTimeoutSeconds |
How long to wait between consecutive data packets, in seconds. If exceeded, the task fails. Available in Advanced mode only; not configurable in the codeless UI. | No | 3600 |
bufferByteSizeInKB |
Download buffer size, in KB. | No | 1024 |
Read behavior parameters
| Parameter | Description | Required | Default |
|---|---|---|---|
fileFormat |
Source file format. Valid values: csv, text. Both formats support custom field delimiters. |
No | None |
encoding |
File character encoding. | No | utf-8 |
fieldDelimiter |
Field delimiter. For non-printable characters, use the Unicode representation, for example \u001b. |
Yes | , |
useMultiCharDelimiter |
Specifies whether the field delimiter is a multi-character string. | No | false |
lineDelimiter |
Line delimiter. Takes effect only when fileFormat is text. |
No | None |
skipHeader |
Specifies whether to skip the first row. Set to true for files with a header row. Not supported for compressed files. |
No | false |
compress |
Compression format of the source file. Leave blank if the file is uncompressed. Valid values: gzip, bzip2, zip. |
No | None (uncompressed) |
column |
List of columns to read. Each entry requires type and either index or value (not both). See Column configuration. |
Yes | All columns read as STRING |
nullFormat |
The string in the source file that represents a null value. For example, "nullFormat": "null" treats the string null as null; "nullFormat": "\u0001" treats the non-printable character as null. If not set, source data is written to the destination as-is. |
No | None |
Column configuration
Each entry in the column array uses the following fields:
| Field | Description |
|---|---|
type |
Data type of the column. Required. Valid values: long, boolean, double, string, date. |
index |
Column position in the source file, starting from 0. Specify either index or value, not both. |
value |
A constant value to populate the column with, instead of reading from the source file. Specify either index or value, not both. |
To read all columns as STRING without specifying individual types:
"column": ["*"]
To map specific columns with types and inject a constant:
"column": [
{ "type": "long", "index": 0 },
{ "type": "string", "value": "alibaba" }
]