Download files over HTTP and synchronize the files to a data source - DataWorks

DataWorks Data Integration supports HttpFile data sources. You can download files over HTTP and synchronize the files to a destination data source.

Limits

HttpFile data sources support only exclusive resource groups for Data Integration.

Data type mappings

Category	Description
STRING	Text.
LONG	Integer.
BYTES	Byte array. The text that is read is converted to a byte array. The encoding format is UTF-8.
BOOL	Boolean.
DOUBLE	Decimal.
DATE	Date and time. The following date and time formats are supported: yyyy-MM-dd HH:mm:ss yyyy-MM-dd HH:mm:ss

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a data synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.

Add a data source

Before you configure a data synchronization task to synchronize data from a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources.

Configure a batch synchronization task to synchronize data of a single table

For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.

Appendix: Code and parameters

Configure a batch synchronization task by using the code editor

If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader in the code editor.

Code for HttpFile Reader

In the following code, a synchronization task is configured to read data from an HttpFile file:

{
  "type": "job",
  "version": "2.0",
  "steps": [
    {
      "stepType": "httpfile",
      "parameter": {
        "datasource": "",
        "fileName": "/f/z/1.csv",
        "requestMethod": "GET",
        "requestBody": "",
        "requestHeaders": {
          "header1": "v1",
          "header2": "v2"
        },
        "socketTimeoutSeconds": 3600,
        "connectTimeoutSeconds": 60,
        "bufferByteSizeInKB": 1024,
        "fileFormat": "csv",
        "encoding": "utf8/gbk/...",
        "fieldDelimiter": ",",
        "useMultiCharDelimiter": true,
        "lineDelimiter": "\n",
        "skipHeader": true,
        "compress": "zip/gzip",
        "column": [
          {
            "index": 0,
            "type": "long"
          },
          {
            "index": 1,
            "type": "boolean"
          },
          {
            "index": 2,
            "type": "double"
          },
          {
            "index": 3,
            "type": "string"
          },
          {
            "index": 4,
            "type": "date"
          }
        ]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "stream",
      "parameter": {},
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting": {
    "errorLimit": {
      "record": "0"
    },
    "speed": {
      "concurrent": 1
    }
  },
  "order": {
    "hops": [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  }
}

Parameters in code for HttpFile Reader

Parameter	Description	Required	Default value
datasource	The name of the data source. It must be the same as the name of the added data source.	Yes	No default value
fileName	The file path. If the file name contains special characters, you must enter the value after the URL escape. For example, you must escape a space to %20. Original file path: `/file/test abc.csv` Value of this parameter: `/file/test%20abc.csv` Note A file path can contain multiple special characters, such as spaces, number signs (#), and percent sign (%). The escape method varies based on the special characters. You can view the supported special characters in the DataWorks console. For information about the escape methods, see HTML Uniform Resource Locators. The final access path consists of the URL domain name of the data source and the file path.	Yes	No default value
bufferByteSizeInKB	The buffer size of the downloaded file. Unit: KB.	No	1024
requestMethod	The request method. Valid values: GET, POST, and PUT.	No	GET
requestParam	This parameter takes effect only when the requestMethod parameter is set to GET. If the parameter value contains special characters, the parameter value must be escaped. Example: The value of the start parameter is `2024-03-25 17:06:54` The value of this parameter is `start=2024-03-25%2017:06:54`. Note The start parameter specifies the start time of an operation when a GET request is initiated.	No	No default value
requestBody	The content of the request. This parameter takes effect only when the requestMethod parameter is set to POST or PUT. This parameter must be used with the Content-Type parameter in requestHeaders. Example: `{ "requestBody":"{\"a\":\"b\"}", "requestHeaders": { "Content-Type": "application/json" } }`	No	No default value
requestHeaders	The request header, which is specified in a key-value pair. Example: `{ "Content-Type": "application/json" }`	No	`{ "User-Agent": "DataX Http File Reader" }`
fileFormat	The type of the source file. Valid values: csv and text. You can specify delimiters for the two types of files.	No	No default value
column	The names of the columns from which you want to read data. The type parameter specifies the source data type. The index parameter specifies the ID of the column in the source file, starting from 0. The value parameter specifies the column value if the column is a constant column. The reader does not read a constant column from the source. Instead, the reader generates data in a constant column based on the column value that you specify. By default, the reader reads all data as strings based on the following configuration: `"column": [""]` You can also configure the column* parameter in the following way: `"column": { "type": "long", "index": 0 // The first INT-type column in the file from which you want to read data. }, { "type": "string", "value": "alibaba" // The value of the current column. In this code, the value is the constant alibaba. }` Note For the column parameter, you must configure the type parameter and either the index or value parameter. You are not allowed to configure the three parameters at the same time.	Yes	"column": ["*"]
fieldDelimiter	The column delimiter that is used in the file from which you want to read data. Note You must specify a column delimiter for HttpFile Reader. The default column delimiter is commas (,). If you do not specify a column delimiter, the default column delimiter is used. If the delimiter is non-printable, enter a value encoded in Unicode, such as \u001b and \u007c.	Yes	,
lineDelimiter	The row delimiter that is used in the file from which you want to read data. Note This parameter takes effect only when the fileFormat parameter is set to text.	No	No default value
compress	The format in which files are compressed. By default, this parameter is left empty, which indicates that files are not compressed. The following compression formats are supported: GZIP, BZIP2, and ZIP.	No	No default value
encoding	The encoding format of the file from which you want to read data.	No	utf-8
nullFormat	The string that represents a null pointer. No standard strings can represent a null pointer in TXT files. You can use this parameter to define a string that represents a null pointer. Examples: If you specify `nullFormat:"null"`, the reader considers the printable string null as a null value. If you specify `nullFormat:"\u0001"`, the reader considers the non-printable string \u0001 as a null value. If you do not configure the `nullFormat` parameter, the reader does not convert source data.	No	No default value
skipHeader	Specifies whether to skip the headers in a CSV-like file if the file has headers. Valid values: true: indicates that the headers are skipped. false: indicates that the headers are not skipped. The skipHeader parameter is unavailable for compressed files. Common file compression formats are GZIP, BZIP2, and ZIP.	No	false
connectTimeoutSeconds (advanced parameter, available only in the code editor)	The timeout period for HTTP requests. Unit: seconds. If the specified timeout period is exceeded, the task fails.	No	60
socketTimeoutSeconds (advanced parameter, available only in the code editor)	The timeout period for HTTP responses. Unit: seconds. If the interval between two packets is greater than the specified timeout period, the task fails.	No	3600

References

For more information about the supported data sources, see Supported data source types and synchronization operations.
For more information about how to manage permissions on a data source, see RAM authorization mode.