You can create a REST API data source to write JSON data from a RESTful API to another data source, such as MaxCompute, by using a synchronization task. A REST API data source can also act as a destination to receive data from other data sources. This topic describes the data synchronization capabilities of the REST API data source in DataWorks.
Limitations
-
This data source currently supports only Serverless resource groups and exclusive resource groups for Data Integration.
-
You cannot configure the request timeout parameter. The built-in request timeout in DataWorks is 60s. If your API query takes longer than 60s to return a response, the task will fail.
Supported column types
When you synchronize data to a destination, only a flat, single-level table structure is supported. Nested column structures are not supported. For example, if an API returns a structure like {data: {user: { id: 1, name:'lily'}, value: 123}}, the columns must be flattened into parallel columns such as user_id, user_name, and value in the destination.
|
Type |
Column type |
|
Integer |
LONG, INT |
|
String |
STRING |
|
Floating-point |
DOUBLE, FLOAT |
|
Boolean |
BOOLEAN |
|
Date and time |
DATE |
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.
Data source authentication
The REST API data source supports the following three authentication methods:
-
No Auth: No authentication is required. You can directly access the API. This method is suitable for public APIs that do not require authentication.
-
Basic Auth: Authentication is performed by using a username and password. After you select this method, enter the username and password on the configuration page.
-
Token Auth: Authentication is performed by using a token. After you select this method, enter the access_token obtained from the third-party API in the token field on the configuration page.
DataWorks does not provide a built-in tool for obtaining third-party API tokens. If your third-party API uses token-based authentication such as OAuth 2.0, you must obtain the access_token from the API provider on your own. The following example shows how to obtain a token by using curl:
curl -X POST https://api.example.com/oauth/token \
-d 'grant_type=client_credentials&client_id=YOUR_CLIENT_ID&client_secret=YOUR_CLIENT_SECRET'
After you obtain the token, set Authentication Method to Token Auth when you create a REST API data source, and enter the token in the corresponding field.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure a single-table batch synchronization task
-
For the procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
-
For the complete parameters and script demo in script mode, see Appendix: Script demo and parameter description.
Examples
FAQ
-
Can I only specify the number of pagination requests?
-
Answer: Yes.
-
-
Is automatic pagination supported? For example, stop paginating when the request returns no data.
-
Answer: No. Otherwise, split-based sharding cannot be performed.
-
-
If I specify more pagination pages than actually exist, causing empty data for the remaining pages, how does the system handle this?
-
Answer: When the remaining pages return empty data, it is equivalent to an SQL query returning no data. The system will continue to query the next record.
-
-
Does the system support parsing only one level of JSON data?
-
Answer: Yes. Deeper-level parsing is not performed.
-
-
How do I configure a non-array data type for a REST API in DataWorks Data Integration?
-
Answer: Make sure that in the
readersection ofparameter, setdataPathto the path that points to the non-array data. For example:dataPath:"data.list". This helps the plug-in correctly locate the data columns you want to read. Then, setdataModetomultiData. This means DataWorks will process the data as multiple individual records, even if they are not in array form in the source data.NoteNote that in
multiDatamode, thecolumnconfiguration is no longer applicable. You should directly specify the data path indataPath.The following is an example of configuring a non-array data type for the REST API in Data Integration:
reader: { name: "restapi", parameter: { dataPath: "data.list", dataMode: "multiData", // Other parameters } }
-
Appendix: Script demo and parameter description
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Script mode configuration. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader script demo
-
The following is a script example:
{ "type":"job", "version":"2.0", "steps":[ { "stepType":"restapi", "parameter":{ "url":"http://127.0.0.1:5000/get_array5", "dataMode":"oneData", "responseType":"json", "column":[ { "type":"long", "name":"a.b" //Find data from the a.b path }, { "type":"string", //Find data from the a.c path "name":"a.c" } ], "dirtyData":"null", "method":"get", "socketTimeout":"60000", "defaultHeader":{ "X-Custom-Header":"test header" }, "customHeader":{ "X-Custom-Header2":"test header2" }, "parameters":"abc=1&def=1" }, "name":"restapireader", "category":"reader" }, { "stepType":"stream", "parameter":{ }, "name":"Writer", "category":"writer" } ], "setting":{ "errorLimit":{ "record":"" }, "speed":{ "throttle":true, //When throttle is set to false, the mbps parameter does not take effect, indicating no throttling. When throttle is set to true, throttling is enabled. "concurrent":1, //Job concurrency. "mbps":"12"//Throttling. Here 1 mbps = 1 MB/s. } }, "order":{ "hops":[ { "from":"Reader", "to":"Writer" } ] } } -
The script mode configuration is described as follows:
After the RESTful API plugin sends an HTTP(S) request, it receives a response body (the body is a JSON object). The dataPath parameter specifies the JSON path to extract data from the body. Here are two examples: Using the following API response body as an example, the business data is in DATA, and the API returns multiple rows of data at once (DATA is an array): { "HEADER": { "BUSID": "bid1", "RECID": "uuid", "SENDER": "dc", "RECEIVER": "pre", "DTSEND": "202201250000" }, "DATA": [ { "SERNR": "sernr1" }, { "SERNR": "sernr2" } ] } To extract multiple rows of data from DATA as multiple sync records, configure column as "column": [ "SERNR" ], dataMode as "dataMode": "multiData", and dataPath as "dataPath": "DATA". Using the following API response body as an example, the business data is in content.DATA, and the API returns one row of data at a time (DATA is an object): { "HEADER": { "BUSID": "bid1", "RECID": "uuid", "SENDER": "dc", "RECEIVER": "pre", "DTSEND": "202201250000" }, "content": { "DATA": { "SERNR": "sernr2" } } } To extract one row of data from content.DATA as a single sync record, configure column as "column": [ "SERNR" ], dataMode as "dataMode": "oneData", and dataPath as "dataPath": "content.DATA".
Reader script parameters
The following parameters are involved in the process of adding a data source and configuring a Data Integration task node.
The current plug-in does not support scheduling parameters.
|
Parameter |
Description |
Required |
Default value |
|
url |
The RESTful API URL. |
Yes |
N/A |
|
dataMode |
The format of the JSON data returned by the RESTful API request.
|
Yes |
N/A |
|
responseType |
The data format of the response. Currently, only the JSON format is supported. |
Yes |
JSON |
|
column |
The list of columns to read. The type parameter specifies the data type of the source data, and the name parameter specifies the JSON path from which the current column data is retrieved. You can specify column information as follows. "column":[{"type":"long","name":"a.b" //Retrieve data from path a.b},{"type":"string","name":"a.c"//Retrieve data from path a.c}] For each column you specify, type and name are required. |
Yes |
N/A |
|
dataPath |
The path to a single JSON object or JSON array in the response. |
No |
N/A |
|
method |
The request method. GET and POST are supported. |
Yes |
N/A |
|
socketTimeout |
The socket timeout for accessing the RESTful API, in milliseconds. |
No |
60000 |
|
customHeader |
The header information passed to the RESTful API. |
No |
N/A |
|
parameters |
The parameter information passed to the RESTful API.
|
No |
N/A |
|
dirtyData |
Specifies how to handle data when no data is found at the specified column JSON path.
|
Yes |
dirty |
|
requestTimes |
The number of times to request data from the RESTful API.
|
Yes |
single |
|
requestParam |
When requestTimes is set to multiple, you must specify the loop parameter, such as pageNumber. The plug-in loops through the pageNumber parameter based on the startIndex, endIndex, and step values, and passes it to the RESTful API for multiple requests. |
No |
N/A |
|
startIndex |
The start index of the loop requests. The start index is inclusive. |
No |
N/A |
|
endIndex |
The end index of the loop requests. The end index is inclusive. |
No |
N/A |
|
step |
The step size of the loop requests. |
No |
N/A |
|
authType |
The authentication method. Valid values:
|
No |
N/A |
|
authUsername/authPassword |
The username and password for Basic Auth authentication. |
No |
N/A |
|
authToken |
The token for Token Auth authentication. |
No |
N/A |
|
accessKey/accessSecret |
The account information for Alibaba Cloud API signature authentication. |
No |
N/A |
Writer script demo
{
"type":"job",
"version":"2.0",
"steps":[
{
"stepType":"stream",
"parameter":{
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"restapi",
"parameter":{
"url":"http://127.0.0.1:5000/writer1",
"dataMode":"oneData",
"responseType":"json",
"column":[
{
"type":"long", //Place column data to path a.b
"name":"a.b"
},
{
"type":"string", //Place column data to path a.c
"name":"a.c"
}
],
"method":"post",
"defaultHeader":{
"X-Custom-Header":"test header"
},
"customHeader":{
"X-Custom-Header2":"test header2"
},
"parameters":"abc=1&def=1",
"batchSize":256
},
"name":"restapiwriter",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0" //The error count.
},
"speed":{
"throttle":true,//If throttle is set to false, the mbps parameter does not take effect, which means throttling is disabled. If throttle is set to true, throttling is enabled.
"concurrent":1, //The concurrency of the job.
"mbps":"12"//Throttling. 1 mbps = 1 MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}Writer script parameters
|
Parameter |
Description |
Required |
Default value |
|
url |
The RESTful API URL. |
Yes |
N/A |
|
dataMode |
The format of the JSON data passed through the RESTful request.
|
Yes |
N/A |
|
column |
The list of column paths for generating JSON data. The type parameter specifies the data type of the source data, and the name parameter specifies the JSON path where the current column data is placed. You can specify column information as follows. "column":[{"type":"long","name":"a.b" //Place column data to path a.b},{"type":"string","name":"a.c"//Place column data to path a.c}] Note For each column you specify, type and name are required. |
Yes |
N/A |
|
dataPath |
The JSON object path where the data result is placed. |
No |
N/A |
|
method |
The request method. POST and PUT are supported. |
Yes |
N/A |
|
customHeader |
The header information passed to the RESTful API. |
No |
N/A |
|
authType |
The authentication method.
|
No |
N/A |
|
authUsername/authPassword |
The username and password for Basic Auth authentication. |
No |
N/A |
|
authToken |
The token for Token Auth authentication. |
No |
N/A |
|
accessKey/accessSecret |
The account information for Alibaba Cloud API signature authentication. |
No |
N/A |
|
batchSize |
The maximum number of records per request when dataMode is set to multiData. |
Yes |
512 |