Alibaba Cloud Data Lake Formation (DLF) is a fully managed platform that provides unified metadata, data storage, and data management. DLF offers features such as metadata management, storage management, permission management, storage analysis, and storage optimization. You can use DataWorks Data Integration to write data to DLF data sources. This topic describes how to use a DLF data source.
Limits
You can use Data Lake Formation data sources only in Data Integration and only with serverless resource groups.
Create a data source
Go to the Data Sources page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Data Sources.
Click Add Data Source. Search for and select Data Lake Formation. Configure the parameters as described in the following table:
Parameter
Description
Data Source Name
Enter a custom name for the data source. It must be unique within the workspace and can contain only letters, digits, and underscores (_). It cannot start with a digit or an underscore.
Configuration Mode
Only Alibaba Cloud Instance Mode is supported.
Endpoint
Select the endpoint of the DLF engine instance from the drop-down list.
Access Identity
You can select one of the following options:
Alibaba Cloud Account
Alibaba Cloud RAM User
Alibaba Cloud RAM Role
Select an option as needed.
NoteIf you select RAM User or RAM Role, grant the following permissions to the RAM user or RAM role.
You need to attach the system policy AliyunDataWorksDIAccessDLF to the RAM user or RAM role in the RAM console to grant RAM permissions for DLF to access metadata. For details, see Grant permissions to a RAM user.
In the Data Lake Formation console, grant the role or RAM user the Data Editor permission for the data tables to be synced.
DLF Data Catalog
Select a DLF data catalog in the same region as your DataWorks workspace.
Database Name
Select a database in the data catalog.
After you configure the parameters, test the connectivity between the data source and the serverless resource group in the connection configuration section. If the connectivity test is successful, click Complete Creation to create the data source. If the connectivity test fails, see Network connectivity configuration to troubleshoot the issue.
Create a data integration task
You can use a Data Lake Formation data source in a DataWorks data integration task. For more information, see Synchronize data to Data Lake Formation.
Appendix: Script examples and parameter descriptions
Configure an offline task script
If you use the code editor to configure an offline task, you must add the parameters to the task script in the standard format. For more information, see Configure a task in the code editor. The following sections describe the data source parameters for the code editor.
Reader script example
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "dlf",
"parameter": {
"datasource": "guxuan_dlf",
"table": "auto_ob_3088545_0523",
"column": [
"id",
"col1",
"col2",
"col3"
],
"where": "id > 1"
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "stream",
"parameter": {
"print": false
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": "" // The number of error records.
},
"speed": {
"throttle": true, // If set to false, the mbps parameter does not take effect, which means the rate is not limited. If set to true, the rate is limited.
"concurrent": 20, // The job concurrency.
"mbps": "12" // The rate limit. 1 mbps = 1 MB/s.
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}Reader script parameters
Parameter | Description | Required |
datasource | The DLF data source. | Yes |
table | The table name. | Yes |
column | The column names. | Yes |
where | The filter condition. | No |
Writer script example
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "stream",
"parameter": {
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "dlf",
"parameter": {
"datasource": "guxuan_dlf",
"column": [
"id",
"col1",
"col2",
"col3"
],
"table": "auto_ob_3088545_0523"
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": "" // The number of error records.
},
"speed": {
"throttle": true, // If set to false, the mbps parameter does not take effect, which means the rate is not limited. If set to true, the rate is limited.
"concurrent": 20, // The job concurrency.
"mbps": "12" // The rate limit. 1 mbps = 1 MB/s.
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}Writer script parameters
Parameter | Description | Required | Default Value |
datasource | The DLF data source. | Yes | None |
table | The table name. | Yes | None |
column | The column names. | Yes | None |