DataWorks provides LogHub Reader and LogHub Writer for you to read data from and write data to Simple Log Service data sources. This topic describes the capabilities of synchronizing data from or to Simple Log Service data sources.
Limits
When you use DataWorks Data Integration to run batch synchronization tasks to write data to Simple Log Service, Simple Log Service does not ensure idempotence. If you rerun a failed task, redundant data may be generated.
Data types
The following table provides the support status of main data types in Simple Log Service.
Data type | LogHub Reader for batch data read | LogHub Writer for batch data write | LogHub Reader for real-time data read |
STRING | Supported | Supported | Supported |
LogHub Writer for batch data write
LogHub Writer converts the data types supported by Data Integration to STRING before data is written to Simple Log Service. The following table lists the data type mappings based on which LogHub Writer converts data types.
Data Integration data type
Simple Log Service data type
LONG
STRING
DOUBLE
STRING
STRING
STRING
DATE
STRING
BOOLEAN
STRING
BYTES
STRING
LogHub Reader for real-time data read
The following table describes the metadata fields that LogHub Reader for real-time data synchronization provides.
Field provided by LogHub Reader for real-time data synchronization
Data type
Description
__time__
STRING
A reserved field of Simple Log Service. The field specifies the time when logs are written to Simple Log Service. The field value is a UNIX timestamp in seconds.
__source__
STRING
A reserved field of Simple Log Service. The field specifies the source device from which logs are collected.
__topic__
STRING
A reserved field of Simple Log Service. The field specifies the name of the topic for logs.
__tag__:__receive_time__
STRING
The time when logs arrive at the server. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs. The field value is a UNIX timestamp in seconds.
__tag__:__client_ip__
STRING
The public IP address of the source device. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs.
__tag__:__path__
STRING
The path of the log file collected by Logtail. Logtail automatically adds this field to logs.
__tag__:__hostname__
STRING
The hostname of the device from which Logtail collects data. Logtail automatically adds this field to logs.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a data synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.
Add a data source
Before you configure a data synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources.
Configure a batch synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.
Configure a real-time synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Create a real-time synchronization task to synchronize incremental data from a single table and Configure a real-time synchronization task in DataStudio.
Configure synchronization settings to implement batch synchronization of all data in a database, real-time synchronization of full data or incremental data in a database, and real-time synchronization of data from sharded tables in a sharded database
For more information about the configuration procedure, see Configure a synchronization task in Data Integration.
FAQ
For more information, see FAQ about Data Integration.
Appendix: Code and parameters
Appendix: Configure a batch synchronization task by using the code editor
If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.
Code for LogHub Reader
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"LogHub",// The plug-in name.
"parameter":{
"datasource":"",// The name of the data source.
"column":[// The names of the columns.
"col0",
"col1",
"col2",
"col3",
"col4",
"C_Category",
"C_Source",
"C_Topic",
"C_MachineUUID", // The log topic.
"C_HostName", // The hostname.
"C_Path", // The path.
"C_LogTime" // The time when the event occurred.
],
"beginDateTime":"",// The start time of data consumption.
"batchSize":"",// The number of data entries that are queried at a time.
"endDateTime":"",// The end time of data consumption.
"fieldDelimiter":",",// The column delimiter.
"logstore":""// The name of the Logstore.
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The maximum number of dirty data records allowed.
},
"speed":{
"throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":1 // The maximum number of parallel threads.
"mbps":"12",// The maximum transmission rate. Unit: MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
Parameters in code for LogHub Reader
Parameter | Description | Required | Default value |
endPoint | The endpoint of Simple Log Service. The endpoint is a URL that you can use to access the project and the log data in the project. The endpoint varies based on the project name and the Alibaba Cloud region where the project resides. For more information about the endpoints of Simple Log Service in each region, see Endpoints. | Yes | No default value |
accessId | The AccessKey ID of the Alibaba Cloud account that is used to access the Simple Log Service project. | Yes | No default value |
accessKey | The AccessKey secret of the Alibaba Cloud account that is used to access the Simple Log Service project. | Yes | No default value |
project | The name of the Simple Log Service project. A project is the basic unit for managing resources in Simple Log Service. Projects are used to isolate resources and control access to the resources. | Yes | No default value |
logstore | The name of the Logstore. A Logstore is a basic unit that you can use to collect, store, and query log data in Simple Log Service. | Yes | No default value |
batchSize | The number of data entries to read from Simple Log Service at a time. | No | 128 |
column | The names of the columns. You can set this parameter to the metadata in Simple Log Service. Supported metadata includes the log topic, unique identifier of the host, hostname, path, and log time. Note Column names are case-sensitive. For more information about column names in Simple Log Service, see Introduction. | Yes | No default value |
beginDateTime | The start time of data consumption. The value is the time at which log data arrives at Simple Log Service. This parameter defines the left boundary of a left-closed, right-open interval in the format of yyyyMMddHHmmss, such as 20180111013000. This parameter can work with the scheduling parameters in DataWorks. For example, if you enter Note The beginDateTime and endDateTime parameters must be used in pairs. | Yes | No default value |
endDateTime | The end time of data consumption. This parameter defines the right boundary of a left-closed, right-open interval in the format of yyyyMMddHHmmss, such as 20180111013010. This parameter can work with the scheduling parameters in DataWorks. For example, if you enter endDateTime=${yyyymmdd} in the Parameters field on the Properties tab, you can set End Timestamp to ${endDateTime}000000 on the task configuration tab to consume logs that are generated until 00:00:00 of the next day of the data timestamp. For more information, see Supported formats of scheduling parameters. Note The time that is specified by the endDateTime parameter of the previous interval cannot be earlier than the time that is specified by the beginDateTime parameter of the current interval. Otherwise, data in some regions may not be read. | Yes | No default value |
Code for LogHub Writer
{
"type": "job",
"version": "2.0",// The version number.
"steps": [
{
"stepType": "stream",
"parameter": {},
"name": "Reader",
"category": "reader"
},
{
"stepType":"LogHub",// The plug-in name.
"parameter": {
"datasource": "",// The name of the data source.
"column": [// The names of the columns.
"col0",
"col1",
"col2",
"col3",
"col4",
"col5"
],
"topic": "",// The name of the topic.
"batchSize": "1024",// The number of data records to write at a time.
"logstore": ""// The name of the Logstore.
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": ""// The maximum number of dirty data records allowed.
},
"speed": {
"throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":3, // The maximum number of parallel threads.
"mbps":"12"// The maximum transmission rate. Unit: MB/s.
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}
Parameters in code for LogHub Writer
LogHub Writer obtains data from a reader and converts the data types supported by Data Integration into STRING. If the number of data records reaches the value specified for the batchSize parameter, LogHub Writer sends the data records to Simple Log Service at a time by using Simple Log Service SDK for Java.
Parameter | Description | Required | Default value |
endpoint | The endpoint of Simple Log Service. The endpoint is a URL that you can use to access the project and the log data in the project. The endpoint varies based on the project name and the Alibaba Cloud region where the project resides. For more information about the endpoints of Simple Log Service in each region, see Endpoints. | Yes | No default value |
accessKeyId | The AccessKey ID of the Alibaba Cloud account that is used to access the Simple Log Service project. | Yes | No default value |
accessKeySecret | The AccessKey secret of the Alibaba Cloud account that is used to access the Simple Log Service project. | Yes | No default value |
project | The name of the Simple Log Service project. | Yes | No default value |
logstore | The name of the Logstore. A Logstore is a basic unit that you can use to collect, store, and query log data in Simple Log Service. | Yes | No default value |
topic | The name of the topic. | No | Empty string |
batchSize | The number of data records to write to Simple Log Service at a time. Default value: 1024. Maximum value: 4096. Note The size of the data to write to Simple Log Service at a time cannot exceed 5 MB. You can change the value of this parameter based on the size of a single data record. | No | 1,024 |
column | The names of columns in each data record. | Yes | No default value |