All Products
Search
Document Center

DataWorks:Simple Log Service data source

Last Updated:Nov 08, 2023

DataWorks provides LogHub Reader and LogHub Writer for you to read data from and write data to Simple Log Service data sources. This topic describes the capabilities of synchronizing data from or to Simple Log Service data sources.

Limits

When you use DataWorks Data Integration to run batch synchronization tasks to write data to Simple Log Service, Simple Log Service does not ensure idempotence. If you rerun a failed task, redundant data may be generated.

Data types

The following table provides the support status of main data types in Simple Log Service.

Data type

LogHub Reader for batch data read

LogHub Writer for batch data write

LogHub Reader for real-time data read

STRING

Supported

Supported

Supported

  • LogHub Writer for batch data write

    LogHub Writer converts the data types supported by Data Integration to STRING before data is written to Simple Log Service. The following table lists the data type mappings based on which LogHub Writer converts data types.

    Data Integration data type

    Simple Log Service data type

    LONG

    STRING

    DOUBLE

    STRING

    STRING

    STRING

    DATE

    STRING

    BOOLEAN

    STRING

    BYTES

    STRING

  • LogHub Reader for real-time data read

    The following table describes the metadata fields that LogHub Reader for real-time data synchronization provides.

    Field provided by LogHub Reader for real-time data synchronization

    Data type

    Description

    __time__

    STRING

    A reserved field of Simple Log Service. The field specifies the time when logs are written to Simple Log Service. The field value is a UNIX timestamp in seconds.

    __source__

    STRING

    A reserved field of Simple Log Service. The field specifies the source device from which logs are collected.

    __topic__

    STRING

    A reserved field of Simple Log Service. The field specifies the name of the topic for logs.

    __tag__:__receive_time__

    STRING

    The time when logs arrive at the server. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs. The field value is a UNIX timestamp in seconds.

    __tag__:__client_ip__

    STRING

    The public IP address of the source device. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs.

    __tag__:__path__

    STRING

    The path of the log file collected by Logtail. Logtail automatically adds this field to logs.

    __tag__:__hostname__

    STRING

    The hostname of the device from which Logtail collects data. Logtail automatically adds this field to logs.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a data synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.

Add a data source

Before you configure a data synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources.

Configure a batch synchronization task to synchronize data of a single table

Configure a real-time synchronization task to synchronize data of a single table

For more information about the configuration procedure, see Create a real-time synchronization task to synchronize incremental data from a single table and Configure a real-time synchronization task in DataStudio.

Configure synchronization settings to implement batch synchronization of all data in a database, real-time synchronization of full data or incremental data in a database, and real-time synchronization of data from sharded tables in a sharded database

For more information about the configuration procedure, see Configure a synchronization task in Data Integration.

FAQ

For more information, see FAQ about Data Integration.

Appendix: Code and parameters

Appendix: Configure a batch synchronization task by using the code editor

If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.

Code for LogHub Reader

{
 "type":"job",
 "version":"2.0",// The version number. 
 "steps":[
     {
         "stepType":"LogHub",// The plug-in name. 
         "parameter":{
             "datasource":"",// The name of the data source. 
             "column":[// The names of the columns. 
                 "col0",
                 "col1",
                 "col2",
                 "col3",
                 "col4",
                 "C_Category",
                 "C_Source",
                 "C_Topic",
                 "C_MachineUUID", // The log topic. 
                 "C_HostName", // The hostname. 
                 "C_Path", // The path. 
                 "C_LogTime" // The time when the event occurred. 
             ],
             "beginDateTime":"",// The start time of data consumption. 
             "batchSize":"",// The number of data entries that are queried at a time. 
             "endDateTime":"",// The end time of data consumption. 
             "fieldDelimiter":",",// The column delimiter. 
             "logstore":""// The name of the Logstore. 
         },
         "name":"Reader",
         "category":"reader"
     },
     { 
         "stepType":"stream",
         "parameter":{},
         "name":"Writer",
         "category":"writer"
     }
 ],
 "setting":{
     "errorLimit":{
         "record":"0"// The maximum number of dirty data records allowed. 
     },
     "speed":{
         "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1 // The maximum number of parallel threads. 
            "mbps":"12",// The maximum transmission rate. Unit: MB/s. 
     }
 },
 "order":{
     "hops":[
         {
             "from":"Reader",
             "to":"Writer"
         }
     ]
 }
}

Parameters in code for LogHub Reader

Parameter

Description

Required

Default value

endPoint

The endpoint of Simple Log Service. The endpoint is a URL that you can use to access the project and the log data in the project. The endpoint varies based on the project name and the Alibaba Cloud region where the project resides. For more information about the endpoints of Simple Log Service in each region, see Endpoints.

Yes

No default value

accessId

The AccessKey ID of the Alibaba Cloud account that is used to access the Simple Log Service project.

Yes

No default value

accessKey

The AccessKey secret of the Alibaba Cloud account that is used to access the Simple Log Service project.

Yes

No default value

project

The name of the Simple Log Service project. A project is the basic unit for managing resources in Simple Log Service. Projects are used to isolate resources and control access to the resources.

Yes

No default value

logstore

The name of the Logstore. A Logstore is a basic unit that you can use to collect, store, and query log data in Simple Log Service.

Yes

No default value

batchSize

The number of data entries to read from Simple Log Service at a time.

No

128

column

The names of the columns. You can set this parameter to the metadata in Simple Log Service. Supported metadata includes the log topic, unique identifier of the host, hostname, path, and log time.

Note

Column names are case-sensitive. For more information about column names in Simple Log Service, see Introduction.

Yes

No default value

beginDateTime

The start time of data consumption. The value is the time at which log data arrives at Simple Log Service. This parameter defines the left boundary of a left-closed, right-open interval in the format of yyyyMMddHHmmss, such as 20180111013000. This parameter can work with the scheduling parameters in DataWorks.

For example, if you enter beginDateTime=${yyyymmdd-1} in the Parameters field on the Properties tab, you can set Start Timestamp to ${beginDateTime}000000 on the task configuration tab to consume logs that are generated from 00:00:00 of the data timestamp. For more information, see Supported formats of scheduling parameters.

Note

The beginDateTime and endDateTime parameters must be used in pairs.

Yes

No default value

endDateTime

The end time of data consumption. This parameter defines the right boundary of a left-closed, right-open interval in the format of yyyyMMddHHmmss, such as 20180111013010. This parameter can work with the scheduling parameters in DataWorks.

For example, if you enter endDateTime=${yyyymmdd} in the Parameters field on the Properties tab, you can set End Timestamp to ${endDateTime}000000 on the task configuration tab to consume logs that are generated until 00:00:00 of the next day of the data timestamp. For more information, see Supported formats of scheduling parameters.

Note

The time that is specified by the endDateTime parameter of the previous interval cannot be earlier than the time that is specified by the beginDateTime parameter of the current interval. Otherwise, data in some regions may not be read.

Yes

No default value

Code for LogHub Writer

{
    "type": "job",
    "version": "2.0",// The version number. 
    "steps": [
        { 
            "stepType": "stream",
            "parameter": {},
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType":"LogHub",// The plug-in name. 
            "parameter": {
                "datasource": "",// The name of the data source. 
                "column": [// The names of the columns. 
                    "col0",
                    "col1",
                    "col2",
                    "col3",
                    "col4",
                    "col5"
                ],
                "topic": "",// The name of the topic. 
                "batchSize": "1024",// The number of data records to write at a time. 
                "logstore": ""// The name of the Logstore. 
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": ""// The maximum number of dirty data records allowed. 
        },
        "speed": {
            "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":3, // The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate. Unit: MB/s. 
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

Parameters in code for LogHub Writer

Note

LogHub Writer obtains data from a reader and converts the data types supported by Data Integration into STRING. If the number of data records reaches the value specified for the batchSize parameter, LogHub Writer sends the data records to Simple Log Service at a time by using Simple Log Service SDK for Java.

Parameter

Description

Required

Default value

endpoint

The endpoint of Simple Log Service. The endpoint is a URL that you can use to access the project and the log data in the project. The endpoint varies based on the project name and the Alibaba Cloud region where the project resides. For more information about the endpoints of Simple Log Service in each region, see Endpoints.

Yes

No default value

accessKeyId

The AccessKey ID of the Alibaba Cloud account that is used to access the Simple Log Service project.

Yes

No default value

accessKeySecret

The AccessKey secret of the Alibaba Cloud account that is used to access the Simple Log Service project.

Yes

No default value

project

The name of the Simple Log Service project.

Yes

No default value

logstore

The name of the Logstore. A Logstore is a basic unit that you can use to collect, store, and query log data in Simple Log Service.

Yes

No default value

topic

The name of the topic.

No

Empty string

batchSize

The number of data records to write to Simple Log Service at a time. Default value: 1024. Maximum value: 4096.

Note

The size of the data to write to Simple Log Service at a time cannot exceed 5 MB. You can change the value of this parameter based on the size of a single data record.

No

1,024

column

The names of columns in each data record.

Yes

No default value