This topic introduces the terms that are related to the data transformation feature.
Basic terms
- ETL
Extract, transform, and load (ETL) is a process during which data is extracted from business systems, cleansed, transformed, and loaded. This process unifies and standardizes data from different sources. Log Service can load data from a source Logstore, transform data, and then write transformed data to destination Logstores. Log Service can also load data from Object Storage Service (OSS) buckets, ApsaraDB RDS instances, or other Logstores.
- event, data, and log
In data transformation, events and data are represented by logs. For example, the event time is equivalent to the log time, and the
drop_event_fields
function discards log fields. - log time
The log time indicates the point in time at which an event occurs. The log time is also known as the event time. The log time is indicated by the reserved field
__time__
in Log Service. The value of this field is extracted from the time information in logs. The value is a UNIX timestamp representing the number of seconds that have elapsed since the epoch time January 1, 1970, 00:00:00 UTC. Data type: integer. Unit: seconds. - log receiving time
The log receiving time indicates the point in time at which a log is received by a server of Log Service. By default, this time is not saved in logs. However, if you turn on Log Public IP for a Logstore, this time is recorded in the log tag field
__receive_time__
. In the data transformation process, the complete name of this field is__tag__:__receive_time__
. The value is a UNIX timestamp representing the number of seconds that have elapsed since the epoch time January 1, 1970, 00:00:00 UTC. Data type: integer. Unit: seconds.Note In most scenarios, logs are sent to Log Service in real time, and the log time is the same as the log receiving time. If you import historical logs, the log time is different from the log receiving time. For example, if you import logs generated during the last 30 days by using an SDK, the log receiving time is the current time and is different from the log time. - tag
Logs have tags. Each tag field is prefixed with
__tag__:
. Log Service supports two types of tags.- Custom tags: the tags that you add when you call the PutLogs operation to write data.
- System tags: the tags that are added by Log Service, including
__client_ip__
and__receive_time__
.
Configuration-related terms
- source Logstore
The data transformation feature reads data from a source Logstore for transformation.
You can configure only one source Logstore for a data transformation task. However, you can configure the same source Logstore for different data transformation tasks.
- destination Logstore
The data transformation feature writes transformed data to destination Logstores.
You can configure one or more destination Logstores for a data transformation task. Data can be written to destination Logstores in static or dynamic mode. For more information, see Distribute data to multiple destination Logstores.
- DSL for Log Service
The domain-specific language (DSL) for Log Service is a Python-compatible scripting language, and is used for data transformation in Log Service. The DSL for Log Service is built on top of Python. The DSL provides more than 200 built-in functions to simplify common data transformation tasks. The DSL also allows you to use custom Python extensions. For more information, see Language introduction.
- transformation rule
A transformation rule is a data transformation script that is orchestrated by using the DSL for Log Service.
- data transformation task
A data transformation task is the minimum scheduling unit of data transformation. You must configure a source Logstore, one or more destination Logstores, a transformation rule, a transformation time range, and other parameters for a data transformation task.
Rule-related terms
- resource
Resources refer to third-party data sources that are referenced during data transformation. The data sources include but are not limited to on-premises resources, Object Storage Service (OSS), ApsaraDB RDS, and Logstores other than the source and destination Logstores. The resources may be referenced to enrich data. For more information, see Resource functions.
- dimension table
A dimension table contains dimension information that can be used to enrich data. A dimension table is an external table. For example, a dimension table can contain the information of users, products, and geographical locations of a company. In most scenarios, dimension tables are included in resources and may be dynamically updated.
- enrichment or mapping
If the information contained in a log cannot meet your requirements, you can map one or more fields in the log by using a dimension table to obtain more information. This process is called enrichment or mapping.
For example, a request log contains the status field that specifies the HTTP status code. You can map the field to the status_desc field to obtain the HTTP status description by using the following table.Before enrichment After enrichment status status_desc 200 Success 300 Redirect 400 Permission error 500 Server error If a source log contains the user_id field, you can map the field by using a dimension table that contains account details to obtain more information. For example, you can obtain the user name, gender, registration time, and email address for each user ID. Then, you can add the information to the source log and write the log to the destination Logstores. For more information, see Mapping and enrichment functions.
- event splitting
If a log contains multiple pieces of information, the log can be split into multiple logs. This process is called event splitting.
For example, a log contains the following information:
The log can be split into two logs.__time__: 1231245 __topic: "win_logon_log" content: [ { "source": "192.0.2.1", "dest": "192.0.2.1" "action": "login", "result": "pass" },{ "source": "192.0.2.2", "dest": "192.0.2.1" "action": "logout", "result": "pass" } ]
__time__: 1231245 __topic: "win_logon_log" content: { "source": "192.0.2.1", "dest": "192.0.2.1" "action": "login", "result": "pass" }
__time__: 1231245 __topic: "win_logon_log" content: { "source": "192.0.2.2", "dest": "192.0.2.1" "action": "logout", "result": "pass" }
- grok
Grok uses patterns to replace complex regular expressions.
For example, the
grok("%{IPV4}")
pattern indicates a regular expression that is used to match IPv4 addresses and is equivalent to the expression"(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])"
. For more information, see Grok function. - content capturing by using a regular expression
You can use a regular expression to capture specified content in a field and include the content in a new field.
For example, the function
e_regex("content", "(?P<email>[a-zA-Z][a-zA-Z0-9_.+-=:]+@\w+\.com)")
extracts the email address from thecontent
field and includes the extracted email address in theemail
field. The email address information is extracted by using a common regular expression. We recommend that you use the following grok pattern to simplify the regular expression:e_regex("content", grok("%{EMAILADDRESS:email}")
. For more information, see Regular expressions.