LogHub of Log Service is a streaming data center. Logs can be consumed in real time after they are written to LogHub. The extract, transform, load (ETL) procedure of Log Service can process such streaming data in seconds.

Background information

ETL is the procedure of extracting data from the source, transforming data, and then loading data to the destination. The traditional ETL process is an important part in building a data warehouse. You need to extract required data from the data source, cleanse the data, and then load the data to the destination data warehouse based on the pre-defined data warehouse model. Ever-growing businesses require different systems to exchange large amounts of data. Data must flow in different systems to help you extract more value from log data.

Benefits of the ETL procedure based on Log Service and Function Compute
  • The procedure is a centralized solution to collect, store, transform, and analyze data.
  • Data transformation is fully managed, triggered by time, and automatically retried.
  • Shards can be scaled out to meet the resource requirements of big data.
  • Data is transformed in Function Compute, which provides elastic resources and supports the pay-as-you-go billing method.
  • The ETL procedure is transparent to users and provides the logging and alerting features.
  • Built-in function templates are continuously added to reduce the cost of function development for mainstream requirements.

Scenarios

  • Data cleansing and transformation
    Log Service allows you to quickly collect, process, query, and analyze logs.Scenarios
  • Data shipping
    You can ship data to the destination and build data pipelines between big data services in the cloud.Data shipping

ETL model

The ETL model is a stream-based, real-time data stream processing model. The ETL trigger polls all shards in the source Logstore to detect the position where data is written and regularly generates trituple information to trigger relevant functions. The trituple information is used to identify the range of data to be processed in the current ETL task.

Shards can be scaled out to ensure the auto scaling of the ETL procedure. Tasks are triggered by a timer to ensure that data is continuously loaded.

During the implementation of the ETL procedure, functions are called in Function Compute to transform data thanks to the flexibility of user-defined functions (UDFs). Function Compute provides pay-as-you-go, auto scaling, and custom code execution to meet your specific requirements.

ETL model

ETL logs

  • ETL procedure logs

    ETL procedure logs record the key points and errors of each step during the implementation of the ETL procedure, including the start time, end time, initialization status, and module error information. With ETL procedure logs, you can view the status of the ETL procedure at any time and locate error causes. Logs generated by functions record the key points and errors during data transformation.

  • ETL scheduling logs

    ETL scheduling logs record the start time and end time of an ETL task, whether the task is successful, and the information returned when the task is successful. If an error occurs, an ETL error log is generated and an alert email or SMS message is sent to the system administrator. Based on scheduling logs, reports can be created to display the status of the ETL procedure.

Application examples

Software based on HTTP servers such as NGINX and Apache can record access logs for each user. By using the ETL procedure based on Log Service and Function Compute, you can analyze the regions where your services are used and the links that are used to access your services.

For data analysis engineers, the workload of the ETL procedure accounts for 60% to 70% of the total workload of a project. Log Service can use built-in function templates to shorten the ETL procedure to less than 15 minutes.

Procedure

  1. Centralize log storage.
    You can use the Logtail client of Log Service to collect log files from a machine, for example, NGINX.
    NGINX access logs collected by the Logtail client are stored in a Logstore. As shown in the following figure, the value of the forward field indicates the IP address from which the user sends the request.Centralize log storage
  2. Transform data in the cloud.
    1. Log on to the Function Compute console and create a service.
      In the Advanced Configuration section, we recommend that you configure a Logstore to store logs generated during data transformation for the ETL function to be created. Then, you can use the logs to locate errors during data transformation. Grant the AliyunLogFullAccess permission to the ETL function. Then, the ETL function can read data from the source Logstore and write the data to the destination Logstore after data transformation.
    2. Use a built-in template to create the ETL function.
    3. Create a Log Service trigger for the ETL function.
      The following figure shows the configurations of the Log Service trigger.

      Set the data source to the Logstore described in step 1. In this example, the Logstore is named etl-test/logstore:nginx_access_log.

      Log Service polls data in the source Logstore. When data is continuously generated, an ETL task is created every 60 seconds to call the ETL function and process data. The trigger and results of the function are recorded in the etl-trigger-log Logstore that stores trigger logs.

      Function configurations vary with the implementation methods and features of functions. For information about the configuration items of the ip-lookup template, see the relevant README file.

    4. Save the configurations and wait for one minute. Then, the ETL task starts.
      The generated ETL process logs are stored in the etl-function-log Logstore and scheduling logs are stored in the etl-trigger-log Logstore.
      You can also use query statements to create reports based on ETL logs.Log reports
      • The chart in the upper-left corner shows the number of times that the ETL function is triggered per minute. This chart is created by using the following query statement:

        project_name : etl-test and job_name : ceff019ca3d077f85acaad35bb6b9bba65da6717 | select from_unixtime(__time__ - time % 60) as t, count(1) as invoke_count group by from_unixtime(__time__ - time % 60) order by t asc limit 1000
        											
      • The chart in the upper-right corner shows the percentages of successful and failed ETL tasks. This chart is created by using the following query statement:

        project_name : etl-test and job_name : ceff019ca3d077f85acaad35bb6b9bba65da6717 | select task_status, count(1) group by task_status
        											
      • The chart in the lower-left corner shows the number of bytes of logs read from the source Logstore every five minutes. This chart is created by using the following query statement:

        project_name : etl-test and job_name : ceff019ca3d077f85acaad35bb6b9bba65da6717 and task_status : Success | select from_unixtime(__time__ - time % 300) as t, sum(ingest_bytes) as ingest_bytes group by from_unixtime(__time__ - time % 300) order by t asc limit 1000
        											
      • The chart in the lower-right corner shows the number of lines of logs read from the source Logstore every five minutes. This chart is created by using the following query statement:

        project_name : etl-test and job_name : ceff019ca3d077f85acaad35bb6b9bba65da6717 and task_status : Success | select from_unixtime(__time__ - time % 300) as t, sum(ingest_lines) as ingest_lines group by from_unixtime(__time__ - time % 300) order by t asc limit 1000
        											
  3. Model data after data transformation.
    NGINX access logs on the machine are collected by the Logtail client to the source Logstore in real time, transformed by the ETL function in near real time, and then written to the destination Logstore. The following figure shows the transformed data that contains the IP address information.Model data after data transformation

    Compared with the untransformed data, the transformed data contains four new fields: country, province, city, and isp. The four fields indicate that the request whose IP address is 117.136.90.160 comes from Taiyuan, Shanxi, China, and the Internet service provider (ISP) is China Mobile.

    You can use the log analysis feature of Log Service to query the cities from which requests of specific IP addresses come and the distribution of ISPs in a period of time. Use the following two query statements to create reports:
    • * | select city, count(1) as c group by city order by c desc limit 15
    • * | select isp, count(1) as c group by isp order by c desc limit 15