Use SPL to Efficiently Implement Flink SLS Connector Pushdown

By Weilong Pan (Huolang)

1. Background

Simple Log Service (SLS) is a cloud-native observation and analysis platform that provides large-scale, low-cost, and real-time services for Log, Metric, and Trace. With the convenient data access of SLS, you can store and analyze system logs and business logs by accessing logs to SLS. Realtime Compute for Apache Flink is a big data analysis platform built by Alibaba Cloud based on Apache Flink. It is widely used in real-time data analysis and risk detection. Realtime Compute for Apache Flink supports the SLS Connector, enabling users to use SLS as a source table or a result table on the Realtime Compute for Apache Flink platform.

When you configure SLS as a source table in the Realtime Compute for Apache Flink, the SLS Logstore data is consumed by default to build a dynamic table. During the consumption process, you can specify the starting time. The consumed data is also the full data after the specified time. In specific scenarios, you only need to analyze logs or some fields of logs with certain characteristics. Such requirements can be satisfied through the WHERE and SELECT of Flink SQL. However, there are two problems:

1) The Connector pulls too many unnecessary data rows or columns from the source, resulting in network overheads.

2) These unnecessary data need to be filtered and projected in Flink. These cleansing tasks are not the focus of data analysis, resulting in a waste of computation.

In this scenario, is there a better solution?

The answer is Yes. SLS introduces the SPL language, which can efficiently cleanse and process log data. This capability is also integrated into log consumption scenarios, including the SLS Connector in the Realtime Compute for Apache Flink. You can configure SLS SPL to implement data cleansing rules. This reduces the amount of data transmitted over the network and the computing consumption of Flink.

Next, we will introduce SPL and its application in the Realtime Compute for Apache Flink SLS Connector.

2. Introduction to SLS SPL

SLS SPL is a high-performance processing language provided by SLS for weakly structured logs. It can be used in Logtail, scan-based query, and streaming consumption scenarios at the same time. SLS SPL is interactive, exploratory, and simple to use.

The basic syntax of SPL is as follows:

<data-source> 
| <spl-cmd> -option=<option> -option ... <expression>, ... as <output>, ...
| <spl-cmd> ...
| <spl-cmd> ...

It is an SPL instruction and supports semi-structured data processing such as row filtering, column expansion, column pruning, regular expression matching, field projection, numerical calculation, JSON, and CSV. For more information, see SPL instruction [1]. To be specific, the SPL instructions include:

SQL calculation instructions on structured data: support row filtering, column expansion, numeric calculation, and SQL function calls

• extend creates new fields based on the calculation results of SQL expression

• where filters data entries based on the calculation results of SQL expression

*
| extend latency=cast(latency as BIGINT)
| where status='200' AND latency>100

Field processing instructions: support field projection, field renaming, and column pruning

• project retains the fields that match the given pattern and renames the specified fields

• project-away removes the fields that match the given pattern and retains all other fields as they are

• project-rename renames the specified field and retains all other fields as they are

*
| project-away -wildcard "__tag__:*"
| project-rename __source__=remote_addr

Extraction instructions on unstructured data: support the processing of unstructured field values such as JSON, regular expression, and CSV

• parse-regexp extracts the information that matches groups in the specified regular expression from the specified field

• parse-json extracts the first-layer JSON information from the specified field

• parse-csv extracts the information in the CSV format from the specified field

*
| parse-csv -delim='^_^' content as time, body
| parse-regexp body, '(\S+)\s+(\w+)' as msg, user

3. Introduction to the Principle of SPL in Flink SLS Connector

Realtime Compute for Apache Flink supports the SLS Connector which can be used to pull data from the SLS Logstore in real time. The analyzed data can also be written to SLS in real time. As a high-performance computing engine, Flink SQL is also widely used in Flink computing. With the help of SQL syntax, structured data can be analyzed.

In the SLS Connector, you can configure the log field as the Table field in Flink SQL, and then analyze data based on SQL. Before SPL configuration is supported, the SLS Connector consumes full log data to the Flink computing platform in real time. The current consumption has the following characteristics:

• In Flink, usually, not all log rows are calcuted. For example, in security scenarios, only data that has certain characteristics may be required and logs need to be filtered. In fact, unnecessary log rows will also be pulled, resulting in a waste of network bandwidth.

• In Flink, specific field columns are generally calculated. For example, if there are 30 fields in the Logstore, only 10 fields may need to be calculated in Flink. The pulling of all fields results in a waste of network bandwidth.

In the preceding scenarios, unnecessary network traffic and computing overheads may be increased. Based on these features, SLS integrates the capabilities of SPL into the new version of the SLS Connector, which can implement row filtering and column pruning before data reaches Flink. These preprocessing capabilities are built into the SLS server, which can achieve the purpose of saving both network traffic and Flink computing (filtering and column pruning) overheads.

3.1 Principle Comparison

• When no SPL statement is configured, Flink tasks pull full log data (including all columns and rows) of SLS for computing.

• When an SPL statement is configured, if the SPL statement contains filtering and column pruning operations, then Flink tasks based on the SPL statement only pull the data that has been filtered and column pruned for further processing and computing.

4. Use SLS SPL in Flink

Next, an NGINX log is used as an example to describe how to use Flink based on SLS SPL. For better demonstration, we configure the SLS source table in the Flink console and then start a continuous query to observe the effect. In actual use, you can directly modify the SLS source table to retain the rest of the analysis and write logic.

Next, we will introduce how to use SPL to implement row filtering and column pruning in Flink.

4.1 Preparing Data from SLS

• Activate SLS, create a Project and a Logstore in SLS, and create the AK/SK account that has permission to consume the Logstore.

• The current Logstore data uses the simulated access method of the SLB seven-layer log to generate simulated data, which contains more than 10 fields.

Simulated access continuously generates random log data. The following example shows the log content:

{
  "__source__": "127.0.0.1",
  "__tag__:__receive_time__": "1706531737",
  "__time__": "1706531727",
  "__topic__": "slb_layer7",
  "body_bytes_sent": "3577",
  "client_ip": "114.137.195.189",
  "host": "www.pi.mock.com",
  "http_host": "www.cwj.mock.com",
  "http_user_agent": "Mozilla/5.0 (Windows NT 6.2; rv:22.0) Gecko/20130405 Firefox/23.0",
  "request_length": "1662",
  "request_method": "GET",
  "request_time": "31",
  "request_uri": "/request/path-0/file-3",
  "scheme": "https",
  "slbid": "slb-02",
  "status": "200",
  "upstream_addr": "42.63.187.102",
  "upstream_response_time": "32",
  "upstream_status": "200",
  "vip_addr": "223.18.47.239"
}

The slbid field in the Logstore has two values: slb-01 and slb-02. If you collect slbid statistics of log data for 15 minutes, you can find that the number of slb-01 is approximate to the number of slb-02.

5. Row Filtering Scenario

Filtering data is a common requirement in data processing. In Flink, you can use the filter operator or the where condition in SQL to filter data, which is very convenient to use. However, using the filter operator in Flink often means that data has entered the Flink computing engine through the network, and the full data will consume the network bandwidth and the computing performance of Flink. In this scenario, SLS SPL provides a filter pushdown capability for Flink SLS Connector. You can configure filter conditions in the query statement of SLS Connector to implement filter pushdown and avoid the full data transmission and full data filtering calculations.

5.1 Create an SQL Job

Create a blank SQL stream job draft in the Realtime Compute for Apache Flink console. Click Next to write the job.

Enter the following statements to create a temporary table in the job draft:

CREATE TEMPORARY TABLE sls_input(
  request_uri STRING,
  scheme STRING,
  slbid STRING,
  status STRING,
  `__topic__` STRING METADATA VIRTUAL,
  `__source__` STRING METADATA VIRTUAL,
  `__timestamp__` STRING METADATA VIRTUAL,
   __tag__ MAP<VARCHAR, VARCHAR> METADATA VIRTUAL,
  proctime as PROCTIME()
) WITH (
  'connector' = 'sls',
  'endpoint' ='cn-beijing-intranet.log.aliyuncs.com',
  'accessId' = '${ak}',
  'accessKey' = '${sk}',
  'starttime' = '2024-01-21 00:00:00',
  'project' ='${project}',
  'logstore' ='test-nginx-log',
  'query' = '* | where slbid = ''slb-01'''
);

• For the convenience of demonstration, only request_uri, scheme, slbid, status, and some metadata fields are set as table fields.

• Replace ${ak}, ${sk}, and ${project} with an account that has the Logstore consumption permission.

• endpoint: enter the private IP address of SLS in the same region.

• query: enter the SPL statement of SLS. Enter the filter statement of SPL: * | where slbid = ''slb-01''. Note that in the development of the Realtime Compute for Apache Flink jobs, strings need to be escaped by using single quotation marks in English forms.

5.2 Continuous Query and Effect

Enter the analysis statement in the job and perform an aggregate query based on slbid. The dynamic query refreshes numbers in real time based on log changes.

SELECT slbid, count(1) as slb_cnt FROM sls_input GROUP BY slbid

Click the Debug button in the upper-right corner to debug. You can view the values of the slbid field in the result, which is always slb-01.

After the SPL statement is configured, sls_input only contains the slbid=‘slb-01’ data. Other data that does not meet the conditions are filtered out.

5.3 Traffic Comparison

After using SPL, it can be seen that the read traffic of Flink to SLS is greatly reduced when the write traffic of SLS remains unchanged. At the same time, in the scenario where the filtering operation occupies most of the Flink CU, the consumption of Flink CU will be reduced accordingly after filtering the data through SPL.

6. Column Pruning Scenario

Column pruning is also a common requirement in data processing. In raw data, there are often full fields, but only specific fields are required for actual calculations. Similar to the need to use the project operator in Flink or select in SQL for column pruning and transformation, using the project operator in Flink often means that data has entered the Flink computing engine through the network, and the full data will consume the network bandwidth and the computing performance of Flink. In this scenario, SLS SPL provides a projection pushdown capability for Flink SLS Connector. You can configure query parameters of the SLS Connector to implement projection field pushdown and avoid the full data transmission and full data filtering calculations.

6.1 Create an SQL Job

To create a step-by-step row filtering scenario, enter the following statement to create a temporary table in the job draft. Here, the query parameter configuration is modified and a projection statement is added based on the filtering to only pull the content of specific fields from the SLS server.

CREATE TEMPORARY TABLE sls_input(
  request_uri STRING,
  scheme STRING,
  slbid STRING,
  status STRING,
  `__topic__` STRING METADATA VIRTUAL,
  `__source__` STRING METADATA VIRTUAL,
  `__timestamp__` STRING METADATA VIRTUAL,
   __tag__ MAP<VARCHAR, VARCHAR> METADATA VIRTUAL,
  proctime as PROCTIME()
) WITH (
  'connector' = 'sls',
  'endpoint' ='cn-beijing-intranet.log.aliyuncs.com',
  'accessId' = '${ak}',
  'accessKey' = '${sk}',
  'starttime' = '2024-01-21 00:00:00',
  'project' ='${project}',
  'logstore' ='test-nginx-log',
  'query' = '* | where slbid = ''slb-01'' | project request_uri, scheme, slbid, status, __topic__, __source__, "__tag__:__receive_time__"'
);

For effect, the following lines will display the configuration in the statement. In any Flink statement, a single line of configuration is required.

* 
| where slbid = ''slb-01'' 
| project request_uri, scheme, slbid, status, __topic__, __source__, "__tag__:__receive_time__"

The preceding uses the pipeline syntax of SLS SPL to implement the projection after data filtering, which is similar to the Unix pipeline. The | symbol is used to split different instructions. The output of the previous instruction is used as the input of the next instruction, and the output of the last instruction represents the output of the entire pipeline.

6.2 Continuous Query and Effect

Enter the analysis statement in the job. The result is similar to that in the row filtering scenario.

SELECT slbid, count(1) as slb_cnt FROM sls_input_project GROUP BY slbid

🔔 Note: This is different from row filtering. In the preceding row filtering scenario, full fields are returned. In the current statement, the SLS Connector returns only specific fields. This reduces data transmission over the network.

7. What Else Can SPL Do?

• The preceding example demonstrates that the filtering and projection functions of SLS SPL are used to implement the pushdown of the SLS Connector, which can effectively reduce network traffic and the use of Flink CU. This can avoid the additional consumption of filtering and projection computing before Flink performs computing.

• The functions of SLS SPL are more than filtering and projection. For the complete syntax supported by SLS SPL, see SPL Instruction [1]. At the same time, the SPL pipeline syntax fully supports the configuration in Flink Connector.

• SLS SPL supports data preprocessing, such as the field expansions of regular expression, JSON, and CSV, data format conversion, column addition and reduction, and filtering. In addition to consumption scenarios, scenarios of both the Scan mode and the collection side can also be applied so that users can use SPL capabilities at both the collection side and the consumer side.

Reference

[1] SPL Instruction

[2] Simple Log Service Overview

[3] SPL Overview

[4] Alibaba Cloud Realtime Compute for Apache Flink Connector SLS

Community

Use SPL to Efficiently Implement Flink SLS Connector Pushdown

1. Background

2. Introduction to SLS SPL

3. Introduction to the Principle of SPL in Flink SLS Connector

3.1 Principle Comparison

4. Use SLS SPL in Flink

4.1 Preparing Data from SLS

5. Row Filtering Scenario

5.1 Create an SQL Job

5.2 Continuous Query and Effect

5.3 Traffic Comparison

6. Column Pruning Scenario

6.1 Create an SQL Job

6.2 Continuous Query and Effect

7. What Else Can SPL Do?

Reference

Read previous post:

Read next post:

Alibaba Cloud Native

You may also like

Comments

Alibaba Cloud Native

Related Products

Realtime Compute for Apache Flink

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Function Compute