AnalyticDB for MySQL Data Lakehouse Edition (V3.0) allows you to synchronize data from a Simple Log Service Logstore to a Data Lakehouse Edition (V3.0) cluster. This topic describes how to use the data synchronization feature of AnalyticDB for MySQL Data Lakehouse Edition (V3.0) to synchronize data from a Simple Log Service Logstore to a Data Lakehouse Edition (V3.0) cluster, analyze data, and manage data sources.
Overview
AnalyticDB for MySQL Data Lakehouse Edition (V3.0) provides the data synchronization feature that allows you to synchronize data from a Simple Log Service Logstore to a Data Lakehouse Edition (V3.0) cluster in real time from a specified point in time. This feature helps meet requirements such as near-real-time data ingestion, full data archiving, and elastic analysis.
Prerequisites
- An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created. For more information, see Create a cluster.
A job resource group is created. For more information, see Create a resource group.
- A database account is created.
- If you use an Alibaba Cloud account, you only need to create a privileged database account. For more information, see Create a privileged account.
- If you use a Resource Access Management (RAM) user, you must create both a privileged database account and a standard database account and associate the standard account to the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
Simple Log Service is activated. A project and a Logstore are created in the same region as the Data Lakehouse Edition (V3.0) cluster. For more information, see Getting Started.
Usage notes
The data of a Simple Log Service Logstore can be synchronized to only a single table in an AnalyticDB for MySQL cluster.
Procedure
Step 1: (Optional) Configure RAM authorization.
Step 2: Create a data source.
Step 3: Create a data synchronization job.
Step 4: Start the data synchronization job.
Step 5: Analyze data.
Step 6: (Optional) Manage the data source.
Configure RAM authorization
If you want to synchronize Simple Log Service data to an AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster across Alibaba Cloud accounts, you must create a RAM role on the data source, grant permissions to the RAM role by attaching a policy, and then edit the trust policy of the RAM role. If you want to synchronize Simple Log Service data within the same Alibaba Cloud account, you can skip this step and create a data source. For more information, see the "Create a data source" section of this topic.
Create a RAM role. For more information, see Create a RAM role for a trusted Alibaba Cloud account.
NoteSelect Other Alibaba Cloud Account for the Select Trusted Alibaba Cloud Account parameter and enter the ID of the Alibaba Cloud account to which the Data Lakehouse Edition (V3.0) cluster belongs. You can log on to the Account Management console and go to the Overview page to view the account ID.
Attach the AliyunAnalyticDBAccessingLogRolePolicy policy to the RAM role. For more information, see the "Method 2: Grant permissions to a RAM role by clicking Input and Attach on the Roles page" section of the Grant permissions to a RAM role topic.
Edit the trust policy of the RAM role. For more information, see Edit the trust policy of a RAM role.
{ "Statement": [ { "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "RAM": [ "acs:ram::<Alibaba Cloud account ID>:root" ], "Service": [ "<Alibaba Cloud account ID>@ads.aliyuncs.com" ] } } ], "Version": "1" }
NoteIn the preceding code, the Alibaba Cloud account ID parameter specifies the account ID that is entered in Step 1. When you specify this parameter, delete the angle brackets (<>).
Create a data source
If you want to synchronize data from an existing data source, skip this step and create a data synchronization job. For more information, see the "Create a data synchronization job" section of this topic.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Data Ingestion > Data Sources.
In the upper-right corner, click Create Data Source.
On the Create Data Source page, configure the parameters that are described in the following table.
Parameter
Description
Data Source Type
The data source. Select SLS.
Data Source Name
The name of the data source. By default, the system generates a name based on the data source type and the current time. You can modify the name based on your business requirements.
Data Source Description
The description of the data source. For example, you can enter the use case and business limits.
Deployment Mode
Only Alibaba Cloud Instance is supported.
Region of Log Service Project
The region where the source Simple Log Service project resides.
Across Alibaba Cloud Accounts
Specifies whether the Simple Log Service data can be synchronized to the Data Lakehouse Edition (V3.0) cluster across Alibaba Cloud accounts. Valid values:
No.
Yes. If you select Yes, you must specify the Alibaba Cloud Account and RAM Role parameters.
NoteAlibaba Cloud Account: the ID of the Alibaba Cloud account to which the data source belongs.
RAM Role: the name of the RAM role that is created on the data source in Step 1.
Log Service Project
The name of the source Simple Log Service project.
ImportantAll projects that belong to the Alibaba Cloud account and RAM users are displayed. If you select a project that belongs to the Alibaba Cloud account and want to synchronize data as a RAM user, make sure that the RAM user is granted permissions on the project. Otherwise, data cannot be synchronized to Data Lakehouse Edition (V3.0).
Log Service Logstore
The name of the source Simple Log Service Logstore.
Click Create.
Create a data synchronization job
In the left-side navigation pane, click Data Synchronization.
In the upper-right corner, click Create Synchronization Job.
On the Log Service Data Source tab of the Create Synchronization Job page, configure the parameters in the Source and Destination Settings, Destination Database and Table Settings, and Synchronization Settings sections.
The following table describes the parameters in the Source and Destination Settings section.
Parameter
Description
Data Job Name
The name of the data synchronization job. By default, the system generates a name based on the data source type and the current time. You can modify the name based on your business requirements.
Data Source
The data source. You can select an existing Simple Log Service data source or create a data source.
Destination Type
The data storage type in AnalyticDB for MySQL. Only Data Lake - OSS Storage is supported.
OSS Path
The Object Storage Service (OSS) storage path of the AnalyticDB for MySQL cluster data.
ImportantAll buckets that reside in the same region as the AnalyticDB for MySQL cluster are displayed. You can select one of the buckets. Configure this parameter based on your business requirements. After you configure this parameter, it cannot be modified.
We recommend that you select an empty directory that does not have nested relationships with the directories of other data synchronization jobs. This prevents historical data from being overwritten. For example, assume that the involved OSS storage paths of two data synchronization jobs are oss://adb_demo/test/sls1/ and oss://adb_demo/test/. In this case, data overwriting occurs during data synchronization because these two paths have nested relationships with each other.
The following table describes the parameters in the Destination Database and Table Settings section.
Parameter
Description
Database Name
The name of the destination database in the Data Lakehouse Edition (V3.0) cluster. If no database that uses the same name exists, a database is created. If a database that uses the same name already exists, data is synchronized to the existing database. For information about the naming conventions for databases, see Limits.
Table Name
The name of the destination table in the Data Lakehouse Edition (V3.0) cluster. If no table that uses the same name exists in the database, a table is created. If a table that uses the same name already exists in the database, data fails to be synchronized. For information about the naming conventions for tables, see Limits.
Schema Field Mapping
The source and destination fields. By default, fields are obtained based on the shipping task configuration of Simple Log Service. If no shipping task is configured for the Logstore, fields are obtained based on the most recent log data by default.
The following data types are supported: BOOLEAN, INT, BIGINT, FLOAT, DOUBLE, and STRING.
The reserved fields of Simple Log Service can be synchronized. For more information, see Reserved fields.
ImportantThe destination field names cannot be modified.
If a data synchronization job is started, regardless of whether it is being started and running or has been started and completed, information about existing columns cannot be modified. However, you can add more columns. If a data synchronization job is created but has never been started, information about existing columns can be modified.
Partition Key Settings
The partition field settings for the destination table. To ensure that data is ingested and queries are executed as expected, we recommend that you configure partitions based on the log time or business logic. If you do not configure partitions, no partitions exist in the destination table.
Valid values for the Format Processing Method field:
Formatted Time: You must specify a time field value such as a timestamp. The destination table is partitioned based on the specified time field and the format of the time field. For example, if the source partition field is gmt_created and the format of the source field is yyyy-MM-dd, you can set the destination partition field to year and the format of the destination partition field to yyyy. This specifies that the destination table is partitioned by year.
Specified Partition Field: The destination table is partitioned based on the specified partition field.
The following table describes the parameters in the Synchronization Settings section.
Parameter
Description
Starting Consumer Offset for Incremental Synchronization
The point in time from which the system consumes Simple Log Service data when the data synchronization job starts. Valid values:
Earliest Offset (begin_cursor): The system consumes Simple Log Service data from the point in time at which the earliest data record was generated.
Latest Offset (end_cursor): The system consumes Simple Log Service data from the point in time at which the latest data record was generated.
Custom Offset: You can select a point in time. Then, the system consumes Simple Log Service data from the first entry that is generated as of the selected point in time.
Job Resource Group
The job resource group that runs the data synchronization job.
ACUs for Incremental Synchronization
The number of AnalyticDB compute units (ACUs) that are required for the job resource group to run the data synchronization job. The value ranges from 2 to the maximum number of computing resources that are available in the job resource group. To improve the stability and performance of data ingestion, we recommend that you specify a larger number of ACUs.
NoteWhen you create a data synchronization job in a job resource group, elastic resources in the resource group are used, and the system deducts the used resources from the resource group. For example, assume that a job resource group has 48 ACUs of reserved computing resources and a synchronization job that consumes 8 ACUs is created. When you create another synchronization job in the resource group, you can select up to 40 ACUs.
Advanced Settings
The personalized settings for the data synchronization job. If you want to configure personalized settings, contact technical support.
Click Submit.
Start the data synchronization job
On the Data Synchronization page, find the data synchronization job that you created and click Start in the Actions column.
In the upper-right corner, click Search. If the state of the job changes to Starting, the data synchronization job is started.
Analyze data
After the data synchronization job is completed, you can use Spark JAR Development to analyze the data that is synchronized to the AnalyticDB for MySQL cluster. For more information about Spark development, see Spark development editor and Spark batch application development.
In the left-side navigation pane, choose .
Enter SQL statements in the default template and click Run Now.
-- Here is just an example of SparkSQL. Modify the content and run your spark program. conf spark.driver.resourceSpec=medium; conf spark.executor.instances=2; conf spark.executor.resourceSpec=medium; conf spark.app.name=Spark SQL Test; conf spark.adb.connectors=oss; -- Here are your sql statements show tables from lakehouse20220413156_adbTest;
(Optional) On the Applications tab, find an application and click Log in the Actions column to view the Spark SQL running log of the application.
Manage the data source
On the Data Sources page, you can perform the following operations in the Actions column.
Operation | Description |
Create Job | Directs you to the Create Synchronization Job or Create Migration Job page to create a job that uses the data source. |
View | Obtains the detailed configurations of the data source. |
Edit | Allows you to edit data source properties, such as updating the data source name and description. |
Delete | Deletes the data source. Note If the data source is being used in a data synchronization or data migration job, you cannot delete the data source. In this case, you must first go to the Data Synchronization page, find the job, and then click Delete in the Actions column to delete the job. |