All Products
Search
Document Center

Create a data warehouse by merging databases within T+1 days

Last Updated: Apr 14, 2021

Background information

Increased volume of data stored in an ApsaraDB RDS may increase the time that is required to handle data queries. This affects user experience. To continue using ApsaraDB RDS, you can use the sharding technique. This technique allows you to split data of one table in an ApsaraDB RDS database into multiple tables in multiple ApsaraDB RDS databases, modify business code, and use a middleware similar to TDDL. This way, user experience is improved. However, the data analysis process becomes complex because one table is logically split into multiple tables.

Solution

DLA provides a solution to create a data warehouse by merging multiple databases in an ApsaraDB RDS instance within T+1 days. If you use this solution, you can use the wizard in the Data Lake Analytics (DLA) console to aggregate data of tables in multiple ApsaraDB RDS databases into one partitioned table that stores the data. This way, you can analyze all data at a time or analyze data in a specified partition. The analysis does not affect business operations on the ApsaraDB RDS instance.

Prerequisites

  • If you want to create a data warehouse by merging multiple databases in an ApsaraDB RDS instance, DLA uses Object Storage Service (OSS) as the data warehouse to store ApsaraDB RDS data. Before you perform this operation, make sure that the following operations are performed on OSS:

    1. OSS is activated. For more information, see Activate OSS.

    2. A bucket is created. For more information, see Create buckets.

    3. A folder is created. For more information, see Create folders.

    Note

    You can determine whether to create a folder to store ApsaraDB RDS data based on your business requirements.

  • The ApsaraDB RDS data source is prepared based on your business requirements. For more information, see Quick Start.

Procedure

  1. Log on to the DLA console.

  2. In the top navigation bar, select the region where DLA is deployed.

  3. In the left-side navigation pane, choose Data Lake Management > Data into the lake. On the Data into the lake page, click Go To The Wizard in the Multi-database merger section.

  4. If you use DLA to access the ApsaraDB RDS data source for the first time, you must grant read-only permissions on ApsaraDB RDS to DLA. Then, click Next.

    Note

    If the read-only permissions on ApsaraDB RDS have been granted to DLA, skip this step.

  5. Configure the parameters as prompted.

    Category

    Parameter

    Description

    Manual selection of rds: allows you to manually specify an ApsaraDB RDS instance. This method is suitable for scenarios where the number of ApsaraDB RDS instances is small and static or does not increase frequently.

    Type

    The type of the data source. In this topic, the type is ApsaraDB RDS. Select an ApsaraDB RDS instance and add it as a data source.

    Database Filtering Rules

    The names of the databases from which data is synchronized to DLA. Separate multiple database names with commas (,). Database names can contain the percent sign (%), for example, user_%.

    Specify By Query: allows you to specify an ApsaraDB RDS data source by using SQL SELECT statements. This method is suitable for scenarios where the number of ApsaraDB RDS instances is large and increases dynamically.

    -

    Example: SELECT 'mysql' AS engine, 'db001' AS db_name, 'rm-111..aliyuncs.com' AS host, 3306 AS port, 'rm-123445' AS instance_id, 'vpc-3424555' AS vpc_id FROM tbl1

    Certification Information

    User Name

    The username that is used to log on to a database. For ease of use, the DLA console requires you to use the same username and password to log on to all databases.

    Password

    The password that corresponds to the username. After you enter the username and password, you can click Test Connection to test the connectivity.

    Position Opening Configuration

    Schema Name

    The name of the schema. This parameter specifies the name of the DLA database that is mapped to the database in the ApsaraDB RDS instance.

    Location

    The OSS directory where data of an ApsaraDB RDS database is saved when you create a data warehouse in DLA. DLA automatically selects the OSS bucket that resides in the same region as DLA. You can click Select Location to select a bucket and an object. Before you create a data warehouse by merging multiple databases, make sure that DLA is authorized to delete OSS data. This is because DLA can perform extract, transform, and load (ETL) operations only after OSS data is deleted. For more information, see Authorize DLA to delete OSS files.

    Sync Time

    The time at which ApsaraDB RDS data is synchronized to OSS. By default, data is synchronized at 00:30. To prevent your business from being affected when data is synchronized, we recommend that you set this parameter to a time at off-peak hours.

    Table Name Generation Rules

    The rule for generating the name of the ApsaraDB RDS table in the data warehouse when you create a data warehouse by merging databases in DLA. The ApsaraDB RDS table in the data warehouse is mapped to the table in the ApsaraDB RDS database. Valid values: IdentityResolver: The name of the table in the ApsaraDB RDS database is directly used as the name of the table in the data warehouse. This rule is suitable for scenarios where database sharding is used but table sharding is not used for the ApsaraDB RDS database. RemoveTrailingUnderscoreAndNumberResolver: The last underscore (_) and the digits that follow this underscore are removed from the name of the table in the ApsaraDB RDS database and the new table name is used as the name of the table in the data warehouse. For example, if the name of the table in the ApsaraDB RDS database is tbl_001, the name of the table in the data warehouse is tbl.

    Partition Configuration

    Specifies how to generate partition fields and their values in the data warehouse that you want to create. A partition field value is an expression that contains variables, such as ${rdsInstanceId}. DLA supports only the rdsEngine variable. This variable indicates the type of the engine supported by ApsaraDB RDS and can be set to MySQL, SQL Server, PostgreSQL, or Oracle. rdsDbName: the name of the database in the ApsaraDB RDS instance. rdsTableName: the name of the table in the database of the ApsaraDB RDS instance. rdsInstanceId: the ID of the ApsaraDB RDS instance. rdsVpcId: the ID of the VPC to which the ApsaraDB RDS instance belongs. We recommend that you specify rdsInstanceId and rdsDbName.

    Advanced Options

    Custom parameters, such as filter fields. For more information, see Advanced options.

  6. After you configure the preceding parameters, click Create to create a data warehouse.

After you create a data warehouse, DLA automatically synchronizes data from the database in the ApsaraDB RDS instance to OSS at the specified time. At the same time, a table schema that is the same as the table schema in the database of the ApsaraDB RDS instance is created in OSS and an OSS table is created in DLA.