All Products
Search
Document Center

Multi-database warehousing

Last Updated: Apr 09, 2020

Background

In the actual applications of databases, as the data volume of a single ApsaraDB for RDS (RDS) instance increases, the responsiveness to data queries drops, which severely affects user experience. In this case, the service end usually splits the data of a single table in an RDS database into multiple tables in different databases by using the sharding technology, and then modifies the service-layer code accordingly. With sharding middleware such as Taobao distributed data layer (TDDL), you can continue to use the RDS database in services.

While the preceding solution rectifies user experience degradation due to the large data volume of a single RDS database, another issue arises. That is, the big data analysis of the shard data is very complex, because one table is split into multiple tables logically, but no middleware such as TDDL is available to shield the splitting of physical tables.

What is multi-database warehousing?

The multi-database warehousing function of Data Lake Analytics (DLA) is designed to aggregate the shard data of an RDS instance into a unified table by going through the DLA console wizard, in order to store data in the form of sharded tables. This allows you to analyze all data globally and select a shard to perform focal analysis on the shard data, without affecting service running on the RDS instance.

Prerequisites

Before using the multi-database warehousing function of Data Lake Analytics (DLA), you must complete the following steps:

Steps

  1. Log on to the Data Lake Analytics console.

  2. In the upper-left corner of the page, select the region where your DLA service is located.

  3. In the left-side navigation pane, choose Schema.

  4. On the Schema page, click Create Schema.

  5. On the Popular tab page of the New Schema page, click Create By Wizard in the Multi-database Warehousing section.

  6. If the DLA console accesses the RDS instance for the first time, you must grant the read-only permission of the instance for the DLA console. After authorizing, click Next.

    Skip this step if you have granted the read-only permission of the instance for the DLA console.

  7. Configure the RDS instance and the OSS warehouse for storing RDS data as instructed.

    Category Parameter Description
    Manual Selection

    You can manually specify data sources for RDS instances. This method is applicable to scenarios where only a small number of RDS instances are used and the number of instances is static without frequent increase.

    Type The type of the data source, which is RDS in this case. Select the radio button for an RDS instance to add the RDS instance to Data Source.
    Database Filtering Rules Enter the name of the database to be synchronized.
    • Separate multiple database names with commas (,).
    • A database name can include the wildcard %, for example, user_%.
    Specify by Query

    Specify the RDS data source by performing SQL query. This method is applicable to scenarios where a large number of RDS instances are used and the number of instances increases dynamically.

    - For example, you can run the following query statement to specify the data source: SELECT 'mysql' AS engine, 'db001' AS db_name, 'rm-111..aliyuncs.com' AS host, 3306 AS port, 'rm-123445' AS instance_id, 'vpc-3424555' AS vpc_id FROM tbl1.
    Certification Information User Name For ease of use, the DLA console requires you to use the same username and password for all databases.
    Password The password of the preceding username.

    After entering the username and password, you can click Test Connection to test the connectivity.

    Warehousing Configuration Schema Name The name of the schema. This indicates the name of the database to which the RDS database is mapped in the DLA console.
    Location The detailed address for storing RDS data in OSS when you create the data warehouse.

    The system automatically pulls OSS buckets that are in the same region as the DLA console. Click Select Location. A dialog box appears, prompting for selecting an OSS path. You can select a bucket and an object based on your service requirements.

    To use the multi-database warehousing function, the DLA console must have the permission to delete OSS data, so that it can extract, transform, and load (ETL) OSS data to RDS. For more information about authorization, see Authorize DLA to delete OSS files.

    Sync Time The time when RDS data is synchronized to OSS.

    The default data synchronization time is 00:30. You can change this time to an off-peak hour based on service traffic, to minimize the impact on services during synchronization.

    Table Name Generation Rules The name of the table to which the RDS table is mapped in the data warehouse during the configuration of DLA-based warehousing. The mapped table name is automatically generated according to the following rules:
    • IdentityResolver: The table name in the data warehouse is the same as that in the RDS table. It is applicable to scenarios where the RDS instance has database shards but no table shards.
    • RemoveTrailingUnderscoreAndNumberResolver: The underscore (_) and numeric string that last appear in the RDS table name are removed to form the table name in the data warehouse.

      For example, if the RDS table name is tbl_001, the table name in the data warehouse is tbl.

    Partition Configuration Sets the fields for sharding in the data warehouse and the method for generating shard field values.

    The shard field value is an expression with variables, for example, ${rdsInstanceId}. Currently, the DLA console supports the following variables:

    • rdsEngine: The database engine supported by RDS, which can be MySQL, SQL Server, PostgreSQL, or Oracle.
    • rdsDbName: The database name in the RDS instance.
    • rdsTableName: The data table name in the RDS instance.
    • rdsInstanceId: The ID of the RDS instance.
    • rdsVpcId: The ID of the Virtual Private Cloud (VPC) to which the RDS instance belongs.
    We recommend that you enter both the RDS instance ID and the database name, for example:
    • Shard name rds_instance_id and the shard value ${rdsInstanceId}
    • Shard name rds_db_name and the shard value ${rdsDbName}
    Advanced Options The user-defined options, such as filtering fields. For more information, see Advanced options.

    Multi-database Warehousing

  8. After completing the preceding steps, click Create to create the data warehouse.

After the data warehouse is created, DLA automatically synchronizes RDS data to OSS at the specified synchronization time. In addition, DLA creates a table structure in OSS, which is the same as the table structure in the RDS instance, and creates a corresponding OSS table in DLA.