This topic describes how to create an AnalyticDB for MySQL V3.0 dimension table. This topic also describes the parameters in the WITH clause and cache parameters used when you create an AnalyticDB for MySQL V3.0 dimension table.

Notice This topic applies only to Blink-3.5.0-hotfix and later.

DDL syntax

CREATE TABLE dim_ads(
   `name` VARCHAR,
    id VARCHAR,
    PRIMARY KEY (`name`),
    PERIOD FOR SYSTEM_TIME
)with(
    type='ADB30',
    url='jdbc:mysql://<Internal endpoint>/<databaseName>',
    tableName='xxx',
    userName='xxx',
    password='xxx'
);
Note
  • You must specify a primary key when you declare a dimension table.
  • When you join a dimension table with another table, the ON condition must contain equality conditions that include all primary keys.
  • The primary key of an AnalyticDB for MySQL database can be defined as the primary key or unique index column of an AnalyticDB for MySQL dimension table.

Parameters in the WITH clause

Parameter Description Required Remarks
type The type of the dimension table. Yes Set the value to ADB30.
url The URL of the AnalyticDB for MySQL database. Yes The URL of the AnalyticDB for MySQL database, such as url='jdbc:mysql://databaseName****-cn-shenzhen-a.ads.aliyuncs.com:10014/databaseName'.
Note
tableName The name of the table. Yes N/A.
userName The username that is used to access the AnalyticDB for MySQL database. Yes N/A.
password The password that is used to access the AnalyticDB for MySQL database. Yes N/A.
maxRetryTimes The maximum number of retries for writing data to the table. No Default value: 3.

Cache parameters

Parameter Description Required Remarks
cache The policy that is used to cache data. No Valid values:
  • None: indicates that data is not cached. This is the default cache policy.
  • LRU: indicates that only the specified data in the dimension table is cached. The system searches the cache each time it receives a data record from the source table. If the system does not find the record in the cache, the system searches for the data record in the physical dimension table.

    If this cache policy is used, you must configure the cacheSize and cacheTTLMs parameters.

  • ALL: indicates that all the data in the dimension table is cached. Before the system runs a job, the system loads all data in the dimension table to the cache. This way, the cache is searched for all subsequent queries in the dimension table. If the system does not find the data record in the cache, the join key does not exist. The system reloads all data in the cache after cache entries expire.

    If the amount of data in a remote table is small and a large number of missing keys exist, we recommend that you set this parameter to ALL. The source table and dimension table cannot be associated based on the ON clause.

    If you use this cache policy, you must configure the cacheTTLMs and cacheReloadTimeBlackList parameters.

Note
  • If you set the cache parameter to ALL, you must increase the memory of the node for joining tables because the system asynchronously loads data from the dimension table. The increased memory size is twice the memory size of the remote table.
  • If a dimension table stores a large volume of data and the cache parameter is set to ALL, an out of memory (OOM) error may occur or a full garbage collection (GC) may be time-consuming. To address this issue, you can use one of the following methods:
    • If the cache parameter can be set to ALL for a dimension table, enable the partitionedJoin feature. For a Blink version earlier than Blink 3.6.0, the full data of the dimension table is loaded for each concurrent job by default. For a Blink version later than Blink 3.6.0, the partitionedJoin feature is available if you set the cache parameter to ALL. After you enable the partitionedJoin feature, only the required data is cached for each concurrent job.
    • Use an ApsaraDB for HBase or ApsaraDB RDS dimension table that uses key-value pairs to store data.
cacheSize The maximum number of rows of data records that can be cached. No This parameter is available only if you set the cache parameter to LRU. Default value: 10000.
cacheTTLMs The interval at which the system refreshes the cache. The system reloads the latest data in the dimension table based on the value of this parameter. This ensures that the data in the source table is associated with the latest data in the dimension table. No Unit: milliseconds. This parameter is empty by default. This indicates that the updates in the dimension table are not reloaded.
cacheReloadTimeBlackList The periods of time during which cache is not refreshed. This parameter takes effect when the cache parameter is set to ALL. The cache is not refreshed during the time periods that you specify for this parameter. This parameter is useful for large-scale online promotional events such as Double 11. No This parameter is optional. This parameter is empty by default. For example, you can specify this parameter as '2017-10-24 14:00 -> 2017-10-24 15:00, 2017-11-10 23:30 -> 2017-11-11 08:00'. Use the following delimiters to separate time periods:
  • Separate multiple time periods with commas (,).
  • Separate the start time and end time of each time period with a hyphen and a greater-than sign (->).
partitionedJoin Specifies whether to enable the partitionedJoin feature. If the partitionedJoin feature is enabled, shuffling is implemented based on join keys before the primary table is joined with the dimension table. This process provides the following benefits:
  • If you set the cache parameter to LRU, the cache hit rate increases.
  • If you set the cache parameter to ALL, memory resources are reduced because only the required data is cached for each concurrent job.
No The default value of this parameter is false. This indicates that the partitionedJoin feature is disabled.
Note Before you enable the partitionedJoin feature, set partitionedJoin to true.
maxJoinRows The maximum number of results that are returned each time a data record in the primary table is queried and matched with data records in the dimension table. No Default value: 1024. If you can estimate that a data record in the primary table corresponds to a maximum of n data records in the dimension table, you can set the maxJoinRows to n to ensure efficient matching in Realtime Compute for Apache Flink.
Note When you join a dimension table with another table, this parameter specifies the maximum number of results that can be returned after a data record in the primary table is matched with data records in the dimension table.

Sample code

CREATE TABLE datahub_input1 (
  id      BIGINT,
  name    VARCHAR,
  age     BIGINT
) WITH (
  type='datahub'
);

create table phoneNumber (
  name VARCHAR,
  phoneNumber BIGINT,
  primary key(name),
  PERIOD FOR SYSTEM_TIME--The identifier of a dimension table. 
) with (
  type='ADB30'
);

CREATE table result_infor (
  id BIGINT,
  phoneNumber BIGINT,
  name VARCHAR
) with (
  type='rds'
);

INSERT INTO result_infor
SELECT
  t.id,
  w.phoneNumber,
  t.name
FROM datahub_input1 as t
JOIN phoneNumber FOR SYSTEM_TIME AS OF PROCTIME() as w -- You must include this clause when you perform a JOIN operation on the dimension table. 
ON t.name = w.name;