All Products
Search
Document Center

Create OSS partition tables in DLA

Last Updated: May 13, 2019

Background

In relational databases, tables that contain large amounts of data can be partitioned to improve the query performance. Data Lake Analytics (DLA) is a serverless big data analysis service, which allows you to use standard SQL statements to query the data that is stored in Alibaba Cloud Object Storage Service (OSS) and Table Store. In DLA, you can also use partition tables to refine data in OSS and Table Store, thereby shortening the query response time.

This topic describes how to create and use partition tables in DLA by using OSS as the data source.

Directory structure of partition tables in OSS

DLA can map directories or files that are stored in OSS into a table. Data in the table is the file content in OSS. The partition columns in a partition table are mapped to directories that are named following a special naming rule in OSS:

  • Each partition column is mapped to a subdirectory under the LOCATION field of the table. A directory is named as follows: Partition column name=Partition column value.

  • If the table has multiple partition columns, these columns must be nested in the sequence specified in the table creation statement.

The following is an example of the directory structure in OSS:

  1. https://bucket-name.oss-cn-hangzhou.aliyuncs.com/dla/y=2018/m=12/kv2.txt
  2. https://bucket-name.oss-cn-hangzhou.aliyuncs.com/dla/y=2019/m=2/kv1.txt

In the preceding directory structure, use DLA to create an external table named dla that points to the OSS file. The dla table is a partition table with two partition columns, namely, y (year) and m (month):

  1. CREATE EXTERNAL TABLE dla (
  2. prod_name string,
  3. cnt bigint)
  4. PARTITIONED BY (y string, m string)
  5. STORED AS TEXTFILE
  6. LOCATION 'oss://bucket-name/dla/';

Notes

  • The nesting sequence of partition columns in the OSS directory structure must be the same as the sequence that is defined in the table.

    For the preceding OSS directory structure, the following external table creation statement is incorrect:

    1. CREATE EXTERNAL TABLE dla (
    2. prod_name string,
    3. cnt bigint)
    4. PARTITIONED BY (m string, y string)
    5. STORED AS TEXTFILE
    6. LOCATION 'oss://bucket-name/dla/';
  • Only the data in the directories that map the partition columns in the partition table is scanned.

    For the following directory structure:

    1. https://bucket-name.oss-cn-hangzhou.aliyuncs.com/dla/y=2019/m=3/kv3.txt
    2. https://bucket-name.oss-cn-hangzhou.aliyuncs.com/dla/kv4.txt

    If the table creation statement specifies columns y and m as the partition columns, you can query only the data in the kv3.txt file through the partition table. Data in the kv4.txt file cannot be queried.

Prerequisites

Test data is ready in OSS. If not, follow these steps to prepare the data:

  1. Activate OSS.

  2. Create a bucket.

  3. Create a directory to store files and name it in the format of Partition column name=Partition column value.

  4. Upload a file.

Procedure

Step 1: Create an OSS connection in DLA

Log on to the Data Lake Analytics console. In the endpoint list, locate the row that contains the target endpoint and click Log on in DMS in the Actions column to create a connection to OSS in DLA. Use the following syntax to create the connection:

  1. CREATE SCHEMA dla_oss_db with DBPROPERTIES(
  2. catalog='oss',
  3. location= 'oss://bucket-name/dla/'
  4. );

location: indicates the file storage directory in the OSS bucket, which must end with a slash (/).

Step 2: Create an external table (partition table) that points to the OSS file in DLA

  1. CREATE EXTERNAL TABLE dla (
  2. prod_name string,
  3. cnt bigint)
  4. PARTITIONED BY (y string, m string)
  5. STORED AS TEXTFILE
  6. LOCATION 'oss://bucket-name/dla/';

Step 3: Run the MSCK command to update partition information

After the external table is created, run the MSCK REPAIR TABLE command to synchronize partition information to DLA. The MSCK command can only identify the directories that are named following the partition column naming rule Partition column name=Partition column value for DLA.

  1. MSCK REPAIR TABLE dla;

Note: If the partition directories in OSS have changed, when you run the MSCK command, DLA automatically synchronizes partition information based on the current partition values in OSS.

After the MSCK command is executed, the following information appears:

  1. Repair: Added partition to metastore dla_oss_db.dla:y=2018/m=12
  2. Repair: Added partition to metastore dla_oss_db.dla:y=2019/m=2

Step 4: Run the SHOW PARTITIONS command to view partition information

After the MSCK command is executed, run the SHOW PARTITIONS command to view all partition information in the table.

  1. show partitions dla;
  2. +-----------+
  3. | y=2018/m=12|
  4. | y=2019/m=2 |

Step 5: Query data in the partition table

  1. select count(*) from dla;
  2. +-------+
  3. | _col0 |
  4. +-------+
  5. | 4 |
  6. select * from dla;
  7. +---------------------------+
  8. |prod_name | cnt | y | m |
  9. +---------------------------+
  10. |Monitor | 2 |2019 | 2 |
  11. |Hard disk | 6 |2018 | 12 |
  12. |Keyboard | 3 |2018 | 12 |
  13. |Mouse | 1 |2019 | 2 |

You can use partition columns as filter criteria:

  1. select * from dla where y='2019';
  2. +---------------------------+
  3. |prod_name | cnt | y | m |
  4. +---------------------------+
  5. |Monitor | 2 |2019 | 2 |
  6. |Mouse | 1 |2019 | 2 |

Subsequent operations

Synchronize partition information

When partition directories in OSS change, for example, when a partition directory is added to OSS, you can run the MSCK command to make the new partition take effect.

  1. MSCK REPAIR TABLE table_name;

Add partitions

  1. ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
  2. partition_spec:
  3. : (partition_column = partition_col_value, partition_column = partition_col_value, ...)

You can add multiple partitions at one time. Separate them with commas (,). For example:

  1. ALTER TABLE order_part ADD
  2. PARTITION (dt='2008-08-08', status='ready') location '/path/to/ready/part080808',
  3. PARTITION (dt='2008-08-09', status='new') location '/path/to/new/part080809';

Note: If you specify IF NOT EXISTS when adding a partition and the partition that you want to add already exists, the new directory that is specified by the LOCATION parameter overwrites the directory to which the original partition points.

Delete partitions

  1. ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...] ;

You can delete multiple partitions at one time. Separate them with commas (,). For example:

  1. ALTER TABLE order_part DROP
  2. PARTITION (dt='2008-08-08', status='ready');

Note:

  • In the preceding command, you can specify the partitions that you want to delete only by setting **Partition column name=Partition column value**. The partitions cannot be specified by an expression, such as partitionCol > 100.

  • If a deleted partition complies with the naming rule Partition column name=Partition column value, the MSCK command will automatically add this partition.