This topic describes the background information, scenarios, and limits of the partition projection feature of Data Lake Analytics (DLA). It also describes how to use this feature.

Background information

If you store data in Object Storage Service (OSS) or Hadoop Distributed File System (HDFS), you may use a large number of partitions to manage the data. In this scenario, if you use the serverless Presto engine to analyze the data, it takes a long time to retrieve partition metadata from the metadata service of DLA. As a result, query performance deteriorates. DLA introduces the partition projection feature to address this issue. After you enable this feature, the serverless Presto engine of DLA can calculate partition values and locations based on the locations of partitioned tables. You do not need to retrieve partition metadata from the metadata service of DLA. This feature reduces the runtime of queries against a large number of partitioned tables.

Scenarios

We recommend that you use the partition projection feature in the following scenarios:
  • A partitioned table contains a large number of partitions. The serverless Presto engine of DLA takes a long time to retrieve partition metadata from the metadata service.
  • One partition, such as dt=2021-02-19, needs to be added to a partitioned table each day. You do not want to manually perform this operation.
  • Data is stored in OSS tables. These tables contain a large number of partitions, such as tens of thousands of partitions. However, only data in a few partitions needs to be queried.

Limits

  • If the partition projection feature is enabled for tables, the following SQL statement limits are imposed on these tables.
    • MSCK REPAIR TABLE tblName or ALTER TABLE tblName ADD PARTITION cannot be used to add partitions to these tables.
    • SHOW PARTITIONS tblName cannot be used to query partitions of these tables.
    • ALTER TABLE tblName DROP PARTITION cannot be used to drop partitions of these tables.
  • Only the serverless Presto engine of DLA supports the partition projection feature. This feature can be used only in DLA CU Edition.
  • If no OSS directories are projected into partitions and you use the partition projection feature to execute SQL statements, DLA returns None and no errors are returned.

Use the partition projection feature

Partition projection is a table-level feature. To use this feature, you need only to specify related parameters in the statements that are used to create partitioned tables. Sample statement:
create external table projection_test (
  field1 int
) 
partitioned by (
  a string, 
  b string, 
  c int, 
  d string
) 
LOCATION 'oss://oss-cn-hangzhou-xxx/xxx/projection_test' 
tblproperties(
  'projection.enabled' = 'true',    # Enable the partition projection feature for the table that you want to create.
  'projection.a.type' = 'injected',
  'projection.a.miss' = 'LIST',
  'projection.b.type' = 'date',
  'projection.b.range' = '2021-02-01, NOW',
  'projection.b.format' = 'yyyy-MM-dd',
  'projection.b.interval.unit' = 'days',  
  'projection.b.interval' = '1',
  'projection.c.type' = 'integer',   #projection.columnName.type (projection.c.type) indicates the partition projection type of a specified column. For more information, see "Partition projection type" in this topic.
  'projection.c.range' = '1,2',
  'projection.d.type' = 'enum',
  'projection.d.values' = '1,2',
  'storage.location.template' = 'oss://oss-cn-hangzhou-xxx/xxx/projection_test/${a}/${b}/${c}/${d}/'   # The format of an OSS directory. This format is considered a template that uses partition key columns as placeholders. The default directory format is table_path/Partition key column 1=Value 1/Partition key column 2=Value 2/. You can also use other directory formats.
)

Partition projection types

projection.c.type specifies the partition projection type of a partition key column. DLA supports four types of partition projection: enum, integer, date, and injected. The following table describes the parameters that you must specify for different partition projection types.
  • enum
    Parameter Required Example Description
    projection.columnName.type Yes enum The partition projection type of a partition key column. enum is one of the supported partition projection types.
    projection.columnName.values Yes RED, YELLOW, GREEN A comma-delimited list of enumeration values.
  • integer
    Parameter Required Example Description
    projection.columnName.type Yes integer The partition projection type of a partition key column. integer is one of the supported partition projection types.
    projection.columnName.range Yes 1, 100 A comma-delimited list of integers. Negative integers are supported. In the list, the first integer value must be less than the second integer value.
    projection.columnName.interval No 1 The interval between successive partition values of a partition key column. The value of this parameter must be an integer greater than 0. The default value is 1. For example, if you set projection.columnName.range to 1,10 and projection.columnName.interval to 3, the partition values that are generated are 1, 4, and 7.
    projection.columnName.digits No 1 The final format of partition values that are generated. The value of this parameter must be a positive integer. The default value is 0. For example, if you set projection.columnName.range to 1,10, projection.columnName.interval to 3, and projection.columnName.digits to 2, the partition values that are generated are 01, 04, and 07.
  • date
    Parameter Required Example Description
    projection.columnName.type Yes date The partition projection type of a partition key column. date is one of the supported partition projection types.
    projection.columnName.range Yes 20201212, 20210219, 2020-12,NOW-8YEARS, NOW+6months A comma-delimited list of date values. The value of this parameter can be a date value of the STRING type, for example, 2021-02-19. You can also use the following components to form a character string: NOW, +/-, YEARS, MONTHS, WEEKS, DAYS, HOURS, MINUTES, and SECONDS. The character string must comply with the \s*NOW\s*(([\+\-])\s*([0-9]+)\s*(YEARS?|MONTHS?|WEEKS?|DAYS?|HOURS?|MINUTES?|SECONDS?)\s*)? regular expression. Characters in the string are not case-sensitive.
    projection.columnName.format Yes yyyy-MM, yyyy/MM/dd, yyyyMMdd HH:mm:ss The format of a date value. You must make sure that the format that you specified can be parsed by using java.text.SimpleDateFormat.
    projection.columnName.interval Yes 1 The interval between successive partition values of a partition key column. The value of this parameter must be an integer greater than 0. The default value is 1.
    projection.columnName.interval.unit Yes YEARS, MONTHS, WEEKS, DAYS, HOURS, MINUTES, SECONDS The date part to increment a date value. Values in the left table are supported. The value of this parameter is not case-sensitive. For example, if you set projection.columnName.range to 2020-02, 2021-02, projection.columnName.interval to 3, and projection.columnName.interval.unit to MONTHS, the partition values that are generated are 2020-02, 2020-05, 2020-08, 2020-11, and 2021-02.
  • injected
    Parameter Required Example Description
    projection.columnName.type Yes injected The partition projection type of a partition key column. injected is one of the supported partition projection types. The value range is not limited.
    projection.columnName.miss Yes ERROR The value of this parameter is not case-sensitive. Valid values:
    • ERROR: indicates that an error is returned if a partition key column is not specified for a query.
    • LIST: indicates that the OSS directories that are projected into a partition with a partition key column are listed if this column is not specified for a query. All the directories that comply with the directory structure are projected into the same partition.
      Notice If you set this parameter to LIST, DLA performance may deteriorate. Exercise caution when you use this setting.
    Note If you set projection.columnName.miss to LIST and more than 1,000 OSS directories are listed, the serverless Presto engine of DLA may take a long time to obtain data from OSS. As a result, an error is returned. To work around this issue, you can add projection-max-paths=-1 before the statement that is used to query data. Sample statement:
    /*+ projection-max-paths=-1 */ select * from projection_test;