Some data files, such as standard forms and log files generated by enterprise services, are periodically uploaded to Object Storage Service (OSS). However, metadata in these files is usually not managed. This makes it difficult to compute and analyze data. To address this issue, Data Lake Analytics (DLA) provides the metadata discovery feature. After this feature is enabled, you can create a metadata discovery task in the DLA console. This task automatically generates and updates DLA metadata for OSS files in a single run. The metadata may be included in one or more tables. This task also automatically detects data fields and types in files, maps subdirectories to table partitions, detects new columns and partitions, and splits files into tables.

Configuration modes

OSS data sources can be configured in two modes: data warehouse mode and free mode. The following table describes the differences between the two modes.
Configuration mode Scenario OSS directory format Discovery precision Performance
Data warehouse mode Users directly upload data to OSS and plan to build a standard data warehouse. This data warehouse is used to compute and analyze data. Database name/Table name/File name or Database name/Table name/Partition name/.../Partition name/File name High High
Free mode OSS data already exists, but the directory for saving OSS data is not specified. Users plan to create databases, tables, and partitions by using metadata discovery tasks. No requirements Medium Medium

OSS directory formats in data warehouse mode

OSS is an open file system. To efficiently build a data warehouse on top of OSS, you must make sure that OSS directories are in the correct format. If OSS data sources are configured in data warehouse mode, metadata discovery tasks of DLA support only the following OSS directory formats: Database name/Table name/File name and Database name/Table name/Partition name/.../Partition name/File name.The root directory is mapped to a schema. Second-level subdirectories are mapped to tables. The names of the second-level subdirectories must be mapped to the names of the tables. If third-level or higher-level subdirectories exist, these subdirectories are mapped to partitions.

After DLA discovers metadata of OSS data sources in data warehouse mode, DLA automatically creates tables that are mapped to OSS directories. The following table describes the mappings.
OSS directory DLA table Mapping description
Table1 No table mapped Files in the Table1 directory are in different formats (CSV and JSON). As a result, DLA cannot create a table that is mapped to this directory. A mapped table can be created in DLA only if files in the directory are in the same format.
Note If files in a directory are in the same format but fields in the files are of different types, DLA cannot create a table that is mapped to this directory.
Table2 Table2 All files in the Table2 directory are in the CSV format. The Table2 table is created in DLA to map to the Table2 directory.
Table3 Table3 (Partitioned table) Subdirectories under the Table3 directory are named in the year=xx/month=xx/day=xx/ format. These subdirectories are automatically mapped to the following partitions in the Table3 table:
  • year=2020/month=03/day=01
  • year=2020/month=03/day=02
  • year=2020/month=04/day=29
  • year=2020/month=04/day=30
Table4 Table4 (Partitioned table) The subdirectory under the Table4 directory is named in the age=xx format. This subdirectory is automatically mapped to the age=20 partition in the Table4 table.
Table5 Table5 (Partitioned table) Subdirectories under the Table5 directory are named in the partition_0=xx/partition_1=xx/partition_2=xx/ format. These subdirectories are automatically mapped to the following partitions in the Table5 table:
  • 2020/03/29
  • 2020/03/30
Note No partition keys exist in these partitions. Therefore, partition_num is used.

Procedure

  1. Log on to the DLA console.
  2. In the left-side navigation pane, choose Data Lake Management > Meta information discovery.
  3. On the Meta information discovery page, click Go To The Wizard in the OSS data source section. Metadata discovery
  4. On the page that appears, click the OSS data source tab and configure the parameters described in the following table.
    Parameter Description
    Data warehouse mode and Free mode You can select Data warehouse mode or Free mode.
    • Data warehouse mode: allows you to create an automatic metadata discovery task if you build a standard data warehouse on top of OSS. In this mode, the precision of metadata discovery is high. OSS directories must be in the format of Database name/Table name/File name or Database name/Table name/Partition name/.../Partition name/File name.
    • Free mode: allows you to create an automatic metadata discovery task if you want to analyze data stored on OSS. This mode has no requirements for the OSS data structure. This may result in differentiated tables.
    OSS directory location The OSS directory in which the destination file is stored. The directory ends with a forward slash (/). DLA automatically specifies this parameter based on the directory that you selected.
    Note DLA automatically obtains the OSS bucket that is in the same region as DLA. You can also select a bucket from the Bucket drop-down list. After you select a bucket, DLA automatically lists all objects and files in this bucket. After you select the destination object and file, DLA automatically adds them to the OSS directory on the right.
    Format parser A format parser reads data from a data file to determine the data format of the file. The default value of this parameter is Automatic parsing. This indicates that all built-in parsers are called in sequence. To specify the parser of a specific file format, you can set this parameter to json, parquet, avro, orc, or csv.
    • json: reads the beginning of the file to determine the file format.
    • parquet: reads the schema at the end of the file to determine the file format.
    • avro: reads the schema at the beginning of the file to determine the file format.
    • orc: reads the metadata from the file to determine the file format.
    • csv: checks the following delimiters: comma (,), vertical bar (|), tab (\t), semicolon (;), space ( ), and \u0001.
    Configuration options (optional) The advanced custom options, such as File field change rules and Object deletion change rules.
    Scheduling frequency The frequency at which metadata discovery tasks are scheduled.
    Schema Name The name of the schema, which is mapped to the name of the database in DLA. After a metadata discovery task is created, a schema is automatically created for this task.
  5. After you configure the preceding parameters, click Create to create a metadata discovery task.
    Note After a metadata discovery task is created, DLA automatically runs the task at specified intervals. To immediately run the task, you can find your task on the Task List tab and click Execution in the Operation column.
  6. After the task starts to run, the status of the task is displayed on the Task List tab. On the Task List tab, you can manage the task, such as view the task status, modify configurations, and go to the Execute page to query data. OSS data source tab

Usage notes

If you configure OSS data sources in data warehouse mode, take note of the following items when you run a metadata discovery task:
  • If an OSS directory is not discovered by DLA, you need to check whether all the files in the directory are in the same format. If the files are in the CSV format, you can configure parameters, such as Field delimiter, Identification, and Table header mode.
  • Metadata discovery tasks sample only some of the data records. If the fields in some rows are of different types, some fields may be missing from the table generated by DLA.
  • Metadata discovery tasks can discover partitions and tables only if the mapped subdirectories contain only files. If an OSS directory contains subdirectories and files, metadata discovery tasks omit this directory. As a result, partitions cannot be generated for the mapped table.
If you configure OSS data sources in free mode, take note of the following items when you run a metadata discovery task:
  • Generate table names

    When you run a metadata discovery task, DLA automatically generates table names. The names of the tables that are stored in the directory for saving a schema must comply with the following rules:

    • The name of the last-level directory is automatically used as the table name for OSS data files.
    • The table name can contain only letters, digits, and underscores (_).
    • The table name must be 1 to 128 characters in length. If the number of characters in the table name exceeds 128, the task truncates the generated name.
    • If a duplicate table name is generated, the task appends a .MD5 extension to the table name.
  • Create partitions

    If a metadata discovery task discovers multiple files in an OSS directory, the task determines the root directory and subdirectories based on the structure of the OSS directory. The root directory is mapped to a DLA table and the subdirectories are mapped to the partitions in the table.

    The name of a DLA table is based on the name or name prefix of an OSS directory that is mapped to the DLA table. If the subdirectories at the same level in a directory have almost the same directory structure and file formats, the task creates a partitioned table. Partitions in the partitioned table are mapped to the subdirectories. Sample OSS directory structure:
    oss://bucket01/folder1/table1/partition1/fiile.txt
    oss://bucket01/folder1/table1/partition2/fiile.txt
    oss://bucket01/folder1/table2/partition3/fiile.txt
    oss://bucket01/folder1/table2/partition4/fiile.txt

    In the preceding directories, subdirectories and files under the table1 and table2 directories are similar. In this situation, the task creates a table with two partitions:

    Sample OSS directory structure:
    oss://bucket01/folder1/table1/partition1/fiile.csv
    oss://bucket01/folder1/table1/partition2/fiile.csv
    oss://bucket01/folder1/table2/partition3/fiile.json
    oss://bucket01/folder1/table2/partition4/fiile.json

    In the preceding directories, the formats of the files under the table1 and table2 directories are different. The task creates two tables, each with two partitions. The table1 table has the partition1 and partition2 partitions. The table2 table has the partition3 and partition4 partitions.

    If subdirectories are named in the format of key-value pairs that are used by Hive tables, the metadata discovery task automatically uses the names of the keys as the partition names. Otherwise, the default partition names, such as partition_0 and partition_1, are used.