Some data files such as standard forms and log files generated by enterprise services are periodically uploaded to Object Storage Service (OSS). However, metadata in these files is usually not managed. This makes it difficult to analyze and compute data. To address this issue, Data Lake Analytics (DLA) provides the metadata discovery feature. After this feature is enabled, you can create a metadata discovery task in the DLA console. This task automatically generates and updates DLA metadata for OSS files in a single run. The metadata may be included in one or more tables. This task also automatically detects data fields and types in files, maps subdirectories to table partitions, detects new columns and partitions, and splits files into tables.

Configuration modes

OSS data sources can be configured in two modes: data warehouse mode and free mode. The following table describes the differences between the two modes.
Configuration mode Scenario OSS directory format Recognition accuracy Performance
Data warehouse mode Users directly upload data to OSS and plan to create a standard data warehouse. This data warehouse is used to analyze and compute data. Database name/Table name/File name or Database name/Table name/Partition name/.../Partition name/File name High High
Free mode OSS data already exists, but the directory for saving OSS data is not specified. Users plan to create databases, tables, and partitions by using metadata discovery tasks. No requirements Moderate Moderate

OSS directory formats in data warehouse mode

OSS is an open file system. To efficiently create a data warehouse in OSS, you must make sure that OSS directories are in the correct format. If OSS data sources are configured in data warehouse mode, metadata discovery tasks of DLA support only the following OSS directory formats: Database name/Table name/File name and Database name/Table name/Partition name/.../Partition name/File name.The root directory is mapped to a schema. Second-level subdirectories are mapped to tables. The names of the second-level subdirectories must be mapped to the names of the tables. If third-level or higher-level subdirectories exist, these subdirectories are mapped to partitions.

After DLA discovers metadata of OSS data sources in data warehouse mode, DLA automatically creates tables that are mapped to OSS directories. The following table describes the mappings.
OSS directory DLA table Mapping description
Table1 No table mapped Files in the Table1 directory are in different formats (CSV and JSON). As a result, DLA cannot create a table that is mapped to this directory. A mapped table can be created in DLA only if files in the directory are in the same format.
Note If files in a directory are in the same format but fields in the files are of different types, DLA cannot create a table that is mapped to this directory.
Table2 Table2 All files in the Table2 directory are in the CSV format. The Table2 table is created in DLA to map to this directory.
Table3 Table3 (Partitioned table) Subdirectories under the Table3 directory are named in the format of year=xx/month=xx/day=xx/. These subdirectories are automatically mapped to the following partitions in the Table3 table:
  • year=2020/month=03/day=01
  • year=2020/month=03/day=02
  • year=2020/month=04/day=29
  • year=2020/month=04/day=30
Table4 Table4 (Partitioned table) The subdirectory under the Table4 directory is named in the format of age=xx. This subdirectory is automatically mapped to the age=20 partition in the Table4 table.
Table5 Table5 (Partitioned table) Subdirectories under the Table5 directory are named in the format of partition_0=xx/partition_1=xx/partition_2=xx/. These subdirectories are automatically mapped to the following partitions in the Table5 table:
  • 2020/03/29
  • 2020/03/30
Note No partition keys exist in these partitions. Therefore, partition_num is used.

Procedure

  1. Log on to the DLA console.
  2. In the left-side navigation pane, choose Data Lake Management > Meta information discovery.
  3. On the Meta information discovery page, click Go To The Wizard in the OSS data source section.
  4. On the OSS data source tab, select an appropriate format from the Format parser drop-down list.
  5. On the OSS data source tab, configure other parameters as prompted. The following table describes the parameters.
    Parameter Description
    Format parser A format parser reads data from a data file to determine the data format of the file. The default value of this parameter is Automatic parsing. This indicates that all built-in parsers are called in sequence. To specify the parser of a specific file format, you can set this parameter to json, parquet, avro, orc, or csv.
    • json: reads the beginning of the file to determine the file format.
    • parquet: reads the schema at the end of the file to determine the file format.
    • avro: reads the schema at the beginning of the file to determine the file format.
    • orc: reads the metadata from the file to determine the file format.
    • csv: checks the following delimiters: comma (,), vertical bar (|), tab (\t), semicolon (;), space ( ), and \u0001.
    Scheduling frequency The frequency at which metadata discovery tasks are run. You can specify this parameter to schedule metadata discovery tasks.
    Schema Name The name of a schema, which indicates the name of the DLA database that is mapped to the OSS database. After a metadata discovery task is created, a schema is automatically created for this task.
    Configuration options (optional) The advanced custom options, such as File field change rules and Object deletion change rules.
  6. After you configure the preceding parameters, click Create to create a metadata discovery task.
    Note After a metadata discovery task is created, DLA automatically runs the task at specified intervals. To immediately run the task, you can find your task on the Task List tab and click Execution in the Operation column.
  7. After the task starts to run, the running status of the task is displayed on the History List tab. On the Task List tab, you can view the task status and modify task configurations. In addition, you can go to the Execute page to query data. Meta information discovery page

Usage notes

If you configure OSS data sources in data warehouse mode, take note of the following items when you run a metadata discovery task:
  • If an OSS directory is not recognized by DLA, you can check whether files in the directory are in the same format. If the files are in the CSV format, you can configure parameters such as Field delimiter, Identification, and Table header mode.
  • Metadata discovery tasks sample only some of the data records. If the fields in some rows are of different types, some fields may be missing from the table generated by DLA.
  • Metadata discovery tasks can discover partitions and tables only if the mapped subdirectories contain only files. If an OSS directory contains both subdirectories and files, metadata discovery tasks omit this directory. As a result, partitions cannot be generated for the mapped table.
If you configure OSS data sources in free mode, take note of the following items when you run a metadata discovery task:
  • Generate table names

    When you run a metadata discovery task, DLA automatically generates table names. The names of the tables that are stored in the directory for saving a schema must comply with the following rules:

    • By default, the name of the last-level directory is used as the table name for OSS data files.
    • The table name can contain only letters, digits, and underscores (_).
    • The table name can contain a maximum of 128 characters. If the number of characters in the table name exceeds 128, the task deletes the excess characters from the table name.
    • If a duplicate table name is generated, the task appends a .MD5 extension to the table name.
  • Create partitions

    If a metadata discovery task discovers multiple files in an OSS directory, the task determines the root directory and subdirectories based on the structure of the OSS directory. The root directory is mapped to a DLA table and the subdirectories are mapped to the partitions in the table.

    The name of a DLA table is based on the name or name prefix of an OSS directory that is mapped to the DLA table. If the subdirectories at the same level in a directory have almost the same directory structure and file formats, the task creates a partitioned table. Partitions in the partitioned table are mapped to the subdirectories. Sample OSS directory structure:
    oss://bucket01/folder1/table1/partition1/fiile.txt
    oss://bucket01/folder1/table1/partition2/fiile.txt
    oss://bucket01/folder1/table2/partition3/fiile.txt
    oss://bucket01/folder1/table2/partition4/fiile.txt

    In the preceding directories, subdirectories and files under the table1 and table2 directories are similar. In this situation, the task creates a table with two partitions.

    Sample OSS directory structure:
    oss://bucket01/folder1/table1/partition1/fiile.csv
    oss://bucket01/folder1/table1/partition2/fiile.csv
    oss://bucket01/folder1/table2/partition3/fiile.json
    oss://bucket01/folder1/table2/partition4/fiile.json

    In the preceding directories, the files in the table1 and table2 directories are in different formats. In this situation, the task creates two tables in DLA, each with two partitions. The table1 table includes partition1 and partition2. The table2 table includes partition3 and partition4.

    If subdirectories are named in the format of key-value pairs that are used by Hive tables, the task automatically uses the names of the partition keys as the partition names. Otherwise, the default partition names are used.