This topic describes how to create a metadata crawling task by using the wizard. After you create a metadata crawling task, the metadata (one or more tables) is automatically created and updated in DLA for the data files stored on OSS in a single run. In addition, the metadata crawling task can automatically explore data fields and field types, map directories to partitions, detect new columns and partitions, and split files into tables.
- Go to the Metadata crawling page.
- On the Metadata crawling page, click the Go To The Wizard button.
- On the left of the Metadata crawling page, specify the directory in which the OSS data you want to crawl is located.
- On the right of the Metadata crawling page, specify the parameters listed in the following table as prompted.
Parameter Description Format parser The default value of this parameter is Automatic parsing. This indicates that all built-in parsers are called in sequence. You can also specify a parser for a specific file type such as JSON, Apache Parquet, Apache Avro, ORC, or CSV. Crawling frequency You can set this parameter to schedule metadata crawling tasks as needed. Schema Name The name of the schema, which refers to the name of the database mapped to DLA. After a crawling task is created, a new schema is automatically created for this task. Configuration options (optional) The advanced custom options, including How to handle table updates when the fields of the file under the oss Directory Change and How to handle table updates when oss objects are deleted.
- After you configure the preceding parameters, click Create to create a metadata crawling task.
Note After a metadata crawling task is created, DLA automatically runs the crawling task at specified intervals. You can also run the task on the Task List tab to synchronize data immediately.
- After the task starts to run, the running status of the task instance is displayed on the Instance List tab. On the Task List tab, you can also view the task status and modify task configurations. In addition, you can go to the SQL window of DLA for quick data query.
- Generate a table name during metadata crawling
The names of tables are automatically generated during metadata crawling. The names of tables stored in the directory for saving metadata management schema must comply with the following rules:
- By default, the name of the last-level directory is used as the table name for OSS data files.
- The table name can contain letters, numbers, and underscores (_).
- The table name can contain a maximum of 128 characters. If the name length limit is reached, the crawler truncates the generated name.
- If a duplicate name is generated, the .MD5 extension is appended to the name during metadata crawling.
- Create partitions during metadata crawling
If the crawling task detects multiple OSS directory files during file scanning, the task determines the root directory of the table in the directory structure and determines the directories that are used as the table partitions.The table name is based on the prefix of its parent directory or the directory name. In the following figure, the two directories under a directory level have almost the same directory structure and file formats. In this situation, the crawler creates a partitioned table.
oss://bucket01/folder1/table1/partition1/fiile.txt oss://bucket01/folder1/table1/partition2/fiile.txt oss://bucket01/folder1/table2/partition3/fiile.txt oss://bucket01/folder1/table2/partition4/fiile.txt
In the following figure, the directories and file content under table1 and table2 are similar. In this situation, the crawler creates a table with two partition columns: partition_0 (the table directory) and partition_1 (the partition directory).
oss://bucket01/folder1/table1/partition1/fiile.csv oss://bucket01/folder1/table1/partition2/fiile.csv oss://bucket01/folder1/table2/partition3/fiile.json oss://bucket01/folder1/table2/partition4/fiile.json
In the preceding figure, the file formats under table1 and table2 are different. The crawler creates two tables, each with a partition key column. The partition column of table1 include partition1 and partition2. The partition column of table2 include partition3 and partition4.
If the Hive partition directory stores key-value pairs, the crawler uses key names to automatically populate the column names. Otherwise, the default names are used, such as partition_0 and partition_1.
- Format parsers supported by metadata crawling
A format parser reads the content in a data file to determine the data format of the file.DLA provides the following built-in format parsers for different file types. If you do not specify a format parser, the metadata crawling task will call built-in parsers in the sequence described in the following table.
Parser Description JSON Reads the beginning of the file to determine the file format. Parquet Reads the schema at the end of the file to determine the file format. CSV Checks the following delimiters: comma (,), vertical bar (|), tab (\t), semicolon (;), space ( ), and \u0001. ORC Reads the metadata in the file to determine the file format. AVRO Reads the schema at the beginning of the file to determine the file format.