This topic describes how to use the metadata discovery feature of Data Lake Analytics (DLA) to query and analyze data stored on Object Storage Service (OSS). It helps you get a quick start with DLA.
- Activate DLA.
- Log on to the OSS console and upload a file to OSS. For more information, see Upload objects. For example, you upload the supplier_with_header.csv file to the oss://alibaba-crawler/schema1/supply_ceshi/ directory in OSS.
- Log on to the DLA console. In the left-side navigation pane, choose .
- In the OSS data source section of the page that appears, click Go To The Wizard.
- On the OSS data source tab, configure the parameters in the Data source configuration, Scheduling configuration, and Target metadata configuration sections. The following table describes the parameters.
Parameter Description Data warehouse mode and Free mode You can select Data warehouse mode or Free mode.
- Data warehouse mode: allows you to create an automatic metadata discovery task if you built a standard data warehouse on top of OSS. In this mode, the precision of metadata discovery is high. OSS directories must be in the format of Database name/Table name/File name or Database name/Table name/Partition name/.../Partition name/File name.
- Free mode: allows you to create an automatic metadata discovery task if you want to analyze OSS data. This mode has no requirements for the OSS data structure. This may result in differentiated tables.
OSS directory location The OSS directory in which the destination file is stored. The directory ends with a forward slash (/). DLA automatically specifies this parameter based on the directory that you selected.Note DLA automatically obtains the OSS bucket that is in the same region as DLA. You can also select a bucket from the Bucket drop-down list. After you select a bucket, DLA automatically lists all the objects and files in this bucket. After you select the destination object and file, DLA automatically adds them to the OSS directory on the right. Format parser The default value of this parameter is Automatic parsing. This indicates that all built-in parsers are called in sequence. To parse a specific file format, you can set this parameter to json, parquet, avro, orc, or csv. Scheduling frequency The frequency at which metadata discovery tasks are scheduled. Schema Name The name of the schema, which is the name of the database in DLA. After a metadata discovery task is created, a schema is automatically created for this task. Configuration options (optional) Advanced custom options, such as Field delimiter, identification, Table header mode, and Allow single column.
- After you specify the parameters, click Create. After the metadata discovery task is created, you can view the task on the Task List tab. The task is manually or periodically scheduled based on the value of Scheduling frequency that you specified.After the metadata discovery task succeeds, find your task on the History List tab and click the database name link, such as alibaba, in the schema name/prefix column to go to the Execute page. You can then view the created databases, tables, and columns that are automatically discovered by DLA.
- On the Execute page, edit SQL statements in the code editor and click Sync Execute(F8) or Async Execute(F9) to execute the SQL statements. For example, execute
select * from `schema1_test`.`supply_ceshi` limit 20;under schema1_test.In the lower part of the Execute page, you can click Result Set to view the metadata that DLA automatically discovers from the supplier_with_header.csv file in the oss://alibaba-crawler/schema1/supply_ceshi/ directory.