×
Community Blog How to Use OSS LOCATION in Data Lake Analytics

How to Use OSS LOCATION in Data Lake Analytics

This tutorial shows you how to specify the LOCATION of a table in Data Lake Analytics by using a sample file stored in an Object Storage Service instance.

Alibaba Cloud Data Lake Analytics (DLA) enables you to directly query and analyze data stored in Object Storage Service (OSS) and Table Store instances by using standard SQL statements.

Before querying data, you must create a table in DLA based on the format and content of the data file. This tutorial takes a file stored in an OSS instance as an example to explain how to specify LOCATION of a table.

OSS LOCATION

In DLA, the syntax of a table creation statement is as follows:

CREATE EXTERNAL TABLE [IF NOT EXISTS] [db_name.]table_name
    [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
    [COMMENT table_comment]
    [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
    [ROW FORMAT row_format]
    [STORE AS file_format]
        | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
    LOCATION oss_path

LOCATION can be set to the data file itself or to the directory where the data file is located.

If LOCATION is Set to a Data File

When creating a table, you can directly associate the table with the data file. In this case, LOCATION must be set to the absolute path of the data file in the OSS instance.

For example:

CREATE EXTERNAL TABLE loc_file_csv(
    N_NATIONKEY INT,
    N_NAME STRING,
    N_REGIONKEY INT,
    N_COMMENT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'oss://my-bucket/datasets/test/test_create/create_table/csv/dir1/dir2/nation.csv';

If LOCATION is Set to a Directory

If LOCATION is set to a directory in the OSS instance, all the files under this directory are the data files of the table.

When creating a table, you can set recursive.directories to determine whether to traverse all data files under this directory. If recursive.directories is set to the default value or is not explicitly specified, DLA resolves that recursive.directories is false, that is, the directory is not traversed.

For example, the directory structure in the OSS instance is as follows:

2018-07-05 11:16:11 1752.00B Standard oss://my-bucket/datasets/test/test_create/create_table/csv/dir1/dir2/nation.csv
2018-07-05 11:15:57 1752.00B Standard oss://my-bucket/datasets/test/test_create/create_table/csv/dir1/nation.csv
2018-07-05 11:16:17 1752.00B Standard oss://my-bucket/datasets/test/test_create/create_table/csv/nation.csv

The table creation statement is as follows:

CREATE EXTERNAL TABLE loc_file_csv(
    N_NATIONKEY INT,
    N_NAME STRING,
    N_REGIONKEY INT,
    N_COMMENT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
'oss://my-bucket/datasets/test/test_create/create_table/csv'
TBLPROPERTIES ('recursive.directories' = 'false');

If recursive.directories is set to false, the data file of this table is the nation.csv file under LOCATION:

oss://my-bucket/datasets/test/test_create/create_table/csv/nation.csv

The table creation statement is as follows:

CREATE EXTERNAL TABLE loc_file_csv(
    N_NATIONKEY INT,
    N_NAME STRING,
    N_REGIONKEY INT,
    N_COMMENT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
'oss://my-bucket/datasets/test/test_create/create_table/csv'
TBLPROPERTIES ('recursive.directories' = 'true');

If recursive.directories is set to true, the data files of this table are all the .csv files under LOCATION and all its subdirectories:

oss://my-bucket/datasets/test/test_create/create_table/csv/nation.csv
oss://my-bucket/datasets/test/test_create/create_table/csv/dir1/nation.csv
oss://my-bucket/datasets/test/test_create/create_table/csv/dir1/dir2/nation.csv

Additional Notes

  1. If LOCATION is a directory, DLA determines by default that formats of all data files under this directory are the same, and parses files based on SERDE specified in the table creation statement. If the formats of data files under this directory are different, DLA encounters a parsing failure. Consequently, data in the table is incorrect and the query result is affected.

    For example, if STORED AS TEXTFILE is specified in the table creation statement, but both .csv and .orc files are available under the directory, no error is reported during table creation and query. However, you may see data with garbled characters or incorrect data when running the SELECT statement.

  2. Currently, DLA does not support .csv files with headers. If the first line of a file is the header, you must manually process data. Otherwise, DLA identifies the header as a data record.

To learn more about Alibaba Cloud Data Lake Analytics (DLA), visit www.alibabacloud.com/products/data-lake-analytics

1 1 1
Share on

Alibaba Clouder

2,599 posts | 763 followers

You may also like

Comments

Raja_KT March 1, 2019 at 8:33 am

Good one. I miss out the recursive setting :)

Alibaba Clouder

2,599 posts | 763 followers

Related Products