All Products
Search
Document Center

Quickly analyze data in OSS

Last Updated: Jan 18, 2019

Prerequisites

The following conditions must be met before you can use Data Lake Analytics (DLA).

  • You have registered and activated an Alibaba Cloud account, and activated the DLA service (that is, obtained DLA logon credentials).

  • You have uploaded a file that you want to analyze to OSS.

Note

  • You can only analyze OSS or Table Store files located in the same region as your DLA service. Currently, DLA does not support cross-region operations.

  • DLA provides service access through classic network and VPC endpoints.

    • Classic network endpoints can be used for applications that have network access.
    • VPC endpoints can be used by DLA to access applications on ECS instances located in the same region as the VPC.

Procedure

  1. If you have downloaded and installed an OSS client, you can skip this step. Otherwise, go to the Alibaba Cloud website to download OSS client, and upload your data files to OSS. For details, see [OSS Downloading and Installation]..

  2. Create a local data file test_csv.txt with the following content:

    1. 1|1|China, Hangzhou|2018-01-01|2018-01-01 00:00:00|120.19|30.26
    2. 2|2|China, Hong Kong|2018-01-02|2018-01-02 11:11:11|114.10|22.20

    Upload a file to the OSS bucket:

    1. ./ossutilmac64 cp test_csv.txt oss://test-bucket-julian/test_csv/test_csv.txt
  3. Connect to DLA and create a schema. Then, create an external table for referencing the OSS file.

    You can obtain the connection information at [DMS for Data Lake Analytics].

    • Connect to DLA.

      1. mysql -hservice.cn-hangzhou.openanalytics.aliyuncs.com -P10000 -uxxx -pxxx -c -A
    • Create a schema.

      Notes:

      Currently, DLA schema names in the same region must be unique. If a schema name already exists, an error message will be returned.

  1. CREATE SCHEMA my_test_schema with DBPROPERTIES(
  2. LOCATION = 'oss://xxx/xxx/',
  3. catalog='oss'
  4. );

Your OSS LOCATION directory must end with a forward slash (/). All OSS files in the tables you created within this schema must be under this OSS directory or its subdirectories.

  • Create a table referencing the OSS file.

    • Separate the fields with vertical bars (|) and specify ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'.

    • LOCATION refers to the OSS bucket directory where the data file is located.

      Notes:

      LOCATION must be configured to point to the data file location in the OSS bucket directory, instead of pointing directly to the data file. For example, oss://test-bucket-julian/test_csv is the directory of the data file location. We recommend that you do not place data files directly under the root directory, otherwise the DLA table will reference all files under this directory.

    1. mysql> use my_test_schema;
    2. Database changed
    1. CREATE EXTERNAL TABLE test_csv (
    2. id INT,
    3. name STRING,
    4. location STRING,
    5. create_date DATE,
    6. create_timestamp TIMESTAMP,
    7. longitude DOUBLE,
    8. latitude DOUBLE
    9. )
    10. ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
    11. STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian/test_csv';
    12. Query OK, 0 rows affected (0.73 sec)
  1. Connect to DLA and run an SQL query.

    1. mysql> select * from test_csv;
    2. +------+------+-----------------+-------------+-----------------------+-----------+----------+
    3. | id | name | location | create_date | create_timestamp | longitude | latitude |
    4. +------+------+-----------------+-------------+-----------------------+-----------+----------+
    5. | 1 | 1 | China, Hangzhou | 2018-01-01 | 2018-01-01 00:00:00.0 | 120.19 | 30.26 |
    6. | 2 | 2 | China, Hong Kong | 2018-01-02 | 2018-01-02 11:11:11.0 | 114.1 | 22.2 |
    7. +------+------+-----------------+-------------+-----------------------+-----------+----------+
    8. 2 rows in set (0.51 sec)

    Add additional files to the OSS directory referenced by the external table, and see the updated query results.

    1. ./ossutilmac64 cp test_csv.txt oss://test-bucket-julian/test_csv/test_csv_copy.txt
    1. mysql> select * from test_csv;
    2. +------+------+-----------------+-------------+-----------------------+-----------+----------+
    3. | id | name | location | create_date | create_timestamp | longitude | latitude |
    4. +------+------+-----------------+-------------+-----------------------+-----------+----------+
    5. | 1 | 1 | China, Hangzhou | 2018-01-01 | 2018-01-01 00:00:00.0 | 120.19 | 30.26 |
    6. | 2 | 2 | China, Hong Kong | 2018-01-02 | 2018-01-02 11:11:11.0 | 114.1 | 22.2 |
    7. | 1 | 1 | China, Hangzhou | 2018-01-01 | 2018-01-01 00:00:00.0 | 120.19 | 30.26 |
    8. | 2 | 2 | China, Hong Kong | 2018-01-02 | 2018-01-02 11:11:11.0 | 114.1 | 22.2 |
    9. +------+------+-----------------+-------------+-----------------------+-----------+----------+
    10. 4 rows in set (0.66 sec)
  2. Use a MySQL compatible client tool to connect to DLA and execute an SQL query. For details, see SQL Workbench.

    sqlworkbench1

    sqlworkbench2

    sqlworkbench3

If you add the following SQL comment before the query:

  1. /*+ force-persist-result=true */

The query results, including the data file and metadata description, will be automatically uploaded to your OSS bucket located in the same region.

  1. oss://aliyun-oa-query-results-${uid}-${regionId}/DLA_Result/{yyyy}/{mm}/{dd}/${processId}/result.csv
  2. oss://aliyun-oa-query-results-${uid}-${regionId}/DLA_Result/{yyyy}/{mm}/{dd}/${processId}/result.csv.meta

The metadata description will be created as a JSON file as follows:

  1. {
  2. "query_id": "20180419204602e573ff53000024",
  3. "columns": [
  4. {
  5. "id": {
  6. "display_size": 15,
  7. "type": "Integer"
  8. }
  9. },
  10. {
  11. "name": {
  12. "display_size": 15,
  13. "type": "String"
  14. }
  15. },
  16. {
  17. "location": {
  18. "display_size": 15,
  19. "type": "String"
  20. }
  21. },
  22. {
  23. "create_date": {
  24. "display_size": 15,
  25. "type": "Date"
  26. }
  27. },
  28. {
  29. "create_timestamp": {
  30. "display_size": 15,
  31. "type": "Timestamp"
  32. }
  33. },
  34. {
  35. "create_time": {
  36. "display_size": 15,
  37. "type": "String"
  38. }
  39. }
  40. ]
  41. }