All Products
Search
Document Center

CSV

Last Updated: Jul 25, 2019

This topic describes how to create tables for CSV data in DLA.

Prerequisites

Before creating tables, you must create an OSS schema first. In the example, all tables use the following OSS schema:

  1. CREATE SCHEMA dla_oss_db with DBPROPERTIES(
  2. catalog='oss',
  3. location 'oss://dlaossfile1/dla/'
  4. );

Create a UTF-8-encoded table

For example, the OSS instance has a UTF-8-encoded file oss://dlaossfile1/dla/city.csv whose columns are separated by commas (,). It stores the following data:

  1. Beijing,China,010
  2. ShangHai,China,021
  3. Tianjin,China,022

The following is a sample statement for creating a table based on this file:

  1. CREATE EXTERNAL TABLE city (
  2. city STRING,
  3. country STRING,
  4. code INT
  5. )
  6. ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  7. STORED AS TEXTFILE
  8. LOCATION 'oss://dlaossfile1/dla/';
  • STORED AS TEXTFILE indicates the type of the generated file.

  • ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' indicates that columns must be separated by commas (,).

Create a GBK-encoded table

When the data files are not UTF-8-encoded, you need to specify the SerDe value and encoding format in the statement, that is, SERDE and WITH SERDEPROPERTIES ('serialization.encoding'='gbk').

For example, the OSS instance has a GBK-encoded file oss://dlaossfile1/dla/gbk.csv that stores the following data:

  1. hangzhou,aliyuncs
  2. shanghai,aliyuncs

The following is a sample statement for creating a table based on this file:

  1. CREATE EXTERNAL TABLE gbk (
  2. region string,
  3. owner string
  4. )
  5. ROW FORMAT
  6. SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  7. WITH SERDEPROPERTIES('serialization.encoding'='gbk' 'field.delim'=',')
  8. STORED AS TEXTFILE
  9. LOCATION 'oss://dlaossfile1/dla/'
  10. TBLPROPERTIES (
  11. 'skip.header.line.count' = '0'
  12. )
  • LazySimpleSerDe: indicates that SerDe(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe) is inserted, which is used to process text files.

  • 'skip.header.line.count' = '0' indicates reading all data in the file, without filtering any data. For more information, see skip.header.line.count.

Process data in CSV files through OpenCSVSerDe

CSV files store table data in plain texts, such as numbers and texts, and can separate data by using separators besides commas (,). A CSV file can contain an unlimited number of records that are separated by line breaks.

When you create a table based on a CSV file, you must confirm the data in the CSV file. If the CSV file contains too many items in quotation marks, usually containing many special characters, you can use OpenCSVSerDe to deserialize the data. OpenCSVSerDe is used to specify the field separator, field content quote character, and escape character for fields in a line, for example, WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "“, “escapeChar” = “\” )`. Select appropriate SerDe values, which are used to serialize and deserialize data records.

Considerations

Pay attention to the following to prevent errors when using OpenCSVSerDe:

  • OpenCSVSerDe does not allow you to embed line breaks in fields.

  • When you use OpenCSVSerDe, the table you created only supports strings.

  • You can convert data of other types into strings by using SQL statements. For more information, see function sections in the SQL documentation.

For example, the OSS instance has a UTF-8-encoded file oss://dlaossfile1/dla/openCsv.csv that stores the following data:

  1. 1,1,"China,Hangzhou",2018-01-01,2018-01-01 00:00:00,120.19,30.26
  2. 2,2,"China,Hong Kong","2018-01-02","2018-01-02 11:11:11",114.10,22.20

The following is a sample statement for creating a table based on this file:

  1. CREATE EXTERNAL TABLE openCsv (
  2. id STRING,
  3. name STRING,
  4. location STRING,
  5. create_date STRING
  6. )
  7. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
  8. with serdeproperties(
  9. 'separatorChar'=',',
  10. 'quoteChar'='"',
  11. 'escapeChar'='\\'
  12. )
  13. STORED AS TEXTFILE LOCATION 'oss://dlaossfile1/dla';

The query result is as follows:

  1. select * from test_csv_opencsvserde;
  2. +------+------+---------------+-------------+---------------------+-----------+----------+
  3. | id | name | location | create_date | create_timestamp | longitude | latitude |
  4. +--------------+----------------+---------------------+----------+---------+-------+--------+
  5. | 1 | 1 | China,Hangzhou | 2018-01-01 | 2018-01-01 00:00:00 | 120.19 | 30.26 |
  6. | 2 | 2 | China,Hong Kong | 2018-01-02 | 2018-01-02 11:11:11 | 114.10 | 22.20

If the file does not contain quoted data, you do not need to specify the SerDe value when creating the table.

  1. CREATE EXTERNAL TABLE csv_table2 (
  2. id STRING,
  3. name STRING,
  4. location STRING,
  5. create_date STRING,
  6. create_timestamp STRING,
  7. longitude STRING,
  8. latitude STRING
  9. )
  10. ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  11. STORED AS TEXTFILE LOCATION 'oss://dlaossfile1/dla';

The query result is as follows:

  1. select * from csv_table2;
  2. +------+------+----------+-------------+------------------+-----------------------+----------+
  3. | id | name | location | create_date | create_timestamp | longitude | latitude |
  4. +--------------+----------------+---------------------+----------+---------+-------+--------+
  5. | 1 | 1 | "China | Hangzhou" | 2018-01-01 | 2018-01-01 00:00:00 | 120.19 |
  6. | 2 | 2 | "China | Hong Kong" | "2018-01-02" | "2018-01-02 11:11:11" | 114.10 |

Specify separators for a CSV file

DLA allows you to specify the column separator (FIELDS TERMINATED BY), escape character (ESCAPED BY), and line terminator (LINES TERMINATED BY) for CSV files.

  1. ROW FORMAT DELIMITED
  2. FIELDS TERMINATED BY '\t'
  3. ESCAPED BY '\\'
  4. LINES TERMINATED BY '\n'

Ignore the header of a CSV file

When a CSV file has a header but you want to ignore the header when reading the data, you can specify skip.header.line.count="x" in the statement for creating the table to filter out data in the first to the xth lines. The relationship between x and n (the actual number of lines in the file) is as follows:

  • When x< = 0, DLA reads all the data in the file, without filtering out any information.

  • When 0<x<n, DLA filters out data in the first x lines and starts to read data from the x+1th line.

  • When x ≥ n, DLA filters out all the file content.

For example, the OSS instance has a UTF-8-encoded file oss://dlaossfile1/dla/nation_header.csv that stores the following data:

  1. N_NATIONKEY|N_NAME|N_REGIONKEY|N_COMMENT
  2. 0|ALGERIA|0| haggle. carefully final deposits detect slyly agai|
  3. 1|ARGENTINA|1|al foxes promise slyly according to the regular accounts. bold requests alon|
  4. 2|BRAZIL|1|y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special |
  5. 3|CANADA|1|eas hang ironic, silent packages. slyly regular packages are furiously over the tithes. fluffily bold|
  6. 4|EGYPT|4|y above the carefully unusual theodolites. final dugouts are quickly across the furiously regular d|
  7. 5|ETHIOPIA|0|ven packages wake quickly. regu|

The following is a sample statement for creating a table based on the file:

  1. CREATE EXTERNAL TABLE nation_header (
  2. N_NATIONKEY INT,
  3. N_NAME STRING,
  4. N_REGIONKEY INT,
  5. N_COMMENT STRING
  6. )
  7. ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
  8. STORED AS TEXTFILE
  9. LOCATION 'oss://dlaossfile1/dla/'
  10. TBLPROPERTIES ("skip.header.line.count"="1");

The query result is as follows:

  1. SELECT * FROM nation_header;
  2. +-------------+--------+-------------+-------------+-----------
  3. | n_nationkey | n_name | n_regionkey | n_comment |
  4. +-------------+--------+-------------+-------------+-----------
  5. | 0 | ALGERIA | 0 | haggle. carefully final deposits detect slyly agai
  6. VARCHAR |
  7. +-------------+-------- +-------------+-------------+-----------
  8. | 1 | ARGENTINA | 1 | al foxes promise slyly according to the regular accounts. bold requests alon |
  9. +-------------+-------- +-------------+-------------+-----------
  10. |2 | BRAZIL | 1 | y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special |
  11. +-------------+-------- +-------------+-------------+-----------
  12. |3 | CANADA | 1 | eas hang ironic, silent packages. slyly regular packages are furiously over the tithes. fluffily bold|
  13. +-------------+-------- +-------------+-------------+-----------
  14. |4 | EGYPT | 4 | y above the carefully unusual theodolites. final|
  15. +-------------+-------- +-------------+-------------+-----------
  16. |5 | ETHIOPIA | 0 | ven packages wake quickly. regu|