CSV - Data Lake Analytics - Deprecated - Alibaba Cloud Documentation Center

This topic describes how to create tables for CSV files of different encoding formats in Data Lake Analytics (DLA). This topic also describes how to use OpenCSVSerDe to process fields that are enclosed in double quotation marks in CSV files.

Before you create a table for CSV files, you must create an Object Storage Service (OSS) schema in DLA. In this topic, all tables for CSV files use the OSS schema that is created by executing the following statement:

CREATE SCHEMA dla_oss_db  
with 
 DBPROPERTIES 
( 
 catalog 
= 
'oss' 
, 
 location  
'oss://oss-bucket-name/dla/' 
)

Create a table for a UTF-8 encoded file

A UTF-8 encoded file named city.csv is stored in the oss://oss-bucket-name/dla/ directory. Columns in the city.csv file are separated by commas (,). This file contains the following data:

Beijing 
, 
China 
, 
010 
ShangHai 
, 
China 
, 
021 
Tianjin 
, 
China 
, 
022

You can execute the following statement to create a table for this file:

CREATE EXTERNAL TABLE city  
( 
 city STRING 
, 
 country STRING 
, 
 code INT 
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY  
',' 
STORED AS TEXTFILE  
LOCATION  
'oss://oss-bucket-name/dla/' 
;

STORED AS TEXTFILE indicates that the CSV file is of the TEXTFILE type.
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' indicates that columns in the CSV file are separated by commas (,).

Create a table for a GBK encoded file

To create a table for a GBK encoded file, you must specify LazySimpleSerDe and WITH SERDEPROPERTIES ('serialization.encoding'='gbk') in the CREATE EXTERNAL TABLE statement.

A GBK encoded file named gbk.csv is stored in the oss://oss-bucket-name/dla/ directory. This file contains the following data:

hangzhou 
, 
aliyuncs 
shanghai 
, 
aliyuncs

You can execute the following statement to create a table for this file:

CREATE EXTERNAL TABLE gbk  
( 
 region  
string 
, 
 owner  
string 
) 
ROW FORMAT SERDE  
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES 
( 
'serialization.encoding' 
= 
'gbk' 
, 
'field.delim' 
= 
',' 
) 
STORED AS TEXTFILE 
LOCATION  
'oss://oss-bucket-name/dla/' 
TBLPROPERTIES  
( 
'skip.header.line.count' 
= 
'0' 
)

LazySimpleSerDe indicates a built-in SerDe (org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe), which is used to process files of the TEXTFILE type.
'skip.header.line.count' = '0' indicates that data in the file is not filtered and all data in the file is read.

Note:

You must specify the serialization.encoding parameter based on the original file encoding format. In Linux, you can run file filepath to view the file encoding format. The original file encoding format is ISO-8859. If the file contains garbled characters, you can set the serialization.encoding parameter to gbk.
OpenCSVSerDe does not support encoding format parameters that are passed.

Use OpenCSVSerDe to process fields enclosed in double quotation marks

CSV files store data in plaintext. A CSV file contains a number of records that are separated by commas (,) or line feeds.

When you create a table for a CSV file, you must check the data in the file. If the data contains information such as special characters enclosed in double quotation marks, you can use OpenCSVSerDe to deserialize data. OpenCSVSerDe allows you to specify the delimiter, quote, and escape character for the fields in each row. For example, you can use with serdeproperties('separatorChar'=',','quoteChar'='"','escapeChar'='\\') to specify the delimiter, quote, and escape character.

Usage notes

Take note of the following items when you use OpenCSVSerDe. Otherwise, an error is returned.

OpenCSVSerDe does not allow row delimiters to be embedded in fields.
When you create a table for a CSV file in DLA, all fields in the file must be of the STRING type. If the data type is not STRING, use a type conversion function to convert the data type.

A UTF-8 encoded file named openCsv.csv is stored in the oss://oss-bucket-name/dla/ directory. This file contains the following data:

1 
, 
1 
, 
"China,Hangzhou" 
, 
2018 
- 
01 
- 
01 
, 
2018 
- 
01 
- 
01 
00 
: 
00 
: 
00 
, 
120.19 
, 
30.26 
2 
, 
2 
, 
"China,Hong Kong" 
, 
"2018-01-02" 
, 
"2018-01-02 11:11:11" 
, 
114.10 
, 
22.20

You can execute the following statement to create a table for the file:

CREATE EXTERNAL TABLE openCsv  
( 
 id STRING 
, 
 name STRING 
, 
 location STRING 
, 
 create_date STRING 
) 
ROW FORMAT SERDE  
'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
with 
 serdeproperties 
( 
'separatorChar' 
= 
',' 
, 
'quoteChar' 
= 
'"' 
, 
'escapeChar' 
= 
'\\' 
) 
STORED AS TEXTFILE  
LOCATION  
'oss://oss-bucket-name/dla' 
;

Query data from the table that you created. The following result is returned:

select 
* 
from 
 openCsv 
; 
+------+------+---------------+-------------+---------------------+-----------+----------+ 
c 
| 
 id  
| 
 name  
| 
 location  
| 
 create_date  
| 
 create_timestamp  
| 
 longitude  
| 
 latitude  
| 
+------+------+---------------+-------------+---------------------+-----------+----------+ 
| 
1 
| 
1 
| 
China,Hangzhou 
| 
2018 
- 
01 
- 
01 
| 
2018 
- 
01 
- 
01 
00 
: 
00 
: 
00 
| 
120.19 
| 
30.26 
| 
| 
2 
| 
2 
| 
China,Hong Kong 
| 
2018 
- 
01 
- 
02 
| 
2018 
- 
01 
- 
02 
11 
: 
11 
: 
11 
| 
114.10 
| 
22.20

If the file does not contain the data enclosed in double quotation marks, you can optionally specify SerDe to create a table.

CREATE EXTERNAL TABLE csv_table2  
( 
id STRING 
, 
name STRING 
, 
location STRING 
, 
create_date STRING 
, 
create_timestamp STRING 
, 
longitude STRING 
, 
latitude STRING 
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY  
',' 
STORED AS TEXTFILE LOCATION  
'oss://oss-bucket-name/dla' 
;

Query data from the table that you created. The following result is returned:

select 
* 
from 
 csv_table2 
; 
+------+------+----------+-------------+------------------+-----------------------+----------+ 
| 
 id  
| 
 name  
| 
 location  
| 
 create_date  
| 
 create_timestamp  
| 
 longitude  
| 
 latitude  
| 
+------+------+----------+-------------+------------------+-----------------------+----------+ 
| 
1 
| 
1 
| 
"China | Hangzhou" 
| 
2018 
- 
01 
- 
01 
| 
2018 
- 
01 
- 
01 
00 
: 
00 
: 
00 
| 
120.19 
| 
| 
2 
| 
2 
| 
"China | Hong Kong" 
| 
"2018-01-02" 
| 
"2018-01-02 11:11:11" 
| 
114.10 
|

Use custom delimiters, escape characters, and line feeds for a CSV file

DLA allows you to specify FIELDS TERMINATED BY, ESCAPED BY, and LINES TERMINATED BY for a CSV file. FIELDS TERMINATED BY indicates the delimiter between columns in the file. ESCAPED BY indicates the escape character used in the file. LINES TERMINATED BY indicates the line feed used in the file.

ROW FORMAT DELIMITED 
 FIELDS TERMINATED BY  
'\t' 
 ESCAPED BY  
'\\' 
 LINES TERMINATED BY  
'\n'

Skip a header in a CSV file

If a CSV file contains a header and the header information does not need to be read, you can specify skip.header.line.count="x" in the CREATE EXTERNAL TABLE statement to skip the data from the first row to the xth row.

DLA uses different methods to read file data based on the relationship between the last row to be skipped (x) and the number of rows (n) in the file:

If the value of x is less than or equal to 0, DLA reads all data in the file.
If the value of x is greater than 0 and less than the value of n, DLA skips data from the first row to the xth row and reads data from the (x+1)th row.
If the value of x is greater than or equal to the value of n, DLA does not read file data.

A UTF-8 encoded file named nation_header.csv is stored in the oss://oss-bucket-name/dla/ directory. This file contains the following data:

N_NATIONKEY 
| 
N_NAME 
| 
N_REGIONKEY 
| 
N_COMMENT 
0 
| 
ALGERIA 
| 
0 
| 
 haggle 
. 
 carefully  
final 
 deposits detect slyly agai 
| 
1 
| 
ARGENTINA 
| 
1 
| 
al foxes promise slyly according to the regular accounts 
. 
 bold requests alon 
| 
2 
| 
BRAZIL 
| 
1 
| 
y alongside of the pending deposits 
. 
 carefully special packages are about the ironic forges 
. 
 slyly special  
| 
3 
| 
CANADA 
| 
1 
| 
eas hang ironic 
, 
 silent packages 
. 
 slyly regular packages are furiously over the tithes 
. 
 fluffily bold 
| 
4 
| 
EGYPT 
| 
4 
| 
y above the carefully unusual theodolites 
. 
final 
 dugouts are quickly across the furiously regular d 
| 
5 
| 
ETHIOPIA 
| 
0 
| 
ven packages wake quickly 
. 
 regu 
|

You can execute the following statement to create a table for the file:

CREATE EXTERNAL TABLE nation_header  
( 
 N_NATIONKEY INT 
, 
 N_NAME STRING 
, 
 N_REGIONKEY INT 
, 
 N_COMMENT STRING 
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY  
'|' 
STORED AS TEXTFILE  
LOCATION  
'oss://oss-bucket-name/dla/' 
TBLPROPERTIES  
( 
"skip.header.line.count" 
= 
"1" 
);

Query data from the table that you created. The following result is returned:

SELECT  
* 
 FROM nation_header 
; 
+-------------+--------+-------------+-------------+----------- 
| 
 n_nationkey  
| 
 n_name  
| 
 n_regionkey  
| 
 n_comment  
| 
+-------------+--------+-------------+-------------+----------- 
| 
0 
| 
 ALGERIA  
| 
0 
| 
 haggle 
. 
 carefully  
final 
 deposits detect slyly agai 
VARCHAR  
| 
+-------------+-------- 
+-------------+-------------+----------- 
| 
1 
| 
 ARGENTINA  
| 
1 
| 
 al foxes promise slyly according to the regular accounts 
. 
 bold requests alon  
| 
+-------------+-------- 
+-------------+-------------+----------- 
| 
2 
| 
 BRAZIL  
| 
1 
| 
 y alongside of the pending deposits 
. 
 carefully special packages are about the ironic forges 
. 
 slyly special  
| 
+-------------+-------- 
+-------------+-------------+----------- 
| 
3 
| 
 CANADA  
| 
1 
| 
 eas hang ironic 
, 
 silent packages 
. 
 slyly regular packages are furiously over the tithes 
. 
 fluffily bold 
| 
+-------------+-------- 
+-------------+-------------+----------- 
| 
4 
| 
 EGYPT  
| 
4 
| 
 y above the carefully unusual theodolites 
. 
final 
| 
+-------------+-------- 
+-------------+-------------+----------- 
| 
5 
| 
 ETHIOPIA  
| 
0 
| 
 ven packages wake quickly 
. 
 regu 
|