This topic describes how to create, read from, and write to Object Storage Service (OSS) foreign tables in CSV and TSV formats.
Scope
OSS external tables do not support the cluster property.
The size of a single file cannot exceed 2 GB. You must split files that are larger than 2 GB.
MaxCompute and OSS must be in the same region.
Supported data types
For more information about MaxCompute data types, see Data types V1.0 and Data types V2.0.
For more information about Smart Parse, see Smart Parse for more flexible type compatibility.
Data type | com.aliyun.odps.CsvStorageHandler/ TsvStorageHandler (built-in text data parser) | org.apache.hadoop.hive.serde2.OpenCSVSerde (built-in open source data parser) |
TINYINT | ||
SMALLINT | ||
INT | ||
BIGINT | ||
BINARY | ||
FLOAT | ||
DOUBLE | ||
DECIMAL(precision,scale) | ||
VARCHAR(n) | ||
CHAR(n) | ||
STRING | ||
DATE | ||
DATETIME | ||
TIMESTAMP | ||
TIMESTAMP_NTZ | ||
BOOLEAN | ||
ARRAY | ||
MAP | ||
STRUCT | ||
JSON |
Supported compression formats
When you read from or write to compressed OSS files, you can add the with serdeproperties property to the CREATE TABLE statement. For more information, see with serdeproperties parameters.
Compression format | com.aliyun.odps.CsvStorageHandler/ TsvStorageHandler (built-in text data parser) | org.apache.hadoop.hive.serde2.OpenCSVSerde (built-in open source data parser) |
GZIP | ||
SNAPPY | ||
LZO |
Schema evolution support
Operation type | Supported | Description |
Add columns |
| |
Delete columns | This operation is not recommended because it may cause a mismatch between the schema and the data. | |
Change column order | This operation is not recommended because it may cause a mismatch between the schema and the data. | |
Change column data types | For the data type conversion table, see Change column data types. | |
Change column names | ||
Change column comments | The comment must be a valid string of no more than 1024 bytes. Otherwise, an error is reported. | |
Change the NOT NULL property of a column | This operation is not supported. The default value is Nullable. |
Parameter settings
The schema of a CSV or TSV foreign table is mapped to the file columns by position. If the number of columns in an OSS file does not match the number of columns in the foreign table schema, you can use the odps.sql.text.schema.mismatch.mode parameter to specify how to handle the mismatched rows.
If
odps.sql.text.schema.mismatch.mode=truncateis set, after you modify a column:Data that conforms to the modified schema can be read.
Existing data that uses the old schema is read based on the new schema.
For example, if you add a column to a table, the value for this new column in existing rows is filled with NULL when you read the table.
If
odps.sql.text.schema.mismatch.mode=ignoreis set, after you modify a column:Data that conforms to the modified schema can be read.
Existing data that uses the old schema is read based on the new schema.
For example, if you add a column to a table, the entire row of existing data is discarded when you read the table.
If
odps.sql.text.schema.mismatch.mode=erroris set, after you modify a column:Data that conforms to the modified schema can be read.
Existing data that uses the old schema is read based on the new schema.
For example, if you add a column to a table, an error occurs when you read historical data because that data lacks the new column.
Create a foreign table
Syntax
Built-in text data parser
CSV format
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
(
<col_name> <data_type>,
...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
STORED BY 'com.aliyun.odps.CsvStorageHandler'
[WITH serdeproperties (
['<property_name>'='<property_value>',...]
)]
LOCATION '<oss_location>'
[tblproperties ('<tbproperty_name>'='<tbproperty_value>',...)];TSV format
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
(
<col_name> <data_type>,
...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
STORED BY 'com.aliyun.odps.TsvStorageHandler'
[WITH serdeproperties (
['<property_name>'='<property_value>',...]
)]
LOCATION '<oss_location>'
[tblproperties ('<tbproperty_name>'='<tbproperty_value>',...)];Built-in open source data parser
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
(
<col_name> <data_type>,
...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
[WITH serdeproperties (
['<property_name>'='<property_value>',...]
)]
STORED AS TEXTFILE
LOCATION '<oss_location>'
[tblproperties ('<tbproperty_name>'='<tbproperty_value>',...)];Common parameters
For more information about common parameters, see Basic syntax parameters.
Unique parameters
with serdeproperties parameters
Applicable parser | property_name | Scenario | Description | property_value | Default value |
Built-in text data parser (CsvStorageHandler/TsvStorageHandler) | odps.text.option.gzip.input.enabled | Add this property when you need to read data from GZIP-compressed CSV or TSV files. | CSV and TSV compression property. Set this parameter to True to allow MaxCompute to read compressed files. Otherwise, the read operation fails. |
| False |
odps.text.option.gzip.output.enabled | Add this property when you need to write data to OSS in GZIP compression format. | CSV and TSV compression property. Set this parameter to True to allow MaxCompute to write data to OSS in GZIP compression format. Otherwise, the data is not compressed. |
| False | |
odps.text.option.header.lines.count | Add this property when the OSS data file is in CSV or TSV format and you need to ignore the first N rows of the file. | When MaxCompute reads the OSS data file, it ignores the specified number of rows. | A non-negative integer. | 0 | |
odps.text.option.null.indicator | Add this property when the OSS data file is in CSV or TSV format and you need to define a parsing rule for NULL values in the file. | The string specified by this parameter is parsed as NULL in SQL. For example, to parse | String | Empty string | |
odps.text.option.ignore.empty.lines | Add this property when the OSS data file is in CSV or TSV format and you need to define a rule for handling empty lines in the file. | If this parameter is set to True, MaxCompute ignores empty lines in the data file. Otherwise, empty lines are read. |
| True | |
odps.text.option.encoding | Add this property when the OSS data file is in CSV or TSV format and its encoding format is not the default. | Make sure that the encoding format specified here is the same as the encoding format of the OSS data file. Otherwise, MaxCompute cannot read the data. |
| UTF-8 | |
odps.text.option.delimiter | Add this property when you need to specify the column delimiter for a CSV or TSV data file. | Make sure that the column delimiter specified here can correctly read each column of the OSS data file. Otherwise, the data read by MaxCompute may be misaligned. | Single character | Comma (,) | |
odps.text.option.use.quote | Add this property when a field in the CSV or TSV data file contains a line break (CRLF), a double quotation mark, or a comma. | When a CSV field contains a line break, a double quote (which must be escaped by adding a |
| False | |
odps.sql.text.option.flush.header | When you write data to OSS, the first row of the file block is the table header. | This parameter takes effect only for files in the CSV format. |
| False | |
odps.sql.text.schema.mismatch.mode | The number of data columns in the OSS file does not match the number of schema columns in the foreign table. | Specifies how to handle rows with a mismatched number of columns. Note: This feature does not take effect when odps.text.option.use.quote is set to True. |
| error | |
Built-in open source data parser (OpenCSVSerde) | separatorChar | Add this property when you need to specify the column delimiter for CSV data saved in TEXTFILE format. | Specifies the column delimiter for CSV data. | Single string | Comma (,) |
quoteChar | Add this property when a field in the CSV data saved in TEXTFILE format contains a line break, a double quotation mark, or a comma. | Specifies the quote character for CSV data. | Single string | None | |
escapeChar | Add this property when you need to specify the escape rule for CSV data saved in TEXTFILE format. | Specifies the escape character for CSV data. | Single string | None |
tblproperties parameters
Applicable parser | property_name | Scenario | Description | property_value | Default value |
Built-in open source data parser (OpenCSVSerde) | skip.header.line.count | Add this property when you need to ignore the first N rows of a CSV file saved in TEXTFILE format. | When MaxCompute reads data from OSS, it ignores the specified number of rows starting from the first row. | A non-negative integer. | None |
skip.footer.line.count | Add this property when you need to ignore the last N rows of a CSV file saved in TEXTFILE format. | When MaxCompute reads data from OSS, it ignores the specified number of rows starting from the last row. | A non-negative integer. | None | |
mcfed.mapreduce.output.fileoutputformat.compress | Add this property when you need to write a TEXTFILE data file to OSS in a compressed format. | TEXTFILE compression property. Set this parameter to True to allow MaxCompute to write a TEXTFILE data file to OSS in a compressed format. Otherwise, the data is not compressed. |
| False | |
mcfed.mapreduce.output.fileoutputformat.compress.codec | Add this property when you need to write a TEXTFILE data file to OSS in a compressed format. | TEXTFILE compression property. Sets the compression format for the TEXTFILE data file. Note: For |
| None | |
io.compression.codecs | Add this property when the OSS data file is in Raw-Snappy format. | Set this parameter to True to allow MaxCompute to read compressed data. Otherwise, MaxCompute cannot read the data. | com.aliyun.odps.io.compress.SnappyRawCodec | None | |
odps.text.option.bad.row.skipping | Add this property when the OSS data file is in CSV format and you need to skip dirty data in the file. | When MaxCompute reads an OSS data file, you can choose to skip or not skip dirty data. |
| None |
Write data
For more information about the MaxCompute write syntax, see Write syntax.
Query and analysis
For more information about the SELECT syntax, see Query syntax.
For more information about query plan optimization, see Query optimization.
For more information about BadRowSkipping, see BadRowSkipping.
BadRowSkipping
If dirty data exists in CSV data, you can use the BadRowSkipping feature to set parameters that specify whether to skip the data that causes errors. Enabling or disabling this feature does not affect the parsing of the underlying data format.
Parameter settings
Table-level parameter
odps.text.option.bad.row.skippingrigid: The skip logic must be executed and cannot be overwritten by session-level or project-level settings.flexible: Enables flexible skipping at the data plane. This setting can be overwritten by session-level or project-level settings.
Session-level or project-levelparametersodps.sql.unstructured.text.bad.row.skipping: This parameter can overwrite the flexible table-level parameter but not the rigid table-level parameter.on: Actively enabled. If this parameter is not configured for the table, it is automatically enabled.off: Actively disabled. If the table-level parameter is set to flexible, this feature is disabled. In other cases, the table-level parameter takes precedence.<null>/If the input is invalid: The behavior is determined only by the table-level configuration.
odps.sql.unstructured.text.bad.row.skipping.debug.num: Specifies the number of error results that can be printed to standard output (stdout) in Logview.The maximum value is 1000.
If the value is less than or equal to 0, this feature is disabled.
If the input is invalid, this feature is disabled.
Interaction between session-level and table-level parameters
tbl property
session flag
result
rigid
on
On, forcibly enabled
off
<null>, an invalid value, or this parameter is not configured
flexible
on
On
off
Off, disabled by the session
<null>, an invalid value, or this parameter is not configured
On
Not configured
on
Status: Session enabled
off
Off
<null>, an invalid value, or this parameter is not configured
Usage examples
Data preparation
Upload the test data file csv_bad_row_skipping.csv, which contains dirty data, to the
oss-mc-test/badrow/directory in OSS.Create a CSV foreign table
The following three scenarios describe different combinations of table-level parameters and session-level flags.
Table parameter:
odps.text.option.bad.row.skipping = flexible/rigid/<not specified>Session flag:
odps.sql.unstructured.text.bad.row.skipping = on/off/<not specified>
No table-level parameter is configured
-- No table-level parameter is configured. Errors are reported as they occur. The behavior is controlled by the session-level flag. CREATE EXTERNAL TABLE test_csv_bad_data_skipping_flag ( a INT, b INT ) STORED BY 'com.aliyun.odps.CsvStorageHandler' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) location '<oss://databucketpath>';The table-level parameter is set to skip error rows, but this can be disabled at the session level
-- The table-level parameter is set to skip error rows, but this can be actively disabled by the session-level flag. CREATE EXTERNAL TABLE test_csv_bad_data_skipping_flexible ( a INT, b INT ) STORED BY 'com.aliyun.odps.CsvStorageHandler' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) location '<oss://databucketpath>' tblproperties ( 'odps.text.option.bad.row.skipping' = 'flexible' -- Flexibly enabled. Can be disabled at the session level. );The table-level parameter is set to skip error rows, and this cannot be disabled at the session level
-- The table-level parameter is set to forcibly skip error rows. This cannot be disabled at the session level. CREATE EXTERNAL TABLE test_csv_bad_data_skipping_rigid ( a INT, b INT ) STORED BY 'com.aliyun.odps.CsvStorageHandler' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) location '<oss://databucketpath>' tblproperties ( 'odps.text.option.bad.row.skipping' = 'rigid' -- Forcibly enabled. );Verify the query result
No table-level parameter is configured
-- Enable at the session level. SET odps.sql.unstructured.text.bad.row.skipping=on; -- Disable at the session level. If the table-level parameter is set to 'flexible', this feature is disabled. If the table-level parameter is set to 'rigid', this setting has no effect. SET odps.sql.unstructured.text.bad.row.skipping=off; -- Print problematic row data at the session level. A maximum of 1,000 rows can be printed. If the value is less than or equal to 0, printing is disabled. SET odps.sql.unstructured.text.bad.row.skipping.debug.num=10; SELECT * FROM test_csv_bad_data_skipping_flag;The following error is returned: FAILED: ODPS-0123131:User defined function exception
The table-level parameter is set to skip error rows, but this can be disabled at the session level
-- Enable at the session level. SET odps.sql.unstructured.text.bad.row.skipping=on; -- Disable at the session level. If the table-level parameter is set to 'flexible', this feature is disabled. If the table-level parameter is set to 'rigid', this setting has no effect. SET odps.sql.unstructured.text.bad.row.skipping=off; -- Print problematic row data at the session level. A maximum of 1,000 rows can be printed. If the value is less than or equal to 0, printing is disabled. SET odps.sql.unstructured.text.bad.row.skipping.debug.num=10; SELECT * FROM test_csv_bad_data_skipping_flexible;The following error is returned: FAILED: ODPS-0123131:User defined function exception
The table-level parameter is set to skip error rows, and this cannot be disabled at the session level
-- Enable at the session level. SET odps.sql.unstructured.text.bad.row.skipping=on; -- Disable at the session level. If the table-level parameter is set to 'flexible', this feature is disabled. If the table-level parameter is set to 'rigid', this setting has no effect. SET odps.sql.unstructured.text.bad.row.skipping=off; -- Print problematic row data at the session level. A maximum of 1,000 rows can be printed. If the value is less than or equal to 0, printing is disabled. SET odps.sql.unstructured.text.bad.row.skipping.debug.num=10; SELECT * FROM test_csv_bad_data_skipping_rigid;The following result is returned:
+------------+------------+ | a | b | +------------+------------+ | 1 | 26 | | 5 | 37 | +------------+------------+
Smart Parse for more flexible type compatibility
For OSS foreign tables in CSV format, MaxCompute SQL uses data types V2.0 for read and write operations. MaxCompute is now compatible with various types of values in CSV files. The following table provides the details:
Type | Input as String | Output as String | Description |
BOOLEAN |
Note The input is trimmed using the |
| If the value is not in the enumeration, the parsing fails. |
TINYINT |
Note
|
| 8-bit integer. If the value is outside the range of |
SMALLINT | 16-bit integer. If the value is outside the range of | ||
INT | 32-bit integer. If the value is outside the range of | ||
BIGINT | 64-bit integer. If the value is outside the range of Note
| ||
FLOAT |
Note
|
| Special values (case-insensitive) include NaN, Inf, -Inf, Infinity, and -Infinity. If the value is outside the value range, an error is reported. If the precision is exceeded, the value is rounded and truncated. |
DOUBLE |
Note
|
| Special values (case-insensitive) include NaN, Inf, -Inf, Infinity, and -Infinity. If the value is outside the value range, an error is reported. If the precision is exceeded, the value is rounded and truncated. |
DECIMAL (precision, scale) Example: DECIMAL(15,2) |
Note
|
| The number of digits in the integer part exceeds , an error is reported. If the fractional part exceeds the scale, the value is rounded and truncated. |
CHAR(n) Example: CHAR(7) |
|
| The maximum length is 255. If the length is insufficient, the string is padded with spaces. The spaces are not used in comparisons. If the length exceeds n, the string is truncated. |
VARCHAR(n) Example: VARCHAR(7) |
|
| The maximum length is 65535. If the length exceeds n, the string is truncated. |
STRING |
|
| The length is limited to 8 MB. |
DATE |
Note You can also customize the parsing method for reads by setting |
|
|
TIMESTAMP_NTZ Note OpenCsvSerde does not support this data type because it is not compatible with the Hive data format. |
|
|
|
DATETIME |
| If the system time zone is Asia/Shanghai:
|
|
TIMESTAMP |
| (If the system time zone is Asia/Shanghai)
|
|
General rules
For any data type, an empty string in a CSV data file is parsed as NULL when read into a table.
Unsupported data types
Complex types (STRUCT, ARRAY, and MAP): Not supported. This is because the values of these complex types can easily conflict with common CSV separators, such as the comma (
,), which can cause parsing failures.BINARY and INTERVAL types: Not supported. To use these types, you can contact MaxCompute technical support.
Numeric types (such as INT and DOUBLE)
For INT, SMALLINT, TINYINT, BIGINT, FLOAT, DOUBLE, and DECIMAL data types, the default parsing capability is flexible.
If you want to parse only basic number strings, you can set the
odps.text.option.smart.parse.levelparameter tonaiveintblproperties. This configures the parser to support only common number strings, such as "123" and "123.456". An error is reported when parsing strings in other formats.
Date and time types (such as DATE and TIMESTAMP)
The four date or time-related types,
DATE, DATETIME, TIMESTAMP, and TIMESTAMP_NTZ, are all processed usingjava.time.format.DateTimeFormatterin the underlying code.Default format: MaxCompute has several built-in default parsing formats.
Custom format:
You can set
odps.text.option.<date|datetime|timestamp|timestamp_ntz>.io.formatintblpropertiesto customize multiple parsing formats and one output format.You can separate multiple parsing formats with a number sign (
#).Custom formats have a higher priority than MaxCompute's built-in parsing formats. The first custom format is used as the output format.
Example: If you customize the format string for the DATE type as
pattern1#pattern2#pattern3, you can parse strings that conform topattern1,pattern2, orpattern3. However, when writing to a file, thepattern1format is used. For more information, see DateTimeFormatter.
Important: Notes on the 'z' time zone format
We recommend that you do not use 'z' (time-zone name) as a custom time zone format, especially for users in China, because 'z' can be ambiguous in some contexts.
We recommend that you use 'x' (zone-offset) or 'VV' (time-zone ID) as the time zone pattern instead.
Example: 'CST' usually stands for China Standard Time (UTC+8) in China. However, when 'CST' is parsed by
java.time.format.DateTimeFormatter, it is identified as US Central Standard Time (UTC-6). This may lead to unexpected input or output results.
Usage examples
Prerequisites
An OSS bucket and folder are available. For more information, see Create a bucket and Manage folders.
MaxCompute supports automatic folder creation in OSS. If an SQL statement involves external tables and user-defined functions (UDFs), you can use a single statement to read from and write to the tables and use the UDFs. You can also create folders manually.
MaxCompute is deployed only in specific regions. To avoid potential issues with cross-region data connections, ensure that your OSS bucket is in the same region as your MaxCompute project.
Authorization
You must have permissions to access OSS. You can use an Alibaba Cloud account, a Resource Access Management (RAM) user, or a RAM role to access OSS external tables. For more information about authorization, see Authorize access in STS mode for OSS.
You must have the CreateTable permission in the MaxCompute project. For more information about table permissions, see MaxCompute permissions.
Create an OSS foreign table using the built-in text data parser
Example 1: Non-partitioned table
Map the table to the
Demo1/folder in Example data. The following sample code shows how to create an OSS foreign table.CREATE EXTERNAL TABLE IF NOT EXISTS mc_oss_csv_external1 ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongtitue DOUBLE, recordTime STRING, direction STRING ) STORED BY 'com.aliyun.odps.CsvStorageHandler' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo1/'; -- You can run the desc extended mc_oss_csv_external1; command to view the schema of the created OSS foreign table.The role used in this example is
aliyunodpsdefaultrole. If you use another role, you can replacealiyunodpsdefaultrolewith the name of your role and grant the role permissions to access OSS.Query the non-partitioned foreign table.
SELECT * FROM mc_oss_csv_external1;The following result is returned:
+------------+------------+------------+------------+------------------+-------------------+----------------+------------+ | vehicleid | recordid | patientid | calls | locationlatitute | locationlongtitue | recordtime | direction | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+ | 1 | 1 | 51 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | S | | 1 | 2 | 13 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE | | 1 | 3 | 48 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE | | 1 | 4 | 30 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | W | | 1 | 5 | 47 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | S | | 1 | 6 | 9 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | S | | 1 | 7 | 53 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | N | | 1 | 8 | 63 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | SW | | 1 | 9 | 4 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | NE | | 1 | 10 | 31 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | N | +------------+------------+------------+------------+------------------+-------------------+------------+----------------+Write data to the non-partitioned foreign table and check whether the data is written.
INSERT INTO mc_oss_csv_external1 VALUES(1,12,76,1,46.81006,-92.08174,'9/14/2014 0:10','SW'); SELECT * FROM mc_oss_csv_external1 WHERE recordId=12;The following result is returned:
+------------+------------+------------+------------+------------------+-------------------+----------------+------------+ | vehicleid | recordid | patientid | calls | locationlatitute | locationlongtitue | recordtime | direction | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+ | 1 | 12 | 76 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:10 | SW | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+Check the OSS path
Demo1/. A new file is generated.
Example 2: Partitioned table
Map the table to the
Demo2/folder in Example data. The following sample code shows how to create an OSS foreign table and import partition data.CREATE EXTERNAL TABLE IF NOT EXISTS mc_oss_csv_external2 ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongtitue DOUBLE, recordTime STRING ) PARTITIONED BY ( direction STRING ) STORED BY 'com.aliyun.odps.CsvStorageHandler' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo2/'; -- You can run the DESC EXTENDED mc_oss_csv_external2; command to view the schema of the created foreign table.The role used in this example is
aliyunodpsdefaultrole. If you use another role, you can replacealiyunodpsdefaultrolewith the name of your role and grant the role permissions to access OSS.Import partition data. When you create a partitioned OSS foreign table, you must perform an additional operation to import partition data. For more information, see Syntax for adding partition data to an OSS foreign table.
MSCK REPAIR TABLE mc_oss_csv_external2 ADD PARTITIONS; -- This is equivalent to the following statement. ALTER TABLE mc_oss_csv_external2 ADD PARTITION (direction = 'N') PARTITION (direction = 'NE') PARTITION (direction = 'S') PARTITION (direction = 'SW') PARTITION (direction = 'W');Query the partitioned foreign table.
SELECT * FROM mc_oss_csv_external2 WHERE direction='NE';The following result is returned:
+------------+------------+------------+------------+------------------+-------------------+----------------+------------+ | vehicleid | recordid | patientid | calls | locationlatitute | locationlongtitue | recordtime | direction | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+ | 1 | 2 | 13 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE | | 1 | 3 | 48 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE | | 1 | 9 | 4 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | NE | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+Write data to the partitioned foreign table and check whether the data is written.
INSERT INTO mc_oss_csv_external2 PARTITION(direction='NE') VALUES(1,12,76,1,46.81006,-92.08174,'9/14/2014 0:10'); SELECT * FROM mc_oss_csv_external2 WHERE direction='NE' AND recordId=12;The following result is returned:
+------------+------------+------------+------------+------------------+-------------------+----------------+------------+ | vehicleid | recordid | patientid | calls | locationlatitute | locationlongtitue | recordtime | direction | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+ | 1 | 12 | 76 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:10 | NE | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+Check the OSS path
Demo2/direction=NE. A new file is generated.
Example 3: Compressed data
This example shows how to create a GZIP-compressed foreign table in CSV format and perform read and write operations.
Create an internal table and write test data to it for subsequent write tests.
CREATE TABLE vehicle_test( vehicleid INT, recordid INT, patientid INT, calls INT, locationlatitute DOUBLE, locationlongtitue DOUBLE, recordtime STRING, direction STRING ); INSERT INTO vehicle_test VALUES (1,1,51,1,46.81006,-92.08174,'9/14/2014 0:00','S');Create a GZIP-compressed foreign table in CSV format and map it to the
Demo3/(compressed data) folder in Example data. The following sample code shows how to create the OSS foreign table.CREATE EXTERNAL TABLE IF NOT EXISTS mc_oss_csv_external3 ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongtitue DOUBLE, recordTime STRING, direction STRING ) PARTITIONED BY (dt STRING) STORED BY 'com.aliyun.odps.CsvStorageHandler' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole', 'odps.text.option.gzip.input.enabled'='true', 'odps.text.option.gzip.output.enabled'='true' ) LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo3/'; -- Import partition data. MSCK REPAIR TABLE mc_oss_csv_external3 ADD PARTITIONS; -- You can run the DESC EXTENDED mc_oss_csv_external3; command to view the schema of the created foreign table.The role used in this example is
aliyunodpsdefaultrole. If you use another role, you can replacealiyunodpsdefaultrolewith the name of your role and grant the role permissions to access OSS.Use the MaxCompute client to read data from OSS. The following sample code provides an example:
NoteIf the compressed data in OSS is in an open source data format, you must add the
set odps.sql.hive.compatible=true;command before the SQL statement and submit them together.--Enable a full table scan. This setting is valid only for the current session. SET odps.sql.allow.fullscan=true; SELECT recordId, patientId, direction FROM mc_oss_csv_external3 WHERE patientId > 25;The following result is returned:
+------------+------------+------------+ | recordid | patientid | direction | +------------+------------+------------+ | 1 | 51 | S | | 3 | 48 | NE | | 4 | 30 | W | | 5 | 47 | S | | 7 | 53 | N | | 8 | 63 | SW | | 10 | 31 | N | +------------+------------+------------+Read data from the internal table and write it to the OSS foreign table.
You can run the
INSERT OVERWRITEorINSERT INTOcommand on the foreign table using the MaxCompute client to write data to OSS.INSERT INTO TABLE mc_oss_csv_external3 PARTITION (dt='20250418') SELECT * FROM vehicle_test;After the statement is executed, you can view the exported file in the OSS folder.
Create an OSS foreign table and specify the first row of the corresponding OSS file as the table header
Create the Demo11 folder in the oss-mc-test bucket from Example data and run the following statements:
--Create a foreign table.
CREATE EXTERNAL TABLE mf_oss_wtt
(
id BIGINT,
name STRING,
tran_amt DOUBLE
)
STORED BY 'com.aliyun.odps.CsvStorageHandler'
WITH serdeproperties (
'odps.text.option.header.lines.count' = '1',
'odps.sql.text.option.flush.header' = 'true'
)
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo11/';
--Insert data.
INSERT OVERWRITE TABLE mf_oss_wtt VALUES (1, 'val1', 1.1),(2, 'value2', 1.3);
--Query data.
--When you create the table, you can set the data type of all fields to STRING. Otherwise, an error is reported when the table header is read.
--Alternatively, add the parameter to skip the table header when you create the table: 'odps.text.option.header.lines.count' = '1'
SELECT * FROM mf_oss_wtt;The role used in this example is aliyunodpsdefaultrole. If you use another role, you can replace aliyunodpsdefaultrole with the name of your role and grant the role permissions to access OSS.
The following result is returned:
+----------+--------+------------+
| id | name | tran_amt |
+----------+--------+------------+
| 1 | val1 | 1.1 |
| 2 | value2 | 1.3 |
+----------+--------+------------+Create an OSS foreign table where the number of columns does not match the number of columns in the OSS data
Create the
demofolder in theoss-mc-testbucket from Example data and upload thetest.csvfile. Thetest.csvfile contains the following content.1,kyle1,this is desc1 2,kyle2,this is desc2,this is two 3,kyle3,this is desc3,this is three, I have 4 columnsCreate a foreign table.
Set the handling mode for rows with a mismatched number of columns to
TRUNCATE.-- Delete the table. DROP TABLE test_mismatch; -- Create a new foreign table. CREATE EXTERNAL TABLE IF NOT EXISTS test_mismatch ( id string, name string, dect string, col4 string ) STORED BY 'com.aliyun.odps.CsvStorageHandler' WITH serdeproperties ('odps.sql.text.schema.mismatch.mode' = 'truncate') LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/demo/';Set the handling mode for rows with a mismatched number of columns to
IGNORE.-- Delete the table. DROP TABLE test_mismatch01; -- Create a new foreign table. CREATE EXTERNAL TABLE IF NOT EXISTS test_mismatch01 ( id STRING, name STRING, dect STRING, col4 STRING ) STORED BY 'com.aliyun.odps.CsvStorageHandler' WITH serdeproperties ('odps.sql.text.schema.mismatch.mode' = 'ignore') LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/demo/';Query table data.
Query the test_mismatch table.
SELECT * FROM test_mismatch; --Result +----+-------+---------------+---------------+ | id | name | dect | col4 | +----+-------+---------------+---------------+ | 1 | kyle1 | this is desc1 | NULL | | 2 | kyle2 | this is desc2 | this is two | | 3 | kyle3 | this is desc3 | this is three | +----+-------+---------------+---------------+Query the test_mismatch01 table.
SELECT * FROM test_mismatch01; --Result +----+-------+----------------+-------------+ | id | name | dect | col4 | +----+-------+----------------+-------------+ | 2 | kyle2 | this is desc2 | this is two +----+-------+----------------+-------------+
Create an OSS foreign table using the built-in open source parser
This example shows how to create an OSS foreign table using the built-in open source parser to read a comma-separated file and ignore the first and last rows.
Create the
demo-testfolder in theoss-mc-testbucket from Example data and upload the test file test.csv.The test file contains the following data.
1,1,51,1,46.81006,-92.08174,9/14/2014 0:00,S 1,2,13,1,46.81006,-92.08174,9/14/2014 0:00,NE 1,3,48,1,46.81006,-92.08174,9/14/2014 0:00,NE 1,4,30,1,46.81006,-92.08174,9/14/2014 0:00,W 1,5,47,1,46.81006,-92.08174,9/14/2014 0:00,S 1,6,9,1,46.81006,-92.08174,9/15/2014 0:00,S 1,7,53,1,46.81006,-92.08174,9/15/2014 0:00,N 1,8,63,1,46.81006,-92.08174,9/15/2014 0:00,SW 1,9,4,1,46.81006,-92.08174,9/15/2014 0:00,NE 1,10,31,1,46.81006,-92.08174,9/15/2014 0:00,NCreate a foreign table, specify the separator as a comma, and set parameters to ignore the first and last rows.
CREATE EXTERNAL TABLE ext_csv_test08 ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongtitue DOUBLE, recordTime STRING, direction STRING ) ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH serdeproperties ( "separatorChar" = "," ) stored AS textfile location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/***/' -- Set parameters to ignore the first and last rows. TBLPROPERTIES ( "skip.header.line.COUNT"="1", "skip.footer.line.COUNT"="1" ) ;Read the foreign table.
SELECT * FROM ext_csv_test08; -- Only 8 rows were read, and the header and footer rows were ignored. +------------+------------+------------+------------+------------------+-------------------+------------+------------+ | vehicleid | recordid | patientid | calls | locationlatitute | locationlongtitue | recordtime | direction | +------------+------------+------------+------------+------------------+-------------------+------------+------------+ | 1 | 1 | 51 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | S | | 1 | 2 | 13 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE | | 1 | 3 | 48 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE | | 1 | 4 | 30 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | W | | 1 | 5 | 47 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | S | | 1 | 6 | 9 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | S | | 1 | 7 | 53 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | N | | 1 | 8 | 63 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | SW | | 1 | 9 | 4 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | NE | | 1 | 10 | 31 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | N | +------------+------------+------------+------------+------------------+-------------------+------------+------------+
Create an OSS foreign table in CSV format with custom date and time data types
For more information about custom parsing and output formats for time types in CSV, see Smart Parse for more flexible type compatibility.
Create a CSV foreign table for each time data type (datetime, timestamp, and timestamp_ntz).
CREATE EXTERNAL TABLE test_csv ( col_date DATE, col_datetime DATETIME, col_timestamp TIMESTAMP, col_timestamp_ntz TIMESTAMP_NTZ ) STORED BY 'com.aliyun.odps.CsvStorageHandler' LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/demo/' TBLPROPERTIES ( 'odps.text.option.date.io.format' = 'MM/dd/yyyy', 'odps.text.option.datetime.io.format' = 'yyyy-MM-dd-HH-mm-ss x', 'odps.text.option.timestamp.io.format' = 'yyyy-MM-dd HH-mm-ss VV', 'odps.text.option.timestamp_ntz.io.format' = 'yyyy-MM-dd HH:mm:ss.SS' ); INSERT OVERWRITE test_csv VALUES(DATE'2025-02-21', DATETIME'2025-02-21 08:30:00', TIMESTAMP'2025-02-21 12:30:00', TIMESTAMP_NTZ'2025-02-21 16:30:00.123456789');After the data is inserted, the content of the CSV file is:
02/21/2025,2025-02-21-08-30-00 +08,2025-02-21 12-30-00 Asia/Shanghai,2025-02-21 16:30:00.12Read the data again to view the result.
SELECT * FROM test_csv;The following result is returned:
+------------+---------------------+---------------------+------------------------+ | col_date | col_datetime | col_timestamp | col_timestamp_ntz | +------------+---------------------+---------------------+------------------------+ | 2025-02-21 | 2025-02-21 08:30:00 | 2025-02-21 12:30:00 | 2025-02-21 16:30:00.12 | +------------+---------------------+---------------------+------------------------+
FAQ
A "column count mismatch" error is reported when reading CSV/TSV data
Symptom
If the number of columns in a CSV/TSV file does not match the number of columns in the foreign table DDL, a "column count mismatch" error is reported when you read the CSV/TSV data. For example:
FAILED: ODPS-0123131:User defined function exception - Traceback:java.lang.RuntimeException: SCHEMA MISMATCH:xxx.Solution
You can control the output by setting the
odps.sql.text.schema.mismatch.modeparameter at the session level:SET odps.sql.text.schema.mismatch.mode=error: If the number of columns does not match, the system reports an error.SET odps.sql.text.schema.mismatch.mode=truncate: If the file has more columns than the foreign table DDL, the extra data is discarded. If the file has fewer columns than the foreign table DDL, the missing columns are filled with NULL.