JSON External Tables - MaxCompute - Alibaba Cloud Documentation Center

This topic describes how to create, read from, and write to JSON-formatted external tables in OSS.

Applicability

OSS external tables do not support the cluster property.
The size of a single file cannot exceed 2 GB. You must split files that are larger than 2 GB.
MaxCompute and OSS must be in the same region.
Supported data types.
Schema evolution support.

Create an External Table

Syntax

When you read JSON data, if a JSON file has fewer columns than what is defined in the external table, MaxCompute populates the missing column values with NULL. If the file has more columns, MaxCompute discards the extra columns.

Streamlined syntax structure

CREATE EXTERNAL TABLE <mc_oss_extable_name>
(
  <col_name> <data_type>,
  ...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS textfile
LOCATION '<oss_location>';

Full Syntax

CREATE EXTERNAL TABLE <mc_oss_extable_name>
(
  <col_name> <data_type>,
  ...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
  [WITH serdeproperties (
    ['<property_name>'='<property_value>',...])
  ]
STORED AS textfile
LOCATION '<oss_location>'
[tblproperties ('<tbproperty_name>'='<tbproperty_value>',...)];

Common Parameters

For more information, see Common syntax parameters.

Table-Specific Parameters

tblproperties Parameters

property_name	Use case	Description	property_value	Default
mcfed.mapreduce.output.fileoutputformat.compress	Add this property when you want to compress TEXTFILE data files before writing them to OSS.	TEXTFILE compression property. Set this parameter to True to enable compression. If set to False, no compression occurs.	True False	False
mcfed.mapreduce.output.fileoutputformat.compress.codec	Add this property when you want to compress TEXTFILE data files before writing them to OSS.	TEXTFILE compression property. Specifies the compression codec for TEXTFILE data files. The default output format is `.deflate`. Note: Only the compression codecs listed in `property_value` are supported.	com.hadoop.compression.lzo.LzoCodec com.hadoop.compression.lzo.LzopCodec org.apache.hadoop.io.compress.SnappyCodec com.aliyun.odps.io.compress.SnappyRawCodec org.apache.hadoop.io.compress.BZip2Codec org.apache.hadoop.io.compress.Lz4Codec org.apache.hadoop.io.compress.DeflateCodec org.apache.hadoop.io.compress.GzipCodec	org.apache.hadoop.io.compress.DeflateCodec
odps.external.data.output.prefix (Backward compatible with odps.external.data.prefix)	Add this property when you want to add a custom prefix to output files.	Must contain only letters, digits, or underscores (a–z, A–Z, 0–9, _). Length must be between 1 and 10 characters.	A valid combination of allowed characters, such as `mc_`.	None
odps.external.data.enable.extension	Add this property when you want to display file extensions in output filenames.	If set to True, output files include extensions. If set to False, extensions are omitted.	True False	False
odps.external.data.output.suffix	Add this property when you want to add a custom suffix to output files.	Must contain only letters, digits, or underscores (a–z, A–Z, 0–9, _).	A valid combination of allowed characters, such as _hangzhou.	None
odps.external.data.output.explicit.extension	Add this property when you want to specify a custom file extension for output files.	Must contain only letters, digits, or underscores (a–z, A–Z, 0–9, _). Length must be between 1 and 10 characters. Takes precedence over `odps.external.data.enable.extension`.	A valid combination of allowed characters, such as `jsonl`.	None
odps.text.option.bad.row.skipping	Add this property when you want to skip dirty data in OSS data files.	When MaxCompute reads OSS data files, you can choose whether to skip dirty data.	rigid: Skips dirty rows unconditionally. Cannot be overridden by session- or project-level settings. flexible: Enables flexible skipping at the data layer, but it can also be overwritten by configurations at the session or project layer.

When you create a JSON external table, if a field contains a nested JSON object, do not define its data type as STRING or JSON. Otherwise, MaxCompute cannot automatically parse the subfields.

You can use one of the two approaches described in the following examples:

Define the field as STRING. Then, use functions such as get_json_object during queries to extract subfield values.
Use the STRUCT type to define the field, mapping each subfield of the JSON object to a separate column in the table. You can directly access the data using the field_name.subfield_name syntax.

Write Data

For more information about the syntax for writing data from MaxCompute to OSS, see Write data to OSS.

Query and Analysis

For more information about SELECT syntax, see Read data from OSS.
For more information about query plan optimization, see Query optimization.
For more information about BadRowSkipping, see BadRowSkipping.

BadRowSkipping

If dirty data is present in your JSON data, you can use the BadRowSkipping feature to skip or retain the rows that contain errors. Enabling or disabling this feature does not change how the underlying data format is interpreted.

Parameter settings

Table-level parameter odps.text.option.bad.row.skipping
- rigid: The skip logic must be executed and cannot be overwritten by session-level or project-level settings.
- flexible: Enables flexible skipping at the data plane. This setting can be overwritten by session-level or project-level settings.
Session-level or project-level parameters
- odps.sql.unstructured.text.bad.row.skipping: This parameter can overwrite the flexible table-level parameter but not the rigid table-level parameter.
  - on: Actively enabled. If this parameter is not configured for the table, it is automatically enabled.
  - off: Actively disabled. If the table-level parameter is set to flexible, this feature is disabled. In other cases, the table-level parameter takes precedence.
  - <null>/If the input is invalid: The behavior is determined only by the table-level configuration.
- odps.sql.unstructured.text.bad.row.skipping.debug.num: Specifies the number of error results that can be printed to standard output (stdout) in Logview.
  - The maximum value is 1000.
  - If the value is less than or equal to 0, this feature is disabled.
  - If the input is invalid, this feature is disabled.

Interaction between session-level and table-level parameters

tbl property	session flag	result
rigid	on	On, forcibly enabled
	off
	<null>, an invalid value, or this parameter is not configured
flexible	on	On
	off	Off, disabled by the session
	<null>, an invalid value, or this parameter is not configured	On
Not configured	on	Status: Session enabled
	off	Off
	<null>, an invalid value, or this parameter is not configured	Off

Examples

Prepare the data.
Upload the test file json_bad_row_skipping.json, which contains dirty data, to the oss-mc-test/badrow/ path in your OSS bucket.

Create a JSON external table.

The behavior differs based on the table-level parameter and the session-level flag. The following three cases are possible:

Table-level parameter: odps.text.option.bad.row.skipping = flexible/rigid/<not set>
Session-level flag: odps.sql.unstructured.text.bad.row.skipping = on/off/<not set>

No table-level parameter set

-- No table-level parameter set. Errors occur as expected. Session flag controls behavior.
CREATE EXTERNAL TABLE test_json_bad_data_skipping_flag
(
  a INT,
  b INT
)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored AS textfile
location '<oss://databucketpath>';

Table-level parameter skips errors. Session-level flag can disable it.

-- Table-level parameter skips errors, but session-level flag can override it.
CREATE EXTERNAL TABLE test_json_bad_data_skipping_flexible
(
  a INT,
  b INT
)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored AS textfile
location '<oss://databucketpath>'
tblproperties (
  'odps.text.option.bad.row.skipping' = 'flexible'   -- Elastic mode. Can be disabled by session flag.
);

Table-level parameter forces error skipping. Session-level flag cannot disable it.

-- Table-level parameter forces error skipping. Session-level flag cannot override it.
CREATE EXTERNAL TABLE test_json_bad_data_skipping_rigid
(
  a INT,
  b INT
)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored AS textfile
location '<oss://databucketpath>'
tblproperties (
  'odps.text.option.bad.row.skipping' = 'rigid'  -- Forced mode.
);

Verify the query results.

No table-level parameter set

-- Enable at session level.
SET odps.sql.unstructured.text.bad.row.skipping=on;

-- Disable at session level. If table is set to 'flexible', it disables skipping. If table is set to 'rigid', it has no effect.
SET odps.sql.unstructured.text.bad.row.skipping=off;

-- Print up to 10 problematic rows to Logview Stdout. Values ≤ 0 disable printing.
SET odps.sql.unstructured.text.bad.row.skipping.debug.num=10;

SELECT * FROM test_json_bad_data_skipping_flag;

The query fails and returns the following error: FAILED: ODPS-0123131:User defined function exception

Table-level parameter skips errors. Session-level flag can disable it.

-- Enable at session level.
SET odps.sql.unstructured.text.bad.row.skipping=on;

-- Disable at session level. If table is set to 'flexible', it disables skipping. If table is set to 'rigid', it has no effect.
SET odps.sql.unstructured.text.bad.row.skipping=off;

-- Print up to 10 problematic rows to Logview Stdout. Values ≤ 0 disable printing.
SET odps.sql.unstructured.text.bad.row.skipping.debug.num=10;

SELECT * FROM test_json_bad_data_skipping_flexible;

The query fails and returns the following error: FAILED: ODPS-0123131:User defined function exception

Table-level parameter forces error skipping. Session-level flag cannot disable it.

-- Enable at session level.
SET odps.sql.unstructured.text.bad.row.skipping=on;

-- Disable at session level. If table is set to 'flexible', it disables skipping. If table is set to 'rigid', it has no effect.
SET odps.sql.unstructured.text.bad.row.skipping=off;

-- Print up to 10 problematic rows to Logview Stdout. Values ≤ 0 disable printing.
SET odps.sql.unstructured.text.bad.row.skipping.debug.num=10;

SELECT * FROM test_json_bad_data_skipping_rigid;

The query returns the following result:

+------------+------------+
| a          | b          | 
+------------+------------+
| 1          | 2          | 
| 15         | 16         | 
+------------+------------+

Examples

Prerequisites

You have created a MaxCompute project.
You have prepared an OSS bucket and OSS directory. For more information, see Create a bucket and Manage folders.
Because MaxCompute is available only in specific regions, cross-region connectivity issues may occur. We recommend using an OSS bucket in the same region as your MaxCompute project.
Authorization
1. You have permissions to access OSS. An Alibaba Cloud account (primary account), Resource Access Management (RAM) user, or RAM role can access OSS external tables. For authorization details, see STS-mode authorization for OSS.
2. You have CreateTable permissions in the MaxCompute project. For details about table operation permissions, see MaxCompute permissions.

Example 1: Create, write to, and query a JSON external table

This example shows how to create a JSON external table using the built-in open-source JSON parser, and then write data to and query data from OSS.

Prepare the data.
Log on to the OSS console and upload the test file json2025.txt to the external-table-test/json/dt=20250521/ directory in your OSS bucket. For more information, see Upload OSS files.

Create a JSON external table.

CREATE EXTERNAL TABLE mc_oss_extable_name_json
(
  action STRING,
  time STRING
)
PARTITIONED BY (dt STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH serdeproperties (
  'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
)
STORED AS textfile
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/external-table-test/json/';

Add partitions. If your OSS external table is partitioned, you must run this command to add the existing partitions. For more information, see Add partitions to an OSS external table.
```
-- Add partitions.
MSCK REPAIR TABLE mc_oss_extable_name_json ADD PARTITIONS;
```

Read the JSON external table.

SELECT * FROM mc_oss_extable_name_json WHERE dt=20250526;

The query returns the following result:

+------------+------------+------------+
| action     | time       | dt         |
+------------+------------+------------+
| Close      | 1469679568 | 20250526   |
| Close      | 1469679568 | 20250526   |
+------------+------------+------------+

Write to the JSON external table.

INSERT INTO mc_oss_extable_name_json PARTITION (dt='20250526') VALUES ('test','1627273823');

Check the written data.

SELECT * FROM mc_oss_extable_name_json WHERE dt=20250526;

The query returns the following result:

+------------+------------+------------+
| action     | time       | dt         |
+------------+------------+------------+
| test       | 1627273823 | 20250526   |
| Close      | 1469679568 | 20250526   |
| Close      | 1469679568 | 20250526   |
+------------+------------+------------+

Example 2: Read JSON field values

Prepare the data

Create a JSON file named events.json with the following content:

{"a":{"x":1, "y":2}, "id":"123"}
{"a":{"x":3, "y":4}, "id":"345"}

Log on to the OSS console and upload the test file to the external-table-test/json-sturct/ directory in your OSS bucket. For more information, see Upload OSS files.

Method 1: Create a TEXTFILE external table and use the get_json_object function to read field values

Create a TEXTFILE external table with a single string column:

CREATE EXTERNAL TABLE extable_json_test01 (
  col STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'
STORED AS textfile
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/external-table-test/json_struct/';  

SELECT * FROM extable_json_test01;

The query returns the following result:

+------------------------------------+
|               col                  |
+------------------------------------+
| {"a": {"x": 1, "y": 2},"id":"123"} |
| {"a": {"x": 3, "y": 4},"id":"345"} |
+------------------------------------+

Use the get_json_object function to read the a and id fields:

SELECT 
    get_json_object(col, '$.a') AS a,
    get_json_object(col, '$.id') AS id
FROM extable_json_test01;

The query returns the following result:

+-------------------+-----+
|        a          | id  |
+-------------------+-----+
| {"x":1,"y":2}     | 123 |
| {"x":3,"y":4}     | 345 |
+-------------------+-----+

Read the nested fields x, y, and id:

SELECT 
    get_json_object(get_json_object(col,'$.a'),'$.x') AS x,
    get_json_object(get_json_object(col,'$.a'),'$.y') AS y,
    get_json_object(col,'$.id') AS id
FROM extable_json_test01;

The query returns the following result:

+---+---+-----+
| x | y | id  |
+---+---+-----+
| 1 | 2 |123  |
| 3 | 4 |345  |
+---+---+-----+

Method 2: Create a JSON external table and use the STRUCT type to handle nested fields

Create a JSON external table that uses the STRUCT type to represent nested fields:

CREATE EXTERNAL TABLE extable_json_test02
(
  a STRUCT<x: BIGINT, y: BIGINT>,
  id STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS textfile
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/external-table-test/json_struct/';

Query the table directly:

SELECT * FROM extable_json_test02;

The query returns the following result:

+----------+-----+
|    a     | id  |
+----------+-----+
| {x:1, y:2}|123 |
| {x:3, y:4}|345 |
+----------+-----+

Use the get_json_object and TO_JSON functions to read the x and y fields:

SELECT 
    get_json_object(TO_JSON(a), '$.x') AS x,
    get_json_object(TO_JSON(a), '$.y') AS y,
    id
FROM extable_json_test02;

The query returns the following result:

+---+---+-----+
| x | y | id  |
+---+---+-----+
| 1 | 2 |123  |
| 3 | 4 |345  |
+---+---+-----+

Example 3: Set custom prefixes, suffixes, and file extensions for OSS output files

Set a custom prefix test06_ for the output files. The DDL is as follows:

CREATE EXTERNAL TABLE  <mc_oss_extable_name>
(
  vehicleId INT,
  recordId INT,
  patientId INT,
  calls INT,
  locationLatitute DOUBLE,
  locationLongitude DOUBLE,
  recordTime STRING,
  direction STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH serdeproperties (
  'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
) 
STORED AS textfile
LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/'
TBLPROPERTIES (
-- Set custom prefix.
    'odps.external.data.output.prefix'='test06_') 
;

-- Write data to the external table.
INSERT INTO  <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');

The output files are displayed in the following figure:

Set a custom suffix _beijing for the output files. The DDL is as follows:

CREATE EXTERNAL TABLE <mc_oss_extable_name>
(
  vehicleId INT,
  recordId INT,
  patientId INT,
  calls INT,
  locationLatitute DOUBLE,
  locationLongitude DOUBLE,
  recordTime STRING,
  direction STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH serdeproperties (
  'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
) 
STORED AS textfile
LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/'
TBLPROPERTIES (
-- Set custom suffix.
    'odps.external.data.output.suffix'='_beijing') 
;

-- Write data to the external table.
INSERT INTO <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');

The output files are displayed in the following figure:

You can enable automatic file extensions for output files. DDL:

CREATE EXTERNAL TABLE <mc_oss_extable_name>
(
  vehicleId INT,
  recordId INT,
  patientId INT,
  calls INT,
  locationLatitute DOUBLE,
  locationLongitude DOUBLE,
  recordTime STRING,
  direction STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH serdeproperties (
  'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
) 
STORED AS textfile
LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/'
TBLPROPERTIES (
-- Enable automatic file extensions.
    'odps.external.data.enable.extension'='true') 
;

-- Write data to the external table.
INSERT INTO <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');

The output files appear as shown in the following image:

Set a custom file extension jsonl for the output files. The DDL is as follows:

CREATE EXTERNAL TABLE <mc_oss_extable_name>
(
  vehicleId INT,
  recordId INT,
  patientId INT,
  calls INT,
  locationLatitute DOUBLE,
  locationLongitude DOUBLE,
  recordTime STRING,
  direction STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH serdeproperties (
  'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
) 
STORED AS textfile
LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/'
TBLPROPERTIES (
-- Set custom file extension.
   'odps.external.data.output.explicit.extension'='jsonl') 
;

-- Write data to the external table.
INSERT INTO <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');

The output files are displayed in the following figure:

Set a custom prefix mc_, a suffix _beijing, and an extension jsonl for the output files. The DDL is as follows:

CREATE EXTERNAL TABLE <mc_oss_extable_name>
(
  vehicleId INT,
  recordId INT,
  patientId INT,
  calls INT,
  locationLatitute DOUBLE,
  locationLongitude DOUBLE,
  recordTime STRING,
  direction STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH serdeproperties (
  'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
) 
STORED AS textfile
LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/'
TBLPROPERTIES (
    -- Set custom prefix.
    'odps.external.data.output.prefix'='mc_', 
    -- Set custom suffix.
    'odps.external.data.output.suffix'='_beijing', 
    -- Set custom file extension.
    'odps.external.data.output.explicit.extension'='jsonl') 
;  

-- Write data to the external table.
INSERT INTO <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');

The output files are displayed in the following figure:

Supported Data Types

For more information about MaxCompute data types, see Data types version 1.0 and Data types version 2.0.

Data type	Supported	Data type	Supported?
TINYINT	Supported	STRING	Supported
SMALLINT	Supported	DATE	Supported
INT	Supported	DATETIME	Not supported
BIGINT	Supported	TIMESTAMP	Not supported
BINARY	Not supported	TIMESTAMP_NTZ	Supported
FLOAT	Supported	BOOLEAN	Supported
DOUBLE	Supported	ARRAY	Supported
DECIMAL(precision,scale)	Supported	MAP	Supported
VARCHAR(n)	Supported	STRUCT	Supported
CHAR(n)	Supported	JSON	Not supported

Schema Evolution Support

JSON external tables map column values by name between the table schema and the columns in the JSON file.

The Data compatibility notes column in the following table describes whether an external table with an evolved schema can read both new data that matches the updated schema and historical data that does not.

Operation Type	Supported	Description	Data compatibility notes
Add column	Supported	New columns have no guaranteed order. They always appear last. Default values for new columns apply only to data written by MaxCompute.	New data matching the updated schema reads correctly. The columns of the legacy Schema data are not modified. Instead, the table is read using the new Schema. Example: After adding a column, historical data lacks that column. When querying, the new column returns NULL for those rows.
Drop column	Supported	JSON external tables map column values by name.	Compatible.
Modify Column Order	Supported	JSON external tables map column values by name.	Compatible.
Change column data type	Supported	See Change column data type for supported conversions.	Compatible.
Modify column name	Not supported	We do not recommend renaming columns. Because JSON external tables map by name, renaming breaks alignment between JSON file keys and the table schema. This causes query failures.	New data matching the updated schema reads correctly. If you do not perform a column modification operation on existing data that uses the old schema, the table is read using the new schema. Example: After renaming a column, if the JSON file still uses the old key name, the renamed column returns NULL.
Modify column comment	Supported	Comments must be valid strings up to 1024 bytes. Invalid comments cause errors.	Compatible.
Modify column nullability	Not supported	This operation is not supported. All columns are nullable by default.	Not applicable.