Use DataWorks Data Integration to load JSON data from Object Storage Service (OSS) into MaxCompute, then query individual JSON fields using the GET_JSON_OBJECT function.
JSON data often has a flexible, evolving schema — new fields can appear at any time. Instead of mapping each JSON field to a separate column upfront, this tutorial stores each JSON document as a single string in MaxCompute. You can then extract any field on demand using GET_JSON_OBJECT, without needing to redefine the table schema as your data changes.
The end-to-end flow:
-
An OSS bucket holds a TXT file containing JSON data.
-
A batch synchronization task in DataWorks Data Integration reads the file and writes each line as a string into MaxCompute.
-
An ODPS SQL node queries the table and extracts JSON fields using
GET_JSON_OBJECT.
Prerequisites
Before you begin, ensure that you have:
-
MaxCompute and DataWorks activated. See Activate MaxCompute and DataWorks
-
A MaxCompute data source added. See Add a MaxCompute data source
-
A workflow created in your workspace. This tutorial uses a workspace in basic mode. See Create a workflow
-
A TXT file containing JSON data uploaded to an OSS bucket. This tutorial uses a bucket in the China (Shanghai) region. Sample data:
{
"store": {
"book": [
{
"category": "reference",
"author": "Nigel Rees",
"title": "Sayings of the Century",
"price": 8.95
},
{
"category": "fiction",
"author": "Evelyn Waugh",
"title": "Sword of Honour",
"price": 12.99
},
{
"category": "fiction",
"author": "J. R. R. Tolkien",
"title": "The Lord of the Rings",
"isbn": "0-395-19395-8",
"price": 22.99
}
],
"bicycle": {
"color": "red",
"price": 19.95
}
},
"expensive": 10
}
Step 1: Add an OSS data source
Add an OSS data source in DataWorks so the batch synchronization task can connect to your OSS bucket. See Add an OSS data source.
Step 2: Create a MaxCompute table
Create a single-column table to store the raw JSON strings.
-
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O\&M > Data Development, select your workspace, and click Go to Data Development.
-
In the left-side navigation pane, click Data source. On the Data Source page, click Add Data Source to add a MaxCompute project.
-
In the Add Data Source dialog box, fill in the parameters and click Associate Purchased Resource Group.
-
On the DataStudio page, move your pointer over the
icon and choose Create Table > MaxCompute > Table. -
In the Create Table dialog box, configure Path and Name, then click Create.
Note: If multiple MaxCompute data sources are associated with DataStudio, select the data source you want to use.
-
On the table editing page, click DDL Statement.
-
In the DDL dialog box, enter the following statement and click Generate Table Schema:
CREATE TABLE mqdata (mq_data string);This creates a table named
mqdatawith a single string columnmq_data. Each row stores one complete JSON document as a string. -
In the confirmation message, click OK.
-
In the General section, configure Display Name, then click Commit to Production Environment.
Note: If your workspace is not in basic mode, click both Commit to Development Environment and Commit to Production Environment.
Step 3: Create a batch synchronization task
Configure an offline synchronization node to read the JSON file from OSS and write it to the mqdata table.
The Reader plugin reads the entire file as a binary stream — each line becomes one string value. The column configuration maps line index 0 (the first and only column) to the mq_data column in MaxCompute.
-
On the data analytics page, right-click your workflow and choose Create Node > Data Integration > Offline synchronization.
-
In the Create Node dialog box, enter a node name and click Submit.
-
In the top navigation bar, click the
icon to switch to script mode. -
In script mode, click the
icon to import a template. -
In the Import Template dialog box, set the source type, data source, target type, and data source, then click Confirm.
-
Replace the generated JSON with the following configuration and click the
icon to run the task:{ "type": "job", "steps": [ { "stepType": "oss", "parameter": { "fieldDelimiterOrigin": "^", "nullFormat": "", "compress": "", "datasource": "OSS_userlog", "column": [ { "name": 0, "type": "string", "index": 0 } ], "skipHeader": "false", "encoding": "UTF-8", "fieldDelimiter": "^", "fileFormat": "binary", "object": [ "applog.txt" ] }, "name": "Reader", "category": "reader" }, { "stepType": "odps", "parameter": { "partition": "", "isCompress": false, "truncate": true, "datasource": "odps_source", "column": [ "mqdata" ], "emptyAsNull": false, "table": "mqdata" }, "name": "Writer", "category": "writer" } ], "version": "2.0", "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] }, "setting": { "errorLimit": { "record": "" }, "speed": { "concurrent": 2, "throttle": false } } }Replace
OSS_userlogwith your OSS data source name andodps_sourcewith your MaxCompute data source name. The key parameters in this configuration are:Parameter
Value
Description
fileFormatbinaryReads the file line by line without field splitting, so each JSON line is captured as a single string
column[].index0Maps the first (and only) column in the file to the target
fieldDelimiter^Uses
^as a delimiter that does not appear in JSON, preventing accidental field splitstruncatetrueClears the target table before each run, so re-running the task does not produce duplicate rows
Step 4: Verify the migration
Create an ODPS SQL node to query the migrated data and extract JSON fields.
-
Right-click your workflow and choose new > MaxCompute > ODPS SQL.
-
In the dialog box, enter a node name and click Submit.
-
On the node configuration tab, enter the following SQL statements:
-- Query all rows in the mqdata table. SELECT * FROM mqdata; -- Extract the value of the "expensive" field from each JSON document. SELECT GET_JSON_OBJECT(mqdata.MQdata, '$.expensive') FROM mqdata; -
Click the
icon to run the statements. -
Check the operation Log for the query results. The second query returns
10, which is the value of theexpensivefield in the sample JSON.
What's next
-
To extract other fields, modify the JSONPath expression passed to
GET_JSON_OBJECT. For example, use$.store.bicycle.colorto get the bicycle color. -
To schedule this synchronization task, configure a trigger in your DataWorks workflow.