Build an OSS-to-MaxCompute JSON Data Pipeline - DataWorks

Use DataWorks Data Integration to load JSON data from Object Storage Service (OSS) into MaxCompute, then query individual JSON fields using the GET_JSON_OBJECT function.

JSON data often has a flexible, evolving schema — new fields can appear at any time. Instead of mapping each JSON field to a separate column upfront, this tutorial stores each JSON document as a single string in MaxCompute. You can then extract any field on demand using GET_JSON_OBJECT, without needing to redefine the table schema as your data changes.

The end-to-end flow:

An OSS bucket holds a TXT file containing JSON data.
A batch synchronization task in DataWorks Data Integration reads the file and writes each line as a string into MaxCompute.
An ODPS SQL node queries the table and extracts JSON fields using GET_JSON_OBJECT.

Prerequisites

Before you begin, ensure that you have:

MaxCompute and DataWorks activated. See Activate MaxCompute and DataWorks
A MaxCompute data source added. See Add a MaxCompute data source
A workflow created in your workspace. This tutorial uses a workspace in basic mode. See Create a workflow
A TXT file containing JSON data uploaded to an OSS bucket. This tutorial uses a bucket in the China (Shanghai) region. Sample data:

{
    "store": {
        "book": [
             {
                "category": "reference",
                "author": "Nigel Rees",
                "title": "Sayings of the Century",
                "price": 8.95
             },
             {
                "category": "fiction",
                "author": "Evelyn Waugh",
                "title": "Sword of Honour",
                "price": 12.99
             },
             {
                 "category": "fiction",
                 "author": "J. R. R. Tolkien",
                 "title": "The Lord of the Rings",
                 "isbn": "0-395-19395-8",
                 "price": 22.99
             }
          ],
          "bicycle": {
              "color": "red",
              "price": 19.95
          }
    },
    "expensive": 10
}

Step 1: Add an OSS data source

Add an OSS data source in DataWorks so the batch synchronization task can connect to your OSS bucket. See Add an OSS data source.

Step 2: Create a MaxCompute table

Create a single-column table to store the raw JSON strings.

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O\&M > Data Development, select your workspace, and click Go to Data Development.
In the left-side navigation pane, click Data source. On the Data Source page, click Add Data Source to add a MaxCompute project.
In the Add Data Source dialog box, fill in the parameters and click Associate Purchased Resource Group.
On the DataStudio page, move your pointer over the icon and choose Create Table > MaxCompute > Table.
In the Create Table dialog box, configure Path and Name, then click Create.

Note: If multiple MaxCompute data sources are associated with DataStudio, select the data source you want to use.
On the table editing page, click DDL Statement.
In the DDL dialog box, enter the following statement and click Generate Table Schema:
```
CREATE TABLE mqdata (mq_data string);
```
This creates a table named mqdata with a single string column mq_data. Each row stores one complete JSON document as a string.
In the confirmation message, click OK.
In the General section, configure Display Name, then click Commit to Production Environment.

Note: If your workspace is not in basic mode, click both Commit to Development Environment and Commit to Production Environment.

Step 3: Create a batch synchronization task

Configure an offline synchronization node to read the JSON file from OSS and write it to the mqdata table.

The Reader plugin reads the entire file as a binary stream — each line becomes one string value. The column configuration maps line index 0 (the first and only column) to the mq_data column in MaxCompute.

On the data analytics page, right-click your workflow and choose Create Node > Data Integration > Offline synchronization.
In the Create Node dialog box, enter a node name and click Submit.
In the top navigation bar, click the icon to switch to script mode.
In script mode, click the icon to import a template.
In the Import Template dialog box, set the source type, data source, target type, and data source, then click Confirm.

Replace the generated JSON with the following configuration and click the icon to run the task:

{
    "type": "job",
    "steps": [
        {
            "stepType": "oss",
            "parameter": {
                "fieldDelimiterOrigin": "^",
                "nullFormat": "",
                "compress": "",
                "datasource": "OSS_userlog",
                "column": [
                    {
                        "name": 0,
                        "type": "string",
                        "index": 0
                    }
                ],
                "skipHeader": "false",
                "encoding": "UTF-8",
                "fieldDelimiter": "^",
                "fileFormat": "binary",
                "object": [
                    "applog.txt"
                ]
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "odps",
            "parameter": {
                "partition": "",
                "isCompress": false,
                "truncate": true,
                "datasource": "odps_source",
                "column": [
                    "mqdata"
                ],
                "emptyAsNull": false,
                "table": "mqdata"
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "version": "2.0",
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    },
    "setting": {
        "errorLimit": {
            "record": ""
        },
        "speed": {
            "concurrent": 2,
            "throttle": false
        }
    }
}

Replace OSS_userlog with your OSS data source name and odps_source with your MaxCompute data source name. The key parameters in this configuration are:

Parameter	Value	Description
`fileFormat`	`binary`	Reads the file line by line without field splitting, so each JSON line is captured as a single string
`column[].index`	`0`	Maps the first (and only) column in the file to the target
`fieldDelimiter`	`^`	Uses `^` as a delimiter that does not appear in JSON, preventing accidental field splits
`truncate`	`true`	Clears the target table before each run, so re-running the task does not produce duplicate rows

Step 4: Verify the migration

Create an ODPS SQL node to query the migrated data and extract JSON fields.

Right-click your workflow and choose new > MaxCompute > ODPS SQL.
In the dialog box, enter a node name and click Submit.

On the node configuration tab, enter the following SQL statements:

-- Query all rows in the mqdata table.
SELECT * FROM mqdata;
-- Extract the value of the "expensive" field from each JSON document.
SELECT GET_JSON_OBJECT(mqdata.MQdata, '$.expensive') FROM mqdata;

Click the icon to run the statements.
Check the operation Log for the query results. The second query returns 10, which is the value of the expensive field in the sample JSON.

What's next

To extract other fields, modify the JSONPath expression passed to GET_JSON_OBJECT. For example, use $.store.bicycle.color to get the bicycle color.
To schedule this synchronization task, configure a trigger in your DataWorks workflow.