All Products
Search
Document Center

MaxCompute:Migrate JSON-formatted data from OSS to MaxCompute

Last Updated:Mar 26, 2026

Use DataWorks Data Integration to load a JSON file from Object Storage Service (OSS) into MaxCompute, then extract field values with the GET_JSON_OBJECT function.

This guide stores the entire JSON document as a single string column. After loading, GET_JSON_OBJECT lets you query individual fields by JSONPath expression without restructuring the table.

Prerequisites

Before you begin, make sure you have:

  • MaxCompute and DataWorks activated. See Activate MaxCompute and DataWorks

  • A MaxCompute data source added in DataWorks. See Add a MaxCompute data source

  • A workflow created in your DataWorks workspace (this guide uses a workspace in basic mode). See Create a workflow

  • A TXT file containing JSON data uploaded to an OSS bucket. This guide uses a bucket in the China (Shanghai) region, with a file named applog.txt that contains the following JSON:

    {
        "store": {
            "book": [
                 {
                    "category": "reference",
                    "author": "Nigel Rees",
                    "title": "Sayings of the Century",
                    "price": 8.95
                 },
                 {
                    "category": "fiction",
                    "author": "Evelyn Waugh",
                    "title": "Sword of Honour",
                    "price": 12.99
                 },
                 {
                     "category": "fiction",
                     "author": "J. R. R. Tolkien",
                     "title": "The Lord of the Rings",
                     "isbn": "0-395-19395-8",
                     "price": 22.99
                 }
              ],
              "bicycle": {
                  "color": "red",
                  "price": 19.95
              }
        },
        "expensive": 10
    }

Step 1: Add an OSS data source

Add your OSS bucket as a data source in DataWorks. See Add an OSS data source.

Step 2: Create a table to store the JSON data

Create a MaxCompute table with a single string column. Each row stores one complete JSON document. You query individual fields later using GET_JSON_OBJECT — no need to define a column per field at this stage.

  1. Log on to the DataWorks console. In the top navigation bar, select the region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select your workspace and click Go to Data Development.

  2. In the left-side navigation pane, click Data source. On the Data Source page, click Add Data Source to add a MaxCompute project.

  3. In the Add Data Source dialog box, fill in the parameters and click Associate Purchased Resource Group.

  4. On the DataStudio page, move the pointer over the image..png icon and choose Create Table > MaxCompute > Table.

  5. In the Create Table dialog box, configure Path and Name, then click Create.

    If multiple MaxCompute data sources are associated with DataStudio, select the correct data source in the Create Table dialog box.
  6. On the table editing page, click DDL Statement.

  7. In the DDL dialog box, enter the following statement and click Generate Table Schema:

    CREATE TABLE mqdata (mq_data string);

    This creates a single string column (mq_data) that stores each JSON document as raw text.

  8. In the Confirm dialog box, click OK.

  9. In the General section, set the Display Name parameter. Click Commit to Production Environment.

    For workspaces in standard mode, click both Commit to Development Environment and Commit to Production Environment.

Step 3: Create a batch synchronization task

Configure a DataWorks offline synchronization node to read applog.txt from OSS and write it to the mqdata table.

  1. On the data analytics page, right-click your workflow and choose Create Node > Data Integration > Offline synchronization.

  2. In the create a node dialog box, enter a node name and click submit.

  3. In the top navigation bar, click the Conversion script icon to switch to script mode.

  4. In script mode, click the ** icon.

  5. In the import Template dialog box, select the SOURCE type, data source, target type, and data source, then click confirm.

  6. Replace the generated JSON with the following configuration, then click the 运行 icon to run the task:

    {
        "type": "job",
        "steps": [
            {
                "stepType": "oss",
                "parameter": {
                    "fieldDelimiterOrigin": "^",
                    "nullFormat": "",
                    "compress": "",
                    "datasource": "OSS_userlog",
                    "column": [
                        {
                            "name": 0,
                            "type": "string",
                            "index": 0
                        }
                    ],
                    "skipHeader": "false",
                    "encoding": "UTF-8",
                    "fieldDelimiter": "^",
                    "fileFormat": "binary",
                    "object": [
                        "applog.txt"
                    ]
                },
                "name": "Reader",
                "category": "reader"
            },
            {
                "stepType": "odps",
                "parameter": {
                    "partition": "",
                    "isCompress": false,
                    "truncate": true,
                    "datasource": "odps_source",
                    "column": [
                        "mqdata"
                    ],
                    "emptyAsNull": false,
                    "table": "mqdata"
                },
                "name": "Writer",
                "category": "writer"
            }
        ],
        "version": "2.0",
        "order": {
            "hops": [
                {
                    "from": "Reader",
                    "to": "Writer"
                }
            ]
        },
        "setting": {
            "errorLimit": {
                "record": ""
            },
            "speed": {
                "concurrent": 2,
                "throttle": false
            }
        }
    }

    Key parameters:

    Parameter Value Description
    fileFormat binary Reads the file as raw bytes without field-level parsing. The entire file is treated as a single record.
    fieldDelimiter ^ Delimiter set for the source file. Because fileFormat is binary, the file loads as a whole record regardless of this value.
    datasource (reader) OSS_userlog Name of your OSS data source in DataWorks. Replace with the actual name you configured.
    datasource (writer) odps_source Name of your MaxCompute data source in DataWorks. Replace with the actual name you configured.
    truncate true Clears the target table before each run. Set to false to append data instead.
    concurrent 2 Number of parallel channels for the transfer.

Verify the result

Create an ODPS SQL node to query the loaded data and extract a JSON field.

  1. Right-click your workflow and choose new > MaxCompute > ODPS SQL.

  2. In the create a function dialog box, enter a function name and click submit.

  3. On the configuration tab, enter the following SQL statements:

    -- Query all rows in the mqdata table
    SELECT * FROM mqdata;
    
    -- Extract the value of the "expensive" field from each JSON document
    SELECT GET_JSON_OBJECT(mqdata.MQdata, '$.expensive') FROM mqdata;
  4. Click the ** icon to run the statements.

  5. View the results in operation Log. The first query returns the raw JSON stored in each row. The second query extracts the top-level expensive field. Based on the sample data, the expected output for the second query is:

    10

What's next

  • To query nested fields, use a longer JSONPath expression in GET_JSON_OBJECT. For example, $.store.bicycle.color returns red from the sample data.

  • To automate recurring loads, schedule the synchronization node as a periodic task in DataWorks.