All Products
Search
Document Center

E-MapReduce:Use Spark SIMD JSON

Last Updated:Aug 16, 2023

Spark single instruction, multiple data (SIMD) JSON parses data more efficiently than the native JSON parser of Spark. This topic describes how to enable and use Spark SIMD JSON.

Enable Spark SIMD JSON

You can enable Spark SIMD JSON in the E-MapReduce (EMR) console.

Enable Spark SIMD JSON for Spark Thrift Server

  1. Go to the Services tab.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

    2. In the top navigation bar, select the region in which your cluster resides and select a resource group based on your business requirements.

    3. On the EMR on ECS page, find the cluster that you want to manage and click Services in the Actions column.

  2. Add a configuration item.

    1. On the Services tab, click Configure in the Spark3 section.

      In this example, Spark3 is used.

    2. Click the spark-thriftserver.conf tab.

    3. Click Add Configuration Item.

    4. In the Add Configuration Item dialog box, add a configuration item whose Key is spark.sql.simd.json.enabled and whose Value is true.

    5. Click OK.

    6. In the dialog box that appears, enter an execution reason in the Execution Reason field and click Save.

  3. Restart Spark Thrift Server.

    1. On the Services tab, click the Status tab.

    2. In the Components section, find SparkThriftServer and click Restart in the Actions column.

    3. In the dialog box that appears, enter an execution reason in the Execution Reason field and click OK.

    4. In the Confirm message, click OK.

Enable Spark SIMD JSON for a Spark job

Add the following parameter to the command that you use to start a Spark job:

spark.sql.simd.json.enabled=true

Supported functions

Spark SIMD JSON supports the following functions. You can use the functions in the same way as you use functions in the native JSON parser of Spark.

  • get_json_object(expr, path)

    Sample code:

    SELECT get_json_object('{"a":"b"}', '$.a');
    b
  • json_tuple(jsonStr, path1 [, ...] )

    Sample code:

    SELECT json_tuple('{"a":1, "b":2}', 'a', 'b'), 'Spark';
    1 2 Spark
    SELECT json_tuple('{"a":1, "b":2}', 'a', 'c'), 'Hive';
    1 NULL Hive