All Products
Search
Document Center

Spark SQL

Last Updated: Nov 19, 2021

The serverless Spark and Presto engines of Data Lake Analytics (DLA) can share metadata with Data Lake Formation (DLF).

Access the metadata service of DLA by using the serverless Spark engine

The serverless Spark engine of DLA supports a variety of metadata services. This engine allows you to access the metadata service of self-managed Hive databases and the metadata service of DLA. The metadata service of DLA allows multiple engines to access metadata in DLA. The serverless Spark engine can read and use the database and table schemas that are created when you , discover metadata, or perform one-click data warehousing based on full synchronization in T+1 days. The metadata that is created or modified by using Spark SQL can also be accessed by other engines.

After you choose Serverless Presto > Execute in the left-side navigation pane of the DLA console, you can view all the database and table schemas in DLA, query and analyze table data online, and manage the permissions of RAM users on the databases and tables. The following figure shows the Execute page.
Execute page

Run Spark SQL jobs

The serverless Spark engine of DLA allows you to directly write Spark SQL statements in the DLA console, without requiring you to package the statements into a JAR file or write Python code. This helps data developers analyze data by using the serverless Spark engine.

To create a Spark SQL job, log on to the DLA console. In the left-side navigation pane, choose Serverless Spark > Submit job to create a Spark SQL job. After you create a Spark SQL job, DLA automatically enables the metadata service. You can disable the metadata service of DLA by using one of the following methods:

  • Set the spark.sql.catalogImplementation parameter to in-memory.

    set spark.sql.catalogImplementation = in-memory;
  • Set the spark.sql.hive.metastore.version parameter to 1.2.1 or another value.

    set spark.sql.catalogImplementation = hive;
    set spark.sql.hive.metastore.version = 1.2.1;

You can directly enter SQL statements of the Spark SQL job in the code editor of the Execute page. SQL statements are separated by semicolons (;).

SQL statements support the following types of commands:

  • SET

    • This command is used to configure the serverless Spark engine. In most cases, SET commands are placed before a set of SQL statements.

    • Each SET command specifies the value of a Spark parameter. SET commands are separated by semicolons (;).

    • Keys and values in SET commands cannot be enclosed in single quotation marks (') or double quotation marks (").

  • ADD JAR

    • This command is used to add JAR packages that are required for executing Spark SQL statements, such as the JAR packages of user-defined functions (UDFs) and JAR packages of various data source connectors. JAR packages must be stored in an Object Storage Service (OSS) path. In most cases, this command is placed before a set of SQL statements.

    • One ADD JAR command specifies one OSS path in which a JAR package is stored. Strings in an OSS path cannot be enclosed in single quotation marks (') or double quotation marks ("). ADD JAR commands are separated by semicolons (;).

  • DDL or DML statements supported by Spark SQL:

    • SELECT: queries data.

    • INSERT: inserts data into a table.

    • SHOW DATABASE: queries databases.

Note
  • For more information about limits on Spark SQL statements, see Limits and usage notes of the metadata service of DLA in this topic.

  • If the SET or ADD JAR command is not placed before a set of SQL statements, these two commands take effect after the SQL statement that precedes these commands is executed. For example, if the SET command is placed between two SELECT statements, the SET command takes effect only after the SQL statement that precedes this command is executed.

Use Spark SQL in code

In DLA, you can execute Spark SQL statements, manage metadata, and read data from and write data to tables. The following example demonstrates how to use PySpark SQL statements. The method of using SQL statements in other programming languages is similar. Create a Python file named example.py and upload the file to OSS.

from pyspark.sql import SparkSession

if __name__ == "__main__":
    # init pyspark context
    spark = SparkSession \
        .builder \
        .appName("Python SQL Test") \
        .getOrCreate()
    
    # create a database
    spark.sql(
            "create database if not exists dlatest comment 'c' location 'oss://{your bucket name}/{path}/' WITH DBPROPERTIES(k1='v1', k2='v2')")
    # create table
    spark.sql(
            "create table dlatest.tp(col1 INT)  PARTITIONED BY (p1 STRING, p2 STRING) location 'oss://{your bucket name}/{path}/' STORED AS parquet TBLPROPERTIES ('parquet.compress'='SNAPPY')")
    # show structure
    print(spark.sql("show create table dlatest.tp").collect()[0])
    
    # insert data
    spark.sql("INSERT into dlatest.tp partition(p1='a',p2='a') values(1)")
    
    # show data
    spark.sql("select * from dlatest.tp").show()

Use the following JSON code to submit a Spark job in the DLA console.

{
    "name": "DLA SQL Test",
    "file": "oss://path/to/example.py",
    "conf": {
        "spark.driver.resourceSpec": "small",
        "spark.sql.hive.metastore.version": "dla",
        "spark.sql.catalogImplementation": "hive",
        "spark.dla.connectors": "oss",
        "spark.executor.instances": 1,
        "spark.dla.job.log.oss.uri": "oss://path/to/spark-logs",
        "spark.executor.resourceSpec": "small"
    }
}

After the job is submitted, you can find the dlatest database and the tp table stored in the dlatest database on the Execute page in the DLA console.

Notice

When the metadata service of DLA references the names of databases and tables, it ignores the case sensitivity of these names.

Limits and usage notes of the metadata service of DLA

1. The serverless Spark engine of DLA allows you to create only external databases and external tables, and perform read and write operations only on these tables.

If you connect the serverless Spark engine to the metadata service of DLA, you can create only external tables and perform read and write operations only on these tables
When you create a database, you must explicitly specify LOCATION for the database. Sample statement:
CREATE DATABASE db1 LOCATION 'oss://test/db1/';
Similarly, when you create a table, you must explicitly specify LOCATION for the table. Sample statement:
CREATE TABLE table1(col1 INT) LOCATION 'oss://test/db1/table1/';
Take note of the following items:
  • If you use the serverless Spark engine to drop a table or a partition of a table in DLA, the OSS file that is mapped to the table is not deleted.

  • If you create a table in DLA, LOCATION of the table must be a subdirectory in the directory specified by LOCATION of the database.

  • If you add a partition to a table in DLA, LOCATION of the partition must be a subdirectory of the directory specified by LOCATION of the table.

  • If you execute the RENAME PARTITION statement to rename a partition of a table in DLA, the OSS directory structure does not change.

2. The serverless Spark engine supports only OSS external tables.

DLA SQL statements support data from multiple data stores, such as ApsaraDB RDS and Tablestore. The metadata service of DLA allows the serverless Spark engine to read data from and write data to OSS external tables.
If you use the serverless Spark engine to create databases and tables, you must specify a valid OSS path for LOCATION.
More data stores will be supported in the future.
3. A database named DEFAULT cannot be created.
When this rule applies, take note of the following items:
  • A database named DEFAULT cannot be created or managed in the serverless Spark engine.

  • Before you perform operations on a table by using SQL statements, you must execute the USE DatabaseName statement to specify the database in which the table is stored. You can also explicitly specify the database in a SELECT statement, such as SELECT * FROM db1.table1.

4. The serverless Spark engine has some limits on ALTER statements.
You can modify the metadata of databases and tables by using ALTER statements, such as ALTER DATABASE. When you use ALTER statements, take note of the following points:
  • You can execute ALTER statements to modify only the comment of a database. Other information of the database, such as the location and properties, cannot be modified.

  • You can execute ALTER statements to modify the columns or properties of a table. For example, you can add a column or modify properties. In this case, the columns must not be partition key columns.

5. The serverless Spark engine has some limits on the operations that you perform on Spark tables by using SQL statements.
After you create the db1 database and the table1 table in the serverless Spark engine, DLA must explicitly check whether the database or table exists when you execute SQL statements on the database or table. If you want to drop the db1 database, you must execute the following statement:
DROP DATABASE IF EXISTS db1;
6. Spark SQL does not support GRANT statements.
You cannot execute a GRANT statement in the serverless Spark engine to modify the permissions of RAM users. This rule is the same as that defined by the Apache Spark community.