The serverless Spark and Presto engines of Data Lake Analytics (DLA) can share metadata with Data Lake Formation (DLF).
Access the metadata service of DLA by using the serverless Spark engine
The serverless Spark engine of DLA supports a variety of metadata services. This engine allows you to access the metadata service of self-managed Hive databases and the metadata service of DLA. The metadata service of DLA allows multiple engines to access metadata in DLA. The serverless Spark engine can read and use the database and table schemas that are created when you , discover metadata, or perform one-click data warehousing based on full synchronization in T+1 days. The metadata that is created or modified by using Spark SQL can also be accessed by other engines.

Run Spark SQL jobs
The serverless Spark engine of DLA allows you to directly write Spark SQL statements in the DLA console, without requiring you to package the statements into a JAR file or write Python code. This helps data developers analyze data by using the serverless Spark engine.
To create a Spark SQL job, log on to the DLA console. In the left-side navigation pane, choose Serverless Spark > Submit job to create a Spark SQL job. After you create a Spark SQL job, DLA automatically enables the metadata service. You can disable the metadata service of DLA by using one of the following methods:
Set the spark.sql.catalogImplementation parameter to
in-memory
.set spark.sql.catalogImplementation = in-memory;
Set the
spark.sql.hive.metastore.version
parameter to 1.2.1 or another value.set spark.sql.catalogImplementation = hive; set spark.sql.hive.metastore.version = 1.2.1;
You can directly enter SQL statements of the Spark SQL job in the code editor of the Execute page. SQL statements are separated by semicolons (;).
SQL statements support the following types of commands:
SET
This command is used to configure the serverless Spark engine. In most cases, SET commands are placed before a set of SQL statements.
Each SET command specifies the value of a Spark parameter. SET commands are separated by semicolons (;).
Keys and values in SET commands cannot be enclosed in single quotation marks (') or double quotation marks (").
ADD JAR
This command is used to add JAR packages that are required for executing Spark SQL statements, such as the JAR packages of user-defined functions (UDFs) and JAR packages of various data source connectors. JAR packages must be stored in an Object Storage Service (OSS) path. In most cases, this command is placed before a set of SQL statements.
One ADD JAR command specifies one OSS path in which a JAR package is stored. Strings in an OSS path cannot be enclosed in single quotation marks (') or double quotation marks ("). ADD JAR commands are separated by semicolons (;).
DDL or DML statements supported by Spark SQL:
SELECT
: queries data.INSERT
: inserts data into a table.SHOW DATABASE
: queries databases.
For more information about limits on Spark SQL statements, see Limits and usage notes of the metadata service of DLA in this topic.
If the
SET
orADD JAR
command is not placed before a set of SQL statements, these two commands take effect after the SQL statement that precedes these commands is executed. For example, if theSET
command is placed between twoSELECT
statements, theSET
command takes effect only after the SQL statement that precedes this command is executed.
Use Spark SQL in code
In DLA, you can execute Spark SQL statements, manage metadata, and read data from and write data to tables. The following example demonstrates how to use PySpark SQL statements. The method of using SQL statements in other programming languages is similar. Create a Python file named example.py and upload the file to OSS.
from pyspark.sql import SparkSession
if __name__ == "__main__":
# init pyspark context
spark = SparkSession \
.builder \
.appName("Python SQL Test") \
.getOrCreate()
# create a database
spark.sql(
"create database if not exists dlatest comment 'c' location 'oss://{your bucket name}/{path}/' WITH DBPROPERTIES(k1='v1', k2='v2')")
# create table
spark.sql(
"create table dlatest.tp(col1 INT) PARTITIONED BY (p1 STRING, p2 STRING) location 'oss://{your bucket name}/{path}/' STORED AS parquet TBLPROPERTIES ('parquet.compress'='SNAPPY')")
# show structure
print(spark.sql("show create table dlatest.tp").collect()[0])
# insert data
spark.sql("INSERT into dlatest.tp partition(p1='a',p2='a') values(1)")
# show data
spark.sql("select * from dlatest.tp").show()
Use the following JSON code to submit a Spark job in the DLA console.
{
"name": "DLA SQL Test",
"file": "oss://path/to/example.py",
"conf": {
"spark.driver.resourceSpec": "small",
"spark.sql.hive.metastore.version": "dla",
"spark.sql.catalogImplementation": "hive",
"spark.dla.connectors": "oss",
"spark.executor.instances": 1,
"spark.dla.job.log.oss.uri": "oss://path/to/spark-logs",
"spark.executor.resourceSpec": "small"
}
}
After the job is submitted, you can find the dlatest
database and the tp
table stored in the dlatest database on the Execute page in the DLA console.
When the metadata service of DLA references the names of databases and tables, it ignores the case sensitivity of these names.
Limits and usage notes of the metadata service of DLA
1. The serverless Spark engine of DLA allows you to create only external databases and external tables, and perform read and write operations only on these tables.
LOCATION
for the database. Sample statement: CREATE DATABASE db1 LOCATION 'oss://test/db1/';
LOCATION
for the table. Sample statement: CREATE TABLE table1(col1 INT) LOCATION 'oss://test/db1/table1/';
If you use the serverless Spark engine to drop a table or a
partition
of a table in DLA, the OSS file that is mapped to the table is not deleted.If you create a table in DLA,
LOCATION
of the table must be a subdirectory in the directory specified byLOCATION
of the database.If you add a
partition
to a table in DLA,LOCATION
of the partition must be a subdirectory of the directory specified byLOCATION
of the table.If you execute the
RENAME PARTITION
statement to rename a partition of a table in DLA, the OSS directory structure does not change.
2. The serverless Spark engine supports only OSS external tables.
DLA SQL statements
support data from multiple data stores, such as ApsaraDB RDS and Tablestore. The metadata service of DLA allows the serverless Spark engine to read data from and write data to OSS external tables. LOCATION
. DEFAULT
cannot be created. A database named
DEFAULT
cannot be created or managed in the serverless Spark engine.Before you perform operations on a table by using SQL statements, you must execute the
USE DatabaseName
statement to specify the database in which the table is stored. You can also explicitly specify the database in a SELECT statement, such asSELECT * FROM db1.table1
.
ALTER
statements. ALTER DATABASE
. When you use ALTER statements, take note of the following points: You can execute
ALTER
statements to modify only the comment of a database. Other information of the database, such as thelocation
andproperties
, cannot be modified.You can execute ALTER statements to modify the
columns
orproperties
of a table. For example, you can add a column or modify properties. In this case, thecolumns
must not bepartition key columns
.
SQL
statements. DROP DATABASE IF EXISTS db1;
GRANT
statements.