Quick start: Run a Flink Hive SQL job - Realtime Compute for Apache Flink

Prerequisites

If you access the console as a RAM user, a RAM role, or another identity, ensure you have the required permissions. For more information, see permissions.
You have created a workspace. For more information, see Activate Realtime Compute for Apache Flink.

Limitations

Only Ververica Runtime (VVR) 8.0.11 and later supports the Hive dialect.
SQL jobs currently support only the INSERT Statements syntax of the Hive dialect, and you must declare USE Catalog <yourHiveCatalog> before the INSERT Statements. If you need to create a table, perform the operation on the Scripts page.
Hive and Flink user-defined functions (UDFs) are not supported.

Step 1: Create a Hive catalog

Configure Hive metadata. For more information, see Configure Hive metadata.
Create a Hive catalog. For more information, see Create a Hive catalog.

In this tutorial, the Hive catalog is named hdfshive.

Step 2: Prepare sample Hive tables

In the left-side navigation pane, go to Development > Scripts. Click New to create a script.

Run the following sample SQL statements.

Important

The Hive source table and sink table must be permanent tables created with the CREATE TABLE statement. You cannot use temporary tables created with the CREATE TEMPORARY TABLE statement.

-- Use the Hive catalog. In this example, the catalog is named hdfshive and was created in Step 1.
USE CATALOG hdfshive;   
-- Create a source table with the default storage format.
CREATE TABLE source_table (
 id INT,
 name STRING,
 age INT,
 city STRING,
 salary FLOAT
)WITH ('connector' = 'hive');
-- Create a sink table with the default storage format.
CREATE TABLE target_table (
city STRING,
avg_salary FLOAT,
user_count INT
)WITH ('connector' = 'hive');
-- Insert sample data into the source table.
INSERT INTO source_table VALUES
(1, 'Alice', 25, 'New York', 5000.0),
(2, 'Bob', 30, 'San Francisco', 6000.0),
(3, 'Charlie', 35, 'New York', 7000.0),
(4, 'David', 40, 'San Francisco', 8000.0),
(5, 'Eva', 45, 'Los Angeles', 9000.0);
-- Create a table with a specific storage format, for example, Parquet.
-- Load the Hive module.
load MODULE hive with ('hive-version' = '2.3.6');
use CATALOG `hdfshive`;
-- Required: Set the SQL dialect to 'hive' to recognize Hive DDL keywords such as 'STORED'.
set 'table.sql-dialect' = 'hive';
CREATE TABLE `parquet_table`(
 id INT,
 name STRING,
 age INT,
 city STRING,
 salary FLOAT
)STORED AS PARQUET;

Step 3: Create a Hive SQL job

In the left-side navigation pane, go to Development > ETL.
Click New. In the New Draft dialog box, select Blank Batch Draft (BETA) and click Next.

Enter the job information.

Parameter	Description	Example
Name	The name of the job. Note The job name must be unique within the current workspace.	hive-sql
Location	The folder where the job's code file is stored. You can also click the icon to the right of an existing folder to create a subfolder.	Drafts
Engine version	The Flink engine version used by the job. We recommend that you select a version with the RECOMMENDED tag. These versions offer higher reliability and performance. For more information about engine versions, see Release notes and Engine versions.	vvr-8.0.11-flink-1.17
SQL dialect	The SQL language for data processing. Note This parameter appears only if you select an engine version that supports the Hive dialect.	Hive SQL

Click Create.

Step 4: Write and deploy the Hive SQL job

Write the SQL statements.

This example calculates the number of users older than 30 and the average salary for each city. You can copy the following SQL script into the SQL editor.

-- Use the Hive catalog. In this example, the catalog is named hdfshive and was created in Step 1.
USE CATALOG hdfshive; 
INSERT INTO TABLE target_table
SELECT
  city,
  AVG(salary) AS avg_salary, -- Calculate the average salary
  COUNT(id) AS user_count -- Count the number of users
FROM source_table
WHERE age > 30 -- Filter for users older than 30
GROUP BY city; -- Group by city

In the upper-right corner, click Deploy. In the dialog box, configure the parameters as needed (this tutorial uses the default settings) and click OK.

(Optional) Step 5: Configure runtime parameters

Important

This step is required only if you use JindoSDK to access your Hive cluster.

In the left-side navigation pane, go to O&M > Deployments.
From the drop-down list, select Batch job. Find the target job and click Details in the Actions column.
In the deployment details panel, click Edit in the Runtime parameters configuration section.

In the Other Configuration field, add the following configuration:

fs.oss.jindo.endpoint: <YOUR_Endpoint> 
fs.oss.jindo.buckets: <YOUR_Buckets>
fs.oss.jindo.accessKeyId: <YOUR_AccessKeyId>
fs.oss.jindo.accessKeySecret: <YOUR_AccessKeySecret>

For more information about these parameters, see Write data to OSS-HDFS.

Click Save.

Step 6: Start the job and view the results

On the Deployments page, select Batch job from the filter, find your target job (for example, hive-sql), and click Start in the Actions column.
After the job status changes to FINISHED, view the results.

On the Development > Scripts page, run the following SQL statement to view the data, which includes the number of users older than 30 and their average salary in each city.
```
-- Use the Hive catalog. In this example, the catalog is named hdfshive and was created in Step 1.
USE CATALOG hdfshive; 
select * from target_table;
```
The query returns three rows from target_table with the columns city, avg_salary, and user_count: Los Angeles (9000.0, 1), New York (7000.0, 1), and San Francisco (8000.0, 1).

Hive JAR job development

You can run Hive dialect jobs as JAR jobs. This requires version 11.2 or later of the "ververica-connector-hive-2.3.6" JAR package. You must also ensure that the Hive configurations in your JAR job and the console settings match.

Console settings
1. The JAR URI specifies the uploaded JAR package for the JAR job.
2. In Additional Dependencies, upload the four configuration files from your Hive cluster: core-site.xml, mapred-site.xml, hdfs-site.xml, and hive-site.xml. You must also upload the ververica-connector-hive-2.3.6 JAR package.
3. Configure runtime parameters. Based on your Hive cluster configuration, if you need to write data to OSS-HDFS, use the settings from (Optional) Step 5: Configure runtime parameters.
```
table.sql-dialect: HIVE
classloader.parent-first-patterns.additional: org.apache.hadoop;org.antlr.runtime
kubernetes.application-mode.classpath.include-user-jar: true
```

Example JAR job code:

StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
Configuration conf = new Configuration();
conf.setString("type", "hive");
conf.setString("default-database", "default");
conf.setString("hive-version", "2.3.6"); 
conf.setString("hive-conf-dir", "/flink/usrlib/" );
conf.setString("hadoop-conf-dir", "/flink/usrlib/");
CatalogDescriptor descriptor = CatalogDescriptor.of("hivecat", conf);
tableEnv.createCatalog("hivecat", descriptor);
tableEnv.loadModule("hive", new HiveModule());
tableEnv.useModules("hive");
tableEnv.useCatalog("hivecat");
tableEnv.executeSql("insert into `hivecat`.`default`.`test_write` select * from `hivecat`.`default`.`test_read`;");

Realtime Compute for Apache Flink:Hive dialect jobs