SQL query and data analysis - DataWorks - Alibaba Cloud Documentation Center

You can use SQL statements to quickly query and analyze data from data sources such as MaxCompute, EMR Hive, and Hologres. This topic describes how to query data sources using SQL statements.

Important

This topic describes the new version of DataAnalysis. For information about the old version, see SQL query (old). You can switch between the new and old versions of DataAnalysis in the navigation bar.

Supported data sources

SQL query supports the following data source types: MaxCompute, Hologres, EMR, CDH, ADB for PostgreSQL, ADB for MySQL, ClickHouse, StarRocks, MySQL, PostgreSQL, Oracle, SQL Server, Doris, and SelectDB.

Note

Only MaxCompute supports queries using direct connections and data source connections. Other data source types only support queries on data sources that have been added to a workspace.

Permissions for data sources

Data source scope

Data source connection mode: You can select data only from data sources in workspaces where you have permissions. Therefore, you must first contact an administrator to be added to a workspace as a Data Analyst, Developer, O&M, or Workspace Administrator.
Direct connection mode: You can select only the MaxCompute projects that the current logon account is a member of. For more information about how to configure permissions for MaxCompute projects, see MaxCompute Users and Permissions.

Data source access permissions

You can access data sources using the following two identity modes.

Access identity mode	Description	Supported data sources	Authorization operation
Executor Identity	The identity of the Alibaba Cloud account that is used to log on to DataWorks.	MaxCompute and Hologres.	Contact the administrator of the specified MaxCompute project or Hologres instance to grant you member access permissions.
Data Source Default Access Identity	The access identity configured when the data source was created.	This feature applies to the supported data sources.	If your current logon account is not the default access identity of the data source, you must contact a user with Workspace Administrator permissions to grant authorization to your Alibaba Cloud account.

Important

If IP whitelist-based access control is enabled for your MaxCompute project, you must add the IP addresses in the DataAnalysis whitelist to the IP whitelist of the MaxCompute project.

Access the feature

Log on to DataWorks DataAnalysis, switch to the target region, and click Enter Data Analysis.

If you see Go To New DataAnalysis in the navigation bar, click it to switch to the new DataAnalysis page.
If you see Return To Legacy DataAnalysis in the navigation bar, you are on the new DataAnalysis page.

Create an SQL query

Hover over Personal Directory > My Files and click > New SQL File on the right.
You can also click New Folder to customize the directory structure for your SQL query files.
On the SQL editor page, write an SQL statement.
Important
You can also generate an SQL statement in the following ways:
- On the Data Catalog page, after you add a data catalog, find the target table, right-click the table, and select Generate SQL Statement.
- Copy an SQL statement that is shared with you in the Other People's Files folder.
- After you edit the SQL statement, you can click the Formatting button in the toolbar to format the code.
- SQL queries support Copilot. You can click the Copilot icon () in the upper-right corner to use intelligent assistant features, such as SQL Generate and SQL Rectify, to help you write code. In addition, Copilot supports code auto-completion in the editor to improve your coding efficiency.
Sample SQL
For example, to query the GitHub public event data in MaxCompute, you can use the following SQL statement:
```
-- Enable session-level schema syntax.
SET odps.namespace.schema=true; 
-- Query 100 rows of data from the dwd_github_events_odps table.
SELECT * FROM bigdata_public_dataset.github_events.dwd_github_events_odps WHERE ds='${dt}' LIMIT 100;
```
Note
When you query a MaxCompute project for which Schema syntax is enabled, you must add the SET odps.namespace.schema=true; command before the query statement. This enables session-level Schema syntax and prevents the query from failing.
By default, queries run against the data source specified in the current Run Configuration. To query other MaxCompute projects, you must explicitly specify the project name in your query statement. For example, if your running configuration specifies MaxCompute A as the computing resource and your query statement targets a table in MaxCompute B (SELECT * FROM B.schema_name.table_name WHERE ****), MaxCompute A is used as the execution engine to query data from MaxCompute B.
After you write the SQL statement, click Run Configuration on the right to configure the SQL query. You can configure parameters such as Data Source and Script Parameters.
- Type: Select the target data source type for the SQL query.
- Computing Resource: Specifies the target data source for the SQL query. If you query only MaxCompute data, you can use a direct connection or a data source connection. For other data source types, you can query data only from data sources in workspaces where you have permissions. For more information about permissions, see Data source permissions.
- Script Parameters: If you use parameter variables in your SQL query, you can assign values to runtime variables here.
  Important
  If parsing fails, go to the Settings page in the lower-left corner, search for parsing, and enable the Enable DataWorks LSP code parameter feature option.
Click Run in the toolbar to view the Executed SQL Content, Run Log, and Run Results in the results window.

Visualize query results

In the toolbar to the left of the query results, click the icon to automatically generate a chart from the results.

Note

You can click the Copilot button above the chart to use the DataWorks Copilot intelligent chart and insight generation feature.
You can click the Edit Chart button to adjust the chart style.

Export and share results

Important

If you want to export data locally and then import it to another data source, we recommend that you use an offline sync task in Data Integration for more efficient and stable data migration and synchronization.

To the right of the SQL query results, click Export As. You can export the data in the following formats:

Local file: You can download the query results to your local computer in CSV format.

Item	Description
Download limits	Only the MaxCompute and EMR engines are supported. For more information, see Data download limits. If the data protection mechanism is enabled for a MaxCompute project (i.e., data downloads are prohibited), downloading data using DataAnalysis will fail.
Downloading Scope	You can choose to download Only Data Displayed in Table or All Data. Only Data Displayed In The Table: Downloads only the data displayed on the current page. The default limit is `10000` rows. All Data: Exports all query results within the download limits.
Download method	Supports Download After Approval and Download Without Approval. Download After Approval: Allows you to configure Fraud Detection rules to detect risks in data download operations. When you download data, you must submit a download approval request to ensure the compliance and security of data usage. Note Only DataWorks Enterprise Edition supports setting and enabling Fraud Detection rules. Download Without Approval: This is the default method. No approval request is required for the download.

Object Storage Service (OSS): You can export the query results in a specified format, such as CSV or Parquet, to an Alibaba Cloud Object Storage Service (OSS) bucket. This method is suitable for scenarios that involve archiving large data volumes or integrating with other cloud products.

The first time you use this feature, you must grant DataWorks permissions to access your OSS resources. In the File Path drop-down list, click the One-click Authorization link in the prompt and follow the on-screen instructions to complete the RAM authorization.

Configuration Item	Description
File Path	Click the folder icon on the right to select the OSS Bucket and directory where you want to store the result file.
File Name	The system automatically generates a file name. You can also modify it manually.
File Format	Select the file format for the export. The supported formats are `csv`, `text`, `orc`, and `parquet`.
Delimiter	Specify the delimiter between columns. The default delimiter is a comma (`,`).
Encoding format	Select the encoding format for the file, such as `UTF-8` or `GBK`.
CU	Configure the number of computing units (CUs) for this export task. The default value is 1 CU.
Resource group	Select the serverless resource group to run this export task. If you do not select a resource group, the Data Integration resource group that is set in DataAnalysis > System Management is used by default.

After you complete the configuration, click OK to start the export task. On the task execution page, you can view the export progress, run logs, and configuration details. After the task is complete, go to the OSS console to download the exported object file to your local computer.

Workbooks/Workbooks and Share: You can save data to a workbook to perform a more detailed analysis of the query data. You can also share the latest analysis results from your workbook with others.