DataStudio - DataWorks - Alibaba Cloud Documentation Center

This topic provides answers to some frequently asked questions about DataStudio.

Resources
PyODPS
Nodes and workflows
Tables
Operational logs and retention period of operational logs
- How do I query historical operational logs on the DataStudio page?
- How long are operational logs on the DataStudio page retained?
Batch operations
- How do I perform operations on multiple nodes, resources, or functions at a time?
- How do I change resource groups for scheduling for multiple nodes in a workflow at a time on the DataStudio page?
Power BI connection to MaxCompute
What do I do if an error is reported when I connect Power BI to MaxCompute?
API calls
- When I call a DataWorks API operation, the following error message appears: access is forbidden. Please first activate DataWorks Enterprise Edition or Flagship Edition. What do I do?
Other items
- How do I disable the MaxCompute Query Acceleration (MCQA) feature if I want to obtain the instance ID that is used to download more than 10,000 data records?

Which kind of resource group can I use when I reference a third-party package in a PyODPS node?

Use an exclusive resource group for scheduling. For more information, see Use a PyODPS node to reference a third-party package.

How do I control whether the queried table data can be downloaded?

Before you download data from DataWorks, you must enable the Download SELECT Query Result feature. If no download entry point is available, the Download SELECT Query Result feature is disabled for your workspace. If you use a RAM user and need to use this feature, contact the owner of the Alibaba Cloud account or the workspace administrator to enable this feature in the Workspace Settings panel or on the Workspace Management page. Enable Download SELECT Query Result

After you query data in DataStudio, the download entry point is displayed in the lower-right corner of the query result tab, as shown in the following figure. Download data

You can download only a maximum of 10,000 data records from DataStudio due to the limits of the compute engine.

How do I download more than 10,000 data records?

Use a Tunnel command of MaxCompute. For more information, see Use SQLTask and Tunnel to export a large amount of data.

When I create a table in a workspace with which an E-MapReduce (EMR) compute engine instance is associated, the following error message appears: call emr exception. What do I do?

Possible cause: Security settings are not configured for the security group to which your EMR cluster belongs. Before you associate an EMR compute engine instance with your workspace, add the following rules to the security group of the ECS instance that hosts your EMR cluster. Otherwise, the preceding error message may appear.
Solution: Check the security settings of the security group of the ECS instance that hosts your EMR cluster. If the security settings do not include the preceding rules, add the rules to the security group.

How do I reference a resource in a node?

Find the resource that you want to reference in the node in the Scheduled Workflow pane, right-click the resource name, and then select Insert Resource Path.

How do I download a resource that is uploaded to DataWorks?

Find the resource that you want to download in the Scheduled Workflow pane, right-click the resource name, and then select View Earlier Versions. In the Versions dialog box, click Download in the Actions column. View Versions

How do I upload a resource whose size is greater than 30 MB?

Use a Tunnel command to upload the resource. Then, add the resource to DataStudio in the MaxCompute Resources pane for future use. For more information, see How do I use a resource that is uploaded to DataWorks by using odpscmd?.

How do I use a resource that is uploaded to DataWorks by using odpscmd?

If you want to use a resource that is uploaded to DataWorks by using odpscmd, add the resource to DataStudio in the MaxCompute Resources pane. Add resources

How do I upload a JAR package on my on-premises machine to DataWorks as a JAR resource and reference the uploaded resource in a node?

Upload the JAR package to DataWorks on the DataStudio page as a JAR resource. If you want to reference the resource in a node, find the resource in the Scheduled Workflow pane, right-click the resource name, and then select Insert Resource Path. A comment is automatically added at the beginning of the code for the node, and the node can directly reference the resource in its code based on the resource name. Upload a JAR package

For example, you want to reference the resource test.jar in a Shell node. After you select Insert Resource Path, the comment ##@resource_reference{"test.jar"} is automatically added at the beginning of the code for the Shell node.

How do I use a MaxCompute table in DataWorks?

You cannot use the codeless user interface (UI) to upload a MaxCompute table to DataWorks. If you want to use a MaxCompute table in DataWorks, perform the following steps:

On the DataStudio page of DataWorks, create a file resource that has the same name as the MaxCompute table and upload the file. In this example, the userlog3.txt file is uploaded.
Note
Do not select Upload to MaxCompute.
After you upload the file, execute a statement on odpscmd to add the MaxCompute table resource to DataWorks. In this example, the statement add table userlog3 -f; is executed.
Select the uploaded file resource to use the resource.

Can a Python resource call another Python resource?

A Python resource can call another Python resource in the same workspace.

Can PyODPS call custom functions to use third-party packages?

If you do not want to use the map method of DataFrame to call the test function, you can use PyODPS to call custom functions to use third-party packages. For more information, see Reference a third-party package in a PyODPS node.

When I call a pickle file in a PyODPS 3 node, the following error message appears: `_pickle.UnpicklingError: invalid load key, '\xef.`. What do I do?

Check whether the code of your PyODPS 3 node contains special characters. If the code contains special characters, compress the code into a ZIP package, upload the package to DataWorks, and then decompress the package to call the pickle file.

How do I delete a MaxCompute resource?

To delete a MaxCompute resource in a workspace in basic mode, right-click the resource name and select Delete to delete it. To delete a MaxCompute resource in a workspace in standard mode, you must delete the resource in the development environment and then delete it in the production environment again. The following example shows how to delete a MaxCompute resource in the development and production environments.

Note

In a DataWorks workspace in standard mode, the development environment is isolated from the production environment. If you delete a resource on the DataStudio page, the resource is deleted only from the development environment. The same resource is deleted from the production environment only after you deploy the delete operation to the production environment.

Delete a resource from the development environment. In the desired workflow, choose MaxCompute > Resource, right-click the resource name that you want to delete, and then select Delete. In the Delete dialog box, click OK.
Delete a resource from the production environment. A resource can be deleted from the production environment only after the delete operation of the resource is deployed to the production environment. On the DataStudio page, click Deploy in the upper-right corner. On the Create Deploy Task page, set Change Type to Offline, find the package of the resource that is deleted in the previous step, and click Deploy in the Actions column. In the Create Deploy Task dialog box, click Deploy. After you click Deploy, the resource is deleted from the production environment.

How do I recover a node that is deleted?

On the DataStudio page, click the Recycle Bin icon in the left-side navigation pane. In the Recycle Bin pane, find the node that you want to recover, right-click the node name, and then select Restore. Recover a deleted node

How do I view the versions of a node?

Find the node whose versions you want to view in the Scheduled Workflow pane and double-click the node name to go to the configuration tab of the node. Then, click Versions in the right-side navigation pane. On the Versions tab, you can view the versions of the node.

Important

A version is generated only after you commit the code.

View node versions

How do I clone a workflow?

Use a node group. For more information, see Create and manage a node group.

How do I export the code of a node?

Use Migration Assistant. For more information, see Overview.

How do I check whether a node is committed?

If you want to check whether a node is committed, find the desired workflow in the Scheduled Workflow pane and expand the workflow to view the status of each node in this workflow. If the Icon icon is displayed on the left side of a node, the node is committed. Otherwise, the node is not committed. Icon

Can I configure properties for all nodes in a workflow at a time?

No, you cannot configure properties for all nodes in a workflow. In DataWorks, you are not allowed to configure properties for a workflow. If a workflow contains multiple nodes, you must configure properties for the nodes one by one. For example, if a workflow contains 20 nodes, you must configure properties for these nodes one by one.

What is the impact on the instances of a node after the node is deleted?

The scheduling system generates one or more instances for a node every day based on the time properties of the node. If the node is deleted after it is run for a period of time, its instances are retained. However, the instances will fail to run after the node is deleted. This is because the required code is unavailable.

After a node is modified and committed and deployed to the production environment, is the existing faulty node in the production environment overwritten?

No, the existing faulty node is not overwritten. The updated code is used to run new node instances that are not run, and the existing node instances are retained. If scheduling properties are modified, the modified configurations apply only to the new node instances.

How do I create a table in a visualized manner?

Go to the DataStudio page and click the Workspace Tables icon in the left-side navigation pane. In the Workspace Tables pane, create a table.

How do I add fields to a table that is in the production environment?

If you use an Alibaba Cloud account, add fields to the table in the Workspace Tables pane of the DataStudio page and commit the table to the production environment.

If you use a RAM user, you must request the permissions of the O&M engineer or workspace administrator role for the RAM user, use the RAM user to add fields to the table in the Workspace Tables pane of the DataStudio page, and then commit the table to the production environment.

How do I delete a table?

You can delete a table from the development environment on the DataStudio page.

To delete a table from the production environment, use one of the following methods:

Go to Data Map and delete the table on the My Data tab.
Create an ODPS SQL node, and enter and execute the DROP statement on the configuration tab of the node. For more information about how to create an ODPS SQL node, see Develop a MaxCompute SQL task. For more information about the syntax of the DROP statement, see Table operations.

Delete a table

How do I upload data from my on-premises machine to a MaxCompute table?

Go to the DataStudio page and use the Import feature in the Scheduled Workflow pane to import the data. Import data from an on-premises machine

When I create a table in a workspace with which an E-MapReduce (EMR) compute engine instance is associated, the following error message appears: `call emr exception`. What do I do?

Possible cause:
Security settings are not configured for the security group to which your EMR cluster belongs. Before you associate an EMR compute engine instance with your workspace, add the following rules to the security group of the ECS instance that hosts your EMR cluster. Otherwise, the preceding error message may appear.
- Action: Allow
- Protocol type: Custom TCP
- Port range: 8898/8898
- Authorization object: 100.104.0.0/16
Solution:
Check the security settings of the security group of the ECS instance that hosts your EMR cluster. If the security settings do not include the preceding rules, add the rules to the security group.

How do I query data that is in the production environment from the development environment on the DataStudio page?

In a workspace in standard mode, if you want to query data that is in the production environment from the development environment on the DataStudio page, specify the table whose data you want to query in the Project name.Table name format.

In a workspace that is upgraded from the basic mode to the standard mode, if you want to query data that is in the production environment from the development environment on the DataStudio page, you must request the permissions of the producer role first and specify the table whose data you want to query in the Project name.Table name format. For more information about how to request the permissions, see Request permissions on tables.

How do I query historical operational logs on the DataStudio page?

Click the Operational history icon in the left-side navigation pane of the DataStudio page. In the Operational history pane, you can view the historical operational logs.

How long are operational logs on the DataStudio page retained?

By default, operational logs on the DataStudio page are retained for three days.

Note

For more information about the retention period of logs and instances in Operation Center of the production environment, see How long are the logs of resource groups for scheduling and node instances that are run on such resource groups retained?.

How do I perform operations on multiple nodes, resources, or functions at a time?

Go to the DataStudio page and click the Batch Operation icon in the Scheduled Workflow pane. On the Batch Operation-Data Development tab, you can perform the desired operation on multiple nodes, resources, or functions at a time. Then, you can commit the objects on which you perform the operation at a time and deploy the objects on the Create Deploy Task page to make the modifications take effect.

Batch Operation

How do I change resource groups for scheduling for multiple nodes in a workflow at a time on the DataStudio page?

Find the desired workflow on the DataStudio page, move the pointer over the workflow name, and then click the icon on the right side of the workflow name. On the tab that appears, select the nodes for which you want to change resource groups for scheduling and click Switch Resource Groups. After you change the resource groups for the nodes, click the Submit icon in the top toolbar to commit the nodes at a time. Then, deploy the nodes on the Create Deploy Task page at a time to make the modifications take effect in the production environment. Change resource groups for scheduling for multiple nodes at a time

What do I do if an error is reported when I connect Power BI to MaxCompute?

MaxCompute cannot be connected to Power BI. We recommend that you connect Hologres instead of Power BI to MaxCompute. For more information, see Endpoints for connecting to Hologres.

When I call a DataWorks API operation, the following error message appears: `access is forbidden. Please first activate DataWorks Enterprise Edition or Flagship Edition`. What do I do?

Activate DataWorks Enterprise Edition. For more information, see Overview.

How do I disable the MaxCompute Query Acceleration (MCQA) feature if I want to obtain the instance ID that is used to download more than 10,000 data records?

To obtain the instance ID that is required to download the data records, you must disable the MCQA feature.

Note

DataWorks allows you to download only a total of 10,000 data records. If you want to download more than 10,000 data records of an ODPS SQL node, you must use a Tunnel command.

Add set odps.mcqa.disable=true; to the code of the ODPS SQL node and execute this statement together with other SELECT statements.

Which kind of resource group can I use when I reference a third-party package in a PyODPS node?

How do I control whether the queried table data can be downloaded?

How do I download more than 10,000 data records?

When I create a table in a workspace with which an E-MapReduce (EMR) compute engine instance is associated, the following error message appears: call emr exception. What do I do?

How do I reference a resource in a node?

How do I download a resource that is uploaded to DataWorks?

How do I upload a resource whose size is greater than 30 MB?

How do I use a resource that is uploaded to DataWorks by using odpscmd?

How do I upload a JAR package on my on-premises machine to DataWorks as a JAR resource and reference the uploaded resource in a node?

How do I use a MaxCompute table in DataWorks?

Can a Python resource call another Python resource?

Can PyODPS call custom functions to use third-party packages?

When I call a pickle file in a PyODPS 3 node, the following error message appears: _pickle.UnpicklingError: invalid load key, '\xef.. What do I do?

How do I delete a MaxCompute resource?

How do I recover a node that is deleted?

How do I view the versions of a node?

How do I clone a workflow?

How do I export the code of a node?

How do I check whether a node is committed?

Can I configure properties for all nodes in a workflow at a time?

What is the impact on the instances of a node after the node is deleted?

After a node is modified and committed and deployed to the production environment, is the existing faulty node in the production environment overwritten?

How do I create a table in a visualized manner?

How do I add fields to a table that is in the production environment?

How do I delete a table?

How do I upload data from my on-premises machine to a MaxCompute table?

When I create a table in a workspace with which an E-MapReduce (EMR) compute engine instance is associated, the following error message appears: call emr exception. What do I do?

How do I query data that is in the production environment from the development environment on the DataStudio page?

How do I query historical operational logs on the DataStudio page?

How long are operational logs on the DataStudio page retained?

How do I perform operations on multiple nodes, resources, or functions at a time?

How do I change resource groups for scheduling for multiple nodes in a workflow at a time on the DataStudio page?

What do I do if an error is reported when I connect Power BI to MaxCompute?

When I call a DataWorks API operation, the following error message appears: access is forbidden. Please first activate DataWorks Enterprise Edition or Flagship Edition. What do I do?

How do I disable the MaxCompute Query Acceleration (MCQA) feature if I want to obtain the instance ID that is used to download more than 10,000 data records?

When I call a pickle file in a PyODPS 3 node, the following error message appears: `_pickle.UnpicklingError: invalid load key, '\xef.`. What do I do?

When I create a table in a workspace with which an E-MapReduce (EMR) compute engine instance is associated, the following error message appears: `call emr exception`. What do I do?

When I call a DataWorks API operation, the following error message appears: `access is forbidden. Please first activate DataWorks Enterprise Edition or Flagship Edition`. What do I do?