Debug real-time tasks - Dataphin - Alibaba Cloud Documentation Center

Dataphin supports data sampling or manual upload for local debugging or Session cluster debugging of developed real-time task code to help ensure the correctness of code tasks and avoid human errors or omissions. This topic will introduce how to debug real-time tasks.

Debugging methods

Local debugging method: This method does not involve debugging through a cluster, and the data for debugging is non-streaming data. It is relatively fast but requires manual data upload or entry, and only supports automatic sampling for specific data sources.
Session cluster debugging method: This method involves debugging through a Session cluster, using real online data and streaming data (i.e., when data is written into the source table, the calculation result of that data will be directly output, consistent with the result of the real online running task). In this method, the Session cluster provides real-time viewing of Flink task status, logs, and output results, allowing you to verify the correctness of the task by observing its behavior and output. This supports iterative modification and debugging of task code to quickly locate and solve problems.
Note
The debugging results of the Session cluster debugging method will not be written into the result table.

Limits

Blink only supports local debugging for engine version 3.6.0 and above.
DataStream tasks are not supported for debugging.
The Session cluster debugging method currently only supports customers with the open-source Flink engine and deployed based on the latest architecture. For details, please contact the product operations team.

Debug task operation entry

On the Dataphin home page, click Development in the top menu bar.
Follow the operation guide in the figure below to select the task to be debugged and enter the task's Debug Configuration dialog box.
Currently, only single mode debugging is supported. After selecting the mode, sample the corresponding mode table data for debugging.
- Real-time Mode Debugging: Sample the corresponding real-time physical table data. After completing data sampling, perform local debugging in FLINK Stream mode or Session cluster debugging. For specific operations, see real-time mode debugging.
- Offline Mode Debugging: Sample the corresponding offline physical table data. After completing data sampling, perform local debugging in FLINK Batch mode. For specific operations, see offline mode debugging.

Real-time mode debugging

In the Debug Configuration dialog box, on the Select Sampling Mode tab, select Real-time Mode-flink Stream Task.
Click Next.
In the debug configuration dialog box, select debug data source.
- Manually upload data (local debugging method)
  This involves manually uploading data for debugging through the local debugging method. Data upload methods include manually Uploading Sample Data Files, Manually Entering Data, and Automatic Data Sampling.
  - Manually upload sample data files
    You can manually upload local data by uploading data. Before uploading local data, you need to download the sample first. The sample is a CSV format sample template generated by Dataphin that automatically identifies the read and write tables and the schema information of the tables. You can edit the data to be uploaded according to the downloaded sample. After clicking Upload, the data is automatically filled into the Metadata Sampling area.
  - Manually enter data
    This is suitable for scenarios where the collected data is relatively small or where the collected data needs to be modified.
  - Automatic data sampling
    The automatically sampled data is random, so it is suitable for scenarios where there are no restrictions on the collected data. Automatic data sampling is supported for HBase, MySQL, MaxCompute, DataHub, and Kafka data sources. You can click Automatic Sampling to sample data.
    Note
    Kafka supports automatic sampling of json, csv, canal-json, maxwell-json, and debezium-json data formats.
    Kafka automatic sampling only supports no authentication and username + password authentication methods, and does not support SSL.
    During Kafka automatic sampling, it supports selecting the data range to read, with a maximum sampling of 100 entries.
After completing metadata sampling for all data tables, click OK.
On the Result page, you can view the Debugging Results.
- Manually upload data (local debugging method)
- Collect online data (Session cluster debugging method)

Offline mode debugging

In the Debug Configuration dialog box, on the Select Sampling Mode tab, select Offline Mode-flink Batch Task.
Click Next.
In the debug configuration dialog box, select debug data source.
- Manually upload data (local debugging method)
  This involves manually uploading data for debugging through the local debugging method. Data upload methods include manually Uploading Sample Data Files, Manually Entering Data, and Automatic Data Sampling.
  - Manually upload sample data files
    You can manually upload local data by uploading data. Before uploading local data, you need to download the sample first. The sample is a CSV format sample template generated by Dataphin that automatically identifies the read and write tables and the schema information of the tables. You can edit the data to be uploaded according to the downloaded sample. After clicking Upload, the data is automatically filled into the Metadata Sampling area.
  - Manually enter data
    This is suitable for scenarios where the collected data is relatively small or where the collected data needs to be modified.
  - Automatic data sampling
    The automatically sampled data is random, making it ideal for scenarios without specific data requirements. Automatic data sampling is available for HBase, MaxCompute, DataHub, and Kafka data sources. Click Automatic Sampling to proceed with data sampling.
    Note
    Kafka supports automatic sampling for json, csv, canal-json, maxwell-json, and debezium-json data formats.
    Kafka's automatic sampling is compatible with no authentication and username + password authentication methods, but does not support SSL.
    Kafka's automatic sampling allows for data range selection during sampling, with a maximum of 100 records.
Once metadata sampling for all data tables is complete, click OK at the bottom of the page.
On the Result page, you can review the Debug Data, Intermediate Results, and Debugging Results.

Appendix: Automatically sampled debug data

When using automatic sampling for local debugging, the debug data is determined by the meta table configuration. Consider the following:

The meta table's Default Read property should be set to Development Table during task debugging.
- If the task references Project_Name_dev.meta_table_name, the development meta table is sampled. If a development meta table is not present in the data source, Automatic Sampling is not available.
- When the task utilizes Project_Name.meta_table_name, it automatically samples the production meta table. Should you lack permissions for this table in the production environment, an error will be triggered. It is necessary to request access to the production table. For more information, see apply for table permission.
- If the task references ${Project_Name}.meta_table_name or meta_table_name, the development meta table is sampled. If a development meta table is not available, Automatic Sampling is not supported.
When the meta table's Default Read property is set to Production Table during task debugging:
- If the task references Project_Name_dev.meta_table_name, the development table is sampled. If the data source lacks a development meta table, Automatic Sampling is not available.
- If the task references Project_Name.meta_table_name, the production meta table is sampled.
- If the task references ${Project_Name}.meta_table_name or meta_table_name, the system will replace the ${project_name} variable based on parameter settings. The actual project specified in the parameters (development or production) will determine whether the production or development meta table is used. If ${project_name} is not set, the production meta table is sampled by default.