Compute Engine Overview for Data Development - Dataphin

Before creating a project space for data development, you must configure the compute engine for your Dataphin instance. Once the compute engine is configured, the system allows the addition of corresponding compute resources to the project space, thus providing necessary computing and storage resources. This topic outlines the compute engine options available within the Dataphin system.

Permission description

Configuration of the compute engine is restricted to super administrators or system administrators only.

Billing description

To set up the real-time computing engine, you must purchase and enable the real-time module prior to configuration.

The agile development version does not support the configuration of the real-time computing engine.

Limits

Changing the compute engine type for the metadata warehouse after initial setup may result in incorrect metadata operations for the business tenant. Consult with the Dataphin operations team before making changes to the metadata warehouse compute engine type.
When altering offline computing engine settings, the system will synchronize the compute source configuration. The system does not check the connectivity of the compute source during this process. Ensure the accuracy of the configuration to prevent node failure. After changes, it is advisable to manually verify the compute source connectivity.
Configuration changes will take effect within 30 seconds. During synchronization, viewing the compute source configuration may display inconsistencies, and SQL execution may still use the previous settings.

Supported compute engines

Before utilizing Dataphin, complete the compute engine settings for your instance by configuring the compute cluster endpoint. Afterward, you can establish compute sources based on this cluster. Dataphin supports the following compute engines:

Note

If no offline compute source exists, you may change the compute engine type and configuration in the calculation settings. If an offline compute source is present, only the calculation settings can be modified, not the compute engine type.

Once the metadata warehouse compute engine is initialized, only the supported compute engines for the current metadata warehouse tenant are selectable.

Compute Engine	Description	References
Offline Computing Engine
MaxCompute	Alibaba's native big data computing platform, offering massive data storage and processing capabilities with high efficiency and stability.	Configure the Dataphin instance compute engine to MaxCompute
AnalyticDB for PostgreSQL	An OLAP-focused analytic database, offering a petabyte-scale, high-concurrency, real-time data warehouse with seamless scalability for extensive data computation.	Configure the Dataphin instance compute engine to AnalyticDB for PostgreSQL
E-MapReduce3.x Hadoop and E-MapReduce5.x Hadoop	An open-source Hadoop cluster on Alibaba Cloud ECS, leveraging Alibaba Cloud E-MapReduce (EMR) for robust data processing.	Configure the Dataphin instance compute engine to Hadoop
CDH5.x Hadoop CDH6.x Hadoop	A globally recognized distributed system infrastructure, featuring HDFS and MapReduce for extensive data storage and processing.
CDH5.x Hadoop CDH6.x Hadoop	A globally recognized distributed system infrastructure, featuring HDFS and MapReduce for extensive data storage and processing.
Cloudera Data Platform 7.x	CDP represents the combined strengths of Cloudera's CDH and Hortonworks' HDP, post-merger.
Huawei FusionInsight 8.x Hadoop	A big data platform by Huawei, enhancing Apache open-source software for comprehensive data storage, query, and analysis.
AsiaInfo DP5.3 Hadoop	An integrated big data support platform, built on an open-source ecosystem and leveraging carrier-grade capabilities.
Transwarp ArgoDB	Transwarp ArgoDB is a distributed analytic database from Transwarp Technology. Note Transwarp ArgoDB is not supported by the intelligent development version.	Configure the Dataphin instance compute engine to TDH or ArgoDB
Transwarp TDH 6.x	Transwarp Data Hub (TDH) is Transwarp's comprehensive big data platform.
StarRocks	StarRocks is a high-performance analytic data warehouse, utilizing vectorization, MPP architecture, CBO, intelligent materialized views, and a real-time updatable columnar storage engine for multidimensional, real-time, high-concurrency data analysis.	Use StarRocks as the metadata warehouse compute engine for initialization
Lindorm (Compute Engine)	Lindorm is Alibaba Cloud's cloud-native multi-model database product, with a compute engine mode that supports offline big data applications.	Set the Dataphin compute engine to Lindorm (Compute Engine)
GaussDB (DWS)	GaussDB (DWS) is a distributed relational database from Huawei. It is based on PostgreSQL and is compatible with Oracle, MySQL, and TeraData syntax.	Set the Dataphin instance compute engine to GaussDB ((DWS)
Databricks	Databricks is a unified data analytics platform based on Apache Spark. It provides managed Spark clusters, an interactive notebook environment, and seamless integration with cloud storage to support high-volume data processing and large-scale analytics.	Set the Dataphin instance compute engine to Databricks
Amazon EMR	Amazon EMR is a managed Hadoop big data cluster platform that provides big data computing capabilities such as Hive and Spark.	Set the Dataphin instance compute engine to Amazon EMR
SelectDB	SelectDB Enterprise is the commercial version of Apache Doris from SelectDB.	Set the Dataphin instance compute engine to SelectDB or Doris
Doris	Apache Doris is a high-performance, real-time analytic database based on a massively parallel processing (MPP) architecture.
Real-time Computing Engine
Alibaba Cloud Realtime Compute for Apache Flink	Alibaba Cloud's next-generation compute engine, Flink, supports real-time computing with high throughput and low latency, and is also capable of offline computing and scheduling.	Once the tenant enables the real-time development module, the system will suggest settings based on the offline computing engine selection, which can be customized. For instructions on enabling real-time development, see tenant settings.
Apache Flink	Apache Flink is a distributed processing engine for stateful computations on both unbounded and bounded data streams.
FusionInsight Flink	FusionInsight Flink is a real-time computation and analysis engine for high-speed data streams, based on Apache Flink.
Blink Exclusive	Blink is Alibaba Cloud's exclusive real-time computing engine. Important This version is no longer available on the public cloud. Selection should be made with caution.