This article discusses the main differences and similarities between Azure and Alibaba Cloud in big data services. We mainly discuss the following service types and products:1. Data computing2. Data processing
The comparison covers the products as shown in the table below.
|Data computing||Azure HDInsight||Alibaba Cloud MaxCompute|
|Data processing||Azure Data FactoryAzure Data Catalog||Alibaba Cloud DataWorks|
After collecting data to the corresponding cloud environment, these products can convert data, filter the data, and then compute on the data based on your needs.
The following table compares the basic functions and terminologies of Azure HDInsight vs Alibaba Cloud MaxCompute：
|Function||Alibaba Cloud MaxCompute||Azure HDInsight|
|Data channel||Tunnel upload/download Based on SDK plug-ins developed: DTS, Sqoop, Kettle, CLT||Kafka|
|Datahub real-time transfer/Based on SDK plug-ins : OGG、Flume、LogStash、Flunted|
|Data storage||File compression store RaidFile mechanism||Azure Blob container|
|Calculation & Analysis task||SQL（Hive-like SQL）、UDF||Supported|
|Unstructured data processing||Supported|
|System security||Rights Management Model: |
Project space users and authorizations managing
Resource sharing across project spaces
Project space data protection
Project space security configuration
|Protect enterprise data assets with Azure Virtual Network, encryption, and integration with Azure Active Directory.Meet the most popular industry and government compliance standards.|
|Package resource sharing|
|LabelSecurity access control|
|Open Source Ecology||API||Hadoop、Spark、LLAP、Kafka、Storm、HBase、ML Services|
|Log import tool：Fluentd、Flume|
|Open source code：R、Sqoop、ogg、eclipse、JDBC Driver|
|Maximum size||For single cluster: 10,000 + ,multiple clusters supportted||Hadoop/Hbase cluster|
|High availability||Storage, scheduling systems highly available, no single point of failure||HDInsight cluster provides two head nodes|
Azure HDInsight is a cloud distribution of the Hadoop components from the Hortonworks Data Platform (HDP). Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. You can use the most popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, and more. With these frameworks, you can enable a broad range of scenarios such as extract, transform, and load (ETL), data warehousing, machine learning, and IoT.Azure HDInsight is a fully managed, full-spectrum, open-source analytics service for enterprises.
Alibaba Cloud MaxCompute
Alibaba Cloud MaxCompute is the largest big data cloud service platform in China, and provides massive data storage, massive data computing, as well as data exchange among multiple organizations. Alibaba Cloud MaxCompute is a large distributed computing system independently developed by Alibaba Group. MaxCompute supports multi-cluster dual-active disaster recovery. You don’t have to concern about the infrastructure stability, which allows you to concentrate on your own business. MaxCompute also ensures data consistency and continuity of its services. Alibaba Cloud MaxCompute provides users with a comprehensive set of big data development tools to improve data import and export solutions, as well as various classic distributed computing models to quickly solve massive data computation, effectively reduce enterprise cost, and safeguard data security.
- Cloud native: Azure HDInsight enables you to create optimized clusters for Hadoop, Spark, Interactive query (LLAP), Kafka, Storm, HBase, and ML Services on Azure. HDInsight also provides an end-to-end SLA on all your production workloads.
- Low-cost and scalable: HDInsight enables you to scale workloads up or down. You can reduce costs by creating clusters on demand and paying only for what you use.
- Secure and compliant: HDInsight enables you to protect your enterprise data assets with Azure Virtual Network, encryption, and integration with Azure Active Directory. HDInsight also meets the most popular industry and government compliance standards.
- Monitoring: Azure HDInsight integrates with Azure Log Analytics to provide a single interface with which you can monitor all your clusters.
- Productivity: Azure HDInsight enables you to use rich productive tools for Hadoop and Spark with your preferred development environments. These development environments include Visual Studio, VSCode, Eclipse, and IntelliJ for Scala, Python, R, Java, and .NET support.
- Extensibility: You can extend the HDInsight clusters with installed components (Hue, Presto, and so on) by using script actions, by adding edge nodes, or by integrating with other big data certified applications.
Azure HDInsight underlying architecture is based on open source Hadoop, Spark and other products。MaxCompute is optimized for high-concurrency processing and execution planning, the performance is better in the scenarios such as CPU-IO sensitivity calculation and large volume join calculation.The calculation time of maxcompute is more stable when the amount of data and the amount of resources are scaled up in the same ratio, it can make full use of the allocated computing resources, and the calculation and resource quantity increase in linear relation.
Under the same amount of data and resource, with the same test set and the same standard conditions, the overall performance of MaxCompute is better.In addition, MaxCompute is highly product-oriented and easier to use.
In summary, in data computing, Maxcompute have these advantage over Azure HDInsight:
- Fast computing and excellent performance
- Hyperscale computing and storage
- Support multiple computing engines
- Support multi-cluster and cross-cluster computing
- Big data integrated development environment
- Dramatically reducing enterprise usage costs
- High Stability and Security
Data processing carries out data transfer, data conversion, and other related operations, introducing data from different data sources to transform and process the data.Finally, the data was extracted to other data systems, with the entire data acquisition, conversion, development, analysis processes completed.
The following table compares the basic functions and terminologies of Azure Data Factory and Azure Data Catalog vs Alibaba Cloud DataWorks：
|Function||Property||Azure Data Factory||Azure Data Catalog||Alibaba Cloud DataWorks|
|Data acquisition||Real-time acquisition||Not supported||N/A||Supported|
|Local data||Supported(deployment of proxy gateways)||N/A||Supported|
|Heterogeneous data sources||Azure storage, databases, files||N/A||support over 20 + (RDBMS, NoSQL, MPP, Unstructured storage, Big data storage, etc)|
|Data management||Data discovery||N/A||Supported||Supported|
|Version management||N/A||Not supported||Not supported|
|Capturing schema changes||N/A||Not supported||Not supported|
|Automatic Identification detection||N/A||Not supported||Not supported|
|Collecting/structuring tags||N/A||Supported||Not supported|
|Data Conversion & Development||Automatic code generating||Not supported||N/A||Not supported|
|Online editing||Not supported||N/A||Supported|
|Version management||Not supported||N/A||Supported|
|Mode||Based on calculating engine（HDInsight, Data Lake Analytices U-SQL, Machine Leaning,R）||N/A||Based on calculating engine(ODPS SQL, SHELL, PAI)|
|Orchestrating and Task Scheduling||Trigger mode||Cycle||N/A||Cycle,API trigger|
|Monitoring & Alarm||Monitor dashboard||Supported||N/A||Supported|
|Data quality||Offline monitoring||Not supported||Not supported||Supported|
|Online monitoring||Not supported||Not supported||Supported|
|Self-defined monitoring rules||Not supported||Not supported||Supported|
Azure Data Factory
Azure data integration development tools Data Factory has been online for a long time,integrating data acquisition, data development, task monitoring capabilities.
In the second half of 2017,Data Factory released the V2 version and reconstructed the functional model,with visual drag-and-drop editing and complex process control added, and task monitoring enhanced.There has been considerable progress in the competency and user experience of complex scenes.
Azure Data Factory is a cloud-based data integration service, letting you create data-driven workflows in the cloud to coordinate and automate data movement and transformation.You can use Azure Data Factory to perform the following tasks:
- Create and schedule data-driven workflows (called pipes ), so that data can be introduced from different data stores.
- Use computing services such as Aure HDinsight Hadoop, Spark, Azure Data Lake Analytics, Azure Machine Learning to processe or transform data.
- Output data to a data store (for example, Azure SQL Data Warehouse ) for business intelligence (BI) applications.
Azure Data Catalog
Azure Data Catalog is designed to help enterprises make the most of existing information assets.
Data Catalog can help users who manage data to discover and understand data sources more easily.Data Catalog provides cloud-based services where you can register data sources: data is retained in an existing location, while a copy of its metadata is added to Data Catalog along with a reference to the data source location.This metadata is also indexed to facilitate easy discovery of each data source by the search function, therefore users who find the data source can understand it easily.
After registering the data sources, users can enrich their metadata.Every user can provide descriptions, tags, or other metadata(such as documents that request data source access and process) to comment on the data source.This descriptive metadata supplements structured metadata, such as column names and data types, that are registered in the data source.
The primary purpose of the registrating data sources is to discover and understand the data source and its purpose.Enterprise users may need data to be used for business intelligence, application development, data science, or anything else.They can use Azure Data Catalog to quickly find data that matches their needs and learn about it. The data can be then used by opening the data source in there selected tool.
In the meantime, users can also participate in the Azure Data Catalog by marking, logging, and annotating the registered data source.Users can also register for a new data source, these data sources can then be found and used by the Azure Data Catalog community.
- Product location: one-stop big data platform, covering data integration, data management, data development, data operation, data service sharing, data security, data quality and other stages of the big data lifecycle
- Methodology: cloud data warehouse, streaming computing
- Target User: data developers (data integration, data development, data operation), data manager(data management, data security, data quality), data users (data management, data service, real-time analytic)
- How to use: Web-side
- Deployment approach: public cloud serverless, proprietary cloud
- Development language: SQL, Java (openmr), Python, R, etc
- Service level: public test (Data Integration is officially commercial)
- Underlying engine: MaxCompute, Blink.
- Rigorous conceptual model.Azure Data Factory abstract all possible objects and behaviors in data processing and establish a self-consistent system and methodology. There is virtually no possibility of ambiguity, and it’s easy to extend functionality in the future.
- Rich ecosystem.Data Factory abstract The supported data sources and Processing engines as linked service objects, there are differences in the scope of linked service that are supported in different activities.According to the official documentation, it supports 68 different Movement data sources, supports eight different Transformation processing engines.
- Unified user experience.Data Factory, as an Azure “window “, has the consistent user experience with other Azure products, you don’t even need to create a new browser window or tab (there can be multiple windows inside a page ).
- Full support for text-based operations.All object definitions are done through JSON and all interface operations are run by the corresponding azure powershell commands.Users can completely leave the browser and save their work through text.
- Online Editing of Activity is not supported.All Activity types, especially transformation, require upload scripts or define stored procedures,result in poor user experience.
- Only Pipeline-level Trigger is supported.That is, within pipeline, you can not define time requirements for Ativity.As long as you satisfy the dependOn property, the Activity will be executed.
- Weak monitoring capacity.Pipeline’s monitoring is based entirely on zure monitor, and there is no better monitoring of data quality.
Complete enterprise-class metadata management
Data Catalog carries on Azure’s experience in enterprise-class data management.Data Catalog integrates with Azure AD to facilitate the management of corporate organization and staff privileges.Data Catalog manages metadata permissions by ownership, annotation, registration, visibility and use terms to standardize the description of asset objects and asset attributes.All these properties are suitable for the functions of enterprise-class collaborative scenes, which constitute a more complete solution.
Data knowledge sharing and managing
Data Catalog not only manage metadata, but also manage metadata-related knowledge：
- You can set a friendly name for an asset object that is easy to identify.
- For asset objects and attributes, you can set a comment, Tag, or term.
- Experts can be set up for asset objects to associate with people.
- For asset objects you can write text-formatted documents.
- Anyone with annotation privileges can write comments, tags, and terms.
- Data Profile
While the asset object is registered, Data Catalog collects Data Profile which contains statistical information that reflects the characteristics of the data, so that users can have a sensitive understanding of the data content.
- In terms of interface interaction, Data Catalog maintains a good user experience with informative and friendly user interface,but there are several aspects restricting the entering of new users:
- Open to corporate or school Azure accounts only.
- You must subscribe to Azure ready-to-use packages.Although the free version of Data Catalog itself is available, however, a subscription to this package will result in a loss of free usage for other Azure products.
- The data source import tool must be run under Windows 64-bit operating system while MAC OS is not supported.
- Data Catalog is more independent than other Azure products.In terms of function,Azure Data Catalog focuses on the management of data catalogs and associated knowledge with no link with Data Factory, therfore its application scenario is limited.The premise for interaction with other products is that Datapipeline is seamlessly integrated, the delay in data transmission is small enough, and the possibility of compatibility problems is low enough.
In summary, in data warehouse and data business process areas, the advantages of DataWorks are:
- Data Integration: supports for streaming control and real-time synchronization.
- Data Development: powerful online editing capabilities to experience a comparable offline IDE.
- Monitoring Operations: supports business baseline monitoring.
- Data Management: complete data management capabilities, also provides unique functions such as classification and data desensitization.
- Data Quality: unique features in competitors.