All Products
Search
Document Center

Big Data

Last Updated: Aug 13, 2018

Alibaba Cloud for Azure Professionals

This article discusses the main differences and similarities between Azure and Alibaba Cloud in big data services. We mainly discuss the following service types and products:

1. Data computing

2. Data processing

The comparison covers the products as shown in the table below.

FeatureAzureAlibaba Cloud
Data computingAzure HDInsightAlibaba Cloud MaxCompute
Data processingAzure Data Factory

Azure Data Catalog
Alibaba Cloud DataWorks

1. Data computing

After collecting data to the corresponding cloud environment, these products can convert data, filter the data, and then compute on the data based on your needs.

1.1 Service comparison

The following table compares the basic functions and terminologies of Azure HDInsight vs Alibaba Cloud MaxCompute:

Function Alibaba Cloud MaxCompute Azure HDInsight
Data channel Tunnel upload/download

Based on SDK plug-ins developed: DTS, Sqoop, Kettle, CLT
Kafka
Datahub real-time transfer/Based on SDK plug-ins : OGG、Flume、LogStash、Flunted
Data storage File compression store RaidFile mechanism Azure Blob container
Calculation & Analysis task SQL(Hive-like SQL)、UDF Supported
MapReduce Supported
Graph Not supported
Unstructured data processing Supported
Spark Supported
ElasticSearch N/A
BigGraph N/A
System security Rights Management Model:
Project space users and authorizations managing
Resource sharing across project spaces
Project space data protection
Project space security configuration
Protect enterprise data assets with Azure Virtual Network, encryption, and integration with Azure Active Directory.Meet the most popular industry and government compliance standards.
ACL authorization
Policy authorization
Package resource sharing
LabelSecurity access control
Open Source Ecology API Hadoop、Spark、LLAP、Kafka、Storm、HBase、ML Services
SDK:Python、Java
Log import tool:Fluentd、Flume
Client:CLT、Studio
Open source code:R、Sqoop、ogg、eclipse、JDBC Driver
Maximum size For single cluster: 10,000 + ,multiple clusters supportted Hadoop/Hbase cluster
Elastic scaling Supported Supported
Hot fix Supported N/A
Quasi-real-time Supported N/A
High availability Storage, scheduling systems highly available, no single point of failure HDInsight cluster provides two head nodes

1.2 Product comparison overview

Azure HDInsight

Azure HDInsight is a cloud distribution of the Hadoop components from the Hortonworks Data Platform (HDP). Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. You can use the most popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, and more. With these frameworks, you can enable a broad range of scenarios such as extract, transform, and load (ETL), data warehousing, machine learning, and IoT.Azure HDInsight is a fully managed, full-spectrum, open-source analytics service for enterprises.

Alibaba Cloud MaxCompute

Alibaba Cloud MaxCompute is the largest big data cloud service platform in China, and provides massive data storage, massive data computing, as well as data exchange among multiple organizations. Alibaba Cloud MaxCompute is a large distributed computing system independently developed by Alibaba Group. MaxCompute supports multi-cluster dual-active disaster recovery. You don’t have to concern about the infrastructure stability, which allows you to concentrate on your own business. MaxCompute also ensures data consistency and continuity of its services. Alibaba Cloud MaxCompute provides users with a comprehensive set of big data development tools to improve data import and export solutions, as well as various classic distributed computing models to quickly solve massive data computation, effectively reduce enterprise cost, and safeguard data security.

1.3 Advantage and disadvantage comparison

Azure HDInsight product advantages

  • Cloud native: Azure HDInsight enables you to create optimized clusters for Hadoop, Spark, Interactive query (LLAP), Kafka, Storm, HBase, and ML Services on Azure. HDInsight also provides an end-to-end SLA on all your production workloads.
  • Low-cost and scalable: HDInsight enables you to scale workloads up or down. You can reduce costs by creating clusters on demand and paying only for what you use.
  • Secure and compliant: HDInsight enables you to protect your enterprise data assets with Azure Virtual Network, encryption, and integration with Azure Active Directory. HDInsight also meets the most popular industry and government compliance standards.
  • Monitoring: Azure HDInsight integrates with Azure Log Analytics to provide a single interface with which you can monitor all your clusters.
  • Productivity: Azure HDInsight enables you to use rich productive tools for Hadoop and Spark with your preferred development environments. These development environments include Visual Studio, VSCode, Eclipse, and IntelliJ for Scala, Python, R, Java, and .NET support.
  • Extensibility: You can extend the HDInsight clusters with installed components (Hue, Presto, and so on) by using script actions, by adding edge nodes, or by integrating with other big data certified applications.

Azure HDInsight product disadvantages

Azure HDInsight underlying architecture is based on open source Hadoop, Spark and other products。MaxCompute is optimized for high-concurrency processing and execution planning, the performance is better in the scenarios such as CPU-IO sensitivity calculation and large volume join calculation.The calculation time of maxcompute is more stable when the amount of data and the amount of resources are scaled up in the same ratio, it can make full use of the allocated computing resources, and the calculation and resource quantity increase in linear relation.
Under the same amount of data and resource, with the same test set and the same standard conditions, the overall performance of MaxCompute is better.In addition, MaxCompute is highly product-oriented and easier to use.

1.4 Conclusion

In summary, in data computing, Maxcompute have these advantage over Azure HDInsight:

  • Fast computing and excellent performance
  • Hyperscale computing and storage
  • Support multiple computing engines
  • Support multi-cluster and cross-cluster computing
  • Big data integrated development environment
  • Dramatically reducing enterprise usage costs
  • High Stability and Security

2. Data processing

Data processing carries out data transfer, data conversion, and other related operations, introducing data from different data sources to transform and process the data.Finally, the data was extracted to other data systems, with the entire data acquisition, conversion, development, analysis processes completed.

2.1 Service comparison

The following table compares the basic functions and terminologies of Azure Data Factory and Azure Data Catalog vs Alibaba Cloud DataWorks:

Function Property Azure Data Factory Azure Data Catalog Alibaba Cloud DataWorks
Data acquisition Real-time acquisition Not supported N/A Supported
Batch acquisition Supported N/A Supported
Client acquisition Supported N/A Supported
Local data Supported(deployment of proxy gateways) N/A Supported
Cloud data Supported N/A Supported
Heterogeneous data sources Azure storage, databases, files N/A support over 20 + (RDBMS, NoSQL, MPP, Unstructured storage, Big data storage, etc)
Data management Data discovery N/A Supported Supported
Capture metadata N/A Supported Supported
Version management N/A Not supported Not supported
Capturing schema changes N/A Not supported Not supported
Automatic Identification detection N/A Not supported Not supported
Comment N/A Supported Not supported
Collecting/structuring tags N/A Supported Not supported
Data relationship N/A N/A Supported
Data Conversion & Development Automatic code generating Not supported N/A Not supported
Online editing Not supported N/A Supported
Version management Not supported N/A Supported
Mode Based on calculating engine(HDInsight, Data Lake Analytices U-SQL, Machine Leaning,R) N/A Based on calculating engine(ODPS SQL, SHELL, PAI)
Orchestrating and Task Scheduling Trigger mode Cycle N/A Cycle,API trigger
serveless Supported N/A Supported
Automatically re-run Supported N/A Supported
Monitoring & Alarm Monitor dashboard Supported N/A Supported
Alarm Supported N/A Supported
Data quality Offline monitoring Not supported Not supported Supported
Online monitoring Not supported Not supported Supported
Self-defined monitoring rules Not supported Not supported Supported
Openness API Supported Supported Supported
SDK Supported Supported Not supported

2.2 Product comparison overview

Azure Data Factory

Azure data integration development tools Data Factory has been online for a long time,integrating data acquisition, data development, task monitoring capabilities.
In the second half of 2017,Data Factory released the V2 version and reconstructed the functional model,with visual drag-and-drop editing and complex process control added, and task monitoring enhanced.There has been considerable progress in the competency and user experience of complex scenes.
Azure Data Factory is a cloud-based data integration service, letting you create data-driven workflows in the cloud to coordinate and automate data movement and transformation.You can use Azure Data Factory to perform the following tasks:

  • Create and schedule data-driven workflows (called pipes ), so that data can be introduced from different data stores.
  • Use computing services such as Aure HDinsight Hadoop, Spark, Azure Data Lake Analytics, Azure Machine Learning to processe or transform data.
  • Output data to a data store (for example, Azure SQL Data Warehouse ) for business intelligence (BI) applications.

Azure Data Catalog

Azure Data Catalog is designed to help enterprises make the most of existing information assets.
Data Catalog can help users who manage data to discover and understand data sources more easily.Data Catalog provides cloud-based services where you can register data sources: data is retained in an existing location, while a copy of its metadata is added to Data Catalog along with a reference to the data source location.This metadata is also indexed to facilitate easy discovery of each data source by the search function, therefore users who find the data source can understand it easily.
After registering the data sources, users can enrich their metadata.Every user can provide descriptions, tags, or other metadata(such as documents that request data source access and process) to comment on the data source.This descriptive metadata supplements structured metadata, such as column names and data types, that are registered in the data source.
The primary purpose of the registrating data sources is to discover and understand the data source and its purpose.Enterprise users may need data to be used for business intelligence, application development, data science, or anything else.They can use Azure Data Catalog to quickly find data that matches their needs and learn about it. The data can be then used by opening the data source in there selected tool.
In the meantime, users can also participate in the Azure Data Catalog by marking, logging, and annotating the registered data source.Users can also register for a new data source, these data sources can then be found and used by the Azure Data Catalog community.

DataWorks

  • Product location: one-stop big data platform, covering data integration, data management, data development, data operation, data service sharing, data security, data quality and other stages of the big data lifecycle
  • Methodology: cloud data warehouse, streaming computing
  • Target User: data developers (data integration, data development, data operation), data manager(data management, data security, data quality), data users (data management, data service, real-time analytic)
  • How to use: Web-side
  • Deployment approach: public cloud serverless, proprietary cloud
  • Development language: SQL, Java (openmr), Python, R, etc
  • Service level: public test (Data Integration is officially commercial)
  • Underlying engine: MaxCompute, Blink.

2.3 Advantage and disadvantage comparison

Azure Data Factory product advantages

  • Rigorous conceptual model.Azure Data Factory abstract all possible objects and behaviors in data processing and establish a self-consistent system and methodology. There is virtually no possibility of ambiguity, and it’s easy to extend functionality in the future.
  • Rich ecosystem.Data Factory abstract The supported data sources and Processing engines as linked service objects, there are differences in the scope of linked service that are supported in different activities.According to the official documentation, it supports 68 different Movement data sources, supports eight different Transformation processing engines.
  • Unified user experience.Data Factory, as an Azure “window “, has the consistent user experience with other Azure products, you don’t even need to create a new browser window or tab (there can be multiple windows inside a page ).
  • Full support for text-based operations.All object definitions are done through JSON and all interface operations are run by the corresponding azure powershell commands.Users can completely leave the browser and save their work through text.

Azure Data Factory product disadvantages

  • Online Editing of Activity is not supported.All Activity types, especially transformation, require upload scripts or define stored procedures,result in poor user experience.
  • Only Pipeline-level Trigger is supported.That is, within pipeline, you can not define time requirements for Ativity.As long as you satisfy the dependOn property, the Activity will be executed.
  • Weak monitoring capacity.Pipeline’s monitoring is based entirely on zure monitor, and there is no better monitoring of data quality.

Azure Data Catalog product advantages

  • Complete enterprise-class metadata management
    Data Catalog carries on Azure’s experience in enterprise-class data management.Data Catalog integrates with Azure AD to facilitate the management of corporate organization and staff privileges.Data Catalog manages metadata permissions by ownership, annotation, registration, visibility and use terms to standardize the description of asset objects and asset attributes.All these properties are suitable for the functions of enterprise-class collaborative scenes, which constitute a more complete solution.

  • Data knowledge sharing and managing
    Data Catalog not only manage metadata, but also manage metadata-related knowledge:

  1. You can set a friendly name for an asset object that is easy to identify.
  2. For asset objects and attributes, you can set a comment, Tag, or term.
  3. Experts can be set up for asset objects to associate with people.
  4. For asset objects you can write text-formatted documents.
  5. Anyone with annotation privileges can write comments, tags, and terms.
  • Data Profile
    While the asset object is registered, Data Catalog collects Data Profile which contains statistical information that reflects the characteristics of the data, so that users can have a sensitive understanding of the data content.

Azure Data Catalog product disadvantages

  • In terms of interface interaction, Data Catalog maintains a good user experience with informative and friendly user interface,but there are several aspects restricting the entering of new users:
  1. Open to corporate or school Azure accounts only.
  2. You must subscribe to Azure ready-to-use packages.Although the free version of Data Catalog itself is available, however, a subscription to this package will result in a loss of free usage for other Azure products.
  3. The data source import tool must be run under Windows 64-bit operating system while MAC OS is not supported.
  • Data Catalog is more independent than other Azure products.In terms of function,Azure Data Catalog focuses on the management of data catalogs and associated knowledge with no link with Data Factory, therfore its application scenario is limited.The premise for interaction with other products is that Datapipeline is seamlessly integrated, the delay in data transmission is small enough, and the possibility of compatibility problems is low enough.

2.4 Conclusion

In summary, in data warehouse and data business process areas, the advantages of DataWorks are:

  • Data Integration: supports for streaming control and real-time synchronization.
  • Data Development: powerful online editing capabilities to experience a comparable offline IDE.
  • Monitoring Operations: supports business baseline monitoring.
  • Data Management: complete data management capabilities, also provides unique functions such as classification and data desensitization.
  • Data Quality: unique features in competitors.