All Products
Document Center

Big Data

Last Updated: Aug 13, 2018

Alibaba Cloud for AWS Professionals

This article discusses the main differences and similarities between AWS and Alibaba Cloud in big data services. We mainly discuss the following service types and products:

1. Data collection

2. Data computing

3. Data analysis

4. Data visualization

5. Data processing

The comparison covers the products as shown in the table below.

FeatureAWSAlibaba Cloud
Data collectionAWS KinesisAlibaba Cloud Log Service

Alibaba Cloud DataHub
Data computingAWS Elastic MapReduce

AWS Redshift
Alibaba Cloud E-MapReduce

Alibaba Cloud MaxCompute
Data analysisAWS QuickSightAlibaba Cloud Quick BI
Data visualizationN/AAlibaba Cloud DataV
Data processingAWS Glue

AWS Data Pipeline
Alibaba Cloud DataWorks

1. Data collection

Both AWS Kinesis and Alibaba Cloud Log Service & DataHub can be used to extract and collect data to their own cloud environment or the corresponding data models. However, each service uses a different service model.

1.1 Service models

The following table compares the basic functions and terminologies of AWS Kinesis vs Alibaba Cloud DataHub & Alibaba Cloud Log Service.

Feature AWS Kinesis Alibaba Cloud Log Service Alibaba Cloud DataHub(public beta by china site)
Client support & collection methods Native Agent Native Agent Native Agent
Open-source client Open-source client Open-source client
API Over 30 collection ends, such as Logstash and Fluent. Multiple collection ends, such as mobile devices, applications, website service, and sensors.
Client expansion N/A Supported Supported
Retention days 1 ~ 7 days 1 ~ 365 days 7 days
Stream computing support Open source stream computing engine, +Kinesis Analytics Open source stream computing engine, ARMS and StreamCompute (which will be launched on the international site later), and CloudMonitor. Supports stream computing engine, StreamCompute.
Deployment Location Region Region (global) Region (public beta)
Shipping destination S3/RedShift/ES OSS/MaxCompute/Table Store OSS/MaxCompute/and so on
Size 1 MB 3 MB 1 MB
Throughput 5 MB/s, 5000 records/s. No upper limit, elastic. Supports up to several TB of daily data input with single topic, with each shard supporting several hundred GB of daily data input.
Delay S3/ES: 60~900/s

Redshift: >60s
OSS/Table Store: 60~900/s.

MaxCompute: 15 min.
Maximum delay: 5 min.
Storage cost USD 0.02/GB USD 0.01/GB Public beta, free for the time being.
ETL support Lambda JSON/CSV/Parquet

Function computer
Connected to MaxCompute and Blink platform, Alibaba Cloud DataHub supports all ETL tools on these two platforms.
Pricing strategy Kinesis pricing Log Service pricing Public beta, free for the time being.
Security Supports customization of permissions, group, and access control of users and roles. HTTPS + Transmission signature + Multi-tenant isolation + Access control Provides enterprise-level multi-layer security protection and multi-user resource isolation mechanism.

Provides various authentication and authorization mechanisms, as well as whitelist and primary/subaccount features.

1.2 Main functions

AWS Kinesis is a cloud service provider that supports stream computing. It enables users to collect and process data in real time. AWS Kinesis provides multiple core capabilities to economically and effectively process the corresponding data flow. It also has the flexibility to allow you to choose the tools that best fit the application needs.By default, the time record of added data flow can be accessed within a maximum of 24 hours after being added. You can increase the data retention period to seven days by enabling extended data retention. The maximum data block size in A record is 1 MB. You can use REST API or Kinesis Producer Library(KPL) to send data to AWS Kinesis.

Alibaba Cloud Log Service provides all-in-one solutions for log collections, log processing, and real-time log analysis. The collection method LogHub supports client, web page, protocol, SDK/API (mobile apps and games) and many other log collection methods. All log collection methods are implemented based on Restful API, apart from which you can use API/SDK to implement new collection methods. The maximum data block size supported by Alibaba Cloud Log Service is 3 MB. You can choose a data retention period from 1 to 365 days. It also supports rich ETL and elastic for throughput (without upper limit).

Alibaba Cloud DataHub is currently in public beta release, and is only targeted at the Chinese market. The international version will be developed later. Alibaba Cloud DataHub can continually collect, store, and process data from mobile devices, application software, website services, sensors, and other units that generate streaming data. The maximum data block size supported by Alibaba Cloud DataHub is 1 MB. You can choose a data retention period of seven days. DataHub supports all ETL tools on these two platforms by connecting to MaxCompute and Blink platform.

1.3 Data Shipping

AWS Kinesis can use AWS Kinesis Firehose to load streaming data to data storage, which can then load data to AWS S3, AWS Redshift, or AWS Elasticsearch Service.

Alibaba Cloud Log Service can use LogShipper to deliver the collected data to Alibaba Cloud’s storage products such as OSS, Table Store, and MaxCompute in real time. You only need to complete configuration on the console. In addition, LogShipper provides a complete status API and automatic retry function. LogShipper can also be used in concert with E-MapReduce(Spark, Hive) and MaxCompute to conduct offline computing.DataHub service also supports distributing streaming data to various cloud products, such as MaxCompute (formerly known as ODPS) and OSS.

The price of AWS Kinesis Streams is based on two core aspects: Shard Hour and PUT Payload Unit, and an optional dimension: Extended Data Retention. Data is retained for 24 hours by default. You are charged for an additional rate on each shard hour incurred by your stream once you enable extended data retention. AWS Kinesis Streams uses a provisioning model, which means you must pay for the provided resources, even if you choose not to use some or all of them. The price of AWS Kinesis Firehose is based on the data transmission volume.

1.4 Cost

Alibaba Cloud Log Service uses the Pay-As-You-Go pricing method, and you are charged based on the volume of resources used at different stages of monthly prices. If you have a free credit line for your log service, you are not charged for the volume within the credit line, and are only charged for the excessing part. In addition, there are resource packs available to provide you better offers.

Alibaba Cloud DataHub is currently at the public beta stage, and is free currently.

2. Data computing

After collecting data to the corresponding cloud environment, these products can convert data, filter the data, and then compute on the data based on your needs.

2.1 Service comparison

The following table compares the basic functions and terminologies of AWS Elastic MapReduce vs Alibaba Cloud E-MapReduce.

Item AWS Elastic MapReduce Alibaba Cloud E-MapReduce
Open source database Apache Hadoop and Apache Spark Apache Hadoop and Apache Spark
Service integration Yes Yes
Scaling Manual Manual
Deployment Location Zonal Zonal
Pricing model Hourly Hourly
Deployment unit Cluster Cluster
Dimensional unit Node (master, core, and task nodes) Node (master and slave nodes, scalable)
Unit of Work Step Job
Computing model MapReduce, Apache Hive, Apache Pig, Apache Spark, Spark SQL, and PySpark. MapReduce, Apache Hive, Apache Pig, Apache Spark, Spark SQL, Hbase, and so on.
Customized Pilot operation Pilot operation

The following table compares the basic functions and terminologies of AWS Redshift vs. Alibaba Cloud MaxCompute.

Item AWS Redshift Alibaba Cloud MaxCompute
Computing level EB level EB level
Data source AWS S3, DynamoDB Activity Log, Kinesis, web app server… Application-generated data(ApsaraDB for RDS/OSS/AnalyticDB/SLS…), existing data center (Oracle DB), independent data set (Hadoop Cluster)…
Supply unit Nodes N/A (full management)
Data security Uses VPC to isolate clusters, and KMS to manage keys. Provides multi-layer sandbox protection/monitoring, project-based data protection mechanism, package authorization, Trusted mode, as well as RAM and ACL authorizations.
Zoom Manual Auto
Backup management Snapshots Cluster disaster recovery
Deployment Location Zonal region
Data format TEXTFile, SequenceFile, RCFile, AVRO, Parquet, ORC, and so on. TEXTFile, SequenceFile, RCFile, AVRO, Parquet, ORC, and so on.
Ecological connectivity JDBC and ODBC. JDBC, ODBC, R, Python Pandas, and IntelliJ IDEA.
Community compatibility PostgreSQL compatible Standard SQL, MR, and Tunnel statements.

2.2 Main functions

Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to run data process frameworks. Amazon EMR consumes and processes real-time data from Amazon Kinesis, Apache Kafka, or other data streams with Spark Streaming. Amazon EMR performs streaming analytics in a fault-tolerant way and writes results to AWS S3 or HDFS. Amazon EMR can be used to quickly and cost-effectively perform data transformation workloads (ETL) such as, sort, aggregate, and join, on large datasets.

Alibaba Cloud E-MapReduce is a big data processing system solution running on Alibaba Cloud platform. E-MapReduce is built on Alibaba Cloud Elastic Compute Service (ECS) based on open source Apache Hadoop and Apache Spark. It facilitates usage of the other peripheral systems (for example, Apache Hive, Apache Pig and HBase) in the Hadoop and Spark ecosystems to analyze and process their own data. Moreover, you can also export and import data to other cloud data storage systems and database systems easily, such as Alibaba Cloud OSS and Alibaba Cloud ApsaraDB for RDS.

AWS Redshift data house is an enterprise level relational database query and management system. AWS Redshift supports multiple types of applications, including business intelligence (BI), reports, data and analytic tools, to establish client connection. AWS Redshift Spectrum allows you to store and process data at any time as needed.

Alibaba Cloud MaxCompute is the largest big data cloud service platform in China, and provides massive data storage, massive data computing, as well as data exchange among multiple organizations. Alibaba Cloud MaxCompute is a large distributed computing system independently developed by Alibaba Group. MaxCompute supports multi-cluster dual-active disaster recovery. You don’t have to concern about the infrastructure stability, which allows you to concentrate on your own business. MaxCompute also ensures data consistency and continuity of its services. Alibaba Cloud MaxCompute provides users with a comprehensive set of big data development tools to improve data import and export solutions, as well as various classic distributed computing models to quickly solve massive data computation, effectively reduce enterprise cost, and safeguard data security.

Alibaba Cloud MaxCompute has better support on ecological connectivity and community compatibility. In terms of data format support, Alibaba Cloud MaxCompute and AWS Redshift are basically tied. They both have their own security policies, but security policies of Alibaba Cloud MaxCompute are more extensive. In terms of data backup, AWS Redshift’s automatic snapshot function can continuously back up data from the clusters to AWS S3. Snapshots are automatically created in a continuous and incremental manner. AWS Redshift stores your snapshots for a customized period, which can be 1 to 35 days. For Alibaba Cloud MaxCompute, data is stored in Apsara system’s clusters. Apsara Distributed File System in Apsara system is in triplicate, and uses a multi-master mechanism to ensure the masters’ availability, and data reliability. Apsara Distributed File System guarantees both high data availability and high service availability. Alibaba Cloud MaxCompute also supports timed data backup.

In addition, MaxCompute has also developed the next generation engine MaxCompute 2.0. Using the internal big data platform of Alibaba Group and Alibaba Cloud, MaxCompute 2.0 features high performance and low cost, which are the most fundamental indicators of a computing platform. We have also been constantly optimizing the architecture and performance. In terms of language support, we have launched NewSQL, a new generation big data language that combines both Imperative and Declarative advantages. With regards to multi-machine collaboration, we have deployed more than 10 clusters, and data operation is subject to smart scheduling among clusters. MaxCompute also has the multi-cluster disaster tolerance capability to ensure financial-level stability. In terms of computing model, MaxCompute supports batch MR, DAG-based processing, interactive, memory computing, cluster learning and many other computing models, and achieves open-source compatibility by collaborating with computing platforms.

2.3 High scalability

Service model of Alibaba Cloud E-MapReduce is very similar to that of AWS EMR. Taking full advantage of the open source big data ecosystems, including Hadoop, Spark, Hive, Storm, and Pig, E-MapReduce provides users with an all-in-one big data processing and analysis solution that covers clusters, jobs and data management. When using these two services, users may create a cluster that contains multiple nodes. This service allows creation of one master node and a variable number of work nodes.

Both AWS EMR and Alibaba Cloud E-MapReduce support manual node quantity adjustment within a cluster after launching the cluster. How to manage the cluster size as well as the scaling operations are made by the user or administrator that monitors the cluster’s performance and usage. Users of these two products are charged by the number of nodes provided.

Comparing the Apache Spark models used in concert with AWS EMR and Alibaba Cloud E-MapReduce, if an AWS Redshift user wants to scale up/down a cluster, for example, to increase resources during high-usage period, or reduce cost during low-usage period, the user must do it manually.

MaxCompute provides higher flexibility and security, extensive functions, integrated architecture, elastic scaling methods, and a variety of supported tools. In addition, DataWorks is closely linked with MaxCompute and provides MaxCompute with all-in-one solutions for data synchronization, task development, data workflow development, data management, data O&M, and other functions. For details, see DataWorks .

AWS EMR supports on-demand pricing as well as short-term and long-term discounts. Both AWS EMR and E-MapReduce use hourly pricing.

When purchasing E-MapReduce clusters, Alibaba Cloud ECS is purchased automatically, so you do not need to prepare ECS in advance. If you are entitled to a discount for ECS, you enjoy the same discount when purchasing ECS here. For details, refer to E-MapReduce pricing descriptions.

AWS Redshift pricing options include:

  1. On-Demand pricing: no upfront costs. You pay an hourly rate based on the type and number of nodes in your cluster.

  2. Amazon Redshift Spectrum pricing: enables you to run SQL queries directly against all of your data in AWS S3. You pay for the number of bytes scanned.

  3. Reserved Instance pricing: to save cost by committing to using Redshift for a certain period of time.

MaxCompute offers two pricing options:

  1. Volume-based post payment: taking the volume of resources consumed by jobs as the measurement indicator, you pay after execution of the jobs.

  2. The CU-based pre-payment: You can reserve a certain quantity of resources in advance. CU-based pre-payment is only supported on Alibaba Cloud big data cloud platform. For detailed pricing descriptions, refer to MaxCompute.

3. Data analysis

Computes, processes, and analyzes the collected big data, and converts it into information that is useful to the enterprise, to provide value for enterprise planning, product R&D, and market condition survey.

3.1 Service comparison

The following table compares the basic functions and terminologies of Alibaba Cloud Quick BI vs. AWS QuickSight.

Item AWS QuickSight Alibaba Cloud Quick BI
Data connection Strong relational database, multidimensional database, NoSQL database, Hadoop & local files. Relational database, multidimensional database, NoSQL database, Hadoop & local files, Alibaba ecosystem.
Data model Cube support, system time cycle (date, week, month, quarter, year), offline data source acceleration (ApsaraDB for RDS acceleration, high cost). Cube support, system time cycle (date, week (7 types), month, quarter, year, MTD, QDT, YTD, fiscal year), offline data source acceleration (computation acceleration, full coverage, low cost).
Report generation Standard table, composite electronic reports. Standard table, composite electronic reports (Excel proficiency).
Data visualization Data components (14 types), visual screen creation, widget filtering (time, drop down, button). Data components (16 types), visual screen creation, widget filtering (time, drop down, text, button, comparison, comment).
Dashboard & sharing Supported Supported
Permission management Assigns ADMIN or User roles. Includes organization permission management and row-level permission management.
Data view Mobile and web terminals, DirectMail. Mobile and web terminals, portal creation, DingTalk account support, DirectMail.
System capability Professionalism (enterprise level BI), easy-to-use (good web page interaction), integration (third-party embedding supported). Professionalism (enterprise level BI), easy-to-use (excellent web page interaction), integration (third-party embedding supported).

3.2 Main functions

Both AWS QuickSight and Alibaba Cloud Quick BI are cloud-computing-based business analysis services that can:

  • provide smart data modeling.

  • integrate the scale advantage and flexibility of cloud computing into business analysis to solve business pain points.

  • help enterprises complete data analysis and data animation.

  • provide highly efficient capabilities and methods for business digitization.

QuickSight uses SPICE (Super-fast, Parallel, In-memory Calculation Engine) to provide quick-response query performance, and allows quick interactive analysis on various AWS data sources. Alibaba Cloud QuickBI is a built-in intelligent query acceleration engine that realizes real-time online analysis on massive amount of data. Without large amount of data preprocessing, Quick BI can smoothly analyze massive amount of data, which significantly improves the analysis efficiency. In addition, Quick BI support multiple data sources, including Alibaba Cloud data sources and Alibaba Group ecosystem related data sources.

Both AWS QuickSight and Alibaba Cloud Quick BI supports Cube (multidimensional database, or multidimensional data cube). When using Cube, you can compress the required data, especially when processing large data sizes. For example, FineCube for FineBI can avoid data modeling and increase data processing speed. In terms of offline data acceleration, Alibaba Cloud Quick BI uses computation acceleration with full coverage, at a lower cost than ApsaraDB for RDS acceleration used by AWS QuickSight. In addition, Alibaba Cloud Quick BI supports a wider range of data types within the system time cycle.

They both support standard Table, but Alibaba Cloud’s compliance electronic report (Excel proficiency) has a better performance. Of course, the standard version only contains the worksheet function, and only advanced versions has the electronic report function.

Data visualization: AWS QuickSight uses a technology called AutoGraph, which chooses the most appropriate visual type based on data attributes (such as numbers and data types) you select. Alibaba Cloud Quick BI supports extensive data visualization effects to meet data presentation demands of different scenarios. Besides, it automatically recognizes data features and smartly recommends an appropriate visualization solution.

In terms of permission management: When creating an AWS QuickSight account, this account has the ADMIN permission by default. AWS QuickSight users can invite other users and assign to them the ADMIN or USER roles. Alibaba Cloud Quick BI’s security-control data permission management includes internal organization member management, and supports administrative-level data permissions, to meet different permission requirements for different users.

In terms of data sharing and data view: AWS QuickSight allows users to use the sharing icon on the service interface to share analysis results, dashboards, and tables. Before sharing something with others, users can choose the recipient (email address, user name, or group name), permission level and other options. Similarly, Alibaba Cloud Quick BI supports sharing worksheets/spreadsheets, dashboards, and data portals to other logged-on users, and publishing dashboards to the Internet for access by non-logged-on users. Data view support: Both Alibaba Cloud Quick BI and AWS QuickSight support data view at the mobile terminal, web terminal, and through DirectMail. Alibaba Cloud Quick BI also supports DingTalk account, which is convenient for DingTalk users. Alibaba Cloud Quick BI also supports data portal creation, and allows users to drag-and-drop dashboards to create a data portal, embed links to dashboards, and conduct basic settings for templates and the menu bar.

3.3 System capability

Both AWS QuickSight and Alibaba Cloud Quick BI support enterprise level BI and third-party integration. Alibaba Cloud Quick BI offers flexible report integration solutions, which allow you to embed reports created from Alibaba Cloud Quick BI into your own system, and directly access the reports from your system without logging on to Alibaba Cloud Quick BI. Quick BI is easy-to-use. With an intelligent data modeling tool, Quick BI greatly reduces data acquisition cost and makes it much easier to use. Besides, the drag-drop operation and the extensive visual chart controls allow you to easily complete data perspective analysis, self-service data acquisition, business data profiling, report making, and data portal creation.

Both AWS QuickSight and Alibaba Cloud Quick BI are priced based on the number of users and the subscription duration, and both of them provide two editions (standard edition and enterprise edition) with different pricing options. Annual subscription is required by AWS QuickSight. The purchased Alibaba Cloud Quick BI instance can last for at most one year. You can select the number of users and the service length. When your Quick BI instance is going to expire, the system sends a message to remind you to renew your Quick BI instance in time.

4. Data visualization

DataV is a powerful and easy-to-use data visualization tool, which has extensive geographical presentation functions and user-friendly interfaces.

4.1 Application scenario

  • Presentation: presents business performance data (investor relationship, public relations, exhibitions, road shows, and reception).

  • Monitoring: uses data to boost business growth (real-time monitoring, alert, and quick response support).

  • Data-driven: discovers hidden data value (real-time presentation of multidimensional data may bring new responsibilities).

4.2 Main functions

4.2.1 Templates for different solutions

DataV provides multiple templates for diversified scenarios, such as the control center, geographic analysis, real-time monitoring, and operation presentation, which can be used after slight customization from the client. You can design high quality visual presentations without help from professional designers.

4.2.2 Open and extensive visualization component library

Apart from the basic charts, DataV is good at combining data and geographical information, such as map-based traffic routes, heat maps and scatter charts. DataV also allows you to draw geographic tracks, geographic lines, heat maps, geographic blocks, 3D maps, and 3D globes that involve massive amounts of data, and to overlay geographic data. The topographic maps, tree charts, and other distinctive charts are also available to you.

4.2.3 Support for various data sources

DataV can be connected to various data sources, including Alibaba Cloud AnalyticDB, ApsaraDB for RDS, and API, and supports dynamic requests. Static data stored in CSV and JSON files is also supported.

4.2.4 User-friendly interface

With graphic interface and configuration widgets, you only drag and drop to create professional visualization projects, which requires very limited programming skills.

4.2.5 Flexible publishing and adaptation

DataV projects can be published as web pages, or published with password or access token to control access and security information displayed on the dashboard. For better display effects on spliced screens, DataV is optimized to improve resolution.

4.2.6 Support for internal deployment

There are scenarios that data may be subject to very high level of confidentiality and cannot be posted online, or the network access is restricted. In such cases, internal deployment solution may be used. After editing the dashboard interface from the cloud version of DataV editor, you can compress the content of your edits into a single file, download it to your local DataVServer, and then connect it to your local database and publish it locally.

4.2.7 Tools for dashboard broadcasting and splicing

DataV also provides a lightweight solution for dashboard broadcasting and splicing. Distinct from traditional solutions, Mscreen ensures each interface is stably run as an independent process and can be spliced together to form a customized solution for single channel output of signal.

4.2.8 Component customization

DataV provides a secondary development environment that allows developers to integrate their own JavaScript components into DataV solutions. Users can configure the data source and styles of customized components, just like local components. Developers can also sell their component libraries at Alibaba Cloud Marketplace.

4.3 Pricing

DataV offers two product editions for public cloud users: Basic Edition and Enterprise Edition. Prices and feature details are listed as follows.

Item Basic (USD 360 Annually) Enterprise (USD 3,000 Annually)
Sharing - share projects publicly Yes Yes
Sharing - share with password N/A Yes
Sharing - share with access token N/A Yes
Sharing - transfer projects to another user Available only when target user is using Enterprise Edition. Available only when target user is using Enterprise Edition.
Projects and templates - available templates 5 All templates (updating)
Projects and templates - available projects 5 20
Data source - ApsaraDB for RDS for MySQL Yes Yes
Data source - Analytic DB Yes Yes
Data source - MySQL Compatible Database Yes Yes
Data source - CSV Yes Yes
Data source - API Yes Yes
Data source - Static JSON Yes Yes
Data source - DataV Data Proxy Service Yes Yes
Data source - ApsaraDB for RDS for PostgreSQL N/A Yes
Data source - ApsaraDB for RDS for SQLServer N/A Yes
HybirdDB for PostgreSQL N/A Yes
Data source - Alibaba Cloud API Gateway N/A Yes
Data source - Table Store N/A Yes
Data source - Alibaba Cloud intranet IP N/A Yes
Data source - OSS N/A Yes
Data source - Alibaba Cloud Log Service N/A Yes
Data source - Oracle N/A Yes
Data source - SQLserver N/A Yes
Visualization widgets - basic charts Yes Yes
Visualization widgets - basic maps Yes Yes
Visualization widgets - advanced maps N/A Yes
Visualization widgets - ECharts N/A Yes

5. Data processing

Data processing carries out data transfer, data conversion, and other related operations, introducing data from different data sources to transform and process the data.Finally, the data was extracted to other data systems, with the entire data acquisition, conversion, development, analysis processes completed.

5.1 Service comparison

The following table compares the basic functions and terminologies of AWS Glue and AWS Data Pipeline vs Alibaba Cloud DataWorks:

Function Property AWS Glue AWS Data Pipeline Alibaba Cloud DataWorks
Data acquisition Real-time acquisition Not supported Not supported Supported
Batch acquisition Supported Supported Supported
Client acquisition Supported Supported Supported
Local data N/A Not supported Supported
Cloud data Supported Supported Supported
Heterogeneous data sources S3、DynamoDB、RDS、Redshift、JDBC S3、DynamoDB、RDS、Redshift、JDBC support over 20 + (RDBMS, NoSQL, MPP, Unstructured storage, Big data storage, etc)
Data management Data discovery Supported N/A Supported
Capture metadata Supported N/A Supported
Version management Supported N/A Not supported
Capturing schema changes Supported N/A Not supported
Automatic Identification detection Supported N/A Not supported
Comment Not supported N/A Not supported
Collecting/structuring tags Not supported N/A Not supported
Data relationship N/A N/A Supported
Data Conversion & Development Automatic code generating Supported Supported Not supported
Online editing Supported N/A Supported
Version management Supported by GIT Supported by GIT Supported
Mode Running in Spark container,auto scaling Based on calculating engine(SQL,Shell scripts,EMR,Hive Pig) Based on calculating engine(ODPS SQL, SHELL, PAI)
Orchestrating and Task Scheduling Trigger mode Cycle Cycle,event trigger,lambda Cycle,API trigger
serveless Supported Supported Supported
Automatically re-run Supported Supported Supported
Monitoring & Alarm Monitor dashboard Supported Supported Supported
Alarm Supported Supported Supported
Data quality Offline monitoring Not supported Not supported Supported
Online monitoring Not supported Not supported Supported
Self-defined monitoring rules Not supported Not supported Supported
Openness API N/A Supported Supported
SDK N/A Supported Not supported

5.2 Product comparison overview

AWS Glue

AWS Glue is a fully managed ETL(Extract, transform, and load) service for economic efficiently classify data, cleanup, and expansion, and reliably move data between a variety of data stores.AWS Glue consists of a central metadata repository called the AWS Glue Data Catalog, an autogenerated ETL engine for Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and re-runs.AWS Glue is a serveless service, so you don’t need to set up or manage your infrastructure.
You can use AWS Glue console to discover data, convert it, and makes it available for searching and querying.The console calls the underlying service to coordinate the work required to transform the data.You can also use AWS glue services by API operations to edit, debug, and test Python or Scala Apache spark ETL code in a familiar development environment.

AWS DataPipeline

Amazon DataPipeline s a web service which enables you to automate data movement and transformation.Using Amazon DataPipeline,you can define a data-targeted workflow, in which case, the task can then perform subsequent operations based on whether the previous task is completed successfully.


  • Product location: one-stop big data platform, covering data integration, data management, data development, data operation, data service sharing, data security, data quality and other stages of the big data lifecycle
  • Methodology: cloud data warehouse, streaming computing
  • Target User: data developers (data integration, data development, data operation), data manager(data management, data security, data quality), data users (data management, data service, real-time analytic)
  • How to use: Web-side
  • Deployment approach: public cloud serverless, proprietary cloud
  • Development language: SQL, Java (openmr), Python, R, etc
  • Service level: public test (Data Integration is officially commercial)
  • Underlying engine: MaxCompute, Blink.

5.3 Advantage and disadvantage comparison

Amazon Glue product advantages

  • Compatible with different storage forms by metadata abstraction

AWS Glue supports unstructured files including CSV,JSON and database connections in the form of JDBC.By mapping different forms of storage to metadata such as database, table, schema, and so on, the differences are cleared.Therefore you can reduce the development difficulty of the data conversion process, while doing code reuse.

  • Preprocessing unstructured text by Classifier

Using Classifier,you can automatically structure the obtained unstructured text with 12 different built in formats.You are also allowed to customize formats in ways such as Grok.Classifier provides very good compatibility.

  • Support dynamic metadata acquisition

In addition to manually creating or getting metadata by Crawler, Glue also supports dynamic metadata acquisition.Crawler itself support wildcard characters and metadata acquisition for new tables.Glue also supports defining the run plan in cron format.Different Processing policies are also supported for tables that are added, changed, and deleted.These measures enable metadata to track data sources while maintaining change records.

  • Closely integrated with Spark ecology,

Unlike AWS Data Pipeline, AWS Glue only supports Spark on YARN and its code submission interface is opened directly to usres.During the development of the ETL job, you can develop or upload PySpark or Scala files directly online.You only need to change several Data Frame classes to complete the migration of existing Spark scripts.
Meanwhile,Glue also integrates the Zeppline Notebook Server with Spark Shell as a debugging tool,which is convenient for users to manually run the written spark script.Glue is also able to use the computing capabilities of Spark cluster as OLAP tools, thanks to the integration of Zeppline Notebook.

Amazon Glue product disadvantages

  • Only support Spark engine

Amazon glue only supports its own Spark engine and cannot use data in redshift and S3 directly,which may result in additional network transmission cost.

Amazon Data Pipeline product advantages

  • Support different kinds of computing engines

Rich in Activity extensions,Data Pipeline support EMR,hive,Redshift,Pig and sql as its computing engine.

  • Only implement core functions and seamlessly combinate with other products

Data Pipeline and DataWorks have certain coincident usage scenarios,but Data Pipeline’s functional model is much simpler.Most of the functions need to be achieved with other AWS products.

  • Programmer-oriented, scripting definition, flexible

Data Pipeline graphical interface shows only a fraction of the functionality, and the experience is not ideal.
The main means of operation is JSON.JSON represents Data Pipeline’s programmer-oriented idea.By means of a well-defined grammar, with JSON, you can use many of Data Pipeline’s “hidden” features in a flexible manner.Many usage scenarios, such as parametric operation, complex scheduling settings, state inheritance, and so on, all can be easily defined using JSON.
Same functions require more complex interaction designs if shown by GUI,but the use of JSON avoids the problems.Programmer-oriented is the main design idea for many of AWS’s products.

Amazon Data Pipeline product disadvantages

  • Simple scheduling model

Data Pipeline scheduling is time-based, and take the “day” as the basic scheduling unit (a task less than a day is called a high-frequency task, the shortest time is 15 minutes ).
Meanwhile, unlike DataWorks, each Pipeline is atomized and stateless.Atomization means different Pipelines are independent of each other.Different Pipelines cannot trigger each other, and the activities inside are independent too.Statelessness means that Pipeline itself does not support parameter input and there is no variable passes between activities.

  • Simple interface and single function

Data Pipeline interface interaction is extremely simple while some advanced features cannot be used by the GUI.The editing of the Activity script only provides a simple text box with no assisting development features such as syntax highlighting.
For functional design, Data Pipeline focuses on task scheduling.For most other features, you need to call other AWS products:

  1. Data Pipeline supports very few types of data sources,while other types must be converted to supported types using Glue.
  2. Data Pipeline does not support parameter input and variable transfer, however, this can be achieved through the support of a variety of Datanodes.
  3. Data Pipeline also does not provide code management and SQL can only be saved in plain text.The upload of the jar package can only be implemented by S3.
    The premise of these combinations is that Data Pipeline is seamlessly integrated with other products, the delay in data transmission is small enough, and the possibility of compatibility problems is low enough.

5.4 Conclusion

In summary, in data warehouse and data business process areas, the advantages of DataWorks are:

  • Data Integration: supports for streaming control and real-time synchronization.
  • Data Development: powerful online editing capabilities to experience a comparable offline IDE.
  • Monitoring Operations: supports business baseline monitoring.
  • Data Management: complete data management capabilities, also provides unique functions such as classification and data desensitization.
  • Data Quality: unique features in competitors.