This article discusses the main differences and similarities between AWS and Alibaba Cloud in big data services. We mainly discuss the following service types and products:1. Data collection2. Data computing3. Data analysis4. Data visualization5. Data processing
The comparison covers the products as shown in the table below.
|Data collection||AWS Kinesis||Alibaba Cloud Log ServiceAlibaba Cloud DataHub|
|Data computing||AWS Elastic MapReduceAWS Redshift||Alibaba Cloud E-MapReduceAlibaba Cloud MaxCompute|
|Data analysis||AWS QuickSight||Alibaba Cloud Quick BI|
|Data visualization||N/A||Alibaba Cloud DataV|
|Data processing||AWS GlueAWS Data Pipeline||Alibaba Cloud DataWorks|
Both AWS Kinesis and Alibaba Cloud Log Service & DataHub can be used to extract and collect data to their own cloud environment or the corresponding data models. However, each service uses a different service model.
The following table compares the basic functions and terminologies of AWS Kinesis vs Alibaba Cloud DataHub & Alibaba Cloud Log Service.
|Feature||AWS Kinesis||Alibaba Cloud Log Service||Alibaba Cloud DataHub(public beta by china site)|
|Client support & collection methods||Native Agent||Native Agent||Native Agent|
|Open-source client||Open-source client||Open-source client|
|API||Over 30 collection ends, such as Logstash and Fluent.||Multiple collection ends, such as mobile devices, applications, website service, and sensors.|
|Retention days||1 ~ 7 days||1 ~ 365 days||7 days|
|Stream computing support||Open source stream computing engine, +Kinesis Analytics||Open source stream computing engine, ARMS and StreamCompute (which will be launched on the international site later), and CloudMonitor.||Supports stream computing engine, StreamCompute.|
|Deployment Location||Region||Region (global)||Region (public beta)|
|Shipping destination||S3/RedShift/ES||OSS/MaxCompute/Table Store||OSS/MaxCompute/and so on|
|Size||1 MB||3 MB||1 MB|
|Throughput||5 MB/s, 5000 records/s.||No upper limit, elastic.||Supports up to several TB of daily data input with single topic, with each shard supporting several hundred GB of daily data input.|
|Delay||S3/ES: 60~900/sRedshift: >60s||OSS/Table Store: 60~900/s.MaxCompute: 15 min.||Maximum delay: 5 min.|
|Storage cost||USD 0.02/GB||USD 0.01/GB||Public beta, free for the time being.|
|ETL support||Lambda||JSON/CSV/Parquet Function computer||Connected to MaxCompute and Blink platform, Alibaba Cloud DataHub supports all ETL tools on these two platforms.|
|Pricing strategy||Kinesis pricing||Log Service pricing||Public beta, free for the time being.|
|Security||Supports customization of permissions, group, and access control of users and roles.||HTTPS + Transmission signature + Multi-tenant isolation + Access control||Provides enterprise-level multi-layer security protection and multi-user resource isolation mechanism.Provides various authentication and authorization mechanisms, as well as whitelist and primary/subaccount features.|
AWS Kinesis is a cloud service provider that supports stream computing. It enables users to collect and process data in real time. AWS Kinesis provides multiple core capabilities to economically and effectively process the corresponding data flow. It also has the flexibility to allow you to choose the tools that best fit the application needs.By default, the time record of added data flow can be accessed within a maximum of 24 hours after being added. You can increase the data retention period to seven days by enabling extended data retention. The maximum data block size in A record is 1 MB. You can use REST API or Kinesis Producer Library(KPL) to send data to AWS Kinesis.Alibaba Cloud Log Service provides all-in-one solutions for log collections, log processing, and real-time log analysis. The collection method LogHub supports client, web page, protocol, SDK/API (mobile apps and games) and many other log collection methods. All log collection methods are implemented based on Restful API, apart from which you can use API/SDK to implement new collection methods. The maximum data block size supported by Alibaba Cloud Log Service is 3 MB. You can choose a data retention period from 1 to 365 days. It also supports rich ETL and elastic for throughput (without upper limit).Alibaba Cloud DataHub is currently in public beta release, and is only targeted at the Chinese market. The international version will be developed later. Alibaba Cloud DataHub can continually collect, store, and process data from mobile devices, application software, website services, sensors, and other units that generate streaming data. The maximum data block size supported by Alibaba Cloud DataHub is 1 MB. You can choose a data retention period of seven days. DataHub supports all ETL tools on these two platforms by connecting to MaxCompute and Blink platform.
AWS Kinesis can use AWS Kinesis Firehose to load streaming data to data storage, which can then load data to AWS S3, AWS Redshift, or AWS Elasticsearch Service.Alibaba Cloud Log Service can use LogShipper to deliver the collected data to Alibaba Cloud’s storage products such as OSS, Table Store, and MaxCompute in real time. You only need to complete configuration on the console. In addition, LogShipper provides a complete status API and automatic retry function. LogShipper can also be used in concert with E-MapReduce(Spark, Hive) and MaxCompute to conduct offline computing.DataHub service also supports distributing streaming data to various cloud products, such as MaxCompute (formerly known as ODPS) and OSS.The price of AWS Kinesis Streams is based on two core aspects: Shard Hour and PUT Payload Unit, and an optional dimension: Extended Data Retention. Data is retained for 24 hours by default. You are charged for an additional rate on each shard hour incurred by your stream once you enable extended data retention. AWS Kinesis Streams uses a provisioning model, which means you must pay for the provided resources, even if you choose not to use some or all of them. The price of AWS Kinesis Firehose is based on the data transmission volume.
Alibaba Cloud Log Service uses the Pay-As-You-Go pricing method, and you are charged based on the volume of resources used at different stages of monthly prices. If you have a free credit line for your log service, you are not charged for the volume within the credit line, and are only charged for the excessing part. In addition, there are resource packs available to provide you better offers.Alibaba Cloud DataHub is currently at the public beta stage, and is free currently.
After collecting data to the corresponding cloud environment, these products can convert data, filter the data, and then compute on the data based on your needs.
The following table compares the basic functions and terminologies of AWS Elastic MapReduce vs Alibaba Cloud E-MapReduce.
|Item||AWS Elastic MapReduce||Alibaba Cloud E-MapReduce|
|Open source database||Apache Hadoop and Apache Spark||Apache Hadoop and Apache Spark|
|Dimensional unit||Node (master, core, and task nodes)||Node (master and slave nodes, scalable)|
|Unit of Work||Step||Job|
|Computing model||MapReduce, Apache Hive, Apache Pig, Apache Spark, Spark SQL, and PySpark.||MapReduce, Apache Hive, Apache Pig, Apache Spark, Spark SQL, Hbase, and so on.|
|Customized||Pilot operation||Pilot operation|
The following table compares the basic functions and terminologies of AWS Redshift vs. Alibaba Cloud MaxCompute.
|Item||AWS Redshift||Alibaba Cloud MaxCompute|
|Computing level||EB level||EB level|
|Data source||AWS S3, DynamoDB Activity Log, Kinesis, web app server…||Application-generated data(ApsaraDB for RDS/OSS/AnalyticDB/SLS…), existing data center (Oracle DB), independent data set (Hadoop Cluster)…|
|Supply unit||Nodes||N/A (full management)|
|Data security||Uses VPC to isolate clusters, and KMS to manage keys.||Provides multi-layer sandbox protection/monitoring, project-based data protection mechanism, package authorization, Trusted mode, as well as RAM and ACL authorizations.|
|Backup management||Snapshots||Cluster disaster recovery|
|Data format||TEXTFile, SequenceFile, RCFile, AVRO, Parquet, ORC, and so on.||TEXTFile, SequenceFile, RCFile, AVRO, Parquet, ORC, and so on.|
|Ecological connectivity||JDBC and ODBC.||JDBC, ODBC, R, Python Pandas, and IntelliJ IDEA.|
|Community compatibility||PostgreSQL compatible||Standard SQL, MR, and Tunnel statements.|
Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to run data process frameworks. Amazon EMR consumes and processes real-time data from Amazon Kinesis, Apache Kafka, or other data streams with Spark Streaming. Amazon EMR performs streaming analytics in a fault-tolerant way and writes results to AWS S3 or HDFS. Amazon EMR can be used to quickly and cost-effectively perform data transformation workloads (ETL) such as, sort, aggregate, and join, on large datasets.Alibaba Cloud E-MapReduce is a big data processing system solution running on Alibaba Cloud platform. E-MapReduce is built on Alibaba Cloud Elastic Compute Service (ECS) based on open source Apache Hadoop and Apache Spark. It facilitates usage of the other peripheral systems (for example, Apache Hive, Apache Pig and HBase) in the Hadoop and Spark ecosystems to analyze and process their own data. Moreover, you can also export and import data to other cloud data storage systems and database systems easily, such as Alibaba Cloud OSS and Alibaba Cloud ApsaraDB for RDS.AWS Redshift data house is an enterprise level relational database query and management system. AWS Redshift supports multiple types of applications, including business intelligence (BI), reports, data and analytic tools, to establish client connection. AWS Redshift Spectrum allows you to store and process data at any time as needed.Alibaba Cloud MaxCompute is the largest big data cloud service platform in China, and provides massive data storage, massive data computing, as well as data exchange among multiple organizations. Alibaba Cloud MaxCompute is a large distributed computing system independently developed by Alibaba Group. MaxCompute supports multi-cluster dual-active disaster recovery. You don’t have to concern about the infrastructure stability, which allows you to concentrate on your own business. MaxCompute also ensures data consistency and continuity of its services. Alibaba Cloud MaxCompute provides users with a comprehensive set of big data development tools to improve data import and export solutions, as well as various classic distributed computing models to quickly solve massive data computation, effectively reduce enterprise cost, and safeguard data security.Alibaba Cloud MaxCompute has better support on ecological connectivity and community compatibility. In terms of data format support, Alibaba Cloud MaxCompute and AWS Redshift are basically tied. They both have their own security policies, but security policies of Alibaba Cloud MaxCompute are more extensive. In terms of data backup, AWS Redshift’s automatic snapshot function can continuously back up data from the clusters to AWS S3. Snapshots are automatically created in a continuous and incremental manner. AWS Redshift stores your snapshots for a customized period, which can be 1 to 35 days. For Alibaba Cloud MaxCompute, data is stored in Apsara system’s clusters. Apsara Distributed File System in Apsara system is in triplicate, and uses a multi-master mechanism to ensure the masters’ availability, and data reliability. Apsara Distributed File System guarantees both high data availability and high service availability. Alibaba Cloud MaxCompute also supports timed data backup.In addition, MaxCompute has also developed the next generation engine MaxCompute 2.0. Using the internal big data platform of Alibaba Group and Alibaba Cloud, MaxCompute 2.0 features high performance and low cost, which are the most fundamental indicators of a computing platform. We have also been constantly optimizing the architecture and performance. In terms of language support, we have launched NewSQL, a new generation big data language that combines both Imperative and Declarative advantages. With regards to multi-machine collaboration, we have deployed more than 10 clusters, and data operation is subject to smart scheduling among clusters. MaxCompute also has the multi-cluster disaster tolerance capability to ensure financial-level stability. In terms of computing model, MaxCompute supports batch MR, DAG-based processing, interactive, memory computing, cluster learning and many other computing models, and achieves open-source compatibility by collaborating with computing platforms.
Service model of Alibaba Cloud E-MapReduce is very similar to that of AWS EMR. Taking full advantage of the open source big data ecosystems, including Hadoop, Spark, Hive, Storm, and Pig, E-MapReduce provides users with an all-in-one big data processing and analysis solution that covers clusters, jobs and data management. When using these two services, users may create a cluster that contains multiple nodes. This service allows creation of one master node and a variable number of work nodes.Both AWS EMR and Alibaba Cloud E-MapReduce support manual node quantity adjustment within a cluster after launching the cluster. How to manage the cluster size as well as the scaling operations are made by the user or administrator that monitors the cluster’s performance and usage. Users of these two products are charged by the number of nodes provided.Comparing the Apache Spark models used in concert with AWS EMR and Alibaba Cloud E-MapReduce, if an AWS Redshift user wants to scale up/down a cluster, for example, to increase resources during high-usage period, or reduce cost during low-usage period, the user must do it manually.MaxCompute provides higher flexibility and security, extensive functions, integrated architecture, elastic scaling methods, and a variety of supported tools. In addition, DataWorks is closely linked with MaxCompute and provides MaxCompute with all-in-one solutions for data synchronization, task development, data workflow development, data management, data O&M, and other functions. For details, see DataWorks .AWS EMR supports on-demand pricing as well as short-term and long-term discounts. Both AWS EMR and E-MapReduce use hourly pricing.When purchasing E-MapReduce clusters, Alibaba Cloud ECS is purchased automatically, so you do not need to prepare ECS in advance. If you are entitled to a discount for ECS, you enjoy the same discount when purchasing ECS here. For details, refer to E-MapReduce pricing descriptions.AWS Redshift pricing options include:
- On-Demand pricing: no upfront costs. You pay an hourly rate based on the type and number of nodes in your cluster.
- Amazon Redshift Spectrum pricing: enables you to run SQL queries directly against all of your data in AWS S3. You pay for the number of bytes scanned.
- Reserved Instance pricing: to save cost by committing to using Redshift for a certain period of time.
MaxCompute offers two pricing options:
- Volume-based post payment: taking the volume of resources consumed by jobs as the measurement indicator, you pay after execution of the jobs.
- The CU-based pre-payment: You can reserve a certain quantity of resources in advance. CU-based pre-payment is only supported on Alibaba Cloud big data cloud platform. For detailed pricing descriptions, refer to MaxCompute.
Computes, processes, and analyzes the collected big data, and converts it into information that is useful to the enterprise, to provide value for enterprise planning, product R&D, and market condition survey.
The following table compares the basic functions and terminologies of Alibaba Cloud Quick BI vs. AWS QuickSight.
|Item||AWS QuickSight||Alibaba Cloud Quick BI|
|Data connection||Strong relational database, multidimensional database, NoSQL database, Hadoop & local files.||Relational database, multidimensional database, NoSQL database, Hadoop & local files, Alibaba ecosystem.|
|Data model||Cube support, system time cycle (date, week, month, quarter, year), offline data source acceleration (ApsaraDB for RDS acceleration, high cost).||Cube support, system time cycle (date, week (7 types), month, quarter, year, MTD, QDT, YTD, fiscal year), offline data source acceleration (computation acceleration, full coverage, low cost).|
|Report generation||Standard table, composite electronic reports.||Standard table, composite electronic reports (Excel proficiency).|
|Data visualization||Data components (14 types), visual screen creation, widget filtering (time, drop down, button).||Data components (16 types), visual screen creation, widget filtering (time, drop down, text, button, comparison, comment).|
|Dashboard & sharing||Supported||Supported|
|Permission management||Assigns ADMIN or User roles.||Includes organization permission management and row-level permission management.|
|Data view||Mobile and web terminals, DirectMail.||Mobile and web terminals, portal creation, DingTalk account support, DirectMail.|
|System capability||Professionalism (enterprise level BI), easy-to-use (good web page interaction), integration (third-party embedding supported).||Professionalism (enterprise level BI), easy-to-use (excellent web page interaction), integration (third-party embedding supported).|
Both AWS QuickSight and Alibaba Cloud Quick BI are cloud-computing-based business analysis services that can:
- provide smart data modeling.
- integrate the scale advantage and flexibility of cloud computing into business analysis to solve business pain points.
- help enterprises complete data analysis and data animation.
- provide highly efficient capabilities and methods for business digitization.
QuickSight uses SPICE (Super-fast, Parallel, In-memory Calculation Engine) to provide quick-response query performance, and allows quick interactive analysis on various AWS data sources. Alibaba Cloud QuickBI is a built-in intelligent query acceleration engine that realizes real-time online analysis on massive amount of data. Without large amount of data preprocessing, Quick BI can smoothly analyze massive amount of data, which significantly improves the analysis efficiency. In addition, Quick BI support multiple data sources, including Alibaba Cloud data sources and Alibaba Group ecosystem related data sources.Both AWS QuickSight and Alibaba Cloud Quick BI supports Cube (multidimensional database, or multidimensional data cube). When using Cube, you can compress the required data, especially when processing large data sizes. For example, FineCube for FineBI can avoid data modeling and increase data processing speed. In terms of offline data acceleration, Alibaba Cloud Quick BI uses computation acceleration with full coverage, at a lower cost than ApsaraDB for RDS acceleration used by AWS QuickSight. In addition, Alibaba Cloud Quick BI supports a wider range of data types within the system time cycle.They both support standard Table, but Alibaba Cloud’s compliance electronic report (Excel proficiency) has a better performance. Of course, the standard version only contains the worksheet function, and only advanced versions has the electronic report function.Data visualization: AWS QuickSight uses a technology called AutoGraph, which chooses the most appropriate visual type based on data attributes (such as numbers and data types) you select. Alibaba Cloud Quick BI supports extensive data visualization effects to meet data presentation demands of different scenarios. Besides, it automatically recognizes data features and smartly recommends an appropriate visualization solution.In terms of permission management: When creating an AWS QuickSight account, this account has the ADMIN permission by default. AWS QuickSight users can invite other users and assign to them the ADMIN or USER roles. Alibaba Cloud Quick BI’s security-control data permission management includes internal organization member management, and supports administrative-level data permissions, to meet different permission requirements for different users. In terms of data sharing and data view: AWS QuickSight allows users to use the sharing icon on the service interface to share analysis results, dashboards, and tables. Before sharing something with others, users can choose the recipient (email address, user name, or group name), permission level and other options. Similarly, Alibaba Cloud Quick BI supports sharing worksheets/spreadsheets, dashboards, and data portals to other logged-on users, and publishing dashboards to the Internet for access by non-logged-on users. Data view support: Both Alibaba Cloud Quick BI and AWS QuickSight support data view at the mobile terminal, web terminal, and through DirectMail. Alibaba Cloud Quick BI also supports DingTalk account, which is convenient for DingTalk users. Alibaba Cloud Quick BI also supports data portal creation, and allows users to drag-and-drop dashboards to create a data portal, embed links to dashboards, and conduct basic settings for templates and the menu bar.
Both AWS QuickSight and Alibaba Cloud Quick BI support enterprise level BI and third-party integration. Alibaba Cloud Quick BI offers flexible report integration solutions, which allow you to embed reports created from Alibaba Cloud Quick BI into your own system, and directly access the reports from your system without logging on to Alibaba Cloud Quick BI. Quick BI is easy-to-use. With an intelligent data modeling tool, Quick BI greatly reduces data acquisition cost and makes it much easier to use. Besides, the drag-drop operation and the extensive visual chart controls allow you to easily complete data perspective analysis, self-service data acquisition, business data profiling, report making, and data portal creation.Both AWS QuickSight and Alibaba Cloud Quick BI are priced based on the number of users and the subscription duration, and both of them provide two editions (standard edition and enterprise edition) with different pricing options. Annual subscription is required by AWS QuickSight. The purchased Alibaba Cloud Quick BI instance can last for at most one year. You can select the number of users and the service length. When your Quick BI instance is going to expire, the system sends a message to remind you to renew your Quick BI instance in time.
DataV is a powerful and easy-to-use data visualization tool, which has extensive geographical presentation functions and user-friendly interfaces.
- Presentation: presents business performance data (investor relationship, public relations, exhibitions, road shows, and reception).
- Monitoring: uses data to boost business growth (real-time monitoring, alert, and quick response support).
- Data-driven: discovers hidden data value (real-time presentation of multidimensional data may bring new responsibilities).
DataV offers two product editions for public cloud users: Basic Edition and Enterprise Edition. Prices and feature details are listed as follows.
|Item||Basic (USD 360 Annually)||Enterprise (USD 3,000 Annually)|
|Sharing - share projects publicly||Yes||Yes|
|Sharing - share with password||N/A||Yes|
|Sharing - share with access token||N/A||Yes|
|Sharing - transfer projects to another user||Available only when target user is using Enterprise Edition.||Available only when target user is using Enterprise Edition.|
|Projects and templates - available templates||5||All templates (updating)|
|Projects and templates - available projects||5||20|
|Data source - ApsaraDB for RDS for MySQL||Yes||Yes|
|Data source - Analytic DB||Yes||Yes|
|Data source - MySQL Compatible Database||Yes||Yes|
|Data source - CSV||Yes||Yes|
|Data source - API||Yes||Yes|
|Data source - Static JSON||Yes||Yes|
|Data source - DataV Data Proxy Service||Yes||Yes|
|Data source - ApsaraDB for RDS for PostgreSQL||N/A||Yes|
|Data source - ApsaraDB for RDS for SQLServer||N/A||Yes|
|HybirdDB for PostgreSQL||N/A||Yes|
|Data source - Alibaba Cloud API Gateway||N/A||Yes|
|Data source - Table Store||N/A||Yes|
|Data source - Alibaba Cloud intranet IP||N/A||Yes|
|Data source - OSS||N/A||Yes|
|Data source - Alibaba Cloud Log Service||N/A||Yes|
|Data source - Oracle||N/A||Yes|
|Data source - SQLserver||N/A||Yes|
|Visualization widgets - basic charts||Yes||Yes|
|Visualization widgets - basic maps||Yes||Yes|
|Visualization widgets - advanced maps||N/A||Yes|
|Visualization widgets - ECharts||N/A||Yes|
Data processing carries out data transfer, data conversion, and other related operations, introducing data from different data sources to transform and process the data.Finally, the data was extracted to other data systems, with the entire data acquisition, conversion, development, analysis processes completed.
The following table compares the basic functions and terminologies of AWS Glue and AWS Data Pipeline vs Alibaba Cloud DataWorks：
|Function||Property||AWS Glue||AWS Data Pipeline||Alibaba Cloud DataWorks|
|Data acquisition||Real-time acquisition||Not supported||Not supported||Supported|
|Local data||N/A||Not supported||Supported|
|Heterogeneous data sources||S3、DynamoDB、RDS、Redshift、JDBC||S3、DynamoDB、RDS、Redshift、JDBC||support over 20 + (RDBMS, NoSQL, MPP, Unstructured storage, Big data storage, etc)|
|Data management||Data discovery||Supported||N/A||Supported|
|Version management||Supported||N/A||Not supported|
|Capturing schema changes||Supported||N/A||Not supported|
|Automatic Identification detection||Supported||N/A||Not supported|
|Comment||Not supported||N/A||Not supported|
|Collecting/structuring tags||Not supported||N/A||Not supported|
|Data Conversion & Development||Automatic code generating||Supported||Supported||Not supported|
|Version management||Supported by GIT||Supported by GIT||Supported|
|Mode||Running in Spark container,auto scaling||Based on calculating engine（SQL，Shell scripts，EMR，Hive Pig）||Based on calculating engine(ODPS SQL, SHELL, PAI)|
|Orchestrating and Task Scheduling||Trigger mode||Cycle||Cycle,event trigger,lambda||Cycle,API trigger|
|Monitoring & Alarm||Monitor dashboard||Supported||Supported||Supported|
|Data quality||Offline monitoring||Not supported||Not supported||Supported|
|Online monitoring||Not supported||Not supported||Supported|
|Self-defined monitoring rules||Not supported||Not supported||Supported|
AWS Glue is a fully managed ETL(Extract, transform, and load) service for economic efficiently classify data, cleanup, and expansion, and reliably move data between a variety of data stores.AWS Glue consists of a central metadata repository called the AWS Glue Data Catalog, an autogenerated ETL engine for Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and re-runs.AWS Glue is a serveless service, so you don’t need to set up or manage your infrastructure.
You can use AWS Glue console to discover data, convert it, and makes it available for searching and querying.The console calls the underlying service to coordinate the work required to transform the data.You can also use AWS glue services by API operations to edit, debug, and test Python or Scala Apache spark ETL code in a familiar development environment.
Amazon DataPipeline s a web service which enables you to automate data movement and transformation.Using Amazon DataPipeline,you can define a data-targeted workflow, in which case, the task can then perform subsequent operations based on whether the previous task is completed successfully.
- Product location: one-stop big data platform, covering data integration, data management, data development, data operation, data service sharing, data security, data quality and other stages of the big data lifecycle
- Methodology: cloud data warehouse, streaming computing
- Target User: data developers (data integration, data development, data operation), data manager(data management, data security, data quality), data users (data management, data service, real-time analytic)
- How to use: Web-side
- Deployment approach: public cloud serverless, proprietary cloud
- Development language: SQL, Java (openmr), Python, R, etc
- Service level: public test (Data Integration is officially commercial)
- Underlying engine: MaxCompute, Blink.
- Compatible with different storage forms by metadata abstraction
AWS Glue supports unstructured files including CSV,JSON and database connections in the form of JDBC.By mapping different forms of storage to metadata such as database, table, schema, and so on, the differences are cleared.Therefore you can reduce the development difficulty of the data conversion process, while doing code reuse.
- Preprocessing unstructured text by Classifier
Using Classifier,you can automatically structure the obtained unstructured text with 12 different built in formats.You are also allowed to customize formats in ways such as Grok.Classifier provides very good compatibility.
- Support dynamic metadata acquisition
In addition to manually creating or getting metadata by Crawler, Glue also supports dynamic metadata acquisition.Crawler itself support wildcard characters and metadata acquisition for new tables.Glue also supports defining the run plan in cron format.Different Processing policies are also supported for tables that are added, changed, and deleted.These measures enable metadata to track data sources while maintaining change records.
- Closely integrated with Spark ecology,
Unlike AWS Data Pipeline, AWS Glue only supports Spark on YARN and its code submission interface is opened directly to usres.During the development of the ETL job, you can develop or upload PySpark or Scala files directly online.You only need to change several Data Frame classes to complete the migration of existing Spark scripts.
Meanwhile,Glue also integrates the Zeppline Notebook Server with Spark Shell as a debugging tool,which is convenient for users to manually run the written spark script.Glue is also able to use the computing capabilities of Spark cluster as OLAP tools, thanks to the integration of Zeppline Notebook.
- Only support Spark engine
Amazon glue only supports its own Spark engine and cannot use data in redshift and S3 directly,which may result in additional network transmission cost.
- Support different kinds of computing engines
Rich in Activity extensions,Data Pipeline support EMR,hive,Redshift,Pig and sql as its computing engine.
- Only implement core functions and seamlessly combinate with other products
Data Pipeline and DataWorks have certain coincident usage scenarios，but Data Pipeline’s functional model is much simpler.Most of the functions need to be achieved with other AWS products.
- Programmer-oriented, scripting definition, flexible
Data Pipeline graphical interface shows only a fraction of the functionality, and the experience is not ideal.
The main means of operation is JSON.JSON represents Data Pipeline’s programmer-oriented idea.By means of a well-defined grammar, with JSON, you can use many of Data Pipeline’s “hidden” features in a flexible manner.Many usage scenarios, such as parametric operation, complex scheduling settings, state inheritance, and so on, all can be easily defined using JSON.
Same functions require more complex interaction designs if shown by GUI,but the use of JSON avoids the problems.Programmer-oriented is the main design idea for many of AWS’s products.
- Simple scheduling model
Data Pipeline scheduling is time-based, and take the “day” as the basic scheduling unit (a task less than a day is called a high-frequency task, the shortest time is 15 minutes ).
Meanwhile, unlike DataWorks, each Pipeline is atomized and stateless.Atomization means different Pipelines are independent of each other.Different Pipelines cannot trigger each other, and the activities inside are independent too.Statelessness means that Pipeline itself does not support parameter input and there is no variable passes between activities.
- Simple interface and single function
Data Pipeline interface interaction is extremely simple while some advanced features cannot be used by the GUI.The editing of the Activity script only provides a simple text box with no assisting development features such as syntax highlighting.
For functional design, Data Pipeline focuses on task scheduling.For most other features, you need to call other AWS products:
- Data Pipeline supports very few types of data sources,while other types must be converted to supported types using Glue.
- Data Pipeline does not support parameter input and variable transfer, however, this can be achieved through the support of a variety of Datanodes.
- Data Pipeline also does not provide code management and SQL can only be saved in plain text.The upload of the jar package can only be implemented by S3.
The premise of these combinations is that Data Pipeline is seamlessly integrated with other products, the delay in data transmission is small enough, and the possibility of compatibility problems is low enough.
In summary, in data warehouse and data business process areas, the advantages of DataWorks are:
- Data Integration: supports for streaming control and real-time synchronization.
- Data Development: powerful online editing capabilities to experience a comparable offline IDE.
- Monitoring Operations: supports business baseline monitoring.
- Data Management: complete data management capabilities, also provides unique functions such as classification and data desensitization.
- Data Quality: unique features in competitors.