Build an elastic big data cloud analysis platform with zero threshold

Big data and big data analysis have become the focus of attention of enterprises nowadays. The big data analysis platform is changing from high configuration to standard configuration. It is the basic platform for enterprises to realize the goal of "all business data, all data business".

Bao Yuansong, Senior Solution Architect of Alibaba Cloud Intelligence, shared "Building an Elastic Big Data Cloud Analysis Platform with Zero Threshold". During the process, he divided the construction of the big data analysis platform into stages and explained each stage in detail.

The figure below shows the four stages of big data analysis platform construction, which are self-built, cloud hosting, cloud service and cloud native.

Self-construction of big data analysis platform construction

Why build your own big data analysis platform? There are three main reasons:

1: Traditional big data analysis technology can no longer satisfy big data analysis, and needs to be improved by introducing new technologies.
2: Early big data technology was relatively immature and unreliable, requiring specialized technical personnel to study it.
3: There is a lack of successful cases and practices of effective big data analysis in the market, and enterprises must cross the river by feeling the stones.

The self-built big data analysis platform is an asset-heavy model, which has many deficiencies, mainly as follows:

1: Long cycle: The entire construction cycle is extremely long, involving many links such as computer room selection, hardware procurement, cluster deployment, test tuning, data service, and operation and maintenance management.
2: High cost: costs are divided into two categories, one is explicit costs such as servers, storage, network, operation and maintenance, and IDC, and the other is implicit costs such as business impact, idle resources, elastic expansion, and one-time capital investment . The inputs of these costs are certain, but the outputs are unknown.
3: High threshold: In recent years, big data technology has developed vigorously. There are many subdivided technologies in each dimension of data integration, data storage, analysis and calculation, and data operations. Any technology requires dedicated personnel for in-depth research. For ordinary enterprises The bar for talent is high.
4: Slow results: The big data analysis platform needs to be iterated and revised from beginning to end until the data quality meets expectations and the data analysis results are credible, in order to truly achieve the ultimate elastic performance, high reliability, and multi-scenario application effects.

Cloud hosting for big data analysis platform construction

Against the background of various deficiencies in the self-built big data analysis platform, cloud hosting was born in response to needs for three reasons:

1: Enterprises get rid of the burden of heavy assets.
2: Big data technology is becoming mature, and enterprises no longer focus on big data technology itself, but need a group of people with big data skills to do big data development.
3: Cloud vendors combine their own advantages to provide a big data hosting platform on the cloud.

The self-built big data analysis platform is usually based on the open source Hadoop platform, while cloud hosting is to transform the self-built open source Hadoop platform into an enterprise-level, standard big data analysis platform, with unified cluster management, complete monitoring and alarm, separation of computing and storage, Elastic expansion, on-demand construction, data security, low-threshold operation and maintenance, rich cloud ecological docking and other advantages.

EMR provides platform capabilities such as basic resources, platform management, data storage, data integration, computing engine, data usage, and job management. It provides complete monitoring and alarming for all components, and any component abnormality can be alarmed and notified at the first time. At the same time, it provides intelligent operation and maintenance management, scheduling and other functions based on the platform.

Next, let's learn more about some of the advantages of cloud hosting from the perspectives of infrastructure, operation and maintenance management, and cloud ecology.

Cloud hosting infrastructure

First of all, there are rich product specification families on the cloud. The entire virtual machine of Alibaba Cloud is divided into three categories: general computing, heterogeneous computing, and bare metal & high-performance computing. Each category meets different scenarios and can quickly build different scenarios. Big data analysis platform.

Second, by utilizing the elasticity of the cloud, computing and storage resources can be independently expanded to meet business peak periods or the business's pursuit of extreme performance, and can also be flexibly built on demand.

Finally, the cost of building a big data analysis platform on the cloud can be greatly optimized, and the purchase method can be flexibly selected according to business characteristics. For example, the cost of computing nodes can be greatly reduced through Spot Instance.

Operation and maintenance management of cloud hosting

The operation and maintenance of the entire big data analysis platform is very complicated, requiring professional talents and a lot of investment. From basic operation and maintenance to management operation and maintenance, and then to component operation and maintenance, cloud vendors provide multi-dimensional operation and maintenance capabilities.

Basic O&M: Cloud vendors use their own large-scale server O&M experience to build AlOps systems, which can detect and analyze hardware in advance, quickly perform active O&M after faults are found, and reduce the impact on business.

Management operation and maintenance: EMR realizes one-click deployment and out-of-the-box use, and also provides unified configuration management, platform status monitoring, and fault alarm functions.

Component operation and maintenance: Component operation and maintenance is the most complicated part of the big data analysis platform. When upgrading the version, due to the inextricable relationship between components, ensuring compatibility is the top priority.

Another important aspect of component O&M is performance optimization. Cloud vendors will combine their own cloud computing advantages to optimize the underlying infrastructure, optimize the kernel engine, and help open source components improve performance.

Cloud Ecosystem of Cloud Hosting
There is a rich ecology on the cloud, preventing latecomers from reinventing the wheel or starting from scratch, as shown in the figure below:

The underlying storage can provide OSS object storage and HDFS storage on the cloud. HDFS storage can directly and seamlessly access OSS object storage, which is no different from accessing HDFS files. In this way, data archiving and cost optimization can be performed flexibly.

In terms of data sources, services such as OSS, SLS, RDS, and message queues are supported as data sources; in terms of computing engines, the EMR platform on the cloud can communicate with MaxCompute, Flink, and Tensorflow engines.

In terms of integration, DataWorks services are provided on the cloud, through which DataWorks can unify Hadoop's entire upper-layer metadata management and data quality management.

In addition, the cloud also provides analysis and display capabilities such as DataV and QuickBI.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us