2022 Cloud Habitat Integration Big Data Intelligence Summit

On the main technical forum this year, we talked about the continuous evolution of algorithms in the field of artificial intelligence. At the same time, these intelligent applications, in turn, have actually led to explosive growth in the demand for big data. In today's forum, we hope to take a look at the sparks of innovation between data and intelligence based on the field of data.

The concept of big data is not new in itself. From the beginning of relational databases, in the 1990s, with the increase of data volume and applications, the industry began to systematically think about some methodologies of big data. The most typical may be the three V's that are familiar in the data field today: volume, velocity, and diversity. Under the guidance of these methodologies, we began to build an excellent big data system from the perspectives of storage, computing, scheduling, and services.

On the main forum, we mentioned an impossible triangle of AI today: ease of use, scale, and efficiency. These three core points actually exist in the field of big data. Mapping to big data, we will see some features:

First, the ease of use of production tools. Because the data itself is a very "heavy "It is also a platform closely related to business, so we often pay attention to its security and stability. This is not wrong, but today we need to use data for more and more decisions. So, how to build a more flexible and convenient platform, so that everyone can write a line of SQL when using data, or even get the rules of data from the perspective of actual business without writing SQL? This is a usability issue.

Secondly, large-scale productivity. Today, various data platforms and data warehouses emerge in endlessly. However, for us, how to solve the problem of data islands and data efficiency, and how to greatly reduce the complexity and cost of the data platform while ensuring the large-scale development of the business, is a very realistic demand. Alibaba Cloud's big data platform today supports 10 exabytes of computing every day. I believe many cloud customers are facing the problem of scale.

Finally, the diversified application of data brings about the demand for productivity. For an enterprise, it seems that data platforms are always doing calculations: such as ETL, stream computing, OLAP, etc. However, the question is whether these things have been calculated and looked at by anyone. We don't know whether they will cause waste if no one looks at them. In addition, whether some tasks are not well written, such as full table scanning, but actually only a small amount of data is processed, which involves the problem of governance. Just as many applications today need quality control, so do data. Production efficiency is not only the efficiency of technology, but also the efficiency of organizational governance.

So what have we done from Alibaba's perspective?

Our big data platform also starts from the simple open source technology and single point technology: the Hadoop cluster and the self-developed ODPS of Hadoop cluster were initially built, and we improved the platform efficiency through cloud: by centralizing the data of an enterprise, we can get through different sub business departments, solve the problem of data barriers, and build a data system from zero to one.

By managing a full range of data tasks, we can achieve sustained business growth at a very low cost. In this process, we also saw many performance challenges. These performances come from two aspects: on the one hand, when the system is large, how to run SQL faster, how to do storage better, and how to improve the level of storage and computing, there will be a lot of internal work to do.

On the other hand, we see many diversified computing needs: for example, offline computing and real-time computing, one pursues the resource utilization level, and the other pursues the efficiency of OLAP. Today, our guests will talk about how we can technically solve the contradiction between demand diversification and cost through more integrated design.

The last one is how to lower the threshold. Data development and governance is a complex matter. We think this is the difference between us and many international data service providers, and we have done a good job. For example, the snowflake that we were familiar with a year ago is very different. Alibaba Cloud provides a complete upper layer development, operation and maintenance, modeling and governance system. From the perspective of developers, you can get a panoramic view from development to system operation and maintenance; From the perspective of enterprise governance, you can see the efficiency of various departments and businesses in data management and governance, which can make data development more global.

The above mentioned capabilities are relatively abstract. So, what kind of product capabilities can we provide on the cloud today?

First, open source is a big trend today. Whether we use the traditional Hadoop, Hive, or today's Spark and Data Lake architecture, we can provide a completely consistent experience with open source on the cloud, and we can also provide many capabilities that are lacking in simply installing an open source software. To put it simply, enterprise level stability, flexibility, and maintenance free. Today, whether it is EMR, Flink or ElasticSearch, we provide Serverless with the ability and hosting base, so that we don't need to care about these "dirty jobs". At the same time, we have also done a lot of innovative work in the open source field. For example, a project we just donated to the Apache Foundation, called Celeborn, has greatly improved the performance of many engines on the data lake to do data shuffle.

Second, we provide an integrated self-developed big data platform ODPS, which is composed of MaxCompute focusing on offline and scale and Hologres focusing on real-time data analysis and services. Today, we see a big trend is the "automatic driving" of data platforms. Users do not need to pay attention to whether the data is offline or real-time tables, or whether the engine and underlying storage are connected. Instead, they use a set of storage, a set of metadata, and a set of adjustments to solve the problem. At the same time, we can realize the seamless connection between the open source data lake and the self-developed data warehouse by integrating the lake and warehouse.

Third, we have comprehensively upgraded Dataworks, the platform for data development and governance, on the basis of a variety of data engines. Today, Dataworks can support multiple underlying engines. At the same time, it helps industry experts in data modeling and governance to more quickly build their own data middle platform, and at the same time, it can find and govern a series of efficiency problems such as data health classification on the data middle platform. In addition, we will provide corresponding OpenAPI capabilities for each version of dataworks this year, making the above secondary development easier.

A very clear trend we see today is that traditional data analysis and calculation are still the mainstream, and more and more data applications are in the field of artificial intelligence. For example, the "deep learning" - visual voice NLP, etc. often uses unstructured data. At the same time, scenarios such as intelligent search and user recommendation are also strongly bound with data.

Today, on the basis of data lake and data warehouse, the artificial intelligence platform PAI we have built is capable of solving big data AI integration. For example, the model open source platform ModelScope released by us in the main forum, high-performance computing solutions in the field of automatic driving, and solutions for intelligent recommendation and user growth are all built on PAI.

Finally, we will show you the system of each product we just mentioned through a big picture for you to follow.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us