Developer Content

As the originator of open source big data projects, Hadoop has been emerging for more than 15 years. In the past ten years, the field of open source big data has developed rapidly, and we have witnessed the rise and change of diversified technologies.

In order to gain a deep insight into the past, present and future of open source big data technology from the massive data gathered by the code hosting platform through data processing and visualization, and provide useful reference for enterprises and developers in the application, learning, selection and technology research and development of open source big data technology, the Open Atomic Open Source Foundation, X-Lab Open Laboratory Alibaba Open Source Committee jointly launched the "2022 Open Source Big Data Thermal Report" project.

From the 10th year of Hadoop's development, i.e. 2015, the report collects relevant public data for correlation analysis, studies the technology trend after the open source big data enters a new stage, and the role of the open source community's operation mode in boosting the technology trend.

After studying the 102 most active open source big data projects, the report found that: every 40 months, the heating value of open source projects will double, and the technology will complete a round of update iteration. In the past eight years, there have been five large-scale technological and thermal transitions, and diversification, integration and cloud origin have become the most prominent features of the current development trend of open source big data.

Liu Jingjuan, Deputy Secretary General of the Open Atomic Open Source Foundation, said that the report hoped to help the following people:

(1) Enterprises and developers engaged in big data technology research and development. They can understand the development trend of big data technology through the report, so as to guide the learning direction and improve their skills, and provide some reference for the selection of technology for application development from the perspective of technical activity.

(2) Developers who are interested in contributing code to open source projects. There are many subdivisions of open source big data, but there are also some relatively weak links, such as data security and data management. Developers can start from multiple subdivisions to help these fields develop better.

(3) Operators or maintainers of open source big data projects. They can obtain experience and laws from the thermal development trend of excellent projects, so as to operate open source projects in a more mature way.

For big data practitioners, what is the technology development logic behind the thermal migration of open source big data projects? How should we deal with the challenges posed by new technology trends? In response to these questions, InfoQ recently chatted with Jia Yangqing, Vice President of Alibaba Group, Chairman of Alibaba Open Source Committee and Head of Alibaba Cloud Computing Platform Business Unit, and Wang Feng, Initiator of Apache Flink Chinese Community and Head of Alibaba Open Source Big Data Platform (don't ask).

Diversification of user needs promotes technology diversification

The open source big data system with Hadoop as its core has been transformed into a parallel development of diversified technologies since 2015.

On the one hand, the product iteration of the original Hadoop system tends to be stable. Some Hadoop ecological projects (such as HDFS) have become the foundation of other emerging technologies. Some common open source big data component combinations, such as Flink+Kafka, Spark+HDFS, have been tested by the open source ecological market, and are relatively mature and easy to use. They have become relatively fixed standardized options.

On the other hand, developers are enthusiastic about six major technical hot spots: "search and analysis", "stream processing", "data visualization", "interactive analysis", "DataOps" and "data lake". Each hot spot focuses on solving a specific scenario problem.

The thermal transitions (large thermal value jumps) in hot areas presented in the report are consistent with this: "Data Visualization" experienced two thermal transitions in 2016 and 2021, "Search and Analysis" and "Flow Processing" experienced two thermal transitions in 2019, "Interactive Analysis" and "DataOps" experienced two thermal transitions in 2018 and 2021, and "Data Lake" experienced two thermal transitions in 2020.

This also echoes the trend of big data application scenarios and scale changes: from the popularization of upper level data visualization applications, to the upgrading of data processing technologies, to the structural evolution of data storage and management, and finally, the improvement of data infrastructure capabilities in turn drives the innovation of upper level application technologies.

Jia Yangqing believes that behind the thermal transition, technological development is a spiral process. After the user side needs to push the technology forward, the system side needs to move forward to achieve better scalability, lower costs and higher flexibility; When the system side reaches a state where it can carry a larger scale of data calculation, processing and analysis, people will start to look for whether the data analysis and visualization on the user side or business side can be better. For example, the popular open source big data visualization tool Apache Superset, the commercial company behind which Preset has recently made new exploration in BI.

Compared with the original Hadoop system, which has become more stable, under the new scenario, such as data governance analysis, streaming computing+OLAP, data lake, etc., open source big data components are still evolving, and there will be more changes in the future.

With these emerging big data open source components becoming more and more complex, it is increasingly difficult for developers to master the open source data platform. This has actually brought considerable challenges to enterprises that want to use open source components to build enterprise level big data platforms.

First of all, most business scenarios require multiple big data components to work together, which requires the technical team to master many different big data components at the same time, and understand how to better combine these components. However, most small and medium-sized companies or traditional enterprises do not have enough big data basic talents. Therefore, enterprises may need a professional big data team to provide them with design, consultation and guidance, so that they can connect the components of open source big data together to form a complete set of solutions.

Secondly, when the business scale increases, the enterprise's requirements for the stability, security and high availability of the big data platform will also increase, which will inevitably increase the complexity of building the big data platform. For example, when problems occur in the system, what is the problem, how to diagnose, analyze, alarm, and achieve real-time observation? These supporting capabilities are not provided by the open source big data components themselves. This requires supporting tools or products in the open source ecosystem to help enterprises find, locate and solve problems better.

In the opinion of Jia Yangqing and Wang Feng, the above challenges can be solved to some extent through standardized products on the cloud. But from another point of view, the cloud itself also brings some new challenges to the big data technology stack.

Changes and challenges brought by cloud native

The report shows that new projects emerging after 2015, without exception, have made a positive technology layout in the cloud native direction. Plusar, DolphinScheduler, JuiceFS, Celeborn, Arctic and other open source projects born in the cloud native era have sprung up like mushrooms. The proportion of the thermal value of these new projects in 2022 has reached 51%, among which, data integration, data storage, data development and management and other fields have undergone great project changes, and the proportion of the thermal value of new projects has exceeded 80%. Since 2020, Spark, Kafka, Flink and other mainstream projects have also officially supported Kubernetes.

In the cloud native trend, the open source big data technology stack is being reconstructed. Among them, the reconstruction of the data integration field is faster than other subdivision fields.

With the explosion of diversified data collection needs in the cloud and the change of downstream data analysis logic, data integration has evolved from a "labor-intensive" ETL tool to a flexible, efficient and easy-to-use "data processing pipeline". Flume and Camel, traditional data integration tools, are in a stable maintenance state. Sqoop has been retired from the Apache Foundation in 2021. Airbyte, Flink CDC, SeaTunnel, InLong and other projects that are more closely integrated with Cloud Native have developed rapidly.

It can be seen from the thermal trend in the report that the cloud native data integration surpassed the traditional data integration in 2018. Since 2019, this evolution has accelerated and the thermal value has doubled year by year. The compound annual growth rate of thermal value of many newly incubated projects exceeded 100%, showing a strong growth momentum.

In the past few years, data sources and data storage have been gradually migrated to the cloud, and more diversified computing loads have also been run to the cloud. In fact, cloud base has changed many premises of open source big data. As a person who has experienced this change, Wang Feng is deeply touched.

More than ten years ago, Wang Feng participated in the work of big data development based on Hadoop. At that time, he was a relatively junior engineer, and most of his work was based on local machines. The issues to consider include which model, how large a disk, and whether the disk is large or the memory is large.

Now, the base of big data has become a virtual machine, container, object storage and cloud storage on the cloud. All open source projects should consider elastic architecture, how to have good observability, how to naturally connect with Kubernetes and other cloud native ecosystems from the beginning. These problems have already existed in everyone's subconsciousness, changing the expectations of many open source projects at the initial stage.

At the same time, after running on the cloud, the data architecture will also change. For example, the separation of computing and storage has become the standard architecture of the big data platform, which is now a problem that everyone must consider by default. Therefore, no matter what big data components (Flink, Hive, Presto, etc.) are, when making data shuffles, you should consider that you can no longer rely on the local machine, and you cannot assume that there will be a local disk, because the machine on the cloud is bound to migrate. This requires a general shuffle service to help complete the work, and adaptively schedule the cloud resource model changes. Not long ago, Alibaba donated the open source project Celeborn to the Apache Foundation to help the computing engine improve the performance of data shuffle.

At the technical level, the cloud has brought more challenges to the original open source big data system. For example, based on scheduling, everyone used to do resource management and scheduling on Hadoop by default. After going to the cloud, everyone embraced the Kubernetes ecosystem, and arranged and scheduled based on Kubernetes. However, as an online scheduling service, Kubernetes has bottlenecks in scheduling large-scale data computing tasks. Alibaba itself has made many improvements to Kubernetes.

In addition, although cloud storage has many advantages, it also has shortcomings in transmission bandwidth and data locality. Some emerging open source projects, such as JuiceFS and Alluxio, are designed to solve the problem of cloud storage acceleration. JindoData, the big data storage acceleration service on Alibaba Cloud EMR, is also doing this work. Wang Feng said that the open source big data system is still evolving, and it still needs to go some way to truly integrate with Cloud Native.

How can big data practitioners face the future?

As an experimental field of technological innovation, Cloud Native has brought more flexible resource elasticity and lower storage, operation and maintenance costs. It allows many enterprises to make more bold attempts on the cloud, improves everyone's willingness to try new things, and is naturally more conducive to the success of innovative technologies or software. Jia Yangqing observed that many new software or engines overseas will take the road of "attracting users first, and then converting users into customers". In his view, this may also become a trend in China in the future, because the cloud does provide a good resource and software distribution environment.

However, Yun just gave us a good hammer. If one is not strong enough, he may hit his own foot when using it. For example, because it is easier for users to obtain resources on the cloud, they may store and calculate data at will, resulting in waste. Therefore, from a business perspective, cloud service providers and users are actually integrated. On the basis of large-scale open source data systems, cloud service providers need to do more data governance work according to the needs of enterprises to help users make good use of the cloud hammer and avoid resource waste. Jia Yangqing said that how to use cloud platforms and cloud resources more effectively has become a growing concern of enterprises. Now it is mainly enterprise software that is doing this work. In the future, more open source tools may appear.

To enjoy the future trend of data engineering, Jia Yangqing believes that there are three directions that big data practitioners can focus on. The first is cloud, that is, cloud is used to solve the problem of system architecture, covering four levels: offline real-time integration, big data AI integration, stream batch integration, and lake warehouse integration. Now, both open source projects and closed source products are actually moving towards a more integrated direction. The new sub project TableStore of Flink community is also an exploration in this direction. The ultimate goal is to reduce the complexity of big data technology stack to a certain extent. In the future, the computing engine may also be more integrated with data governance and data orchestration tools to become an integrated data solution.

Secondly, the application of upper level data will become simpler. In the long run, all data can be analyzed using a common SQL method for the end user. Finally, there will be more ecological development in the future. Whether Alibaba Cloud's Flink, EMR or overseas Databricks, Snowflake, and Bigquery are just tool platforms. On these data platforms, more enterprises similar to Salesforce need to make richer vertical solutions and ultimately form a more prosperous data ecosystem. Of course, the premise is that the data platform should be standardized first.

Wang Feng is also optimistic about the future development of data ecology, especially for open source big data. Now many data companies in Europe and the United States have their own characteristics, but they can also form synergies with each other, so that everyone will be in a benign competitive state. Wang Feng believes that this will help promote the standardization of products of various manufacturers and bring many benefits to users and customers. From the perspective of the entire application market, after the formation of standardization, all parties can access this data engineering ecology at a relatively low cost, which in turn can make the whole market cake bigger together.

In the rapidly developing data engineering field, how can big data practitioners face the future? Will the roles and responsibilities of the post change in the future?

Jia Yangqing believes that the roles of system engineer, data engineer and data scientist will continue to exist in the future. But the cloud has solved many system problems, so more and more engineer roles will start to move to the upper business. The underlying work of partial system construction, that is, the role of system engineer, may be more concentrated in cloud data service providers and data engine service providers that provide standardized services. Other enterprises will pay more attention to the business itself, and put more manpower into partial data science and upper level business, so there is no need for so many system engineers. This is the inevitable result of the refinement of social division of labor.

For data scientists and data engineers, their responsibilities will be more focused on using data to realize business value, including data modeling, data governance and other work according to business needs, instead of solving the underlying system and operation and maintenance problems as before, because these problems have been well solved by system engineers and the cloud.

How do big data practitioners cope with the challenges brought by new technology trends?

Related Articles

A detailed explanation of Hadoop core architecture HDFS

What Does IOT Mean

6 Optional Technologies for Data Storage

What Is Blockchain Technology

Explore More Special Offers

Short Message Service(SMS) & Mail Service

Sales Support

Technical Support

Connect & Report Abuse