This topic provides answers to some frequently asked questions about the basic concepts of fully managed Flink.
- What is Realtime Compute for Apache Flink?
- What are the differences between real-time computing and batch processing?
- What is streaming data?
- Which types of data stores are supported by Realtime Compute for Apache Flink?
What is Realtime Compute for Apache Flink?
Realtime Compute for Apache Flink is developed to meet the strict demand for the timeliness of data processing. The business value of data decreases over the time that data is processed. Therefore, data needs to be computed and processed immediately and as fast as possible after the data is generated. As the demands for high data timeliness and operability increase, software systems must be able to process more data in shorter periods of time. In traditional models of big data processing, online transaction processing (OLTP) and offline data analysis are separately performed at different times. Traditional models of big data processing follow the scheduled processing mode. In this mode, data is accumulated and processed in a computing cycle that can last for hours or even days. Therefore, traditional models of data processing cannot meet the growing demand for computing data streams in real time. Data processing delay may lead to serious consequences in delay-sensitive scenarios, such as real-time big data analytics, risk control and alerting, real-time prediction, and financial transactions. To address this issue, Alibaba Cloud provides Realtime Compute for Apache Flink to perform computations on data streams in real time.
Realtime Compute for Apache Flink shortens the delay in data processing, uses a real-time computational logic, and significantly reduces computing costs to help meet the business requirements for real-time processing of large amounts of data.
- Real-time and unbounded data streams
Realtime Compute for Apache Flink processes data streams in real time. Streaming data is continuously generated from data sources and is subscribed to and consumed in chronological order. For example, when Realtime Compute for Apache Flink processes log streams that are triggered by website visits, the log streams continuously enter the Realtime Compute for Apache Flink system when the website is online.
- Continuous and efficient computing
Realtime Compute for Apache Flink is an event-driven system in which unbounded event or data streams continuously trigger real-time computations. Each streaming data record triggers a computational task. Realtime Compute for Apache Flink performs continuous and real-time computations on data streams.
- Real-time integration of streaming data
Realtime Compute for Apache Flink writes the computing result of each streaming data record to the destination data store in real time. For example, the system can directly write the computed report data to an ApsaraDB RDS instance to display reports. Realtime Compute for Apache Flink continuously writes the result data to the destination data store in real time. Therefore, Realtime Compute for Apache Flink can be used as a data source that generates data streams for the destination data store.
What are the differences between real-time computing and batch processing?
- Batch processing
Batch processing jobs are initiated by users or systems and are processed with a long delay. Most traditional data computing and analysis services are developed based on the batch processing model. Extract, transform, and load (ETL) systems or OLTP systems are used to load data into data storage systems. Online data services such as ad hoc queries and dashboards are used to access the structured data and obtain analysis results by using SQL statements. The batch processing model is widely used with the evolution of relational databases in diversified industries. The following figure shows the traditional batch processing model.The traditional batch processing procedure consists of the following steps:
- Load data
To perform batch processing, you must load data to a computing system in advance. You can use an ETL system or an online transaction processing system as your computing system. The system performs a series of query optimization, analysis, and computations on the loaded data based on the storage and computation method.
- Submit a request
A system initiates a computing job, such as a MaxCompute SQL job or a Hive SQL job, and submits requests to the computing system. Then, the computing system schedules computing nodes to compute large amounts of data. This process may take several minutes or even hours. If data cannot be processed at the earliest opportunity, historical data may be generated during the computing process.Note You can modify SQL statements and submit a job again based on your business requirements. You can also perform ad hoc queries to query updated data in real time.
- Return results
After the computing job is complete, the result data is returned as a result set. The amount of result data that is stored in the computing system is large. Therefore, you must integrate the result data into another system. Large amounts of result data lead to a data integration process that takes several minutes or even hours to complete.
- Load data
- Real-time computing
Real-time computing jobs are continuously triggered by events and are processed with a short delay. Real-time computing is a new technology in the field of big data computing. The real-time computing model is simple. Therefore, real-time computing is considered a value-added service of batch processing in most big data processing scenarios. Real-time computing provides computations on data streams that have short delays. The following figure shows the real-time computing model.
Real-time computing is performed in the following order:
- Send real-time data streams
Data integration tools are used to send streaming data to streaming data stores, such as Message Queue and DataHub, in real time. Streaming data is sent in micro batches in real time to minimize the delay in data integration.
Streaming data is continuously written to storage systems without the need to preload the data. Realtime Compute for Apache Flink does not store streaming data that is continuously processed. Streaming data is discarded immediately after the data is processed.
- Publish a streaming job
In batch processing, you can start a computing job only after data integration is complete. A real-time computing job is a resident computing service. When you start a Realtime Compute for Apache Flink job, Realtime Compute for Apache Flink immediately computes streaming data and generates results after a small batch of data enters a streaming data store. Realtime Compute for Apache Flink also divides large batches of data records into smaller batches for incremental computing. This way, the processing delay is reduced.
In stream computing jobs, you must predefine the computational logic in Realtime Compute for Apache Flink. You cannot change the computational logic when stream computing jobs are running. If you terminate a running job and publish the job again after you change the computational logic, the streaming data that is processed before the change cannot be reprocessed.
- Generate result data streams in real time
In batch processing, result data can be written to an online system at the same time only after all accumulated data is processed. A stream computing job delivers result data to an online system or a batch system immediately after each micro batch of data records is processed.
- A user publishes a real-time computing job.
- Streaming data triggers the real-time computing job.
- The result data of the real-time computing job is continuously written to the destination system.
- Send real-time data streams
|Item||Batch processing||Real-time computing|
|Data integration||You must load data to the data processing system in advance.||Realtime Compute for Apache Flink loads data in real time.|
|Computational logic||The computational logic can be changed, and data can be reprocessed.||If the computational logic is changed, data cannot be reprocessed because streaming data is processed in real time.|
|Data scope||You can query and process all or most data in a dataset.||You can query and process the latest data record or the data within a tumbling window.|
|Data amount||Large amounts of data are processed.||Individual records or micro batches of data that consist of a few records are processed.|
|Performance||The processing delay lasts for several minutes or hours.||The processing delay lasts for several milliseconds or seconds.|
|Analysis||The analysis is complex.||The analysis is based on simple response functions, aggregates, and rolling metrics.|
What is streaming data?
- Log files that are generated by using mobile or web applications
- Online shopping data
- In-game player activities
- Data from social networking sites
- Telemetry data from connected devices in a financial trading hall or geospatial data center
- Geospatial service information
- Telemetry data from devices or instruments
Which types of data stores are supported by Realtime Compute for Apache Flink?
- Streaming data: Streaming data inputs trigger real-time computing. At least one streaming data source must be declared for each Realtime Compute for Apache Flink job.
- Static data: refers to dimension tables. Each streaming data record can be associated with an external static data source for data query in Realtime Compute for Apache Flink.
- Result table: Realtime Compute for Apache Flink writes result data to a destination data table to provide read and write interfaces. Downstream storage systems can use these interfaces to continue to consume the data.