E-MapReduce (EMR) V3.27.X and earlier versions use the open source version of Flink. Versions later than EMR V3.27.X use Ververica Runtime (VVR), an enterprise-grade computing engine. VVR is fully compatible with Flink. This topic describes how to configure a VVR-based Flink job.
Background information
Enterprise-edition Flink is officially released by the founding team of Apache Flink and maintains a globally uniform brand.
- Uses a new data structure to increase the random query speed and reduce frequent disk I/O operations.
- Optimizes the cache policy. If memory is sufficient, hot data is not stored in disks and cache entries do not expire after compaction.
- Uses Java to implement GeminiStateBackend, which eliminates Java Native Interface (JNI) overheads that are caused by RocksDB.
- Uses off-heap memory and implements an efficient memory allocator based on GeminiDB to eliminate the impact of garbage collection for Java Virtual Machines (JVMs).
- Supports asynchronous incremental checkpoints. This ensures that only memory indexes are copied during data synchronization. Unlike RocksDB, GeminiStateBackend avoids jitters that are caused by I/O operations.
- Supports local recovery and storage of the timer.
The basic configurations of the checkpoint and state backend in Flink also apply to GeminiStateBackend. For more information, see Configuration.
Parameter | Description |
---|---|
state.backend.gemini.memory.managed | Specifies whether to calculate the memory size of each backend based on the values
of the Managed Memory and Task Slot parameters. Default value: true. Valid values:
|
state.backend.gemini.offheap.size | Specifies the memory of each backend when the state.backend.gemini.memory.managed parameter is set to false. Default value: 2. Unit: GiB. |
state.backend.gemini.local.dir | Specifies the directory that stores local data files of GeminiDB. |
state.backend.gemini.timer-service.factory | Specifies the storage location of the timer-service state. Default value: HEAP. Valid
values:
|
Prerequisites
- An EMR Hadoop cluster is created. For more information, see Create a cluster.
- A project is created. For more information, see Manage projects.
- Resources that are required for jobs and data files to be processed are obtained,
such as JAR packages, data file names, and storage paths of the packages and files.
Note
- We recommend that you use Object Storage Service (OSS) to maintain the JAR packages that you want to run.
- If you use a local path of a file, use the absolute path.
Procedure
- Go to the Data Platform tab.
- Log on to the Alibaba Cloud EMR console by using your Alibaba Cloud account.
- In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
- Click the Data Platform tab.
- In the Projects section of the page that appears, find the project that you want to manage and click Edit Job in the Actions column.
- Create a Flink job.
- In the Edit Job pane on the left, right-click the folder on which you want to perform operations and select Create Job.
- In the Create Job dialog box, specify Name and Description, and select Flink from the Job Type drop-down list.
- Click OK.
- Edit job content.