This topic describes the enterprise-level state backend storage GeminiStateBackend and compares the performance between GeminiStateBackend and RocksDBStateBackend.
- A large number of random accesses exist and few range queries are performed.
- Data traffic and hotspots frequently change. In this case, different parallel threads of the same operator use different data access modes.
- Uses a new architecture and data structure design to improve the overall data processing
The overall architecture of GeminiStateBackend is designed based on the log-structured merge-tree (LSM tree) data structure. GeminiStateBackend provides three capabilities: adaption with changes in the data volume and access characteristics, tiered storage of hot and cold data, and switchover between anti-caching and caching architectures. GeminiStateBackend also supports the hash storage structure that allows random access. The performance comparison by using Nexmark shows that GeminiStateBackend provides better performance than RocksDBStateBackend.
- Supports compute-storage separation to eliminate the dependency of state data on local
The space of local disks is limited. Therefore, a job that has a large amount of state data often encounters the issue of insufficient local disk space. In most cases, if a job that runs based on RocksDBStateBackend encounters the issue of insufficient local disk space, you need to increase the parallelism of threads or use other methods to increase resources. GeminiStateBackend supports compute-storage separation. This way, state storage can be independent of local disks. This prevents job failures that are caused by excessive local state data. For more information about the configurations of compute-storage separation, see Parameters for compute-storage separation.
- Supports key-value separation to significantly improve the performance of jobs that
involve dual-stream JOIN or multi-stream JOIN.
Dual-stream JOIN or multi-stream JOIN is one of the most challenging scenarios in stream processing and a typical scenario in which state storage encounters a bottleneck. GeminiStateBackend is developed based on key-value separation to significantly improve the performance of jobs that involve dual-stream JOIN or multi-stream JOIN. The verification during Double 11 Shopping Festival of Alibaba Group shows that the computing resource utilization can be increased by an average of 50% after key-value separation is enabled. In typical scenarios, the computing resource utilization can be increased by 100% to 200%. For more information about the configurations of key-value separation, see Parameters for key-value separation.
- Supports adaptive parameter tuning.
In stream processing tasks, different operators have different state access modes. In most cases, different combinations of parameters are required to implement the optimal performance of state storage. The configurations of these parameters involve underlying technologies. Manual parameter tuning has high learning and understanding costs. To resolve this issue, GeminiStateBackend supports the adaptive parameter tuning technology. When a job is running, parameter configurations can be automatically tuned based on the current data access mode and traffic to achieve the optimal performance of state storage in various scenarios. The verification during Double 11 Shopping Festival of Alibaba Group shows that this technology can reduce manual parameter tuning by more than 95% and increase the single-core throughput by 10% to 40%. For more information about the configurations of adaptive parameter tuning, see Parameters for adaptive parameter tuning.
Performance comparison by using Nexmark
|Case name||Gemini TPS/core||RocksDB TPS/core||Improved by|
|q4||83.63 K/s||53.26 K/s||57.02%|
|q5||84.52 K/s||57.86 K/s||46.08%|
|q8||468.96 K/s||361.37 K/s||29.77%|
|q9||59.42 K/s||26.56 K/s||123.72%|
|q11||93.08 K/s||48.82 K/s||90.66%|
|q18||150.93 K/s||87.37 K/s||72.75%|
|q19||143.46 K/s||58.5 K/s||145.23%|
|q20||75.69 K/s||22.44 K/s||237.30%|