Use Lindorm to migrate a high-throughput recommendation service from HBase - Lindorm

A leading online education company ran its recommendation service on self-built open source HBase clusters. As user traffic grew, the clusters struggled to keep up: write throughput hit a ceiling, garbage collection (GC) pauses caused latency spikes, and storage costs climbed with every new cohort of data. Manual scaling ahead of promotion events added further operational risk.

After migrating to Lindorm, the recommendation service handles 200,000 write operations per second — three times the throughput of the previous self-built setup — with write latency reduced to 1/10 of the original and storage costs cut by more than 50%.

Challenges

Insufficient write throughput. Self-built HBase clusters could not sustain the hundreds of thousands of write and compute events per second required by the recommendation pipeline, creating a hard ceiling on service capacity.
Latency spikes from GC pauses. Deficiencies in the GC mechanism of open source HBase caused unpredictable stop-the-world pauses, making stable latency impossible for the recommendation service.
Storage costs growing unchecked. As stored data volumes increased, storage costs scaled linearly with no mechanism to separate infrequently accessed historical data from hot working data.
High operations and maintenance (O&M) overhead from manual scaling. Without a unified O&M platform, engineers had to scale HBase clusters manually, leading to operational failures and high labor costs — especially before high-traffic promotion events.

Solution

High-throughput writes with linear scalability

Lindorm addresses write throughput through three complementary mechanisms:

Group Commit — an optimized batch write mechanism that improves batch write performance by three times.
Lindorm Log Consensus (LLC) — a triplicate architecture that uses quorum-based algorithms to reduce write latency by 50%.
Linear scalability — a single table supports tens of millions of read and write operations without database and table partitioning.

Stable latency through optimized GC

Lindorm implements a GC mechanism optimized based on the Z Garbage Collector (ZGC) provided by Alibaba JDK (AJDK). This significantly reduces the maximum response latency for 99.9% of requests, delivering stable, predictable latency for the recommendation service.

Lower storage costs with hot and cold data separation

Lindorm uses optimized compression algorithms to reduce storage costs by up to 50%. The hot and cold data separation feature stores hot and cold data from the same table in different storage media — without any changes to application code — reducing costs further.

Automated scaling with storage-compute decoupled architecture

Lindorm separates storage nodes from compute nodes. Each layer scales independently based on actual traffic without interrupting running services. The platform automatically rebalances data and requests, eliminating the manual scaling work that previously caused operational failures.

Results: 3x throughput, 1/10 write latency, 50%+ cost reduction

Dimension	Before (self-built HBase)	After (Lindorm)	Improvement
Write throughput	Capacity ceiling reached	200,000 write operations per second	3x higher
Write latency	Baseline	1/10 of previous	90% reduction
Data compression rate	Baseline	2x higher compression ratio	—
Storage costs	Growing linearly	Reduced by more than 50%	Further reduced by hot and cold data separation
Post-migration incidents	Recurring latency spikes	No faults reported after migration	Stable P99.9 latency
Scaling for promotion events	Manual, error-prone	Self-service scale-out on the O&M platform	Reduced costs and operational risk

A unified O&M platform now handles scaling for Spring Festival promotion events and beyond, removing the need for manual intervention and eliminating the overhead of managing multiple HBase clusters independently.