All Products
Search
Document Center

Lindorm:Migrate the recommendation service of a leading online education company to Lindorm

Last Updated:Mar 28, 2026

A leading online education company ran its recommendation service on self-built open source HBase clusters. As user traffic grew, the clusters struggled to keep up: write throughput hit a ceiling, garbage collection (GC) pauses caused latency spikes, and storage costs climbed with every new cohort of data. Manual scaling ahead of promotion events added further operational risk.

After migrating to Lindorm, the recommendation service handles 200,000 write operations per second — three times the throughput of the previous self-built setup — with write latency reduced to 1/10 of the original and storage costs cut by more than 50%.

Challenges

  • Insufficient write throughput. Self-built HBase clusters could not sustain the hundreds of thousands of write and compute events per second required by the recommendation pipeline, creating a hard ceiling on service capacity.

  • Latency spikes from GC pauses. Deficiencies in the GC mechanism of open source HBase caused unpredictable stop-the-world pauses, making stable latency impossible for the recommendation service.

  • Storage costs growing unchecked. As stored data volumes increased, storage costs scaled linearly with no mechanism to separate infrequently accessed historical data from hot working data.

  • High operations and maintenance (O&M) overhead from manual scaling. Without a unified O&M platform, engineers had to scale HBase clusters manually, leading to operational failures and high labor costs — especially before high-traffic promotion events.

Solution

High-throughput writes with linear scalability

Lindorm addresses write throughput through three complementary mechanisms:

  1. Group Commit — an optimized batch write mechanism that improves batch write performance by three times.

  2. Lindorm Log Consensus (LLC) — a triplicate architecture that uses quorum-based algorithms to reduce write latency by 50%.

  3. Linear scalability — a single table supports tens of millions of read and write operations without database and table partitioning.

image

Stable latency through optimized GC

Lindorm implements a GC mechanism optimized based on the Z Garbage Collector (ZGC) provided by Alibaba JDK (AJDK). This significantly reduces the maximum response latency for 99.9% of requests, delivering stable, predictable latency for the recommendation service.

image

Lower storage costs with hot and cold data separation

Lindorm uses optimized compression algorithms to reduce storage costs by up to 50%. The hot and cold data separation feature stores hot and cold data from the same table in different storage media — without any changes to application code — reducing costs further.

image

Automated scaling with storage-compute decoupled architecture

Lindorm separates storage nodes from compute nodes. Each layer scales independently based on actual traffic without interrupting running services. The platform automatically rebalances data and requests, eliminating the manual scaling work that previously caused operational failures.

Results: 3x throughput, 1/10 write latency, 50%+ cost reduction

DimensionBefore (self-built HBase)After (Lindorm)Improvement
Write throughputCapacity ceiling reached200,000 write operations per second3x higher
Write latencyBaseline1/10 of previous90% reduction
Data compression rateBaseline2x higher compression ratio
Storage costsGrowing linearlyReduced by more than 50%Further reduced by hot and cold data separation
Post-migration incidentsRecurring latency spikesNo faults reported after migrationStable P99.9 latency
Scaling for promotion eventsManual, error-proneSelf-service scale-out on the O&M platformReduced costs and operational risk

A unified O&M platform now handles scaling for Spring Festival promotion events and beyond, removing the need for manual intervention and eliminating the overhead of managing multiple HBase clusters independently.