A leading online education company ran its recommendation service on self-built open source HBase clusters. As user traffic grew, the clusters struggled to keep up: write throughput hit a ceiling, garbage collection (GC) pauses caused latency spikes, and storage costs climbed with every new cohort of data. Manual scaling ahead of promotion events added further operational risk.
After migrating to Lindorm, the recommendation service handles 200,000 write operations per second — three times the throughput of the previous self-built setup — with write latency reduced to 1/10 of the original and storage costs cut by more than 50%.
Challenges
Insufficient write throughput. Self-built HBase clusters could not sustain the hundreds of thousands of write and compute events per second required by the recommendation pipeline, creating a hard ceiling on service capacity.
Latency spikes from GC pauses. Deficiencies in the GC mechanism of open source HBase caused unpredictable stop-the-world pauses, making stable latency impossible for the recommendation service.
Storage costs growing unchecked. As stored data volumes increased, storage costs scaled linearly with no mechanism to separate infrequently accessed historical data from hot working data.
High operations and maintenance (O&M) overhead from manual scaling. Without a unified O&M platform, engineers had to scale HBase clusters manually, leading to operational failures and high labor costs — especially before high-traffic promotion events.
Solution
High-throughput writes with linear scalability
Lindorm addresses write throughput through three complementary mechanisms:
Group Commit — an optimized batch write mechanism that improves batch write performance by three times.
Lindorm Log Consensus (LLC) — a triplicate architecture that uses quorum-based algorithms to reduce write latency by 50%.
Linear scalability — a single table supports tens of millions of read and write operations without database and table partitioning.
Stable latency through optimized GC
Lindorm implements a GC mechanism optimized based on the Z Garbage Collector (ZGC) provided by Alibaba JDK (AJDK). This significantly reduces the maximum response latency for 99.9% of requests, delivering stable, predictable latency for the recommendation service.
Lower storage costs with hot and cold data separation
Lindorm uses optimized compression algorithms to reduce storage costs by up to 50%. The hot and cold data separation feature stores hot and cold data from the same table in different storage media — without any changes to application code — reducing costs further.
Automated scaling with storage-compute decoupled architecture
Lindorm separates storage nodes from compute nodes. Each layer scales independently based on actual traffic without interrupting running services. The platform automatically rebalances data and requests, eliminating the manual scaling work that previously caused operational failures.
Results: 3x throughput, 1/10 write latency, 50%+ cost reduction
| Dimension | Before (self-built HBase) | After (Lindorm) | Improvement |
|---|---|---|---|
| Write throughput | Capacity ceiling reached | 200,000 write operations per second | 3x higher |
| Write latency | Baseline | 1/10 of previous | 90% reduction |
| Data compression rate | Baseline | 2x higher compression ratio | — |
| Storage costs | Growing linearly | Reduced by more than 50% | Further reduced by hot and cold data separation |
| Post-migration incidents | Recurring latency spikes | No faults reported after migration | Stable P99.9 latency |
| Scaling for promotion events | Manual, error-prone | Self-service scale-out on the O&M platform | Reduced costs and operational risk |
A unified O&M platform now handles scaling for Spring Festival promotion events and beyond, removing the need for manual intervention and eliminating the overhead of managing multiple HBase clusters independently.