BladeLLM is an inference engine tailored for large language model (LLM) optimization and high-performance model deployment. BladeLLM has an advanced technical architecture and provides a user-friendly interface and outstanding performance to address new opportunities and challenges in LLM fields. This makes BladeLLM a suitable choice for enterprises that want to deploy LLMs and use the LLMs to perform inference.
Technical architecture
The following figure shows the technical architecture of BladeLLM.
Deployment platform
BladeLLM is compatible with various GPU architectures, including NVIDIA, AMD, and other GPUs. BladeLLM is also deeply integrated with Elastic Algorithm Service (EAS) for resource scheduling and management to provide efficient and reliable one-stop model deployment.
BladeLLM
Model computation
BladeLLM features high-performance operators and AI compilation. BladeLLM is equipped with the flexible LLM operator library BlaDNN, which surpasses mainstream libraries in feature coverage and performance. FlashNN is an AI-compiled operator library of BladeLLM for automatic generation. FlashNN can extend across multiple hardware platforms to match the performance of manually optimized operators.
Quantization compression is one of the most important model optimization methods in LLM inference scenarios. BladeLLM supports advanced algorithms, such as GPTQ, AWQ, SmoothQuant, and SmoothQuant+, which significantly improves throughput and reduces latency.
BladeLLM supports distributed inference over multiple GPUs, provides tensor parallelism and pipeline parallelism strategies, and supports arbitrary degrees of parallelism to address the GPU memory bottleneck issues of LLMs.
Generation engine
In addition to optimizations in model computation, BladeLLM has designed a fully asynchronous runtime specifically for LLM scenarios to address high-concurrency service requests in actual application scenarios. User requests are first asynchronously submitted to the batch scheduling module, asynchronously forwarded to the generation engine, and then finally processed by using asynchronous decoding.
BladeLLM supports the continuous batch processing method, which improves the throughput and first-packet response speed.
Prompt Cache allows BladeLLM to obtain the previous calculation results from the cache for repeated or similar queries. This reduces the response time.
During decoding, BladeLLM uses efficient decoding methods, such as speculative decoding and lookhead decoding, to predict subsequent tokens in advance. This accelerates the generation of tokens without sacrificing accuracy.
Service framework
As the model scale increases, the resources of a single instance cannot meet the requirements, and models need to be deployed on multiple instances. BladeLLM implements efficient distributed scheduling strategies and combines with intelligent LLM routing of EAS. This way, BladeLLM can achieve dynamic request distribution and balanced load distribution to maximize cluster utilization.
Scenario
BladeLLM supports various scenarios, including chat, Retrieval-Augmented Generation (RAG), multimodal, and JSON mode, providing efficient model deployment solutions.
User experience
BladeLLM prioritizes a user-friendly experience to simplify the deployment and usage of LLMs.
Simple and convenient startup: BladeLLM provides scenario-based deployment in EAS with pre-configured images, startup commands, and common parameters. This way, users need to only select an open source or custom model and an appropriate instance type to achieve one-click deployment of model services.
Flexible and easy invocation: BladeLLM supports streaming and non-streaming response interfaces by using HTTP Server-Sent Events (SSE). The response interfaces are compatible with OpenAI interface protocols for quick business system integration.
Powerful and rich model compatibility: The BladeLLM model format is compatible with the standards of communities, such as Hugging Face and ModelScope. This allows users to directly use existing model weights without additional conversion.
Out-of-the-box optimization options: BladeLLM supports optimization features, such as quantization compression, speculative sampling, and prompt cache. This enables users to configure parameters in a simple manner.
Stable and comprehensive production support: BladeLLM provides production-ready images and real-time monitoring and performance testing tools on EAS to ensure steady and reliable operation of the business of customers.
Performance comparison
The following section describes the differences between the performance of BladeLLM v0.8.0 and a mainstream open source framework.
TTFT-QPS curve: BladeLLM significantly increases the Time To First Token (TTFT) by 2 to 3 times in typical load scenarios and doubles the queries per second (QPS) in scenarios that have typical latency requirements for TTFT.
TBT-QPS curve: BladeLLM increases the Time Between Tokens (TBT) by approximately 2 to 3.3 times in typical load scenarios and the QPS by 1.6 times in scenarios that have typical latency requirements for TTFT.