|
Date
|
Image version
|
Built-in library version
|
Updates
|
|
2024.6.21
|
-
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4
Tag: chat-llm-webui:3.0
-
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-flash-attn
-
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-vllm
Tag: chat-llm-webui:3.0-vllm
-
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-vllm-flash-attn
-
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-blade
Tag: chat-llm-webui:3.0-blade
|
-
Torch: 2.3.0
-
Torchvision: 0.18.0
-
Transformers: 4.41.2
-
vLLM: 0.5.0.post1
-
vllm-flash-attn: 2.5.9
-
Blade: 0.7.0
|
-
Supports Rerank model deployment.
-
Supports simultaneous or separate deployment of Embedding, Rerank, and LLM models.
-
The Transformers backend supports Deepseek-V2, Yi1.5, and Qwen2.
-
Changes the model type of Qwen1.5 to qwen1.5.
-
The vLLM backend supports Qwen2.
-
The BladeLLM backend supports Llama3 and Qwen2.
-
The HuggingFace (HF) backend supports batch inputs.
-
The BladeLLM backend supports OpenAI Chat.
-
Fixes BladeLLM Metrics access.
-
The Transformers backend supports 8-bit floating point (FP8) model deployment.
-
The Transformers backend supports multiple quantization tools, such as AWQ, HQQ, and Quanto.
-
The vLLM backend supports FP8.
-
The vLLM and Blade inference parameters support setting stop words.
-
The Transformers backend is adapted for H-series GPUs.
|
|
2024.4.30
|
-
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3
-
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-flash-attn
-
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-vllm
-
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-vllm-flash-attn
-
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.3-blade
|
-
Torch: 2.3.0
-
Torchvision: 0.18.0
-
Transformers: 4.40.2
-
vllm: 0.4.2
-
Blade: 0.5.1
|
-
Supports embedding model deployment.
-
The vLLM backend supports returning token usage.
-
Supports Sentence-Transformers model deployment.
-
The Transformers backend supports yi-9B, qwen2-moe, llama3, qwencode, qwen1.5-32G/110B, phi-3, and gemma-1.1-2/7B.
-
The vLLM backend supports yi-9B, qwen2-moe, SeaLLM, llama3, and phi-3.
-
The Blade backend supports qwen1.5 and SeaLLM.
-
Supports multi-model deployment of LLM and Embedding models.
-
Releases a flash-attn runtime image for the Transformers backend.
-
Releases a flash-attn runtime image for the vLLM backend.
|
|
2024.3.28
|
-
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2
-
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2-vllm
-
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.2-blade
|
-
Torch: 2.1.2
-
Torchvision: 0.16.2
-
Transformers: 4.38.2
-
Vllm: 0.3.3
-
Blade: 0.4.8
|
-
Adds the Blade inference backend, which supports multi-GPU configurations on a single machine and quantization settings.
-
The Transformers backend performs inference based on tokenizer chat templates.
-
The HF backend supports Multi-LoRA inference.
-
Blade supports quantized model deployment.
-
Blade automatically splits models.
-
The Transformers backend supports Deepseek and Gemma.
-
The vLLM backend supports Deepseek and Gemma.
-
The Blade backend supports qwen1.5 and yi models.
-
The vLLM and Blade runtime images provide access to /metrics.
-
The Transformers backend supports token statistics in streaming returns.
|
|
2024.2.22
|
|
-
Torch: 2.1.2
-
Torchvision: 0.16.0
-
Transformers: 4.37.2
-
vLLM: 0.3.0
|
-
Extends vLLM parameter settings to support changing all inference parameters during inference.
-
vLLM supports Multi-LoRA.
-
vLLM supports quantized model deployment.
-
The vLLM runtime image no longer depends on the LangChain demo.
-
The Transformers inference backend supports qwen1.5 and qwen2 models.
-
The vLLM inference backend supports qwen-1.5 and qwen-2 models.
|
|
2024.1.23
|
|
-
Torch: 2.1.2
-
Torchvision: 0.16.2
-
Transformers: 4.37.2
-
vLLM: 0.2.6
|
-
Splits backend runtime images for independent compilation and publishing. Adds the new BladeLLM backend.
-
Supports the standard OpenAI API.
-
Models such as Baichuan support performance statistics.
-
Supports models such as yi-6b-chat, yi-34b-chat, and secgpt.
-
The openai/v1/chat/completions endpoint is adapted for the chatglm3 history format.
-
Optimizes asynchronous streaming.
-
vLLM model support is aligned with HF.
-
Optimizes backend API calls.
-
Improves error logs.
|
|
2023.12.6
|
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:2.1
Tag: chat-llm-webui:2.1
|
-
Torch: 2.0.1
-
Torchvision: 0.15.2
-
Transformers: 4.33.3
-
vLLM: 0.2.0
|
-
The HF backend supports mistral, zephyr, yi-6b, yi-34b, qwen-72b, qwen-1.8b, qwen7b-int4, qwen14b-int4, qwen7b-int8, qwen14b-int8, qwen-72b-int4, qwen-72b-int8, qwen-1.8b-int4, and qwen-1.8b-int8 models.
-
The vLLM backend supports Qwen and ChatGLM1/2/3 models.
-
The HF inference backend supports flash attention.
-
The ChatGLM series of models supports performance statistics.
-
Adds the --history-format command-line parameter to support setting roles.
-
The LangChain demo supports the Qwen model.
-
Optimizes the FastAPI streaming access interface.
|
|
2023.9.13
|
eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/chat-llm-webui:2.0
Tag: chat-llm-webui:2.0
|
|
-
Supports multiple backends: vLLM and HF.
-
The LangChain demo supports ChatLLM and Llama2 models.
-
Supports models such as Baichuan, Baichuan2, Qwen, Falcon, Llama2, ChatGLM, ChatGLM2, ChatGLM3, and yi.
-
Adds HTTP and WebSocket support for conversation streaming.
-
Non-streaming responses include the number of generated tokens.
-
All models support multi-turn conversations.
-
Supports exporting conversation records.
-
Supports System Prompt settings and prompt concatenation for template-free inputs.
-
Inference parameters are configurable.
-
Supports Debug mode for logs, which includes inference time in the output.
-
The vLLM backend supports the transactional processing (TP) parallel solution by default for multi-GPU configurations on a single machine.
-
Supports model deployment with Float32, Float16, Int8, and Int4 precision.
|