腾讯发布混元-A13B，可能是效率最高的开源大模型

今天HF最受欢迎的模型是Flux Kontext，次受欢迎的是腾讯混元-A13B。

欢迎访问官方仓库 Hunyuan-A13B，这是一个基于细粒度混合专家（MoE）架构的创新且开源的大语言模型（LLM）。Hunyuan-A13B 设计旨在高效和可扩展，提供前沿性能的同时，具有最小的计算开销，使其成为高级推理和通用应用程序的理想选择，特别是在资源受限的环境中。

模型介绍

随着人工智能技术的迅速发展，大语言模型（LLMs）在自然语言处理、计算机视觉和科学任务中取得了显著进展。然而，随着模型规模的不断扩展，优化资源消耗同时保持高性能已成为一个关键挑战。为了解决这个问题，我们探索了混合专家（MoE）架构。新推出的 Hunyuan-A13B 模型总共有 800 亿个参数，其中 130 亿个是活跃参数。它不仅提供了高性能的结果，还实现了最佳的资源效率，成功平衡了计算能力和资源利用。

核心功能和优势

紧凑而强大：该模型仅有 130 亿个活跃参数（总计 800 亿个），但在广泛的任务基准测试中表现出色，与更大的模型不相上下。
混合推理支持：支持快速和慢速思考模式，用户可以根据需要灵活选择。
超长上下文理解: 原生支持 256K 上下文窗口，在长文本任务中保持稳定的性能。
增强的代理能力: 优化用于代理任务，在 BFCL-v3、τ-Bench 和 C3-Bench 等基准测试中取得领先结果。
高效推理: 使用分组查询注意机制（GQA）并支持多种量化格式，实现高效的推理。

为什么选择 Hunyuan-A13B？

作为强大且计算高效的大型模型，浑元-A13B 是在资源受限条件下追求高性能的研究人员和开发者的理想选择。无论是用于学术研究、成本效益型 AI 解决方案开发，还是创新应用探索，该模型都为发展提供了坚实的基础。

Benchmark

注意：以下基准是在几个基础模型上由 TRT-LLM-后端评估的。

Model	涵元-大型	Qwen2.5-72B	Qwen3-A22B	涵元-A13B
MMLU	88.40	86.10	87.81	88.17
MMLU-Pro	60.20	58.10	68.18	67.23
MMLU-Redux	87.47	83.90	87.40	87.67
BBH	86.30	85.80	88.87	87.56
SuperGPQA	38.90	36.20	44.06	41.32
EvalPlus	75.69	65.93	77.60	78.64
MultiPL-E	59.13	60.50	65.94	69.33
MBPP	72.60	76.00	81.40	83.86
CRUX-I	57.00	57.63	–	70.13
CRUX-O	60.63	66.20	79.00	77.00
MATH	69.80	62.12	71.84	72.35
CMATH	91.30	84.80	–	91.17
GSM8k	92.80	91.50	94.39	91.83
GPQA	25.18	45.90	47.47	49.12

Hunyuan-A13B-Instruct 在多个基准测试中取得了高度竞争力的表现，特别是在数学、科学、代理领域等方面。我们将其与几个强大的模型进行了比较，结果如下。

主题	长凳	OpenAI-o1-1217	深渊 R1	Qwen3-A22B	混元-A13B-指令
数学	AIME 2024AIME 2025MATH	74.379.296.4	79.87094.9	85.781.594.0	87.376.894.3
科学	GPQA-钻石奥林匹亚基准	7883.1	71.582.4	71.185.7	71.282.7
编程	Livecodebench全栈基准构件基准	63.964.638.6	65.971.644.6	70.765.644.6	63.967.843
推理	BBHDROP条纹逻辑	80.490.281	83.792.278.7	88.990.380.3	89.191.184.7
指令Following	IF-EvalSysBench	91.882.5	88.377.7	83.474.2	84.776.1
文本创造	LengthCtrlInsCtrl	60.174.8	55.969	53.373.7	55.471.9
NLU	复杂自然语言理解词项任务	64.767.1	64.576.3	59.856.4	61.262.9
代理	BFCL v3τ-BenchComplexFuncBenchC3-Bench	67.860.447.658.8	56.943.841.155.3	70.844.640.651.7	78.354.761.263.5

使用 transformers

我们的模型默认使用慢思考推理，有两种方法可以禁用 CoT 推理。

在调用 apply_chat_template 时，传递”enable_thinking=False”。
在提示前加上 “/no_think” 将强制模型不使用进行 CoT 推理。同样，在提示前加上 “/think” 将强制模型进行 CoT 推理。

以下代码片段展示了如何使用 transformers 库加载并应用模型。它还演示了如何启用和禁用推理模式，以及如何解析推理过程及其最终输出。

from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import re

model_name_or_path = os.environ['MODEL_PATH']
# model_name_or_path = "tencent/Hunyuan-A13B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True)  # You may want to use bfloat16 and/or move to GPU here
messages = [
    {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
                                                enable_thinking=True# Toggle thinking mode (default: True)
                                                )

outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)

output_text = tokenizer.decode(outputs[0])

think_pattern = r'<think>(.*?)</think>'
think_matches = re.findall(think_pattern, output_text, re.DOTALL)

answer_pattern = r'<answer>(.*?)</answer>'
answer_matches = re.findall(answer_pattern, output_text, re.DOTALL)

think_content = [match.strip() formatchin think_matches][0]
answer_content = [match.strip() formatchin answer_matches][0]
print(f"thinking_content:{think_content}\n\n")
print(f"answer_content:{answer_content}\n\n")

快速和慢速思考切换

该模型支持两种操作模式：

慢速思考模式（默认）：在生成最终答案之前启用详细的内部推理步骤。
快速思考模式：跳过内部推理过程以加快推理速度，直接生成最终答案。

切换到快速思考模式：

To 禁用推理过程，在调用 apply_chat_template 时设置 enable_thinking=False：

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=False  # Use fast thinking mode
)

部署

For 部署，你可以使用诸如 TensorRT-LLM、vLLM 或 SGLang 这样的框架来服务模型并创建一个兼容 OpenAI 的 API 端点。

image: https://hub.docker.com/r/hunyuaninfer/hunyuan-a13b/标签

TensorRT-LLM

Docker 镜像

我们提供基于最新版本 TensorRT-@1001 的预构建 Docker 镜像。

开始使用：

https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags

docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm

docker run --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm

准备配置文件：

cat >/path/to/extra-llm-api-config.yml <<EOF
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
print_iter_log: true
EOF

启动 API 服务器:

trtllm-serve \
  /path/to/HunYuan-moe-A13B \
  --host localhost \
  --port 8000 \
  --backend pytorch \
  --max_batch_size 32 \
  --max_num_tokens 16384 \
  --tp_size 2 \
  --kv_cache_free_gpu_memory_fraction 0.6 \
  --trust_remote_code \
  --extra_llm_api_options /path/to/extra-llm-api-config.yml

vLLM

Docker 镜像

我们提供一个预构建的 Docker 镜像，包含 vLLM 0.8.5，并完全支持此模型。官方 vllm 发布目前仍在开发中，注意：此 Docker 需要 cuda 12.8。

开始使用：

docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-vllm 
or
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm

下载模型文件：
- Huggingface: 将会由 vllm 自动下载。
- ModelScope: modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct
启动 API 服务器:

model 下载由 huggingface：

docker run --rm  --ipc=host \
        -v ~/.cache:/root/.cache/ \
        --security-opt seccomp=unconfined \
        --net=host \
        --gpus=all \
        -it \
        -e VLLM_USE_V1=0 \
        --entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
        -m vllm.entrypoints.openai.api_server \
        --host 0.0.0.0 \
        --tensor-parallel-size 4 \
        --port 8000 \
        --model tencent/Hunyuan-A13B-Instruct  \
        --trust_remote_code

model downloaded by modelscope:

docker run --rm  --ipc=host \
        -v ~/.cache/modelscope:/root/.cache/modelscope \
        --security-opt seccomp=unconfined \
        --net=host \
        --gpus=all \
        -it \
        -e VLLM_USE_V1=0 \
        --entrypoint python mirror.ccs.tencentyun.com/hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm \
        -m vllm.entrypoints.openai.api_server \
        --host 0.0.0.0 \
        --tensor-parallel-size 4 \
        --port 8000 \
        --model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/  \
        --trust_remote_code

Tool Calling with vLLM

为了支持基于代理的工作流和函数调用能力，该模型包括专门的解析机制来处理工具调用和内部推理步骤。

如需了解如何在代理环境中实现和使用这些功能的完整示例，请参阅我们在 GitHub 上的完整代理实现：
🔗 浑元 A13B 代理示例

当使用vLLM部署模型时，可以使用以下参数来配置工具解析行为：

参数	值
--工具解析插件	本地浑元 A13B 工具解析文件
--tool-call-parser	浑元

这些设置使 vLLM 能够正确解释并根据预期格式路由由模型生成的工具调用。

推理解析器

vLLM 原因解析解析器在鸿源 A13B 模型上的支持正在开发中。

SGLang

Docker 镜像

我们还提供基于最新版本 SGLang 的预构建 Docker 镜像。

开始使用：

拉取 Docker 镜像

docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang
or
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang

启动 API 服务器:

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    --ipc=host \
    docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang \
    -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000

（文：路过银河AI）

腾讯发布混元-A13B，可能是效率最高的开源大模型最先出现在每时AI。