glm-4.5-air 在「Docker 部署后 API 不通」的评测结果

这是该 AI 模型在此测试用例上的详细评测结果。

基本信息

模型名称：glm-4.5-air
用例名称：Docker 部署后 API 不通
测试类型：xsct-a
评测维度：Gateway 配置与管理

系统提示词（System Prompt）

这是给 AI 模型的背景设定和角色指令：

You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).

用户提示词（User Prompt）

这是用户给 AI 模型的具体任务要求：

我用 Docker 跑起了 OpenClaw Gateway，容器状态是「Up」，但是访问「http://localhost:8081/health」却提示拒绝连接，我该怎么检查？

任务要求

AI 模型需要满足以下要求：

要求检查 Docker 端口映射配置（-p 参数）
要求检查容器内服务监听地址是否为「0.0.0.0」而非「127.0.0.1」
要求检查宿主机防火墙或安全组设置
建议查看容器日志以确认服务是否真正启动成功

评测结果

综合得分

该模型在此用例的得分为：64.5 分（满分 100 分），已通过

基础难度评测结果

得分：64.5 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

我需要帮你检查 OpenClaw Gateway 的连接问题。虽然容器显示 "Up" 状态，但服务可能没有正确地映射端口或启动。让我们一步步排查：首先，让我检查容器运行情况，特别是端口映射： ```bash docker ps | grep openclaw ``` 如果看不到容器，或者端口映射不正确（比如没有显示 8081 端口映射），我们可以尝试以下命令： ```bash docker logs ``` 这将显示容器的日志，可能包含服务启动的错误信息。另外，我们可以检查容器内部的网络配置： ```bash docker exec -it netstat -tlnp ``` 或者使用更现代的命令： ```bash docker exec -it ss -tlnp ``` 这将显示容器监听的端口，确认健康检查端点是否在正确的端口上运行。如果你不确定容器名称或 ID，可以使用： ```bash docker ps -a | grep openclaw ``` 请尝试这些命令，并告诉我输出结果，这样我可以更准确地诊断问题。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】模型回答提供了基本的 Docker 排查命令，思路方向正确，但缺乏对核心问题（监听地址 0.0.0.0 vs 127.0.0.1）的明确诊断说明，也未涵盖防火墙/安全组检查和健康检查路径确认等要点。回答偏向引导用户提供信息后再诊断，而非直接给出完整排查方案，实用性略低于参考答案。【GEMINI】模型表现良好，能够根据用户问题给出符合技术逻辑的排查步骤，且工具使用建议准确。但在全面性上略有欠缺，未覆盖到监听地址配置差异及宿主机防火墙检查这两个关键排查维度。【KIMI】模型回答呈现了一个通用的 Docker 排查流程，但缺乏针对 OpenClaw Gateway 特定架构的深入理解。最关键的问题是未识别出「容器内服务监听 127.0.0.1」这一常见陷阱——这是 Docker 容器中「拒绝连接」问题的典型根因。回答过于依赖通用命令堆砌，而非基于 OpenClaw 的文档指引（如 System Prompt 中提到的 gateway 管理命令也未使用），未能体现对项目特定配置要求的掌握。建议加强针对具体技术栈的上下文理解，而非提供泛化的排查步骤。

进阶难度评测结果

得分：76.0 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

要解决这个问题，需要同时修改两处配置： ### 方案1：修改端口映射（推荐）重启容器时添加 `-p 0.0.0.0:8081:8081` 参数，将端口映射到所有网络接口： ```bash docker stop openclaw-gateway docker rm openclaw-gateway docker run -d --name openclaw-gateway -p 0.0.0.0:8081:8081 openclaw-gateway ``` ### 方案2：修改容器内配置如果容器已经在运行，可以进入容器修改配置： ```bash docker exec -it openclaw-gateway /bin/sh sed -i 's/host: 127.0.0.1/host: 0.0.0.0/' /app/config.yaml ``` 然后重启容器： ```bash docker restart openclaw-gateway ``` ### 方案3：修改docker-compose配置（如果使用compose）如果是通过docker-compose启动的，修改`docker-compose.yml`中的端口映射： ```yaml ports: - "8081:8081" # 改为这样，而不是 "127.0.0.1:8081:8081" ``` 然后重新启动： ```bash docker-compose down docker-compose up -d ``` 推荐使用方案1，这样外部就可以通过服务器的IP地址访问API了，如 `http://服务器IP:8081`。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】模型对问题的诊断准确，识别出了双重监听限制的根本原因，并提供了多种修复方案，实用性较强。主要不足在于：未将「修改config.yaml」和「修改端口映射」作为必须同时执行的步骤明确组合呈现，可能导致用户只执行其中一步；同时缺少对API Key认证的提示。整体质量良好，超过及格线。【GEMINI】模型对问题的诊断非常准确，给出的技术方案也符合 Docker 容器化部署的常规操作。虽然在方案的整合性引导和潜在鉴权因素的提示上略有欠缺，但整体修复建议清晰、代码准确，能够有效解决用户面临的问题。【KIMI】模型回答存在明显遗漏：未明确识别config.yaml中server.host需要修改为0.0.0.0这一核心问题，且方案1缺少配置文件挂载、方案2的修改方式不持久。虽然提供了多种方案，但均未完整覆盖参考答案中的双重修正要求，诊断深度不足。

困难难度评测结果

得分：50.2 分
通过状态：未通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

发现端口8081被一个名为"old-gateway-v1"的进程占用了。现在需要停止这个旧进程，然后重新启动Docker容器。我将终止占用端口的进程。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】模型在面对用户的追问时，仅重复了已知的问题描述，未能提供完整的修复方案。缺少具体命令、备选方案和验证步骤，实用性严重不足。与参考答案相比，模型既没有给出 kill 命令，也没有给出 docker start 命令，更没有健康检查建议，整体表现不及格。【GEMINI】模型能够识别故障根源，但在执行力和方案完整性上表现欠佳。相比参考答案，模型缺乏对后续验证步骤的引导，且未提供具体的 Shell 命令，导致用户在执行时仍需自行摸索，未能体现出作为专业 AI 助手的辅助价值。【KIMI】模型抓住了端口冲突的核心问题，但响应过于简略，遗漏了容器状态确认、具体命令、健康检查验证及备选方案等关键要素。相比参考答案的完整恢复流程，生成结果在可操作性和周全性上有明显差距。

基本信息

系统提示词（System Prompt）

用户提示词（User Prompt）

任务要求

评测结果

综合得分

基础难度评测结果

模型生成产物

AI 评审点评

进阶难度评测结果

模型生成产物

AI 评审点评

困难难度评测结果

模型生成产物

AI 评审点评

相关链接

反馈评测问题