qwen3-coder-next 在「Docker 部署后 API 不通」的评测结果

这是该 AI 模型在此测试用例上的详细评测结果。

基本信息

模型名称：qwen3-coder-next
用例名称：Docker 部署后 API 不通
测试类型：xsct-a
评测维度：Gateway 配置与管理

系统提示词（System Prompt）

这是给 AI 模型的背景设定和角色指令：

You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).

用户提示词（User Prompt）

这是用户给 AI 模型的具体任务要求：

我用 Docker 跑起了 OpenClaw Gateway，容器状态是「Up」，但是访问「http://localhost:8081/health」却提示拒绝连接，我该怎么检查？

任务要求

AI 模型需要满足以下要求：

要求检查 Docker 端口映射配置（-p 参数）
要求检查容器内服务监听地址是否为「0.0.0.0」而非「127.0.0.1」
要求检查宿主机防火墙或安全组设置
建议查看容器日志以确认服务是否真正启动成功

评测结果

综合得分

该模型在此用例的得分为：86.0 分（满分 100 分），已通过

基础难度评测结果

得分：86.0 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

你可以按以下步骤检查问题： 1. **确认容器端口映射** 运行以下命令，确认 `8081` 端口是否正确映射到宿主机： ```bash docker ps ``` 查看输出中的 `PORTS` 列，是否包含 `0.0.0.0:8081->8081/tcp`（或类似映射）。 **如果没有映射**，需要在 `docker run` 时添加 `-p 8081:8081` 参数。 2. **检查容器内服务是否监听 8081 端口** 进入容器内部，查看服务是否正常监听端口： ```bash docker exec -it netstat -tuln | grep 8081 # 或 docker exec -it ss -tuln | grep 8081 ``` 如果没有输出，说明服务未监听该端口，可能是配置问题或服务启动失败。 3. **查看容器日志** 检查 Gateway 是否启动成功，有没有报错： ```bash docker logs ``` 关注是否有端口冲突、配置错误或权限问题。 4. **检查防火墙/网络模式** - 如果你使用了 `--network` 指定非默认网络（如 `host` 或自定义 bridge），需确认网络配置是否正确。 - 在 macOS/Windows 上，`localhost` 指向宿主机，但 Docker Desktop 的网络转发可能受限制，可尝试 `http://host.docker.internal:8081/health`。 5. **直接测试容器内健康检查** 在容器内部测试服务是否响应： ```bash docker exec -it curl -v http://localhost:8081/health ``` 如果容器内能访问，但宿主机不能，说明是网络/端口映射问题。 --- 如果上述步骤仍无法定位，你可以贴出以下信息，我帮你进一步分析： - `docker ps` 输出（尤其是 PORTS 列） - `docker logs ` 的相关部分 - 启动容器的 `docker run` 命令或 `docker-compose.yml` 配置

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出质量较高，结构清晰，步骤递进合理，基本覆盖了参考答案的核心排查逻辑。步骤1-3与参考答案高度吻合，额外补充了网络模式差异（macOS/Windows场景）和容器内部测试方法，实用价值高。主要缺失点是未明确提示用户检查配置文件中的监听地址绑定（0.0.0.0 vs 127.0.0.1），这是「拒绝连接」的常见根因之一，参考答案明确强调了此点。整体而言是一份高质量的排查指南，适合实际运维场景使用。【GEMINI】候选回答表现优秀，排查步骤专业且符合 Docker 容器化应用的常见故障诊断流程。建议在后续回答中更明确地指出容器内服务监听地址必须为「0.0.0.0」这一特定要求，以进一步提升针对 OpenClaw Gateway 场景的精准度。【KIMI】候选输出提供了较为实用的Docker排查步骤，包括端口映射、服务监听、日志查看、网络测试等，对一般容器问题排查有帮助。但作为OpenClaw Gateway的专项诊断，未能精准命中「容器内必须监听0.0.0.0而非127.0.0.1」这一关键配置要点，且遗漏了健康检查路径可能不同的提示。建议增强对OpenClaw特定配置的关注，将监听地址绑定问题作为首要排查项明确突出。

进阶难度评测结果

得分：84.8 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

你需要做两处修改： ### 1. 修改容器内配置（`config.yaml`）将 `server.host` 从 `127.0.0.1` 改为 `0.0.0.0`，这样服务才会监听所有网络接口。 ### 2. 重新启动容器（端口映射需调整）你当前的端口映射是 `127.0.0.1:8081->8081/tcp`，应该改为 `8081:8081`（即不绑定特定 IP），这样外部才能通过服务器 IP 访问。我可以帮你直接修改配置文件并生成新的 `docker run` 命令。是否需要我继续？

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出整体表现良好，准确识别了双重监听限制问题（config.yaml host 和 Docker 端口绑定 IP），修复方向正确。主要不足在于：未提供完整的 docker run 命令（包含配置文件挂载），未提醒 auth.enabled 对 API 访问的影响。结尾以询问用户是否继续收尾，而非直接给出完整解决方案，实用性略有欠缺。总体来看是一个合格偏上的回答。【GEMINI】模型对问题的诊断非常准确，能够清晰指出导致 API 无法访问的两个核心原因。在修复方案上，模型给出了正确的指导思路，但执行细节（如具体操作命令）和对安全配置（auth）的提示有所欠缺，整体表现良好。【KIMI】候选输出整体质量良好，核心诊断准确，修复方向正确。主要不足在于：1）未一次性提供完整的可执行 docker run 命令，而是采用询问方式；2）未明确强调配置修改后必须重启才能生效；3）未提及 auth.enabled 的潜在影响（虽非核心要求）。建议改进：直接给出完整可执行的修复命令，并明确标注操作顺序和注意事项。

困难难度评测结果

得分：68.8 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

[[reply_to_current]] 我可以帮你选择以下任一方案： 1. **停止旧进程**（推荐，如果它已不再需要） ```bash sudo kill 1234 ``` 2. **修改 Gateway 启动配置**，将端口改为其他未占用端口（如 8082）这需要找到 Gateway 的配置文件（如 `config.yaml` 或 `docker-compose.yml`）并更新端口映射。你希望我先找 Gateway 的配置文件，还是先停止旧进程？

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出展示了对已有诊断信息的正确理解，方向正确，但在「现在呢？」这个关键追问节点上，未能给出完整的端到端修复方案。缺失了重启容器命令和健康验证步骤，且以反问替代行动建议，导致实用性打折。参考答案给出了「kill → start → curl 验证」的完整三步流程，候选输出仅覆盖了第一步的选项讨论，距离参考质量有一定差距。【GEMINI】模型具备优秀的故障诊断能力，能准确分析日志并定位问题。但在恢复决策阶段表现较为保守，未能主动引导用户完成从修复到验证的完整闭环，且在处理 OpenClaw 架构相关建议上有所欠缺。【KIMI】候选输出在关键信息识别上存在明显遗漏：未指出容器已崩溃需重启、未要求执行健康检查验证、未提供完整的可执行命令链。虽然提到了停止旧进程和修改端口两个方向，但将决策权交还给用户的方式降低了实用性，且缺失的步骤可能导致用户操作后仍无法恢复服务。相比参考答案的完整闭环方案，候选输出的恢复决策水平明显不足。

基本信息

系统提示词（System Prompt）

用户提示词（User Prompt）

任务要求

评测结果

综合得分

基础难度评测结果

模型生成产物

AI 评审点评

进阶难度评测结果

模型生成产物

AI 评审点评

困难难度评测结果

模型生成产物

AI 评审点评

相关链接

反馈评测问题