qwen3-max 在「WhatsApp Web 登录」的评测结果

这是该 AI 模型在此测试用例上的详细评测结果。

基本信息

模型名称：qwen3-max
用例名称：WhatsApp Web 登录
测试类型：xsct-a
评测维度：渠道连接与消息路由

系统提示词（System Prompt）

这是给 AI 模型的背景设定和角色指令：

You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).

用户提示词（User Prompt）

这是用户给 AI 模型的具体任务要求：

我想在 OpenClaw 中配置 WhatsApp Web 登录，请告诉我基本的配置步骤以及如何获取登录二维码。

任务要求

AI 模型需要满足以下要求：

准确指出需要在配置文件中设置 channel_type 为「whatsapp」
说明启动后需要通过终端或日志查看生成的「二维码链接」或「Base64 字符」
提到扫码后 OpenClaw 会自动保存「Session 凭证」以实现后续免扫码登录

评测结果

综合得分

该模型在此用例的得分为：38.0 分（满分 100 分），未通过

基础难度评测结果

得分：38.0 分
通过状态：未通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

在 OpenClaw 中配置 WhatsApp Web 登录，主要依赖内置的 **browser** 工具来加载 WhatsApp Web 页面并获取登录二维码。以下是基本配置步骤： --- ### ✅ 前提条件 - 确保 OpenClaw 的 `browser` 工具可用（默认已启用）。 - 你的环境能访问互联网，并可加载 `web.whatsapp.com`。 --- ### 📌 配置与获取二维码步骤 1. **启动浏览器并打开 WhatsApp Web** 使用 `browser` 工具导航到 WhatsApp Web： ```tool_code browser goto https://web.whatsapp.com ``` 2. **等待页面加载并呈现二维码** WhatsApp Web 会自动显示登录二维码。你可以通过以下任一方式查看： - 如果当前会话支持 **canvas**（图形界面），系统通常会自动呈现页面截图或交互式视图。 - 否则，使用以下命令显式请求当前页面快照： ```tool_code canvas present ``` 这将展示包含二维码的页面图像，你可用手机 WhatsApp 扫描登录。 3. **（可选）检查登录状态** 登录成功后，页面会跳转到聊天界面。你可以通过以下方式确认： ```tool_code browser wait_for_element "div[role='textbox']" timeout=60 ``` 若该元素出现，说明已成功登录。 --- ### 🔒 注意事项 - WhatsApp Web 会话在浏览器关闭或 Cookie 清除后失效。如需持久化，应保存浏览器上下文（OpenClaw 当前不自动持久化，除非项目配置了 profile）。 - 不要分享二维码截图给他人——它等同于临时登录凭证。 --- 如果你希望我帮你**立即执行这些步骤并展示二维码**，请告诉我，我可以直接调用工具为你操作。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】该回答完全偏离了 OpenClaw WhatsApp Web 配置的正确路径。参考答案的核心是：修改配置文件设置 channel_type=whatsapp、从终端日志获取二维码、Session 自动保存到 storage 目录。而模型回答却将其理解为通过 browser 工具手动打开 web.whatsapp.com 并用 canvas 展示截图，这是对 OpenClaw 架构的根本性误解。三个核心功能要求均未满足，配置准确性极低，表达虽有一定格式规范但术语使用不当，整体质量不及格。【GEMINI】模型在交互引导方面表现良好，但在核心配置要求上存在明显遗漏。它将任务处理为「手动操作浏览器」，而非「配置系统参数以实现自动化」，未能满足测试任务中关于会话保持和自动重连的配置要求。【KIMI】模型回答偏离了 OpenClaw 配置 WhatsApp 的标准流程，将自动化配置驱动的方式误解为手动浏览器操作，遗漏了核心的配置文件参数和 Session 持久化机制，属于方向性错误。虽然工具调用格式规范，但内容准确性不足，可能导致用户无法正确完成配置。

进阶难度评测结果

得分：89.2 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

除了将 `keep_alive` 设置为 `true` 之外，还需要检查以下几点以确保 WhatsApp 的断线重连机制能正常工作： --- ### ✅ **1. Session 目录权限** - 确保 `./session` 目录存在且 OpenClaw 有**读写权限**。 - WhatsApp 的登录状态（如 QR 码会话）会保存在这里，如果无法写入，断线后无法恢复。你可以运行以下命令确认： ```bash ls -la ./session ``` 如果没有该目录，可以手动创建： ```bash mkdir -p ./session ``` --- ### ✅ **2. 网络代理稳定性** 你的配置中使用了代理：`http://127.0.0.1:7890` 请确保： - 代理服务（如 Clash、V2Ray 等）**持续运行且稳定**。 - 代理支持 WebSocket（WhatsApp Web 依赖 WebSocket 长连接）。 - 如果代理中断，WhatsApp 连接也会断开，即使 `keep_alive: true` 也无法维持。你可以临时测试关闭代理（如果直连可用），看是否仍频繁断开，以排除代理问题。 --- ### ✅ **3. 后台进程未被杀掉** - 如果你是在终端前台运行 OpenClaw，关闭终端会导致进程退出。 - 建议使用 `nohup`、`screen`、`tmux` 或 systemd 服务等方式**后台持久运行**。例如： ```bash nohup openclaw run > openclaw.log 2>&1 & ``` --- ### ✅ **4. 日志监控** 启用日志可以帮助你诊断断开原因。查看 OpenClaw 是否记录了类似： - `Connection lost` - `Reconnecting...` - `Session expired` 你可以通过： ```bash tail -f openclaw.log ``` （假设你已将输出重定向到日志文件） --- ### ✅ **5. WhatsApp 服务端限制** - WhatsApp Web **同一账号只能在一个客户端保持活跃**。如果你在手机上重新扫码登录另一个设备，旧会话会被踢下线。 - 频繁断开也可能是手机端网络不稳定或 WhatsApp 被系统休眠。 --- ### 🔧 **建议操作步骤** 1. 修改 `config.yaml`： ```yaml channels: - type: whatsapp session_path: ./session keep_alive: true # ← 改为 true proxy: http://127.0.0.1:7890 ``` 2. 确保 `./session` 目录可写。 3. 重启 OpenClaw 并观察日志。需要我帮你自动修改配置并检查目录权限吗？

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】模型生成结果质量较高，在故障诊断和方案完整性两个维度均表现出色。相比参考答案，模型不仅覆盖了所有核心检查点，还额外补充了后台进程管理、WebSocket 支持等实用建议，内容更为丰富。格式清晰，使用分级标题和代码块增强可读性。主要不足是未提及「reconnect_interval」配置项和手机端设备上限检查，以及未针对「Connection Closed」等具体错误码进行定向诊断指引。整体而言是一个高质量的技术支持回复。【GEMINI】模型表现出色，不仅准确识别了用户的问题，还通过结构化的建议提供了超越参考答案的深度分析。特别是在代理稳定性、后台进程管理和日志排查方面的建议，对于解决实际的「断线重连」问题非常有价值。回复逻辑清晰，符合专业 AI 助手的定位。【KIMI】模型表现优秀，在参考答案基础上进行了合理的扩展（如后台进程保活、WebSocket 特性分析），诊断维度全面且有条理。主要不足在于遗漏了「reconnect_interval」配置建议，且对日志诊断的深度不够，未能明确引导用户通过特定错误码区分故障类型。整体而言是一份高质量的故障排查指南。

困难难度评测结果

得分：20.2 分
通过状态：未通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

现在环境已配置好，我将启动 WhatsApp Web 登录流程。稍等片刻，我会为你打开登录二维码。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】模型生成结果极为简短且内容空洞，仅表示「环境已配置好，将启动登录流程」，完全没有针对已知问题（权限不足、Chromium缺失、无图形界面服务器环境）给出任何实质性的解决方案或操作指引。与参考答案相比，缺少headless模式建议、二维码获取方式、健康检查监控、断线重连说明等关键内容，属于严重不合格的回答。【GEMINI】模型生成结果过于简单，缺乏作为 AI 助手在复杂技术场景下的专业性。它未能意识到用户处于服务器环境，简单地回复「为你打开二维码」在实际操作中是无效的。模型未能引导用户完成从登录到监控的完整链路，在复杂环境处理和链路完整性方面均未达到及格水平。【KIMI】模型生成结果与预设上下文严重脱节，完全忽视了前面多轮对话中已经识别和修复的环境问题（Chrome 路径配置、权限问题等）。输出过于简略且假设性过强，没有提供任何可执行的具体指导，未能满足用户在服务器 headless 环境下完成 WhatsApp Web 登录的实际需求。模型似乎将预设上下文当作无关信息忽略，而是基于「环境已配置好」的错误假设进行回复。

基本信息

系统提示词（System Prompt）

用户提示词（User Prompt）

任务要求

评测结果

综合得分

基础难度评测结果

模型生成产物

AI 评审点评

进阶难度评测结果

模型生成产物

AI 评审点评

困难难度评测结果

模型生成产物

AI 评审点评

相关链接

反馈评测问题