doubao-seed-1-6 在「点击无响应后重试」的评测结果
这是该 AI 模型在此测试用例上的详细评测结果。
基本信息
- 模型名称:doubao-seed-1-6
- 用例名称:点击无响应后重试
- 测试类型:xsct-a
- 评测维度:网页打开与浏览器操作
系统提示词(System Prompt)
这是给 AI 模型的背景设定和角色指令:
You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).
用户提示词(User Prompt)
这是用户给 AI 模型的具体任务要求:
请点击页面底部的「联系我们」链接。如果点击后页面没有刷新或内容没变,请告诉我你打算如何处理。
任务要求
AI 模型需要满足以下要求:
- 识别出需要使用「click_element」工具进行交互
- 明确表达在点击无响应时需要重新执行「browser_snapshot」来对比状态
- 提出备选方案,如「scroll_page」确保元素可见或检查是否有遮罩层
评测结果
综合得分
该模型在此用例的得分为:31.2 分(满分 100 分),未通过
基础难度评测结果
- 得分:31.2 分
- 通过状态:未通过
模型生成产物
以下是该模型针对此任务的实际生成内容:
[{"name":"browser","parameters":{"action":"click","selector":{"xpath":"//a[text()='联系我们']"}}}]
AI 评审点评
以下是 AI 评审对该模型输出的点评:
【CLAUDE】模型的回答严重不足,仅执行了一次 browser click 操作,完全没有回应用户关于「点击后无变化如何处理」的核心问题。参考答案要求的快照对比、备选方案(scroll_page、遮罩层检查、JavaScript 模拟点击)等关键步骤均未涉及。这是一个对任务理解不完整、缺乏交互策略思考的典型失败案例。 【GEMINI】模型未能理解任务中的条件分支逻辑。虽然成功调用了基础的点击工具,但完全忽略了对「点击无响应」后的处理方案说明,导致任务完成度较低,未达到预期的交互策略调整要求。 【KIMI】模型完全偏离了测试任务的核心要求。题目明确要求模型说明「第一次点击后页面无变化」时的处理策略,但模型仅执行了单次点击操作,既未预判无响应情况,也未提供任何后续处理方案。工具调用格式存在多处错误(action 名称、selector 结构),且完全缺失自然语言回应,未能满足用户「告诉我你打算如何处理」的指令。整体表现与参考答案差距显著,未能体现对 OpenClaw 浏览器工具状态检查与交互策略调整流程的理解。
进阶难度评测结果
- 得分:56.0 分
- 通过状态:未通过
模型生成产物
以下是该模型针对此任务的实际生成内容:
可能的原因包括: 1. **选择器不准确**:按钮可能使用类名(如 `class="load-more"`)而非 ID `#load-more`,导致点击了错误元素。 2. **动态加载延迟**:点击后内容加载需要时间,需等待网络请求完成。 3. **按钮状态变化**:点击后按钮可能变为“加载中”或隐藏,需重新定位。 **接下来操作**: 1. 先读取页面源码确认按钮实际选择器:
AI 评审点评
以下是 AI 评审对该模型输出的点评:
【CLAUDE】模型对异常原因的分析有一定广度,提出了选择器问题这一额外视角,但对「点击成功但无结果」这一核心中间状态的深度分析不足。更关键的问题是策略调整部分回答被截断,未能完整给出下一步操作方案,缺少参考答案中最重要的「browser_snapshot 重新获取 DOM 状态」和「等待后再次验证」等核心步骤,导致整体实用性大打折扣。 【GEMINI】模型能够初步分析问题原因,但在应对「点击后无变化」的交互策略调整上表现不佳。最严重的问题在于回复未完成,导致后续操作策略缺失,未能体现出作为 AI 助手在处理复杂交互任务时的逻辑闭环能力。 【KIMI】模型对「点击成功但内容未更新」这一典型浏览器自动化中间状态的理解存在偏差,混淆了技术层面的「点击事件触发」与业务层面的「交互生效」。策略调整方面,模型未使用关键的「browser_snapshot」工具,反而提议验证已知正确的选择器,方向错误。整体表现未达到及格水平,建议加强对异步 Web 应用交互模式的认知,以及优先使用快照对比来诊断页面状态变化。
困难难度评测结果
- 得分:23.2 分
- 通过状态:未通过
模型生成产物
以下是该模型针对此任务的实际生成内容:
[[reply_to_current]] 正在确认提交结果,检查是否出现成功提示。
AI 评审点评
以下是 AI 评审对该模型输出的点评:
【CLAUDE】模型在面对用户追问「现在呢?」这一关键节点时,表现极为薄弱。仅输出了一句「正在确认提交结果,检查是否出现成功提示」,既未调用browser_snapshot等工具获取页面实际状态,也未分析表单校验失败的可能性,更未给出任何具体的现状评估和下一步行动计划。参考答案要求模型主动执行快照、分析错误提示、提出解决方案,而模型的回复几乎是空洞的占位符。整体表现不及格,需要大幅改进在模糊指令下的主动决策能力和工具调用能力。 【GEMINI】模型表现不佳。在任务明确要求「重新 snapshot 并调整交互策略」的情况下,模型仅给出了敷衍的文字回复,未执行任何实际的诊断操作。相比参考答案中对错误提示(如验证码缺失)的敏锐捕捉和后续行动建议,该模型未能展现出对复杂交互任务的执行力,导致用户无法获知真实的页面状态。 【KIMI】模型表现严重不合格。面对「点击无响应后重试」的任务要求,模型在最终环节完全放弃了验证职责:不调用工具获取页面状态、不分析表单校验结果、不给出下一步建议,仅用一句敷衍的叙述性回复搪塞用户。这与参考答案中主动 snapshot、分析错误提示、提出验证码检查建议的专业表现形成巨大落差。模型似乎将「正在确认」误解为可以替代实际行动的叙事策略,严重违背了 AI 助手应通过工具获取真实状态的基本原则。
相关链接
您可以通过以下链接查看更多相关内容: