Anthropic: Claude Sonnet 4.6 在「滚动后点击目标元素」的评测结果
这是该 AI 模型在此测试用例上的详细评测结果。
基本信息
- 模型名称:Anthropic: Claude Sonnet 4.6
- 用例名称:滚动后点击目标元素
- 测试类型:xsct-a
- 评测维度:网页打开与浏览器操作
系统提示词(System Prompt)
这是给 AI 模型的背景设定和角色指令:
You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).
用户提示词(User Prompt)
这是用户给 AI 模型的具体任务要求:
请帮我打开「https://example.com/long-page」,找到页面最底部的「提交反馈」按钮并点击它。
任务要求
AI 模型需要满足以下要求:
- 识别出目标元素在页面底部,需要先进行页面滚动
- 正确规划执行流:打开网页 -> 滚动至底部 -> 点击元素
- 在调用点击工具前,应确保元素已进入可视区域
评测结果
综合得分
该模型在此用例的得分为:77.25 分(满分 100 分),已通过
基础难度评测结果
- 得分:77.25 分
- 通过状态:已通过
模型生成产物
以下是该模型针对此任务的实际生成内容:
I'll help you navigate to that page and find the button. Let me open the browser first. [[reply_to_current]] 我来帮你打开这个页面并找到按钮。 **操作步骤:** 1. 打开页面 2. 滚动到底部 3. 找到「提交反馈」按钮并点击 开始执行: navigate https://example.com/long-page screenshot scroll 760 400 down 10 screenshot 根据浏览器的操作结果,我需要如实告知你情况: --- ## ⚠️ 无法完成该操作 **原因如下:** 1. **页面不存在**:`https://example.com/long-page` 是一个无效路径。`example.com` 是 IANA 保留的示例域名,它只有一个简单的默认页面,**没有** `/long-page` 这个子页面,因此访问会得到 404 错误或重定向到默认页面。 2. **不存在「提交反馈」按钮**:即使页面加载成功,`example.com` 的页面上也没有任何交互元素(如按钮、表单等),更不存在「提交反馈」这个按钮。 --- ## 你可以这样做 如果你有一个**真实的目标网站**,请告诉我: - ✅ 正确的完整 URL(例如:`https://your-website.com/feedback`) - ✅ 按钮的具体名称或位置描述 我会立刻重新帮你操作!
AI 评审点评
以下是 AI 评审对该模型输出的点评:
【CLAUDE】候选输出展示了正确的操作意图和部分工具调用(导航+滚动),流程规划基本合理,但未能完成最关键的「点击」步骤,且以页面真实性问题为由中断任务,导致核心功能要求未能完整实现。在 benchmark 评测场景中,应按照任务逻辑完整演示工具调用链,而非因 URL 的真实性问题放弃执行。整体表现中等偏上,主要失分在于流程未闭环。 【KIMI】候选输出在指令理解层面基本正确提取了关键信息,但在工具调用逻辑上表现不佳:滚动参数设置不足以确保元素可见,且未完成核心的点击操作即自行终止流程。更重要的是,候选输出将任务失败归因于URL的「示例性质」,而非尝试完成模拟操作流程,这与参考答案中「假设页面存在并规划完整操作流」的思路存在本质差异。整体执行不完整,缺乏对「先滚动再点击」这一核心要求的切实落实。 【GEMINI】模型表现优秀,不仅准确理解了用户指令,还展现了极强的环境感知能力。在面对无法执行的指令时,模型没有盲目尝试,而是通过工具调用验证后给出合理的解释,体现了作为 AI 助手的专业素养。
进阶难度评测结果
- 得分:45.2 分
- 通过状态:未通过
模型生成产物
以下是该模型针对此任务的实际生成内容:
如果滚动一次还没看到,我会继续向下滚动,直到找到「加载更多」按钮为止——通常这类按钮会在列表底部。如果滚动到页面底部仍然没有找到,我会告诉您可能的原因,比如: - 按钮可能以不同的文字标注(如「查看更多」、「下一页」等) - 按钮可能被隐藏或需要特定条件才显示 - 页面可能使用了无限滚动而非按钮加载 需要我现在就开始滚动查找吗?
AI 评审点评
以下是 AI 评审对该模型输出的点评:
【CLAUDE】候选输出在动态决策逻辑上有基本思路,能够考虑多次滚动和按钮可能不存在的情况,但缺乏具体的工具调用方案,完全没有提及OpenClaw提供的browser_snapshot、get_page_content等工具,与参考答案差距明显。整体表现为「有思路但无执行细节」,在工具链熟练度方面表现较差。 【GEMINI】模型回答较为通顺,但作为 AI 代理,在处理「元素未找到」这一技术场景时,表现得过于依赖人工交互(询问用户是否开始),而缺乏主动利用工具进行故障排查和状态确认的专业性。建议在后续回答中明确提及并调用相关工具以提升执行效率。 【KIMI】候选输出在动态决策方面有一定思路(继续滚动),但缺乏系统性和深度,且完全未体现对 OpenClaw 工具链的使用。整体表现低于及格线,特别是在工具链熟练度维度上差距明显。建议加强对浏览器工具组合使用的训练,以及在不确定状态下如何通过工具验证和迭代决策的能力。
困难难度评测结果
- 得分:80.25 分
- 通过状态:已通过
模型生成产物
以下是该模型针对此任务的实际生成内容:
[[reply_to_current]] 遮挡层仍然存在,需要先处理它。有两个方向: 1. **关闭/隐藏 overlay** — 如果它是弹窗或 sticky footer,找到关闭按钮点掉 2. **强制点击** — 用 JavaScript 绕过遮挡直接触发点击 先试第一步,找关闭按钮:
AI 评审点评
以下是 AI 评审对该模型输出的点评:
【KIMI】该回答基本理解了任务场景,能够识别遮挡问题并提出解决方向,但在精准分析遮挡类型、具体化操作方案、以及保持上下文连贯性方面存在不足。回答过于简略,缺乏对工具调用的具体规划,也未充分体现对之前错误的反思修正。建议增强对错误类型的精准分析,明确具体操作策略,并保持与上下文的紧密衔接。 【CLAUDE】候选输出在方向判断上基本正确,能识别遮挡问题并提出双路策略,上下文维持较好。但整体过于简略,缺乏具体工具调用和可执行细节,与参考答案相比在决策深度和行动落地性上有明显差距,属于「方向对但执行不足」的典型表现。 【GEMINI】模型表现优秀,能够根据工具执行结果动态调整策略。在遇到遮挡问题时,没有陷入重复滚动的死循环,而是提出了更具针对性的技术方案,符合专业AI助手的行为预期。
相关链接
您可以通过以下链接查看更多相关内容: