xAI: Grok 4.20 Beta 在「首次访问 401」的评测结果

这是该 AI 模型在此测试用例上的详细评测结果。

基本信息

  • 模型名称:xAI: Grok 4.20 Beta
  • 用例名称:首次访问 401
  • 测试类型:xsct-a
  • 评测维度:Gateway 配置与管理

系统提示词(System Prompt)

这是给 AI 模型的背景设定和角色指令:

You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).

用户提示词(User Prompt)

这是用户给 AI 模型的具体任务要求:

我刚启动了 OpenClaw Gateway,尝试访问「/v1/models」接口却提示「401」,我该如何找到初始的管理密钥进行认证?

任务要求

AI 模型需要满足以下要求:

  1. 指出 OpenClaw Gateway 默认开启身份验证
  2. 说明初始管理密钥(Admin Key)的存放位置通常在「data/config.yaml」或启动日志中
  3. 提供具体的查看命令,如「cat data/config.yaml」
  4. 解释如何在请求头中使用「Authorization: Bearer <key>」进行访问

评测结果

综合得分

该模型在此用例的得分为:49.0 分(满分 100 分),未通过

基础难度评测结果

  • 得分:49.0 分
  • 通过状态:未通过

模型生成产物

以下是该模型针对此任务的实际生成内容:

**要找到 OpenClaw Gateway 的初始管理密钥(API Key),请按以下步骤操作:** 1. **查看配置文件**(最推荐、最快的方式) 在你的工作目录下执行以下命令,查找配置文件: ```bash ls -la cat config.yaml 2>/dev/null || cat gateway.yaml 2>/dev/null || cat config/config.yaml 2>/dev/null ``` 或者直接搜索包含 `key` 或 `admin` 的内容: ```bash grep -E "key|admin|token|secret" *.yaml *.yml 2>/dev/null || true ``` 2. **查看环境变量**(常见做法) ```bash env | grep -E "KEY|TOKEN|SECRET|ADMIN" | grep -E "OPENCLAW|GATEWAY" ``` 3. **查看日志**(Gateway 启动时通常会打印初始密钥) ```bash openclaw gateway logs # 或者 journalctl -u opencLaw-gateway -n 100 --no-pager 2>/dev/null ``` 或者直接看最近的启动日志: ```bash grep -E "key|token|admin|initial" /var/log/opencLaw* 2>/dev/null || true ``` 4. **检查默认配置文件位置** 常见位置包括: - `./config.yaml` - `./gateway/config.yaml` - `/etc/opencLaw/gateway.yaml` - `~/.opencLaw/config.yaml` **最快速的命令组合(建议直接复制运行):** ```bash echo "=== 配置文件 ===" && (cat config.yaml 2>/dev/null || cat gateway.yaml 2>/dev/null || echo "未找到配置文件") && echo -e "\n=== 搜索密钥 ===" && grep -E "(key|admin_key|api_key|token):" *.yaml *.yml 2>/dev/null || echo "未在yaml中找到key" ``` 通常初始的管理密钥会在配置文件中以 `admin_key:` 或 `management_key:` 的形式出现。 你可以把上面命令的输出贴出来,我可以帮你一起分析具体应该使用哪个密钥。

AI 评审点评

以下是 AI 评审对该模型输出的点评:

【CLAUDE】候选答案采用了「广撒网」的排查思路,列举了多种可能的配置文件路径和命令,但未能精准命中 OpenClaw Gateway 的实际密钥存储位置「data/config.yaml」,也未明确说明默认开启身份验证这一背景和 Authorization Bearer 的使用方式。相比参考答案,缺乏针对性和准确性,对真正面临 401 问题的用户帮助有限。 【GEMINI】模型未能有效利用提供的参考答案(EVAL_BLOB 中的事实来源),而是采取了通用的排查策略。在处理特定系统配置问题时,应优先定位参考答案中明确指出的「data/config.yaml」路径,而非进行大范围的模糊搜索。建议在后续回答中更精准地聚焦于已知事实,并提供更简洁的验证步骤。 【KIMI】候选输出试图提供全面的排查方案,但偏离了 EVAL_BLOB 中的关键事实:初始密钥明确存储在「data/config.yaml」中,而非需要广泛搜索。输出过度泛化,遗漏了「Authorization: Bearer」认证头的核心使用说明,且引入了未经验证的外部工具和假设。虽然命令数量多,但精准度和针对性不足,用户按此操作可能无法快速解决问题。

进阶难度评测结果

  • 得分:92.0 分
  • 通过状态:已通过

模型生成产物

以下是该模型针对此任务的实际生成内容:

**当前步骤:** 1. 把 `admin_key` 改成 `my-safe-key` 2. 重启 gateway 服务让新密钥立即生效 **操作命令(直接执行下面两步即可):** ```bash # 第一步:修改配置文件 sed -i 's/admin_key: "oc-secret-789"/admin_key: "my-safe-key"/' data/config.yaml # 第二步:重启 gateway 服务(立即生效) openclaw gateway restart ``` 执行完后,用新的 key 访问后台: **新访问方式:** - Header: `Authorization: Bearer my-safe-key` - 或 Query 参数: `?key=my-safe-key` --- **验证是否成功的方法:** 执行下面这条命令测试新 key 是否生效: ```bash curl -H "Authorization: Bearer my-safe-key" http://localhost:8000/admin/status ``` 需要我现在帮你执行上面这两条命令吗?(修改 + 重启) 直接回复「执行」即可。

AI 评审点评

以下是 AI 评审对该模型输出的点评:

【CLAUDE】候选输出质量高,完整涵盖修改配置、重启服务、验证生效三个关键步骤,命令准确可执行,格式清晰。工具使用符合 OpenClaw 规范,验证方式合理。整体与参考答案高度吻合,是一个优质的技术指导回答。 【GEMINI】模型表现优秀,能够准确理解用户需求并结合上下文提供的 CLI 指令进行操作指导。不仅给出了修改和重启的命令,还主动提供了验证方法,且在最后询问用户是否需要代为执行,体现了良好的交互体验。 【KIMI】候选输出整体质量较高,完整回应了用户「更换密钥并即时生效」的需求,流程清晰、命令可执行。在工具运用上偏向传统 shell 命令而非 Agent 原生的文件操作工具,且对重启必要性原理说明不足。验证路径与参考答案不一致但不影响功能验证。综合表现良好,达到实用水平。

困难难度评测结果

  • 得分:29.0 分
  • 通过状态:未通过

模型生成产物

以下是该模型针对此任务的实际生成内容:

**现在可以正常操作了。** 我们已经成功修复了 `data/config.yaml` 的权限问题,并拿到了里面的 `admin_key`: ```yaml gateway: admin_key: "prod-key-123" ``` --- **下一步建议:** 请告诉我你想怎么处理这个 `admin_key`? 1. **直接用这个 key 去访问接口**(推荐先试一下是否能解决 401) 2. 把 key 写回到某个配置文件里(比如 `.env` 或其他地方) 3. 先检查一下当前请求里是否已经带了 key,但格式不对 你现在想让我执行哪一步? 直接回复数字 1、2、3 或者告诉我具体需求即可。

AI 评审点评

以下是 AI 评审对该模型输出的点评:

【CLAUDE】候选输出停留在表面,仅总结了已发生的操作结果,既未识别配置文件乱码导致解析失败的潜在风险,也未提供任何主动的系统恢复方案。以「现在可以正常操作了」作为开场白实际上是误判——配置文件末尾有乱码意味着问题并未真正解决。整体表现距离参考答案的深度和完整性差距悬殊,属于不及格水平。 【GEMINI】模型在处理简单的权限修复任务上表现尚可,但面对复杂故障(乱码、潜在的配置解析错误)时,缺乏系统性的排查思维。未能覆盖参考答案中提到的环境变量排查、日志分析及配置文件完整性修复等关键步骤,导致无法形成有效的闭环解决方案。 【KIMI】候选人输出严重偏离参考标准。参考答案完整覆盖了权限修复(chown)、内容修复(乱码处理)、环境变量排查、日志验证和重启闭环等关键步骤,而候选人仅停留在「获取到 key」这一最浅层结果,对上下文明确提示的「乱码」警告完全忽视,且将专业恢复决策转化为让用户选择的交互式提问,缺乏系统管理员应有的故障恢复能力和责任心。

相关链接

您可以通过以下链接查看更多相关内容:

加载中...