xAI: Grok 4.20 Beta 在「Matrix 接入」的评测结果

这是该 AI 模型在此测试用例上的详细评测结果。

基本信息

模型名称：xAI: Grok 4.20 Beta
用例名称：Matrix 接入
测试类型：xsct-a
评测维度：渠道连接与消息路由

系统提示词（System Prompt）

这是给 AI 模型的背景设定和角色指令：

You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).

用户提示词（User Prompt）

这是用户给 AI 模型的具体任务要求：

我想把 OpenClaw 接入到我的「Matrix」服务器，请告诉我需要在配置文件中设置哪些核心字段，以及如何验证「homeserver」连接是否成功？

任务要求

AI 模型需要满足以下要求：

准确列出 Matrix 渠道必需的配置字段，如「homeserver_url」、「user_id」、「access_token」
说明如何通过日志或健康检查接口验证连接状态
提及 Matrix 协议特有的「room_id」或「sync」机制

评测结果

综合得分

该模型在此用例的得分为：52.8 分（满分 100 分），未通过

基础难度评测结果

得分：52.8 分
通过状态：未通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

**在 OpenClaw 中接入 Matrix 服务器的核心配置** OpenClaw 目前主要通过 **Matrix 协议** 作为其 Gateway 的通信后端之一，你需要在配置文件中正确设置以下核心字段。 ### 1. 核心配置文件字段（推荐使用 `config.yaml` 或 `gateway.yaml`） ```yaml gateway: enabled: true protocol: matrix # 必须设置为 matrix matrix: homeserver: "https://matrix.yourdomain.com" # ← 必须正确填写 # homeserver: "https://matrix.org" # 示例：公有服务器 username: "@yourbot:yourdomain.com" # 必须是完整的 MXID password: "your_bot_password" # 推荐使用密码（或下面 token） # access_token: "MDA...long_token..." # 推荐使用 access_token（更安全） # 可选但强烈建议设置的字段 device_id: "OpenClawBot" # 设备标识 sync_timeout: 30000 # 同步超时（毫秒） auto_join_rooms: true # 自动加入邀请的房间 accept_invites: true # 加密相关（可选） encryption: enabled: true store_path: "./data/matrix_crypto" # 加密密钥存储目录 ``` ### 2. 推荐的配置方式（优先级从高到低） 1. **最推荐**：使用 `access_token`（避免长期保存密码） 2. 使用 `password` + `username` 3. 使用 `login` 流程让 OpenClaw 自动登录获取 token（部分版本支持） ### 3. 如何验证 homeserver 连接是否成功 #### 方法一：使用 OpenClaw 内置命令（推荐） ```bash # 1. 检查 Gateway 状态 openclaw gateway status # 2. 测试 Matrix 连接 openclaw gateway test matrix # 3. 强制重新连接 openclaw gateway restart ``` #### 方法二：查看日志验证 ```bash # 查看实时日志 openclaw gateway logs --follow # 或者直接看日志文件 tail -f logs/gateway.log ``` **成功时的关键日志关键词**： - `Connected to Matrix homeserver` - `Matrix client initialized successfully` - `Sync loop started` - `Logged in as @yourbot:yourdomain.com` **失败时的常见错误**： - `Failed to connect to homeserver` → homeserver URL 错误或网络不通 - `Invalid homeserver URL` → 必须带 `https://` 或 `http://` - `M_FORBIDDEN` 或 `Invalid username/password` → 凭证错误 - `Too many redirects` → homeserver 配置了反代问题 --- ### 快速验证连通性（不启动整个 Gateway）如果你想在不启动完整服务的情况下测试 homeserver 是否可达，可以执行： ```bash # 测试 homeserver 是否能访问 curl -I https://matrix.yourdomain.com/_matrix/client/versions # 或使用 OpenClaw 提供的诊断工具（如果有） openclaw diagnose matrix --homeserver https://matrix.yourdomain.com ``` **请告诉我**： 1. 你当前 OpenClaw 的版本（`openclaw --version`） 2. 你是使用自建 Matrix 服务器还是公有服务器（如 matrix.org） 3. 你更倾向使用 `password` 还是 `access_token` 我可以根据你的实际情况给你**更精确的配置模板**。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出篇幅较长、格式清晰，体现了一定的 Matrix 协议知识，但在关键细节上与参考答案存在明显偏差：配置字段命名不规范（homeserver 而非 homeserver_url，username 而非 user_id），配置路径不符（matrix: 而非 channels.matrix），验证日志关键词不准确，且遗漏了「/health」健康检查接口这一核心验证手段。此外还存在编造 CLI 子命令的风险（openclaw gateway test matrix、openclaw diagnose matrix 等在系统提示的 CLI 参考中均未列出）。整体回答偏向「通用 Matrix 协议知识」而非「OpenClaw 特定配置」，与 benchmark 期望的精准性有一定距离。【GEMINI】模型在回答中表现出极高的专业度，不仅准确覆盖了核心配置需求，还提供了非常实用的故障排查和验证手段。虽然在配置字段的命名上与参考答案存在细微差异，但考虑到 OpenClaw 的实际 CLI 操作逻辑，模型提供的方案更具落地价值。【KIMI】候选输出整体呈现「过度发挥」特征：在核心配置字段上使用了错误的结构路径和命名，添加了过多未经核实的扩展字段；在验证逻辑上虚构了不存在的 CLI 命令和日志关键词。虽然输出格式专业、内容详实，但与 EVAL_BLOB 提供的唯一事实来源存在显著偏差，未能准确回答用户关于「channels.matrix」配置结构和特定验证方式的问题。建议严格依据项目文档和参考标准，避免基于假设扩展内容。

进阶难度评测结果

得分：33.5 分
通过状态：未通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

好的，我帮你检查日志，看看是否收到了 Matrix 事件。请让我执行以下命令来查看最近的日志：（重点查找 `m.room.message`、`matrix`、`event` 等关键词） ```bash tail -n 100 logs/bot.log 2>/dev/null || tail -n 100 bot.log 2>/dev/null || journalctl -u openclaw-bot --no-pager -n 50 2>/dev/null ``` 如果日志文件不在上面路径，请告诉我正确的日志文件位置，或者直接把日志文件路径发给我，我再帮你读取。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出表现较差。在工具使用上，未直接调用 exec 工具而是以代码块方式描述命令，违反了系统提示的工具使用规范；在故障定位上，仅停留在「查看日志」层面，没有提出任何有价值的诊断假设（如 sync 请求问题、access_token 权限、routing 配置、event_filter 等），与参考答案相比缺乏实质性的技术分析深度。整体表现明显低于及格线。【GEMINI】模型在处理用户请求时表现较为被动，未能利用上下文中的已知信息（如日志路径）进行高效排查。建议在后续任务中更主动地利用已知的环境信息和工具进行诊断，而非简单地猜测路径并询问用户。【KIMI】候选输出严重偏离任务要求。用户明确提供了配置信息和具体的技术问题（Matrix机器人无响应、已手动邀请、需检查m.room.message事件），但候选输出：1) 无视用户已提供的日志路径「logs/claw.log」；2) 未执行任何实际的工具调用来诊断问题；3) 将简单的grep检查任务复杂化为让用户自行寻找日志；4) 完全未涉及Matrix事件流分析、权限检查、路由配置等核心技术点。与参考答案相比，候选输出在故障定位和工具使用两个维度均表现不及格，未能解决用户的实际问题。

困难难度评测结果

得分：62.8 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

**当前问题总结：** Matrix 连接失败的主要原因是： - `matrix_creds.json` 中的 `url` 是 `https://internal.matrix.local`（内部域名） - `token` 是旧的 `old_token_123` - 很可能主配置文件 `config.yaml` 中配置的 homeserver 与此不一致 --- ### 下一步建议（请你决定怎么做）：我推荐直接**更新凭证文件**，请告诉我以下信息： 1. 正确的 Matrix Homeserver 地址（例如：`https://matrix.yourcompany.com` 或 `https://matrix.example.com`） 2. 有效的 Access Token（新生成的 Matrix token） 3. 是否是自签名证书？（如果是，需要关闭 SSL 验证） --- 或者，你也可以直接告诉我你想怎么处理，我来帮你修改文件。 **请回复以下其中一种：** - **A.** 直接给我正确的 url 和 token，我来修改文件 - **B.** 先让我看看 `config.yaml` 的内容 - **C.** 其他处理方式请回复 A、B 或 C（或者直接把 url + token 发给我）

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出在问题识别层面基本准确，正确定位了旧 token 和内部域名的问题，也提及了 SSL 证书校验。但整体表现偏「问答式」而非「行动式」：在有工具可用的 agent 场景下，模型本应主动读取 config.yaml 完成多文件比对、给出具体修复指令，而非用 A/B/C 选项将决策推回给用户。与参考答案相比，缺少主动比对配置冲突、给出完整三步修复路径、以及提示验证修复后同步状态（sync）的内容。整体在自主性和决策完整性上存在明显差距。【GEMINI】模型表现良好，能够准确诊断 Matrix 接入失败的根本原因。在交互上，模型通过提供结构化的选项（A/B/C）引导用户，具有较好的易用性。建议在后续任务中，对于已识别的配置冲突，可以更主动地执行读取操作以确认事实，从而减少用户侧的决策负担。【KIMI】候选输出未能充分利用对话上下文中已获取的关键信息（已通过 sudo 读取的 matrix_creds.json 内容），反而要求用户重复提供已知信息，显示上下文理解缺陷。在权限处理和配置对比方面，没有延续已解决的进展。虽然形式上提供了选项让用户选择，但这实质是将诊断责任转移给用户，而非基于证据主动推进修复。与参考答案相比，缺少明确的「URL 冲突已确认」「token 已过期」等关键结论陈述，以及具体的配置更新操作步骤。

基本信息

系统提示词（System Prompt）

用户提示词（User Prompt）

任务要求

评测结果

综合得分

基础难度评测结果

模型生成产物

AI 评审点评

进阶难度评测结果

模型生成产物

AI 评审点评

困难难度评测结果

模型生成产物

AI 评审点评

相关链接

反馈评测问题