OpenAI: gpt-oss-120b 在「Matrix 接入」的评测结果

这是该 AI 模型在此测试用例上的详细评测结果。

基本信息

模型名称：OpenAI: gpt-oss-120b
用例名称：Matrix 接入
测试类型：xsct-a
评测维度：渠道连接与消息路由

系统提示词（System Prompt）

这是给 AI 模型的背景设定和角色指令：

You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).

用户提示词（User Prompt）

这是用户给 AI 模型的具体任务要求：

我想把 OpenClaw 接入到我的「Matrix」服务器，请告诉我需要在配置文件中设置哪些核心字段，以及如何验证「homeserver」连接是否成功？

任务要求

AI 模型需要满足以下要求：

准确列出 Matrix 渠道必需的配置字段，如「homeserver_url」、「user_id」、「access_token」
说明如何通过日志或健康检查接口验证连接状态
提及 Matrix 协议特有的「room_id」或「sync」机制

评测结果

综合得分

该模型在此用例的得分为：81.5 分（满分 100 分），已通过

基础难度评测结果

得分：81.5 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

下面给出 **OpenClaw 接入 Matrix 服务器** 时需要在配置文件（通常是 `config.yml` 或 `openclaw.yaml`）中填写的核心字段，以及验证 **homeserver** 连接是否成功的常用方法。如果你的项目中已经有相应的配置模板，可以直接在对应的 `matrix`（或 `gateway.matrix`）段落里添加/修改这些键。 --- ## 1️⃣ 必填核心字段 | 字段 | 说明 | 示例 | |------|------|------| | `matrix.homeserver` | Matrix homeserver 的基础 URL（必须以 `https://` 或 `http://` 开头，结尾不要带斜杠）。 | `https://matrix.example.com` | | `matrix.access_token` | 机器人/服务账号的 **Access Token**（在 Matrix 中创建的用户登录后可以通过 `/login` 接口获取，或在 Synapse 的 `registration.yaml` 中预生成）。 | `mfa.QR...` | | `matrix.user_id` | 机器人/服务账号的完整 MXID（形如 `@bot:example.com`）。 | `@openclaw-bot:example.com` | | `matrix.device_id` *(可选)* | 设备标识，若不填 OpenClaw 会自动生成。 | `OPENCLAW_BOT` | | `matrix.sync_filter` *(可选)* | 需要同步的事件过滤器 JSON（可用于限制同步范围，提升性能）。 | `{ "room": { "timeline": { "limit": 20 } } }` | | `matrix.enable_presence` *(可选)* | 是否上报 Presence（`true`/`false`），默认 `false`。 | `true` | | `matrix.retry_backoff` *(可选)* | 连接失败后的重试间隔（秒），支持指数退避。 | `5` | | `matrix.http_timeout` *(可选)* | HTTP 请求超时时间（秒），防止长时间卡死。 | `30` | | `matrix.verify_ssl` *(可选)* | 是否校验 HTTPS 证书（生产环境建议 `true`）。 | `true` | > **备注** > - **Access Token** 必须具备足够的权限：发送消息、读取房间、加入房间等。常见做法是给该账号分配 `admin` 或 `moderator` 权限，或在 Synapse 中通过 `registration_shared_secret` 预生成带有相应权限的 token。 > - 如果你使用 **Synapse** 的 **Application Service**（AS）模式，还需要额外配置 `as_token`、`hs_token`、`url` 等字段，这里不展开，若有需求请参考官方 AS 文档。 --- ## 2️⃣ 配置示例 ```yaml # config.yml（或 openclaw.yaml） matrix: homeserver: "https://matrix.example.com" access_token: "mfa.QRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" user_id: "@openclaw-bot:example.com" device_id: "OPENCLAW_BOT" sync_filter: room: timeline: limit: 20 enable_presence: false retry_backoff: 5 http_timeout: 30 verify_ssl: true ``` 保存后 **重新启动 OpenClaw**（或对应的 gateway 服务），使配置生效： ```bash openclaw gateway restart ``` --- ## 3️⃣ 验证 homeserver 连接是否成功 ### 3.1 使用 OpenClaw 自带的状态/健康检查 | 命令 | 说明 | |------|------| | `openclaw gateway status` | 查看所有已启动的 gateway 进程状态，若 Matrix gateway 正常会显示 `RUNNING` 并且 `last_heartbeat` 最近。 | | `openclaw...

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】模型回答结构清晰、内容详尽，覆盖了配置字段、验证方法、故障排查等多个维度，对用户有较高的实用参考价值。核心字段（homeserver、access_token、user_id）均已正确列出，验证链路逻辑合理且多样。主要不足在于：字段命名与参考答案存在轻微偏差（homeserver vs homeserver_url），部分扩展字段和命令属于推测性内容，可能与 OpenClaw 实际实现不符。整体而言是一份质量较高的技术回答，但在严格对照官方规范方面有提升空间。【GEMINI】模型生成的内容非常专业且具备极高的实用价值。它不仅准确回答了用户关于配置和验证的问题，还主动扩展了关于 Application Service、故障排查和 API 验证的深度内容，完全符合 OpenClaw 助手的角色设定。结构清晰，排版规范，是高质量的技术文档输出。【KIMI】模型生成结果整体信息丰富、结构清晰，提供了大量实用的配置细节和验证方法。但主要问题在于：1）配置字段的层级结构和命名与参考答案存在偏差，可能不符合 OpenClaw 实际配置规范；2）验证方法中混杂了假设性的 CLI 命令，未严格遵循提示词中提供的 OpenClaw CLI 参考；3）未准确捕捉参考答案中强调的关键验证指标（特定日志字样和 /health 接口）。建议模型在回答时更严格地依据提供的系统提示和参考答案框架，避免过度扩展未经验证的内容。

进阶难度评测结果

得分：42.5 分
通过状态：未通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

下面是常见导致 Matrix 机器人收不到 **`m.room.message`** 事件的配置问题，以及对应的排查与修复步骤。 --- ## 1️⃣ 关键配置点 | 配置项 | 当前值 | 常见错误 | 正确做法 | |--------|--------|----------|----------| | `rooms` | `[]` (空) | 只监听 **已声明** 的房间。如果列表为空，机器人只会处理自己主动加入的房间（取决于 `auto_join`）。 | 把要监听的房间 ID（`!xxxxx:matrix.org`）写进去，或开启 `auto_join`。 | | `auto_join` | `false` | 机器人不会自动接受邀请，也不会在收到邀请后加入房间。 | 设置为 `true`（或在代码里自行处理 `m.room.member` 邀请事件）。 | | `access_token` | `syt_valid_token`（示例） | token 已失效、权限不足（没有 `send_message` / `receive_message`） | 确认 token 有 **`User`** 权限，且 **`invite`**、**`join`**、**`send_message`** 均被授予。 | | `homeserver_url` | `https://matrix.org` | 正确，但如果使用自建服务器，需要确保 **TLS**、**端口** 正确。 | 保持不变或改为自建服务器地址。 | > **结论**：在当前配置下，即使你手动邀请机器人进入房间，它也**不会自动加入**（`auto_join: false`），而且 **`rooms` 列表为空**，所以即使加入后也不会把该房间当作“已监听”而去处理 `m.room.message`。 --- ## 2️⃣ 立即生效的最小改动 1. **打开自动加入**（推荐） ```yaml channels: matrix: enabled: true homeserver_url: 'https://matrix.org' user_id: '@bot:matrix.org' access_token: 'syt_valid_token' rooms: [] # 仍可保留空 auto_join: true # **提示**：如果你同时打开 `auto_join`，机器人会在收到邀请后立即加入并开始监听该房间的所有事件（包括 `m.room.message`）。 --- ## 3️⃣ 重启服务让配置生效 ```bash # 停止旧实例 openclaw gateway stop # 重新加载配置并启动 openclaw gateway start ``` > 如果你使用的是 **Docker / systemd**，对应的命令可能是 `docker restart ` 或 `systemctl restart openclaw-gateway.service`。请根据实际部署方式执行。 --- ## 4️⃣ 检查日志，确认事件已收到 ### 4.1 查看实时日志 ```bash # 若日志写入文件（默认路径可能是 logs/bot.log） tail -f logs/bot.log ``` ### 4.2 若日志直接输出到 stdout（服务以 `foreground` 方式启动） ```bash openclaw gateway start # 直接在终端运行，观察控制台 ``` 在日志中，你应该能看到类似的条目： ``` INFO matrix: Joined room !AbcDeFGHIJ:matrix.org INFO matrix: Received event m.room.message from @user:matrix.org in !AbcDeFGHIJ:matrix.org ``` 如果仍未出现 `m.room.message`，请确认： 1. **Token 权限**：在 Matrix 客户端（如 Element）里，用同一个账号发送一条消息，检查是否能在客户端看到。若能看到，说明 token 有发送/接收权限。 2. **房间加密**：如果房间开启了端到端加密（E2EE），普通 bot（未实现 Olm/Megolm）是收不到加密消息的。此时需要在代码里集成 **matrix-nio** 的加密支持或改为非加密房间。 3. **过滤器**：检查代码里是否使用了自定义 `sync_filter`，导致只同步特定事件类型。若有，确保 `room.message` 被包含。 --- ## 5️⃣ 完整的示例配置（推荐）...

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】模型对Matrix配置问题有一定的领域知识，能识别rooms为空和auto_join为false的问题，但完全忽略了用户的核心诉求——查看日志确认m.room.message事件是否到达。整个回复没有调用任何工具，变成了一篇通用的排查指南，而非基于实际工具执行结果的诊断。与参考答案相比，缺少对sync请求、access_token权限、routing配置的针对性检查，也没有通过exec实际执行grep命令。在benchmark环境中，这种「只说不做」的回复方式与系统提示的工具使用规范严重背离。【GEMINI】模型具备极强的 Matrix 技术背景知识，能准确识别配置中的逻辑漏洞并给出修复建议。但在工具使用层面表现较为被动，未能直接响应用户关于「检查日志」的指令，而是通过文字描述引导用户操作，未能充分利用 OpenClaw 的工具能力。【KIMI】模型生成了一份看似专业详尽的 Matrix 配置排查指南，但实质上完全偏离了测试任务的核心要求。用户明确请求「看看日志里有没有收到 m.room.message 事件」，这是一个需要立即执行工具调用的诊断动作，而非配置教程需求。模型犯了「过度叙述、拒绝工具」的典型错误：用 2000+ 字的配置说明替代了本应简单的 grep 命令，既未分析实际事件流，也未验证 sync 状态，更未检查路由过滤。这种「只说不做」的响应模式在 Agent 场景中是严重缺陷——用户需要的是诊断结果，而非操作手册。此外，模型多次要求用户提供敏感信息（token）来「帮助」执行本应由 Agent 自行完成的操作，进一步暴露了工具使用意识的缺失。

困难难度评测结果

得分：73.7 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

下面我们把 **Matrix 连接** 修复好，步骤如下： --- ## 1️⃣ 检查并更新配置 ### 1.1 查看 `config.yaml` 中的 Matrix 配置（确认是否已经写入 `homeserver_url`、`access_token`、`verify_ssl` 等字段） ```bash cat config/config.yaml | grep -i matrix -A 5 ``` ### 1.2 对比 `matrix_creds.json` | 项目 | `config.yaml`（可能的值） | `matrix_creds.json` | |------|---------------------------|----------------------| | **URL** | `https://matrix.example.com` | `https://internal.matrix.local` | | **Token** | `new_token_abc` | `old_token_123` | | **verify_ssl** | `true/false` | – | 如果 `config.yaml` 中的 URL 与 `matrix_creds.json` 不一致，或 Token 已过期，就需要同步更新。 --- ## 2️⃣ 更新配置文件 ### 2️⃣1 修改 `config.yaml` ```bash # 使用编辑器打开 nano config/config.yaml ``` 在 `matrix:` 节点下确保： ```yaml matrix: homeserver_url: "https://internal.matrix.local" # 与 credentials 中保持一致 access_token: "new_token_abc" # 替换为有效的 token verify_ssl: false # 若内部服务器使用自签名证书，可设为 false ``` > **提示**：如果你不确定 `access_token`，可以在 Matrix 客户端（如 Element）里生成一个新的 **Access Token**（设置 → 安全 → “生成访问令牌”），复制后粘贴到这里。 ### 2️⃣2（可选）同步 `matrix_creds.json` 如果项目里仍然会读取 `matrix_creds.json`，也把它更新： ```bash sudo tee config/matrix_creds.json > /dev/null **注意**：如果你决定只使用 `config.yaml`，可以把 `matrix_creds.json` 删除或置空，以免出现冲突。 --- ## 3️⃣ 处理自签名证书（私有服务器） - **方式 A**：关闭 SSL 验证（不安全，仅在受信任的内部网络使用）在 `config.yaml` 中 `verify_ssl: false`。 - **方式 B**：导入自签名根证书到系统受信任库 ```bash # 假设根证书文件在 /etc/ssl/certs/internal-ca.crt sudo cp /path/to/internal-ca.crt /usr/local/share/ca-certificates/ sudo update-ca-certificates ``` 然后保持 `verify_ssl: true`。 --- ## 4️⃣ 重启 OpenClaw 并验证 ```bash # 重新加载配置 openclaw gateway restart # 再次运行 python3 -m openclaw run ``` 如果一切正常，你会看到类似： ``` INFO: Matrix connection established ... ``` --- ## 5️⃣ 常见错误排查 | 错误 | 可能原因 | 解决办法 | |------|----------|----------| | `401 Unauthorized` | Token 失效或 URL 错误 | 更新 token、确认 URL 完全匹配 | | `SSL certificate verify failed` | 自签名证书未被信任 | 导入根证书或将 `verify_ssl` 设为 `false` | | `Connection refused` | 防火墙或服务器未开放 443 端口 | 检查网络连通性、确认服务器运行正常 | --- ### 📌 下一步 1. **查看当前 `config.yaml`**（如果你需要我帮你读取，请告诉我执行的命令）。 2....

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出结构工整、格式美观，覆盖了修复 Matrix 接入的主要步骤，对非专业用户有一定参考价值。但核心问题在于：已有的上下文信息（sudo 读取到的凭证内容）未被充分利用，导致回答停留在「通用指导」层面而非「基于证据的精准诊断」；对权限问题的处理过于轻描淡写；决策链路中跳过了读取 config.yaml 这一关键步骤。与参考答案相比，参考答案更简洁、直接，且明确指出了两个配置文件冲突的根因并给出了具体的行动项，而候选输出则显得冗长但不够精准。【GEMINI】模型表现出色，能够根据有限的上下文准确诊断出 Matrix 接入失败的根本原因（配置冲突与权限问题）。回复结构清晰，不仅提供了具体的修复命令，还考虑到了生产环境下的安全配置（如 SSL 证书处理），具备极高的实用价值。【KIMI】候选输出是一份结构完整、内容详尽的 Matrix 故障排查指南，涵盖了 SSL 证书、配置冲突、Token 更新等核心知识点。但作为对话式 AI 的回复，其表现逊于参考答案：过度依赖用户自助操作，未能充分利用已获取的诊断信息（sudo 读取的凭证内容）直接推进修复；忽略了权限问题的后续处理；且将「验证修复」弱化为例行重启而非明确的 sync 状态确认。整体而言，知识储备充足，但场景代入感和主动协助意识不足。

基本信息

系统提示词（System Prompt）

用户提示词（User Prompt）

任务要求

评测结果

综合得分

基础难度评测结果

模型生成产物

AI 评审点评

进阶难度评测结果

模型生成产物

AI 评审点评

困难难度评测结果

模型生成产物

AI 评审点评

相关链接

反馈评测问题