OpenAI: GPT-5 Nano 在「iMessage 接入」的评测结果

这是该 AI 模型在此测试用例上的详细评测结果。

基本信息

模型名称：OpenAI: GPT-5 Nano
用例名称：iMessage 接入
测试类型：xsct-a
评测维度：渠道连接与消息路由

系统提示词（System Prompt）

这是给 AI 模型的背景设定和角色指令：

You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).

用户提示词（User Prompt）

这是用户给 AI 模型的具体任务要求：

我想在 OpenClaw 中配置「iMessage」接入，听说需要用到 BlueBubbles，请告诉我基本的配置流程和关键参数。

任务要求

AI 模型需要满足以下要求：

准确指出需要安装并运行 BlueBubbles Server 作为后端支撑
列出关键配置项：Server URL、API Password 以及 OpenClaw 的回调地址
说明 iMessage 渠道在 OpenClaw 配置文件中的基本结构

评测结果

综合得分

该模型在此用例的得分为：70.2 分（满分 100 分），已通过

基础难度评测结果

得分：70.2 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

一句话总结：要在 OpenClaw 里接入 iMessage，通常是通过 BlueBubbles 服务器把 iMessage 对接到一个对外暴露的接口（WebSocket/REST），再在 OpenClaw 配置一个 BlueBubbles 连接器来实现消息的发送/接收。以下给出一个“基本流程”和“关键参数”的简要指南，帮助你梳理思路。如果你愿意，我也可以按你的实际环境给出更具体的配置文件示例。一、基本配置流程（高层次） 1) 准备工作 - 目标主机：一台 macOS 设备（推荐），用于运行 BlueBubbles Server，并确保 iMessage 能在该主机上正常工作。 - OpenClaw 运行环境：确保 OpenClaw 可以访问到 BlueBubbles 服务器所在的网络（同一局域网或可路由到的网络）。 - 确认网络/端口：BlueBubbles 服务监听的端口要对 OpenClaw 可达，必要时配置防火墙规则。 2) 安装与启动 BlueBubbles Server - 从 BlueBubbles 的官方文档获取安装方法（通常包含 Node/Docker 安装、依赖、以及如何连接到 iMessage）。 - 启动 BlueBubbles Server，并用 Apple ID 登录/授权以完成 iMessage 连接。 - 启用对外访问的接口（WebSocket/REST）并获取访问凭据（通常是某种 Token/JWT、或本地域名+端口组合）。 3) BlueBubbles 的外部访问凭据与映射 - 通过 BlueBubbles 的管理界面或配置文件，获取用于外部客户端的 API 令牌或认证方式，以及服务器地址（如 ws://host:port 或 wss://host:port）。 - 记录一个“设备/实例标识”（例如 server_id、device_id），用于 OpenClaw 端的身份识别和会话映射。 4) 在 OpenClaw 中添加 BlueBubbles 连接器 - 在 OpenClaw 的工作区/插件配置中新增一个 BlueBubbles 连接器条目。 - 配置要点包括：服务器地址、认证凭据、以及与 iMessage 聊天的映射关系。 - 建立“聊天映射表”，把 OpenClaw 的对话标识映射到 BlueBubbles/ iMessage 的线程标识（threadId、chatId 等，具体名称以实际插件文档为准）。 5) 验证与测试 - 发送一条测试信息，看 iMessage 是否能收到，以及来自 iMessage 的应答（如发送状态、回执、键入指示等）。 - 检查日志，排除网络、鉴权、映射错误。 - 如需要，开启调试日志以便排错。二、关键参数（需要在 BlueBubbles 与 OpenClaw 两端配置时关注）在 BlueBubbles 端需要关心的核心字段（具体名称可能会因版本而异，文档里会给到确切字段）： - 服务器地址与端口：BlueBubbles 对外暴露的 WebSocket/REST 入口，如 ws://host:port 或 wss://host:port。 - 认证凭据：访问 API 的 Token/JWT、或 API Key，确保 OpenClaw 能使用它来鉴权。 - 设备/实例标识：BlueBubbles 服务器中对应的设备ID，OpenClaw 端需要用来绑定到正确的 iMessage 账号。 - TLS/证书设置：如使用 TLS，请提供证书路径、忽略证书校验（出于调试阶段可设置为 false），以及相关 CA 配置。 - 聊天映射（mapping table）：OpenClaw 的对话 ID 与 BlueBubbles 的 iMessage 聊天/线程 ID 的对应关系，确保消息能定向到正确的对话。 - 消息特性开关： - 打字中指示（typing indicators） - 送达回执（delivery/read receipts） - 文件/图片、附件的传输能力 - 消息格式（纯文本、富文本/Markdown，具体以插件支持为准） - 重连与健壮性参数： - 自动重连开关、重连间隔、最大重试次数 - 心跳/Ping 间隔 - 日志级别与调试开关：便于排错的详细日志设置（如 debug/info/warn/error）。三、一个简化的示例结构（供你理解对齐，实际字段以 BlueBubbles/OpenClaw 的最新文档为准）在 OpenClaw 的配置仓库中你可能会看到类似的结构（示意用，具体格式请参照实际文档）： - bluebubbles.json（OpenClaw 插件配置示例） - type: "bluebubbles" -...

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】模型对 iMessage 通过 BlueBubbles 接入 OpenClaw 的整体链路有基本正确的理解，回答结构清晰、内容丰富。但核心问题在于：缺少 Webhook 回调地址这一关键配置步骤，配置文件格式和字段名与 OpenClaw 实际规范（config.yaml 中的 imessage 渠道、password 字段）不符，且回答篇幅过长、信噪比偏低。对于一个需要精准配置指导的场景，这些偏差会对用户实际操作造成误导。建议在回答此类问题时优先参考项目实际文档，给出更精准、简洁的配置示例。【GEMINI】模型表现优秀，不仅准确回答了用户关于 BlueBubbles 接入 iMessage 的核心需求，还主动补充了关于网络环境、TLS 安全设置及功能开关等进阶配置建议。内容结构严谨，配置示例具有很强的参考性，完全符合专业技术助手的标准。【KIMI】该模型生成结果存在严重的「幻觉」问题，虚构了不存在的配置结构和术语。题目明确要求说明 OpenClaw 中 `config.yaml` 的 `imessage` 渠道配置，但模型完全偏离这一要求，自行构建了一套「BlueBubbles 连接器」体系，与参考答案的标准流程不符。虽然对 BlueBubbles Server 本身的描述有一定合理性，但 OpenClaw 端的配置描述完全错误，导致整体回答无法满足用户的实际需求。模型在最后询问用户具体环境信息的做法也表明其缺乏对标准配置流程的准确掌握。

进阶难度评测结果

得分：50.8 分
通过状态：未通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

除了密码一致外，常见还可能有以下原因和排查方向： - 签名校验失败的来源 - BlueBubbles 发送请求时的签名密钥与 OpenClaw 端的预期不一致（或签名算法/头部字段名不匹配）。 - 请确认 BlueBubbles 端的 webhook secret 与 OpenClaw 的 config 中的 password 完全一致，且没有前后空格或换行符差异。 - 同时确认 OpenClaw 对应的签名头字段名、算法是否与 BlueBubbles 使用的保持一致（不同版本可能字段名不同）。 - 请求来源未在信任名单中 - 日志里显示 “Webhook received from unknown source: 192.168.1.5”。这通常意味着来源 IP 被白名单/来源校验过滤掉。 - 请在配置中允许该来源 IP，或禁用来源校验（如果 OpenClaw 支持该选项）。具体字段名可能是 allowed_sources、source_whitelist 等，请查阅当前版本文档并添加 192.168.1.5。 - 端口/路径或网络连通性问题 - OpenClaw 监听的是 127.0.0.1:1234；如果 BlueBubbles 在另一台机器上，需确保网络连通性和端口映射正确，BlueBubbles 应该能够访问 OpenClaw 的实际暴露地址。 - 确认 webhook_path '/api/imessage' 在 OpenClaw 端确实有对应路由，且没有被代理（如 Nginx/反向代理）修改请求体。 - 中间件或代理对请求的影响 - 如果有 Nginx/反向代理，请确保请求未被修改（尤其是请求体的原始字节用于签名的情况）。代理应转发原始负载和所有必要头部。 - 版本或兼容性问题 - OpenClaw 的 iMessage 通道实现可能与 BlueBubbles 的最新版本在某些字段上不完全兼容。尝试升级到最新版本，或查看两端版本的兼容性说明。你可以按下面的步骤尝试排查，我也可以据此给出更具体的修改建议： 1) 再次确认签名/秘密 - 逐字逐句对比 OpenClaw config 中的 password 与 BlueBubbles 设置的 webhook secret（不要有空格、换行）。 - 如果文档允许，重置秘密后重新两端同步。 2) 放宽或确认来源白名单 - 在 channels.yaml（或等效配置）中查看是否有 allow/whitelist 字段，加入 192.168.1.5。 - 如不确定字段名，请粘贴你当前的通道配置片段给我，我来给出准确的补丁。 3) 验证网络连通性 - 在 BlueBubbles 部署所在机器，尝试对 OpenClaw 的实际暴露地址做一次简单的 POST 请求，确保路由和端口正确。 - 确认 OpenClaw 的日志里是否再出现新的来源或签名相关的记录。 4) 提升日志级别并重试 - 将日志级别调至调试（debug），重启服务后再触发一次 webhook，贴出完整日志中与签名相关的部分，便于定位。 5) 重启服务 - 重启 OpenClaw 网关/相关进程：openclaw gateway restart（如你的环境有该命令）。如果方便，请把以下信息发给我，我可以给出更具体的修改建议： - 你当前 channels.yaml 的完整 iMessage 配置（包括可能的白名单字段）。 - openclaw.log 中在一次 webhook 请求时的完整上下文（包含来源 IP、签名头、请求头、请求体哈希或签名失败的具体信息）。 - BlueBubbles 端的 webhook 设置截图或描述（使用的秘密/签名字段名、发送 URL、是否经过代理等）。我会据此给出逐步的修复清单和需要的配置补丁。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】模型的故障排查思路较为全面，涵盖了签名、来源 IP、监听地址、路径、代理等多个维度，核心问题方向（监听地址和 Webhook 路径）基本正确。但回复过于冗长，缺乏聚焦，未能像参考答案那样简洁直接地指出最关键的两个问题。更重要的是，模型完全没有利用已有的工具能力（如 exec 执行 netstat 命令、read 读取配置文件等）来主动获取诊断信息，这与 Agent 的定位不符，工具利用率严重不足。【GEMINI】模型逻辑清晰，排查思路全面，能够从多个技术维度分析问题。但在工具利用方面表现较弱，未主动利用已提供的工具集（如查看配置文件或执行网络诊断命令）来辅助排查，导致交互效率偏低。【KIMI】模型展现了一定的故障排查思维广度，能够列举多种可能原因，但缺乏对关键配置问题的精准定位（127.0.0.1 vs 0.0.0.0监听地址），且完全未使用任何工具进行主动信息收集，与参考答案相比存在明显差距。建议增强对网络配置细节的敏感度，并在排查过程中积极运用可用工具获取实时证据。

困难难度评测结果

得分：22.8 分
通过状态：未通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

我来诊断一下当前运行环境，看能不能给出一个可执行的改动方向。现在执行几个系统级别的检查： - 当前进程运行的用户与权限 - /tmp 的权限与 /tmp/claw、/tmp/claw/attachments 的权限继承情况 - 以及 OpenClaw 进程的身份请稍等，我把信息拉取回来后给出具体的下一步方案。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】模型在面对明确的错误信息和上下文后，仅给出了一个「即将行动」的空洞声明，没有实际调用工具、没有提出具体解决方案、没有展示对 BlueBubbles/iMessage/容器化环境的架构理解。回复质量极低，远未达到及格线，属于典型的「说而不做」型失败响应。【GEMINI】模型表现不佳。在用户已经通过工具执行并反馈了明确的错误信息（chown 权限不足）后，模型未能根据上下文提供实质性的解决方案，而是采取了拖延式的「进一步检查」策略，浪费了交互轮次。未能体现出作为专业 AI 助手在处理复杂系统故障时的判断力和架构知识储备。【KIMI】模型生成结果严重不合格。面对用户已经明确的权限错误和失败的 chown 操作，模型没有提供任何实质性的解决方案或替代方案，而是选择继续「诊断」和「拉取信息」，这种回应在实际场景中会让用户感到被敷衍。与参考答案相比，模型完全没有涉及：宿主机修复、配置检查、BlueBubbles 验证等关键步骤，也没有提供任何可立即执行的操作建议。这是一个典型的「说了很多但什么都没做」的失败案例。

基本信息

系统提示词（System Prompt）

用户提示词（User Prompt）

任务要求

评测结果

综合得分

基础难度评测结果

模型生成产物

AI 评审点评

进阶难度评测结果

模型生成产物

AI 评审点评

困难难度评测结果

模型生成产物

AI 评审点评

相关链接

反馈评测问题