kimi-k2.5 在「Signal 接入」的评测结果
这是该 AI 模型在此测试用例上的详细评测结果。
基本信息
- 模型名称:kimi-k2.5
- 用例名称:Signal 接入
- 测试类型:xsct-a
- 评测维度:渠道连接与消息路由
系统提示词(System Prompt)
这是给 AI 模型的背景设定和角色指令:
You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).
用户提示词(User Prompt)
这是用户给 AI 模型的具体任务要求:
我想在 OpenClaw 中接入「Signal」渠道,请告诉我需要准备哪些核心参数,以及如何在配置文件中定义一个基础的 Signal 接收器?
任务要求
AI 模型需要满足以下要求:
- 明确指出需要 Signal-CLI 或相关 REST API 服务作为前置条件
- 列出核心参数:手机号「number」、账户密码或数据存储路径
- 提供符合 OpenClaw 规范的 YAML 基础配置示例
- 提及权限验证或注册流程的必要性
评测结果
综合得分
该模型在此用例的得分为:57.8 分(满分 100 分),未通过
基础难度评测结果
- 得分:57.8 分
- 通过状态:未通过
模型生成产物
以下是该模型针对此任务的实际生成内容:
让我先查看 OpenClaw 的项目结构和相关文档,了解 Signal 渠道的集成方式。 read ls read 看来我需要先了解项目结构。让我检查是否有文档目录或配置文件示例。 ls exec 让我进一步搜索 Signal 相关的文档和配置示例。 exec read read 让我继续搜索 Signal 接收器的具体配置。 exec read read 很好!让我再查看配置文件示例,了解完整的配置结构。 read 现在我已经收集了足够的信息来回答用户的问题。让我总结一下: ## Signal 渠道核心参数 根据 OpenClaw 的代码和文档,接入 Signal 渠道需要以下核心参数: 1. **signald_socket_path** - signald 的 Unix socket 文件路径(默认:`/var/run/signald/signald.sock`) 2. **account** - 已注册的 Signal 电话号码(格式:+86138xxxxxxxx) 3. **group_id**(可选)- Signal 群组 ID,用于群组消息路由 ## 配置文件示例 Signal 接收器在 `config.yaml` 中的定义方式如下: ```yaml receivers: - name: signal_primary type: signal config: signald_socket_path: "/var/run/signald/signald.sock" account: "+8613800138000" # group_id: "group.xxxxxx=" # 可选:指定特定群组 ``` 现在我可以给用户一个完整的回答了。
AI 评审点评
以下是 AI 评审对该模型输出的点评:
【CLAUDE】模型尝试通过工具调用查阅项目文档和代码来生成回答,体现了良好的工具使用习惯。但最终生成的配置与参考答案在字段命名、顶层结构和底层服务选择上均存在偏差(signald vs signal-cli,receivers vs channels,account+signald_socket_path vs number+data_path)。前置条件说明不完整,缺少注册验证流程和路由配置步骤的说明。整体回答具备一定参考价值,但准确性和完整性有待提升。 【GEMINI】模型在技术实现细节和配置格式上表现优秀,能够通过工具调用获取准确的配置结构。但在业务流程的完整性上有所欠缺,未涵盖 Signal 接入中最为关键的注册验证环节,导致用户在实际操作中可能会遇到权限或认证失败的问题。 【KIMI】模型生成结果整体质量较低,未能满足用户关于 Signal 渠道接入的核心需求。主要问题包括:1)配置字段严重错误,混淆了不同的 Signal 集成技术方案;2)完全遗漏前置条件说明,未提及 signal-cli 服务及注册流程;3)过度依赖工具调用探索项目结构,但最终未能整合有效信息形成准确回答。建议模型在查阅文档后应仔细核对配置参数的标准命名和层级结构,并确保涵盖环境准备、服务依赖和验证流程等关键接入步骤。
进阶难度评测结果
- 得分:59.0 分
- 通过状态:未通过
模型生成产物
以下是该模型针对此任务的实际生成内容:
[[reply_to_current]] 端口未监听说明 Signal 服务(signal-cli 或 signald)没有运行。让我帮你检查并启动服务。
AI 评审点评
以下是 AI 评审对该模型输出的点评:
【CLAUDE】模型在故障定位方向上判断正确,能识别出端口未监听对应服务未启动的问题,但整体回答过于简短,缺乏深度分析和具体的修复步骤。修复方案部分几乎没有实质内容,仅表示「将帮助检查并启动」,未给出任何可操作的命令或验证步骤,与参考答案的完整性差距较大。 【GEMINI】模型在故障定位方面表现良好,能够根据工具返回的错误信息快速锁定问题根源。但在修复方案的执行层面表现极差,未能给出具体的修复指令或排查建议,导致用户无法通过该回复解决实际问题,未达到预期的专业助手水平。 【KIMI】模型在故障定位上表现良好,能快速判断服务未启动的核心问题;但在修复方案的完整性上明显不足,缺乏具体的命令示例和全面的排查步骤,仅给出了方向性建议而未落实到可操作层面。
困难难度评测结果
- 得分:61.3 分
- 通过状态:已通过
模型生成产物
以下是该模型针对此任务的实际生成内容:
当前遇到权限障碍:**OpenClaw 进程(`openclaw-user`)无法访问 Signal 的数据目录(属主为 `signal-user`)**。 可行的解决路径: **方案 A:调整用户组(推荐)** ```bash sudo usermod -aG signal-user openclaw-user sudo chmod g+rx /var/lib/signal-cli ``` 然后重启 OpenClaw。 **方案 B:改用 RPC 模式** 不直接操作文件,让 OpenClaw 通过本地端口连接 signal-cli 的守护进程: ```json { "signal": { "mode": "rpc", "endpoint": "tcp://127.0.0.1:7583" } } ``` **方案 C:迁移数据目录** 复制 Signal 数据到 OpenClaw 可访问的目录,并更新配置。 你希望采用哪种方案?或者你有 root 权限执行方案 A 吗?
AI 评审点评
以下是 AI 评审对该模型输出的点评:
【CLAUDE】候选输出在权限冲突分析和替代方案提出方面表现尚可,能够识别核心问题并给出多种可行路径,格式清晰易读。然而,在 Signal 协议特有机制(Identity Key 验证、Rate Limit 检查)方面存在明显缺失,这是链路完整性验证的核心要求。方案专业度方面也缺少生产级细节(health_check、容器化部署、完整配置模板)。总体而言,回复解决了表层权限问题,但未能覆盖 Signal 接入的完整诊断链路,距离生产环境最佳实践仍有差距。 【GEMINI】候选模型能够快速响应并提供有效的技术方案,但在处理 Signal 这种特定业务场景时,缺乏对业务逻辑(如身份密钥、速率限制)的深度理解。相比参考答案,其在生产环境的风险预判和架构优化建议方面还有提升空间。 【KIMI】候选输出完成了基础的权限问题诊断和方案列举,但在Signal特有的安全机制(Identity Key、Rate Limit)方面完全缺失,这是该任务的核心要求。此外,生产级方案的优先级判断、架构优化建议(Docker、health_check)以及配置模板提供均不如参考答案完善。建议加强垂直领域知识的覆盖,确保技术方案的系统性和完整性。
相关链接
您可以通过以下链接查看更多相关内容: