kimi-k2.5 在「合同条款信息提取」的评测结果

这是该 AI 模型在此测试用例上的详细评测结果。

基本信息

模型名称：kimi-k2.5
用例名称：合同条款信息提取
测试类型：xsct-a
评测维度：文档问答

系统提示词（System Prompt）

这是给 AI 模型的背景设定和角色指令：

You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).

用户提示词（User Prompt）

这是用户给 AI 模型的具体任务要求：

以下是当前工作区中的文档 `employment_contract.docx` 的相关条款内容： **文档标题**：劳动合同（标准版） **合同编号**：HC-2024-0376 **签订日期**：2024年3月1日 **合同双方**：甲方（用人单位）—— 星辰科技有限公司；乙方（劳动者）—— 张明 --- **第七条休假与休息** 7.1 公司实行每周五天工作制，每周休息两天。 7.2 依据履行地（维苏维亚省·埃瑟加德市）现行劳动法规，严格执行地方及市级政府规定的法定休假制度，包括但不限于：公共假日、婚假、产假（陪产假）、丧假及家庭护理假。 7.3 乙方每一日历年度享有12天带薪年假。年假应在当年度内使用完毕，未经甲方书面同意不得跨年度累积。 7.4 乙方因病需请假的，应提供正规医疗机构出具的诊断证明，病假期间的待遇按照当地相关规定执行。 7.5 乙方如需请事假，须提前3个工作日向直属主管提交书面申请，经批准后方可休假。事假期间不计发工资。 --- 请帮我提取并整理第七条中关于员工休假制度的所有关键信息，包括：工作制度、法定假种类、年假天数及使用规则、病假和事假的具体要求。请用中文回答。

任务要求

AI 模型需要满足以下要求：

准确提取工作制度信息：每周五天工作制，每周休息两天
完整列出法定休假种类：公共假日、婚假、产假（陪产假）、丧假、家庭护理假
明确说明年假天数为12天/日历年，以及不得跨年累积（除非甲方书面同意）的使用规则
准确提取病假要求：需提供正规医疗机构诊断证明，待遇按当地规定执行
准确提取事假要求：提前3个工作日书面申请、需直属主管批准、事假期间不计发工资
回答结构清晰，信息分类合理

评测结果

综合得分

该模型在此用例的得分为：98.3 分（满分 100 分），已通过

基础难度评测结果

得分：98.3 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

根据劳动合同第七条的内容，员工休假制度的关键信息整理如下： --- ## 员工休假制度关键信息汇总 ### 一、工作制度 - 实行**每周五天工作制** - 每周休息**两天** ### 二、法定假日种类依据履行地（维苏维亚省·埃瑟加德市）法规，享有以下法定休假： | 假期类型 | 说明 | |---------|------| | 公共假日 | 法定节假日 | | 婚假 | 结婚休假 | | 产假/陪产假 | 生育相关休假 | | 丧假 | 直系亲属丧事休假 | | 家庭护理假 | 护理家庭成员所需休假 | ### 三、带薪年假 | 项目 | 具体规定 | |-----|---------| | 天数 | **12天/日历年度** | | 使用期限 | 应在**当年度内使用完毕** | | 跨年度规定 | 未经**甲方书面同意**不得跨年度累积 | ### 四、病假规定 - **证明材料**：需提供**正规医疗机构出具的诊断证明** - **待遇标准**：按照当地相关规定执行 ### 五、事假规定 | 项目 | 具体要求 | |-----|---------| | 申请时限 | 提前**3个工作日**提交 | | 申请方式 | 向**直属主管**提交**书面申请** | | 批准要求 | 经批准后方可休假 | | 薪资规定 | 事假期间**不计发工资** | --- **合同信息**：星辰科技有限公司 · 张明 | 合同编号：HC-2024-0376 | 签订日期：2024年3月1日

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出在信息提取完整性、内容准确性和结构表达三个维度均表现优秀，与参考答案高度吻合且在呈现形式上有所提升（表格化展示）。所有关键条款均被准确提取，无遗漏、无错误，是一份高质量的合同条款结构化解读。【GEMINI】模型表现优秀，不仅准确完整地提取了所有合同条款信息，还通过表格和结构化列表对信息进行了极佳的整理，便于用户快速查阅。表达专业且符合合同解读的语境。【KIMI】候选输出表现出色，完整准确地提取了劳动合同第七条的所有休假制度信息，结构清晰、分类合理。采用表格形式呈现关键信息点是一种有效的信息组织方式，不影响内容准确性。整体质量优秀，完全符合任务要求。

进阶难度评测结果

得分：92.3 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

**问题1：国庆法定假日加班工资计算** 根据合同第6.4条，法定假日加班按不低于工资的300%支付。 **计算过程：** - **日工资基数**：18,000元 ÷ 21.75天 ≈ **827.59元/天** （21.75天为法定月计薪天数） - **单日加班工资**：827.59元 × 300% = **2,482.77元** - **3天加班工资总额**：2,482.77元 × 3 = **7,448.31元** （或精确计算：18,000 ÷ 21.75 × 3 × 3 ≈ **7,448.28元**） --- **问题2：2024年当年带薪年假天数** 这取决于张明入职前是否有连续工作经历，需分两种情况： **情况A：若张明入职前已连续工作满12个月（如有 prior 工作经验且连续）** - 根据《企业职工带薪年休假实施办法》第五条，应按比例折算当年剩余日历天数 - **折算公式**：（当年度在本单位剩余日历天数 ÷ 365天）× 全年应享受天数 - 2024年3月1日至12月31日剩余**306天**（2024年为闰年366天，但折算分母仍用365） - 计算：306 ÷ 365 × 12 ≈ 10.06天 - **结果**：折算后不足1整天的部分不享受，故2024年可享受 **10天** 带薪年假 **情况B：若张明为新参加工作（或此前未连续工作满12个月）** - 根据《职工带薪年休假条例》第二条，需连续工作满12个月后才享受年假 - **结果**：2024年当年**不享受**带薪年假，需工作满12个月后（即2025年3月1日起）方可享受 **合同条款说明**：合同第6.3条规定的"工作满1年不足10年的，年假为12天"是对全年应享受天数的界定，但实际享受时间仍需符合法定连续工作满12个月的前提条件。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出整体质量较高，准确提取了合同关键条款，计算过程清晰，结构规范。主要不足有两点：一是未明确指出法定假日加班「不可用补休替代」这一与休息日加班的本质区别；二是年假折算分母选取（365 vs 366）与参考答案存在差异且未充分说明法规依据。这两处均属细节层面的瑕疵，不影响整体实用性，属于中等偏上水平的回答。【GEMINI】模型表现优秀，不仅准确完成了合同条款的提取与计算任务，还展现了对劳动法相关法规的深刻理解，能够根据不同假设条件给出严谨的分析，完全符合专业评测标准。【KIMI】该候选输出整体质量较高，准确完成了合同条款提取和两项计算任务，结构清晰、分情况讨论体现了专业严谨性。主要扣分点在于：年假折算分母未采用366天（2024年闰年），以及未主动强调法定假日加班不可补休的法律特性。模型在情况A备注中「折算分母仍用365」的表述与法规实践存在偏差，但数值结果差异极小（10.06 vs 10.03天，取整后均为10天）。建议加强对法规细节的精确把握，尤其是涉及闰年等特殊时间节点的计算。

困难难度评测结果

得分：93.6 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

[[reply_to_current]] 您指出了关键的法律解释问题。让我分析英文原文的精确含义： --- **英文原文分析** > "Employees who have served the Company for **more than 5 consecutive years**" - **"More than 5"** 在数学和法律英语中是**严格大于**（>5），不包含等于5的情况 - 因此，恰好满5年（5年整）的员工**不符合条件** - 必须是从第5年届满后的第1天起（即进入第6年）才开始享有 --- **翻译准确性检验** 我之前的中文翻译： > "在公司连续服务**满5年以上**的员工" 问题在于：中文"以上"在法律语境中通常**包含本数**（即≥5），与英文原文"more than"（>5）存在偏差。 --- **修正后的翻译** **原条款5.5应修正为：** > "5.5 在公司连续服务**超过5年**的员工，每年额外享有3天忠诚假。" 或更精确表述为： > "5.5 连续服务年限**超过5年**（即自第6个完整服务年度起）的员工，每年额外享有3天忠诚假。" --- **对您案例的影响** - **4年零8个月**：距"超过5年"还有**4个月**，届时进入第6个服务年度才能享受忠诚假 - 即：**满5年整时仍不享有**，必须再满1天后才开始享有 **假期明细修正**（基于严格解释）： - 带薪年假：12天 - 忠诚假：**0天**（需4个月后才开始享有） - 病假：最多90天结论与之前相同，但法律依据更严谨：该员工目前**不具备**忠诚假资格，且4个月后进入第6年时才开始享有。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出整体表现优秀，准确完成了法律条款精确解读、翻译对比修正、场景应用和多轮一致性四个维度的任务。对「more than 5 consecutive years」的法律含义分析严谨，翻译修正到位，多轮信息追踪无误。细节上可进一步补充对比表述（如「not less than」）和更明确的时间节点说明，但整体质量达到专业水准。【GEMINI】模型表现极其出色，不仅展现了对法律文本细微差别的敏锐洞察力，还通过对比分析纠正了翻译中的潜在歧义。在处理复杂工龄计算时，逻辑严密，完全符合专业评测标准。【KIMI】模型展现了优秀的法律文本精确解读能力，对 'more than' 的严格大于含义分析透彻，翻译修正建议合理且提供了更精确的替代方案。场景应用正确，时间计算准确。整体回答结构清晰、逻辑严密，体现了专业的法律语言分析素养。 minor improvements could be made in explicit multi-round consistency acknowledgment and slightly more precise temporal phrasing for the threshold moment.

基本信息

系统提示词（System Prompt）

用户提示词（User Prompt）

任务要求

评测结果

综合得分

基础难度评测结果

模型生成产物

AI 评审点评

进阶难度评测结果

模型生成产物

AI 评审点评

困难难度评测结果

模型生成产物

AI 评审点评

相关链接

反馈评测问题