Mistral: Mistral Nemo 在「多用户机器隔离」的评测结果

这是该 AI 模型在此测试用例上的详细评测结果。

基本信息

  • 模型名称:Mistral: Mistral Nemo
  • 用例名称:多用户机器隔离
  • 测试类型:xsct-a
  • 评测维度:Gateway 配置与管理

系统提示词(System Prompt)

这是给 AI 模型的背景设定和角色指令:

You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).

用户提示词(User Prompt)

这是用户给 AI 模型的具体任务要求:

我想在这一台机器上跑两个 OpenClaw Gateway,一个给研发用,一个给测试用,互不干扰。我该怎么指定不同的「数据存储路径」和「服务端口」来启动它们?请给出具体的命令行示例。

任务要求

AI 模型需要满足以下要求:

  1. 正确识别并使用 --data-dir 或环境变量 OPENCLAW_DATA_DIR 指定隔离的数据目录
  2. 正确识别并使用 --port 或环境变量 OPENCLAW_PORT 指定不同的监听端口
  3. 明确指出两个实例需要指向不同的目录以实现「配置与日志隔离」
  4. 回复内容需包含完整的启动命令示例

评测结果

综合得分

该模型在此用例的得分为:35.8 分(满分 100 分),未通过

基础难度评测结果

  • 得分:35.8 分
  • 通过状态:未通过

模型生成产物

以下是该模型针对此任务的实际生成内容:

To run two instances of OpenClaw Gateway, one for development and one for testing, you'll need to specify different data storage paths and service ports for each. You can achieve this by using named instances and configuring them accordingly. Here's how you can do it: 1. First, create a configuration file for each instance. Let's name them `dev.conf` and `test.conf`. **dev.conf** ``` [openclaw] data_dir = "/path/to/dev/data" port = 5000 ``` **test.conf** ``` [openclaw] data_dir = "/path/to/test/data" port = 5001 ``` Replace `/path/to/dev/data` and `/path/to/test/data` with the desired data storage paths for your development and testing instances respectively. Ensure that these directories are accessible and writable. 2. Now, start the instances using the following commands: For the development instance: ```bash openclaw gateway start --config dev.conf --instance dev ``` For the testing instance: ```bash openclaw gateway start --config test.conf --instance test ``` The `--instance` flag is used to give each instance a unique name, and the `--config` flag specifies the configuration file to use. 3. To verify that both instances are running, you can use the following commands: ```bash openclaw gateway status dev openclaw gateway status test ``` This will display the status of the development and testing instances respectively. 4. To stop or restart the instances, you can use the following commands: To stop the development instance: ```bash openclaw gateway stop dev ``` To stop the testing instance: ```bash openclaw gateway stop test ``` To restart the development instance: ```bash openclaw gateway restart dev ``` To restart the testing instance: ```bash openclaw gateway restart test ``` By following these steps, you should have two independent OpenClaw Gateway instances running on your machine, each with its own data storage path and service port.

AI 评审点评

以下是 AI 评审对该模型输出的点评:

【CLAUDE】模型理解了多实例隔离的需求场景,并给出了结构完整的操作步骤,但核心问题在于使用了自创的 `--config` 和 `--instance` 参数,而非参考答案要求的 `--port` 和 `--data-dir` 直接命令行参数。这违反了系统提示中「不要发明命令」的原则,且与标准参数完全不符。建议模型在面对不确定的命令参数时,应先通过 `openclaw gateway --help` 等方式查阅帮助文档,而非自行推断参数名称。 【GEMINI】模型在处理该任务时,未能严格遵守「OpenClaw CLI Quick Reference」中定义的命令格式,而是编造了「--config」和「--instance」等不存在的参数。虽然其逻辑思路(配置文件隔离)在通用软件中是合理的,但对于特定的 OpenClaw 系统而言,该回复属于无效方案。建议模型在回答时应优先参考系统提示词中提供的 CLI 规范,而非臆测通用的启动方式。 【KIMI】该模型生成结果严重违背了 System Prompt 中的核心指令:「Do not invent commands」。模型虚构了 `--config`、`--instance` 参数以及配置文件格式,而完全忽略了文档中实际存在的命令行使用方式。尽管模型试图提供一个结构化的解决方案,但由于基础参数的错误,整个方案不具备可执行性。模型应当直接使用 `openclaw-gateway --port <port> --data-dir <path>` 的命令行参数方式,或遵循文档中明确的子命令格式,而非自行发明一套配置体系。这反映出模型在面对具体工具使用时,倾向于泛化生成而非严格遵循给定文档的问题。

进阶难度评测结果

  • 得分:75.8 分
  • 通过状态:已通过

模型生成产物

以下是该模型针对此任务的实际生成内容:

要创建一个完全隔离的 OpenClaw 环境,您可以指定一个新的数据目录,并为新用户创建一个配置文件。以下是步骤: 1. 创建新的数据目录: ``` sudo mkdir /var/lib/openclaw-userb ``` 2. 复制默认配置文件到新目录: ``` sudo cp /var/lib/openclaw-default/config.yaml /var/lib/openclaw-userb/ ``` 3. 编辑新的配置文件,修改「port」、「data-dir」和「log-dir」参数: ``` sudo nano /var/lib/openclaw-userb/config.yaml ``` 修改为: ``` port: 9001 data-dir: /var/lib/openclaw-userb log-dir: /var/log/openclaw-userb ``` 4. 运行新的 OpenClaw 实例: ``` openclaw-gateway --config /var/lib/openclaw-userb/config.yaml ``` 5. 为用户「UserB」设置权限: ``` sudo chown userb:userb /var/lib/openclaw-userb sudo chown userb:userb /var/log/openclaw-userb ``` 现在,用户「UserB」可以在「http://localhost:9001」访问他的 OpenClaw 实例,并且不会与第一个实例共享数据或日志。

AI 评审点评

以下是 AI 评审对该模型输出的点评:

【CLAUDE】模型生成结果整体方向正确,覆盖了隔离所需的主要要素(独立目录、独立端口、权限控制),对于普通用户具有一定参考价值。但在安全性细节(chmod 缺失)、操作顺序合理性(chown 时机)、以及避免配置污染(不应复制旧配置)等方面存在明显缺陷。此外,未能充分利用预设上下文中的工具输出信息进行针对性决策,缺乏对 OpenClaw 隔离机制(data-dir 自动生成独立 gateway.db)的深入解释。整体属于及格偏上水平,但距离参考答案的质量仍有差距。 【GEMINI】模型表现优秀,不仅完成了任务要求的隔离目标,还主动考虑了日志目录的独立配置,体现了良好的系统运维思维。操作步骤具体且易于执行,完全满足了用户对于「UserB」环境隔离的需求。 【KIMI】模型提供了一个基本可用的隔离方案,涵盖了核心的目录隔离、端口区分和权限设置,但存在执行顺序不当、机制解释不足、与上下文衔接薄弱等问题。特别是在认证令牌隔离和日志访问控制这两个用户明确关心的点上缺乏深入说明,整体方案偏向表面化的步骤罗列而非系统性的隔离设计。

困难难度评测结果

  • 得分:24.8 分
  • 通过状态:未通过

模型生成产物

以下是该模型针对此任务的实际生成内容:

目录的属主是 root 用户。我需要切换到 root 用户重新启动 OpenClawateway。我将使用 `sudo` 执行此操作。 会话即将切换到 root 用户。

AI 评审点评

以下是 AI 评审对该模型输出的点评:

【CLAUDE】模型生成结果极为简陋,仅识别出目录属主为 root 这一表面现象,并提出使用 sudo 切换用户的方案,但未能提供完整的故障排查分析、多租户隔离设计方案或可执行的修复命令。与参考答案相比,缺少 chown 修复建议、数据目录迁移方案、最终启动命令及验证步骤,整体质量远低于预期标准。 【GEMINI】该模型在处理多用户隔离任务时表现极差。它不仅没有提出合理的权限修复方案(如 chown 或迁移至用户目录),反而给出了一个破坏安全性的建议(直接以 root 运行),这在多租户环境中会导致严重的权限污染和安全风险。模型未能综合考虑端口冲突与权限隔离的协同处理,未能完成任务目标。 【KIMI】该模型生成结果严重不完整,仅处理了权限问题的表面现象,完全遗漏了端口冲突这一关键故障点。建议直接使用 sudo 切换 root 的做法不符合多租户隔离的安全原则,也未提供完整的启动命令和验证步骤。与参考答案相比,在故障排查深度、多租户架构设计和响应专业度三个维度均存在明显差距,未能满足任务要求。

相关链接

您可以通过以下链接查看更多相关内容:

加载中...