qwen3-0.6b 在「多用户机器隔离」的评测结果

这是该 AI 模型在此测试用例上的详细评测结果。

基本信息

模型名称：qwen3-0.6b
用例名称：多用户机器隔离
测试类型：xsct-a
评测维度：Gateway 配置与管理

系统提示词（System Prompt）

这是给 AI 模型的背景设定和角色指令：

You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).

用户提示词（User Prompt）

这是用户给 AI 模型的具体任务要求：

我想在这一台机器上跑两个 OpenClaw Gateway，一个给研发用，一个给测试用，互不干扰。我该怎么指定不同的「数据存储路径」和「服务端口」来启动它们？请给出具体的命令行示例。

任务要求

AI 模型需要满足以下要求：

正确识别并使用 --data-dir 或环境变量 OPENCLAW_DATA_DIR 指定隔离的数据目录
正确识别并使用 --port 或环境变量 OPENCLAW_PORT 指定不同的监听端口
明确指出两个实例需要指向不同的目录以实现「配置与日志隔离」
回复内容需包含完整的启动命令示例

评测结果

综合得分

该模型在此用例的得分为：48.8 分（满分 100 分），未通过

基础难度评测结果

得分：48.8 分
通过状态：未通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

```bash # 开启研发 Gateway（数据存储路径：/data/research、服务端口：8080） openclaw gateway start --storage-path=/data/research --port=8080 # 开启测试 Gateway（数据存储路径：/data/test、服务端口：8000） openclaw gateway start --storage-path=/data/test --port=8000 ``` 注意：请根据实际需求调整 `--storage-path` 和 `--port` 参数，确保两个服务在独立容器或独立端口环境中运行。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选答案的整体思路正确，展示了通过不同数据路径和端口来隔离两个 OpenClaw Gateway 实例的方案，格式清晰。然而最关键的问题在于数据目录参数名错误——使用了 `--storage-path` 而非参考答案中的 `--data-dir`，这在实际使用中会导致命令失效。端口参数 `--port` 使用正确。方案完整性方面虽然提供了双实例示例，但缺少对隔离效果（日志、缓存、配置等）的说明以及环境变量替代方案的介绍。【GEMINI】模型能够理解用户的隔离需求，并给出了直观的命令行示例。但在参数命名上未能完全遵循系统提示中给出的「OpenClaw CLI」规范，使用了非标准的参数名。建议在回答此类技术问题时，优先查阅并严格遵循系统提示中提供的工具文档或 CLI 参考。【KIMI】候选输出在方案完整性上基本达标，但在关键参数准确性上存在明显错误：使用了 `--storage-path` 而非正确的 `--data-dir`，这会导致命令无法按预期工作。此外，候选输出未引用项目上下文中明确提到的环境变量替代方案，且对隔离机制的说明不够充分。建议严格依据项目上下文提供的工具文档和参数规范来构建答案。

进阶难度评测结果

得分：27.2 分
通过状态：未通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

为了创建隔离环境并防止访问日志，请按以下步骤操作： 1. **确定新实例目录**： - 使用「ls /data/」命令检查可用的存储路径（假设现有目录为/data-2b）。 - 确保新目录与现有实例不同，避免混淆。 2. **迁移配置文件**： - 执行「mv config.yaml /data-2b/config.yaml」复制原有配置文件。 - 删除旧的配置文件（如 `gateway.db`），并重命名以保留唯一性。 3. **限制访问权限**： - 修改系统文件权限（例如 `chmod 600 /data-2b/config.yaml`）防止日志读取。 - 如果需要，临时禁用第一个实例的日志输出功能（通过修改启动脚本中的日志参数）。 4. **验证隔离状态**： - 启动新实例时，检查其日志输出是否仅包含新用户内容。 - 确保新实例的运行状态正常，且未与其他实例共享资源。请按照上述步骤执行操作，即可为UserB提供隔离环境。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选答案质量较差，未能满足核心功能需求。既没有利用对话上下文中已有的进程信息（端口、数据目录）进行针对性建议，也未给出正确的隔离方案（缺少 chown/创建系统用户、--port、--data-dir 参数、data-dir 自动隔离机制说明）。操作步骤存在明显错误（如 mv 配置文件、删除 gateway.db），不具备实际可操作性，整体远低于及格线。【GEMINI】候选回答未能准确理解 OpenClaw 的运行机制，给出的操作建议具有破坏性（如移动配置文件、删除数据库），且未遵循系统提示中关于「隔离会话、数据目录和权限」的核心要求。相比之下，参考答案提供了标准且安全的实施路径，而候选回答的方案不仅无法实现隔离，还可能导致现有系统崩溃。【KIMI】候选输出未能满足任务核心要求。在隔离深度方面，完全遗漏了系统用户创建、--data-dir 和 --port 参数指定、OpenClaw 自动隔离机制解释等关键要素，反而提出错误操作（迁移配置文件、禁用第一个实例日志）。在逻辑连贯性方面，未基于给定的工具输出（ps 和 ls 结果）进行推理，假设的目录路径与实际情况无关，操作步骤顺序混乱且与隔离目标相悖。整体表现远低于及格线，未能为用户提供可行的隔离方案。

困难难度评测结果

得分：24.8 分
通过状态：未通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

当前目录 `/etc/openclaw/tenant1` 的权限为 `drwxr-xr-x`（755），表示用户对目录具有读、写和执行权限。然而，权限不足的报错提示需要进一步确认： 1. **检查目录所有权**：确保该目录的属主是允许写入的用户（如 root）。 ```bash ls -l /etc/openclaw/tenant1 | grep "rw-r--r--" ``` 2. **调整文件权限**：若权限确实不足，可临时修改权限以提高可用性，例如： ```bash chmod 755 /etc/openclaw/tenant1 ``` 3. **验证数据库路径**：确认 `gateway.db` 文件确存在于指定目录下，且无权限问题。请按上述步骤修复权限后，重新尝试启动实例。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出在面对已知信息（`ls -ld` 明确显示 root 属主）时未能做出正确诊断，提出了无效甚至错误的修复建议（重复 chmod、错误 grep 用法），完全忽略了端口占用问题的延续处理，也未提出任何多租户隔离架构方案。与参考答案相比差距显著，未能满足核心任务要求。【GEMINI】该模型在处理 Linux 系统权限问题时表现出严重的知识盲区。它未能识别出 `ls -ld` 输出中 root 属主与当前用户权限冲突的本质，给出的修复建议（如 `chmod 755`）不仅无效，还显示出对 Linux 权限模型理解的偏差。此外，在多租户隔离任务中，模型未能提出将数据目录迁移至用户家目录等安全实践，整体表现不及格。【KIMI】候选输出严重偏离任务目标，未能正确诊断 root 属主导致的权限问题，完全遗漏端口占用故障，也未提出多租户隔离的完整方案。技术命令存在明显错误，建议的修复措施无效，整体质量远低于及格水平。

基本信息

系统提示词（System Prompt）

用户提示词（User Prompt）

任务要求

评测结果

综合得分

基础难度评测结果

模型生成产物

AI 评审点评

进阶难度评测结果

模型生成产物

AI 评审点评

困难难度评测结果

模型生成产物

AI 评审点评

相关链接

反馈评测问题