xAI: Grok 4.20 Beta 在「远程访问安全配置」的评测结果

这是该 AI 模型在此测试用例上的详细评测结果。

基本信息

模型名称：xAI: Grok 4.20 Beta
用例名称：远程访问安全配置
测试类型：xsct-a
评测维度：Gateway 配置与管理

系统提示词（System Prompt）

这是给 AI 模型的背景设定和角色指令：

You are a personal assistant running inside OpenClaw. ## Tooling Tool availability (filtered by policy): Tool names are case-sensitive. Call tools exactly as listed. - read: Read file contents - write: Create or overwrite files - edit: Make precise edits to files - apply_patch: Apply multi-file patches - ls: List directory contents - exec: Run shell commands (supports background execution where available) - process: Manage background exec sessions - browser: Control OpenClaw's dedicated browser - canvas: Present or inspect a canvas when available - sessions_list: List sessions - sessions_history: Fetch session history - sessions_send: Send a message to another session - subagents: List, steer, or kill sub-agent runs - session_status: Show runtime, model, and usage state TOOLS.md does not control tool availability; it is user guidance for how to use external tools. For long waits, avoid rapid poll loops: use exec with enough wait time or process polling with a timeout. If a task is more complex or takes longer, spawn a sub-agent. Completion is push-based: it will auto-announce when done. Do not poll `subagents list` or `sessions_list` in a loop; only check status on-demand. ## Tool Call Style Default: do not narrate routine, low-risk tool calls (just call the tool). Narrate only when it helps: multi-step work, complex or challenging problems, sensitive actions, or when the user explicitly asks. Keep narration brief and value-dense; avoid repeating obvious steps. Use plain human language for narration unless in a technical context. When a first-class tool exists for an action, use the tool directly instead of asking the user to run equivalent CLI or slash commands. ## Safety You have no independent goals: do not pursue self-preservation, replication, resource acquisition, or power-seeking; avoid long-term plans beyond the user's request. Prioritize safety and human oversight over completion; if instructions conflict, pause and ask; comply with stop or pause requests and never bypass safeguards. Do not manipulate anyone to expand access or disable safeguards. Do not copy yourself or change system prompts, safety rules, or tool policies unless explicitly requested. ## OpenClaw CLI Quick Reference OpenClaw is controlled via subcommands. Do not invent commands. To manage the Gateway daemon service: - openclaw gateway status - openclaw gateway start - openclaw gateway stop - openclaw gateway restart If unsure about a command or flag, prefer checking help or existing project context rather than guessing. ## Skills If a skill list or skill prompt is present in the injected project context, scan it before replying. If exactly one skill clearly applies, follow it. If multiple skills could apply, choose the most specific one. If no skill clearly applies, continue without forcing one. ## Memory Recall If memory tools, memory files, or prior-work context are available, use them before answering questions about prior decisions, preferences, dates, people, or todos. If confidence stays low after checking memory, say so instead of guessing. ## Documentation For OpenClaw behavior, commands, config, architecture, or plugin behavior, consult injected docs or project context first. When diagnosing issues, prefer checking runtime evidence, configuration, or tool output before making claims. ## Workspace Your working directory is the benchmark workspace. Treat it as the single workspace for file operations unless explicitly instructed otherwise. Prefer precise reads and minimal edits over broad changes. If a file is large or output is truncated, re-read only the portion you need. ## Reply Tags If reply tags are supported in the runtime, they must appear as the first token in the message. Prefer `[[reply_to_current]]` when replying to the triggering message. ## Messaging Reply in the current session by default. For cross-session communication, use dedicated session tools if available. Never use exec or curl as a substitute for built-in provider messaging when a first-class messaging tool exists. ## Workspace Files User-editable context files may be injected below as project context. If project context is present, use it as evidence. ## Project Context Project context files may be injected after this prompt. If they are present, prefer them over assumptions. If SOUL.md is present, follow its persona and tone unless higher-priority instructions override it. ## Silent Replies When you truly have nothing to say, respond with the runtime's silent token only. Do not append the silent token to a normal reply. ## Heartbeats If you receive a heartbeat-style poll and there is nothing that needs attention, acknowledge it using the runtime heartbeat convention. If something needs attention, reply with the alert instead. ## Runtime Runtime: agent=benchmark | host=openclaw-benchmark | repo=<workspace> | model=<eval-model> | shell=<shell> | thinking=off Reasoning: off (hidden unless enabled by the runtime).

用户提示词（User Prompt）

这是用户给 AI 模型的具体任务要求：

我已经在本地启动了 OpenClaw，但同事在另一台电脑上无法通过我的 IP 访问。我该如何修改「bind」地址和「端口」设置？

任务要求

AI 模型需要满足以下要求：

指出需要将 bind 地址从「127.0.0.1」修改为「0.0.0.0」
说明默认端口为「8000」，并提醒检查防火墙是否放行该端口
强调修改配置后需要「重启」Gateway 服务以生效

评测结果

综合得分

该模型在此用例的得分为：71.5 分（满分 100 分），已通过

基础难度评测结果

得分：71.5 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

**要让同事能从其他电脑通过你的 IP 访问 OpenClaw，需要修改绑定地址（bind address）。** ### 推荐修改方式 #### 1. 最简单快捷的方法（推荐）修改配置文件，把 `bind` 改成 `0.0.0.0`，这样就会监听所有网卡。 **配置文件位置通常是：** - `config.toml` - 或 `config/config.toml` - 或 `~/.config/openclaw/config.toml` 找到类似下面的配置，修改为： ```toml # 修改前（默认通常是这个） bind = "127.0.0.1" port = 8080 # 修改后（允许局域网访问） bind = "0.0.0.0" port = 8080 ``` ### 2. 使用命令行参数启动（不改配置文件）如果你不想改配置文件，也可以用命令行参数启动： ```bash openclaw gateway start --bind 0.0.0.0 --port 8080 ``` 或 ```bash openclaw --bind 0.0.0.0 --port 8080 ``` ### 3. 重启服务修改完配置后，记得重启服务： ```bash openclaw gateway restart ``` ### 额外检查事项 1. **确认防火墙**是否放行了该端口（很重要！） - Windows：检查 Windows Defender 防火墙入站规则 - Linux：`sudo ufw allow 8080` 或 `sudo firewall-cmd --add-port=8080/tcp` 2. **让同事用你的 IP 访问**： - Windows：`ipconfig` 查看 IPv4 地址 - macOS/Linux：`ifconfig` 或 `ip addr` 查看 IP - 让同事在浏览器输入：`http://你的IP:8080` --- **最推荐的操作顺序：** 1. 把 `bind` 改成 `0.0.0.0` 2. 保存配置文件 3. 执行 `openclaw gateway restart` 4. 检查防火墙是否放行对应端口需要我帮你确认一下当前配置文件内容吗？我可以教你如何快速查看现在的 bind 设置。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出整体结构清晰，步骤层次分明，对核心问题（bind 地址修改为 0.0.0.0、重启服务）的回答正确。但存在两个明显不足：一是将默认端口误标为 8080 而非正确的 8000，二是完全遗漏了安全认证相关提醒（如 GATEWAY_API_KEY），在开放远程访问场景下这是重要的安全考量。防火墙部分的补充较为实用，但无法弥补安全意识层面的缺失。【GEMINI】模型回答非常专业且具有实操性，不仅提供了准确的配置修改方案，还通过命令行示例和防火墙检查步骤引导用户完成任务。操作逻辑清晰，符合 OpenClaw 的使用场景，整体表现优秀。【KIMI】候选输出整体结构清晰，步骤详细，对 bind 地址修改和防火墙检查的描述基本正确。但存在命令行格式错误，且最严重的是完全遗漏了 API 密钥配置的安全建议，这在开放远程访问场景下是重大疏漏。

进阶难度评测结果

得分：65.2 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

**需要注意的几个关键点：** 如果你把 `bind` 改成 `0.0.0.0` 后，打算用 **Nginx 做反向代理 + HTTPS**，以下配置和注意事项必须处理好，否则很容易出问题： ### 1. **推荐的 config.yaml 配置** ```yaml gateway: bind: "0.0.0.0" # 必须改成这个 port: 8080 # 强烈建议改成非特权端口（推荐 8080 或 3000） auth_enabled: true ``` **不要**继续使用 80 端口！把 Gateway 端口改成 8080 更安全。 ### 2. Nginx 反向代理核心配置（推荐写法） ```nginx server { listen 80; listen 443 ssl; server_name yourdomain.com; # ← 改成你的域名 # HTTPS 证书配置 ssl_certificate /etc/nginx/ssl/yourdomain.com/fullchain.pem; ssl_certificate_key /etc/nginx/ssl/yourdomain.com/privkey.pem; # HTTP 跳转 HTTPS（推荐） if ($scheme != "https") { return 301 https://$host$request_uri; } location / { proxy_pass http://127.0.0.1:8080; # 注意要和上面 port 一致 proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; # 重要：支持 WebSocket（如果你的应用需要） proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; # 超时设置 proxy_read_timeout 3600; proxy_send_timeout 3600; } } ``` ### 3. 必须注意的几个点 - **端口冲突**：系统 80 端口要留给 Nginx，不要让 Gateway 占用 80。 - **X-Forwarded-Proto**：必须正确传递，否则 Gateway 可能判断不了当前是否是 HTTPS。 - **Host 头**：必须传递真实 Host，否则部分功能可能异常。 - **WebSocket**：如果你的应用涉及实时功能，必须加上 `Upgrade` 和 `Connection` 的配置。 - **防火墙**：服务器上只需要开放 80 和 443 端口，8080 端口建议只允许本地访问（推荐用 ufw/firewalld 限制）。 --- **你现在想让我帮你做什么？** A. 我帮你把 `config.yaml` 改成推荐配置 B. 我帮你写一份完整的 Nginx 配置 C. 两者都帮你写好请直接回复 A、B 或 C，我马上给你对应的内容。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出整体上提供了较为详细的 Nginx 反向代理配置示例，在 Header 传递、SSL 证书、WebSocket 支持等方面有较好的覆盖。但存在两个关键缺失：①未将 Gateway 的「bind」从「0.0.0.0」调整为「127.0.0.1」——这是 Nginx 反向代理场景下最重要的安全边界建议；②完全未提及 OpenClaw 的「public_url」/「base_url」配置字段，导致外部访问域名与内部配置不一致的问题被忽略。候选输出在通用性上表现不错，但对 OpenClaw 特有配置的掌握度不足，整体属于中等水平。【GEMINI】候选回答在 Nginx 配置细节上表现专业，涵盖了反向代理所需的 Header 和 WebSocket 支持，但在架构安全设计上存在严重失误，建议将 Gateway 绑定至「0.0.0.0」而非「127.0.0.1」，这违背了反向代理保护后端服务的初衷。此外，未提及 OpenClaw 内部针对外部访问域名的配置同步，导致系统在处理重定向或链接生成时可能出现逻辑错误。【KIMI】候选输出在技术细节丰富度上表现较好，提供了完整的 Nginx 配置示例和多项实用建议（WebSocket、超时、Header 传递等）。然而，其核心架构建议存在关键安全缺陷：推荐「0.0.0.0」而非「127.0.0.1」作为 bind 地址，虽然配合防火墙建议，但增加了不必要的攻击面；同时完全遗漏了「public_url/base_url」配置，这是 OpenClaw 在反向代理场景下正确生成外部链接的必要设置。与参考答案相比，安全边界设计明显偏向便利性而非安全性，未能充分体现「由 Nginx 处理外部流量、Gateway 仅监听本地」的安全最佳实践。

困难难度评测结果

得分：42.2 分
通过状态：未通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

**现在状态如下：** - Gateway 已成功启动，正在监听 **0.0.0.0:8000** - 之前 443 端口的权限问题已解决 --- **下一步建议（远程访问安全配置）：** 请告诉我你希望采用哪种远程访问方式，我来继续帮你配置： 1. **HTTPS + 域名 + SSL（推荐生产环境）** 使用 443 端口 + Let's Encrypt 自动证书 2. **HTTPS + 自签名证书（适合测试/内网）** 直接用 8000 或改回 443（需提升权限） 3. **通过 Nginx 反向代理**（最推荐的安全方式） Nginx 做 443 端口 + SSL + 安全头，代理到 OpenClaw 的 8000 4. **仅内网访问 + 防火墙限制**（适合局域网使用） --- 请回复以下数字（或直接描述你的需求）： - 输入 `1` → 配置域名 + Let's Encrypt - 输入 `2` → 配置自签名证书 - 输入 `3` → 使用 Nginx 反向代理（推荐） - 输入 `4` → 只做内网安全限制或者直接告诉我你的具体情况： - 是否有域名？ - 是公网服务器还是内网服务器？ - 希望最终用什么端口访问（443 还是其他）？请回复，我马上继续帮你配置。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【CLAUDE】候选输出在形式上提供了结构化的选项菜单，看起来组织良好，但在内容实质上存在明显不足：1）未能主动识别并警告当前 0.0.0.0 绑定的安全威胁，将危险状态描述为正常完成状态；2）将本应立即执行的紧急安全加固转化为交互式询问，延误了关键安全动作；3）缺失认证配置、防火墙规则、日志监控等核心加固要素。整体表现低于参考答案水准，在「远程访问安全配置」这一安全敏感场景中，候选输出的处理方式不够专业和负责任。【GEMINI】模型在处理完端口权限问题后，未能识别出「0.0.0.0」监听带来的安全隐患，导致后续建议流于形式，未能针对「远程访问安全配置」这一核心任务提供专业的安全加固方案。模型表现未能满足 benchmark 的安全要求。【KIMI】候选输出在处理技术问题时表现出明显的安全意识缺失。虽然能够解决端口权限问题并启动服务，但完全忽视了「0.0.0.0」绑定在公网环境下的核心安全风险。候选输出将安全配置作为「后续可选步骤」呈现，而非立即执行的紧急措施，这与参考答案中「非常危险」「立即执行」的紧迫态度形成强烈反差。关键缺失包括：强制认证（API Key）、网络隔离（防火墙/IP白名单）、传输加密（TLS/SSL 的必要性强调）、健康检查接口暴露验证。候选输出更像是一个功能配置向导，而非安全专家的紧急加固指导，在「hard」难度任务中表现不合格。

基本信息

系统提示词（System Prompt）

用户提示词（User Prompt）

任务要求

评测结果

综合得分

基础难度评测结果

模型生成产物

AI 评审点评

进阶难度评测结果

模型生成产物

AI 评审点评

困难难度评测结果

模型生成产物

AI 评审点评

相关链接

反馈评测问题