New documentation pages covering UAT Agent setup, ECS and Model configs and how to run agent scripts

2026-01-29 13:31:35 +01:00 · 2026-01-29 13:31:35 +01:00 · 14f84e78e9
commit 14f84e78e9
parent d86ece2a2a
4 changed files with 418 additions and 3 deletions
--- a/content/en/docs/Autonomous
+++ b/content/en/docs/Autonomous
@ -1,7 +1,21 @@
 ---
-title: "Autonomous UAT Agent"
-linkTitle: "Autonomous UAT Agent"
+title: "General Documentation"
+linkTitle: "Documentation"
 weight: 10
 description: >
-  Deliverable D66 documentation for the Autonomous UAT Agent (Business + Technical)
+  General documentation for D66 and the Autonomous UAT Agent (primarily technical, English, Markdown)
 ---
+
+# General Documentation (D66)
+
+This section contains the core documentation for D66, focusing on how the Autonomous UAT Agent works and how to run it.
+
+## Pages
+
+- [Overview](./overview.md)
+- [Quickstart](./quickstart.md)
+- [Running Autonomous UAT Agent Scripts](./running-auata-scripts.md)
+- [Workflow Diagram](./agent-workflow-diagram.md)
+- [Model Stack](./model-stack.md)
+- [Outputs & Artifacts](./outputs-and-artifacts.md)
+- [Troubleshooting](./troubleshooting.md)
--- a/Agent/agent-workflow-diagram.md
+++ b/Agent/agent-workflow-diagram.md
@ -0,0 +1,83 @@
+---
+title: "Agent Workflow Diagram"
+linkTitle: "Workflow Diagram"
+weight: 5
+description: >
+  Visual workflow of a typical Agent S (Autonomous UAT Agent) run (gui_agent_cli.py) across Ministral, Holo, and VNC
+---
+
+# Agent Workflow Diagram (Autonomous UAT Agent)
+
+This page provides a **visual sketch** of the typical workflow (example: `gui_agent_cli.py`).
+
+## High-level data flow
+
+```mermaid
+flowchart LR
+  %% Left-to-right overview of one typical agent loop
+
+  user[Operator / Prompt] --> cli[Agent S script (gui_agent_cli.py)]
+
+  subgraph otc[OTC (Open Telekom Cloud)]
+    subgraph ecsMin["ecs_ministral_L4 (164.30.28.242:8001) - Ministral vLLM"]
+      ministral[(Ministral 3 8B - Thinking / Planning)]
+    end
+
+    subgraph ecsHolo["ecs_holo_A40 (164.30.22.166:8000) - Holo vLLM"]
+      holo[(Holo 1.5-7B - Vision / Grounding)]
+    end
+
+    subgraph ecsGui["GUI test target (VNC + Firefox)"]
+      vnc[VNC / Desktop]
+      browser[Firefox]
+    end
+  end
+
+  cli -->|1. plan step (vLLM_THINKING_ENDPOINT)| ministral
+  ministral -->|next action (click/type/wait)| cli
+
+  cli -->|2. capture screenshot| vnc
+  vnc -->|screenshot (PNG)| cli
+
+  cli -->|3. grounding request (vLLM_VISION_ENDPOINT)| holo
+  holo -->|coordinates + UI element info| cli
+
+  cli -->|4. execute action (mouse/keyboard)| vnc
+  vnc --> browser
+
+  cli -->|logs + screenshots (results/ folder)| artifacts[(Artifacts: logs, screenshots, JSON comms)]
+```
+
+## Sequence (one loop)
+
+```mermaid
+sequenceDiagram
+  autonumber
+  actor U as Operator
+  participant CLI as gui_agent_cli.py
+  participant MIN as Ministral vLLM\n(ecs_ministral_L4)
+  participant VNC as VNC Desktop\n(Firefox)
+  participant HOLO as Holo vLLM\n(ecs_holo_A40)
+
+  U->>CLI: Provide goal / prompt
+
+  loop Step loop (until done)
+    CLI->>MIN: Plan next step (text-only reasoning)
+    MIN-->>CLI: Next action (intent)
+
+    CLI->>VNC: Capture screenshot
+    VNC-->>CLI: Screenshot (image)
+
+    CLI->>HOLO: Ground UI element(s) in screenshot
+    HOLO-->>CLI: Coordinates + element metadata
+
+    CLI->>VNC: Execute click/type/scroll
+  end
+
+  CLI-->>U: Result summary + saved artifacts
+```
+
+## Notes
+
+- The **thinking** and **grounding** models are separate on purpose: it improves coordinate reliability and makes failures easier to debug.
+- The agent loop typically produces artifacts (logs + screenshots) which are later copied into D66 evidence bundles.
--- a/content/en/docs/Autonomous
+++ b/content/en/docs/Autonomous
@ -0,0 +1,88 @@
+---
+title: "Model Stack"
+linkTitle: "Models"
+weight: 4
+description: >
+  Thinking vs grounding model split for D66 (current state and target state)
+---
+
+# Model Stack (D66)
+
+For a visual overview of how the models interact with the VNC-based GUI automation loop, see: [Workflow Diagram](./agent-workflow-diagram.md)
+
+## Requirement
+
+D66 must use **open-source models from European companies**.
+
+## Target setup
+
+- **Thinking / planning:** Ministral
+- **Grounding / coordinates:** Holo 1.5
+
+The Agent S framework runs an iterative loop: it uses a reasoning model to decide *what to do next* (plan the next action) and a grounding model to translate UI intent into *pixel-accurate coordinates* on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities.
+
+## Why split models?
+
+- Reasoning models optimize planning and textual decision making
+- Vision/grounding models optimize stable coordinate output
+- Separation reduces “coordinate hallucinations” and makes debugging easier
+
+## Current state in repo
+
+- Some scripts and docs still reference historical **Claude** and **Pixtral** experiments.
+- **Pixtral is not suitable for pixel-level grounding in this use case**: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation.
+- In an early prototyping phase, **Anthropic Claude Sonnet** was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution.
+
+## Current configuration (D66)
+
+### Thinking model: Ministral 3 8B (Instruct)
+
+- HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
+- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_ministral_L4` (public IP: `164.30.28.242`)
+  - Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | `pi5e.4xlarge.4`
+  - GPU: 1 × NVIDIA Tesla L4 (24 GiB)
+  - Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
+- Deployment: vLLM OpenAI-compatible endpoint (chat completions)
+  - Endpoint env var: `vLLM_THINKING_ENDPOINT`
+  - Current server (deployment reference): `http://164.30.28.242:8001/v1`
+  - Recommendation: set `vLLM_THINKING_ENDPOINT` explicitly (do not rely on script defaults).
+
+**Operational note:** vLLM is configured to **auto-start on server boot** (OTC ECS restart) via `systemd`.
+
+**Key serving settings (vLLM):**
+
+- `--gpu-memory-utilization 0.90`
+- `--max-model-len 32768`
+- `--host 0.0.0.0`
+- `--port 8001`
+
+**Key client settings (Autonomous UAT Agent scripts):**
+
+- `model`: `/home/ubuntu/ministral-vllm/models/ministral-3-8b`
+- `temperature`: `0.0`
+
+### Grounding model: Holo 1.5-7B
+
+- HuggingFace model card: https://huggingface.co/holo-1.5-7b
+- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_holo_A40` (public IP: `164.30.22.166`)
+  - Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | `g7.12xlarge.8`
+  - GPU: 1 × NVIDIA A40 (48 GiB)
+  - Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
+- Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
+  - Endpoint env var: `vLLM_VISION_ENDPOINT`
+  - Current server (deployment reference): `http://164.30.22.166:8000/v1`
+
+**Key client settings (grounding / coordinate space):**
+
+- `model`: `holo-1.5-7b`
+- Native coordinate space: `3840×2160` (4K)
+- Client grounding dimensions:
+  - `grounding_width`: `3840`
+  - `grounding_height`: `2160`
+
+Notes:
+
+- Prompting and output-format hardening (reliability work):
+  - `docs/story-026-001-context.md` (Holo output reliability)
+  - `docs/story-025-001-context.md` (double grounding / calibration)
+
--- a/Agent/running-auata-scripts.md
+++ b/Agent/running-auata-scripts.md
@ -0,0 +1,230 @@
+---
+title: "Running Autonomous UAT Agent Scripts"
+linkTitle: "Run Scripts"
+weight: 3
+description: >
+  How to run the key D66 evaluation scripts and what they produce
+---
+
+# Running Autonomous UAT Agent Scripts
+
+All commands below assume you are running from the **Agent-S repository root** (Linux/ECS), i.e. the folder that contains `staging_scripts/`.
+
+The **Autonomous UAT Agent** is the overall UX/UI testing use case built on top of the Agent S codebase and scripts in this repo.
+
+If you are inside the monorepo workspace, first `cd ~/Projects/Agent_S3/Agent-S` on the Ubuntu ECS and then run the same commands.
+
+## One-command recommended run (ECS)
+
+If you only run one thing to produce clean, repeatable evidence (screenshots with click markers), run the calibration CLI:
+
+```bash
+DISPLAY=:1 python staging_scripts/gui_agent_cli.py --prompt "Go to telekom.de and click the cart icon" --max-steps 10
+```
+
+This writes screenshots to `./results/gui_agent_cli/<timestamp>/screenshots/`.
+
+## ECS runner notes
+
+- **Working directory matters:** the default output path is relative to the current working directory (it should be the Agent-S repo root on ECS).
+- **GUI required:** `pyautogui` needs an X server (`DISPLAY=:1` is assumed by most scripts).
+- **Persistence:** if you want results after the task ends, ensure `./results/` is on a mounted volume or copied out as an artifact.
+
+## Prerequisites (runtime)
+
+- Linux GUI session (VNC/Xvfb) because these scripts drive a real browser via `pyautogui`.
+- A working `DISPLAY` (most of the scripts assume `:1`).
+- Network access to the model endpoints (thinking + vision/grounding).
+
+Common environment variables used by the vLLM-backed scripts:
+
+- `vLLM_THINKING_ENDPOINT` (default in code if unset)
+- `vLLM_VISION_ENDPOINT` (default in code if unset)
+- `vLLM_API_KEY` (default: `dummy-key`)
+
+## Key scripts (repo locations)
+
+Core scripts referenced for D66 demonstrations:
+
+- UI check (Agent S3): `staging_scripts/1_UI_check_AS3.py`
+- Functional correctness check: `staging_scripts/1_UI_functional_correctness_check.py`
+- Visual quality audit: `staging_scripts/2_UX_visual_quality_audit.py`
+- Task-based UX flow (newsletter): `staging_scripts/3_UX_taskflow_newsletter_signup.py`
+
+Calibration / CLI entry point (used for click coordinate scaling validation):
+
+- GUI Agent CLI (Holo click calibration): `staging_scripts/gui_agent_cli.py`
+
+Legacy / historical:
+
+- `staging_scripts/old scripts/agent_s3_1_old.py`
+- `staging_scripts/old scripts/agent_s3_ui_test.py`
+
+## Common configuration knobs
+
+Many scripts support these environment variables:
+
+- `AS2_TARGET_URL`: website URL to test
+- `AS2_MAX_STEPS`: max steps (varies by script)
+- `ASK_EVERY_STEPS`: interactive prompt cadence
+
+Execution environment:
+
+- Linux GUI environment typically expects `DISPLAY=:1`
+
+## Recommended: run gui_agent_cli.py (calibration / click precision)
+
+This is the “clean” CLI entry point for repeatable calibration runs.
+
+Minimal run (prompt mode):
+
+```bash
+python staging_scripts/gui_agent_cli.py \
+  --prompt "Go to telekom.de and click the cart icon" \
+  --max-steps 30
+```
+
+Optional scaling factors for debugging (defaults to `1.0` / `1.0`):
+
+```bash
+python staging_scripts/gui_agent_cli.py \
+  --prompt "Go to telekom.de and click the cart icon" \
+  --x-scale 2.0 \
+  --y-scale 2.0 \
+  --max-steps 30
+```
+
+Outputs:
+
+- Default run folder: `./results/gui_agent_cli/<timestamp>/`
+- Screenshots: `./results/gui_agent_cli/<timestamp>/screenshots/`
+- Text log (stdout/stderr): `./results/gui_agent_cli/<timestamp>/logs/run.log`
+
+If `--enable-logging` is set, the script also writes a structured JSON communication log (Story 026-002) into the same run `logs/` folder by default.
+
+Enable model communication logging (recommended when debugging mis-clicks):
+
+```bash
+python staging_scripts/gui_agent_cli.py \
+  --prompt "Click the Telekom icon" \
+  --max-steps 10 \
+  --output-dir ./results/gui_agent_cli/debug_run_telekom_icon \
+  --enable-logging \
+  --log-output-dir ./results/gui_agent_cli/debug_run_telekom_icon/logs
+```
+
+## Golden run (terminal on ECS)
+
+This is the “golden run” command sequence currently used for D66 evidence generation.
+
+### 1) Connect from Windows
+
+```powershell
+ssh -i "C:\Path to KeyPair\KeyPair-ECS.pem" ubuntu@80.158.3.120
+```
+
+### 2) Prepare the ECS runtime (GUI + browser)
+
+```bash
+# Activate venv
+# Recommended: use the Agent S3 venv
+source ~/Projects/Agent_S3/Agent-S/venv/bin/activate
+
+# Go to Agent-S repo root
+cd ~/Projects/Agent_S3/Agent-S
+
+# Start VNC (DISPLAY=:1) and a browser
+vncserver :1
+export XAUTHORITY="$HOME/.Xauthority"
+export DISPLAY=":1"
+firefox &
+```
+
+### 3) Run the golden prompt
+
+```bash
+python staging_scripts/gui_agent_cli.py \
+  --prompt "Role: You are a UI/UX testing agent specializing in functional correctness.
+Goal: Test all interactive elements in the header navigation on www.telekom.de for functional weaknesses.
+Tasks:
+1. Navigate to the website
+2. Identify and test interactive elements (buttons, links, forms, menus)
+3. Check for broken flows, defective links, non-functioning elements
+4. Document issues found
+Report Format:
+Return findings in the 'issues' field as a list of objects:
+- element: Name/description of the element
+- location: Where on the page
+- problem: What doesn't work
+- recommendation: How to fix it
+If no problems found, return an empty array: []" \
+  --max-steps 30
+```
+
+Golden run artifacts:
+
+- Screenshots: `./results/gui_agent_cli/<timestamp>/screenshots/`
+- Text log: `./results/gui_agent_cli/<timestamp>/logs/run.log`
+- Optional JSON comm log (if enabled): `./results/gui_agent_cli/<timestamp>/logs/calibration_log_*.json`
+
+## Alternative: run the agent via a web interface (Frontend)
+
+Work in progress.
+
+We are currently updating the web-based view and its ECS runner integration. This section will be filled with the correct, up-to-date instructions once the frontend flow supports the current Autonomous UAT Agent + `gui_agent_cli.py` workflow.
+
+## Run the D66 evaluation scripts (staging_scripts)
+
+These scripts are used for D66-style evaluation runs and tend to write their artifacts into `staging_scripts/` (DB, screenshots, JSON).
+
+### UI check (Agent S3)
+
+Typical pattern (URL via env var + optional run control args):
+
+```bash
+export AS2_TARGET_URL="https://www.leipzig.de"
+export AS2_MAX_STEPS="20"
+
+python staging_scripts/1_UI_check_AS3.py --auto-yes --ask-every 1000
+```
+
+Notes:
+
+- Supports `--job-id <id>` (used by runners) and uses `JOB_ID` as a fallback.
+- Writes JSON to `./agent_output/raw_json/<job_id>/` and screenshots/overlays to `staging_scripts/Screenshots/...`.
+
+### Functional correctness check
+
+```bash
+export AS2_TARGET_URL="https://www.leipzig.de"
+export AS2_MAX_STEPS="0"   # 0 = no limit (script-specific)
+
+python staging_scripts/1_UI_functional_correctness_check.py --auto-yes --ask-every 1000
+```
+
+### Visual quality audit
+
+This script currently uses a hardcoded `WEBSITE_URL` near the top of the file. Update it and then run:
+
+```bash
+python staging_scripts/2_UX_visual_quality_audit.py --auto-yes --ask-every 10
+```
+
+### Task-based UX flow (newsletter)
+
+This script is currently a staging/WIP script; verify it runs in your environment before relying on it for evidence.
+
+## Outputs to expect
+
+Most scripts record one or more of:
+
+- `uxqa.db` (run log DB)
+- screenshots/overlays under `staging_scripts/Screenshots/...`
+- JSON step outputs under `agent_output/` (paths vary by script)
+- calibration CLI outputs under `./results/gui_agent_cli/<timestamp>/`
+
+See [Outputs & Artifacts](./outputs-and-artifacts.md).
+
+## Notes on model usage
+
+Some scripts still contain legacy model configs (Claude/Pixtral). The D66 target configuration is documented in [Model Stack](./model-stack.md).