New documentation pages covering UAT Agent setup, ECS and Model configs and how to run agent scripts
All checks were successful
ci / build (push) Successful in 3m21s

This commit is contained in:
Tom Sakretz 2026-01-29 13:31:35 +01:00
parent d86ece2a2a
commit 14f84e78e9
4 changed files with 418 additions and 3 deletions

View file

@ -1,7 +1,21 @@
---
title: "Autonomous UAT Agent"
linkTitle: "Autonomous UAT Agent"
title: "General Documentation"
linkTitle: "Documentation"
weight: 10
description: >
Deliverable D66 documentation for the Autonomous UAT Agent (Business + Technical)
General documentation for D66 and the Autonomous UAT Agent (primarily technical, English, Markdown)
---
# General Documentation (D66)
This section contains the core documentation for D66, focusing on how the Autonomous UAT Agent works and how to run it.
## Pages
- [Overview](./overview.md)
- [Quickstart](./quickstart.md)
- [Running Autonomous UAT Agent Scripts](./running-auata-scripts.md)
- [Workflow Diagram](./agent-workflow-diagram.md)
- [Model Stack](./model-stack.md)
- [Outputs & Artifacts](./outputs-and-artifacts.md)
- [Troubleshooting](./troubleshooting.md)

View file

@ -0,0 +1,83 @@
---
title: "Agent Workflow Diagram"
linkTitle: "Workflow Diagram"
weight: 5
description: >
Visual workflow of a typical Agent S (Autonomous UAT Agent) run (gui_agent_cli.py) across Ministral, Holo, and VNC
---
# Agent Workflow Diagram (Autonomous UAT Agent)
This page provides a **visual sketch** of the typical workflow (example: `gui_agent_cli.py`).
## High-level data flow
```mermaid
flowchart LR
%% Left-to-right overview of one typical agent loop
user[Operator / Prompt] --> cli[Agent S script (gui_agent_cli.py)]
subgraph otc[OTC (Open Telekom Cloud)]
subgraph ecsMin["ecs_ministral_L4 (164.30.28.242:8001) - Ministral vLLM"]
ministral[(Ministral 3 8B - Thinking / Planning)]
end
subgraph ecsHolo["ecs_holo_A40 (164.30.22.166:8000) - Holo vLLM"]
holo[(Holo 1.5-7B - Vision / Grounding)]
end
subgraph ecsGui["GUI test target (VNC + Firefox)"]
vnc[VNC / Desktop]
browser[Firefox]
end
end
cli -->|1. plan step (vLLM_THINKING_ENDPOINT)| ministral
ministral -->|next action (click/type/wait)| cli
cli -->|2. capture screenshot| vnc
vnc -->|screenshot (PNG)| cli
cli -->|3. grounding request (vLLM_VISION_ENDPOINT)| holo
holo -->|coordinates + UI element info| cli
cli -->|4. execute action (mouse/keyboard)| vnc
vnc --> browser
cli -->|logs + screenshots (results/ folder)| artifacts[(Artifacts: logs, screenshots, JSON comms)]
```
## Sequence (one loop)
```mermaid
sequenceDiagram
autonumber
actor U as Operator
participant CLI as gui_agent_cli.py
participant MIN as Ministral vLLM\n(ecs_ministral_L4)
participant VNC as VNC Desktop\n(Firefox)
participant HOLO as Holo vLLM\n(ecs_holo_A40)
U->>CLI: Provide goal / prompt
loop Step loop (until done)
CLI->>MIN: Plan next step (text-only reasoning)
MIN-->>CLI: Next action (intent)
CLI->>VNC: Capture screenshot
VNC-->>CLI: Screenshot (image)
CLI->>HOLO: Ground UI element(s) in screenshot
HOLO-->>CLI: Coordinates + element metadata
CLI->>VNC: Execute click/type/scroll
end
CLI-->>U: Result summary + saved artifacts
```
## Notes
- The **thinking** and **grounding** models are separate on purpose: it improves coordinate reliability and makes failures easier to debug.
- The agent loop typically produces artifacts (logs + screenshots) which are later copied into D66 evidence bundles.

View file

@ -0,0 +1,88 @@
---
title: "Model Stack"
linkTitle: "Models"
weight: 4
description: >
Thinking vs grounding model split for D66 (current state and target state)
---
# Model Stack (D66)
For a visual overview of how the models interact with the VNC-based GUI automation loop, see: [Workflow Diagram](./agent-workflow-diagram.md)
## Requirement
D66 must use **open-source models from European companies**.
## Target setup
- **Thinking / planning:** Ministral
- **Grounding / coordinates:** Holo 1.5
The Agent S framework runs an iterative loop: it uses a reasoning model to decide *what to do next* (plan the next action) and a grounding model to translate UI intent into *pixel-accurate coordinates* on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities.
## Why split models?
- Reasoning models optimize planning and textual decision making
- Vision/grounding models optimize stable coordinate output
- Separation reduces “coordinate hallucinations” and makes debugging easier
## Current state in repo
- Some scripts and docs still reference historical **Claude** and **Pixtral** experiments.
- **Pixtral is not suitable for pixel-level grounding in this use case**: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation.
- In an early prototyping phase, **Anthropic Claude Sonnet** was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution.
## Current configuration (D66)
### Thinking model: Ministral 3 8B (Instruct)
- HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_ministral_L4` (public IP: `164.30.28.242`)
- Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | `pi5e.4xlarge.4`
- GPU: 1 × NVIDIA Tesla L4 (24 GiB)
- Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
- Deployment: vLLM OpenAI-compatible endpoint (chat completions)
- Endpoint env var: `vLLM_THINKING_ENDPOINT`
- Current server (deployment reference): `http://164.30.28.242:8001/v1`
- Recommendation: set `vLLM_THINKING_ENDPOINT` explicitly (do not rely on script defaults).
**Operational note:** vLLM is configured to **auto-start on server boot** (OTC ECS restart) via `systemd`.
**Key serving settings (vLLM):**
- `--gpu-memory-utilization 0.90`
- `--max-model-len 32768`
- `--host 0.0.0.0`
- `--port 8001`
**Key client settings (Autonomous UAT Agent scripts):**
- `model`: `/home/ubuntu/ministral-vllm/models/ministral-3-8b`
- `temperature`: `0.0`
### Grounding model: Holo 1.5-7B
- HuggingFace model card: https://huggingface.co/holo-1.5-7b
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_holo_A40` (public IP: `164.30.22.166`)
- Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | `g7.12xlarge.8`
- GPU: 1 × NVIDIA A40 (48 GiB)
- Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
- Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
- Endpoint env var: `vLLM_VISION_ENDPOINT`
- Current server (deployment reference): `http://164.30.22.166:8000/v1`
**Key client settings (grounding / coordinate space):**
- `model`: `holo-1.5-7b`
- Native coordinate space: `3840×2160` (4K)
- Client grounding dimensions:
- `grounding_width`: `3840`
- `grounding_height`: `2160`
Notes:
- Prompting and output-format hardening (reliability work):
- `docs/story-026-001-context.md` (Holo output reliability)
- `docs/story-025-001-context.md` (double grounding / calibration)

View file

@ -0,0 +1,230 @@
---
title: "Running Autonomous UAT Agent Scripts"
linkTitle: "Run Scripts"
weight: 3
description: >
How to run the key D66 evaluation scripts and what they produce
---
# Running Autonomous UAT Agent Scripts
All commands below assume you are running from the **Agent-S repository root** (Linux/ECS), i.e. the folder that contains `staging_scripts/`.
The **Autonomous UAT Agent** is the overall UX/UI testing use case built on top of the Agent S codebase and scripts in this repo.
If you are inside the monorepo workspace, first `cd ~/Projects/Agent_S3/Agent-S` on the Ubuntu ECS and then run the same commands.
## One-command recommended run (ECS)
If you only run one thing to produce clean, repeatable evidence (screenshots with click markers), run the calibration CLI:
```bash
DISPLAY=:1 python staging_scripts/gui_agent_cli.py --prompt "Go to telekom.de and click the cart icon" --max-steps 10
```
This writes screenshots to `./results/gui_agent_cli/<timestamp>/screenshots/`.
## ECS runner notes
- **Working directory matters:** the default output path is relative to the current working directory (it should be the Agent-S repo root on ECS).
- **GUI required:** `pyautogui` needs an X server (`DISPLAY=:1` is assumed by most scripts).
- **Persistence:** if you want results after the task ends, ensure `./results/` is on a mounted volume or copied out as an artifact.
## Prerequisites (runtime)
- Linux GUI session (VNC/Xvfb) because these scripts drive a real browser via `pyautogui`.
- A working `DISPLAY` (most of the scripts assume `:1`).
- Network access to the model endpoints (thinking + vision/grounding).
Common environment variables used by the vLLM-backed scripts:
- `vLLM_THINKING_ENDPOINT` (default in code if unset)
- `vLLM_VISION_ENDPOINT` (default in code if unset)
- `vLLM_API_KEY` (default: `dummy-key`)
## Key scripts (repo locations)
Core scripts referenced for D66 demonstrations:
- UI check (Agent S3): `staging_scripts/1_UI_check_AS3.py`
- Functional correctness check: `staging_scripts/1_UI_functional_correctness_check.py`
- Visual quality audit: `staging_scripts/2_UX_visual_quality_audit.py`
- Task-based UX flow (newsletter): `staging_scripts/3_UX_taskflow_newsletter_signup.py`
Calibration / CLI entry point (used for click coordinate scaling validation):
- GUI Agent CLI (Holo click calibration): `staging_scripts/gui_agent_cli.py`
Legacy / historical:
- `staging_scripts/old scripts/agent_s3_1_old.py`
- `staging_scripts/old scripts/agent_s3_ui_test.py`
## Common configuration knobs
Many scripts support these environment variables:
- `AS2_TARGET_URL`: website URL to test
- `AS2_MAX_STEPS`: max steps (varies by script)
- `ASK_EVERY_STEPS`: interactive prompt cadence
Execution environment:
- Linux GUI environment typically expects `DISPLAY=:1`
## Recommended: run gui_agent_cli.py (calibration / click precision)
This is the “clean” CLI entry point for repeatable calibration runs.
Minimal run (prompt mode):
```bash
python staging_scripts/gui_agent_cli.py \
--prompt "Go to telekom.de and click the cart icon" \
--max-steps 30
```
Optional scaling factors for debugging (defaults to `1.0` / `1.0`):
```bash
python staging_scripts/gui_agent_cli.py \
--prompt "Go to telekom.de and click the cart icon" \
--x-scale 2.0 \
--y-scale 2.0 \
--max-steps 30
```
Outputs:
- Default run folder: `./results/gui_agent_cli/<timestamp>/`
- Screenshots: `./results/gui_agent_cli/<timestamp>/screenshots/`
- Text log (stdout/stderr): `./results/gui_agent_cli/<timestamp>/logs/run.log`
If `--enable-logging` is set, the script also writes a structured JSON communication log (Story 026-002) into the same run `logs/` folder by default.
Enable model communication logging (recommended when debugging mis-clicks):
```bash
python staging_scripts/gui_agent_cli.py \
--prompt "Click the Telekom icon" \
--max-steps 10 \
--output-dir ./results/gui_agent_cli/debug_run_telekom_icon \
--enable-logging \
--log-output-dir ./results/gui_agent_cli/debug_run_telekom_icon/logs
```
## Golden run (terminal on ECS)
This is the “golden run” command sequence currently used for D66 evidence generation.
### 1) Connect from Windows
```powershell
ssh -i "C:\Path to KeyPair\KeyPair-ECS.pem" ubuntu@80.158.3.120
```
### 2) Prepare the ECS runtime (GUI + browser)
```bash
# Activate venv
# Recommended: use the Agent S3 venv
source ~/Projects/Agent_S3/Agent-S/venv/bin/activate
# Go to Agent-S repo root
cd ~/Projects/Agent_S3/Agent-S
# Start VNC (DISPLAY=:1) and a browser
vncserver :1
export XAUTHORITY="$HOME/.Xauthority"
export DISPLAY=":1"
firefox &
```
### 3) Run the golden prompt
```bash
python staging_scripts/gui_agent_cli.py \
--prompt "Role: You are a UI/UX testing agent specializing in functional correctness.
Goal: Test all interactive elements in the header navigation on www.telekom.de for functional weaknesses.
Tasks:
1. Navigate to the website
2. Identify and test interactive elements (buttons, links, forms, menus)
3. Check for broken flows, defective links, non-functioning elements
4. Document issues found
Report Format:
Return findings in the 'issues' field as a list of objects:
- element: Name/description of the element
- location: Where on the page
- problem: What doesn't work
- recommendation: How to fix it
If no problems found, return an empty array: []" \
--max-steps 30
```
Golden run artifacts:
- Screenshots: `./results/gui_agent_cli/<timestamp>/screenshots/`
- Text log: `./results/gui_agent_cli/<timestamp>/logs/run.log`
- Optional JSON comm log (if enabled): `./results/gui_agent_cli/<timestamp>/logs/calibration_log_*.json`
## Alternative: run the agent via a web interface (Frontend)
Work in progress.
We are currently updating the web-based view and its ECS runner integration. This section will be filled with the correct, up-to-date instructions once the frontend flow supports the current Autonomous UAT Agent + `gui_agent_cli.py` workflow.
## Run the D66 evaluation scripts (staging_scripts)
These scripts are used for D66-style evaluation runs and tend to write their artifacts into `staging_scripts/` (DB, screenshots, JSON).
### UI check (Agent S3)
Typical pattern (URL via env var + optional run control args):
```bash
export AS2_TARGET_URL="https://www.leipzig.de"
export AS2_MAX_STEPS="20"
python staging_scripts/1_UI_check_AS3.py --auto-yes --ask-every 1000
```
Notes:
- Supports `--job-id <id>` (used by runners) and uses `JOB_ID` as a fallback.
- Writes JSON to `./agent_output/raw_json/<job_id>/` and screenshots/overlays to `staging_scripts/Screenshots/...`.
### Functional correctness check
```bash
export AS2_TARGET_URL="https://www.leipzig.de"
export AS2_MAX_STEPS="0" # 0 = no limit (script-specific)
python staging_scripts/1_UI_functional_correctness_check.py --auto-yes --ask-every 1000
```
### Visual quality audit
This script currently uses a hardcoded `WEBSITE_URL` near the top of the file. Update it and then run:
```bash
python staging_scripts/2_UX_visual_quality_audit.py --auto-yes --ask-every 10
```
### Task-based UX flow (newsletter)
This script is currently a staging/WIP script; verify it runs in your environment before relying on it for evidence.
## Outputs to expect
Most scripts record one or more of:
- `uxqa.db` (run log DB)
- screenshots/overlays under `staging_scripts/Screenshots/...`
- JSON step outputs under `agent_output/` (paths vary by script)
- calibration CLI outputs under `./results/gui_agent_cli/<timestamp>/`
See [Outputs & Artifacts](./outputs-and-artifacts.md).
## Notes on model usage
Some scripts still contain legacy model configs (Claude/Pixtral). The D66 target configuration is documented in [Model Stack](./model-stack.md).