New documentation pages covering UAT Agent setup, ECS and Model configs and how to run agent scripts
All checks were successful
ci / build (push) Successful in 3m21s
All checks were successful
ci / build (push) Successful in 3m21s
This commit is contained in:
parent
d86ece2a2a
commit
14f84e78e9
4 changed files with 418 additions and 3 deletions
|
|
@ -1,7 +1,21 @@
|
|||
---
|
||||
title: "Autonomous UAT Agent"
|
||||
linkTitle: "Autonomous UAT Agent"
|
||||
title: "General Documentation"
|
||||
linkTitle: "Documentation"
|
||||
weight: 10
|
||||
description: >
|
||||
Deliverable D66 documentation for the Autonomous UAT Agent (Business + Technical)
|
||||
General documentation for D66 and the Autonomous UAT Agent (primarily technical, English, Markdown)
|
||||
---
|
||||
|
||||
# General Documentation (D66)
|
||||
|
||||
This section contains the core documentation for D66, focusing on how the Autonomous UAT Agent works and how to run it.
|
||||
|
||||
## Pages
|
||||
|
||||
- [Overview](./overview.md)
|
||||
- [Quickstart](./quickstart.md)
|
||||
- [Running Autonomous UAT Agent Scripts](./running-auata-scripts.md)
|
||||
- [Workflow Diagram](./agent-workflow-diagram.md)
|
||||
- [Model Stack](./model-stack.md)
|
||||
- [Outputs & Artifacts](./outputs-and-artifacts.md)
|
||||
- [Troubleshooting](./troubleshooting.md)
|
||||
|
|
|
|||
|
|
@ -0,0 +1,83 @@
|
|||
---
|
||||
title: "Agent Workflow Diagram"
|
||||
linkTitle: "Workflow Diagram"
|
||||
weight: 5
|
||||
description: >
|
||||
Visual workflow of a typical Agent S (Autonomous UAT Agent) run (gui_agent_cli.py) across Ministral, Holo, and VNC
|
||||
---
|
||||
|
||||
# Agent Workflow Diagram (Autonomous UAT Agent)
|
||||
|
||||
This page provides a **visual sketch** of the typical workflow (example: `gui_agent_cli.py`).
|
||||
|
||||
## High-level data flow
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
%% Left-to-right overview of one typical agent loop
|
||||
|
||||
user[Operator / Prompt] --> cli[Agent S script (gui_agent_cli.py)]
|
||||
|
||||
subgraph otc[OTC (Open Telekom Cloud)]
|
||||
subgraph ecsMin["ecs_ministral_L4 (164.30.28.242:8001) - Ministral vLLM"]
|
||||
ministral[(Ministral 3 8B - Thinking / Planning)]
|
||||
end
|
||||
|
||||
subgraph ecsHolo["ecs_holo_A40 (164.30.22.166:8000) - Holo vLLM"]
|
||||
holo[(Holo 1.5-7B - Vision / Grounding)]
|
||||
end
|
||||
|
||||
subgraph ecsGui["GUI test target (VNC + Firefox)"]
|
||||
vnc[VNC / Desktop]
|
||||
browser[Firefox]
|
||||
end
|
||||
end
|
||||
|
||||
cli -->|1. plan step (vLLM_THINKING_ENDPOINT)| ministral
|
||||
ministral -->|next action (click/type/wait)| cli
|
||||
|
||||
cli -->|2. capture screenshot| vnc
|
||||
vnc -->|screenshot (PNG)| cli
|
||||
|
||||
cli -->|3. grounding request (vLLM_VISION_ENDPOINT)| holo
|
||||
holo -->|coordinates + UI element info| cli
|
||||
|
||||
cli -->|4. execute action (mouse/keyboard)| vnc
|
||||
vnc --> browser
|
||||
|
||||
cli -->|logs + screenshots (results/ folder)| artifacts[(Artifacts: logs, screenshots, JSON comms)]
|
||||
```
|
||||
|
||||
## Sequence (one loop)
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
actor U as Operator
|
||||
participant CLI as gui_agent_cli.py
|
||||
participant MIN as Ministral vLLM\n(ecs_ministral_L4)
|
||||
participant VNC as VNC Desktop\n(Firefox)
|
||||
participant HOLO as Holo vLLM\n(ecs_holo_A40)
|
||||
|
||||
U->>CLI: Provide goal / prompt
|
||||
|
||||
loop Step loop (until done)
|
||||
CLI->>MIN: Plan next step (text-only reasoning)
|
||||
MIN-->>CLI: Next action (intent)
|
||||
|
||||
CLI->>VNC: Capture screenshot
|
||||
VNC-->>CLI: Screenshot (image)
|
||||
|
||||
CLI->>HOLO: Ground UI element(s) in screenshot
|
||||
HOLO-->>CLI: Coordinates + element metadata
|
||||
|
||||
CLI->>VNC: Execute click/type/scroll
|
||||
end
|
||||
|
||||
CLI-->>U: Result summary + saved artifacts
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The **thinking** and **grounding** models are separate on purpose: it improves coordinate reliability and makes failures easier to debug.
|
||||
- The agent loop typically produces artifacts (logs + screenshots) which are later copied into D66 evidence bundles.
|
||||
88
content/en/docs/Autonomous UAT Agent/model-stack.md
Normal file
88
content/en/docs/Autonomous UAT Agent/model-stack.md
Normal file
|
|
@ -0,0 +1,88 @@
|
|||
---
|
||||
title: "Model Stack"
|
||||
linkTitle: "Models"
|
||||
weight: 4
|
||||
description: >
|
||||
Thinking vs grounding model split for D66 (current state and target state)
|
||||
---
|
||||
|
||||
# Model Stack (D66)
|
||||
|
||||
For a visual overview of how the models interact with the VNC-based GUI automation loop, see: [Workflow Diagram](./agent-workflow-diagram.md)
|
||||
|
||||
## Requirement
|
||||
|
||||
D66 must use **open-source models from European companies**.
|
||||
|
||||
## Target setup
|
||||
|
||||
- **Thinking / planning:** Ministral
|
||||
- **Grounding / coordinates:** Holo 1.5
|
||||
|
||||
The Agent S framework runs an iterative loop: it uses a reasoning model to decide *what to do next* (plan the next action) and a grounding model to translate UI intent into *pixel-accurate coordinates* on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities.
|
||||
|
||||
## Why split models?
|
||||
|
||||
- Reasoning models optimize planning and textual decision making
|
||||
- Vision/grounding models optimize stable coordinate output
|
||||
- Separation reduces “coordinate hallucinations” and makes debugging easier
|
||||
|
||||
## Current state in repo
|
||||
|
||||
- Some scripts and docs still reference historical **Claude** and **Pixtral** experiments.
|
||||
- **Pixtral is not suitable for pixel-level grounding in this use case**: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation.
|
||||
- In an early prototyping phase, **Anthropic Claude Sonnet** was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution.
|
||||
|
||||
## Current configuration (D66)
|
||||
|
||||
### Thinking model: Ministral 3 8B (Instruct)
|
||||
|
||||
- HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
|
||||
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_ministral_L4` (public IP: `164.30.28.242`)
|
||||
- Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | `pi5e.4xlarge.4`
|
||||
- GPU: 1 × NVIDIA Tesla L4 (24 GiB)
|
||||
- Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
|
||||
- Deployment: vLLM OpenAI-compatible endpoint (chat completions)
|
||||
- Endpoint env var: `vLLM_THINKING_ENDPOINT`
|
||||
- Current server (deployment reference): `http://164.30.28.242:8001/v1`
|
||||
- Recommendation: set `vLLM_THINKING_ENDPOINT` explicitly (do not rely on script defaults).
|
||||
|
||||
**Operational note:** vLLM is configured to **auto-start on server boot** (OTC ECS restart) via `systemd`.
|
||||
|
||||
**Key serving settings (vLLM):**
|
||||
|
||||
- `--gpu-memory-utilization 0.90`
|
||||
- `--max-model-len 32768`
|
||||
- `--host 0.0.0.0`
|
||||
- `--port 8001`
|
||||
|
||||
**Key client settings (Autonomous UAT Agent scripts):**
|
||||
|
||||
- `model`: `/home/ubuntu/ministral-vllm/models/ministral-3-8b`
|
||||
- `temperature`: `0.0`
|
||||
|
||||
### Grounding model: Holo 1.5-7B
|
||||
|
||||
- HuggingFace model card: https://huggingface.co/holo-1.5-7b
|
||||
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_holo_A40` (public IP: `164.30.22.166`)
|
||||
- Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | `g7.12xlarge.8`
|
||||
- GPU: 1 × NVIDIA A40 (48 GiB)
|
||||
- Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
|
||||
- Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
|
||||
- Endpoint env var: `vLLM_VISION_ENDPOINT`
|
||||
- Current server (deployment reference): `http://164.30.22.166:8000/v1`
|
||||
|
||||
**Key client settings (grounding / coordinate space):**
|
||||
|
||||
- `model`: `holo-1.5-7b`
|
||||
- Native coordinate space: `3840×2160` (4K)
|
||||
- Client grounding dimensions:
|
||||
- `grounding_width`: `3840`
|
||||
- `grounding_height`: `2160`
|
||||
|
||||
Notes:
|
||||
|
||||
- Prompting and output-format hardening (reliability work):
|
||||
- `docs/story-026-001-context.md` (Holo output reliability)
|
||||
- `docs/story-025-001-context.md` (double grounding / calibration)
|
||||
|
||||
230
content/en/docs/Autonomous UAT Agent/running-auata-scripts.md
Normal file
230
content/en/docs/Autonomous UAT Agent/running-auata-scripts.md
Normal file
|
|
@ -0,0 +1,230 @@
|
|||
---
|
||||
title: "Running Autonomous UAT Agent Scripts"
|
||||
linkTitle: "Run Scripts"
|
||||
weight: 3
|
||||
description: >
|
||||
How to run the key D66 evaluation scripts and what they produce
|
||||
---
|
||||
|
||||
# Running Autonomous UAT Agent Scripts
|
||||
|
||||
All commands below assume you are running from the **Agent-S repository root** (Linux/ECS), i.e. the folder that contains `staging_scripts/`.
|
||||
|
||||
The **Autonomous UAT Agent** is the overall UX/UI testing use case built on top of the Agent S codebase and scripts in this repo.
|
||||
|
||||
If you are inside the monorepo workspace, first `cd ~/Projects/Agent_S3/Agent-S` on the Ubuntu ECS and then run the same commands.
|
||||
|
||||
## One-command recommended run (ECS)
|
||||
|
||||
If you only run one thing to produce clean, repeatable evidence (screenshots with click markers), run the calibration CLI:
|
||||
|
||||
```bash
|
||||
DISPLAY=:1 python staging_scripts/gui_agent_cli.py --prompt "Go to telekom.de and click the cart icon" --max-steps 10
|
||||
```
|
||||
|
||||
This writes screenshots to `./results/gui_agent_cli/<timestamp>/screenshots/`.
|
||||
|
||||
## ECS runner notes
|
||||
|
||||
- **Working directory matters:** the default output path is relative to the current working directory (it should be the Agent-S repo root on ECS).
|
||||
- **GUI required:** `pyautogui` needs an X server (`DISPLAY=:1` is assumed by most scripts).
|
||||
- **Persistence:** if you want results after the task ends, ensure `./results/` is on a mounted volume or copied out as an artifact.
|
||||
|
||||
## Prerequisites (runtime)
|
||||
|
||||
- Linux GUI session (VNC/Xvfb) because these scripts drive a real browser via `pyautogui`.
|
||||
- A working `DISPLAY` (most of the scripts assume `:1`).
|
||||
- Network access to the model endpoints (thinking + vision/grounding).
|
||||
|
||||
Common environment variables used by the vLLM-backed scripts:
|
||||
|
||||
- `vLLM_THINKING_ENDPOINT` (default in code if unset)
|
||||
- `vLLM_VISION_ENDPOINT` (default in code if unset)
|
||||
- `vLLM_API_KEY` (default: `dummy-key`)
|
||||
|
||||
## Key scripts (repo locations)
|
||||
|
||||
Core scripts referenced for D66 demonstrations:
|
||||
|
||||
- UI check (Agent S3): `staging_scripts/1_UI_check_AS3.py`
|
||||
- Functional correctness check: `staging_scripts/1_UI_functional_correctness_check.py`
|
||||
- Visual quality audit: `staging_scripts/2_UX_visual_quality_audit.py`
|
||||
- Task-based UX flow (newsletter): `staging_scripts/3_UX_taskflow_newsletter_signup.py`
|
||||
|
||||
Calibration / CLI entry point (used for click coordinate scaling validation):
|
||||
|
||||
- GUI Agent CLI (Holo click calibration): `staging_scripts/gui_agent_cli.py`
|
||||
|
||||
Legacy / historical:
|
||||
|
||||
- `staging_scripts/old scripts/agent_s3_1_old.py`
|
||||
- `staging_scripts/old scripts/agent_s3_ui_test.py`
|
||||
|
||||
## Common configuration knobs
|
||||
|
||||
Many scripts support these environment variables:
|
||||
|
||||
- `AS2_TARGET_URL`: website URL to test
|
||||
- `AS2_MAX_STEPS`: max steps (varies by script)
|
||||
- `ASK_EVERY_STEPS`: interactive prompt cadence
|
||||
|
||||
Execution environment:
|
||||
|
||||
- Linux GUI environment typically expects `DISPLAY=:1`
|
||||
|
||||
## Recommended: run gui_agent_cli.py (calibration / click precision)
|
||||
|
||||
This is the “clean” CLI entry point for repeatable calibration runs.
|
||||
|
||||
Minimal run (prompt mode):
|
||||
|
||||
```bash
|
||||
python staging_scripts/gui_agent_cli.py \
|
||||
--prompt "Go to telekom.de and click the cart icon" \
|
||||
--max-steps 30
|
||||
```
|
||||
|
||||
Optional scaling factors for debugging (defaults to `1.0` / `1.0`):
|
||||
|
||||
```bash
|
||||
python staging_scripts/gui_agent_cli.py \
|
||||
--prompt "Go to telekom.de and click the cart icon" \
|
||||
--x-scale 2.0 \
|
||||
--y-scale 2.0 \
|
||||
--max-steps 30
|
||||
```
|
||||
|
||||
Outputs:
|
||||
|
||||
- Default run folder: `./results/gui_agent_cli/<timestamp>/`
|
||||
- Screenshots: `./results/gui_agent_cli/<timestamp>/screenshots/`
|
||||
- Text log (stdout/stderr): `./results/gui_agent_cli/<timestamp>/logs/run.log`
|
||||
|
||||
If `--enable-logging` is set, the script also writes a structured JSON communication log (Story 026-002) into the same run `logs/` folder by default.
|
||||
|
||||
Enable model communication logging (recommended when debugging mis-clicks):
|
||||
|
||||
```bash
|
||||
python staging_scripts/gui_agent_cli.py \
|
||||
--prompt "Click the Telekom icon" \
|
||||
--max-steps 10 \
|
||||
--output-dir ./results/gui_agent_cli/debug_run_telekom_icon \
|
||||
--enable-logging \
|
||||
--log-output-dir ./results/gui_agent_cli/debug_run_telekom_icon/logs
|
||||
```
|
||||
|
||||
## Golden run (terminal on ECS)
|
||||
|
||||
This is the “golden run” command sequence currently used for D66 evidence generation.
|
||||
|
||||
### 1) Connect from Windows
|
||||
|
||||
```powershell
|
||||
ssh -i "C:\Path to KeyPair\KeyPair-ECS.pem" ubuntu@80.158.3.120
|
||||
```
|
||||
|
||||
### 2) Prepare the ECS runtime (GUI + browser)
|
||||
|
||||
```bash
|
||||
# Activate venv
|
||||
# Recommended: use the Agent S3 venv
|
||||
source ~/Projects/Agent_S3/Agent-S/venv/bin/activate
|
||||
|
||||
# Go to Agent-S repo root
|
||||
cd ~/Projects/Agent_S3/Agent-S
|
||||
|
||||
# Start VNC (DISPLAY=:1) and a browser
|
||||
vncserver :1
|
||||
export XAUTHORITY="$HOME/.Xauthority"
|
||||
export DISPLAY=":1"
|
||||
firefox &
|
||||
```
|
||||
|
||||
### 3) Run the golden prompt
|
||||
|
||||
```bash
|
||||
python staging_scripts/gui_agent_cli.py \
|
||||
--prompt "Role: You are a UI/UX testing agent specializing in functional correctness.
|
||||
Goal: Test all interactive elements in the header navigation on www.telekom.de for functional weaknesses.
|
||||
Tasks:
|
||||
1. Navigate to the website
|
||||
2. Identify and test interactive elements (buttons, links, forms, menus)
|
||||
3. Check for broken flows, defective links, non-functioning elements
|
||||
4. Document issues found
|
||||
Report Format:
|
||||
Return findings in the 'issues' field as a list of objects:
|
||||
- element: Name/description of the element
|
||||
- location: Where on the page
|
||||
- problem: What doesn't work
|
||||
- recommendation: How to fix it
|
||||
If no problems found, return an empty array: []" \
|
||||
--max-steps 30
|
||||
```
|
||||
|
||||
Golden run artifacts:
|
||||
|
||||
- Screenshots: `./results/gui_agent_cli/<timestamp>/screenshots/`
|
||||
- Text log: `./results/gui_agent_cli/<timestamp>/logs/run.log`
|
||||
- Optional JSON comm log (if enabled): `./results/gui_agent_cli/<timestamp>/logs/calibration_log_*.json`
|
||||
|
||||
## Alternative: run the agent via a web interface (Frontend)
|
||||
|
||||
Work in progress.
|
||||
|
||||
We are currently updating the web-based view and its ECS runner integration. This section will be filled with the correct, up-to-date instructions once the frontend flow supports the current Autonomous UAT Agent + `gui_agent_cli.py` workflow.
|
||||
|
||||
## Run the D66 evaluation scripts (staging_scripts)
|
||||
|
||||
These scripts are used for D66-style evaluation runs and tend to write their artifacts into `staging_scripts/` (DB, screenshots, JSON).
|
||||
|
||||
### UI check (Agent S3)
|
||||
|
||||
Typical pattern (URL via env var + optional run control args):
|
||||
|
||||
```bash
|
||||
export AS2_TARGET_URL="https://www.leipzig.de"
|
||||
export AS2_MAX_STEPS="20"
|
||||
|
||||
python staging_scripts/1_UI_check_AS3.py --auto-yes --ask-every 1000
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- Supports `--job-id <id>` (used by runners) and uses `JOB_ID` as a fallback.
|
||||
- Writes JSON to `./agent_output/raw_json/<job_id>/` and screenshots/overlays to `staging_scripts/Screenshots/...`.
|
||||
|
||||
### Functional correctness check
|
||||
|
||||
```bash
|
||||
export AS2_TARGET_URL="https://www.leipzig.de"
|
||||
export AS2_MAX_STEPS="0" # 0 = no limit (script-specific)
|
||||
|
||||
python staging_scripts/1_UI_functional_correctness_check.py --auto-yes --ask-every 1000
|
||||
```
|
||||
|
||||
### Visual quality audit
|
||||
|
||||
This script currently uses a hardcoded `WEBSITE_URL` near the top of the file. Update it and then run:
|
||||
|
||||
```bash
|
||||
python staging_scripts/2_UX_visual_quality_audit.py --auto-yes --ask-every 10
|
||||
```
|
||||
|
||||
### Task-based UX flow (newsletter)
|
||||
|
||||
This script is currently a staging/WIP script; verify it runs in your environment before relying on it for evidence.
|
||||
|
||||
## Outputs to expect
|
||||
|
||||
Most scripts record one or more of:
|
||||
|
||||
- `uxqa.db` (run log DB)
|
||||
- screenshots/overlays under `staging_scripts/Screenshots/...`
|
||||
- JSON step outputs under `agent_output/` (paths vary by script)
|
||||
- calibration CLI outputs under `./results/gui_agent_cli/<timestamp>/`
|
||||
|
||||
See [Outputs & Artifacts](./outputs-and-artifacts.md).
|
||||
|
||||
## Notes on model usage
|
||||
|
||||
Some scripts still contain legacy model configs (Claude/Pixtral). The D66 target configuration is documented in [Model Stack](./model-stack.md).
|
||||
Loading…
Add table
Add a link
Reference in a new issue