diff --git a/content/en/docs/Autonomous UAT Agent/_index.md b/content/en/docs/Autonomous UAT Agent/_index.md index 27dd81a..1573c82 100644 --- a/content/en/docs/Autonomous UAT Agent/_index.md +++ b/content/en/docs/Autonomous UAT Agent/_index.md @@ -1,7 +1,21 @@ --- -title: "Autonomous UAT Agent" -linkTitle: "Autonomous UAT Agent" +title: "General Documentation" +linkTitle: "Documentation" weight: 10 description: > - Deliverable D66 documentation for the Autonomous UAT Agent (Business + Technical) + General documentation for D66 and the Autonomous UAT Agent (primarily technical, English, Markdown) --- + +# General Documentation (D66) + +This section contains the core documentation for D66, focusing on how the Autonomous UAT Agent works and how to run it. + +## Pages + +- [Overview](./overview.md) +- [Quickstart](./quickstart.md) +- [Running Autonomous UAT Agent Scripts](./running-auata-scripts.md) +- [Workflow Diagram](./agent-workflow-diagram.md) +- [Model Stack](./model-stack.md) +- [Outputs & Artifacts](./outputs-and-artifacts.md) +- [Troubleshooting](./troubleshooting.md) diff --git a/content/en/docs/Autonomous UAT Agent/agent-workflow-diagram.md b/content/en/docs/Autonomous UAT Agent/agent-workflow-diagram.md new file mode 100644 index 0000000..f894080 --- /dev/null +++ b/content/en/docs/Autonomous UAT Agent/agent-workflow-diagram.md @@ -0,0 +1,83 @@ +--- +title: "Agent Workflow Diagram" +linkTitle: "Workflow Diagram" +weight: 5 +description: > + Visual workflow of a typical Agent S (Autonomous UAT Agent) run (gui_agent_cli.py) across Ministral, Holo, and VNC +--- + +# Agent Workflow Diagram (Autonomous UAT Agent) + +This page provides a **visual sketch** of the typical workflow (example: `gui_agent_cli.py`). + +## High-level data flow + +```mermaid +flowchart LR + %% Left-to-right overview of one typical agent loop + + user[Operator / Prompt] --> cli[Agent S script (gui_agent_cli.py)] + + subgraph otc[OTC (Open Telekom Cloud)] + subgraph ecsMin["ecs_ministral_L4 (164.30.28.242:8001) - Ministral vLLM"] + ministral[(Ministral 3 8B - Thinking / Planning)] + end + + subgraph ecsHolo["ecs_holo_A40 (164.30.22.166:8000) - Holo vLLM"] + holo[(Holo 1.5-7B - Vision / Grounding)] + end + + subgraph ecsGui["GUI test target (VNC + Firefox)"] + vnc[VNC / Desktop] + browser[Firefox] + end + end + + cli -->|1. plan step (vLLM_THINKING_ENDPOINT)| ministral + ministral -->|next action (click/type/wait)| cli + + cli -->|2. capture screenshot| vnc + vnc -->|screenshot (PNG)| cli + + cli -->|3. grounding request (vLLM_VISION_ENDPOINT)| holo + holo -->|coordinates + UI element info| cli + + cli -->|4. execute action (mouse/keyboard)| vnc + vnc --> browser + + cli -->|logs + screenshots (results/ folder)| artifacts[(Artifacts: logs, screenshots, JSON comms)] +``` + +## Sequence (one loop) + +```mermaid +sequenceDiagram + autonumber + actor U as Operator + participant CLI as gui_agent_cli.py + participant MIN as Ministral vLLM\n(ecs_ministral_L4) + participant VNC as VNC Desktop\n(Firefox) + participant HOLO as Holo vLLM\n(ecs_holo_A40) + + U->>CLI: Provide goal / prompt + + loop Step loop (until done) + CLI->>MIN: Plan next step (text-only reasoning) + MIN-->>CLI: Next action (intent) + + CLI->>VNC: Capture screenshot + VNC-->>CLI: Screenshot (image) + + CLI->>HOLO: Ground UI element(s) in screenshot + HOLO-->>CLI: Coordinates + element metadata + + CLI->>VNC: Execute click/type/scroll + end + + CLI-->>U: Result summary + saved artifacts +``` + +## Notes + +- The **thinking** and **grounding** models are separate on purpose: it improves coordinate reliability and makes failures easier to debug. +- The agent loop typically produces artifacts (logs + screenshots) which are later copied into D66 evidence bundles. diff --git a/content/en/docs/Autonomous UAT Agent/model-stack.md b/content/en/docs/Autonomous UAT Agent/model-stack.md new file mode 100644 index 0000000..df902d0 --- /dev/null +++ b/content/en/docs/Autonomous UAT Agent/model-stack.md @@ -0,0 +1,88 @@ +--- +title: "Model Stack" +linkTitle: "Models" +weight: 4 +description: > + Thinking vs grounding model split for D66 (current state and target state) +--- + +# Model Stack (D66) + +For a visual overview of how the models interact with the VNC-based GUI automation loop, see: [Workflow Diagram](./agent-workflow-diagram.md) + +## Requirement + +D66 must use **open-source models from European companies**. + +## Target setup + +- **Thinking / planning:** Ministral +- **Grounding / coordinates:** Holo 1.5 + +The Agent S framework runs an iterative loop: it uses a reasoning model to decide *what to do next* (plan the next action) and a grounding model to translate UI intent into *pixel-accurate coordinates* on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities. + +## Why split models? + +- Reasoning models optimize planning and textual decision making +- Vision/grounding models optimize stable coordinate output +- Separation reduces “coordinate hallucinations” and makes debugging easier + +## Current state in repo + +- Some scripts and docs still reference historical **Claude** and **Pixtral** experiments. +- **Pixtral is not suitable for pixel-level grounding in this use case**: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation. +- In an early prototyping phase, **Anthropic Claude Sonnet** was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution. + +## Current configuration (D66) + +### Thinking model: Ministral 3 8B (Instruct) + +- HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512 +- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_ministral_L4` (public IP: `164.30.28.242`) + - Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | `pi5e.4xlarge.4` + - GPU: 1 × NVIDIA Tesla L4 (24 GiB) + - Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image) +- Deployment: vLLM OpenAI-compatible endpoint (chat completions) + - Endpoint env var: `vLLM_THINKING_ENDPOINT` + - Current server (deployment reference): `http://164.30.28.242:8001/v1` + - Recommendation: set `vLLM_THINKING_ENDPOINT` explicitly (do not rely on script defaults). + +**Operational note:** vLLM is configured to **auto-start on server boot** (OTC ECS restart) via `systemd`. + +**Key serving settings (vLLM):** + +- `--gpu-memory-utilization 0.90` +- `--max-model-len 32768` +- `--host 0.0.0.0` +- `--port 8001` + +**Key client settings (Autonomous UAT Agent scripts):** + +- `model`: `/home/ubuntu/ministral-vllm/models/ministral-3-8b` +- `temperature`: `0.0` + +### Grounding model: Holo 1.5-7B + +- HuggingFace model card: https://huggingface.co/holo-1.5-7b +- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_holo_A40` (public IP: `164.30.22.166`) + - Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | `g7.12xlarge.8` + - GPU: 1 × NVIDIA A40 (48 GiB) + - Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image) +- Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding) + - Endpoint env var: `vLLM_VISION_ENDPOINT` + - Current server (deployment reference): `http://164.30.22.166:8000/v1` + +**Key client settings (grounding / coordinate space):** + +- `model`: `holo-1.5-7b` +- Native coordinate space: `3840×2160` (4K) +- Client grounding dimensions: + - `grounding_width`: `3840` + - `grounding_height`: `2160` + +Notes: + +- Prompting and output-format hardening (reliability work): + - `docs/story-026-001-context.md` (Holo output reliability) + - `docs/story-025-001-context.md` (double grounding / calibration) + diff --git a/content/en/docs/Autonomous UAT Agent/running-auata-scripts.md b/content/en/docs/Autonomous UAT Agent/running-auata-scripts.md new file mode 100644 index 0000000..b3d5ec7 --- /dev/null +++ b/content/en/docs/Autonomous UAT Agent/running-auata-scripts.md @@ -0,0 +1,230 @@ +--- +title: "Running Autonomous UAT Agent Scripts" +linkTitle: "Run Scripts" +weight: 3 +description: > + How to run the key D66 evaluation scripts and what they produce +--- + +# Running Autonomous UAT Agent Scripts + +All commands below assume you are running from the **Agent-S repository root** (Linux/ECS), i.e. the folder that contains `staging_scripts/`. + +The **Autonomous UAT Agent** is the overall UX/UI testing use case built on top of the Agent S codebase and scripts in this repo. + +If you are inside the monorepo workspace, first `cd ~/Projects/Agent_S3/Agent-S` on the Ubuntu ECS and then run the same commands. + +## One-command recommended run (ECS) + +If you only run one thing to produce clean, repeatable evidence (screenshots with click markers), run the calibration CLI: + +```bash +DISPLAY=:1 python staging_scripts/gui_agent_cli.py --prompt "Go to telekom.de and click the cart icon" --max-steps 10 +``` + +This writes screenshots to `./results/gui_agent_cli//screenshots/`. + +## ECS runner notes + +- **Working directory matters:** the default output path is relative to the current working directory (it should be the Agent-S repo root on ECS). +- **GUI required:** `pyautogui` needs an X server (`DISPLAY=:1` is assumed by most scripts). +- **Persistence:** if you want results after the task ends, ensure `./results/` is on a mounted volume or copied out as an artifact. + +## Prerequisites (runtime) + +- Linux GUI session (VNC/Xvfb) because these scripts drive a real browser via `pyautogui`. +- A working `DISPLAY` (most of the scripts assume `:1`). +- Network access to the model endpoints (thinking + vision/grounding). + +Common environment variables used by the vLLM-backed scripts: + +- `vLLM_THINKING_ENDPOINT` (default in code if unset) +- `vLLM_VISION_ENDPOINT` (default in code if unset) +- `vLLM_API_KEY` (default: `dummy-key`) + +## Key scripts (repo locations) + +Core scripts referenced for D66 demonstrations: + +- UI check (Agent S3): `staging_scripts/1_UI_check_AS3.py` +- Functional correctness check: `staging_scripts/1_UI_functional_correctness_check.py` +- Visual quality audit: `staging_scripts/2_UX_visual_quality_audit.py` +- Task-based UX flow (newsletter): `staging_scripts/3_UX_taskflow_newsletter_signup.py` + +Calibration / CLI entry point (used for click coordinate scaling validation): + +- GUI Agent CLI (Holo click calibration): `staging_scripts/gui_agent_cli.py` + +Legacy / historical: + +- `staging_scripts/old scripts/agent_s3_1_old.py` +- `staging_scripts/old scripts/agent_s3_ui_test.py` + +## Common configuration knobs + +Many scripts support these environment variables: + +- `AS2_TARGET_URL`: website URL to test +- `AS2_MAX_STEPS`: max steps (varies by script) +- `ASK_EVERY_STEPS`: interactive prompt cadence + +Execution environment: + +- Linux GUI environment typically expects `DISPLAY=:1` + +## Recommended: run gui_agent_cli.py (calibration / click precision) + +This is the “clean” CLI entry point for repeatable calibration runs. + +Minimal run (prompt mode): + +```bash +python staging_scripts/gui_agent_cli.py \ + --prompt "Go to telekom.de and click the cart icon" \ + --max-steps 30 +``` + +Optional scaling factors for debugging (defaults to `1.0` / `1.0`): + +```bash +python staging_scripts/gui_agent_cli.py \ + --prompt "Go to telekom.de and click the cart icon" \ + --x-scale 2.0 \ + --y-scale 2.0 \ + --max-steps 30 +``` + +Outputs: + +- Default run folder: `./results/gui_agent_cli//` +- Screenshots: `./results/gui_agent_cli//screenshots/` +- Text log (stdout/stderr): `./results/gui_agent_cli//logs/run.log` + +If `--enable-logging` is set, the script also writes a structured JSON communication log (Story 026-002) into the same run `logs/` folder by default. + +Enable model communication logging (recommended when debugging mis-clicks): + +```bash +python staging_scripts/gui_agent_cli.py \ + --prompt "Click the Telekom icon" \ + --max-steps 10 \ + --output-dir ./results/gui_agent_cli/debug_run_telekom_icon \ + --enable-logging \ + --log-output-dir ./results/gui_agent_cli/debug_run_telekom_icon/logs +``` + +## Golden run (terminal on ECS) + +This is the “golden run” command sequence currently used for D66 evidence generation. + +### 1) Connect from Windows + +```powershell +ssh -i "C:\Path to KeyPair\KeyPair-ECS.pem" ubuntu@80.158.3.120 +``` + +### 2) Prepare the ECS runtime (GUI + browser) + +```bash +# Activate venv +# Recommended: use the Agent S3 venv +source ~/Projects/Agent_S3/Agent-S/venv/bin/activate + +# Go to Agent-S repo root +cd ~/Projects/Agent_S3/Agent-S + +# Start VNC (DISPLAY=:1) and a browser +vncserver :1 +export XAUTHORITY="$HOME/.Xauthority" +export DISPLAY=":1" +firefox & +``` + +### 3) Run the golden prompt + +```bash +python staging_scripts/gui_agent_cli.py \ + --prompt "Role: You are a UI/UX testing agent specializing in functional correctness. +Goal: Test all interactive elements in the header navigation on www.telekom.de for functional weaknesses. +Tasks: +1. Navigate to the website +2. Identify and test interactive elements (buttons, links, forms, menus) +3. Check for broken flows, defective links, non-functioning elements +4. Document issues found +Report Format: +Return findings in the 'issues' field as a list of objects: +- element: Name/description of the element +- location: Where on the page +- problem: What doesn't work +- recommendation: How to fix it +If no problems found, return an empty array: []" \ + --max-steps 30 +``` + +Golden run artifacts: + +- Screenshots: `./results/gui_agent_cli//screenshots/` +- Text log: `./results/gui_agent_cli//logs/run.log` +- Optional JSON comm log (if enabled): `./results/gui_agent_cli//logs/calibration_log_*.json` + +## Alternative: run the agent via a web interface (Frontend) + +Work in progress. + +We are currently updating the web-based view and its ECS runner integration. This section will be filled with the correct, up-to-date instructions once the frontend flow supports the current Autonomous UAT Agent + `gui_agent_cli.py` workflow. + +## Run the D66 evaluation scripts (staging_scripts) + +These scripts are used for D66-style evaluation runs and tend to write their artifacts into `staging_scripts/` (DB, screenshots, JSON). + +### UI check (Agent S3) + +Typical pattern (URL via env var + optional run control args): + +```bash +export AS2_TARGET_URL="https://www.leipzig.de" +export AS2_MAX_STEPS="20" + +python staging_scripts/1_UI_check_AS3.py --auto-yes --ask-every 1000 +``` + +Notes: + +- Supports `--job-id ` (used by runners) and uses `JOB_ID` as a fallback. +- Writes JSON to `./agent_output/raw_json//` and screenshots/overlays to `staging_scripts/Screenshots/...`. + +### Functional correctness check + +```bash +export AS2_TARGET_URL="https://www.leipzig.de" +export AS2_MAX_STEPS="0" # 0 = no limit (script-specific) + +python staging_scripts/1_UI_functional_correctness_check.py --auto-yes --ask-every 1000 +``` + +### Visual quality audit + +This script currently uses a hardcoded `WEBSITE_URL` near the top of the file. Update it and then run: + +```bash +python staging_scripts/2_UX_visual_quality_audit.py --auto-yes --ask-every 10 +``` + +### Task-based UX flow (newsletter) + +This script is currently a staging/WIP script; verify it runs in your environment before relying on it for evidence. + +## Outputs to expect + +Most scripts record one or more of: + +- `uxqa.db` (run log DB) +- screenshots/overlays under `staging_scripts/Screenshots/...` +- JSON step outputs under `agent_output/` (paths vary by script) +- calibration CLI outputs under `./results/gui_agent_cli//` + +See [Outputs & Artifacts](./outputs-and-artifacts.md). + +## Notes on model usage + +Some scripts still contain legacy model configs (Claude/Pixtral). The D66 target configuration is documented in [Model Stack](./model-stack.md).