Update to running auata scripts and results

This commit is contained in:
Tom Sakretz 2026-01-29 15:22:46 +01:00
parent 22e6c208ac
commit 242a4b8a79
20 changed files with 268 additions and 167 deletions

View file

@ -6,13 +6,13 @@ description: >
Thinking vs grounding model split for D66 (current state and target state)
---
# Model Stack (D66)
# Model Stack
For a visual overview of how the models interact with the VNC-based GUI automation loop, see: [Workflow Diagram](./agent-workflow-diagram.md)
## Requirement
D66 must use **open-source models from European companies**.
The Autonomous UAT Agent must use **open-source models from European companies**. This has been a project requirement form the very beginnning of this project.
## Target setup
@ -45,7 +45,6 @@ The Agent S framework runs an iterative loop: it uses a reasoning model to decid
- Deployment: vLLM OpenAI-compatible endpoint (chat completions)
- Endpoint env var: `vLLM_THINKING_ENDPOINT`
- Current server (deployment reference): `http://164.30.28.242:8001/v1`
- Recommendation: set `vLLM_THINKING_ENDPOINT` explicitly (do not rely on script defaults).
**Operational note:** vLLM is configured to **auto-start on server boot** (OTC ECS restart) via `systemd`.
@ -80,9 +79,4 @@ The Agent S framework runs an iterative loop: it uses a reasoning model to decid
- `grounding_width`: `3840`
- `grounding_height`: `2160`
Notes:
- Prompting and output-format hardening (reliability work):
- `docs/story-026-001-context.md` (Holo output reliability)
- `docs/story-025-001-context.md` (double grounding / calibration)

View file

@ -0,0 +1,17 @@
---
title: "Results & Findings"
linkTitle: "Results"
weight: 20
description: >
Results, findings, and evidence artifacts for D66
---
# Results & Findings (D66)
This section contains the outputs that support D66 claims: findings summaries and pointers to logs, screenshots, and run artifacts.
## Pages
- [PoC Validation](./poc-validation.md)
- [Golden Run (Telekom Header Navigation)](./golden-run-telekom-header-nav/)
- [Logs & Artifacts](./logs-and-artifacts.md)

View file

@ -0,0 +1,142 @@
---
title: "Golden Run: Telekom Header Navigation"
linkTitle: "Golden Run (Telekom)"
weight: 3
description: >
Evidence pack (screenshots + logs) for the golden run on www.telekom.de header navigation
---
# Golden Run: Telekom Header Navigation
This page is the evidence pack for the **Autonomous UAT Agent** golden run on **www.telekom.de**.
## Run intent
- Goal: Test interactive elements in the header navigation for functional weaknesses
- Output: Click-marked screenshots + per-run log (and optionally model communication JSON)
## How the run was executed (ECS)
Command (as used in the runbook):
```bash
python staging_scripts/gui_agent_cli.py \
--prompt "Role: You are a UI/UX testing agent specializing in functional correctness.
Goal: Test all interactive elements in the header navigation on www.telekom.de for functional weaknesses.
Tasks:
1. Navigate to the website
2. Identify and test interactive elements (buttons, links, forms, menus)
3. Check for broken flows, defective links, non-functioning elements
4. Document issues found
Report Format:
Return findings in the 'issues' field as a list of objects:
- element: Name/description of the element
- location: Where on the page
- problem: What doesn't work
- recommendation: How to fix it
If no problems found, return an empty array: []" \
--max-steps 30
```
## Artifacts
### In-repo evidence (this page bundle)
Place the evidence files here:
- Screenshots: `screenshots/`
- Text log: `logs/run.log`
- Optional JSON communication log(s): `logs/calibration_log_*.json`
If you have ~15 screenshots, name them in a stable order, e.g.:
- `screenshots/uat_agent_step_001.png``screenshots/uat_agent_step_015.png`
### Runtime output location (where they come from)
The CLI defaults to:
- `./results/gui_agent_cli/<timestamp>/screenshots/`
- `./results/gui_agent_cli/<timestamp>/logs/run.log`
Copy the files you want to publish into this page bundle so they render in the docs.
## Screenshot gallery
### Thumbnail grid (recommended for many screenshots)
Click any thumbnail to open the full image.
<div style="display:grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 12px; align-items:start;">
<figure style="margin:0;">
<a href="screenshots/uat_agent_step_001.png"><img src="screenshots/uat_agent_step_001.png" alt="UAT agent step 001" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
<figcaption style="text-align:center; font-size:0.9em;">Step 001</figcaption>
</figure>
<figure style="margin:0;">
<a href="screenshots/uat_agent_step_002.png"><img src="screenshots/uat_agent_step_002.png" alt="UAT agent step 002" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
<figcaption style="text-align:center; font-size:0.9em;">Step 002</figcaption>
</figure>
<figure style="margin:0;">
<a href="screenshots/uat_agent_step_003.png"><img src="screenshots/uat_agent_step_003.png" alt="UAT agent step 003" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
<figcaption style="text-align:center; font-size:0.9em;">Step 003</figcaption>
</figure>
<figure style="margin:0;">
<a href="screenshots/uat_agent_step_004.png"><img src="screenshots/uat_agent_step_004.png" alt="UAT agent step 004" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
<figcaption style="text-align:center; font-size:0.9em;">Step 004</figcaption>
</figure>
<figure style="margin:0;">
<a href="screenshots/uat_agent_step_005.png"><img src="screenshots/uat_agent_step_005.png" alt="UAT agent step 005" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
<figcaption style="text-align:center; font-size:0.9em;">Step 005</figcaption>
</figure>
<figure style="margin:0;">
<a href="screenshots/uat_agent_step_006.png"><img src="screenshots/uat_agent_step_006.png" alt="UAT agent step 006" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
<figcaption style="text-align:center; font-size:0.9em;">Step 006</figcaption>
</figure>
<figure style="margin:0;">
<a href="screenshots/uat_agent_step_007.png"><img src="screenshots/uat_agent_step_007.png" alt="UAT agent step 007" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
<figcaption style="text-align:center; font-size:0.9em;">Step 007</figcaption>
</figure>
<figure style="margin:0;">
<a href="screenshots/uat_agent_step_008.png"><img src="screenshots/uat_agent_step_008.png" alt="UAT agent step 008" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
<figcaption style="text-align:center; font-size:0.9em;">Step 008</figcaption>
</figure>
<figure style="margin:0;">
<a href="screenshots/uat_agent_step_010.png"><img src="screenshots/uat_agent_step_010.png" alt="UAT agent step 010" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
<figcaption style="text-align:center; font-size:0.9em;">Step 010</figcaption>
</figure>
<figure style="margin:0;">
<a href="screenshots/uat_agent_step_011.png"><img src="screenshots/uat_agent_step_011.png" alt="UAT agent step 011" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
<figcaption style="text-align:center; font-size:0.9em;">Step 011</figcaption>
</figure>
<figure style="margin:0;">
<a href="screenshots/uat_agent_step_012.png"><img src="screenshots/uat_agent_step_012.png" alt="UAT agent step 012" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
<figcaption style="text-align:center; font-size:0.9em;">Step 012</figcaption>
</figure>
<figure style="margin:0;">
<a href="screenshots/uat_agent_step_013.png"><img src="screenshots/uat_agent_step_013.png" alt="UAT agent step 013" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
<figcaption style="text-align:center; font-size:0.9em;">Step 013</figcaption>
</figure>
</div>
<details>
<summary>Full-size images (stacked)</summary>
{{< figure src="screenshots/uat_agent_step_001.png" caption="Step 001" >}}
{{< figure src="screenshots/uat_agent_step_002.png" caption="Step 002" >}}
{{< figure src="screenshots/uat_agent_step_003.png" caption="Step 003" >}}
{{< figure src="screenshots/uat_agent_step_004.png" caption="Step 004" >}}
{{< figure src="screenshots/uat_agent_step_005.png" caption="Step 005" >}}
{{< figure src="screenshots/uat_agent_step_006.png" caption="Step 006" >}}
{{< figure src="screenshots/uat_agent_step_007.png" caption="Step 007" >}}
{{< figure src="screenshots/uat_agent_step_008.png" caption="Step 008" >}}
{{< figure src="screenshots/uat_agent_step_010.png" caption="Step 010" >}}
{{< figure src="screenshots/uat_agent_step_011.png" caption="Step 011" >}}
{{< figure src="screenshots/uat_agent_step_012.png" caption="Step 012" >}}
{{< figure src="screenshots/uat_agent_step_013.png" caption="Step 013" >}}
</details>
## Notes
- If repo size becomes an issue, publish only a curated subset (e.g. 68 key frames) and link to the full run folder externally.
- If you want a thumbnail grid instead of full-width figures, say so and BMad Master will add a compact gallery layout.

Binary file not shown.

After

Width:  |  Height:  |  Size: 122 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 880 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 502 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 251 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 777 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 968 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 578 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 527 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 191 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 501 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 881 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 884 KiB

View file

@ -0,0 +1,36 @@
---
title: "Logs & Artifacts"
linkTitle: "Logs & Artifacts"
weight: 2
description: >
Where to find logs, screenshots, and reports relevant to D66
---
# Logs & Artifacts
## Repo locations
- Local calibration and run logs: `logs/`
- Script outputs (varies by run):
- `Backend/IPCEI-UX-Agent-S3/staging_scripts/uxqa.db`
- `Backend/IPCEI-UX-Agent-S3/staging_scripts/Screenshots/`
- `Backend/IPCEI-UX-Agent-S3/staging_scripts/agent_output/`
- Golden run evidence pack (recommended publishing location in docs):
- `docs/D66/results/golden-run-telekom-header-nav/`
## What to capture for D66
- A representative run per capability:
- functional correctness checks
- visual quality audits
- task-based UX smoke tests
- For each run, capture:
- target URL
- timestamp
- key screenshots/overlays
- issue summaries (structured)
## Notes
If needed, we can add a consistent run naming convention and a small “how to export a D66 evidence pack” procedure.

View file

@ -0,0 +1,29 @@
---
title: "PoC Validation"
linkTitle: "PoC Validation"
weight: 1
description: >
What was validated and where to find the evidence
---
# PoC Validation Evidence
## What was validated
- Autonomous GUI interaction via the Autonomous UAT Agent (Agent S3-based scripts)
- Generation of UX findings and recommendations
- Production of reproducible artifacts (screenshots, logs)
## Where to find evidence in this repo
- Run logs and calibration logs: `logs/`
- Story evidence and investigation notes:
- `docs/story-025-001-context.md`
- `docs/story-026-001-context.md`
- `docs/story-023-003-coordinate-space-detection.md`
## How to reproduce a run
1. Choose a script in `Backend/IPCEI-UX-Agent-S3/staging_scripts/`
2. Set target URL (if supported) via `AS2_TARGET_URL`
3. Run and capture artifacts (see `docs/D66/documentation/outputs-and-artifacts.md`)

View file

@ -8,114 +8,9 @@ description: >
# Running Autonomous UAT Agent Scripts
All commands below assume you are running from the **Agent-S repository root** (Linux/ECS), i.e. the folder that contains `staging_scripts/`.
The **Autonomous UAT Agent** is the overall UX/UI testing use case built on top of the Agent S codebase and scripts in this repo.
If you are inside the monorepo workspace, first `cd ~/Projects/Agent_S3/Agent-S` on the Ubuntu ECS and then run the same commands.
## One-command recommended run (ECS)
If you only run one thing to produce clean, repeatable evidence (screenshots with click markers), run the calibration CLI:
```bash
DISPLAY=:1 python staging_scripts/gui_agent_cli.py --prompt "Go to telekom.de and click the cart icon" --max-steps 10
```
This writes screenshots to `./results/gui_agent_cli/<timestamp>/screenshots/`.
## ECS runner notes
- **Working directory matters:** the default output path is relative to the current working directory (it should be the Agent-S repo root on ECS).
- **GUI required:** `pyautogui` needs an X server (`DISPLAY=:1` is assumed by most scripts).
- **Persistence:** if you want results after the task ends, ensure `./results/` is on a mounted volume or copied out as an artifact.
## Prerequisites (runtime)
- Linux GUI session (VNC/Xvfb) because these scripts drive a real browser via `pyautogui`.
- A working `DISPLAY` (most of the scripts assume `:1`).
- Network access to the model endpoints (thinking + vision/grounding).
Common environment variables used by the vLLM-backed scripts:
- `vLLM_THINKING_ENDPOINT` (default in code if unset)
- `vLLM_VISION_ENDPOINT` (default in code if unset)
- `vLLM_API_KEY` (default: `dummy-key`)
## Key scripts (repo locations)
Core scripts referenced for D66 demonstrations:
- UI check (Agent S3): `staging_scripts/1_UI_check_AS3.py`
- Functional correctness check: `staging_scripts/1_UI_functional_correctness_check.py`
- Visual quality audit: `staging_scripts/2_UX_visual_quality_audit.py`
- Task-based UX flow (newsletter): `staging_scripts/3_UX_taskflow_newsletter_signup.py`
Calibration / CLI entry point (used for click coordinate scaling validation):
- GUI Agent CLI (Holo click calibration): `staging_scripts/gui_agent_cli.py`
Legacy / historical:
- `staging_scripts/old scripts/agent_s3_1_old.py`
- `staging_scripts/old scripts/agent_s3_ui_test.py`
## Common configuration knobs
Many scripts support these environment variables:
- `AS2_TARGET_URL`: website URL to test
- `AS2_MAX_STEPS`: max steps (varies by script)
- `ASK_EVERY_STEPS`: interactive prompt cadence
Execution environment:
- Linux GUI environment typically expects `DISPLAY=:1`
## Recommended: run gui_agent_cli.py (calibration / click precision)
This is the “clean” CLI entry point for repeatable calibration runs.
Minimal run (prompt mode):
```bash
python staging_scripts/gui_agent_cli.py \
--prompt "Go to telekom.de and click the cart icon" \
--max-steps 30
```
Optional scaling factors for debugging (defaults to `1.0` / `1.0`):
```bash
python staging_scripts/gui_agent_cli.py \
--prompt "Go to telekom.de and click the cart icon" \
--x-scale 2.0 \
--y-scale 2.0 \
--max-steps 30
```
Outputs:
- Default run folder: `./results/gui_agent_cli/<timestamp>/`
- Screenshots: `./results/gui_agent_cli/<timestamp>/screenshots/`
- Text log (stdout/stderr): `./results/gui_agent_cli/<timestamp>/logs/run.log`
If `--enable-logging` is set, the script also writes a structured JSON communication log (Story 026-002) into the same run `logs/` folder by default.
Enable model communication logging (recommended when debugging mis-clicks):
```bash
python staging_scripts/gui_agent_cli.py \
--prompt "Click the Telekom icon" \
--max-steps 10 \
--output-dir ./results/gui_agent_cli/debug_run_telekom_icon \
--enable-logging \
--log-output-dir ./results/gui_agent_cli/debug_run_telekom_icon/logs
```
## Golden run (terminal on ECS)
This is the “golden run” command sequence currently used for D66 evidence generation.
All commands below assume you are running from the **Agent-S repository root** (Linux/ECS), `~/Projects/Agent_S3/Agent-S`. To do that, connect to the server via SSH. You will need a key pair for authentication and an open inbound port in the firewall. For information on how to obtain the key pair and request firewall access, contact [tom.sakretz@telekom.de](mailto:tom.sakretz@telekom.de).
### 1) Connect from Windows
@ -127,7 +22,6 @@ ssh -i "C:\Path to KeyPair\KeyPair-ECS.pem" ubuntu@80.158.3.120
```bash
# Activate venv
# Recommended: use the Agent S3 venv
source ~/Projects/Agent_S3/Agent-S/venv/bin/activate
# Go to Agent-S repo root
@ -140,7 +34,45 @@ export DISPLAY=":1"
firefox &
```
### 3) Run the golden prompt
### 3) One-command recommended run (ECS)
If you only run one thing to produce clean, repeatable evidence (screenshots with click markers), run the following command CLI:
```bash
python staging_scripts/gui_agent_cli.py --prompt "Go to telekom.de and click the cart icon" --max-steps 10
```
This will produce:
- Screenshots: `./results/gui_agent_cli/<timestamp>/screenshots/`
- Text log: `./results/gui_agent_cli/<timestamp>/logs/run.log`
- JSON comm log: `./results/gui_agent_cli/<timestamp>/logs/run.log`
## Prerequisites (runtime)
- Linux GUI session (VNC/Xvfb) because these scripts drive a real browser via `pyautogui`.
- A working `DISPLAY` (default for all scripts is `:1`).
- Network access to the model endpoints (thinking + vision/grounding).
## Key scripts (repo locations)
The GUI Agent CLI script is the most flexible entry point and is therefore the only one described in more detail in this documentation. Assumes you are in project root `~/Projects/Agent_S3/Agent-S`.
- GUI Agent CLI: `staging_scripts/gui_agent_cli.py`
Historically, we used purpose-built scripts for individual tasks. We now recommend using `gui_agent_cli.py` as the primary entry point, because the same scenarios can usually be expressed via a well-scoped prompt while keeping the workflow more flexible and easier to maintain. The scripts below are kept for reference and may not reflect the current, preferred workflow.
- UI check (Agent S3): `staging_scripts/1_UI_check_AS3.py`
- Functional correctness check: `staging_scripts/1_UI_functional_correctness_check.py`
- Visual quality audit: `staging_scripts/2_UX_visual_quality_audit.py`
- Task-based UX flow (newsletter): `staging_scripts/3_UX_taskflow_newsletter_signup.py`
## Golden run (terminal on ECS)
This is the “golden run” command sequence currently used for D66 evidence generation. The golden run is a complete workflow that works as a template for reproducible outcomes.
```bash
python staging_scripts/gui_agent_cli.py \
@ -167,63 +99,14 @@ Golden run artifacts:
- Text log: `./results/gui_agent_cli/<timestamp>/logs/run.log`
- Optional JSON comm log (if enabled): `./results/gui_agent_cli/<timestamp>/logs/calibration_log_*.json`
An example golden run with screenshots and log outputs can be seen in [Results](./results/).
## Alternative: run the agent via a web interface (Frontend)
Work in progress.
We are currently updating the web-based view and its ECS runner integration. This section will be filled with the correct, up-to-date instructions once the frontend flow supports the current Autonomous UAT Agent + `gui_agent_cli.py` workflow.
## Run the D66 evaluation scripts (staging_scripts)
These scripts are used for D66-style evaluation runs and tend to write their artifacts into `staging_scripts/` (DB, screenshots, JSON).
### UI check (Agent S3)
Typical pattern (URL via env var + optional run control args):
```bash
export AS2_TARGET_URL="https://www.leipzig.de"
export AS2_MAX_STEPS="20"
python staging_scripts/1_UI_check_AS3.py --auto-yes --ask-every 1000
```
Notes:
- Supports `--job-id <id>` (used by runners) and uses `JOB_ID` as a fallback.
- Writes JSON to `./agent_output/raw_json/<job_id>/` and screenshots/overlays to `staging_scripts/Screenshots/...`.
### Functional correctness check
```bash
export AS2_TARGET_URL="https://www.leipzig.de"
export AS2_MAX_STEPS="0" # 0 = no limit (script-specific)
python staging_scripts/1_UI_functional_correctness_check.py --auto-yes --ask-every 1000
```
### Visual quality audit
This script currently uses a hardcoded `WEBSITE_URL` near the top of the file. Update it and then run:
```bash
python staging_scripts/2_UX_visual_quality_audit.py --auto-yes --ask-every 10
```
### Task-based UX flow (newsletter)
This script is currently a staging/WIP script; verify it runs in your environment before relying on it for evidence.
## Outputs to expect
Most scripts record one or more of:
- `uxqa.db` (run log DB)
- screenshots/overlays under `staging_scripts/Screenshots/...`
- JSON step outputs under `agent_output/` (paths vary by script)
- calibration CLI outputs under `./results/gui_agent_cli/<timestamp>/`
See [Outputs & Artifacts](./outputs-and-artifacts.md).
## Notes on model usage