Update to running auata scripts and results
|
|
@ -6,13 +6,13 @@ description: >
|
|||
Thinking vs grounding model split for D66 (current state and target state)
|
||||
---
|
||||
|
||||
# Model Stack (D66)
|
||||
# Model Stack
|
||||
|
||||
For a visual overview of how the models interact with the VNC-based GUI automation loop, see: [Workflow Diagram](./agent-workflow-diagram.md)
|
||||
|
||||
## Requirement
|
||||
|
||||
D66 must use **open-source models from European companies**.
|
||||
The Autonomous UAT Agent must use **open-source models from European companies**. This has been a project requirement form the very beginnning of this project.
|
||||
|
||||
## Target setup
|
||||
|
||||
|
|
@ -45,7 +45,6 @@ The Agent S framework runs an iterative loop: it uses a reasoning model to decid
|
|||
- Deployment: vLLM OpenAI-compatible endpoint (chat completions)
|
||||
- Endpoint env var: `vLLM_THINKING_ENDPOINT`
|
||||
- Current server (deployment reference): `http://164.30.28.242:8001/v1`
|
||||
- Recommendation: set `vLLM_THINKING_ENDPOINT` explicitly (do not rely on script defaults).
|
||||
|
||||
**Operational note:** vLLM is configured to **auto-start on server boot** (OTC ECS restart) via `systemd`.
|
||||
|
||||
|
|
@ -80,9 +79,4 @@ The Agent S framework runs an iterative loop: it uses a reasoning model to decid
|
|||
- `grounding_width`: `3840`
|
||||
- `grounding_height`: `2160`
|
||||
|
||||
Notes:
|
||||
|
||||
- Prompting and output-format hardening (reliability work):
|
||||
- `docs/story-026-001-context.md` (Holo output reliability)
|
||||
- `docs/story-025-001-context.md` (double grounding / calibration)
|
||||
|
||||
|
|
|
|||
17
content/en/docs/Autonomous UAT Agent/results/_index.md
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
---
|
||||
title: "Results & Findings"
|
||||
linkTitle: "Results"
|
||||
weight: 20
|
||||
description: >
|
||||
Results, findings, and evidence artifacts for D66
|
||||
---
|
||||
|
||||
# Results & Findings (D66)
|
||||
|
||||
This section contains the outputs that support D66 claims: findings summaries and pointers to logs, screenshots, and run artifacts.
|
||||
|
||||
## Pages
|
||||
|
||||
- [PoC Validation](./poc-validation.md)
|
||||
- [Golden Run (Telekom Header Navigation)](./golden-run-telekom-header-nav/)
|
||||
- [Logs & Artifacts](./logs-and-artifacts.md)
|
||||
|
|
@ -0,0 +1,142 @@
|
|||
---
|
||||
title: "Golden Run: Telekom Header Navigation"
|
||||
linkTitle: "Golden Run (Telekom)"
|
||||
weight: 3
|
||||
description: >
|
||||
Evidence pack (screenshots + logs) for the golden run on www.telekom.de header navigation
|
||||
---
|
||||
|
||||
# Golden Run: Telekom Header Navigation
|
||||
|
||||
This page is the evidence pack for the **Autonomous UAT Agent** golden run on **www.telekom.de**.
|
||||
|
||||
## Run intent
|
||||
|
||||
- Goal: Test interactive elements in the header navigation for functional weaknesses
|
||||
- Output: Click-marked screenshots + per-run log (and optionally model communication JSON)
|
||||
|
||||
## How the run was executed (ECS)
|
||||
|
||||
Command (as used in the runbook):
|
||||
|
||||
```bash
|
||||
python staging_scripts/gui_agent_cli.py \
|
||||
--prompt "Role: You are a UI/UX testing agent specializing in functional correctness.
|
||||
Goal: Test all interactive elements in the header navigation on www.telekom.de for functional weaknesses.
|
||||
Tasks:
|
||||
1. Navigate to the website
|
||||
2. Identify and test interactive elements (buttons, links, forms, menus)
|
||||
3. Check for broken flows, defective links, non-functioning elements
|
||||
4. Document issues found
|
||||
Report Format:
|
||||
Return findings in the 'issues' field as a list of objects:
|
||||
- element: Name/description of the element
|
||||
- location: Where on the page
|
||||
- problem: What doesn't work
|
||||
- recommendation: How to fix it
|
||||
If no problems found, return an empty array: []" \
|
||||
--max-steps 30
|
||||
```
|
||||
|
||||
## Artifacts
|
||||
|
||||
### In-repo evidence (this page bundle)
|
||||
|
||||
Place the evidence files here:
|
||||
|
||||
- Screenshots: `screenshots/`
|
||||
- Text log: `logs/run.log`
|
||||
- Optional JSON communication log(s): `logs/calibration_log_*.json`
|
||||
|
||||
If you have ~15 screenshots, name them in a stable order, e.g.:
|
||||
|
||||
- `screenshots/uat_agent_step_001.png` … `screenshots/uat_agent_step_015.png`
|
||||
|
||||
### Runtime output location (where they come from)
|
||||
|
||||
The CLI defaults to:
|
||||
|
||||
- `./results/gui_agent_cli/<timestamp>/screenshots/`
|
||||
- `./results/gui_agent_cli/<timestamp>/logs/run.log`
|
||||
|
||||
Copy the files you want to publish into this page bundle so they render in the docs.
|
||||
|
||||
## Screenshot gallery
|
||||
|
||||
### Thumbnail grid (recommended for many screenshots)
|
||||
|
||||
Click any thumbnail to open the full image.
|
||||
|
||||
<div style="display:grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 12px; align-items:start;">
|
||||
<figure style="margin:0;">
|
||||
<a href="screenshots/uat_agent_step_001.png"><img src="screenshots/uat_agent_step_001.png" alt="UAT agent step 001" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
|
||||
<figcaption style="text-align:center; font-size:0.9em;">Step 001</figcaption>
|
||||
</figure>
|
||||
<figure style="margin:0;">
|
||||
<a href="screenshots/uat_agent_step_002.png"><img src="screenshots/uat_agent_step_002.png" alt="UAT agent step 002" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
|
||||
<figcaption style="text-align:center; font-size:0.9em;">Step 002</figcaption>
|
||||
</figure>
|
||||
<figure style="margin:0;">
|
||||
<a href="screenshots/uat_agent_step_003.png"><img src="screenshots/uat_agent_step_003.png" alt="UAT agent step 003" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
|
||||
<figcaption style="text-align:center; font-size:0.9em;">Step 003</figcaption>
|
||||
</figure>
|
||||
<figure style="margin:0;">
|
||||
<a href="screenshots/uat_agent_step_004.png"><img src="screenshots/uat_agent_step_004.png" alt="UAT agent step 004" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
|
||||
<figcaption style="text-align:center; font-size:0.9em;">Step 004</figcaption>
|
||||
</figure>
|
||||
<figure style="margin:0;">
|
||||
<a href="screenshots/uat_agent_step_005.png"><img src="screenshots/uat_agent_step_005.png" alt="UAT agent step 005" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
|
||||
<figcaption style="text-align:center; font-size:0.9em;">Step 005</figcaption>
|
||||
</figure>
|
||||
<figure style="margin:0;">
|
||||
<a href="screenshots/uat_agent_step_006.png"><img src="screenshots/uat_agent_step_006.png" alt="UAT agent step 006" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
|
||||
<figcaption style="text-align:center; font-size:0.9em;">Step 006</figcaption>
|
||||
</figure>
|
||||
<figure style="margin:0;">
|
||||
<a href="screenshots/uat_agent_step_007.png"><img src="screenshots/uat_agent_step_007.png" alt="UAT agent step 007" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
|
||||
<figcaption style="text-align:center; font-size:0.9em;">Step 007</figcaption>
|
||||
</figure>
|
||||
<figure style="margin:0;">
|
||||
<a href="screenshots/uat_agent_step_008.png"><img src="screenshots/uat_agent_step_008.png" alt="UAT agent step 008" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
|
||||
<figcaption style="text-align:center; font-size:0.9em;">Step 008</figcaption>
|
||||
</figure>
|
||||
<figure style="margin:0;">
|
||||
<a href="screenshots/uat_agent_step_010.png"><img src="screenshots/uat_agent_step_010.png" alt="UAT agent step 010" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
|
||||
<figcaption style="text-align:center; font-size:0.9em;">Step 010</figcaption>
|
||||
</figure>
|
||||
<figure style="margin:0;">
|
||||
<a href="screenshots/uat_agent_step_011.png"><img src="screenshots/uat_agent_step_011.png" alt="UAT agent step 011" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
|
||||
<figcaption style="text-align:center; font-size:0.9em;">Step 011</figcaption>
|
||||
</figure>
|
||||
<figure style="margin:0;">
|
||||
<a href="screenshots/uat_agent_step_012.png"><img src="screenshots/uat_agent_step_012.png" alt="UAT agent step 012" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
|
||||
<figcaption style="text-align:center; font-size:0.9em;">Step 012</figcaption>
|
||||
</figure>
|
||||
<figure style="margin:0;">
|
||||
<a href="screenshots/uat_agent_step_013.png"><img src="screenshots/uat_agent_step_013.png" alt="UAT agent step 013" style="width:100%; height:auto; border:1px solid #ddd; border-radius:6px;" /></a>
|
||||
<figcaption style="text-align:center; font-size:0.9em;">Step 013</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
|
||||
<details>
|
||||
<summary>Full-size images (stacked)</summary>
|
||||
|
||||
{{< figure src="screenshots/uat_agent_step_001.png" caption="Step 001" >}}
|
||||
{{< figure src="screenshots/uat_agent_step_002.png" caption="Step 002" >}}
|
||||
{{< figure src="screenshots/uat_agent_step_003.png" caption="Step 003" >}}
|
||||
{{< figure src="screenshots/uat_agent_step_004.png" caption="Step 004" >}}
|
||||
{{< figure src="screenshots/uat_agent_step_005.png" caption="Step 005" >}}
|
||||
{{< figure src="screenshots/uat_agent_step_006.png" caption="Step 006" >}}
|
||||
{{< figure src="screenshots/uat_agent_step_007.png" caption="Step 007" >}}
|
||||
{{< figure src="screenshots/uat_agent_step_008.png" caption="Step 008" >}}
|
||||
{{< figure src="screenshots/uat_agent_step_010.png" caption="Step 010" >}}
|
||||
{{< figure src="screenshots/uat_agent_step_011.png" caption="Step 011" >}}
|
||||
{{< figure src="screenshots/uat_agent_step_012.png" caption="Step 012" >}}
|
||||
{{< figure src="screenshots/uat_agent_step_013.png" caption="Step 013" >}}
|
||||
|
||||
</details>
|
||||
|
||||
## Notes
|
||||
|
||||
- If repo size becomes an issue, publish only a curated subset (e.g. 6–8 key frames) and link to the full run folder externally.
|
||||
- If you want a thumbnail grid instead of full-width figures, say so and BMad Master will add a compact gallery layout.
|
||||
|
After Width: | Height: | Size: 122 KiB |
|
After Width: | Height: | Size: 880 KiB |
|
After Width: | Height: | Size: 502 KiB |
|
After Width: | Height: | Size: 251 KiB |
|
After Width: | Height: | Size: 777 KiB |
|
After Width: | Height: | Size: 968 KiB |
|
After Width: | Height: | Size: 578 KiB |
|
After Width: | Height: | Size: 527 KiB |
|
After Width: | Height: | Size: 191 KiB |
|
After Width: | Height: | Size: 501 KiB |
|
After Width: | Height: | Size: 881 KiB |
|
After Width: | Height: | Size: 884 KiB |
|
|
@ -0,0 +1,36 @@
|
|||
---
|
||||
title: "Logs & Artifacts"
|
||||
linkTitle: "Logs & Artifacts"
|
||||
weight: 2
|
||||
description: >
|
||||
Where to find logs, screenshots, and reports relevant to D66
|
||||
---
|
||||
|
||||
# Logs & Artifacts
|
||||
|
||||
## Repo locations
|
||||
|
||||
- Local calibration and run logs: `logs/`
|
||||
- Script outputs (varies by run):
|
||||
- `Backend/IPCEI-UX-Agent-S3/staging_scripts/uxqa.db`
|
||||
- `Backend/IPCEI-UX-Agent-S3/staging_scripts/Screenshots/`
|
||||
- `Backend/IPCEI-UX-Agent-S3/staging_scripts/agent_output/`
|
||||
|
||||
- Golden run evidence pack (recommended publishing location in docs):
|
||||
- `docs/D66/results/golden-run-telekom-header-nav/`
|
||||
|
||||
## What to capture for D66
|
||||
|
||||
- A representative run per capability:
|
||||
- functional correctness checks
|
||||
- visual quality audits
|
||||
- task-based UX smoke tests
|
||||
- For each run, capture:
|
||||
- target URL
|
||||
- timestamp
|
||||
- key screenshots/overlays
|
||||
- issue summaries (structured)
|
||||
|
||||
## Notes
|
||||
|
||||
If needed, we can add a consistent run naming convention and a small “how to export a D66 evidence pack” procedure.
|
||||
|
|
@ -0,0 +1,29 @@
|
|||
---
|
||||
title: "PoC Validation"
|
||||
linkTitle: "PoC Validation"
|
||||
weight: 1
|
||||
description: >
|
||||
What was validated and where to find the evidence
|
||||
---
|
||||
|
||||
# PoC Validation Evidence
|
||||
|
||||
## What was validated
|
||||
|
||||
- Autonomous GUI interaction via the Autonomous UAT Agent (Agent S3-based scripts)
|
||||
- Generation of UX findings and recommendations
|
||||
- Production of reproducible artifacts (screenshots, logs)
|
||||
|
||||
## Where to find evidence in this repo
|
||||
|
||||
- Run logs and calibration logs: `logs/`
|
||||
- Story evidence and investigation notes:
|
||||
- `docs/story-025-001-context.md`
|
||||
- `docs/story-026-001-context.md`
|
||||
- `docs/story-023-003-coordinate-space-detection.md`
|
||||
|
||||
## How to reproduce a run
|
||||
|
||||
1. Choose a script in `Backend/IPCEI-UX-Agent-S3/staging_scripts/`
|
||||
2. Set target URL (if supported) via `AS2_TARGET_URL`
|
||||
3. Run and capture artifacts (see `docs/D66/documentation/outputs-and-artifacts.md`)
|
||||
|
|
@ -8,114 +8,9 @@ description: >
|
|||
|
||||
# Running Autonomous UAT Agent Scripts
|
||||
|
||||
All commands below assume you are running from the **Agent-S repository root** (Linux/ECS), i.e. the folder that contains `staging_scripts/`.
|
||||
|
||||
The **Autonomous UAT Agent** is the overall UX/UI testing use case built on top of the Agent S codebase and scripts in this repo.
|
||||
|
||||
If you are inside the monorepo workspace, first `cd ~/Projects/Agent_S3/Agent-S` on the Ubuntu ECS and then run the same commands.
|
||||
|
||||
## One-command recommended run (ECS)
|
||||
|
||||
If you only run one thing to produce clean, repeatable evidence (screenshots with click markers), run the calibration CLI:
|
||||
|
||||
```bash
|
||||
DISPLAY=:1 python staging_scripts/gui_agent_cli.py --prompt "Go to telekom.de and click the cart icon" --max-steps 10
|
||||
```
|
||||
|
||||
This writes screenshots to `./results/gui_agent_cli/<timestamp>/screenshots/`.
|
||||
|
||||
## ECS runner notes
|
||||
|
||||
- **Working directory matters:** the default output path is relative to the current working directory (it should be the Agent-S repo root on ECS).
|
||||
- **GUI required:** `pyautogui` needs an X server (`DISPLAY=:1` is assumed by most scripts).
|
||||
- **Persistence:** if you want results after the task ends, ensure `./results/` is on a mounted volume or copied out as an artifact.
|
||||
|
||||
## Prerequisites (runtime)
|
||||
|
||||
- Linux GUI session (VNC/Xvfb) because these scripts drive a real browser via `pyautogui`.
|
||||
- A working `DISPLAY` (most of the scripts assume `:1`).
|
||||
- Network access to the model endpoints (thinking + vision/grounding).
|
||||
|
||||
Common environment variables used by the vLLM-backed scripts:
|
||||
|
||||
- `vLLM_THINKING_ENDPOINT` (default in code if unset)
|
||||
- `vLLM_VISION_ENDPOINT` (default in code if unset)
|
||||
- `vLLM_API_KEY` (default: `dummy-key`)
|
||||
|
||||
## Key scripts (repo locations)
|
||||
|
||||
Core scripts referenced for D66 demonstrations:
|
||||
|
||||
- UI check (Agent S3): `staging_scripts/1_UI_check_AS3.py`
|
||||
- Functional correctness check: `staging_scripts/1_UI_functional_correctness_check.py`
|
||||
- Visual quality audit: `staging_scripts/2_UX_visual_quality_audit.py`
|
||||
- Task-based UX flow (newsletter): `staging_scripts/3_UX_taskflow_newsletter_signup.py`
|
||||
|
||||
Calibration / CLI entry point (used for click coordinate scaling validation):
|
||||
|
||||
- GUI Agent CLI (Holo click calibration): `staging_scripts/gui_agent_cli.py`
|
||||
|
||||
Legacy / historical:
|
||||
|
||||
- `staging_scripts/old scripts/agent_s3_1_old.py`
|
||||
- `staging_scripts/old scripts/agent_s3_ui_test.py`
|
||||
|
||||
## Common configuration knobs
|
||||
|
||||
Many scripts support these environment variables:
|
||||
|
||||
- `AS2_TARGET_URL`: website URL to test
|
||||
- `AS2_MAX_STEPS`: max steps (varies by script)
|
||||
- `ASK_EVERY_STEPS`: interactive prompt cadence
|
||||
|
||||
Execution environment:
|
||||
|
||||
- Linux GUI environment typically expects `DISPLAY=:1`
|
||||
|
||||
## Recommended: run gui_agent_cli.py (calibration / click precision)
|
||||
|
||||
This is the “clean” CLI entry point for repeatable calibration runs.
|
||||
|
||||
Minimal run (prompt mode):
|
||||
|
||||
```bash
|
||||
python staging_scripts/gui_agent_cli.py \
|
||||
--prompt "Go to telekom.de and click the cart icon" \
|
||||
--max-steps 30
|
||||
```
|
||||
|
||||
Optional scaling factors for debugging (defaults to `1.0` / `1.0`):
|
||||
|
||||
```bash
|
||||
python staging_scripts/gui_agent_cli.py \
|
||||
--prompt "Go to telekom.de and click the cart icon" \
|
||||
--x-scale 2.0 \
|
||||
--y-scale 2.0 \
|
||||
--max-steps 30
|
||||
```
|
||||
|
||||
Outputs:
|
||||
|
||||
- Default run folder: `./results/gui_agent_cli/<timestamp>/`
|
||||
- Screenshots: `./results/gui_agent_cli/<timestamp>/screenshots/`
|
||||
- Text log (stdout/stderr): `./results/gui_agent_cli/<timestamp>/logs/run.log`
|
||||
|
||||
If `--enable-logging` is set, the script also writes a structured JSON communication log (Story 026-002) into the same run `logs/` folder by default.
|
||||
|
||||
Enable model communication logging (recommended when debugging mis-clicks):
|
||||
|
||||
```bash
|
||||
python staging_scripts/gui_agent_cli.py \
|
||||
--prompt "Click the Telekom icon" \
|
||||
--max-steps 10 \
|
||||
--output-dir ./results/gui_agent_cli/debug_run_telekom_icon \
|
||||
--enable-logging \
|
||||
--log-output-dir ./results/gui_agent_cli/debug_run_telekom_icon/logs
|
||||
```
|
||||
|
||||
## Golden run (terminal on ECS)
|
||||
|
||||
This is the “golden run” command sequence currently used for D66 evidence generation.
|
||||
All commands below assume you are running from the **Agent-S repository root** (Linux/ECS), `~/Projects/Agent_S3/Agent-S`. To do that, connect to the server via SSH. You will need a key pair for authentication and an open inbound port in the firewall. For information on how to obtain the key pair and request firewall access, contact [tom.sakretz@telekom.de](mailto:tom.sakretz@telekom.de).
|
||||
|
||||
### 1) Connect from Windows
|
||||
|
||||
|
|
@ -127,7 +22,6 @@ ssh -i "C:\Path to KeyPair\KeyPair-ECS.pem" ubuntu@80.158.3.120
|
|||
|
||||
```bash
|
||||
# Activate venv
|
||||
# Recommended: use the Agent S3 venv
|
||||
source ~/Projects/Agent_S3/Agent-S/venv/bin/activate
|
||||
|
||||
# Go to Agent-S repo root
|
||||
|
|
@ -140,7 +34,45 @@ export DISPLAY=":1"
|
|||
firefox &
|
||||
```
|
||||
|
||||
### 3) Run the golden prompt
|
||||
### 3) One-command recommended run (ECS)
|
||||
|
||||
If you only run one thing to produce clean, repeatable evidence (screenshots with click markers), run the following command CLI:
|
||||
|
||||
```bash
|
||||
python staging_scripts/gui_agent_cli.py --prompt "Go to telekom.de and click the cart icon" --max-steps 10
|
||||
```
|
||||
|
||||
This will produce:
|
||||
|
||||
- Screenshots: `./results/gui_agent_cli/<timestamp>/screenshots/`
|
||||
- Text log: `./results/gui_agent_cli/<timestamp>/logs/run.log`
|
||||
- JSON comm log: `./results/gui_agent_cli/<timestamp>/logs/run.log`
|
||||
|
||||
|
||||
## Prerequisites (runtime)
|
||||
|
||||
- Linux GUI session (VNC/Xvfb) because these scripts drive a real browser via `pyautogui`.
|
||||
- A working `DISPLAY` (default for all scripts is `:1`).
|
||||
- Network access to the model endpoints (thinking + vision/grounding).
|
||||
|
||||
|
||||
## Key scripts (repo locations)
|
||||
|
||||
The GUI Agent CLI script is the most flexible entry point and is therefore the only one described in more detail in this documentation. Assumes you are in project root `~/Projects/Agent_S3/Agent-S`.
|
||||
|
||||
- GUI Agent CLI: `staging_scripts/gui_agent_cli.py`
|
||||
|
||||
Historically, we used purpose-built scripts for individual tasks. We now recommend using `gui_agent_cli.py` as the primary entry point, because the same scenarios can usually be expressed via a well-scoped prompt while keeping the workflow more flexible and easier to maintain. The scripts below are kept for reference and may not reflect the current, preferred workflow.
|
||||
|
||||
- UI check (Agent S3): `staging_scripts/1_UI_check_AS3.py`
|
||||
- Functional correctness check: `staging_scripts/1_UI_functional_correctness_check.py`
|
||||
- Visual quality audit: `staging_scripts/2_UX_visual_quality_audit.py`
|
||||
- Task-based UX flow (newsletter): `staging_scripts/3_UX_taskflow_newsletter_signup.py`
|
||||
|
||||
|
||||
## Golden run (terminal on ECS)
|
||||
|
||||
This is the “golden run” command sequence currently used for D66 evidence generation. The golden run is a complete workflow that works as a template for reproducible outcomes.
|
||||
|
||||
```bash
|
||||
python staging_scripts/gui_agent_cli.py \
|
||||
|
|
@ -167,63 +99,14 @@ Golden run artifacts:
|
|||
- Text log: `./results/gui_agent_cli/<timestamp>/logs/run.log`
|
||||
- Optional JSON comm log (if enabled): `./results/gui_agent_cli/<timestamp>/logs/calibration_log_*.json`
|
||||
|
||||
An example golden run with screenshots and log outputs can be seen in [Results](./results/).
|
||||
|
||||
## Alternative: run the agent via a web interface (Frontend)
|
||||
|
||||
Work in progress.
|
||||
|
||||
We are currently updating the web-based view and its ECS runner integration. This section will be filled with the correct, up-to-date instructions once the frontend flow supports the current Autonomous UAT Agent + `gui_agent_cli.py` workflow.
|
||||
|
||||
## Run the D66 evaluation scripts (staging_scripts)
|
||||
|
||||
These scripts are used for D66-style evaluation runs and tend to write their artifacts into `staging_scripts/` (DB, screenshots, JSON).
|
||||
|
||||
### UI check (Agent S3)
|
||||
|
||||
Typical pattern (URL via env var + optional run control args):
|
||||
|
||||
```bash
|
||||
export AS2_TARGET_URL="https://www.leipzig.de"
|
||||
export AS2_MAX_STEPS="20"
|
||||
|
||||
python staging_scripts/1_UI_check_AS3.py --auto-yes --ask-every 1000
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- Supports `--job-id <id>` (used by runners) and uses `JOB_ID` as a fallback.
|
||||
- Writes JSON to `./agent_output/raw_json/<job_id>/` and screenshots/overlays to `staging_scripts/Screenshots/...`.
|
||||
|
||||
### Functional correctness check
|
||||
|
||||
```bash
|
||||
export AS2_TARGET_URL="https://www.leipzig.de"
|
||||
export AS2_MAX_STEPS="0" # 0 = no limit (script-specific)
|
||||
|
||||
python staging_scripts/1_UI_functional_correctness_check.py --auto-yes --ask-every 1000
|
||||
```
|
||||
|
||||
### Visual quality audit
|
||||
|
||||
This script currently uses a hardcoded `WEBSITE_URL` near the top of the file. Update it and then run:
|
||||
|
||||
```bash
|
||||
python staging_scripts/2_UX_visual_quality_audit.py --auto-yes --ask-every 10
|
||||
```
|
||||
|
||||
### Task-based UX flow (newsletter)
|
||||
|
||||
This script is currently a staging/WIP script; verify it runs in your environment before relying on it for evidence.
|
||||
|
||||
## Outputs to expect
|
||||
|
||||
Most scripts record one or more of:
|
||||
|
||||
- `uxqa.db` (run log DB)
|
||||
- screenshots/overlays under `staging_scripts/Screenshots/...`
|
||||
- JSON step outputs under `agent_output/` (paths vary by script)
|
||||
- calibration CLI outputs under `./results/gui_agent_cli/<timestamp>/`
|
||||
|
||||
See [Outputs & Artifacts](./outputs-and-artifacts.md).
|
||||
|
||||
## Notes on model usage
|
||||
|
||||
|
|
|
|||