7.8 KiB
| title | linkTitle | weight | description |
|---|---|---|---|
| Running Autonomous UAT Agent Scripts | Running Autonomous UAT Agent Scripts | 3 | How to run the key D66 evaluation scripts and what they produce |
Running Autonomous UAT Agent Scripts
All commands below assume you are running from the Agent-S repository root (Linux/ECS), i.e. the folder that contains staging_scripts/.
The Autonomous UAT Agent is the overall UX/UI testing use case built on top of the Agent S codebase and scripts in this repo.
If you are inside the monorepo workspace, first cd ~/Projects/Agent_S3/Agent-S on the Ubuntu ECS and then run the same commands.
One-command recommended run (ECS)
If you only run one thing to produce clean, repeatable evidence (screenshots with click markers), run the calibration CLI:
DISPLAY=:1 python staging_scripts/gui_agent_cli.py --prompt "Go to telekom.de and click the cart icon" --max-steps 10
This writes screenshots to ./results/gui_agent_cli/<timestamp>/screenshots/.
ECS runner notes
- Working directory matters: the default output path is relative to the current working directory (it should be the Agent-S repo root on ECS).
- GUI required:
pyautoguineeds an X server (DISPLAY=:1is assumed by most scripts). - Persistence: if you want results after the task ends, ensure
./results/is on a mounted volume or copied out as an artifact.
Prerequisites (runtime)
- Linux GUI session (VNC/Xvfb) because these scripts drive a real browser via
pyautogui. - A working
DISPLAY(most of the scripts assume:1). - Network access to the model endpoints (thinking + vision/grounding).
Common environment variables used by the vLLM-backed scripts:
vLLM_THINKING_ENDPOINT(default in code if unset)vLLM_VISION_ENDPOINT(default in code if unset)vLLM_API_KEY(default:dummy-key)
Key scripts (repo locations)
Core scripts referenced for D66 demonstrations:
- UI check (Agent S3):
staging_scripts/1_UI_check_AS3.py - Functional correctness check:
staging_scripts/1_UI_functional_correctness_check.py - Visual quality audit:
staging_scripts/2_UX_visual_quality_audit.py - Task-based UX flow (newsletter):
staging_scripts/3_UX_taskflow_newsletter_signup.py
Calibration / CLI entry point (used for click coordinate scaling validation):
- GUI Agent CLI (Holo click calibration):
staging_scripts/gui_agent_cli.py
Legacy / historical:
staging_scripts/old scripts/agent_s3_1_old.pystaging_scripts/old scripts/agent_s3_ui_test.py
Common configuration knobs
Many scripts support these environment variables:
AS2_TARGET_URL: website URL to testAS2_MAX_STEPS: max steps (varies by script)ASK_EVERY_STEPS: interactive prompt cadence
Execution environment:
- Linux GUI environment typically expects
DISPLAY=:1
Recommended: run gui_agent_cli.py (calibration / click precision)
This is the “clean” CLI entry point for repeatable calibration runs.
Minimal run (prompt mode):
python staging_scripts/gui_agent_cli.py \
--prompt "Go to telekom.de and click the cart icon" \
--max-steps 30
Optional scaling factors for debugging (defaults to 1.0 / 1.0):
python staging_scripts/gui_agent_cli.py \
--prompt "Go to telekom.de and click the cart icon" \
--x-scale 2.0 \
--y-scale 2.0 \
--max-steps 30
Outputs:
- Default run folder:
./results/gui_agent_cli/<timestamp>/ - Screenshots:
./results/gui_agent_cli/<timestamp>/screenshots/ - Text log (stdout/stderr):
./results/gui_agent_cli/<timestamp>/logs/run.log
If --enable-logging is set, the script also writes a structured JSON communication log (Story 026-002) into the same run logs/ folder by default.
Enable model communication logging (recommended when debugging mis-clicks):
python staging_scripts/gui_agent_cli.py \
--prompt "Click the Telekom icon" \
--max-steps 10 \
--output-dir ./results/gui_agent_cli/debug_run_telekom_icon \
--enable-logging \
--log-output-dir ./results/gui_agent_cli/debug_run_telekom_icon/logs
Golden run (terminal on ECS)
This is the “golden run” command sequence currently used for D66 evidence generation.
1) Connect from Windows
ssh -i "C:\Path to KeyPair\KeyPair-ECS.pem" ubuntu@80.158.3.120
2) Prepare the ECS runtime (GUI + browser)
# Activate venv
# Recommended: use the Agent S3 venv
source ~/Projects/Agent_S3/Agent-S/venv/bin/activate
# Go to Agent-S repo root
cd ~/Projects/Agent_S3/Agent-S
# Start VNC (DISPLAY=:1) and a browser
vncserver :1
export XAUTHORITY="$HOME/.Xauthority"
export DISPLAY=":1"
firefox &
3) Run the golden prompt
python staging_scripts/gui_agent_cli.py \
--prompt "Role: You are a UI/UX testing agent specializing in functional correctness.
Goal: Test all interactive elements in the header navigation on www.telekom.de for functional weaknesses.
Tasks:
1. Navigate to the website
2. Identify and test interactive elements (buttons, links, forms, menus)
3. Check for broken flows, defective links, non-functioning elements
4. Document issues found
Report Format:
Return findings in the 'issues' field as a list of objects:
- element: Name/description of the element
- location: Where on the page
- problem: What doesn't work
- recommendation: How to fix it
If no problems found, return an empty array: []" \
--max-steps 30
Golden run artifacts:
- Screenshots:
./results/gui_agent_cli/<timestamp>/screenshots/ - Text log:
./results/gui_agent_cli/<timestamp>/logs/run.log - Optional JSON comm log (if enabled):
./results/gui_agent_cli/<timestamp>/logs/calibration_log_*.json
Alternative: run the agent via a web interface (Frontend)
Work in progress.
We are currently updating the web-based view and its ECS runner integration. This section will be filled with the correct, up-to-date instructions once the frontend flow supports the current Autonomous UAT Agent + gui_agent_cli.py workflow.
Run the D66 evaluation scripts (staging_scripts)
These scripts are used for D66-style evaluation runs and tend to write their artifacts into staging_scripts/ (DB, screenshots, JSON).
UI check (Agent S3)
Typical pattern (URL via env var + optional run control args):
export AS2_TARGET_URL="https://www.leipzig.de"
export AS2_MAX_STEPS="20"
python staging_scripts/1_UI_check_AS3.py --auto-yes --ask-every 1000
Notes:
- Supports
--job-id <id>(used by runners) and usesJOB_IDas a fallback. - Writes JSON to
./agent_output/raw_json/<job_id>/and screenshots/overlays tostaging_scripts/Screenshots/....
Functional correctness check
export AS2_TARGET_URL="https://www.leipzig.de"
export AS2_MAX_STEPS="0" # 0 = no limit (script-specific)
python staging_scripts/1_UI_functional_correctness_check.py --auto-yes --ask-every 1000
Visual quality audit
This script currently uses a hardcoded WEBSITE_URL near the top of the file. Update it and then run:
python staging_scripts/2_UX_visual_quality_audit.py --auto-yes --ask-every 10
Task-based UX flow (newsletter)
This script is currently a staging/WIP script; verify it runs in your environment before relying on it for evidence.
Outputs to expect
Most scripts record one or more of:
uxqa.db(run log DB)- screenshots/overlays under
staging_scripts/Screenshots/... - JSON step outputs under
agent_output/(paths vary by script) - calibration CLI outputs under
./results/gui_agent_cli/<timestamp>/
See Outputs & Artifacts.
Notes on model usage
Some scripts still contain legacy model configs (Claude/Pixtral). The D66 target configuration is documented in Model Stack.