website-and-documentation/content/en/docs/Autonomous UAT Agent/running-auata-scripts.md

7.8 KiB

title linkTitle weight description
Running Autonomous UAT Agent Scripts Running Autonomous UAT Agent Scripts 3 How to run the key D66 evaluation scripts and what they produce

Running Autonomous UAT Agent Scripts

All commands below assume you are running from the Agent-S repository root (Linux/ECS), i.e. the folder that contains staging_scripts/.

The Autonomous UAT Agent is the overall UX/UI testing use case built on top of the Agent S codebase and scripts in this repo.

If you are inside the monorepo workspace, first cd ~/Projects/Agent_S3/Agent-S on the Ubuntu ECS and then run the same commands.

If you only run one thing to produce clean, repeatable evidence (screenshots with click markers), run the calibration CLI:

DISPLAY=:1 python staging_scripts/gui_agent_cli.py --prompt "Go to telekom.de and click the cart icon" --max-steps 10

This writes screenshots to ./results/gui_agent_cli/<timestamp>/screenshots/.

ECS runner notes

  • Working directory matters: the default output path is relative to the current working directory (it should be the Agent-S repo root on ECS).
  • GUI required: pyautogui needs an X server (DISPLAY=:1 is assumed by most scripts).
  • Persistence: if you want results after the task ends, ensure ./results/ is on a mounted volume or copied out as an artifact.

Prerequisites (runtime)

  • Linux GUI session (VNC/Xvfb) because these scripts drive a real browser via pyautogui.
  • A working DISPLAY (most of the scripts assume :1).
  • Network access to the model endpoints (thinking + vision/grounding).

Common environment variables used by the vLLM-backed scripts:

  • vLLM_THINKING_ENDPOINT (default in code if unset)
  • vLLM_VISION_ENDPOINT (default in code if unset)
  • vLLM_API_KEY (default: dummy-key)

Key scripts (repo locations)

Core scripts referenced for D66 demonstrations:

  • UI check (Agent S3): staging_scripts/1_UI_check_AS3.py
  • Functional correctness check: staging_scripts/1_UI_functional_correctness_check.py
  • Visual quality audit: staging_scripts/2_UX_visual_quality_audit.py
  • Task-based UX flow (newsletter): staging_scripts/3_UX_taskflow_newsletter_signup.py

Calibration / CLI entry point (used for click coordinate scaling validation):

  • GUI Agent CLI (Holo click calibration): staging_scripts/gui_agent_cli.py

Legacy / historical:

  • staging_scripts/old scripts/agent_s3_1_old.py
  • staging_scripts/old scripts/agent_s3_ui_test.py

Common configuration knobs

Many scripts support these environment variables:

  • AS2_TARGET_URL: website URL to test
  • AS2_MAX_STEPS: max steps (varies by script)
  • ASK_EVERY_STEPS: interactive prompt cadence

Execution environment:

  • Linux GUI environment typically expects DISPLAY=:1

This is the “clean” CLI entry point for repeatable calibration runs.

Minimal run (prompt mode):

python staging_scripts/gui_agent_cli.py \
  --prompt "Go to telekom.de and click the cart icon" \
  --max-steps 30

Optional scaling factors for debugging (defaults to 1.0 / 1.0):

python staging_scripts/gui_agent_cli.py \
  --prompt "Go to telekom.de and click the cart icon" \
  --x-scale 2.0 \
  --y-scale 2.0 \
  --max-steps 30

Outputs:

  • Default run folder: ./results/gui_agent_cli/<timestamp>/
  • Screenshots: ./results/gui_agent_cli/<timestamp>/screenshots/
  • Text log (stdout/stderr): ./results/gui_agent_cli/<timestamp>/logs/run.log

If --enable-logging is set, the script also writes a structured JSON communication log (Story 026-002) into the same run logs/ folder by default.

Enable model communication logging (recommended when debugging mis-clicks):

python staging_scripts/gui_agent_cli.py \
  --prompt "Click the Telekom icon" \
  --max-steps 10 \
  --output-dir ./results/gui_agent_cli/debug_run_telekom_icon \
  --enable-logging \
  --log-output-dir ./results/gui_agent_cli/debug_run_telekom_icon/logs

Golden run (terminal on ECS)

This is the “golden run” command sequence currently used for D66 evidence generation.

1) Connect from Windows

ssh -i "C:\Path to KeyPair\KeyPair-ECS.pem" ubuntu@80.158.3.120

2) Prepare the ECS runtime (GUI + browser)

# Activate venv
# Recommended: use the Agent S3 venv
source ~/Projects/Agent_S3/Agent-S/venv/bin/activate

# Go to Agent-S repo root
cd ~/Projects/Agent_S3/Agent-S

# Start VNC (DISPLAY=:1) and a browser
vncserver :1
export XAUTHORITY="$HOME/.Xauthority"
export DISPLAY=":1"
firefox &

3) Run the golden prompt

python staging_scripts/gui_agent_cli.py \
  --prompt "Role: You are a UI/UX testing agent specializing in functional correctness.
Goal: Test all interactive elements in the header navigation on www.telekom.de for functional weaknesses.
Tasks:
1. Navigate to the website
2. Identify and test interactive elements (buttons, links, forms, menus)
3. Check for broken flows, defective links, non-functioning elements
4. Document issues found
Report Format:
Return findings in the 'issues' field as a list of objects:
- element: Name/description of the element
- location: Where on the page
- problem: What doesn't work
- recommendation: How to fix it
If no problems found, return an empty array: []" \
  --max-steps 30

Golden run artifacts:

  • Screenshots: ./results/gui_agent_cli/<timestamp>/screenshots/
  • Text log: ./results/gui_agent_cli/<timestamp>/logs/run.log
  • Optional JSON comm log (if enabled): ./results/gui_agent_cli/<timestamp>/logs/calibration_log_*.json

Alternative: run the agent via a web interface (Frontend)

Work in progress.

We are currently updating the web-based view and its ECS runner integration. This section will be filled with the correct, up-to-date instructions once the frontend flow supports the current Autonomous UAT Agent + gui_agent_cli.py workflow.

Run the D66 evaluation scripts (staging_scripts)

These scripts are used for D66-style evaluation runs and tend to write their artifacts into staging_scripts/ (DB, screenshots, JSON).

UI check (Agent S3)

Typical pattern (URL via env var + optional run control args):

export AS2_TARGET_URL="https://www.leipzig.de"
export AS2_MAX_STEPS="20"

python staging_scripts/1_UI_check_AS3.py --auto-yes --ask-every 1000

Notes:

  • Supports --job-id <id> (used by runners) and uses JOB_ID as a fallback.
  • Writes JSON to ./agent_output/raw_json/<job_id>/ and screenshots/overlays to staging_scripts/Screenshots/....

Functional correctness check

export AS2_TARGET_URL="https://www.leipzig.de"
export AS2_MAX_STEPS="0"   # 0 = no limit (script-specific)

python staging_scripts/1_UI_functional_correctness_check.py --auto-yes --ask-every 1000

Visual quality audit

This script currently uses a hardcoded WEBSITE_URL near the top of the file. Update it and then run:

python staging_scripts/2_UX_visual_quality_audit.py --auto-yes --ask-every 10

Task-based UX flow (newsletter)

This script is currently a staging/WIP script; verify it runs in your environment before relying on it for evidence.

Outputs to expect

Most scripts record one or more of:

  • uxqa.db (run log DB)
  • screenshots/overlays under staging_scripts/Screenshots/...
  • JSON step outputs under agent_output/ (paths vary by script)
  • calibration CLI outputs under ./results/gui_agent_cli/<timestamp>/

See Outputs & Artifacts.

Notes on model usage

Some scripts still contain legacy model configs (Claude/Pixtral). The D66 target configuration is documented in Model Stack.