forked from DevFW-CICD/website-and-documentation

Tom Sakretz 097724558a Minor updates to existing pages

2026-01-30 15:23:28 +01:00

4.7 KiB

Raw Blame History

title	linkTitle	weight	description
Running Autonomous UAT Agent Scripts	Running Autonomous UAT Agent Scripts	3	How to run the key D66 evaluation scripts and what they produce

Running Autonomous UAT Agent Scripts

The Autonomous UAT Agent is the overall UX/UI testing use case built on top of the Agent S codebase and scripts in this repo.

All commands below assume you are running from the Agent-S repository root (Linux/ECS), ~/Projects/Agent_S3/Agent-S. To do that, connect to the server via SSH. You will need a key pair for authentication and an open inbound port in the firewall. For information on how to obtain the key pair and request firewall access, contact tom.sakretz@telekom.de.

Template for running a script from command line terminal

1) Connect from Windows

ssh -i "C:\Path to KeyPair\KeyPair-ECS.pem" ubuntu@80.158.3.120

2) Prepare the ECS runtime (GUI + browser)

# Activate venv
source ~/Projects/Agent_S3/Agent-S/venv/bin/activate

# Go to Agent-S repo root
cd ~/Projects/Agent_S3/Agent-S

# Start VNC (DISPLAY=:1) and a browser
vncserver :1
export XAUTHORITY="$HOME/.Xauthority"
export DISPLAY=":1"
firefox &

3) One-command recommended run (ECS)

If you only want to produce clean, repeatable evidence (screenshots with click markers), run the following command CLI:

python staging_scripts/gui_agent_cli.py --prompt "Go to telekom.de and click the cart icon" --max-steps 10

This will produce:

Screenshots: ./results/gui_agent_cli/<timestamp>/screenshots/
Text log: ./results/gui_agent_cli/<timestamp>/logs/run.log
JSON comm log: ./results/gui_agent_cli/<timestamp>/logs/run.log

Prerequisites (runtime)

Linux GUI session (VNC/Xvfb) because these scripts drive a real browser via pyautogui.
A working DISPLAY (default for all scripts is :1).
Network access to the model endpoints (thinking + vision/grounding).

Key scripts (repo locations)

The GUI Agent CLI script is the most flexible entry point and is therefore the only one described in more detail in this documentation. Assumes you are in project root ~/Projects/Agent_S3/Agent-S.

GUI Agent CLI: staging_scripts/gui_agent_cli.py

Historically, we used purpose-built scripts for individual tasks. We now recommend using gui_agent_cli.py as the primary entry point, because the same scenarios can usually be expressed via a well-scoped prompt while keeping the workflow more flexible and easier to maintain. The scripts below are kept for reference and may not reflect the current, preferred workflow.

UI check (Agent S3): staging_scripts/1_UI_check_AS3.py
Functional correctness check: staging_scripts/1_UI_functional_correctness_check.py
Visual quality audit: staging_scripts/2_UX_visual_quality_audit.py
Task-based UX flow (newsletter): staging_scripts/3_UX_taskflow_newsletter_signup.py

Golden run (terminal on ECS)

This is the “golden run” command sequence currently used for D66 evidence generation. The golden run is a complete workflow that works as a template for reproducible outcomes.

python staging_scripts/gui_agent_cli.py \
  --prompt "Role: You are a UI/UX testing agent specializing in functional correctness.
Goal: Test all interactive elements in the header navigation on www.telekom.de for functional weaknesses.
Tasks:
1. Navigate to the website
2. Identify and test interactive elements (buttons, links, forms, menus)
3. Check for broken flows, defective links, non-functioning elements
4. Document issues found
Report Format:
Return findings in the 'issues' field as a list of objects:
- element: Name/description of the element
- location: Where on the page
- problem: What doesn't work
- recommendation: How to fix it
If no problems found, return an empty array: []" \
  --max-steps 30

Golden run artifacts:

Screenshots: ./results/gui_agent_cli/<timestamp>/screenshots/
Text log: ./results/gui_agent_cli/<timestamp>/logs/run.log
Optional JSON comm log (if enabled): ./results/gui_agent_cli/<timestamp>/logs/calibration_log_*.json

An example golden run with screenshots and log outputs can be seen in Results.

Alternative: run the agent via a web interface (Frontend)

Work in progress.

We are currently updating the web-based view and its ECS runner integration. This section will be filled with the correct, up-to-date instructions once the frontend flow supports the current Autonomous UAT Agent + gui_agent_cli.py workflow.

Notes on model usage

Some scripts still contain legacy model configs (Claude/Pixtral). The D66 target configuration is documented in Model Stack.

4.7 KiB Raw Blame History