4.7 KiB
| title | linkTitle | weight | description |
|---|---|---|---|
| Running Autonomous UAT Agent Scripts | Running Autonomous UAT Agent Scripts | 3 | How to run the key D66 evaluation scripts and what they produce |
Running Autonomous UAT Agent Scripts
The Autonomous UAT Agent is the overall UX/UI testing use case built on top of the Agent S codebase and scripts in this repo.
All commands below assume you are running from the Agent-S repository root (Linux/ECS), ~/Projects/Agent_S3/Agent-S. To do that, connect to the server via SSH. You will need a key pair for authentication and an open inbound port in the firewall. For information on how to obtain the key pair and request firewall access, contact tom.sakretz@telekom.de.
Template for running a script from command line terminal
1) Connect from Windows
ssh -i "C:\Path to KeyPair\KeyPair-ECS.pem" ubuntu@80.158.3.120
2) Prepare the ECS runtime (GUI + browser)
# Activate venv
source ~/Projects/Agent_S3/Agent-S/venv/bin/activate
# Go to Agent-S repo root
cd ~/Projects/Agent_S3/Agent-S
# Start VNC (DISPLAY=:1) and a browser
vncserver :1
export XAUTHORITY="$HOME/.Xauthority"
export DISPLAY=":1"
firefox &
3) One-command recommended run (ECS)
If you only want to produce clean, repeatable evidence (screenshots with click markers), run the following command CLI:
python staging_scripts/gui_agent_cli.py --prompt "Go to telekom.de and click the cart icon" --max-steps 10
This will produce:
- Screenshots:
./results/gui_agent_cli/<timestamp>/screenshots/ - Text log:
./results/gui_agent_cli/<timestamp>/logs/run.log - JSON comm log:
./results/gui_agent_cli/<timestamp>/logs/run.log
Prerequisites (runtime)
- Linux GUI session (VNC/Xvfb) because these scripts drive a real browser via
pyautogui. - A working
DISPLAY(default for all scripts is:1). - Network access to the model endpoints (thinking + vision/grounding).
Key scripts (repo locations)
The GUI Agent CLI script is the most flexible entry point and is therefore the only one described in more detail in this documentation. Assumes you are in project root ~/Projects/Agent_S3/Agent-S.
- GUI Agent CLI:
staging_scripts/gui_agent_cli.py
Historically, we used purpose-built scripts for individual tasks. We now recommend using gui_agent_cli.py as the primary entry point, because the same scenarios can usually be expressed via a well-scoped prompt while keeping the workflow more flexible and easier to maintain. The scripts below are kept for reference and may not reflect the current, preferred workflow.
- UI check (Agent S3):
staging_scripts/1_UI_check_AS3.py - Functional correctness check:
staging_scripts/1_UI_functional_correctness_check.py - Visual quality audit:
staging_scripts/2_UX_visual_quality_audit.py - Task-based UX flow (newsletter):
staging_scripts/3_UX_taskflow_newsletter_signup.py
Golden run (terminal on ECS)
This is the “golden run” command sequence currently used for D66 evidence generation. The golden run is a complete workflow that works as a template for reproducible outcomes.
python staging_scripts/gui_agent_cli.py \
--prompt "Role: You are a UI/UX testing agent specializing in functional correctness.
Goal: Test all interactive elements in the header navigation on www.telekom.de for functional weaknesses.
Tasks:
1. Navigate to the website
2. Identify and test interactive elements (buttons, links, forms, menus)
3. Check for broken flows, defective links, non-functioning elements
4. Document issues found
Report Format:
Return findings in the 'issues' field as a list of objects:
- element: Name/description of the element
- location: Where on the page
- problem: What doesn't work
- recommendation: How to fix it
If no problems found, return an empty array: []" \
--max-steps 30
Golden run artifacts:
- Screenshots:
./results/gui_agent_cli/<timestamp>/screenshots/ - Text log:
./results/gui_agent_cli/<timestamp>/logs/run.log - Optional JSON comm log (if enabled):
./results/gui_agent_cli/<timestamp>/logs/calibration_log_*.json
An example golden run with screenshots and log outputs can be seen in Results.
Alternative: run the agent via a web interface (Frontend)
Work in progress.
We are currently updating the web-based view and its ECS runner integration. This section will be filled with the correct, up-to-date instructions once the frontend flow supports the current Autonomous UAT Agent + gui_agent_cli.py workflow.
Notes on model usage
Some scripts still contain legacy model configs (Claude/Pixtral). The D66 target configuration is documented in Model Stack.