website-and-documentation/content/en/docs/Autonomous UAT Agent/agent-workflow-diagram.md

3.5 KiB

title linkTitle weight description
Agent Workflow Diagram UAT Agent Workflow Diagram 5 Visual workflow of a typical Agent S (Autonomous UAT Agent) run (gui_agent_cli.py) across Ministral, Holo, and VNC

Agent Workflow Diagram (Autonomous UAT Agent)

This page provides a visual sketch of the typical workflow (example: gui_agent_cli.py).

Workflow (fallback without Mermaid)

If Mermaid rendering is not available or fails in your build, this section shows the same workflow as plain text.

Operator/Prompt
  -> gui_agent_cli.py
     -> (1) Planning request  -> Ministral vLLM (thinking)
     <-     Next action intent
     -> (2) Screenshot capture -> VNC Desktop / Firefox
     <-     PNG screenshot
     -> (3) Grounding request  -> Holo vLLM (vision)
     <-     Coordinates + element metadata
     -> (4) Execute action     -> VNC Desktop / Firefox
     -> Artifacts saved        -> results/ (logs, screenshots, JSON)
Step From To What Output
0 Operator gui_agent_cli.py Provide goal / prompt Goal text
1 gui_agent_cli.py Ministral vLLM Plan next step (text) Next action intent
2 gui_agent_cli.py VNC Desktop Capture screenshot PNG screenshot
3 gui_agent_cli.py Holo vLLM Ground UI element(s) Coordinates + element metadata
4 gui_agent_cli.py VNC Desktop Execute click/type/scroll UI state change
5 gui_agent_cli.py results/ Persist evidence Logs + screenshots + JSON

High-level data flow

flowchart LR
  %% Left-to-right overview of one typical agent loop

  user[Operator / Prompt] --> cli[Agent S script<br/>gui_agent_cli.py]

  subgraph OTC[OTC (Open Telekom Cloud)]
    subgraph MIN_HOST[ecs_ministral_L4]
      MIN[(Ministral 3 8B<br/>Thinking / Planning)]
    end

    subgraph HOLO_HOST[ecs_holo_A40]
      HOLO[(Holo 1.5-7B<br/>Vision / Grounding)]
    end

    subgraph TARGET[GUI test target]
      VNC[VNC / Desktop]
      FF[Firefox]
      VNC --> FF
    end
  end

  cli -->|1. plan step<br/>vLLM_THINKING_ENDPOINT| MIN
  MIN -->|next action<br/>click / type / wait| cli

  cli -->|2. capture screenshot| VNC
  VNC -->|screenshot (PNG)| cli

  cli -->|3. grounding request<br/>vLLM_VISION_ENDPOINT| HOLO
  HOLO -->|coordinates + UI element info| cli

  cli -->|4. execute action<br/>mouse / keyboard| VNC

  cli -->|logs + screenshots| artifacts[(Artifacts<br/>logs, screenshots, JSON comms)]

Sequence (one loop)

sequenceDiagram
  autonumber
  actor U as Operator
  participant CLI as gui_agent_cli.py
  participant MIN as Ministral vLLM (ecs_ministral_L4)
  participant VNC as VNC Desktop (Firefox)
  participant HOLO as Holo vLLM (ecs_holo_A40)

  U->>CLI: Provide goal / prompt

  loop Step loop (until done)
    CLI->>MIN: Plan next step (text-only reasoning)
    MIN-->>CLI: Next action (intent)

    CLI->>VNC: Capture screenshot
    VNC-->>CLI: Screenshot (image)

    CLI->>HOLO: Ground UI element(s) in screenshot
    HOLO-->>CLI: Coordinates + element metadata

    CLI->>VNC: Execute click/type/scroll
  end

  CLI-->>U: Result summary + saved artifacts

Notes

  • The thinking and grounding models are separate on purpose: it improves coordinate reliability and makes failures easier to debug.
  • The agent loop typically produces artifacts (logs + screenshots) which are later copied into D66 evidence bundles.