website-and-documentation/content/en/docs/Autonomous UAT Agent/model-stack.md

3.6 KiB
Raw Blame History

title linkTitle weight description
Model Stack Model Stack 4 Thinking vs grounding model split for D66 (current state and target state)

Model Stack

For a visual overview of how the models interact with the VNC-based GUI automation loop, see: Workflow Diagram

Requirement

The Autonomous UAT Agent must use open-source models from European companies. This has been a project requirement form the very beginnning of this project.

Target setup

  • Thinking / planning: Ministral
  • Grounding / coordinates: Holo 1.5

The Agent S framework runs an iterative loop: it uses a reasoning model to decide what to do next (plan the next action) and a grounding model to translate UI intent into pixel-accurate coordinates on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities.

Why split models?

  • Reasoning models optimize planning and textual decision making
  • Vision/grounding models optimize stable coordinate output
  • Separation reduces “coordinate hallucinations” and makes debugging easier

Current state in repo

  • Some scripts and docs still reference historical Claude and Pixtral experiments.
  • Pixtral is not suitable for pixel-level grounding in this use case: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation.
  • In an early prototyping phase, Anthropic Claude Sonnet was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution.

Current configuration (D66)

Thinking model: Ministral 3 8B (Instruct)

  • HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
  • Runs on OTC (Open Telekom Cloud) ECS: ecs_ministral_L4 (public IP: 164.30.28.242)
    • Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | pi5e.4xlarge.4
    • GPU: 1 × NVIDIA Tesla L4 (24 GiB)
    • Image: Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074 (Public image)
  • Deployment: vLLM OpenAI-compatible endpoint (chat completions)
    • Endpoint env var: vLLM_THINKING_ENDPOINT
    • Current server (deployment reference): http://164.30.28.242:8001/v1

Operational note: vLLM is configured to auto-start on server boot (OTC ECS restart) via systemd.

Key serving settings (vLLM):

  • --gpu-memory-utilization 0.90
  • --max-model-len 32768
  • --host 0.0.0.0
  • --port 8001

Key client settings (Autonomous UAT Agent scripts):

  • model: /home/ubuntu/ministral-vllm/models/ministral-3-8b
  • temperature: 0.0

Grounding model: Holo 1.5-7B

  • HuggingFace model card: https://huggingface.co/holo-1.5-7b
  • Runs on OTC (Open Telekom Cloud) ECS: ecs_holo_A40 (public IP: 164.30.22.166)
    • Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | g7.12xlarge.8
    • GPU: 1 × NVIDIA A40 (48 GiB)
    • Image: Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074 (Public image)
  • Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
    • Endpoint env var: vLLM_VISION_ENDPOINT
    • Current server (deployment reference): http://164.30.22.166:8000/v1

Key client settings (grounding / coordinate space):

  • model: holo-1.5-7b
  • Native coordinate space: 3840×2160 (4K)
  • Client grounding dimensions:
    • grounding_width: 3840
    • grounding_height: 2160