website-and-documentation/content/en/docs/Autonomous UAT Agent/model-stack.md

---
title: "Model Stack"
linkTitle: "Model Stack"
weight: 4
description: >
  Thinking vs grounding model split for D66 (current state and target state)
---

# Model Stack

For a visual overview of how the models interact with the VNC-based GUI automation loop, see: [Workflow Diagram](./agent-workflow-diagram.md)

## Requirement

The Autonomous UAT Agent must use **open-source models from European companies**. This has been a project requirement form the very beginnning of this project.

## Target setup

- **Thinking / planning:** Ministral
- **Grounding / coordinates:** Holo 1.5

The Agent S framework runs an iterative loop: it uses a reasoning model to decide *what to do next* (plan the next action) and a grounding model to translate UI intent into *pixel-accurate coordinates* on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities.

## Why split models?

- Reasoning models optimize planning and textual decision making
- Vision/grounding models optimize stable coordinate output
- Separation reduces “coordinate hallucinations” and makes debugging easier

## Current state in repo

- Some scripts and docs still reference historical **Claude** and **Pixtral** experiments.
- **Pixtral is not suitable for pixel-level grounding in this use case**: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation.
- In an early prototyping phase, **Anthropic Claude Sonnet** was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution.

## Current configuration (D66)

### Thinking model: Ministral 3 8B (Instruct)

- HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_ministral_L4` (public IP: `164.30.28.242`)
  - Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | `pi5e.4xlarge.4`
  - GPU: 1 × NVIDIA Tesla L4 (24 GiB)
  - Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
- Deployment: vLLM OpenAI-compatible endpoint (chat completions)
  - Endpoint env var: `vLLM_THINKING_ENDPOINT`
  - Current server (deployment reference): `http://164.30.28.242:8001/v1`

**Operational note:** vLLM is configured to **auto-start on server boot** (OTC ECS restart) via `systemd`.

**Key serving settings (vLLM):**

- `--gpu-memory-utilization 0.90`
- `--max-model-len 32768`
- `--host 0.0.0.0`
- `--port 8001`

**Key client settings (Autonomous UAT Agent scripts):**

- `model`: `/home/ubuntu/ministral-vllm/models/ministral-3-8b`
- `temperature`: `0.0`

### Grounding model: Holo 1.5-7B

- HuggingFace model card: https://huggingface.co/holo-1.5-7b
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_holo_A40` (public IP: `164.30.22.166`)
  - Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | `g7.12xlarge.8`
  - GPU: 1 × NVIDIA A40 (48 GiB)
  - Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
- Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
  - Endpoint env var: `vLLM_VISION_ENDPOINT`
  - Current server (deployment reference): `http://164.30.22.166:8000/v1`

**Key client settings (grounding / coordinate space):**

- `model`: `holo-1.5-7b`
- Native coordinate space: `3840×2160` (4K)
- Client grounding dimensions:
  - `grounding_width`: `3840`
  - `grounding_height`: `2160`