website-and-documentation/content/en/docs/Autonomous UAT Agent/model-stack.md

---
title: "Model Stack"
linkTitle: "Model Stack"
weight: 4
description: >
  Thinking vs grounding model split for D66 (current state and target state)
---

# Model Stack

For a visual overview of how the models interact with the VNC-based GUI automation loop, see: [Workflow Diagram](./agent-workflow-diagram.md)

## Requirement

The Autonomous UAT Agent must use **open-source models from European companies**. This has been a project requirement form the very beginnning of this project.

## Target setup

- **Thinking / planning:** Ministral
- **Grounding / coordinates:** Holo 1.5

The Agent S framework runs an iterative loop: it uses a reasoning model to decide *what to do next* (plan the next action) and a grounding model to translate UI intent into *pixel-accurate coordinates* on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities.

## Why split models?

- Reasoning models optimize planning and textual decision making
- Vision/grounding models optimize stable coordinate output
- Separation reduces “coordinate hallucinations” and makes debugging easier

## Current state in repo

- Some scripts and docs still reference historical **Claude** and **Pixtral** experiments.
- **Pixtral is not suitable for pixel-level grounding in this use case**: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation.
- In an early prototyping phase, **Anthropic Claude Sonnet** was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution.

## Current configuration (D66)

### Thinking model: Ministral 3 8B (Instruct)

- HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_ministral_L4` (public IP: `164.30.28.242`)
  - Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | `pi5e.4xlarge.4`
  - GPU: 1 × NVIDIA Tesla L4 (24 GiB)
  - Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
- Deployment: vLLM OpenAI-compatible endpoint (chat completions)
  - Endpoint env var: `vLLM_THINKING_ENDPOINT`
  - Current server (deployment reference): `http://164.30.28.242:8001/v1`

**Operational note:** vLLM is configured to **auto-start on server boot** (OTC ECS restart) via `systemd`.

**Key serving settings (vLLM):**

- `--gpu-memory-utilization 0.90`
- `--max-model-len 32768`
- `--host 0.0.0.0`
- `--port 8001`

**Key client settings (Autonomous UAT Agent scripts):**

- `model`: `/home/ubuntu/ministral-vllm/models/ministral-3-8b`
- `temperature`: `0.0`

### Grounding model: Holo 1.5-7B

- HuggingFace model card: https://huggingface.co/holo-1.5-7b
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_holo_A40` (public IP: `164.30.22.166`)
  - Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | `g7.12xlarge.8`
  - GPU: 1 × NVIDIA A40 (48 GiB)
  - Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
- Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
  - Endpoint env var: `vLLM_VISION_ENDPOINT`
  - Current server (deployment reference): `http://164.30.22.166:8000/v1`

**Key client settings (grounding / coordinate space):**

- `model`: `holo-1.5-7b`
- Native coordinate space: `3840×2160` (4K)
- Client grounding dimensions:
  - `grounding_width`: `3840`
  - `grounding_height`: `2160`
-												New documentation pages covering UAT Agent setup, ECS and Model configs and how to run agent scripts

											
										
										
											2026-01-29 13:31:35 +01:00
+								---
 								title: "Model Stack"
-												Updated titles for existing pages

											
										
										
											2026-01-29 14:19:15 +01:00
+								linkTitle: "Model Stack"
-												New documentation pages covering UAT Agent setup, ECS and Model configs and how to run agent scripts

											
										
										
											2026-01-29 13:31:35 +01:00
+								weight: 4
 								description: >
 								  Thinking vs grounding model split for D66 (current state and target state)
 								---
-												Update to running auata scripts and results

											
										
										
											2026-01-29 15:22:46 +01:00
+								# Model Stack
-												New documentation pages covering UAT Agent setup, ECS and Model configs and how to run agent scripts

											
										
										
											2026-01-29 13:31:35 +01:00
 								For a visual overview of how the models interact with the VNC-based GUI automation loop, see: [Workflow Diagram](./agent-workflow-diagram.md)
 								## Requirement
-												Update to running auata scripts and results

											
										
										
											2026-01-29 15:22:46 +01:00
+								The Autonomous UAT Agent must use **open-source models from European companies**. This has been a project requirement form the very beginnning of this project.
-												New documentation pages covering UAT Agent setup, ECS and Model configs and how to run agent scripts

											
										
										
											2026-01-29 13:31:35 +01:00
 								## Target setup
 								- **Thinking / planning:** Ministral
 								- **Grounding / coordinates:** Holo 1.5
 								The Agent S framework runs an iterative loop: it uses a reasoning model to decide *what to do next* (plan the next action) and a grounding model to translate UI intent into *pixel-accurate coordinates* on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities.
 								## Why split models?
 								- Reasoning models optimize planning and textual decision making
 								- Vision/grounding models optimize stable coordinate output
 								- Separation reduces “coordinate hallucinations” and makes debugging easier
 								## Current state in repo
 								- Some scripts and docs still reference historical **Claude** and **Pixtral** experiments.
 								- **Pixtral is not suitable for pixel-level grounding in this use case**: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation.
 								- In an early prototyping phase, **Anthropic Claude Sonnet** was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution.
 								## Current configuration (D66)
 								### Thinking model: Ministral 3 8B (Instruct)
 								- HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
 								- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_ministral_L4` (public IP: `164.30.28.242`)
 								  - Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | `pi5e.4xlarge.4`
 								  - GPU: 1 × NVIDIA Tesla L4 (24 GiB)
 								  - Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
 								- Deployment: vLLM OpenAI-compatible endpoint (chat completions)
 								  - Endpoint env var: `vLLM_THINKING_ENDPOINT`
 								  - Current server (deployment reference): `http://164.30.28.242:8001/v1`
 								**Operational note:** vLLM is configured to **auto-start on server boot** (OTC ECS restart) via `systemd`.
 								**Key serving settings (vLLM):**
 								- `--gpu-memory-utilization 0.90`
 								- `--max-model-len 32768`
 								- `--host 0.0.0.0`
 								- `--port 8001`
 								**Key client settings (Autonomous UAT Agent scripts):**
 								- `model`: `/home/ubuntu/ministral-vllm/models/ministral-3-8b`
 								- `temperature`: `0.0`
 								### Grounding model: Holo 1.5-7B
 								- HuggingFace model card: https://huggingface.co/holo-1.5-7b
 								- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_holo_A40` (public IP: `164.30.22.166`)
 								  - Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | `g7.12xlarge.8`
 								  - GPU: 1 × NVIDIA A40 (48 GiB)
 								  - Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
 								- Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
 								  - Endpoint env var: `vLLM_VISION_ENDPOINT`
 								  - Current server (deployment reference): `http://164.30.22.166:8000/v1`
 								**Key client settings (grounding / coordinate space):**
 								- `model`: `holo-1.5-7b`
 								- Native coordinate space: `3840×2160` (4K)
 								- Client grounding dimensions:
 								  - `grounding_width`: `3840`
 								  - `grounding_height`: `2160`