website-and-documentation/content/en/docs/Autonomous UAT Agent/model-stack.md

83 lines
3.6 KiB
Markdown
Raw Normal View History

---
title: "Model Stack"
2026-01-29 14:19:15 +01:00
linkTitle: "Model Stack"
weight: 4
description: >
Thinking vs grounding model split for D66 (current state and target state)
---
# Model Stack
For a visual overview of how the models interact with the VNC-based GUI automation loop, see: [Workflow Diagram](./agent-workflow-diagram.md)
## Requirement
The Autonomous UAT Agent must use **open-source models from European companies**. This has been a project requirement form the very beginnning of this project.
## Target setup
- **Thinking / planning:** Ministral
- **Grounding / coordinates:** Holo 1.5
The Agent S framework runs an iterative loop: it uses a reasoning model to decide *what to do next* (plan the next action) and a grounding model to translate UI intent into *pixel-accurate coordinates* on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities.
## Why split models?
- Reasoning models optimize planning and textual decision making
- Vision/grounding models optimize stable coordinate output
- Separation reduces “coordinate hallucinations” and makes debugging easier
## Current state in repo
- Some scripts and docs still reference historical **Claude** and **Pixtral** experiments.
- **Pixtral is not suitable for pixel-level grounding in this use case**: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation.
- In an early prototyping phase, **Anthropic Claude Sonnet** was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution.
## Current configuration (D66)
### Thinking model: Ministral 3 8B (Instruct)
- HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_ministral_L4` (public IP: `164.30.28.242`)
- Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | `pi5e.4xlarge.4`
- GPU: 1 × NVIDIA Tesla L4 (24 GiB)
- Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
- Deployment: vLLM OpenAI-compatible endpoint (chat completions)
- Endpoint env var: `vLLM_THINKING_ENDPOINT`
- Current server (deployment reference): `http://164.30.28.242:8001/v1`
**Operational note:** vLLM is configured to **auto-start on server boot** (OTC ECS restart) via `systemd`.
**Key serving settings (vLLM):**
- `--gpu-memory-utilization 0.90`
- `--max-model-len 32768`
- `--host 0.0.0.0`
- `--port 8001`
**Key client settings (Autonomous UAT Agent scripts):**
- `model`: `/home/ubuntu/ministral-vllm/models/ministral-3-8b`
- `temperature`: `0.0`
### Grounding model: Holo 1.5-7B
- HuggingFace model card: https://huggingface.co/holo-1.5-7b
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_holo_A40` (public IP: `164.30.22.166`)
- Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | `g7.12xlarge.8`
- GPU: 1 × NVIDIA A40 (48 GiB)
- Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
- Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
- Endpoint env var: `vLLM_VISION_ENDPOINT`
- Current server (deployment reference): `http://164.30.22.166:8000/v1`
**Key client settings (grounding / coordinate space):**
- `model`: `holo-1.5-7b`
- Native coordinate space: `3840×2160` (4K)
- Client grounding dimensions:
- `grounding_width`: `3840`
- `grounding_height`: `2160`