--- title: "Model Stack" linkTitle: "Model Stack" weight: 4 description: > Thinking vs grounding model split for D66 (current state and target state) --- # Model Stack For a visual overview of how the models interact with the VNC-based GUI automation loop, see: [Workflow Diagram](./agent-workflow-diagram.md) ## Requirement The Autonomous UAT Agent must use **open-source models from European companies**. This has been a project requirement form the very beginnning of this project. ## Target setup - **Thinking / planning:** Ministral - **Grounding / coordinates:** Holo 1.5 The Agent S framework runs an iterative loop: it uses a reasoning model to decide *what to do next* (plan the next action) and a grounding model to translate UI intent into *pixel-accurate coordinates* on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities. ## Why split models? - Reasoning models optimize planning and textual decision making - Vision/grounding models optimize stable coordinate output - Separation reduces “coordinate hallucinations” and makes debugging easier ## Current state in repo - Some scripts and docs still reference historical **Claude** and **Pixtral** experiments. - **Pixtral is not suitable for pixel-level grounding in this use case**: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation. - In an early prototyping phase, **Anthropic Claude Sonnet** was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution. ## Current configuration (D66) ### Thinking model: Ministral 3 8B (Instruct) - HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512 - Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_ministral_L4` (public IP: `164.30.28.242`) - Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | `pi5e.4xlarge.4` - GPU: 1 × NVIDIA Tesla L4 (24 GiB) - Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image) - Deployment: vLLM OpenAI-compatible endpoint (chat completions) - Endpoint env var: `vLLM_THINKING_ENDPOINT` - Current server (deployment reference): `http://164.30.28.242:8001/v1` **Operational note:** vLLM is configured to **auto-start on server boot** (OTC ECS restart) via `systemd`. **Key serving settings (vLLM):** - `--gpu-memory-utilization 0.90` - `--max-model-len 32768` - `--host 0.0.0.0` - `--port 8001` **Key client settings (Autonomous UAT Agent scripts):** - `model`: `/home/ubuntu/ministral-vllm/models/ministral-3-8b` - `temperature`: `0.0` ### Grounding model: Holo 1.5-7B - HuggingFace model card: https://huggingface.co/holo-1.5-7b - Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_holo_A40` (public IP: `164.30.22.166`) - Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | `g7.12xlarge.8` - GPU: 1 × NVIDIA A40 (48 GiB) - Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image) - Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding) - Endpoint env var: `vLLM_VISION_ENDPOINT` - Current server (deployment reference): `http://164.30.22.166:8000/v1` **Key client settings (grounding / coordinate space):** - `model`: `holo-1.5-7b` - Native coordinate space: `3840×2160` (4K) - Client grounding dimensions: - `grounding_width`: `3840` - `grounding_height`: `2160`