website-and-documentation/content/en/docs/Autonomous UAT Agent/model-stack.md

88 lines
3.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Model Stack"
linkTitle: "Models"
weight: 4
description: >
Thinking vs grounding model split for D66 (current state and target state)
---
# Model Stack (D66)
For a visual overview of how the models interact with the VNC-based GUI automation loop, see: [Workflow Diagram](./agent-workflow-diagram.md)
## Requirement
D66 must use **open-source models from European companies**.
## Target setup
- **Thinking / planning:** Ministral
- **Grounding / coordinates:** Holo 1.5
The Agent S framework runs an iterative loop: it uses a reasoning model to decide *what to do next* (plan the next action) and a grounding model to translate UI intent into *pixel-accurate coordinates* on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities.
## Why split models?
- Reasoning models optimize planning and textual decision making
- Vision/grounding models optimize stable coordinate output
- Separation reduces “coordinate hallucinations” and makes debugging easier
## Current state in repo
- Some scripts and docs still reference historical **Claude** and **Pixtral** experiments.
- **Pixtral is not suitable for pixel-level grounding in this use case**: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation.
- In an early prototyping phase, **Anthropic Claude Sonnet** was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution.
## Current configuration (D66)
### Thinking model: Ministral 3 8B (Instruct)
- HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_ministral_L4` (public IP: `164.30.28.242`)
- Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | `pi5e.4xlarge.4`
- GPU: 1 × NVIDIA Tesla L4 (24 GiB)
- Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
- Deployment: vLLM OpenAI-compatible endpoint (chat completions)
- Endpoint env var: `vLLM_THINKING_ENDPOINT`
- Current server (deployment reference): `http://164.30.28.242:8001/v1`
- Recommendation: set `vLLM_THINKING_ENDPOINT` explicitly (do not rely on script defaults).
**Operational note:** vLLM is configured to **auto-start on server boot** (OTC ECS restart) via `systemd`.
**Key serving settings (vLLM):**
- `--gpu-memory-utilization 0.90`
- `--max-model-len 32768`
- `--host 0.0.0.0`
- `--port 8001`
**Key client settings (Autonomous UAT Agent scripts):**
- `model`: `/home/ubuntu/ministral-vllm/models/ministral-3-8b`
- `temperature`: `0.0`
### Grounding model: Holo 1.5-7B
- HuggingFace model card: https://huggingface.co/holo-1.5-7b
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_holo_A40` (public IP: `164.30.22.166`)
- Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | `g7.12xlarge.8`
- GPU: 1 × NVIDIA A40 (48 GiB)
- Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
- Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
- Endpoint env var: `vLLM_VISION_ENDPOINT`
- Current server (deployment reference): `http://164.30.22.166:8000/v1`
**Key client settings (grounding / coordinate space):**
- `model`: `holo-1.5-7b`
- Native coordinate space: `3840×2160` (4K)
- Client grounding dimensions:
- `grounding_width`: `3840`
- `grounding_height`: `2160`
Notes:
- Prompting and output-format hardening (reliability work):
- `docs/story-026-001-context.md` (Holo output reliability)
- `docs/story-025-001-context.md` (double grounding / calibration)