82 lines
3.6 KiB
Markdown
82 lines
3.6 KiB
Markdown
---
|
||
title: "Model Stack"
|
||
linkTitle: "Model Stack"
|
||
weight: 4
|
||
description: >
|
||
Thinking vs grounding model split for D66 (current state and target state)
|
||
---
|
||
|
||
# Model Stack
|
||
|
||
For a visual overview of how the models interact with the VNC-based GUI automation loop, see: [Workflow Diagram](./agent-workflow-diagram.md)
|
||
|
||
## Requirement
|
||
|
||
The Autonomous UAT Agent must use **open-source models from European companies**. This has been a project requirement form the very beginnning of this project.
|
||
|
||
## Target setup
|
||
|
||
- **Thinking / planning:** Ministral
|
||
- **Grounding / coordinates:** Holo 1.5
|
||
|
||
The Agent S framework runs an iterative loop: it uses a reasoning model to decide *what to do next* (plan the next action) and a grounding model to translate UI intent into *pixel-accurate coordinates* on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities.
|
||
|
||
## Why split models?
|
||
|
||
- Reasoning models optimize planning and textual decision making
|
||
- Vision/grounding models optimize stable coordinate output
|
||
- Separation reduces “coordinate hallucinations” and makes debugging easier
|
||
|
||
## Current state in repo
|
||
|
||
- Some scripts and docs still reference historical **Claude** and **Pixtral** experiments.
|
||
- **Pixtral is not suitable for pixel-level grounding in this use case**: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation.
|
||
- In an early prototyping phase, **Anthropic Claude Sonnet** was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution.
|
||
|
||
## Current configuration (D66)
|
||
|
||
### Thinking model: Ministral 3 8B (Instruct)
|
||
|
||
- HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
|
||
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_ministral_L4` (public IP: `164.30.28.242`)
|
||
- Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | `pi5e.4xlarge.4`
|
||
- GPU: 1 × NVIDIA Tesla L4 (24 GiB)
|
||
- Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
|
||
- Deployment: vLLM OpenAI-compatible endpoint (chat completions)
|
||
- Endpoint env var: `vLLM_THINKING_ENDPOINT`
|
||
- Current server (deployment reference): `http://164.30.28.242:8001/v1`
|
||
|
||
**Operational note:** vLLM is configured to **auto-start on server boot** (OTC ECS restart) via `systemd`.
|
||
|
||
**Key serving settings (vLLM):**
|
||
|
||
- `--gpu-memory-utilization 0.90`
|
||
- `--max-model-len 32768`
|
||
- `--host 0.0.0.0`
|
||
- `--port 8001`
|
||
|
||
**Key client settings (Autonomous UAT Agent scripts):**
|
||
|
||
- `model`: `/home/ubuntu/ministral-vllm/models/ministral-3-8b`
|
||
- `temperature`: `0.0`
|
||
|
||
### Grounding model: Holo 1.5-7B
|
||
|
||
- HuggingFace model card: https://huggingface.co/holo-1.5-7b
|
||
- Runs on **OTC (Open Telekom Cloud) ECS**: `ecs_holo_A40` (public IP: `164.30.22.166`)
|
||
- Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | `g7.12xlarge.8`
|
||
- GPU: 1 × NVIDIA A40 (48 GiB)
|
||
- Image: `Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074` (Public image)
|
||
- Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
|
||
- Endpoint env var: `vLLM_VISION_ENDPOINT`
|
||
- Current server (deployment reference): `http://164.30.22.166:8000/v1`
|
||
|
||
**Key client settings (grounding / coordinate space):**
|
||
|
||
- `model`: `holo-1.5-7b`
|
||
- Native coordinate space: `3840×2160` (4K)
|
||
- Client grounding dimensions:
|
||
- `grounding_width`: `3840`
|
||
- `grounding_height`: `2160`
|
||
|
||
|