3.6 KiB
| title | linkTitle | weight | description |
|---|---|---|---|
| Model Stack | Model Stack | 4 | Thinking vs grounding model split for D66 (current state and target state) |
Model Stack
For a visual overview of how the models interact with the VNC-based GUI automation loop, see: Workflow Diagram
Requirement
The Autonomous UAT Agent must use open-source models from European companies. This has been a project requirement form the very beginnning of this project.
Target setup
- Thinking / planning: Ministral
- Grounding / coordinates: Holo 1.5
The Agent S framework runs an iterative loop: it uses a reasoning model to decide what to do next (plan the next action) and a grounding model to translate UI intent into pixel-accurate coordinates on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities.
Why split models?
- Reasoning models optimize planning and textual decision making
- Vision/grounding models optimize stable coordinate output
- Separation reduces “coordinate hallucinations” and makes debugging easier
Current state in repo
- Some scripts and docs still reference historical Claude and Pixtral experiments.
- Pixtral is not suitable for pixel-level grounding in this use case: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation.
- In an early prototyping phase, Anthropic Claude Sonnet was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution.
Current configuration (D66)
Thinking model: Ministral 3 8B (Instruct)
- HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
- Runs on OTC (Open Telekom Cloud) ECS:
ecs_ministral_L4(public IP:164.30.28.242)- Flavor: GPU-accelerated | 16 vCPUs | 64 GiB |
pi5e.4xlarge.4 - GPU: 1 × NVIDIA Tesla L4 (24 GiB)
- Image:
Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074(Public image)
- Flavor: GPU-accelerated | 16 vCPUs | 64 GiB |
- Deployment: vLLM OpenAI-compatible endpoint (chat completions)
- Endpoint env var:
vLLM_THINKING_ENDPOINT - Current server (deployment reference):
http://164.30.28.242:8001/v1
- Endpoint env var:
Operational note: vLLM is configured to auto-start on server boot (OTC ECS restart) via systemd.
Key serving settings (vLLM):
--gpu-memory-utilization 0.90--max-model-len 32768--host 0.0.0.0--port 8001
Key client settings (Autonomous UAT Agent scripts):
model:/home/ubuntu/ministral-vllm/models/ministral-3-8btemperature:0.0
Grounding model: Holo 1.5-7B
- HuggingFace model card: https://huggingface.co/holo-1.5-7b
- Runs on OTC (Open Telekom Cloud) ECS:
ecs_holo_A40(public IP:164.30.22.166)- Flavor: GPU-accelerated | 48 vCPUs | 384 GiB |
g7.12xlarge.8 - GPU: 1 × NVIDIA A40 (48 GiB)
- Image:
Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074(Public image)
- Flavor: GPU-accelerated | 48 vCPUs | 384 GiB |
- Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
- Endpoint env var:
vLLM_VISION_ENDPOINT - Current server (deployment reference):
http://164.30.22.166:8000/v1
- Endpoint env var:
Key client settings (grounding / coordinate space):
model:holo-1.5-7b- Native coordinate space:
3840×2160(4K) - Client grounding dimensions:
grounding_width:3840grounding_height:2160