The Autonomous UAT Agent must use **open-source models from European companies**. This has been a project requirement form the very beginnning of this project.
The Agent S framework runs an iterative loop: it uses a reasoning model to decide *what to do next* (plan the next action) and a grounding model to translate UI intent into *pixel-accurate coordinates* on the current screenshot. This split is essential for reliable GUI automation because planning and “where exactly to click” are different problems and benefit from different model capabilities.
## Why split models?
- Reasoning models optimize planning and textual decision making
- Separation reduces “coordinate hallucinations” and makes debugging easier
## Current state in repo
- Some scripts and docs still reference historical **Claude** and **Pixtral** experiments.
- **Pixtral is not suitable for pixel-level grounding in this use case**: in our evaluations it did not provide the consistency and coordinate stability required for reliable UI automation.
- In an early prototyping phase, **Anthropic Claude Sonnet** was useful due to strong instruction-following and reasoning quality; however it does not meet the D66 constraints (open-source + European provider), so it could not be used for the D66 target solution.
## Current configuration (D66)
### Thinking model: Ministral 3 8B (Instruct)
- HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512