What it takes to turn an open 3B vision-language model into a GUI grounding agent for about $15: visual grounding, LoRA and QLoRA, and a verifiable-reward GRPO recipe, where SFT does the heavy lifting and RL rewards the whole target, not just its center.
