<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Sft on MdJawad</title><link>https://www.mdjawad.com/tags/sft/</link><description>Recent content in Sft on MdJawad</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Wed, 03 Jun 2026 04:29:47 +0000</lastBuildDate><atom:link href="https://www.mdjawad.com/tags/sft/index.xml" rel="self" type="application/rss+xml"/><item><title>Teaching a 3B VLM to Click: SFT, GRPO, and What Actually Moved the Needle</title><link>https://www.mdjawad.com/posts/teaching-a-3b-vlm-to-click/</link><pubDate>Wed, 03 Jun 2026 09:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/teaching-a-3b-vlm-to-click/</guid><description>What it takes to turn an open 3B vision-language model into a GUI grounding agent for about $15: visual grounding, LoRA and QLoRA, and a verifiable-reward GRPO recipe, where SFT does the heavy lifting and RL rewards the whole target, not just its center.</description><content:encoded><![CDATA[<h2 id="the-agent-that-looks-at-the-screen">The Agent That Looks at the Screen</h2>
<p>For most of their short history, software agents have been polite guests in other people&rsquo;s houses. They acted through doors that someone built for them on purpose: REST APIs, function schemas, accessibility trees, the DOM. If the door existed, the agent could walk through it. If it didn&rsquo;t, the agent was stuck on the porch.</p>
<p>Computer-use models break that habit. Instead of asking an application for a structured handle on itself, the model looks at the screen (the actual rendered pixels a person would see) and acts on it the way a person does: move the cursor here, click, type, scroll, look again. When Anthropic shipped computer use and OpenAI shipped Operator, this stopped being a research demo and became a product category. The pitch is enormous: an agent that can drive <em>any</em> software, on any operating system, with no integration and no permission from the application, because it is simply using the interface the way you do.</p>
<p>Strip away the planning and the multi-step choreography in those demos and you reach a single, unglamorous skill that all of it is built on. Given a screenshot and an instruction in plain English (<em>&ldquo;open the settings menu&rdquo;</em>), put the cursor on the right pixel. That&rsquo;s it. That skill has a name: <strong>visual grounding</strong>. Get it reliably right and the flashy agent becomes possible. Get it wrong and every plan above it collapses, because a perfect chain of reasoning that ends in a click on the wrong button is just a confident mistake.</p>
<p>This post is about teaching that one skill to a small, open model, <a href="https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct">Qwen2.5-VL-3B</a> (Apache-2.0), from a standing start, for about the price of lunch. The whole project ran on a single rented 24 GB GPU and cost roughly <strong>$15</strong>. It is also a chance to watch the ideas this blog has covered one at a time finally snap together into something that does a job: <a href="/posts/quantization-and-gptq/">quantization</a> is what lets a multi-billion-parameter model train on a hobbyist&rsquo;s card; <a href="/posts/flash-attention/">memory-efficient attention</a> is what lets it read a dense, high-resolution screenshot at all; and <a href="/posts/rlhf-to-grpo/">GRPO</a> is the reinforcement-learning algorithm that does the final polishing. None of these are new here. What&rsquo;s new is assembling them into a working computer-use agent.</p>
<p>Two techniques did the teaching, in sequence: supervised fine-tuning, then reinforcement learning. The headline, which I&rsquo;ll defend with numbers rather than vibes: <strong>SFT did almost all of the heavy lifting, and RL earned its keep in a narrower, more interesting way than the hype suggests.</strong></p>
<h2 id="what-computer-use-is-and-why-grounding-is-the-keystone">What &ldquo;Computer Use&rdquo; Is, and Why Grounding Is the Keystone</h2>
<p>It helps to be precise about scope, because &ldquo;computer-use agent&rdquo; covers a huge range of ambition.</p>
<p>At the hard end is full task execution: <em>&ldquo;book me the cheapest flight to Tokyo next month.&rdquo;</em> That requires a live environment, long-horizon planning, recovery from mistakes, and credit assignment across dozens of steps where only the final outcome tells you whether you succeeded. It is genuinely hard, and benchmarks like OSWorld and WebArena exist to measure it. It is also expensive and slow to train against, because every rollout needs a real, stateful machine on the other end.</p>
<p>At the tractable end is grounding, a single step: screenshot plus instruction in, one coordinate out. It needs no simulator. The reward is something you can compute with arithmetic, <em>did the predicted point land inside the target element&rsquo;s bounding box?</em>, rather than something a learned judge has to decide. And it is exactly the slice the recent GUI reinforcement-learning literature (UI-R1, GUI-R1, GTA1) converged on, precisely because it is the most learnable, verifiable piece of the problem. Grounding is the keystone: it is the smallest unit of &ldquo;use the computer&rdquo; that you can train, measure, and trust on its own.</p>
<p>So that is the task. Concretely, the model is handed an image and one line of text, and must answer with a pixel:</p>

<div class="cua-gp" id="cua-gp-ca72848780901f6e41aa11b8375d1dc6">
  <style>
    .cua-gp {
      --gp-bg: #0d1117;
      --gp-surface: #161b22;
      --gp-screen: #0a0d12;
      --gp-border: #30363d;
      --gp-text: #e6edf3;
      --gp-muted: #8b949e;
      --gp-accent: #a371f7;
      --gp-accent-dim: rgba(163, 113, 247, 0.15);
      --gp-blue: #58a6ff;
      --gp-green: #39d353;
      --gp-green-dim: rgba(57, 211, 83, 0.12);
      --gp-red: #f97583;
      --gp-red-dim: rgba(249, 117, 131, 0.12);

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--gp-bg);
      color: var(--gp-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .cua-gp,
    :root:not([data-theme="dark"]) .cua-gp {
      --gp-bg: #f8fafc;
      --gp-surface: #ffffff;
      --gp-screen: #eef2f7;
      --gp-border: #e2e8f0;
      --gp-text: #1e293b;
      --gp-muted: #64748b;
      --gp-accent: #8b5cf6;
      --gp-accent-dim: rgba(139, 92, 246, 0.1);
      --gp-blue: #3b82f6;
      --gp-green: #10b981;
      --gp-green-dim: rgba(16, 185, 129, 0.12);
      --gp-red: #ef4444;
      --gp-red-dim: rgba(239, 68, 68, 0.08);
    }

    .cua-gp * { box-sizing: border-box; }

    .cua-gp .gp-header { text-align: center; margin-bottom: 1.25rem; }
    .cua-gp .gp-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem; font-weight: 600; color: var(--gp-accent);
      letter-spacing: 0.08em; text-transform: uppercase; margin: 0 0 0.4rem 0;
    }
    .cua-gp .gp-header p { color: var(--gp-muted); font-size: 0.9rem; margin: 0; }

    .cua-gp .gp-prompt {
      background: var(--gp-surface); border: 1px solid var(--gp-accent);
      border-radius: 8px; padding: 0.85rem 1.1rem; margin-bottom: 1rem;
      display: flex; align-items: center; gap: 0.75rem; flex-wrap: wrap;
    }
    .cua-gp .gp-prompt-label {
      font-family: 'IBM Plex Mono', monospace; font-size: 0.65rem; font-weight: 600;
      color: var(--gp-accent); text-transform: uppercase; letter-spacing: 0.06em;
      background: var(--gp-accent-dim); padding: 0.2rem 0.5rem; border-radius: 4px;
      white-space: nowrap;
    }
    .cua-gp .gp-prompt-text { font-size: 1.02rem; font-weight: 600; }

     
    .cua-gp .gp-screen {
      position: relative;
      background: var(--gp-screen);
      border: 1px solid var(--gp-border);
      border-radius: 10px;
      overflow: hidden;
      user-select: none;
    }
    .cua-gp .gp-titlebar {
      display: flex; align-items: center; gap: 0.4rem;
      padding: 0.6rem 0.85rem;
      background: var(--gp-surface);
      border-bottom: 1px solid var(--gp-border);
    }
    .cua-gp .gp-dot { width: 11px; height: 11px; border-radius: 50%; }
    .cua-gp .gp-dot.r { background: #f97583; }
    .cua-gp .gp-dot.y { background: #d29922; }
    .cua-gp .gp-dot.g { background: #39d353; }
    .cua-gp .gp-titlebar-name {
      margin-left: 0.5rem; font-size: 0.78rem; color: var(--gp-muted);
      font-family: 'IBM Plex Mono', monospace;
    }

    .cua-gp .gp-toolbar {
      display: flex; gap: 0.4rem; padding: 0.9rem;
      justify-content: center; flex-wrap: nowrap;
    }
    .cua-gp .gp-tool {
      flex: 1 1 0; min-width: 0;
      display: flex; flex-direction: column; align-items: center; gap: 0.25rem;
      padding: 0.6rem 0.2rem; border-radius: 8px;
      background: var(--gp-surface); border: 1px solid var(--gp-border);
      transition: border-color 0.3s ease, background 0.3s ease;
    }
    .cua-gp .gp-tool .gp-glyph { font-size: 1.15rem; line-height: 1; }
    .cua-gp .gp-tool .gp-cap {
      font-size: 0.58rem; color: var(--gp-muted);
      font-family: 'IBM Plex Mono', monospace; text-transform: uppercase;
      letter-spacing: 0.03em; white-space: nowrap; overflow: hidden; text-overflow: ellipsis;
      max-width: 100%;
    }
    .cua-gp .gp-tool.gp-target { background: var(--gp-accent-dim); }

     
    .cua-gp .gp-body {
      height: 120px;
      display: flex; align-items: center; justify-content: center;
      color: var(--gp-muted); font-size: 0.8rem; font-style: italic;
      border-top: 1px solid var(--gp-border);
    }

     
    .cua-gp .gp-gtbox {
      position: absolute; border: 2px dashed var(--gp-green); border-radius: 6px;
      box-shadow: 0 0 0 3px var(--gp-green-dim);
      pointer-events: none; opacity: 0;
      transition: opacity 0.4s ease;
    }
    .cua-gp .gp-gtbox.gp-on { opacity: 1; }
    .cua-gp .gp-gtbox-tag {
      position: absolute; top: -1.35rem; left: -2px;
      font-family: 'IBM Plex Mono', monospace; font-size: 0.58rem; font-weight: 700;
      color: var(--gp-green); text-transform: uppercase; letter-spacing: 0.04em;
      white-space: nowrap;
    }

     
    .cua-gp .gp-pred {
      position: absolute; width: 16px; height: 16px; margin: -8px 0 0 -8px;
      border-radius: 50%; z-index: 5;
      transition: left 0.6s cubic-bezier(0.34, 1.3, 0.64, 1), top 0.6s cubic-bezier(0.34, 1.3, 0.64, 1), background 0.4s ease, box-shadow 0.4s ease;
    }
    .cua-gp .gp-pred.hit { background: var(--gp-green); box-shadow: 0 0 0 6px var(--gp-green-dim); }
    .cua-gp .gp-pred.miss { background: var(--gp-red); box-shadow: 0 0 0 6px var(--gp-red-dim); }
    .cua-gp .gp-pred::after {
      content: ''; position: absolute; inset: 4px; border-radius: 50%;
      background: rgba(255,255,255,0.85);
    }

     
    .cua-gp .gp-readout {
      display: grid; grid-template-columns: repeat(3, 1fr); gap: 0.75rem;
      margin-top: 1.1rem;
    }
    .cua-gp .gp-stat {
      background: var(--gp-surface); border: 1px solid var(--gp-border);
      border-radius: 8px; padding: 0.7rem; text-align: center;
    }
    .cua-gp .gp-stat-label {
      font-size: 0.6rem; font-weight: 600; text-transform: uppercase;
      letter-spacing: 0.06em; color: var(--gp-muted); margin-bottom: 0.25rem;
    }
    .cua-gp .gp-stat-value {
      font-family: 'IBM Plex Mono', monospace; font-size: 1.05rem; font-weight: 700;
    }
    .cua-gp .gp-stat-value.pin { color: var(--gp-blue); }
    .cua-gp .gp-stat-value.good { color: var(--gp-green); }
    .cua-gp .gp-stat-value.bad { color: var(--gp-red); }

    .cua-gp .gp-controls {
      display: flex; gap: 0.5rem; margin-top: 1.1rem; flex-wrap: wrap;
      justify-content: center;
    }
    .cua-gp .gp-btn {
      font-family: 'IBM Plex Sans', sans-serif; font-size: 0.8rem; font-weight: 600;
      padding: 0.5rem 1rem; border-radius: 6px; border: 1px solid var(--gp-border);
      background: var(--gp-surface); color: var(--gp-muted); cursor: pointer;
      transition: all 0.2s ease;
    }
    .cua-gp .gp-btn:hover { border-color: var(--gp-accent); color: var(--gp-accent); }
    .cua-gp .gp-btn.gp-sel { background: var(--gp-accent); border-color: var(--gp-accent); color: #fff; }

    .cua-gp .gp-note {
      margin-top: 1rem; font-size: 0.82rem; color: var(--gp-muted);
      line-height: 1.6; text-align: center; min-height: 2.4rem;
    }
    .cua-gp .gp-note strong { color: var(--gp-text); }

    @media (max-width: 600px) {
      .cua-gp .gp-tool .gp-cap { display: none; }
      .cua-gp .gp-readout { grid-template-columns: 1fr; }
    }
  </style>

  <div class="gp-header">
    <h3>Visual Grounding: Screenshot + Instruction &rarr; Click</h3>
    <p>The whole task in one widget. Toggle a training stage; watch where the model clicks.</p>
  </div>

  <div class="gp-prompt">
    <span class="gp-prompt-label">Instruction</span>
    <span class="gp-prompt-text">Open the settings menu</span>
  </div>

  <div class="gp-screen" id="gp-screen-ca72848780901f6e41aa11b8375d1dc6">
    <div class="gp-titlebar">
      <span class="gp-dot r"></span><span class="gp-dot y"></span><span class="gp-dot g"></span>
      <span class="gp-titlebar-name">app — main window</span>
    </div>
    <div class="gp-toolbar" id="gp-toolbar-ca72848780901f6e41aa11b8375d1dc6">
      <div class="gp-tool" data-k="back"><span class="gp-glyph">&#8592;</span><span class="gp-cap">Back</span></div>
      <div class="gp-tool" data-k="fwd"><span class="gp-glyph">&#8594;</span><span class="gp-cap">Forward</span></div>
      <div class="gp-tool" data-k="reload"><span class="gp-glyph">&#10227;</span><span class="gp-cap">Reload</span></div>
      <div class="gp-tool" data-k="home"><span class="gp-glyph">&#9750;</span><span class="gp-cap">Home</span></div>
      <div class="gp-tool" data-k="saved"><span class="gp-glyph">&#9733;</span><span class="gp-cap">Saved</span></div>
      <div class="gp-tool" data-k="alerts" id="gp-neighbor-ca72848780901f6e41aa11b8375d1dc6"><span class="gp-glyph">&#9742;</span><span class="gp-cap">Alerts</span></div>
      <div class="gp-tool gp-target" data-k="settings" id="gp-target-ca72848780901f6e41aa11b8375d1dc6"><span class="gp-glyph">&#9881;</span><span class="gp-cap">Settings</span></div>
    </div>
    <div class="gp-body">application content</div>

    <div class="gp-gtbox" id="gp-gtbox-ca72848780901f6e41aa11b8375d1dc6">
      <span class="gp-gtbox-tag">ground&#8209;truth target</span>
    </div>
    <div class="gp-pred" id="gp-pred-ca72848780901f6e41aa11b8375d1dc6"></div>
  </div>

  <div class="gp-readout">
    <div class="gp-stat">
      <div class="gp-stat-label">Predicted (x, y)</div>
      <div class="gp-stat-value pin" id="gp-coord-ca72848780901f6e41aa11b8375d1dc6">&mdash;</div>
    </div>
    <div class="gp-stat">
      <div class="gp-stat-label">Inside the box?</div>
      <div class="gp-stat-value" id="gp-inbox-ca72848780901f6e41aa11b8375d1dc6">&mdash;</div>
    </div>
    <div class="gp-stat">
      <div class="gp-stat-label">Reward</div>
      <div class="gp-stat-value" id="gp-reward-ca72848780901f6e41aa11b8375d1dc6">&mdash;</div>
    </div>
  </div>

  <div class="gp-controls">
    <button class="gp-btn gp-sel" data-stage="zero" id="gp-b-zero-ca72848780901f6e41aa11b8375d1dc6">Zero-shot</button>
    <button class="gp-btn" data-stage="sft" id="gp-b-sft-ca72848780901f6e41aa11b8375d1dc6">After SFT</button>
    <button class="gp-btn" data-stage="grpo" id="gp-b-grpo-ca72848780901f6e41aa11b8375d1dc6">After GRPO</button>
  </div>

  <div class="gp-note" id="gp-note-ca72848780901f6e41aa11b8375d1dc6"></div>

  <script>
  (function() {
    var uid = 'ca72848780901f6e41aa11b8375d1dc6';
    var screen   = document.getElementById('gp-screen-' + uid);
    var target   = document.getElementById('gp-target-' + uid);
    var neighbor = document.getElementById('gp-neighbor-' + uid);
    var gtbox    = document.getElementById('gp-gtbox-' + uid);
    var pred     = document.getElementById('gp-pred-' + uid);
    var coordEl  = document.getElementById('gp-coord-' + uid);
    var inboxEl  = document.getElementById('gp-inbox-' + uid);
    var rewardEl = document.getElementById('gp-reward-' + uid);
    var noteEl   = document.getElementById('gp-note-' + uid);
    var btns = {
      zero: document.getElementById('gp-b-zero-' + uid),
      sft:  document.getElementById('gp-b-sft-' + uid),
      grpo: document.getElementById('gp-b-grpo-' + uid)
    };

    var notes = {
      zero: 'The base model knows it&rsquo;s a toolbar icon but lands one button over, <strong>just outside the box</strong>. A confident miss; reward 0.',
      sft:  'Supervised fine-tuning imitates the labeled answer, the box&rsquo;s center, and the click lands right on it: <strong>inside the box</strong>, reward 1.',
      grpo: 'GRPO&rsquo;s reward is point-in-box: full marks for landing <em>anywhere</em> inside, with no bonus for the center. This off-center click still scores <strong>1</strong>. On an easy target like this, SFT already succeeds; GRPO&rsquo;s edge is reliability on hard, tiny targets, not tighter centering.'
    };

    var stage = 'zero';

    function rect(el) {
      var s = screen.getBoundingClientRect();
      var r = el.getBoundingClientRect();
      return { x: r.left - s.left, y: r.top - s.top, w: r.width, h: r.height, sw: s.width, sh: s.height };
    }

    
    function predictedPoint() {
      var t = rect(target);
      var n = rect(neighbor);
      if (stage === 'zero') {
        
        return { x: n.x + n.w * 0.5, y: n.y + n.h * 0.5 };
      } else if (stage === 'sft') {
        
        return { x: t.x + t.w * 0.50, y: t.y + t.h * 0.48 };
      }
      
      return { x: t.x + t.w * 0.72, y: t.y + t.h * 0.66 };
    }

    function render() {
      var t = rect(target);
      
      gtbox.style.left   = t.x + 'px';
      gtbox.style.top    = t.y + 'px';
      gtbox.style.width  = t.w + 'px';
      gtbox.style.height = t.h + 'px';
      gtbox.classList.add('gp-on');

      var p = predictedPoint();
      pred.style.left = p.x + 'px';
      pred.style.top  = p.y + 'px';

      var inside = (p.x >= t.x && p.x <= t.x + t.w && p.y >= t.y && p.y <= t.y + t.h);
      pred.classList.toggle('hit', inside);
      pred.classList.toggle('miss', !inside);

      coordEl.textContent = '(' + Math.round(p.x) + ', ' + Math.round(p.y) + ')';
      inboxEl.textContent = inside ? 'yes' : 'no';
      inboxEl.className = 'gp-stat-value ' + (inside ? 'good' : 'bad');
      rewardEl.textContent = inside ? '1.0' : '0.0';
      rewardEl.className = 'gp-stat-value ' + (inside ? 'good' : 'bad');
      noteEl.innerHTML = notes[stage];
    }

    function select(s) {
      stage = s;
      for (var k in btns) btns[k].classList.toggle('gp-sel', k === s);
      render();
    }

    btns.zero.addEventListener('click', function() { select('zero'); });
    btns.sft.addEventListener('click',  function() { select('sft'); });
    btns.grpo.addEventListener('click', function() { select('grpo'); });

    
    requestAnimationFrame(render);
    window.addEventListener('resize', render);
    if (document.fonts && document.fonts.ready) { document.fonts.ready.then(render); }
  })();
  </script>
</div>

<p>The reward is as blunt as it looks in that widget. The model proposes $(x, y)$; we check whether that point falls inside the ground-truth box; it scores 1 if it does and 0 if it doesn&rsquo;t. No partial credit, no geometry beyond a containment test. Hold onto that simplicity. It turns out to be the quiet hero of the reinforcement-learning stage later.</p>
<p><img alt="The trained model placing a click on a real desktop application." loading="lazy" src="/images/posts/teaching-a-3b-vlm-to-click/alignment-sample.png"></p>
<h2 id="why-grounding-is-hard-and-the-architecture-choice-it-forces">Why Grounding Is Hard, and the Architecture Choice It Forces</h2>
<p>The naive way to make a vision-language model output coordinates is to treat the numbers as text. Show it a screenshot, ask for the location, and let it generate the string <code>x=523, y=217</code> token by token, the same way it would generate any other text.</p>
<p>This works astonishingly badly, and the reason is worth sitting with. The model&rsquo;s vision encoder turns the screenshot into a grid of visual tokens, each carrying a positional embedding, a sense of <em>where on the image</em> it lives. But the language head emits ordinary number tokens. Nothing in the architecture connects &ldquo;this patch is two-thirds of the way across the screen&rdquo; to &ldquo;the digits five-two-three.&rdquo; The model has to learn that mapping implicitly, from examples, with no built-in bridge between visual position and textual coordinate. Train it on 1080p screenshots and it tends to fail on a 4K monitor, because the implicit mapping it memorized doesn&rsquo;t transfer to coordinate ranges it never saw. Grounding <em>looks</em> like a perception problem; a lot of it is actually a coordinate-representation problem.</p>
<p>The field has two answers. The first, used by models like Microsoft&rsquo;s Florence-2, is to <strong>normalize</strong> coordinates into an abstract 0–1000 grid, so the model never sees absolute pixels and learns relative position instead. The second, used by <strong>Qwen2.5-VL</strong>, is to lean into absolute scale: its multimodal rotary position embeddings (MRoPE) and native dynamic-resolution processing let it represent boxes and points in the <em>real pixel coordinates of the original image</em>, at whatever size that image happens to be. This gives a much tighter coupling between what the vision encoder sees and what the language head says, which is precisely the alignment grounding lives or dies on.</p>
<p>That property is the main reason Qwen2.5-VL-3B is the base model here, and the de-facto backbone of the GUI-RL literature. It is small enough to fine-tune on one consumer-grade card, permissively licensed, and, crucially, it already grounds reasonably well out of the box, so there is real signal for training to amplify rather than a blank slate to fill.</p>
<p>There&rsquo;s a cost to that native-resolution superpower, though, and it shapes every practical decision downstream. A vision transformer cuts an image into patches and turns each into a token; the bigger the image, the more tokens, and attention cost grows with the token count. For Qwen2.5-VL the rule of thumb is roughly</p>
$$\text{visual tokens} \approx \frac{H \times W}{28 \times 28},$$<p>so a modest screenshot is already hundreds of tokens and a large one is well over a thousand. Slide the resolution and watch what happens to the bill:</p>

<div class="cua-vt" id="cua-vt-ca72848780901f6e41aa11b8375d1dc6">
  <style>
    .cua-vt {
      --vt-bg: #0d1117;
      --vt-surface: #161b22;
      --vt-border: #30363d;
      --vt-text: #e6edf3;
      --vt-muted: #8b949e;
      --vt-accent: #58a6ff;
      --vt-accent-dim: rgba(88, 166, 255, 0.15);
      --vt-grid: rgba(88, 166, 255, 0.5);
      --vt-green: #39d353;
      --vt-orange: #d29922;
      --vt-red: #f97583;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--vt-bg); color: var(--vt-text); line-height: 1.6;
      padding: 1.5rem; border-radius: 12px; margin: 2rem 0;
    }
    [data-theme="light"] .cua-vt,
    :root:not([data-theme="dark"]) .cua-vt {
      --vt-bg: #f8fafc;
      --vt-surface: #ffffff;
      --vt-border: #e2e8f0;
      --vt-text: #1e293b;
      --vt-muted: #64748b;
      --vt-accent: #3b82f6;
      --vt-accent-dim: rgba(59, 130, 246, 0.1);
      --vt-grid: rgba(59, 130, 246, 0.4);
      --vt-green: #10b981;
      --vt-orange: #d97706;
      --vt-red: #ef4444;
    }
    .cua-vt * { box-sizing: border-box; }

    .cua-vt .vt-header { text-align: center; margin-bottom: 1.25rem; }
    .cua-vt .vt-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem; font-weight: 600; color: var(--vt-accent);
      letter-spacing: 0.08em; text-transform: uppercase; margin: 0 0 0.4rem 0;
    }
    .cua-vt .vt-header p { color: var(--vt-muted); font-size: 0.9rem; margin: 0; }

    .cua-vt .vt-main {
      display: grid; grid-template-columns: 220px 1fr; gap: 1.5rem; align-items: center;
    }
    @media (max-width: 600px) { .cua-vt .vt-main { grid-template-columns: 1fr; } }

    .cua-vt .vt-canvas-wrap { display: flex; justify-content: center; }
    .cua-vt .vt-grid {
      width: 200px; height: 200px;
      display: grid; gap: 0;
      background: var(--vt-surface);
      border: 1px solid var(--vt-border); border-radius: 8px; overflow: hidden;
    }
    .cua-vt .vt-cell { border-right: 1px solid var(--vt-grid); border-bottom: 1px solid var(--vt-grid); }

    .cua-vt .vt-readout { display: flex; flex-direction: column; gap: 0.9rem; }

    .cua-vt .vt-row { display: flex; align-items: baseline; justify-content: space-between; gap: 0.5rem; }
    .cua-vt .vt-row-label { font-size: 0.8rem; color: var(--vt-muted); }
    .cua-vt .vt-row-value { font-family: 'IBM Plex Mono', monospace; font-weight: 700; font-size: 0.95rem; }

    .cua-vt .vt-token-num {
      font-family: 'IBM Plex Mono', monospace; font-weight: 700; font-size: 2rem;
      color: var(--vt-accent); line-height: 1;
    }
    .cua-vt .vt-token-unit { font-size: 0.8rem; color: var(--vt-muted); font-weight: 400; }

    .cua-vt .vt-bar-track {
      position: relative; height: 12px; border-radius: 6px;
      background: var(--vt-surface); border: 1px solid var(--vt-border); overflow: hidden;
    }
    .cua-vt .vt-bar-fill {
      position: absolute; left: 0; top: 0; bottom: 0; border-radius: 6px;
      background: var(--vt-accent); transition: width 0.15s ease, background 0.15s ease;
    }
    .cua-vt .vt-bar-caps {
      display: flex; justify-content: space-between;
      font-family: 'IBM Plex Mono', monospace; font-size: 0.6rem; color: var(--vt-muted);
      margin-top: 0.3rem;
    }

    .cua-vt .vt-slider-group { margin-top: 0.4rem; }
    .cua-vt .vt-slider {
      -webkit-appearance: none; appearance: none; width: 100%; height: 5px;
      border-radius: 3px; background: var(--vt-border); outline: none;
    }
    .cua-vt .vt-slider::-webkit-slider-thumb {
      -webkit-appearance: none; appearance: none; width: 18px; height: 18px;
      border-radius: 50%; background: var(--vt-accent); cursor: pointer;
      border: 2px solid var(--vt-bg);
    }
    .cua-vt .vt-slider::-moz-range-thumb {
      width: 18px; height: 18px; border-radius: 50%; background: var(--vt-accent);
      cursor: pointer; border: 2px solid var(--vt-bg);
    }

    .cua-vt .vt-presets { display: flex; gap: 0.4rem; margin-top: 0.75rem; flex-wrap: wrap; }
    .cua-vt .vt-preset {
      font-family: 'IBM Plex Mono', monospace; font-size: 0.65rem; font-weight: 600;
      padding: 0.3rem 0.6rem; border-radius: 4px; border: 1px solid var(--vt-border);
      background: var(--vt-surface); color: var(--vt-muted); cursor: pointer; transition: all 0.2s ease;
    }
    .cua-vt .vt-preset:hover { border-color: var(--vt-accent); color: var(--vt-accent); }

    .cua-vt .vt-formula {
      margin-top: 1.1rem; padding: 0.7rem 1rem; border-radius: 8px;
      background: var(--vt-accent-dim); border: 1px solid var(--vt-accent);
      font-family: 'IBM Plex Mono', monospace; font-size: 0.82rem; text-align: center;
    }
    .cua-vt .vt-status { margin-top: 0.6rem; font-size: 0.8rem; min-height: 1.2rem; }
    .cua-vt .vt-status .vt-flag { font-weight: 700; }
    .cua-vt .vt-status .ok { color: var(--vt-green); }
    .cua-vt .vt-status .down { color: var(--vt-orange); }
    .cua-vt .vt-status .up { color: var(--vt-accent); }
  </style>

  <div class="vt-header">
    <h3>One Screenshot, How Many Tokens?</h3>
    <p>A vision transformer cuts the image into patches. More pixels, more tokens, more memory.</p>
  </div>

  <div class="vt-main">
    <div class="vt-canvas-wrap">
      <div class="vt-grid" id="vt-grid-ca72848780901f6e41aa11b8375d1dc6"></div>
    </div>

    <div class="vt-readout">
      <div>
        <span class="vt-token-num" id="vt-tokens-ca72848780901f6e41aa11b8375d1dc6">256</span>
        <span class="vt-token-unit">visual tokens fed to the language model</span>
      </div>

      <div>
        <div class="vt-bar-track">
          <div class="vt-bar-fill" id="vt-bar-ca72848780901f6e41aa11b8375d1dc6"></div>
        </div>
        <div class="vt-bar-caps">
          <span>min_pixels &middot; 256</span>
          <span>max_pixels &middot; 1,337</span>
        </div>
      </div>

      <div class="vt-row">
        <span class="vt-row-label">Screenshot resolution</span>
        <span class="vt-row-value" id="vt-res-ca72848780901f6e41aa11b8375d1dc6">448 &times; 448</span>
      </div>
      <div class="vt-row">
        <span class="vt-row-label">Raw patches (H&middot;W / 784)</span>
        <span class="vt-row-value" id="vt-raw-ca72848780901f6e41aa11b8375d1dc6">256</span>
      </div>

      <div class="vt-slider-group">
        <input type="range" class="vt-slider" id="vt-slider-ca72848780901f6e41aa11b8375d1dc6" min="392" max="1536" step="28" value="448">
        <div class="vt-presets">
          <button class="vt-preset" data-px="448">phone (448²)</button>
          <button class="vt-preset" data-px="784">tablet (784²)</button>
          <button class="vt-preset" data-px="1024">desktop (1024²)</button>
          <button class="vt-preset" data-px="1456">4K-ish (1456²)</button>
        </div>
      </div>
    </div>
  </div>

  <div class="vt-formula">
    visual&nbsp;tokens &nbsp;&asymp;&nbsp; (H &times; W) / (28 &times; 28)
  </div>
  <div class="vt-status" id="vt-status-ca72848780901f6e41aa11b8375d1dc6"></div>

  <script>
  (function() {
    var uid = 'ca72848780901f6e41aa11b8375d1dc6';
    var grid    = document.getElementById('vt-grid-' + uid);
    var tokens  = document.getElementById('vt-tokens-' + uid);
    var bar     = document.getElementById('vt-bar-' + uid);
    var resEl   = document.getElementById('vt-res-' + uid);
    var rawEl   = document.getElementById('vt-raw-' + uid);
    var slider  = document.getElementById('vt-slider-' + uid);
    var status  = document.getElementById('vt-status-' + uid);
    var presets = document.querySelectorAll('#cua-vt-' + uid + ' .vt-preset');

    var MIN_TOK = 256;   
    var MAX_TOK = 1337;  
    var PATCH = 28;

    function update(L) {
      L = parseInt(L, 10);
      var raw = Math.round((L * L) / (PATCH * PATCH));
      var eff = Math.max(MIN_TOK, Math.min(MAX_TOK, raw));

      resEl.textContent = L + ' × ' + L;
      rawEl.textContent = raw.toLocaleString();
      tokens.textContent = eff.toLocaleString();

      var pct = Math.max(2, Math.min(100, (eff / MAX_TOK) * 100));
      bar.style.width = pct + '%';
      bar.style.background = eff >= MAX_TOK ? 'var(--vt-red)'
        : eff > MAX_TOK * 0.66 ? 'var(--vt-orange)' : 'var(--vt-accent)';

      
      var N = Math.max(4, Math.min(26, Math.round(L / PATCH)));
      grid.style.gridTemplateColumns = 'repeat(' + N + ', 1fr)';
      grid.style.gridTemplateRows = 'repeat(' + N + ', 1fr)';
      var want = N * N;
      while (grid.children.length < want) {
        var c = document.createElement('div'); c.className = 'vt-cell'; grid.appendChild(c);
      }
      while (grid.children.length > want) { grid.removeChild(grid.lastChild); }

      if (raw > MAX_TOK) {
        status.innerHTML = '<span class="vt-flag down">downscaled</span> &mdash; above max_pixels, so Qwen shrinks it back to a ' + MAX_TOK.toLocaleString() + '-token budget. Past this point, detail is lost, not gained.';
      } else if (raw < MIN_TOK) {
        status.innerHTML = '<span class="vt-flag up">upscaled</span> &mdash; below min_pixels, padded up to the ' + MIN_TOK + '-token floor.';
      } else {
        status.innerHTML = '<span class="vt-flag ok">within budget</span> &mdash; every extra patch is more attention compute and more VRAM. This is why a pixel budget is the first OOM fix.';
      }
    }

    slider.addEventListener('input', function() { update(this.value); });
    presets.forEach(function(b) {
      b.addEventListener('click', function() {
        var px = this.getAttribute('data-px');
        slider.value = px; update(px);
      });
    });

    update(slider.value);
  })();
  </script>
</div>

<p>This is why every serious training run pins a <code>min_pixels</code>/<code>max_pixels</code> budget and why high-resolution screenshots are the number-one cause of out-of-memory crashes. It is also why memory-efficient attention matters here as much as it does for long-context text: reading a screen <em>is</em> a long-context problem, just in two dimensions. (On the actual run, the attention backend was PyTorch&rsquo;s SDPA rather than a hand-rolled FlashAttention kernel, but the principle from the <a href="/posts/flash-attention/">Flash Attention post</a> is exactly what keeps thousands of image-patch tokens tractable.)</p>
<h2 id="why-a-hobbyist-can-do-this-for-15">Why a Hobbyist Can Do This for $15</h2>
<p>A 3-billion-parameter model in full precision is about 6 GB just to <em>store</em>, and full fine-tuning needs several multiples of that for gradients, optimizer state, and activations. Add a thousand-token screenshot&rsquo;s worth of activations on top and a 24 GB card is hopeless. The reason this project fits at all is a stack of compression tricks, each of which this blog has covered before and which here finally pull their weight together.</p>
<p><strong>LoRA</strong> freezes the pre-trained weights entirely and learns a small, low-rank update alongside them, typically a fraction of a percent of the parameters. You are no longer training the model; you are training a thin adapter that nudges it. Memory for optimizer state and gradients collapses accordingly, and as a bonus the frozen base can&rsquo;t catastrophically forget what it already knew.</p>
<p><strong>QLoRA</strong> takes it further: it stores the frozen base in 4-bit precision (the information-theoretically-tuned NF4 format) and trains the bf16 adapter on top. The base weights are quantized for <em>storage</em> and dequantized on the fly for <em>computation</em>. That storage-vs-compute split is the whole trick, and it&rsquo;s the same machinery covered in <a href="/posts/quantization-and-gptq/">the quantization deep-dive</a>. A 3B model that wouldn&rsquo;t fit for full fine-tuning now loads in a handful of gigabytes with room to spare for rollouts.</p>
<p>The rest is good housekeeping that keeps the run on one card: gradient checkpointing (recompute activations in the backward pass instead of storing them), an 8-bit optimizer (quantize the optimizer state too), gradient accumulation (simulate a big batch with many small ones), and a capped pixel budget (fewer visual tokens, less memory).</p>
<p>On the hardware side, 24 GB is the magic number: the largest you can rent cheaply and still call commodity. The choice comes down to two generations of the same class of card:</p>
<table>
  <thead>
      <tr>
          <th>GPU</th>
          <th>Architecture</th>
          <th>VRAM</th>
          <th>~On-demand</th>
          <th>~Spot</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>A10G (<code>g5.xlarge</code>)</td>
          <td>Ampere</td>
          <td>24 GB</td>
          <td>~$1.00/hr</td>
          <td>~$0.40/hr</td>
      </tr>
      <tr>
          <td>L4 (<code>g6.xlarge</code>)</td>
          <td>Ada Lovelace</td>
          <td>24 GB</td>
          <td>~$0.80/hr</td>
          <td>~$0.30/hr</td>
      </tr>
  </tbody>
</table>
<p>The L4&rsquo;s newer Ada Lovelace architecture has better native support for the BF16 and FP8 math that quantized fine-tuning leans on. Pleasantly, it&rsquo;s also the cheaper of the two. This project ran on a <code>g6.2xlarge</code> (one L4, a bit more host CPU and RAM for data prep) at roughly $1/hr on-demand, in <code>us-east-1</code>. Across baseline evaluation, the SFT run, the GRPO run, and the inevitable re-runs, the whole thing came to about <strong>$12–15</strong> of GPU time. Spot pricing or the smaller <code>xlarge</code> would have shaved it further. The point is not the exact figure; it&rsquo;s that the entry barrier to <em>actually doing this</em> is now a takeout dinner, not a research grant.</p>
<h2 id="the-recipe-sft-first-then-grpo-and-the-insight-that-makes-rl-worth-it">The Recipe: SFT First, Then GRPO, and the Insight That Makes RL Worth It</h2>
<p>With the base model chosen and the memory math solved, the training itself is two stages.</p>
<p><strong>Stage one is supervised fine-tuning.</strong> Take a few thousand examples of (screenshot, instruction, correct coordinate), and train the model, through its LoRA adapter, to reproduce the answer. The objective is plain next-token cross-entropy on the response only. SFT does two jobs extraordinarily well: it locks the <em>output format</em> so the model reliably emits a clean <code>{&quot;x&quot;: int, &quot;y&quot;: int}</code> and nothing else, and it installs a strong behavioral <em>prior</em>: the sense that &ldquo;close button&rdquo; means top-corner, that toolbar icons cluster along an edge, that a labeled control is where its label is.</p>
<p>It does both jobs fast. On this run the cross-entropy collapsed from 4.36 to 0.57 in the first thirty steps and then essentially flatlined; the model had learned the task (format and prior together) almost immediately, and the remaining 120 steps were polish.</p>

<div class="cua-sc" id="cua-sc-ca72848780901f6e41aa11b8375d1dc6">
  <style>
    .cua-sc {
      --sc-bg: #0d1117;
      --sc-surface: #161b22;
      --sc-border: #30363d;
      --sc-text: #e6edf3;
      --sc-muted: #8b949e;
      --sc-loss: #f97583;
      --sc-acc: #39d353;
      --sc-accent: #58a6ff;
      --sc-grid: rgba(139,148,158,0.16);
      --sc-mark: #d29922;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--sc-bg); color: var(--sc-text); line-height: 1.6;
      padding: 1.5rem; border-radius: 12px; margin: 2rem 0;
    }
    [data-theme="light"] .cua-sc,
    :root:not([data-theme="dark"]) .cua-sc {
      --sc-bg: #f8fafc;
      --sc-surface: #ffffff;
      --sc-border: #e2e8f0;
      --sc-text: #1e293b;
      --sc-muted: #64748b;
      --sc-loss: #ef4444;
      --sc-acc: #10b981;
      --sc-accent: #3b82f6;
      --sc-grid: rgba(100,116,139,0.16);
      --sc-mark: #d97706;
    }
    .cua-sc * { box-sizing: border-box; }

    .cua-sc .sc-header { text-align: center; margin-bottom: 1rem; }
    .cua-sc .sc-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem; font-weight: 600; color: var(--sc-accent);
      letter-spacing: 0.08em; text-transform: uppercase; margin: 0 0 0.4rem 0;
    }
    .cua-sc .sc-header p { color: var(--sc-muted); font-size: 0.9rem; margin: 0; }

    .cua-sc .sc-legend { display: flex; gap: 1.25rem; justify-content: center; margin-bottom: 0.5rem; }
    .cua-sc .sc-leg { display: flex; align-items: center; gap: 0.4rem; font-size: 0.75rem; color: var(--sc-muted); }
    .cua-sc .sc-leg .sw { width: 16px; height: 3px; border-radius: 2px; }

    .cua-sc .sc-svg-wrap { width: 100%; }
    .cua-sc svg { width: 100%; height: auto; display: block; }

    .cua-sc .sc-line { fill: none; stroke-width: 2.5; stroke-linejoin: round; stroke-linecap: round; }
    .cua-sc .sc-line.loss { stroke: var(--sc-loss); }
    .cua-sc .sc-line.acc { stroke: var(--sc-acc); }
    .cua-sc .sc-dot.loss { fill: var(--sc-loss); }
    .cua-sc .sc-dot.acc { fill: var(--sc-acc); }
    .cua-sc .sc-axis { stroke: var(--sc-border); stroke-width: 1; }
    .cua-sc .sc-grid { stroke: var(--sc-grid); stroke-width: 1; }
    .cua-sc .sc-tick { fill: var(--sc-muted); font-family: 'IBM Plex Mono', monospace; font-size: 10px; }
    .cua-sc .sc-axtitle { font-family:'IBM Plex Mono',monospace; font-size: 10px; font-weight: 700; letter-spacing: 0.04em; }
    .cua-sc .sc-mark-line { stroke: var(--sc-mark); stroke-width: 1.5; stroke-dasharray: 4 3; }
    .cua-sc .sc-mark-label { fill: var(--sc-mark); font-family:'IBM Plex Mono',monospace; font-size: 10px; font-weight: 700; }

    .cua-sc .sc-foot {
      margin-top: 1rem; padding: 0.8rem 1.05rem; border-radius: 8px;
      background: var(--sc-surface); border: 1px solid var(--sc-border);
      font-size: 0.82rem; line-height: 1.6;
    }
    .cua-sc .sc-foot b { color: var(--sc-mark); }
    .cua-sc .sc-controls { display:flex; justify-content:center; margin-top: 0.9rem; }
    .cua-sc .sc-btn {
      font-family:'IBM Plex Sans',sans-serif; font-size:0.78rem; font-weight:600;
      padding:0.45rem 1rem; border-radius:6px; border:1px solid var(--sc-border);
      background: var(--sc-surface); color: var(--sc-muted); cursor:pointer; transition: all 0.2s ease;
    }
    .cua-sc .sc-btn:hover { border-color: var(--sc-accent); color: var(--sc-accent); }
  </style>

  <div class="sc-header">
    <h3>SFT Learns the Task in ~30 Steps</h3>
    <p>Cross-entropy loss and token accuracy across the 150-step run (logged every 10 steps).</p>
  </div>

  <div class="sc-legend">
    <div class="sc-leg"><span class="sw" style="background:var(--sc-loss)"></span>training loss</div>
    <div class="sc-leg"><span class="sw" style="background:var(--sc-acc)"></span>token accuracy</div>
  </div>

  <div class="sc-svg-wrap">
    <svg viewBox="0 0 620 340" id="sc-svg-ca72848780901f6e41aa11b8375d1dc6" preserveAspectRatio="xMidYMid meet" role="img" aria-label="SFT loss and token accuracy over 150 training steps"></svg>
  </div>

  <div class="sc-controls"><button class="sc-btn" id="sc-replay-ca72848780901f6e41aa11b8375d1dc6">&#8635; Replay</button></div>

  <div class="sc-foot">
    Loss collapses from <b>4.36 to 0.57</b> in the first 30 steps, then flatlines near 0.41; token accuracy mirrors it, jumping to ~0.91 and holding. The format and the behavioral prior are learned almost immediately &mdash; the remaining 120 steps are polish.
  </div>

  <script>
  (function() {
    var uid = 'ca72848780901f6e41aa11b8375d1dc6';
    var svg = document.getElementById('sc-svg-' + uid);
    var replay = document.getElementById('sc-replay-' + uid);
    var NS = 'http://www.w3.org/2000/svg';

    var steps = [10,20,30,40,50,60,70,80,90,100,110,120,130,140,150];
    var loss  = [4.359,1.51,0.5738,0.4886,0.4729,0.4214,0.4122,0.4048,0.4148,0.4147,0.3939,0.4099,0.4141,0.3924,0.4145];
    var acc   = [0.4573,0.7415,0.9059,0.9106,0.9095,0.9158,0.9182,0.9195,0.9156,0.9163,0.9207,0.9193,0.9183,0.9218,0.9146];

    
    var L = 48, R = 572, T = 26, B = 286;   
    var LOSS_MAX = 4.5;
    function xOf(s) { return L + (s / 150) * (R - L); }
    function yLoss(v) { return B - (v / LOSS_MAX) * (B - T); }
    function yAcc(v) { return B - (v / 1) * (B - T); }

    function mk(tag, attrs) {
      var e = document.createElementNS(NS, tag);
      for (var k in attrs) e.setAttribute(k, attrs[k]);
      return e;
    }

    function build() {
      svg.innerHTML = '';

      
      [0,1,2,3,4].forEach(function(v) {
        var y = yLoss(v);
        svg.appendChild(mk('line', { class:'sc-grid', x1:L, y1:y, x2:R, y2:y }));
        var t = mk('text', { class:'sc-tick', x:L-8, y:y+3, 'text-anchor':'end' }); t.textContent = v;
        svg.appendChild(t);
      });
      
      [0,0.25,0.5,0.75,1].forEach(function(v) {
        var y = yAcc(v);
        var t = mk('text', { class:'sc-tick', x:R+8, y:y+3, 'text-anchor':'start' }); t.textContent = Math.round(v*100) + '%';
        svg.appendChild(t);
      });
      
      [0,30,60,90,120,150].forEach(function(s) {
        var x = xOf(s);
        var t = mk('text', { class:'sc-tick', x:x, y:B+16, 'text-anchor':'middle' }); t.textContent = s;
        svg.appendChild(t);
      });
      
      svg.appendChild(mk('line', { class:'sc-axis', x1:L, y1:T, x2:L, y2:B }));
      svg.appendChild(mk('line', { class:'sc-axis', x1:R, y1:T, x2:R, y2:B }));
      svg.appendChild(mk('line', { class:'sc-axis', x1:L, y1:B, x2:R, y2:B }));
      
      var xt = mk('text', { class:'sc-tick', x:(L+R)/2, y:B+30, 'text-anchor':'middle' }); xt.textContent = 'training step'; svg.appendChild(xt);
      var lt = mk('text', { class:'sc-axtitle', x:L-30, y:T-8, 'text-anchor':'start', fill:'var(--sc-loss)' }); lt.textContent = 'loss'; svg.appendChild(lt);
      var rt = mk('text', { class:'sc-axtitle', x:R+2, y:T-8, 'text-anchor':'end', fill:'var(--sc-acc)' }); rt.textContent = 'acc'; svg.appendChild(rt);

      
      var mx = xOf(30);
      svg.appendChild(mk('line', { class:'sc-mark-line', x1:mx, y1:T, x2:mx, y2:B }));
      var ml = mk('text', { class:'sc-mark-label', x:mx+5, y:T+12, 'text-anchor':'start' }); ml.textContent = 'task learned'; svg.appendChild(ml);

      
      var lossPts = steps.map(function(s,i){ return xOf(s) + ',' + yLoss(loss[i]); }).join(' ');
      var accPts  = steps.map(function(s,i){ return xOf(s) + ',' + yAcc(acc[i]); }).join(' ');
      var lossLine = mk('polyline', { class:'sc-line loss', points: lossPts });
      var accLine  = mk('polyline', { class:'sc-line acc', points: accPts });
      svg.appendChild(lossLine);
      svg.appendChild(accLine);

      
      steps.forEach(function(s,i) {
        var d1 = mk('circle', { class:'sc-dot loss', cx:xOf(s), cy:yLoss(loss[i]), r:3 });
        d1.appendChild(mk('title', {})); d1.lastChild.textContent = 'step ' + s + ' · loss ' + loss[i].toFixed(3);
        svg.appendChild(d1);
        var d2 = mk('circle', { class:'sc-dot acc', cx:xOf(s), cy:yAcc(acc[i]), r:3 });
        d2.appendChild(mk('title', {})); d2.lastChild.textContent = 'step ' + s + ' · acc ' + (acc[i]*100).toFixed(1) + '%';
        svg.appendChild(d2);
      });

      animate(lossLine);
      animate(accLine);
    }

    function animate(line) {
      try {
        var len = line.getTotalLength();
        line.style.transition = 'none';
        line.style.strokeDasharray = len;
        line.style.strokeDashoffset = len;
        
        line.getBoundingClientRect();
        line.style.transition = 'stroke-dashoffset 1.5s ease';
        line.style.strokeDashoffset = '0';
      } catch (e) {   }
    }

    replay.addEventListener('click', build);
    build();
  })();
  </script>
</div>

<p>But the way we set SFT up carries a subtle flaw, and it is the hinge of this whole story. The training label is a <em>single</em> coordinate, the center of the target box. Cross-entropy punishes the model for predicting anything else, which means it is being trained to hit one exact pixel. Yet the actual task does not care about that pixel at all. <em>Any</em> click inside the button works. SFT is optimizing a needlessly strict objective: it trains the model to reproduce one exact pixel, nagging it toward a precision the task never asked for, with no way to express &ldquo;anywhere in this region is fine.&rdquo; None of this is a law of supervised learning; it falls out of the label we chose, and sampling targets from across the box instead of its center would blunt it.</p>
<p>Reinforcement learning closes the same gap more directly. As the GTA1 work on GUI grounding put it, SFT &ldquo;rigidly trains the model to predict the exact center of the target element,&rdquo; whereas a reward-driven approach can &ldquo;reward any click that falls within the target element region.&rdquo; Swap the strict imitation loss for a reward that returns 1 for <em>any</em> point inside the box, and the model stops chasing one pixel and starts treating the whole target as correct.</p>

<div class="cua-sg" id="cua-sg-ca72848780901f6e41aa11b8375d1dc6">
  <style>
    .cua-sg {
      --sg-bg: #0d1117;
      --sg-surface: #161b22;
      --sg-screen: #0a0d12;
      --sg-border: #30363d;
      --sg-text: #e6edf3;
      --sg-muted: #8b949e;
      --sg-accent: #a371f7;
      --sg-accent-dim: rgba(163, 113, 247, 0.15);
      --sg-green: #39d353;
      --sg-green-dim: rgba(57, 211, 83, 0.16);
      --sg-orange: #d29922;
      --sg-red: #f97583;
      --sg-red-dim: rgba(249, 117, 131, 0.12);

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--sg-bg); color: var(--sg-text); line-height: 1.6;
      padding: 1.5rem; border-radius: 12px; margin: 2rem 0;
    }
    [data-theme="light"] .cua-sg,
    :root:not([data-theme="dark"]) .cua-sg {
      --sg-bg: #f8fafc;
      --sg-surface: #ffffff;
      --sg-screen: #eef2f7;
      --sg-border: #e2e8f0;
      --sg-text: #1e293b;
      --sg-muted: #64748b;
      --sg-accent: #8b5cf6;
      --sg-accent-dim: rgba(139, 92, 246, 0.12);
      --sg-green: #10b981;
      --sg-green-dim: rgba(16, 185, 129, 0.18);
      --sg-orange: #d97706;
      --sg-red: #ef4444;
      --sg-red-dim: rgba(239, 68, 68, 0.1);
    }
    .cua-sg * { box-sizing: border-box; }

    .cua-sg .sg-header { text-align: center; margin-bottom: 1.25rem; }
    .cua-sg .sg-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem; font-weight: 600; color: var(--sg-accent);
      letter-spacing: 0.08em; text-transform: uppercase; margin: 0 0 0.4rem 0;
    }
    .cua-sg .sg-header p { color: var(--sg-muted); font-size: 0.9rem; margin: 0; }

     
    .cua-sg .sg-modes { display: flex; gap: 0.5rem; justify-content: center; margin-bottom: 1rem; }
    .cua-sg .sg-mode-btn {
      font-family: 'IBM Plex Sans', sans-serif; font-size: 0.8rem; font-weight: 600;
      padding: 0.5rem 1.1rem; border-radius: 6px; border: 1px solid var(--sg-border);
      background: var(--sg-surface); color: var(--sg-muted); cursor: pointer; transition: all 0.2s ease;
    }
    .cua-sg .sg-mode-btn.sel { background: var(--sg-accent); border-color: var(--sg-accent); color: #fff; }

    .cua-sg .sg-screen {
      position: relative; width: 100%; height: 300px;
      background: var(--sg-screen); border: 1px solid var(--sg-border);
      border-radius: 10px; overflow: hidden; cursor: crosshair; touch-action: none;
    }
    @media (max-width: 600px) { .cua-sg .sg-screen { height: 240px; } }

     
    .cua-sg .sg-field {
      position: absolute; inset: 0; opacity: 0; transition: opacity 0.4s ease;
      background: radial-gradient(circle at 50% 50%,
        var(--sg-green) 0%,
        rgba(57,211,83,0.35) 4%,
        rgba(210,153,34,0.30) 11%,
        rgba(249,117,131,0.22) 22%,
        transparent 46%);
    }
    .cua-sg.sg-is-sft .sg-field { opacity: 1; }

     
    .cua-sg .sg-box {
      position: absolute; left: 50%; top: 50%; transform: translate(-50%, -50%);
      width: 30%; height: 34%; border-radius: 8px;
      border: 2px solid var(--sg-green);
      transition: background 0.4s ease;
      pointer-events: none;
    }
    .cua-sg.sg-is-grpo .sg-box { background: var(--sg-green-dim); }
    .cua-sg .sg-box-tag {
      position: absolute; top: -1.4rem; left: -2px;
      font-family: 'IBM Plex Mono', monospace; font-size: 0.6rem; font-weight: 700;
      color: var(--sg-green); text-transform: uppercase; letter-spacing: 0.04em; white-space: nowrap;
    }
     
    .cua-sg .sg-center {
      position: absolute; left: 50%; top: 50%; width: 10px; height: 10px;
      margin: -5px 0 0 -5px; pointer-events: none; opacity: 0; transition: opacity 0.4s ease;
    }
    .cua-sg.sg-is-sft .sg-center { opacity: 1; }
    .cua-sg .sg-center::before, .cua-sg .sg-center::after {
      content: ''; position: absolute; background: var(--sg-green);
    }
    .cua-sg .sg-center::before { left: 4px; top: -3px; width: 2px; height: 16px; }
    .cua-sg .sg-center::after { top: 4px; left: -3px; height: 2px; width: 16px; }

     
    .cua-sg .sg-click {
      position: absolute; width: 18px; height: 18px; margin: -9px 0 0 -9px;
      border-radius: 50%; z-index: 6; cursor: grab;
      background: var(--sg-accent); box-shadow: 0 0 0 5px var(--sg-accent-dim);
      transition: background 0.2s ease, box-shadow 0.2s ease;
    }
    .cua-sg .sg-click::after { content:''; position:absolute; inset:5px; border-radius:50%; background:#fff; }
    .cua-sg .sg-hint {
      position: absolute; bottom: 8px; left: 50%; transform: translateX(-50%);
      font-size: 0.7rem; color: var(--sg-muted); font-family: 'IBM Plex Mono', monospace;
      pointer-events: none; opacity: 0.8;
    }

     
    .cua-sg .sg-scores { display: grid; grid-template-columns: 1fr 1fr; gap: 0.9rem; margin-top: 1.1rem; }
    .cua-sg .sg-card {
      background: var(--sg-surface); border: 1px solid var(--sg-border);
      border-radius: 10px; padding: 0.9rem 1rem;
    }
    .cua-sg .sg-card.active { border-color: var(--sg-accent); }
    .cua-sg .sg-card-title {
      font-family: 'IBM Plex Mono', monospace; font-size: 0.68rem; font-weight: 700;
      text-transform: uppercase; letter-spacing: 0.05em; color: var(--sg-muted); margin-bottom: 0.5rem;
    }
    .cua-sg .sg-card-score { font-family: 'IBM Plex Mono', monospace; font-size: 1.8rem; font-weight: 700; line-height: 1; }
    .cua-sg .sg-card-bar { height: 6px; border-radius: 3px; background: var(--sg-border); margin-top: 0.6rem; overflow: hidden; }
    .cua-sg .sg-card-bar > span { display: block; height: 100%; border-radius: 3px; transition: width 0.15s ease, background 0.15s ease; }
    .cua-sg .sg-card-sub { font-size: 0.72rem; color: var(--sg-muted); margin-top: 0.5rem; }

    .cua-sg .sg-note {
      margin-top: 1.1rem; padding: 0.85rem 1.1rem; border-radius: 8px;
      background: var(--sg-accent-dim); border: 1px solid var(--sg-accent);
      font-size: 0.83rem; line-height: 1.65;
    }
    .cua-sg .sg-note strong { color: var(--sg-accent); }

    @media (max-width: 600px) { .cua-sg .sg-scores { grid-template-columns: 1fr; } }
  </style>

  <div class="sg-header">
    <h3>SFT Imitates One Pixel; RL Rewards the Whole Target</h3>
    <p>Drag the click. Watch how each objective scores the exact same spot.</p>
  </div>

  <div class="sg-modes">
    <button class="sg-mode-btn sel" data-m="sft" id="sg-m-sft-ca72848780901f6e41aa11b8375d1dc6">SFT objective</button>
    <button class="sg-mode-btn" data-m="grpo" id="sg-m-grpo-ca72848780901f6e41aa11b8375d1dc6">GRPO reward</button>
  </div>

  <div class="sg-screen sg-is-sft" id="sg-screen-ca72848780901f6e41aa11b8375d1dc6">
    <div class="sg-field"></div>
    <div class="sg-box" id="sg-box-ca72848780901f6e41aa11b8375d1dc6">
      <span class="sg-box-tag">target element</span>
    </div>
    <div class="sg-center"></div>
    <div class="sg-click" id="sg-click-ca72848780901f6e41aa11b8375d1dc6"></div>
    <div class="sg-hint">drag the click marker</div>
  </div>

  <div class="sg-scores">
    <div class="sg-card active" id="sg-card-sft-ca72848780901f6e41aa11b8375d1dc6">
      <div class="sg-card-title">SFT &middot; cross-entropy to center</div>
      <div class="sg-card-score" id="sg-score-sft-ca72848780901f6e41aa11b8375d1dc6">0.00</div>
      <div class="sg-card-bar"><span id="sg-bar-sft-ca72848780901f6e41aa11b8375d1dc6"></span></div>
      <div class="sg-card-sub">Peaks only at the one labeled pixel; punishes any drift.</div>
    </div>
    <div class="sg-card" id="sg-card-grpo-ca72848780901f6e41aa11b8375d1dc6">
      <div class="sg-card-title">GRPO &middot; point-in-box reward</div>
      <div class="sg-card-score" id="sg-score-grpo-ca72848780901f6e41aa11b8375d1dc6">0.0</div>
      <div class="sg-card-bar"><span id="sg-bar-grpo-ca72848780901f6e41aa11b8375d1dc6"></span></div>
      <div class="sg-card-sub">Flat plateau: 1.0 anywhere inside, 0.0 outside.</div>
    </div>
  </div>

  <div class="sg-note" id="sg-note-ca72848780901f6e41aa11b8375d1dc6"></div>

  <script>
  (function() {
    var uid = 'ca72848780901f6e41aa11b8375d1dc6';
    var screen = document.getElementById('sg-screen-' + uid);
    var box    = document.getElementById('sg-box-' + uid);
    var click  = document.getElementById('sg-click-' + uid);
    var scoreSft = document.getElementById('sg-score-sft-' + uid);
    var scoreGrpo= document.getElementById('sg-score-grpo-' + uid);
    var barSft  = document.getElementById('sg-bar-sft-' + uid);
    var barGrpo = document.getElementById('sg-bar-grpo-' + uid);
    var cardSft = document.getElementById('sg-card-sft-' + uid);
    var cardGrpo= document.getElementById('sg-card-grpo-' + uid);
    var note    = document.getElementById('sg-note-' + uid);
    var btnSft  = document.getElementById('sg-m-sft-' + uid);
    var btnGrpo = document.getElementById('sg-m-grpo-' + uid);

    var mode = 'sft';
    
    var px = 0.585, py = 0.40;
    var dragging = false;

    function compute() {
      var s = screen.getBoundingClientRect();
      var b = box.getBoundingClientRect();
      var mx = px * s.width, my = py * s.height;

      
      click.style.left = (px * 100) + '%';
      click.style.top  = (py * 100) + '%';

      
      var bx0 = b.left - s.left, by0 = b.top - s.top, bx1 = bx0 + b.width, by1 = by0 + b.height;
      var inside = (mx >= bx0 && mx <= bx1 && my >= by0 && my <= by1);

      
      var cx = s.width * 0.5, cy = s.height * 0.5;
      var d = Math.sqrt((mx - cx) * (mx - cx) + (my - cy) * (my - cy)) / s.width;
      var sigma = 0.06;
      var sft = Math.exp(-(d * d) / (2 * sigma * sigma));

      var grpo = inside ? 1.0 : 0.0;

      scoreSft.textContent = sft.toFixed(2);
      scoreGrpo.textContent = grpo.toFixed(1);
      barSft.style.width = (sft * 100) + '%';
      barSft.style.background = sft > 0.66 ? 'var(--sg-green)' : sft > 0.25 ? 'var(--sg-orange)' : 'var(--sg-red)';
      scoreSft.style.color = sft > 0.66 ? 'var(--sg-green)' : sft > 0.25 ? 'var(--sg-orange)' : 'var(--sg-red)';
      barGrpo.style.width = (grpo * 100) + '%';
      barGrpo.style.background = grpo > 0.5 ? 'var(--sg-green)' : 'var(--sg-red)';
      scoreGrpo.style.color = grpo > 0.5 ? 'var(--sg-green)' : 'var(--sg-red)';

      
      var good = (mode === 'sft') ? (sft > 0.5) : (grpo > 0.5);
      click.style.background = good ? 'var(--sg-green)' : 'var(--sg-red)';
      click.style.boxShadow = '0 0 0 5px ' + (good ? 'var(--sg-green-dim)' : 'var(--sg-red-dim)');

      if (inside && sft < 0.5) {
        note.innerHTML = 'This click is <strong>inside the target</strong> &mdash; the task says it&rsquo;s correct. GRPO gives it full reward. But SFT, trained to reproduce the exact center, scores it <strong>' + sft.toFixed(2) + '</strong> and pushes the model to &ldquo;fix&rdquo; a click that was already right. That gap is what RL closes.';
      } else if (inside) {
        note.innerHTML = 'Near the center, both objectives agree. The disagreement lives away from the center &mdash; <strong>drag toward the box edge</strong> and watch SFT&rsquo;s score collapse while GRPO stays at 1.0.';
      } else {
        note.innerHTML = 'Outside the box, both objectives say no. The interesting region is <strong>inside, off-center</strong>: a valid click that SFT still penalizes.';
      }
    }

    function setMode(m) {
      mode = m;
      screen.classList.toggle('sg-is-sft', m === 'sft');
      screen.classList.toggle('sg-is-grpo', m === 'grpo');
      btnSft.classList.toggle('sel', m === 'sft');
      btnGrpo.classList.toggle('sel', m === 'grpo');
      cardSft.classList.toggle('active', m === 'sft');
      cardGrpo.classList.toggle('active', m === 'grpo');
      compute();
    }

    function pointerToFrac(e) {
      var s = screen.getBoundingClientRect();
      var clientX = e.touches ? e.touches[0].clientX : e.clientX;
      var clientY = e.touches ? e.touches[0].clientY : e.clientY;
      px = Math.max(0.02, Math.min(0.98, (clientX - s.left) / s.width));
      py = Math.max(0.04, Math.min(0.96, (clientY - s.top) / s.height));
      compute();
    }

    screen.addEventListener('pointerdown', function(e) { dragging = true; pointerToFrac(e); });
    window.addEventListener('pointermove', function(e) { if (dragging) pointerToFrac(e); });
    window.addEventListener('pointerup', function() { dragging = false; });

    btnSft.addEventListener('click', function() { setMode('sft'); });
    btnGrpo.addEventListener('click', function() { setMode('grpo'); });

    requestAnimationFrame(compute);
    window.addEventListener('resize', compute);
  })();
  </script>
</div>

<p>That is the conceptual reason to add RL on top of SFT, and it predicts something specific that we&rsquo;ll see in the numbers: RL should matter most exactly where SFT&rsquo;s center-point rigidity costs the most, on small, awkward targets where &ldquo;near the center&rdquo; and &ldquo;inside the box&rdquo; are very different bets.</p>
<h2 id="grpo-and-the-verifiable-reward">GRPO and the Verifiable Reward</h2>
<p>The reinforcement-learning algorithm is <strong>GRPO</strong> (Group Relative Policy Optimization), which I covered in depth in <a href="/posts/rlhf-to-grpo/">From RLHF to GRPO</a>, so here&rsquo;s only the part that matters for grounding. For each prompt you sample a <em>group</em> of $G$ candidate answers, score each one, and use the group&rsquo;s own statistics as the baseline:</p>
$$\hat{A}_i = \frac{r_i - \text{mean}(r_1, \dots, r_G)}{\text{std}(r_1, \dots, r_G)}.$$<p>Answers above the group average get reinforced, answers below get suppressed, and the standard deviation normalizes the scale. The elegance is that there is no separate value network (the group is the baseline), which is precisely what makes GRPO cheap enough to run alongside a quantized model on one 24 GB card.</p>
<p>The reward is where grounding gets to show off. There is no learned reward model, no human preference data, no judge that can be gamed. The reward is a deterministic geometry check, the same <code>point_in_box</code> test from the widget at the top of this post, and it has a property most reward functions only dream of: <strong>it is identical to the evaluation metric.</strong> When your training reward <em>is</em> the thing you ultimately measure, the usual gap between &ldquo;what we optimized&rdquo; and &ldquo;what we wanted&rdquo; disappears, and there is nothing to reward-hack. This is the paradigm called RLVR (Reinforcement Learning with Verifiable Rewards), the same idea behind DeepSeek-R1 and Tülu 3, where correctness is checked by a function rather than judged by a model. A small format reward rides along to keep the output parseable, weighted low so it can&rsquo;t dominate:</p>
$$R = 1.0 \cdot R_{\text{point-in-box}} + 0.2 \cdot R_{\text{format}}.$$<p>The hyperparameters came almost verbatim from the GTA1 recipe, which had already mapped this terrain: a group size of $G = 8$, a tiny learning rate of $10^{-6}$, no chain-of-thought (&ldquo;thinking&rdquo; doesn&rsquo;t help pure grounding and costs tokens), and no KL penalty to a reference policy ($\beta = 0$, which lets the coordinates move freely). One constraint deserves a callout because it shaped the whole run: GRPO&rsquo;s reward signal <em>vanishes</em> if the batch is too small. With only a handful of samples per update, the group statistics are too noisy to learn from and the model collapses. The fix is a large effective batch, reached here by accumulating gradients to <strong>128 completions per update</strong>. On a single GPU you don&rsquo;t get that batch for free; you get it by accumulating patiently.</p>
<p>On the tooling: of the libraries that implement RL for vision-language models, Hugging Face <strong>TRL</strong>&rsquo;s <code>GRPOTrainer</code> is the one that realistically fits a single 24 GB card. The alternatives (EasyR1, VLM-R1, verl) are excellent but assume a multi-GPU cluster. That single fact quietly decided the stack.</p>
<h2 id="what-actually-happened">What Actually Happened</h2>
<p>Time for numbers. The benchmark is <strong>ScreenSpot-v2</strong>: 1,272 real screenshot-instruction pairs across iOS, Android, macOS, Windows, and the web, scored by point-in-box accuracy. Three checkpoints: the untouched model, after SFT, and after GRPO.</p>

<div class="cua-rs" id="cua-rs-ca72848780901f6e41aa11b8375d1dc6">
  <style>
    .cua-rs {
      --rs-bg: #0d1117;
      --rs-surface: #161b22;
      --rs-border: #30363d;
      --rs-text: #e6edf3;
      --rs-muted: #8b949e;
      --rs-zero: #d29922;
      --rs-sft: #58a6ff;
      --rs-grpo: #a371f7;
      --rs-green: #39d353;
      --rs-grid: rgba(139, 148, 158, 0.18);

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--rs-bg); color: var(--rs-text); line-height: 1.6;
      padding: 1.5rem; border-radius: 12px; margin: 2rem 0;
    }
    [data-theme="light"] .cua-rs,
    :root:not([data-theme="dark"]) .cua-rs {
      --rs-bg: #f8fafc;
      --rs-surface: #ffffff;
      --rs-border: #e2e8f0;
      --rs-text: #1e293b;
      --rs-muted: #64748b;
      --rs-zero: #d97706;
      --rs-sft: #3b82f6;
      --rs-grpo: #8b5cf6;
      --rs-green: #10b981;
      --rs-grid: rgba(100, 116, 139, 0.18);
    }
    .cua-rs * { box-sizing: border-box; }

    .cua-rs .rs-header { text-align: center; margin-bottom: 1.1rem; }
    .cua-rs .rs-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem; font-weight: 600; color: var(--rs-grpo);
      letter-spacing: 0.08em; text-transform: uppercase; margin: 0 0 0.4rem 0;
    }
    .cua-rs .rs-header p { color: var(--rs-muted); font-size: 0.9rem; margin: 0; }

    .cua-rs .rs-tabs { display: flex; gap: 0.4rem; justify-content: center; flex-wrap: wrap; margin-bottom: 1.1rem; }
    .cua-rs .rs-tab {
      font-family: 'IBM Plex Sans', sans-serif; font-size: 0.76rem; font-weight: 600;
      padding: 0.45rem 0.85rem; border-radius: 6px; border: 1px solid var(--rs-border);
      background: var(--rs-surface); color: var(--rs-muted); cursor: pointer; transition: all 0.2s ease;
    }
    .cua-rs .rs-tab:hover { border-color: var(--rs-grpo); color: var(--rs-grpo); }
    .cua-rs .rs-tab.sel { background: var(--rs-grpo); border-color: var(--rs-grpo); color: #fff; }

    .cua-rs .rs-legend { display: flex; gap: 1rem; justify-content: center; flex-wrap: wrap; margin-bottom: 0.9rem; }
    .cua-rs .rs-leg { display: flex; align-items: center; gap: 0.4rem; font-size: 0.74rem; color: var(--rs-muted); }
    .cua-rs .rs-chip { width: 12px; height: 12px; border-radius: 3px; }

     
    .cua-rs .rs-chart {
      display: flex; align-items: flex-end; justify-content: space-around;
      gap: 0.5rem; height: 260px; padding: 0 0.25rem 0 2.2rem;
      position: relative; border-bottom: 1px solid var(--rs-border);
    }
    .cua-rs .rs-yaxis { position: absolute; left: 0; top: 0; bottom: 0; width: 2rem; }
    .cua-rs .rs-ytick {
      position: absolute; right: 0.3rem; transform: translateY(-50%);
      font-family: 'IBM Plex Mono', monospace; font-size: 0.6rem; color: var(--rs-muted);
    }
    .cua-rs .rs-gridline { position: absolute; left: 2.2rem; right: 0; height: 1px; background: var(--rs-grid); }

    .cua-rs .rs-group { display: flex; flex-direction: column; align-items: center; flex: 1 1 0; min-width: 0; height: 100%; justify-content: flex-end; }
    .cua-rs .rs-bars { display: flex; align-items: flex-end; gap: 0.3rem; height: 100%; width: 100%; justify-content: center; }
    .cua-rs .rs-bar {
      position: relative; width: 100%; max-width: 46px; border-radius: 5px 5px 0 0;
      height: 0; transition: height 0.6s cubic-bezier(0.4, 0, 0.2, 1);
    }
    .cua-rs .rs-bar-val {
      position: absolute; top: -1.15rem; left: 50%; transform: translateX(-50%);
      font-family: 'IBM Plex Mono', monospace; font-size: 0.62rem; font-weight: 700; white-space: nowrap;
    }
    .cua-rs .rs-glabel {
      margin-top: 0.5rem; font-size: 0.66rem; color: var(--rs-muted);
      text-align: center; white-space: nowrap; overflow: hidden; text-overflow: ellipsis; max-width: 100%;
    }

    .cua-rs .rs-note {
      margin-top: 1rem; padding: 0.85rem 1.1rem; border-radius: 8px;
      background: var(--rs-surface); border: 1px solid var(--rs-border);
      font-size: 0.82rem; line-height: 1.6; color: var(--rs-text);
    }
    .cua-rs .rs-note strong { color: var(--rs-grpo); }
  </style>

  <div class="rs-header">
    <h3>What Moved the Needle</h3>
    <p>ScreenSpot-v2 · 1,272 samples · point-in-box accuracy</p>
  </div>

  <div class="rs-tabs">
    <button class="rs-tab sel" data-v="overall" id="rs-t-overall-ca72848780901f6e41aa11b8375d1dc6">Overall</button>
    <button class="rs-tab" data-v="texticon" id="rs-t-texticon-ca72848780901f6e41aa11b8375d1dc6">Text vs icon</button>
    <button class="rs-tab" data-v="platform" id="rs-t-platform-ca72848780901f6e41aa11b8375d1dc6">By platform</button>
  </div>

  <div class="rs-legend" id="rs-legend-ca72848780901f6e41aa11b8375d1dc6"></div>

  <div class="rs-chart" id="rs-chart-ca72848780901f6e41aa11b8375d1dc6">
    <div class="rs-yaxis" id="rs-yaxis-ca72848780901f6e41aa11b8375d1dc6"></div>
  </div>

  <div class="rs-note" id="rs-note-ca72848780901f6e41aa11b8375d1dc6"></div>

  <script>
  (function() {
    var uid = 'ca72848780901f6e41aa11b8375d1dc6';
    var chart  = document.getElementById('rs-chart-' + uid);
    var yaxis  = document.getElementById('rs-yaxis-' + uid);
    var legend = document.getElementById('rs-legend-' + uid);
    var note   = document.getElementById('rs-note-' + uid);
    var tabs   = document.querySelectorAll('#cua-rs-' + uid + ' .rs-tab');

    var C = { zero: 'var(--rs-zero)', sft: 'var(--rs-sft)', grpo: 'var(--rs-grpo)' };

    var views = {
      overall: {
        legend: [['Zero-shot','zero'],['+ SFT','sft'],['+ GRPO','grpo']],
        groups: [{ label: 'All targets', bars: [
          { c:'zero', v:71.3 }, { c:'sft', v:82.2 }, { c:'grpo', v:82.5 } ] }],
        note: '<strong>SFT does the heavy lifting:</strong> +10.8 points, zero-shot to SFT. GRPO adds a slight edge (shown on its matched 1008&sup2; canvas eval, its training protocol). On a near-saturated benchmark, there is little headroom left to win.'
      },
      texticon: {
        legend: [['Zero-shot','zero'],['+ SFT','sft'],['+ GRPO','grpo']],
        groups: [
          { label: 'Text targets', bars: [ {c:'zero',v:77.7},{c:'sft',v:88.3},{c:'grpo',v:89.7} ] },
          { label: 'Icon targets', bars: [ {c:'zero',v:63.0},{c:'sft',v:74.2},{c:'grpo',v:73.1} ] }
        ],
        note: 'Icons &mdash; no text label to latch onto &mdash; are the model&rsquo;s weak spot, and where SFT helps most (63.0 &rarr; 74.2). Text climbs steadily through GRPO; the icon gap is the frontier the easy benchmark can&rsquo;t close.'
      },
      platform: {
        legend: [['Zero-shot','zero'],['+ SFT','sft']],
        groups: [
          { label: 'iOS',      bars:[{c:'zero',v:80.7},{c:'sft',v:88.2}] },
          { label: 'Android',  bars:[{c:'zero',v:71.1},{c:'sft',v:87.2}] },
          { label: 'Shopping', bars:[{c:'zero',v:71.4},{c:'sft',v:82.4}] },
          { label: 'Windows',  bars:[{c:'zero',v:68.6},{c:'sft',v:81.8}] },
          { label: 'Forum',    bars:[{c:'zero',v:75.9},{c:'sft',v:79.7}] },
          { label: 'GitLab',   bars:[{c:'zero',v:67.1},{c:'sft',v:79.5}] },
          { label: 'macOS',    bars:[{c:'zero',v:66.7},{c:'sft',v:76.9}] },
          { label: 'Dev tools',bars:[{c:'zero',v:61.7},{c:'sft',v:70.8}] }
        ],
        note: 'SFT lifts <strong>every</strong> platform, and most where the base was weakest (Android +16, GitLab +12). The hardest category, dense developer tools, stays the lowest, with the most headroom still on the table.'
      }
    };

    function renderLegend(items) {
      legend.innerHTML = '';
      items.forEach(function(it) {
        var d = document.createElement('div'); d.className = 'rs-leg';
        d.innerHTML = '<span class="rs-chip" style="background:' + C[it[1]] + '"></span>' + it[0];
        legend.appendChild(d);
      });
    }

    function renderYAxis() {
      
      var old = chart.querySelectorAll('.rs-gridline');
      old.forEach(function(o){ o.remove(); });
      yaxis.innerHTML = '';
      [0,25,50,75,100].forEach(function(t) {
        var topPct = (100 - t);
        var tick = document.createElement('div'); tick.className = 'rs-ytick';
        tick.style.top = topPct + '%'; tick.textContent = t;
        yaxis.appendChild(tick);
        var gl = document.createElement('div'); gl.className = 'rs-gridline';
        gl.style.top = topPct + '%'; chart.appendChild(gl);
      });
    }

    function renderChart(view) {
      
      var old = chart.querySelectorAll('.rs-group');
      old.forEach(function(o){ o.remove(); });
      renderYAxis();

      view.groups.forEach(function(g) {
        var grp = document.createElement('div'); grp.className = 'rs-group';
        var bars = document.createElement('div'); bars.className = 'rs-bars';
        g.bars.forEach(function(b) {
          var bar = document.createElement('div'); bar.className = 'rs-bar';
          bar.style.background = C[b.c];
          bar.innerHTML = '<span class="rs-bar-val" style="color:' + C[b.c] + '">' + b.v.toFixed(1) + '</span>';
          bars.appendChild(bar);
          
          requestAnimationFrame(function() {
            requestAnimationFrame(function() { bar.style.height = b.v + '%'; });
          });
        });
        var lab = document.createElement('div'); lab.className = 'rs-glabel'; lab.textContent = g.label;
        grp.appendChild(bars); grp.appendChild(lab);
        chart.appendChild(grp);
      });
    }

    function select(v) {
      tabs.forEach(function(t){ t.classList.toggle('sel', t.getAttribute('data-v') === v); });
      var view = views[v];
      renderLegend(view.legend);
      renderChart(view);
      note.innerHTML = view.note;
    }

    tabs.forEach(function(t) {
      t.addEventListener('click', function() { select(this.getAttribute('data-v')); });
    });

    select('overall');
  })();
  </script>
</div>

<p>Read the bars in order. <strong>Zero-shot, the base model gets 71.3%</strong>, already not bad, which is exactly why it&rsquo;s a good base. <strong>SFT lifts that to 82.2%, a +10.8-point jump</strong>, and it does so uniformly across every platform while closing the model&rsquo;s worst weakness: icons, which have no text label to latch onto, climb from 63.0% to 74.2%. Just as telling, the number of completely unparseable outputs went from 15 at zero-shot to <strong>zero</strong> after SFT. That is the format lock and the behavioral prior earning their keep. SFT is the heavy lifting, full stop.</p>
<p>Then GRPO. On ScreenSpot-v2 the gain is small: a slight edge on the matched evaluation, with text accuracy ticking up to 89.7%. And here is the honest, useful part: <strong>that small gain is exactly what the theory predicts.</strong> ScreenSpot-v2 is a comparatively easy, near-saturated benchmark; once SFT has the model clicking inside the right box most of the time, there isn&rsquo;t much headroom left for &ldquo;aim better&rdquo; to harvest. The center-point rigidity that RL fixes simply doesn&rsquo;t cost much when the targets are large and well-labeled.</p>
<p>Even at much larger scale the easy benchmark behaves the same way. GTA1, training this recipe on bigger models, reports ScreenSpot-v2 climbing from 90.2 to 92.4 with GRPO: a couple of points, no more. A near-saturated benchmark simply doesn&rsquo;t leave reinforcement learning much to win.</p>
<p>So the honest read of what we measured is this: SFT did the work, and GRPO mostly confirmed it. The place a metric-matching reward should actually earn its keep is harder grounding, the tiny targets in dense, professional UIs where landing <em>inside</em> the box and landing <em>near its center</em> finally come apart. Measuring that cleanly is the obvious next step, and the one I&rsquo;d want in hand before claiming RL moved the needle.</p>
<h2 id="closing-the-loop-from-one-click-to-an-agent">Closing the Loop: From One Click to an Agent</h2>
<p>A grounding model is a function: pixels and text in, one coordinate out. An <em>agent</em> is that function in a loop, wired to a real machine.</p>

<div class="cua-al" id="cua-al-ca72848780901f6e41aa11b8375d1dc6">
  <style>
    .cua-al {
      --al-bg: #0d1117;
      --al-surface: #161b22;
      --al-border: #30363d;
      --al-text: #e6edf3;
      --al-muted: #8b949e;
      --al-accent: #a371f7;
      --al-accent-dim: rgba(163, 113, 247, 0.15);
      --al-blue: #58a6ff;
      --al-green: #39d353;
      --al-red: #f97583;
      --al-red-dim: rgba(249, 117, 131, 0.1);

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--al-bg); color: var(--al-text); line-height: 1.6;
      padding: 1.5rem; border-radius: 12px; margin: 2rem 0;
    }
    [data-theme="light"] .cua-al,
    :root:not([data-theme="dark"]) .cua-al {
      --al-bg: #f8fafc;
      --al-surface: #ffffff;
      --al-border: #e2e8f0;
      --al-text: #1e293b;
      --al-muted: #64748b;
      --al-accent: #8b5cf6;
      --al-accent-dim: rgba(139, 92, 246, 0.12);
      --al-blue: #3b82f6;
      --al-green: #10b981;
      --al-red: #ef4444;
      --al-red-dim: rgba(239, 68, 68, 0.08);
    }
    .cua-al * { box-sizing: border-box; }

    .cua-al .al-header { text-align: center; margin-bottom: 1rem; }
    .cua-al .al-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem; font-weight: 600; color: var(--al-accent);
      letter-spacing: 0.08em; text-transform: uppercase; margin: 0 0 0.4rem 0;
    }
    .cua-al .al-header p { color: var(--al-muted); font-size: 0.9rem; margin: 0; }

    .cua-al .al-goal {
      background: var(--al-surface); border: 1px solid var(--al-accent); border-radius: 8px;
      padding: 0.6rem 1rem; margin-bottom: 1.1rem; text-align: center; font-size: 0.9rem;
    }
    .cua-al .al-goal span {
      font-family: 'IBM Plex Mono', monospace; font-size: 0.62rem; font-weight: 700;
      color: var(--al-accent); text-transform: uppercase; letter-spacing: 0.06em;
      background: var(--al-accent-dim); padding: 0.15rem 0.45rem; border-radius: 4px; margin-right: 0.5rem;
    }

    .cua-al .al-main { display: grid; grid-template-columns: 1fr 1.4fr; gap: 1.5rem; align-items: start; }
    @media (max-width: 760px) { .cua-al .al-main { grid-template-columns: 1fr; } }

    .cua-al .al-loop { position: relative; height: 320px; }
    .cua-al .al-circle { position: absolute; left: 50%; top: 50%; transform: translate(-50%,-50%); width: 260px; height: 260px; }
    @media (max-width: 380px) { .cua-al .al-circle { width: 220px; height: 220px; } }

    .cua-al .al-node {
      position: absolute; width: 84px; height: 84px; border-radius: 50%;
      background: var(--al-surface); border: 2px solid var(--al-border);
      display: flex; flex-direction: column; align-items: center; justify-content: center;
      text-align: center; transition: all 0.35s cubic-bezier(0.4,0,0.2,1);
    }
    .cua-al .al-node .al-ic { font-size: 1.4rem; line-height: 1; margin-bottom: 0.2rem; }
    .cua-al .al-node .al-lb { font-size: 0.56rem; font-weight: 700; letter-spacing: 0.04em; color: var(--al-muted); font-family:'IBM Plex Mono',monospace; }
    .cua-al .al-node.active {
      border-color: var(--al-accent); background: var(--al-accent-dim);
      transform: scale(1.12); box-shadow: 0 8px 24px var(--al-accent-dim);
    }
    .cua-al .al-node.active .al-lb { color: var(--al-accent); }
    .cua-al .al-n0 { top: 0; left: 50%; transform: translateX(-50%); }
    .cua-al .al-n0.active { top: 0; left: 50%; transform: translateX(-50%) scale(1.12); }
    .cua-al .al-n1 { top: 50%; right: 0; transform: translateY(-50%); }
    .cua-al .al-n1.active { top: 50%; right: 0; transform: translateY(-50%) scale(1.12); }
    .cua-al .al-n2 { bottom: 0; left: 50%; transform: translateX(-50%); }
    .cua-al .al-n2.active { bottom: 0; left: 50%; transform: translateX(-50%) scale(1.12); }
    .cua-al .al-n3 { top: 50%; left: 0; transform: translateY(-50%); }
    .cua-al .al-n3.active { top: 50%; left: 0; transform: translateY(-50%) scale(1.12); }

    .cua-al .al-ring {
      position: absolute; inset: 42px; border-radius: 50%;
      border: 2px dashed var(--al-border);
    }
    .cua-al .al-ring-arrow {
      position: absolute; left: 50%; top: -9px; transform: translateX(-50%);
      color: var(--al-muted); font-size: 0.9rem;
    }

     
    .cua-al .al-trace { background: var(--al-surface); border: 1px solid var(--al-border); border-radius: 10px; padding: 1rem; min-height: 320px; }
    .cua-al .al-trace-head { display: flex; justify-content: space-between; align-items: center; margin-bottom: 0.75rem; }
    .cua-al .al-trace-title { font-size: 0.78rem; font-weight: 700; font-family:'IBM Plex Mono',monospace; text-transform: uppercase; letter-spacing: 0.05em; color: var(--al-muted); }
    .cua-al .al-iter { background: var(--al-accent-dim); color: var(--al-accent); padding: 0.15rem 0.6rem; border-radius: 20px; font-size: 0.66rem; font-weight: 700; font-family:'IBM Plex Mono',monospace; }
    .cua-al .al-event { border-left: 3px solid var(--al-border); padding: 0.5rem 0.7rem; margin-bottom: 0.6rem; border-radius: 0 6px 6px 0; background: var(--al-bg); transition: all 0.3s ease; }
    .cua-al .al-event.fresh { border-left-color: var(--al-accent); }
    .cua-al .al-event-type { font-size: 0.72rem; font-weight: 700; margin-bottom: 0.2rem; }
    .cua-al .al-event-body { font-family:'IBM Plex Mono',monospace; font-size: 0.72rem; color: var(--al-muted); line-height: 1.5; word-break: break-word; }
    .cua-al .al-event-body b { color: var(--al-blue); }
    .cua-al .al-done { color: var(--al-green); font-weight: 700; }

    .cua-al .al-controls { display: flex; gap: 0.5rem; justify-content: center; margin-top: 1.1rem; }
    .cua-al .al-btn {
      font-family: 'IBM Plex Sans', sans-serif; font-size: 0.8rem; font-weight: 600;
      padding: 0.5rem 1.1rem; border-radius: 6px; border: 1px solid var(--al-border);
      background: var(--al-surface); color: var(--al-text); cursor: pointer; transition: all 0.2s ease;
    }
    .cua-al .al-btn.primary { background: var(--al-accent); border-color: var(--al-accent); color: #fff; }
    .cua-al .al-btn:hover { filter: brightness(1.08); }
    .cua-al .al-btn:disabled { opacity: 0.5; cursor: not-allowed; filter: none; }

    .cua-al .al-safety {
      margin-top: 1.1rem; padding: 0.75rem 1rem; border-radius: 8px;
      background: var(--al-red-dim); border: 1px solid var(--al-red);
      font-size: 0.78rem; line-height: 1.55;
    }
    .cua-al .al-safety b { color: var(--al-red); }
  </style>

  <div class="al-header">
    <h3>From One Click to an Agent</h3>
    <p>The grounding model is the eyes and the hand. The loop is the rest of the body.</p>
  </div>

  <div class="al-goal"><span>Goal</span> Turn on dark mode</div>

  <div class="al-main">
    <div class="al-loop">
      <div class="al-circle">
        <div class="al-ring"><span class="al-ring-arrow">&#8635;</span></div>
        <div class="al-node al-n0" id="al-n0-ca72848780901f6e41aa11b8375d1dc6"><span class="al-ic">&#128247;</span><span class="al-lb">SCREENSHOT</span></div>
        <div class="al-node al-n1" id="al-n1-ca72848780901f6e41aa11b8375d1dc6"><span class="al-ic">&#127919;</span><span class="al-lb">GROUND</span></div>
        <div class="al-node al-n2" id="al-n2-ca72848780901f6e41aa11b8375d1dc6"><span class="al-ic">&#128433;</span><span class="al-lb">CLICK</span></div>
        <div class="al-node al-n3" id="al-n3-ca72848780901f6e41aa11b8375d1dc6"><span class="al-ic">&#128064;</span><span class="al-lb">OBSERVE</span></div>
      </div>
    </div>

    <div class="al-trace">
      <div class="al-trace-head">
        <span class="al-trace-title">Trace</span>
        <span class="al-iter" id="al-iter-ca72848780901f6e41aa11b8375d1dc6">step 0 / 8</span>
      </div>
      <div id="al-events-ca72848780901f6e41aa11b8375d1dc6"></div>
    </div>
  </div>

  <div class="al-controls">
    <button class="al-btn primary" id="al-start-ca72848780901f6e41aa11b8375d1dc6">&#9654; Run</button>
    <button class="al-btn" id="al-next-ca72848780901f6e41aa11b8375d1dc6">Step</button>
    <button class="al-btn" id="al-reset-ca72848780901f6e41aa11b8375d1dc6">Reset</button>
  </div>

  <div class="al-safety">
    <b>Safety is not optional.</b> A model that can click anything can click the wrong thing, fast. Run the agent in an ephemeral, sandboxed VM and restrict network egress to an allowlist. Grounding accuracy is, among other things, a safety property.
  </div>

  <script>
  (function() {
    var uid = 'ca72848780901f6e41aa11b8375d1dc6';
    var nodes = [
      document.getElementById('al-n0-' + uid),
      document.getElementById('al-n1-' + uid),
      document.getElementById('al-n2-' + uid),
      document.getElementById('al-n3-' + uid)
    ];
    var events = document.getElementById('al-events-' + uid);
    var iter   = document.getElementById('al-iter-' + uid);
    var startB = document.getElementById('al-start-' + uid);
    var nextB  = document.getElementById('al-next-' + uid);
    var resetB = document.getElementById('al-reset-' + uid);

    
    var steps = [
      { p:0, t:'Screenshot', b:'Capture current screen &rarr; 1008&times;1008 frame sent to the model.' },
      { p:1, t:'Ground', b:'Model output: <b>{"box": [712, 18, 744, 50]}</b> &rarr; center = ((712+744)/2, (18+50)/2) = <b>(728, 34)</b>.' },
      { p:2, t:'Click', b:'driver.click(<b>728, 34</b>) &rarr; the Settings gear.' },
      { p:3, t:'Observe', b:'New screen: the settings panel is open. Goal not yet met &rarr; loop.' },
      { p:0, t:'Screenshot', b:'Capture the settings panel &rarr; new frame to the model.' },
      { p:1, t:'Ground', b:'Model output: <b>{"box": [560, 210, 604, 236]}</b> &rarr; center = <b>(582, 223)</b>.' },
      { p:2, t:'Click', b:'driver.click(<b>582, 223</b>) &rarr; the &ldquo;Dark mode&rdquo; toggle.' },
      { p:3, t:'Observe', b:'<span class="al-done">Dark mode is on. Goal met &mdash; the loop exits.</span>' }
    ];

    var i = -1, timer = null, running = false;

    function light(phase) {
      nodes.forEach(function(n, idx){ n.classList.toggle('active', idx === phase); });
    }

    function step() {
      if (i >= steps.length - 1) { stop(); return; }
      i++;
      var s = steps[i];
      light(s.p);
      
      var prev = events.querySelectorAll('.al-event.fresh');
      prev.forEach(function(e){ e.classList.remove('fresh'); });
      var color = ['var(--al-blue)','var(--al-accent)','var(--al-text)','var(--al-green)'][s.p];
      var ev = document.createElement('div');
      ev.className = 'al-event fresh';
      ev.innerHTML = '<div class="al-event-type" style="color:' + color + '">' + s.t + '</div><div class="al-event-body">' + s.b + '</div>';
      events.appendChild(ev);
      iter.textContent = 'step ' + (i + 1) + ' / ' + steps.length;
      if (i >= steps.length - 1) { stop(); startB.textContent = '✓ Done'; startB.disabled = true; }
    }

    function run() {
      if (running) return;
      running = true; startB.disabled = true; nextB.disabled = true;
      timer = setInterval(step, 1400);
    }
    function stop() {
      running = false; clearInterval(timer);
      if (i < steps.length - 1) { startB.disabled = false; nextB.disabled = false; }
    }
    function reset() {
      stop(); i = -1; events.innerHTML = ''; iter.textContent = 'step 0 / ' + steps.length;
      light(-1); startB.textContent = '▶ Run'; startB.disabled = false; nextB.disabled = false;
    }

    startB.addEventListener('click', run);
    nextB.addEventListener('click', function(){ if (!running) step(); });
    resetB.addEventListener('click', reset);
  })();
  </script>
</div>

<p>Each turn: capture a screenshot, send it with the standing instruction to the model, parse the predicted box or point, reduce a box to a click target with the obvious arithmetic ($x_c = (x_1 + x_2)/2$, $y_c = (y_1 + y_2)/2$), execute the click through an OS-level driver like PyAutoGUI or Playwright, then capture the new screen and go again until the goal is met. The grounding model is the eyes and the hand; the loop is the rest of the body.</p>
<p>This is also where the stakes become real, and the references this project drew on are emphatic about it: a model with unconstrained control of a real screen is a security surface, not a toy. The standard mitigations are not optional. Run the agent inside an ephemeral, sandboxed virtual machine, and restrict its network egress to an allowlist of domains it actually needs. An agent that can click anything can also click the wrong thing, confidently, very fast. The grounding accuracy we spent this whole post chasing is, among other things, a safety property.</p>
<h2 id="lessons">Lessons</h2>
<p>A few things generalize well beyond this particular model.</p>
<p><strong>Scope to the verifiable slice first.</strong> The temptation with computer use is to chase full autonomy immediately. Grounding is the piece you can train, measure, and trust on its own, and everything ambitious is built on it. Earn the keystone before you build the arch.</p>
<p><strong>SFT imitates one pixel; RL rewards the whole target.</strong> Supervised fine-tuning installs the format and the behavioral prior, and it does the overwhelming majority of the work: here, +10.8 points in an hour. The rigidity it leaves behind, imitating a single labeled pixel, isn&rsquo;t a limitation of SFT itself; it&rsquo;s an artifact of how we set it up: training on the box center with a token-level loss. You could blunt it on the SFT side too, by sampling labels from across the target. RL simply takes the more direct route: reward any in-box click and you optimize the metric itself. That correction only shows up where the rigidity was costing you, which means RL&rsquo;s visible payoff lives on the hard examples, not the easy benchmark.</p>
<p><strong>A reward that <em>is</em> your metric is the cheapest cheat code there is.</strong> Point-in-box can&rsquo;t be reward-hacked because there is no gap between it and the thing you actually measure. When you can phrase your objective as a deterministic check, do. Verifiable rewards sidestep an entire category of failure that learned reward models invite.</p>
<p><strong>Evaluate on your training distribution.</strong> This one bit during the run: the RL model trained on images letterboxed to a fixed canvas, and evaluating it on differently-shaped images understated its score until the protocols were matched. A model can only be fairly judged on the distribution it was trained for; a mismatched eval will lie to you.</p>
<p><strong>The theory has become a budget line.</strong> Quantization, memory-efficient attention, low-rank adaptation, group-relative RL: the pieces this blog has explained one at a time are now mature enough that assembling them into a working vision agent costs about $15 and an afternoon. The frontier of <em>understanding</em> each technique is still deep; the frontier of <em>using all of them together</em> has dropped to the floor. That&rsquo;s the quietly remarkable part.</p>
<h2 id="try-it-yourself">Try It Yourself</h2>
<p>The full training code (data preparation, the SFT and GRPO scripts, the reward functions, the evaluation harness, and the exact config files for every hyperparameter mentioned here) is open source at the repository below. The recipe is small enough to read in an evening: one base model, one cleaned grounding dataset, a LoRA SFT pass, and a GRPO pass with a geometry-check reward.</p>
<blockquote>
<p><strong>Reproduce it:</strong> Qwen2.5-VL-3B-Instruct · <code>Salesforce/grounding_dataset</code> (≈6k for SFT, ≈1.2k prompts for GRPO) · one 24 GB GPU (NVIDIA L4) · LoRA SFT (r=16, α=32, lr 1e-4, 150 steps) then QLoRA GRPO (G=8, β=0, lr 1e-6, point-in-box + format reward, 150 steps) · evaluate on ScreenSpot-v2. Total cost ≈ $15.</p>
<p><strong>Repository:</strong> <a href="https://github.com/MdJawad/computer-use">github.com/MdJawad/computer-use</a></p></blockquote>
<p>If you read <a href="/posts/rlhf-to-grpo/">the GRPO post</a> for the algorithm and <a href="/posts/quantization-and-gptq/">the quantization post</a> for the compression, this is what they look like pointed at a real, slightly stubborn problem: a small model that, after an hour of supervised teaching and a couple of hours of reinforcement, will look at a screen it has never seen and put the cursor where you asked.</p>
<div class="bd-subscribe">
  <div class="bd-subscribe__copy">
    <h3 class="bd-subscribe__title">Get new posts by email</h3>
    <p class="bd-subscribe__blurb">Deep dives on LLM systems — inference, attention, agents, quantization — straight to your inbox. No spam, unsubscribe anytime.</p>
  </div>
  <form
    class="bd-subscribe__form embeddable-buttondown-form"
    action="https://buttondown.com/api/emails/embed-subscribe/jawad"
    method="post"
    target="popupwindow"
    onsubmit="window.open('https://buttondown.com/jawad', 'popupwindow')"
  >
    <input class="bd-subscribe__input" type="email" name="email" placeholder="you@example.com" aria-label="Email address" required>
    <input type="hidden" value="1" name="embed">
    <button class="bd-subscribe__btn" type="submit">Subscribe</button>
  </form>
  <p class="bd-subscribe__rss">Prefer a feed reader? <a href="/index.xml">Subscribe via RSS</a>.</p>
</div>

<style>
.bd-subscribe{
  margin:2.5rem 0;
  padding:1.5rem 1.75rem;
  border:1px solid var(--border);
  border-radius:12px;
  background:var(--entry);
}
.bd-subscribe__title{margin:0 0 .35rem;font-size:1.2rem;color:var(--primary);}
.bd-subscribe__blurb{margin:0 0 1rem;color:var(--secondary);font-size:.95rem;line-height:1.5;}
.bd-subscribe__form{display:flex;gap:.5rem;flex-wrap:wrap;}
.bd-subscribe__input{
  flex:1 1 220px;
  padding:.6rem .75rem;
  border:1px solid var(--border);
  border-radius:8px;
  background:var(--theme);
  color:var(--primary);
  font-size:.95rem;
}
.bd-subscribe__input:focus{outline:2px solid var(--tertiary);outline-offset:1px;}
.bd-subscribe__btn{
  padding:.6rem 1.2rem;
  border:0;
  border-radius:8px;
  background:var(--primary);
  color:var(--theme);
  font-weight:600;
  font-size:.95rem;
  cursor:pointer;
  transition:opacity .2s ease;
}
.bd-subscribe__btn:hover{opacity:.85;}
.bd-subscribe__rss{margin:.85rem 0 0;font-size:.82rem;color:var(--secondary);}
.bd-subscribe__rss a{color:var(--secondary);text-decoration:underline;}
</style>

]]></content:encoded></item></channel></rss>