Technical6 min read

LLM Test Automation - Code for Scale, Vision for Edge Cases

Why code generation beats visual agents for scalable test automation, and how to strategically use computer vision for edge cases where the DOM is inaccessible.

LLM Test Automation - Code for Scale, Vision for Edge Cases
K

Krishnanand B

January 1, 2026

We are living through a peak hype cycle for "AI Agents." The prevailing vision is seductive: an intelligent layer sitting on top of your computer, watching your screen and clicking buttons just like a human would. It promises a world where we can bypass APIs and code entirely, relying on visual understanding to get things done.

After spending the last year building production test automation with LLMs, I've learned that both approaches have their place—but they're not equal. For most use cases, code generation is faster, cheaper, and more reliable. Vision shines in specific edge cases where code can't reach.

Recently, a Twitter thread by Manosai caught my eye. He was dissecting a takeaway from Lenny Rachitsky's conversation with OpenAI's Codex lead, and he highlighted a point that matched my own experience:

Point #4:

"Writing code may become the universal way AI accomplishes any task. Rather than clicking through interfaces or building separate integrations, AI performs best when it writes small programs on the fly."

This isn't theory. It's the playbook for the future of automation. Here's why the "Code First" approach is winning—and where "Vision" actually belongs.


Why Code Works Better at Scale

Manosai argues that this insight is "mispriced" by the market. Everyone is rushing to build agents that see, but for most automation tasks, the real value is in agents that code.

Two reasons stand out from my work:

Verification is binary. Frontier models like GPT-4o and Claude are trained on millions of lines of code. Code either runs or it errors out. This creates a tight feedback loop for the model. A visual agent trying to "click the blue button" deals with fuzzy logic—did it miss by a pixel? Did the color render differently on this monitor? A code agent that writes await page.click('#submit-btn') deals with deterministic truth.

Reasoning gets sharper. When you ask a model to write code, you anchor it in logic and structure. Manosai notes that models show stronger reasoning when working through codebases because it forces structured thinking—unlike the chaos of visual pixels where anything can happen.

In my own work, moving from "visual navigation" to "code generation" improved reliability by an order of magnitude. We stopped asking the AI to be the user, and started asking it to write the script for the user.


The Use Case for Vision: When the DOM Goes Dark

So, is computer vision dead? Not at all.

While code should be the "muscle" that performs actions, vision is the "sense" that guides it. This matters most when the code layer is inaccessible.

I recently worked on a challenging automation project for a high-volume media platform. The entire application was rendered inside HTML5 Canvases and iFrames.

The Problem: Traditional automation tools like Playwright look for DOM elements—buttons, divs, inputs. But a Canvas is just a black box of pixels to the DOM. There was no code to hook into.

The Solution: We didn't make the AI blindly click. We built a hybrid system using an OODA Loop:

  1. Observe: The AI analyzes a screenshot of the canvas.
  2. Orient: We overlaid a grid system on the image. The AI identified specific grid coordinates where elements lived.
  3. Decide: We fed those coordinates. Once the AI "saw" a button, we stored that location as data.
  4. Act: Code executed the click at those precise coordinates.

The Architecture That Works: LLMs as Code Collectors

Here's what I've learned about building automation that actually ships to production:

Make the LLM an orchestrator, not a performer. The experience of using LLMs as a "code collector"—where they call specific tools and generate a spec file—works far better than fully dynamic solutions. I learned this the hard way. Early on, I built an automation platform that took the fully dynamic route, letting the AI figure out each step without pre-written code. The vision was appealing, but the results weren't deterministic. Tests would pass one run and fail the next. The AI would find creative but inconsistent paths through the same workflow. Cost per run was high, and debugging felt like chasing ghosts.

Build component-based automation. The approach I've landed on maps natural language test cases to reusable Playwright components. Each component (login forms, data tables, navigation menus) exposes functions and gets a matching AI tool with descriptions of when to call it:

const login = await loginComponent(page);
login.login("email", "password");

The AI reads a test case in plain English, does a "fake run" to collect the right code blocks, and generates a runnable spec file. Once you've built enough components, you can spin up new test cases in minutes.

Treat tests as data. Everything in the LLM space is data—errors, logs, test failures. When a test breaks, feed the failure back to the AI and let it fix the code. The maintenance burden drops because the LLM handles the tedious part of debugging.


The Cost vs. Determinism Tradeoff

This architecture isn't just about what's possible—it's about what's practical.

Fully dynamic agents that see screens and decide on the fly are expensive. Every action burns tokens. Hallucinations cause flaky tests. You're paying for uncertainty.

Component-based code generation flips that equation. The AI does the heavy thinking once during test creation. After that, you're running deterministic Playwright scripts. Regression runs cost almost nothing. When tests fail, you have clear logs to feed back to the model for fixes.

The pattern I recommend:

  • Execution (90%): Let the LLM write small programs to interact with APIs and DOM elements. Fast, cheap, verifiable.
  • Assertions (10%): Use vision models to verify state. Did the modal close? Does the chart look right? Is the game showing the correct texture?
  • Discovery (edge cases): Use vision to map territory when the code layer is hidden, then immediately convert that visual insight into structured data.

The Takeaway

Visual agents and code-generating agents aren't competitors—they're teammates. The question is which one should do most of the work.

For the majority of test automation, code generation wins on cost and reliability. You pay for AI reasoning once during test creation, then run deterministic scripts forever. Vision-only approaches burn tokens on every run and introduce uncertainty you don't need.

But when the DOM goes dark—Canvas apps, iFrames, complex visual verification—vision becomes essential. The key is converting what it sees into structured data that code can use.

My recommendation: start with code generation as your default. Bring in vision when you hit a wall. That's the architecture that's worked for me over the last year.


What's your experience with AI test automation? Are you leaning toward code generation, visual agents, or a hybrid? I'd like to hear what's working for you.

Enjoyed this article?

Get in touch to discuss how AI-powered testing can transform your QA processes.