Context Engineering: The Missing Layer in Your LLM Test Strategy
Better prompts plateau. Context engineering breaks through by assembling the right information into every LLM test generation call.

Krishnanand B
March 11, 2026
A ticket just landed in your sprint: add E2E coverage for the checkout page promo code feature. You open Claude, type a prompt, and get a test back in seconds.
Context engineering isn't on your mind yet. The prompt looks reasonable:
Write a Playwright test for the checkout page promo code feature.
Test applying a valid promo code and seeing the discount,
and rejecting an expired promo code.
Claude produces this:
import { test, expect } from '@playwright/test';
test('applies valid promo code', async ({ page }) => {
await page.goto('/checkout');
await page.fill('#email', 'testuser@example.com');
await page.fill('#password', 'Test1234!');
await page.click('.btn-login');
await page.waitForNavigation();
await page.fill('.promo-input', 'SAVE20');
await page.click('.apply-promo-btn');
await expect(page.locator('.discount-amount')).toHaveText('-$20.00');
});
test('rejects expired promo code', async ({ page }) => {
await page.goto('/checkout');
await page.fill('#email', 'testuser@example.com');
await page.fill('#password', 'Test1234!');
await page.click('.btn-login');
await page.waitForNavigation();
await page.fill('.promo-input', 'EXPIRED2024');
await page.click('.apply-promo-btn');
await expect(page.locator('.promo-error')).toHaveText('Promo code expired');
});
It runs. It even compiles. Five things are wrong with it:
- CSS selectors everywhere. The team uses
data-testid, but the model has never seen the style guide. - Inline login flow. The team has a
loginAsUserhelper. The model wrote its own because it didn't know the helper exists. - Empty cart. No
seedCartcall, so the checkout page has nothing to apply a promo to. The test will fail before it reaches the promo input. - No
waitForResponse. The promo API takes 2+ seconds on staging. Without waiting for/api/promo/validate, the assertion fires before the discount renders. Flake city. - Hardcoded promo codes. The team stores test data in fixture files.
'SAVE20'might not exist in the staging database next week.
The problem isn't LLM. It's what an LLM knows when it starts writing. I hit this same wall building the tools in this series. The fix wasn't a better prompt. It was better context.
What Is Context Engineering?
Most teams try to fix bad LLM output by rewriting the prompt. Add more detail. Be more specific. That helps up to a point, then it plateaus.
Context engineering takes a different angle. Instead of changing how you ask, you change what the model can see when it answers.
You build a system that assembles the right information into every LLM call. Not just your request, but everything the model needs to do the job well.
Think of it as mise en place for your LLM. A chef who starts cooking before prepping ingredients produces chaos. An LLM that starts generating tests without the right reference material does the same.
The information a model needs falls into three categories:
- What shapes its judgment: conventions, style guides, examples of good output. These act as guardrails.
- What it reasons over: source code, API signatures, existing tests, historical patterns. This is the raw material.
- What's happening right now: the specific request and its constraints. This is the trigger.
When you paste a file into a chat window, you're doing context engineering by hand. The goal is to make it automatic. A system that loads the right files, includes the right conventions, and flags the right pitfalls before every generation call.
What Your LLM Is Missing
Go back to those five problems. Every one of them has a fix that someone on your team already knows.
Your senior engineer knows selectors should use data-testid. Your test lead built loginAsUser six months ago. Someone debugged the flaky promo test last quarter and discovered the 2.3-second API delay. The fixture files with test data constants already exist in the repo.
That knowledge lives in people's heads, Slack threads, code review comments, and wikis. The LLM can't see any of it. So it guesses.
Context engineering fixes this by turning tribal knowledge into explicit files the model can read. Every team will organize this differently. What follows is one approach, built step by step.
Your repo probably already looks something like this:
project-root/
├── pages/
│ └── checkout.page.ts
├── helpers/
│ ├── auth.helper.ts
│ └── cart.helper.ts
└── fixtures/
└── promo-codes.fixture.ts
The Page Objects, helpers, and fixtures are all there. But the conventions your team follows, the patterns that make tests pass on the first try, the lessons learned from debugging flaky tests. None of that lives in a file the model can read.
Start with a style guide
This is the highest-value file you can create. Fifteen lines of your team's conventions change every line the model writes:
# Test Style Guide
## Selectors
Always use data-testid attributes. Never use CSS classes or IDs.
## Structure
Use Page Object Model for all page interactions.
Never call page.fill() or page.click() directly in test bodies.
## Auth
Use loginAsUser(page, role) from helpers/auth.helper.ts.
Valid roles: 'customer', 'admin', 'guest'.
## Test Data
Use fixture files in fixtures/. Never hardcode values in tests.
## Async
After any action that triggers an API call, use waitForResponse
before asserting on the result.
## Feature Flags
Use enableFeatureFlag(page, 'FLAG_NAME') before testing any
gated feature.
This file alone fixes three of the five problems from the promo code test. The model stops guessing at selectors, uses the existing loginAsUser helper, and pulls test data from fixtures instead of hardcoding strings.
Capture your failure patterns
Every team has institutional knowledge about what breaks and why. It usually lives in someone's head or buried in a Slack thread from six months ago. Write it down:
# Known Failure Patterns
## checkout-flow.spec.ts
"Order summary" test flakes because DOM updates after
/api/order/summary resolves. Fix: waitForResponse.
## checkout-payment.spec.ts
Staging DB refresh clears payment tokens Monday mornings
before 9am EST. Tests using saved payment methods will
fail during that window.
## Promo API
/api/promo/validate averages 2.3s on staging.
Always await the response before asserting discount values.
## Feature Flag: PROMO_V2
New promo UI is behind the PROMO_V2 flag. Tests targeting
the new promo flow must enable it in beforeEach.
This fixes the remaining two problems. The model learns to wait for the slow promo API and to enable the PROMO_V2 feature flag.
Add an example test
One good reference test from your best engineer teaches the model more about your team's patterns than ten paragraphs of instructions. Drop it in an examples/ folder.
Add an index file
A CLAUDE.md (or agents.md) at the repo root tells the LLM where to find everything. Without it, the agent doesn't know your context files exist.
# Test Generation Context
Read these files before generating any test:
- .ai/test-style-guide.md (selector strategy, naming, assertion patterns)
- .ai/failure-patterns.md (known flaky tests, async issues)
- .ai/examples/ (reference tests showing team patterns)
For the component under test, also read:
- The Page Object in pages/ (if it exists)
- The fixture file in fixtures/ (for test data constants)
- The component source in src/components/ (for data-testid attributes)
The instructions are direct. Tell the agent what to read, when to read it, and where to find it. Tools like Claude Code pick up CLAUDE.md automatically on every turn. Other agents may need the file referenced in a system prompt or loaded at the start of each session.
The result
With those files in place, your repo now looks like this:
project-root/
├── .ai/
│ ├── test-style-guide.md
│ ├── failure-patterns.md
│ └── examples/
│ └── checkout-flow.spec.ts
├── pages/
│ └── checkout.page.ts
├── helpers/
│ ├── auth.helper.ts
│ └── cart.helper.ts
├── fixtures/
│ └── promo-codes.fixture.ts
└── CLAUDE.md
Four new files. The style guide tells the model how your team writes tests. The failure patterns file tells it what has gone wrong before. Between them, the model can avoid mistakes that took your team weeks to discover.
Your context is yours
The .ai/ folder above is one example. There is no fixed structure for context building. Every team's setup looks different because every team's codebase looks different.
Some teams keep tests alongside their source code. In that case, the codebase itself is the context. Point the agent at the component directory and it can read the source, the co-located tests, and the patterns that connect them. If your team has API documentation in the repo, reference that too. The model can use API docs to set up test data, build request payloads, and validate response shapes without guessing.
The only rule: make the knowledge findable. If your selectors follow a convention, write it down. If your staging environment has quirks, write those down. If your fixtures live in a specific place, tell the agent where.
This is where your experience as an engineer matters most. You know which patterns are worth documenting. You know which failures cost the team days of debugging. You know where the sharp edges are. No template can replace that judgment. The .ai/ folder is a starting point. What you put in it, and how you grow it over time, depends entirely on what your team has learned.
A Context-Engineered Prompt
Here's what changes when those files get wired into the LLM call.
- Gather: The system reads the style guide, pulls the
CheckoutPagesource, loads three example tests, and checks the failure patterns file. - Assemble: Each source gets a share of the context window. Style guide and Page Object get priority because they shape every line of the generated test. Failure patterns get summarized. If the context window is tight, example tests get trimmed first.
- Call: The assembled context plus your request go to the model as a single prompt.
- Record: After generation, log which sources were used and whether the test passed. This feedback makes the next generation better.
The assembled prompt looks like this:
## Style Guide
[contents of .ai/test-style-guide.md]
## Page Object: CheckoutPage
export class CheckoutPage {
async applyPromoCode(code: string) { ... }
async getDiscountAmount(): Promise<string> { ... }
async getPromoError(): Promise<string> { ... }
async waitForPromoResponse() {
await this.page.waitForResponse('**/api/promo/validate');
}
}
## Available Helpers
- loginAsUser(page, role): logs in as 'customer', 'admin', or 'guest'
- seedCart(page, items[]): seeds cart with given items via API
- enableFeatureFlag(page, flagName): enables a feature flag for the session
## Known Issues
- /api/promo/validate averages 2.3s on staging. Always await before asserting.
- PROMO_V2 feature flag must be enabled for new promo UI.
## Example Test Pattern
test('completes checkout with standard shipping', async ({ page }) => {
const checkout = new CheckoutPage(page);
await loginAsUser(page, 'customer');
await seedCart(page, [BASIC_ITEM]);
await checkout.goto();
await checkout.selectShipping('standard');
await checkout.waitForOrderSummary();
await expect(checkout.totalAmount).toHaveText('$29.99');
});
## Request
Write a Playwright test for the promo code feature: apply a valid code
and verify the discount, reject an expired code with an error message.
Claude now produces this:
import { test, expect } from '@playwright/test';
import { CheckoutPage } from '../pages/checkout.page';
import { loginAsUser } from '../helpers/auth.helper';
import { seedCart } from '../helpers/cart.helper';
import { enableFeatureFlag } from '../helpers/feature-flag.helper';
import { VALID_PROMO, EXPIRED_PROMO, BASIC_ITEM } from '../fixtures/promo-codes.fixture';
test.describe('Checkout Promo Codes', () => {
let checkout: CheckoutPage;
test.beforeEach(async ({ page }) => {
await loginAsUser(page, 'customer');
await enableFeatureFlag(page, 'PROMO_V2');
await seedCart(page, [BASIC_ITEM]);
checkout = new CheckoutPage(page);
await checkout.goto();
});
test('applies valid promo code and shows discount', async () => {
await checkout.applyPromoCode(VALID_PROMO.code);
await checkout.waitForPromoResponse();
await expect(checkout.getDiscountAmount()).resolves.toBe(VALID_PROMO.discount);
});
test('rejects expired promo code with error', async () => {
await checkout.applyPromoCode(EXPIRED_PROMO.code);
await checkout.waitForPromoResponse();
await expect(checkout.getPromoError()).resolves.toBe('Promo code expired');
});
});
Every improvement traces to a specific context source:
data-testidselectors baked into the Page Object methods (from the style guide)CheckoutPagepage object withapplyPromoCodeandwaitForPromoResponse(from the injected source code)loginAsUserhelper instead of an inline login flow (from the style guide listing available helpers)seedCartcall inbeforeEachso the checkout page has items (from the style guide's test data rule)enableFeatureFlagforPROMO_V2(from failure patterns noting the flag requirement)VALID_PROMOandEXPIRED_PROMOconstants from fixtures, not hardcoded strings (from the fixture file)- No race condition on the promo API, because
waitForPromoResponseis built into the Page Object (which the model read)
Same model. Same request. I've watched this play out on real projects. The first test gets deleted. The second test gets merged with one comment in code review.
Making It Automatic
Context loading should be invisible. Every engineer on the team should get it without copy-pasting files into a chat window.
The CLAUDE.md file we added earlier does this. The agent reads it on every turn, loads the relevant context, and generates tests using your team's actual patterns. No copy-pasting. No per-engineer prompt variation.
I added a file like this to my own project and watched prompt variation across the team drop to zero overnight.
If you've read Components-as-Docs, this will look familiar. That post built a .docs/ folder with structured component references. The index file here follows the same principle: put the map where the agent can see it, and the agent will find what it needs.
Agentic Memory took a similar approach with .ci/memory/, giving the pipeline persistence across runs. Same pattern, different data.
Keep These in Mind
Treat context files like code. They need the same maintenance rigor as any other file in your repo. If your naming conventions changed six months ago and the style guide still says otherwise, the model will follow the outdated rules. Review your context files during sprint retrospectives or whenever patterns shift. We'll cover maintenance strategies in a future post.
Build repeatability into generation. Raw prompts produce inconsistent output. Agent capabilities like Claude Code skills let you encode your generation workflow once and run it the same way every time. The less you rely on one-off prompts, the more consistent your tests become.
Make context engineering your default. A fully loaded prompt with style guide, Page Object, examples, and failure patterns uses more tokens than a bare request. That's the point. Direct prompts without context produce tests you'll rewrite by hand. Context-engineered prompts produce tests that match your team's patterns from the start. Stop treating context as optional overhead. It's the foundation.
Don't fight the context window. Guide the agent to what it needs. You can't load 200 page objects into a single call. You don't need to. Command-line agents like Claude Code are built for file discovery. They can grep, search, and read files on their own. Your job is to give them a map. A folder structure guide inside your context files, or a Mermaid diagram that shows where your page objects, helpers, fixtures, and API docs live, turns the agent from guessing into navigating. Instead of stuffing everything into one prompt, point the agent at the right directories and let it pull what it needs. We'll cover techniques like Mermaid-based test scope maps in upcoming posts.
Context maintenance is test maintenance. When a redesign breaks 40 tests, the same context assembly helps an LLM suggest fixes faster, because it already knows your selector strategy and your patterns. Keeping context files current pays off in both generation and repair.
Start Here
- Create
.ai/in your repo. - Write
test-style-guide.md. Even 15 lines of your team's conventions (selectors, structure, auth patterns) will change the output. - Add one example test from your best engineer.
- Pick one component. Paste the style guide, the example, and the component source into Claude. Generate a test.
- Open a fresh chat with no context. Ask the same question. Run both tests and compare.
The model is the engine. Context is the fuel. You've been tuning the engine. Time to look at the fuel.
What's Next
This is the first post in a series on context engineering for test automation. Each post builds on the last, and each one ships a working tool.
Here's where we're headed:
- Your E2E Tests Lose Memory After Every Step. We'll build session-aware test agents that remember the full user journey, not just the step that failed.
- Your CI Already Has the Data. Teach It to Remember. Raw CI failure logs contain patterns your team rediscovers every few sprints. We'll build a system that extracts, stores, and surfaces them automatically.
- Teaching Your Test Agent to Remember How, Not Just What. Knowing that a test failed is one thing. Knowing how your team fixed it last time is what turns flaky test debugging from hours into minutes.
- Five Sources of Context. One Test. Zero Guesswork. We'll build a context assembly engine that pulls from style guides, test history, source code, failure patterns, and known issues into one token-budgeted prompt.
Beyond that: multi-agent orchestration, context health monitoring, Mermaid-based test scope maps and more.
Enjoyed this article?
Get in touch to discuss how AI-powered testing can transform your QA processes.