← All posts
Engineering

From Context Engineering to Harness Engineering

April 16, 2026

TL;DR

  • 2023 was prompt engineering. 2024–2025 was context engineering. 2026 is harness engineering.
  • A harness is the runtime that decides what context, tools, and memory the model sees on every turn. The model itself is the smallest part of an LLM app.
  • A harness has three moving parts: a knowledge engine (search + memory), a tool retriever (load only what the turn needs), and a session controller (state across turns).
  • Vector DBs built for the context-engineering era don't fit. They charge a flat monthly minimum even when your harness is asleep. We built LambdaDB so an idle harness costs $0.
  • There's a ~40-line Python harness at the bottom. Knowledge engine, tool retrieval, session state. One file.

What is harness engineering?

The runtime around the model (memory, tool routing, retrieval, session state) treated as the primary unit of work. The model is interchangeable. The harness is what your users feel.

Prompt engineering tuned the words. Context engineering tuned the window. Harness engineering tunes the system.

The shift, in concrete terms

Same recurring question on Hacker News, r/LangChain, and r/SaaS: why are AI agents still stateless? Teams ship a tuned prompt, a tuned retrieval pipeline, and the agent still forgets across sessions and burns tokens re-loading the same context.

The unit of work moved. It used to be the prompt. Then the context window. Now it's the harness. The system that wraps the model and decides, on every turn, what the model gets to see.

The rigor that lived in clever prompts now lives in the runtime. The model became the most replaceable piece.

EraYearUnit of WorkWhat you tuned
Prompt Engineering2023The promptWording, few-shot examples
Context Engineering2024–2025The context windowChunking, retrieval, compression
Harness Engineering2026+The runtimeMemory, tool routing, session state

If you're still in 2024 thinking "I just need better RAG," you'll lose to teams treating the harness as the product.

What is a harness, exactly?

A small program that runs around the model. Every turn, it looks up relevant memory from past sessions, picks which tools are available right now, assembles a context window that fits the budget, calls the model, and persists what just happened so the next turn can use it.

Claude Code is a harness. Cursor is a harness. The agents you're shipping at work are harnesses. The model inside is interchangeable. Sonnet 4.6, Opus 4.7, GPT-5, whatever ships next month. Users never feel the model. They feel the harness.

The three components

Same three components show up in every agent codebase I've read.

1. The knowledge engine (memory + retrieval)

Where the agent remembers. Past conversations, indexed documents, tool descriptions, code graphs. All of it lives here and gets queried per turn.

The naive version is "vector DB + embeddings." That breaks fast. Pure vector search returns "Error 222" as "close enough" for "Error 221." You need hybrid search: vectors for meaning, full-text for exact matches, filters for scoping.

A common voice from the community:

"If your use case required exactness (like searching for 'Error 221' in a manual) a pure vector search would gleefully serve up 'Error 222' as 'close enough,' which is cute in a demo but catastrophic in production."

Knowledge engine problem, not a model problem. Also exactly why the agent-memory threads on r/LangChain and r/LocalLLaMA keep circling back to "FTS + Dense + filter" stacks.

2. The dynamic tool retriever

If your agent has 100 tools, you can't shove 100 tool descriptions into the context window. Tokens cost real money, and the model gets worse with more options anyway. Paradox of choice applies to LLMs too.

The pattern that works: index your tool descriptions in the same knowledge engine, then retrieve the top 3–5 per turn based on the user's intent.

100 tools. 3 loaded. Zero wasted tokens.

3. The session controller

Most teams underbuild this part. It decides what state survives the turn, what gets summarized vs. dropped vs. kept verbatim, and when a long-running session branches into a sub-task.

Anthropic's engineering posts on Claude Code call this compaction. Same thing. Harness engineering by another name.

Why your vector database is the cost driver

Here's the part that frustrates me as someone building this.

Most vector databases (Pinecone, Qdrant Cloud, Weaviate Cloud) were designed for the context-engineering era. The mental model: index your docs, retrieve at query time, done. Pricing followed: a flat monthly minimum, a server always running, capacity provisioned for peak.

Now imagine you're building harnesses. A harness might be one user's personal agent. It runs for 30 seconds when the user types something, then sleeps for 4 hours. You want hundreds of these, each with isolated memory.

On Pinecone, you're paying the $50/mo account minimum the moment you turn it on, plus storage and query costs that don't sleep when your users do. The pricing assumes a workload that's up. Harnesses spend most of their life down.

On LambdaDB, $0 when idle. You pay per query. A sleeping harness costs nothing.

This isn't a marketing line. It's an architecture fact. We run entirely on serverless components. There is no server to keep warm. When your harness wakes up, the DB wakes up. When it sleeps, the DB sleeps.

The pain shows up in public. From DEV Community and HN this past year:

"Pinecone's new $50/mo minimum just nuked my hobby project."
"I would much rather spin up a dedicated machine with a lot of memory than pay some of the wildly high fees for a Vector DB otherwise."
"Managed services start cheap but costs skyrocket after vendor lock-in. Got bitten by this."

These users were building harnesses on context-engineering infrastructure. The economics break.

A minimal harness in ~40 lines of Python

A harness doing the three jobs. Memory, tool retrieval, session state. LambdaDB as the knowledge engine, Anthropic's SDK for the model.

import os
from anthropic import Anthropic
from lambdadb import LambdaDB

db = LambdaDB(
    project_api_key=os.environ["LAMBDADB_API_KEY"],
    base_url=os.environ["LAMBDADB_BASE_URL"],
    project_name=os.environ["LAMBDADB_PROJECT"],
)
llm = Anthropic()

memory = db.collection("agent-memory")  # pre-created with text+vector+user_id fields
tools  = db.collection("agent-tools")   # pre-created with text+vector+spec fields

def embed(text: str) -> list[float]:
    # your embedding call. Returns a 1536-dim vector
    ...

def harness(user_id: str, message: str) -> str:
    qv = embed(message)

    # 1. Knowledge engine: hybrid search over user's memory
    past = memory.query(
        size=5,
        query={
            "rrf": [
                {"queryString": {"query": message, "defaultField": "text"}},
                {"knn": {"field": "vector", "queryVector": qv, "k": 5}},
            ],
        },
        filter={"queryString": {"query": f"user_id:{user_id}"}},
        consistent_read=True,
    )

    # 2. Tool retriever: only load tools relevant to this turn
    relevant_tools = tools.query(
        size=3,
        query={"knn": {"field": "vector", "queryVector": qv, "k": 3}},
        consistent_read=True,
    )

    # 3. Session controller: assemble context within budget
    context = "\n".join(f"- {hit.doc['text']}" for hit in past.docs)
    tool_specs = [hit.doc["spec"] for hit in relevant_tools.docs]

    response = llm.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=f"You are a helpful agent.\n\nMemory:\n{context}",
        tools=tool_specs,
        messages=[{"role": "user", "content": message}],
    )

    # Persist what happened
    memory.docs.upsert(docs=[{
        "id":      f"{user_id}-{response.id}",
        "text":    f"User: {message}\nAgent: {response.content[0].text}",
        "vector":  embed(f"{message} {response.content[0].text}"),
        "user_id": user_id,
    }])

    return response.content[0].text

Three search calls, one model call, one upsert. Forty lines, because the database does the work that used to live in glue code.

Three details worth noticing. mode="hybrid" is one parameter. Behind it, LambdaDB runs vector, full-text, and filters in a single query. Most vector DBs make you stitch this yourself. filter={"user_id": user_id} is how you scope memory per user; each user gets their own slice, idle users cost $0 because there's no per-namespace fee and no always-on server. And tool retrieval is the same search call as memory retrieval. Tools are just data.

What harness engineering changes about how you build

Stop thinking about "the database" as separate from "the agent." The DB is part of the agent's runtime. Every retrieval call is a thought. Slow DB, slow agent. Expensive DB, expensive agent.

Hybrid search isn't a feature. It's the default. Pure vector loses to grep on exact-match queries. Pure full-text loses to embeddings on semantic queries. You need both, one query, every time.

$0 idle is a hard requirement, not a nice-to-have. Harnesses are bursty by nature. User types, agent works, then nothing for hours. If your infra charges you for the silence, you can't scale past a handful of users without raising a Series A to pay your DB bill.

Branching is the next big primitive. When a parent agent spawns a child for a sub-task, you don't want to re-index the parent's knowledge. You want to fork it instantly, let the child mutate its copy, and merge or discard. We built zero-copy branching for exactly this.

FAQ

Is harness engineering just a rebrand of context engineering? No. Context engineering optimizes a single context window. Harness engineering optimizes the runtime that produces that window. Memory, tool selection, session state across turns.

Do I still need a vector database? You need a knowledge engine, not just a vector DB. Hybrid search (vector + full-text + filter), per-user scoping, and $0 idle pricing are the new requirements. Most legacy vector DBs check only the first box.

How is LambdaDB different from Pinecone for agent memory? Pinecone's pricing assumes an always-on workload. Flat monthly minimum plus storage and query fees that keep ticking when nobody's using your agent. LambdaDB charges per query and goes to $0 when idle. Hybrid query is one call, not three. Zero-copy branching ships in the box for sub-agents.

Where do I start if I'm migrating from context engineering? Rewrite your retrieval layer first. Replace ad-hoc vector calls with a single hybrid search per turn, scoped by user_id. The 40-line harness above is a working starting point.

Where to start

If you're on a context-engineering vector DB and starting to feel the pinch, the migration is shorter than you think. The harness pattern is small enough to rewrite in an afternoon.

The fastest path: Build agent memory in 15 minutes. The harness pattern, end to end.

Going deeper: Quickstart · Hybrid search guide · LambdaDB vs Pinecone

Sign-up takes 30 seconds. Pay-as-you-go from query, $0 when your agent sleeps.

The model is the smallest part. The runtime is the product. Your harness deserves a knowledge engine, not a vector DB built for the last era.