Engram Blog · Published April 11, 2026
Building an AI memory service on Cloudflare Workers
Why we picked D1 + Vectorize + Workers AI over the obvious stacks for Engram, what the latency math looks like, and where the edges of each service actually are.
When we started building Engram, the obvious stack was Node + Postgres + pgvector + OpenAI embeddings, fronted by a container on Fly or Render. It would have worked. It's also what roughly everyone else in the AI-memory space has already built. We picked something else, and nine months later the decision has paid for itself in latency, operational surface, and the monthly AWS bill we don't have. This post is the case for why.
Engram stores verbatim conversation transcripts and makes them searchable by meaning, over the Model Context Protocol. In practice that means three things have to be fast and cheap: writing messages, embedding them, and retrieving the nearest neighbors at query time. The stack we landed on:
- Cloudflare Workers for compute (Hono.js, TypeScript)
- Cloudflare D1 (SQLite at the edge) for messages and metadata
- Cloudflare Vectorize for the semantic index
- Cloudflare Workers AI (bge-base-en-v1.5) for embeddings
Everything is one Worker. No separate embedding microservice, no vector database to operate, no Redis, no queue. The whole data path — ingest, embed, write, index, retrieve — happens inside a single Cloudflare data center, using internal RPC bindings instead of HTTP. That constraint turns out to be the single most important thing about the system.
The obvious stack we didn't build
The default shape for a memory service, if you started from scratch tomorrow, is probably:
Client ──HTTPS──▶ Fly/Render container (Node)
│
├──▶ Postgres + pgvector (Neon/Supabase/RDS)
├──▶ OpenAI /v1/embeddings
└──▶ Redis (rate limiting / cache)This isn't wrong. It has three properties that turn into problems at scale:
Every dependency is a network hop.Writing one message means: HTTPS from client to Node, TLS handshake to Postgres, TLS handshake to OpenAI, TLS handshake to Redis. Four round trips, each with its own latency floor, each counted against your p99. OpenAI's embedding endpoint alone is a 60–200 ms round trip from most regions. That's before you do any actual work.
Cold starts are real. Node containers take seconds to boot on first request after idle. Postgres connection pools have to warm up. pgvector indexes have to be loaded into shared_buffers. In practice you keep a container warm all the time, which means you pay for idle capacity 24/7 whether anyone is using the service or not.
Ops is never free.Postgres + pgvector needs someone thinking about autovacuum tuning, index bloat, replication, point-in-time recovery, and (sooner or later) the moment the memory index outgrows one machine and you're looking at sharding strategies. Multiply by Redis, by the container runtime, by the secret rotation for OpenAI keys. Every line item is a paging risk.
You can make all of this work. Plenty of teams do. Our bet was that if we could put the entire pipeline inside one Cloudflare data center and talk to every dependency via binding RPC, the latency, cost, and ops surface would collapse far enough that it would reshape what the product could be.
Why Workers instead of Node
Cloudflare Workers are not containers. They're V8 isolates — the same JavaScript engine that runs in Chrome, minus the browser. An isolate spins up in under a millisecond and is destroyed when the request finishes. There's no OS to boot, no runtime to warm, no connection pool to maintain. The execution model is closer to “serverless that actually means it” than anything AWS Lambda ever shipped.
The practical effects, in order of how much they matter:
- No cold start.First request after a long idle still returns in tens of milliseconds. We don't have a “keep a dyno warm” line item.
- No idle cost.When nobody is using Engram, we pay for nothing but the data at rest. That's a very different cost curve than a container that has to stay hot because cold starts would embarrass us.
- Bindings beat HTTP. Calling
env.DB.prepare(...).run()is not an HTTP request to D1. It's an RPC message routed internally by the Workers runtime to the D1 service in the same data center. No TLS handshake, no public internet, no authentication overhead. Same story for Vectorize and Workers AI. A write that would have been 4 TLS handshakes and 60+ ms of latency on a vanilla Node stack is single-digit milliseconds inside a Worker. - Deploys are instant.
wrangler deploytakes 10 seconds and the new version is live in all 300+ data centers. Container builds, ECR pushes, and ECS rolling updates are just... not in our operational vocabulary.
The tradeoff: Workers have hard limits. 128 MB memory per isolate, 30 seconds of CPU time per request (higher on paid plans), no long-running background work, no native binaries, no filesystem. These limits force you into a specific shape — do one thing per request, do it quickly, persist state in bindings, and never try to be clever with in-memory caches. That shape happens to be exactly what a memory service needs.
Why D1 instead of Postgres
D1 is SQLite running as a managed service, colocated with your Worker. Every query is a binding call, not a network hop. We're using it for the core conversation and message tables — the structured metadata that sits next to (but not inside) the vector index.
The motivating property of D1 is that it's in the same data center as your Worker. A query that would be a TCP round-trip to a Postgres primary in us-east-1 is a microsecond-scale RPC inside a Cloudflare PoP. For a write- heavy workload like message ingest, that's a ten-to-one latency win. For a read-heavy workload like paginated history fetches, it's the same.
What you give up: D1 is SQLite, which means no native JSON operators, no stored procedures, no advisory locks, no pub/sub, no LISTEN/NOTIFY, no extensions. There are also real size limits (10 GB per database on the current tier) and query-shape limits (no arbitrary CTE depth, no full-text search extensions). You work around this by (a) keeping your schema simple, (b) doing the clever stuff in the Worker, and (c) sharding by organization if you ever get close to the size cap. We're not close.
The schema is almost embarrassingly plain:
CREATE TABLE conversations ( id TEXT PRIMARY KEY, org_id TEXT NOT NULL, title TEXT, created_at INTEGER NOT NULL ); CREATE TABLE messages ( id TEXT PRIMARY KEY, conversation_id TEXT NOT NULL, role TEXT NOT NULL, content TEXT NOT NULL, created_at INTEGER NOT NULL, FOREIGN KEY (conversation_id) REFERENCES conversations(id) ); CREATE INDEX idx_messages_conv ON messages(conversation_id); CREATE INDEX idx_conversations_org ON conversations(org_id);
That's it. The vectors live in Vectorize, not in D1. D1 is the source of truth for the raw text; Vectorize is the search index pointed at it. Keeping those separate turned out to simplify nearly everything downstream.
Why Vectorize instead of pgvector or Pinecone
We looked at three options for the vector index: pgvector (inside the same Postgres we didn't build), a dedicated hosted vector DB like Pinecone or Qdrant Cloud, and Cloudflare Vectorize. We picked Vectorize for the same reason we picked D1 — it's a binding away from the Worker, which means query latency is a function of the vector math, not the network.
Vectorize is less flexible than pgvector or Qdrant. It supports one distance metric per index (we use cosine), fixed dimensionality (768 for us, matching bge-base-en-v1.5), metadata filters with a small set of operators, and a topK ceiling. There's no HNSW tuning dial, no ef_search knob, no hybrid BM25+vector scoring. If you want those, you're going to end up somewhere else. What you get in exchange is that an index query looks like this:
const results = await env.VECTORIZE.query(queryVector, {
topK: 10,
filter: { org_id: orgId },
returnMetadata: "all",
});One line. No connection. No HTTP. No keys. No auth handshake. The results come back with the message IDs in the metadata, and we hydrate the full message text from D1 in the next step. That whole pipeline — embed query, query Vectorize, hydrate from D1, return to the MCP client — lands in under 100 ms on the p50 for a typical workload.
The operational story is the part that actually sold us. Vectorize has zero tunables, which sounds like a downside until you realize that “zero tunables” is another way of saying “zero three-a.m. pages about ef_search.” We've never debugged a Vectorize performance issue. We may eventually. When we do, the fix will probably involve moving to a dedicated vector DB. That's a good future problem.
Why Workers AI for embeddings
Embeddings are the least glamorous part of a memory service and the most expensive if you do it wrong. Every message ingested has to be embedded. Every search query has to be embedded. Doing that via the OpenAI API means a 60–200 ms round trip and a per-token bill. At Engram's ingest volumes, that adds up fast.
Workers AI is Cloudflare's inference-as-binding. It runs a set of open models — including @cf/baai/bge-base-en-v1.5, which is what we use — on GPUs in the same data centers as the Workers runtime. Calling it is a binding, not an API call:
const { data } = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: [message.content],
});
const vector = data[0]; // 768-dimensional float arrayLatency is single-digit-to-tens of milliseconds. Pricing is per 1M neurons, not per token, which for an embedding workload comes out to roughly an order of magnitude cheaper than OpenAI at our volume. And the privacy story is straightforward: messages never leave Cloudflare's infrastructure, which matters a lot for the “my-company-chat-goes-into-memory” use case.
The model choice deserves its own paragraph. We picked bge-base-en-v1.5 because (a) Workers AI hosts it, (b) 768 dimensions is a sweet spot for Vectorize storage cost, and (c) on our internal retrieval benchmarks it was indistinguishable from text-embedding-3-small on realistic conversation chunks and slightly better on short query-against-long-chunk retrieval, which is the shape most MCP memory searches take. If a better model ships on Workers AI tomorrow, switching is a one-line change and a reindex.
The verbatim design choice
The one non-obvious product decision is worth calling out because it shaped the stack more than any other single choice: Engram stores full message text, not extracted summaries or knowledge-graph nodes.
Every other memory product we looked at extracts factsfrom conversations and discards the original text. “User prefers Postgres for nested documents” is a fact. “I tried Postgres 16 but the JSONB GIN indexes were 30% slower than MongoDB for our specific nested- document query shape, we might revisit when 17 ships” is the actual conversation that produced that fact. The facts are compact. The conversations are what you actually want to retrieve six months later when someone asks “should we revisit?”
Storing verbatim is expensive on naive stacks. You're paying for gigabytes of raw text instead of megabytes of structured facts. But the Workers + D1 + Vectorize combination makes it cheap enough that the product-side argument wins on its own: D1 stores text at roughly the cost of any SQLite deployment, Vectorize only indexes the chunk-level embeddings (not the text), and bandwidth inside Cloudflare is free. The storage bill for a million messages is not a line item we think about.
Chunking happens at the 5-message window, with a 1-message overlap, which gives us retrievable units that are small enough to be precise and large enough to carry context. Each chunk gets one embedding, stored in Vectorize with metadata pointing at the conversation and the message range. At query time, we embed the query, fetch the top-10 chunks, hydrate the full message ranges from D1, and return them to the MCP client. The client sees verbatim text, not a summary. The model on the other side decides what matters.
What it actually costs
Rough numbers for a typical Engram workload (one team, ~50k messages/month, moderate search):
- Workers requests: a few dollars / month. First 100k requests/day are free on the paid plan.
- D1: under $5/month at this volume. Reads and writes are cheap; storage scales linearly with messages.
- Vectorize: the biggest single line item, but still low-ten-dollars per month for this workload. Scales with stored vectors and queries.
- Workers AI: pennies per 10k embeddings. For a team doing ~50k ingests and ~5k searches per month, this is a rounding error.
The whole stack — compute, database, vector index, inference — comes in under $20/month at this scale. We've done the math on what the equivalent Postgres + pgvector + OpenAI stack would cost on Fly or Render and it's roughly 5–10x that, mostly driven by the always-warm container, the hosted Postgres bill, and OpenAI embedding costs. Your mileage will vary; the point is that the cost curve is flat enough that we can run the free tier without feeling it.
Where the edges are
In the interest of not writing a Cloudflare ad, here's an honest list of what the stack doesn't do well:
D1 query complexity.If you need to join five tables with a recursive CTE and window functions, D1 will do it, but you will miss Postgres. We've stayed on relatively simple queries and moved the complicated work into the Worker.
Vectorize filter expressiveness. The metadata filter language is simple on purpose. You can do org_id = ? and created_at > ?, but if you want full-text filters or nested boolean logic across metadata, you'll find the edges. Our workaround is to over-fetch with a loose filter and then narrow in the Worker, which works fine at topK ≤ 50.
Background work.There's no long-running worker pattern. Queue-based ingestion uses Cloudflare Queues, which adds another binding but works fine. If we ever need truly long-running jobs (re-embedding an entire corpus after a model upgrade), we'll probably use Queues + a batch pattern rather than reaching for a different compute model.
Observability.Cloudflare's observability story is good but newer than, say, Datadog on a vanilla container stack. We ship logs to Logpush and traces via Workers Traces. It's fine. If you're used to deep Datadog dashboards you'll notice the difference.
None of these have been deal-breakers. They're the list of things we'd want to know before making the same choice, because nothing in this post is novel — it's mostly “pick the serverless runtime that's been quietly getting better for five years and let it do its job.”
Taking it for a spin
Engram is hosted at getengram.app, free tier available with no self-hosting. The setup is one block of config in your MCP client — Claude Desktop, Claude Code, Cursor, Windsurf, Zed, or anything else that speaks MCP — and your agent has persistent memory. The architecture page has the same story with more diagrams and code; the whitepaper covers the product-side patterns this stack enables. The source is on GitHub if you want to see how any of the pieces above fit together in real code.
Written by the Engram team. Published April 11, 2026. Corrections and war stories: hello@getengram.app.