Torqon | Documentation

Introduction

Torqon gives your AI assistant a persistent memory that survives across conversations. It extracts atomic facts from messages, embeds them into a 1536-dimensional vector store, and retrieves the most semantically relevant ones before every LLM call — all in under 8ms.

<8ms

p95 retrieval latency

1536

vector dimensions
(text-embedding-3-small)

0.72

cosine similarity floor

3 lines

to integrate via MCP

Every message you send passes through a 6-stage pipeline: intent classification, context optimization, vector retrieval, context assembly, LLM generation, and async memory extraction. The retrieval and assembly stages add less than 10ms to your p95. Memory extraction runs fully async — it never blocks a response.

MCP Server

3 lines of config

Add Torqon to Claude Desktop's mcpServers block. Memory starts working immediately — no prompting required.

REST API

Any language, any runtime

POST to api.torqon.dev/chat with your API key. Works from Node, Python, curl — any HTTP client.

SDK coming soon

Native TypeScript & Python

Typed SDK with automatic retries, streaming support, and built-in conversation management.

Quickstart

Get persistent memory in your Claude Desktop setup in under 2 minutes. You need an API key from the dashboard — free tier requires no credit card.

01

Get your API key

Sign up at dashboard.torqon.dev and copy your key from the API Keys tab. Keys are prefixed tq_live_.

02

Add Torqon to Claude Desktop

Open your Claude Desktop config file (~/Library/Application Support/Claude/claude_desktop_config.json on macOS) and add the Torqon MCP server:

claude_desktop_config.json

// Add this to your existing mcpServers block
{
  "mcpServers": {
    "torqon": {
      "command": "npx",
      "args": ["-y", "@torqon/mcp@latest"],
      "env": {
        "TORQON_API_KEY": "tq_live_your_key_here"
      }
    }
  }
}

03

Restart Claude Desktop and you're done

Torqon begins extracting and storing facts from your conversations immediately. Tell Claude something about your project:

Example conversation

You:   I�m building a Next.js 14 app called Torqon with pgvector on Railway.
        We never use mocks in tests. Always a real database.

Claude: Got it. I�ve noted your stack and testing preference.

---  3 days later, new conversation ---

You:   How should I write the migration script?

Claude: For your Railway + pgvector setup, you�ll want to...
        ? Claude remembered your stack from 3 days ago

No prompting required. Torqon operates silently in the background. It decides what to store and what to retrieve automatically based on message intent.

Integration Paths

MCP Server

The easiest path. The MCP server exposes store_memory, retrieve_context, and auto_process tools that Claude calls automatically based on conversation content.

store_memory

Explicitly stores a fact. Claude calls this when you state something it should remember. The fact is embedded and written to your vector store.

Input: message string

retrieve_context

Runs a semantic search against your fact store. Returns up to 12 facts with cosine similarity > 0.72, ranked by relevance.

Input: query string → ranked facts

auto_process

Classifies intent, retrieves relevant memory, assembles context, and calls the LLM — the full pipeline in a single tool call.

Input: message string → response + traceId

optimize_context

Runs the Content Optimizer on a string. Returns a compressed version and the compression ratio. Useful for preprocessing large tool outputs before sending to the model.

Input: text string → compressed text + ratio

REST API

Call Torqon directly from any HTTP client. All endpoints live at https://api.torqon.dev. Authenticate with an API key header on every request.

REST API — send a message with memory

# Store a fact
curl -X POST https://api.torqon.dev/chat \
  -H "X-Api-Key: tq_live_••••••••••••••••" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "My project is called Helios. We use Rust for the backend.",
    "conversationId": "user-session-001",
    "storeOnly": true
  }'

# Ask a question — memory auto-retrieved
curl -X POST https://api.torqon.dev/chat \
  -H "X-Api-Key: tq_live_••••••••••••••••" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What language are we using for the backend?",
    "conversationId": "user-session-001"
  }'

Authentication

All endpoints except GET /, GET /health, and POST /auth/register require an API key. Pass it in the X-Api-Key header.

Header	Value	Required
`X-Api-Key`	`tq_live_<your_key>`	Yes — on all authenticated routes
`Content-Type`	`application/json`	Yes — for POST requests

✓ Valid request

POST /chat
X-Api-Key: tq_live_abc123
Content-Type: application/json

{"message": "..."}

✗ Missing key → 401

POST /chat
Content-Type: application/json

{"message": "..."}

→ {"error":"Missing X-Api-Key header.
    Register at /auth/register"}

POST /chat

The core endpoint. Runs the full pipeline: intent classification → context optimization → memory retrieval → context assembly → LLM generation → async memory extraction. Returns the LLM response with a trace ID for debugging.

Request Body

Field	Type	Required	Description
`message`	string	Yes	The user's message. Torqon classifies intent, optimizes context, and retrieves relevant memory before passing to the LLM.
`conversationId`	string	No	Groups messages into a conversation. If you mention a project name, Torqon auto-namespaces to `userId:projectName` so facts don't bleed across projects.
`storeOnly`	boolean	No	When `true`, skips the LLM call entirely and just queues the message for memory extraction. Useful for bulk-loading context. Returns immediately.
`rawFacts`	boolean	No	When `true`, skips the LLM call and returns the raw retrieved facts instead. Returns up to 5 facts with similarity ≥ 0.68, pipe-delimited as `type:content`.

Response

200 OK — standard response

{
  "response": {
    "content": "You're using Rust for the Helios backend.",
    "usage": {
      "promptTokens": 312,
      "completionTokens": 47,
      "totalTokens": 359
    }
  },
  "traceId": "trace_a1b2c3d4",
  "evaluation": {
    "baselineTokens": 1840,
    "optimizedTokens": 312,
    "tokenSavings": 83.0,
    "memoryUsed": 3,
    "avgSimilarity": 0.87,
    "latencyMs": 641
  }
}

Error codes

Status	Error	Cause
`401`	Missing X-Api-Key header	No `X-Api-Key` header present
`429`	Monthly retrieval limit reached	Free or Pro quota exhausted for this billing cycle
`429`	Monthly optimization limit reached	Context optimization quota exhausted
`500`	Internal pipeline error	LLM timeout or upstream failure

storeOnly vs rawFacts: Use storeOnly to load context in bulk without burning LLM tokens. Use rawFacts to retrieve memory programmatically — for example, to inject Torqon-retrieved facts into your own system prompt rather than letting Torqon call the LLM.

POST /auth/register

Creates a new user account and returns an API key. No authentication required. This is the only unauthenticated write endpoint.

Request

Request body

{
  "email": "you@example.com",
  "name": "Your Name"
}

Response

200 OK

{
  "apiKey": "tq_live_abc123...",
  "userId": "usr_xyz",
  "plan": "free"
}

Store your API key securely — it is only shown once on registration. If you lose it, generate a new one from the dashboard.

GET /health

Health check endpoint. No authentication required. Use this to verify the API is reachable before sending requests.

200 OK

{
  "status": "ok",
  "uptime": 48293
}

Request Pipeline

Every call to POST /chat runs through six stages. Stages 1–5 are synchronous and complete before the response is returned. Stage 6 (memory extraction) is fully asynchronous and never blocks.

01

Intent Classification

Determines whether the message is a retrieval query, a storage event, or both. Uses a regex heuristic cascade — if heuristics resolve intent with high confidence, the LLM fallback is skipped entirely. Project name detection automatically namespaces the conversation: "I'm building Helios" routes to userId:helios.

sync adds <1ms

02

Context Optimization

Messages longer than 30 tokens are routed to the Content Optimizer. The optimizer detects content type (JSON, code, or prose) and applies the appropriate compression strategy before passing the message to the retrieval and LLM stages. Short messages skip optimization entirely.

sync 30–200ms for prose (LLM), <5ms for JSON/code

03

Memory Retrieval

Generates a 1536-dimensional embedding of the (optimized) message using text-embedding-3-small. Runs a cosine similarity query against your fact store, filters results with similarity < 0.72, and returns up to 12 ranked facts. Global search is enabled for authenticated users — facts from all conversations under your account are considered.

sync p95 <8ms end-to-end

04

Context Assembly (CacheAligner)

Assembles the final prompt using the CacheAligner pattern: a single stable system prompt that never changes (ensuring prefix cache hits on every request), and retrieved facts injected into the user message as a [VERIFIED FACTS] block above the question. This structure lets the LLM provider cache the system prompt while facts remain dynamic.

sync adds <1ms

05

LLM Generation

Calls google/gemini-2.5-flash-lite via OpenRouter with the assembled context. Response token usage is tracked and returned in the evaluation block alongside baseline token count (what you would have spent without Torqon).

sync model-dependent latency

06

Async Memory Extraction

After the response is returned, a background event fires to extract and embed atomic facts from the message. This runs via an internal event bus — the user's response is never delayed by storage. Extraction, embedding, and PostgreSQL write typically complete in 3–5 seconds. Immediate re-query within this window may miss freshly stored facts.

async 3–5s after response

Memory Model

Torqon does not store conversation history. It extracts the smallest independently reusable unit of knowledge — an atomic fact — and stores each one as a separate embedding. This means retrieval precision holds even as your fact store grows to thousands of entries.

What gets stored

When you say "I'm building a Next.js 14 app called Helios with pgvector on Railway", Torqon extracts three separate facts:

project_name: "helios"
tech_stack: "Next.js 14"
tech_stack: "pgvector on Railway"

Each fact has its own embedding. Changing one fact only supersedes that specific embedding — not the entire conversation.

Fact supersession

When a new fact contradicts an existing one, the old fact is soft-deleted via the is_current_fact boolean flag — not physically removed.

Old facts are preserved as an audit trail
Retrieval filters WHERE is_current_fact = true
Stale data never surfaces in answers
You can replay any historical state

Storage schema

facts table — simplified schema

CREATE TABLE facts (
  id              UUID       PRIMARY KEY,
  user_id         TEXT       NOT NULL,
  conversation_id TEXT       NOT NULL,
  content         TEXT       NOT NULL,
  fact_type       TEXT       NOT NULL,
  embedding       VECTOR(1536),
  is_current_fact BOOLEAN    DEFAULT true,
  confidence      FLOAT,
  created_at      TIMESTAMPTZ DEFAULT now()
);

-- pgvector cosine distance index
CREATE INDEX facts_embedding_idx
  ON facts USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Fact Types

Every stored fact is tagged with a type. Types inform retrieval priority and how the fact is presented in the assembled context. Torqon auto-classifies fact type during extraction.

Type	What it captures	Example	Retrieval priority
project_name	The canonical name of the active project. Used to namespace all other facts.	`"helios"`	High — scopes retrieval
tech_stack	Technologies, frameworks, databases. Each technology is stored as a separate fact so changing one doesn't invalidate others.	`"pgvector on Railway"`	High
goal	Active objectives and milestones. When a goal is completed or changed, the old fact is superseded.	`"Ship v1 by Q1 2026"`	Medium
preference	Working constraints, style rules, team conventions. Elevated retrieval priority — violating preferences wastes entire sessions.	`"No mock databases in tests"`	High
decision	Architectural decisions with their rationale. Prevents relitigating solved problems.	`"Use atomic extraction over conversation summaries"`	Medium
context	Background knowledge that shapes how the model should respond — expertise level, role, domain knowledge.	`"Senior backend engineer, new to React"`	Medium

Content Optimizer

Messages longer than 30 tokens are compressed before being passed to retrieval and the LLM. The optimizer detects content type and routes to the appropriate compressor — only prose reaches the LLM compressor, so JSON and code compression adds under 5ms.

Content type	Detection signal	Method	Typical savings	LLM call?
JSON / arrays	Starts with `{` or `[`	Structural — `JSON.parse` → `JSON.stringify`. Arrays > 10 items are sampled: first 3, last 2, with a `…` marker.	20–60%	No
Source code	Contains `function`, `import`, `const`, `def`, `=>`, etc.	Comment stripping — removes `//`, `#`, `/* */` lines and blank lines.	15–40%	No
Prose	Everything else	LLM compression via Gemini 2.5 Flash Lite. Target: under 20% of original token count. Output is a single dense line — names, numbers, decisions, constraints only.	60–85%	Yes

Real example — prose compression

Input — 187 tokens

"We're building a SaaS platform for real-time inventory management. The backend is Node.js with Fastify running on Railway. We use PostgreSQL with pgvector for embeddings. The frontend is Next.js 14 deployed on Vercel. Authentication is handled by Auth.js with Google OAuth. We're planning to add Stripe billing next quarter."

Compressed — 31 tokens (83% savings)

SaaS inventory mgmt; Node.js Fastify Railway; PostgreSQL pgvector; Next.js 14 Vercel; Auth.js Google OAuth; Stripe billing Q2 planned

Optimization quota: Prose compression counts against your monthly optimization quota. JSON and code compression are unlimited. If your optimization quota is exhausted, Torqon falls back to passing the raw (uncompressed) message to the LLM.

Retrieval System

Retrieval is a single pgvector cosine similarity query. There is no re-ranking model, no BM25 hybrid, no post-processing — just one fast query filtered and sorted by similarity score. This keeps p95 latency under 8ms.

retrieval.sql — the actual query Torqon runs

SELECT
  id,
  content,
  fact_type,
  1 - (embedding <=> $1) AS similarity
FROM  facts
WHERE user_id = $2
  AND  is_current_fact = true
  AND  1 - (embedding <=> $1) > 0.72   -- similarity floor
ORDER BY similarity DESC
LIMIT 12;                               -- top-K cap

0.72

cosine similarity floor
facts below this are never surfaced

12

maximum facts per query
trimmed from bottom up by token budget

true

is_current_fact guard
stale facts never retrieved

global

search scope
all conversations under your account

What the similarity threshold means in practice

Similarity range	Meaning	Action
`> 0.90`	Near-identical semantic match — very likely the exact fact being asked about	Always retrieved, highest rank
`0.80 – 0.90`	Strong topical match — same concept, different phrasing	Retrieved and ranked
`0.72 – 0.80`	Related match — same domain, possibly adjacent topic	Retrieved if within top-12
`< 0.72`	Weak match — unrelated or different domain	Never surfaced

rawFacts mode applies a stricter floor of 0.68 and caps results at 5. This mode returns the top facts as a pipe-delimited string (type:content | type:content) for use in your own prompt construction.

CacheAligner

LLM providers like Anthropic and OpenAI cache prompt prefixes — if the start of your prompt is identical across requests, they skip re-processing it and charge you less. CacheAligner keeps the system prompt stable across every request so the prefix cache always hits.

Without CacheAligner — cache miss every time

System: "You are a helpful assistant.
The user's name is Alice. They are
building Helios on Next.js..."

↑ System prompt changes every request.
  Provider re-processes from scratch.
  You pay full input tokens every time.

With CacheAligner — cache hit every time

System: "You are a precise, concise
assistant. When verified facts are
provided, answer from them directly."

User: "[VERIFIED FACTS]
name: Alice
tech_stack: Next.js 14, pgvector

[QUESTION]
How should I write the migration?"

↑ System prompt never changes.
  Facts move into the user message.
  Provider caches the prefix. Cheaper.

The assembled user message format is always: [VERIFIED FACTS]\n{facts}\n\n[QUESTION]\n{message} when facts were retrieved, or just {message} when no facts matched. The system prompt is the same single string on every call, regardless of facts retrieved.

Observability

Every request returns an evaluation block with token accounting and retrieval quality metrics. Detailed traces are available from the dashboard.

Metric	Type	Description
`traceId`	string	Unique ID for this request. Use it to look up the full trace in the dashboard.
`evaluation.baselineTokens`	integer	Tokens you would have spent if you had injected all stored facts naively into the system prompt.
`evaluation.optimizedTokens`	integer	Actual tokens used after optimization and selective retrieval.
`evaluation.tokenSavings`	float	Percentage reduction: `(baseline − optimized) / baseline × 100`.
`evaluation.memoryUsed`	integer	Number of facts injected from memory for this request.
`evaluation.avgSimilarity`	float	Average cosine similarity score of injected facts. A value above 0.85 means highly relevant retrieval.
`evaluation.latencyMs`	integer	LLM generation latency in milliseconds (does not include retrieval or assembly time).

Retrieval Latency

Tracked per request and aggregated in the dashboard. Includes embedding generation + pgvector query time.

Memory Hit Rate

Percentage of requests where at least one fact was retrieved. Low hit rate indicates sparse memory or queries that don't match stored facts well.

Token Savings Over Time

Cumulative comparison of naive token usage vs Torqon-optimized usage across all requests in a billing cycle.

Plans & Limits

Quotas reset on the first of each calendar month. Exceeding a quota returns a 429 with the quota type specified in the error body.

Quota type	Free	Pro	Teams	When exceeded
Memory Stores	500 / mo	10,000 / mo	75,000 / mo	Async extraction skipped; response still returns
Retrievals	500 / mo	20,000 / mo	125,000 / mo	429 returned; no LLM call made
Optimizations	50 / mo	1,000 / mo	7,500 / mo	Falls back to raw (uncompressed) message
Uptime SLA	Best effort	99.9%	99.9%	—

Retrieval quota note: Each call to POST /chat (including storeOnly: true) counts as one retrieval. Use the rawFacts flag to retrieve memory without triggering an LLM call — this still counts as one retrieval.

Roadmap

Now — Phase 1

MCP server integration
REST API with full pipeline
Atomic fact extraction + supersession
Content Optimizer (JSON / code / prose routing)
CacheAligner prefix cache architecture
Per-request evaluation with token accounting
Free + Pro + Teams plans

Q3 2026 — Phase 2

Native TypeScript + Python SDKs
Memory graphs — link related facts with named edges
Recency weighting in retrieval scoring
Adaptive similarity threshold per conversation
Conflict resolution — detect contradictory facts before embedding

Q4 2026 — Phase 3

Multi-model support (Claude, GPT-4o, Llama)
Streaming responses
Memory aging — confidence decay over time
Webhook-based memory push (no polling)
Dashboard analytics: memory hit rate, savings charts, trace explorer