r mcp-rune 0.1.0
SECTION V · GUIDE 15 OF 19
Reading
22 min
Topic
analysis · vector
Spec
v0.1.0-alpha
Source
analysis-memories-guide.md

Analysis Memories

A five-tool feature for running LLM-driven qualitative analysis over large, paginated datasets without dragging raw rows into the model’s context window — and for acting on a subset of that dataset without ever putting the IDs back in context.

The LLM downloads records once into offline storage, stores its own qualitative findings as semantic embeddings, then queries both layers — by meaning, by aggregate, by filter, by stratified sample — until it has enough material to synthesise a final answer.

Table of Contents


Data flow

Two tables back the feature. Only one of them stores vectors — the other is plain JSONB. The five tools are stitched together by the LLM, which drives the loop: ingest once, then read → reason → store, optionally act, then clear when done.

═══════════════════════════════════════════════════════════════════════════════
  SESSION LIFECYCLE  ·  the LLM is the loop driver
═══════════════════════════════════════════════════════════════════════════════

  ① BOOTSTRAP — runs once at the start of an analysis_id
  ──────────────────────────────────────────────────────
  analysis_ingest(model, filters)
       │  GET /api/<model>?page=N  (auto-paginates up to 50)

       ├─────────────────────────────────────► ingested_records
       │                                       (raw rows, JSONB,
       │                                        no embedding, 1h TTL)

       │  per-page summary text
       │  (distributions, numeric stats, date ranges)
       │           │
       │      ═════╪═════ EMBED ═════════════════════════════════
       │           │   MiniLM-L6-v2 (local, 384-dim)
       │           ▼
       └─────► analysis_memories
               (category: "page_summary",
                text + vector + metadata, 1h TTL)


  ② READ — LLM queries to understand the data
  ───────────────────────────────────────────
  analysis_query(analysis_id, mode, …)

       ├── describe / aggregate / filter / sample
       │       │
       │       │  SQL: GROUP BY, JSONB @>, range casts,
       │       │  ROW_NUMBER() partitioned by stratify_by
       │       │  and/or date_bin() proximity buckets
       │       ▼
       │   ingested_records

       └── semantic
               │  ═════ EMBED(query) ═════  →  cosine distance

           analysis_memories
           (page summaries + any findings stored so far)


  ③ WRITE — LLM commits insights it formed in step ②
  ──────────────────────────────────────────────────
  analysis_store(findings[])
       │           no API call — purely LLM → storage

       │  finding text
       │      │
       │ ═════╪═════ EMBED ═════════════════════════════════════
       │      │   MiniLM-L6-v2 (local, 384-dim)
       │      ▼
       └─► analysis_memories
           (category set by LLM, e.g. "quality_issue", "pattern";
            1h TTL by default, or persistent: true)


  ╭─────────────────────────────────────────────────────────────────────╮
  │ Steps ② and ③ repeat as a loop. Each analysis_store call enlarges   │
  │ the pool that the next semantic query in step ② can recall.         │
  │                                                                     │
  │ The very first analysis_store has nothing to build on except the    │
  │ page_summary findings written by step ① — that's by design. Page    │
  │ summaries are the LLM's "starter pack": coarse-grained, automatic,  │
  │ and immediately searchable before the LLM has written anything.     │
  ╰─────────────────────────────────────────────────────────────────────╯


  ③.5 ACT — optional: mutate a subset before teardown
  ────────────────────────────────────────────────────
  analysis_act(analysis_id, model, where?, action, attributes?, dry_run?)

       │  SELECT record_id FROM ingested_records
       │  WHERE analysis_id = ? AND model = ? AND <where predicate>
       │           │
       │           ▼
       │      resolved IDs (server-side only — never returned to context)
       │           │
       │      batches of 50, concurrency 5
       │           │
       │           ▼
       │  PATCH/DELETE /api/<endpoint>/<id>   ── upstream API

       └─► response = { summary: { total, succeeded, failed }, sample_errors }
                                (per-record results stay in the server log)


  ④ TEARDOWN — once the LLM has its final synthesis
  ─────────────────────────────────────────────────
  analysis_clear(analysis_id)
       ├── DELETE FROM ingested_records  WHERE analysis_id = ?
       └── DELETE FROM analysis_memories WHERE analysis_id = ?

Where the embed boundary actually sits. Embedding happens at exactly three points, all crossing the ═══ EMBED ═══ line: ① the page-summary side-effect of analysis_ingest, ③ every analysis_store finding, and ② the query string in semantic mode (not the data being searched — that was already embedded on write). Raw API rows never touch the embedder.

This is why aggregate/filter/sample queries are cheap and deterministic SQL, while semantic queries pay for a single query-side embed and rank rows by cosine distance.

Maintenance note: keep this diagram in sync with the code. If a future change adds a new write path (e.g. embedding raw rows, a third table, a new tool, or removing the page-summary side-effect), update the diagram in the same PR — the value of a high-level picture collapses the moment it stops matching the code. The authoritative sources are analysis-ingest-tool.ts, analysis-store-tool.ts, analysis-query-tool.ts, analysis-act-tool.ts, analysis-clear-tool.ts, and the two pgvector backend files listed under File reference.


When to use it

Use it for:

  • Qualitative analysis over result sets that span many pages (themes, anomalies, “what patterns do you see?”).
  • Distribution / aggregation questions across an entire collection.
  • Representative sampling from skewed datasets where a naive ORDER BY RANDOM() would favour the majority class.

Don’t use it for:

  • Single-record lookups or known-id reads — use find_records.
  • Result sets that fit in one page and that you actually want returned to context — use search_records or list_records_app.
  • Transactional CRUD — use create_model / update_model / delete_model.

The dividing line: if you need the raw data in context, use the data tools. If you need to reason about a dataset that’s too big for context, use the analysis tools.


Setup (integrators)

Prerequisites

  • PostgreSQL with the pgvector extension installed.
  • pg connection pool you own and inject into mcp-rune.

1. Environment variables

VariableRequiredNotes
DATABASE_URLyesStandard postgres://... connection string.
ANALYSIS_ENABLEDyesSet to true to register the four analysis tools. When false/unset, they’re gated out by requiresVectorStorage and won’t appear in tools/list.

2. Run migrations

mcp-rune ships migrations as data under the @mcp-rune/mcp-rune/db/migrations subpath import. The analysis tables (analysis_memories, ingested_records) are tagged feature: 'analysis' — apply them conditionally:

import { migrations } from '@mcp-rune/mcp-rune/db/migrations'

const needed = migrations.filter(
  (m) => m.feature === 'core' || process.env.ANALYSIS_ENABLED === 'true'
)
// ...apply each migration.up against your pool

See the Database section of the root README for the full migration runner snippet.

3. Initialise vector storage at startup

import pg from 'pg'
import { initVectorStorage } from '@mcp-rune/mcp-rune/services'

const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL })

initVectorStorage({
  pool, // required — pool injection only; mcp-rune never creates pools
  serviceName: 'my-mcp-server',
  version: '1.0.0',
  retentionDays: 30, // default: 30 — sweep window for tool_memories (operations feature)
  ingestedRecordsRetentionDays: 7, // default: 7 — TTL for ingested_records (analysis feature)
  backgroundCleanupIntervalMs: 6 * 60 * 60 * 1000 // optional — periodic cleanup across all three tables; omit to disable
})

If options.pool is omitted, vector storage stays disabled and the five analysis tools simply won’t show up in the tool list. There’s no error path — the gate is silent by design.

backgroundCleanupIntervalMs is opt-in because short-lived processes (test runs, single-shot scripts) don’t need it; the boot-time sweep already evicts expired rows on startup. Set it for long-running servers where on-access eviction alone may leave orphaned rows behind.

4. Embeddings

Embeddings run locally via @huggingface/transformers using sentence-transformers/all-MiniLM-L6-v2 (384 dimensions, quantised, lazy-loaded on first use). No API keys, no outbound network calls. The first analysis_store or semantic analysis_query in a fresh process pays a one-time model warm-up cost.


The five tools

All five belong to the ANALYSIS tool category, gated by requiresVectorStorage. analysis_ingest and analysis_act call the upstream API; the others operate purely on the local pgvector tables.

analysis_ingest

Downloads records from the model’s API and stores them in ingested_records as JSONB. Only a status summary returns to context — the raw data never inflates the LLM window.

Two ingestion modes:

ModeTriggered byUse for
Top-levelmodelListing/searching records of a single model.
Nestedparent_model + child_resourceFetching child resources (e.g. metadata_errors) for every previously-ingested parent, with auto-resolved parent IDs.

Key inputs:

FieldNotes
modelRequired for top-level mode.
analysis_idRequired. Session key — every later tool uses this to scope its work.
filtersOptional. Routed through validateFilterParams; supports the same operators as search_records.
page / per_pageper_page defaults to 50.
fieldsOptional projection (["id", "name", "status"]). {assoc}_id keys are auto-preserved when you ask for any flattened field from the association (e.g. title_nametitle_id is kept).
ingest_allWhen true, auto-paginates up to 50 pages, reporting progress page-by-page.
resumeWith ingest_all, skips already-stored pages by counting existing rows and continuing from the next page.
parent_model / child_resource / parent_idsNested mode. parent_ids is capped at 25; if omitted, auto-resolved from previously ingested records of parent_model in the same analysis_id. Nested fetches run with a concurrency cap of 5.
user_idService-account impersonation.

Side effects beyond storing records: every successful page also produces a page summary finding (category page_summary) — a one-line digest with field distributions for low-cardinality fields, numeric min/max/avg/median, and date ranges. These summaries are searchable via analysis_query mode: "semantic" even before the LLM stores any findings of its own.

Dedupe: ingested_records has a partial unique index on (analysis_id, model, record_id) WHERE record_id IS NOT NULL, and inserts use ON CONFLICT ... DO UPDATE. Re-ingesting the same page (or running resume after a partial failure) replaces rather than duplicates.

Caps: 50 pages max per ingest_all call; 25 parent IDs max per nested call; 5 concurrent nested fetches.

analysis_store

Stores LLM-generated qualitative findings — patterns, anomalies, conclusions — as semantic embeddings. Not for raw record data (that’s analysis_ingest’s job, already done automatically).

Inputs:

FieldNotes
analysis_idRequired.
findingsArray of { finding, category?, metadata? }. Max 25 per call.
persistentDefault false (1-hour TTL). true keeps the finding around across conversations.

category is free-form but acts as a grouping key for later recall (naming_inconsistency, missing_metadata, etc.). metadata is arbitrary JSON — typically record IDs or field values that justify the finding.

analysis_query

Single unified tool with five modes. Mode is the only required discriminator beyond analysis_id.

ModeRequired paramsReturnsUse for
describeMarkdown table of fields, types, query syntax examples derived from the model’s attributes configDiscovering shape before querying.
semanticqueryFindings + page summaries ranked by cosine similarity. Defaults: top_k=50, threshold 0.5. Filter by category to scope.Recalling stored insights; searching page summaries by meaning.
aggregategroup_by{ value, count } rows sorted by count desc, formatted as a percentage distribution.Distribution of a field across the dataset.
filterwhereMatching data rows. Default limit 20, hard cap 200.Inspecting a specific subset.
sample— (all optional)Sampled data rows. Default 5, hard cap 50. Composes stratify_by, where, and proximity.Representative spot-checks.

where operator syntax (used by filter mode and as a pre-filter in sample mode):

{ "status": "active" }                                    // exact match (JSONB containment)
{ "duration_minutes": { "$gte": 40, "$lte": 120 } }       // numeric range
{ "started_at": { "$gte": "2026-01-01" } }                // date range (auto-cast to timestamptz)

Operators: $gt, $gte, $lt, $lte. The cast (::numeric vs ::timestamptz) is inferred from the value type. Field names are validated against ^[a-zA-Z_][a-zA-Z0-9_]*$ before being interpolated into SQL.

analysis_act

Applies a bulk update or delete to records previously ingested in the session. Resolves matching record IDs server-side from ingested_records using the same where vocabulary as analysis_query mode: "filter", then runs the mutation in batches against the upstream API. Only an aggregate summary returns to context — per-record IDs and results are never echoed back to the LLM.

Annotated destructiveHint: true, requiresAuth: true. Same risk profile as bulk_action_models.

Inputs:

FieldNotes
analysis_idRequired. Must match a prior analysis_ingest call.
modelRequired. Must be a writable model present in the analysis session.
whereOptional. Same operator vocabulary as analysis_query mode: "filter": exact match plus $gt/$gte/$lt/$lte. Omit to match every record of model in the session.
actionRequired. "update" or "delete".
attributesRequired when action: "update", ignored when action: "delete". Applied uniformly to every matched record.
dry_runOptional. When true, returns { matched_count, sample_ids, sample_data, ingestedAtRange } without calling the API. Use it to confirm scope and snapshot age before mutating.
user_idService-account impersonation.

Batching: internal batches of 50, concurrency cap of 5. Higher than bulk_action_models (25) because batches are never surfaced to the LLM — only the aggregate summary is.

Response (live):

{
  "summary": { "total": 312, "succeeded": 308, "failed": 4, "action": "update" },
  "sample_errors": [
    /* first 5 failed records, with status_code and message */
  ]
}

Response (dry-run):

{
  "matched_count": 312,
  "sample_ids": ["d-1", "d-2", ...],            // first 10
  "sample_data": [ /* first 3 rows, each with ingestedAt */ ],
  "ingestedAtRange": {
    "earliest": "2026-05-13T08:14:22Z",
    "latest":   "2026-05-13T08:15:01Z"
  }
}

Snapshot staleness. ingested_records is a point-in-time copy. A long gap between ingest and act means the upstream state may have drifted. The ingestedAt timestamp on the dry-run sample and ingestedAtRange exist so the LLM (and the operator reviewing the call) can judge whether to re-ingest first. There is no automatic revalidation pass — that’s intentional, to keep the cost model predictable.

Failure model. Batches are not atomic across the whole set (same as bulk_action_models). A partial failure mid-run leaves earlier batches applied. sample_errors carries enough information to diagnose patterns; the server log carries the full per-record outcome.

Progress. When the MCP client supplies a progressToken, analysis_act emits one notifications/progress event per completed record.

analysis_clear

Cascade-deletes both analysis_memories and ingested_records for the given analysis_id. Annotated destructiveHint: true. Call it once the synthesis is done.


Stratified sampling

The sample mode composes three dimensions of stratification freely. All three can be combined in one call.

1. Discrete: stratify_by

Distributes sample slots evenly across distinct values of a JSONB field, so minority groups always appear. Implementation: ROW_NUMBER() OVER (PARTITION BY data->>'<field>' ORDER BY RANDOM()) with a per-group budget of CEIL(sample_size / num_groups).

Without stratification, 85 active + 10 draft + 5 archived records with sample_size: 6 would almost always return six active rows. With stratify_by: "status" you get roughly two of each.

2. Temporal: proximity

Date-windowed sampling around an origin date, with optional bucket stratification.

{
  "field": "created_at",
  "origin": "2026-03-15",
  "window": "7 days",
  "bucket": "1 day"
}
  • window and bucket are validated against ^\d+\s+(day|days|week|weeks|month|months|hour|hours|minute|minutes)$.
  • Without bucket: uniform random sampling within the window.
  • With bucket: PostgreSQL date_bin(bucket, value, origin) creates origin-anchored buckets, and the same ROW_NUMBER() budget allocation distributes slots across buckets.

For deeper detail (use cases, edge cases, performance notes), see proximity-sampling-guide.md.

3. Pre-filter: where

Restricts the candidate set before sampling. Same operator vocabulary as filter mode. Useful for “sample from the population that matches X” rather than “sample then filter”.

Composing all three

{
  "mode": "sample",
  "analysis_id": "q1-deal-audit",
  "sample_size": 12,
  "stratify_by": "status",
  "where": { "amount": { "$gte": 10000 } },
  "proximity": {
    "field": "closed_at",
    "origin": "2026-03-15",
    "window": "30 days",
    "bucket": "1 week"
  }
}

Reads as: “From deals over $10k that closed in the 60 days around March 15, give me 12 examples — spread across statuses, spread across weeks.” The filtered CTE applies where + the proximity date range, then the partition key (date_bin(week, closed_at, origin), data->>'status') allocates the budget.


End-to-end workflow

A realistic session: an LLM auditing the book model in an example bookshelf server. The user asks “what’s the state of our library — any quality issues across the collection?

1. Ingest the dataset once, paginated up to the cap:

analysis_ingest({
  analysis_id: "library-audit-2026-05",
  model: "book",
  ingest_all: true,
  per_page: 50,
  fields: ["id", "title", "author", "status", "rating", "updated_at"]
})
// → "Stored 312 record(s) (6 fields per record) across 7 page(s). Analysis: library-audit-2026-05"

Seven page_summary findings are stored automatically alongside the raw rows — each with field distributions and date ranges for that page.

2. Discover the shape before querying:

analysis_query({ analysis_id: "library-audit-2026-05", mode: "describe" })
// → markdown table of book fields, enum values, and copy-pasteable query examples

3. Aggregate to ground the LLM in distributions:

analysis_query({ analysis_id: "library-audit-2026-05", mode: "aggregate", group_by: "status" })
// → "Distribution of \"status\" (312 total):
//      completed: 180 (57.7%)
//      reading: 80 (25.6%)
//      unread: 52 (16.7%)"

4. Filter to investigate one segment:

analysis_query({
  analysis_id: "library-audit-2026-05",
  mode: "filter",
  where: { "status": "completed", "rating": { "$lt": 2 } },
  limit: 20
})
// → up to 20 raw rows the LLM can reason over

5. Sample representatively for spot-checks:

analysis_query({
  analysis_id: "library-audit-2026-05",
  mode: "sample",
  sample_size: 9,
  stratify_by: "status"
})
// → 3 of each status, so the LLM doesn't see only the majority class

6. Store findings as the LLM forms them:

analysis_store({
  analysis_id: "library-audit-2026-05",
  findings: [
    { finding: "12 completed books rated 1 — outliers worth a re-read or de-listing", category: "quality_issue", metadata: { count: 12 } },
    { finding: "Sci-fi authors over-represented in 'reading' status — possible stalled-progress bias", category: "pattern" }
  ]
})

7. Recall semantically near the end of the session, after several rounds of querying:

analysis_query({
  analysis_id: "library-audit-2026-05",
  mode: "semantic",
  query: "issues with rating or quality",
  top_k: 20
})
// → all "quality_issue" findings + any page summaries whose distributions hint at the same

8. Clear once the synthesis lands:

analysis_clear({ analysis_id: "library-audit-2026-05" })
// → "Cleared 312 ingested record(s) and 9 finding(s) for analysis \"library-audit-2026-05\"."

Lifecycle & retention

LayerExpiryEviction
analysis_memories (ephemeral)1 hour from creationOn-access: every recallMemories call deletes expired rows first.
analysis_memories (persistent)Never (until explicit clear)Set with persistent: true at store time.
ingested_records7 days from store (configurable via ingestedRecordsRetentionDays)On-access: every queryRecords call deletes expired rows first. Boot-time sweep on init.
tool_memories (separate feature)retentionDays from initVectorStorageBoot-time sweep + on-access.
Background sweep (opt-in)Every backgroundCleanupIntervalMs msPeriodic cleanup across all three tables. Off by default — set the option to enable for long-lived servers.

Practical implications:

  • The 7-day TTL on ingested_records is the realistic working window for an analysis session — long enough to ingest in the morning and analysis_act in the afternoon (or after a weekend), short enough that an abandoned session eventually frees its disk.
  • analysis_memories is split deliberately: ephemeral findings are throw-away by design; the persistent: true flag is the explicit opt-in for findings that should outlive a session.
  • A session whose ingested_records have expired will return an empty match set from analysis_act and analysis_query. Re-run analysis_ingest with resume: true to rebuild — page summaries will be regenerated.
  • analysis_clear is the explicit teardown for a session that finished cleanly. Don’t rely on TTL for cleanup if you store findings persistently.

Troubleshooting

The four tools don’t appear in tools/list. They’re gated by requiresVectorStorage. Check:

  1. DATABASE_URL is set.
  2. ANALYSIS_ENABLED=true.
  3. initVectorStorage({ pool }) was called at server startup with a real pg.Pool. The init returns false and logs pgvector: no pool provided, vector storage disabled when the pool is missing.
  4. The analysis_memories and ingested_records tables exist (migrations applied).

analysis_ingest reports duplicate-looking counts after a retry. Resolved in commit c1cc813 — the table now has a partial unique index and inserts use ON CONFLICT DO UPDATE. If you’re seeing it, confirm your migrations are up to date (the unique index ships in the relevant migration).

analysis_ingest stops at 50 pages. That’s the MAX_INGEST_PAGES cap, surfaced in the response ((capped at 50 pages)). Tighten filters to reduce the result set, or run multiple sessions with disjoint filters.

analysis_query mode: "filter" returns nothing for what looks like a valid match. where uses JSONB containment for exact match — values must match the stored representation. Numeric fields stored as strings (e.g. via flattened HAL responses) need range operators with the right cast: { "amount": { "$gte": 100 } } infers ::numeric. If your value is a string "100", exact match needs the string form.

Field name rejected with “Invalid field name”. Stratification, range conditions, and proximity all validate the field against ^[a-zA-Z_][a-zA-Z0-9_]*$ to keep them safe to interpolate into SQL. Dotted or hyphenated paths aren’t supported — flatten the data at ingest time via fields or model associations.

First semantic query is slow. @huggingface/transformers lazy-loads and quantises the all-MiniLM-L6-v2 weights on first call (~1–2 s). Subsequent calls are fast.


File reference

PathPurpose
src/mcp/tools/analysis/analysis-ingest-tool.tsanalysis_ingest tool
src/mcp/tools/analysis/analysis-store-tool.tsanalysis_store tool
src/mcp/tools/analysis/analysis-query-tool.tsanalysis_query tool (all five modes)
src/mcp/tools/analysis/analysis-act-tool.tsanalysis_act tool (server-side ID resolution + batched mutation)
src/mcp/tools/analysis/analysis-clear-tool.tsanalysis_clear tool
src/mcp/tools/analysis/base-analysis-tool.tsCategory binding (ANALYSIS, requiresVectorStorage)
src/services/vector-storage.tsVendor-agnostic facade — initVectorStorage, isVectorStorageEnabled, all store/query/clear entry points
src/services/vendor/pgvector/index.tsPool injection, cleanup-on-boot
src/services/vendor/pgvector/analysis-memories.tsFindings table SQL (store, recall, clear, eviction)
src/services/vendor/pgvector/ingested-records.tsRaw-data table SQL (store, aggregate/filter/sample, stratification)
src/services/embeddings.tsLocal all-MiniLM-L6-v2 embeddings
src/db/migrations.tsMigration data (feature: 'analysis' for these tables)

Related guides:

Out of scope for this iteration (tracked separately): a read-only analysis_export companion that returns filtered records to a downloadable artefact; an opt-in revalidation pass that re-fetches each candidate record before analysis_act mutates it to detect drift since ingest. See issue #80 for context.