← RubySage·Technical deep dive

How it actually
works.

For people who want to know the moving parts. Architecture, the algorithm, the boundary lines, and the things we deliberately didn’t build.

The problem, one more time

Every Claude Code or Cursor session on a Rails repo starts cold. The agent walks the codebase to orient itself — list directories, read Gemfile, read db/schema.rb, glob app/models/*.rb, read every model file the question seems to involve. By the time it’s ready to answer “where does billing happen?”, it’s burned 50–200K input tokens just figuring out which files exist and what’s in them.

The maddening part: the agent does this every session. The codebase hasn’t materially changed since yesterday’s session. The model already knew, in some other window, what User has on it. None of that persists.

RubySage is the persistence layer.

The core insight: git already fingerprints your code

You don’t need a clever hash function. You don’t need content-addressing voodoo. Git’s blob SHA does exactly the job you need:

  • Same file content → same SHA, every time.
  • Different file content → different SHA, no exceptions.
  • Cheap to compute (it’s already computed when you commit).
  • Works on uncommitted changes too — git hash-object <file> or any equivalent SHA-1 of the file bytes.

Per-file caching keyed on this fingerprint is the right primitive. If the digest of app/models/user.rbhasn’t changed since the last scan, we know with certainty the extracted artifact is still valid. Reuse it. Move on.

This is what makes incremental scans cheap. A 200-file Rails app where 3 files changed today re-summarizes 3 files, not 200.

It’s also what makes the staleness objection (“but your data goes stale!”) bounded and answerable: the staleness window is “time between code change and next scan.” With a post-commit git hook, that window is seconds.

What an artifact actually contains

When RubySage scans a file, it produces a small structured record:

schema_version: 1
path: app/models/user.rb
kind: model
digest: 7f3a2b1c...   # SHA of the file bytes
public_symbols:
  - User
  - email
  - admin?
  - subscribe!
  - subscriptions
route_mappings: []     # populated for controllers/jobs
summary: |
  ActiveRecord model representing an authenticated user. Has many
  Subscriptions, belongs_to Organization. Validates email presence
  and uniqueness. Exposes admin?/staff? role predicates and a
  subscribe! helper that wraps the Stripe integration in
  BillingService.
audiences: [developer, admin]

Notice what’s there: the structural shape of the file (public symbols, kind, routes) plus a tight prose summary. Notice what’s not there: the raw source. The artifact is a fraction the size of the file it describes. That asymmetry is the whole game.

The lean V1 actually writes these to disk:

host-app/
└── .ruby_sage/
    ├── manifest.json
    ├── artifacts/
    │   └── app/models/user.rb.yml
    └── routes.json

The disk layout is the source of truth. The widget’s database is a derived index rebuilt from disk via rake ruby_sage:index. Disk-first means:

  1. The MCP server (bin/ruby_sage mcp) reads disk directly, no Rails boot — important because MCP is invoked on every agent session and we can’t afford a 2-second Rails boot.
  2. Artifacts can be committed to the repo — the team shares them, CI can pre-bake them, prod doesn’t need to re-summarize anything.
  3. Filesystem-level inotify / fsevents is the natural file-watcher surface — incremental re-scans on save.

Retrieval: BM25, not vectors (yet)

When a question comes in, RubySage scores every artifact against the query tokens. The current V1 algorithm:

score(artifact, query) =
    Σ(token_match in artifact.summary)        × 1.0
  + Σ(token_match in artifact.public_symbols) × 2.0
  + Σ(token_match in artifact.path)           × 1.5
  + page_context_boost if artifact is on the current page's route

That’s it. Lexical scoring with weighted fields. It’s BM25-shaped without the IDF normalization — V1’s corpora are small enough that exact-IDF doesn’t move the needle.

Why not vectors? Three reasons:

  1. Rails apps have unreasonably strong lexical signals. Model names, route paths, controller actions — these are exactly what the agent asks about. “Where does subscription billing happen?” → tokens subscription, billing → matches against app/services/subscription_service.rb’s public_symbols. A 200-line embedding model is doing a worse job than the symbol index here.
  2. Costs more, ships slower. pgvector + an embedding pass per artifact adds a hard dependency and a per-file API call to scans.
  3. Hybrid retrieval is the V2 story, not the V1 story. When users start asking conceptual questions (“how do we handle webhook idempotency across services?”), exact symbol matching breaks down and embeddings start earning their keep. We add them then.

The Retriever class is designed as a swappable component. V2 hybrid retrieval is a class-substitution, not a rewrite.

The four MCP tools (and why those four)

Once you have artifacts on disk, the question becomes: what tool surface do you expose to an AI coding agent? RubySage V1 went with four:

ToolQuestion it answers
find_relevant_files“Where in the codebase is X?”
get_file_context“What's in this file, structurally?”
get_route_handler“Which file handles this URL?”
search_symbols“Where is symbol Foo defined?”

A bigger surface was tempting — we considered get_model_associations, get_dependencies, find_callers. Resisted on principle: agents re-train their muscle memory based on what tools exist, and a noisy tool surface trains them to over-call. Better to have four tools an agent uses well than twelve it picks badly from. Watch usage, add more once we’ve seen the gaps.

The boundary is intentional: these tools answer structural questions. “What does this code do?” questions still flow through read_file— RubySage isn’t trying to replace reading source code, just to prevent reading source code when reading the symbol index would do.

The prompt-caching contract

Anthropic’s prompt cache is the silent multiplier here. When the RubySage widget answers a chat question, the prompt structure is:

[system prompt]
[cached_context]              ← marked with cache_control: ephemeral
[user question + retrieved artifacts]

That cached_contextblock holds the schema, README, frequently-cited files, and anything else that’s stable across consecutive questions in a session. Anthropic caches it. The next question within ~5 minutes reuses the cache at roughly 10× lower cost.

The agent-facing MCP tools could do the same trick once we measure which artifact subsets get hit repeatedly. The current MCP tool returns are short enough that caching at that layer is a Phase-1.5 optimization rather than a V1 requirement.

The agent-driven scan path

The single most under-appreciated feature in the V1 design is that summarization doesn’t have to cost API tokens.

bundle exec rake ruby_sage:scan:plan

…writes tmp/ruby_sage/INSTRUCTIONS.md + manifest.json. You tell your local Claude Code (or Cursor, or Codex) to read the INSTRUCTIONS.md and produce summaries.json. Then:

bundle exec rake ruby_sage:scan:apply

…ingests the summaries into a completed Scan.

Why this matters: the summarization work happens on your existing developer-tools subscription, not on an Anthropic API key. For most teams that means the cost goes to zerobecause they’re already paying for Claude Code or Cursor. The gem itself never sees an API key in this flow.

This is also how scan results ship to production: pre-bake in CI using the agent-driven path, commit the artifacts (or upload them as build artifacts), and prod runs zero LLM calls during scans. The expensive work lives in the developer environment where the credentials and cache already exist.

When RubySage helps — and when it doesn't

It would be dishonest to claim RubySage helps with everything. Here’s the actual shape:

Where it earns its place:

  • Orientation at session start (“how is auth handled?”) — agents otherwise burn 50–200K tokens grepping the codebase
  • Multi-file synthesis (“where would I add a new role?”) — needs joint visibility into models, controllers, policies, configs
  • Onboarding (humans and agents): rake ruby_sage:onboard writes AGENT_PRIMER.md and ONBOARDING.md off the same index
  • Anything where the answer requires finding before fixing

Where it doesn’t help — and might slightly hurt:

  • Trivial edits with known location (“change the homepage h1 from ‘Welcome’ to ‘Hello’”) — you know the file. grep+ an editor wins; RubySage’s MCP round-trip is pure overhead
  • One-line bug fixes the user can articulate by exact symbol — read_file on a known path is faster than retrieval
  • “Refactor this method” on a file already in the conversation context — the model already has what it needs

The honest framing: if you can describe the task in 10 words and already know where it lives, skip RubySage and just do the edit. (If you can do it in 10 words and don’tknow where it lives — that’s exactly where RubySage shines.)

The benchmark prompt set deliberately includes trivial-task prompts so we can show what happens in the case where RubySage shouldn’t help. We’d rather publish a chart that honestly says “ here’s where it’s a tie or slightly worse” than a chart that hides the unfavorable cells.

The escape hatch: set RUBY_SAGE_DISABLE=1in your shell to make the MCP server respond with a polite no-op to all tool calls. Useful for sessions where you know you don’t want the overhead. Defaults to on because most real-work sessions benefit from it.

Scan cost, and the honest cumulative-token story

The scan is the cost line item people ask about first. The shape on a typical Rails app:

App sizeInitial scanDaily delta
Small (~50 files)~20K–50K input tokens~2K–8K
Medium (~200 files)~80K–200K input tokens~10K–30K
Large (~1K files)~400K–1M input tokens~50K–150K

Two important things about those numbers:

  1. They’re paid once (per file change). Every agent session that day reads the cached index, not the source. The amortization curve crosses zero quickly — even one heavy session per day usually covers the scan cost five times over.
  2. They’re mostly free if you use the agent-driven scan path. rake ruby_sage:scan:plan writes a manifest; your existing Claude Code / Cursor / Codex subscription does the summarization; rake ruby_sage:scan:apply ingests. No API-key tokens spent.

The bigger your codebase, the more RubySage helps — and the bigger the scan cost. The benchmark publishes both numbers (scan cost and per-session savings) so a team with a million-LOC monorepo can see for themselves whether the math works.

One caveat worth flagging:don’t run two concurrent scans on the same repo. The agent-driven scan flow writes to tmp/ruby_sage/; two Claude Code sessions kicking off scan:plan at the same time will race on the manifest. The gem will grow file-locking around this; for now, serialize scans (which you should be doing anyway — two AI sessions clobbering the same files is a different bug).

What we deliberately didn't build (and why)

  • Dependency graph— useful for some questions (“who calls this service?”), but grep solves 80% of those today and a real call graph requires runtime tracing or aggressive static analysis. V1.5+ if usage data shows it earns its keep.
  • LLM summaries at retrieval time — the artifact summary is built at scan time, never at retrieval time. Keeps the question/answer latency predictable and the cost amortized.
  • Multi-language— RubySage is Rails-only. The Prism-based signature extraction relies on Ruby’s grammar. A Python/JS equivalent is conceivable but is a different product.
  • Vector store — covered above. V2 hybrid retrieval is on deck once the corpora and question patterns demand it.
  • Auth that’s anything but explicit — the controller hard-fails closed if auth_check isn’t configured. No middle ground.

Where it could grow (V2+)

  • Hybrid retrieval: pgvector embeddings + the current lexical scorer, combined via reciprocal rank fusion
  • Dependency-aware retrieval: “this file imports X; pull X’s artifact too”
  • Workspace-level memory: agents store and retrieve their own facts about the codebase, persisted across sessions, scoped to git branch
  • Cross-repo: a multi-project index for orgs with many Rails apps
  • Read-only tool-loop for admin questions (“how many users signed up yesterday?”) — partially built, needs adoption signal before deepening

What's not the goal

It would be easy to drift this product toward “AI customer support widget for your Rails app.” It isn’t. The point of RubySage is making AI coding agents cheaper and smarter on Rails repos. The chat widget exists because the same retrieval primitive happens to power it; it’s not the headline.