Case 01

Trappan

RAG-grounded extraction. Tiered fallbacks. Production observability.

TypeScript
OpenAI API
Supabase
pgvector
Langfuse
Vitest
Docker

Overview

Trappan automatically finds and reads quarterly financial reports for Swedish listed real estate companies, then extracts 20+ standardised metrics (EPRA NRV, LTV, occupancy rate, etc.) into a structured database. The hard part is that every company uses different terminology, buries figures in different sections, and publishes in Swedish. The system solves this with a definitions-aware RAG layer that maps company-specific labels to canonical metrics before extraction runs. Everything is built from scratch in TypeScript — no LangChain, no LangGraph.

Contribution: Full-stack design and implementation — orchestrator, harness, RAG pipeline, extraction tiers, observability

Agent surface

Custom retry harness with confidence-aware best-result tracking across attempts
RAG-grounded term resolution via pgvector: definition blocks extracted from the PDF, embedded, and matched to a canonical metrics knowledge base
Two-tier extraction (vision LLM on keyword-matched pages → web search fallback) with explicit confidence escalation, reducing cost on the happy path

Orchestrator: four explicit phases

The orchestrator sequences four phases — cache check, report discovery, extraction, and persistence — with job state written to the database between each step. This means every run is resumable and auditable. A Langfuse trace is opened at the start and a span is attached to each phase, so the full execution tree is visible in the observability dashboard without any extra instrumentation.

Phase 1 (Cache): checks both a freshness-based metric cache and a date-specific data cache — skips extraction entirely if all requested metrics are available
Phase 2 (Discovery): delegates to the report finder, which runs a 3-method cascade to locate the correct quarterly PDF
Phase 3 (Extraction): runs the tiered parser; writes per-run stats including token counts, latency, and which metrics were extracted vs. missed
Phase 4 (Persist): writes artifact records and a structured run log; the orchestrator returns a typed ExtractionSignal (clean | partial_fallback | full_fallback | failed) for the caller to act on

export async function runOrchestrator(
  company: CompanyInput,
  metrics: MetricDef[],
  options?: { cacheMaxAgeHours?: number },
  runContext?: RunContext
): Promise<OrchestratorResult> {

  // Phase 1 — return early if all metrics are fresh
  const cached = await getCachedMetrics(company.id, metricKeys);
  if (missingMetrics.length === 0) return finalize({ signal: "clean", ... });

  // Phase 2 — find the report PDF
  const discovery = await runWithRetry(
    (note) => findReport({ ...input, _note: note }, { logger: log }),
    { isSuccess: (o) => !!o?.report_url, label: `reportsFinder:${company.id}` }
  );

  // Phase 3 — extract missing metrics
  const { outputs, steps } = await parseMetrics(missingMetrics, reportUrl, ctx);
  await upsertMetrics(company.id, reportUrl, outputs);

  // Phase 4 — persist and return
  return finalize({ signal: deriveSignal(steps, failed), ... });
}

Each phase is independently observable and the state machine never silently skips a step.

Custom retry harness

Every agent call in the system is wrapped in a generic retry harness that tracks the best result across all attempts — not just the last one. This matters because LLM outputs are non-deterministic: a third attempt might be worse than the first. The harness ranks outputs by a caller-supplied confidence label and promotes only when the new result is strictly better.

Caller provides isSuccess() and getConfidence() — harness is completely domain-agnostic
Exponential backoff on transport errors; immediate short-circuit on semantic failures the caller flags as non-retryable
On retry, a retryNote string is injected into the next LLM call — letting the model know what went wrong without changing the prompt structure
Test suite covers: first-attempt success, confidence promotion, all-attempts-fail, null returns, and thrown errors

export async function runWithRetry<T>(
  fn: (note: string | null) => Promise<T | null>,
  options: HarnessOptions<T>
): Promise<HarnessResult<T>> {
  let best: T | null = null;
  let bestConfidence: string | null = null;

  for (let attempt = 1; attempt <= totalAttempts; attempt++) {
    const output = await fn(attempt === 1 ? null : retryNote);

    // Promote to best only if strictly higher confidence
    const promoteToBest =
      best === null ||
      (isSuccess(output) && !isSuccess(best)) ||
      (isSuccess(output) === isSuccess(best) &&
        confidenceRank(confidence) > confidenceRank(bestConfidence));

    if (promoteToBest) { best = output; bestConfidence = confidence; }
    if (succeeded) return { output: best, attempts: attempt, succeeded: true };

    await sleep(backoff(attempt)); // exponential, skipped on last attempt
  }

  return { output: best, attempts: totalAttempts, succeeded: false };
}

Returns the best output seen — not the last. The confidence ranking is the key invariant.

Definitions-aware RAG: grounding extraction in the company's own glossary

Swedish real estate companies each define their key metrics slightly differently — one calls it 'Substansvärde per aktie', another uses 'Långsiktigt substansvärde', both meaning EPRA NRV. If extraction runs against generic keyword lists, it misses matches or picks up the wrong values. The solution: before any extraction runs, the system finds the definition pages in the PDF, renders each page as an image, and runs a vision LLM to extract definition blocks (title + description pairs). Each block description is then embedded with text-embedding-3-small and searched against a pgvector table in Supabase containing canonical definitions for all 20+ metrics — in both Swedish and English. When a match exceeds the similarity threshold (0.78), the company-specific label is prepended to that metric's keyword list. Extraction then runs against these resolved terms, not generic defaults.

Per-metric guardrails enforce correctness beyond embedding similarity — e.g. EPRA NRV matches only if the description explicitly says 'per aktie' or 'per share', blocking company-level totals from being promoted
Falls back gracefully to default keyword lists if no definition pages are found or no matches exceed the threshold
pgvector match_metric_definitions RPC handles both similarity scoring and canonical key lookup in a single query
Definition resolution is traced as its own Langfuse span — resolved vs. defaulted keys are visible per run

// Embed the extracted block description, search for matching canonical metric
for (const block of blocks) {
  const matches = await searchDefinitions(block.description); // pgvector cosine search
  const best = matches[0];
  if (!best || best.similarity < DEFINITION_MATCH_THRESHOLD) continue;
  if (!passesGuardrails(best.canonical_key, block)) continue; // e.g. EPRA NRV must say "per aktie"

  // Prepend company label — extraction uses this as its primary search term
  resolved[best.canonical_key] = [block.title, ...defaults];
}

Company terminology → canonical metric key, via embedding search. Guardrails prevent false positives that cosine similarity alone would allow.

Tiered extraction: cost-efficient by default, accurate under pressure

Tier 1 runs a keyword scan across all PDF pages to locate candidate pages, renders those pages as images, and sends them to a vision LLM (gpt-4o-mini). Only pages that actually contain a resolved keyword are rendered — typically 3–8 pages out of 80–120. If confidence comes back as 'high' or 'medium', the pipeline stops. If Tier 1 returns 'low' or 'none', Tier 2 fires: OpenAI's Responses API with web_search_preview browses the public PDF URL directly and extracts the missing metrics. This model is stronger (gpt-4o) and slower. The dual-tier design means most runs pay Tier 1 cost; Tier 2 is reserved for the hard cases.

Tier 1: keyword scan → candidate pages → render images → vision extraction (gpt-4o-mini)
Tier 2: web_search_preview on public PDF URL → text-based full-report extraction (gpt-4o)
Definition pages are excluded from candidate page selection — prevents the model from confusing a metric definition with a reported value
ExtractionSignal (clean | partial_fallback | full_fallback | failed) tells the caller exactly what happened, enabling downstream decisions (e.g. alert, use stale cache, skip)
Per-run stats record token counts and latency for both tiers — enabling cost attribution across companies and time periods

export async function parseMetrics(metrics, reportUrl, ctx) {
  const pdfData = await loadPdf(reportUrl);
  const tier1 = await runTier1(pdfData, metrics, ctx);
  // keyword scan → render candidate pages → vision LLM

  if (tier1.step.accepted) { // confidence === "high" | "medium"
    return buildResult(pdfData, tier1.outputs, [tier1.step], "tier1");
  }

  // Tier 2: web_search_preview on the live PDF URL
  const tier2 = await runTier2(reportUrl, metrics, tier1.resolvedTerms, ctx);
  return buildResult(pdfData, tier2.outputs, [tier1.step, tier2.step], "tier2_fallback");
}

Tier 1 only touches the pages that matter. Tier 2 escalates to a stronger model on the live document.

Report discovery: three-method cascade

Before extraction can run, the system needs to find the correct quarterly PDF for a given company and date. This is genuinely hard: URLs are not standardised, reports are published on company IR pages, CDNs, and IR aggregators, and getting the wrong period (e.g. last year's annual instead of this quarter's interim) would silently corrupt the data. The discovery agent runs three methods in order, stopping at the first success.

Method 1 — mfn.se: an LLM agent navigates the Swedish IR press release aggregator with a fetch_page tool, inferring the company slug, filtering to the correct year and report type, and returning a direct PDF URL. Correctly-scoped to a single period every time.
Method 2 — Company website crawl: same agent pattern, restricted to the company's own domain. Follows investor relations links to the reports listing page.
Method 3 — OpenAI web search: last resort, uses web_search_preview to search for the report and extract a PDF URL from the results.
Period validation runs on every candidate URL: file path years are checked against the target date, and a stale URL (>1 year behind) is rejected before it can poison extraction

export async function findReport(input, options): Promise<ReportsFinderOutput> {
  // Method 1: mfn.se — structured IR hub, period-accurate
  const mfnResult = await findViaMfn(input, log);
  if (mfnResult.report_url) return mfnResult;

  // Method 2: company website crawl
  const crawlResult = await findViaCrawl(input, log);
  if (crawlResult.report_url) return crawlResult;

  // Method 3: OpenAI web search (last resort)
  return findViaOpenAISearch(input, log);
}

Cascade stops at first valid result. Each method returns the source so callers can see which path succeeded.

Test suite: unit tests for every non-trivial decision

The test suite covers the harness, keyword search utilities, and all numeric normalisation logic. Tests are pure unit tests — no mocks of the database or LLM APIs, no network calls. Vitest runs in under a second.

harness.test.ts: 11 cases covering first-attempt success, confidence promotion, exhausted retries, thrown errors, null returns, maxRetries=0
tools.test.ts: sanitizeString, normalizeMetric (Swedish thousands separator, ambiguous spaces), normalizeSharesMetric (tusental/thousand/million multipliers with double-scale guard), EPRA NRV derivation, market value per sqm, property value change %
keywordSearch.test.ts: searchKeywords, pruneKeywordMatches (per-metric and total caps, deduplication), buildSnippetsForTier1, buildPagesForTier2
All numeric edge cases are explicitly tested: division by zero, null inputs, negative values, ambiguous multi-value strings

Explore the other cases: BoardFlow.