Case 01
Trappan
RAG-grounded extraction. Tiered fallbacks. Production observability.
- TypeScript
- OpenAI API
- Supabase
- pgvector
- Langfuse
- Vitest
- Docker
Overview
Trappan automatically finds and reads quarterly financial reports for Swedish listed real estate companies, then extracts 20+ standardised metrics (EPRA NRV, LTV, occupancy rate, etc.) into a structured database. The hard part is that every company uses different terminology, buries figures in different sections, and publishes in Swedish. The system solves this with a definitions-aware RAG layer that maps company-specific labels to canonical metrics before extraction runs. Everything is built from scratch in TypeScript — no LangChain, no LangGraph.
Contribution: Full-stack design and implementation — orchestrator, harness, RAG pipeline, extraction tiers, observability
Agent surface
- Custom retry harness with confidence-aware best-result tracking across attempts
- RAG-grounded term resolution via pgvector: definition blocks extracted from the PDF, embedded, and matched to a canonical metrics knowledge base
- Two-tier extraction (vision LLM on keyword-matched pages → web search fallback) with explicit confidence escalation, reducing cost on the happy path
Orchestrator: four explicit phases
The orchestrator sequences four phases — cache check, report discovery, extraction, and persistence — with job state written to the database between each step. This means every run is resumable and auditable. A Langfuse trace is opened at the start and a span is attached to each phase, so the full execution tree is visible in the observability dashboard without any extra instrumentation.
- Phase 1 (Cache): checks both a freshness-based metric cache and a date-specific data cache — skips extraction entirely if all requested metrics are available
- Phase 2 (Discovery): delegates to the report finder, which runs a 3-method cascade to locate the correct quarterly PDF
- Phase 3 (Extraction): runs the tiered parser; writes per-run stats including token counts, latency, and which metrics were extracted vs. missed
- Phase 4 (Persist): writes artifact records and a structured run log; the orchestrator returns a typed ExtractionSignal (clean | partial_fallback | full_fallback | failed) for the caller to act on
export async function runOrchestrator(
company: CompanyInput,
metrics: MetricDef[],
options?: { cacheMaxAgeHours?: number },
runContext?: RunContext
): Promise<OrchestratorResult> {
// Phase 1 — return early if all metrics are fresh
const cached = await getCachedMetrics(company.id, metricKeys);
if (missingMetrics.length === 0) return finalize({ signal: "clean", ... });
// Phase 2 — find the report PDF
const discovery = await runWithRetry(
(note) => findReport({ ...input, _note: note }, { logger: log }),
{ isSuccess: (o) => !!o?.report_url, label: `reportsFinder:${company.id}` }
);
// Phase 3 — extract missing metrics
const { outputs, steps } = await parseMetrics(missingMetrics, reportUrl, ctx);
await upsertMetrics(company.id, reportUrl, outputs);
// Phase 4 — persist and return
return finalize({ signal: deriveSignal(steps, failed), ... });
}Custom retry harness
Every agent call in the system is wrapped in a generic retry harness that tracks the best result across all attempts — not just the last one. This matters because LLM outputs are non-deterministic: a third attempt might be worse than the first. The harness ranks outputs by a caller-supplied confidence label and promotes only when the new result is strictly better.
- Caller provides isSuccess() and getConfidence() — harness is completely domain-agnostic
- Exponential backoff on transport errors; immediate short-circuit on semantic failures the caller flags as non-retryable
- On retry, a retryNote string is injected into the next LLM call — letting the model know what went wrong without changing the prompt structure
- Test suite covers: first-attempt success, confidence promotion, all-attempts-fail, null returns, and thrown errors
export async function runWithRetry<T>(
fn: (note: string | null) => Promise<T | null>,
options: HarnessOptions<T>
): Promise<HarnessResult<T>> {
let best: T | null = null;
let bestConfidence: string | null = null;
for (let attempt = 1; attempt <= totalAttempts; attempt++) {
const output = await fn(attempt === 1 ? null : retryNote);
// Promote to best only if strictly higher confidence
const promoteToBest =
best === null ||
(isSuccess(output) && !isSuccess(best)) ||
(isSuccess(output) === isSuccess(best) &&
confidenceRank(confidence) > confidenceRank(bestConfidence));
if (promoteToBest) { best = output; bestConfidence = confidence; }
if (succeeded) return { output: best, attempts: attempt, succeeded: true };
await sleep(backoff(attempt)); // exponential, skipped on last attempt
}
return { output: best, attempts: totalAttempts, succeeded: false };
}Definitions-aware RAG: grounding extraction in the company's own glossary
Swedish real estate companies each define their key metrics slightly differently — one calls it 'Substansvärde per aktie', another uses 'Långsiktigt substansvärde', both meaning EPRA NRV. If extraction runs against generic keyword lists, it misses matches or picks up the wrong values. The solution: before any extraction runs, the system finds the definition pages in the PDF, renders each page as an image, and runs a vision LLM to extract definition blocks (title + description pairs). Each block description is then embedded with text-embedding-3-small and searched against a pgvector table in Supabase containing canonical definitions for all 20+ metrics — in both Swedish and English. When a match exceeds the similarity threshold (0.78), the company-specific label is prepended to that metric's keyword list. Extraction then runs against these resolved terms, not generic defaults.
- Per-metric guardrails enforce correctness beyond embedding similarity — e.g. EPRA NRV matches only if the description explicitly says 'per aktie' or 'per share', blocking company-level totals from being promoted
- Falls back gracefully to default keyword lists if no definition pages are found or no matches exceed the threshold
- pgvector match_metric_definitions RPC handles both similarity scoring and canonical key lookup in a single query
- Definition resolution is traced as its own Langfuse span — resolved vs. defaulted keys are visible per run
// Embed the extracted block description, search for matching canonical metric
for (const block of blocks) {
const matches = await searchDefinitions(block.description); // pgvector cosine search
const best = matches[0];
if (!best || best.similarity < DEFINITION_MATCH_THRESHOLD) continue;
if (!passesGuardrails(best.canonical_key, block)) continue; // e.g. EPRA NRV must say "per aktie"
// Prepend company label — extraction uses this as its primary search term
resolved[best.canonical_key] = [block.title, ...defaults];
}Tiered extraction: cost-efficient by default, accurate under pressure
Tier 1 runs a keyword scan across all PDF pages to locate candidate pages, renders those pages as images, and sends them to a vision LLM (gpt-4o-mini). Only pages that actually contain a resolved keyword are rendered — typically 3–8 pages out of 80–120. If confidence comes back as 'high' or 'medium', the pipeline stops. If Tier 1 returns 'low' or 'none', Tier 2 fires: OpenAI's Responses API with web_search_preview browses the public PDF URL directly and extracts the missing metrics. This model is stronger (gpt-4o) and slower. The dual-tier design means most runs pay Tier 1 cost; Tier 2 is reserved for the hard cases.
- Tier 1: keyword scan → candidate pages → render images → vision extraction (gpt-4o-mini)
- Tier 2: web_search_preview on public PDF URL → text-based full-report extraction (gpt-4o)
- Definition pages are excluded from candidate page selection — prevents the model from confusing a metric definition with a reported value
- ExtractionSignal (clean | partial_fallback | full_fallback | failed) tells the caller exactly what happened, enabling downstream decisions (e.g. alert, use stale cache, skip)
- Per-run stats record token counts and latency for both tiers — enabling cost attribution across companies and time periods
export async function parseMetrics(metrics, reportUrl, ctx) {
const pdfData = await loadPdf(reportUrl);
const tier1 = await runTier1(pdfData, metrics, ctx);
// keyword scan → render candidate pages → vision LLM
if (tier1.step.accepted) { // confidence === "high" | "medium"
return buildResult(pdfData, tier1.outputs, [tier1.step], "tier1");
}
// Tier 2: web_search_preview on the live PDF URL
const tier2 = await runTier2(reportUrl, metrics, tier1.resolvedTerms, ctx);
return buildResult(pdfData, tier2.outputs, [tier1.step, tier2.step], "tier2_fallback");
}Report discovery: three-method cascade
Before extraction can run, the system needs to find the correct quarterly PDF for a given company and date. This is genuinely hard: URLs are not standardised, reports are published on company IR pages, CDNs, and IR aggregators, and getting the wrong period (e.g. last year's annual instead of this quarter's interim) would silently corrupt the data. The discovery agent runs three methods in order, stopping at the first success.
- Method 1 — mfn.se: an LLM agent navigates the Swedish IR press release aggregator with a fetch_page tool, inferring the company slug, filtering to the correct year and report type, and returning a direct PDF URL. Correctly-scoped to a single period every time.
- Method 2 — Company website crawl: same agent pattern, restricted to the company's own domain. Follows investor relations links to the reports listing page.
- Method 3 — OpenAI web search: last resort, uses web_search_preview to search for the report and extract a PDF URL from the results.
- Period validation runs on every candidate URL: file path years are checked against the target date, and a stale URL (>1 year behind) is rejected before it can poison extraction
export async function findReport(input, options): Promise<ReportsFinderOutput> {
// Method 1: mfn.se — structured IR hub, period-accurate
const mfnResult = await findViaMfn(input, log);
if (mfnResult.report_url) return mfnResult;
// Method 2: company website crawl
const crawlResult = await findViaCrawl(input, log);
if (crawlResult.report_url) return crawlResult;
// Method 3: OpenAI web search (last resort)
return findViaOpenAISearch(input, log);
}Test suite: unit tests for every non-trivial decision
The test suite covers the harness, keyword search utilities, and all numeric normalisation logic. Tests are pure unit tests — no mocks of the database or LLM APIs, no network calls. Vitest runs in under a second.
- harness.test.ts: 11 cases covering first-attempt success, confidence promotion, exhausted retries, thrown errors, null returns, maxRetries=0
- tools.test.ts: sanitizeString, normalizeMetric (Swedish thousands separator, ambiguous spaces), normalizeSharesMetric (tusental/thousand/million multipliers with double-scale guard), EPRA NRV derivation, market value per sqm, property value change %
- keywordSearch.test.ts: searchKeywords, pruneKeywordMatches (per-metric and total caps, deduplication), buildSnippetsForTier1, buildPagesForTier2
- All numeric edge cases are explicitly tested: division by zero, null inputs, negative values, ambiguous multi-value strings
Explore the other cases: Boardflow.