Internal · Datamart · v0.1 — 2026-05-20

A governed data platform where every number has a name, an owner, and a paper trail.

Partners contribute data they trust, a conversational agent answers questions, and quality issues surface before they cause bad decisions.

See the features See the architecture

The problem

Why we're building this

No one trusts the answer.

When a number lands in a report, no one can quickly say where it came from, who uploaded it, or when. Debate happens in Slack threads instead of analysis.

Onboarding new data is bespoke every time.

Every new partner, dataset, or format triggers a one-off process. Validation is inconsistent. Schema drift goes silent until it breaks a downstream report.

Finding data is harder than using it.

Discovery happens through tribal knowledge, even when the data already exists. People recreate datasets that are already in the system because they can't find them.

The opportunity

Four jobs, one platform

A data platform that is governed by design — not by retrofit — so trust, discovery, quality, and conversational access compose into one product instead of four bolted-together tools.

Trusted ingestion. Every file validated, owned, and audited on arrival.

Browseable catalog. Anyone can see what we hold, who owns it, and how clean it is.

Conversational access. Plain-English questions with traceable answers and citations.

Proactive quality. Drift, anomalies, and stale sources surface to owners before they corrupt decisions.

The 8-feature pipeline

Eight features. Five phases. One coherent platform.

Each feature is a black box with a clear value, a clear audience, and a clear effort. Tap any card to see what it does, who benefits, and what it depends on.

Architecture

How the platform fits together

Five layers, top to bottom. Each layer is gated by the one above it; there is no shortcut around the gates. Every store is encrypted; every action lands in an append-only audit trail.

Figure 1How the platform fits together — by layer, top-down

1 · Authentication

Identity Provider
SSO + MFA · TLS everywhere

All Source Owners and Authorized Users sign in through one gate.

2 · Application surfaces

Upload
+ Validation

Catalog Dashboard
+ Data Dictionary

Chat Interface
over the Conversational Agent

Quality Agent
scheduled · background

3 · Authorization · every data access

Per-Source Permission Grant
filters to allowed Sources only · no global override

4 · Encrypted stores · audit append-only

Raw Files
Bronze tier

Analytical Store
Silver tier

Document Index

Metadata + Audit DB

5 · Outputs

Data Catalog → Authorized User

Data Dictionary → Authorized User

Data Quality → Authorized User

Data Analysis → Authorized User
BI / business queries / tables — future, not in v1

Cited answer → Authorized User

Owner notification → Source Owner

How we build

Three principles, validated in production

The constitution has seven principles. Three of them are the ones a sponsor most cares about — they're the load-bearing claims behind every other design decision.

Audit by default, not audit as a feature.

Every action produces a durable record from day one, not added under audit pressure later. Why this matters: "Who uploaded this and when?" becomes a query, not a fire drill.

Swappable backends.

No application code touches a vendor SDK directly; every external service sits behind a project-owned interface. Why this matters: Switching cloud, storage, or AI providers becomes a bounded migration rather than a rewrite. This claim was tested on 2026-05-20 when the platform migrated from AWS to Azure plus Snowflake.

Schema declared before persistence.

No data lands without a typed, owned, documented schema. Why this matters: Silent data-shape corruption — the most expensive class of bug — is prevented at the door.

Phased delivery

Five phases. Each one ships value on its own.

The project can stop after any phase and still be useful. No phase is a half-bridge.

Phase 1Planned

Foundation

Ships: Feature 001 File Ingestion. Trusted, audited, governed file uploads through an API.
Phase 2Planned

Visibility

Ships: Features 002 Data Dictionary + 003 Catalog Dashboard. The data team stops being the human catalog.
Phase 3Planned

Connectors

Ships: Feature 004 External Source Connectors. Partner data flows in without bespoke integrations.
Phase 4Planned

Semantic intelligence

Ships: Features 005 + 006. Natural-language query with provable access control and per-answer provenance.
Phase 5Planned

Proactive quality + non-tabular reach

Ships: Features 008 + 009 (MVP cuts first). Quality issues surface before damage; the platform reaches documents and media.

Risks + mitigations

What could go wrong, and how the architecture answers it

Five risks a thoughtful stakeholder will worry about. Each one is named — and each one has a specific design mechanism that mitigates it.

Risk	Mitigation
AI agent gives wrong or fabricated answers.	Constitution Principle VII: no AI feature ships without an eval suite. A Citation Verifier blocks fabricated citations at response time. GPT-4o-mini acts as judge — a different model than the generator, preventing self-evaluation bias.
AI agent silently mutates data.	Constitution Principle VI (Read-Only Agents, non-negotiable). Enforced at the Snowflake privilege layer — agent roles have no INSERT/UPDATE/DELETE/DDL grants. Even prompt injection cannot escape because the role has no privilege to write.
Partner data leaks across organizations.	Per-Source access control is the only authorization unit; no global override exists. External partners are deferred from v1 specifically to do isolation right (ADR-0002-v2). Permission Grants live in Postgres (authoritative), enforced in Snowflake via SCIM-synced Entra roles.
Vendor lock-in.	Every external dependency sits behind a project-owned interface (Constitution Principle IV). The AWS → Azure substrate transition (2026-05-20) validated this at the most painful possible level: domain code unchanged; only adapter modules rewritten.
Compliance posture insufficient for regulated data.	v1 is restricted to non-regulated data only. A hard code-level block prevents provisioning of regulated-class Sources until ADR-0044 (PII redaction) and ADR-0051 (Conversation deletion) ratify. Not a roadmap promise — an enforced gate.

Acknowledged trade-offs: numeric thresholds for several cost/performance gates, CAD conversion (deferred — licensing), and specific frontend choices are intentionally deferred until their data arrives. We refuse to pre-commit to numbers we'd have to retract.

Common questions

A governed data platform where every number has a name, an owner, and a paper trail.

Why we're building this

No one trusts the answer.

Onboarding new data is bespoke every time.

Finding data is harder than using it.

Four jobs, one platform

Eight features. Five phases. One coherent platform.

File Ingestion Sources

Data Dictionary

Catalog Dashboard

External Source Connectors

Semantic Schema Map

Conversational Data Agent

Data Quality + Opportunity Agent

Non-Tabular Sources

How the platform fits together

Three principles, validated in production

Audit by default, not audit as a feature.

Swappable backends.

Schema declared before persistence.

Five phases. Each one ships value on its own.

Foundation

Visibility

Connectors

Semantic intelligence

Proactive quality + non-tabular reach

What could go wrong, and how the architecture answers it

What stakeholders ask, answered