Internal · Datamart · v0.1 — 2026-05-20

A governed data platform where every number has a name, an owner, and a paper trail.

Partners contribute data they trust, a conversational agent answers questions, and quality issues surface before they cause bad decisions.

The problem

Why we're building this

No one trusts the answer.

When a number lands in a report, no one can quickly say where it came from, who uploaded it, or when. Debate happens in Slack threads instead of analysis.

Onboarding new data is bespoke every time.

Every new partner, dataset, or format triggers a one-off process. Validation is inconsistent. Schema drift goes silent until it breaks a downstream report.

Finding data is harder than using it.

Discovery happens through tribal knowledge, even when the data already exists. People recreate datasets that are already in the system because they can't find them.

The opportunity

Four jobs, one platform

A data platform that is governed by design — not by retrofit — so trust, discovery, quality, and conversational access compose into one product instead of four bolted-together tools.

Trusted ingestion. Every file validated, owned, and audited on arrival.
Browseable catalog. Anyone can see what we hold, who owns it, and how clean it is.
Conversational access. Plain-English questions with traceable answers and citations.
Proactive quality. Drift, anomalies, and stale sources surface to owners before they corrupt decisions.

The 8-feature pipeline

Eight features. Five phases. One coherent platform.

Each feature is a black box with a clear value, a clear audience, and a clear effort. Tap any card to see what it does, who benefits, and what it depends on.

Architecture

How the platform fits together

Five layers, top to bottom. Each layer is gated by the one above it; there is no shortcut around the gates. Every store is encrypted; every action lands in an append-only audit trail.

Figure 1How the platform fits together — by layer, top-down
1 · Authentication
Identity Provider
SSO + MFA · TLS everywhere

All Source Owners and Authorized Users sign in through one gate.

2 · Application surfaces
Upload
+ Validation
Catalog Dashboard
+ Data Dictionary
Chat Interface
over the Conversational Agent
Quality Agent
scheduled · background
3 · Authorization · every data access
Per-Source Permission Grant
filters to allowed Sources only · no global override
4 · Encrypted stores · audit append-only
Raw Files
Bronze tier
Analytical Store
Silver tier
Document Index
Metadata + Audit DB
5 · Outputs
Data Catalog → Authorized User
Data Dictionary → Authorized User
Data Quality → Authorized User
Data Analysis → Authorized User
BI / business queries / tables — future, not in v1
Cited answer → Authorized User
Owner notification → Source Owner

How we build

Three principles, validated in production

The constitution has seven principles. Three of them are the ones a sponsor most cares about — they're the load-bearing claims behind every other design decision.

Audit by default, not audit as a feature.

Every action produces a durable record from day one, not added under audit pressure later. Why this matters: "Who uploaded this and when?" becomes a query, not a fire drill.

Swappable backends.

No application code touches a vendor SDK directly; every external service sits behind a project-owned interface. Why this matters: Switching cloud, storage, or AI providers becomes a bounded migration rather than a rewrite. This claim was tested on 2026-05-20 when the platform migrated from AWS to Azure plus Snowflake.

Schema declared before persistence.

No data lands without a typed, owned, documented schema. Why this matters: Silent data-shape corruption — the most expensive class of bug — is prevented at the door.

Phased delivery

Five phases. Each one ships value on its own.

The project can stop after any phase and still be useful. No phase is a half-bridge.

  1. Phase 1Planned

    Foundation

    Ships: Feature 001 File Ingestion. Trusted, audited, governed file uploads through an API.

  2. Phase 2Planned

    Visibility

    Ships: Features 002 Data Dictionary + 003 Catalog Dashboard. The data team stops being the human catalog.

  3. Phase 3Planned

    Connectors

    Ships: Feature 004 External Source Connectors. Partner data flows in without bespoke integrations.

  4. Phase 4Planned

    Semantic intelligence

    Ships: Features 005 + 006. Natural-language query with provable access control and per-answer provenance.

  5. Phase 5Planned

    Proactive quality + non-tabular reach

    Ships: Features 008 + 009 (MVP cuts first). Quality issues surface before damage; the platform reaches documents and media.

Risks + mitigations

What could go wrong, and how the architecture answers it

Five risks a thoughtful stakeholder will worry about. Each one is named — and each one has a specific design mechanism that mitigates it.

RiskMitigation
AI agent gives wrong or fabricated answers.Constitution Principle VII: no AI feature ships without an eval suite. A Citation Verifier blocks fabricated citations at response time. GPT-4o-mini acts as judge — a different model than the generator, preventing self-evaluation bias.
AI agent silently mutates data.Constitution Principle VI (Read-Only Agents, non-negotiable). Enforced at the Snowflake privilege layer — agent roles have no INSERT/UPDATE/DELETE/DDL grants. Even prompt injection cannot escape because the role has no privilege to write.
Partner data leaks across organizations.Per-Source access control is the only authorization unit; no global override exists. External partners are deferred from v1 specifically to do isolation right (ADR-0002-v2). Permission Grants live in Postgres (authoritative), enforced in Snowflake via SCIM-synced Entra roles.
Vendor lock-in.Every external dependency sits behind a project-owned interface (Constitution Principle IV). The AWS → Azure substrate transition (2026-05-20) validated this at the most painful possible level: domain code unchanged; only adapter modules rewritten.
Compliance posture insufficient for regulated data.v1 is restricted to non-regulated data only. A hard code-level block prevents provisioning of regulated-class Sources until ADR-0044 (PII redaction) and ADR-0051 (Conversation deletion) ratify. Not a roadmap promise — an enforced gate.

Acknowledged trade-offs: numeric thresholds for several cost/performance gates, CAD conversion (deferred — licensing), and specific frontend choices are intentionally deferred until their data arrives. We refuse to pre-commit to numbers we'd have to retract.

Common questions

What stakeholders ask, answered