Internal · Datamart · v0.1 — 2026-05-20
A governed data platform where every number has a name, an owner, and a paper trail.
Partners contribute data they trust, a conversational agent answers questions, and quality issues surface before they cause bad decisions.
The problem
Why we're building this
No one trusts the answer.
When a number lands in a report, no one can quickly say where it came from, who uploaded it, or when. Debate happens in Slack threads instead of analysis.
Onboarding new data is bespoke every time.
Every new partner, dataset, or format triggers a one-off process. Validation is inconsistent. Schema drift goes silent until it breaks a downstream report.
Finding data is harder than using it.
Discovery happens through tribal knowledge, even when the data already exists. People recreate datasets that are already in the system because they can't find them.
The opportunity
Four jobs, one platform
A data platform that is governed by design — not by retrofit — so trust, discovery, quality, and conversational access compose into one product instead of four bolted-together tools.
The 8-feature pipeline
Eight features. Five phases. One coherent platform.
Each feature is a black box with a clear value, a clear audience, and a clear effort. Tap any card to see what it does, who benefits, and what it depends on.
- What it does
- Accepts CSV / JSON / JSONL / Parquet / Excel; validates against the declared schema; rejects malformed rows with reason codes; writes an immutable Audit Record for every state change.
- Who benefits
- Contributors (one consistent upload path), auditors (every action is queryable), downstream consumers (every record traceable to its origin).
- Depends on
- Nothing — this is the foundation everything else builds on.
- Source
.specify/specs/001-file-ingestion-sources/spec.md
- What it does
- Catalogs every field with its type, definition, owner, sample values, freshness, and quality posture. Lineage links back to the originating Source.
- Who benefits
- Analysts (self-serve column definitions), partner liaisons (clear ownership), engineers (no more "what does this column mean?" pings).
- Depends on
- 001 File Ingestion (catalog needs validated columns to describe).
- Source
.specify/specs/002-data-dictionary/spec.md
- What it does
- Browseable UI over the catalog metadata: filter by owner, recency, freshness, quality posture. Drill down to the source files and audit history.
- Who benefits
- Leadership (one-glance posture), ops (find stale Sources), partner-facing teams (show partners their own data without engineering involvement).
- Depends on
- 001 File Ingestion + 002 Data Dictionary.
- Source
.specify/specs/003-catalog-dashboard/spec.md
- What it does
- Scheduled pull from partner APIs, databases, or SFTP into the same ingestion pipeline as uploads. Same validation, same audit, same governance.
- Who benefits
- Integration team (no bespoke ETL per partner), API-equipped partners (no manual uploads).
- Depends on
- 001 File Ingestion (reuses the Source + validation model).
- Source
.specify/specs/004-external-source-connectors/spec.md
- What it does
- Glossary + synonyms + join hints + example queries — the layer that maps "monthly recurring revenue" to the actual columns and joins. A learning layer adds ratified interactions over time.
- Who benefits
- Powers the agent (006); becomes the substrate for any future BI tooling.
- Depends on
- 002 Data Dictionary (you can't map vocabulary to columns that aren't yet cataloged).
- Source
.specify/specs/005-semantic-schema-map/spec.md
- What it does
- Multi-turn conversation. Generates SQL, requires approval before mutating queries (there are none — agents are read-only), cites every fact, refuses outside its allowed scope.
- Who benefits
- Anyone with an authorized question — analysts, PMs, sponsors, owners.
- Depends on
- 005 Semantic Schema Map + 001 File Ingestion.
- Source
.specify/specs/006-conversational-data-agent/spec.md
- What it does
- Scheduled detectors run nightly. Findings land in a per-owner queue. Critical staleness alerts within 60 seconds. An Inspector chat lets owners ask deeper questions about a Finding.
- Who benefits
- Source owners (see issues before being told), ops (proactive vs reactive), leadership (data quality posture as a metric).
- MVP cut
- Scheduled mode + Finding Queue + 3 critical Detectors ships first (L). Inspector + 6 more Detectors follow in v1.1.
- Depends on
- 002 Data Dictionary + 006 Conversational Agent.
- Source
.specify/specs/008-data-quality-agent/spec.md
- What it does
- Adds PDFs, raster images, SVG, audio, video. Extracts text, captions images, transcribes audio. Image vision is gated behind explicit user approval. EXIF GPS is stripped in-memory before any storage write.
- Who benefits
- Anyone whose data lives in documents or media — partner reports, equipment photos, recorded calls.
- MVP cut
- PDFs + raster images ship first (L). SVG, audio, video follow in v1.1, v1.2. CAD deferred (licensing).
- Out of scope in v1
- Face-containing images blocked (biometric risk); non-English audio/video transcription; CAD.
- Depends on
- 001 + 005 + 006.
- Source
.specify/specs/009-non-tabular-sources/spec.md
Architecture
How the platform fits together
Five layers, top to bottom. Each layer is gated by the one above it; there is no shortcut around the gates. Every store is encrypted; every action lands in an append-only audit trail.
SSO + MFA · TLS everywhere
All Source Owners and Authorized Users sign in through one gate.
+ Validation
+ Data Dictionary
over the Conversational Agent
scheduled · background
filters to allowed Sources only · no global override
Bronze tier
Silver tier
BI / business queries / tables — future, not in v1
How we build
Three principles, validated in production
The constitution has seven principles. Three of them are the ones a sponsor most cares about — they're the load-bearing claims behind every other design decision.
Audit by default, not audit as a feature.
Every action produces a durable record from day one, not added under audit pressure later. Why this matters: "Who uploaded this and when?" becomes a query, not a fire drill.
Swappable backends.
No application code touches a vendor SDK directly; every external service sits behind a project-owned interface. Why this matters: Switching cloud, storage, or AI providers becomes a bounded migration rather than a rewrite. This claim was tested on 2026-05-20 when the platform migrated from AWS to Azure plus Snowflake.
Schema declared before persistence.
No data lands without a typed, owned, documented schema. Why this matters: Silent data-shape corruption — the most expensive class of bug — is prevented at the door.
Phased delivery
Five phases. Each one ships value on its own.
The project can stop after any phase and still be useful. No phase is a half-bridge.
-
Phase 1Planned
Foundation
Ships: Feature 001 File Ingestion. Trusted, audited, governed file uploads through an API.
-
Phase 2Planned
Visibility
Ships: Features 002 Data Dictionary + 003 Catalog Dashboard. The data team stops being the human catalog.
-
Phase 3Planned
Connectors
Ships: Feature 004 External Source Connectors. Partner data flows in without bespoke integrations.
-
Phase 4Planned
Semantic intelligence
Ships: Features 005 + 006. Natural-language query with provable access control and per-answer provenance.
-
Phase 5Planned
Proactive quality + non-tabular reach
Ships: Features 008 + 009 (MVP cuts first). Quality issues surface before damage; the platform reaches documents and media.
Risks + mitigations
What could go wrong, and how the architecture answers it
Five risks a thoughtful stakeholder will worry about. Each one is named — and each one has a specific design mechanism that mitigates it.
| Risk | Mitigation |
|---|---|
| AI agent gives wrong or fabricated answers. | Constitution Principle VII: no AI feature ships without an eval suite. A Citation Verifier blocks fabricated citations at response time. GPT-4o-mini acts as judge — a different model than the generator, preventing self-evaluation bias. |
| AI agent silently mutates data. | Constitution Principle VI (Read-Only Agents, non-negotiable). Enforced at the Snowflake privilege layer — agent roles have no INSERT/UPDATE/DELETE/DDL grants. Even prompt injection cannot escape because the role has no privilege to write. |
| Partner data leaks across organizations. | Per-Source access control is the only authorization unit; no global override exists. External partners are deferred from v1 specifically to do isolation right (ADR-0002-v2). Permission Grants live in Postgres (authoritative), enforced in Snowflake via SCIM-synced Entra roles. |
| Vendor lock-in. | Every external dependency sits behind a project-owned interface (Constitution Principle IV). The AWS → Azure substrate transition (2026-05-20) validated this at the most painful possible level: domain code unchanged; only adapter modules rewritten. |
| Compliance posture insufficient for regulated data. | v1 is restricted to non-regulated data only. A hard code-level block prevents provisioning of regulated-class Sources until ADR-0044 (PII redaction) and ADR-0051 (Conversation deletion) ratify. Not a roadmap promise — an enforced gate. |
Acknowledged trade-offs: numeric thresholds for several cost/performance gates, CAD conversion (deferred — licensing), and specific frontend choices are intentionally deferred until their data arrives. We refuse to pre-commit to numbers we'd have to retract.
Common questions
What stakeholders ask, answered
Every state change — uploads, schema edits, permission grants, agent queries, quality findings — produces a durable, queryable, immutable Audit Record. There are 14 distinct audit categories, each capturing who acted, when, and what changed, with cryptographic content hashes where applicable.
- 14 audit kinds, one append-only Postgres table.
- No edits, no deletes — enforced at the database role level (Constitution Principle I).
- 365-day production retention for audit-path; 30-day for operational logs (ADR-0013-v2).
Owners declare an expected cadence (daily / weekly / monthly / cron) when creating a Source. A daily Temporal Completeness Detector compares actual upload rhythm against that declaration and emits a Finding when it slips. Critical staleness fires an alert to the owner within 60 seconds.
- Owner-declared
expected_update_cadencefield on every Source (spec 001 FR-002). - Daily Temporal Completeness Detector per Source (spec 008).
- 60-second Event Grid alert on critical Findings with deep-link into the dashboard.
Hybrid storage matched to content shape. Raw files in object storage, parsed records in a columnar analytics layer, governance metadata in a relational database, document-style semantic content in a managed RAG layer. No single store is asked to do all jobs.
- Raw files: Azure Blob (ADLS Gen2) with intelligent tiering — Bronze layer.
- Parsed tabular records: Snowflake with time-travel and ACID — Silver layer.
- Document semantics: Azure AI Search (hybrid BM25 + vector). Structured metadata: Azure Database for PostgreSQL with pgvector.
Tabular today (CSV, JSON, JSONL, Parquet, Excel) plus a non-tabular layer (PDFs, raster images, SVG, audio, video) where text is extracted and indexed, and image analysis is gated behind explicit user approval. Privacy-sensitive metadata (image GPS coordinates) is stripped before any persistence.
- Tabular formats: CSV, JSON, JSONL, Parquet, Excel (spec 001).
- Non-tabular formats: PDF, raster images, SVG, audio, video (spec 009).
- Out of scope in v1: CAD files (licensing), non-English audio/video transcription, face-containing images (biometric risk).