DATA PLATFORM A Global CRE Advisory Firm

From vendor workbooks to a trusted benchmark platform.

6-stage

governed pipeline

6-page

internal workbench

10+ years

of benchmark data

Deterministic

rebuild & release

Hover a stage · tap to lock details

The problem:

The firm receives vendor pricing models as semi-structured Excel workbooks, each formatted differently by the provider, with varying column structures, layouts, and data conventions. Comparing across providers required manual effort, and the institutional context tied to each model was not captured in a governed, reusable workflow.

What we built:

A governed benchmark data platform in Python, designed around reproducibility, traceability, and human oversight. The core is a six-stage workflow: source workbook intake, CLI-based onboarding and inspection, profile/extract/match processing, human review, controlled rebuild/apply into a governed benchmark database, and analyst consumption through a local Flask workbench. Source workbooks are inspected and registered through a centralized onboarding flow that issues stable model IDs, records provenance in a source registry, and does not bypass human review.

On the production side, a deterministic rebuild/apply system produces governed benchmark database artifacts backed by release manifests, integrity reports, typed data contracts, diffing, backups, and preflight validation. Atomic write protections now cover the governed data tables as well as governance-adjacent artifacts such as the source registry, intake register, release manifest, and integrity report. A GitHub Actions CI workflow runs the non-browser validation suite on every change, and the project currently has 254 passing tests.

The consumption layer is a six-page Flask workbench — Overview, Roles, Providers, Confidence, Quality, and Trends — with client-side filtering, AJAX detail panels, manifest-based freshness signaling, and governed CSV exports that include provenance headers. Historical trend analysis is governed by explicit comparability rules: only same-client, same-provider series are compared across model vintages.

Why it matters:

This is a system where automation handles schema alignment, validation, provenance capture, and drift detection while humans remain in control of every decision that affects benchmark truth. The pipeline can be rebuilt deterministically at any point, every published output is traceable to approved source decisions, and the platform is structured to accumulate benchmark history over time through a governed workflow rather than one-off manual analysis.

Python data pipeline CLI onboarding automation Deterministic rebuild/apply Release manifests & integrity reports Typed data contracts Six-page Flask workbench Confidence scoring Provider comparisons Historical trends (strict comparability) Governed CSV exports with provenance GitHub Actions CI Test-validation suite Source registry architecture

← Back to Our Work