№ 11 · Data architecture

Bacardi Nielsen Matching

Nielsen ↔ UPH entity resolution for share-of-market analysis and competitive benchmarking, native to Databricks.

Live2026BacardiDatabricksPySparkMosaic AI Vector SearchClaude LLMs

Summary

A Databricks-native entity-resolution pipeline that lifts external market-measurement data into Bacardi's Universal Product Hierarchy, so syndicated panel volumes become directly comparable with internal sales, finance, and dimensional masters in a single dashboarding layer. A 10-stage architecture with a 4-pass cascade (Override → Exact → Vector Search → LLM judge) replaces ad-hoc Excel reconciliations with deterministic Delta tables, SCD2-versioned matches, and a unified steward review queue.

Details

My role: Data architect
Period: 2026
Status: Live
Stack: DatabricksPySparkMosaic AI Vector SearchClaude LLMsDelta LakeLangGraph

Context

Market-measurement panel data is the only external lens through which Bacardi sees competitor sell-through. The feed arrives at scale across eleven European and Americas markets and nine spirits, wine, and adjacent categories, but its product identity layer does not align with the corporate master: brand, owner, sub-brand, and container size all carry spelling variants, polluted placeholder labels, and free-form text where the master expects canonical codes. Direct key-based joins are impossible — the panel feed's product description column collapses into a handful of dataset-name labels, and true variant identity is fragmented across dozens of heterogeneous category-specific dimension tables. Without resolution, every share-of-market or competitor-benchmarking question gets answered in spreadsheets sitting outside the platform. With resolution, those answers move inside the warehouse, version themselves over time, and feed downstream BI without per-question reconciliation. The mandate: build the resolution pipeline once, version it correctly, and roll it out country by country.

Architecture

Ten Delta Lake stages with one operating principle above all others: no record is ever dropped. Polluted fields are flagged unusable, not excluded, and every source row reaches the final read surface with a definitive match status.

Reference layer — eight SCD2 Delta tables hold mappings, eligibility matrix, unified override, and the steward queue.
Field cleansing — `field_reliability_mask` per row flags usable columns; drift detection over owner and brand runs in parallel.
Harmonize and canonicalize — Pattern-B join brings category dimensions; container size canonicalised by a shared function.
Deduplicate — canonical catalogue with stable hash composition; lineage table preserves source-dataset provenance per product.
Eligibility — structural eligibility is decoupled from the effective check; combinations without master coverage short-circuit to review.
Cascading match — four strict passes: Override force, Exact join, Mosaic AI Vector Search, and Claude LLM judges.
Assembly — row-grain view for forensic audit and an eleven-dimension materialised aggregate; FX fails open with visibility.
Embeddings — three new vector indices on the shared Mosaic AI endpoint already serving the wider MDM landscape.

Key decisions

No record is ever dropped; polluted fields are flagged, not excluded.: Removing rows with placeholder owner or brand was tempting and wrong: they still represent real volume the business needs to see in dashboards even when they cannot be matched. Stewards work the queue rather than the pipeline working around them.
Override precedes regex rules at every level of the pipeline.: When a steward inserts an override, the pipeline self-corrects without a code redeploy. That precedence holds across the four match passes and the container-size canonicalisation, turning the override into the system's evolution lever.
Reuse artifacts from sibling pipelines.: The Vector Search endpoint, the brand canonical mapping, and the embeddings cache are reused from the existing MDM landscape rather than duplicated. An early architectural decision closed the debate and keeps a single shared surface across the entity-resolution projects.
Cascading match with four passes in strict order.: Override → Exact → Vector Search → LLM judge. Cheapest-first: each pass receives only what the previous one did not resolve, and the cost order is also the confidence order, so the human review rate falls without sacrificing precision.
Haiku → Sonnet → Opus cascade with a cap as incident brake.: The hard cap on the most expensive model functions as an anomaly detector, not a budget control. If the cascade starts calling the most expensive model out of band, something upstream — a data drift or a broken override — is asking for investigation.

Lessons learned

Adversarial review reshaped four draft decisions before architectural lock — the council session was load-bearing for the final design (dual SCD2 triggers, JSON-payload override extensibility, cascading mask).
Third-party briefings can carry load-bearing factual errors that only profiling exposes — Phase 0 caught three before architecture began.
Design-log discipline matters at scale: eighty-six numbered decisions across the architecture phase with ten partial supersedes, all traceable.

Status & roadmap

Current state: Phase 0 closed with cross-comparison report and brief-correction profiling; Phase 1 architecture locked via an eleven-step walkthrough with the boss, eighty-six numbered decisions, and ratification by a multi-agent council.
Next steps: Phase 1 implementation kicks off with the build of the ten-stage pipeline, a country pilot starting in the United Kingdom to validate the cascade in production, and multi-market rollout afterwards as each country passes the coverage and match-rate verification.