№ 09 · Data architecture

Bacardi Concepts MDM

Entity resolution pipeline over 6 SFA systems: deduplicates product concepts and matches them against the corporate UPH hierarchy.

Live2026BacardiDatabricksPySparkDelta LakeMosaic AI Vector Search

Summary

Bacardi runs six regional sales-force systems, each carrying its own dialect of the same product catalog. We built a Databricks entity-resolution pipeline that ingests near-million source rows, normalizes them through rule layers and a Claude LLM cascade, deduplicates them into a single golden concept list, and matches every survivor against Bacardi's Universal Product Hierarchy. The output is an SCD2-versioned, embedding-indexed master record powering stewardship and downstream analytics.

Details

My role
Data architecture lead
Period
2026
Status
Live
Stack
DatabricksPySparkDelta LakeMosaic AI Vector SearchClaude LLMsEmbeddingsSCD2

Context

Bacardi's commercial reality is fragmented by design: each region operates its own SFA platform, and the same bottle, brand, or cocktail concept appears under six different schemas with no shared identifier. A French "Bouteille" and an English "Bottle" are literally distinct rows. Direct key-based reconciliation is impossible, and without a unified golden list every downstream initiative — pricing, listings, analytics, the corporate UPH itself — inherits ambiguity, conflicting attributes, and duplicated effort. The mandate was to consolidate the SFA fleet into one deduplicated, audited, governed master list, and to match each concept to the right UPH level (brand, liquid, or product). Two orthogonal problems live inside that mandate: cross-system entity resolution within source data and cross-dataset matching against the corporate hierarchy. The pipeline had to be repeatable, idempotent, and able to evolve from a one-shot seed run into a daily incremental flow without re-running deduplication every night.

Architecture

A staged Databricks pipeline that turns six fragmented SFA catalogs into a single SCD2-versioned, vector-indexed golden master, aligned against the corporate product hierarchy.

  1. Reference seeding — steward overrides, display-name dictionaries, and review tables provisioned in the curated layer.
  2. PRODUCT vs NON_PRODUCT segregation via static dictionary plus Claude Haiku-Sonnet-Opus cascade for ambiguous types.
  3. Harmonization of material type, brand, category, and volume; brand canonicalization against UPH via embedding match.
  4. Exact dedup on a composite key, then semantic dedup with embeddings plus a cluster-level Claude judge.
  5. Match against UPH routed by material type to brand, liquid, or product with brand-filtered cosine and fallback.
  6. Golden enrichment back-populated via MERGE INTO once UPH context is known.
  7. Lineage assembly so every source row maps to its golden concept and UPH entity.
  8. Embeddings published to the Mosaic AI Vector Search index in DELTA_SYNC mode for incremental matching.
  9. Daily incremental routes new rows to AutoApproved, PendingReview, or PendingCreation with override pre-pass.
  10. Unified stewardship queue persists ambiguous matches and feeds steward decisions back as authoritative rules.

Key decisions

Standardize, then deduplicate, then match — never the reverse.
Multilingual source strings ("Bouteille" vs "Bottle") create false non-matches if you dedup before normalizing. And matching before deduplication wastes compute and produces conflicting UPH assignments for what is the same concept.
Cluster-based LLM judge instead of pairwise.
Pair judges create transitivity contradictions — A=B, B=C, A≠C — when an ambiguous member bridges distinct clusters. Presenting whole clusters to Claude Sonnet for partitioning eliminates the paradox by construction and cuts API spend.
UPH as the single source of truth for canonical brand names.
Hardcoded canonical-brand override maps drift, accumulate typos, and resist governance. Replacing them with a live embedding match against the UPH brand table makes the corporate hierarchy itself the authority and turns drift into a measurable review queue.
The seed pipeline is one-shot; the incremental is a separate codebase.
Retrofitting the seed deduplication pipeline into a daily incremental destroys idempotency and breaks the position-based golden_id contract. The incremental is reframed as a matching problem against the frozen golden, powered by Mosaic AI Vector Search in DELTA_SYNC mode.
SCD2 on the golden record before any incremental writes.
Position-based golden IDs cannot survive new arrivals or attribute changes without breaking foreign keys downstream. Promoting the golden table to full SCD2 makes the golden_id a stable business key and the matching history queryable.
Adversarial review before engineering risky specs.
A devil's-advocate pass on migration specs caught semantic bugs — cross-stage column mutation, pre- vs post-LLM type divergence — that the standard engineer-then-reviewer flow would have missed and that would have cost multiple Dev re-deploy cycles.

Lessons learned

  • Embeddings on composite keys (brand | name | category) outperform fuzzy string matching across multilingual catalogs and abbreviations.
  • LLMs reliably unlock sparse records: Claude Haiku resolves the bulk of attribute extraction while larger models handle the long tail cost-effectively.
  • Position-based primary keys are a trap the moment the pipeline goes incremental — design SCD2 stability in from day one if downstream systems will reference the IDs.
  • `NaN != NaN` in pandas merges silently drops rows on nullable join keys; sentinel-fill before any merge is non-negotiable.
  • Each filter or assertion added to a mature pipeline is a latent-bug detector: the first end-to-end run after deploy routinely surfaces accumulated state corruption masked by earlier failures.
  • Hardcoded "at-risk column" lists drift with every schema change; dtype-driven auto-detection is the durable defense pattern.

What it enables

  • A single deduplicated golden concept list reconciling the full SFA fleet, traceable end-to-end from source row to canonical record to UPH entity.
  • Every product concept routed to the appropriate level of the corporate hierarchy — brand, liquid, or product — with explicit auto-match, review, and unmatched bands.
  • A Mosaic AI Vector Search index over the golden master enabling sub-second incremental matching of new source records without re-deduplicating the corpus.
  • SCD2 history on the golden record so attribute evolution and steward decisions stay auditable, reversible, and queryable across time.
  • A unified stewardship queue surfacing ambiguous matches and pending creations to the commercial team, promoting steward decisions back into the pipeline as authoritative overrides.
  • An idempotent, parameterized Databricks job structure where every task maps one-to-one to its wrapper, eliminating drift between environments and making DEV → QA promotion mechanical.

Status & roadmap

Current state
Seed pipeline frozen and validated end-to-end on production data; SCD2 schema and Vector Search index live on the golden master; the incremental pipeline's serial chain — match-against-golden, pending-creation minting, golden evolution with hybrid steward authorization, and revival of soft-deleted goldens — is implemented and running in development.
Next steps
Cutover to QA, deployment of the unified override and review-queue tables in production, integration with the stewardship app for live queue resolution, and rollout of the source-deletion writer that closes the incremental lifecycle.