№ 12 · Data architecture

Bacardi People MDM

People master data from the corporate identity provider: `people_master` table + 3-level manager hierarchy, governed in Databricks.

Live2026BacardiDatabricksPythonDelta Lake

Summary

Build the people golden record for Bacardi: a single trustworthy view of who works where, who they report to, and which commercial structure they sit inside. The pipeline anchors on the corporate identity provider (HR-grade identity) and reconciles deterministically against the SAP commercial master and the unified commercial people view, so downstream analytics and AI agents stop rebuilding their own people lookups.

Details

My role
Data engineer
Period
2026
Status
Live
Stack
DatabricksPythonDelta Lake

Architecture

A seven-stage Python pipeline on Databricks and Delta Lake anchoring identity in the corporate identity provider and enriching with SAP and the unified commercial view through additive survivorship, with dual silver/gold materialization.

  1. Reference table caching with the descriptions of the sales-organization hierarchy.
  2. Filter active members on the identity-provider side and exclude deletion-flagged rows on the SFA side.
  3. Normalize: lower/trim email, upper/trim names, ISO-2 country sanity check.
  4. Deterministic waterfall match anchored on the identity provider's UID, with Employee ID and email as steps.
  5. Route the result to AutoApproved Full or Partial, IdpOnly, or SFA Orphan based on completeness.
  6. Additive survivorship across four zones — distinct sources, no cross-zone field conflicts.
  7. Dual materialization: physical SCD2 silver plus a golden layer exposed via CREATE OR REPLACE VIEW.

Key decisions

The unified commercial view as the classification-zone feeder.
Identity stays anchored to the corporate identity provider plus the SFA dimension; commercial classification — distributor, team, status — comes from the unified view. It replicates the architectural pattern already proven on the sister account-matching engagement, keeps distinct zones with distinct sources, and prevents field-level conflicts in survivorship.
Deterministic waterfall match key with no fuzzy in v1.
Employee ID first when present — deterministic but with structurally insufficient coverage to serve as a universal join key — then lowercased and trimmed email as the universal bridge. No fuzzy, phonetic, or address-proxy fallback in v1: the behaviour stays auditable and the fuzzy/ML decision waits until the data speaks.
Dual silver/gold materialization with SCD2 below and a filtered view above.
Silver is a physical SCD2 table where engineers and auditors access the full history of changes; gold is a view filtering current records for analytical consumption. It aligns with the masterdata convention established by the lead vendor and cleanly separates the audit contract from the consumption contract.
Fuzzy/ML strategy deferred until a data-driven recommendation lands.
v1 ships only the deterministic cascade. Adding fuzzy or ML for residual unmatched rows waits on the pre-flight against real data. Anchor principle: ship deterministic first, measure, then decide. The sister account-matching engine uses ML because it processes much higher volumes; here SAP joins are deterministic and the scale is smaller, so ML is unlikely to help, but the final call will be data-driven.

Status & roadmap

Current state
Cross-comparison report on the identity-provider extract complete; the strategy log is ratified through four decisions; the production-architecture skeleton is drafted with all stage signatures, zone definitions, and survivorship rules.
In flight
Pre-flight match-rate analysis on real silver data, dry-run of the deterministic waterfall, and a recommendation on whether v2 needs fuzzy or ML based on the unmatched fraction at the close of the analysis.
Next steps
Implement the seven-stage pipeline and resolve four open architectural questions: gold-view filter scope, deletion-flag column, multi-row survivorship rank, and a post-load assertion on manager-chain consistency.