Bacardi People MDM
People master data from the corporate identity provider: `people_master` table + 3-level manager hierarchy, governed in Databricks.
Summary
Build the people golden record for Bacardi: a single trustworthy view of who works where, who they report to, and which commercial structure they sit inside. The pipeline anchors on the corporate identity provider (HR-grade identity) and reconciles deterministically against the SAP commercial master and the unified commercial people view, so downstream analytics and AI agents stop rebuilding their own people lookups.
Details
- My role
- Data engineer
- Period
- 2026
- Status
- Live
- Stack
- DatabricksPythonDelta Lake
Architecture
A seven-stage Python pipeline on Databricks and Delta Lake anchoring identity in the corporate identity provider and enriching with SAP and the unified commercial view through additive survivorship, with dual silver/gold materialization.
- Reference table caching with the descriptions of the sales-organization hierarchy.
- Filter active members on the identity-provider side and exclude deletion-flagged rows on the SFA side.
- Normalize: lower/trim email, upper/trim names, ISO-2 country sanity check.
- Deterministic waterfall match anchored on the identity provider's UID, with Employee ID and email as steps.
- Route the result to AutoApproved Full or Partial, IdpOnly, or SFA Orphan based on completeness.
- Additive survivorship across four zones — distinct sources, no cross-zone field conflicts.
- Dual materialization: physical SCD2 silver plus a golden layer exposed via CREATE OR REPLACE VIEW.
Key decisions
- The unified commercial view as the classification-zone feeder.
- Identity stays anchored to the corporate identity provider plus the SFA dimension; commercial classification — distributor, team, status — comes from the unified view. It replicates the architectural pattern already proven on the sister account-matching engagement, keeps distinct zones with distinct sources, and prevents field-level conflicts in survivorship.
- Deterministic waterfall match key with no fuzzy in v1.
- Employee ID first when present — deterministic but with structurally insufficient coverage to serve as a universal join key — then lowercased and trimmed email as the universal bridge. No fuzzy, phonetic, or address-proxy fallback in v1: the behaviour stays auditable and the fuzzy/ML decision waits until the data speaks.
- Dual silver/gold materialization with SCD2 below and a filtered view above.
- Silver is a physical SCD2 table where engineers and auditors access the full history of changes; gold is a view filtering current records for analytical consumption. It aligns with the masterdata convention established by the lead vendor and cleanly separates the audit contract from the consumption contract.
- Fuzzy/ML strategy deferred until a data-driven recommendation lands.
- v1 ships only the deterministic cascade. Adding fuzzy or ML for residual unmatched rows waits on the pre-flight against real data. Anchor principle: ship deterministic first, measure, then decide. The sister account-matching engine uses ML because it processes much higher volumes; here SAP joins are deterministic and the scale is smaller, so ML is unlikely to help, but the final call will be data-driven.
Status & roadmap
- Current state
- Cross-comparison report on the identity-provider extract complete; the strategy log is ratified through four decisions; the production-architecture skeleton is drafted with all stage signatures, zone definitions, and survivorship rules.
- In flight
- Pre-flight match-rate analysis on real silver data, dry-run of the deterministic waterfall, and a recommendation on whether v2 needs fuzzy or ML based on the unmatched fraction at the close of the analysis.
- Next steps
- Implement the seven-stage pipeline and resolve four open architectural questions: gold-view filter scope, deletion-flag column, multi-row survivorship rank, and a post-load assertion on manager-chain consistency.