№ 10 · Data architecture

Bacardi Account Matching

Commercial account matching pipeline against SAP MDG: ML scoring, confidence-tiered routing, and golden record with N:1 mapping.

Live2025BacardiDatabricksPySparkDelta LakeLightGBM

Summary

A Bacardi entity-resolution engine consolidating a corporate ERP master and six regional sales-force systems into a single Golden Record with steward-controlled provenance. Six-stage Databricks pipeline — filter, normalize, features, score, route, assemble — with SCD2 history, deterministic survivorship, and a stewardship app dual-writing alongside it. Production-deployed: full seed run validated, daily incremental MERGE pipeline live in dev, scoped for partner-led QA hand-off.

Details

My role: Data architect / pipeline lead
Period: 2025
Status: Live
Stack: DatabricksPySparkDelta LakeLightGBMrapidfuzzFastAPIDatabricks Apps

Context

Bacardi's customer-account truth was fractured. The corporate ERP master held the structural identity — names, addresses, legal IDs — for hundreds of thousands of accounts. Six regional sales-force systems, covering EMEA, LATAM, and APAC, layered a larger volume of records carrying the commercial DNA: account type, consumer activity, point-of-sale segmentation, visit frequency. The same outlet often existed three or four times across systems, never reconciled, frequently with mutually contradictory attributes. The mandate was a pipeline-and-governance pair, not just a model. The pipeline had to dedupe deterministically where possible, score probabilistically where necessary, route ambiguous cases to humans, and produce an SCD2 Golden Record with a complete audit trail of which source contributed each field. Governance had to keep the corporate ERP authoritative, prevent steward decisions from being overwritten, and let the data team continue editing a Golden Record while the same pipeline ran nightly. Delivery model: the engineering team owned dev; the implementation partner deployed QA and prod via an Azure Data Factory bundle.

Architecture

Two sources, one pipeline, six stages, two write targets (Golden Record + mapping table), and a stewardship app dual-writing into both. Identity is resolved by the mapping table; field values are resolved by ranked survivorship where the corporate ERP always prevails.

Delta detection — per-source watermarks in a protected control table drive a hybrid anti-join + timestamp diff.
Filter & union — both sources unioned into one staging table with `source_type`; deletes excluded, status as metadata not filter.
Normalize — country to ISO-2, blocking keys trimmed, geo coordinates cast; un-normalizable countries quarantined.
Identity short-circuit + features — already-mapped rows skip scoring; only genuinely new rows enter blocking N1/N2/N3.
Score with LightGBM over seven features; exact ERP-key matches bypass at 1.0; LLM only refines the review band.
Routing in one decision per source: AutoApproved, PendingReview with top-5 candidates, or PendingCreation.
SCD2 assembly with MERGE into Golden and mapping; structural fields owned by the ERP, commercial filled by ranked survivorship.
FastAPI + Databricks Apps stewardship app with five screens dual-writing Golden and mapping in a single transaction.

Key decisions

LightGBM, not XGBoost.: LightGBM's native text serialization is independent of NumPy's binary ABI. Databricks Serverless workers run an older NumPy than the training environment; XGBoost's pickle format breaks across that boundary. LightGBM's `.txt` model loads cleanly on every worker.
Features over embeddings.: Account matching is an identity problem, not a semantic-similarity problem. Engineered features — Jaro-Winkler on names and address, Haversine, exact-match flags, TIN equality — are interpretable to stewards reviewing borderline cases, cheap at the hundreds-of-millions-of-pairs scale, and robust to the large fraction of confirmed matches whose names disagree.
Conservative clustering.: Prefer duplicate Goldens that can be merged later over contaminating one Golden by joining different real accounts. False negatives are recoverable through a steward merge tool; false positives corrupt the master and cascade into downstream reporting.
The corporate ERP always wins.: Survivorship is rank-ordered: the ERP master is rank 1, regional systems are ranked 2-4. For any shared field, the highest-ranked source with a non-null value wins. A deterministic secondary sort on source ID prevents non-deterministic Spark window ordering.
Identity vs. field management are different problems.: Once a source row is mapped, its identity is resolved — re-scoring it on every run wastes compute and creates false PendingCreation rows when source fields drift after a steward edits the Golden. The pipeline splits each delta into an "already-mapped" branch (join-and-tag, no model load) and a "genuinely new" branch (full blocking + scoring + routing).
LLM as a targeted blend, not a primary scorer.: Multilingual name normalization and phonetic equivalence are areas where an LLM adds real signal in the review band. The LLM is wired only into the review zone, behind a volume gate, with parallelized batch calls so daily runs stay inside SLA.

Lessons learned

Spark lazy evaluation against mutating Delta tables is a silent data-loss class — always materialize a DataFrame before MERGE-ing into the table it reads from.
Type literals that mirror a physical schema are a maintenance trap; either validate them at runtime against `DESCRIBE TABLE` or eliminate them and trust the source-native type flow.
A producer stage must not drop the temp tables consumed by a downstream lazy view — cleanup belongs in the next-run overwrite or in an orchestrator epilogue.
Adversarial PM review before structural-vs-cosmetic fix decisions catches the "if all entries had this bug, what would we do?" generalization that patching a subset misses.
Untestable-yet-correct logic deserves first-class tracking — an "Empirical Gap" ticket beats a watch-item that drifts and gets lost across sessions.
A two-line MERGE-rule unification (rank arithmetic over hard-coded source exclusions) can be stricter on steward protection and fix a self-refresh drift bug at the same time.

What it enables

A unified Golden Record with full SCD2 history, validated end-to-end across all six stages on the seed run.
A daily incremental pipeline executing as one orchestrated notebook, processing only changed records and MERGE-ing results without overwriting steward decisions.
Per-field provenance for every commercial attribute on every Golden Record — every value answers "which source wrote me, and when."
A stewardship app that creates, edits, and obsoletes Golden Records in real time, dual-writing in a single transaction alongside the pipeline.
Drift detection, queue management, and bulk operations exposed to data stewards through a five-screen web app.
A hand-off contract for QA and production deployment via an Azure Data Factory bundle owned by the implementation partner — the engineering team never touches non-dev environments.

Status & roadmap

Production hardening: Atomicity guards on the close-then-append SCD2 write, parallelized LLM batch calls to bring the review-zone runtime inside the daily SLA, and a persisted quarantine table so rejected rows survive session boundaries.
Model lifecycle: Migration of the LightGBM scorer from a static repo file to a registry-managed model with retraining fed by steward decisions, drift monitoring, and ground-truth enrichment with hard negatives mined from manual rejections.
Recall and coverage: Conditional activation of an approximate-nearest-neighbour pass over the no-match pool, gated on production evidence; and an ERP-side fuzzy fallback for ERP records that arrive after a steward has already created a Golden from a regional source.
Cross-system communication: A bidirectional handshake with the stewardship app to propagate steward edits upstream, eliminating the source-side drift that today forces compensating logic inside the pipeline and pushing the contract back where the data actually originates.