№ 10 · Data architecture

Bacardi Account Matching

Commercial account matching pipeline against SAP MDG: ML scoring, confidence-tiered routing, and golden record with N:1 mapping.

Live2025BacardiDatabricksPySparkDelta LakeLightGBM

Summary

A Bacardi entity-resolution engine consolidating a corporate ERP master and six regional sales-force systems into a single Golden Record with steward-controlled provenance. Six-stage Databricks pipeline — filter, normalize, features, score, route, assemble — with SCD2 history, deterministic survivorship, and a stewardship app dual-writing alongside it. Production-deployed: full seed run validated, daily incremental MERGE pipeline live in dev, scoped for partner-led QA hand-off.

Details

My role
Data architect / pipeline lead
Period
2025
Status
Live
Stack
DatabricksPySparkDelta LakeLightGBMrapidfuzzFastAPIDatabricks Apps

Context

Bacardi's customer-account truth was fractured. The corporate ERP master held the structural identity — names, addresses, legal IDs — for hundreds of thousands of accounts. Six regional sales-force systems, covering EMEA, LATAM, and APAC, layered a larger volume of records carrying the commercial DNA: account type, consumer activity, point-of-sale segmentation, visit frequency. The same outlet often existed three or four times across systems, never reconciled, frequently with mutually contradictory attributes. The mandate was a pipeline-and-governance pair, not just a model. The pipeline had to dedupe deterministically where possible, score probabilistically where necessary, route ambiguous cases to humans, and produce an SCD2 Golden Record with a complete audit trail of which source contributed each field. Governance had to keep the corporate ERP authoritative, prevent steward decisions from being overwritten, and let the data team continue editing a Golden Record while the same pipeline ran nightly. Delivery model: the engineering team owned dev; the implementation partner deployed QA and prod via an Azure Data Factory bundle.

Architecture

Two sources, one pipeline, six stages, two write targets (Golden Record + mapping table), and a stewardship app dual-writing into both. Identity is resolved by the mapping table; field values are resolved by ranked survivorship where the corporate ERP always prevails.

  1. Delta detection — per-source watermarks in a protected control table drive a hybrid anti-join + timestamp diff.
  2. Filter & union — both sources unioned into one staging table with `source_type`; deletes excluded, status as metadata not filter.
  3. Normalize — country to ISO-2, blocking keys trimmed, geo coordinates cast; un-normalizable countries quarantined.
  4. Identity short-circuit + features — already-mapped rows skip scoring; only genuinely new rows enter blocking N1/N2/N3.
  5. Score with LightGBM over seven features; exact ERP-key matches bypass at 1.0; LLM only refines the review band.
  6. Routing in one decision per source: AutoApproved, PendingReview with top-5 candidates, or PendingCreation.
  7. SCD2 assembly with MERGE into Golden and mapping; structural fields owned by the ERP, commercial filled by ranked survivorship.
  8. FastAPI + Databricks Apps stewardship app with five screens dual-writing Golden and mapping in a single transaction.

Key decisions

LightGBM, not XGBoost.
LightGBM's native text serialization is independent of NumPy's binary ABI. Databricks Serverless workers run an older NumPy than the training environment; XGBoost's pickle format breaks across that boundary. LightGBM's `.txt` model loads cleanly on every worker.
Features over embeddings.
Account matching is an identity problem, not a semantic-similarity problem. Engineered features — Jaro-Winkler on names and address, Haversine, exact-match flags, TIN equality — are interpretable to stewards reviewing borderline cases, cheap at the hundreds-of-millions-of-pairs scale, and robust to the large fraction of confirmed matches whose names disagree.
Conservative clustering.
Prefer duplicate Goldens that can be merged later over contaminating one Golden by joining different real accounts. False negatives are recoverable through a steward merge tool; false positives corrupt the master and cascade into downstream reporting.
The corporate ERP always wins.
Survivorship is rank-ordered: the ERP master is rank 1, regional systems are ranked 2-4. For any shared field, the highest-ranked source with a non-null value wins. A deterministic secondary sort on source ID prevents non-deterministic Spark window ordering.
Identity vs. field management are different problems.
Once a source row is mapped, its identity is resolved — re-scoring it on every run wastes compute and creates false PendingCreation rows when source fields drift after a steward edits the Golden. The pipeline splits each delta into an "already-mapped" branch (join-and-tag, no model load) and a "genuinely new" branch (full blocking + scoring + routing).
LLM as a targeted blend, not a primary scorer.
Multilingual name normalization and phonetic equivalence are areas where an LLM adds real signal in the review band. The LLM is wired only into the review zone, behind a volume gate, with parallelized batch calls so daily runs stay inside SLA.

Lessons learned

  • Spark lazy evaluation against mutating Delta tables is a silent data-loss class — always materialize a DataFrame before MERGE-ing into the table it reads from.
  • Type literals that mirror a physical schema are a maintenance trap; either validate them at runtime against `DESCRIBE TABLE` or eliminate them and trust the source-native type flow.
  • A producer stage must not drop the temp tables consumed by a downstream lazy view — cleanup belongs in the next-run overwrite or in an orchestrator epilogue.
  • Adversarial PM review before structural-vs-cosmetic fix decisions catches the "if all entries had this bug, what would we do?" generalization that patching a subset misses.
  • Untestable-yet-correct logic deserves first-class tracking — an "Empirical Gap" ticket beats a watch-item that drifts and gets lost across sessions.
  • A two-line MERGE-rule unification (rank arithmetic over hard-coded source exclusions) can be stricter on steward protection and fix a self-refresh drift bug at the same time.

What it enables

  • A unified Golden Record with full SCD2 history, validated end-to-end across all six stages on the seed run.
  • A daily incremental pipeline executing as one orchestrated notebook, processing only changed records and MERGE-ing results without overwriting steward decisions.
  • Per-field provenance for every commercial attribute on every Golden Record — every value answers "which source wrote me, and when."
  • A stewardship app that creates, edits, and obsoletes Golden Records in real time, dual-writing in a single transaction alongside the pipeline.
  • Drift detection, queue management, and bulk operations exposed to data stewards through a five-screen web app.
  • A hand-off contract for QA and production deployment via an Azure Data Factory bundle owned by the implementation partner — the engineering team never touches non-dev environments.

Status & roadmap

Production hardening
Atomicity guards on the close-then-append SCD2 write, parallelized LLM batch calls to bring the review-zone runtime inside the daily SLA, and a persisted quarantine table so rejected rows survive session boundaries.
Model lifecycle
Migration of the LightGBM scorer from a static repo file to a registry-managed model with retraining fed by steward decisions, drift monitoring, and ground-truth enrichment with hard negatives mined from manual rejections.
Recall and coverage
Conditional activation of an approximate-nearest-neighbour pass over the no-match pool, gated on production evidence; and an ERP-side fuzzy fallback for ERP records that arrive after a steward has already created a Golden from a regional source.
Cross-system communication
A bidirectional handshake with the stewardship app to propagate steward edits upstream, eliminating the source-side drift that today forces compensating logic inside the pipeline and pushing the contract back where the data actually originates.