MegatronLead

How-to

How to deduplicate leads without losing attribution

Conservative match keys, human-reviewed merges, and a multi-source data model that survives every dedupe operation. The recipe that protects attribution.

ByFounder, MegatronLead8 min read

Builds operational software for multi-market sales organizations. Twenty years across enterprise IT, M365, and revenue operations.

How-to

How to deduplicate leads without losing attribution

The promise of deduplication is simple: one canonical record per real-world person. The problem is that most dedupe implementations destroy the data they were supposed to protect, specifically the attribution chain that marketing depends on.

The fix is structural, and it lives at the data layer, not in the dedupe logic. Here is the recipe.

Why naive dedupe destroys attribution

A typical CRM stores source as a single field on the contact record: Lead Source, Original Source, Source ID. When two records merge, one source survives. Usually the merge UI lets the admin pick which record's source to keep, but the result is the same: one of the two contributing sources is gone.

For a single-channel organization this is fine. For an organization where the same person can arrive from Meta, then LinkedIn, then HubSpot within weeks, the merge throws away exactly the data the marketing team needs to make budget decisions.

The fix is not better dedupe logic. The fix is a data model that does not have a winner-takes-all source field.

Step 1: Store source as an event

The right data model puts source in a child relation. Each contact has zero or more source events. Each event records:

  • The channel (meta_leadgen, hubspot, salesforce, linkedin_ads, custom webhook, csv).
  • The timestamp.
  • The campaign or ad set, if applicable.
  • A pointer to the original payload (for forensic replay).
  • A reference to the source-system identifier.

The contact record itself can still have a "primary source" field if your analytics need a single value, but that field becomes derived (most recent, first ever, highest weight) rather than authoritative.

The authoritative answer to "what touched this person" is the full source event list. Reports against the model can apply whichever attribution model they want (first-touch, last-touch, U-shaped, time-decay) without re-pulling data.

Step 2: Set match keys conservatively

Match keys are the rules that decide whether two records are the same person. Conservative match keys are the difference between dedupe that improves your data and dedupe that destroys it.

Three keys to use, in priority order:

Normalized email. Strip whitespace, lowercase, normalize plus-suffixes. John.Doe+sales@example.com and john.doe@example.com are likely the same person. Match.

E.164-normalized phone. Strip formatting, prepend country code based on lead's market if the source lacks one. +1 555 123 4567 and 15551234567 are the same number.

Name plus company similarity. Fuzzy match on full name combined with exact match on company domain. Lower priority than email or phone because false positives happen (two people named John Smith at large companies).

Three keys to avoid:

Company alone. Far too aggressive. Two different people at Acme Corp are not the same person.

Phone alone, without country code derivation. Number collisions across countries.

Name alone. False positive epidemic.

The right tuning produces conservative dedupe: occasional duplicates remain (because the system was not confident enough to merge), but false merges are rare.

Step 3: Route matches to human review

Auto-merge is a feature you turn on after your match-confidence model is well calibrated. In v1, every match goes to a human review queue.

A human review takes about 30 seconds per match. Reviewer sees both records side by side, the matching fields highlighted, the existing source events on each, and a merge button.

If the reviewer agrees, the merge proceeds. If not, the records stay separate.

Volume in practice: most organizations see 5 to 15 review-queue items per day per 1000 inbound leads. A part-time operations role can clear the queue. The throughput is fine and the data quality benefit is enormous.

Step 4: Merge as set union, not winner takes all

The merge operation has to preserve information, not destroy it. Specifically:

  • Source events: union. Both records' source events are kept. The survivor inherits everything.
  • Notes: union. Notes from both are kept, attached to the survivor, with their original timestamps and authors.
  • Activity history: union. Calls, emails, meetings, state changes.
  • Custom attributes: most-recent wins, but the displaced value is logged. If the survivor and the duplicate both have a vertical attribute and they disagree, take the more recent and log the disagreement for review.
  • Ownership: explicit decision, audited. The merged contact has one owner. Pick the more active owner, or surface the conflict to a manager.

The deleted-looking record is not actually deleted. It becomes a tombstone with merged_into=<survivor_id>. Hidden from default views; queryable for audit; restorable if the merge was wrong.

Step 5: Keep the merge reversible

Merges are wrong sometimes. The reviewer was tired, the match-confidence was misleading, the two people really were different.

A reversible merge means:

  • The tombstone record is preserved with all original fields.
  • The survivor's merged_records list points back to each tombstone.
  • An "unmerge" operation extracts the original records, removes the merged data from the survivor, and resurrects the tombstones as live records.
  • The unmerge is itself an audited operation.

This is the property most CRMs lack. The CRM treats merge as one-way; the data is gone. A platform built for serious lead operations treats merge as a reversible transformation.

Step 6: Audit every merge

Every merge fires an audit log entry. Who initiated the merge, which records were merged, what fields differed, what survived. The audit lives in the same tamper-evident log as everything else.

This matters when ownership disputes arise. "Why did my prospect's record get merged into this other one" is a question with an answer in the audit trail, not in someone's memory.

What this gives you

A dedupe operation built this way does three things at once:

  • Improves data quality. Fewer duplicate records, cleaner reports, fewer "is this person the same as that one" conversations.
  • Preserves attribution. Marketing's ROI math stays intact because no source event is ever destroyed.
  • Maintains trust. Reps and managers know that the merge was reviewed, audited, and reversible. They stop fighting it.

The cost is a slightly more complex data model and a small ongoing operations workload to clear the review queue. The benefit is dedupe that works without breaking the things downstream that depend on the data.

For how MegatronLead's ingestion pipeline implements this, see the platform overview. For the audit log that records every merge, see security and compliance.

Related reading

More in this category

Operationalize your lead pipeline.

Talk to us about how MegatronLead handles your specific markets, sources, and audit requirements.