Entity Resolution at Scale: Matching Records Across Disparate Data Sources

The Identity Problem in Enterprise Data

Every large organization has the same problem: the same real-world entity — a company, a person, a product, a location — appears under different names, identifiers, and formats across different data systems. The CRM calls it "International Business Machines Corp." The ERP calls it "IBM." The procurement system calls it "IBM Corporation." The accounts payable system uses the DUNS number 001000001. The legal database uses the LEI code 5493000IBP32UQZ0KL24.

These are all the same entity. But without a system to recognize them as the same entity, every analysis that joins data across these systems will either miss connections (treating IBM and International Business Machines as different companies) or create false connections (treating different companies with similar names as the same). The downstream consequences are significant: incorrect revenue attribution, missed risk exposures, duplicated customer records, and AI systems that reason incorrectly about the entities in their knowledge base.

Entity resolution — also called record linkage, deduplication, or entity matching — is the process of determining which records across different data sources refer to the same real-world entity. It is one of the foundational data quality problems in enterprise data management, and it has become significantly more tractable with the advent of large language models and semantic similarity techniques.

Traditional Approaches and Their Limits

Traditional entity resolution systems use rule-based matching: if two records have the same tax ID, they are the same entity; if the normalized company name has an edit distance below a threshold, they are probably the same entity; if the address matches and the phone number matches, they are likely the same entity. These rules work well for clean, structured data but break down in the messy reality of enterprise data systems.

The fundamental problem with rule-based approaches is that they require explicit enumeration of all the ways two records can refer to the same entity. This is tractable for a single data source with a known schema, but becomes intractable when integrating dozens of data sources with different schemas, naming conventions, and data quality levels. The rule set grows without bound, rules conflict with each other, and maintenance becomes a full-time job.

Probabilistic approaches, pioneered by Fellegi and Sunter in 1969 and implemented in modern tools like Splink and Dedupe.io, assign match probabilities based on the statistical distribution of field agreements. These approaches are more robust than pure rule-based systems, but they still rely on structured field comparisons and struggle with unstructured text, abbreviations, and cross-language entity names.

AI-Powered Entity Resolution

The shift to AI-powered entity resolution began with the adoption of transformer-based embedding models for entity representation. Instead of comparing structured fields, AI-powered systems embed entity descriptions — the full text of a company profile, a person's biography, a product description — into dense vectors and use semantic similarity to identify matches.

This approach handles the cases that rule-based systems miss: "Big Blue" and "IBM" have high semantic similarity because they co-occur in the same contexts in the training data. "Alphabet Inc." and "Google" are recognized as related entities because their descriptions overlap significantly. A person described as "the founder of Microsoft" and "Bill Gates" can be matched even without a shared identifier.

Large language models take this further. Given two entity descriptions, an LLM can be asked directly: "Are these two records referring to the same real-world entity? Explain your reasoning." The LLM's ability to reason about context, abbreviations, subsidiaries, and historical name changes makes it far more capable than any rule-based or embedding-only system for ambiguous cases. Research from 2023 showed that GPT-4 achieves near-human performance on entity matching benchmarks without any task-specific fine-tuning.

The Blocking Problem: Scaling to Millions of Records

The fundamental scalability challenge in entity resolution is the quadratic comparison problem. If you have one million records and need to determine which pairs refer to the same entity, a naive approach requires comparing every record against every other record — 500 billion comparisons. Even at microsecond speed, this is computationally intractable.

The solution is blocking: a pre-filtering step that reduces the candidate pairs to a manageable set before applying expensive matching logic. Traditional blocking uses simple rules: only compare records that share the same first three characters of the company name, or the same zip code. These rules dramatically reduce the comparison space but introduce false negatives — pairs that are the same entity but don't share the blocking key.

AI-powered blocking uses embedding-based approximate nearest-neighbor search to find candidate pairs. Each entity is embedded into a vector, and the vector index is searched for the k nearest neighbors of each entity. Only these candidate pairs are passed to the full matching pipeline. This approach is both more accurate (it finds matches that rule-based blocking would miss) and more scalable (approximate nearest-neighbor search scales sublinearly with dataset size using algorithms like HNSW). The vector database infrastructure used for enterprise vector search is directly applicable to this blocking problem.

Entity Resolution for Knowledge Graph Construction

Entity resolution is a prerequisite for high-quality knowledge graph construction. A knowledge graph that contains duplicate nodes — "IBM" and "International Business Machines" as separate entities — will produce incorrect query results and mislead any AI system that reasons over it. Every relationship that should connect to IBM will be split across two nodes, making it impossible to get a complete picture of IBM's relationships.

The standard pipeline for knowledge graph construction from enterprise data includes entity resolution as a mandatory step: extract entities from source documents, resolve duplicates to create a canonical entity set, then build the graph on top of the resolved entities. This pipeline is described in detail in the Knowledge Graphs for Enterprise AI article.

The resolution step also enables cross-source entity linking: connecting entities in your internal knowledge graph to external knowledge bases like Wikidata, Google's Knowledge Graph, or industry-specific databases like GLEIF's Legal Entity Identifier registry. This cross-source linking dramatically enriches the knowledge graph with publicly available information about entities, without requiring manual data entry.

Building an Entity Resolution Pipeline in 2026

A production entity resolution pipeline in 2026 has four stages: preprocessing, blocking, matching, and clustering.

Preprocessing standardizes entity representations: normalize company names (remove legal suffixes like "Inc.", "LLC", "Corp."), standardize address formats, resolve abbreviations, and extract structured fields from unstructured text. This step dramatically improves the performance of all subsequent stages.

Blocking generates candidate pairs using embedding-based nearest-neighbor search. Embed each entity using a model fine-tuned for entity matching — sentence-transformers models work well as a starting point. Retrieve the top-k nearest neighbors for each entity from the vector index. The value of k is a trade-off between recall (higher k catches more true matches) and precision (lower k reduces false positives passed to the matching stage).

Matching scores each candidate pair using a combination of embedding similarity, field-level comparison (exact match on tax ID, fuzzy match on name), and optionally an LLM for ambiguous cases. The LLM step should be reserved for high-value, ambiguous cases — it is too expensive to apply to all candidate pairs at scale.

Clustering groups matched pairs into entity clusters using a transitive closure algorithm: if A matches B and B matches C, then A, B, and C are all the same entity. Splink provides an excellent open-source implementation of this full pipeline with built-in support for probabilistic matching and clustering.

entity resolution record linkage deduplication AI knowledge graph entity matching master data management

Nick Eubanks

Entrepreneur, SEO Strategist & AI Infrastructure Builder

Nick Eubanks is a serial entrepreneur and digital strategist with nearly two decades of experience at the intersection of search, data, and emerging technology. He is the Global CMO of Digistore24, founder of IFTF Agency (acquired), and co-founder of the TTT SEO Community (acquired). A former Semrush team member and recognized authority in organic growth strategy, Nick has advised and built companies across SEO, content intelligence, and AI-driven marketing infrastructure. He is the founder of semantic.io — the definitive reference for the semantic AI era — and the Enterprise Risk Association at riskgovernance.com, where he publishes research on agentic AI governance for enterprise executives. Based in Miami, Nick writes at the frontier of semantic technology, AI architecture, and the infrastructure required to make enterprise AI actually work.

LinkedIn @nick_eubanks

SharePost on X Share on LinkedIn

Data Architecture