UnSpec 1.0.0

dotnet add package UnSpec --version 1.0.0
                    
NuGet\Install-Package UnSpec -Version 1.0.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="UnSpec" Version="1.0.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="UnSpec" Version="1.0.0" />
                    
Directory.Packages.props
<PackageReference Include="UnSpec" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add UnSpec --version 1.0.0
                    
#r "nuget: UnSpec, 1.0.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package UnSpec@1.0.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=UnSpec&version=1.0.0
                    
Install as a Cake Addin
#tool nuget:?package=UnSpec&version=1.0.0
                    
Install as a Cake Tool

UnSpec

A .NET 9.0 class library implementing the Universal Normalization Algorithm Specification (UNAS) — deterministic normalization of entity identifiers across all world scripts, naming conventions, creative works, geographic regions, and languages.

UnSpec produces stable, comparable identifiers from messy real-world data by normalizing Unicode text, transliterating scripts, parsing culture-specific names, and generating BLAKE3-based composite identifiers with FRBR-layered granularity.

Key Features

  • Unicode Preprocessing Pipeline — NFC normalization, whitespace/punctuation/diacritic canonicalization, case folding with Turkish/German/Greek special cases
  • Script Detection & Transliteration — 15+ script families (Cyrillic, Greek, Arabic, Hebrew, Devanagari, CJK, Georgian, Armenian, Ethiopic, Thai, Bengali, Tamil, Thaana, Tifinagh, and more) to ASCII Latin
  • Person Name Normalization — Culture-aware strategies for Western, Spanish/Portuguese, East Asian, Arabic, South Asian, Icelandic, mononymous, and pseudonymous names
  • Creative Work & Product Normalization — Title preprocessing, article removal (40+ languages), subtitle/edition stripping, Roman numeral and spelled-out number normalization
  • Geographic & Language Normalization — Place name canonicalization and BCP 47 language tag normalization
  • BLAKE3 Identifier Generation — 128-bit deterministic hashes across four FRBR layers (Work, Expression, Manifestation, Item)
  • Collision Detection — Alias registry for managing known variant-to-canonical mappings
  • Versioning — Algorithm version migration with deterministic re-hashing
  • Phone Normalization — E.164-aligned canonical form with country code detection, trunk prefix stripping, vanity number mapping, and 100+ country metadata
  • Email Normalization — Provider-aware canonicalization: Gmail dot-stripping, sub-address (+tag) removal for Gmail/Outlook/Yahoo/Proton/iCloud, domain alias resolution
  • Address Normalization — Street abbreviation expansion (St→street, Ave→avenue, 50+ types), directional/unit normalization, ordinal stripping, US state and CA province expansion, postal code cleaning
  • Confidence Tracking — Every result carries a Green/Amber/Red confidence flag indicating normalization quality

The Problem UnSpec Solves

Real-world systems ingest entity data from many sources — user forms, API partners, imports, manual entry — and the same entity arrives in dozens of surface forms:

Source Person Address/Org
CRM Import Dr. José María García-López Müller & Söhne GmbH
Web Form jose garcia lopez Mueller and Soehne
API Partner GARCIA LOPEZ, Jose Maria MULLER SOHNE GMBH
Call Center García López, J.M. Müller Söhne

These all refer to the same person and the same organization. Without normalization, your Person, Contact, Address, and Organization tables accumulate duplicates that poison search, reporting, and matching.

UnSpec gives each entity a single deterministic normalized form and a BLAKE3 hash so you can deduplicate, match, and index reliably across any script or culture.

Architecture

┌──────────────────────────────────────────────────────────┐
│                   NormalizationPipeline                   │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────────┐ │
│  │ Encoding │→│Whitespace│→│Punctuatn │→│ Case Folding│→...
│  │   NFC    │ │  Collapse│ │  Symbols │ │  + Special  │ │
│  └──────────┘ └──────────┘ └──────────┘ └─────────────┘ │
└──────────────────────────────────────────────────────────┘
         │
         ▼
┌──────────────────┐     ┌──────────────────────────┐
│ Script Detection │────→│  Transliteration Registry │
│  (per-segment)   │     │  (script → transliterator)│
└──────────────────┘     └──────────────────────────┘
         │
         ▼
┌──────────────────────────────────────────────────────────┐
│               Domain Normalizers                         │
│  PersonName │ CreativeWork │ Product │ Geographic │ Lang │
└──────────────────────────────────────────────────────────┘
         │
         ▼
┌──────────────────────────────────────────────────────────┐
│            BLAKE3 Identifier Generation                  │
│    Work (W) → Expression (E) → Manifestation (M) → Item │
└──────────────────────────────────────────────────────────┘

Every component implements an interface and is independently composable. Pipelines can nest inside other pipelines. Strategies and transliterators can be added or replaced at runtime.

Requirements

  • .NET 9.0+
  • Blake3 NuGet package (pulled automatically)

Quick Start

dotnet add package UnSpec
using UnSpec;
using UnSpec.Identifiers;
using UnSpec.Pipeline;

// Generate a work identifier
var workGen = new WorkIdentifierGenerator();
var workId = workGen.Generate("The Last Unicorn", "Peter S. Beagle", "book");
// → v1.W.a3f7c9e2b1d4f6a8.g

// Generate the full FRBR chain
var exprId = new ExpressionIdentifierGenerator().Generate(workId, "en");
var manId = new ManifestationIdentifierGenerator().Generate(exprId, "hardcover", "Penguin", "1968");
var itemId = new ItemIdentifierGenerator().Generate(manId, "Library of Congress", "LC-123456");

See CONSUMPTION.md for complete API documentation.


Real-World Usage with RDBMS

The following sections show how to integrate UnSpec with relational databases for common business entities. The pattern is always the same: normalize on write, index the hash, query by hash.

Schema Design Pattern

For any entity table, add two computed columns alongside the original data:

┌─────────────────────────────────────────────────────────────┐
│  Raw columns (preserve original)  │  UnSpec columns (query) │
│  first_name, last_name, ...       │  normalized_key          │
│                                   │  normalized_hash         │
│                                   │  confidence              │
└─────────────────────────────────────────────────────────────┘
  • normalized_key — The human-readable canonical form (for debugging and display)
  • normalized_hash — The BLAKE3 identifier (for indexing and matching)
  • confidence — Green/Amber/Red flag (for filtering questionable matches)

Always preserve the original raw data. The normalized columns exist purely for deduplication and lookup.

Person Table

A person arrives as "Dr. José María García-López", "jose garcia lopez", or "GARCIA LOPEZ, Jose Maria". All three should resolve to the same row.

Schema:

CREATE TABLE Person (
    id              BIGINT IDENTITY PRIMARY KEY,
    -- Original data (preserved exactly as received)
    first_name      NVARCHAR(200)   NOT NULL,
    last_name       NVARCHAR(200)   NOT NULL,
    full_name_raw   NVARCHAR(500)   NOT NULL,
    source_system   NVARCHAR(100),
    language_tag    NVARCHAR(20)    DEFAULT 'en',
    -- UnSpec normalized columns
    normalized_key  NVARCHAR(500)   NOT NULL,   -- "garcia-lopez:jose maria"
    normalized_hash CHAR(52)        NOT NULL,   -- "v1.W.a3f7c9e2b1d4f6a8c9d0e1f2a3b4c5d6.g"
    confidence      CHAR(1)         NOT NULL,   -- 'g', 'a', or 'r'
    -- Timestamps
    created_at      DATETIME2       DEFAULT SYSUTCDATETIME(),
    updated_at      DATETIME2       DEFAULT SYSUTCDATETIME()
);

CREATE INDEX IX_Person_NormalizedHash ON Person (normalized_hash);
CREATE INDEX IX_Person_Confidence     ON Person (confidence) WHERE confidence <> 'r';

C# — Populate on insert/update:

using UnSpec.PersonName;
using UnSpec.Identifiers;
using UnSpec.Pipeline;

public class PersonService
{
    private readonly PersonNameNormalizer _nameNorm = new();
    private readonly WorkIdentifierGenerator _idGen = new();

    public (string key, string hash, char confidence) NormalizePerson(
        string fullName, string languageTag)
    {
        var ctx = new NormalizationContext { LanguageTag = languageTag };
        var result = _nameNorm.Normalize(fullName, ctx);

        // Canonical form: "primary:secondary"
        string key = result.ToCanonicalForm();

        // BLAKE3 hash for indexing
        var id = _idGen.Generate(key, "", "person");
        char conf = id.Confidence switch
        {
            Confidence.Green => 'g',
            Confidence.Amber => 'a',
            _                => 'r'
        };

        return (key, id.ToString(), conf);
    }
}

What this buys you:

INSERT 1: "Dr. José María García-López"  → hash: v1.W.7f3a...  (garcia-lopez:jose maria)
INSERT 2: "jose garcia lopez"            → hash: v1.W.7f3a...  (same hash!)
INSERT 3: "GARCIA LOPEZ, Jose Maria"     → hash: v1.W.7f3a...  (same hash!)
-- Find all variants of the same person
SELECT * FROM Person WHERE normalized_hash = 'v1.W.7f3a...';

-- Deduplicate: find persons that appear more than once
SELECT normalized_hash, COUNT(*) as occurrences
FROM Person
WHERE confidence <> 'r'
GROUP BY normalized_hash
HAVING COUNT(*) > 1;

Contact Table

Contacts combine a person with communication channels. Use PersonNameNormalizer for the name, EmailNormalizer for email, and PhoneNormalizer for phone — each has its own domain-specific rules.

Schema:

CREATE TABLE Contact (
    id                  BIGINT IDENTITY PRIMARY KEY,
    -- Original data
    display_name        NVARCHAR(300)   NOT NULL,
    email_raw           NVARCHAR(320),
    phone_raw           NVARCHAR(50),
    country_code        CHAR(2)         DEFAULT 'US',
    language_tag        NVARCHAR(20)    DEFAULT 'en',
    -- UnSpec: person normalization
    person_norm_key     NVARCHAR(500)   NOT NULL,
    person_norm_hash    CHAR(52)        NOT NULL,
    -- UnSpec: email normalization (provider-aware canonical form)
    email_normalized    NVARCHAR(320),
    email_confidence    CHAR(1),
    -- UnSpec: phone normalization (E.164 canonical form)
    phone_normalized    VARCHAR(20),        -- "+{cc}{national}", e.g. "+12125551234"
    phone_confidence    CHAR(1),
    -- UnSpec: composite contact hash (person + email + phone for dedup)
    contact_hash        CHAR(52)        NOT NULL,
    confidence          CHAR(1)         NOT NULL,
    created_at          DATETIME2       DEFAULT SYSUTCDATETIME()
);

CREATE INDEX IX_Contact_PersonHash  ON Contact (person_norm_hash);
CREATE INDEX IX_Contact_ContactHash ON Contact (contact_hash);
CREATE INDEX IX_Contact_Email       ON Contact (email_normalized);
CREATE INDEX IX_Contact_Phone       ON Contact (phone_normalized);

C#:

using UnSpec;
using UnSpec.PersonName;
using UnSpec.Email;
using UnSpec.Phone;
using UnSpec.Identifiers;
using UnSpec.Pipeline;

public class ContactService
{
    private readonly PersonNameNormalizer _nameNorm = new();
    private readonly EmailNormalizer _emailNorm = new();
    private readonly PhoneNormalizer _phoneNorm = new();
    private readonly IdentifierGenerator _idGen = new();

    public ContactNormalized Normalize(
        string displayName, string? email, string? phone,
        string lang, string countryCode = "US")
    {
        var ctx = new NormalizationContext { LanguageTag = lang };

        // Normalize person name
        var person = _nameNorm.Normalize(displayName, ctx);
        var personId = _idGen.Generate(
            person.ToCanonicalForm(), IdentifierLayer.Work, person.Confidence);

        // Normalize email (provider-aware: Gmail dot-strip, sub-address removal, etc.)
        string? emailNorm = null;
        Confidence emailConf = Confidence.Red;
        if (!string.IsNullOrWhiteSpace(email))
        {
            var emailResult = _emailNorm.Normalize(email);
            emailNorm = emailResult.CanonicalForm;
            emailConf = emailResult.Confidence;
        }

        // Normalize phone (E.164 canonical form)
        string? phoneNorm = null;
        Confidence phoneConf = Confidence.Red;
        if (!string.IsNullOrWhiteSpace(phone))
        {
            var phoneResult = _phoneNorm.Normalize(phone, countryCode);
            phoneNorm = phoneResult.CanonicalForm;
            phoneConf = phoneResult.Confidence;
        }

        // Composite hash: person + email + phone
        string compositeInput = $"{person.ToCanonicalForm()}|{emailNorm ?? ""}|{phoneNorm ?? ""}";
        var contactId = _idGen.Generate(
            compositeInput, IdentifierLayer.Work, person.Confidence);

        return new ContactNormalized(person, personId, emailNorm, emailConf,
            phoneNorm, phoneConf, contactId);
    }
}

What this buys you for email:

john.doe+promo@gmail.com     → johndoe@gmail.com   (dots stripped, sub-address removed)
John.Doe@googlemail.com      → johndoe@gmail.com   (same!)
j.o.h.n.d.o.e@Gmail.COM     → johndoe@gmail.com   (same!)

What this buys you for phone:

+1 (212) 555-1234            → +12125551234
1-212-555-1234               → +12125551234  (same!)
212.555.1234  (country=US)   → +12125551234  (same!)
020 7946 0958 (country=GB)   → +442079460958 (trunk prefix stripped)

Dedup query — find contacts that are probably the same person regardless of format:

-- Same person across different entries
SELECT person_norm_key, COUNT(*) as entries
FROM Contact
WHERE confidence IN ('g', 'a')
GROUP BY person_norm_hash, person_norm_key
HAVING COUNT(*) > 1;

-- Same email, different person records (possible duplicate identities)
SELECT email_normalized, COUNT(*) as cnt
FROM Contact
WHERE email_normalized IS NOT NULL AND email_confidence <> 'r'
GROUP BY email_normalized
HAVING COUNT(*) > 1;

-- Same phone, different records
SELECT phone_normalized, COUNT(*) as cnt
FROM Contact
WHERE phone_normalized IS NOT NULL AND phone_confidence <> 'r'
GROUP BY phone_normalized
HAVING COUNT(*) > 1;

Organization Table

Organization names suffer from suffix noise (Inc., Corp., GmbH, Ltd., S.A.), transliteration differences (Müller vs Mueller), and ampersand variants (& vs and). UnSpec handles all of these.

Schema:

CREATE TABLE Organization (
    id              BIGINT IDENTITY PRIMARY KEY,
    -- Original data
    legal_name      NVARCHAR(500)   NOT NULL,
    trade_name      NVARCHAR(500),
    country_code    CHAR(2),
    language_tag    NVARCHAR(20)    DEFAULT 'en',
    -- UnSpec normalized columns
    normalized_key  NVARCHAR(500)   NOT NULL,
    normalized_hash CHAR(52)        NOT NULL,
    confidence      CHAR(1)         NOT NULL,
    created_at      DATETIME2       DEFAULT SYSUTCDATETIME()
);

CREATE INDEX IX_Org_NormalizedHash ON Organization (normalized_hash);

C#:

using UnSpec;
using UnSpec.Product;
using UnSpec.Pipeline;

public class OrganizationService
{
    private readonly ProductNormalizer _prodNorm = new();

    public (string key, string hash, char confidence) NormalizeOrg(
        string legalName, string? countryCode, string lang)
    {
        var ctx = new NormalizationContext { LanguageTag = lang };

        // ProductNormalizer strips publisher suffixes (Inc, Corp, GmbH, Ltd, etc.)
        // and normalizes Unicode, ampersands, diacritics
        var result = _prodNorm.Normalize(legalName, "", "organization", ctx);

        char conf = result.Confidence switch
        {
            Confidence.Green => 'g',
            Confidence.Amber => 'a',
            _                => 'r'
        };

        return (result.Value, result.Value.GetHashCode().ToString(), conf);
    }
}

What normalizes identically:

Raw Input Normalized Key
Müller & Söhne GmbH muller and sohne\|organization
Mueller and Soehne muller and sohne\|organization (after pipeline: ü→u, ö→o)
MÜLLER SÖHNE GMBH muller and sohne\|organization
Muller & Sohne, Inc. muller and sohne\|organization

For identifier-based matching, generate a hash with IdentifierGenerator instead:

var idGen = new IdentifierGenerator();
var id = idGen.Generate(result.Value, IdentifierLayer.Work, result.Confidence);
// Use id.ToString() as normalized_hash

Address Table

The AddressNormalizer handles full structured addresses: street abbreviation expansion (St→street, Ave→avenue, 50+ types), directional expansion (N→north, NE→northeast), unit normalization (Apt→apartment, Ste→suite), ordinal stripping (1st→1), US state/CA province expansion, and postal code cleaning.

Schema:

CREATE TABLE Address (
    id                  BIGINT IDENTITY PRIMARY KEY,
    -- Original data
    street_line_1       NVARCHAR(500),
    street_line_2       NVARCHAR(500),
    city_raw            NVARCHAR(200)   NOT NULL,
    state_province_raw  NVARCHAR(200),
    postal_code         NVARCHAR(20),
    country_raw         NVARCHAR(200)   NOT NULL,
    language_tag        NVARCHAR(20)    DEFAULT 'en',
    -- UnSpec: normalized components
    street_normalized   NVARCHAR(500),
    city_normalized     NVARCHAR(200)   NOT NULL,
    state_normalized    NVARCHAR(200),
    postal_normalized   VARCHAR(20),
    country_normalized  NVARCHAR(200)   NOT NULL,
    -- UnSpec: composite address hash for dedup
    address_canonical   NVARCHAR(1000)  NOT NULL,   -- "street|city|state|postal|country"
    address_hash        CHAR(52)        NOT NULL,
    confidence          CHAR(1)         NOT NULL,
    created_at          DATETIME2       DEFAULT SYSUTCDATETIME()
);

CREATE INDEX IX_Address_Hash        ON Address (address_hash);
CREATE INDEX IX_Address_City        ON Address (city_normalized);
CREATE INDEX IX_Address_Postal      ON Address (postal_normalized);

C#:

using UnSpec;
using UnSpec.Address;
using UnSpec.Identifiers;
using UnSpec.Pipeline;

public class AddressService
{
    private readonly AddressNormalizer _addrNorm = new();
    private readonly IdentifierGenerator _idGen = new();

    public (AddressNormalizationResult result, string hash) Normalize(
        string? street1, string? street2, string city,
        string? state, string? postal, string country, string lang)
    {
        var ctx = new NormalizationContext { LanguageTag = lang };

        var result = _addrNorm.Normalize(new AddressInput
        {
            Street1 = street1,
            Street2 = street2,
            City = city,
            StateProvince = state,
            PostalCode = postal,
            Country = country
        }, ctx);

        var id = _idGen.Generate(
            result.CanonicalForm, IdentifierLayer.Work, result.Confidence);

        return (result, id.ToString());
    }
}

What normalizes identically:

Raw Address Normalized Canonical Form
123 Main St, Apt 4B, Springfield, IL 62704 123 main street apartment 4b\|springfield\|illinois\|62704\|us
123 Main Street, Apartment 4B, Springfield, Illinois 62704 123 main street apartment 4b\|springfield\|illinois\|62704\|us
123 MAIN ST APT 4B + SPRINGFIELD + IL 123 main street apartment 4b\|springfield\|illinois\|\|

Abbreviation expansion ensures St = Street, Ave = Avenue, Apt = Apartment, N = North, CA = California, ON = Ontario, etc. Ordinals like 1st and 42nd are normalized to 1 and 42.

Note: For geographic matching at the city/country level (without street-level precision), use GeographicNormalizer directly. AddressNormalizer is for full postal address dedup.

When to Use Each Normalizer

Entity Type Normalizer What It Does When to Use
Person PersonNameNormalizer Culture-aware name parsing into primary:secondary form. Handles particles (van, von, de, al-), prefixes (Mc/Mac/O'), titles (Dr., Prof.), suffixes (Jr., III), and 8 culture strategies Person, Employee, Customer, Author, Patient
Email EmailNormalizer Provider-aware canonicalization. Gmail: strip dots + sub-address. Outlook/Yahoo/Proton/iCloud: strip sub-address. Domain alias resolution (googlemail→gmail) Contact, User, Subscriber, Lead — any table with email addresses
Phone PhoneNormalizer E.164-aligned canonical form. Country code detection, trunk prefix stripping (0→), vanity letter mapping (1-800-FLOWERS), 100+ country metadata Contact, Customer, Lead — any table with phone numbers
Address AddressNormalizer Full postal address: street abbreviation expansion (50+ types), directional/unit normalization, ordinal stripping, US state and CA province expansion, postal code cleaning Address, Location, Branch, Warehouse, Store — structured postal addresses
Organization ProductNormalizer Strips legal suffixes (Inc., GmbH, Ltd., S.A.), normalizes &and, handles diacritics and case Organization, Company, Vendor, Publisher, Employer
Place / City GeographicNormalizer Canonical name|geo_type|parent_region form with controlled geo-type vocabulary City/country-level geographic matching (not full postal addresses)
Creative Work WorkNormalizer Title normalization (article/subtitle/edition removal) + creator + type in canonical form Book, Film, Album, Game, media catalog tables
Product / SKU ProductNormalizer Name + manufacturer + category canonical form with suffix stripping Product, SKU, Inventory, Catalog tables
Language Tag LanguageNormalizer BCP 47 canonicalization: legacy codes, script suppression, case normalization Any column storing locale/language codes
Free Text UnicodePreprocessingPipeline Raw Unicode normalization only — NFC, whitespace, punctuation, case fold, diacritics Normalizing search terms, tags, notes, or any string column

Pattern: Dedup on Insert

A common pattern is to check the normalized hash before inserting, to prevent duplicates at the application layer:

public async Task<long> UpsertPerson(string fullName, string lang, DbConnection db)
{
    var (key, hash, conf) = _personService.NormalizePerson(fullName, lang);

    // Check for existing match
    var existing = await db.QueryFirstOrDefaultAsync<long?>(
        "SELECT id FROM Person WHERE normalized_hash = @hash",
        new { hash });

    if (existing.HasValue)
        return existing.Value;

    // Insert new
    return await db.ExecuteScalarAsync<long>(
        @"INSERT INTO Person (full_name_raw, language_tag, normalized_key, normalized_hash, confidence)
          VALUES (@raw, @lang, @key, @hash, @conf);
          SELECT SCOPE_IDENTITY();",
        new { raw = fullName, lang, key, hash, conf });
}

Pattern: Fuzzy Match with Confidence Filtering

When matching across systems, use the confidence flag to control strictness:

-- High-confidence matches only (deterministic, unambiguous)
SELECT a.*, b.*
FROM System_A a
JOIN System_B b ON a.normalized_hash = b.normalized_hash
WHERE a.confidence = 'g' AND b.confidence = 'g';

-- Include heuristic matches for review
SELECT a.*, b.*, a.confidence AS conf_a, b.confidence AS conf_b
FROM System_A a
JOIN System_B b ON a.normalized_hash = b.normalized_hash
WHERE a.confidence IN ('g', 'a') AND b.confidence IN ('g', 'a')
ORDER BY
    CASE WHEN a.confidence = 'g' AND b.confidence = 'g' THEN 1
         WHEN a.confidence = 'g' OR  b.confidence = 'g' THEN 2
         ELSE 3 END;

Pattern: Alias Resolution for Known Variants

Some entities normalize to different strings but represent the same thing (Mark Twain vs Samuel Clemens, Munich vs München). Use the alias registry:

using UnSpec.Collision;

var aliases = new InMemoryAliasRegistry();

// Register known alias
aliases.Register(new AliasEntry(
    Type: "person",
    VariantNormalized: "twain:mark",
    CanonicalNormalized: "clemens:samuel",
    CanonicalId: null,
    Source: "manual_review",
    Confidence: Confidence.Green
));

// At query time, check both direct hash match AND alias
public async Task<List<Person>> FindPerson(string name, string lang, DbConnection db)
{
    var (key, hash, _) = _personService.NormalizePerson(name, lang);

    // Check alias registry
    var alias = aliases.Lookup(key, "person");
    string canonicalKey = alias?.CanonicalNormalized ?? key;

    return await db.QueryAsync<Person>(
        "SELECT * FROM Person WHERE normalized_key = @key",
        new { key = canonicalKey });
}

For production systems, implement IAliasRegistry with a database-backed store:

CREATE TABLE NormalizationAlias (
    id                      BIGINT IDENTITY PRIMARY KEY,
    entity_type             NVARCHAR(50)    NOT NULL,   -- 'person', 'organization', etc.
    variant_normalized      NVARCHAR(500)   NOT NULL,
    canonical_normalized    NVARCHAR(500)   NOT NULL,
    canonical_id            CHAR(52),
    source                  NVARCHAR(100)   NOT NULL,
    confidence              CHAR(1)         NOT NULL,
    added_by                NVARCHAR(100),
    added_date              DATE            DEFAULT GETDATE(),
    UNIQUE (entity_type, variant_normalized)
);

CREATE INDEX IX_Alias_Canonical ON NormalizationAlias (entity_type, canonical_normalized);

Running Tests

dotnet test

887 tests covering every normalization stage, transliterator, name strategy, identifier generator, phone/email/address normalizer, and integration scenario.

Project Structure

src/UnSpec/
├── Pipeline/                    # Composable normalization stages
│   └── UnicodePipeline/         # §2: Encoding, whitespace, punctuation, case, diacritics
├── ScriptDetection/             # §3.1: Unicode script identification
├── Transliteration/             # §3.2–3.14: Script-to-Latin transliterators
├── PersonName/                  # §4: Culture-specific name normalization
│   └── Strategies/              # Western, Arabic, East Asian, etc.
├── CreativeWork/                # §5: Title and work normalization
├── Product/                     # §5: Product normalization
├── Phone/                       # E.164 phone normalization
├── Email/                       # Provider-aware email normalization
├── Address/                     # Structured address normalization
├── Geographic/                  # §6: Geographic entity normalization
├── Language/                    # §7: BCP 47 language tag normalization
├── Identifiers/                 # §8: BLAKE3 identifier generation (4 FRBR layers)
├── Collision/                   # §9: Alias registry for collision resolution
├── Versioning/                  # §10: Algorithm version migration
└── Vocabularies/                # Appendices: Controlled vocabularies

License

See LICENSE file.

Product Compatible and additional computed target framework versions.
.NET net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
1.0.0 109 2/20/2026