Show HN: Splink 3 – multi-back end fuzzy record linkage at scale (FOSS/Python)

RobinL · on Aug 8, 2022

Lead author here. Feel free to ask any questions!

Splink is a tool for deduplicating and linking datasets that lack unique identifiers. For instance, customer data may have been entered multiple times in multiple different computer systems.

There’s a mature academic literature, but before Splink there were no free, open source options that enable linkage of very large datasets. One of the best-practice models is known as the Fellegi-Sunter model, which is the backbone of Splink.

In Splink 3, we’ve introduced support for executing linkage workloads on several different SQL backends: DuckDB, AWS Athena and Spark, using the excellent sqlglot library for transpilation.

An interesting aspect is that unsupervised learning can be used to estimate models, so you don’t need labelled data.

You can find out more about the background to Splink here: https://www.robinlinacre.com/introducing_splink/

And if you’re already familiar with Splink, I cover what’s new in Splink 3 here: https://www.robinlinacre.com/splink_3/