I have contributed to the following libraries/packages, which were released alongside research papers:
activeevalimplements a generic framework for pool-based active evaluation in Python. Given an unlabelled pool of test instances and a set of classifiers to evaluate, the framework coordinates the acquisition of ground truth labels in order to estimate a performance measure, such as precision and recall. It includes an implementation of our adaptive importance sampling method (Marchant & Rubinstein, 2021), which aims to minimize asymptotic variance of performance estimates. This approach can reduce labelling costs in important scenarios—e.g. significant reductions can be expected when the classes are severely imbalanced. This occurs, for example, when evaluating record linkage/entity resolution systems. Scripts for reproducing experiments in the paper are available
exchangeris an R package that implements a Bayesian model for entity resolution as proposed in (Marchant, Rubinstein & Steorts, 2020). The model adopts the Ewens-Pitman family of infinitely-exchangeable random partitions as a prior on the linkage structure and features an improved hit-miss distortion model. The package is partially implemented in C++ and scales to data sets of modest size (~10k records) without blocking.
dblinkis an Apache Spark package for distributed end-to-end Bayesian entity resolution. The package implements extensions to the
blinkmodel (Steorts, 2015) as proposed in our JCGS paper. This includes integration of probabilistic blocking and a more efficient partially-collapsed Gibbs sampler. An R interface called
dblinkRis also available, in addition to scripts for reproducing experiments in the paper.
I am the lead developer of the following open-source projects: