Deduplication of Identities Using Similarity Search in a Scalable Vector Database

Shraddha Surana; Chinmay Dhawan; Digvijay Gunjal; Rajesh Tamhane

doi:10.36227/techrxiv.170905672.20134318/v1

loading page

Deduplication of Identities Using Similarity Search in a Scalable Vector Database

Shraddha Surana,
Chinmay Dhawan,
Digvijay Gunjal,
Rajesh Tamhane

Abstract

Identity systems increasingly use biometrics to register and uniquely identify individuals. Governments use them to identify and authenticate citizens for voter-enrollment, socialwelfare, border-control, KYC, and healthcare. It is therefore essential to ensure people are not registered multiple times and duplicates are discovered promptly to avoid frauds. This paper proposes a framework for building a scalable deduplication system using facial biometrics and open-source tools. It examines the use of the open-source ArcFace algorithm to create embeddings of representative facial images and the Milvus database to quickly search through millions of images. Such systems will help ensure that duplicate identities are not registered in an identity enrollment system. Based on many experiments and combinations of different parameters, the authors achieve 99.79% accuracy, an F1-score of 89.44%, a false positive identification rate (FPIR) of 0.1%, and a false negative identification rate (FNIR) of 0.1%. This work aims to provide the potential configurations, architecture, parameters, and their effect on accuracy and speed for implementing a highly scalable deduplication system. They elaborate on the impact of each parameter on accuracy and performance. Readers can use this analysis to make an informed decision on the best architecture and combination of parameters for their use case.

23 Feb 2024Submitted to TechRxiv

27 Feb 2024Published in TechRxiv

Abstract

Peer review timeline