loading page

Achieving k-anonymity of a large-scale database in a distributed memory environment
  • +2
  • Arsème Vadèle Djeufack Nanfack,
  • Gaël Le Mahec,
  • Gilles Dequen,
  • Vianney Kengne Tchendji,
  • Jerry Lacmou Zeutouo
Arsème Vadèle Djeufack Nanfack
MIS research unit, University of Picardie Jules Verne)

Corresponding Author:[email protected]

Author Profile
Gaël Le Mahec
MIS research unit, University of Picardie Jules Verne)
Gilles Dequen
MIS research unit, University of Picardie Jules Verne)
Vianney Kengne Tchendji
Department of Mathematics and Computer Science, University of Dschang
Jerry Lacmou Zeutouo
The joint project-team Avalon INRIA, LIP, ENS-Lyon

Abstract

The k-anonymity problem introduced by Samarati and Sweeney in 1998, guarantees that it is impossible to distinguish user data from at least (k − 1) others in the same database. The methods used to achieve k-anonymity result in an information loss as the data in the database is modified, making it less accurate through a process of generalization or micro-aggregation of the stored data. Mauger et al. proposed a O(n²)-time sequential algorithm that gives good results while minimizing the information loss using their designed metrics. However, their solution is very time-consuming and therefore not suitable for large-scale databases. In this paper, we tackle this problem using parallelism. We propose three coarse-grained parallel algorithms to solve the k-anonymity problem. The first is the straightforward algorithm that runs in O(n ²/p) execution time with O(n) communication rounds, where n is the number of lines in the database and p is the number of processors. The second runs in O(n² /p²) execution time with O²(p) communication rounds. The third runs in O(n² /plog2(p)) execution time with O(np/ (log2(p))²) communications rounds. For the latter two algorithms, we introduce the concept of data reorganization to minimize the information loss when data are partitioned. Experimental results show that for a database of size n = 10^6 , p = 2^7 , and k = 10^2 , second, and third parallel algorithms are respectively 1127.59× and 41.13× faster than the sequential algorithm while achieving anonymity with 4.03% and 2.62% information loss.
13 Jan 2024Submitted to TechRxiv
25 Jan 2024Published in TechRxiv