Joint Statistical Seminar: Saloni Kwatra

Tue

May

Tuesday 23 May, 2023 at 13:00 - 14:00

MIT.A.346

Abstract: Federated Learning (FL) allows training a shared model across multiple distributed devices or organizations without the need for centralized data collection. In FL, the data remains with the local devices or organizations, and only the model updates are shared with the central server. The central server aggregates the model updates received from different devices and sends the aggregated model updates back to the distributed devices. The process is continued until the model reaches a point of convergence or until the maximum number of iterations has been achieved. Although only the model parameters are shared across the devices, sharing model updates leads to substantial privacy leakage. Hence, our work focuses on privacy-preserving FL. We proposed an FL framework with Decision Trees, in which each device first protects its data using Mondrian k-anonymity and then trains the decision tree classifier. Distributed devices share their nodes from the root to the leaf node, and the aggregation server recursively merges the DTs and obtains a merged tree, which is then shared by the distributed devices.

Each device participating in the FL process aims to learn a better machine-learning model than what it could have learned alone. We studied an FL framework called SimFL, which leverages the information from similar samples of distributed parties. SimFL uses Locality Sensitive Hashing (LSH) to know similar samples from different distributed devices. The idea of LSH is that the data sample and its nearest similar neighbors should be hashed into the same bucket with a high probability, and dissimilar samples should be hashed into the same bucket with a low probability. The SimFL framework assumes that each distributed device knows the hashed values (computed using LSH functions) of every device's records. We show that this assumption is a significant vulnerability in SimFL, which risks the privacy of individuals. We implemented two data reconstruction attacks, which estimated the user's original data from the hash values computed using LSH. We proposed an FL framework SimFL, where we use Mondrian anonymization before the computation of locality-sensitive-based hashed values. Mondrian k-anonymity before LSH improves the privacy of participants in FL. The reason is that Mondrian k-anonymity creates an equivalence class or anonymized set of size k, where all the quasi-identifiers are generalized to the same values in a group of size k. Dissimilar samples are placed in the same equivalence class due to the enforcement of k-anonymity (precisely when k is high. The term high depends on the size and distribution of the dataset). High k can also worsen the predictive capability of the FL model. Therefore, there is a trade-off between utility and privacy.

This talk is based on a joint work with Vicenç Torra.

Organizer: Department of Mathematics and Mathematical Statistics, Statistics

Event type: Seminar

Speaker

Saloni Kwatra

Doctoral student

Read about Saloni Kwatra

Contact

Alp Yurtsever

Read about Alp Yurtsever