Privacy for complex data

Research project We are developing new and unique data privacy protections built directly into AI, data-driven models and decision-making systems. This provides a unique opportunity to use the data for research, analysis or decision-making while ensuring that the privacy of individuals and organisations is fully protected.

Decision-making systems and data-driven models are important support for researchers and decision-makers. However, these need to continually "train" on high-quality data which is not always available and may reveal sensitive information. Today, data protection mechanisms for complex data are very limited. There are some solutions for dynamic databases and static graphs. However, there are no solutions for data with complex relationships between objects, dynamic graphs, and measurement data. We are now developing methods to provide anonymised data for complex data.

Head of project

Vicenç Torra Professor

Email

+46 90 786 59 48

Project overview

Project period:

2023-04-19 – 2025-05-22

Participating departments and units at Umeå University

Department of Computing Science, Faculty of Science and Technology

External funding

Swedish Research Council

Project description

A large number of data protection mechanisms have been developed for standard databases, commonly known as SQL databases, which consist of one or more tables and have data records described in terms of variables or attributes.

There are also protection mechanisms for building machine learning and statistical models for the data, as well as masking methods for data publication so that researchers can access an anonymised version of the original data. The latter is important for researchers in machine learning and data science. On the one hand, they need access to the data to explore it and decide which models are best suited. But they also need to test the models with different parameters to determine which is optimal in terms of privacy constraints, but also, for example, accuracy, transparency and explainability.

Multiple versions are a risk

Things become more difficult when data contains temporal elements. Multiple versions of data can lead to data disclosure, as intruders can take advantage of one version to attack another. In particular, multiple anonymisations of the same data can provide clues to the original information.

A further difficulty with data privacy is when there are relationships between the objects we protect. A simple case is when there are correlations between records in a database, such as the same person or corresponding to people in the same household.

When things get complex

Complex data, those that deal with several different variables, are usually stored in noSQL databases, and these include both of these components. Graph data is a typical example of complex data. Social networks can be represented by what are known as labelled graphs where nodes represent people and companies, and edges represent relationships between them. Labels represent additional information, related to nodes and relationships, such as "friends" or "interests".
We can usually derive information about a node (e.g. a person) from the information about its relationships, e.g. political orientation from data about neighboring nodes (people and companies).

Time as an aspect

So-called dynamic graphs – data that changes over time or in response to changes – are based on information in a time dimension, which is also a challenge. Another example of complex data, (which can include multiple variables, relationships or hierarchies), is measurement data from, for example, a power grid. Electricity grids are represented by what is known as a hierarchical structure. In grid data, we have a time dimension because information from households and industries is represented by time series. In addition, the information at the different levels of the hierarchy (i.e. aggregations) must be consistent. Hierarchy can be regions, countries, cities, etc. Aggregations are summaries of data at a higher level, such as the number of cancer cases in a city, region or province.

Objectives and aims

Current data protection mechanisms for complex data are very limited. There are partial solutions for dynamic databases and for static graphs. However, there are no solutions for data with complex relationships between objects (including the temporal component), nor are there effective integrity mechanisms for dynamic graphs and measurement data.

Therefore, the research group at Umeå University will develop methods to provide anonymised data (e.g. for open access) for complex data. In particular, the group focuses on the cases where consideration must be given to both interactions between objects and time aspects but also strong relationships between the objects to be protected. Mainly we will focus on dynamic graphs and grid data. This is to enable the development of privacy-friendly machine learning models that are compatible with appropriate privacy models. The goal is to openly publish data that enables the building of data-driven models - while at the same time protecting privacy. It is about balancing openness and transparency with respect for the privacy of individuals and organisations.

Partial objective

Understanding the risk of unravelling when dealing with complex data
When we need to make multiple versions of a dataset, the information from one version can be exploited by an intruder to attack other versions. Hence, we need integrity models that take this time dimension into account. Once an object is related to a set of other objects, independent protection of the latter is not sufficient to protect the former. Conclusions can be drawn about the characteristics of the former. Data protection models and disclosure risk measures must take these relationships into account.
Developing data protection methods for temporal data
The temporal dimension is an element that is becoming increasingly important as databases in organisations and businesses increase. To build data-driven models for these data, we need efficient algorithms that implement integrity models. We will focus on dynamic graph data as graphs can be used to model a wide range of different problems. Also, we will look at measurement data where time series is one of the fundamental components.
Data protection mechanisms must be resilient to transparency attacks
That is, protection should not be based on hiding how the data has been protected. Instead, we expect data to be published together with information on how the data has been processed and published. This data must be resilient to attacks that use this additional information to provide maximum privacy guarantees.

External funding

Latest update: 2024-03-11