Semantic Parsing of Multimodal Data

Research project Semantic parsing is an important technique in AI. The project studies algorithms for translating combined media, including text, image, and sound, into mathematical representations that are easier for computers to understand, and make it easier to apply different types of analysis. The goal is partly a theory of graph-based calculations that is adapted for multimodal parsing, and partly graph-based models to represent data, together with a new family of algorithms that work on these models.

The project is financed by the Swedish Research Council.

Head of project

Johanna Björklund Professor

Email

+46 90 786 79 27

Project overview

Project period:

2021-01-01 – 2024-12-31

Participating departments and units at Umeå University

Department of Computing Science

Research area

Computing science

External funding

Swedish Research Council

Project description

A semantic parser is an algorithm that translates unstructured text, typically single sentences, into a formal representation that is easier for computers to understand and work with. We are interested in the case where the semantic representation is a graph, in other words, a network where the nodes represent objects, and the edges relations between objects. When we say that the parsing is multimodal, we mean the input is a combination of different types of media, for example one may want to parse a video that consists of a combination of image frames, audio, and subtitles. Previous work on multimodal parsing focuses on translating media objects to purely numerical representations. The advantage is that it is relatively easy to train such parsers from data, but the downside is that it is difficult to analyse how the trained parser functions internally, and it is more or less impossible to manually correct the parser’s behaviour.

This project contributes to the research area in three ways.

(1) We develop algorithms for semantic parsing that results in graphs. There is already previous work in this direction, but the novelty lies in that we let the parsers use a more complex type of memory, while breaking down the training- and translation process into simpler steps that give us with greater control.

(2) We use the new type of states to integrate multimodal information, so that we can work with more complex media types than plain text.

(3) We develop optimisation techniques to keep the run times low, so that the algorithm are practically applicable.

Semantic parsing of multimodal data is one of the corner stones of artificial intelligence, and has as such many application areas. First and foremost it is a key to automatise workflows in media, e.g., to search video banks and answer queries about the content, and to automatically edit trailers of a video for different geographic regions. In the field of data mining, semantic parsing adds value because it allows us to extract knowledge graphs from unstructured data, e.g., from video clips on YouTube. Semantic parsing is also useful in robotics, to map words in a natural-language command to concrete objects and actions. Finally, it is of inherent value in machine learning, because it allows us to transfer knowledge between different media domains.

The outcome of the project is a mathematic theory of computation, tailored for multimodal parsing. This consists of graph-based data representations, together with computation models that operate on such representations. The project pushes the boundaries of how much and what kind of structure can be managed with efficiency within the framework of supervised machine learning. We aim to advance the state of art both in unimodal (i.e., textual) semantic parsing, and in the broader field of multimodal semantic parsing.The project will take place over a period of four years and is conducted by the applicant together with two doctoral students and research colleagues in Great Britain, Germany, and Italy.

External funding

Latest update: 2021-01-19