Multimodal Semantic Analysis: Making Sense of Language in Context

PhD project Publishing industry needs automated workflows for extracting useful, meaningful and accurate information from media sources combining e.g. text, audio, images, and video. In particular, the analysis, classification, and indexing of video requires algorithms that fuse data from multiple sources, such as face and object recognition, speech recognition or scene classification. This project develops techniques to meet these challenges and evaluate them in real-world applications.

Arezoo Hatefi Ghahfarrokhi is a PhD student within the Industrial Doctoral School at Umeå University.

Project overview

Project period:

2019-03-01 – 2023-12-31

Participating departments and units at Umeå University

Department of Computing Science, Faculty of Science and Technology

Research area

Computing science

Project members

Frank Drewes Professor

Email

+46 90 786 97 90

Johanna Björklund Associate professor

Email

+46 90 786 79 27

Project description

Contextual programmatic advertising is the automatic placement of advertisements on web pages, depending on these web pages’ contents. For programmatic advertising and similar applications, media objects must be analyzed with respect to their meaning, so that advertisements can be placed in a meaningful way. Similar semantic information is also vital to automatic recommendation, where the goal is to recommend material with a related content.

Depending on the type of media object at hand, valuable analysis methods include text and image analysis, object detection, scene classification, and speech recognition. These are all powerful techniques in their own right, but for successful classification it is necessary to fuse their results, because media objects usually consist of a mixture of different types of content. This is most obvious for films which provide audio, video, subtitles, and perhaps IMDB metadata about author, actors, publication year, and so on. However, the same holds, e.g., for news articles consisting of text, images, charts, and the like.

Thus, results from multiple analysis sources must be combined into a representation that captures the combined information in a way that can be stored in an easily accessible manner. The latter is especially important because the analysis is resource demanding and cannot be made on demand: if a user clicks a button “Show me related videos” there is no time for analyzing millions of videos as the user expects a more or less immediate answer.

Thus, the purpose of this research is to be able make a semantic analysis of large amounts of media content and store the results in a way that, later on and on demand, can be used to categorize content, search it, or answer questions about it that are not known at the time of analysis. In all of this, machine learning is a key ingredient, because algorithms have to be trained on real data sets.

Latest update: 2019-12-17