Introduction to Multimodal Models


  • Simon Enkido Bethdavid
  • Helena Björnesjö
  • Tove Gustavi
  • Hanna Lilja
  • Magnus Rosell
  • Johan Sabel
  • Edward Tjörnhammar
  • Sebastian Öberg

Publish date: 2024-05-21

Report number: FOI-R--5505--SE

Pages: 52

Written in: English


  • artificial intelligence
  • machine learning
  • deep learning
  • deep neural networks
  • multimodal models
  • language models


During the last decade, there has been an extraordinary development of innovative artificial neural network models. Most of these models were constructed to handle only one modality, where a modality can be thought of as a channel for communication or a type of data, such as text or images. However, in the last few years, machine learning models based on the novel transformer architecture have been able to produce impressive results on tasks that require the ability to process two or more modalities jointly. The multimodal capabilities make these models better suited to handle a variety of problems that arise in our multimodal world. The most well-known multimodal models so far are those that combine text and images, for instance by generating images from text prompts or by answering questions about images. Similarly, advances are being made for models combining text and video. Other models combine text and sound, for music generation, or for text-to-speech and speech-to-text conversion. In addition, there are models that are able to combine more than two modalities, laying the foundation for new solutions to complicated problems in fields such as data fusion and robotics. For example, while industrial robots work well in controlled environments, a multi-purpose robot in an uncontrolled environment needs to be able to perform agile task and motion planning, based on input from a variety of sensors. This ability can be seen in early work on multimodal models for robotics. This report provides an overview of recent developments in the field of multimodal neural network models. A selection of multimodal models, developed in recent years, is presented. The focus is on models that process media data, where media data is to be understood as data primarily intended for human communication, such as text, images, sound, and video. Although the multimodal models used today have limitations, their ability for automatic multimodal reasoning is, in parts, so impressive that we have to ask ourselves in what ways multimodal machine learning models may come to impact our lives in the years to come.