Meta is working on an AI that tries to perceive the world like humans

10 May 2023

Image: © Tada Images/

Learning across a range of modalities, Meta’s open-source ImageBind aims to ‘open the floodgates’ to researchers developing new advanced AI systems.

Meta is open sourcing a new AI model that can combine different data points to process information which the company says is akin to how humans perceive the world.

ImageBind, revealed in a research paper yesterday (9 May) and described on Meta’s blog, learns across the six modalities of text, audio, visual, movement, thermal and depth data.

When combined with appropriate hardware, Meta believes the ImageBind AI model could bring machines “one step closer to humans’ ability to learn simultaneously, holistically and directly from many different forms of information – without the need for explicit supervision”.

“ImageBind equips machines with a holistic understanding that connects objects in a photo with how they will sound, their 3D shape, how warm or cold they are, and how they move,” Meta wrote in the blog.

This means that it can learn not only from text, visuals and audio but also from sensors that record depth in 3D, thermal through infrared radiation and something known as inertial measurement units (IMU) that calculate an object’s motion and position.

“Future possibilities include more accurate ways to recognise, connect and moderate content, and to boost creative design, such as generating richer media more seamlessly and creating wider multimodal search functions,” the blogpost added.

ImageBind could also be used, for example, in Make-A-Scene – Meta’s text-to-image AI tool unveiled last summer – to create images from audio, such as creating an image based on the sounds of a rain forest or a bustling market.

It could have similar uses in Make-A-Video, Meta’s text-to-video AI tool launched in September.

While it is only a research project for now, Meta hopes the AI model will “open the floodgates for researchers” to develop new systems, such as combining 3D and IMU sensors to design immersive, virtual worlds.

“ImageBind could also provide a rich way to explore memories – searching for pictures, videos, audio files or text messages using a combination of text, audio and image,” the company said.

Meanwhile, OpenAI – the ChatGPT creator that triggered the ongoing AI race – is developing a new tool to help us understand how language models work and identify which parts of the model are responsible for which behaviours.

Google-backed rival Anthropic, on the other hand, has been trying to make itself stand out as the AI that does safety best in the market with its concept of ‘constitutional AI’ that uses the chatbot itself to moderate content (with human intervention) instead of solely human moderators.

10 things you need to know direct to your inbox every weekday. Sign up for the Daily Brief, Silicon Republic’s digest of essential sci-tech news.

Vish Gain is a journalist with Silicon Republic