Glossary

Multimodal

A multimodal AI understands and produces more than text: it can look at photos, read documents, listen, talk, and often generate images. All major assistants (ChatGPT, Claude, Gemini) are multimodal today.

“Modal” is just academic for “kind of input”: text, images, audio, video. A multimodal model handles several of them, which changed what everyday AI use looks like. The keyboard is no longer the only door in.

Concretely: snap a photo of the boiler’s error display and ask “what does E04 mean and can I fix it myself?” Photograph a handwritten recipe from your nonna and have it typed up and halved for two people. Point the camera at a form in a foreign language. Upload a 30-page PDF and ask where the cancellation clause is. Each of those is multimodality doing the work of typing you’d never do.

A caveat that saves disappointment: vision is good, not perfect. Models misread small print, handwriting, and complex tables more often than they misread clean text, so for numbers that matter, double-check against the original.

Where you’ll meet this

The paperclip/photo icon in ChatGPT, Claude, and Gemini; camera input in their mobile apps; voice mode and image generation are multimodality too. Capabilities differ per app, so our AI chooser matches them to what you actually need.

Put it to work

Free AI for Beginners Brand new to AI and not sure where to start? Answer a few simple questions and we'll point you to the one assistant that fits what you want to do — honest, with sources, no hype. Start here.

← Back to the glossary

Where you’ll meet this

Related terms

Put it to work