Glossary
Multimodal
A multimodal AI understands and produces more than text: it can look at photos, read documents, listen, talk, and often generate images. All major assistants (ChatGPT, Claude, Gemini) are multimodal today.
“Modal” is just academic for “kind of input”: text, images, audio, video. A multimodal model handles several of them, which changed what everyday AI use looks like. The keyboard is no longer the only door in.
Concretely: snap a photo of the boiler’s error display and ask “what does E04 mean and can I fix it myself?” Photograph a handwritten recipe from your nonna and have it typed up and halved for two people. Point the camera at a form in a foreign language. Upload a 30-page PDF and ask where the cancellation clause is. Each of those is multimodality doing the work of typing you’d never do.
A caveat that saves disappointment: vision is good, not perfect. Models misread small print, handwriting, and complex tables more often than they misread clean text, so for numbers that matter, double-check against the original.
Where you’ll meet this
The paperclip/photo icon in ChatGPT, Claude, and Gemini; camera input in their mobile apps; voice mode and image generation are multimodality too. Capabilities differ per app, so our AI chooser matches them to what you actually need.