OpenAI on Monday announced GPT-4o, a new AI model that the company says is one step closer to “more natural human-computer interaction.” The new model accepts any combination of text, audio and images as input and can produce output in all three formats. It is also capable of recognizing emotion, allowing you to interrupt it mid-speech and respond almost human-like fast during a conversation.
“What’s special about GPT-4o is that it’s GPT-4-level intelligence for everyone, including our free users,” Mira Murati, CTO of OpenAI, said during the live launch. “When it comes to ease of use, it’s the first time we’ve taken a big step forward.”
During the presentation, OpenAI demonstrated GPT-4o translating live between English and Italian, helping a researcher solve a linear equation in real time on paper, and instructing another OpenAI administrator on deep breathing just by listening to his breaths.
The “o” in GPT-4o stands for “omni”, referring to the model’s multimodal capabilities. OpenAI said GPT-4o is trained on text, vision and audio, meaning all input and output is processed by the same neural network. This is different from the company’s previous models, the GPT-3.5 and GPT-4, which allowed users to ask questions simply by speaking but then transcribing speech to text. It took away the tone and emotion and slowed down the interaction.
OpenAI is making the new model available to everyone, including free ChatGPT users, over the next few weeks, and is also releasing a desktop version of ChatGPT initially for Mac, which paid users can use starting today.
OpenAI’s announcement comes a day before Google I/O, the company’s annual developer conference. Shortly after OpenAI discovered GPT-4o, Google teased a version of Gemini, its own AI chatbot with similar capabilities.