OpenAI introduces its latest multimodal AI model GPT-4o

OpenAI has announced the launch of GPT-4o, a cutting-edge multimodal AI model that integrates text and images in a single powerful system. This latest addition to OpenAI’s suite of AI models sets a new benchmark for AI capabilities, offering superior performance in non-English languages and vision tasks while matching GPT-4 Turbo in English text and coding tasks.

Multimodal Approach Enhances Accuracy and Responsiveness

GPT-4o’s ability to handle multiple data types simultaneously, including text and images, represents a significant advancement in AI technology. By integrating these modalities, GPT-4o can provide more accurate and responsive results in human-computer interactions.

Early Access Available Through Azure OpenAI Studio

Developers can now test out GPT-4o through the Azure OpenAI Studio early access playground. This preview model is available for early access, allowing users to explore its capabilities and potential applications.

Matching GPT-4 Turbo in English Tasks

GPT-4o demonstrates impressive performance, matching the capabilities of GPT-4 Turbo in English text and coding tasks while outperforming it in non-English languages and vision tasks. This versatility showcases the model’s adaptability and potential for a wide range of applications.

Key features

GPT-4o is a multimodal AI model that can process and generate text, images, and audio simultaneously. This allows for more natural and efficient human-computer interactions.

It matches the performance of GPT-4 Turbo on English text and coding tasks, while significantly outperforming it on non-English languages and vision task. This makes GPT-4o more capable and accessible globally.

Introducing GPT-4o, our new model which can reason across text, audio, and video in real time.

It's extremely versatile, fun to play with, and is a step towards a much more natural form of human-computer interaction (and even human-computer-computer interaction): pic.twitter.com/VLG7TJ1JQx
— Greg Brockman (@gdb) May 13, 2024

GPT-4o responds to audio inputs in as little as 232 milliseconds on average, similar to human response times in a conversation. This is a major improvement over previous models which had latencies of several seconds.

Say hello to GPT-4o, our new flagship model which can reason across audio, vision, and text in real time: https://t.co/MYHZB79UqN

Text and image input rolling out today in API and ChatGPT with voice and video in the coming weeks. pic.twitter.com/uuthKZyzYx
— OpenAI (@OpenAI) May 13, 2024

The model supports over 50 languages and has enhanced multilingual capabilities across its various functions. This expands its reach to a wider audience worldwide.

We also have significantly improved non-English language performance quite a lot, including improving the tokenizer to better compress many of them: pic.twitter.com/hE92x1qmM1
— Greg Brockman (@gdb) May 13, 2024

GPT-4o can understand and discuss images, enabling tasks like translating menus, explaining sports rules, and analyzing data and charts in real-time. This visual understanding is a key new capability.

This demo is insane.

A student shares their iPad screen with the new ChatGPT + GPT-4o, and the AI speaks with them and helps them learn in *realtime*.

Imagine giving this to every student in the world.

The future is so, so bright. pic.twitter.com/t14M4fDjwV
— Mckay Wrigley (@mckaywrigley) May 13, 2024

OpenAI has also expanded its offerings by rolling out its application for Mac users, introducing a desktop app that enhances user experiences and provides a revamped UI for improved interaction.

OpenAI is gradually rolling out GPT-4o’s features, with text and image capabilities already available in ChatGPT. Audio and video functionalities will be released to developers and partners in a controlled manner to ensure safety.