Multimodal
Multimodality refers to a model's ability to understand and generate content using various input types—such as text, images, audio, and video.
VisionImagesSpeech
All recipes17
Evaluating Grounded Spatial Reasoning with GPT-5.5
EvalsImagesReasoningVision
May 11, 2026Build Live Translation Apps with gpt-realtime-translate
AudioSpeech
May 7, 2026GPT Image Generation Models Prompting Guide
ImagesVision
Apr 21, 2026Getting the Most out of GPT-5.4 for Vision and Document Understanding
ImagesVision
Mar 6, 2026Realtime Prompting Guide
AudioResponsesSpeech
Feb 25, 2026Image Evals for Image Generation and Editing Use Cases
EvalsImagesVision
Jan 29, 2026Realtime Eval Guide
AudioEvalsResponsesSpeech
Jan 25, 2026Gpt-image-1.5 Prompting Guide
ImagesVision
Dec 16, 2025Transcribing User Audio with a Separate Realtime Request
AudioSpeech
Nov 20, 2025Generate images with high input fidelity
Images
Jul 17, 2025MCP-Powered Agentic Voice Framework
Agents SDKFunctionsSpeech
Jun 17, 2025Image Understanding with RAG
ImagesResponsesVision
May 16, 2025Context Summarization with Realtime API
AudioSpeechTiktoken
May 10, 2025Comparing Speech-to-Text Methods with the OpenAI API
Agents SDKAudioSpeech
Apr 29, 2025Generate images with GPT Image
Images
Apr 23, 2025Multi-Language One-Way Translation with the Realtime API
AudioSpeech
Mar 24, 2025Vision Fine-tuning on GPT-4o for Visual Question Answering
CompletionsFine-tuningVision
Nov 1, 2024