Primary navigation

Multimodal

Multimodality refers to a model's ability to understand and generate content using various input types—such as text, images, audio, and video.

VisionImagesSpeech

All recipes25

Image Evals for Image Generation and Editing Use Cases
EvalsImagesVision
Jan 29, 2026
Realtime Eval Guide
AudioEvalsResponsesSpeech
Jan 25, 2026
Gpt-image-1.5 Prompting Guide
ImagesVision
Dec 16, 2025
Transcribing User Audio with a Separate Realtime Request
AudioSpeech
Nov 20, 2025
Realtime Prompting Guide
AudioResponsesSpeech
Aug 28, 2025
Generate images with high input fidelity
Images
Jul 17, 2025
Using Evals API on Image Inputs
EvalsImages
Jul 15, 2025
Practical guide to data-intensive apps with the Realtime API
AudioSpeech
May 29, 2025
Image Understanding with RAG
ImagesResponsesVision
May 16, 2025
Context Summarization with Realtime API
AudioSpeechTiktoken
May 10, 2025
ElatoAI - Realtime Speech AI Agents for ESP32 on Arduino
AudioSpeech
May 1, 2025
Comparing Speech-to-Text Methods with the OpenAI API
Agents SDKAudioSpeech
Apr 29, 2025
Generate images with GPT Image
Images
Apr 23, 2025
Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API
ResponsesSpeechVision
Apr 22, 2025
Building a Voice Assistant with the Agents SDK
AudioResponsesSpeech
Mar 27, 2025
Multi-Language One-Way Translation with the Realtime API
AudioSpeech
Mar 24, 2025
Using GPT4 Vision with Function Calling
ChatVision
Dec 13, 2024
Optimizing Retrieval-Augmented Generation using GPT-4o Vision Modality
CompletionsVision
Nov 12, 2024
Vision Fine-tuning on GPT-4o for Visual Question Answering
CompletionsFine-tuningVision
Nov 1, 2024
How to parse PDF docs for RAG
EmbeddingsVision
Sep 29, 2024
How to combine GPT4o mini with RAG to create a clothing matchmaker app
EmbeddingsVision
Jul 18, 2024
Using GPT4o mini to tag and caption images
EmbeddingsVision
Jul 18, 2024
Introduction to GPT-4o and GPT-4o mini
CompletionsVisionWhisper
Jul 18, 2024
Data Extraction and Transformation in ELT Workflows using GPT-4o as an OCR Alternative
CompletionsVision
Jul 9, 2024
CLIP embeddings to improve multimodal RAG with GPT-4 Vision
EmbeddingsVision
Apr 10, 2024