Google announced Gemini Omni at I/O 2026 on May 19 as the company's first dedicated multimodal AI video model. For developers, the more relevant news landed in the same week: developer access via Google AI Studio is rolling out to early adopters, with broader availability through the standard google-genai SDK that already handles the text and image models in the Gemini family.
This tutorial walks through the practical steps to integrate Gemini Omni into a Node.js or Python application. We will cover authentication, the first generation request, multimodal input handling, conversational refinement programmatically, and a few practical patterns for handling responses cleanly in production code.
Prerequisites
Before you begin, make sure you have:
- Node.js 18+ or Python 3.10+ installed
- A Google AI Studio account with API access enabled
- An API key for the Gemini Omni model (request access through AI Studio if you do not already have one)
- The google-genai SDK installed for your language
Install the SDK:
# Python
pip install google-genai
# Node.js
npm install @google/genai
Set your API key as an environment variable rather than hard-coding it:
export GOOGLE_GENAI_API_KEY="your-api-key-here"
Your First Generation Request
The simplest possible Gemini Omni call takes a text prompt and returns a short video clip. Here is the minimum viable code in both languages.
Python:
from google import genai
import os
client = genai.Client(api_key=os.environ["GOOGLE_GENAI_API_KEY"])
response = client.models.generate_video(
model="gemini-omni-flash",
contents=(
"A locked-off close-up of steam rising from a coffee cup, "
"warm window light, shallow depth of field. 5 seconds."
),
)
# Save the result to disk
with open("output.mp4", "wb") as f:
f.write(response.video_bytes)
print(
f"Saved {len(response.video_bytes)} bytes. "
f"SynthID: {response.metadata.synthid_signature}"
)
Node.js:
import { GoogleGenAI } from "@google/genai";
import { writeFileSync } from "fs";
const ai = new GoogleGenAI({
apiKey: process.env.GOOGLE_GENAI_API_KEY,
});
const response = await ai.models.generateVideo({
model: "gemini-omni-flash",
contents: (
"A locked-off close-up of steam rising from a coffee cup, " +
"warm window light, shallow depth of field. 5 seconds."
),
});
writeFileSync("output.mp4", Buffer.from(response.videoBytes));
console.log(
`Saved ${response.videoBytes.length} bytes. ` +
`SynthID: ${response.metadata.synthidSignature}`
);
Run that, and within a few seconds you should have an output.mp4 file in your working directory. The exact response time depends on the model variant and current API load, but typical generation times for Omni Flash are 3 to 8 seconds for a 5-second clip.
Sending Multimodal Input
The text-only example above does not use Gemini Omni's most interesting feature. The model accepts text, images, audio, and short video clips in the same request. To send a reference image plus a text instruction:
Python:
from pathlib import Path
from google import genai
client = genai.Client()
reference_image = Path("reference.jpg").read_bytes()
reference_audio = Path("mood-clip.mp3").read_bytes()
response = client.models.generate_video(
model="gemini-omni-flash",
contents=[
{
"inline_data": {
"mime_type": "image/jpeg",
"data": reference_image,
},
},
{
"inline_data": {
"mime_type": "audio/mpeg",
"data": reference_audio,
},
},
{
"text": (
"Use this room. Add gentle movement. "
"Camera dollies forward slowly. 6 seconds."
),
},
],
)
Node.js:
import { readFileSync } from "fs";
const referenceImage = readFileSync("reference.jpg").toString("base64");
const referenceAudio = readFileSync("mood-clip.mp3").toString("base64");
const response = await ai.models.generateVideo({
model: "gemini-omni-flash",
contents: [
{
inlineData: {
mimeType: "image/jpeg",
data: referenceImage,
},
},
{
inlineData: {
mimeType: "audio/mpeg",
data: referenceAudio,
},
},
{
text: (
"Use this room. Add gentle movement. " +
"Camera dollies forward slowly. 6 seconds."
),
},
],
});
A few practical notes on the input format:
- Image input is base64-encoded for the Node.js SDK and raw bytes for Python
- Audio reference is optional — if you include it, the model attempts to match visual pacing to audio energy
- Reference clips (video input) follow the same pattern with mime type video/mp4
- The text instruction is processed alongside the other inputs, not in sequence
Conversational Refinement Programmatically
The model retains the previous generation in context when you continue a session. This is what allows the conversational editing flow that the marketing emphasises. To use it programmatically, hold a session reference:
session = client.chats.create(model="gemini-omni-flash")
# First generation
response_1 = session.generate_video(
contents=[
{
"inline_data": {
"mime_type": "image/jpeg",
"data": reference_image,
},
},
{
"text": "Static medium shot. Golden hour. 5 seconds.",
},
],
)
# Refinement — model knows what was generated above
response_2 = session.generate_video(
contents=[
{
"text": "Slow the camera by half and warm the lighting more.",
},
],
)
# Another refinement
response_3 = session.generate_video(
contents=[
{
"text": "Hold the final frame for an extra second.",
},
],
)
Each refinement preserves the parts of the previous output that worked. This is dramatically faster and cheaper than writing new prompts from scratch, and it is the workflow pattern that produces the best results in production code.
Handling SynthID and Response Metadata
Every generation includes metadata that you should log or persist alongside the video output. The most important field is the SynthID signature, which marks the output as machine-generated:
metadata = response.metadata
print(f"SynthID: {metadata.synthid_signature}")
print(f"Generation time: {metadata.generation_ms}ms")
print(f"Model version: {metadata.model_version}")
print(f"Frames: {metadata.frame_count}")
For production applications, store the SynthID signature with the video file. It is the proof that the content is AI-generated and is increasingly important for compliance and disclosure workflows. Several detection services already read the signature for verification purposes.
Handling Errors Cleanly
The most common error in real-world usage is hitting the daily rate limit. The API returns a clear RESOURCE_EXHAUSTED status with retry information in the response headers:
import time
from google.genai.errors import ResourceExhaustedError
def generate_with_retry(prompt, max_retries=3):
for attempt in range(max_retries):
try:
return client.models.generate_video(
model="gemini-omni-flash",
contents=prompt,
)
except ResourceExhaustedError as e:
wait_seconds = int(e.retry_after or 60)
print(
f"Rate limited. Waiting {wait_seconds}s "
f"before retry {attempt + 1}."
)
time.sleep(wait_seconds)
raise RuntimeError("Exceeded retry budget")
Other errors worth handling explicitly:
- INVALID_ARGUMENT — usually a malformed input (wrong mime type, prompt too long)
- FAILED_PRECONDITION — the model is in maintenance, retry in a few minutes
- UNAUTHENTICATED — your API key is missing or expired
Cost and Rate Limit Considerations
API pricing for Gemini Omni differs from the consumer subscription tiers. The API charges per generation rather than offering a monthly cap, which is more cost-efficient for applications with variable usage patterns. Each generation is billed by output duration in seconds plus a base request fee. For high-volume applications, the per-second rate becomes more important than the base fee.
The current published rates and the daily quota limits for the free developer tier are listed on the gemini omni price reference, kept current as Google adjusts the figures. If you are building anything that generates more than a handful of clips per day, factor the API cost into your unit economics from the start.
For most developer use cases, the API access tier is the right choice over the consumer subscriptions. The trade-off is that you handle authentication, retries, and rate limiting yourself rather than relying on the Gemini app to do it for you.
A Production Pattern That Works
For production applications integrating Gemini Omni, a clean pattern that has emerged in the first week of availability is:
- Validate inputs locally before sending to the API (reject oversized files early)
- Cache reference images server-side rather than re-uploading on each request
- Use a queue-backed worker to handle generation requests asynchronously
- Persist the SynthID signature and metadata alongside the output for compliance
- Implement retry-with-backoff logic to handle rate limits gracefully
- Log generation costs per request to monitor unit economics over time
A simple Node.js worker following this pattern:
import { Queue, Worker } from "bullmq";
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({});
const queue = new Queue("video-generation");
new Worker("video-generation", async (job) => {
const { prompt, referenceImageId } = job.data;
// Load cached reference rather than re-fetching
const referenceImage = await loadCachedImage(referenceImageId);
const response = await ai.models.generateVideo({
model: "gemini-omni-flash",
contents: [
{
inlineData: {
mimeType: "image/jpeg",
data: referenceImage,
},
},
{
text: prompt,
},
],
});
// Persist with metadata
await saveOutput({
videoBytes: response.videoBytes,
synthidSignature: response.metadata.synthidSignature,
durationMs: response.metadata.generationMs,
cost: response.metadata.billing.amountUsd,
});
});
This handles the four hardest production concerns — async processing, caching, observability, and compliance metadata — without much added complexity.
Wrapping Up
Gemini Omni's API is straightforward to integrate but it rewards thoughtful application design more than naive direct usage. The biggest wins come from caching reference images, using conversational refinement instead of full regeneration, and handling rate limits gracefully. These are not exotic patterns. They are the standard patterns from working with any generative AI API, applied to a model that happens to output video.
For developers exploring this for the first time, the recommended next step is to set up a test account in Google AI Studio, run the basic generation example above, and iterate from there. The model improves quickly when you start treating it as a conversational tool rather than a one-shot generator. Most production-grade pipelines built around gemini omini will look similar to the worker pattern shown above — async, cached, instrumented, and resilient to rate limits.