Getting Started with the Gemini Omni API: A Node.js and Python Tutorial

Learn how to integrate Google's Gemini Omni API with Node.js and Python. This tutorial covers authentication, multimodal input, conversational refinement, error handling, and production patterns for AI video generation.

Google announced Gemini Omni at I/O 2026 on May 19 as the company's first dedicated multimodal AI video model. For developers, the more relevant news landed in the same week: developer access via Google AI Studio is rolling out to early adopters, with broader availability through the standard google-genai SDK that already handles the text and image models in the Gemini family.

This tutorial walks through the practical steps to integrate Gemini Omni into a Node.js or Python application. We will cover authentication, the first generation request, multimodal input handling, conversational refinement programmatically, and a few practical patterns for handling responses cleanly in production code.

Prerequisites

Before you begin, make sure you have:

Node.js 18+ or Python 3.10+ installed
A Google AI Studio account with API access enabled
An API key for the Gemini Omni model (request access through AI Studio if you do not already have one)
The google-genai SDK installed for your language

Install the SDK:

# Python
pip install google-genai

# Node.js
npm install @google/genai

Set your API key as an environment variable rather than hard-coding it:

export GOOGLE_GENAI_API_KEY="your-api-key-here"

Your First Generation Request

The simplest possible Gemini Omni call takes a text prompt and returns a short video clip. Here is the minimum viable code in both languages.

Python:

from google import genai
import os

client = genai.Client(api_key=os.environ["GOOGLE_GENAI_API_KEY"])

response = client.models.generate_video(
    model="gemini-omni-flash",
    contents=(
        "A locked-off close-up of steam rising from a coffee cup, "
        "warm window light, shallow depth of field. 5 seconds."
    ),
)

# Save the result to disk
with open("output.mp4", "wb") as f:
    f.write(response.video_bytes)

print(
    f"Saved {len(response.video_bytes)} bytes. "
    f"SynthID: {response.metadata.synthid_signature}"
)

Node.js:

import { GoogleGenAI } from "@google/genai";
import { writeFileSync } from "fs";

const ai = new GoogleGenAI({
  apiKey: process.env.GOOGLE_GENAI_API_KEY,
});

const response = await ai.models.generateVideo({
  model: "gemini-omni-flash",
  contents: (
    "A locked-off close-up of steam rising from a coffee cup, " +
    "warm window light, shallow depth of field. 5 seconds."
  ),
});

writeFileSync("output.mp4", Buffer.from(response.videoBytes));

console.log(
  `Saved ${response.videoBytes.length} bytes. ` +
    `SynthID: ${response.metadata.synthidSignature}`
);

Run that, and within a few seconds you should have an output.mp4 file in your working directory. The exact response time depends on the model variant and current API load, but typical generation times for Omni Flash are 3 to 8 seconds for a 5-second clip.

Sending Multimodal Input

The text-only example above does not use Gemini Omni's most interesting feature. The model accepts text, images, audio, and short video clips in the same request. To send a reference image plus a text instruction:

Python:

from pathlib import Path

from google import genai

client = genai.Client()

reference_image = Path("reference.jpg").read_bytes()
reference_audio = Path("mood-clip.mp3").read_bytes()

response = client.models.generate_video(
    model="gemini-omni-flash",
    contents=[
        {
            "inline_data": {
                "mime_type": "image/jpeg",
                "data": reference_image,
            },
        },
        {
            "inline_data": {
                "mime_type": "audio/mpeg",
                "data": reference_audio,
            },
        },
        {
            "text": (
                "Use this room. Add gentle movement. "
                "Camera dollies forward slowly. 6 seconds."
            ),
        },
    ],
)

Node.js:

import { readFileSync } from "fs";

const referenceImage = readFileSync("reference.jpg").toString("base64");
const referenceAudio = readFileSync("mood-clip.mp3").toString("base64");

const response = await ai.models.generateVideo({
  model: "gemini-omni-flash",
  contents: [
    {
      inlineData: {
        mimeType: "image/jpeg",
        data: referenceImage,
      },
    },
    {
      inlineData: {
        mimeType: "audio/mpeg",
        data: referenceAudio,
      },
    },
    {
      text: (
        "Use this room. Add gentle movement. " +
        "Camera dollies forward slowly. 6 seconds."
      ),
    },
  ],
});

A few practical notes on the input format:

Image input is base64-encoded for the Node.js SDK and raw bytes for Python
Audio reference is optional — if you include it, the model attempts to match visual pacing to audio energy
Reference clips (video input) follow the same pattern with mime type video/mp4
The text instruction is processed alongside the other inputs, not in sequence

Conversational Refinement Programmatically

The model retains the previous generation in context when you continue a session. This is what allows the conversational editing flow that the marketing emphasises. To use it programmatically, hold a session reference:

session = client.chats.create(model="gemini-omni-flash")

# First generation
response_1 = session.generate_video(
    contents=[
        {
            "inline_data": {
                "mime_type": "image/jpeg",
                "data": reference_image,
            },
        },
        {
            "text": "Static medium shot. Golden hour. 5 seconds.",
        },
    ],
)

# Refinement — model knows what was generated above
response_2 = session.generate_video(
    contents=[
        {
            "text": "Slow the camera by half and warm the lighting more.",
        },
    ],
)

# Another refinement
response_3 = session.generate_video(
    contents=[
        {
            "text": "Hold the final frame for an extra second.",
        },
    ],
)

Each refinement preserves the parts of the previous output that worked. This is dramatically faster and cheaper than writing new prompts from scratch, and it is the workflow pattern that produces the best results in production code.

Handling SynthID and Response Metadata

Every generation includes metadata that you should log or persist alongside the video output. The most important field is the SynthID signature, which marks the output as machine-generated:

metadata = response.metadata

print(f"SynthID: {metadata.synthid_signature}")

print(f"Generation time: {metadata.generation_ms}ms")

print(f"Model version: {metadata.model_version}")

print(f"Frames: {metadata.frame_count}")

For production applications, store the SynthID signature with the video file. It is the proof that the content is AI-generated and is increasingly important for compliance and disclosure workflows. Several detection services already read the signature for verification purposes.

Handling Errors Cleanly

The most common error in real-world usage is hitting the daily rate limit. The API returns a clear RESOURCE_EXHAUSTED status with retry information in the response headers:

import time

from google.genai.errors import ResourceExhaustedError


def generate_with_retry(prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.models.generate_video(
                model="gemini-omni-flash",
                contents=prompt,
            )
        except ResourceExhaustedError as e:
            wait_seconds = int(e.retry_after or 60)

            print(
                f"Rate limited. Waiting {wait_seconds}s "
                f"before retry {attempt + 1}."
            )

            time.sleep(wait_seconds)

    raise RuntimeError("Exceeded retry budget")

Other errors worth handling explicitly:

INVALID_ARGUMENT — usually a malformed input (wrong mime type, prompt too long)
FAILED_PRECONDITION — the model is in maintenance, retry in a few minutes
UNAUTHENTICATED — your API key is missing or expired

Cost and Rate Limit Considerations

API pricing for Gemini Omni differs from the consumer subscription tiers. The API charges per generation rather than offering a monthly cap, which is more cost-efficient for applications with variable usage patterns. Each generation is billed by output duration in seconds plus a base request fee. For high-volume applications, the per-second rate becomes more important than the base fee.

The current published rates and the daily quota limits for the free developer tier are listed on the gemini omni price reference, kept current as Google adjusts the figures. If you are building anything that generates more than a handful of clips per day, factor the API cost into your unit economics from the start.

For most developer use cases, the API access tier is the right choice over the consumer subscriptions. The trade-off is that you handle authentication, retries, and rate limiting yourself rather than relying on the Gemini app to do it for you.

A Production Pattern That Works

For production applications integrating Gemini Omni, a clean pattern that has emerged in the first week of availability is:

Validate inputs locally before sending to the API (reject oversized files early)
Cache reference images server-side rather than re-uploading on each request
Use a queue-backed worker to handle generation requests asynchronously
Persist the SynthID signature and metadata alongside the output for compliance
Implement retry-with-backoff logic to handle rate limits gracefully
Log generation costs per request to monitor unit economics over time

A simple Node.js worker following this pattern:

import { Queue, Worker } from "bullmq";
import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({});
const queue = new Queue("video-generation");

new Worker("video-generation", async (job) => {
  const { prompt, referenceImageId } = job.data;

  // Load cached reference rather than re-fetching
  const referenceImage = await loadCachedImage(referenceImageId);

  const response = await ai.models.generateVideo({
    model: "gemini-omni-flash",
    contents: [
      {
        inlineData: {
          mimeType: "image/jpeg",
          data: referenceImage,
        },
      },
      {
        text: prompt,
      },
    ],
  });

  // Persist with metadata
  await saveOutput({
    videoBytes: response.videoBytes,
    synthidSignature: response.metadata.synthidSignature,
    durationMs: response.metadata.generationMs,
    cost: response.metadata.billing.amountUsd,
  });
});

This handles the four hardest production concerns — async processing, caching, observability, and compliance metadata — without much added complexity.

Wrapping Up

Gemini Omni's API is straightforward to integrate but it rewards thoughtful application design more than naive direct usage. The biggest wins come from caching reference images, using conversational refinement instead of full regeneration, and handling rate limits gracefully. These are not exotic patterns. They are the standard patterns from working with any generative AI API, applied to a model that happens to output video.

For developers exploring this for the first time, the recommended next step is to set up a test account in Google AI Studio, run the basic generation example above, and iterate from there. The model improves quickly when you start treating it as a conversational tool rather than a one-shot generator. Most production-grade pipelines built around gemini omini will look similar to the worker pattern shown above — async, cached, instrumented, and resilient to rate limits.