OpenAINext.jsAI Integration

Integrating OpenAI API in a Next.js App: A Practical Guide

10 min read

Most OpenAI API tutorials show you how to get a response. This post is about what happens after that — when you're building something that has to work reliably in production for real users.

Streaming is not optional

For any text generation longer than a sentence, streaming is not a nice-to-have — it's essential for UX. A 3-second delay before a response appears feels broken. The same content streamed token-by-token feels alive.

In Next.js with the App Router, streaming works well with Route Handlers returning a ReadableStream:

// app/api/generate/route.ts
export async function POST(req: Request) {
  const { prompt } = await req.json();

  const stream = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: prompt }],
    stream: true,
  });

  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const text = chunk.choices[0]?.delta?.content ?? "";
        controller.enqueue(encoder.encode(text));
      }
      controller.close();
    },
  });

  return new Response(readable, {
    headers: { "Content-Type": "text/plain; charset=utf-8" },
  });
}

Token management

Token costs are real and can surprise you at scale. Keep a few things in mind:

  • System prompts count. A 500-token system prompt adds to every request. Keep them concise.
  • Conversation history compounds. If you're passing previous messages for context, you're paying for them on every turn. Truncate or summarise history beyond a sensible window (8-10 turns).
  • gpt-4o-mini for most things. For classification, summarisation, and general Q&A, it's good enough and significantly cheaper. Reserve gpt-4o for cases where output quality genuinely matters.

Prompt templates that survive production

Prompt engineering is mostly about constraints. Tell the model what to do, what not to do, what format to use, and what to say when it doesn't know something. Leaving any of these implicit leads to unpredictable output.

A template I use for document-grounded responses:

You are an assistant for {business_name}.

Rules:
- Answer ONLY based on the context provided below.
- If the context doesn't contain the answer, say: "I don't have that information."
- Never make up facts.
- Respond in the same language as the user's message.
- Keep responses concise unless the user asks for detail.

Context:
{retrieved_context}

User question: {user_message}

The explicit "never make up facts" instruction matters. Without it, models will hallucinate confident-sounding answers. With it, they stay grounded significantly more reliably.

Error handling and fallbacks

OpenAI has downtime. Rate limits are real. Network failures happen. Your AI feature needs graceful degradation — not a blank screen or a 500 error.

Pattern I use: wrap API calls in try/catch, return a structured error response, and have the frontend show a human-readable fallback message ("Our AI assistant is temporarily unavailable — try again in a moment"). Log the error with enough context to debug, but never expose raw error messages to users.

Deployment checklist

  • API keys in environment variables, never in source code
  • Rate limiting on your API route (Vercel Edge Config or a simple Redis counter)
  • Input validation — max length, content filtering if needed
  • Response logging for debugging and cost monitoring
  • Timeout handling (OpenAI requests can take 10-30 seconds for complex queries)

The integration itself is straightforward. Making it production-worthy is where the engineering work lives.