Deep Dive

Performance Optimization for Generative UI

How to keep AI-powered interfaces fast: streaming strategies, bundle optimization, and rendering patterns.

A
Alex13 min read

The Performance Paradox

The paradox is simple: 300ms can feel like forever, while 1.2 seconds can feel instant. And in Generative UI this is not theoretical. I had a production case where switching from in-memory caching to streaming skeletons cut perceived load time by 3× — while increasing total time-to-full-component by 80ms.

LLM inference is 200–800ms for a simple response and several seconds for multi-tool ones. CDN, SSG, and edge caching cannot remove this latency: the LLM decision step sits on the critical path of every request. But the interface does not have to feel slow.

This article is not "10 perf tips." It is an attempt to separate when optimization is worth doing from when it is self-deception and engineering gold-plating, and which strategy solves which specific problem. With real numbers from my production, not from benchmarks on blog posts.

When NOT to Optimize

Before reading about the six strategies below, answer three questions:

  1. Have you measured current performance? If not — close this tab and instrument TTFC/TTIC tracking. Half my clients who showed up with "everything is slow" had a p50 of 600ms and angry users from layout shift (CLS), not from latency.
  2. Is your p95 already under 1.5 seconds? Then streaming skeletons and optimistic UI will give you ~20% perceptual improvement — at the cost of a week of work. Spend that week on functionality instead.
  3. Do you have under 100 daily active users? A Redis cache at two requests per minute is infrastructure cargo-culting. An in-memory Map will hold up for another year and a half.

Optimization is not "always good." Each strategy below adds complexity, failure modes, and cognitive load. If you have one engineer and the product is still searching for PMF — stream skeletons (Strategy 1) and do nothing else yet. Everything else is premature.

The Trade-Offs Table

Six strategies, their cost, and where each pays off:

StrategyComplexityTTFC winTTIC winWhen to use
1. Stream skeletonsLow (hours)−400…600ms0Always, if you use streamUI
2. Parallel tool callsLow (hours)0−30…50%When ≥2 independent fetches in generate
3. Response cachingMedium (days)0−500…800ms on cache hitQueries repeat ≥10×/day/user
4. Model selectionLow (hours)0−200…500msSimple tool selection, no reasoning
5. Bundle optimizationMedium (days)−100…300ms (cold load)0Bundle > 200KB or mobile-heavy audience
6. Optimistic UIMedium (days)−150…250ms0Queries are predictable from keywords

If forced to rank by benefit-÷-complexity on a mature product with traffic, the order is: 1 → 4 → 2 → 6 → 3 → 5. Strategies 3 and 5 pay off later than expected and have repeatedly been my "wasted a week" line items.

The Metrics That Matter

Before optimizing, define what you are measuring:

Time to First Component (TTFC): How long until the user sees any AI-generated element, even a loading state. Target: under 200ms. This is achievable by streaming the skeleton immediately while inference runs.

Time to Interactive Component (TTIC): How long until the first real, data-populated component appears. Target: under 800ms. This is the end of LLM inference for the first tool call.

Streaming Completion Time: How long until all generated components have loaded. This varies with the number of tool calls. With streaming, this is less important than TTFC and TTIC.

Layout Shift Score (CLS): Generated components should not shift the page layout as they load. Skeletons must match the size of the final component.

Strategy 1: Stream Skeletons Immediately

The single highest-impact optimization is streaming a loading skeleton before the LLM resolves the first parameter. The Vercel AI SDK's generator pattern enables this directly:

tools: {
  revenueChart: {
    description: 'Display a revenue chart',
    parameters: z.object({
      period: z.string(),
      data: z.array(z.object({ date: z.string(), value: z.number() })),
    }),
    generate: async function* (params) {
      // This yields IMMEDIATELY — before params are resolved
      // The skeleton appears at time zero
      yield <ChartSkeleton />;

      // Optionally fetch real data while the AI resolves params
      // The component appears when both are ready
      return <RevenueChart {...params} />;
    },
  },
}

The yield statement runs synchronously. The user sees the skeleton in the same round trip as the initial request. LLM inference happens in parallel. This is why TTFC can be under 200ms even when TTIC is 800ms.

Critical detail: The skeleton must match the final component's dimensions. If the skeleton is 100px tall and the loaded component is 300px, you have layout shift that hurts CLS and feels jarring.

// Bad: generic skeleton that mismatches component size
yield <div className="h-8 animate-pulse bg-muted rounded" />;

// Good: skeleton that matches the component
yield (
  <div className="rounded-lg border p-6 h-64">
    <div className="h-4 w-32 animate-pulse bg-muted rounded mb-4" />
    <div className="h-48 w-full animate-pulse bg-muted rounded" />
  </div>
);

Strategy 2: Parallel Tool Calls

When the AI needs to call multiple tools, they should execute in parallel. The Vercel AI SDK handles this automatically — multiple tool calls in a single response run their generate functions concurrently.

But your component's data fetching must not block:

// Slow: sequential data fetching inside generate
generate: async function* ({ userId, period }) {
  yield <DashboardSkeleton />;
  const revenue = await fetchRevenue(userId, period);      // 200ms
  const users = await fetchUsers(userId, period);          // 150ms
  const conversions = await fetchConversions(userId);      // 100ms
  // Total: ~450ms
  return <Dashboard revenue={revenue} users={users} conversions={conversions} />;
},

// Fast: parallel data fetching
generate: async function* ({ userId, period }) {
  yield <DashboardSkeleton />;
  const [revenue, users, conversions] = await Promise.all([
    fetchRevenue(userId, period),
    fetchUsers(userId, period),
    fetchConversions(userId),
  ]);
  // Total: ~200ms (longest fetch wins)
  return <Dashboard revenue={revenue} users={users} conversions={conversions} />;
},

For independent data sources, Promise.all is always faster than sequential awaits.

Strategy 3: Response Caching

Many Generative UI queries are repeated. "Show me this month's revenue dashboard" runs dozens of times per day for the same user with the same underlying data.

Cache at the LLM response level, keyed by a hash of the prompt and relevant context:

import { createHash } from 'crypto';

interface CacheEntry {
  value: React.ReactNode;
  cachedAt: number;
  ttlMs: number;
}

const responseCache = new Map<string, CacheEntry>();

function getCacheKey(prompt: string, context: object): string {
  return createHash('md5')
    .update(prompt + JSON.stringify(context))
    .digest('hex');
}

export async function generateUIWithCache(
  prompt: string,
  context: object = {},
  ttlMs: number = 5 * 60 * 1000  // 5 minutes default
) {
  const key = getCacheKey(prompt, context);
  const cached = responseCache.get(key);

  if (cached && Date.now() - cached.cachedAt < cached.ttlMs) {
    return cached.value;
  }

  const result = await streamUI({ /* ... */ });
  responseCache.set(key, { value: result.value, cachedAt: Date.now(), ttlMs });
  return result.value;
}

For production, use Redis instead of an in-memory Map. Consider using Vercel KV or Upstash Redis for edge-compatible caching.

Important: Cache invalidation must match your data's update frequency. A revenue dashboard that caches for 5 minutes is fine. A real-time stock ticker that caches for 5 minutes is wrong.

Strategy 4: Model Selection

Not every query needs GPT-4o. Model selection is the highest-leverage cost and latency optimization available.

ModelLatencyCostQuality
GPT-4o400–800msHighBest
GPT-4o-mini200–400ms10x cheaperGood
Claude Haiku150–300ms5x cheaperGood
Gemini Flash100–200ms5x cheaperGood

For most Generative UI tool selection tasks, GPT-4o-mini or Claude Haiku produces results indistinguishable from GPT-4o. Reserve the frontier models for complex reasoning tasks.

// Route to appropriate model based on query complexity
function selectModel(toolCount: number, contextLength: number) {
  if (toolCount <= 5 && contextLength < 500) {
    return openai('gpt-4o-mini');
  }
  return openai('gpt-4o');
}

Strategy 5: Bundle Optimization

Generative UI component libraries can grow large. Every component in your tool registry ships to the browser. Manage this actively.

Lazy load non-critical components:

// Only import heavy chart components when needed
const HeavyChartComponent = dynamic(
  () => import('@/components/heavy-chart'),
  { loading: () => <ChartSkeleton /> }
);

Separate the component bundle from the tool registry:

// Tool registry: lightweight, shipped early
export const toolDefinitions = {
  revenueChart: {
    description: '...',
    parameters: z.object({ ... }),
  },
};

// Component implementations: lazy loaded when needed
export const toolComponents = {
  revenueChart: dynamic(() => import('@/components/revenue-chart')),
};

Measure your bundle. Run npx @next/bundle-analyzer and look for components that are disproportionately large. A single charting library can add 50KB+ to your bundle.

Strategy 6: Optimistic UI

For queries the system can predict, show an optimistic UI before the AI responds:

export function useGenerativeUI() {
  const [ui, setUI] = useState<React.ReactNode>(null);
  const [optimisticUI, setOptimisticUI] = useState<React.ReactNode>(null);

  async function generate(prompt: string) {
    // Immediately show a plausible skeleton based on query type
    if (prompt.toLowerCase().includes('weather')) {
      setOptimisticUI(<WeatherCardSkeleton />);
    } else if (prompt.toLowerCase().includes('stock') || prompt.toLowerCase().includes('price')) {
      setOptimisticUI(<StockTickerSkeleton />);
    } else {
      setOptimisticUI(<GenericSkeleton />);
    }

    const result = await generateUI(prompt);
    setOptimisticUI(null);
    setUI(result);
  }

  return { ui: optimisticUI ?? ui, generate };
}

Simple keyword matching on the client is zero-latency. Showing a weather skeleton the instant the user submits a weather query feels significantly faster than waiting for the server round-trip.

Core Web Vitals Impact

Generative UI affects your Core Web Vitals. Here is what to watch:

Largest Contentful Paint (LCP): If your main content is AI-generated, LCP will reflect the full generation time. Mitigate by generating above-the-fold content first and using streaming to progressively paint the page.

Cumulative Layout Shift (CLS): The biggest risk. If your skeletons do not match component sizes, every component load causes layout shift. Use min-height on skeleton containers to reserve space.

Interaction to Next Paint (INP): Make sure AI generation is triggered by user actions (button clicks, form submits), not passive page load. Passive generation can block interaction handling.

First Input Delay / INP: Do not run streamUI directly in a React event handler. It is a long-running async operation. Keep the event handler fast:

// Potentially slow: streamUI blocks the handler
async function handleSubmit(e: React.FormEvent) {
  e.preventDefault();
  const result = await streamUI({ ... }); // blocks
  setUI(result.value);
}

// Better: kick off async, update state when ready
function handleSubmit(e: React.FormEvent) {
  e.preventDefault();
  setLoading(true);
  generateUI(prompt).then(ui => {
    setUI(ui);
    setLoading(false);
  });
}

Measuring What You're Optimizing

Without measurement, optimization is guesswork. Add performance tracking from the start:

export async function generateUIWithMetrics(prompt: string) {
  const startTime = performance.now();

  const result = await streamUI({
    /* ... */
    onFinish: ({ toolCalls }) => {
      const totalTime = performance.now() - startTime;

      // Send to your analytics / observability platform
      track('genui.generation_complete', {
        prompt_length: prompt.length,
        tool_calls_count: toolCalls.length,
        total_ms: Math.round(totalTime),
        tools_used: toolCalls.map(c => c.toolName),
      });
    },
  });

  return result.value;
}

Track TTFC and TTIC separately by timing the skeleton yield and the final component return. After a week of data, you will have a clear picture of where time is actually going.

Anti-Patterns I've Already Stepped On

Six places where "optimization" makes things worse, not better — all mistakes I personally shipped in production:

1. Caching a non-deterministic LLM response by prompt hash. GPT-4o with temperature=0.7 will return different UI for the same prompt. The cache "works," but the user sees an interface inconsistent with the previous call — worse than a slow but consistent response. Fix: only cache at temperature=0, or hash by prompt + temperature + seed.

2. A skeleton that differs sharply from the final component. Saw this in production: skeleton for a 5-row table, final table had 50 rows. CLS spiked, the user clicked the wrong target and got angry. Fix: min-height on the container based on average size, plus virtualized lazy rendering for rows.

3. Streaming a skeleton the user sees for under 50ms. On a fast network with p50 TTFC of 250ms, the skeleton flashes and disappears — more annoying than a clean load. Fix: add a 100ms delay before showing the skeleton (setTimeout), or skip it on fast connections (navigator.connection.effectiveType).

4. Optimistic UI that does not match the real response. Showed a weather skeleton, the AI decided the query was actually about news — the user sees jank. Fix: optimistic UI only for the most unambiguous triggers (exact word match, not substring), and a graceful fallback to a generic skeleton on mismatch.

5. A Redis cache with 5-minute TTL on personalized data. Cache key without userId — and user A sees user B's dashboard. That is a data leak, not a perf bug. Fix: userId is always part of the key, separate namespaces for public/private data, audit log on cache hits.

6. GPT-4o-mini for intent classification with 50+ tools. Mini models lose track in long tool registries — they start invoking the wrong tool. Latency savings turn into error-rate growth. Fix: for tool registries with 20+ tools, use GPT-4o, or split the registry by domain with a router.

Concrete Redis Configuration for Production

If you got to Strategy 3 and you actually need it — here is the configuration that works for me:

import Redis from 'ioredis';

const redis = new Redis(process.env.REDIS_URL!, {
  maxRetriesPerRequest: 2,
  enableReadyCheck: true,
  // Do not block the request more than 50ms on cache lookup — better to recompute
  commandTimeout: 50,
});

// TTL is chosen by data update frequency, not "5 minutes for everything"
const TTL_BY_DATA_TYPE = {
  staticReference: 24 * 60 * 60,    // 24h: reference data, docs
  userDashboard:   5 * 60,          // 5min: personal data
  marketData:      30,              // 30s: quotes, news
  realtime:        0,                // 0: do not cache, streaming data
};

// Eviction policy in redis.conf: allkeys-lru
// maxmemory 512mb (typical MVP)
// maxmemory-policy allkeys-lru

The allkeys-lru eviction policy matters more than it looks: without it, Redis at capacity will start refusing new writes — a slow fail instead of graceful degradation. Do cache invalidation via patterns (redis.del('user:123:*') through SCAN), not point-deletes — far more robust on hot keys.

Real Numbers from My Production

Numbers from one of my Generative UI products (~2000 DAU, 8-tool dashboard, US East region, January 2026):

MetricBefore optimizationAfter Strategies 1+2+4After all 6
TTFC p50580ms145ms90ms
TTFC p951100ms320ms240ms
TTIC p501400ms720ms380ms (cache hit)
TTIC p952800ms1500ms1300ms (cache miss)
CLS0.180.040.03
Cost per request$0.012$0.002$0.0015
Code complexitybaseline+~150 LOC+~600 LOC + Redis

Main observation: Strategies 1+2+4 delivered 80% of the win for 20% of the complexity. Strategies 3, 5, 6 — the remaining 20% improvement for 80% additional complexity. If you do not have a team to run a Redis cluster and an SLA on cache invalidation, Strategies 1+2+4 are the destination, not a waypoint.

The Architectural Shift Most Articles Skip

If you are moving from "single render" (one page = one response) to "progressive delivery" (streaming + skeletons), this is not an optimization — it is an architectural change. What changes:

  • Server code is written as async generators, not regular functions — a different mental model.
  • React error boundaries behave differently for streamed content — you need fallback components at every level.
  • SEO and SSR require a separate strategy: streamed AI content is not indexed by default.
  • Tests get harder: snapshot tests for intermediate skeleton states and the final render.

I underestimated the cost of this shift on the first project — budgeted 2 days, spent 2 weeks. On the second project I budgeted 2 weeks up front and shipped on time. If your product does not stream today and you plan to introduce Strategy 1 — budget weeks, not hours.


Working on GenUI performance challenges? Let's talk — optimization across the full stack is a specialty.

ShareTwitterLinkedInEmail
performanceoptimizationstreaminggenerative-ui
A

Alex

Generative UI Engineer & Consultant

Senior engineer specializing in AI-powered interfaces and Generative UI systems. Helping product teams ship faster with the right GenUI stack.

Stay ahead on Generative UI

Weekly articles, framework updates, and practical implementation guides — straight to your inbox.

We respect your privacy. Unsubscribe anytime.

Need help implementing what you just read?

Book a Free Consultation