Chatbot Response Latency Breakdown

An interactive visualization comparing response time components between simple template-based queries and complex RAG (Retrieval-Augmented Generation) queries with LLM processing.

Interactive Chart

View Fullscreen

<iframe src="main.html" width="100%" height="580" scrolling="no"></iframe>

Overview

This horizontal stacked bar chart provides a detailed breakdown of latency components in chatbot response pipelines, comparing two common scenarios:

Simple Query (Template): A straightforward query that can be answered using pre-defined templates
Complex Query (RAG + LLM): A sophisticated query requiring retrieval-augmented generation with LLM processing

The visualization reveals critical performance bottlenecks and highlights the dramatic difference in response times between these two approaches.

Key Insights

Simple Query Performance (180ms)

Template-based responses deliver exceptional speed: - Intent Classification: 100ms - Understanding user intent - Template Selection: 50ms - Choosing the appropriate response template - Response Formatting: 30ms - Final formatting and delivery - Total: 180ms - Well below the 2-second user expectation threshold

Complex Query Performance (6,100ms)

RAG-enhanced responses require significantly more time: - Intent Classification: 150ms (2.5%) - Understanding user intent - Context Retrieval: 300ms (4.9%) - Fetching relevant context - Vector Search: 600ms (9.8%) - Semantic similarity search - LLM Generation: 5,000ms (82%) - Language model processing - Response Formatting: 50ms (0.8%) - Final formatting and delivery - Total: 6,100ms - Exceeds both expectation (2s) and abandonment (5s) thresholds

Critical Performance Thresholds

User Expectation Threshold (2,000ms): Research shows users expect responses within 2 seconds for conversational interfaces
Abandonment Risk Zone (5,000ms): At 5 seconds, users are likely to abandon the interaction or lose context
LLM Bottleneck: The LLM generation step accounts for 82% of total latency in complex queries, making it the primary optimization target

Chart Features

✅ Horizontal stacked bar chart showing chatbot latency breakdown
✅ Accurate data: Simple query (180ms) vs Complex query (6,100ms)
✅ Color-coded components with the exact color scheme requested
✅ Threshold annotations at 2,000ms and 5,000ms
✅ LLM callout highlighting the 82% bottleneck
✅ Two-line labels for better space efficiency
✅ Minimal margins and full-width display
✅ Readable tooltips with dark background and white text
✅ Proper height at 400px

Interactive Elements

Hover Tooltips: Shows detailed timing for each component with cumulative time
Threshold Annotations: Visual indicators for user expectation (2s) and abandonment risk (5s) zones
LLM Callout: Highlights that LLM generation represents 82% of total latency
Component Totals: Tooltip footer displays total response time for each scenario

Visual Design

Color-Coded Components:
Blue: Intent classification
Green/Teal: Retrieval operations (template, context, vector search)
Orange: LLM generation (the dominant bottleneck)
Purple: Response formatting
Stacked Waterfall: Each component stacks horizontally to show cumulative time
Clear Annotations: Dashed vertical lines mark critical performance thresholds

Customization Guide

Changing the Data

To modify the latency values, edit the data.datasets array in main.html:

const data = {
    labels: ['Simple Query\n(Template)', 'Complex Query\n(RAG + LLM)'],
    datasets: [
        {
            label: 'Intent Classification',
            data: [100, 150], // [simple query ms, complex query ms]
            backgroundColor: colors.intentClassification
        },
        // Add more components...
    ]
};

Important: Ensure that the data array has two values for each component: - Index 0: Simple query latency (ms) - Index 1: Complex query latency (ms)

Use 0 if a component doesn't apply to that scenario.

Adjusting Thresholds

Modify the annotation positions to reflect different user expectation thresholds:

annotation: {
    annotations: {
        expectationLine: {
            xMin: 2000, // Change this value (ms)
            xMax: 2000,
            label: {
                content: 'User expectation threshold (2s)'
            }
        },
        abandonmentLine: {
            xMin: 5000, // Change this value (ms)
            xMax: 5000,
            label: {
                content: 'Abandonment risk zone (5s)'
            }
        }
    }
}

Customizing Colors

Update the color scheme in the colors object:

const colors = {
    intentClassification: 'rgba(54, 162, 235, 0.8)',
    templateSelection: 'rgba(75, 192, 192, 0.8)',
    contextRetrieval: 'rgba(153, 102, 255, 0.8)',
    vectorSearch: 'rgba(201, 203, 207, 0.8)',
    llmGeneration: 'rgba(255, 159, 64, 0.8)',
    formatting: 'rgba(147, 51, 234, 0.8)'
};

Adding New Components

To add a new latency component:

Add a color to the colors object
Add a dataset to the data.datasets array
Update the custom legend in the HTML
Adjust the annotation if needed

Example:

{
    label: 'New Component',
    data: [50, 100], // Latency for simple and complex queries
    backgroundColor: 'rgba(255, 99, 132, 0.8)',
    borderColor: 'rgba(255, 99, 132, 1)',
    borderWidth: 1
}

Adjusting Chart Scale

To change the maximum time displayed on the x-axis:

scales: {
    x: {
        min: 0,
        max: 10000, // Change this value (ms)
        // ...
    }
}

Technical Details

Library: Chart.js 4.4.0
Plugins: chartjs-plugin-annotation 3.0.1 (for threshold lines and callouts)
Chart Type: Horizontal stacked bar chart
Browser Compatibility: All modern browsers (Chrome, Firefox, Safari, Edge)
Dependencies: Chart.js and Annotation plugin (loaded from CDN)
Responsive: Yes, adapts to container width

Performance Optimization Strategies

Based on this latency breakdown, consider these optimization approaches:

For Complex Queries

LLM Optimization (82% impact):
Implement streaming responses to reduce perceived latency
Use smaller, faster models for simpler queries
Cache frequently requested completions
Optimize prompt length and complexity
Vector Search Optimization (10% impact):
Use approximate nearest neighbor (ANN) algorithms
Implement hierarchical search indices
Pre-filter results to reduce search space
Cache popular query embeddings
Parallel Processing:
Run context retrieval and vector search in parallel
Pre-fetch likely contexts based on intent classification

Hybrid Approaches

Progressive Enhancement: Start with a template response, then enhance with LLM-generated details
Confidence Thresholds: Use templates for high-confidence queries, LLM for uncertain cases
Streaming: Display partial results as components complete

Use Cases

This visualization is valuable for:

Performance Analysis: Identifying bottlenecks in chatbot response pipelines
Architecture Decisions: Comparing template-based vs. LLM-based approaches
Optimization Planning: Prioritizing optimization efforts based on component impact
Stakeholder Communication: Explaining technical tradeoffs to non-technical audiences
SLA Planning: Setting realistic response time expectations
Educational Content: Teaching about conversational AI performance characteristics

Retrieval-Augmented Generation (RAG): Combining retrieval systems with LLMs
Vector Search: Semantic similarity search in embedding spaces
Intent Classification: Understanding user goals from natural language
Template-Based Responses: Pre-defined responses for common queries
Latency Budgets: Allocating time across system components
Streaming Responses: Progressive delivery of LLM outputs