Skip to content

Chatbot Response Latency Breakdown

An interactive visualization comparing response time components between simple template-based queries and complex RAG (Retrieval-Augmented Generation) queries with LLM processing.

Interactive Chart

View Fullscreen

1
<iframe src="main.html" width="100%" height="580" scrolling="no"></iframe>

Overview

This horizontal stacked bar chart provides a detailed breakdown of latency components in chatbot response pipelines, comparing two common scenarios:

  1. Simple Query (Template): A straightforward query that can be answered using pre-defined templates
  2. Complex Query (RAG + LLM): A sophisticated query requiring retrieval-augmented generation with LLM processing

The visualization reveals critical performance bottlenecks and highlights the dramatic difference in response times between these two approaches.

Key Insights

Simple Query Performance (180ms)

Template-based responses deliver exceptional speed: - Intent Classification: 100ms - Understanding user intent - Template Selection: 50ms - Choosing the appropriate response template - Response Formatting: 30ms - Final formatting and delivery - Total: 180ms - Well below the 2-second user expectation threshold

Complex Query Performance (6,100ms)

RAG-enhanced responses require significantly more time: - Intent Classification: 150ms (2.5%) - Understanding user intent - Context Retrieval: 300ms (4.9%) - Fetching relevant context - Vector Search: 600ms (9.8%) - Semantic similarity search - LLM Generation: 5,000ms (82%) - Language model processing - Response Formatting: 50ms (0.8%) - Final formatting and delivery - Total: 6,100ms - Exceeds both expectation (2s) and abandonment (5s) thresholds

Critical Performance Thresholds

  • User Expectation Threshold (2,000ms): Research shows users expect responses within 2 seconds for conversational interfaces
  • Abandonment Risk Zone (5,000ms): At 5 seconds, users are likely to abandon the interaction or lose context
  • LLM Bottleneck: The LLM generation step accounts for 82% of total latency in complex queries, making it the primary optimization target

Chart Features

  • ✅ Horizontal stacked bar chart showing chatbot latency breakdown
  • ✅ Accurate data: Simple query (180ms) vs Complex query (6,100ms)
  • ✅ Color-coded components with the exact color scheme requested
  • ✅ Threshold annotations at 2,000ms and 5,000ms
  • ✅ LLM callout highlighting the 82% bottleneck
  • ✅ Two-line labels for better space efficiency
  • ✅ Minimal margins and full-width display
  • ✅ Readable tooltips with dark background and white text
  • ✅ Proper height at 400px

Interactive Elements

  • Hover Tooltips: Shows detailed timing for each component with cumulative time
  • Threshold Annotations: Visual indicators for user expectation (2s) and abandonment risk (5s) zones
  • LLM Callout: Highlights that LLM generation represents 82% of total latency
  • Component Totals: Tooltip footer displays total response time for each scenario

Visual Design

  • Color-Coded Components:
  • Blue: Intent classification
  • Green/Teal: Retrieval operations (template, context, vector search)
  • Orange: LLM generation (the dominant bottleneck)
  • Purple: Response formatting
  • Stacked Waterfall: Each component stacks horizontally to show cumulative time
  • Clear Annotations: Dashed vertical lines mark critical performance thresholds

Customization Guide

Changing the Data

To modify the latency values, edit the data.datasets array in main.html:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
const data = {
    labels: ['Simple Query\n(Template)', 'Complex Query\n(RAG + LLM)'],
    datasets: [
        {
            label: 'Intent Classification',
            data: [100, 150], // [simple query ms, complex query ms]
            backgroundColor: colors.intentClassification
        },
        // Add more components...
    ]
};

Important: Ensure that the data array has two values for each component: - Index 0: Simple query latency (ms) - Index 1: Complex query latency (ms)

Use 0 if a component doesn't apply to that scenario.

Adjusting Thresholds

Modify the annotation positions to reflect different user expectation thresholds:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
annotation: {
    annotations: {
        expectationLine: {
            xMin: 2000, // Change this value (ms)
            xMax: 2000,
            label: {
                content: 'User expectation threshold (2s)'
            }
        },
        abandonmentLine: {
            xMin: 5000, // Change this value (ms)
            xMax: 5000,
            label: {
                content: 'Abandonment risk zone (5s)'
            }
        }
    }
}

Customizing Colors

Update the color scheme in the colors object:

1
2
3
4
5
6
7
8
const colors = {
    intentClassification: 'rgba(54, 162, 235, 0.8)',
    templateSelection: 'rgba(75, 192, 192, 0.8)',
    contextRetrieval: 'rgba(153, 102, 255, 0.8)',
    vectorSearch: 'rgba(201, 203, 207, 0.8)',
    llmGeneration: 'rgba(255, 159, 64, 0.8)',
    formatting: 'rgba(147, 51, 234, 0.8)'
};

Adding New Components

To add a new latency component:

  1. Add a color to the colors object
  2. Add a dataset to the data.datasets array
  3. Update the custom legend in the HTML
  4. Adjust the annotation if needed

Example:

1
2
3
4
5
6
7
{
    label: 'New Component',
    data: [50, 100], // Latency for simple and complex queries
    backgroundColor: 'rgba(255, 99, 132, 0.8)',
    borderColor: 'rgba(255, 99, 132, 1)',
    borderWidth: 1
}

Adjusting Chart Scale

To change the maximum time displayed on the x-axis:

1
2
3
4
5
6
7
scales: {
    x: {
        min: 0,
        max: 10000, // Change this value (ms)
        // ...
    }
}

Technical Details

  • Library: Chart.js 4.4.0
  • Plugins: chartjs-plugin-annotation 3.0.1 (for threshold lines and callouts)
  • Chart Type: Horizontal stacked bar chart
  • Browser Compatibility: All modern browsers (Chrome, Firefox, Safari, Edge)
  • Dependencies: Chart.js and Annotation plugin (loaded from CDN)
  • Responsive: Yes, adapts to container width

Performance Optimization Strategies

Based on this latency breakdown, consider these optimization approaches:

For Complex Queries

  1. LLM Optimization (82% impact):
  2. Implement streaming responses to reduce perceived latency
  3. Use smaller, faster models for simpler queries
  4. Cache frequently requested completions
  5. Optimize prompt length and complexity

  6. Vector Search Optimization (10% impact):

  7. Use approximate nearest neighbor (ANN) algorithms
  8. Implement hierarchical search indices
  9. Pre-filter results to reduce search space
  10. Cache popular query embeddings

  11. Parallel Processing:

  12. Run context retrieval and vector search in parallel
  13. Pre-fetch likely contexts based on intent classification

Hybrid Approaches

  1. Progressive Enhancement: Start with a template response, then enhance with LLM-generated details
  2. Confidence Thresholds: Use templates for high-confidence queries, LLM for uncertain cases
  3. Streaming: Display partial results as components complete

Use Cases

This visualization is valuable for:

  • Performance Analysis: Identifying bottlenecks in chatbot response pipelines
  • Architecture Decisions: Comparing template-based vs. LLM-based approaches
  • Optimization Planning: Prioritizing optimization efforts based on component impact
  • Stakeholder Communication: Explaining technical tradeoffs to non-technical audiences
  • SLA Planning: Setting realistic response time expectations
  • Educational Content: Teaching about conversational AI performance characteristics
  • Retrieval-Augmented Generation (RAG): Combining retrieval systems with LLMs
  • Vector Search: Semantic similarity search in embedding spaces
  • Intent Classification: Understanding user goals from natural language
  • Template-Based Responses: Pre-defined responses for common queries
  • Latency Budgets: Allocating time across system components
  • Streaming Responses: Progressive delivery of LLM outputs

References