CLI Latency Optimization: 68× Faster with Two-Tier Caching

In the fast-paced world of AI-driven command-line tools, seamless user experience hinges on snappy response times—even for routine tasks like browsing conversation history. When integrating modern large language models (LLMs) into existing CLI workflows, unexpected performance bottlenecks often emerge, rooted not in data volume but in inefficient runtime choices. This article dissects a critical latency issue encountered with a popular AI CLI tool after model integration, diagnoses its root cause, and presents a highly optimized two-tier caching solution that slashed query time by 68×—from 6,000ms to 88ms. Beyond the specific use case, this analysis outlines universal performance optimization principles applicable to any tool facing “small data, slow queries” syndrome.

The Problem: Crippling Latency in Conversation History Retrieval

After updating a widely used AI CLI tool to leverage a new high-performance LLM, developers and users reported a debilitating issue: every attempt to view conversation history took approximately 6 seconds (6,000ms). This delay was not intermittent—it occurred consistently for every query, making the tool virtually unusable for anyone who frequently switches between conversations (a common workflow for daily users). For a developer juggling dozens of conversation switches per day, this translated to minutes of wasted time, severely impacting productivity.

Initial Misdiagnosis: Ruling Out Data Volume

The first instinct was to blame excessive data size—large datasets are a common culprit for slow queries. However, a thorough audit of the tool’s local storage revealed the opposite:

Only 11 conversation threads existed
Total data size was a mere 4.5MB, stored across two locations:
1. ~/.codex/state_5.sqlite: A SQLite database holding thread metadata (titles, timestamps, model info)
2. ~/.codex/sessions/2026/05/*.jsonl: JSON Lines files storing full conversation content (one JSON object per line)

A 4.5MB dataset should never take 6 seconds to query on a modern machine. Data volume was clearly not the issue—something far more insidious was at play.

Root Cause: Python Cold-Start Overhead

Deep dive into the tool’s source code uncovered the critical flaw: every conversation history query triggered a full Python interpreter cold start. The CLI tool relied on inline Python scripts (executed via python -c) to handle database connections, JSON parsing, and output formatting. Each query initiated a new Python process, which incurred massive, redundant overhead.

Breaking down the 6,000ms latency:

Python Interpreter Startup: ~2,000ms (initializing the runtime, loading core libraries)
Module Loading (sqlite3, json): ~500ms
Database Connection & SQL Query: ~500ms
JSONL File Reading & Parsing: ~1,000ms
Output Formatting: ~500ms
Process Termination: ~500ms

Total: 5,000–7,000ms per query.

Worst of all, this process repeated for every single query. Even for 11 threads, each lookup restarted Python from scratch—no caching, no persistence, no optimization. The bottleneck was not the data itself, but the repeated initialization of a heavyweight runtime for lightweight, frequent tasks. This is a classic anti-pattern in CLI tooling: using a general-purpose language for high-frequency, low-complexity operations where native shell tools would suffice.

Iterative Optimization: From Basic Caching to a Two-Tier Architecture

The solution evolved in two key iterations, each addressing critical limitations of the last. The core insight guiding both phases: separate heavy, one-time work from light, frequent queries.

First Iteration: Precomputed JSON Cache

The initial fix focused on eliminating redundant database and JSONL parsing by precomputing a static cache. A Python script (build_cache.py) was created to:

Connect to the SQLite database and fetch all thread metadata
Iterate over each thread’s JSONL file and load full conversation messages
Store all data in a single JSON file (thread_cache.json)

This script ran once, taking just 0.4 seconds to process all 11 threads and 2,515 messages. A second Python script (read_history.py) was then used to read this precomputed cache for subsequent queries.

Result: Latency dropped from 6,000ms to ~3,000ms—a 15× improvement. While promising, this fix had a fatal flaw: restarting the machine or terminal reintroduced 2–3 seconds of latency, as reading the cache still required starting a Python interpreter. The cold-start problem was mitigated but not eliminated.

Final Solution: Two-Tier Caching Architecture

The breakthrough came from reimagining the workflow with a clear separation of concerns between two layers, each using the right tool for the job:

Cache Builder Layer (Rare, Heavy Work): Python (retained for its robust SQLite/JSON support)
Query Layer (Frequent, Light Work): PowerShell (native, zero-overhead shell tooling)

How the Two-Tier System Works

Cache Construction (On-Demand, Rare):
- The build_cache.py script runs only when the source data changes (e.g., new conversation created)
- It reads SQLite metadata and JSONL content, then writes the consolidated thread_cache.json (0.4s runtime)
Cache Querying (Frequent, Instant):
- A PowerShell script (read_history.ps1) handles all user queries
- It uses native cmdlets (Get-Content | ConvertFrom-Json) to parse the JSON cache without spawning any external processes
- Zero interpreter startup overhead—query time drops to 88ms
Automatic Cache Invalidation (Seamless UX):
- The PowerShell script checks timestamps of the SQLite database and JSON cache on every query
- If the database is newer (data changed), it automatically triggers build_cache.py to refresh the cache
- Users experience no disruption—invalidations happen in the background

This architecture leverages the strengths of each tool: Python for complex, one-time data processing; native shell tools for blistering-fast, frequent queries. The result is a system that eliminates cold-start overhead entirely for routine use cases.

Performance Benchmarks: Quantifying the Gains

The two-tier caching solution delivered transformative performance improvements across all key workflows, with results measured in controlled testing environments:

Scenario	Before Optimization	First Iteration (Python Cache)	Final Solution (Two-Tier)	Total Improvement
List Conversation Threads	~6,000ms	~3,000ms	88ms	68×
View Specific Conversation	~6,000ms	~3,000ms	135ms	44×
Cache Rebuild (Data Change)	6s × N Queries	Manual Trigger (Slow)	0.4s (Automatic)	—

These numbers speak for themselves: what once took 6 seconds now takes less than a tenth of a second. The 68× speedup eliminates user frustration and restores the tool’s usability for daily workflows. Even for edge cases like viewing a specific conversation, the 44× improvement is dramatic.

Key Design Principles for Universal Performance Optimization

This case study reveals three timeless principles applicable to any software tool facing latency issues—especially those involving CLI workflows, local data storage, and frequent user queries:

1. Precomputation Beats Real-Time Querying

Whenever possible, compute expensive work upfront and reuse results instead of recalculating them for every user request. Static caches, preprocessed data files, and batch jobs eliminate redundant computation and runtime overhead. In this case, precomputing the JSON cache avoided repeated SQLite/JSONL parsing for every query.

2. Use the Right Tool for the Right Job

A general-purpose language like Python is powerful but heavy—avoid it for high-frequency, lightweight tasks. Native shell tools (PowerShell, Bash) have zero cold-start overhead and excel at simple file I/O and text processing. Conversely, reserve Python for complex tasks like database integration, nested data parsing, and batch processing.

3. Automate Cache Invalidation

Caches only work if they stay fresh. Implement automatic invalidation logic to ensure caches update when source data changes—without requiring manual user intervention. Timestamp checks, file watchers, or event triggers keep caches consistent while maintaining a seamless user experience.

Conclusion

The latency issue in the AI CLI tool is a masterclass in identifying and resolving “hidden bottlenecks”—problems that have nothing to do with data size and everything to do with inefficient runtime choices. By diagnosing the Python cold-start overhead, iterating from a basic cache to a refined two-tier architecture, and adhering to core performance principles, the team achieved a 68× speedup, transforming a frustrating user experience into a seamless one.

This solution is not unique to AI CLI tools—it applies to any software that handles frequent queries on small-to-medium local datasets. The key takeaway: when facing slow queries, look beyond the data itself and examine the tooling and runtime choices powering your workflows. A small shift in architecture—separating heavy work from light work—can yield order-of-magnitude performance gains.

The same principle applies when moving from local CLI tools to API-driven AI applications. Latency often comes from repeated routing, authentication, retries, and unstable upstream services—not just model speed. An API gateway can help move that infrastructure work out of the hot path by centralizing access control, routing, monitoring, and scalability. In this context, 4sapi fits as a unified gateway for AI API access, helping teams manage model integration and keep request flows more predictable.