In the fast-paced world of AI-driven command-line tools, seamless user experience hinges on snappy response times—even for routine tasks like browsing conversation history. When integrating modern large language models (LLMs) into existing CLI workflows, unexpected performance bottlenecks often emerge, rooted not in data volume but in inefficient runtime choices. This article dissects a critical latency issue encountered with a popular AI CLI tool after model integration, diagnoses its root cause, and presents a highly optimized two-tier caching solution that slashed query time by 68×—from 6,000ms to 88ms. Beyond the specific use case, this analysis outlines universal performance optimization principles applicable to any tool facing “small data, slow queries” syndrome.
The Problem: Crippling Latency in Conversation History Retrieval
After updating a widely used AI CLI tool to leverage a new high-performance LLM, developers and users reported a debilitating issue: every attempt to view conversation history took approximately 6 seconds (6,000ms). This delay was not intermittent—it occurred consistently for every query, making the tool virtually unusable for anyone who frequently switches between conversations (a common workflow for daily users). For a developer juggling dozens of conversation switches per day, this translated to minutes of wasted time, severely impacting productivity.
Initial Misdiagnosis: Ruling Out Data Volume
The first instinct was to blame excessive data size—large datasets are a common culprit for slow queries. However, a thorough audit of the tool’s local storage revealed the opposite:
- Only 11 conversation threads existed
- Total data size was a mere 4.5MB, stored across two locations:
~/.codex/state_5.sqlite: A SQLite database holding thread metadata (titles, timestamps, model info)~/.codex/sessions/2026/05/*.jsonl: JSON Lines files storing full conversation content (one JSON object per line)
A 4.5MB dataset should never take 6 seconds to query on a modern machine. Data volume was clearly not the issue—something far more insidious was at play.
Root Cause: Python Cold-Start Overhead
Deep dive into the tool’s source code uncovered the critical flaw: every conversation history query triggered a full Python interpreter cold start. The CLI tool relied on inline Python scripts (executed via python -c) to handle database connections, JSON parsing, and output formatting. Each query initiated a new Python process, which incurred massive, redundant overhead.
Breaking down the 6,000ms latency:
- Python Interpreter Startup: ~2,000ms (initializing the runtime, loading core libraries)
- Module Loading (
sqlite3,json): ~500ms - Database Connection & SQL Query: ~500ms
- JSONL File Reading & Parsing: ~1,000ms
- Output Formatting: ~500ms
- Process Termination: ~500ms
Total: 5,000–7,000ms per query.
Worst of all, this process repeated for every single query. Even for 11 threads, each lookup restarted Python from scratch—no caching, no persistence, no optimization. The bottleneck was not the data itself, but the repeated initialization of a heavyweight runtime for lightweight, frequent tasks. This is a classic anti-pattern in CLI tooling: using a general-purpose language for high-frequency, low-complexity operations where native shell tools would suffice.
Iterative Optimization: From Basic Caching to a Two-Tier Architecture
The solution evolved in two key iterations, each addressing critical limitations of the last. The core insight guiding both phases: separate heavy, one-time work from light, frequent queries.
First Iteration: Precomputed JSON Cache
The initial fix focused on eliminating redundant database and JSONL parsing by precomputing a static cache. A Python script (build_cache.py) was created to:
- Connect to the SQLite database and fetch all thread metadata
- Iterate over each thread’s JSONL file and load full conversation messages
- Store all data in a single JSON file (
thread_cache.json)
This script ran once, taking just 0.4 seconds to process all 11 threads and 2,515 messages. A second Python script (read_history.py) was then used to read this precomputed cache for subsequent queries.
Result: Latency dropped from 6,000ms to ~3,000ms—a 15× improvement. While promising, this fix had a fatal flaw: restarting the machine or terminal reintroduced 2–3 seconds of latency, as reading the cache still required starting a Python interpreter. The cold-start problem was mitigated but not eliminated.
Final Solution: Two-Tier Caching Architecture
The breakthrough came from reimagining the workflow with a clear separation of concerns between two layers, each using the right tool for the job:
- Cache Builder Layer (Rare, Heavy Work): Python (retained for its robust SQLite/JSON support)
- Query Layer (Frequent, Light Work): PowerShell (native, zero-overhead shell tooling)
How the Two-Tier System Works
-
Cache Construction (On-Demand, Rare):
- The
build_cache.pyscript runs only when the source data changes (e.g., new conversation created) - It reads SQLite metadata and JSONL content, then writes the consolidated
thread_cache.json(0.4s runtime)
- The
-
Cache Querying (Frequent, Instant):
- A PowerShell script (
read_history.ps1) handles all user queries - It uses native cmdlets (
Get-Content | ConvertFrom-Json) to parse the JSON cache without spawning any external processes - Zero interpreter startup overhead—query time drops to 88ms
- A PowerShell script (
-
Automatic Cache Invalidation (Seamless UX):
- The PowerShell script checks timestamps of the SQLite database and JSON cache on every query
- If the database is newer (data changed), it automatically triggers
build_cache.pyto refresh the cache - Users experience no disruption—invalidations happen in the background
This architecture leverages the strengths of each tool: Python for complex, one-time data processing; native shell tools for blistering-fast, frequent queries. The result is a system that eliminates cold-start overhead entirely for routine use cases.
Performance Benchmarks: Quantifying the Gains
The two-tier caching solution delivered transformative performance improvements across all key workflows, with results measured in controlled testing environments:
| Scenario | Before Optimization | First Iteration (Python Cache) | Final Solution (Two-Tier) | Total Improvement |
|---|---|---|---|---|
| List Conversation Threads | ~6,000ms | ~3,000ms | 88ms | 68× |
| View Specific Conversation | ~6,000ms | ~3,000ms | 135ms | 44× |
| Cache Rebuild (Data Change) | 6s × N Queries | Manual Trigger (Slow) | 0.4s (Automatic) | — |
These numbers speak for themselves: what once took 6 seconds now takes less than a tenth of a second. The 68× speedup eliminates user frustration and restores the tool’s usability for daily workflows. Even for edge cases like viewing a specific conversation, the 44× improvement is dramatic.
Key Design Principles for Universal Performance Optimization
This case study reveals three timeless principles applicable to any software tool facing latency issues—especially those involving CLI workflows, local data storage, and frequent user queries:
1. Precomputation Beats Real-Time Querying
Whenever possible, compute expensive work upfront and reuse results instead of recalculating them for every user request. Static caches, preprocessed data files, and batch jobs eliminate redundant computation and runtime overhead. In this case, precomputing the JSON cache avoided repeated SQLite/JSONL parsing for every query.
2. Use the Right Tool for the Right Job
A general-purpose language like Python is powerful but heavy—avoid it for high-frequency, lightweight tasks. Native shell tools (PowerShell, Bash) have zero cold-start overhead and excel at simple file I/O and text processing. Conversely, reserve Python for complex tasks like database integration, nested data parsing, and batch processing.
3. Automate Cache Invalidation
Caches only work if they stay fresh. Implement automatic invalidation logic to ensure caches update when source data changes—without requiring manual user intervention. Timestamp checks, file watchers, or event triggers keep caches consistent while maintaining a seamless user experience.
Conclusion
The latency issue in the AI CLI tool is a masterclass in identifying and resolving “hidden bottlenecks”—problems that have nothing to do with data size and everything to do with inefficient runtime choices. By diagnosing the Python cold-start overhead, iterating from a basic cache to a refined two-tier architecture, and adhering to core performance principles, the team achieved a 68× speedup, transforming a frustrating user experience into a seamless one.
This solution is not unique to AI CLI tools—it applies to any software that handles frequent queries on small-to-medium local datasets. The key takeaway: when facing slow queries, look beyond the data itself and examine the tooling and runtime choices powering your workflows. A small shift in architecture—separating heavy work from light work—can yield order-of-magnitude performance gains.
The same principle applies when moving from local CLI tools to API-driven AI applications. Latency often comes from repeated routing, authentication, retries, and unstable upstream services—not just model speed. An API gateway can help move that infrastructure work out of the hot path by centralizing access control, routing, monitoring, and scalability. In this context, 4sapi fits as a unified gateway for AI API access, helping teams manage model integration and keep request flows more predictable.




