Abstract
Claude.ai’s built-in web search has become one of the strongest search experiences among mainstream AI products. It is fast, accurate and citation-friendly. Many developers assume this performance comes from a proprietary search engine or a hidden internal retrieval system.
However, third-party technical analysis suggests that Claude.ai’s backend search is largely powered by Brave Search. The cited content in Claude’s answers overlaps with Brave’s top-ranked results at a rate of 86.7%. This indicates that Claude.ai’s search quality is not mainly the result of a single exclusive search engine. It comes from a set of well-designed engineering optimizations.
This article breaks the topic into two parts.
The first part analyzes five core optimizations behind Claude.ai’s web search system: inline rich snippets, server-side retrieval loops, code-based filtering, prompt caching and selective deep crawling. It also explains why server-side search tools are fundamentally different from client-side tool frameworks.
The second part presents a practical experiment. A self-built agent search pipeline is reconstructed using similar principles. The optimized system improves answer accuracy from 37% to 67%, narrowing the gap with Claude.ai from 43 percentage points to 13 percentage points.
The goal is not to claim that Claude.ai can be fully replicated. It cannot, because server-side execution has structural advantages. The goal is to show which parts of its search performance can be reproduced by developers building their own AI agents and retrieval systems.
Part One: Five Core Engineering Optimizations Behind Claude.ai Web Search
Before discussing individual optimizations, one architectural difference must be clear: Claude.ai uses a server-side web search tool, while many self-built AI systems use client-side tool calling.
This distinction changes everything.
In Anthropic’s server-side search architecture, the model can trigger web_search inside a single HTTP request. Query generation, result retrieval, query refinement and repeated search loops all run on Anthropic’s servers. The client does not need to manage multiple tool calls or return tool_result messages after each search.
A client-side framework works differently. Developers must write their own loop logic. The model asks for a tool call, the client executes the search, the client returns the result, and the model decides what to do next. Every search round requires a new network interaction.
This creates a major performance gap. Server-side search reduces latency, hides iteration complexity and keeps the retrieval process tightly integrated with the model. Client-side search gives developers more control, but it adds network round-trips and orchestration overhead.
Claude.ai’s performance comes from this architectural advantage, combined with five practical engineering optimizations.
1. Inline Rich Snippets: Reducing the Need for Page Crawling
The first optimization is the use of rich inline snippets.
When Claude.ai calls its server-side search tool, each search result includes a large amount of relevant text. In many cases, the returned record contains around 500 words of useful context. This allows the model to answer many questions without separately crawling the full page.
This design is very different from ordinary search APIs. A basic search result usually contains only a title, URL and short description. That is not enough for high-quality question answering. The system then has to fetch pages, parse HTML, remove boilerplate and extract relevant text. Each step adds latency and failure risk.
Brave Search provides richer result data. In addition to the standard description field, which is usually around 250 characters, Brave can return up to five extra_snippets for each result. These snippets are dynamically extracted from the search index according to the user query. Together, they provide much more usable text than a normal search abstract.
One key sign of this mechanism is that the snippet for the same URL can change when the query changes. That behavior is consistent with index-side snippet extraction.
In Claude’s API response structure, these inline snippets are carried through encrypted fields. The content appears in encrypted_content, while citation metadata corresponds to encrypted_index. External developers cannot decrypt or modify these fields. They can only forward them as part of the original response.
This encryption likely serves two purposes.
First, it prevents the API from being abused as a free crawler. Second, it helps respect third-party content rights. The model can use the retrieved context, but developers cannot freely extract and republish the raw snippet data.
Anthropic also uses different designs across different products. Claude.ai prioritizes speed and includes rich inline content. Claude Code, the command-line development tool, takes a more controlled approach. It also uses Brave Search, but its WebSearch tool mainly returns titles and URLs. Users must then call a separate WebFetch tool to crawl pages and extract content, usually with the help of a lightweight Haiku model.
This is a clear trade-off.
Claude.ai favors response speed. Claude Code favors operational control and security.
For developers building their own systems, the lesson is direct: rich snippets can remove many unnecessary page crawls. This is often the fastest way to improve search pipeline performance.
2. Server-Side Closed-Loop Retrieval: Multiple Iterations in One Request
The second optimization is server-side closed-loop retrieval.
Claude.ai does not stop after one search. When needed, the model can generate a query, inspect results, identify missing information, refine the query and search again. This loop runs inside Anthropic’s server environment.
A typical flow looks like this:
The user sees only the final result.
In ordinary cases, the request ends with end_turn. If the content becomes too long and the response is truncated, the system may return pause_turn. In that case, the client forwards the complete message back to continue the task.
This design avoids repeated public network round-trips. A server-side search task may require 7 to 11 internal retrieval rounds, but the client still makes only one request.
In a client-side framework, the same workflow is much slower. Developers need to implement a while loop. Each search call becomes a separate tool interaction. Each round requires another request, another response and another prompt preprocessing step.
This explains why caching is critical in client-side frameworks. Without caching, each tool round forces the model to reprocess large parts of the conversation. Server-side tools reduce this burden because the loop stays inside the provider’s infrastructure.
For self-built systems, full server-side replication is difficult. But developers can still reduce overhead by minimizing round-trips, batching retrieval steps and avoiding unnecessary page crawling.
3. Code-Based Filtering: Using Deterministic Scripts Instead of Extra LLM Calls
The third optimization is code-based filtering.
When Claude receives many search results, it does not always ask another large model to screen them. Instead, it can use programmatic tool calling. The model writes executable filtering logic and runs it in a sandbox environment.
This approach is faster and cheaper than using another LLM call.
The newer web_search_20260209 tool supports code execution. The older web_search_20250305 version does not. With code execution enabled, the model can load search results, apply deterministic filtering rules and keep only the most relevant entries.
The logic may be simple:
A script can complete this in milliseconds. An additional LLM-based screening step can take hundreds of milliseconds or several seconds. It also consumes more tokens.
This matters because many self-built search pipelines overuse LLMs. They call a model to summarize pages, call another model to rank results, and call another model to decide whether more search is needed. This creates high latency and high cost.
Deterministic filtering is often enough for early-stage ranking and cleanup. LLM judgment should be reserved for tasks that actually require reasoning.
There is one trade-off. Code execution does not satisfy Zero Data Retention (ZDR) requirements. Users who need ZDR must use the basic search tool instead of the code-execution version.
Still, the principle remains important: do not use LLM inference where simple code is enough.
4. Prompt Caching: Reducing Latency and Cost Through KV Cache Reuse
The fourth optimization is prompt caching.
Prompt caching is especially important for long multi-round search tasks. Anthropic’s official data indicates that prompt caching can reduce latency by up to 85% and reduce usage cost by 90%. For a 100,000-token prompt, response time can drop from 11.5 seconds to 2.4 seconds after caching is enabled.
The mechanism is based on the Transformer architecture. Once the Key and Value vectors of a token are computed, they do not change when later tokens are added. A KV cache stores these tensors and reuses them in later requests.
Anthropic requires developers to explicitly mark stable prefixes with:
These markers are usually placed after stable parts of the prompt, such as system instructions and tool definitions.
The system indexes cached content using hashes of token sequences. If the same prefix appears again, the model can reuse cached KV tensors instead of recomputing them.
The rules are strict.
Any change to the stable prefix can break the cache. Even a small change in characters, JSON key order, system prompt wording or tool definitions may cause a cache miss.
This means multi-round conversations should follow one rule:
If a system rewrites, compresses or reorganizes previous messages at every round, it destroys the cache. Many self-built agent frameworks make this mistake. They summarize history after every tool call, which seems efficient but prevents prefix reuse.
Research systems such as Prompt Cache, RadixAttention and KVFlow have shown that prefix caching is valuable for multi-agent and multi-round tool-use workflows.
Developers can check whether caching is working by looking at:
When the cache is hit, the number of non-cached tokens in later rounds can drop to single digits.
For client-side agent systems, caching is not optional. It is one of the main ways to keep long search workflows affordable and responsive.
5. Selective Deep Crawling: Avoiding Blind Page Fetching
The fifth optimization is selective deep crawling.
Claude.ai does not blindly crawl every search result. It uses a simple principle:
This is efficient because many factual questions can be answered from snippets alone. For simple queries, 1 to 3 search rounds with rich inline content are often enough.
Blind crawling creates several problems.
It increases latency. It expands context length. It consumes tokens. It may trigger anti-crawling systems. It also introduces noisy HTML content, ads, navigation text and irrelevant sections.
By setting a max_uses parameter, Claude can limit the number of search rounds. Developers can apply the same principle in their own systems. For latency-sensitive applications, setting max_uses to 3 may be enough. For research-heavy workflows, a higher value may be justified.
The key is to treat full-page crawling as an exception, not the default.
This design works especially well when combined with inline snippets. If search results already include strong query-relevant excerpts, most pages do not need to be fetched.
The Complete Claude.ai Search Workflow
When the five optimizations are combined, Claude.ai’s search pipeline can be summarized as follows:
- The model detects whether real-time information is needed.
- Stable prefixes, such as system prompts and tool definitions, hit the prompt cache.
- The server calls Brave Search.
- Each result returns rich encrypted inline snippets built from
descriptionandextra_snippets. - If results are too many, the model uses sandboxed code to filter them quickly.
- The model checks whether the inline content is sufficient.
- If information is incomplete, it refines the query and searches again within the server loop.
- Full-page crawling happens only when inline content cannot answer the question.
- Once enough evidence is collected, the model generates a cited answer and returns it to the client.
Claude Code uses a different workflow. It does not rely on inline snippets in the same way. After search, it often requires a separate WebFetch step to crawl and extract page content. This makes it more controllable, but usually slower.
Part Two: Experimental Report on Replicating a Claude-Level Search Pipeline
The five optimizations above are not only theoretical. They can be partially reproduced in a self-built agent system.
This section presents a controlled experiment. A traditional client-side search pipeline was rebuilt using Claude-inspired engineering principles. The goal was to measure how much performance improvement could be achieved without access to Anthropic’s internal server-side loop.
1. Experimental Setup
The experiment used the following settings:
The baseline results were:
This gave a clear target. The original system was far behind Claude.ai. The question was whether engineering changes alone could close much of the gap.
2. Problems in the Original Framework
The original self-built pipeline used a common pattern:
This approach had four major problems.
First, the search tool returned only short abstracts. It did not provide rich inline snippets. As a result, the system had to crawl pages almost every time.
Second, the pipeline automatically crawled the top three pages for each search. This was inefficient and vulnerable to anti-crawling rules.
Third, content extraction used fixed-length truncation. Each page was truncated to 6,000 characters. This often removed the key answer section while preserving irrelevant content.
Fourth, the framework rewrote and compressed historical messages in each round. This caused persistent prompt-cache failures. The system could run up to 8 execution rounds, but caching rarely helped.
These flaws are common in many self-built AI search systems. They add latency, increase token usage and reduce answer accuracy.
3. Optimized Architecture
The optimized system was rebuilt according to Claude-inspired principles.
| Module | Original Architecture | Optimized Architecture |
|---|---|---|
| Search Engine | Single Serper API with only abstracts | Brave Search as primary source, plus a 9-level fallback chain and 500-word inline snippets |
| Crawling Strategy | Automatically crawls top 3 pages | Prioritizes inline content; selectively crawls only the top URL when needed |
| Content Extraction | Fixed 6,000-character truncation | Exa highlights for embedding-based semantic extraction, without LLM calls |
| Context Management | Rewrites history summaries each round | Append-only message history with explicit prompt caching |
| Iteration Rules | Maximum 8 rounds | Maximum 16 rounds with network retry logic |
The main upgrades were:
- Use rich inline snippets to reduce crawling.
- Use semantic extraction instead of fixed truncation.
- Enable prompt caching for stable prefixes.
- Stop rewriting history during multi-round execution.
- Add fallback search sources when Brave returns insufficient results.
- Increase the iteration limit while keeping cost under control through caching.
These changes did not fully replicate Claude.ai’s server-side architecture. But they addressed most of the obvious weaknesses in the client-side pipeline.
4. Experimental Results
4.1 Overall Accuracy
After applying all optimizations, the system improved from:
The gap with Claude.ai dropped from 43 percentage points to 13 percentage points.
This means the optimized system captured most of the available improvement space. It did not match Claude.ai completely, but it moved much closer.
The result also shows that search quality depends heavily on engineering. The same underlying model can perform very differently depending on retrieval, extraction, caching and iteration design.
4.2 Latency and Iteration Analysis
The test questions were split into passed and failed groups.
For the 20 passed questions, the metrics were:
For the 10 failed questions, the metrics were:
The results show a clear pattern.
Successful tasks usually found enough information through inline snippets. They required fewer rounds and fewer searches. Failed tasks required more iterations and more crawling, but still did not succeed.
This suggests that many failures were not caused by a lack of search attempts. The bottleneck shifted to reasoning, judgment and query strategy.
For the 8 questions that the original framework already answered correctly, the optimized system did not introduce major latency penalties:
This is important. The new architecture improved difficult cases without significantly slowing down previously successful ones.
4.3 Impact of Each Optimization
The experiment also measured the effect of individual optimization points.
Inline Snippets
Across all searches, 78% completed information acquisition without additional crawling. This was the single most important reason for latency reduction.
Multi-Level Fallback Search
Brave Search was used directly 67 times. Fallback tools were triggered 11 times. This helped solve empty-result problems and improved robustness.
Semantic Extraction
All crawling tasks used Exa highlights for semantic extraction. Average response time was 613ms.
By comparison, LLM-based generation and extraction took 3,284ms on average.
Semantic extraction was therefore about 5.4 times faster.
Prompt Caching
After cache hits, later iteration rounds often had only single-digit non-cached token counts. This enabled stable operation across up to 16 rounds without excessive preprocessing cost.
This confirms that prompt caching is essential for long-running client-side agent workflows.
4.4 Remaining Failures
The 10 failed questions fell into three main categories.
The first and largest category was reasoning failure during multi-round analysis. The system retrieved relevant information but made the wrong judgment.
The second category involved niche content with insufficient index coverage. Search engines did not surface the needed information reliably.
The third category involved insufficient proactive search. The model stopped too early or failed to generate the right follow-up query.
This means the bottleneck changed. Before optimization, the main problem was search pipeline engineering. After optimization, the main limitation became the model’s reasoning and search strategy.
That is a meaningful improvement. It shows that the retrieval layer became good enough for deeper model limitations to become visible.
Practical Engineering Principles for Developers
The experiment leads to five practical rules for building high-performance AI search systems.
1. Prioritize Inline Content
Choose search engines or retrieval tools that provide rich snippets. If the system can answer from search results directly, it can avoid many slow and fragile crawl operations.
2. Use Retrieval-Based Extraction
Embedding-based semantic highlights are often faster and more stable than LLM-generated summaries. Use LLMs for reasoning, not for every extraction task.
3. Protect Prompt Caching
Do not rewrite stable prefixes. Do not reorder tool definitions. Do not compress history in every round. Use append-only message history whenever possible.
4. Avoid Fixed Truncation
Fixed character limits are crude. They often remove the exact passage that contains the answer. Extract content according to query relevance instead.
5. Understand Server-Side Advantages
A client-side framework cannot fully reproduce Anthropic’s server-side closed loop. But it can still reproduce much of the benefit through rich snippets, caching, semantic extraction and selective crawling.
The goal is not perfect replication. The goal is to remove avoidable inefficiencies.
Conclusion
Claude.ai’s web search performance is not the result of one secret technology. It comes from the combination of several mature engineering decisions.
The most important optimizations are:
- Rich inline search snippets
- Server-side closed-loop retrieval
- Deterministic code-based filtering
- Prompt caching through KV cache reuse
- Selective deep crawling only when necessary
The replication experiment confirms that these ideas are practical. After applying similar optimizations, a self-built search framework improved from 37% accuracy to 67% accuracy on a 30-question dataset. The gap with Claude.ai narrowed from 43 percentage points to 13 percentage points.
The optimized system also controlled latency effectively. Most successful searches relied on inline snippets and avoided crawling. Semantic extraction was 5.4 times faster than LLM-based extraction. Prompt caching allowed longer iteration chains without excessive overhead.
There is still a gap compared with Claude.ai. Server-side tool execution has structural advantages that client-side systems cannot fully match. But the experiment shows that much of the performance gap can be closed with better engineering.
For teams building AI search, research agents or tool-using assistants, the lesson is clear. Search quality depends not only on the model or the search engine. It also depends on retrieval design, extraction strategy, caching discipline and iteration control.
For teams that need to combine multiple large models and search-related services, 4sapi can serve as a professional API gateway. It provides access to mainstream models at more competitive prices than official channels and is compatible with common development frameworks. This helps developers reduce operating costs while deploying multi-model and multi-tool AI systems.
In the future, AI search competition will not be decided only by the underlying search engine. It will depend on the full engineering pipeline. Systems that combine strong retrieval, efficient caching, selective crawling and reliable tool coordination will deliver the best real-world performance.




