Claude.ai Web Search: 5 Engineering Optimizations

Abstract

Claude.ai’s built-in web search has become one of the strongest search experiences among mainstream AI products. It is fast, accurate and citation-friendly. Many developers assume this performance comes from a proprietary search engine or a hidden internal retrieval system.

However, third-party technical analysis suggests that Claude.ai’s backend search is largely powered by Brave Search. The cited content in Claude’s answers overlaps with Brave’s top-ranked results at a rate of 86.7%. This indicates that Claude.ai’s search quality is not mainly the result of a single exclusive search engine. It comes from a set of well-designed engineering optimizations.

This article breaks the topic into two parts.

The first part analyzes five core optimizations behind Claude.ai’s web search system: inline rich snippets, server-side retrieval loops, code-based filtering, prompt caching and selective deep crawling. It also explains why server-side search tools are fundamentally different from client-side tool frameworks.

The second part presents a practical experiment. A self-built agent search pipeline is reconstructed using similar principles. The optimized system improves answer accuracy from 37% to 67%, narrowing the gap with Claude.ai from 43 percentage points to 13 percentage points.

The goal is not to claim that Claude.ai can be fully replicated. It cannot, because server-side execution has structural advantages. The goal is to show which parts of its search performance can be reproduced by developers building their own AI agents and retrieval systems.

Part One: Five Core Engineering Optimizations Behind Claude.ai Web Search

Before discussing individual optimizations, one architectural difference must be clear: Claude.ai uses a server-side web search tool, while many self-built AI systems use client-side tool calling.

This distinction changes everything.

In Anthropic’s server-side search architecture, the model can trigger web_search inside a single HTTP request. Query generation, result retrieval, query refinement and repeated search loops all run on Anthropic’s servers. The client does not need to manage multiple tool calls or return tool_result messages after each search.

A client-side framework works differently. Developers must write their own loop logic. The model asks for a tool call, the client executes the search, the client returns the result, and the model decides what to do next. Every search round requires a new network interaction.

This creates a major performance gap. Server-side search reduces latency, hides iteration complexity and keeps the retrieval process tightly integrated with the model. Client-side search gives developers more control, but it adds network round-trips and orchestration overhead.

Claude.ai’s performance comes from this architectural advantage, combined with five practical engineering optimizations.

1. Inline Rich Snippets: Reducing the Need for Page Crawling

The first optimization is the use of rich inline snippets.

When Claude.ai calls its server-side search tool, each search result includes a large amount of relevant text. In many cases, the returned record contains around 500 words of useful context. This allows the model to answer many questions without separately crawling the full page.

This design is very different from ordinary search APIs. A basic search result usually contains only a title, URL and short description. That is not enough for high-quality question answering. The system then has to fetch pages, parse HTML, remove boilerplate and extract relevant text. Each step adds latency and failure risk.

Brave Search provides richer result data. In addition to the standard description field, which is usually around 250 characters, Brave can return up to five extra_snippets for each result. These snippets are dynamically extracted from the search index according to the user query. Together, they provide much more usable text than a normal search abstract.

One key sign of this mechanism is that the snippet for the same URL can change when the query changes. That behavior is consistent with index-side snippet extraction.

In Claude’s API response structure, these inline snippets are carried through encrypted fields. The content appears in encrypted_content, while citation metadata corresponds to encrypted_index. External developers cannot decrypt or modify these fields. They can only forward them as part of the original response.

This encryption likely serves two purposes.

First, it prevents the API from being abused as a free crawler. Second, it helps respect third-party content rights. The model can use the retrieved context, but developers cannot freely extract and republish the raw snippet data.

Anthropic also uses different designs across different products. Claude.ai prioritizes speed and includes rich inline content. Claude Code, the command-line development tool, takes a more controlled approach. It also uses Brave Search, but its WebSearch tool mainly returns titles and URLs. Users must then call a separate WebFetch tool to crawl pages and extract content, usually with the help of a lightweight Haiku model.

This is a clear trade-off.

Claude.ai favors response speed. Claude Code favors operational control and security.

For developers building their own systems, the lesson is direct: rich snippets can remove many unnecessary page crawls. This is often the fastest way to improve search pipeline performance.

2. Server-Side Closed-Loop Retrieval: Multiple Iterations in One Request

The second optimization is server-side closed-loop retrieval.

Claude.ai does not stop after one search. When needed, the model can generate a query, inspect results, identify missing information, refine the query and search again. This loop runs inside Anthropic’s server environment.

A typical flow looks like this:

text

User query
→ Model decides whether search is needed
→ Model generates search query
→ Server calls Brave Search
→ Model reads results
→ Model checks whether information is sufficient
→ Model refines query if needed
→ Server searches again
→ Model produces final answer with citations

The user sees only the final result.

In ordinary cases, the request ends with end_turn. If the content becomes too long and the response is truncated, the system may return pause_turn. In that case, the client forwards the complete message back to continue the task.

This design avoids repeated public network round-trips. A server-side search task may require 7 to 11 internal retrieval rounds, but the client still makes only one request.

In a client-side framework, the same workflow is much slower. Developers need to implement a while loop. Each search call becomes a separate tool interaction. Each round requires another request, another response and another prompt preprocessing step.

This explains why caching is critical in client-side frameworks. Without caching, each tool round forces the model to reprocess large parts of the conversation. Server-side tools reduce this burden because the loop stays inside the provider’s infrastructure.

For self-built systems, full server-side replication is difficult. But developers can still reduce overhead by minimizing round-trips, batching retrieval steps and avoiding unnecessary page crawling.

3. Code-Based Filtering: Using Deterministic Scripts Instead of Extra LLM Calls

The third optimization is code-based filtering.

When Claude receives many search results, it does not always ask another large model to screen them. Instead, it can use programmatic tool calling. The model writes executable filtering logic and runs it in a sandbox environment.

This approach is faster and cheaper than using another LLM call.

The newer web_search_20260209 tool supports code execution. The older web_search_20250305 version does not. With code execution enabled, the model can load search results, apply deterministic filtering rules and keep only the most relevant entries.

The logic may be simple:

text

Load search results
→ Match keywords or entities
→ Filter by source, date or title
→ Rank candidates
→ Keep top results

A script can complete this in milliseconds. An additional LLM-based screening step can take hundreds of milliseconds or several seconds. It also consumes more tokens.

This matters because many self-built search pipelines overuse LLMs. They call a model to summarize pages, call another model to rank results, and call another model to decide whether more search is needed. This creates high latency and high cost.

Deterministic filtering is often enough for early-stage ranking and cleanup. LLM judgment should be reserved for tasks that actually require reasoning.

There is one trade-off. Code execution does not satisfy Zero Data Retention (ZDR) requirements. Users who need ZDR must use the basic search tool instead of the code-execution version.

Still, the principle remains important: do not use LLM inference where simple code is enough.

4. Prompt Caching: Reducing Latency and Cost Through KV Cache Reuse

The fourth optimization is prompt caching.

Prompt caching is especially important for long multi-round search tasks. Anthropic’s official data indicates that prompt caching can reduce latency by up to 85% and reduce usage cost by 90%. For a 100,000-token prompt, response time can drop from 11.5 seconds to 2.4 seconds after caching is enabled.

The mechanism is based on the Transformer architecture. Once the Key and Value vectors of a token are computed, they do not change when later tokens are added. A KV cache stores these tensors and reuses them in later requests.

Anthropic requires developers to explicitly mark stable prefixes with:

json

"cache_control": {"type": "ephemeral"}

These markers are usually placed after stable parts of the prompt, such as system instructions and tool definitions.

The system indexes cached content using hashes of token sequences. If the same prefix appears again, the model can reuse cached KV tensors instead of recomputing them.

The rules are strict.

Any change to the stable prefix can break the cache. Even a small change in characters, JSON key order, system prompt wording or tool definitions may cause a cache miss.

This means multi-round conversations should follow one rule:

text

Append new messages only. Do not rewrite historical content.

If a system rewrites, compresses or reorganizes previous messages at every round, it destroys the cache. Many self-built agent frameworks make this mistake. They summarize history after every tool call, which seems efficient but prevents prefix reuse.

Research systems such as Prompt Cache, RadixAttention and KVFlow have shown that prefix caching is valuable for multi-agent and multi-round tool-use workflows.

Developers can check whether caching is working by looking at:

text

cache_creation_input_tokens
cache_read_input_tokens

When the cache is hit, the number of non-cached tokens in later rounds can drop to single digits.

For client-side agent systems, caching is not optional. It is one of the main ways to keep long search workflows affordable and responsive.

5. Selective Deep Crawling: Avoiding Blind Page Fetching

The fifth optimization is selective deep crawling.

Claude.ai does not blindly crawl every search result. It uses a simple principle:

text

Use inline snippets first. Crawl full pages only when necessary.

This is efficient because many factual questions can be answered from snippets alone. For simple queries, 1 to 3 search rounds with rich inline content are often enough.

Blind crawling creates several problems.

It increases latency. It expands context length. It consumes tokens. It may trigger anti-crawling systems. It also introduces noisy HTML content, ads, navigation text and irrelevant sections.

By setting a max_uses parameter, Claude can limit the number of search rounds. Developers can apply the same principle in their own systems. For latency-sensitive applications, setting max_uses to 3 may be enough. For research-heavy workflows, a higher value may be justified.

The key is to treat full-page crawling as an exception, not the default.

This design works especially well when combined with inline snippets. If search results already include strong query-relevant excerpts, most pages do not need to be fetched.

The Complete Claude.ai Search Workflow

When the five optimizations are combined, Claude.ai’s search pipeline can be summarized as follows:

The model detects whether real-time information is needed.
Stable prefixes, such as system prompts and tool definitions, hit the prompt cache.
The server calls Brave Search.
Each result returns rich encrypted inline snippets built from description and extra_snippets.
If results are too many, the model uses sandboxed code to filter them quickly.
The model checks whether the inline content is sufficient.
If information is incomplete, it refines the query and searches again within the server loop.
Full-page crawling happens only when inline content cannot answer the question.
Once enough evidence is collected, the model generates a cited answer and returns it to the client.

Claude Code uses a different workflow. It does not rely on inline snippets in the same way. After search, it often requires a separate WebFetch step to crawl and extract page content. This makes it more controllable, but usually slower.

Part Two: Experimental Report on Replicating a Claude-Level Search Pipeline

The five optimizations above are not only theoretical. They can be partially reproduced in a self-built agent system.

This section presents a controlled experiment. A traditional client-side search pipeline was rebuilt using Claude-inspired engineering principles. The goal was to measure how much performance improvement could be achieved without access to Anthropic’s internal server-side loop.

1. Experimental Setup

The experiment used the following settings:

text

Experimental system:
A self-built agent research framework

Control model:
The same underlying model family used by Claude.ai in the comparison group

Test dataset:
30 questions selected from GAIA and FRAMES

Question type:
Questions that standard models cannot answer without web search

Evaluation metric:
Answer accuracy, judged by whether the final answer contains the standard answer
Case-insensitive matching was used

The baseline results were:

text

Standalone model without search: 0/30 correct
Original self-built framework: 11/30 correct, 37% accuracy
Official Claude.ai: 24/30 correct, 80% accuracy

This gave a clear target. The original system was far behind Claude.ai. The question was whether engineering changes alone could close much of the gap.

2. Problems in the Original Framework

The original self-built pipeline used a common pattern:

text

Search → Crawl → Truncate → Answer

This approach had four major problems.

First, the search tool returned only short abstracts. It did not provide rich inline snippets. As a result, the system had to crawl pages almost every time.

Second, the pipeline automatically crawled the top three pages for each search. This was inefficient and vulnerable to anti-crawling rules.

Third, content extraction used fixed-length truncation. Each page was truncated to 6,000 characters. This often removed the key answer section while preserving irrelevant content.

Fourth, the framework rewrote and compressed historical messages in each round. This caused persistent prompt-cache failures. The system could run up to 8 execution rounds, but caching rarely helped.

These flaws are common in many self-built AI search systems. They add latency, increase token usage and reduce answer accuracy.

3. Optimized Architecture

The optimized system was rebuilt according to Claude-inspired principles.

Module	Original Architecture	Optimized Architecture
Search Engine	Single Serper API with only abstracts	Brave Search as primary source, plus a 9-level fallback chain and 500-word inline snippets
Crawling Strategy	Automatically crawls top 3 pages	Prioritizes inline content; selectively crawls only the top URL when needed
Content Extraction	Fixed 6,000-character truncation	Exa highlights for embedding-based semantic extraction, without LLM calls
Context Management	Rewrites history summaries each round	Append-only message history with explicit prompt caching
Iteration Rules	Maximum 8 rounds	Maximum 16 rounds with network retry logic

The main upgrades were:

Use rich inline snippets to reduce crawling.
Use semantic extraction instead of fixed truncation.
Enable prompt caching for stable prefixes.
Stop rewriting history during multi-round execution.
Add fallback search sources when Brave returns insufficient results.
Increase the iteration limit while keeping cost under control through caching.

These changes did not fully replicate Claude.ai’s server-side architecture. But they addressed most of the obvious weaknesses in the client-side pipeline.

4. Experimental Results

4.1 Overall Accuracy

After applying all optimizations, the system improved from:

text

Original self-built framework: 11/30 correct, 37%
Optimized framework: 20/30 correct, 67%
Official Claude.ai: 24/30 correct, 80%

The gap with Claude.ai dropped from 43 percentage points to 13 percentage points.

This means the optimized system captured most of the available improvement space. It did not match Claude.ai completely, but it moved much closer.

The result also shows that search quality depends heavily on engineering. The same underlying model can perform very differently depending on retrieval, extraction, caching and iteration design.

4.2 Latency and Iteration Analysis

The test questions were split into passed and failed groups.

For the 20 passed questions, the metrics were:

text

Average time: 54 seconds
Median time: 35 seconds
Average rounds: 4.4
Average searches: 2.2
Searches answered by inline snippets only: 89%

For the 10 failed questions, the metrics were:

text

Average time: 93 seconds
Median time: 40 seconds
Average rounds: 8.0
Average searches: 3.4
Searches answered by inline content only: 65%

The results show a clear pattern.

Successful tasks usually found enough information through inline snippets. They required fewer rounds and fewer searches. Failed tasks required more iterations and more crawling, but still did not succeed.

This suggests that many failures were not caused by a lack of search attempts. The bottleneck shifted to reasoning, judgment and query strategy.

For the 8 questions that the original framework already answered correctly, the optimized system did not introduce major latency penalties:

text

Original average time: 31 seconds
Optimized average time: 36 seconds

Original median time: 34.5 seconds
Optimized median time: 32 seconds

This is important. The new architecture improved difficult cases without significantly slowing down previously successful ones.

4.3 Impact of Each Optimization

The experiment also measured the effect of individual optimization points.

Inline Snippets

Across all searches, 78% completed information acquisition without additional crawling. This was the single most important reason for latency reduction.

Multi-Level Fallback Search

Brave Search was used directly 67 times. Fallback tools were triggered 11 times. This helped solve empty-result problems and improved robustness.

Semantic Extraction

All crawling tasks used Exa highlights for semantic extraction. Average response time was 613ms.

By comparison, LLM-based generation and extraction took 3,284ms on average.

Semantic extraction was therefore about 5.4 times faster.

Prompt Caching

After cache hits, later iteration rounds often had only single-digit non-cached token counts. This enabled stable operation across up to 16 rounds without excessive preprocessing cost.

This confirms that prompt caching is essential for long-running client-side agent workflows.

4.4 Remaining Failures

The 10 failed questions fell into three main categories.

The first and largest category was reasoning failure during multi-round analysis. The system retrieved relevant information but made the wrong judgment.

The second category involved niche content with insufficient index coverage. Search engines did not surface the needed information reliably.

The third category involved insufficient proactive search. The model stopped too early or failed to generate the right follow-up query.

This means the bottleneck changed. Before optimization, the main problem was search pipeline engineering. After optimization, the main limitation became the model’s reasoning and search strategy.

That is a meaningful improvement. It shows that the retrieval layer became good enough for deeper model limitations to become visible.

Practical Engineering Principles for Developers

The experiment leads to five practical rules for building high-performance AI search systems.

1. Prioritize Inline Content

Choose search engines or retrieval tools that provide rich snippets. If the system can answer from search results directly, it can avoid many slow and fragile crawl operations.

2. Use Retrieval-Based Extraction

Embedding-based semantic highlights are often faster and more stable than LLM-generated summaries. Use LLMs for reasoning, not for every extraction task.

3. Protect Prompt Caching

Do not rewrite stable prefixes. Do not reorder tool definitions. Do not compress history in every round. Use append-only message history whenever possible.

4. Avoid Fixed Truncation

Fixed character limits are crude. They often remove the exact passage that contains the answer. Extract content according to query relevance instead.

5. Understand Server-Side Advantages

A client-side framework cannot fully reproduce Anthropic’s server-side closed loop. But it can still reproduce much of the benefit through rich snippets, caching, semantic extraction and selective crawling.

The goal is not perfect replication. The goal is to remove avoidable inefficiencies.

Conclusion

Claude.ai’s web search performance is not the result of one secret technology. It comes from the combination of several mature engineering decisions.

The most important optimizations are:

Rich inline search snippets
Server-side closed-loop retrieval
Deterministic code-based filtering
Prompt caching through KV cache reuse
Selective deep crawling only when necessary

The replication experiment confirms that these ideas are practical. After applying similar optimizations, a self-built search framework improved from 37% accuracy to 67% accuracy on a 30-question dataset. The gap with Claude.ai narrowed from 43 percentage points to 13 percentage points.

The optimized system also controlled latency effectively. Most successful searches relied on inline snippets and avoided crawling. Semantic extraction was 5.4 times faster than LLM-based extraction. Prompt caching allowed longer iteration chains without excessive overhead.

There is still a gap compared with Claude.ai. Server-side tool execution has structural advantages that client-side systems cannot fully match. But the experiment shows that much of the performance gap can be closed with better engineering.

For teams building AI search, research agents or tool-using assistants, the lesson is clear. Search quality depends not only on the model or the search engine. It also depends on retrieval design, extraction strategy, caching discipline and iteration control.

For teams that need to combine multiple large models and search-related services, 4sapi can serve as a professional API gateway. It provides access to mainstream models at more competitive prices than official channels and is compatible with common development frameworks. This helps developers reduce operating costs while deploying multi-model and multi-tool AI systems.

In the future, AI search competition will not be decided only by the underlying search engine. It will depend on the full engineering pipeline. Systems that combine strong retrieval, efficient caching, selective crawling and reliable tool coordination will deliver the best real-world performance.

Claude.ai Web Search: 5 Engineering Optimizations

Abstract

Part One: Five Core Engineering Optimizations Behind Claude.ai Web Search

1. Inline Rich Snippets: Reducing the Need for Page Crawling

2. Server-Side Closed-Loop Retrieval: Multiple Iterations in One Request

3. Code-Based Filtering: Using Deterministic Scripts Instead of Extra LLM Calls

4. Prompt Caching: Reducing Latency and Cost Through KV Cache Reuse

5. Selective Deep Crawling: Avoiding Blind Page Fetching

The Complete Claude.ai Search Workflow

Part Two: Experimental Report on Replicating a Claude-Level Search Pipeline

1. Experimental Setup

2. Problems in the Original Framework

3. Optimized Architecture

4. Experimental Results

4.1 Overall Accuracy

4.2 Latency and Iteration Analysis

4.3 Impact of Each Optimization

Inline Snippets

Multi-Level Fallback Search

Semantic Extraction

Prompt Caching

4.4 Remaining Failures

Practical Engineering Principles for Developers

1. Prioritize Inline Content

2. Use Retrieval-Based Extraction

3. Protect Prompt Caching

4. Avoid Fixed Truncation

5. Understand Server-Side Advantages

Conclusion

Recommended reading

ZCode Kimi Error Fix: max_tokens Exceeds 32768

LLM API Gateway Backup Routing: Build Failover Systems

Claude Fable 5 vs Sonnet 5: Technical Deployment Guide

Domestic AI Coding Agents: ZCode, Kimi Work and MiMo Code