Gemini 3.5 Long-Context Summary Test: 100K Words

Large context windows have become a major selling point for modern large language models. Many users assume that if a model can load 100,000 words at once, it should also be able to summarize the full document accurately.

In practice, this is not always true.

A large context window means the model can receive more input. It does not guarantee that the model will pay equal attention to every part of the document. Long-text summarization still faces several common issues, such as information omission, weak retention of middle sections, unstable outputs, and incomplete structure restoration.

This article evaluates Gemini 3.5 using a document of about 100,000 Chinese words. The test compares two common summarization strategies: one-shot summarization and segmented summarization with final aggregation.

The results are based on practical observations. They are not strict academic benchmarks. Model performance may vary depending on document type, prompt design, language style, and model updates. Developers and teams should test with their own business materials before making production decisions.

For teams that need to compare multiple long-context models in batches, an API gateway such as 4sapi can help standardize model calls and simplify cross-model evaluation.

1. Test Preparation and Evaluation Framework

1.1 Test Environment and Platform

Long-document testing can easily produce biased results if the environment is unstable or if different models use inconsistent request settings.

For this test, 4sapi was used as the unified evaluation platform. The platform integrates several mainstream models, including Gemini 3.5, GPT-5.5, Claude 4.8, and DeepSeek. It also supports unified parameter configuration, billing rules, and preprocessing logic.

Using one platform helps keep the testing process consistent. The same document and prompts can be applied across different models, making horizontal comparison easier.

The test document contains about 100,000 Chinese words. It includes explanatory content, practical cases, and statistical data. This structure is close to many real office and technical documents.

This document size only represents the scale of this test. It does not represent the maximum context capacity of Gemini 3.5. Any official context window limits should be checked against Google’s latest public documentation.

1.2 Five Evaluation Dimensions

To avoid relying only on subjective impressions, the test uses five evaluation dimensions.

Information Coverage Whether the summary retains the core arguments, key data, and main conclusions of the full document.
Middle Content Retention Whether the model preserves information from the middle sections, instead of focusing mostly on the beginning and end.
Key Point Accuracy Whether the model correctly distinguishes important information from secondary details.
Structure Restoration Whether the summary reflects the original document’s logic, hierarchy, and narrative structure.
Result Stability Whether repeated runs produce consistent summaries, or whether omissions appear randomly.

Among these metrics, middle content retention is especially important. Many long-context models can load long documents, but their attention is not always evenly distributed. Content in the middle is often more likely to be compressed, simplified, or missed.

This issue is easy to overlook if only one test is performed.

2. Test One: One-Shot Summarization

2.1 Test Setup

In the first test, the complete 100,000-word document was sent to Gemini 3.5 in one request.

The prompt required the model to generate a summary within 500 words. The summary needed to cover core arguments, key data, and major conclusions.

This is the simplest and most common approach. Many users prefer it because it requires only one model call and almost no manual preprocessing.

2.2 Scores and Analysis

The test used a 1-to-5 scoring scale. A higher score means better performance.

Evaluation Dimension	Score	Explanation
Information Coverage	4	Most core ideas were captured, but some marginal details were missing.
Middle Content Retention	3.5	The beginning and ending were summarized well, but some middle sections were weakened.
Key Point Accuracy	4	Most key points were identified correctly, though some secondary details were overemphasized.
Structure Restoration	4	The main logical structure of the original document was preserved.
Result Stability	3.5	Repeated runs showed different omissions in different positions.

2.3 Summary of One-Shot Mode

One-shot summarization is simple and efficient. It works well for ordinary documents where perfect coverage is not required.

However, this mode exposes a common weakness of large-context models: the model handles the beginning and ending more clearly, while the middle sections are more likely to be compressed or omitted.

The omissions are also not fully stable. In repeated tests, the missing content appeared in different places. This makes quality control harder.

One-shot summarization is suitable for:

Quick reading
Rough document understanding
Low-risk internal reference
Draft-level summaries
Documents where complete coverage is not critical

It is not ideal for important reports, legal materials, compliance documents, or data-heavy content.

3. Test Two: Segmented Summarization plus Aggregation

To reduce middle-content omission, the second test used a divide-and-conquer strategy.

Instead of feeding the entire document into the model at once, the document was split into logical sections. Each section was summarized separately. Then, all section summaries were combined and summarized again into a final full-document summary.

3.1 Operation Steps

The segmented workflow included three steps:

Split the 100,000-word document by chapters and logical sections.
Generate an independent summary for each section.
Combine all section summaries and ask Gemini 3.5 to generate a final integrated summary.

This method requires more manual work and more API calls. However, it gives each part of the original document more attention.

3.2 Comparison with One-Shot Summarization

Evaluation Dimension	One-Shot Summarization	Segmented Summarization + Aggregation
Information Coverage	4	4.5
Middle Content Retention	3.5	4.5
Key Point Accuracy	4	4.5
Labor and Calling Cost	Low	Higher

3.3 Summary of Segmented Mode

Segmented summarization performs better in information coverage and middle-section retention.

Because each segment is processed independently, the model is less likely to ignore content buried in the middle of a long document. The final summary is also more complete and balanced.

The trade-off is cost.

Users need to split the document manually or build a preprocessing script. The number of model calls also increases. This means higher token consumption, longer processing time, and more workflow complexity.

Segmented summarization is suitable for:

Important business reports
Legal or policy documents
Research papers
Technical specifications
Long meeting transcripts
Documents with key information distributed across many sections

For high-value documents, this method is usually more reliable than one-shot summarization.

4. Key Findings from the Test

Based on the two tests, three conclusions stand out.

4.1 Context Capacity Does Not Equal Full Understanding

A large context window allows the model to load more content. But it does not guarantee that every part of the document receives equal attention.

In this test, Gemini 3.5 showed stronger retention of the beginning and ending sections. Middle sections were more likely to be simplified or missed.

This is not unique to Gemini. It is a common issue in many large-context LLMs.

The real challenge of long-document processing is not only whether the model can load the document. It is whether the model can extract and preserve information evenly across the full text.

4.2 Summary Quality and Cost Must Be Balanced

One-shot summarization is fast and cheap. It is useful when users need a quick overview.

Segmented summarization is more complete. It performs better when the document is important or when missing information could create business risk.

The right strategy depends on the use case.

For casual reading, one-shot summarization may be enough. For formal reports, compliance review, or business-critical analysis, segmented summarization is safer.

4.3 Manual Review Is Still Required

AI-generated summaries should not be treated as final versions when accuracy matters.

Even when the summary is fluent and logically structured, it may still miss key data or misjudge priorities. Missing information is especially hard to detect because it simply does not appear in the output.

For important documents, users should always compare the summary with the original text.

Manual review is especially necessary for:

Key figures
Formal conclusions
Compliance statements
Contract terms
Research findings
Policy interpretations
Financial or legal content

AI can reduce summarization workload, but it cannot replace final verification.

5. Practical Checklist for Long-Text Summarization

Based on the test results, the following checklist can help users process long documents more reliably with Gemini 3.5 and other long-context models.

5.1 Choose the Strategy Based on Document Importance

Use segmented summarization for high-value documents. This improves information coverage and reduces the risk of missing middle content.

Use one-shot summarization for ordinary reference documents where speed matters more than completeness.

5.2 Optimize the Layout of Key Information

Do not place critical conclusions or key data only in the middle of a document.

If possible, repeat important findings in the introduction, section summaries, and conclusion. This reduces the risk of omission during summarization.

5.3 Improve Prompt Design

Prompts should explicitly ask the model to cover all sections.

For example:

text

Summarize the full document. Cover every chapter and pay special attention to the middle sections. Do not focus only on the beginning and conclusion.

This does not completely solve the attention problem, but it can reduce the risk.

5.4 Run Multiple Rounds

For important documents, run the summarization task more than once.

Compare the outputs and check whether some points appear in one result but not another. This helps identify unstable omissions.

5.5 Review Against the Original Document

Always verify key data, conclusions, and compliance-related statements against the original document.

Do not rely only on the fluency of the summary. A smooth summary can still be incomplete.

5.6 Compare Multiple Models for Critical Tasks

For highly important documents, test the same prompt across multiple long-context models.

Different models may preserve different information. Comparing results can help reveal omissions and improve final summary quality.

An API gateway can make this process easier by standardizing calls across models.

6. Common Misconceptions About Long-Context Models

Misconception 1: A large context window makes one-shot input reliable

A large context window means the model can receive more text. It does not mean the model will analyze all parts equally.

Ultra-long documents still carry a real risk of middle-content omission.

Misconception 2: A fluent summary means the summary is correct

Fluency is not the same as accuracy.

A summary can read well while missing key information. The only reliable way to detect omissions is to compare it with the original document.

Misconception 3: One summary is enough

Long-text summarization has natural variability. Repeated runs may produce different omissions.

For important documents, multiple outputs should be compared before final use.

Misconception 4: A test score represents absolute model capability

The results in this article come from a specific document and prompt setup.

Performance may change with document type, language style, prompt wording, and model version. Users should test with their own materials.

Misconception 5: Test conclusions remain valid forever

Model vendors continue to improve long-context processing.

A result that is valid today may change after future model updates. Regular retesting is recommended for production workflows.

7. Overall Summary

Gemini 3.5 performs well in loading and processing very long documents. However, this test shows that long-context summarization still has clear limitations.

When summarizing a 100,000-word document, one-shot summarization is efficient but may overlook middle sections. Segmented summarization with final aggregation provides better coverage, but it requires more manual work and more model calls.

The best strategy depends on the document’s importance, time budget, and accuracy requirements.

For low-risk reference documents, one-shot summarization is acceptable. For important business, legal, research, or compliance materials, segmented summarization is more reliable.

Users should also improve prompt design, optimize document structure, run multiple tests, and review final summaries against the original text.

Large-context models will continue to improve. But for now, high-quality long-document summarization still depends on a combination of model capability, document preprocessing, prompt design, and human review.

The key lesson is simple: a large context window is useful, but it is not a guarantee of complete understanding. To get reliable summaries, teams need a structured workflow, not just a bigger input window.