🚀 GPT-5.2 Compaction API: How to Scale AI Agents Beyond Context Limits
Your agent is 47 tool calls deep into a complex workflow. It’s been reading files, querying databases, analyzing documents. Then suddenly — it forgets what you asked for in the first place. Sound familiar?
This is the context window problem, and OpenAI’s new Compaction API in GPT-5.2 offers an elegant solution.
📋 TL;DR
| What | Details |
|---|---|
| 🎯 Problem | AI agents hit context limits during complex workflows |
| 💡 Solution | Native compression via endpoint |
| 📉 Result | 150K tokens → ~15K tokens while preserving task context |
| ⚡ When to use | Multi-step agents, 20+ tool calls, iterative tasks |
🧠 The Context Window Problem
Modern AI agents are impressive. They can browse the web, execute code, query APIs, and orchestrate complex multi-step workflows. But they all share one fundamental limitation: finite context windows.
Even with 128K or 200K token limits, real-world agentic workflows hit the ceiling faster than you’d expect:
| Workflow Type | Typical Token Usage | Risk Level |
|---|---|---|
| 💬 Simple Q&A | 1K–5K tokens | 🟢 Low |
| 💻 Code generation with context | 10K–30K tokens | 🟢 Low |
| 📁 Multi-file refactoring | 50K–100K tokens | 🟡 Medium |
| 🔍 Research agent (10+ sources) | 80K–150K tokens | 🟠 High |
| 🤖 Complex agentic workflow (50+ tool calls) | 150K–300K+ tokens | 🔴 Critical |
When you exceed the limit, you have three bad options:
❌ Truncate early messages — lose critical context
❌ Restart the conversation — lose all progress
❌ Implement custom summarization — add latency and lose fidelity
GPT-5.2’s Compaction API introduces a fourth option: ✅ native, loss-aware compression.
🔧 What Is the Compaction API?
The Compaction API (
/responses/compact) performs intelligent compression on your conversation state. Instead of naive truncation, it:
- 🔍 Analyzes the full conversation history
- 🎯 Identifies task-relevant information
- 🔐 Compresses it into encrypted, opaque tokens
- 📦 Returns a dramatically smaller payload that preserves semantic meaning
💭 Think of it as a “checkpoint” system for your AI agent. You compress the state at key milestones, then continue with a fresh context window while retaining everything important.
📊 Key Characteristics
| Property | Description | Icon |
|---|---|---|
| Loss-aware | Prioritizes task-relevant information during compression | 🎯 |
| Opaque output | Returns encrypted items — not human-readable | 🔐 |
| Model-specific | Currently works with GPT-5.2 and Responses API | 🔗 |
| Repeatable | Safe to run multiple times in long sessions | 🔄 |
⚠️ Warning
Compacted items are designed for continuation, not inspection. Don’t try to parse or depend on their internal structure — it may change.
⚙️ How It Works
🔄 Basic Flow
┌─────────────────────┐ │ 📚 Conversation │ │ (150K tokens) │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ 🗜️ Compact │ │ Endpoint │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ 📦 Compacted │ │ (~15K tokens) │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ ▶️ Continue │ │ Workflow │ └─────────────────────┘
💻 Code Example
from openai import OpenAI import json client = OpenAI() # 📍 Step 1: Run your agent workflow response = client.responses.create( model="gpt-5.2", input=[ { "role": "user", "content": "Analyze all Python files in the repository and suggest refactoring opportunities.", }, ] ) # ... many tool calls later, context is getting large ... # 📍 Step 2: Compact the conversation state output_json = [msg.model_dump() for msg in response.output] compacted = client.responses.compact( model="gpt-5.2", input=[ { "role": "user", "content": "Analyze all Python files in the repository and suggest refactoring opportunities.", }, output_json[0] # Include assistant's response ] ) # 📍 Step 3: Continue with compacted state continuation = client.responses.create( model="gpt-5.2", input=[ compacted.output[0], # Compacted state { "role": "user", "content": "Now implement the top 3 refactoring suggestions.", }, ] )
🧲 What Gets Preserved?
|
✅ Prioritized
|
⬇️ Deprioritized
|
✅ When to Use Compaction
🟢 Good Use Cases
🔗 Multi-step agent workflows
User request → Plan → Execute step 1 → Execute step 2 → ... → Step N ↑ 🗜️ Compact here
Compact after completing major phases (e.g., after research, before implementation).
🔧 Long-running sessions with many tool calls
💡 If your agent makes 20+ tool calls, you’re likely approaching context limits. Compact proactively.
🔄 Iterative refinement tasks
Code review → fixes → re-review → more fixes ↑ ↑ 🗜️ Compact 🗜️ Compact
Each cycle adds tokens. Compact between cycles.
📚 Research and synthesis
Gathering information from multiple sources, then synthesizing. Compact after gathering, before synthesis.
🔴 When NOT to Use
| Scenario | Why Not | Icon |
|---|---|---|
| Short conversations | Overhead not worth it | 💬 |
| Single-turn completions | Nothing to compress | 1️⃣ |
| When you need to inspect history | Compacted items are opaque | 🔍 |
| Real-time streaming | Adds latency | ⏱️ |
| Cross-model continuation | Compacted items are model-specific | 🔀 |
💎 Best Practices
1️⃣ Compact at Milestones, Not Every Turn
# ❌ Bad: Compacting too frequently for step in workflow_steps: response = execute_step(step) compacted = client.responses.compact(...) # Wasteful # ✅ Good: Compact at logical breakpoints response = execute_research_phase() compacted = client.responses.compact(...) # After major phase response = execute_implementation_phase(compacted)
2️⃣ Monitor Context Usage Proactively
📊 Pro tip: Don’t wait until you hit the limit. Track token usage and compact when you reach ~70% capacity.
def should_compact(response, threshold=0.7): usage = response.usage.total_tokens limit = 128000 # or your model's limit return usage / limit > threshold
3️⃣ Keep Prompts Consistent When Resuming
⚠️ Behavior can drift if you change your system prompt after compaction. Keep instructions functionally identical.
# ❌ Bad: Changing instructions mid-workflow system_prompt_v1 = "You are a code reviewer..." # ... compact ... system_prompt_v2 = "You are a senior engineer..." # Different! # ✅ Good: Consistent instructions system_prompt = "You are a code reviewer..." # ... compact ... # Continue with same system_prompt
4️⃣ Handle Compaction Failures Gracefully
try: compacted = client.responses.compact(model="gpt-5.2", input=history) except Exception as e: # 🔄 Fallback: truncate oldest messages history = history[-10:] logging.warning(f"Compaction failed, using truncation: {e}")
5️⃣ Don’t Parse Compacted Items
# ❌ Bad: Trying to extract information compacted_text = compacted.output[0]["content"] data = json.loads(compacted_text) # Will fail or break later # ✅ Good: Treat as opaque next_input = [compacted.output[0], new_user_message]
🌍 Context Management Across LLM Providers
OpenAI’s Compaction API isn’t just a new feature — it’s a fundamentally different approach to a problem every LLM provider faces.
📊 Provider Comparison
| Provider | Context Window | Native Compaction | Strategy |
|---|---|---|---|
| 🟢 OpenAI GPT-5.2 | 128K | ✅ Yes | Compress & continue |
| 🟠 Anthropic Claude | 200K | ❌ No | Larger window + prompt caching |
| 🔵 Google Gemini 2.0 | 2M | ❌ No | Massive window eliminates the problem |
| 🟣 Mistral Large | 128K | ❌ No | Manual workarounds |
| ⚪ Meta Llama 3 | 128K | ❌ No | Open source — build your own |
| 🟤 Cohere Command R+ | 128K | ❌ No | RAG-first architecture |
| ⚫ xAI Grok | 128K | ❌ No | Manual workarounds |
🎯 Three Approaches to the Context Problem
🗜️ CompressOpenAI Native compaction preserves semantic meaning while reducing token count. Best for: workflows needing continuity without 2M-token pricing |
📐 Expand2M token window means most workflows simply never hit the limit. Tradeoff: cost per token and latency at scale |
📤 OffloadEveryone else Manual summarization, RAG retrieval, sliding window, custom middleware. You’re on your own |
🏗️ What This Means for Your Architecture
| Your Situation | Recommendation |
|---|---|
| 🎯 OpenAI-first? | Use Compaction API — it’s purpose-built for this |
| 🔀 Multi-provider? | Abstract your context management layer. Don’t depend on provider-specific features without a fallback |
| 💰 Cost-sensitive? | Google’s 2M window might be cheaper than repeated compaction calls — benchmark both |
| 🔒 Privacy-critical? | Compacted items are opaque. If you need auditability, consider manual summarization |
💡 The trend is clear: context management is becoming a first-class API concern, not just a prompt engineering problem. Expect other providers to follow with similar features.
⚖️ Compaction vs. Alternatives
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| 🗜️ Compaction API | Native, loss-aware, preserves semantics | Opaque output, GPT-5.2 only | Production agents |
| 📜 Sliding window | Simple to implement | Loses early context entirely | Simple chatbots |
| 📝 Summarization prompt | Transparent, model-agnostic | Adds latency, lossy, costs tokens | Debugging/auditing |
| 🔍 RAG retrieval | Scalable to huge contexts | Requires infrastructure, retrieval errors | Knowledge bases |
| ✂️ Message truncation | Zero overhead | Loses information randomly | Last resort |
🎯 For agentic workflows where task continuity matters, Compaction API is currently the cleanest solution — assuming you’re on GPT-5.2.
🛠️ Practical Example: Code Review Agent
Here’s a complete example of a code review agent that uses compaction to handle large repositories:
from openai import OpenAI client = OpenAI() def review_repository(repo_path: str): """ 🔍 Review all files in a repository, using compaction for scale. """ history = [ { "role": "system", "content": "You are a senior code reviewer. Analyze code for bugs, security issues, and improvements." }, { "role": "user", "content": f"Review the repository at {repo_path}. List all files first." } ] files_reviewed = 0 compaction_interval = 10 # 🗜️ Compact every 10 files while True: response = client.responses.create( model="gpt-5.2", input=history ) # 📝 Add response to history history.append(response.output[0].model_dump()) files_reviewed += 1 # 🗜️ Check if we should compact if files_reviewed % compaction_interval == 0: print(f"🗜️ Compacting after {files_reviewed} files...") compacted = client.responses.compact( model="gpt-5.2", input=history ) # 🔄 Reset history with compacted state history = [ history[0], # Keep system prompt compacted.output[0].model_dump() ] # ✅ Check if review is complete if is_review_complete(response): break # ➡️ Continue to next file history.append({ "role": "user", "content": "Continue to the next file." }) return generate_final_report(history)
🎬 Conclusion
The Compaction API solves a real problem that every production AI agent faces: context limits kill complex workflows.
📌 Key Takeaways
| # | Takeaway |
|---|---|
| 1️⃣ | Use it for long, tool-heavy workflows where losing context means losing progress |
| 2️⃣ | Compact at milestones, not every turn |
| 3️⃣ | Keep prompts consistent when resuming from compacted state |
| 4️⃣ | Treat compacted items as opaque — don’t try to parse them |
| 5️⃣ | Have a fallback for when compaction fails |
🚀 As AI agents become more capable, they’ll need to handle increasingly complex, multi-hour workflows. Native context management features like Compaction are a step toward making that practical.
📚 Further Reading
| Resource | Link |
|---|---|
| 📖 OpenAI Conversation State Guide | platform.openai.com/docs/guides/conversation-state |
| 📖 GPT-5.2 Prompting Guide | cookbook.openai.com/examples/gpt-5/gpt-5-2_prompting_guide |
| 📖 Compact a Response API Reference | platform.openai.com/docs/api-reference/responses/compact |
Found this useful? Share it with your team! 🙌
