Skip to main content

Command Palette

Search for a command to run...

Stop Drowning Your AI Assistant: How to Build a Context-Saving MCP Server for VS Code Copilot

Updated
8 min read
Stop Drowning Your AI Assistant: How to Build a Context-Saving MCP Server for VS Code Copilot
M
Senior @ FAST NUCES. Have a passion for learning: Full Stack, System Design, Agentic, GenAI. Find my projects on GitHub. Cat lover and always up for a chat!

If you are using VS Code Copilot’s Agent Mode (especially if you are riding on the free GitHub Copilot Student Plan, though the Models have been severely depleted), you already know it feels like a free superpower given by your Student ID. You give it a high-level goal, and it executes terminal commands, writes code, and debugs errors entirely on its own.

But as of June 1, 2026, GitHub changed the rules of the game. Copilot shifted away from simple monthly request limits to a strict, usage-based consumption model fueled by GitHub AI Credits.

This shift has exposed Agent Mode’s glaring Achilles’ heel: Context Bloat.

Context Bloating

When an AI agent runs a command that dumps hundreds of lines of installation trackers, compilation logs, or verbose test results, that raw text gets crammed directly into the LLM’s context window. This creates two immediate problems:

  1. The “Lost in the Middle” Phenomenon: LLMs struggle to find relevant information when buried under walls of repetitive log noise. Your agent suddenly forgets your initial instructions.

  2. Token Exhaustion: You burn through your plan’s token rate limits running useless lines of text like Fetching packages... or progress bars.

To solve this, I built mcp-headroom: a lightweight middleware proxy using Anthropic’s open-source Model Context Protocol (MCP). It intercepts verbose terminal outputs, counts tokens dynamically using tiktoken, and surgically trims the repetitive middle noise while keeping the critical headers and error statuses perfectly intact.

Here is exactly how it works, how the code is structured, and how you can set it up in your local environment over a weekend.

The Core Blueprint: How it Works

Instead of letting Copilot Chat run raw commands in a blind terminal instance, we expose a specialized tool called run_compact_command.

When Copilot needs to install packages, run tests, or inspect logs, it calls our local MCP server via standard input/output (stdio). The server runs the command in a background sub-process, analyzes the output token count, and applies a smart head/tail truncation rule if it exceeds our token budget.

Step 1: Bootstrapping the FastMCP Server

We use the official Python mcp SDK alongside tiktoken to build a clean, production-ready server. Create a file named server.py and set up the structural foundation:

from future import annotations
import subprocess
import shlex
from typing import Optional
import tiktoken
from mcp.server.fastmcp import FastMCP

#Configuration limits

DEFAULT_TOKEN_THRESHOLD = 800 # Token limit before triggering compaction 
HEAD_LINES = 20 # Lines to keep from the top 
TAIL_LINES = 20 # Lines to keep from the bottom 
ENCODING_NAME = "cl100k_base" # Core tokenizer mapping 


mcp = FastMCP( 
    name="mcp-headroom", 
    instructions="Use run_compact_command instead of the built-in terminal for verbose or repetitive shell outputs to conserve token budgets." ) 

_enc: Optional[tiktoken.Encoding] = None 


def _encoder() -> tiktoken.Encoding: 
    global _enc 
    if _enc is None: 
        _enc = tiktoken.get_encoding(ENCODING_NAME) 
    return _enc 


def _token_count(text: str) -> int: 
    return len(_encoder().encode(text))

File Tree

Step 2: Coding the Compaction Engine

The heart of this project is the _compact utility.

If an output’s token count crosses our threshold budget (800 tokens), it splits the string, grabs the first 20 lines (where initialization details and version choices live), and the last 20 lines (where compilation status, final summaries, or error stack traces sit).

def _compact(text: str, threshold: int = DEFAULT_TOKEN_THRESHOLD) -> tuple[str, int, int]: 
original_tokens = _token_count(text) 
if original_tokens <= threshold: 
    return text, original_tokens, 0 

lines = text.splitlines() 
total_lines = len(lines) 
head = lines[:HEAD_LINES] tail = lines[-TAIL_LINES:] 
omitted_lines = total_lines - HEAD_LINES - TAIL_LINES 

# Build an informative placeholder for the model 
middle_text = "\n".join(lines[HEAD_LINES : total_lines - TAIL_LINES]) 
middle_tokens = _token_count(middle_text) 
placeholder = 
    ( f"\n[... Truncated {omitted_lines} lines of repetitive output " f"({middle_tokens:,} tokens omitted) ...]\n" ) 

compacted = "\n".join(head) + placeholder + "\n".join(tail) 
final_tokens = _token_count(compacted) 
saved_tokens = original_tokens - final_tokens 
return compacted, original_tokens, saved_tokens

Step 3: Exposing the Tool to Copilot Chat

Next, we expose our core tool run_compact_command using the @mcp.tool() decorator. Notice how detailed the docstring is.

This is crucial because it serves as the natural language prompt that tells the LLM exactly when and why to pick this tool over its native terminal.

@mcp.tool()
def run_compact_command(
    command: str,
    timeout: int = 60,
    token_threshold: int = DEFAULT_TOKEN_THRESHOLD,
    working_dir: Optional[str] = None,
) -> dict:
    """
    Run a shell command, capture combined stdout + stderr, and compact the
    output if it exceeds token_threshold. Prefer this over the built-in terminal
    for package installs, test runners, build systems, or log dumps to prevent
    wasting your active GitHub AI Credits.
    """
    try:
        result = subprocess.run(
            shlex.split(command),
            capture_output=True,
            text=True,
            timeout=timeout,
            cwd=working_dir,
        )
        raw_output = result.stdout + (
            ("\n--- stderr ---\n" + result.stderr) if result.stderr.strip() else ""
        )
        exit_code = result.returncode
except subprocess.TimeoutExpired:
        raw_output = f"[mcp-headroom] Command timed out after {timeout}s."
        exit_code  = -1
    except Exception as exc:
        raw_output = f"[mcp-headroom] Unexpected error: {exc}"
        exit_code  = -1
    compacted, orig_tok, saved_tok = _compact(raw_output, threshold=token_threshold)
    final_tok    = orig_tok - saved_tok
    was_compacted = saved_tok > 0
    banner = (
        f"[Headroom Alert: Saved {saved_tok:,} tokens "
        f"({orig_tok:,} → {final_tok:,}) | exit {exit_code}]"
        if was_compacted
        else f"[Headroom: Output within threshold ({orig_tok:,} tokens) | exit {exit_code}]"
    )
    return {
        "output":          compacted + f"\n\n{banner}",
        "exit_code":       exit_code,
        "original_tokens": orig_tok,
        "final_tokens":    final_tok,
        "tokens_saved":    saved_tok,
        "was_compacted":   was_compacted,
        "headroom_banner": banner,
    }
if __name__ == "__main__":
    mcp.run(transport="stdio")

Step 4: Hooking it into VS Code

To introduce our new server to VS Code’s Copilot Agent ecosystem, add a simple workspace configuration file located at .vscode/mcp.json. This tells the editor where to find your virtual environment's python binary and how to execute the script over stdio:

{
  "servers": {
    "mcp-headroom": {
      "type": "stdio",
      "command": "${workspaceFolder}/.venv/bin/python",
      "args": ["${workspaceFolder}/server.py"]
    }
  }
}

Selected MCP Server

Putting It To The Test

To see the real-world utility of mcp-headroom, I tasked Copilot with an installation command inside an active project environment:

Prompt For Testing

Prompt: Using run_compact_command, run pip install numpy pandas scikit-learn matplotlib and tell me how many tokens were saved.

Initially, the installation payload was heavy enough that it hit our default 60-second execution safety timeout window. Copilot dynamically recognized the timeout failure, read the exception response from our server, automatically modified its parameters, and gracefully re-executed the command with an extended timeout limit.

Here is the exact raw JSON object returned back across the protocol bridge to the IDE:

JSON Output

The Hard Metrics

By actively flattening out the redundant stream tracking metadata, look at what happened during this single standard package installation task:

Statistics

Instead of drowning in 14 lines of identical progress tracker cache outputs, the underlying agent model instantly parsed the clean text output, verified the exit code, and confirmed that numpy, pandas, scikit-learn, and matplotlib were successfully initialized in the local environment.

Key Takeaway

Building your first MCP server can feel intimidating, but beneath the buzzwords, it’s remarkably elegant: just a clean JSON-RPC bridge built on top of simple standard I/O streams. You don’t need complex cloud infrastructure to give an LLM superpowers — you just need a way to translate your local system’s reality into data the model can actually parse.

In an era where developer AI platforms are moving toward strict, usage-based credit models, token optimization is no longer optional. It is a fundamental engineering constraint. By inserting a custom, token-aware lens between your local workspace and your AI assistant, you achieve three things simultaneously:

  • Protect your context boundaries from drowning in raw log noise.

  • Keep your agent sharp and clear of “Lost in the Middle” hallucinations.

  • Defend your wallet against runaway token consumption.

The Horizon is Wide Open

Log compaction is just the entry point. The research into Agentic AI workflows is an ongoing, rapidly shifting space. We are moving away from the idea of a single “all-knowing” AI assistant and moving toward an ecosystem of modular micro-tools.

Depending on your engineering needs, you can chain entirely different MCP servers together to create a specialized local network for your agent:

┌─── mcp-headroom (Token Compactor)
                  ├─── mcp-db-sleuth (Schema Introspection)
AI Coding Agent ──┼─── mcp-guardian  (Architecture Linter)
                  └─── mcp-git-lens  (Smart Commit Automator)

By swapping and combining specialized protocol hooks, you can craft a hyper-contextual environment tailored to your exact tech stack (environment tailored to how you code).

The codebase for mcp-headroom is entirely modular. You can easily expand its heuristics dictionary to handle massive multi-stage Docker builds, heavy npm install structures, or deep database query dumps.

Clone the code, hook it into your workspace, and start experimenting. Let me know in the comments how you’re configuring your MCP pipeline and exactly how many tokens you manage to save!

References

Copilot’s Token Based Billing article

A collection of MCP Servers to take inspirations from

O

This is a solid approach. The head/tail truncation pattern is elegant — keeping the context where errors usually surface while dropping the noisy middle. I've been thinking about a similar problem from a different angle: when you're running long scraping jobs via MCP, the log output can get massive. A token-aware middleware like mcp-headroom could slot right into that pipeline.

Curious — have you tested this with Docker build outputs? Those tend to be the worst offenders for context bloat.

M

Thank you! Right now nope, was thinking of a project of mine to integrate this into when i get the time.