The Race to Shrink AI's Long-Context Memory Cost

At 1M tokens an AI model's KV cache can top 300GB — more than its weights. TurboQuant, OSCAR and EpiCache each shrink it to make long context cheaper.

↻ Published 2026-06-18◷ 6 min readEA

Evgenii Arsentev · PhD

New posts every dayFollow me on TelegramWhere the AI world lives — daily AI news + Claude Code tipsFollow →Free Claude Code courseNo upsells, no cross-sells — nothing to buy here.Start free →

Three research teams have shipped new ways to compress the KV cache — the fast-growing scratchpad that quietly dominates the cost of long-context AI. A MarkTechPost analysis published June 18, 2026 lines up TurboQuant from Google and NYU, OSCAR from Together AI, and EpiCache from Apple, and the numbers explain why this corner of AI infrastructure suddenly matters: for Llama-3.1-70B in full precision the cache costs about 0.31 MB per token, which means roughly 40 GB at 128,000 tokens and more than 300 GB at one million tokens — larger than the model's own 140 GB of weights.

The KV (key-value) cache is where a model stores the attention vectors for every token it has already seen, at every layer, so it doesn't have to recompute them on each new word. It grows linearly with the length of your conversation or document and with how many requests are processed at once. Past a certain length, generating each token is bottlenecked not by raw compute but by shuttling this giant cache in and out of memory. Shrink the cache and you make long context faster, cheaper, and able to fit on smaller hardware.

Three different attacks on the same bottleneck

TurboQuant (to appear at ICLR 2026) is the model-agnostic option: it needs no calibration data and works on any model untouched. It randomly rotates the cache so the numbers behave like clean Gaussian noise, then quantizes aggressively. The result is near full-precision quality at 4x compression, essentially lossless around 3.5 bits per channel and only marginal degradation at 2.5 bits, with theoretical guarantees that put it within about 2.7x of the information-theoretic best possible.

OSCAR is the deploy-it-today option. It is attention-aware — it rotates keys and values along the directions the model actually cares about — and it ships with real plumbing: integration into the SGLang serving engine, a mixed-precision cache that keeps recent tokens in full precision and older history compressed, and precomputed setups for models like Qwen3 and GLM-4.7. At about 2.28 effective bits it stays within 1.42 points of full precision on Qwen3-8B and basically ties it on larger models, while delivering up to 8x memory reduction and up to roughly 3x faster decoding at 100,000-token context.

EpiCache, from Apple, solves a different problem: long, multi-turn conversations. Instead of crushing every number, it groups the chat history into semantic "episodes," keeps a compressed cache per episode, and pulls in only the ones relevant to your current question. It reports up to 40% higher accuracy than simpler "just forget old tokens" approaches, near-full accuracy at 4–6x compression, up to 3.5x lower peak memory, and about 2.4x lower latency.

Why this matters to you

You will never touch any of these settings, and that's the point — this is the invisible plumbing behind features you already use. It is a big reason providers can offer million-token context windows, let you paste in whole books or codebases, and keep prices falling. For anyone running an open model locally, cache compression is the difference between a long document fitting on a single consumer GPU or not. The teams are clear that these are not rivals: TurboQuant wins on portability, OSCAR on shipping today, EpiCache on long chats — and they can be stacked for compounding savings, which means long context should keep getting cheaper across the board.

What I'd actually do

If you run local models, this is the most practical takeaway: when your tool offers KV-cache quantization (often labelled 2-bit, 4-bit or "INT2/INT4 cache"), it's usually safe to turn on at 4-bit — you'll fit far longer context on the same GPU with little quality loss. For everyone else, just read it as a signal: long context and big-document workflows are getting cheaper, so it's worth designing your habits around them rather than rationing tokens.

#research#long-context#inference#efficiency

Related guides

Author

Evgenii Arsentev

PhD · Chief Product Officer at a tech company

About the author →

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

← All news

Source: marktechpost.com