← Technology
Technology

LLMs Can't Forget — And That's Why DRAM Prices Are Exploding

Why AI keeps demanding more memory, and what engineers are doing about it

LLMs are memory-hungry in both training and inference. As a result, demand for the DRAM that goes into GPUs has climbed sharply, and prices have risen with it. The price increase is starting to weigh on AI companies and data centers, and a growing number of efforts at the system-design stage are trying to cut DRAM consumption from the start. A recent EE Times article, What the DRAM Crunch Teaches Us About System Design, examines this trend at the level of system design. The argument is blunt: the era of “cheap, abundant DRAM” is essentially over, and the center of gravity in system design is shifting from compute to data movement and memory management.

This piece looks at why LLMs need so much memory in the first place, and at the directions academia and industry are taking to push past that limit.

A Machine That Holds Every Word Raw

The natural starting point is to ask why LLMs demand so much memory. Answering that means looking at what actually happens during a conversation with an AI.

If a person speaks roughly 130–150 words per minute, an hour-long, in-depth conversation with an AI works out to somewhere between 8,000 and 10,000 words of data going back and forth. After that much conversation, a person doesn’t try to keep every detail. We abstract — we boil what was said down into a few load-bearing traits. Talk about a mutual friend for an hour and what stays in your head is something like “the cheerful one,” “the one who spends a lot,” “the one who works as an engineer.”

Today’s LLMs barely have a mechanism that compresses growing context into a more compact form as the conversation lengthens. They hold the conversation by converting every token’s Key-Value pair into a thousand-dimensional vector and laying it down in DRAM position by position, untouched once written. That accumulated memory is the KV cache. To hold an hour’s worth of context (around 30,000 tokens) for a model like Llama 3 70B, several GB up to nearly 10GB of DRAM per user has to stay pinned in real time at BF16 precision. Once you remember that a single GPU carries 80GB on an H100, and even the more recent H200 and B200 sit at 141GB and 192GB, it becomes vivid that as few as a dozen-or-so simultaneous users in long conversations is enough to push a single GPU’s memory up against its physical limit.

So how is the field trying to address this?

Mathematical Compression: Shrinking the Memory Footprint

The first direction is straightforward: store the data itself in a smaller form.

DeepSeek-V2’s MLA (Multi-head Latent Attention) compresses the KV cache into a low-dimensional latent space and only restores the full form at the moment computation actually needs it. Reports show it cuts memory occupancy by 90% or more compared to standard caching. GQA (Grouped-Query Attention) lets multiple attention heads share a single set of keys and values, eliminating the redundancy of holding the same information many times over.

Both approaches keep the larger Transformer scaffolding intact and trim the inefficiency that lives inside it. Since the underlying memory model doesn’t change, neither is a root-cause fix. Still, in terms of “the same amount of information in less space,” they’re the most readily deployable short-term answers right now.

Build Models That Actually Summarize

The second direction is more ambitious: rebuild how a Transformer remembers in the first place. Instead of stacking up every token, these architectures keep information as a compressed ‘state’ and update it as the conversation moves.

Mamba and SSMs (State Space Models) take a conversation, no matter how long, and fold it into a single state vector that is progressively updated. The key result is constant-complexity memory usage: as input length grows, DRAM usage doesn’t grow with it. That’s a different kind of machine from a Transformer, where memory rises linearly with context. Where the Transformer holds every line of dialogue verbatim, this approach is closer to following a film by tracking only its load-bearing narrative beats. Attention Sinks takes a more pragmatic compromise: instead of preserving every past token, keep only the very beginning of the conversation and the most recent context, squeezing memory efficiency out of that selective retention.

I find this second direction the most interesting, because it answers most directly the underlying weakness flagged earlier — the absence of a generalization-and-abstraction mechanism in current LLMs.

Reach Beyond DRAM: System-Level Memory Management

The third direction reorganizes memory management not inside the model but at the system layer.

PagedAttention (vLLM) lifts the operating system’s virtual memory paradigm directly into LLM serving. It manages memory in pages to avoid fragmentation, and when DRAM runs short, it is designed so that CPU memory — and beyond that, external storage — can be pulled in as another memory tier. On the hardware side, PIM (Processor-In-Memory) keeps coming up. The idea is to embed compute logic into the memory itself, eliminating the data-movement cost between CPU/GPU and DRAM at the source. The shift the EE Times article points to — a center of gravity moving from compute to data movement — is most visibly underway here, at the hardware level.

Closing: Memory Constraints Redrawing the System

These three directions are all reaching for the same problem from different layers. Mathematical compression works inside the model, architectural innovation rebuilds how memory itself works, and system-level layering operates over the infrastructure — each pulling at memory efficiency in its own way. Of these, I think the breakthrough most likely to actually shift the problem will come at the point where hardware innovations like PIM meet architectural ones like Mamba. Each direction on its own can only manage partial trade-offs, but where they converge, the structure of the problem itself can change.

In the end, the DRAM crunch is putting one question to the AI industry. “How long are we going to keep holding every piece of data raw?” If pre-LLM system design centered on “how fast can we compute,” the center is now moving toward “how do we run intelligence inside this much-tighter memory budget.” The ability to see that shift first, and to design around it, is going to be, I suspect, a meaningful part of what separates the companies that hold up in AI services from the ones that don’t.