The Next Token Supercycle: Opportunities in Long-Horizon Inference

Rishub Nahar

The Emergence Team

9:00 am

PDT

April 15, 2026

4 MIN READ

Image:

This is some text inside of a div block.

A Quick Summary

Long-horizon tasks are becoming a dominant AI workload, driving huge amounts of new token consumption

This shift is unlocking a new wave of infrastructure opportunities. We see three promising areas:

Decouple memory from compute on chips to address the worsening memory shortage
Build throughput-first inference clouds to improve economics compared to current inference providers, which are generally optimized for low latency‍
Create a generalized context “garbage collector” to prune and maintain context windows

A Shifting Paradigm

From 2023 through early 2025, most enterprise AI applications revolved around synchronous, chat-centric experiences. The dominant workload profile was low-latency with relatively little persistent context.

But by late 2025, a different pattern began to emerge. New apps started to focus on long-running background tasks - where latency requirements were more forgiving, but the amount of input/reasoning data was often orders of magnitude more. Claude Code is the platonic product here, but we see long-horizon becoming the dominant workload type across the application layer.

For example, in AI-Native Services (another major focus area for our firm), long-horizon tasks account for the bulk of inference. After all, if a service delivers outcomes asynchronously, then it follows that much of its inference load can also be asynchronous.

Why this Matters - the OLAP Parallel

The clearest parallel is the rise of the great OLAP players of the last decade. As companies began running analytical queries over cloud-scale datasets, workloads became dramatically more data-intensive. This shift demanded a new stack and produced several category-defining companies worth over $200B today.

It’s an imperfect comparison, but long-horizon workloads share similarly shaped data and latency requirements to large-scale analytical ones, and we believe will also necessitate a new infra paradigm.

Here are three bold opportunities for founders looking to build in the long-horizon era.‍

Opportunity 1: Deliver AI its Snowflake Moment - Unshackle memory from compute

In long-horizon tasks, memory, not compute, is usually the bottleneck due to their decode heavy nature.

In these tasks, models perform a myriad of intermediate reasoning steps and tool calls to various data stores and MCPs. The output tokens of these actions drive up context length immensely, causing a 1:1 increase in the KV cache.
‍
The ever-ballooning cache can quickly consume all available memory on even a high-quality GPU. Without effective pruning or compression, even a moderately sized 70 billion parameter model can hit the memory wall in <100,000 tokens (Note: 100K tokens < 300 pages of text - not a lot for a long-running workload). Frontier models today are around ~1 trillion parameters, and as we push towards 10 trillion parameter models (like Mythos), the problem will only intensify.

This is not a theoretical problem. The effects of the memory crunch are already being felt, as seen by the surging stock prices of major memory manufacturers.

Now there are a few interesting vectors currently being explored to address this memory shortage:

Framework-level optimizations (e.g., SGLang, vLLM) to allocate memory more efficiently
Frontier chips that are increasing on-chip/local memory capacity (Fractile)

However, to truly achieve infinite memory, we need to decouple memory from compute, essentially doing for inference what Snowflake did for analytics in the 2010s.

‍The challenge is that decoupling storage from compute is extremely difficult for inference. In OLAP, you can service a single query with a small number of large database scans, requiring relatively few network calls, thus making remote storage viable. In inference, attention requires the model to access and compute over the KV Cache for every single generated token. This would create a massive amount of I/O if you utilized network storage.

A simple analogy:

Imagine you had to copy a book word for word by hand. If the book is on your desk (local memory), you can copy continuously. If the book is in a library across town (remote memory), and you must walk there and back home for every word, you’ll be unimaginably slow.

Inference has typically required the book to be on the desk.

Despite the challenge, we are seeing emerging approaches that aim to loosen the locality constraint and drive us closer towards the holy grail of unlimited networked memory. Notable techniques we’ve seen include transferring data over optics instead of copper wires to improve bandwidth.

If someone successfully externalizes memory while maintaining reasonable latency, they will unleash unlimited memory to fuel this generation of long-horizon agents. This would be the Snowflake moment for AI.

Opportunity 2: Build the Throughput Cloud - Embrace latency, to deliver superior economics

Today’s neoclouds were built for speed. They emerged to support first-generation, chat-based AI applications, where ultra-low latency was everything.

To achieve that, they rely on a large fleet of always-on premium GPUs, often sacrificing utilization in favor of responsiveness.
But long-horizon workloads flip the requirements. Context is vastly larger, but latency requirements are way more forgiving.

This creates an opportunity to build a new inference architecture that is purpose-built for high-throughput workloads where latency is less of a concern. Such an architecture should in theory be able to achieve meaningfully better inference economics.

There are a few things founders building in this category need to be aware of:

Batching and pooling are only part of the answer
- These are the most obvious levers to improve chip utilization, but gains here are fundamentally capped
- This is also not a durable advantage - every major inference provider is already doing this to some extent
Model and hardware routing are a more likely source of lasting differentiation
- We believe coordinating different models across heterogeneous hardware for prefill and decode is one of the most compelling opportunities to improve economics. It is also technically very difficult to do, making it all the more interesting for startups to tackle.
Category definition and education will be harder here than in other inference markets
- In multimodal inference, companies like Fal are winning by owning mindshare for a distinct modality (“image and video inference”)
- Long-horizon inference is more abstract. It’s not a new modality, but a different workload shape. As a result, it risks being perceived as incremental to text inference

Opportunity 3: Create the Garbage Collector for Context

As we push for longer and longer running tasks, the amount of total state that we want to hold will always outstrip the available space in the context window.

And the brutal reality is that retrieval begins to degrade long before the context window is filled. This is all the more reason why we need solutions that can intelligently compact, replace, and eliminate excessive context that builds in the window. This is a major portion of the work involved whenever someone talks about “building a harness.”

Frankly, this is the category where it’s still most unclear to us whether a standalone, generalizable solution can work.

Classic system memory garbage collection operates on deterministic rules: remove objects that are no longer referenced, and compact the rest using fixed algorithms.
In contrast, context management requires judgment. It involves deciding what information is still relevant and how to summarize it without losing important signals. Because this varies significantly by use case, it is unclear whether a fully generalized solution is possible.
‍

But we are open and eager to be proven wrong!

Closing Thoughts

Agents are taking on longer and longer tasks, placing unprecedented strain on chip memory, inference budgets, and context windows. We believe this is the moment for bold founders to tackle these challenges head-on and build a new generation of defining infrastructure companies. If you’re building or thinking about these challenges, we would love to chat.

Long live long-horizon.

No items found.