Hyperspace Peer-to-Peer Distributed Cache
Compute once. Cache globally. Serve verified.
White blocks dropping into the grid = cache hits (instant, no compute). Teal dots flying to peer nodes = cache misses (real inference, then cached for everyone). Watch the hit rate climb as the cache fills — fewer teal dots over time.
Every AI lab independently computes identical responses to identical prompts. Hyperspace eliminates this redundancy with a three-layer distributed cache: the first node to answer a question pays the compute cost. Every subsequent node—across the entire planet—gets it verified and free.
At scale.
They scale linearly. The network scales logarithmically.
Big AI Lab
- 2x users = 2x GPU cost (linear)
- Every query is a cost center
- KV cache siloed per datacenter
- No sharing across competitors
- $10–50B datacenter CapEx
Hyperspace Network
- 2x users = higher hit rate (sub-linear)
- Every query enriches the global cache
- KV state shared across all nodes
- Verified via erasure coding + DAP
- Users' existing GPUs — no datacenter
| Network | Combined hit rate | Energy saved/yr | CO2 avoided/yr |
|---|---|---|---|
| 10K nodes | 30–45% | 110–165 MWh | 47–71 tons |
| 100K nodes | 50–70% | 7,300–15,300 MWh | 3,100–6,600 tons |
| 1M nodes | 65–80% | 142,000–175,000 MWh | 61,000–75,000 tons |
| 10M nodes | 75–90% | 411,000–493,000 MWh | 176,000–211,000 tons |
Assumes 50 req/node/day, ~3 Wh/inference on consumer GPU, 0.429 kg CO2/kWh global average. Range reflects workload mix.
Most inference is redundant.
Google reports that only 15% of daily searches are truly novel—the rest are repeats or close variants[1]. Natural language queries follow Zipf's power-law distribution: a small number of popular queries account for the vast majority of traffic[2]. LLM inference inherits this property, with two additional layers of redundancy unique to transformer workloads.
Response-level duplication. Enterprise chatbots report 70–80% of queries falling into a small number of intent categories[3]. API workloads reuse identical templates. The same coding questions, writing requests, and explanations recur across millions of users. Conservative estimate: 30–50% exact or near-exact duplicates for a diverse open network; higher for API-heavy workloads.
Prefix-level duplication. System prompts are identical across 100% of requests within an application—Anthropic, OpenAI, and Google all launched prompt caching products in 2024 specifically because of this[4][5]. SGLang's RadixAttention reports up to 5x speedup from prefix sharing across multi-turn and structured workloads[6]. Moonshot AI's Mooncake system built an entire KV-cache-centric disaggregated architecture around this[7]. Conservative estimate: 60–80% of prefill tokens are shared.
Combined: with response caching catching exact duplicates and KV prefix caching catching shared prefill computation, we estimate 70–90% of total inference compute is redundant at network scale.
Three layers. One cache.
Each layer catches a different class of redundancy. Together, only 6–30% of requests require full GPU inference. Every completed inference enriches all layers for future requests.
Distributed Response Cache
Every completed inference is stored locally and announced to the network. Future identical requests—from any node, anywhere—are served from cache with cryptographic proof linking the response to its original computation.
/cache/<hash> for providers, then stream fetch via /hyperspace/cache/1.0.0.const hash = SHA256(modelId + params + prompt)
// 1. Local SQLite check (instant)
const local = cache.get(hash)
if (local) return local // ~1ms
// 2. DHT lookup (network)
const providers = await dht.findProviders('/cache/' + hash)
for (const peer of providers) {
const stream = await node.openStream(peer, '/hyperspace/cache/1.0.0')
stream.send({ type: 'query', requestHash: hash })
const msg = await stream.receive()
if (msg.type === 'hit') {
// 3. Verify cache proof
if (CacheProofGenerator.verifyIntegrity(msg.proof)) {
cache.set(hash, msg.response) // store locally
dht.provide('/cache/' + hash) // re-announce
return msg.response // ~50-200ms
}
}
}KV Prefix Cache
The most expensive part of LLM inference is prefill—computing Key-Value attention states for the input. For a given model and token sequence, this is deterministic: same model + same tokens = same KV state. There is zero reason to compute it more than once globally.
| Model | 512 tok | 2K tok | 8K tok |
|---|---|---|---|
| Qwen 0.5B | 4 MB | 15 MB | 60 MB |
| Qwen 7B | 38 MB | 150 MB | 600 MB |
| Gemma-3 27B | 150 MB | 600 MB | 2.4 GB |
| Qwen 32B | 175 MB | 700 MB | 2.8 GB |
context.getStateData().Fully peer-to-peer.
KV cache is ephemeral, high-volume, and low-stakes. If a cached chunk is wrong or unavailable, the worst case is re-running prefill. The entire caching layer operates within the P2P network—no centralized infrastructure required.
| Mechanism | Where it lives | Why |
|---|---|---|
| KZG commitment | GossipSub + DHT | 48 bytes, verified locally by any node |
| Erasure chunks | Routing network | Distributed, fetched on demand |
| DAP probing | P2P protocol | Reputation-backed, no validator votes |
| Prefix registry | CRDT (loro) | Delta-only sync, converges in seconds |
| Cache proofs | Ed25519 signatures | Verified locally by any node |
| Incentives | Points system | Capability weight: 8% (response) + 10% (KV) |
Hyperspace breakthrough: the routing network as cache fabric.
Hyperspace's routing network provides the missing infrastructure for distributed caching at scale: deterministic delivery guarantees, anonymous availability probing, and built-in incentives for nodes to store and serve popular cache entries. No centralized coordination. No servers. Just the network.
Topics, protocols, and DHT keys.
| Type | Identifier | Purpose |
|---|---|---|
| GossipSub | hyperspace/cache/announcements | Response cache availability |
| GossipSub | hyperspace/kv-prefix/announcements | KV prefix availability |
| Protocol | /hyperspace/cache/1.0.0 | Response cache fetch |
| Protocol | /hyperspace/kv-cache/1.0.0 | KV chunk fetch + prefix routing |
| DHT | /cache/<requestHash> | Response cache provider discovery |
| DHT | /kv-prefix/<model>/<hash> | KV prefix provider discovery |
Where four fields converge.
The Hyperspace distributed cache is not a single innovation. It sits at the intersection of four research domains, each contributing a critical piece that makes the system possible.
The endgame.
Open-weight models are converging on quality with closed models. Training remains competitive—labs differentiate on data, architecture, and RLHF. But inference is a commodity. Two copies of Qwen-32B running the same prompt produce the same KV state and the same response, byte for byte.
A global distributed cache makes this explicit: training is competitive, inference is shared. The first person to ask a question pays the compute cost. Everyone after them gets it for free, with cryptographic proof it is authentic. The marginal cost of intelligence approaches zero.
No lab—no matter how well-funded—can match this. They cannot share caches across competitors. They scale linearly. The network scales logarithmically.