Hyperspace Peer-to-Peer Distributed Cache

Compute once. Cache globally. Serve verified.

White blocks dropping into the grid = cache hits (instant, no compute). Teal dots flying to peer nodes = cache misses (real inference, then cached for everyone). Watch the hit rate climb as the cache fills — fewer teal dots over time.

Every AI lab independently computes identical responses to identical prompts. Hyperspace eliminates this redundancy with a three-layer distributed cache: the first node to answer a question pays the compute cost. Every subsequent node—across the entire planet—gets it verified and free.

3 layersResponse + KV prefix + routing
70–90%Requests skip full inference
Sub-linearCost scaling with users

At scale.

They scale linearly. The network scales logarithmically.

Big AI Lab

  • 2x users = 2x GPU cost (linear)
  • Every query is a cost center
  • KV cache siloed per datacenter
  • No sharing across competitors
  • $10–50B datacenter CapEx

Hyperspace Network

  • 2x users = higher hit rate (sub-linear)
  • Every query enriches the global cache
  • KV state shared across all nodes
  • Verified via erasure coding + DAP
  • Users' existing GPUs — no datacenter
NetworkCombined hit rateEnergy saved/yrCO2 avoided/yr
10K nodes30–45%110–165 MWh47–71 tons
100K nodes50–70%7,300–15,300 MWh3,100–6,600 tons
1M nodes65–80%142,000–175,000 MWh61,000–75,000 tons
10M nodes75–90%411,000–493,000 MWh176,000–211,000 tons

Assumes 50 req/node/day, ~3 Wh/inference on consumer GPU, 0.429 kg CO2/kWh global average. Range reflects workload mix.

Most inference is redundant.

Today: every lab computes the same thing independently OpenAI prefill "You are a helpful assistant..." → full inference Anthropic prefill "You are a helpful assistant..." → full inference Google prefill "You are a helpful assistant..." → full inference Node A prefill "You are a helpful assistant..." → full inference Node B prefill "You are a helpful assistant..." → full inference 5x the compute same result Hyperspace: compute once, serve globally Node A computes → stores → announces to DHT Node B cache hit → instant, verified Node C cache hit → instant, verified 1x compute same result

Google reports that only 15% of daily searches are truly novel—the rest are repeats or close variants[1]. Natural language queries follow Zipf's power-law distribution: a small number of popular queries account for the vast majority of traffic[2]. LLM inference inherits this property, with two additional layers of redundancy unique to transformer workloads.

Response-level duplication. Enterprise chatbots report 70–80% of queries falling into a small number of intent categories[3]. API workloads reuse identical templates. The same coding questions, writing requests, and explanations recur across millions of users. Conservative estimate: 30–50% exact or near-exact duplicates for a diverse open network; higher for API-heavy workloads.

Prefix-level duplication. System prompts are identical across 100% of requests within an application—Anthropic, OpenAI, and Google all launched prompt caching products in 2024 specifically because of this[4][5]. SGLang's RadixAttention reports up to 5x speedup from prefix sharing across multi-turn and structured workloads[6]. Moonshot AI's Mooncake system built an entire KV-cache-centric disaggregated architecture around this[7]. Conservative estimate: 60–80% of prefill tokens are shared.

Combined: with response caching catching exact duplicates and KV prefix caching catching shared prefill computation, we estimate 70–90% of total inference compute is redundant at network scale.

Three layers. One cache.

Each layer catches a different class of redundancy. Together, only 6–30% of requests require full GPU inference. Every completed inference enriches all layers for future requests.

inference request

SHA-256(prompt + model + params) ↓
Layer 1 — Response Cache

Same prompt = instant cached response. Check local SQLite, then DHT for providers, then stream fetch with CacheProof verification. Fetchers re-announce as providers (popularity amplification).

hit → instant · <1ms
85%
hit rate
↓ miss
Layer 2 — KV Prefix Cache

Same prefix tokens = skip prefill. Erasure-coded KV attention state distributed via routing network. Route inference to node with matching state in VRAM. DAP probing verifies availability.

hit → skip prefill · ~0.9s
60%
prefix match
↓ miss
Layer 3 — Full Inference

P2P routing: registry → DHT → PEX → gossip. Hedged requests for tail latency. On completion: response stored in L1, KV state stored in L2. Every inference makes the next one cheaper.

~4s · result → L1 + L2
6%
of requests

Distributed Response Cache

Every completed inference is stored locally and announced to the network. Future identical requests—from any node, anywhere—are served from cache with cryptographic proof linking the response to its original computation.

01StoreOn inference complete: SHA-256 the request, store in SQLite, announce to DHT, broadcast to GossipSub.
02ResolveCheck local SQLite, then DHT /cache/<hash> for providers, then stream fetch via /hyperspace/cache/1.0.0.
03VerifyCacheProof: SHA-256(requestHash || responseHash || proofHash || computedAt). Ed25519 signed. 24h TTL.
04AmplifyFetchers re-announce as providers. Popular responses replicate naturally. Most-requested = most available.
05SemanticLocal cosine similarity (0.92 threshold). "What is ML?" matches "Explain machine learning" without re-inference.
const hash = SHA256(modelId + params + prompt)

// 1. Local SQLite check (instant)
const local = cache.get(hash)
if (local) return local                // ~1ms

// 2. DHT lookup (network)
const providers = await dht.findProviders('/cache/' + hash)
for (const peer of providers) {
  const stream = await node.openStream(peer, '/hyperspace/cache/1.0.0')
  stream.send({ type: 'query', requestHash: hash })

  const msg = await stream.receive()
  if (msg.type === 'hit') {
    // 3. Verify cache proof
    if (CacheProofGenerator.verifyIntegrity(msg.proof)) {
      cache.set(hash, msg.response)     // store locally
      dht.provide('/cache/' + hash)     // re-announce
      return msg.response               // ~50-200ms
    }
  }
}

KV Prefix Cache

The most expensive part of LLM inference is prefill—computing Key-Value attention states for the input. For a given model and token sequence, this is deterministic: same model + same tokens = same KV state. There is zero reason to compute it more than once globally.

Model512 tok2K tok8K tok
Qwen 0.5B4 MB15 MB60 MB
Qwen 7B38 MB150 MB600 MB
Gemma-3 27B150 MB600 MB2.4 GB
Qwen 32B175 MB700 MB2.8 GB
01ComputeA node runs prefill. The engine exports KV attention state via context.getStateData().
02EncodeReed-Solomon erasure coding: k=32, n=64 chunks. KZG polynomial commitment (48 bytes). Any 32 reconstruct full state.
03DistributeChunks pushed via routing network. Commitment gossiped across P2P mesh. DHT tracks chunk providers.
04ProbeDAP-style availability probing. Probes are indistinguishable from real requests. Failure = reputation penalty.
05RouteRequests routed to nodes with matching KV prefix in VRAM. No transfer needed—the request goes to the cache.

Fully peer-to-peer.

KV cache is ephemeral, high-volume, and low-stakes. If a cached chunk is wrong or unavailable, the worst case is re-running prefill. The entire caching layer operates within the P2P network—no centralized infrastructure required.

MechanismWhere it livesWhy
KZG commitmentGossipSub + DHT48 bytes, verified locally by any node
Erasure chunksRouting networkDistributed, fetched on demand
DAP probingP2P protocolReputation-backed, no validator votes
Prefix registryCRDT (loro)Delta-only sync, converges in seconds
Cache proofsEd25519 signaturesVerified locally by any node
IncentivesPoints systemCapability weight: 8% (response) + 10% (KV)

Hyperspace breakthrough: the routing network as cache fabric.

Hyperspace's routing network provides the missing infrastructure for distributed caching at scale: deterministic delivery guarantees, anonymous availability probing, and built-in incentives for nodes to store and serve popular cache entries. No centralized coordination. No servers. Just the network.

KV Cache Distribution via Routing Network

Node A

computes KV state
exports ~150 MB

Erasure Code

Reed-Solomon 32-of-64
KZG commitment (48B)

R1
R2
R3
R4
R5
R6
Node B

downloads 32 chunks
reconstructs KV state

Deterministic Delivery

Bounded retries. Predictable latency. Capacity-weighted tree routing. Chunks arrive even under adversarial load.

Anonymous DAP Probing

Peers probe random chunks. Probes look like real requests. Can't serve probes while withholding from real users.

Incentive-Aligned Storage

Nodes earn points for serving chunks. Popular data = more requests = more reward. Cache fills itself with what matters.

Topics, protocols, and DHT keys.

TypeIdentifierPurpose
GossipSubhyperspace/cache/announcementsResponse cache availability
GossipSubhyperspace/kv-prefix/announcementsKV prefix availability
Protocol/hyperspace/cache/1.0.0Response cache fetch
Protocol/hyperspace/kv-cache/1.0.0KV chunk fetch + prefix routing
DHT/cache/<requestHash>Response cache provider discovery
DHT/kv-prefix/<model>/<hash>KV prefix provider discovery

Where four fields converge.

The Hyperspace distributed cache is not a single innovation. It sits at the intersection of four research domains, each contributing a critical piece that makes the system possible.

AI Inference KV attention state is deterministic. Same model + same tokens = identical Key-Value cache, always. This is the fundamental property that makes global caching possible—the output of prefill is a pure function of the input, so it can be computed once and reused everywhere. vLLM, SGLang, and Mooncake exploit this within a single datacenter. We exploit it across a planet.
Crypto Verification KZG polynomial commitments + Ed25519 cache proofs. A 48-byte commitment lets any node verify any chunk of a 600 MB KV cache without downloading the whole thing. Cache proofs create an unforgeable chain from cached response back to original inference. Trust without re-computation.
Dist. Sys Routing Stake-secured routing network with DAP probing. The Hyperspace routing network provides deterministic packet delivery, anonymous availability probing, and micropayment-incentivized forwarding. Erasure-coded KV cache chunks ride the same infrastructure as blockchain data blobs—same delivery guarantees, same liveness verification.
Math Erasure Coding Reed-Solomon codes + power-law query distributions. Erasure coding means any 32 of 64 chunks reconstruct the full KV state—no single point of failure. Zipf's law guarantees that popular prefixes are requested exponentially more often than rare ones, so the cache naturally concentrates on the queries that matter most.

The endgame.

Open-weight models are converging on quality with closed models. Training remains competitive—labs differentiate on data, architecture, and RLHF. But inference is a commodity. Two copies of Qwen-32B running the same prompt produce the same KV state and the same response, byte for byte.

A global distributed cache makes this explicit: training is competitive, inference is shared. The first person to ask a question pays the compute cost. Everyone after them gets it for free, with cryptographic proof it is authentic. The marginal cost of intelligence approaches zero.

No lab—no matter how well-funded—can match this. They cannot share caches across competitors. They scale linearly. The network scales logarithmically.

Sources.

[1]Google Search15% of daily queries are novel. Cited by Pandu Nayak (VP Search), reaffirmed by Danny Sullivan (Search Liaison).
[2]Power LawsClauset, Shalizi & Newman, "Power-law distributions in empirical data," SIAM Review 2009. arXiv:0706.1062
[3]Chatbot IntentsRasa, IBM Watson, Microsoft Bot Framework: 70–80% of customer queries fall into top 20–50 intents.
[4]AnthropicClaude prompt caching (2024): cached tokens at 1/10th cost. 5-minute TTL. System prompt reuse.
[5]OpenAIPrompt caching (Oct 2024): 50% discount. Automatic for shared 1024+ token prefixes.
[6]SGLangZheng et al., "SGLang: Efficient Execution of Structured Language Model Programs," 2024. arXiv:2312.07104
[7]MooncakeQin et al., "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving," 2024. arXiv:2407.00079