Distributed Cache

Hyperspace Peer-to-Peer Distributed Cache

Compute once. Cache globally. Serve verified.

White blocks dropping into the grid = cache hits (instant, no compute). Teal dots flying to peer nodes = cache misses (real inference, then cached for everyone). Watch the hit rate climb as the cache fills — fewer teal dots over time.

Every AI lab independently computes identical responses to identical prompts. Hyperspace eliminates this redundancy with a three-layer distributed cache: the first node to answer a question pays the compute cost. Every subsequent node—across the entire planet—gets it verified and free.

3 layersResponse + KV prefix + routing

70–90%Requests skip full inference

Sub-linearCost scaling with users

Economics

At scale.

They scale linearly. The network scales logarithmically.

Big AI Lab

2x users = 2x GPU cost (linear)
Every query is a cost center
KV cache siloed per datacenter
No sharing across competitors
$10–50B datacenter CapEx

Hyperspace Network

2x users = higher hit rate (sub-linear)
Every query enriches the global cache
KV state shared across all nodes
Verified via erasure coding + DAP
Users' existing GPUs — no datacenter

Network	Combined hit rate	Energy saved/yr	CO2 avoided/yr
10K nodes	30–45%	110–165 MWh	47–71 tons
100K nodes	50–70%	7,300–15,300 MWh	3,100–6,600 tons
1M nodes	65–80%	142,000–175,000 MWh	61,000–75,000 tons
10M nodes	75–90%	411,000–493,000 MWh	176,000–211,000 tons

Assumes 50 req/node/day, ~3 Wh/inference on consumer GPU, 0.429 kg CO2/kWh global average. Range reflects workload mix.

The Problem

Most inference is redundant.

Google reports that only 15% of daily searches are truly novel—the rest are repeats or close variants^[1]. Natural language queries follow Zipf's power-law distribution: a small number of popular queries account for the vast majority of traffic^[2]. LLM inference inherits this property, with two additional layers of redundancy unique to transformer workloads.

Response-level duplication. Enterprise chatbots report 70–80% of queries falling into a small number of intent categories^[3]. API workloads reuse identical templates. The same coding questions, writing requests, and explanations recur across millions of users. Conservative estimate: 30–50% exact or near-exact duplicates for a diverse open network; higher for API-heavy workloads.

Prefix-level duplication. System prompts are identical across 100% of requests within an application—Anthropic, OpenAI, and Google all launched prompt caching products in 2024 specifically because of this^[4]^[5]. SGLang's RadixAttention reports up to 5x speedup from prefix sharing across multi-turn and structured workloads^[6]. Moonshot AI's Mooncake system built an entire KV-cache-centric disaggregated architecture around this^[7]. Conservative estimate: 60–80% of prefill tokens are shared.

Combined: with response caching catching exact duplicates and KV prefix caching catching shared prefill computation, we estimate 70–90% of total inference compute is redundant at network scale.

Architecture

Three layers. One cache.

Each layer catches a different class of redundancy. Together, only 6–30% of requests require full GPU inference. Every completed inference enriches all layers for future requests.

Layer 1

Distributed Response Cache

Every completed inference is stored locally and announced to the network. Future identical requests—from any node, anywhere—are served from cache with cryptographic proof linking the response to its original computation.

01StoreOn inference complete: SHA-256 the request, store in SQLite, announce to DHT, broadcast to GossipSub.

02ResolveCheck local SQLite, then DHT /cache/<hash> for providers, then stream fetch via /hyperspace/cache/1.0.0.

03VerifyCacheProof: SHA-256(requestHash || responseHash || proofHash || computedAt). Ed25519 signed. 24h TTL.

04AmplifyFetchers re-announce as providers. Popular responses replicate naturally. Most-requested = most available.

05SemanticLocal cosine similarity (0.92 threshold). "What is ML?" matches "Explain machine learning" without re-inference.

const hash = SHA256(modelId + params + prompt)

// 1. Local SQLite check (instant)
const local = cache.get(hash)
if (local) return local                // ~1ms

// 2. DHT lookup (network)
const providers = await dht.findProviders('/cache/' + hash)
for (const peer of providers) {
  const stream = await node.openStream(peer, '/hyperspace/cache/1.0.0')
  stream.send({ type: 'query', requestHash: hash })

  const msg = await stream.receive()
  if (msg.type === 'hit') {
    // 3. Verify cache proof
    if (CacheProofGenerator.verifyIntegrity(msg.proof)) {
      cache.set(hash, msg.response)     // store locally
      dht.provide('/cache/' + hash)     // re-announce
      return msg.response               // ~50-200ms
    }
  }
}

Layer 2

KV Prefix Cache

The most expensive part of LLM inference is prefill—computing Key-Value attention states for the input. For a given model and token sequence, this is deterministic: same model + same tokens = same KV state. There is zero reason to compute it more than once globally.

Model	512 tok	2K tok	8K tok
Qwen 0.5B	4 MB	15 MB	60 MB
Qwen 7B	38 MB	150 MB	600 MB
Gemma-3 27B	150 MB	600 MB	2.4 GB
Qwen 32B	175 MB	700 MB	2.8 GB

01ComputeA node runs prefill. The engine exports KV attention state via context.getStateData().

02EncodeReed-Solomon erasure coding: k=32, n=64 chunks. KZG polynomial commitment (48 bytes). Any 32 reconstruct full state.

03DistributeChunks pushed via routing network. Commitment gossiped across P2P mesh. DHT tracks chunk providers.

04ProbeDAP-style availability probing. Probes are indistinguishable from real requests. Failure = reputation penalty.

05RouteRequests routed to nodes with matching KV prefix in VRAM. No transfer needed—the request goes to the cache.

Design

Fully peer-to-peer.

KV cache is ephemeral, high-volume, and low-stakes. If a cached chunk is wrong or unavailable, the worst case is re-running prefill. The entire caching layer operates within the P2P network—no centralized infrastructure required.

Mechanism	Where it lives	Why
KZG commitment	GossipSub + DHT	48 bytes, verified locally by any node
Erasure chunks	Routing network	Distributed, fetched on demand
DAP probing	P2P protocol	Reputation-backed, no validator votes
Prefix registry	CRDT (loro)	Delta-only sync, converges in seconds
Cache proofs	Ed25519 signatures	Verified locally by any node
Incentives	Points system	Capability weight: 8% (response) + 10% (KV)

Hyperspace breakthrough: the routing network as cache fabric.

Hyperspace's routing network provides the missing infrastructure for distributed caching at scale: deterministic delivery guarantees, anonymous availability probing, and built-in incentives for nodes to store and serve popular cache entries. No centralized coordination. No servers. Just the network.

Protocol

Topics, protocols, and DHT keys.

Type	Identifier	Purpose
GossipSub	`hyperspace/cache/announcements`	Response cache availability
GossipSub	`hyperspace/kv-prefix/announcements`	KV prefix availability
Protocol	`/hyperspace/cache/1.0.0`	Response cache fetch
Protocol	`/hyperspace/kv-cache/1.0.0`	KV chunk fetch + prefix routing
DHT	`/cache/<requestHash>`	Response cache provider discovery
DHT	`/kv-prefix/<model>/<hash>`	KV prefix provider discovery

Research Foundation

Where four fields converge.

The Hyperspace distributed cache is not a single innovation. It sits at the intersection of four research domains, each contributing a critical piece that makes the system possible.

AI Inference KV attention state is deterministic. Same model + same tokens = identical Key-Value cache, always. This is the fundamental property that makes global caching possible—the output of prefill is a pure function of the input, so it can be computed once and reused everywhere. vLLM, SGLang, and Mooncake exploit this within a single datacenter. We exploit it across a planet.

Crypto Verification KZG polynomial commitments + Ed25519 cache proofs. A 48-byte commitment lets any node verify any chunk of a 600 MB KV cache without downloading the whole thing. Cache proofs create an unforgeable chain from cached response back to original inference. Trust without re-computation.

Dist. Sys Routing Stake-secured routing network with DAP probing. The Hyperspace routing network provides deterministic packet delivery, anonymous availability probing, and micropayment-incentivized forwarding. Erasure-coded KV cache chunks ride the same infrastructure as blockchain data blobs—same delivery guarantees, same liveness verification.

Math Erasure Coding Reed-Solomon codes + power-law query distributions. Erasure coding means any 32 of 64 chunks reconstruct the full KV state—no single point of failure. Zipf's law guarantees that popular prefixes are requested exponentially more often than rare ones, so the cache naturally concentrates on the queries that matter most.

Thesis

The endgame.

Open-weight models are converging on quality with closed models. Training remains competitive—labs differentiate on data, architecture, and RLHF. But inference is a commodity. Two copies of Qwen-32B running the same prompt produce the same KV state and the same response, byte for byte.

A global distributed cache makes this explicit: training is competitive, inference is shared. The first person to ask a question pays the compute cost. Everyone after them gets it for free, with cryptographic proof it is authentic. The marginal cost of intelligence approaches zero.

No lab—no matter how well-funded—can match this. They cannot share caches across competitors. They scale linearly. The network scales logarithmically.

References

Sources.

[1]Google Search15% of daily queries are novel. Cited by Pandu Nayak (VP Search), reaffirmed by Danny Sullivan (Search Liaison).

[2]Power LawsClauset, Shalizi & Newman, "Power-law distributions in empirical data," SIAM Review 2009. arXiv:0706.1062

[3]Chatbot IntentsRasa, IBM Watson, Microsoft Bot Framework: 70–80% of customer queries fall into top 20–50 intents.

[4]AnthropicClaude prompt caching (2024): cached tokens at 1/10th cost. 5-minute TTL. System prompt reuse.

[5]OpenAIPrompt caching (Oct 2024): 50% discount. Automatic for shared 1024+ token prefixes.

[6]SGLangZheng et al., "SGLang: Efficient Execution of Structured Language Model Programs," 2024. arXiv:2312.07104

[7]MooncakeQin et al., "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving," 2024. arXiv:2407.00079

Hyperspace Peer-to-Peer Distributed Cache

At scale.

Big AI Lab

Hyperspace Network

Most inference is redundant.

Three layers. One cache.

inference request

Layer 1 — Response Cache

Layer 2 — KV Prefix Cache

Layer 3 — Full Inference

Distributed Response Cache

KV Prefix Cache

Fully peer-to-peer.

Hyperspace breakthrough: the routing network as cache fabric.

KV Cache Distribution via Routing Network

Node A

Erasure Code

Node B

Deterministic Delivery

Anonymous DAP Probing

Incentive-Aligned Storage

Topics, protocols, and DHT keys.

Where four fields converge.

The endgame.

Sources.