The surveillance-capitalism business model that defines the Internet today is only going to get more imposing. The vast majority of our online requests today are serviced and logged by centralized infrastructure – even more centralized than what we probably expect.
While our collective hivemind takes rightful pride in the successful pushes that have improved this situation, most notably encryption in transit (HTTPS), we are still very much losing the war on metadata. Even when the payload is opaque, the who, when, and where of data access sheds an unfortunate amount of insight into our social networks and our behavior.
This isn’t a fundamental tradeoff – but we need to invest and evolve our systems to protect ourselves from second order effects of metadata collection.
Efficiency & Privacy
Centralization is not an inherent evil, and it is on the path of least resistance for improved performance. It is the second-order effects that are the main risk.
Caching data at the “edge”—physically closer to the user— is a natural performance optimization in minimizing the speed-of-light constraints. This should be aligned with our privacy goals – less hops in the network will see requests and traffic in a well designed system. Similar performance outcomes lead to single entities controlling constrained back-haul infrastructure (efficient spanning trees). This itself is not a problem, but it is natural for these powerful entities to then want to leverage the value of data they are transiting from their privileged positions, and especially in cases where the infrastructure providers extend to smarter ‘value added’ services, this secondary effect of value extraction leads to unfortunate designs for collection, logging, and eventually manipulation of traffic.
With the rise of advanced traffic analysis and machine learning, the “anonymity” we enjoy thinking that our requests aren’t analyzed because of the scale of traffic they are ‘hidden’ within is no longer realistic. As analytical capabilities increase, the power structures exploiting this data will become more effective and will work even harder to embed themselves into core infrastructure.
What does a better structure look like?
To build a CDN resilient to modern passive and active surveillance, we need to go quite a bit beyond encryption. We need the infrastructure and system designed to limit metadata leakage. The good news is that there are both good research ideas and deployed systems that chip away at many parts of this problem already.
Decoupling Identity from Intent (Oblivious HTTP)
The most immediate path already charted by IETF drafts and Apple’s Private Relay is to have an independent entity relay traffic between the client and content. This can mean that the intermediary will know the user’s IP but not the piece of data being asked for, and the content provider knows the content served, but not the user’s identity.
This “de-linking” is important, but it is not by itself the end of the story. In the last decade, we have seen how easy it is to fingerprint the traffic signatures associated with visiting a website (which will involve loading a range of resources, each of a different size). A more effective mental model may be to think about the traffic patterns that would be generated by a series of back-and-forth conversations. Protecting metadata in this ‘repeated game’ scope will yield different systems than limiting scope to a single request.
Differential Privacy and Cover Traffic
These fingerprinting concerns have been the impetus for a range of research looking at defenses. One important piece of intuition that has emerged from this field is that we must be willing to stray from optimal efficiency. There are a number of ways to do this: we could inject ‘fake’ traffic, fit requests into a pre-defined pattern, or increase latency to grow an anonymity set.
Some examples of systems taking different approaches in this design space include:
- Nym adds differential cover traffic to make an argument for statistical deniability in its mixnet design, while Tor trades off its resistance to a “global passive adversary” against latency and practicality concerns.
- Pond was a proof of concept messenger demonstrating usage-agnostic communication patterns.
- Mullvad offers the ability to add cover traffic to reduce classifiability of individual webpages.
Private Information Retrieval (PIR)
“Private Information Retrieval” refers to the class of systems that answer a specific question: how can a user retrieve an item from a database (or cache) without the database learning which item was selected? While historically computationally expensive, recent advances suggest that sub-second, privacy-preserving cache lookups can be possible at scale.
- Kohaku – is an ethereum wallet project demonstrating using PIR for hiding reads
- Iphone live caller ID is the largest user of PIR currently
Content Addressing and Blinding
A variety of more exotic techniques for data transfer have been explored in the contexts of content addressed systems like Bittorrent and IPFS. A number of useful ideas have resurfaced in these contexts:
- Files and data generally can be thought of as a series of fixed sized ‘chunks’, which helps with speed, and is already pre-requisite for the preceding constructions.
- By requesting data by its hash, the response becomes verifiable by the client – so we can split who is responsible for ‘availability’ (any other peer) vs what the data is (the source leading us to get data in the first place). It also means that we don’t have to go to a single origin, but are more naturally able to take advantage of caches.
- We can separate ‘discovery’ (the DNS equivalent of figuring out who might have data) with the transfer of the individual blocks from those peers, and get past a standard client-server model with minimal additional cognitive complexity.
Reducing Centralization & Segmenting Information
There have been a number of projects in the last year, mostly riding on the wave of interest in ‘DePIN’ (decentralized physical infrastructure networks) that looked at economic models for how protocols could directly split earnings with participating network nodes. This extends the coordination systems ideas from cryptocurrencies to how things like CDNs could be constructed to incentivize a decentralized group of participants to operate participating caches / parts of the overall network around the world.
These systems sit somewhat orthogonal to a set of prior research on ‘Sybils’, which indicates there’s an additional coordination system of some sort needed to actually reduce centralization. Conceptually, if you set up incentives so that there are more rewards (and an incentive) for many small participants to form a network rather than a big central player, the large central player can generally split up their resources and make themselves look like multiple smaller entities (called ‘Sybils’). This means there needs to be some mechanism to confirm that different entities are really ‘independent’ if that is a desired property. A number of mechanisms – using social networks, or various forms of human identity have been proposed for this, though all with caveats.
What’s next?
We are missing two important pieces in the story of privacy preserving content delivery. The first is that there is currently no shelling point for this effort. Existing centralized players have been so far disincentivized from investing in this direction, because it is at odds with their business model, and there has not yet been a credible community effort that has emerged.
The second is that much of the market is driven by price. The reason there was a substantial shift from Amazon S3 to Cloudflare R2 was not because of a technical innovation, but because Cloudflare was able to leverage their infrastructure position to provide the same service at a cheaper price. The shift that allows for subsequent disruption is likely partially regulatory – that liability around the collection and exploitation of metadata needs to be disincentivized and in so doing leads customers to switch to a ‘safer’ or more privacy-preserving alternative.
There is hope! Code is becoming cheaper to generate and deploy, so the marginal cost of building is dropping. On the flip side, the value of a Shared Private CDN will grow with usage. – This feels like a situation where the trick will be to get enough excitement and activation energy.
We don’t just need better protocols; we also need the coordination, but there is hope and increasing incentives that make me optimistic that a better system here will emerge.


