Category: Academics

  • When do we get a Privacy-Preserving CDN?

    The surveillance-capitalism business model that defines the Internet today is only going to get more imposing. The vast majority of our online requests today are serviced and logged by centralized infrastructure – even more centralized than what we probably expect.

    While our collective hivemind takes rightful pride in the successful pushes that have improved this situation, most notably encryption in transit (HTTPS), we are still very much losing the war on metadata. Even when the payload is opaque, the who, when, and where of data access sheds an unfortunate amount of insight into our social networks and our behavior.

    This isn’t a fundamental tradeoff – but we need to invest and evolve our systems to protect ourselves from second order effects of metadata collection.

    Efficiency & Privacy

    Centralization is not an inherent evil, and it is on the path of least resistance for improved performance. It is the second-order effects that are the main risk.

    Caching data at the “edge”—physically closer to the user— is a natural performance optimization in minimizing the speed-of-light constraints. This should be aligned with our privacy goals – less hops in the network will see requests and traffic in a well designed system. Similar performance outcomes lead to single entities controlling constrained back-haul infrastructure (efficient spanning trees). This itself is not a problem, but it is natural for these powerful entities to then want to leverage the value of data they are transiting from their privileged positions, and especially in cases where the infrastructure providers extend to smarter ‘value added’ services, this secondary effect of value extraction leads to unfortunate designs for collection, logging, and eventually manipulation of traffic.

    With the rise of advanced traffic analysis and machine learning, the “anonymity” we enjoy thinking that our requests aren’t analyzed because of the scale of traffic they are ‘hidden’ within is no longer realistic. As analytical capabilities increase, the power structures exploiting this data will become more effective and will work even harder to embed themselves into core infrastructure.

    What does a better structure look like?

    To build a CDN resilient to modern passive and active surveillance, we need to go quite a bit beyond encryption. We need the infrastructure and system designed to limit metadata leakage. The good news is that there are both good research ideas and deployed systems that chip away at many parts of this problem already.

    Decoupling Identity from Intent (Oblivious HTTP)

    The most immediate path already charted by IETF drafts and Apple’s Private Relay is to have an independent entity relay traffic between the client and content. This can mean that the intermediary will know the user’s IP  but not the piece of data being asked for, and the content provider knows the content served, but not the user’s identity.

    This “de-linking” is important, but it is not by itself the end of the story. In the last decade, we have seen how easy it is to fingerprint the traffic signatures associated with visiting a website (which will involve loading a range of resources, each of a different size). A more effective mental model may be to think about the traffic patterns that would be generated by a series of back-and-forth conversations. Protecting metadata in this ‘repeated game’ scope will yield different systems than limiting scope to a single request. 

    Differential Privacy and Cover Traffic

    These fingerprinting concerns have been the impetus for a range of research looking at defenses. One important piece of intuition that has emerged from this field is that we must be willing to stray from optimal efficiency. There are a number of ways to do this: we could inject ‘fake’ traffic, fit requests into a pre-defined pattern, or increase latency to grow an anonymity set.

    Some examples of systems taking different approaches in this design space include:

    • Nym adds differential cover traffic to make an argument for statistical deniability in its mixnet design, while Tor trades off its resistance to a “global passive adversary” against latency and practicality concerns.
    • Pond was a proof of concept messenger demonstrating usage-agnostic communication patterns.
    • Mullvad offers the ability to add cover traffic to reduce classifiability of individual webpages.

    Private Information Retrieval (PIR)

    Private Information Retrieval” refers to the class of systems that answer a specific question: how can a user retrieve an item from a database (or cache) without the database learning which item was selected? While historically computationally expensive, recent advances suggest that sub-second, privacy-preserving cache lookups can be possible at scale.

    • Kohaku – is an ethereum wallet project demonstrating using PIR for hiding reads
    • Iphone live caller ID is the largest user of PIR currently

    Content Addressing and Blinding

    A variety of more exotic techniques for data transfer have been explored in the contexts of content addressed systems like Bittorrent and IPFS. A number of useful ideas have resurfaced in these contexts:

    • Files and data generally can be thought of as a series of fixed sized ‘chunks’, which helps with speed, and is already pre-requisite for the preceding constructions.
    • By requesting data by its hash, the response becomes verifiable by the client – so we can split who is responsible for ‘availability’ (any other peer) vs what the data is (the source leading us to get data in the first place). It also means that we don’t have to go to a single origin, but are more naturally able to take advantage of caches.
    • We can separate ‘discovery’ (the DNS equivalent of figuring out who might have data) with the transfer of the individual blocks from those peers, and get past a standard client-server model with minimal additional cognitive complexity.

    Reducing Centralization & Segmenting Information

    There have been a number of projects in the last year, mostly riding on the wave of interest in ‘DePIN’ (decentralized physical infrastructure networks) that looked at economic models for how protocols could directly split earnings with participating network nodes. This extends the coordination systems ideas from cryptocurrencies to how things like CDNs could be constructed to incentivize a decentralized group of participants to operate participating caches / parts of the overall network around the world.

    These systems sit somewhat orthogonal to a set of prior research on ‘Sybils’, which indicates there’s an additional coordination system of some sort needed to actually reduce centralization. Conceptually, if you set up incentives so that there are more rewards (and an incentive) for many small participants to form a network rather than a big central player, the large central player can generally split up their resources and make themselves look like multiple smaller entities (called ‘Sybils’). This means there needs to be some mechanism to confirm that different entities are really ‘independent’ if that is a desired property. A number of mechanisms – using social networks, or various forms of human identity have been proposed for this, though all with caveats.

    What’s next?

    We are missing two important pieces in the story of privacy preserving content delivery. The first is that there is currently no shelling point for this effort. Existing centralized players have been so far disincentivized from investing in this direction, because it is at odds with their business model, and there has not yet been a credible community effort that has emerged. 

    The second is that much of the market is driven by price. The reason there was a substantial shift from Amazon S3 to Cloudflare R2 was not because of a technical innovation, but because Cloudflare was able to leverage their infrastructure position to provide the same service at a cheaper price. The shift that allows for subsequent disruption is likely partially regulatory – that liability around the collection and exploitation of metadata needs to be disincentivized and in so doing leads customers to switch to a ‘safer’ or more privacy-preserving alternative.

    There is hope! Code is becoming cheaper to generate and deploy, so the marginal cost of building is dropping. On the flip side, the value of a Shared Private CDN will grow with usage. – This feels like a situation where the trick will be to get enough excitement and activation energy.

    We don’t just need better protocols; we also need the coordination, but there is hope and increasing incentives that make me optimistic that a better system here will emerge.

  • Retrieval Constraints

    A couple months ago I wrote up some of the edges that I’ve encountered in thinking about how to structure decentralized data transfer systems. These are an extension of the limitations that were initially encountered in bittorrent style tit-for-tat exchanges, and have now matured into a much more extensive field looking at incentives and other mechanisms that can be leveraged to create robust systems.

    See the long-form essay on mirror

    My top take-away from this line of thought is that it does seem like within our initial framing of how data transfer might happen we end up still relying on reputation as a way to estimate transferability of experience, and in estimating trust for whether past behavior will continue to subsequent performance.

  • Private Retrieval

    It’s very exciting to have a public face to the thoughts around how to enable effective private access to data.

    Research Announcement

    EthCC Announcement

    The basic hypothesis here is that there’s a high-leverage opportunity to attract thought around scaling the range of anonymous database or data transfer techniques to reach something with better properties that the systems we have today.

    I’ve learned a lot about what goes into running a grant fund already in my minor involvement helping to set up this program, and am excited to see the next stage of it’s lifecycle as we begin to engage with proposals and grantees.

  • What's Left for private Messaging

    What's Left for private Messaging

    I had the privilege to address the annual Chaos Communication Congress (36C3) in Leipzig last week about the state and remaining issues in private communications.

    The recording of the video has been made available by the CCC, and I have also posted the slides.

    The TL;DR for me is that many of the trade-offs are balancing the stability of user experience with privacy mechanisms – and finding more ergonomic user experience interactions will be as important as new systems schemes are to improving the ecosystem.

    I am particularly excited by the number of ongoing effort reducing trust in central servers. Many of the mechanistic trade-offs we face are due to the topology of our systems. With systems designed for fully anonymous interaction, like mixnets, PIR, and oblivious messaging, we can model and mitigate threats from much more realistic adversaries than we do with popular channels today. (For instance, consider an office which has received a whistle blowing message. If the receiving investigation wants to identify the source, they likely control both the local network, and have the ability to send messages to the account that initiated the conversation. Our current designs will find it quite difficult to protect a user from this scenario)

  • NextGen Korea Scholars

    NextGen Korea Scholars

    I had the incredible opportunity to spend the end of last week in Washington DC with the CSIS NextGen Scholars program meeting the US policy makers who define the US policy towards the DPRK.

    It was fascinating to see the process has been put in place for weighing the different factors that go into these decisions, and how at the same time there really is truth to the almost inconceivable notion that the best any of us can hope for is that Trump and Kim Jong Un will have a successful summit and be able to make progress based on some unexpected personal trust.

    I am hopeful I was able to offer some insight into what life is like in the country, and perhaps was able to offer some sense of the value provided by engagements like PUST.

    Several tweets provide a sense of who we got to meet.

  • Ethics of Censorship Measurement

    I gave a talk this past summer at DEFCON on the ethical quandary that continues to play a role in the academic discussion of network censorship measurement. Over the course of my phd studies, there was a significant arc of time where the community yielded to caution as the issues around ethics were better understood.

    These issues have not gone away, and in the intervening six months since this talk, we’ve seen new groups re-develop techniques deemed problematic by the prevailing winds of the academic community.

    Watch on Youtube

    Slides

  • Corporate Censorship

    One of the most interesting lines of inquiry within the Censored Planet project at the University of Michigan is trying to pull apart the different actors involved in Internet  censorship. One of the interesting quirks is that a significant factor in why content might not be available to users is that the web publisher themselves have limited who they’ll respond to.

    This relates to existing phenomenons like increased balkanization of the web, where regions and nations promote domestic services and networks, but is as much a function of where lucrative markets are and a reaction to the background of fraud and malicious online traffic.

    One outcome of this research is a set of measurements looking at how and where CDNs limit access, that will be presented tomorrow at IMC.

    Like many parts of the Internet, a take-away here is that attribution is hard.

  • NextGen Scholar

    Excited to be included in the 2018 class of CSIS NextGen Scholars.

  • Scalable Remote Measurement of Application-Layer Censorship

    Quite exciting to see another step in remote measurement systems at USENIX Security in August. This particular piece is on how to recover DPI policies at scale.

  • Open Letter to the Cuba Internet Task Force

    The following is a response to an invitation to participate in the recently formed Cuba Internet Task Force.

    Task Force Representatives:
    I will not be joining the Cuba Internet Task Force, or Subcommittees, because I believe the harm done by the existence of these committees outweighs any potential benefit of the recommendations that can come from them.

    In recent years, Cuba has increasingly normalized Internet usage, through expansion and cost reduction of WiFi, through the advent of AirBNB as a major source of tourism revenue, and through growing traffic capacity.

    In the scope of my work, I have documented the flourishing community wireless networks operating in tandem with official Internet service from ETECSA. These community efforts already address the “last mile” problem, and it is not hard to imagine the future where they are consolidated or integrated to provide Internet-to-the-home for many more Cubans.

    These efforts are hindered by the perception by the Cuban government that the Internet and its associated ‘freedom’ are being forced upon them by the United States. In the wake of the creation of this task force, Cuban media has focused on the implied pressure, and private individuals in the Cuban technology sector have come under increased scrutiny.

    Instead of attempting to influence the policies of another sovereign nation, I encourage us to reflect more on our internal policies. US government sanctions currently require a wide range of US-based education and reference sites from blocking Cuban traffic. Likewise, limitations preventing Cubans from connecting to US-invested undersea cables are partially responsible for the scarcity and cost of Cuban Internet connections. Reducing these sanctions can allow Cubans to become a market for US companies, and will provide additional incentives for widespread connectivity across the country.