Tag: web

  • When do we get a Privacy-Preserving CDN?

    The surveillance-capitalism business model that defines the Internet today is only going to get more imposing. The vast majority of our online requests today are serviced and logged by centralized infrastructure – even more centralized than what we probably expect.

    While our collective hivemind takes rightful pride in the successful pushes that have improved this situation, most notably encryption in transit (HTTPS), we are still very much losing the war on metadata. Even when the payload is opaque, the who, when, and where of data access sheds an unfortunate amount of insight into our social networks and our behavior.

    This isn’t a fundamental tradeoff – but we need to invest and evolve our systems to protect ourselves from second order effects of metadata collection.

    Efficiency & Privacy

    Centralization is not an inherent evil, and it is on the path of least resistance for improved performance. It is the second-order effects that are the main risk.

    Caching data at the “edge”—physically closer to the user— is a natural performance optimization in minimizing the speed-of-light constraints. This should be aligned with our privacy goals – less hops in the network will see requests and traffic in a well designed system. Similar performance outcomes lead to single entities controlling constrained back-haul infrastructure (efficient spanning trees). This itself is not a problem, but it is natural for these powerful entities to then want to leverage the value of data they are transiting from their privileged positions, and especially in cases where the infrastructure providers extend to smarter ‘value added’ services, this secondary effect of value extraction leads to unfortunate designs for collection, logging, and eventually manipulation of traffic.

    With the rise of advanced traffic analysis and machine learning, the “anonymity” we enjoy thinking that our requests aren’t analyzed because of the scale of traffic they are ‘hidden’ within is no longer realistic. As analytical capabilities increase, the power structures exploiting this data will become more effective and will work even harder to embed themselves into core infrastructure.

    What does a better structure look like?

    To build a CDN resilient to modern passive and active surveillance, we need to go quite a bit beyond encryption. We need the infrastructure and system designed to limit metadata leakage. The good news is that there are both good research ideas and deployed systems that chip away at many parts of this problem already.

    Decoupling Identity from Intent (Oblivious HTTP)

    The most immediate path already charted by IETF drafts and Apple’s Private Relay is to have an independent entity relay traffic between the client and content. This can mean that the intermediary will know the user’s IP  but not the piece of data being asked for, and the content provider knows the content served, but not the user’s identity.

    This “de-linking” is important, but it is not by itself the end of the story. In the last decade, we have seen how easy it is to fingerprint the traffic signatures associated with visiting a website (which will involve loading a range of resources, each of a different size). A more effective mental model may be to think about the traffic patterns that would be generated by a series of back-and-forth conversations. Protecting metadata in this ‘repeated game’ scope will yield different systems than limiting scope to a single request. 

    Differential Privacy and Cover Traffic

    These fingerprinting concerns have been the impetus for a range of research looking at defenses. One important piece of intuition that has emerged from this field is that we must be willing to stray from optimal efficiency. There are a number of ways to do this: we could inject ‘fake’ traffic, fit requests into a pre-defined pattern, or increase latency to grow an anonymity set.

    Some examples of systems taking different approaches in this design space include:

    • Nym adds differential cover traffic to make an argument for statistical deniability in its mixnet design, while Tor trades off its resistance to a “global passive adversary” against latency and practicality concerns.
    • Pond was a proof of concept messenger demonstrating usage-agnostic communication patterns.
    • Mullvad offers the ability to add cover traffic to reduce classifiability of individual webpages.

    Private Information Retrieval (PIR)

    Private Information Retrieval” refers to the class of systems that answer a specific question: how can a user retrieve an item from a database (or cache) without the database learning which item was selected? While historically computationally expensive, recent advances suggest that sub-second, privacy-preserving cache lookups can be possible at scale.

    • Kohaku – is an ethereum wallet project demonstrating using PIR for hiding reads
    • Iphone live caller ID is the largest user of PIR currently

    Content Addressing and Blinding

    A variety of more exotic techniques for data transfer have been explored in the contexts of content addressed systems like Bittorrent and IPFS. A number of useful ideas have resurfaced in these contexts:

    • Files and data generally can be thought of as a series of fixed sized ‘chunks’, which helps with speed, and is already pre-requisite for the preceding constructions.
    • By requesting data by its hash, the response becomes verifiable by the client – so we can split who is responsible for ‘availability’ (any other peer) vs what the data is (the source leading us to get data in the first place). It also means that we don’t have to go to a single origin, but are more naturally able to take advantage of caches.
    • We can separate ‘discovery’ (the DNS equivalent of figuring out who might have data) with the transfer of the individual blocks from those peers, and get past a standard client-server model with minimal additional cognitive complexity.

    Reducing Centralization & Segmenting Information

    There have been a number of projects in the last year, mostly riding on the wave of interest in ‘DePIN’ (decentralized physical infrastructure networks) that looked at economic models for how protocols could directly split earnings with participating network nodes. This extends the coordination systems ideas from cryptocurrencies to how things like CDNs could be constructed to incentivize a decentralized group of participants to operate participating caches / parts of the overall network around the world.

    These systems sit somewhat orthogonal to a set of prior research on ‘Sybils’, which indicates there’s an additional coordination system of some sort needed to actually reduce centralization. Conceptually, if you set up incentives so that there are more rewards (and an incentive) for many small participants to form a network rather than a big central player, the large central player can generally split up their resources and make themselves look like multiple smaller entities (called ‘Sybils’). This means there needs to be some mechanism to confirm that different entities are really ‘independent’ if that is a desired property. A number of mechanisms – using social networks, or various forms of human identity have been proposed for this, though all with caveats.

    What’s next?

    We are missing two important pieces in the story of privacy preserving content delivery. The first is that there is currently no shelling point for this effort. Existing centralized players have been so far disincentivized from investing in this direction, because it is at odds with their business model, and there has not yet been a credible community effort that has emerged. 

    The second is that much of the market is driven by price. The reason there was a substantial shift from Amazon S3 to Cloudflare R2 was not because of a technical innovation, but because Cloudflare was able to leverage their infrastructure position to provide the same service at a cheaper price. The shift that allows for subsequent disruption is likely partially regulatory – that liability around the collection and exploitation of metadata needs to be disincentivized and in so doing leads customers to switch to a ‘safer’ or more privacy-preserving alternative.

    There is hope! Code is becoming cheaper to generate and deploy, so the marginal cost of building is dropping. On the flip side, the value of a Shared Private CDN will grow with usage. – This feels like a situation where the trick will be to get enough excitement and activation energy.

    We don’t just need better protocols; we also need the coordination, but there is hope and increasing incentives that make me optimistic that a better system here will emerge.

  • Private Retrieval

    It’s very exciting to have a public face to the thoughts around how to enable effective private access to data.

    Research Announcement

    EthCC Announcement

    The basic hypothesis here is that there’s a high-leverage opportunity to attract thought around scaling the range of anonymous database or data transfer techniques to reach something with better properties that the systems we have today.

    I’ve learned a lot about what goes into running a grant fund already in my minor involvement helping to set up this program, and am excited to see the next stage of it’s lifecycle as we begin to engage with proposals and grantees.

  • Ethics of Censorship Measurement

    I gave a talk this past summer at DEFCON on the ethical quandary that continues to play a role in the academic discussion of network censorship measurement. Over the course of my phd studies, there was a significant arc of time where the community yielded to caution as the issues around ethics were better understood.

    These issues have not gone away, and in the intervening six months since this talk, we’ve seen new groups re-develop techniques deemed problematic by the prevailing winds of the academic community.

    Watch on Youtube

    Slides

  • Open Letter to the Cuba Internet Task Force

    The following is a response to an invitation to participate in the recently formed Cuba Internet Task Force.

    Task Force Representatives:
    I will not be joining the Cuba Internet Task Force, or Subcommittees, because I believe the harm done by the existence of these committees outweighs any potential benefit of the recommendations that can come from them.

    In recent years, Cuba has increasingly normalized Internet usage, through expansion and cost reduction of WiFi, through the advent of AirBNB as a major source of tourism revenue, and through growing traffic capacity.

    In the scope of my work, I have documented the flourishing community wireless networks operating in tandem with official Internet service from ETECSA. These community efforts already address the “last mile” problem, and it is not hard to imagine the future where they are consolidated or integrated to provide Internet-to-the-home for many more Cubans.

    These efforts are hindered by the perception by the Cuban government that the Internet and its associated ‘freedom’ are being forced upon them by the United States. In the wake of the creation of this task force, Cuban media has focused on the implied pressure, and private individuals in the Cuban technology sector have come under increased scrutiny.

    Instead of attempting to influence the policies of another sovereign nation, I encourage us to reflect more on our internal policies. US government sanctions currently require a wide range of US-based education and reference sites from blocking Cuban traffic. Likewise, limitations preventing Cubans from connecting to US-invested undersea cables are partially responsible for the scarcity and cost of Cuban Internet connections. Reducing these sanctions can allow Cubans to become a market for US companies, and will provide additional incentives for widespread connectivity across the country.

  • HRPC Research

    It’s great to see that Research into Human Rights Protocol Considerations has been published as an RFC. An interesting document exploring how the technical protocols of the Internet interact with our real-world values.

  • Messaging Threat models

    I talked yesterday at Bornhack about the current state of secure messaging and the different primitives and threats that groups are working to address.

    The talk is on youtube.

    The slides are on this site, as are the directions for dogfooding the talek system.

  • Initial Measurements of the Cuban Street Network

    Internet access in Cuba is severely constrained, due to limited availability, slow speeds, and high cost. Within this isolated environment, technology enthusiasts have constructed a disconnected but vibrant IP network that has grown organically to reach tens of thousands of households across Havana. We present the first detailed characterization of this deployment, which is known as the SNET, or Street Network. Working in collaboration with SNET operators, we describe the network’s infrastructure and map its topology, and we measure bandwidth, available services, usage patterns, and user demographics. Qualitatively, we attempt to answer why the SNET exists and what benefits it has afforded its users. We go on to discuss technical challenges the network faces, including scalability, security, and organizational issues. To our knowledge, the SNET is the largest isolated community-driven network in existence, and its structure, successes, and obstacles show fascinating contrasts and similarities to those of the Internet at large.

    Talks

    The Internet in Cuba: A Story of Community Resilience. Chaos Communication Congress. 2017

    Publication

    P Pujol, Eduardo E., Will Scott, Eric Wustrow, and J. Alex Halderman. “Initial measurements of the cuban street network.” In Proceedings of the 2017 Internet Measurement Conference, pp. 318-324. ACM, 2017. Slides

  • IETF 98

    Last week I talked briefly about the state of open internet measurement for network anomalies at IETF 98. This was my first time attending an IETF in-person meeting, and it was very useful in getting a better understanding of how to navigate the standards process, how it’s used by others, and what value can be gained from it.

    A couple highlights that I took away from the event:

    There’s a concern throughout the IETF about solving the privacy leaks in existing protocols for general web access. There are three major points in the protocol that need to be addressed and are under discussion as part of this: The first is coming up with a successor to DNS that provides confidentiality. This, I think, is going to be the most challenging point. The second is coming up with a SNI equivalent that doesn’t send the requested domain in plain-text. The third is adapting the current public certificate transparency process to provide confidentiality of the specific domains issued certificates, while maintaining the accountability provided by the system.

    Confidential DNS

    There are two proposals with traction for encrypting DNS that I’m aware of. Neither fully solve the problem, but both provide reasonable ways forward. The first is dnscrypt, a protocol with support from entities like yandex and cloudflare. It maintains a stateless UDP protocol, and encrypts requests and responses against server and client keys. There are working client proxies for most platforms, although installation on mobile is hacky, and a set of running providers. The other alternative, which was represented at IETF and seems to be preferred by the standards community is DNS over TLS. The benefit here that there’s no new protocol, meaning less code that needs to be audited to gain confidence of the security properties for the system. There are some working servers and client proxies available for this, but the community seems more fragmented, unfortunately.

    The eventual problem that isn’t yet addressed is that you still need to trust some remote party with your dns query and neither protocol changes the underlying protocol where the work of dns resolution is performed by someone chosen by the local network. Current proxies allow the client to choose who this is instead, but that doesn’t remove the trust issue, and doesn’t work well with captive portals or scale to widespread deployment. It also doesn’t prevent that third party from tracking the chain of dns requests made by the client and getting a pretty good idea about what the client is doing.

    Hidden SNI

    SNI, or server name identification, is a process that occurs at the beginning of an HTTPS request where the client tells the server which domain it wants to talk to. This is a critical part of the protocol, because it allows a single IP address to host HTTPS servers for multiple domains. Unfortunately, it also allows the network to detect and potentially block requests at a domain, rather than IP granularity.

    Proposals for encrypting the SNI have been around for a couple years. Unfortunately, they did not get included in TLS1.3, which means that it will be a while before the next iteration of the standard and the potential to include this update.

    The good news was that there seems to be continued interest in figuring out ways to protect the SNI of client requests, though no current proposal I’m aware of.

    Certificate Transparency Privacy

    Certificate Transparency is an addition to the HTTPS system to enforce additional accountability in to the certificate authority system. It requires authorities (CA)’s to publish a log of all certificates they issue publicly, so that third parties can audit their list and make sure they haven’t secretly mis-issued certificates. While a great feature for accountability and web security, it also opens an additional channel where the list of domains with SSL certificates can be enumerated. This includes internal or private domains that the owner would like to remain obscure.

    As google and others have moved to require the CT log from all authorities through requirements on browser certificate validity, this issue is again at the fore. There’s been work on addressing this problem, including a cryptographic proposal and the IETF proposal for domain label redaction which seems to be advancing through the standards process.

    There remains a ways to go to migrate to protocols which provide some protection against a malicious network, but there’s willingness and work to get there, which is at least a start.

  • Another Strike against Domain Fronting

    In 2014, Domain Fronting became the newest obfuscation technique for covert, difficult to censor communication. Even today, the Meek Pluggable transport serves ~400GB of Tor traffic each day, at a cost of ~$3000/month.

    The basic technique is to make an HTTPS connection to the CDN directly, and then once the encryption has begun, make the HTTP request to the actual backing site instead. Since many CDNs use the same “front-end cache” servers for incoming requests to all of the different sites they host, there is a disconnect between the software handling SSL, and the routing web server proxying requests to where they need to go.

    Even as the technique became widely adopted in 2014-2015, its demise was already predicted, with practitioners in the censorship circumvention community focused on how long it could be made to last until the next mechanism was found. This prediction rested on two points:

    1. The CDN companies will find themselves in a difficult position politically, since they are now in the position of supporting circumvention while also maintaining a relationship with the censoring countries.
    2. The technique has security and cost implications that make it not great for either the CDNs, or the practitioners.

    We’ve seen both of these predictions mature.

    Cloudflare, explicitly doesn’t support this mechanism of circumvention, and coincidentally has major Chinese partnerships and worked to deploy into China. Google also has limited the technique over periods as they have struggled with abuse (although mute in China, since the Google cloud doesn’t work there as a CDN.)

    In terms of cost, the most notable incident is the “Great Cannon”, which targeted not only Github as widely reported, but also caused a significant amount of traffic to go to Amazon-hosted pages run by GreatFire, a dissident news organization, and costing them significant amounts of money. GreatFire had been providing a free browser that operated by proxying all traffic through domain-fronting. Due to a separate and less reported Chinese “DDOS” they ended up with a monthly bill for several tens of thousands of dollars and had to turn down the service.

    The latest strike against domain fronting is seen in posts by Cobalt Strike and FireEye that the technique is also gaining adoption for Malware C&C. This abuse case will further incentivize CDNs from allowing the practice to continue, since there will now be many legitimate western voices actively calling on them to stop. Enterprises attempting to track threats on their networks, and CDN customers wanting to not be blamed for attacks will both begin putting more pressure on the CDNs to remove the ability for different domains to be intermixed, and we should expect to see a continued drop in the willingness of providers to offer such a service.

  • State of internet censorship video

    Video from my CCC talk last week is here.