text.md 23.9 KB
Newer Older
Michał Woźniak's avatar
Michał Woźniak committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
# HOPE 2020

**Censorship Is No Longer Interpreted as Damage**  
*(And What We Can Do About It)*

## Introduction

Hi and hello! My name is Michał Woźniak, although most people know me as [`rysiek`](https://mastodon.social/@rysiek/). No, I am not related to Steve Wozniak. Yes, I do have opinions about Apple. Some of them even on-topic.

For the last 5 years I had worked for the Organized Crime and Corruption Reporting Project. Unsurprisingly, we had to deal with our share of traffic spikes and censorship attempts. I'd like to share some of the know-how gathered and code written throughout this time.

Opinions I voice in this talk are my own and do not necessarily represent those of my employers.

### Censorship as damage

I started my on-line adventures, as I imagine many of you have, in simpler times when the Internet was this new amazing technology, the information superhighway, the series of tubes, where a blog could be set up in a broom closet by everyone and their dog, and nobody could tell them apart (the dog and the everyone, that is; although I am sure there were and are many who still cannot tell a blog from a broom closet).

In this Cyberland Before Time, "*the Internet would interpret censorship as damage, and route around it*". Governments seemed comically helpless as far as anything digital was concerned. Whenever they tried to block anything, the Internet would get a strong, systemic allergic reaction known as the *Streisand Effect*, and information would still flow.

### The game has changed

But that was then. Today, effective web censorship is within reach not only for China, who is willing to invest heavily into building their own in-house technology and capacity, but also to less dedicated but no less censorship-eager regimes like the UK, Russia, or Azerbaijan.

Case in point: a few years ago already I have seen a local independent media site blocked completely in Kazakhstan, even though the site was behind CloudFlare. How? The block was Server Name Indication-based, and almost certainly used off-the-shelf hardware sold by one of the major companies. Why even decrypt the TLS traffic if the domain name is transmitted in clear text anyway?..

Point is, censorship technology has taken huge strides in the last decade. Hardware has become more powerful and less expensive, and predefined filters are availble for sale on the open market. At the same time, and no less importantly, more and more content became accessible solely through fewer and fewer gatekeepers, making the censors' jobs easier.

### Centralization as damage

These include Facebook and CloudFlare, and Amazon, and Google, and Apple. Don't get me wrong, I am no fan of other Big Tech companies, but in the context of this talk, these are the most relevant.

For years activists like myself warned that putting all our content in the Big Tech basket will end badly. That this creates single points of failure that will be difficult to work-around in case of issues. Today, we have specific examples of Facebook, CloudFlare, Amazon, Google, Apple and other Big Tech players thwarting attempts at keeping the information flowing:
- Facebook by [experimenting with their Explore feed on live tissue in Serbia](https://www.nytimes.com/2017/11/15/opinion/serbia-facebook-explore-feed.html) (and costing independent media sites there up to a half of traffic)
- CloudFlare having a [almost half-hour outage, affecting 12mln websites, due to an error in a regular expression](https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/)
- [Google](https://www.theregister.com/2018/04/19/google_domain_fronting/) and [Amazon](https://signal.org/blog/looking-back-on-the-front/) killing [Domain Fronting](https://en.wikipedia.org/wiki/Domain_fronting) and screwing Signal over, suspiciously soon after [RosKomNadzor complained](https://github.com/signalapp/Signal-Android/issues/7745)
- [Apple removing VPN apps](https://www.reuters.com/article/us-china-apple-vpn/apple-says-it-is-removing-vpn-services-from-china-app-store-idUSKBN1AE0BQ) from their Chinese AppStore, at the behest of Beijing.

These give a good overview of the possible issues: from innocent mistakes, through callous experimentation, to outright willful complicity. And with the sheer amount of users affected, this becomes a huge problem.

### Technological homeopathy

At the same time these companies' services are billed as solutions to Internet censorship. Let's take Encrypted SNI which supposedly solves the issue of SNI-based censorship — it can only work with large providers like CloudFlare, where an innocent-looking domain can be used to front for a blocked site. Thus, it creates incentive for more websites to move behind CloudFlare.

This just moves the problem: sooner or later China and others will find ways to influence CloudFlare's decisions to host or drop websites, just like they clearly can influence Google or Apple already.

It's as if we were trying to fix problems caused by centralization by just a bit more centralization. We need something better than what basically amounts to technological homeopathy.

## The basics

I'm making some assumptions in this talk.

For one, I am focusing on websites that are not The Guardian level of traffic, although a lot of suggestions mentioned here could work for such websites too. On the other hand, reading [Guardian's guidelines on front-end development](https://github.com/guardian/frontend#core-development-principles-lines-in-the-sand) can be very informative also for smaller sites.

Secondly, I am assuming an example site is a public one where visitors do not log-in to access content. This holds true for the vast majority of independent news sites. And even if your situation is different, some advice here might still be useful.

Third, I am assuming the organization running the site has *some* technical capacity and willingness to invest time and effort to stay away from the gatekeepers when possible. Deploying some of the strategies in this talk can be done even with very limited tech capacity, while providing quite noticable benefits. As always, there are no silver bullets.

Finally, I am assuming we're trying to deal with a targeted web blockage, not an all-out shutting down of the Internet in a particular region.

### Usual suspects

The [Tor Browser](https://www.torproject.org/download/), VPNs, [Psiphon](https://psiphon3.com/en/index.html) and many other projects offer ways for determined users to access blocked resources. They're invaluable and effective (if you can you should by all means deploy a Tor Hidden Service!), but they do not work at a scale — it is not reasonable to expect the whole population of a country to download the Tor Browser, or use VPNs.

Instead, I want to focus on strategies website admins can employ to make their content available to visitors without requiring anything specific from the visitors themselves.

### Enter Blockchain

Ha ha, no. Good one! But if you buy into the ICO of rysiekcoin, I am happy to tell you more about how and why Blockchain is useless here.

### Don't: absolute URLs

Before we go further, here are a few reasonably simple *don'ts*.

Do *not* hard-code absolute URLs in your templates, or save absolute URLs in your database. It's tricky to get WordPress to not do that, but it is possible. Same with many static site generators.

You want to be able to move domains in case your blocked, or host your website securely on a Tor Hidden Service if you choose to, or do any of the other more funky things — and hard-coding or saving the current domain along with every link and every image URL will make this considerably harder (or even impossible on a short notice).

### Don't: [barnacle trackers](https://barnacles.online/)

Including a lot of JS libraries, media, and fonts from third party servers makes your site brittle. Don't. If you need them, self-host them. This is not going to affect the loading time of your site negatively in any meaningful way, and in fact [self-hosting might even improve the loading times](https://www.tunetheweb.com/blog/should-you-self-host-google-fonts/).

Also, your *NoScript*-using visitors will appreciate greatly that they don't have allow fetching scripts from a dozen random domains. But more importantly, it will make doing some of the more funky anti-censorship stuff considerably easier.

If you need to de-googlify your fonts, [there's a script for that](https://git.occrp.org/libre/fonts-degooglifier) (not to mention it's not that hard to do manually).

### Don't: JBOP (Just a Bunch of Plugins)

If you're using a CMS, be very careful with plugins. They are one of the main attack vectors: often developed by a guy and his dog kind of a team, as soon as a plugin becomes somewhat popular, the [capacity to handle reported bugs and publish security updates in a timely manner goes away quickly](https://www.zdnet.com/article/thousands-of-wordpress-sites-backdoored-with-malicious-code/). At the same time plugins (usually) have full access to the database and the filesystem.

There have been also cases of [plugins being sold by their original developers to third parties that then issued back-doored updates](https://www.bleepingcomputer.com/news/security/backdoor-found-in-wordpress-plugin-with-more-than-300-000-installations/).

This is less of a problem with static site generators, obviously.

## Handling organic traffic

For a dynamic site running on a CMS like WordPress every request is very resource-intensive: it means getting the data from the database, parsing the templates, laying out the HTML, and then serving it to the user. And then this same thing has to happen for another identical request. And another...

I've been asked to deal with a "targeted DDoSes" against websites more times I can remember. Each and every one of these turned out to be... organic traffic due to a popular news item.

***For dynamic websites any sufficiently high organic traffic is indistinguishable from a DDoS.***

Before we can start worrying about censorship, we need to deal with self-inflicted damage.

### Go static

Ideally, go with a static site. There are [plenty of static site generators](https://www.staticgen.com/). For a while the main problem with them was that publishing content required directly editing code (Markdown files, or such). There are however static site generators [that have a user-friendly admin interface](https://www.staticgen.com/).

One of the advantages of SSGs is that you end up with static HTML, JS, CSS, and image/media files. Which means you can redeploy your website to a different hosting quickly and hassle free. You can also do other, more fancy stuff... we'll talk about it in a moment.

Another advantage is that the code needed to update the site and the code serving the site to the visitors (or attackers) is completely different. There is no database to be SQL-injected, there is no PHP code to be exploited, and no third-party plugins to be back-doored. Just static files and the webserver.

### Microcaching (and... caching)

If you can't go static (or can, but also want to be fancy), you can start doing some serious caching at the edge. No, you don't need CloudFlare for this. Developing your own caching strategy is a lot of work, and there are several pitfalls, but here's the good news: [I already did that for you](https://0xacab.org/rysiek/fasada).

*Fasada* is an NginX config that has been tested and improved while in production for over 5 years. The basic idea is [microcaching](https://www.nginx.com/blog/benefits-of-microcaching-nginx/) — caching dynamic resources for short periods of time, so that when a spike in requests comes, most of them get served from cache, but the content stays reasonably *fresh* all the time.

Of course, all static resources (CSS, JS, images, etc) can and should be cached for way longer. No need to cache them for years, an hour or two should suffice.

### Spread out

Once you have some edge caching deployed, you can easily add capacity by adding cheap edge VPSes. Since almost all requests get cached, the limiting factor becomes bandwidth. This means you can add a lot of edge nodes (thus adding plenty of available bandwidth) before you need to add back-end capacity.

The additional benefit is that all the cached content (which usually would mean all the popular content) on your site could stay up and on-line even if the back-end is down for whatever reason (updates, maintenance, accidental screw-up, etc) — this is one of the things Fasada config was designed for.

## Coping Strategies

Now that we can handle a reasonable amount of organic traffic, we can start thinking about targeted malicious actions (break-ins, DDoS) and censorship. 

### Whack-a-mole

Having multiple edge nodes (and your back-end server not directly exposed) also means you can play whack-a-mole with most DDoSes.

When a DDoS targets a site, each botnet node makes necessary DNS requests and then targets the resulting IP addresses to maximize the bandwidth used for the actual attack. If you have a way of quickly deploying to new IP addresses (get a new cheap VPS?) and low TTL on your domain, you can move all legitimate traffic (which is going to be checking with DNS servers often) to a new edge node, while the DDoS pummels the old IP addresses. Win-win.

This can also be used to deal with censorship, by moving to a new domain (and perhaps IPs) every few months, whenever the current ones get blocked — provided that the absolute URLs are not hard-coded anywhere on your site.

Both of these approaches require manual intervention and are somewhat labour intensive, but can be effective in the short term. Censors do move slowly.

### WebArchive your site

Censors might often miss the fact that censored websites can be accessed via the [Wayback Machine](https://archive.org/web/). So the next step is to make sure your website is available there.

Check if snapshots are made on a regular basis (if not, you can contact the [Archive It](https://archive-it.org/) service), and verify that content displays correctly. This is where simplifying your site and self-hosting all resources starts to pay off.

This is also extremely useful in case of any calamity. Of course we all make back-ups, but having a public back-up of all content on WebArchive is definitely a nice additional safety net!

### Move Zip (for great justice)

If you're running a static site, zip it up and publish it as a zipped bundle. That way interested parties can just download the zip file and distribute it over the sneakernet, if all else fails.

Ideally, automate it and script it, and make sure that the extracted files work well when viewed locally in a browser. This might require a bit of fiddling with URLs, but it's totally worth it.

<DEMO TIME: OCCRP PROJECTS ZIP>

And once you have a zip, why not push it to some random places, using services like Dropbox or Mega? That way an up-to-date version of your site will be available in case of emergencies. All you need to do then is publicize the URLs.

### Get the word out

Finally, having [a landing page documenting your anti-censorship strategies](https://www.occrp.org/en/aboutus/bypassing-censorship) is definitely worth it. It might not be directly available when the site gets censored, but it will be available in various caches and all places you made your content available at. That way you maximize the chance people who need it will find information on how to access your site when it's blocked.

## Going experimental

So far we have not discussed anything particularly surprising. Yes, there was a bunch of best practices we all kind of knew but mostly ignored, some code to test out, and perhaps an idea or two that were somewhat interesting.

I'll fix that right after this disclaimer: while suggestions I made before have been tested and deployed successfully in production, ideas below are mostly at a proof-of-concept stage. If anyone is interested in testing them out or working on them together, please get in touch!

### Funky protocols

If we're doing a static site, we can deploy it also using peer-to-peer protocols which are very good at avoiding censorship. For example, [IPFS](https://ipfs.io/), or BitTorrent. Again, it is a good idea to add information on how to access such content to your anti-censorship landing page (for example, by pointing visitors to IPFS gateways or BitTorrent clients).

Of course, this goes against what I said earlier: that we're trying to find ways to evade censorship that do not require the visitors to do anything special. So why am I mentioning this?

Because both BitTorrent and IPFS have JavaScript implementations that work in the browser.

### Service Workers

Let's talk about [Service Workers](https://developer.mozilla.org/en-US/docs/Web/API/Service_Worker_API) for a moment. They're a reasonably new (but widely implemented) browser API that lets a website load some JavaScript code and instruct the browser to run it to handle *all* requests related to that website.

After the user visits the site once, a service worker kicks in the moment the user navigates to the site again in their browser, *before* the site is loaded. This means we can use service workers to use alternative protocols to fetch content, as long as they have implementations that work in the browser. Even if the site is now blocked.

### Samizdat

That's exactly the idea behind [Samizdat](https://samizdat.is/). On the most basic level it's a bit of JS that installs a service worker, which then uses different techniques to get the content in case a regular `fetch()` doesn't work for whatever reason. The only thing the user needs to do is to be able to visit the site *once*.

<DEMO TIME: SAMIZDAT>

Currently Samizdat implements regular [`fetch()`](https://0xacab.org/rysiek/samizdat/-/blob/master/plugins/fetch.js), [local cache](https://0xacab.org/rysiek/samizdat/-/blob/master/plugins/cache.js) (to show any cached content quickly in case a request fails), and an [IPFS transport](https://0xacab.org/rysiek/samizdat/-/blob/master/plugins/gun-ipfs.js). But it is entirely reasonable to implement any other kind of transport, like:
- pulling the content from the Web Archive
- pulling the content from a Dropbox or Google Drive folder
- getting it via [WebTorrent](https://webtorrent.io/)
- or using [`dat://`](https://dat.foundation/)
- hitting some fallback IPs directly (assuming they're not blocked)
- [Snowflake](https://snowflake.torproject.org/)

Whatever is possible in JS in the browser, can be used in a Samizdat transport plugin. 

This is where it all comes together. Making sure our content is as static as can be, and pushing it to different unexpected locations or publishing it using non-standard protocols, combined with Service Worker API, means we can now effectively work-around censorship as long as the user is able to visit the site once.

### Assumptions and Limitations

Samizdat requires the visitor to be able to visit the site once via regular HTTPS, to load the service worker. That is a big ask, but from my experience all censorship is leaky. Most users will be able to access some blocked sites every now and then &mdash; when travelling, or from a location using a VPN, or perhaps when the block goes down temporarily for whatever reason.

This is not usually a problem for the censors, since most of the time the site remains blocked. But if visiting a site *once* means a person is able to access it regardless of the block from that point on, that changes the game. And it's not an easy thing to monitor, since any attempt to access a blocked site still looks blocked in the censors' logs; there is no information that a different protocol or end-point was used.

### Privacy considerations

Of course, there are privacy considerations with Samizdat. Using peer-to-peer protocols could flag users for additional scrutiny in oppressive regimes. Using one of the other possible options (like pulling content from Dropbox) could be a better choice in certain cases.

Then again, simply trying to visit a blocked site is probably already enough of a red flag.

And with Samizdat, it's possible to *improve* the safety of the users: if Samizdat is configured to *not* use regular `fetch()` once the service worker is installed, no further requests would show up as going to the blocked site even if the visitor would navigate to the site. Content would be pulled from cache or using the other methods without ever making the tell-tale blocked request.

Or imagine hosting a completely innocent-looking static site, which when accessed after the service worker loads displays completely different content. Simple crawl of the site would not reveal the real content (since the service worker would have no way of running), but for visitors with regular browsers the site would work as if the "hidden" content was normally available there.

### Further work

There is plenty of work that needs to be done around Samizdat:
- deployment procedure needs to be streamlined (for example, a WordPress plugin to automagically push content to IPFS upon publication would be a very good idea!)
- mobile support
- additional transport plugins need to be implemented
- documentation needs to be greatly improved

## Similar projects

There was and is a number of projects similar to Samizdat. Here's a non-exhaustive list.

I had a nice chat with [NetBlocks](https://netblocks.org/) people at MozFest, turns out they had a similar idea some years ago when Service Workers were a new standard. In case of a blockage they used just a simple `fetch()` to a list of fall-back IP addresses not exposed in the DNS for a given domain.

Apparently this worked well enough, the problem was that Service Workers were not available on mobile back then. This has changed. The project seem to have been called Lazarus, but I was not able to find any sources.

The [lunet](https://github.com/gozala/lunet) project explored the same basic idea. Currently the project seems defunct.

[NewNode](https://clostra.com/newnode.html) attemtps to implement a decentralized protocol for fetching content in a form of a library that is easy to use in mobile apps. [Push app](https://www.pushapp.press/) is a mobile app generator. It already implements Tor as a fall-back in case regular HTTPS requests to the back-end fail. OCCRP has been working on integrating NewNode into it (turns out it's not that easy).

### Browsers should lead

Important Sidenote: All of this should not be necessary. It is, because browser vendors do not seem interested in implementing decentralized protocols. There is some [good cooperation](https://blog.torproject.org/tor-heart-firefox) between Mozilla and the Tor Project (including the [Tor Uplift](https://wiki.mozilla.org/Security/Tor_Uplift) project), and there is [Project Fusion](https://wiki.mozilla.org/Security/Fusion), aiming to implement Tor directly in Firefox, but... well, that wiki page was last edited over 2.5 years ago.

Furthermore, there are [things browser vendors can do](https://sammacbeth.eu/blog/2019/03/22/dat-for-firefox-1.html) to make peer-to-peer protocols more viable in the browsers. [Beaker Browser](https://beakerbrowser.com/docs/faq/) implements `dat://` protocol directly, showing this is possible.

And yes, I am focusing on Mozilla Firefox here, because we damn well know Google is not going to be interested in implementing any of this in Chrome/Chromium (instead,peddling Google Shield and AMP, their own Internet centralization efforts).

Not to mention Safari, which [did not even implement Service Worker API fully](https://developer.mozilla.org/en-US/docs/Web/API/Client#Browser_compatibility).

Thing is, Mozilla is positioned in a way such that it could lead in this. Once Firefox implements Tor, or things required for truly peer-to-peer protocols, other browsers could be shamed into doing so too.


## The future needs to be decentralized

Internet's original strengths and promise stemmed from its decentralized nature. We need to find a way to decentralize it again.

This will require work and not taking the easy way out. It's on all of us and the small decisions we make, like including a JS file from a third-party server, or putting our site behind one of the behemoth CDNs.

But is necessary if we don't want to end up with glorified cable TV.