An internet with an elephant memory

Turns out I always had a penchant for run-on sentences. I have counted over 3 of them in a college application essay I wrote in 2004. It is sitting there, on my Dropbox account, where I moved my “important” documents to from an old Yahoo! email. It’s been there, untouched, seemingly for eternity. Barring a catastrophic event, like Dropbox going out of business or me getting hacked, I suspect it’ll be there for at least 15 more years.

My early days of computing in mid 90s, was colored with paranoia about losing data. Not that I had much “content” digitally, but whatever I had was important. Nevertheless, despite my best efforts I managed to lose data constantly. Floppy disks would break, hard drives would fail. Sometimes computers would just get shipped to dad’s work, with everything wiped.

During the first years of the 21st century, things started to get worse. As digital cameras (remember Mavica?) became more prevalent, I started to produce more things digitally. Yet, my ability to keep things safe didn’t seem to keep up. External hard drives, The Backup, were cumbersome. I never used them with the militaristic rigor the process demanded, and whatever drive I used, I handled it with little  to no care. Multiple years of boarding school memories vanished with a single external drive going kaput after a drop from a top bunk. Same with years of carefully downloaded MP3s with thousands of iD3 tags organized with autistic zeal. Turns out a magnetic platter with a spindle flying a couple molecules above (seriously) isn’t very durable. Who knew?

Something flipped though, in the second half of that decade. The cloud and the internet companies arrived and I welcomed them with open arms. First, Google changed the entire storage conversation by introducing Gmail with 1GB free storage. It was released on April 1st, as anti-joke of some sorts. Can’t blame them, as it seemed to be too good to be real, compared to the paltry 2MB storage provided by Hotmail. As if that wasn’t enough, Google then upped the ante, with the storage going up forever. The symbolism of that counter was more important than the increasing storage. The idea wasn’t that you had a lot of storage, it was that storage was no longer a worry. Storage was electricity. It was boundless and virtually free. Things Just Worked.

And it all seemed to pick up speed from there on. For those pesky MP3s, first iTunes came along, making your content easily downloadable again. Flickr made an attempt at being The Place for your photos, but depending on your view either blew that early lead or decided to focus less on the network, but more on the photos. Facebook, of course, realized that photos were the key to building an emotional bond with its users, and users with each other, and innovated to allow people to upload unlimited amounts of photos, for free, forever. Today, for most people the only where they talk about gigabytes is their data plan, their conduit to the magical place where storage is unlimited.

A 5 Megabyte IBM hard drive in 1956

But with the ever decreasing costs of storage, and ever ever faster broadband (remember that term?), we seemed to forget the idea that maybe this kind of data storage is unnatural. And it comes with its own set of challenges that we are just realizing now. Cloud storage is great for being so durable and immutable. But the business models that encourage companies store more and more data each day are becoming moral hazard of enormous size. Users enjoy the high availability -cloud is everywhere, and it always works, unlike that hard drive sitting your drawer- but they don’t understand the risks involved. Our data gets collected, technically via our legal consent but not really by our expressed intention, by everyone. And when data goes to places it shouldn’t go, used by people that shouldn’t use it, all we get an apology, and maybe if we are lucky, a tool show how bad our data custodians have screwed up.

Let’s dive into technical concepts. One is that durability of data; that is whatever you put online stays forever. The other concept is immutability, the idea that data never changes. The first one is obvious, the second is a bit technical. Another way to think of immutability is reversibility. The data of course changes, the idea behind immutability is more that change itself is modeled, and is reversible.

These are great technical ideals to uphold, and most technology companies are very good at upholding them. I think, however, we need to take into account something else. I’ll call it deletability. This isn’t a new concept (and apparently not a word). GDPR, the big daddy of privacy regulations world wide, calls it right to erasure with “right to be forgotten” in quotes in the title of the article. A right FKA “to be forgotten”.

Right to be forgotten bothered everyone, from techies to journalists, when it first came up. It was uncomfortable. As someone who is in the former group, the idea of an an algorithm, something that only looks at data, something “impartial” actively writing in an edge case into code is just irksome. It’s code smell, they would say. That stuff would never pass the code review. And the journalistic implications seemed sketchy too; what did it mean when a company, say a monopolistic search engine that’s the choke point for billion of people’s access to information online, hides it?

But consider the alternative. What does it mean, for our society and individuals that what everyone did was available, forever, stored immutably, but only as recorded by whoever recorded it, without surrounding context? It is madness. Such archives are tools of tyrants, not compassionate leaders. That one is obvious, But they also might just be bad business, once you realize business operates in a society, not just on a different plane of reality.

A storage pod designed by Backblaze, capable of storing 4.8 Petabytes

Everyone is rightfully mad at Facebook for being a comedically bad caretaker of millions of people’s data. A company in founded in a dorm room and eventually moved to swamps of Menlo Park was able to leak 270,000, sorry 50 million, wait actually 87 million, people’s data to a bunch of political consultants. I called this a “breach of trust” on Twitter, which became the term Facebook decided to call it also (Thanks?). But even this term doesn’t do justice; what Facebook did was just leave a couple network cables plugged to their database around to see who would use it first.

Is Facebook alone in this? Of course not. Facebook will close the whatever barn doors are left, say they are sorry, and that’ll the end of it. The data that leaked has been leaked (and definitely way more of it leaked than has been reported) but rest is still stored forever in Facebook’s servers. And same is true for Google, and Amazon, and even to some degree, Apple. Along with all the apps that you download once, sign in with Facebook or Google, and never think about again.

We are conditioned to believe this is normal, but doesn’t have to be. This is the result of choices others made for you, not due to a natural progression of events. The persistency of your data in some cloud is not the logical extension of technological progression.

We don’t even have good understanding of what “our data” really means. Some data, we enter into these databases ourselves, like our profile photos and names. Some of it, it gets put in there by surveillance. such as your browsing history. And a third category is data that’s been derived, such as what Google thinks your interests or what Facebook thinks you feel that way. This obfuscation and the muddying of the term “data” is frustrating. We generally understand the first, know the second happens (though people also underestimate the surveillance) and the third mostly goes unexplored.

A complaint about wrong grade of copper from 1750 BC

But the real problem is, we assume this data is untouchable. We are accustomed to think once the data hits the cloud, it is there forever. But that doesn’t have to be the case. You can, and you should be able to delete the data, forever. And if you want to not delete it, but take it t

o a competitor, you should be able to do it with the click of a button, not via a cumbersome process (that you can’t do on your phone) involving downloading files and uploading it somewhere else. I can change phone providers easier than I can change a social network, and you are telling me this is a technical problem?

Imagine a different world. Imagine an internet where things are more ephemeral by default, and only persistent by choice. This is part of the charm of many new social apps, like Snapchat already. Deletion is the default, durability an option. People become more open to expressing themselves, when they know that something they’ve said won’t be held against them by anyone. Except, in this case, Snapchat did actually keep this data, of course, but that’s really the point of having bad defaults in law that allowed them to do it (while misleading the users).

Techies will balk at the idea, that users should be allowed to delete anything at whim. The first line of defense is obvious: it’s hard. Most technology stacks, even those (or especially those) with simple user interfaces like Uber have heaps of systems behind the surface. Same data is duplicated to different systems, as some are used for analysis, some for accounting, some for supporting other features. This helps product development, companies argue. But this is just picking one side, technology, over the other, humans.

There’s no insurmountable technical reason that any data that’s been uploaded somewhere, cannot be deleted later. It might be hard (read: expensive), cumbersome (read: expensive), and sometimes outright annoyingly complicated (read: really expensive), but it’s doable. And more importantly, the pain associated with implementing such deletion steps should be seen as another code smell. As privacy by design advocates argue, deletability should be a concern on day one, not an afterthought.

If every product manager in Silicon Valley thought about how their teams would eventually have to delete the data, we wouldn’t be in this mess in the first place. If right to erasure was part of the technical calculus, alongside maintenance and performance requirements done by tech leads, deletion would also work. If every engineer thought about the data she’s sending over the wire when they log an error message or send it through a PubSub system, she would be writing better code in the first place. The data wouldn’t seep into the machinery, like a viral infection that you can’t even diagnose, incubating for years and years, only to have a outbreak that almost destroys Western democracy.

Second line of defense against deletability is that sometimes you do have keep data for legal reasons. And that’s fine. But that misses the point; the idea is not to make data disappear at anyone’s whim. If your service is collecting data that might be usable in law enforcement setting, the first reaction should be to consider whether you should be collecting it in the first place. This isn’t to make law enforcement harder, but to make it more directed and respectful of individual’s rights. Law enforcement should cooperate with companies with accountability and oversight, not simply knock on their door.

And this notion of deletability should extend to not just data people create, but also is derived from their activity, or can be traced to them some other way. Advertising giants like Google and Facebook, and the cottage industry of ad-tech (also lovingly called surveillance capitalism) firms create scarily detailed (but hilariously inaccurate) dossiers on everyone. There are reasonable arguments that such data should never be collected or generated in the first place, or be limited. But they have commercial value, as well as academic and social too.

What is unacceptable is that users have no control over this dragnet data collection, slaughterhouse like processing and disposal. People should be opting in, not opting out. Those who opt in should be able to change their minds at a later time, and not be treated like people trying to exit their timeshare contracts.

Technology firms aren’t alone in this, of course. They are just playing this game that any retailer with their loyalty programs and credit cards has been playing for years. Same with any issuer of credit cards or payment programs. This sort of profiling is not new. But these firms are playing it way better than anyone, with unparalleled success (for their shareholders and employees, mostly) with even less of the risks really accounted for. The increased scrutiny is a function of their size and growth, the new expectations of their technical competence and excellence.

And we should do more. We should be investing in technologies like end-to-end encryption, where data is only visible to the intended recipients, as opposed to everyone who follows corporate handbook. The current solutions in this space are still new, and too hard to use for mainstream adoption. And even if usability issues were fully solved (which is an open problem), many people would be missing the features they came to rely on from cloud storage. Some of the problems here can be attacked by more client side (i.e. on device) computation, and some with technologies like differential privacy. Nevertheless, even encrypted storage systems should support deletability on day one, since no encryption technology is perfect, and other risks will always persist.

I bought a new computer a few months ago. I picked it up at the store, came home, connected to my home wifi, and I was ready to go. My music on Spotify, my email on FastMail and Gmail, most of my documents on iCloud, I didn’t need much else to worry about. This is a giant improvement over what I would have to do many years ago. The idea that technology exists to make my life easier, where I can delegate storage of my data to others, is a good one.

This however, should be an informed choice that users make. As more of the world moves to the digital realm, the more oppressive it becomes to have your data be controlled by someone, alongside with billion other people’s. As more of you becomes your data, more of it moves online, the more there is at stake. There are great benefits to be had, but the good stuff doesn’t increase in direct proportion as the risks. A bigger data pool, doesn’t necessarily mean better features, but it definitely means a bigger target. We should optimize our economies, our companies to recognize the moral and economic hazards, and then minimize. Doing what is easiest for technology would be taking the easy way out.