The Secret Liabilities of Data

Posts

This post is cross-posted from my joint newsletter with Ranjan Roy, The Margins. Please check it out, and consider subscribing.


I love stretching analogies to the point they break. Take the tired cliche that data is the new oil. It makes some sense, considering the biggest data hoarders seem to be doing great, like the oil companies once were and still are. But there are big flaws in the analogy. The point of oil is not just it is very valuable, which it is, but it’s also a physical good that’s consumable. Data, on the other hand, is exceedingly cheap to store, and more importantly free to copy. Try doing that with carbon molecules!

But a different way to look at data as the new oil debate is to consider whether more data is actually an asset, or a liability. True, you can derive an insane amount of value from many types of data, if you are Amazon. But what about if you are Facebook, or Google and store billions of people’s private data? Obviously, these companies’ quarterly earnings belie any notion that private data is simply a negative asset, but I’d argue there are plenty of unaccounted liabilities there too.

Some of it is probably a solely function of the amount of data stored. Somewhat ironically, the quantity takes an intangible quality when it crosses a certain threshold; you are exponentially a bigger target if you have a dossier on the entire population than on just one-quarter of it.

The other factor is the nature of the data stored. It’s one thing to remember what people like to watch as Netflix does (which is still surprisingly controversial), but what if you literally accidentally leak hundreds of millions people’s passwords? Millions of keys to millions of castles, like an apple pie by the window, waiting to be snatched.

So, Facebook did not exactly do that, but came scarily close. Brian Krebs, a well-known security reporter, reported that Facebook has been storing the passwords for a few hundred million people in plaintext, without being hashed or salted (more on this later), both accessible to anyone within the company but also ready to leak at a moment’s notice.

Before we go on any further, we need to discuss what hashing is and how it ties the whole analogy together.

What the hash is a hash?

In order to check whether someone entered their password correctly, you don’t necessarily need to know their password, but rather need to know what their password “could do”. Imagine the password as a physical object casting a shadow through a linen cloth; you could possibly tell that only that password could cast that shadow, without ever knowing what the password itself looked like. This way, you could possibly just remember what the casted shadow would look like and store that in your database. You would still look at the password (as it is entered), but not have a copy of the exact password with millions of other passwords lying next to it. Neat!

The key insight here is to never actually store the passwords. Rather, you store the hash of the password (or rather a salted hash, but let’s ignore that for now). Hashing is a one-way mathematical function where you can generate a hash (cast a shadow) from a given input, the password in this case, but you can’t go the other way in a reasonable time. Since you can’t ever regenerate the password from the hash itself, this means that the stored hashes are useless when they (inevitably) leak with a security breach.

This is only one of the big unaccounted for liabilities that come with storing so much data in one place, including but not limited to, passwords but also personal information. Facebook, in a single mishap, could expose passwords of more than 200 to 600 million people. Just the mere range, 400 million, is larger than the US population.

Password re-use is the obvious big problem here, and one that won’t be solved easily until we collectively find a better way to authenticate users. Currently, password managers such as 1Password, or those included with all major operating systems remain as the next best options. The fact remains, however, that overwhelming majority of people still use the same password to secure their latest cat food purchase and their finances. A single leaked password is all it takes, for most people, for their entire online identity be at risk.

Then why are we not losing our collective minds over what Facebook did?

Weeks since last Facebook Crisis: 0

There are couple reasons. First of all, Facebook claims the plaintext passwords were never leaked outside of Facebook, and while tens of thousands of people technically could access them, a much smaller number actually did look around where the passwords are, and . not necessarily at the passwords. And there’s no evidence of any intentional access, or misuse. It appears that damage was contained within Menlo Park. This time.

There’s a more salient point though, that it is in Facebook’s interest to keep your data as safe as possible. This is where our analogy comes back to the rescue. The end-game for Facebook is to be the broker of your identity, and in order to do that, they need to keep your data as safe as possible. For you to keep using Facebook’s products so that they can suck in more and more of your private data to their servers (and not share it with others, because as I discussed “privacy is good now”), you need to trust them to keep your identity safe.

And maybe, the other reason is that accidentally storing plaintext passwords is less of a one-off bug, but rather a rite of passage for any company that stores passwords . It has happened to Twitter and GitHub as Krebs reports, but they are simply the most well known offenders. A common joke schema in Twitter is publicly shaming some organization by sharing screenshots of their customer service representatives asking you questions about your password, which is a tell that they can see your passwords. There is even a Tumblr —which itself got hacked— to just out these companies.

Given an average user has around 100+ accounts and most companies will not even flag this as an issue, and even fewer of those who notice will publicly come out and apologize (why would you?), it’s quite likely that your passwords are in a database or a log file somewhere, waiting to be looked at by some starry-eyed engineer. This is the world we created for ourselves.

Bugs, or just Organizational Chart Artifacts

I speculated on Twitter that this is less of a “bug” but rather a systemic issue based on my previous experiences on how such things happen. The main point I’ve made is that companies like Facebook operate in a way less centralized fashion that it appears outside. There are teams that build these secure systems, such as credential stores (where your email and hashed password go) and associated “adaptors” that help you use them. And there are also teams that need to launch products, who sometimes find those adaptors unfitting to their needs.

If your options are to convince the team that builds the storage system to work with you, which takes time, or just hack up some solution to save the day, generally you pick the latter. In an environment where up and to the right numbers are prized more heavily than practicing good security hygiene, it’s the more rational choice. You can always apologize with a blog post later, with an aspirational title to remind everyone what you did NOT do, but what you SHOULD HAVE done.

Of course, there’s no reason to think that’s what happened. But I would be surprised it was too far off. If you are technically inclined, just replace “credential store” with “logs”. To the discussion on hand, the difference is quite immaterial. Data is data (and data is data is data too), access methods be damned.

Stewart Brand, the publisher of The Whole Earth Catalog, famously quipped “Information wants to be free”, which of course is a more poetic way to describe the zero marginal cost of copying information. It has long become somewhat of a battle cry for certain corners of the internet who used it to criticize digital rights management schemes.

But I always found it more interesting to see how the same saying applied to databases, which all eventually become free as in liberty much to the chagrin of its owners. And when such data becomes free, it does necessarily not make the world a better place, but puts people’s private information at risk. Should we leave information this free?

There’s also second part of the Brend’s soundbite that got a lot less press: “Information also wants to be expensive”. Seemingly our collective inability to decide on the price (and the cost) of data seems to have a long history.

As we collect more and more data, and put it in more and less places at the same time, makes this discomfort more troubling. Trillions of dollars in dollar value is created out of an asset, that we don’t know how to properly value. We are barely recognizing the negative externalities of decades of oil production and consumption now, and it took us almost destroying the planet. We should do a better job for data.


What I’m Reading

I’ve been on an economics and strategy binge lately. Lots of “big things” to think about and keep in mind as global rules are being rewritten.

Schumpeter on Strategy: Columbia Professor and venture capitalist Jerry Neumann has one of the most thoughtful VC blogs, and this piece on Schumpeter is an “Intro to Strategy” course in itself. Jerry is a good writer too; make sure you follow him.

The mainstream of economics, then as now, pretty much tries to describe the economy as if it shouldn’t change. If it is changing, it’s changing towards an equilibrium, where it won’t have to change any more. Schumpeter noticed that this is not how it works. Both the economy as a whole and individual businesses change constantly. His model of the latter, in his Theory of Economic Development, explains how some entrepreneurs make an unusually large amount of money.

Economics After Neoliberalism: Changing gears now. Renowned Harvard Kennedy School economist (and compatriot of yours truly) Dani Rodrik has been arguing the market dogmatism is finally on its way out, and that can be a saving grace. A piece that spawned a great discussion in responses by other economists and policy makers. Not a light reading, but worthwhile.

Economics does have its universals, of course, such as market-based incentives, clear property rights, contract enforcement, macroeconomic stability, and prudential regulation. These higher-order principles are associated with efficiency and are generally presumed to be conducive to superior economic performance. But these principles are compatible with an almost infinite variety of institutional arrangements with each arrangement producing a different distributional outcome and a different contribution to overall prosperity.

2 Comments

  1. Pingback: Who Controls the Internet? - Rambling Space

  2. Pingback: Is your data yours? - Rambling Space

Comments are closed.