With Big Data Comes Big Responsibility

It’s getting harder to suppress the sense of an impending doom. With the latest Equifax hack, the question of data stewardship has been propelled to the mainstream, again. There are valid calls to reprimand those responsible, and even shut down the company altogether. After all, if a company whose business is safekeeping information can’t keep the information safe,

what other option is there?

The increased attention to the topic is welcome but the outrage misses a key point. Equifax hack is unfortunate, but it is not a black swan. It is merely the latest incarnation of a problem, that will only get worse, unless we all do something.

The main issue is this: any mass collection of personally identifiable data is a liability. Individuals whose data is vacuumed en masse, the companies who do the vacuuming, and the legislators should become aware of the risks. It is fashionable to say “data is the new oil” but the analogy only goes so far, especially when you consider the current situation of the oil-rich countries. Silicon Valley itself here is especially vulnerable.

Big parts of the tech industry in Bay Area  is built on mass collection of such private data, and deriving some value from it. A significant part of the value comes from, somewhat depressingly, from the ever increasingly precise ad targeting. The problem with this model was long known, if not tacitly admitted by its creators, but it wasn’t until the Snowden revelations real national debate has picked up. With the recent brouhaha following the 2016 Elections, and a real risk of an authoritative government in the US, the questions are louder this time.

Public outcry does help, but the change is very slow. Part of it is the business models are wildly successful. Combined Alphabet (née Google) and Facebook are a trillion dollar duopoly. The cottage industry around these two companies, along with practically all stakeholders in the area being somewhat either beholden or financially tied to the industry, motivation to change is small.  Some companies, like Apple, try to raise the issue to a higher plane of morality, part for ethical reasons, part competitive. But the data keeps getting collected, at an ever increasing pace and it’s getting more and more likely a catastrophic event will occur.

Let’s first talk about how data gets exposed. Hacking, or unauthorized access is the most talked about but it’s far from the only way. A lot of the times, , it’s just a matter of a small mistake. Take Dropbox. A cloud storage company once allowed anyone to log into anyone else’s account by entirely ignoring a password check. The case was caught quickly, but it’s a dire reminder of small mistakes can happen. And that is a point worth pondering, separate the recent hack Dropbox suffered from.

As easily data is collected and stored, it’s even easier for it to change hands. Companies and their assets change hands, and so do the jurisdictions they live in. Russian tech sector is a prime example. Pavel Durov, the founder of the oddly popular instant messaging platform Telegram, first built VKontatke, a Russian social network site much popular than Facebook in the country,. But then came Russian government with demands of censorship. Durov ran away but the Russian social network is owned by a figure much closer to the government. And there’s always LiveJournal, which again got sold to a Russian company, now all its data under Russian jurisdiction.

And sometimes, the companies themselves open up themselves to being hacked. Once an internet darling, Yahoo! was put on spotlight when its own security team found a poorly designed hacking tool, installed by no other than company itself. Initially designed to track certain child pornography related emails for the government, the tool was built without the knowledge of the company’s Chief Security Officer, Alex Stamos, a well regarded security professional. He departed the company soon after, only to join Facebook. And again, this is just an addition to the Yahoo! hack that affected 1 billion users, and almost derailed multi-billion dollar acquisition.

Government surveillance is a touchy subject, and moral decisions are always fuzzy, with someone being unhappy. Governments should use tools at their disposal to keep their citizens safe, and this might sometimes require uncomfortable measures. This doesn’t mean they should be given a direct access to millions of people’s private, however. Intelligence efforts should be directed, not drag net. Living in a liberal democracy requires a certain amount of discomfort, not pure order.

But it is hard to deny the evidence at hand, from once liberal darlings like Turkey to known autocratic regimes like China, any government will find it impossible to resist the temptation to take a peek at the data, one way or another.

Governments are made up of people, just like corporations are. The solutions to these problems won’t be easy; with so much already built, tearing it all down is not an option, or even preferable. The industries built add value, employ thousands, if not millions. But we have to start somewhere, both as individuals, technology companies, and legislators.

First, individuals need to be more cognizant of their decisions about their data. Some of it will require education, from a much younger age. But even today, for many, there are a lot of easy steps one can take.

For many uses, a more private, less surveillance oriented tools already exist. Instant messaging tools like WhatsApp (once bought by Facebook for a whopping $19 Billion) is easy to use while using an cutting edge end-to-end encryption technology borrowed from Signal. One can wonder, if essentially playing spies is worth the hassle, but the risks are real, and getting more so every day even for congress people in the US.

For regular browsing, things are in worse shape. Practically every site on the internet tracks you across every other site, shopping and news sites are particularly bad. The users are fighting back, with sometimes clunky, equally overzealous tools. Thanks to an overzealous adoption of ads, both intrusive and sometimes malicious, ad-blocking is on the rise around the world. It is hard to fault consumers, most would benefit from using an independently owned Ad-Blocker like uBlock Origin, or using a browser like Brave that has such technology built in. Apple recently updated its browser Safari on both macOS and iOS to “intelligently” curb cross-site tracking.

For things like email, and cloud storage, things are trickier. For many users, their data is safer with a big company with a competent security team, as opposed to a smaller service provider. There’s a balance here; while big providers are much juicier targets (including governments who can request data legally), they also have the benefit of being hardened by such attacks. Companies like Google use their own services, further incentivizing them to safeguard data, at least from hackers.

However, even then, most people would benefit from increasing the security from the default values. For users of Gmail, Dropbox, and virtually any other cloud storage technology, using 2-Factor authentication, coupled with a password manager is a must.

And largely, going back the cognizance, individuals must be aware of the data they provide and be at least minimally informed. When you sign up for a new service, before sharing with them all your data, see if they at least have a way to delete it, or export it. Even if you never use either of those options, they can be good signs that company treats your data properly, instead of letting it seep into their machinery.

For creators of such technology, things are harder but there’s hope. First step is obvious; companies should treat personally identifiable data as liabilities and collect as little as possible, and only for a specific purpose. This is also the general philosophy behind EU’s new General Data Protection Regulation (GDPR) directive. Instead of collecting as much data as possible, hoping to find good use for it later, companies should only collect data, when they need to. And most importantly, they should delete the data, when they are done with it, instead of hoarding it.

Moreover, companies should invest in technologies that do not need collecting data at all, such using client side computation instead of server side. Apple is the prime example here; company uses machine learning models that are generated on the server, on aggregate data, for things like image recognition or speech synthesis on the devices themselves. Perhaps a sign of poetic justice, the intelligent cross-site tracking Apple built-in to its browser is based on data collected in aggregate form, instead of personally identifiable fashion.

It is not clear, if such technologies can keep up with a server-based solution where iteration is much faster, but the investments might pay dividends. Today’s smartphones easily compete with servers of just a few years ago in performance. Things will only get better.

And for times when mass collection of data is required, companies should invest in techniques that allow aggregate collection instead of personally identifying data. There are huge benefits to collecting data from big populations, and the patterns that emerge from such data can benefit everyone. Again, Apple is a good example here, though Uber is also worth mentioning. Both companies aggressively use a technique called differential privacy where private data is essentially scrambled enough to be not identifiable but still the patterns remain. This way, Uber analysts can view traffic patterns in a city, or even do precise analysis for a given time, without knowing any individual’s trips.

And more generally, companies should invest and actively work on technologies that reduce the reliance on individuals’ private data. As mentioned, a big ad industry will not go away overnight, but it can be transformed to something more responsible. Technologists are known for their innovative spirit, not defeatism.

End-to-end encryption is another promising technology. While popular for instant messaging, technology still in infancy for things like cloud storage and email. There are challenges; the technology is notoriously hard to use, and the recovery is problematic when someone forgets their encryption key, such as their password. Maybe most importantly, encryption makes the data entirely opaque to storage companies, severely limiting the value they can provide on top of it.

However, there are solutions, some already invented, some being worked on. WhatsApp showed that end-to-encryption can be deployed at massive scale and made easy to use. Other companies like Keybase work on more user-friendly ways to do group chat, and possibly storage, while also working on a new paradigm for identity. And there’s also more futuristic technologies like homomorphic encryption. Still in research phase, if it works as expected, technology might allow being able to build cloud storage services where the core data is private while still being able to be searched on, or indexed. Technology companies should direct more of their research and development resources efforts to such areas, not just better ways to collect and analyze data.

And lastly, legislators need to wake up to the issue before it is too late. The US government should enshrined privacy of individuals as a right, instead of treating as a commercial matter. Moreover, mass collection of personally identifiable data needs to be brought under supervision.

Current model, where an executive responsible for leaking 140M US consumers’ can get away with a slap on the wrist and $90M payday, does not work. Stronger punishment would help, but preventing such leaks at the source by limiting the size, fidelity, or the longevity of the data would be better.

Moreover, legislators should work with the industry to better educate the consumers about the risks. Companies will be unwilling to share details about what is possible with the data they have on their users (and unsuspecting visitors) but it is better for consumers to make informed decisions in the long run. Target made the headlines when it reportedly figured out a woman was pregnant before she could tell her parents. Customers should be able aware of such borderline creepy technology before they become subjects to it. Especially more so considering Target itself was also a victim of multiple major hacks. Facebook recently was the subject of a similar report where the company discovered a family member of a tech reporter (the same reporter who broke the Target story), unclear to everyone how. Individuals should not feel this powerless against corporations.

The current wave of negative press against Silicon Valley, caused mostly by the haphazard way social networks were used to amplify messages from subversive actors, is emotionally charged but is not wholly undeserved. Legislators can and should help technology companies earn back people’s trust, by allowing informed debate about their capabilities. A bigger public backlash, when it happens, would make today’s pessimism seem like a nice day in the park.

There are huge benefits to mass amounts of data. There is virtually no industry that wouldn’t benefit from having more data. Cities can make better traffic plans, medical researchers study diseases and health trends, governments can make better policy decisions. And it can be commercially beneficial too, with more data we can make better machine learning tools, from cars that can drive themselves to medical devices that can identify a disease early on. Even data that is collected for boring purposes can become useful; Google’s main revenue source selling ads on top of its search results, which no user would want to get rid of.

Data might be new oil, but only with mindful, responsible management of it will the future look like Norway, rather than Venezuela or Iraq. In its essence, personally identifiable data in huge troves is a big liability. And the benefits we derive from such data currently, is largely mostly used for things like better ad targeting. No one wants to go back to a time without Google, or Facebook. But it possible to be more responsible with the data. The onus is on everyone.