This is a post I wrote in January 2018 for an online magazine, that never got published. I finally got the OK to publish it on my blog, in light of the current Facebook and Cambridge Analytica revelations. Previous posts on those are here and here.
It’s getting hard to suppress a sense of an impending doom. With the latest Equifax hack, the question of data stewardship has been propelled to the mainstream again. There are valid calls to reprimand those responsible, and even shut down the company altogether. After all, if a company whose business is safekeeping information can’t keep the information safe, what other option is there?
The increased attention to the topic is welcome but the outrage misses the key point. Equifax hack is unfortunate, but it is not a black swan event. It is merely the latest incarnation of a problem, that will only get worse, unless we do something.
The main issue is this: any mass collection of personally identifiable data is a liability. Individuals whose data is vacuumed en masse, the companies who do the vacuuming, and the legislators should treat be aware of the risks. It is fashionable to say “data is the new oil” but the analogy falls apart, when you consider the current situation of the oil-rich countries in the world. Silicon Valley itself here is especially vulnerable.
Big parts of the Silicon Valley industry itself is built on mass collection of such private data, and deriving value from it, mostly in terms of ad targeting. Public outcry does help, but the warning signs are largely ignored, due to the extremely lucrative business models. Some companies, like Apple, try to raise the issue to a higher plane, but it’s unclear if it will make a difference.
Let’s first talk about how data gets exposed. Hacking, or unauthorized access is the most talked about but it’s far from the only way. Sometimes, it’s just a matter of a small mistake. Take Dropbox. A cloud storage company once allowed anyone to log into anyone else’s account by entirely ignoring a password check. The case was caught quickly, but it’s a dire reminder of small mistakes can happen.
The problems with companies owning such data gets even murkier when you factor in companies do go bankrupt, change business models, or just get sold to the highest bidder. What you thought was a good steward of your data, might just become an online casino the next day.
And sometimes, the companies themselves open up themselves to being hacked. Once an internet darling, Yahoo! was put on spotlight when its own security team found a poorly designed hacking tool, installed by no other than company itself. Initially designed to track certain child pornography related emails for the government, the uncovering of the tool itself also took part in Yahoo CSO’s departure to Facebook. (Note: Since I wrote this in January of 2018, Alex Stamos is reportedly also leaving Facebook. This news came right after the Cambridge Analytica revelations, but is caused by his clashes with Facebook management overall.)
Of course government surveillance is a touchy subject, requiring one to make moral decisions that might differ from others. But it is hard to deny the evidence at hand, from once liberal darlings like Turkey to known autocratic regimes like China, any government will find it impossible to resist the temptation to take a peek at the data, one way or another.
The solutions to these problems won’t be easy; with so much already built, tearing it all down is not an option, or even preferable. The industries built add value, employ thousands, if not millions. But we have to start somewhere, both as individuals, companies, and legislators.
First, individuals need to be more cognizant of their decisions about their data. Some of it will require education, from a much younger age. But even today, for many, there are a lot of easy steps.
For many people, a more private, less surveillance oriented tools already exist. Instant messaging tools like WhatsApp (ironically owned by Facebook) is easy to use while using an end-to-end encryption technology borrowed from Signal.
Browsing the internet today, without an ad blocker tool, means essentially signing in to every single website, and having your data be sold to the highest bidder. Thanks to an overzealous adoption of ads, both intrusive and sometimes malicious, ad-blocking is on the rise around the world. It is hard to fault consumers, most would benefit from using an independently owned Ad-Blocker like uBlock Origin, or using a browser like Brave that has such technology built in.
For things like email, and cloud storage, things get murkier. Ironically, for many, their data is safer owned by a big corporation that has a competent security team, as opposed to a mom-and-pop store. There’s a balance here; while big providers are much juicier targets, they also have the benefit of being hardened by such attacks. However, even then, most people would benefit from increasing the security from the default values. For users of Gmail, Dropbox, and virtually any other cloud storage technology, using 2-Factor authentication, coupled with a password manager is must.
And largely, going back the cognizance, individuals must be aware of the data they provide and be at least minimally informed. When you sign up for a new service, before sharing with them all your data, see if they at least have a way to delete it, or export it. Even if you never use either of those options, they can be good signs that company treats your data properly, instead of letting it seep into their machinery.
For creators of such technology, things are harder but there’s hope. First step is obvious; companies should treat personally identifiable data as liabilities and collect as little as possible, and only for a specific purpose. This is also the general philosophy behind GDPR. Instead of collecting as much data as possible, hoping to find good use, companies should only collect data, when they need to.
Moreover, companies should invest in technologies that do not need collecting data at all, such using client side analysis instead of server side. Apple is the prime example here; company uses machine learning models that are generated on the server, on aggregate data, for things like image recognition on the client. It is not clear, if such technologies can keep up with a server-based solution where iteration is much faster, but the investments are paying dividends.
And for times when mass collection of data is required, companies should invest in techniques that allow aggregate collection instead of personally identifying data. There are huge benefits to collecting data from big populations, and the patterns that emerge from such data can benefit everyone. Again, Apple is a good example here, though Uber is also worth noting. Both companies aggressively use a technique called differential privacy where private data is essentially scrambled enough to be not identifiable but still the patterns remain. This way, Uber analysts can view traffic patterns in a city, or even do precise analysis for a given time, without knowing any individual’s trips.
And more generally, companies should invest and actively work on technologies that reduce the reliance on individuals’ private data. As mentioned, a big ad industry will not go away overnight, but it can be transformed to something more responsible.
End-to-end encryption is one example. While popular for instant messaging, technology still in infancy for things like cloud storage and email. There are many challenges, generally the technology is hard to use, and the recovery is problematic when someone forgets their encryption key, such as their password. But more importantly, encryption makes the data entirely opaque to companies, unable to provide them value on top of it, largely commoditizing them. However, there are solutions, some already invented, some being worked on. Companies like WhatsApp showed that end-to-encryption can be deployed at massive scale and made easy to use. Other companies like Keybase work on more user-friendly ways to do group chat, and possibly storage, while also working on a new paradigm for identity. And there’s also more futuristic technologies like homomorphic encryption. Still in research phase, if it works as expected, technology might allow being able to build cloud storage services where the core data is private while still being able to be searched on, or indexed.
And lastly, legislators need to wake up to the issue before it is too late. The US government should enshrined privacy of individuals as a right, instead of treating as a commercial matter. Moreover, mass collection of personally identifiable data needs to be brought under supervision.
Current model, where an executive responsible for leaking 140M US consumers’ can get away with a slap on the wrist and $90M payday, does not work. Stronger punishment would help, but preventing such leaks at the source by limiting the size, fidelity, or the longevity of the data would be better.
Moreover, legislators should work with the industry to better educate the consumers about the risks. Companies will be unwilling to share details about what is possible with the data they have on their users (and unsuspecting visitors) but it is better for consumers to make informed decisions in the long run. The current wave of negative press against Silicon Valley, caused mostly by the haphazard way social networks were used to amplify messages from subversive actors, is emotionally charged but is not wholly undeserved. Legislators should help technology companies can and should work to earn back people’s trust, by allowing informed debate about their capabilities.
This is all not to say that there are huge benefits to mass amounts of data. There are virtually no industries that wouldn’t benefit from it. Cities can make better traffic plans, medical researchers study diseases and health trends, governments can make better policy decisions. And it can be commercially beneficial too, with more data we can make better machine learning tools, from cars that can drive themselves to medical devices that can identify a disease early on.
However, personally identifiable data in huge troves is a big liability. And the benefits we derive from such data currently, is largely mostly used for things like better ad targeting. No one wants to go back to a time without Google, or Facebook. But it possible to be more responsible with the data. The onus is on everyone.