Fighting Spam at Facebook

Couple days ago, I wrote about how “Fake News” on Facebook is a spam problem caused, or at least exasperated, by economics of attention. Since there’s limited amount of attention people can give in a day, and Facebook controls so much of it, if you can reverse engineer out the mechanics of the News Feed, you can fan out your message, or boost in Facebook parlance, to millions of people with at a minuscule cost.

On this blog, I use a combination of my experience as a software engineer, what is reported in press, and some light rumor treading to explore ideas. But it is hard to not come off as navel-gazing. No one writes about spam at Facebook, when it’s not a problem. And these systems are complex, involving hundreds of people working on them over many years. They have their own compromises. The inner workings aren’t always hidden (but they are, more than they should be), but it’s not always easily accessible to an outsider.

Luckily, I had some help. Melanie Ensign is a former co-worker of mine Uber. And more importantly, she used to work at Facebook with teams working on fighting spam.  Melanie has been at Uber for almost a year now, but I think her experience from Facebook (where a lot of Uber’s security team hails from, including the CSO) is well worth exploring.

Through a few tweets (embedded at the bottom of this post), she told me how Facebook used to combat spam, why certain approaches worked better than others. It’s worth noting that her comments were about Facebook posts, not about ads (a bit more on those in a bit).

Ensign says that Facebook fights spam primarily by targeting not the content of the posts, but the accounts that post them. She says “systems were trained detect spam based on behavior of accounts spreading malware. It’s never really been about *content* until now”.  She adds that monitoring content is tricky, for several reasons.

The first is obvious; with more than 2 billion monthly active users and many more billions of content posted on the site every day, it’s a big, wieldy undertaking to even start monitoring that much content. Facebook is an engineering force to behold but scaling an operation like that; building systems to analyze that much material for unstructured data, and doing it effectively real-time is not a simple task.

Second reason, is the obvious risks around censorship. Facebook admittedly wanted to keep a neutral position on the content posted on its site (save for legal requirements). False negatives are bad; you let in “spam”, but a false positive is akin to censorship. This might be less controversial now, where Facebook works with fact-checkers to annotate content. But the Facebook promise was always one of extreme, sometimes admittedly labored editorial impartiality.

Third is harder to appreciate but one I can understand. When you build systems that recognize spammy content, you inherently give away your secret; it becomes much easier to work around. Ensign points to the Ray Ban spam that was going in Facebook couple years ago. She says that since content the proper bona fides, team fighting spam instead relied on account characteristics of the accounts posted. Facebook engineers who presented at the Spam Fighting @Scale conference share similar insights; “Fake accounts are a common vector of abuse across multiple platforms.” and “It is possible to fight spam effectively without having access to content, making it possible to support end-to-end encrypted platforms and still combat abuse” are two that are worth mentioning.

Running a user generated content site is hard. When I first started at Digg, the thing that shocked me most was how much of the Digg engineering was really to keep the site remotely clean. We had tools that worked to recognize botnets, stolen accounts, and everything in between. Every submission was evaluated automatically for many characteristics, such as “spamminess”, adult content, a few more. There were tools to block certain content, only in certain countries. Brigades, are they were called, would form on Yahoo! Groups to kick stuff of the site, or promote them. As we plugged one hole, some social media consultant find a new way to use Digg’s various tools to send traffic to his or her ad infested site.

And there was also the scaling. Digg was a lean and mean engineering organization, compared to Facebook. But still, we always struggled with scaling challenges, and so did everyone. Failwhale might be all gone now, but running a site that’s under attack 24/7, with features being added left and right causing unforeseen performance issues all of a sudden, scaling an organization to support just more than a few million users, is an exercise that few can appreciate.  Keeping a site that complicated, up and running at that scale, is a challenge. Doing that while keeping site fast, is a whole another beast. Former users of Friendster or Orkut might feel the same way; performance issues were what caused their users to leave the sites.

I stand by my initial assessment of the problem. Facebook built a massive attention pool, sliced and diced it, packaged it nicely, and now is making bank selling it to the highest bidder. The problems it faces, from spammy content, to fake news, is inherent to the medium of exchange, attention. Sketchy characters flock to frothy marketplaces, like bees to honey. What makes or breaks a marketplace valuable is being the trustworthy intermediary, between the buyer and the seller. By being so large, and so influential, Facebook owns this problem.

And to be clear, this is not a dig (or Digg?) at the company; I rely on Facebook to keep in touch with friends scattered around the world. My WhatsApp groups, like for many others, is my support system. I think as the world’s address book, it is where any business, or activist, or a community organizer find customers, supporters, or members. And of course, while I have no significant others working at the company, or have any financial exposure to it, I do have close friends who are former, or current employees.

My main qualm with social networks has always been the commercializing of the individual and the collective attention spans. As we spend more of our waking hours plugged in, move more and more of our political discourse, both in United States and around the world, to these walled gardens with rules that are written by a few people living in California, we risk losing more than just the integrity of a single election.