Interview: Carnegie Mellon Professor Ari Lightman On How Students Are Empowered By Learning To Use Brandwatch Consumer Research
By Kara FinnertyJun 10
Published June 23rd 2017
Where has my data blog series been? Well it’s been a busy time of year for us and it’s been really exciting announcing so much new data at “Now You Know 2017” (our user conference in Denver). I’ve had a few people poke me to ask where I went, so now that the dust has settled, I’m going to be talking about the exciting things our engineers are working on.
As before, the focus is on transparency, so the odd detail might change or I may simply get something wrong. Take what you read here as me thinking out load at a given point in time rather than official marketing collateral.
A few months ago I did a blog on how we get Instagram data. Since then, I’ve had a pile of requests for something similar around Facebook…so here it is.
OK, I lied… so this isn’t really a commonly asked question. While I do get this question a lot for Instagram, there seems to be a good degree of understanding in the industry that Facebook data is hard to get. However I thought I’d artificially ask it of myself as it’s useful context for those that don’t know.
As well as being an advertising business, Twitter has a significant part of their business built on selling data. They can do this without any significant ethical concerns because a key part of the power of Twitter is its openness. For the most part, conversations on Twitter are happening on the public domain. So for Twitter, it’s super easy, we have access to products like their Firehose which gives unlimited access to their data.
However, for Facebook, their business is essentially built on advertising and so selling private conversations between consumers could be perceived as damaging end user trust and thus hurting the product for potential advertisers.
For this reason, they don’t sell their raw data and private conversations are private. If you are evaluating a social listening product and a vendor claims to have the Facebook Firehose, I would gently suggest you treat this (as well as any other promise) they make with skepticism.
Facebook have hardened their stance on this over the last couple of years. In the past they used to offer a search API that would allow us to ask for all public posts that related to specific search terms. This allowed us to query the public bit of Facebook with complex search terms, much the same as we can with Twitter. However, they closed off that API and getting Facebook data became harder.
It’s actually not that bad – there’s still quite a lot of data we do have access to. We currently offer three main sorts of Facebook data coverage:
Well, as above, there is no Firehose of all Facebook data that we can tap into. Also, as above, there is no search API, so we crawl the public APIs to retrieve posts and comments for Facebook pages, which is certainly trickier than just asking Facebook to send us all data on a particular topic.
However there are two main resources that we have to manage to be able to do this effectively. The first is kind of obvious, but based on the volumes of pages that our customers are expecting, it’s also non trivial: Compute capacity. We have to have enough servers to crawl every post and comment since the last time we crawled quickly enough to meet market demands. This is within the scope of our control. That is to say – if we need more compute, we can just add more; we can pay for more capacity.
The second constraint, however, is not directly within the scope of our control.
The public APIs require a token, which is essentially like an invitation to come into the party. No invite, no getting in the door. Fortunately, Facebook aren’t mean with their tokens.
Every user gets at least one. You get one just for being an active user.
It gives you a small amount of data requests that you can make per hour. Also if your user account is the administrator of any pages, you also get extra tokens that give more data requests for each page. These page tokens tend to be bigger. So it’s basically a good news/bad news situation.
On the one hand, the bad news is that we can’t buy more tokens. On the other hand, if a user authenticates their Facebook account with our app, we get to generate a token and our coverage goes up by a small amount.
We don’t currently require very much when creating a channel, you can just add as many as you want, however, as you’ll read later in this article – authenticating with multiple facebook accounts has very significant benefits to your crawling. This is especially true if those accounts happen to have admin rights on the pages you want to crawl.
Facebook gives us a bump to our rate limit for each active Facebook account that authenticates. So anyone that creates firstname.lastname@example.org, email@example.com… firstname.lastname@example.org sadly won’t make any difference to their coverage. You have to authenticate with active facebook users.
As you (now) know, we can’t do arbitrary searches across Facebook via an API, but there is a sort of hack to achieve a similar result.
Because channels data is stored in the same archive alongside all of our other social data, once it is in there it is available to be queried by regular Brandwatch Analytics queries.
What this means is that you can create channels in the app to pull in data from Facebook pages that are relevant to your industry, which can then be text matched to a more specific query in the app.
For example, if you’re a mobile phone manufacturer, you may want to catch mentions of your phones or competitor’s products. We can’t just query Facebook directly, but you can create channels for the major gadget and phone sites’ Facebook pages such as The Verge, Engadget, Android Central, etc. You then create a query for your products and any of the posts or comments from the channels that text match your query will be retrieved and displayed in your dashboards alongside the Twitter, News, Web, Instagram and Reddit data.
Here’s the thing…we actually currently treat these three scenarios the same. Essentially you authenticate with Facebook via our app and then create a channel, inputting the url of the page and that’s it.
It’s worth noting that because we do it this way, we are able to give coverage of non-owned channels (scenarios 2 & 3, above), which is not true of some of our competitors. For some of them, they only support owned page crawling and offer incidental data of non owned pages (collected because another customer retrieved it).
However because we treat them all the same (basically as non-owned pages), which is essentially the lowest common denominator, we’re missing out on some of the advantages on focussing on their differences.
We’re going to be splitting out the experience for Owned Facebook pages from Non-owned Facebook pages.
Scaling: The final piece of the jigsaw is that once we have improved the token collection and allocation process, we’re going to be looking at massively growing our compute capacity. This might sound like irrelevant techie detail, but it is this that has enabled us to hugely increase our Instagram coverage over the last few months and we want to give Facebook the same treatment.
“But wait,” I hear you say. “I’m an agency and we’re working on behalf of the brand, but we don’t have admin access to their Facebook pages”. Good question; I’m glad you raised it.
We’re going to be including an email invite system to allow you to invite the social media manager or whoever has access whether they work in your organization or not. You can see some of the conceptual mockups above.
It’s worth noting that the whole purpose of the project is giving you the levers to pull to improve your Facebook coverage. If you don’t use this feature, you will still get some Facebook data, this is a mechanism to make it better. The benefits we will be:
There was a data aggregator that had exclusive access to anonymized Facebook data.
This kind of data is a little different to ‘regular social data’ in that rather than querying for the raw data and rolling it up to the insights in our platform, the anonymized data works by passing a set of conditions into the API and receiving answers out.
So rather than saying “send me all conversation around mortgages” and then us using this to provide dashboards, you ask it “what kinds of people talk about mortgages?” and the system delivers anonymized demographics or topics back.
On the one hand this is super useful, because it’s answering questions from within Facebooks closed pool of 1.9Bn users. On the other hand it’s a very different experience, because you can’t dive down into the actual conversation.
We trialled this service with a few of our customers, but the price point of the service made it challenging at the time. Facebook now plan to offer this service directly to and so we no longer offer the service through the third party anonymized API. We are in conversations with Facebook and hope to be one of the first to bring something to market when the API is widely available to our customers.
As always – if you have further questions or requests for a future topic – Please leave a comment below.