Ten Questions on Facebook Data Product

By Amy Collins on June 23rd 2017

Where has my data blog series been? Well it’s been a busy time of year for us and it’s been really exciting announcing so much new data at “Now You Know 2017” (our user conference in Denver). I’ve had a few people poke me to ask where I went, so now that the dust has settled, I’m going to be talking about the exciting things our engineers are working on.

As before, the focus is on transparency, so the odd detail might change or I may simply get something wrong. Take what you read here as me thinking out load at a given point in time rather than official marketing collateral.

A few months ago I did a blog on how we get Instagram data. Since then, I’ve had a pile of requests for something similar around Facebook…so here it is.

1. Do you have the Facebook Firehose?

OK, I lied… so this isn’t really a commonly asked question. While I do get this question a lot for Instagram, there seems to be a good degree of understanding in the industry that Facebook data is hard to get. However I thought I’d artificially ask it of myself as it’s useful context for those that don’t know.

As well as being an advertising business, Twitter has a significant part of their business built on selling data. They can do this without any significant ethical concerns because a key part of the power of Twitter is its openness. For the most part, conversations on Twitter are happening on the public domain. So for Twitter, it’s super easy, we have access to products like their Firehose which gives unlimited access to their data.

However, for Facebook, their business is essentially built on advertising and so selling private conversations between consumers could be perceived as damaging end user trust and thus hurting the product for potential advertisers.

For this reason, they don’t sell their raw data and private conversations are private. If you are evaluating a social listening product and a vendor claims to have the Facebook Firehose, I would gently suggest you treat this (as well as any other promise) they make with skepticism.

Facebook have hardened their stance on this over the last couple of years. In the past they used to offer a search API that would allow us to ask for all public posts that related to specific search terms. This allowed us to query the public bit of Facebook with complex search terms, much the same as we can with Twitter. However, they closed off that API and getting Facebook data became harder.

2. Sounds tough… So what access do we have?

It’s actually not that bad – there’s still quite a lot of data we do have access to. We currently offer three main sorts of Facebook data coverage:

  1. Owned Facebook Pages – We can get posts and comments as well as metrics such as likes for pages that you own. The main use cases for this are campaign and performance metrics, community management and insights.
  2. Non-owned Facebook Pages – We can get posts and comments as well as metrics such as likes for pages that you don’t own. The use cases here are most, benchmarking, competitor intelligence and fan page engagement.
  3. Incidental Facebook Coverage – this is a little known benefit but creating Facebook channels also adds that new data to your regular Brandwatch queries. Some people use this as a hack to boost their general Facebook coverage.

3. What constraints do we face gathering Facebook data?

Well, as above, there is no Firehose of all Facebook data that we can tap into. Also, as above, there is no search API, so we crawl the public APIs to retrieve posts and comments for Facebook pages, which is certainly trickier than just asking Facebook to send us all data on a particular topic.

However there are two main resources that we have to manage to be able to do this effectively. The first is kind of obvious, but based on the volumes of pages that our customers are expecting, it’s also non trivial: Compute capacity. We have to have enough servers to crawl every post and comment since the last time we crawled quickly enough to meet market demands. This is within the scope of our control. That is to say – if we need more compute, we can just add more; we can pay for more capacity.

The second constraint, however, is not directly within the scope of our control.

The public APIs require a token, which is essentially like an invitation to come into the party. No invite, no getting in the door. Fortunately, Facebook aren’t mean with their tokens.

Every user gets at least one. You get one just for being an active user.

It gives you a small amount of data requests that you can make per hour. Also if your user account is the administrator of any pages, you also get extra tokens that give more data requests for each page. These page tokens tend to be bigger. So it’s basically a good news/bad news situation.

On the one hand, the bad news is that we can’t buy more tokens. On the other hand, if a user authenticates their Facebook account with our app, we get to generate a token and our coverage goes up by a small amount.   

4. Is there anything I can do to improve our Facebook coverage?

We don’t currently require very much when creating a channel, you can just add as many as you want, however, as you’ll read later in this article – authenticating with multiple facebook accounts has very significant benefits to your crawling. This is especially true if those accounts happen to have admin rights on the pages you want to crawl.

5. So can I just create fake Facebook accounts to boost my coverage?

Alas, No.

Facebook gives us a bump to our rate limit for each active Facebook account that authenticates. So anyone that creates facebook1@mycompany.com, facebook2@mycompany.comfacebookn@mycompany.com sadly won’t make any difference to their coverage. You have to authenticate with active facebook users.  

6. Tell me more about that Incidental Facebook Coverage thing…?

As you (now) know, we can’t do arbitrary searches across Facebook via an API, but there is a sort of hack to achieve a similar result.

Because channels data is stored in the same archive alongside all of our other social data, once it is in there it is available to be queried by regular Brandwatch Analytics queries.

What this means is that you can create channels in the app to pull in data from Facebook pages that are relevant to your industry, which can then be text matched to a more specific query in the app.

For example, if you’re a mobile phone manufacturer, you may want to catch mentions of your phones or competitor’s products. We can’t just query Facebook directly, but you can create channels for the major gadget and phone sites’ Facebook pages such as The Verge, Engadget, Android Central, etc. You then create a query for your products and any of the posts or comments from the channels that text match your query will be retrieved and displayed in your dashboards alongside the Twitter, News, Web, Instagram and Reddit data.

7. How do we get data for these three different scenarios?

Here’s the thing…we actually currently treat these three scenarios the same. Essentially you authenticate with Facebook via our app and then create a channel, inputting the url of the page and that’s it.

Facebook Page Channel Creation

It’s worth noting that because we do it this way, we are able to give coverage of non-owned channels (scenarios 2 & 3, above), which is not true of some of our competitors. For some of them, they only support owned page crawling and offer incidental data of non owned pages (collected because another customer retrieved it).

However because we treat them all the same (basically as non-owned pages), which is essentially the lowest common denominator, we’re missing out on some of the advantages on focussing on their differences.

8. So what’s changing?

We’re going to be splitting out the experience for Owned Facebook pages from Non-owned Facebook pages.

  • For owned pages:
    Facebook Page Channel Creation for Owned Pages
    We’re going to be giving you the opportunity to authenticate with the a Facebook account that is an administrator of that page.
    This gives us a special token (like a virtual key), which unlocks higher API rate limit, meaning that we can offer lower latency and more data. This token can also give us access to other APIs and allow us to deliver new features for page owners. I can’t talk much about these potential new features today, but watch this space.
  • For non owned pages:
    Facebook Page Channel Creation for Non Owned Pages
    If you don’t own the page, then we obviously can’t expect you to give us the page admin token for that page. For this reason we’re going to continue to collect User tokens to provide the rate limit to support the crawling required.We’re going to start by asking for a single user token per non owned channel that you want to crawl. In order to make this process as easy for you as possible, we plan to make it super simple, with an easy email based invite system.

Scaling: The final piece of the jigsaw is that once we have improved the token collection and allocation process, we’re going to be looking at massively growing our compute capacity. This might sound like irrelevant techie detail, but it is this that has enabled us to hugely increase our Instagram coverage over the last few months and we want to give Facebook the same treatment.

9. What if I don’t have the page admin account?

“But wait,” I hear you say. “I’m an agency and we’re working on behalf of the brand, but we don’t have admin access to their Facebook pages”. Good question; I’m glad you raised it.

We’re going to be including an email invite system to allow you to invite the social media manager or whoever has access whether they work in your organization or not. You can see some of the conceptual mockups above.

10. How do all these changes benefit us, the customer?

It’s worth noting that the whole purpose of the project is giving you the levers to pull to improve your Facebook coverage. If you don’t use this feature, you will still get some Facebook data, this is a mechanism to make it better. The benefits we will be:

  1. Predictability: Part of this program is about agreeing SLAs with the engineering team. In the past, when users didn’t give us tokens or enough tokens for their requested channels, our crawlers could get rate limited. So then the conversation went “If I get you enough tokens, can we hold ourselves to an SLA?”. So that’s basically where we’re going. We’ll be publishing specifications about how often we retrieve the data and for those customers that have added their account credentials we’re going to reliably meet those targets.
  2. More data: The higher rate limits that we will gain through the new token management will allow us to get you more data.
  3. New features: Over time we will build new features and experiences around Owned and Non-Owned use cases. This may include accessing new APIs or specialized dashboards around the use cases for the different scenarios.

Bonus Question! 11. What about Facebook Topic/Insights data?

There was a data aggregator that had exclusive access to anonymized Facebook data.

This kind of data is a little different to ‘regular social data’ in that rather than querying for the raw data and rolling it up to the insights in our platform, the anonymized data works by passing a set of conditions into the API and receiving answers out.

So rather than saying “send me all conversation around mortgages” and then us using this to provide dashboards, you ask it “what kinds of people talk about mortgages?” and the system delivers anonymized demographics or topics back.

On the one hand this is super useful, because it’s answering questions from within Facebooks closed pool of 1.9Bn users. On the other hand it’s a very different experience, because you can’t dive down into the actual conversation.

We trialled this service with a few of our customers, but the price point of the service made it challenging at the time. Facebook now plan to offer this service directly to and so we no longer offer the service through the third party anonymized API. We are in conversations with Facebook and hope to be one of the first to bring something to market when the API is widely available to our customers.

As always – if you have further questions or requests for a future topic – Please leave a comment below.


Brandwatch Analytics

Discover more about your world with Brandwatch Analytics

Find out more

Amy Collins

@amy_co106

Amy is VP of Product for Data at Brandwatch. Her responsibilities include developing the data strategy, data partnerships and acquisition and for owning the roadmap of new data products. She's a keen sailor and is mum to two little girls.

  • RoTRiMa

    Thank you very much for the information Amy, we really need this about Facebook :)

    I know how Instagram coverage improved over the last months but I can not understand how this (growing compute capacity) will work on Facebook which has no API for public posts ?

  • Excellent news Amy!

  • Amy Collins

    Hello there,

    Thanks for your question.

    Public posts made to a Facebook page are accessible through the API. The more of these pages we track, the compute burden increases, the higher volume the pages are (in terms of posts and comments), the compute burden increases and the frequency we check it pushes the compute requirements still further. So the project to elastically scale it will allow us to track many more, larger pages with lower latency.

    Cheers

    Amy

  • RoTRiMa

    Thank you very much Amy, now I understood :)

    Kind regards,

    Summani

  • Rebecca Dembach

    Hi Amy, thank you very much for this great article! Unfortunately, I’ve got another problem: I’ve encouraged my colleagues to authenticate via Facebook, and for most of them everything went fine. But as colleagues who are also admins of Facebook pages wanted to authenticate as well, they got an alert saying “Brandwatch wants to administer your pages”. Of course, they are very reluctant to click “ok” – which is very unfortunate as their authentication would get us better data. Can you tell me, what this means and how I can assure my colleagues that Brandwatch won’t administer their pages? Thanks in advance and kind regards from Germany – Rebecca

  • Rebecca Dembach

    Thank your very much, Amy! I think the metaphor fits very well – and it was helpful in providing a decision proposal for my colleagues.

    Furthermore, after reading another great article of yours (https://www.brandwatch.com/blog/amy-collins-data-5-exciting-announcements-facebook-data/) I think, I understand that you only require the public profile – as minimum permission – from people who help to increase our data for non owned channels.

    So – and please, correct me, if I’m wrong – what I missed was the explication that the privacy conditions under which admins in general authenticate with our Brandwatch client account vary from those of regular users. This is important, I think, especially for people who administrate multiple Facebook pages, as they can’t choose the specific pages they want to authenticate for. Do you think, it would be possible to add the option to choose in further development?

  • Amy Collins

    Yeah – good questions…

    It’s technically possible that we could build some kind of menu where after you’ve granted us permission, we ask Facebook for a list of pages that the user is admin of, then gave you the choice of which ones we save tokens for, but I think this is not quite what you’re asking… I think you’re asking whether they can grant permissions to our app just for selected pages. When a user connects an app to Facebook (whether it be Candy Crush, Ebay or Brandwatch) you have to give that app permission to talk to your account… There are different types of permissions, but these tend to be quite coarse grained rather than super granular. Even to have the option to ask for page a, or b, or c, it’s an account wide permission of ‘app can do stuff with all my pages’. This then allows us to save the tokens for your pages.

    As I started off by saying, in theory, we could give you the choice of which ones we saved, but this would just be an illusion of free choice, because there’d be nothing technically to stop us from storing what we wanted or just getting it all later… Again – To borrow from the (now stretched) metaphor… You gave the cleaner a key to your house and said – “You can come and clean, but don’t look in the second bedroom (because there’s ummm totally nothing at all in there for you to see… just don’t go in there OK? OK? It’s just an ordinary spare bedroom that doesn’t need cleaning)”. Ultimately… you may trust the cleaner not to do that… But essentially, there’s nothing to stop them from doing so once you’ve given them the key to your house. It all comes down to the question: ‘Do I trust this person with a key to my house?’

    All I can really say to that last question is what I said in my previous message – we’re business partners that you contracted to provide you services.

    We don’t have any plans to build the “which page tokes should we store” menu. Mostly because:
    1. It’s a feature that gives you the illusion of security rather than actual security
    2. It’s my view that we should spend our most precious resource (our engineers time), building cool analytics features, new visualizations or plumbing in new data sources – stuff that gives you more insights.

    However, one thing we are doing is investigating some of the lower combinations of permissions that we can request to see if we can still make the process work with less access. We may come to the conclusion that we can’t, but we are looking into it. I’ll let you know in an upcoming blog if we manage to do this.

    Have a lovely weekend

    Amy

  • Rebecca Dembach

    Hi Amy – and thank you very much for your detailed and very comprehensible answer!