US
  • US
  • UK

Language

  • English
  • Deutsch
  • Español
  • Français
  • Indonesian

Ten Questions on Facebook Data Product

By Amy Collins on June 23rd 2017

Where has my data blog series been? Well it’s been a busy time of year for us and it’s been really exciting announcing so much new data at “Now You Know 2017” (our user conference in Denver). I’ve had a few people poke me to ask where I went, so now that the dust has settled, I’m going to be talking about the exciting things our engineers are working on.

As before, the focus is on transparency, so the odd detail might change or I may simply get something wrong. Take what you read here as me thinking out load at a given point in time rather than official marketing collateral.

A few months ago I did a blog on how we get Instagram data. Since then, I’ve had a pile of requests for something similar around Facebook…so here it is.

1. Do you have the Facebook Firehose?

OK, I lied… so this isn’t really a commonly asked question. While I do get this question a lot for Instagram, there seems to be a good degree of understanding in the industry that Facebook data is hard to get. However I thought I’d artificially ask it of myself as it’s useful context for those that don’t know.

As well as being an advertising business, Twitter has a significant part of their business built on selling data. They can do this without any significant ethical concerns because a key part of the power of Twitter is its openness. For the most part, conversations on Twitter are happening on the public domain. So for Twitter, it’s super easy, we have access to products like their Firehose which gives unlimited access to their data.

However, for Facebook, their business is essentially built on advertising and so selling private conversations between consumers could be perceived as damaging end user trust and thus hurting the product for potential advertisers.

For this reason, they don’t sell their raw data and private conversations are private. If you are evaluating a social listening product and a vendor claims to have the Facebook Firehose, I would gently suggest you treat this (as well as any other promise) they make with skepticism.

Facebook have hardened their stance on this over the last couple of years. In the past they used to offer a search API that would allow us to ask for all public posts that related to specific search terms. This allowed us to query the public bit of Facebook with complex search terms, much the same as we can with Twitter. However, they closed off that API and getting Facebook data became harder.

2. Sounds tough… So what access do we have?

It’s actually not that bad – there’s still quite a lot of data we do have access to. We currently offer three main sorts of Facebook data coverage:

  1. Owned Facebook Pages – We can get posts and comments as well as metrics such as likes for pages that you own. The main use cases for this are campaign and performance metrics, community management and insights.
  2. Non-owned Facebook Pages – We can get posts and comments as well as metrics such as likes for pages that you own. The use cases here are most, benchmarking, competitor intelligence and fan page engagement.
  3. Incidental Facebook Coverage – this is a little known benefit but creating Facebook channels also adds that new data to your regular Brandwatch queries. Some people use this as a hack to boost their general Facebook coverage.

3. What constraints do we face gathering Facebook data?

Well, as above, there is no Firehose of all Facebook data that we can tap into. Also, as above, there is no search API, so we crawl the public APIs to retrieve posts and comments for Facebook pages, which is certainly trickier than just asking Facebook to send us all data on a particular topic.

However there are two main resources that we have to manage to be able to do this effectively. The first is kind of obvious, but based on the volumes of pages that our customers are expecting, it’s also non trivial: Compute capacity. We have to have enough servers to crawl every post and comment since the last time we crawled quickly enough to meet market demands. This is within the scope of our control. That is to say – if we need more compute, we can just add more; we can pay for more capacity.

The second constraint, however, is not directly within the scope of our control.

The public APIs require a token, which is essentially like an invitation to come into the party. No invite, no getting in the door. Fortunately, Facebook aren’t mean with their tokens.

Every user gets at least one. You get one just for being an active user.

It gives you a small amount of data requests that you can make per hour. Also if your user account is the administrator of any pages, you also get extra tokens that give more data requests for each page. These page tokens tend to be bigger. So it’s basically a good news/bad news situation.

On the one hand, the bad news is that we can’t buy more tokens. On the other hand, if a user authenticates their Facebook account with our app, we get to generate a token and our coverage goes up by a small amount.   

4. Is there anything I can do to improve our Facebook coverage?

We don’t currently require very much when creating a channel, you can just add as many as you want, however, as you’ll read later in this article – authenticating with multiple facebook accounts has very significant benefits to your crawling. This is especially true if those accounts happen to have admin rights on the pages you want to crawl.

5. So can I just create fake Facebook accounts to boost my coverage?

Alas, No.

Facebook gives us a bump to our rate limit for each active Facebook account that authenticates. So anyone that creates facebook1@mycompany.com, facebook2@mycompany.comfacebookn@mycompany.com sadly won’t make any difference to their coverage. You have to authenticate with active facebook users.  

6. Tell me more about that Incidental Facebook Coverage thing…?

As you (now) know, we can’t do arbitrary searches across Facebook via an API, but there is a sort of hack to achieve a similar result.

Because channels data is stored in the same archive alongside all of our other social data, once it is in there it is available to be queried by regular Brandwatch Analytics queries.

What this means is that you can create channels in the app to pull in data from Facebook pages that are relevant to your industry, which can then be text matched to a more specific query in the app.

For example, if you’re a mobile phone manufacturer, you may want to catch mentions of your phones or competitor’s products. We can’t just query Facebook directly, but you can create channels for the major gadget and phone sites’ Facebook pages such as The Verge, Engadget, Android Central, etc. You then create a query for your products and any of the posts or comments from the channels that text match your query will be retrieved and displayed in your dashboards alongside the Twitter, News, Web, Instagram and Reddit data.

7. How do we get data for these three different scenarios?

Here’s the thing…we actually currently treat these three scenarios the same. Essentially you authenticate with Facebook via our app and then create a channel, inputting the url of the page and that’s it.

Facebook Page Channel Creation

It’s worth noting that because we do it this way, we are able to give coverage of non-owned channels (scenarios 2 & 3, above), which is not true of some of our competitors. For some of them, they only support owned page crawling and offer incidental data of non owned pages (collected because another customer retrieved it).

However because we treat them all the same (basically as non-owned pages), which is essentially the lowest common denominator, we’re missing out on some of the advantages on focussing on their differences.

8. So what’s changing?

We’re going to be splitting out the experience for Owned Facebook pages from Non-owned Facebook pages.

  • For owned pages:
    Facebook Page Channel Creation for Owned Pages
    We’re going to be giving you the opportunity to authenticate with the a Facebook account that is an administrator of that page.
    This gives us a special token (like a virtual key), which unlocks higher API rate limit, meaning that we can offer lower latency and more data. This token can also give us access to other APIs and allow us to deliver new features for page owners. I can’t talk much about these potential new features today, but watch this space.
  • For non owned pages:
    Facebook Page Channel Creation for Non Owned Pages
    If you don’t own the page, then we obviously can’t expect you to give us the page admin token for that page. For this reason we’re going to continue to collect User tokens to provide the rate limit to support the crawling required.We’re going to start by asking for a single user token per non owned channel that you want to crawl. In order to make this process as easy for you as possible, we plan to make it super simple, with an easy email based invite system.

Scaling: The final piece of the jigsaw is that once we have improved the token collection and allocation process, we’re going to be looking at massively growing our compute capacity. This might sound like irrelevant techie detail, but it is this that has enabled us to hugely increase our Instagram coverage over the last few months and we want to give Facebook the same treatment.

9. What if I don’t have the page admin account?

“But wait,” I hear you say. “I’m an agency and we’re working on behalf of the brand, but we don’t have admin access to their Facebook pages”. Good question; I’m glad you raised it.

We’re going to be including an email invite system to allow you to invite the social media manager or whoever has access whether they work in your organization or not. You can see some of the conceptual mockups above.

10. How do all these changes benefit us, the customer?

It’s worth noting that the whole purpose of the project is giving you the levers to pull to improve your Facebook coverage. If you don’t use this feature, you will still get some Facebook data, this is a mechanism to make it better. The benefits we will be:

  1. Predictability: Part of this program is about agreeing SLAs with the engineering team. In the past, when users didn’t give us tokens or enough tokens for their requested channels, our crawlers could get rate limited. So then the conversation went “If I get you enough tokens, can we hold ourselves to an SLA?”. So that’s basically where we’re going. We’ll be publishing specifications about how often we retrieve the data and for those customers that have added their account credentials we’re going to reliably meet those targets.
  2. More data: The higher rate limits that we will gain through the new token management will allow us to get you more data.
  3. New features: Over time we will build new features and experiences around Owned and Non-Owned use cases. This may include accessing new APIs or specialized dashboards around the use cases for the different scenarios.

Bonus Question! 11. What about Facebook Topic/Insights data?

There was a data aggregator that had exclusive access to anonymized Facebook data.

This kind of data is a little different to ‘regular social data’ in that rather than querying for the raw data and rolling it up to the insights in our platform, the anonymized data works by passing a set of conditions into the API and receiving answers out.

So rather than saying “send me all conversation around mortgages” and then us using this to provide dashboards, you ask it “what kinds of people talk about mortgages?” and the system delivers anonymized demographics or topics back.

On the one hand this is super useful, because it’s answering questions from within Facebooks closed pool of 1.9Bn users. On the other hand it’s a very different experience, because you can’t dive down into the actual conversation.

We trialled this service with a few of our customers, but the price point of the service made it challenging at the time. Facebook now plan to offer this service directly to and so we no longer offer the service through the third party anonymized API. We are in conversations with Facebook and hope to be one of the first to bring something to market when the API is widely available to our customers.

As always – if you have further questions or requests for a future topic – Please leave a comment below.


Brandwatch Analytics

Discover more about your world with Brandwatch Analytics

Find out more

Amy Collins

@amy_co106

Amy is VP of Product for Data at Brandwatch. Her responsibilities include developing the data strategy, data partnerships and acquisition and for owning the roadmap of new data products. She's a keen sailor and is mum to two little girls.

  • RoTRiMa

    Thank you very much for the information Amy, we really need this about Facebook :)

    I know how Instagram coverage improved over the last months but I can not understand how this (growing compute capacity) will work on Facebook which has no API for public posts ?

  • Excellent news Amy!

  • Amy Collins

    Hello there,

    Thanks for your question.

    Public posts made to a Facebook page are accessible through the API. The more of these pages we track, the compute burden increases, the higher volume the pages are (in terms of posts and comments), the compute burden increases and the frequency we check it pushes the compute requirements still further. So the project to elastically scale it will allow us to track many more, larger pages with lower latency.

    Cheers

    Amy

  • RoTRiMa

    Thank you very much Amy, now I understood :)

    Kind regards,

    Summani