Brandwatch First to Offer Full Access To Mumsnet Data
By Phillip AgnewOct 9th
Published February 16th 2017
In my last blog, I wrote about our big focus in Quarter 1 of 2017, AsiaPac data coverage.
I didn’t talk about our other major stream of work: improvements to our in-house crawling technologies. Our Data Crawling team own all of our crawled sources such as Facebook, Instagram, blogs, forums and news sites.
I’m going to switch up the format of this blog and do it as a Q&A on how our Instagram data coverage works.
This is a really common question.
Just in case you don’t know, the term “firehose” is industry slang for a paid, full feed of all the data from a given source. Probably the most well known provider of a firehose is Twitter; we pay them for data and we get 100% of tweets for any customer Query. We have similar arrangements with Disqus and a number of other providers.
Unlike Twitter, there is no such thing as a firehose for Instagram.
Instagram is part of the Facebook empire – which is unashamedly an ads company. Well, technically, it positions itself as a media company, but essentially is a hugely successful advertising machine. Selling data for analytics really doesn’t figure in their business model. Facebook are focussed only on generating a good customer experience for their users and getting their advertisers to spend more; Facebook see data sharing as a way of helping advertisers provide users with more targeted and useful advertising. Providing consumer insights to businesses is not their focus and that’s the reason neither Facebook nor Instagram offer a firehose or any kind of pay-for-data business.
Any data provider that tells you that they have “100% of Instagram data” or the “Instagram firehose” should be treated with extreme skepticism.
We are in the same boat as everyone else: we have access to the Instagram Public APIs and crawl its data.
Some vendors skip this step and pay an aggregator to crawl for them. We decided against doing this, meaning we’re in control of what data we collect (basically more stuff customers want, less spam and irrelevant mentions). Like all providers, because this is a public and not paid for API, we are rate limited, so have to be very smart about how we balance the things we crawl.
Because there is no complex search API (as with some other data providers, such as Twitter), we can’t pass in all the rich, boolean logic that you would normally specify in a Brandwatch Query.
The API provides a number of endpoints to get Instagram data. For the non-developers amongst you, an endpoint is just a place that we can request data from the API in a particular format. We currently crawl posts for requested hashtags and posts made by specific users. We also recently added comment crawling for users posts.
It is useful to conceptually split this into two activities. Firstly, getting posts from Instagram, and secondly, matching it against your Queries.
Getting posts from Instagram: Because there is no complex search API, we query the Instagram tags endpoint.
This gives us a list of posts for a given hashtag. We compile a list of all hashtags referenced across our customer Queries (where users have specified the hashtag: operator). This list is then worked through and we request the posts for each hashtag.
The call returns a page of multiple posts and then we iterate through and gather multiple pages worth of posts. Each post is then stored and made available for matching against customer Queries.
Matching it against customer Queries: Once a post is in the Brandwatch data archive, it is eligible to be matched to any customer Query regardless of whether it referenced the hashtag that found the post.
For example, if customer 1 creates a Query for hashtag:cats and we retrieve an Instagram post where the text says “I’m so glad I don’t have a dog. #cats”, we would match it against customer 1’s Query, but also against customer 2’s Query that was for any reference to the word “dog” (even without mention of a hashtag).
We call this “incidental data” – it’s a kind of network effect where the data we retrieve for one customer can benefit all of our customers.
For the record, I’m not actually a cat person.
This capability exists within Brandwatch Analytics Channels for Instagram.
Using it is pretty simple, rather than creating a regular Query, you simply create a channel and specify the Instagram user that you want to add and we begin crawling that user’s content and any comments or likes.
When you create a Channel, the first time we crawl we get the most recent 100 posts made in the last seven days. For each post we get up to 150 comments. We then revisit the user page every one to two hours (depending on how long the last crawling cycle took and the levels of hardware provisioned at that time) and get the top 100 posts and up to 150 comments for each.
You may have seen a message in Brandwatch asking you to authenticate with Instagram, or you may have noticed the authentication menu in the top right of the app:
The challenge is that tokens can only be used so often before they wear out and have to regenerate for a period. So the more tokens we have, the more crawling capacity we have. Each Instagram user account can generate a token, which we can store and use to increase our crawling capacity.
In various places within Brandwatch, we offer a higher level of service to people that have authenticated with us and provided a token. Currently these two areas are:
We have the following suggestions to get the best out of our data retrieval processes.
We want more. I think we need to evolve our Instagram coverage in three key areas:
Right now, number one is our priority. In fact, we have just formed a new engineering team dedicated to this and they are making rapid progress at some major architectural improvements to radically increase scale in the number of posts that we’re retrieving. I will have more to share in the coming weeks – watch for future posts to the blog.