crawl

Just How Much of the Web Can We Crawl?

As a Social Media Monitoring company, the quality of our coverage is integral to our performance as a business: but what exactly does that mean?

Even that term – social media monitoring – isn’t quite accurate. We don’t just monitor the social web; we listen to all online activity.

Social Media can mean lots of different things to different people, and many will argue that the web has always been as social as it is in its current Twitter and Facebook-infested form, through comparatively archaic things like email, Usenet and IM.

The boundaries of what is social and what isn’t ultimately don’t matter too much to us. We crawl the entire web to make sure that if anyone is talking about you or what you’re interested in tracking, we’ll be able to find it.

So if we’re not just listening for content published on social sites, what does Brandwatch track?

The following article should help you make a little more sense of exactly what the 70,000,000+ pages we crawl each day includes, covering some of the most popular sites we cover.

You can also take a look at this intro document that gives an overview of the sites we get asked about most.


News websites

Information is the currency of the digital age, and tracking articles published on news websites is one of the most important applications of our tool. PR departments and campaign managers can easily keep tabs on which news sites their stories are reaching, as Brandwatch includes thousands of the most important news sites available. Below are just a few of the major news sites we cover.

news

We also use a blacklist approach to crawling. This means that we attempt to crawl every news website there is – excluding those behind a paywall – before removing the spam and irrelevant mentions. This manifests as comprehensive coverage of most major news websites, alongside many smaller, local news sites.


Forums

As with news websites, all of our coverage operates on this blacklist method, meaning we crawl countless thousands of forums, before eliminating those that aren’t useful.

This technique is better than a whitelist approach, as it ensures complete coverage, rather than working up a useful list from nothing.

We’re also able to extract individual comments on forum threads, and the only forums we’re missing are those that have politely asked us not to crawl them or have their privacy settings set to private.

This can include image boards such as 4chan, social bookmarking sites like StumbleUpon or even review sites like TripAdvisor.


Social networks

Social media sites make up a significant bulk in the type of content that our clients are most interested in listening to, hence the name ‘social media monitoring’.

Coverage of sites like LinkedIn and Facebook are notoriously difficult to retrieve data from, as both networks have a stringent set of scraping rules and privacy controls that prevent us from taking everything published on those platforms.

We do however, have relationships with a number of the key networks to ensure our coverage is as good as it can possibly be, including being a Twitter Certified Product. These relationships can lead to 100% access to social data.

International nuances such as the East’s preference for sites like RenRen, Wiebo and Orkut are also considerations we take into account when determining which sites to crawl.


Blogs

A huge chunk of the internet is made up from blogs. This includes hubs for leading internet discourse, fierce diatribes against just about anything and endless porn-focused spam disasters.

We use sophisticated systems to extract only the relevant stuff from sprawling blog networks like Tumblr, Blogspot and WordPress to produce a list of millions of blogs to crawl – an eight-digit number that is updated every day.

We’re also sure to include all industry blogs, from corporate-produced articles to mainstream sites like Wired and Techcrunch.


Multimedia content

As the prevalence of video and image-based content increases, we strive to make sure that our coverage is reflective of that.

While 100% coverage of these sites is again unviable – for similar reasons to why other social networks are difficult to entirely cover – we are able to extract a significant percentage of content from sites such as YouTube, Flickr and Daily Motion as well as some coverage of other sites such as Pinterest and Instagram.

We are always working on new ways in which we can track emerging and popular image and video sites to increase our coverage.


Other types of sites

Not all sites can be pigeonholed into pre-defined genres. The sheer extent of personal portfolios, review archives, corporate articles and other miscellaneous websites make up a large proportion of the sites online, and subsequently also the sites we cover.

It’s tough to wedge all these sites under one single umbrella, but rest assured that if the site is reasonably significant (written by a human and with real visitors), we’ll be listening.


Languages

Alongside catering for regional markets in terms of which sites we crawl, we’re also sensitive to the language that mentions are published in. We are capable of tracking mentions in 27 languages, and we’re adding more with every month that passes. Our renowned sentiment analysis is also available for most of the languages we track.

  •  Arabic BETA
  •  Brazilian Portuguese
  •  Chinese (Simplified)
  •  Chinese (Traditional)
  •  Czech BETA
  •  Danish
  •  Dutch
  •  Egyptian Arabic BETA
  •  English
  •  European Portuguese
  •  Farsi BETA
  •  Finnish
  •  French
  •  German
  •  Greek
  •  Gulf Arabic BETA
  •  Hebrew BETA
  •  Italian
  •  Japanese
  •  Korean BETA
  •  Norwegian
  •  Polish
  •  Romanian BETA
  •  Russian
  •  Spanish
  •  Swedish
  •  Turkish

So you can now see just how much of the internet we’re able to cover – this is just the tip of the iceberg – and what thoughts and considerations we have to make when crawling the web. If you’d like to know more about how comprehensive our coverage is, the specific quality of our crawling for each site, how our spam works or any other question about our data, please don’t hesitate to get in touch with us on TwitterFacebook or by email.