Calling all European Coders: What Could you Build with this Web Crawler Hadoop Database?

Last week we announced that Seznam.cz was opening part of its search technology by providing a cluster of data. Today, we are happy to give you more details.

Seznam.cz full text search technology is based on Hadoop and Hbase. The teams will have access to a test cluster of up to 100 million documents from the Internet. All of them pre-crawled and sorted into entities such as domains, webservers and URLs. Each of these entities contains its own attributes for fast analysis and sorting of each web page in the cluster.

More specifically, the 3 entities are :

  • Domains – these are equivalent to DNS name structure, domains are organized as a tree. Root entity is special domain “.”,
  • Webservers – a “webserver” is the specialization of a “domain” (webserver = domain + port). They gather URL statistics and other attributes related to a webserver as a whole (for example content of robots.txt is Webserver relevant).
  • URLs – a URL represents a document on a webserver. “URL” is always related to some “webserver”. It contains all attributes relevant to a single web page.

Each entity has a key. The key looks like a modified URL – the hostname parts are in reverse order, the rest of the url is lowercased and cleaned up. It is possible to recognize an entity type from its key value. For example:

  • URL: http://www.montkovo.cz/Cenik/?utm_source=azet.sk&utm_medium=kampan11
  • URL-key: cz.montkovo.!80/cenik
  • webserver-key: cz.montkovo.!80
  • domain-key: cz.montkovo.

The whole database is sorted via the key (ascending), so that all URLs on the same webserver are co-located and could be processed one after another.

Here is a list of common attributes for each entity:

Domain entity

  • Key
  • IP address of the domain (if exists)
  • Number of direct sub-domains
  • Number of all sub-domains
  • Number of all webservers in all sub-domains
  • Number of all known URLs (URLS related to all sub-domains). We call this state of URL as “key-only”.
  • Number of all downloaded URLs. State “content”.
  • Number of all processed URLs (i.e. parsed and extracted basic features). State “derivative”.
  • Number of redirects
  • Number of errors (i.e. URLs with downloading or processing error)
  • Average document download latency

Webserver entity

  • Key
  • Webserver homepage (key to that URL)
  • Content of Robots.txt (robot exclusion protocol) relevant to our crawler
  • Number of all known URLs (state key-only) related to this webserver.
  • Number of all downloaded URLS (state content) related to this webserver.
  • Number of all processed URLs (state derivative) related to this webserver.
  • Number of redirects
  • Number of errors
  • Average document download latency

URL entity

  • Key
  • URL as seen on the web
  • Last download date
  • Last HTTP status
  • Type of the URL – could be few (not downloaded, web page, redirect, error, …). Mind: type of the URL is not the same as HTTP status. For example: HTTP status is 200 OK, but URL type is redirect, because we have detected software redirect within the page content.
  • Attributes specific for different URL types:
    • Not downloaded page
      • We have no explicit information about this page. Only factors that could be predicted (for example document language) and off-page signals (like pagerank) are available.
      • Prediction of document language
      • Prediction of explicit content (porn)
      • Pagerank – classic PR value calculated from link graph
      • Link distance from webserver homepage
      • List of backward links, each contain:
        • Key of the source page
        • Anchor texts relevant to this link
        • HTML title of the source page
        • Pagerank of the source page
    • Web page (i.e. downloaded page with regular content)
      • Alternative URLs for the page – each page could be presented under multiple different URLs. This is scored list of those possibilities.
      • Detected document’s Content-Type
      • Downloaded content
      • Content version – date/time of content download. Could be different from last download date (note: 304 Not modified)
      • Major language – language identified as “most relevant” for this page – could be different from most frequent language on page (different lang for body text vs. menus)
      • Homepage – flag if this page is webserver’s homepage
      • Pagerank – classic pagerank value
      • Link distance of this page from webserver’s homepage
      • Derivative (attributes obtained by further processing):
        • Document charset
        • Detected languages on page with their frequencies
        • Explicit content flag – detected porn
        • Document title
        • Document <meta description …>
        • Document content parsed down to a DOM tree
        • Forward links found on the page
      • List of backward links. Each one have:
        • Key of the source document
        • Anchor texts (extracted from source document) relevant to this link
        • HTML title of the source page
        • Pagerank of the source page
    • Redirect
      • Target URL key
      • Homepage – flag that this redirect is part of redirect chain to a webserver’s homepage
    • Error
      • The same info as for “not downloaded page”
      • We could provide some more, for example date of last download when the page was OK, if it would be necessary for something.

With all this data at your disposal, what could you build? The cluster will be updated and new entries can be added as per team requests. We are looking for the best ideas in the area of Data, Search and Analytics.

Wherever you are in Europe, we will pay for your flight ticket and your accommodations for 3 months in Prague so that you can participate in our accelerator program. Why don’t you start your application now?

[ssba]

If you have any questions about the database, enter it as a comment below