Calling all European Coders: What Could you Build with this Web Crawler Hadoop Database?

Last week we announced that Seznam.cz was opening part of its search technology by providing a cluster of data. Today, we are happy to give you more details.

Seznam.cz full text search technology is based on Hadoop and Hbase. The teams will have access to a test cluster of up to 100 million documents from the Internet. All of them pre-crawled and sorted into entities such as domains, webservers and URLs. Each of these entities contains its own attributes for fast analysis and sorting of each web page in the cluster.

More specifically, the 3 entities are :

  • Domains – these are equivalent to DNS name structure, domains are organized as a tree. Root entity is special domain “.”,
  • Webservers – a “webserver” is the specialization of a “domain” (webserver = domain + port). They gather URL statistics and other attributes related to a webserver as a whole (for example content of robots.txt is Webserver relevant).
  • URLs – a URL represents a document on a webserver. “URL” is always related to some “webserver”. It contains all attributes relevant to a single web page.

Each entity has a key. The key looks like a modified URL – the hostname parts are in reverse order, the rest of the url is lowercased and cleaned up. It is possible to recognize an entity type from its key value. For example:

  • URL: http://www.montkovo.cz/Cenik/?utm_source=azet.sk&utm_medium=kampan11
  • URL-key: cz.montkovo.!80/cenik
  • webserver-key: cz.montkovo.!80
  • domain-key: cz.montkovo.

The whole database is sorted via the key (ascending), so that all URLs on the same webserver are co-located and could be processed one after another.

Here is a list of common attributes for each entity:

Domain entity

  • Key
  • IP address of the domain (if exists)
  • Number of direct sub-domains
  • Number of all sub-domains
  • Number of all webservers in all sub-domains
  • Number of all known URLs (URLS related to all sub-domains). We call this state of URL as “key-only”.
  • Number of all downloaded URLs. State “content”.
  • Number of all processed URLs (i.e. parsed and extracted basic features). State “derivative”.
  • Number of redirects
  • Number of errors (i.e. URLs with downloading or processing error)
  • Average document download latency

Webserver entity

  • Key
  • Webserver homepage (key to that URL)
  • Content of Robots.txt (robot exclusion protocol) relevant to our crawler
  • Number of all known URLs (state key-only) related to this webserver.
  • Number of all downloaded URLS (state content) related to this webserver.
  • Number of all processed URLs (state derivative) related to this webserver.
  • Number of redirects
  • Number of errors
  • Average document download latency

URL entity

  • Key
  • URL as seen on the web
  • Last download date
  • Last HTTP status
  • Type of the URL – could be few (not downloaded, web page, redirect, error, …). Mind: type of the URL is not the same as HTTP status. For example: HTTP status is 200 OK, but URL type is redirect, because we have detected software redirect within the page content.
  • Attributes specific for different URL types:
    • Not downloaded page
      • We have no explicit information about this page. Only factors that could be predicted (for example document language) and off-page signals (like pagerank) are available.
      • Prediction of document language
      • Prediction of explicit content (porn)
      • Pagerank – classic PR value calculated from link graph
      • Link distance from webserver homepage
      • List of backward links, each contain:
        • Key of the source page
        • Anchor texts relevant to this link
        • HTML title of the source page
        • Pagerank of the source page
    • Web page (i.e. downloaded page with regular content)
      • Alternative URLs for the page – each page could be presented under multiple different URLs. This is scored list of those possibilities.
      • Detected document’s Content-Type
      • Downloaded content
      • Content version – date/time of content download. Could be different from last download date (note: 304 Not modified)
      • Major language – language identified as “most relevant” for this page – could be different from most frequent language on page (different lang for body text vs. menus)
      • Homepage – flag if this page is webserver’s homepage
      • Pagerank – classic pagerank value
      • Link distance of this page from webserver’s homepage
      • Derivative (attributes obtained by further processing):
        • Document charset
        • Detected languages on page with their frequencies
        • Explicit content flag – detected porn
        • Document title
        • Document <meta description …>
        • Document content parsed down to a DOM tree
        • Forward links found on the page
      • List of backward links. Each one have:
        • Key of the source document
        • Anchor texts (extracted from source document) relevant to this link
        • HTML title of the source page
        • Pagerank of the source page
    • Redirect
      • Target URL key
      • Homepage – flag that this redirect is part of redirect chain to a webserver’s homepage
    • Error
      • The same info as for “not downloaded page”
      • We could provide some more, for example date of last download when the page was OK, if it would be necessary for something.

With all this data at your disposal, what could you build? The cluster will be updated and new entries can be added as per team requests. We are looking for the best ideas in the area of Data, Search and Analytics.

Wherever you are in Europe, we will pay for your flight ticket and your accommodations for 3 months in Prague so that you can participate in our accelerator program. Why don’t you start your application now?

Share on FacebookTweet about this on TwitterShare on RedditShare on LinkedInShare on Google+Digg thisPin on PinterestShare on Tumblr
If you like what you read, please consider sharing it

If you have any questions about the database, enter it as a comment below

How do you Start an Internet Business?

As Internet professionals we get asked the same question over and over. “I have an idea for an internet company. How do I start?”

It’s simpler than it seems

My answer invariably is as follow:

“Start building… Make a mockup and show it to people around you. Listen to the feedback, throw away your first version and do it again with all the knowledge you have gained from future users. Do that over and over. Stop until they convince you or you convince yourself it’s a bad idea. In that case, kill it and wait for the next idea.”

You read it right: make a mockup and show it to people. This is the only possible first step you can take if you want to turn your internet idea into a business. Don’t think your first step is to write a business plan. That’s the last thing you should do. Right now you have no business even less a product. What you need to know now is if your idea is a product/service people will want. Have you already figured out what they will pay you for? The best way to start is to put your idea there in front of them and listen to what they say as they discover your mocked up service.

In case you feel that you are not a UX designer, that’s OK. You don’t have to be. You just need to be able to put your idea into some kind of wireframe without design. It’s actually not that hard to create a mockup, either it is for a website or for a mobile app. There are a few great tools if you just Google “Wireframing Tools”. I personally use Keynotopia.com – It’s a set of templates for PowerPoint and Keynote. Their homepage says:

“… testing app ideas in 30 minutes or less.”

And… it’s true. If you know how to use PowerPoint, you can create a mockup. I’m not related to them and have no interest if you buy their product. I’m just a happy customer and I use them all the time. They are always my first point of call when I have an idea (well, after the mind map but for that I just use a good old pen and paper).

Once you have your mockup ready, this is when you either have to build your site or app yourself, pay people in cash or equities to build it, or find someone who can finance the idea in exchange for some equities in the company that would be incorporated around the idea.

It gets better

If you are going to need investors, good news; your mockups are your best friends again. Show them your idea! Either they will want the app/service or they will not. The rest are financial details for them. These details are important of course, but now you have the interest of an investor. What’s important is that you can go from having an idea, all the way to raising seed funding with just mockups and a clear monetisation strategy. All it took was to iterate on mockups following users interviews.

That’s how you start an internet business. By creating a mockup of the idea.

Once you have the fund, you need to recruit developers and designers and manage them. Sites like 99designs.com and odesk.com‎ or Freelancer.com are your best bet.

It’s mockup time Baby!

Once again, the first thing they will want to see is what your idea is. Well… How convenient! You happen to have mockups available so they can instantly understand your idea and transform it into reality.

Source: Keynotopia.com

Source: Keynotopia.com

Once the product is developed you will need to become an online marketing wizard, and hope people will talk about it and refer new users every day. But that’s not how you start. That’s how you grow, and the topic of another blog post.

You can also apply to an accelerator like StartupYard. Not only will you receive funding to start developing your first version but you will also be exposed to hundreds of professionals and specialists who will become your mentors during the 3 months program and who will help you develop your idea and be exposed to even more people. And guess what? One of the first things you will do at StartupYard is create a mockup and show it to people to refine and refine again your idea until it’s perfect.

Share on FacebookTweet about this on TwitterShare on RedditShare on LinkedInShare on Google+Digg thisPin on PinterestShare on Tumblr
If you like what you read, please consider sharing it

Make your Pitch “Real” From Day One

 This is one in a series of posts about the skills, tools and prep work Founders need for success in an accelerator. 
 

What a Pitch Really Is

Raising money is a deeply complex issue for a startup. We won’t reinvent the wheel here and now and tell you whether you’re even ready to try doing it. But we will talk about your “pitch.”

Founders often enter the pitch under the false assumption that investors are looking for someone like them. Someone who feels like a peer, whose job it is to convince them. Paul Graham of Y Combinator wrote about this recently

“When people hurt themselves lifting heavy things, it’s usually because they try to lift with their back. The right way to lift heavy things is to let your legs do the work. Inexperienced founders make the same mistake when trying to convince investors. They try to convince with their pitch. Most would be better off if they let their startup do the work—if they started by understanding why their startup is worth investing in, then simply explained this well to investors.” – Paul Graham (Full Article here)

But your pitch is more than just a magical set of keywords that unlocks a golden elevator, filled with swimsuit models holding champagne flutes and suitcases full of money. The secret handshake theory of business is only attractive to those who aren’t sure who their customers are: investors or actual potential clients.

No, your pitch might be closer to your “identity,” as an early stage startup. Your pitch is not just your idea. That bears repeating. Your pitch is not just your idea: it is a demonstration of why you are a good bet. It’s also the first and consequently most important way that people will know you. VCs, angels, accelerators and even potential employees know you by your pitch. And avoiding the biggest mistakes, can be as key as making the best sounding pitch.

A Pitch is Creating a New Reality

One in which your product is real, and one in which it is something that customers need, and will pay for. This is why your pitch is not your idea. Your idea is plastic, and can change, but the reality you are pitching has to be real. Your product solves real problems.

And your pitch starts from day one. You should come up with a pitch that makes sense before going any further, because if you can’t sell your product, there may be little point in building it. This  ice-cream shotgun is still genius, by the way, but the pitch didn’t work out, so I’m waiting for the market to present a need before investing.

Ok... maybe not.

Ok… maybe not.

What’s in the Pitch

This is a “positioning template” first suggested by Geoffrey Moore in Crossing the Chasm, a modern day “bible” for technology marketing. See if the pitch you have at the top of your head addresses each of these points in a meaningful way:

For (target customers)
Who (have the following problem)
Our product is a (describe the product or solution)
That provides (cite the breakthrough capability)
Unlike (reference competition),
Our product/solution (describe the key point of competitive differentiation)

In this post we’re primarily concerned with the first half of this template.

Create the Problem

All products and innovations address problems, and this is the For, and Who, and Unlike, of your pitch. A person/company/institution with an issue/need/lack/goal.

When entering your pitch, you should have in mind a typical customer who has a common enough problem. This works for anything- you just need to be creative. You are not inventing the need, but you are formulating its basis.

Nobody needed an electric typewriter in 1924 when IBM obtained patents that would later be used in its first commercial models. But there were key deficiencies in the design of manual typewriters, that caused common, known problems. Problems that could be solved. IBM identified those deficiencies, and attempted to eliminate them.

There was no work not being done because of these deficiencies; nobody was sitting around waiting for the automatic typewriter, but companies and individuals still invested in the new technology, not because they were aware of how automatic typewriters would revolutionize business, but because it solved problems they knew they already had.

Here’s a pitch for the electric typwriter, in this frame (freely invented by me):

There are over 50 million typists, secretaries, students and amateur writers in America all grappling with the same issues. Current typewriters on the market have frequent jams, rust easily, cause pain in the fingers due to the difficulty of depressing keys, and create type which is often uneven, and illegible. Our new automatic typewriter solves all of those problems, using revolutionary new technology that prevents jamming, ensures even spacing, and is easy on the fingers. It produces clear, legible, even type, at a speed before totally unprecedented, allowing typists to work more productively, more quickly, and more happily. Better yet, it is cheaper to manufacture than a manual typewriter, accepting universally interchangeable parts.

You’ll find every element of the above template present. Who the product is for (and size of the market), what their problems are, what the competition offers, what our product is, and how it solves all of those problems, along with a litany of killer features, and even a case for profitability. It presents the investor with a world that is broken (typing sucks and it’s expensive), and then presents the solution (typing made easier and cheaper).

That is the sort of pitch that grew IBM’s revenue by a factor of 20 in 20 years, and its profits by a factor of 7 in the same time period. All based on solving basic deficiencies in its marketplace.

Not all products are as glamorous to you and I as the automatic typewriter. But think about the executives who funded its development in 1925. They didn’t touch typewriters. They had secretaries who did that, and they dictated letters or scratched notes on paper. Typewriters were manual labor, and beneath their pay grades.

It was a dark and stormy paperjam.

It was a dark and stormy paper jam.

Give the Solution

These people had to be convinced that a problem existed, and that others, office managers, schools, and institutions of government, would buy the solution, before they invested in buying the patents and funding its development as a commercial product. The pitch provides the investor with a reason why the product is needed, the evidence that there is a market for it, the evidence that the market will accept it, and the evidence that this will be a profitable venture.

And this kind of pitch can be given in 30 seconds, or in 30 minutes. If it makes the problem and the solution real, it can win an investment.

So however tedious the problems that you’re solving are, if you believe there’s a market for the solutions you offer, you have to make those problems real to investors. Presenting a killer solution, even when the status quo still works, is the key to making the problem real for an investor. Make that investor see the current state of affairs as a net loss, instead of a zero sum.

Share on FacebookTweet about this on TwitterShare on RedditShare on LinkedInShare on Google+Digg thisPin on PinterestShare on Tumblr
If you like what you read, please consider sharing it