Calling all European Coders: What Could you Build with this Web Crawler Hadoop Database?

Last week we announced that Seznam.cz was opening part of its search technology by providing a cluster of data. Today, we are happy to give you more details.

Seznam.cz full text search technology is based on Hadoop and Hbase. The teams will have access to a test cluster of up to 100 million documents from the Internet. All of them pre-crawled and sorted into entities such as domains, webservers and URLs. Each of these entities contains its own attributes for fast analysis and sorting of each web page in the cluster.

More specifically, the 3 entities are :

  • Domains – these are equivalent to DNS name structure, domains are organized as a tree. Root entity is special domain “.”,
  • Webservers – a “webserver” is the specialization of a “domain” (webserver = domain + port). They gather URL statistics and other attributes related to a webserver as a whole (for example content of robots.txt is Webserver relevant).
  • URLs – a URL represents a document on a webserver. “URL” is always related to some “webserver”. It contains all attributes relevant to a single web page.

Each entity has a key. The key looks like a modified URL – the hostname parts are in reverse order, the rest of the url is lowercased and cleaned up. It is possible to recognize an entity type from its key value. For example:

  • URL: http://www.montkovo.cz/Cenik/?utm_source=azet.sk&utm_medium=kampan11
  • URL-key: cz.montkovo.!80/cenik
  • webserver-key: cz.montkovo.!80
  • domain-key: cz.montkovo.

The whole database is sorted via the key (ascending), so that all URLs on the same webserver are co-located and could be processed one after another.

Here is a list of common attributes for each entity:

Domain entity

  • Key
  • IP address of the domain (if exists)
  • Number of direct sub-domains
  • Number of all sub-domains
  • Number of all webservers in all sub-domains
  • Number of all known URLs (URLS related to all sub-domains). We call this state of URL as “key-only”.
  • Number of all downloaded URLs. State “content”.
  • Number of all processed URLs (i.e. parsed and extracted basic features). State “derivative”.
  • Number of redirects
  • Number of errors (i.e. URLs with downloading or processing error)
  • Average document download latency

Webserver entity

  • Key
  • Webserver homepage (key to that URL)
  • Content of Robots.txt (robot exclusion protocol) relevant to our crawler
  • Number of all known URLs (state key-only) related to this webserver.
  • Number of all downloaded URLS (state content) related to this webserver.
  • Number of all processed URLs (state derivative) related to this webserver.
  • Number of redirects
  • Number of errors
  • Average document download latency

URL entity

  • Key
  • URL as seen on the web
  • Last download date
  • Last HTTP status
  • Type of the URL – could be few (not downloaded, web page, redirect, error, …). Mind: type of the URL is not the same as HTTP status. For example: HTTP status is 200 OK, but URL type is redirect, because we have detected software redirect within the page content.
  • Attributes specific for different URL types:
    • Not downloaded page
      • We have no explicit information about this page. Only factors that could be predicted (for example document language) and off-page signals (like pagerank) are available.
      • Prediction of document language
      • Prediction of explicit content (porn)
      • Pagerank – classic PR value calculated from link graph
      • Link distance from webserver homepage
      • List of backward links, each contain:
        • Key of the source page
        • Anchor texts relevant to this link
        • HTML title of the source page
        • Pagerank of the source page
    • Web page (i.e. downloaded page with regular content)
      • Alternative URLs for the page – each page could be presented under multiple different URLs. This is scored list of those possibilities.
      • Detected document’s Content-Type
      • Downloaded content
      • Content version – date/time of content download. Could be different from last download date (note: 304 Not modified)
      • Major language – language identified as “most relevant” for this page – could be different from most frequent language on page (different lang for body text vs. menus)
      • Homepage – flag if this page is webserver’s homepage
      • Pagerank – classic pagerank value
      • Link distance of this page from webserver’s homepage
      • Derivative (attributes obtained by further processing):
        • Document charset
        • Detected languages on page with their frequencies
        • Explicit content flag – detected porn
        • Document title
        • Document <meta description …>
        • Document content parsed down to a DOM tree
        • Forward links found on the page
      • List of backward links. Each one have:
        • Key of the source document
        • Anchor texts (extracted from source document) relevant to this link
        • HTML title of the source page
        • Pagerank of the source page
    • Redirect
      • Target URL key
      • Homepage – flag that this redirect is part of redirect chain to a webserver’s homepage
    • Error
      • The same info as for “not downloaded page”
      • We could provide some more, for example date of last download when the page was OK, if it would be necessary for something.

With all this data at your disposal, what could you build? The cluster will be updated and new entries can be added as per team requests. We are looking for the best ideas in the area of Data, Search and Analytics.

Wherever you are in Europe, we will pay for your flight ticket and your accommodations for 3 months in Prague so that you can participate in our accelerator program. Why don’t you start your application now?

[ssba]

If you have any questions about the database, enter it as a comment below

Focus on Copywriting: Sell without Selling

This is part of our series exploring the skills, resources and experience Founders need when entering and working in an accelerator.

While most startups can’t afford an in-house copywriter, most companies also can’t afford not to have someone focus on copywriting, at least some of the time.

Why you Should Focus on Copywriting

“Language ties together the worlds of reality and possibility”

Last year, Jason Cohen wrote about developing his “story,” the many years of selling Smart Bear, a successful code-review tool. I’ll quote part of the text here (with permission), and encourage you to read the entirety.

At first when someone asked what my company’s tool suite was, I would say:

“Smart Bear makes data-mining tools for version control systems”.

It’s a description so esoteric that, although accurate, not even a hardcore geek would have any idea what it is, much less why it’s useful. Years later, when it was clear that code review software became our sole focus, I got better at describing it:

You know how Word has “track changes” where you can make modifications and comments and show them to someone else? We do that for software developers, integrating with their tools instead of Word and working within their standard practices.

Better, yes, and for a while I thought I nailed it, but still no press. Eventually (thanks to helpful journalists) I realized I was still just describing what it is rather than why anyone cares. I left it up to the reader to figure out why they should get excited.

Eventually I developed stories like the following, each tuned to a certain category of listener. Here’s the one for the journalists:

It’s always fun to tell a journalist like you that we enable software developers to review each other’s code because your reaction is always: “Wait a minute, you’re seriously telling me they don’t do this already?” The idea of editing and review is so embedded in your industry you can’t imagine life without it, and you’re right! You know better than anyone how another set of eyeballs finds important problems.

Of course two heads are better than one, but developers traditionally work in isolation, mainly because there’s a dearth of tools which help teams bridge the social gap of an ocean, integrate with incumbent tools, and are lightweight enough to still be fun and relevant.

That’s what we do: Bring the benefits of peer review to software development.

Now the reason for excitement is clear: We’re transforming how software is created, applying the age old techniques of peer review to an industry that needs it but where it’s traditionally too hard to do.

– See more at Cohen’s own Smart Bear Blog

Part of the conundrum of good copywriting is that it is virtually impossible to test. A homepage layout can be dissected into precise quanta of effectiveness: how many visitors, how many clicks, how many page views; the flow of traffic is orderly and can be controlled. You can A/B test adwords and landing pages and see “what works,” but you’ll only ever be finding out what doesn’t fail as much- not how well you could be doing. There’s no A/B test for a truly novel approach; one that builds momentum for your site and your products, because two novel approaches will not be binary in nature. They will not be comparable at all.

Because copy doesn’t work like code, but a lot of web entrepreneurs assume it does. If the copy “doesn’t work,” it’s the fault of the copy, not what the copy supports (ie: the product, or the company).

Build Your “Story,” And Your Voice

Copywriting, done well, can increase a site’s conversion rate enormously. It can entice new customers and woo old ones to stick around. But most early startups stick with their old copywriting for too long.

You can’t test  copy like you can a layout or a button or a piece of code: it’s too complex- there are too many emotions, too many subtle cultural cues, and too many ways in which people read; all of them different.* And even more, it’s reactive- your copy has to evolve with time, coming to acknowledge your existing customers and community, and what your products mean to them, along with attracting new customers. Jason Cohen’s “story,” as it evolved above was changed to acknowledge whom he was talking to about his products, and what they need to really understand about them. He went from a programmer with an idea, to a trustworthy person with a solid background in helping people with his products. And his story showed that.

                        ________________________________________________

*Our CEO Cedric Maloux disagrees with me on this point. 

Cedric: I used to have 4 different homepages. All similar, except for one headline. And I was measuring which one was leading to more sales in real time. The software would show one or the other and measure the reactions. This is A/B testing at it’s best- you can test a headline, but you can’t test all the copy. 
 
Cedric makes a fair point here. Headlines, tag lines and slogans often work more like static features of a website than the rest of the copy does. Because they don’t take on all of the same responsibilities as normal marketing copy, you can and should treat them as testable. These are the elements of your copy that stand up best to focus-grouping and testing, because their purposes are more unique- namely to attract clicks and push a visitor to go further.

                        _______________________________________________

This is not to say that you should never speak in technical jargon, but that you should always know whom you’re talking to, how much they know, and how hard they’re listening. Your copy needs to evolve to reflect the culture of your company and your customers.

But in the data-driven world of online marketing, these organic, real, contextually rich evolutions are rarely allowed to happen. It’s rare in this world to see something closer to the corporate ad-agency driven model, in which a creative and an account executive sell campaigns to a client, who then uses the creative output to tell a new story. More often it’s the case that the better is the enemy of the good: that founders and CEOs are unwilling to try anything that smacks of the entirely new, because it can’t be reliably tested, and requires a great deal of faith. Even though truly original great ideas have, necessarily, never been tried before.

 In the current startup ecosystem, ambitions for zero-cost growth have become dangerously intertwined with risk-aversity: companies shrink from the prospect of *losing* small levels of growth, in a gambit for gaining more. And not a small number of companies have played the same tune for too long- failing to pivot their messaging until their revenue has shrunk enough for it to be too late, and changes will only appear desperate and cynical (which they will be).

This is a shame in some respects, as the quality of language on a website is just as important in conveying impressions of honesty, competence, and skill as a quality design is. Perhaps even more so, as web design becomes increasingly automated and pre-packaged. Copy cannot be automated or prepackaged. It always has to be unique. Language ties together the worlds of reality and possibility. It is the medium in which you make your ideas real for your customers: in which you construct the reality of your products, and a future world in which your customers use them. That’s a vitally important thing to focus on.

1235996_24005539

The pen is mightier.

Write Honestly: Sell Without Selling

I had a great sales manager once you who taught me what he called the “7 Things” that you have to keep in mind when you talk to a customer. It was based on a simple principle: when you talk to customers, you are always selling something. 

Before we go all Glengarry Glen Ross here, this is not the same as the old adage: “Always be Closing.”

The important thing is to remember what your relationship to a customer is, and to be very honest about that fact. You will never sell someone something they don’t want to buy. And even if you do manage it once, they will never buy twice, so you shouldn’t sugar coat or lie about your products, ever. You don’t need to. Just follow these “7 Things.”

Trust: In you and the product. Let the customer do what they would normally do.

Understanding: Be as simple and clear as possible. The customer is not smarter than you.

Emotions: Use humor, use evocative words, show love and caring. Show passion.

What to do: Buy, sign up, share…

When to do it: Now?

What I get out of it: Speak about effects of the product, not the features.

When it will happen: Examples, case studies, quotes, and testimonials

The list is a simple one to follow, and you should look for every point to be covered in some way in your communications with customers (eg: on your homepage, landing pages, email contacts, and other sales material).

Most important of the above is trust. My sales manager would say this: “If I asked you to show me 4 fingers, what would you do?” I held up 4 fingers on one hand. He said, “Exactly. Now, if I held up 2 fingers on each hand, you would think I was being a smartass.” This is to say, that trust is established by doing what the customer expects, and by showing that you understand the customer well. You have thought this through, and you understand what the customer needs.

Then you can access emotions. Emotions can be descriptive words, or appeals to imagination. But emotions must be appealed to after trust and understanding are established. Customers are looking for an emotional connection to anything they buy. If they feel they’re dealing with a real person, who wants and cares about their business, then they will be more than ready to come back and buy again.

What the Customer Gets Out of It

Surprisingly, this is an element of a lot of online copywriting that gets completely lost. Companies don’t talk about what their products mean to people. They just talk about what their products do and are. That’s a major problem.

While you might describe your product idea as: “A non-SQL back-end solution for tracking PPC traffic ROI,” I might describe that same idea as: “A tool that helps online businesses figure out whether their web ad dollars are being spent wisely.” While my version tells you practically nothing important about how the product works, it does tell you what the product does. It speaks about effects rather than features of the product. It focuses on what is important to a client, an investor, or a customer: what the product accomplishes.

This is important especially for non B2B products, but generally any time in which the client is significantly different, as an entity, from your company. And even if you’re an IT company selling IT resources to be used by IT people, the person actually in charge of buying those products or services is probably not the one who will be using them.

2135360

Your copywriter has to concern him or herself with these distinctions: how product copywriting (such as product instructions, help menus, drop downs, and user messages), and marketing copywriting, such as homepages, campaign pages, and marketing communications, are fundamentally different, and meant often for fundamentally different people. I can’t tell you how many times I’ve seen the same mistakes made: websites for complex and expensive products that use a product copywriting style, right on the homepage. You might as well add a light-box on the homepage that pops up and says: “If you haven’t already committed to buying this product, don’t bother going forward.” The person who is viewing your homepage may not be a customer, and treating them like a customer (with product copywriting), is often a big mistake.

Because before a person is your customer, you need to establish trust. And that means giving that person a way of understanding who you are, and what you do, and of liking you. If you haven’t done that, then the customer is taking a risk in buying from you. And most sales, you’ll lose that customer.

If it’s work for the buyer to figure out what your product is, then it’s going to be nearly impossible to sell to them. And unless you’re in the enviable position of a product company that has its client-base beating down its door to buy the latest release, then you need to think about this. A lot of the time, the person your product is meant for, and the person your product will be used by, are two totally different people. And you need to assume the worst.

 [ssba]

Why I Agreed to be CEO of StartupYard

I did not hesitate long. But I did hesitate.

 The Challenge

An accelerator’s success depends first and foremost on the potential of the companies it helps to grow. We’ll either find amazing teams or we will not. The main parameter here is that our application forms are opening for 6 weeks starting…. Today. This means I had 3 weeks to get settled into this position, and will have 6 weeks, starting now, to recruit the best candidates for the accelerator. One thing’s for sure, if we find those hands-on entrepreneurs who combine business sense with uniquely great ideas, they will gain some fantastic knowledge during our 3 Months Acceleration Program. Still, 9 weeks (including 2 weeks of holidays), is an insanely short period of time in which to accomplish anything like this.

The Plan

However, when the Board of StartupYard told me that Seznam.cz was opening part of its proprietary search technology for the future teams, they piqued my interest even more. Suddenly we have one of the only companies in the world that is still #1 in search in its home country against Google, and they’re going to let founders build products on top of processed web data that they will collect and prepare. By providing this level of abstraction, new connections and services can be imagined within these data. It’s all down to the creativity of the founders to come up with some kick-ass business ideas. Who could say no? The second thing that interest me was that StartupYard had decided to become a specialized accelerator. From now on, every new batch of companies will all belong to a vertical segment of the IT industry. This year, thanks to Seznam.cz, it will be Data (with Search and Analytics underlying Data). That’s why this data sandbox they will make available is so interesting if you are working in these fields. Future rounds will include mobile games, payments, etc…

The Name of the Game: Data

Data

By specializing, we aim at bringing together European teams all working on similar Data problems. Our mentors work or have worked on Data projects in companies like GoodData, Google, Yahoo, IBM, Ericsson or Seznam. We see  this as a tremendous opportunity for the selected founders to learn from talented specialists. By specialising, we also hope to foster cooperation between the teams. For this reason I wanted to make sure it will be an easy decision for anybody not living in Prague to join the program. Therefore, StartupYard will offer, for the first time:

  • 3 months of free accommodations in Prague
  • Free lunch

This is important to me. It should not cost founders money to join an accelerator. Period.  If he has cash, he should use it to sustain himself while he is developing his business, or invest it directly in his company. I moved to Prague to start a business 9 years ago and I never regretted it. I want to make that decision an easy one for the next generation.

We’ll See You in Prague

I’ve been mentoring at StartupYard for 2 years now. Not all years were equal, but it is a great start-up in itself, and Prague is the best capital city in Europe. Bar none. I hope we will be able to attract some fantastic founders with brilliant ideas and give them all the knowledge and support they need to thrive. I know how exhilarating growth and success is, and how hard failure can be. I have experienced a lot working 17 years as a start-up founder and CEO. I will make sure they are ready for all of it. Applications are now open, and will stay open until January 31st, 2014. If you are working on anything related to Data, Search or Analytics, you should really check us out! We look forward to seeing you in Prague this spring.

About Cedric Maloux

Cedric Maloux director startupyard

Cedric Maloux, originally of Paris, has been in the startup world for nearly 20 years. He sold his first company in 2000, and has raised capital from the top VCs in Europe. He served as CEO of Geewa, a struggling Czech gaming company, and turned it into one of the top 10 developers for Facebook, with Pool Live Tour. He’s an avid poker player and recently launched an app for poker bankroll management. He also created Pressly.ai, a tool to instantly create professional press releases online. He has lived in Prague for nearly a decade. Cedric has been a mentor at StartupYard for over two years, and was invited by the Board of Directors this past fall to take the reins as CEO.

Seznam Opens Part of Its Full Text Search Technology for StartupYard 2014

Today is a great day for the future startups of StartupYard 2014!

Seznam.cz has decided to open part of its full text search technology to the teams that apply for the 2014 round, to help them with their large data projects that need extracted data from the Internet.

Seznam.cz started as a one-man band and during the past 17 years has become a major Czech influential technological company and a media house in one, which is preventing Google from gaining the majority or monopoly on the Czech market. Seznam.cz full text search technology is based on Hadoop and Hbase. The 2014 StartupYard teams will have access to a test cluster of up to 100 million documents from the Internet. All of them pre-crawled and sorted into entities such as domains, webservers and URLs. Each of these entities contains its own attributes for fast analysis and sorting of each web page in the cluster.

logo_seznam

“We have made a basic analysis for each webpage in the cluster so the teams know its content as a derivate with many parameters such as the language used and meta-descriptions. All of the documents in the cluster are regularly updated and more parameters and content can be added if the teams need and request them,” 

Marek Nový, Head of Business Development at Seznam.cz

Good news comes in pairs. Today StartupYard is accepting applications for our April 2014 round. This round will focus exclusively on teams working in Data, Search and Analytics.

Teams will receive free accommodations in Prague, free lunch and one payed return flight from anywhere in Europe.

We will be setting our sights on the best teams in these areas from all over Europe; selecting only 6 from 10 finalists for its 3 month, English-language only Acceleration Program. The comprehensive program will cover all aspects of creating and growing a business, from legal and accounting, to hiring, code review and company culture, with access to 90 specialised mentors, media trainers, a professional native English copywriter and blogger, and perks worth €250,000.

Prague attracts many data companies, so we felt it would be to the benefit of our upcoming class if we specialised. We have a very clear objective: to attract the most ambitious projects from all across Europe in the fields of Data, Search or Analytics. To achieve that, we have put together a package we hope no entrepreneur in Europe could refuse.

The application closes at the end of January. Please help us spread the news by sharing this article!

[ssba]

Welcome 6 New Teams for 2013 Spring Program

It’s been long. A long selection process, 6 weeks. But we’re proud now to announce 6 amazing teams. We accepted applications from 12 countries and selected 6 startups from 5 countries. The third cohort begins their 6-month acceleration program on March 11 in Prague.

Who are our StartupYard Spring 2013 heroes?

Hlidacky.cz
We help parents meet the most reliable babysitters. Hlidacky.cz operates in the Czech Republic now.
Co-founders: David Hrachovy, Petr Sigut and Vaclav Kuna (Czech Republic)
Web: www.hlidacky.cz

HowDoI Tutorials
Interactive tool for website owners to create tutorials and guides in a few minutes.
Co-founders: Lukas Haraga, Jiri Otahal and Michal Pustka (Czech Republic)

 

Web: www.howdoitutorials.com

StartitUp
Step-By-Step Startup Guide with action items that will help entrepreneurs get their startup from idea, to product, to traction, and to funding.
Co-founders: Edward Liu and Yitao Sun (USA)
Web: www.startitup.co

Travelatus
Event travel made easy. It helps visitors of concerts, festivals and conferences to plan their trips.
Co-founders: Valentin Dombrovsky, Vitaliy Korobkin and Denis Volkov (Russia)
Web: www.travelatus.com

works.io
The essential career tool for professional fine artists, keep portfolios and CVs up-to-date with ease.
Co-founders: Abe Han and Patrick Urwyler (Canada, Switzerland)
Web: www.works.io

Yummy Food Delivery
How to eat healthy at your desk. Its focus is B2B delivery.
Co-founders: Kristina Sediva and Tomas Netrval (Czech Republic)
Facebook page Yummy Food Delivery