Since you work at Blekko, there are more points to discuss that were not address...

ChuckMcM · on May 8, 2014

i) no, search engines articulate the web. They are essentially 'pre-paid' computation.

Take the classic example, a search for 'bilbo baggins'. What that is, is a request to identify documents on the web that have referred to Bilbo Baggins and return their locations.

1) It is absolutely true that you could sit down at your computer and look at each site, from aol to zillow, read all their pages, and note the ones that mention Bilbo. Then you could go back and order the list by the ones that had more of the information you were looking for to the ones with less useful information. Along the way you would find some sites that would not open up to you unless you had an account, those you could not visit.

2) A search engine can look at all the sites, it can note which sites mention Bilbo along with a bunch of other terms and can essentially "pre-compute" that list you were looking for. Along the way it will find sites that, through their robots.txt file, will say "We'd rather you not look here." and it will respect that, thus not indexing those sites.

In both cases figuring out which web pages have information about Bilbo on them is creating 'new' information out of existing data. You can do it on your own and it will cost you time, or you can do it with a few thousand machines and it will cost you money. Either way you get a list of possible sites.

That list forms a distribution, where there are a lot of sites that don't care one way or the other if you read them, sites that won't show up because they asked to be excluded, and sites that will show up because they paid to be included. Some sites really want you to find them, some sites really don't. A good search engine caters to both types.

ii) If a search engine was taking the page, copying it, and then showing that instead of showing the page (this is what got Google's news product in trouble) then its pretty clear that they should not do that. But in terms of location information? The sites themselves derive a huge economic benefit from being in the index that isn't reflected at all back to the search engine that sent traffic there [1], so on a pure economic basis the search engine is on the losing side of that transaction. However, the marginal cost of additional transactions is small (search engines are general purpose) so they make a small amount on large volumes.

To put your question in more specifics, where is the economic value in the list; bobs middle earth atlas, wikipedia entry on bilbo baggins, middle earth web ring, imdb pages on characters in the "Lord of the Rings" movie.

Is it that Bob has an atlas of Middle Earth? Or is it the list itself? Who made the list? Bob or the search engine? (or some human curator of a bookmarks page[2])

iii) It is completely up to the site to allow or disallow access to its content by search engines. Some sites do only allow themselves to be indexed by Google and they find they get less search engine directed traffic that way. Some sites don't allow anyone to index them and they get no traffic (sometimes they are surprised by this, sometimes they don't care, sometimes they are angry that the only way for people to find them is to be in a search engine index)

[1] Google broke that by creating AdSense for Content and created a pretty interesting conflict of interest for themselves.

[2] Good luck finding a book marks page these days :-)

owlmanatt · on May 8, 2014

> ii) Should search engines pay the scraped sites if they are charging to access their indexed data? probably some of the scraped sites has a specific license forbidding the search engine to sell their information in any way.

Any reputable search engine will respect robots.txt.