Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Scraping Problem and Ethics (osvdb.org)
241 points by codezero on May 7, 2014 | hide | past | favorite | 126 comments


This is one of the more interesting policy questions on the web. Our search engine crawls a lot of blogs and what not on the web, criminals who want to find unpatched wordpress sites try to acrape our crawl by sending automated (scripted) queries to find them. We have developed a number of defenses over the years and pretty regularly ban them[1]. Here is the weird part though, if they hired 300 people on mechanical turk and paid them each a dollar to do one search, it probably would get them more information. Look at folks like 80legs or other 'distributed' scrapers. They exist almost solely to subvert these service terms. Are they evil? Creative?

One of the things that stands out is called out in this article. The people involved really want this information, so much that they are willing to expend time and effort to construct scraping bots and what have you . Why not just buy it? How is it that someone gets a request from their boss to get some information, but their boss expects them to get it for free? Can you imagine if they said, "We need pens, pencils, notebooks, staplers, the works for the office here. Oh and you can't spend any money getting that stuff, just get it here." Would they construct some elaborate raid on a nearby Office supplies store using a mercenary army of criminals? Why do that with information?

We did an experiment where we would 'grep the web' for you, basically run a regex over a multi-billion page crawl, give you the first 50 results for "free" and you could buy the complete set. I think we sold exactly one of those.

It is a weird thing, the OP captured it perfectly.

[1] Its a violation of the terms of service.


I think there is a great meta-question in here, about business models for digital data and software.

Here you have a great case study, about an organization that tried to do a volunteer model, and it didn't work. Then they pivoted to a commercial model, but fundamentally they still believe in a free tier. But they have to cripple that free tier pretty thoroughly, and even still people abuse it.

I have a product I'm working on, that some people are apparently willing to spend lots of money on. Ideally I would have some kind of low tier, so that people without lots of money would be able to use it too. But I can't figure out a way to segment the product so that everybody pays what they can afford without bad apples abusing the low tier and ruining it for everybody. The result is that I may end up only selling it to customers with deep pockets, even though the product is much more broadly applicable.


They made the process of paying for the software a laborious pain in the ass. They are desperately trying to extract money from those who can pay, which sadly drives away those who can pay but don't want an involved process.

I have worked at lots of companies where I had a monthly budget of 10k+ that I could spend on whatever I wanted, but if I wanted any sort of complex deal (can't just put on CC with a line item) -- had to bring in legal and other groups -- instantly killed any interest.

"Licensing is based on the data needed (e.g. all of it vs subset), how it is used (e.g. internal only, external, product integration), etc."

What a goddamn horror show. I simply want a product, I want to pay for it, and I want to use it. Turning on Dropbox for Business was a decision made in about 5 minutes... "You all like it, already using it, awesome! I will get team setup." -- 5 minute later I had given Dropbox $3800.

I really think they are getting in their own way for no benefit. They have created a very high barrier to EVEN HAVING A DISCUSSION about buying the product. So, if I don't know exactly how will use it -- I can't purchase it. Stupidity.


Publishes data in public. Can't get people to pay for it. Blames people for theft. The real thieves are the ones separating people from their wallets over data that is available public and censuring it to those who won't pay.


Perhaps it won't come as a surprise but I've been toying with this question for a couple of decades now. Specifically what are the economics of information? In the 'goods' economy there are some interesting mechanisms that inform the question of value, these include but are not limited to, the cost to acquire materials, develop expertise in manufacturing, and managing the supply lines between raw material to finished good. Accountants will talk about the "Cost of goods sold" as a grouping function for these costs. In the 'information' economy the manufacturing part it pretty trivial, you just replicate copies, but the assembling part can be quite difficult. This leads to an interesting inversion where it can cost a lot to assemble something and nothing to 'manufacture' it. And that doesn't even begin to touch on what it is about information that makes it valuable in the first place.

What is the difference in value between a CD with the latest release of Ubuntu burned on it, and the download? download and a bootable Flash drive?

There is a great experiment you can run which goes like this; At one end of an athletic field, place a chess board with a queen on it on one of the squares. At the other end of the field have a table where people can get a quest. Offer to pay a person $5 if they will walk to the end of the field, note where the queen is, and come back and tell the quest giver. At the mid point of the field set up an information seller. They offer to sell you the location of the queen for anywhere between 20 and 80% of the reward price.

This simple experiment lets you see all sort of mechanisms in play that control information value. On the one hand you can see the range of values people apply to their own time (acquisition cost), their willingness to retain value (do they then go back mid-queue at the sign up table and start offering to sell the information for some fraction of the price to anyone?) At what threshold to people start trying to break the rules (a notion that is similar to price inelasticity but has a component like the 'black market demand').

Interesting questions to be sure.


perfect price discrimination is hard.


> We did an experiment where we would 'grep the web' for you, basically run a regex over a multi-billion page crawl, give you the first 50 results for "free" and you could buy the complete set. I think we sold exactly one of those.

Sounds like you didn't advertise it right. I've been looking for a "grep the web" service for a while.

We're going to use https://builtwith.com/ to achieve that exact thing. It will cost me $295 : https://builtwith.com/plans

For that $295 I will get a list of all domains using a rival technology... a list of sales leads.

I'm still short of what I want to have... a prioritised list of sales leads.

So I will write my own scraper to go through those results and scrape every one of those so that I can pull some info from the HTML page to tell me how large those customers might be. I'm not aiming for the largest (easy to find, costly to win), nor the smallest (time consuming and pointless to win), but the median.

I would definitely pay for a "grep the web" that allowed me to match pages by text signatures in the HTML, and then extract part of the DOM as values, and return the list of "url + extracted values" for the matching hits.

I'd consider that to be worth similar amounts to what BuiltWith are chaging, but I'd add more and would go up to $500 assuming that the results come with the extracted values as a CSV file of some kinda and the quality and completeness of the report is high.

People will pay for "grep the web", especially if you sell it to them as something they know they really want: "sales leads".


Pot calling the kettle black.

Crawling others' websites then selling ads. <--- Blekko

Scraping others' websites then selling ads. <--- ????

Selling access to user-generated content. <--- OSVDB

Amusing to watch these folks argue about ethics.

Who owns the copyrights in this data? Surely not the one who is demanding that you pay a license fee. These "services" are middlemen, plain and simple.

This might be why McAfee was wondering about how much manual curation is done.

Maybe that is the only possible theory of how OSVDB could assert any rights in the data (and only in a select few jurisdictions).


If they're a middleman, and it's such an easy "service", why didn't McAfee just bypass them and get the data from the original sources, rather than do the wrong thing?


That's a valid question and one I have considered myself.

So what drives the folks at McAfee to do this?

Maybe it is the same thinking that drives programnmers to not want to write code.

"Don't reinvent the wheel."

"Code reuse."

"Use a shared library or a scripting language with batteries included."

Why crawl the web when Google has already done it for you?

And so on.

Personally, I do not have trouble understanding why McAfee would do this.

What I have trouble understanding is why OSVDB would think they could crawl some public data and then charge a fee to access it.

It is the "sale" of "free" information that puzzles me.

By all means go ahead and try, you may well recoup your outlays for compiling the free data and even make a profit.

But should we really be surprised when someone does not want to pay?


Since you work at Blekko, there are more points to discuss that were not addressed in the article.

For example:

i) Are search engines web scrapers?

ii) Should search engines pay the scraped sites if they are charging to access their indexed data? probably some of the scraped sites has a specific license forbidding the search engine to sell their information in any way.

iii) Regarding Internet policies, is it fair/unfair that a site has a robots.txt configuration to avoid being indexed by a search engine other than Google? I would call this "search neutrality".


i) no, search engines articulate the web. They are essentially 'pre-paid' computation.

Take the classic example, a search for 'bilbo baggins'. What that is, is a request to identify documents on the web that have referred to Bilbo Baggins and return their locations.

1) It is absolutely true that you could sit down at your computer and look at each site, from aol to zillow, read all their pages, and note the ones that mention Bilbo. Then you could go back and order the list by the ones that had more of the information you were looking for to the ones with less useful information. Along the way you would find some sites that would not open up to you unless you had an account, those you could not visit.

2) A search engine can look at all the sites, it can note which sites mention Bilbo along with a bunch of other terms and can essentially "pre-compute" that list you were looking for. Along the way it will find sites that, through their robots.txt file, will say "We'd rather you not look here." and it will respect that, thus not indexing those sites.

In both cases figuring out which web pages have information about Bilbo on them is creating 'new' information out of existing data. You can do it on your own and it will cost you time, or you can do it with a few thousand machines and it will cost you money. Either way you get a list of possible sites.

That list forms a distribution, where there are a lot of sites that don't care one way or the other if you read them, sites that won't show up because they asked to be excluded, and sites that will show up because they paid to be included. Some sites really want you to find them, some sites really don't. A good search engine caters to both types.

ii) If a search engine was taking the page, copying it, and then showing that instead of showing the page (this is what got Google's news product in trouble) then its pretty clear that they should not do that. But in terms of location information? The sites themselves derive a huge economic benefit from being in the index that isn't reflected at all back to the search engine that sent traffic there [1], so on a pure economic basis the search engine is on the losing side of that transaction. However, the marginal cost of additional transactions is small (search engines are general purpose) so they make a small amount on large volumes.

To put your question in more specifics, where is the economic value in the list; bobs middle earth atlas, wikipedia entry on bilbo baggins, middle earth web ring, imdb pages on characters in the "Lord of the Rings" movie.

Is it that Bob has an atlas of Middle Earth? Or is it the list itself? Who made the list? Bob or the search engine? (or some human curator of a bookmarks page[2])

iii) It is completely up to the site to allow or disallow access to its content by search engines. Some sites do only allow themselves to be indexed by Google and they find they get less search engine directed traffic that way. Some sites don't allow anyone to index them and they get no traffic (sometimes they are surprised by this, sometimes they don't care, sometimes they are angry that the only way for people to find them is to be in a search engine index)

[1] Google broke that by creating AdSense for Content and created a pretty interesting conflict of interest for themselves.

[2] Good luck finding a book marks page these days :-)


> ii) Should search engines pay the scraped sites if they are charging to access their indexed data? probably some of the scraped sites has a specific license forbidding the search engine to sell their information in any way.

Any reputable search engine will respect robots.txt.


I'm assuming that the vast majority of the time, it's a case of an individual or small group that wants the information, not a real corporation. Usually they have no intention of paying for anything or playing by the rules.

I have absolutely no explanation regarding McAfee though, considering the billions McAfee and its parent company, Intel, makes in revenue yearly.


Could this be a company culture thing at McAfee ?


I sure hope not.


Looks like 80legs that you mention rebranded themselves as "Datafiniti" at some point recently.

http://blog.datafiniti.net/?p=230


which in Italian literally means 'end of Data'... kind of ironic isn't it?


> Look at folks like 80legs or other 'distributed' scrapers. They exist almost solely to subvert these service terms.

Not sure I understand, is there a whole ecosystem of web businesses that feed off for free of your search engine, or you meant from several legit search engines or ... ?


This is exactly the argument MPAA makes against people torrenting movies and tv shows, using words such as "criminals", "evil", in an attempt to polarize people who are simply consuming information over the internet in the most efficient way possible, without any 3rd party oversight and censorship, the way information on the internet is always bound to be, the way equilibrium is achieved.

The argument that if you really want this information, they should pay for it doesn't work. Basically, you are punishing people for automating manual labor. The fact that you can hire bunch of people and tell them to manually copy data from the website and achieve the same result means that it shouldn't be any different from automating the process itself.

Terms of service on a website is not the law. If there is something you don't want people to have access to, then don't publish it online at all.


I'm more and more concerned that the legal and cultural environment for web scraping would make it hard for a company like Google or Yahoo to be founded today.

The internet isn't about "don't take my stuff", it's about spreading that stuff around. I'm confused by people who want to make their data public, but want to control exactly how people access it.


Agreed.

I talked about this a while back: https://news.ycombinator.com/item?id=6572937

Someone decided to ban all bots from accessing their site except for Google and Bing. So much for "if you're so worried about Google just use another search engine".


I'm confused by people who want to make their data public, but want to control exactly how people access it.

You mean like TV? or Radio? or print....


To be fair those mediums all kind of suck in their own way.


>make it hard for a company like Google or Yahoo to be founded today.

Is that a bad thing?

>The internet isn't about "don't take my stuff", it's about spreading that stuff around.

Try asking Google if they want to "share" their database of crawled data.

>I'm confused by people who want to make their data public, but want to control exactly how people access it.

Me too.


>Is that a bad thing? Yes, no question about it. More players in the market means more competition and more competition means a better service.


>More players in the market means more competition and more competition means a better service

I'm not sure if this is a joke. So, I'll refrain from replying.


Could you explain your point of view? I'd like to understand both sides here and I think I understand why more competition would be good, but now why it wouldn't be.


>I think I understand why more competition would be good, but now why it wouldn't be.

That is not what I said at all. I don't believe that more competition necessarily leads to better product/service. In fact I believe that in most cases it does not. In capitalist economies, companies try their hardest to avoid competing. Competition forces companies to reduce costs, and not necessarily increase quality. The quality of a product is not some number that people can read and go "oh yeah this product is better". Marketing people try hard to invent such pointless numbers (e.g. Megapixels in cameras.. horsepower in cars , etc etc). Also many CEOs don't have the first clue on how the product is actually made, much less increase its quality. They rely on these same 'marketing numbers' that their underlings serve them with. So, they too can go "oh yeah this number is increasing so our product is getting better".

It seems like a lot of people are brainwashed into believing this free market utopia where things just automatically get better because everyone is competing and the customer is this genius who can figure out which company is delivering a better product.

Its sort of like thinking "Well if I'm nice to everyone, everyone will be nice to me.". And then you realize that the real world is a dark place filled with assholes, where slavery is still rampant and many of the goods and services we consume are dependent on the exploitation of natural resources or other fellow humans.

Sorry if reading all that bummed you out. I'm really a quite a cheerful person :P


The OSVDB website contains no signup page for commercial access. No pricing either, purely sign up via contacting someone. From my experience whenever I see this, I just refuse to use the service and look elsewhere. Contacting someone is annoying and opens you up to repeat sales calls. Perhaps they should make commercial API access easier to access rather than complain about scrapers.


It's not really reasonable to say "I don't like the way you market your goods... so you really shouldn't be concerned with people stealing them."


I have to disagree. As creators, we should expect to see the behavior our designs encourage.

The fact is, if it's harder to buy something, would-be buyers choose another route.

Imagine something of trivial value that was very arduous to obtain. Say, the scores to last weekend's football game are only available via sending the NFL 2 cents taped onto a postcard then getting a user account and password back in the mail. Yes, you could apply for an account and who cares about the 2 cents? But most people wouldn't and who could blame them? There'd certainly be a market for pirated 'score data'.

But, if you just put ads on your site (the NFL does), voila! You make the same amount of money and people are happy to use your service. Netflix, Hulu, iTunes Store and Spotify all get this. If you make it a pain to do something, you can't feign surprise when everyone goes around you.

Is it legal to circumvent something just because it's arduous? No. Is it ethical? Not completely, but downloading pirated football scores wouldn't keep me from running for office.


Not in the long term obviously, but you can't make a product extremely difficult to buy and then complain that everyone is taking the easy route.


Talking to someone on the phone is not "extremely difficult." Large companies buy stuff by talking to people on the phone all the time.


Perhaps a method used exclusively by very large companies is that way for a reason? Of course purchasing by talking to someone on the phone is extremely difficult. The only reason people sell that way is to make sure they can hassle you as much as possible during the sale.


And it sucks as a method of buying something.


and I have a policy that if the company does not practice open pricing by publishing their prices online I will not do business with them..

Fuck that non-sense of pricing based on what ever they think they can scam me out of


Apropos of nothing: customers often have an exaggerated notion on how important it is to e.g. an enterprise software company that that company land their account.

A conversation I've had a few times:

"We need it to do $THING_IT_WON'T_DO."

"In that case, it probably isn't a great fit for your needs."

"You don't understand. I won't buy it if it doesn't do that."

"I think I do understand. That's fine. You might consider trying $COMPETITOR, although you should know their minimum spend is $1,000 a month."

"That's outrageous. You have a $29 plan."

"Yes. So you should go with the competitor if that requirement is worth $971 a month to you."

"No, I want to spend $29, but I absolutely need that."

"I understand where you're coming from, but we do not offer that feature, and if we did, we would charge prices close to what our competitor does for it."

"You're not working with me here."

"I'm trying to find a resolution which works for you, but including that feature at $29 doesn't make business sense for me, so I won't do it."

"Put me on the phone with your boss."

"I'm afraid that isn't possible, as I sort of run things around here."

"What sort of businessman turns customers away."

"You're not a customer. If you were, you would be purchasing a product I sell for the amount I sell it for. That isn't happening. That's fine. Have a nice day."


That's fine. 'patio11 has lots of stories about that.

However, the lack of up-front pricing data isn't being used to justify "I won't do business with you." It's justifying "I'm going to take all your stuff anyway."


Tell that to the 95% of the population on this site who torrent TV shows, movies and music.


Even ignoring the difference between taking something for personal use and taking something for commercial use (as that is a distinction that does not affect the legality of the matter) "because other people do it" is not a valid defence.


It's not a defence, I think you can see from my tone I don't support stealing media just because you don't like how it is distributed.

I'm saying rather this argument won't find much favour here unless everybody is a hypocrite.


Did you read the article? In both cases the offending parties were told to contact RBS to obtain a proper license/account.


That's exactly the problem that GP is talking about.

McAfee were just being the usual assholes, but the first guy mentioned in the blog post could have been converted into a paying customer if the pricing scheme were clearly outlined on the web. Since "Contact us" account tiers are usually reserved for the very high end, he probably assumed that it would cost him an arm and a leg.

It shouldn't be too difficult to come up with a handful of moderately priced account tiers, each targeting a different type of customer, as well as an open-ended tier at the end ("Contact us") for those with special needs and deep pockets.


So many times I say this to people. Sometimes the reason people aren't buying is because they can't find the price, and aren't willing to pay the mental price of talking to a sales person to find out.

I understand the sales psychology in making sure you enter into a proper discussion with people to make sure their needs are right, and extracting the maximum consumer surplus from them. But there is a non-trivial pricing point where this makes sense, and by not having any anonymous-sign up, you're cutting off your sales curve below this point.

I think it's very easy to convince yourself it's easier just to sign up customers over the phone, but doing so without at least testing takeup of simpler tiers is an incomplete picture.


I get this line of thinking if your you can sell your offering for $100 bucks a month. But what if your minimum offering is $2500/mo? Correct me if I'm wrong, but I've yet to see a company - that only sales through enterprise - put up a sign saying where their plans start at $2,500.

In most cases I don't believe these services are forcing customers to talk on the phone because they think they will convert them. Chances are that sign up isn't as easy as a 1-2-click, and involved set up and understanding of the customers needs is required before services can be rendered.


If your signup process is understandably complicated and you require a minimum commitment of $2500/mo, then it's probably okay for you to tell potential customers to give you a call.

If your signup process can be automated and you only charge $25/mo, and you still tell people to give you a call, then you will lose business because 1) the friction of a phone call is worth more than the difference between your offer and a competing offer; 2) it's impossible to tell whether your offer is worth the friction in the first place, because your pricing is unknown; and 3) people just assume that you'll charge $2500/mo because that's the usual price point where people say "contact us". If someone else comes along and offers an inferior product for $35/mo, they'll get all the business because monkey psychology.


Yes, that's what I am saying. There is a pricing point at which having a customized sales process absolutely makes sense.

What I am saying along with that is that - by doing this - you're ignoring a lot of the market at (or even just below) your price point. That might be OK - as long as it is a conscious decision to do so, including acknowledging that your market is completely above that point.

In the case being discussed here, it would seem that there is an interest below this price point.


If I were in OSVDB's shoes, I would call these people out in an email and ask them to pay a licencing fee. McAfee have always had a shady past, even when they shook themselves clean of John, they have a history of scam-like behaviour to make a quick-buck.

You should be rate-limiting how many requests free API users can make, like; Twitter, Facebook and every other Internet provider does via their API. Make it harder for people to obtain the information (to the best of your abilities) and paying will become more of an option because they won't go to the trouble of scraping it if it's impossible and will take a lot of time to do so.

Think of your offering as a car. You currently have no car alarm or immobiliser, if you install a immobiliser and car alarm, you will make it very hard for a thief to steal your car.


On the other hand, large (even non-profit) organisations are precisely the ones who have the resources to scrape steathily and widely, as would a loosely-organised community of users... it's not hard to come up with algorithms to respect the rate limits, balancing the load across multiple IPs and accounts, and producing access patterns that don't look any different from the rest of the site traffic.


Large organizations also tend to have risk-averse lawyers. I'm not a fan of the CFAA, but if what Aaron Swartz did was illegal, then so is this.


like much security, it's not about making it impossible, it's about making it a lot less convenient/a bit harder.

At one point the effort to circumvent would cost more in man-hours than just buying the product.


You can get scrapping libraries fairly easily. In my more shady past I developed a library like that and shared it, HTTP client library with automatic proxy rotation and rate limiting friendly.

When I used it (which was almost a decade ago) never ran into problems, plug a list of 10,000 proxy and scrap away.

Not condoning that, which is a bit hypocrite of me, at the time I was mostly doing what I was told and I thought I was clever. Now that I'm in a position to have a positive impact, I do buy data and pay appropriate licence fees on all software/data purchase, which still baffles some of my programmers who constantly ask "why not crack it?", "you know I found a .zip on Google with the data, why buy it?", and so forth.

I don't know what in programmer culture makes it so hard for us to pay for something, some people put some effort behind that software / data collection, and it's only fair to pay them.


I know that paying for things was annoying when I didn't have money. Now that I have cash, I'm willing to pay (reasonable amounts of ) money for digital things.

I still hesitate when it comes to thousand dollar licenses when it's for my personal use , though.


Speaking as devil's advocate, it might just be more convenient to steal the data.

Maybe I want to use your data casually once, and I don't want to sign up and give you all my contact details and subscribe to your annual plan with all the other optional extras.

Tough shit, you say? I'll just steal it then, and not because I can't afford it, but because you're making it hard to pay.


but how is it stealing when one does not lose inventory? If one person scrapes a page, did you lose the source code? Does it not become available for the next visitor? What possible loss do you incur that is directly tied to your data? When you make data public with the intent of being readily accessible by the public, how can you claim theft when you are achieving what you set out to do? Does the accelerated rate of access suddenly become a theft? Does one need to pay a third party to avoid the pain associated with manual hand labor for simply hosting the data which is available to the public? Help me understand.


"but how is it stealing when one does not lose inventory?"

Scraping is not necessarily a no-victim situation. Even today after this stuff has gotten cheaper, you're costing them bandwidth fees, and likely increasing their server storage and CPU fees if it's on a metered hosting service, which is quite likely nowadays. If you degrade their site's functionality, you may chase away paying customers.

We need not hypothesize crazy third-order effects; you are taking money out of their pockets by the act of scraping itself, independent of the question of the value of the content.

"What about Google? etc." - robots.txt-honoring scrapers that don't hammer the sites at least have a plausible claim to permission. Scrapers are quite likely to be ignoring the robots.txt.


> Scraping is not necessarily a no-victim situation. Even today after this stuff has gotten cheaper, you're costing them bandwidth fees, and likely increasing their server storage and CPU fees if it's on a metered hosting service, which is quite likely nowadays. If you degrade their site's functionality, you may chase away paying customers.

While technically correct, you are conflating the issues, because in none of the cases (that I've seen mentioned so far in this thread) the problem is with bandwidth/storage/CPU costs of retrieval to any significant extent.

Instead, it appears that almost all of the costs are incurred before retrieval: curating, sorting, etc.

I'm not arguing that it's okay, but it's just as much not stealing / thievery as downloading movies or music isn't.


Even if you limit the rate, they will just create multiple accounts on multiple servers. There is no way you will make them pay for it as you could find it hard trying to prove it's them scraping in the first place.


If I were in their shows I would track their IPs and send them bogus data along the lines of "Please pay for a commercial license."


You need to be careful sending bogus data: in some jurisdictions this could be argued to be deliberate targeted commercial sabotage. You would no doubt eventually win any resulting legal argument, assuming you could afford to carry the argument on to that conclusion. Sending no data, or limited data, would be safe though.

A better method would be to set "default" pricing (something high but not ridiculous, that could easily be negotiated downwards if they contact you) and make access beyond a few requests a click-through (or better: have them respond to an email before progressing further) where they agree to that pricing if they are using the information commercially.


You don't have to return bogus data, you can return an HTTP error code: 402 payment required.


Exactly.

The problem I see is with giving out bad data while leading people to believe they have obtained useful information (that they then embarrass themselves by using/re-distributing).

I did misread the grand-parent post though: his was suggesting the bogus data was the message, and I read it as handing out fake data for the scraper along with the message to be seen should a human be looking.


Yeah I can see why fake data could give you a harder time in terms of lawsuits. Similar to people who fight hotlinking of their content by replacing images with something offensive.


>"You need to be careful sending bogus data: in some jurisdictions this could be argued to be deliberate targeted commercial sabotage." //

That sounds pretty ludicrous, do you have anything to back it up - caselaw, settlement report? It would be analagous to serving a fake image to combat hotlinking; or a fake page to combat framing.


Something that is obviously fake or otherwise different (like the image or frame break-out examples) would also be fine.

But leading someone to believe they have correct data when what they have is potentially embarrassing when used could be something they'd take objection to. Even if not there are two other points of risk: your reputation if something goes wrong and you accidentally give bad data to your paying clients and your reputation if someone, paying or otherwise, shows off the bad data as "the sort crap these people try to sell".

I don't have any specific references, but it is something I would be careful of as there have certainly been similarly ludicrous (IMO) cases on unrelated matters in the past (yes the right side would win, assuming they can afford to).

I may be being too cynical here, then again maybe not...

I did misread the grand-parent post though and this isn't what he was talking about. He was suggesting the bogus data was the message, and I read it as handing out fake data for the scraper along with the message to be seen should a human be looking.


Hasn't the mapping industry already set a precedent in this field? They provide maps with small, wrong roads/trails specifically to identify who is stealing their data.


Pretty much, but precedents set in the "real world" don't always carry over to "on the Internet", even when the correlation is stark, obvious, and indisputable to most people's eyes.


I can't help recalling a post here a couple of years ago about the concept of "hellbanning" scammers on ecommerce sites--in short, making it look like everything is going fine, while actually isolating them completely from your business logic. Orders with stolen cards appear to go through, and send confirmation emails, but no real order is generated... In this case, you could transparently poison the results served to identified scrapers with 5% bogus vulnerabilities.

Or is that about as ethical as spiking trees to prevent illegal logging?


Heh, I've always called this the map makers trick (found an article about it here: https://theweek.com/article/index/241967/trap-streets-the-cr...) although I guess that is specific to putting a small amount of fake data in your dataset to prove someone else used it. The hellbanning metaphor does fit for return large amounts of poison results.

It could be like spiking trees I guess, but that depends on the potential for harm. I just looked it up and was surprised to find that only one injury has ever been reported due to tree spiking. I guess it's a better talking point than actual tactic.

And like lukejduncan said, this is definitely done in practice.


The point of spiking trees isn't to hurt people, in fact the tree-huggers tell the loggers that the trees have been spiked so that they won't attempt to cut the trees. The point of the tactic is to prevent damage to the forest, not to actually hurt anyone.


I've definitely seen this done in practice


This also goes way beyond scraping, in the sense of small start-ups doing all kinds of hacks or clever manips to stay within the bounds of a free trial service.

As someone who sees both sides… I don't know what to say. I'm running test landing pages on Heroku with New Relic that pings the sites every minute to ensure the dynes keep spinning and my users don't experience downtime. While I'm careful to stay within fair use, this is at best obnoxious, because if everyone did this, Heroku would certainly need to redefine what's free. From my POV though, I am a bootstrapped entrepreneur and supporting 5 landing pages. I simply don't have resources to pay for a dyno and test everything I have in my head, especially not combined with the many other resources I'd need to start paying for as well.

Or consider the kid in Florida who used Parse' free account for hundreds of thousands of users. [1] (The article was on HN a few weeks ago, this was not it's central point, just something I took away relevant to this comment.)

Part of the cause I think is that we live in a world where we're so used to having things be free, it becomes an entitlement. Another is that all these examples of start-up hacks and hustle stories, we kind of laud, don't we? Everyone talks about how Airbnb scraped Craigslist and got a huge boon that way, but few in critical tones. Should we? Or is that how competition and new products get created (i.e., if the scraping hadn't happened, perhaps Airbnb and the whole sharing economy would be less successful today).

These are philosophical questions, and I don't really have a solution, but they are things to think about.

[1] http://pando.com/2014/04/30/how-a-florida-kids-stupid-app-sa...


I was once developing a program for automated trading for my own personal use. In the beginning I made periodical requests to my broker, scraped out the price info, and made orders if the price was right. Once I had something sort of working I decided that I wanted to do this the right way and ask them for permission. They told me I would have to subscribe to a feed through some other program. After many hours of reading the manual and trying to figure out how to transfer the data to my application I again got it sort of working. But I had already started to lose interest and other things had come up in my life.

How is this relevant? Well I would have much preferred paying for scraping, than trying to learn some new api. It increased the transaction cost. If you further have to negotiate with a partner company, that sounds like even more transaction cost, from having to send emails back and forth and the mental effort of negotiating.


Beyond the moral discussion, I think a market for web scraping is a good thing because currently there are a lot of unconnected people trying to buy/sell this service.

Freelancer sites has a lot of offerings for web scraping but this niche has its own issues.


paying [the site in question] for scraping I mean


Pretty sad. McAfee I dont really expect much less from. They've always been a bunch of scam artists / con's in my mind.

I dontbhave time to look but couple of things striked me as odd. Is there a reason you dont lock down your request limits? Also why dont you secure access? Allow free accounts to be made and issue apis keys for them, at least that way you can much more easily rate limit access to the apis and heavily rate limit front end web requests down to reasonable numbers.


Shouldn't you at least be disallowing /show/* in robots.txt? Not that scrapers are necessarily going to respect this... but the way your set up it seems like this is semi-legit behavior.


Maybe they want Google to be able to crawl their database (which it has clearly done, as you'll see if you do a search.) That also raises some questions...


Not completely by the specification, but I think this one works as expected.

    user-agent: *
    disallow: /

    user-agent: Googlebot
    allow: /


I think that userbinator's point is that the DO conditionally allow scraping which makes their position even more tenuous IMO.


By a single search engine which probably provides the vast majority of their traffic.

While I don't necessarily agree with the concept of only allowing google to index your site, comparing a search engine which feeds you business to a company reselling your data with no attribution is not really fair in my opinion.


Neither of the two parties mentioned is likely to resell the data directly, both are likely to to create a derivative work which they will exploit commercially as will Google. What part of the ToS make one OK while and the other forbidden? How does this work when a smart scraper can just pull from the Google or archive.org cache?


Not sure it's a good idea to block all search engines but Google.


Probably not, but if they don't want to be scraped, the first place they should notify about this is robots.txt. I was just stating an example that you can allow some bots and not others if you like.

Of course, forging your user agent, disobeying robots.txt or scraping after you were told not to, is wrong ethically.


Web scraping isn't a crime - the simple act of downloading the data should not be a problem here. (The reuse of the data might be, depending, but we don't have that information right now.)

This doesn't even rise to the level of what Weev did.


Scraping isn't a crime, but by the same token, neither is rate limiting and banning ips.


It can be a crime if it is a violation of the site's terms of service and the requests violate the host's robots.txt.

I'm not sure whether there is a legal precedent though, in some cases you could call it a denial of service if the requests are not rate limited, and in other cases it might be considered an inappropriate access (see Weev, though he eventually won appeal).


[deleted]


Actually, Terms of Service violations fall under the Computer Fraud and Abuse Act, since ToS agreements can lay out under which circumstances that authorization for access to computer systems is given. That sort of obscene generality is the reason for proposals such as Aaron's Law, but to my knowledge there are no such protections today.


Totally correct. Parent comment deleted for lack of usefulness. Thank you for the correction.


While what you have stated is certainly the position held by the US AG, I don't believe it's been tested in court yet.


He's just making the point that it's unethical, which it is. Even if it boils down to simple bandwidth theft.


I feel the opposite. I don't think it's unethical, even if it might be illegal.

Even calling this "bandwidth theft" is quite the hyperbole—if the server can't handle the bandwidth, then rate limit the requests.

I think if you're serving out pages to the public, you don't really get to tell me what kind of browser I'm allowed to download it with. As long as I'm speaking HTTP, it seems fair.

Sadly, that law has been slowly creeping against this mentality... Lately I feel like I'm some old internet hippy with these views. On a site called "Hacker News", no less.


> Sadly, that law has been slowly creeping against this mentality... Lately I feel like I'm some old internet hippy with these views. On a site called "Hacker News", no less.

I guess it's because everyday more and more people on here are finding themselves on the other side of the fence, i.e. finding that some of their users are ripping off their content/site.


Maybe people need to come up with a better business model for websites than "Put up some 'content' and sell ads next to it."

Sell a product or service.


In agreement with my sibling comment: if I ask you politely for some bandwidth, and you give it to me, it's hard to describe that as "bandwidth theft".


Why not put all information behind a login page and force people to sign up? It looks like this site will be used by only a few people anyways. You can then also track scrapers by login.


As a somewhat unrelated side note, I'm actually working on providing a free API written in Golang for CVE and CVSS entries. You can find it here: https://github.com/jordan-wright/cve-api


Is it ethical to scrape from a scraper?


Is it ethical to scrape from a service you DO pay for, instead of using their API or other interface?


Ethical at least requires respecting robots.txt.


If robots.txt said to jump off a bridge, would your spider do it?


Who watches the watchmen?


It's hard to get excited about this. The data owner supposedly has a mission to make the data available but then is concerned with such a thing. It doesn't strike me as that difficult or expensive to make that sized database available.


We had a former Google Maps PM give a talk about building APIs, and he addressed scraping. Something he pointed out is that the traffic pattern from scrapers differs from regular users. Regular users make requests for the same things often. Scrapers usually only request any given item once. The really obvious ones do it in order from the beginning. He had a hilarious map showing the requests of someone requesting business information starting from the top left corner of the map. They got really good information on all of the businesses between the North Pole and Greenland before being blocked.


I see that the "Open Source" Vulnerability Database changed it's name to the "Open Sourced" Vulnerability Database in July 2013.

https://web.archive.org/web/20130714002216/http://www.osvdb....

I guess they were hoping no one would notice the subtle but substantial change to their service


Given these people's immediate reaction to resort to scraping, I'm honestly kinda surprised they didn't setup basic practices of at least going through proxies (or free ride off of Tor, ugh).


Maybe it's the supervillain in me, but I think I would try to come up with software to recognize the scraping attempts, and rather than ban them just have it generate fake data on the fly.


>including an expansive watch list capability,

I would avoid the word expansive. Too similar to expensive. Especially when emailing people who are not native English speakers.


Looks like they forgot to conceal the full name and email of the guy from s21sec in one of the quoted emails...


I would hope they sent a bill.


Amateurs. If you are going to crawl distribute it across unique IP addresses.


What is the price? I assume the price is in the 5-10k+/month range, which is a reasonable amount for data like this.


What about Aaron Swartz? He essentially did the same exact thing to the computers at MIT and he is somehow a freedom fighter when someone doing the same thing to the osvdb website is considered "unethical".

This is straight from the open security foundation website:

"We believe that security information and services should be easily accessible for all who have the need for such information and services"


While your sentiment is reasonable, I think the main difference here is that McAfee and others mentioned host their own private vuln databases and do not share them with anyone, so they were scraping to increase their own private resources for commercial use.

Aaron was scraping private resources to share publicly.


More than that, he was scraping private resources that were freely populated. He was not robbing content creators of their money. He was circumventing a paywall to what should be free data.


That's quite the mental leap. He was circumventing a paywall, but that's not robbing anyone of money because it should have been free in the first place? Well, it wasn't free, even if you think you it should be. Thus the paywall.


Some of the content on JSTOR was actually public domain content. At the time, JSTOR did not make this publicly available. It is definitely debatable to consider the downloading and distribution of public domain material to be robbery.

Since then, JSTOR has released these documents freely themselves[1]. They have also stated that this was their intention all along.

[1] http://about.jstor.org/news/jstor%E2%80%93free-access-early-...


To that point, the people who submitted to the journal did so knowing that it costs money to access, it wasn't as if they were tricked into contributing to a private pay-wall journal.


For most scientists there isn't another system, and certainly when the system was established nobody was envisaging a time when publishers would make hundreds of % markup on every access of a paper. The prices reflect print publishing costs and the historical lack of a cheap distribution system like the internet. Now we have a hangover where people's careers are judged on their ability to publish in high-impact journals, all of which charge several thousand pounds extra to make the paper Open Access. So we're not tricked, but we do it under duress.

The system is changing, Aaron helped.


Is OSVDB any different? "Open Source" Vulnerability Database implies openness. The maintainers are trying to implement a paywall.

I feel bad for OSVDB from a sysadmin perspective, but if Aaron's case was so polarizing for essentially the same thing, why isn't everyone jumping on the Hate Train here?


You might argue the difference is that the information that Aaron was after was already paid for with public money. The OSVDB sounds like it is a bunch of people who aren't otherwise paid to maintain this information.


>You might argue the difference is that the information that Aaron was after was already paid for with public money.

This is naive. Just because research is backed by public money doesn't mean the publications are automatically free to the public. If your argument was valid, you could use it to demand access to the emails of every FBI employee. Just because something is funded by the public doesn't automatically make every component of it open to the public.



Have to file a foia request that could take years and be denied isn't exactly open.


Scraping is not an ethics discussion. It's what you do with the data that falls into the topic. Selling email lists scraped from websites to spammers would be one. Using email lists to prevent spam would not be one.

The article is basically saying "I want to charge people for information I have made public online, they won't pay, so they are obviously thieves by refusing to do it manually by hand like they are supposed to."

Gimme a fucking break here. If you don't want the information to disseminate, DO NOT PUBLISH IT ONLINE.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: