Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Oh Shit: How to Break a Large Website (and how not to) -- PDF version (scribd.com)
38 points by ALee on Feb 22, 2010 | hide | past | favorite | 27 comments


Somewhat OT: is it just me or does anyone else find it utterly ironic that scribd, a site dedicated to helping people avoid dealing with the PDF browser plugin mess, is intentionally going through the trouble of converting their source document to PDF just so they can display it in their viewer?

There's dogfooding, and then there's this. I'm not sure what to make of it.


Why is it ironic? I wouldn't talk about the trouble of converting their source document to PDF, since PDF already is the standard format for publishing presentations. Every scientist I know who makes presentations in PowerPoint exports to PDF when transmitting them to other people, mainly to minimize version conflicts and people ripping off their content. PS and PDF are also the standard output of LaTeX.

scribd exists because PDF browser plugins kind of suck, not because PDF itself sucks.


The epic shittiness of Acrobat Reader 7 is just how scribd got traction -- they don't really give a shit about PDF viewers -- their raison d'têre is to be the Youtube of documents.

Their Flash viewer is terrible (doesn't work at all on Linux) and their site is largely hostile to the idea of you actually reading the content. I had an involved email conversation with one of the founders after slamming their iPaper flash viewer here before, but nothing ultimately came of it.

Choose life. Choose Google's pure-JS document viewer: http://docs.google.com/viewer


"their site is largely hostile to the idea of you actually reading the content"

Yeah; I don't get that, although this item helps to explain why.


>> "PDF browser plugins kind of suck"

Apparently this is true on windows. Works fantastically on Mac though, even iPhone.


> PS and PDF are also the standard output of LaTeX.

Erm, I thought that DVI was the default output medium.


I'm pretty sure that pdflatex is used far more widely than plain latex these days, given that it is simpler to use and makes it easier to take advantage of PDF features.


I meant standard as in, that's what people turn their Latex into when they publish it. But I take blame for the bad wording, latex itself does output DVI :)


Am I the only person who's shocked by the idea that one of Scribd's developers was repeatedly taking down the website by pushing code live without testing it or understanding the queries it ran?

I guess maybe he's just exaggerating for entertainment value... It's good to see that he's sharing his mistakes for others to learn from, and going into detail on why things went wrong. However, it's really depressing to not see any attention paid to the root causes - WTF wasn't he testing his code?


Half of these are things that I could see doing quite easily -- for example, although I happen to know that "select count(*) from users where lower(users.user_name) = 'patio11'" will result in a full table scan on MySQL, that category of mistake is easy to make and has bitten me a time or three. And it is easy to miss if your testing methodology doesn't include "Now run it against a data set about as big as the production DB." So that would be a pattern to look out for both when coding and, critically, when designing processes to make sure this sort of thing doesn't happen to you.

Then there's "code that throws an exception anywhere you execute it every time you execute it was pushed to production." This points to a rather different type of process error.


I know you're making a general point here, but for the example given, I think it's generally best to refresh your dev data from production regularly (maybe after each large release). That way you're always testing on data 'close' to the current production state, which would help you avoid any data-related issues like this.


I think that is a good idea if you're able to get away with it. Scribd can. We can't do it at my day job.

There are hundreds of different ways to break the personal information privacy law if we screw up anonymization of test data. For example, supposing we just take the naive approach and overwrite all names, addresses, emails, phone numbers, etc etc etc. Should be fine, right? Except, uh oh, the data is still personally identifiable: the data will tell you that a female student who took CS103 and English 101 last year was given a semester of medical leave. If a copy of that dataset leaks and someone in the department realizes "Hey wait, the only person in that double major is Hanako... medical leave... Hanako was pregnant last year?!?", then our company just made the front page news, we made our customer look horrible (and likely owe them and Hanako several tens of millions of yen in we're-so-sorry money), and we just broke the information privacy law something fierce.

Incidentally, engineers not treating test data with the same "This CD is nuclear waste" precautions we treat the production data set is a frequent cause of breaches like this. Somebody decides to work from home for the day, gets his laptop stolen, bam front page news. I nearly got in severe trouble for leaving a printout of the student roster on the printer fifteen feet from a door somebody could tailgate through -- the only thing that saved my keister was that I could show that the student roster I printed out was fake. (Lesson learned about producing good test data: don't produce too good test data.)


Why is this being down voted? The privacy concerns are valid but are the symptom of a different problem, which is controlling machine/network access.

Previously I've worked in banking tech and this procedure was always followed (with data scrubbing), but it was also low risk since we were already on a secure network in a secure building.

For a startup with sensitive live data though, I can see why it's best to keep prod data out of test environments and therefore lower the chances of laptops being left unlocked or print outs lost that contain sensitive data. It's clearly a cheaper and better solution to keeping everyone in a secure building and limiting their remote access.


My vote goes to entertainment value, because the stories don't completely add up. When he was implementing the case-insensitive logins he would, of course, first have to run a query to see if there are any users whose login names collide without case sensitivity. That query would run (according to the slides) really slowly, because it would have to essentially do something like:

    SELECT a.id, b.id, a.username FROM users AS a, users AS B
    WHERE a.id != b.id AND a.username = b.username AND
    BINARY(a.username) != BINARY(b.username)
At that point he'd realize the query is taking forever, and use mysql-top or something equivalent to kill the query.

Edit: I realize that most likely the username column had a UNIQUE index already, making this query unnecessary.

--

In his example of find_in_batches where he has to make some changes that applies to all users he makes it sound like the User.all is the big problem because it does a `SELECT * FROM users`. But even when you break up the work in parts you have to be careful, because by the time you're done processing a few of the existing users may have changed already. So as your database becomes too large to do your updates in one big atomic transaction block you almost always have to follow a process like this:

1. find set of rows that match the problem (in batches)

2. apply the (expensive) fix/migration on those rows

3. go back to 1 until the result set is empty

4. do (1) and (2) in a big transaction (but since there are only a few rows, that's fine)

Of course there are a hundred ways to upgrade a live database safely, but there always must be some kind of process in place otherwise your database is going to go down all the time.


I am completely unsurprised, and I'm not being sarcastic.

They probably think that they're using an "agile" methodology.


pdf link anyone? i could find on scribd, and the slides aren't showing for me (chrome 5.0, mac 10.5). thx.


"Oh Shit: You need to upload 3 public documents to download that PDF."


Unless you're on mobile!


First page displays as nearly blank (just Scribd logo), but clicking through to other (78) pages seems to work fine (Safari 4.0.4) - as does DL of pdf.


I would like to see scribd links banished from here.

Please, just link to the PDF, not to an idiotic site that puts a login barrier and other barriers around downloading the PDF.

Every time I land on scribd, I make a throwaway account (with mailinator or whatever), and use that to download the PDF.

Edit: ... and I can't help but notice the irony that breaking scribd, makes it more obvious to more people that just providing the data (in PDF, in HTML, whatever) is better than burying it inside scribd. Thus nudging the world in the right direction.


Who'd have thought that "knowing your code / sql" would be a problem when using software that writes most of it for you!

I'm sure Rails is great for getting an application up quickly, but once you get some traffic and the application gets more complex surely you'd end up hacking it so much you might as well have written from scratch in PHP or similar?

That article just confirmed that I made the right decision when staying well clear of Rails.


No, this is absolutely not the case.

There are plenty of large sites e.g., yellowpages.com and urbandictionary.com (which switched from PHP) which are built using Rails.

Even if you have scaling problems that are around that of, say, Twitter, your application code still isn't your bottleneck; talking to your database is. No amount of framework (or lack thereof, as you propose/promote) will save you from the ramifications of poorly-constructed SQL.


"ramifications of poorly-constructed SQL"

Surely the best defense against this is writing every query yourself rather than letting Rails do it? (As implied in the article).

My argument isn't Ruby execution speed vs PHP execution speed, it's Ruby writing it's own SQL vs me writing it.

I agree 100% that the database is usually the bottleneck, it certainly is for my site.


I quickly got distracted by the F-16 flight manual in the "More from this user" sidebar. Downloading...


I'm surprised any website running at scale is letting an ORM generate SQL queries.


Just out of interest - why? Do you not trust developers to keep an eye out for bad generated queries, or do you have a problem with ORM generated SQL in general?


I'm sure the management at Scribd is happy that this ad (http://content.yieldmanager.edgesuite.net/atoms/2e/71/ab/d0/...) is annoying readers of their own presentation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: