How Perl Saved the Human Genome Project (1996)

codex · on May 5, 2013

Perl and the human genome are almost perfectly matched; both are almost incomprehensible, with no central design, accreted haphazardly over a long time.

kbenson · on May 5, 2013

> with no central design

Any supporting evidence to that? If you look into it, you may be surprised.

Here's a hint, Perl doesn't necessarily optimize for the same things other languages optimize for.

nijk · on May 5, 2013

Perl 5 has no spec, just a bunch of features that seem helpful. It has a philosophy, not a design.

kamaal · on May 5, 2013

Perl has no central design? Seriously?

The very purpose it has survived three decades is because it solves a great variety of problems, in a very centralized design philosophy which no other tool has addressed till today.

nijk · on May 5, 2013

You tried to make a too-strong statement and then backed down when honesty got the best of you. Perl has a philosophy, not a design.

jgrahamc · on May 4, 2013

Nice. Also, The Perl Script That Powered The Alan Turing Petition: http://blog.jgc.org/2012/07/perl-script-powered-alan-turing....

bane · on May 5, 2013

Honest question from somebody that doesn't know any better, with the data from the HGP available for quite some time now, it doesn't appear to me (as a layperson) to have had the promised impact or suddenly providing a genetic map that will allow us to quickly find and target genetic diseases and other undesirable traits: would anybody knowledgeable in the field be able to provide some insight into what kind of impact the HGP data has had?

epistasis · on May 5, 2013

In terms of science research, the impact has been unparalleled. It's hard to find a molecular biology paper that doesn't owe a ton to having the full human genome sequence. There have also been technology side effects. Just like defense projects fueled the early market for silicon-based semiconductors, the HGP kick-started a market for sequencing machines that is generating a huge revolution in sequencing technology where costs are now falling 5x-10x per year, which is quite a bit faster than Moore's law.

In terms of finding genetic causes of disease, this happens every day, and is practically mundane, but there are two complications towards getting to cures. First, most disease is far far more complicated than a single gene; any single gene may account for just a percent or two of what we call the same disease. Second, knowing which gene is broken does not provide a cure for that disease; even for a given small molecule, determining if it will interact with a gene's protein or have any effect on that protein's structure or function is a task that physics has not been able to tackle. Additionally, the genome has only been available for a mere decade, and for many if not most diseases, the process of going from a known gene target an approved drug is going to take far longer than 10 years.

So the HGP has fueled a huge amount of discovery, is the foundation of nearly all human biology research, and is completely indispensable, but in terms of new cures for various diseases it has not delivered, yet, but really it shouldn't have to.

ramanujan · on May 5, 2013

Genomics is now in the clinic in a big way, albeit for diagnostics rather than cures. Two recent links you might find informative:

http://techcrunch.com/2013/04/23/counsyl http://techcrunch.com/2013/04/25/sv-angel-health-informatics

ben1040 · on May 5, 2013

Here's something from the news last year. Without the reference genome this wouldn't be possible.

http://www.nytimes.com/2012/07/08/health/in-gene-sequencing-...

This was just one person, sure, but there are a lot of smart people working very hard on making this scale.

bane · on May 5, 2013

So I guess my question is, and I've seen lots of articles and breakthroughs in genome sequencing, is what use has the actual data been in the HGP? It seems to me that things like this are about as based on the HGP data as velco is to NASA. It's an enormously beneficial spinout technology that happened to have developed as a side-benefit of the main work. I don't know if I'd go so far as to say the sequencing or velcro would have never been developed without the main research focii, but it didn't hurt.

ben1040 · on May 6, 2013

The HGP reference genome is pretty much essential to the "whole genome" analysis done on humans and that's the big direction in research right now. I work in cancer and disease genomics doing data analysis software and all of the analysis methodology goes back to this reference in some way.

Sequencing technology has gotten to a point where it's just blown Moore's Law absolutely out of the water and we can't throw more compute at the analysis problem, we have to make it smarter. The reference genome is used in how that's been made smarter.

It helps to discuss a little bit about how the HGP reference was produced, and why producing it took 10 years and three billion dollars.

The HGP process first had a map made, where the genome was broken into lots of smaller segments. The idea was that this reduced your problem space; any segment of DNA produced from a sample from that portion of the genome came from that area. Then that segment was broken into lots and lots of smaller chunks and then read on the sequencing machines in 600-800 base segments. By the time that sequencing technology reached "max level," the state of the art machine could generate 96 of those segments in an hour's time.

Then you'd calculate overlaps and assemble those smaller "reads" back into a sequence of that chunk you chose from the map. Then someone would audit the computer-generated assembly by hand, possibly ordering up more lab work to fill any gaps or resolve areas of crummy data. Repeat for the next chunk from the map.

Now here's how things work, when we need to do any sort of genomic analysis on an individual:

New technology has the ability to sequence human genomes at deep coverage in 11 days[1], and cranks out 6 billion reads 100bp long from places all over the genome. Computationally, this is an absolutely different animal. You can't feasibly try to re-assemble these reads into a human. So, what we do is use string matching algorithms to "map" a 100bp read back to where it most likely came from, using the HGP genome as a reference.[2] Since obviously your DNA does not match the HGP reference base-for-base, and mismatches/insertions+deletions are really where the interesting data is anyway, there's some leeway for mismatches in the mapping.

At that point, by mapping reads back to where they came from, we end up with a data file that represents an individual's genome. You're able to walk across the genome base for base and ask "So, base 347 of Chromosome 7 is a T in the reference, what is the most likely base on Joe's genome at this point given the reads we have that span this base?"

Mapping things to the reference also allows us to attempt to find really interesting stuff that can cause disease, such as structural variations in the genome. These are instances where large segments are removed, duplicated, inverted, or picked up and moved somewhere else relative to where they "should be."

[1] http://www.illumina.com/systems/hiseq_comparison.ilmn

[2] http://bio-bwa.sourceforge.net/ is the tool that's most popular these days.

micro_cam · on May 5, 2013

And set us back years as well.

Too much perl code is essentially write once and forget. It gets results quick but it is a disaster for repeatability which is an essential part of science. I've worked on bioinformatics perl projects where bugs canceled each other out (ie code that was supposed to clear an array and repopulate it with corrected values did neither so the original values were returned). And I've spent far too many hours trying to figure out what a perl script that is the reference implementation for a certain procedure actually does.

Their certainly still are scientists who use it but python and R are gaining ground for good reason.

Wiring together analysis pipelines with pipes as they describe is, however, an excellent technique regardless of language.

Mithaldu · on May 5, 2013

> Too much perl code is essentially write once and forget.

Please stop repeating this misconception. People who put little effort into learning how to program write "write-once" code in any language. Perl had the "misfortune" of being the only dynamic language on the block for a long time, leading to many people reaching for it to get things done without bothering to actually learn the language, thus creating a vast corpus of low quality code.

(It does not help that the definitive resource of Perl for bioinformatics people, which i've seen in libraries like those of the Genome Campus in Cambridge, isn't worthy of being used as toilet paper, yet influenced a whole generation of scientists.)

> I've spent far too many hours trying to figure out what a perl script does

How often do you reach for perltidy when you do this?

PySlice · on May 5, 2013

I don't think that is a misconception.

Perl is the language with the highest occurrence of "subtle" and "ambiguous" in its documentation and tutorials that I have ever seen. Humans may be subtle and ambiguous, but programming languages should be clear and precise.

Mithaldu · on May 5, 2013

If you contrast the original statement i quoted as a misconception, you will find that your opinion still is not the same as the actual misconception.

bsima · on May 5, 2013

What is this "definitive resource" and what resource would you recommend for learning perl? I'm learning now (self-teaching) and don't want to learn the wrong way.

Here's the book I've been using: http://bixsolutions.net/

adrianhoward · on May 5, 2013

The two books I'd strongly recommend:

* chromatic's "Modern Perl" - which has a free CC licenced version (although it would be nice for you to pay some money to support the good work of the author and publisher ;-) http://www.onyxneon.com/books/modern_perl/index.html

* Curtis Poe's "Beginning Perl" http://ofps.oreilly.com/titles/9781118013847/

(Bias warning: Curtis is a mate - but it is a very good book on the topic. Honest ;-)

bsima · on May 7, 2013

Thanks for these - much appreciated!

Mithaldu · on May 5, 2013

I do not remember the exact title anymore. It was a book that taught how to build web applications by interleaving data retrieval and html printing, instead of building it up in proper MVC fashion, to name one example of the damage it did.

As for learning Perl, the best resource is http://perl-tutorial.org

You will find an explanation there of how to judge a learning resource and a list of the most current texts that teach modern Perl.

bsima · on May 7, 2013

Thanks man, much appreciated :)

nijk · on May 5, 2013

"Proper" MVC fashion.

Mithaldu · on May 5, 2013

Please elaborate. :)

Moto7451 · on May 5, 2013

I've had to deal with my fair share of C# which was written in a "write once" manner. My favorite being when a particular dev wrote an entire Silverlight application while ignoring how the framework functioned. The best bit of obfuscation were sections of code littered with constructs like "((DesiredType)parent.parent.parent.parent.parent).method" where each parent was an Object reference.

Any language can be used to write a riddle. Even Python[1][2]. It's the community's values that determine code quality. The banner of "impossible to read" hangs over the Perl community and it has been my experience that it reminds us to write much better code.

[1] http://www.python.org/search/hypermail/python-1993/0232.html [2] http://p-nand-q.com/python/obfuscated_python.html

micro_cam · on May 6, 2013

Good perl programers write very readable code. The problem is that academic code bases often have a lot of involvement from novice new programers.

Mithaldu · on May 5, 2013

> The banner of "impossible to read" hangs over the Perl community

It hangs there, put there by other people. Sadly it correctly would hang above people writing perl but outside the community. We have many efforts going on to get people to write nice and readable Perl.

nijk · on May 5, 2013

It is insane to claim that R is a language of repeatable science.

Python is barely better.

None of these languages come anywhere near a reliable verifiable formal model of what they claim to compute.

sciurus · on May 4, 2013

17 years later, Perl still seems to be the go-to tool in bioinformatics.

epistasis · on May 4, 2013

In my personal experience, I see no new Perl scripts these days. Python has completely replaced Perl for new code.

Mithaldu · on May 4, 2013

If you say something like this you also need to quantify your personal experience. Does it consist of a 3 person department? Or one school with a pretty big bio informatics department? Or do you contract out to many different schools?

In my experience as a global contractor, there are at least big players who use Perl exclusively, and in Perl irc channels i see bio informatics people very regularly.

epistasis · on May 4, 2013

I would hope that any one person's "personal experience" is not taken so seriously that it needs to be quantified.

UCSC has switched from teaching Perl to teach Python in intro bioinformatics classes (they get 5-10 new PhD students a year, plus at least that many masters students, I think?). Nobody that I encountered at my postdoc institution used Perl, maybe about 60 people who programmed, and I knew enough of ~20 people's habits to know their programming choices. R, Python, and C/C++ were commonplace, with a few weirdos using Ruby, and of course lots of shell, awk, and sed glue to keep it all together. The Broad has been somewhat successful on shoving Java into people's pipelines with GATK and Picard, but it's not a welcome addition to many people's habits, and I haven't encountered any significantly used project around next gen sequencing that is based in Perl. For example, in the RNA world, all the common tools like Tophat, Mapsplice, and ChimeraScan all use Python in ways that Perl would have been used 10-15 years ago.

I have active collaborations with about five different labs that also have in-house informatics for everything from microarrays, to ChIP-Seq, to RNA-seq, exome to whole-genome resequencing, and nobody uses Perl for any of it.

That said, it's really easy to collaborate with lots of informatics people and never even need to know what they use internally. What's this about contracting in bioinformatics? I've never encountered such a thing.

Mithaldu · on May 5, 2013

It seems the main difference here is that you're mainly talking to people in the USA, while i talk mainly to people in Europe.

As for contracting: Oftentimes bioinformatics people will realize that they're not good programmers at all and call in people to help make their code less of a rat's nest and learn from them at the same time. It happens quite often here in Europe.

kevinwuhoo · on May 5, 2013

One point of evidence can be the BioPerl and BioPython communities. On github there are currently 111 stars and 57 forks for BioPerl while there are 350 stars and 179 forks for BioPython.

BioPerl: https://github.com/bioperl/bioperl-live

BioPython: https://github.com/biopython/biopython

Mithaldu · on May 5, 2013

Fair point. I do suspect that this is in part due to a difference in user demographics.

People using BioPerl just use it to get something done and don't care to join the larger community. In Perl we call it the DarkPAN.

Meanwhile anyone wishing to use Python will at least have needed to make an effort to learn the language, since there is nothing else quite like it; and as such probably more likely to invest further effort.

But, i do admit that this is conjecture.

sillywabbit · on May 4, 2013

I see a mix. I use Python, but during undergrad one of the required classes taught Perl. Now at a different university, the local bioinformatics group does a "Perl for Biologists" class, and I know at least one lab that still writes new stuff in Perl to stay consistent with legacy code.

bitcartel · on May 4, 2013

Do the classes teach "Modern Perl" e.g. Moose, DBIx, Critic ?

sillywabbit · on May 4, 2013

The undergrad class was back in 2011 and just went through the basics using Learning Perl, 5th Edition. I haven't personally been to the workshop here, but from their schedule it looks like they spend several sessions on basic programming concepts then spend one on OOP and hop to BioPerl, interacting with FTP sites, then "practical examples". The lab I know that uses Perl does use Moose though.

smegel · on May 4, 2013

Does python have a comparable library/framework to bioperl?

sillywabbit · on May 4, 2013

BioPython - http://biopython.org/

aswanson · on May 4, 2013

Thank god. May the snoopy-cursing syntax rest in pieces.

aswanson · on May 4, 2013

My most sincere apologies to lovers of the snoopy-cursing syntax. In return, I only ask that you refrain from anything remotely nearing programming language design.

bsima · on May 4, 2013

I'm learning Perl now, specifically for bioinformatics. Appreciated this article

draegtun · on May 4, 2013

Here's an HN post (with comments) from couple of years ago on same article (though it was via DrDobbs) - https://news.ycombinator.com/item?id=1568109

mrmagooey · on May 5, 2013

The PUG that I'm a member of had a very interesting presentation of PyCogent (http://pycogent.org/) which is meant to be a Python based successor to BioPerl. IANA bioinformatics researcher so have no idea as to the actual relative strengths of each, but the PyCogent guys appear to have put the hard yards in (~8 years of development and still going)

manish_gill · on May 4, 2013

Is there any point for new programmers to learn Perl, when they have the choice of Ruby (which supposedly is inspired from Perl) ?

timr · on May 4, 2013

Perl is very different than Ruby. It's also very different than Python. It's also faster and more concise than both.

You should learn Perl because you're likely to encounter a lot of Perl code in the wild, and because you can learn something from it. Knowing how to generate a Perl one-liner that does something incredible will take your CLI skills to a new level in a way that knowing Ruby will not.

Perl is still one of my go-to languages for sysadmin scripts, because it's so concise and powerful. It's a long-beard language.

protospork · on May 4, 2013

Thank you, that's a better-written answer than anything I could get down. I'd just like to add that (as the article does a pretty good job of saying) perl is /unparalleled/ as a text-processing language. There is no other language that even comes close to perl's ease and utility for string manipulation.

timr · on May 5, 2013

That's a good point. When I think over what I use Perl for in practice, it's usually some form of text processing. Perl is just the swiss army knife of I/O munging.

sliverstorm · on May 5, 2013

This is why Perl is my main workhorse. I process text almost exclusively.

obviouslygreen · on May 5, 2013

The only part of this I'm going to take issue with is "You should learn Perl because you're likely to encounter a lot of Perl code in the wild."

"...because you can learn something from it" is a far better reason, because it's actually valid. Now, specifying what you're suggesting could be learned would be a good step, but I'd take it as-is without a problem.

The former, though? No, Perl is no longer occupying visible roles to the point where any new programmer can assume they'll be seeing it at all, much less "encounter a lot of Perl in the wild." It's out here, sure, but it's hidden in commands we use or scripts invoked by background tasks non-systems-programmers have absolutely no reason to touch.

Learning to do irresponsibly illegible things in Perl will teach you some great regex and... Perl. Learning Ruby (or Python, or Java, or even PHP) will get you actual work -- that doesn't involve being sequestered in backwards academia -- far more often now than Perl.

Mithaldu · on May 5, 2013

> Learning Ruby (or Python, or Java, or even PHP) will get you actual work far more often now than Perl.

Please only say such things when you can back them up. Yes, learning Java will get you more jobs (though, who really wants to do Java for a living?), but you'll still enounter Perl in more jobs than the other languages you mentioned:

http://www.indeed.com/jobtrends?q=Perl%2C+Ruby%2C+Python%2C+...

PySlice · on May 5, 2013

What one can learn from Perl:

When programming in a language you hate, don't take naming arguments for granted. It could be worse: you could be programming in Perl, where all arguments come in a flattened array.

kbenson · on May 5, 2013

This is really not much of an issue, if you bother to learn the language. If you don't, then yes, things will be confusing.

    sub named_params {
      my %P = @_; # named params in hash %P
      my $foo = $P{bar} + $P{baz};
      return $foo * $P{multiplier};
    }
    
    my $rvalue = named_params( foo => 1, bar => 2, multiplier => 3 );

Or, some people prefer to just pass in a hash reference:

    sub named_params_hashref {
      my $P = shift;
      my $foo = $P->{bar} + $P->{baz};
      return $foo * $P->{multiplier};
    }
    my $rvalue = named_params_hashref({
      foo => 1,
      bar => 2,
      multiplier => 3,
    });

But of course, why bother letting common sense, and a single line of code stand in the way of your chance to deliver a pithy remark?

tudorconstantin · on May 5, 2013

just like in javascript - oh wait, that is the shiny new trend in programming, there it can't be bad thing the lack of named params.

berntb · on May 5, 2013

"PySlice" with karma 25 is probably your trolling account. If not, ASK about stuff you have no clue about. It is better than looking like a little troll kid, having "fun" lowering the quality of HN by stupid language war comments.

>>where all arguments come in a flattened array

That is optional, through the power of CPAN (and kbenson's methods, of course). Examples of alternatives:

http://search.cpan.org/~barefoot/Method-Signatures-20130222/...

http://search.cpan.org/~flora/MooseX-Declare-0.35/lib/MooseX...

http://search.cpan.org/~ether/MooseX-Method-Signatures-0.44/...

(Depending on if you use Moose or not.)

One of the cool parts of Perl is that the language is extensible, that is why it e.g. has a better OO than Python, Ruby, etc. (Or simple typing of parameter arguments, see the modules above. Or why MooseX::Declare has ... ah, look yourself, troll.)

nijk · on May 5, 2013

"the" language is then more of a loosely couples family of languages.

berntb · on May 7, 2013

Not half as much as any Lisp variant with good macros.

Edit: The possibility to extend your environment is generally seen as a good thing, unless you believe in cleanliness of bloo... languages for ideological reasons.

nijk · on May 5, 2013

That is a lie. Some of the arguments are provided as global variables like $_, (no need to remember its name, since its value will be implicitly used whenever you omit a necessary argument) whose value is decided by something akin to Clippy choosing what seems most handy.

kbenson · on May 5, 2013

$_ is not global, it's lexical. It's value is not haphazardly chosen, it is the topic for structures that topicalize.

You seem to be proving what I said earlier[1]. You think you understand what's going on, but in truth you don't. Feel free to continue making assertions regarding things you don't know, I can't stop you, but be aware that to do so is dishonest.

[1]: https://news.ycombinator.com/item?id=5657901

Mithaldu · on May 5, 2013

$_ is very clearly and sharply defined. Only because you never took even a slightly look at the documentation does not mean you should just make up things. :)

abraininavat · on May 5, 2013

Every article about Perl leads to the same pattern of comments. Most people think Perl is horrible and lends itself to incomprehensible code. And Perl people have their backs against a wall, furiously defending their language with prevarications like Perl had the "misfortune" of being the only dynamic language on the block for a long time, leading to many people reaching for it to get things done without bothering to actually learn the language, thus creating a vast corpus of low quality code.

Sure, couldn't have anything to do with the language. The whole rest of the world just doesn't get it.

kbenson · on May 5, 2013

I think it's more a case of people think they understand Perl, or can make assumptions because of their existing C/PHP/Shell programming experience and apply it to Perl without problem, and that is not always the case. The fact is, Perl is fundamentally different, but looks just similar enough to fool people.

If it looked like Lisp, people would be less likely to think that it's just a matter of applying their C experience, but alas, it generally looks pretty familiar, if a bit messy, to users of other imperative languages.

If you are trying to understand, write or change a Perl script, and you don't know what context a statement takes place in, or what I mean by context in this case, then you don't know what you are doing. (I mean you in the general sense, not as an indictment against the parent).

Mithaldu · on May 5, 2013

> The whole rest of the world just doesn't get it.

Never claimed that. If you read the quote you chose, you will find that i claimed that many did not even bother to try and get it, because they got things done and did not have any reason to get it.