Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

tl;dr - this does work to a point, but won't necessarily teach you idiomatic and community practices that come with experience, but it is surprisingly sticky

I had the great pleasure, year ago in my undergrad Operating Systems class, for the class assignment to be "write an OS in Java"...which of course was handed out to a group of students who had never seen Java. By the end of the semester we had written the core guts of a multi-tasking OS, a couple shells and the display systems to handle even displaying things like a unix-like console, a sane piping system, all the major user land utilities (sans some of the compiler things, but things like ls, cat, ps, etc.) a simple text editor and intra-system messaging system, etc. etc. etc.

It was a great curriculum and really was the first time we, as CS students, had the chance to really spend time understanding the subject matter without spending time focusing on stupid language tricks like we had in our various C and C++. The code we wrote was fairly straight forward (we were learning the language as we went, so kept to the KISS method) and focused instead of the material. It was probably among the hardest, and best class I've ever had on any subject.

Did I know Java at the end of it?

To a point -- I knew the pidgin dialect we wrote the OS in. A few semesters later I took a fluff software engineering course and had to hack out some various java server bits and had a roughshod time of it as I ran head first into the now common overengineeringitis that plagues modern Java development. I found the syntax and most of the standard library familiar, but the idiomatic ways of writing the code, community practices, the shibboleths, nearly impenetrable without years buried in an enterprise software house.

I swore off Java and never looked back...moving on to Perl and Python for a spell (incidentally my standard "learn a new language" project is to write a simple non-lexical phrase extractor, it touches I/O, data structures, database connectivity, program flow, and if I get daring, multi-threading and a few other odds and ends and usually gives me a pretty good idea how a language works.

Now years later, taking a look at Android dev, I'm finding that writing code for the platform, even though it's Java, to be like writing code for our old OS. It's pretty simple, there's great library support, and I don't have to wrap simple method calls in hundreds of lines of framework boilerplate nonsense. It's actually pretty fun.

But I've definitely been drawing heavily on that pidgin dialect of Java that I learned way back when -- it's kinda like riding a bicycle, except a few bits have changed here and there. So yeah, I think I did "learn" the language, and it's been amazing how much of it I can recall since it's been a decade since I did any coding in it.

(this method also handily solves the "I need a project, a goal, to learn the language, otherwise I'm just twiddling bits" problem).



What is a non-lexical phrase extractor?

I googled it but it leads back to this page.


<META>

> I googled it but it leads back to this page.

It blows my mind how often this happens to me, even though I understand how and why. Especially since it's usually just a few minutes after the original comment is written. Google is awesome.

</META>


It's a fun way to find sequential tokens (words) that have a high probability of being the names of people.

- Take a lexicon (list) of single words in a given language (English for example).

- Take a block of text and think of it as an ordered sequence of tokens, news articles work really well for this approach, books not as much.

Example (from http://www.cnn.com/2012/11/02/showbiz/movies/flight-review-c...): With its spectacular plane crash -- I would rate it fractionally behind the air disasters director Robert Zemeckis staged in "Cast Away" and Joe Carnahan in "The Grey," but still more than gut-wrenching enough to make you think about taking the train, next time -- "Flight" immediately raises the stakes on your typical addiction drama. But that's essentially what it is -- with a courtroom finish for extra lift.

- Stream through the text, any word that is in your lexicon, throw away.

---- ---- ----------- ----- ----- -- - ----- ---- -- ------------ ------ --- --- --------- ------- Robert Zemeckis ------ -- "---- ----" --- Joe Carnahan -- "--- ----," --- ----- ---- ---- ------------- ------ -- ---- --- ----- ---- ----- --- -----, ---- ---- -- "------" ----------- ------ --- ------ -- ---- ------- --------- -----. --- ----'- ----------- ---- -- -- -- ---- - --------- ------ --- ----- ----.

- treat each group of remaining tokens as separate objects, in this case we have 2

- write these non-lexical (not in your original lexicon) sequences out:

Robert Zemeckis

Joe Carnahan

Boom, you just made an entity extractor that plucks names out of text without having to model the English Language too rigorously. And it generally works in most languages that have a low intersection between name-part tokens and lexicon tokens. And it can be brutally fast.

Where this gets interesting is in suppressing junk and tweaking the algorithm around things like parenthesis, apostrophese, and sentence boundaries. There's lots of little edge cases like this that you have to be mindful of - numbers in the text for example are never parts of names but aren't in your lexicon so you have to figure out what to do with those. And then you can use other heuristics to improve the results, suppose another sentence just had "Zemeckis" in it, that's a name, but then suppose another sentence had a token not in your lexicon like "Samoflange"...do you count that as a name? What about lexical tokens that are names like "Bush"? So you can try things like only counting sequences of tokens that have more than 2 tokens (like "Robert Zemeckis") and ignoring ones that have only 1.

And it goes on and on -- endless tweaks to improve the quality of the names you get and suppress non-name sequences.

To make the project more interesting, try storing your lexicon in a database or some kind of index so you can search it quickly, I like to use SQLite files with indexes on the lexicon table myself, but it's a fun assignment to try different things like in-memory TRIEs.

If you want to try threading, you can try playing around with searching the text at different start and end points (thread 1 searches the first 25% of the text, thread 2 the second 25%, etc.) or have different threads search different articles.

You can try all kinds of different things to keep it interesting and as you start abstracting the problem you can play with all kinds of different control and data structures to accomplish the task. Trying to make this as fast as possible (with all the heuristics turned on) can also be a fun challenge.


My standard "learn a new language" project usually is a calculator... touches many aspects but I have grown bored of it. A phrase extrator is simple but can be much more complex, might try it next language!


Another good one I find is a poker application!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: