I just came across the new Manning-Schutze book (authors of “Foundations of Statistical NLP” (FSNLP) that I mentioned in my previous post). They have co-authored this one with Prabhakar Raghavan (Verity, Yahoo, Stanford, etc.). The book is titled “Introduction to Information Retrieval.” It is still in pre-print and you can obtain an electronic version of the book (HTML, PDF) from the book site.
Others have compared it to MG and say it’s better than FSNLP in that it’s more in-depth and has more practical examples. I will post my thoughts as I get through it.
I have been diving deeper into NLP-related code than I have before. I have been using GATE as a platform for doing experiments. I chose GATE as opposed to some of the other platforms out there (e.g. UIMA, RapidMiner) since it is fairly actively developed, has a good set of built-in modules, and has a clean / simple API and plug-in architecture. The latter has meant that plug-in wrappers for many of the other NLP packages already exist.
I built a couple of simple annotators that are useful building blocks for traditional heuristics used for NLP on some classes of documents. For example, with news stories, the first paragraph or two is more important than following text in the document and the first sentence or two of a paragraph is more important than the following text in the paragraph. The first annotator assigns sentence and paragraph indexes to each sentence (relative to the paragraph and the document) or paragraph (relative to the document). One could, for example, use these indexes in a downstream annotator to assign higher scores to entities or noun phrases that occur earlier in the text. None of this is rocket science and has been explored in the academic literature with many more complicated variations.
The second annotator annotates each document with frequency tables for noun and verb phrases. Again a simple heuristic like taking the top N noun phrases or entities in a document gets you quite a ways along to interesting topic data. I ended up modifying the noun phrase chunker and the verb phrase chunkers in the process as well. All in all, these small projects have been a good quick introduction to the guts of GATE.
My next project is to wrap some other NLP packages in the GATE plugin API, so that I can try out some other chunkers and POS taggers.
A plug for the book Foundations of Statistical NLP: it’s a great text for anyone starting out. Written in 1999, it’s still quite useful and relevant and covers most of the techniques in use now. Highly recommended.
I finally decided to start my own blog. I always wondered how people manage to have time to write up blog entries. I will shortly find out. The title of this post, apart from being my own rhetorical question, is also the title of David Cowan’s blog.