I have been diving deeper into NLP-related code than I have before. I have been using GATE as a platform for doing experiments. I chose GATE as opposed to some of the other platforms out there (e.g. UIMA, RapidMiner) since it is fairly actively developed, has a good set of built-in modules, and has a clean / simple API and plug-in architecture. The latter has meant that plug-in wrappers for many of the other NLP packages already exist.
I built a couple of simple annotators that are useful building blocks for traditional heuristics used for NLP on some classes of documents. For example, with news stories, the first paragraph or two is more important than following text in the document and the first sentence or two of a paragraph is more important than the following text in the paragraph. The first annotator assigns sentence and paragraph indexes to each sentence (relative to the paragraph and the document) or paragraph (relative to the document). One could, for example, use these indexes in a downstream annotator to assign higher scores to entities or noun phrases that occur earlier in the text. None of this is rocket science and has been explored in the academic literature with many more complicated variations.
The second annotator annotates each document with frequency tables for noun and verb phrases. Again a simple heuristic like taking the top N noun phrases or entities in a document gets you quite a ways along to interesting topic data. I ended up modifying the noun phrase chunker and the verb phrase chunkers in the process as well. All in all, these small projects have been a good quick introduction to the guts of GATE.
My next project is to wrap some other NLP packages in the GATE plugin API, so that I can try out some other chunkers and POS taggers.
A plug for the book Foundations of Statistical NLP: it’s a great text for anyone starting out. Written in 1999, it’s still quite useful and relevant and covers most of the techniques in use now. Highly recommended.