I have been using data integration and transformation software for almost a decade now and have used packages that manage dataflow for very different use cases. Simulink, a product from the MathWorks, manages data flow in the context of doing simulations of control systems and digital signal processing systems. Products like Ascential DataStage (now called IBM WebSphere DataStage) and Informatica, in the traditional category of ETL, are graphical front-ends for extracting, transforming, and loading data contained in SQL databases. The data engine in Endeca’s product manages processing and cleansing of both unstructured textual data as well as structured data (e.g. from databases). Finally, some data processing tools focus only on unstructured data (e.g. to aid information extraction) or focus on tasks like data mining, providing an infrastructure that allows one to easily swap in different algorithms and evaluate performance with different datasets, with the ultimate goal of building a good model (in the classification sense).
In addition to these tools, one can also use scripting languages like Perl to manipulate and modify text or XSLT to manipulate and modify XML content.
There are some interesting commonalities and differences between these categories of data processing tools. Over the next couple of posts, I will talk about each of these classes of tools in more depth, including sample use cases for each, the data structures that they use (and hence what types of data they can easily handle), and what lessons one might be able to use from each if one were going to build a new data transformation tool. Note that this is by no means an exhaustive set of these kinds of tools, but rather a small collection of those that I have used and know a little bit about.
I just came across the new Manning-Schutze book (authors of “Foundations of Statistical NLP” (FSNLP) that I mentioned in my previous post). They have co-authored this one with Prabhakar Raghavan (Verity, Yahoo, Stanford, etc.). The book is titled “Introduction to Information Retrieval.” It is still in pre-print and you can obtain an electronic version of the book (HTML, PDF) from the book site.
Others have compared it to MG and say it’s better than FSNLP in that it’s more in-depth and has more practical examples. I will post my thoughts as I get through it.
I have been diving deeper into NLP-related code than I have before. I have been using GATE as a platform for doing experiments. I chose GATE as opposed to some of the other platforms out there (e.g. UIMA, RapidMiner) since it is fairly actively developed, has a good set of built-in modules, and has a clean / simple API and plug-in architecture. The latter has meant that plug-in wrappers for many of the other NLP packages already exist.
I built a couple of simple annotators that are useful building blocks for traditional heuristics used for NLP on some classes of documents. For example, with news stories, the first paragraph or two is more important than following text in the document and the first sentence or two of a paragraph is more important than the following text in the paragraph. The first annotator assigns sentence and paragraph indexes to each sentence (relative to the paragraph and the document) or paragraph (relative to the document). One could, for example, use these indexes in a downstream annotator to assign higher scores to entities or noun phrases that occur earlier in the text. None of this is rocket science and has been explored in the academic literature with many more complicated variations.
The second annotator annotates each document with frequency tables for noun and verb phrases. Again a simple heuristic like taking the top N noun phrases or entities in a document gets you quite a ways along to interesting topic data. I ended up modifying the noun phrase chunker and the verb phrase chunkers in the process as well. All in all, these small projects have been a good quick introduction to the guts of GATE.
My next project is to wrap some other NLP packages in the GATE plugin API, so that I can try out some other chunkers and POS taggers.
A plug for the book Foundations of Statistical NLP: it’s a great text for anyone starting out. Written in 1999, it’s still quite useful and relevant and covers most of the techniques in use now. Highly recommended.
I finally decided to start my own blog. I always wondered how people manage to have time to write up blog entries. I will shortly find out. The title of this post, apart from being my own rhetorical question, is also the title of David Cowan’s blog.