I have been using data integration and transformation software for almost a decade now and have used packages that manage dataflow for very different use cases. Simulink, a product from the MathWorks, manages data flow in the context of doing simulations of control systems and digital signal processing systems. Products like Ascential DataStage (now called IBM WebSphere DataStage) and Informatica, in the traditional category of ETL, are graphical front-ends for extracting, transforming, and loading data contained in SQL databases. The data engine in Endeca’s product manages processing and cleansing of both unstructured textual data as well as structured data (e.g. from databases). Finally, some data processing tools focus only on unstructured data (e.g. to aid information extraction) or focus on tasks like data mining, providing an infrastructure that allows one to easily swap in different algorithms and evaluate performance with different datasets, with the ultimate goal of building a good model (in the classification sense).
In addition to these tools, one can also use scripting languages like Perl to manipulate and modify text or XSLT to manipulate and modify XML content.
There are some interesting commonalities and differences between these categories of data processing tools. Over the next couple of posts, I will talk about each of these classes of tools in more depth, including sample use cases for each, the data structures that they use (and hence what types of data they can easily handle), and what lessons one might be able to use from each if one were going to build a new data transformation tool. Note that this is by no means an exhaustive set of these kinds of tools, but rather a small collection of those that I have used and know a little bit about.
Post a Comment