<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>A World About to Change... &#187; data transformation</title>
	<atom:link href="http://www.vinaysethmohta.com/blog/category/data-transformation/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.vinaysethmohta.com/blog</link>
	<description></description>
	<lastBuildDate>Tue, 29 Jun 2010 03:13:38 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Hadoop and Hive</title>
		<link>http://www.vinaysethmohta.com/blog/2010/06/18/hadoop-and-hive/</link>
		<comments>http://www.vinaysethmohta.com/blog/2010/06/18/hadoop-and-hive/#comments</comments>
		<pubDate>Sat, 19 Jun 2010 01:56:29 +0000</pubDate>
		<dc:creator>vinaysethmohta</dc:creator>
				<category><![CDATA[business intelligence]]></category>
		<category><![CDATA[data transformation]]></category>

		<guid isPermaLink="false">http://www.vinaysethmohta.com/blog/?p=49</guid>
		<description><![CDATA[I&#8217;ve been using Hadoop and Hive for the last six months and have been pretty impressed with how well it works.  To state the obvious, if you can correctly formulate your query, nothing beats this approach.  It&#8217;s been very useful for doing cohort analysis and large scale lifetime value computations on a relatively high traffic [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been using <a href="http://hadoop.apache.org/">Hadoop</a> and <a href="http://hadoop.apache.org/hive/">Hive</a> for the last six months and have been pretty impressed with how well it works.  To state the obvious, if you can correctly formulate your query, nothing beats this approach.  It&#8217;s been very useful for doing cohort analysis and large scale lifetime value computations on a <a href="http://kayak.com">relatively high traffic site</a>.  There are of course limits to what you want to keep in Hadoop / Hive; however, the convenience and the growing feature set are reducing that limit more and more.</p>
<p>Hive is not a good store as a backend for a BI product, since it offers no caching at all.  However, a workflow where you crunch data in Hadoop/Hive and then export to a MySQL table (or an Endeca instance) for use in a BI tool works very well.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.vinaysethmohta.com/blog/2010/06/18/hadoop-and-hive/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Outsourced business intelligence</title>
		<link>http://www.vinaysethmohta.com/blog/2008/08/27/outsourced-business-intelligence/</link>
		<comments>http://www.vinaysethmohta.com/blog/2008/08/27/outsourced-business-intelligence/#comments</comments>
		<pubDate>Wed, 27 Aug 2008 14:42:12 +0000</pubDate>
		<dc:creator>vinaysethmohta</dc:creator>
				<category><![CDATA[business intelligence]]></category>
		<category><![CDATA[data transformation]]></category>
		<category><![CDATA[enterpreneurship]]></category>

		<guid isPermaLink="false">http://www.vinaysethmohta.com/blog/?p=24</guid>
		<description><![CDATA[As more infrastructure moves into the cloud, we have also started to see a migration of applications into the cloud. Salesforce and CRM were the early movers. More recently, I have been seeing entrepreneurs explore how to move much larger applications (like SAP) and application stacks (like BI) into the cloud. My most recent discovery [...]]]></description>
			<content:encoded><![CDATA[<p>As more infrastructure moves into the cloud, we have also started to see a migration of applications into the cloud.  Salesforce and CRM were the early movers. More recently, I have been seeing entrepreneurs explore how to move much larger applications (like SAP) and application stacks (like BI) into the cloud.</p>
<p>My most recent discovery is <a href="http://www.gooddata.com">Good Data</a>.  They&#8217;re Cambridge-based and offer outsourced BI.  They have <a href="http://www.xconomy.com/boston/2008/07/23/good-data-gets-2-million-for-cloud-based-business-intelligence/">pulled down funding</a> from Esther Dyson, Tim O&#8217;Reilly and others and have a small war chest to try out their ideas.  Three areas of market reaction that I&#8217;m particularly curious about:</p>
<ol>
<li>Are companies willing to let their most valuable data out of their doors?  Clearly, they have been open to it in particular areas of their business as they have moved CRM, web analytics for their retail sites, and marketing analytics off-site.  However, in this case, they&#8217;re potentially moving the entire BI stack out.  Would they be willing to move their financial data out?
<p>Good Data&#8217;s success does not depend on whether or not companies choose to ship out their most private data. There&#8217;s plenty of pain around BI in organizations that doesn&#8217;t involve sensitive data. However, it will be interesting to see corporate attitudes evolve as people get more used to sending data off-site for analysis. </li>
<li>Is Good Data able to improve the user experience around business intelligence?  When we were bringing the Endeca business intelligence offering to market, the two frustrations that we addressed for our users were:
<ul>
<li>IT often takes a long time to turn reports around</li>
<li>The reports that they do provide are static and the tools and UI to manipulate them are really only useful to the analytics &#8220;high priests&#8221;</li>
</ul>
<p>The former is something Good Data can address just by providing infrastructure online; the latter is much harder to do whether you are an outsourced provider or you are in house.  It often requires participation of the business users and requires mapping the analysis to the business processes involved.  Metrics and analytics do not mean much unless you understand what the numbers are telling you!</li>
<li>Can Good Data (and others in this category) truly provide outsourced BI without having a significant services component to their business?  This question follows from the previous comment: that analytics becomes much more useful once you have some business context and know how to encode the business context into numbers and interpret the resulting analysis.
<p>Furthermore, I have always believed that to truly embed BI in the organization requires moving decision-making to being part of each business process as opposed to an after-the-fact activity.  Such an approach often requires a deep services component that cannot be standardized easily.</li>
</ol>
<p>Which brings me to my final thought:  Will the very existence of these tools and the pre-defined templates that they come with result in some standardization of business process across organizations? Has anyone modeled their sales process around what Salesforce.com provides out of the box? Will that lead to better efficiency?  Why should this happen now when it didn&#8217;t happen during the ERP days?  Companies spent 10&#8242;s of millions to change the software rather than change their business processes.  Presumably it&#8217;s because a company&#8217;s business processes are one of its core sources of value.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.vinaysethmohta.com/blog/2008/08/27/outsourced-business-intelligence/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Data transformation tools</title>
		<link>http://www.vinaysethmohta.com/blog/2007/09/30/data-transformation-tools/</link>
		<comments>http://www.vinaysethmohta.com/blog/2007/09/30/data-transformation-tools/#comments</comments>
		<pubDate>Sun, 30 Sep 2007 05:04:07 +0000</pubDate>
		<dc:creator>vinaysethmohta</dc:creator>
				<category><![CDATA[data transformation]]></category>

		<guid isPermaLink="false">http://www.vinaysethmohta.com/blog/?p=8</guid>
		<description><![CDATA[I have been using data integration and transformation software for almost a decade now and have used packages that manage dataflow for very different use cases. Simulink, a product from the MathWorks, manages data flow in the context of doing simulations of control systems and digital signal processing systems. Products like Ascential DataStage (now called [...]]]></description>
			<content:encoded><![CDATA[<p>I have been using data integration and transformation software for almost a decade now and have used packages that manage dataflow for very different use cases.  <a title="Simulink" href="http://www.mathworks.com/products/simulink/">Simulink</a>, a product from the <a title="MathWorks" href="http://www.mathworks.com">MathWorks</a>, manages data flow in the context of doing simulations of control systems and digital signal processing systems.  Products like Ascential DataStage (now called <a title="IBM WebSphere DataStage" href="http://www-306.ibm.com/software/data/integration/datastage/">IBM WebSphere DataStage</a>) and <a title="Informatica" href="http://www.informatica.com/">Informatica</a>, in the traditional category of ETL, are graphical front-ends for extracting, transforming, and loading data contained in SQL databases.  The data  engine in Endeca&#8217;s product manages processing and cleansing of both unstructured textual data as well as structured data (e.g. from databases).  Finally, some data processing tools focus only on unstructured data (e.g. to aid information extraction) or focus on tasks like data mining, providing an infrastructure that allows one to easily swap in different algorithms and evaluate performance with different datasets, with the ultimate goal of building a good model (in the classification sense).</p>
<p>In addition to these tools, one can also use scripting languages like Perl to manipulate and modify text or XSLT to manipulate and modify XML content.</p>
<p>There are some interesting commonalities and differences between these categories of data processing tools.  Over the next couple of posts, I will talk about each of these classes of tools in more depth, including sample use cases for each, the data structures that they use (and hence what types of data they can easily handle), and what lessons one might be able to use from each if one were going to build a new data transformation tool.  Note that this is by no means an exhaustive set of these kinds of tools, but rather a small collection of those that I have used and know a little bit about.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.vinaysethmohta.com/blog/2007/09/30/data-transformation-tools/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
