Epistemological Modesty and Big Data

I am currently “reading” (listening to the audiobook actually) David Brooks’ Social Animal. First of, if you care about understanding human behavior in any realm, it is a fantastic read. I have always enjoyed Brooks’ columns at the NYT. His writing in this book is just as engaging and thought-provoking.

I listened to a segment today about epistemological modesty. In addition to introducing epistemological modesty, it included the idea that what we do know can be known in different ways. The example Brooks gives:

For example, if you were asked what day in the spring you should plant corn, you could consult a scientist. You could calculate the weather patterns, consult the historical record, and find the optimal temperature range and date at each latitude and altitude. On the other hand, you could ask a farmer. Folk wisdom in North America decrees that corn should be planted when oak leaves are the size of a squirrel’s ear. Whatever the weather in any particular year, this rule will guide the farmer to the right date.

For the big data practitioner, one of the keys to setting your work apart is finding out how you learn about those additional sources of patterns or wisdom (and thus data) so that you can go and acquire it and use it in your analyses.

And by the way, big data practitioners should add epistemological modesty to their personal toolkit. An over reliance on data without an appreciation of human behavior and context can lead to incorrect conclusions that are “supported by the data.”

Digital Marketing and BI integrated in Business Process

With my evaluation of the major SEM platform vendors over the last few months as well as my time with demand-side platforms for display, I’ve finally experienced a productized version of business intelligence integrated with business process that scales across many different businesses.

While evaluating (and now using) SEM platforms, I have been impressed by:

1) how well-suited the tools are for the day-to-day activities of the paid search marketer
2) the dramatic improvements in process and time-savings realized with having reporting and intelligence integrated with the ability to act

We are able to find outliers and immediately cut or increase spend. Tasks that previously used to take several hours can now be completed in minutes.

Demand-side platforms (DSPs) for display advertising are poised to deliver similar operational improvements to display marketing.

Abstracting up one level, I find these tools fascinating in that they’ve been able to productize (and consequently scale) the integration of BI and business processes. What is it about digital marketing that allows this approach to work, as opposed to say a tool built on top of Salesforce? The data sources are fairly homogenous (the three major search engines) and the data itself is even more so (campaign structure, bids, clicks, and CPC’s constitute the majority of the data).

The major difference between users is how they define conversion and revenue; however, even here, every marketer has a revenue number and a fairly standard set of conversion events (registration, revenue event, login, etc). You specify the definition of conversion and then upload one or more related metrics to support analysis and decision-making in the tool. Bidding is driven by revenue and cost and improving CTR and revenue/click allows marketers to focus on improving ad copy and landing pages.

Thus, in this case, limited data and a fairly homogenous cross-vertical definition of success metrics allows a scalable, tight integration of business process and business intelligence.

With regards to the platforms, you do start hitting limits once you start getting very sophisticated e.g. custom bidding algorithms, complex keyword expansion, etc. But even in this area, the vendors are increasingly offering solutions that are more sophisticated than what most in-house digital marketing and analytics operations will develop e.g. attribution models for bidding, automated SQR-driven keyword expansion, etc.

I’m looking forward to watching these products expand over the next year and simultaneously watching how digital marketing processes evolve.

Small businesses and BI

I saw Dayna Grayson’s post about the winners of NBVP’s Seed Competition. One of the winners was Profitably, which is providing analytics and BI on data in Quickbooks. They are joining a growing list of companies that are selling SaaS BI on standard data schemas to SMB’s (small and medium businesses). Another recent one that a colleague pointed out is Metricly. I’ve signed up for the betas on each site, so will post more when I get to use them.

In most cases, the vendors have taken a horizontal platform (whether Quickbooks or Salesforce) and developed a reporting / analytics offering on top of it. The real question for me goes back to some of the posts that I wrote when I started this blog – these plays are interesting, but they don’t get to the heart of the highest value problems:

  • understanding a business (i.e. vertical) and offering BI / reporting that factors in specific challenges
  • integrating BI as part of the business process of an organization (e.g. within Quickbooks usage or within Salesforce)

Some of the major players (e.g. Siebel Analytics) have offered a “matrix” (i.e. functional area x vertical) suite of applications for several years (e.g. sales for manufacturing or finance for retail). More accurately, at least they offered one several years back; however, it doesn’t seem like these got much traction in their target market as I don’t hear about them any more. I also haven’t seen other enterprise software vendors that sell to large companies universally adopt this packaging for their go-to-market. Presumably, they would have if the market were responding strongly.

Summing a column of numbers

Often, when I’m doing data validation with the output from Hive and comparing it to another system, it’s useful to get Unix to do some summing for me.  awk, cut, sort, and uniq are quite handy in these cases, and often much faster than modifying and re-running Hive queries.  Here’s my bag of tricks:

- Summing the 5th column of numbers:

cat tmp.txt | awk '{t+=$5}END{print t}'

You may also find the above command useful for processing the output of ls -l

- cut, uniq, sort, grep, and wc are great for filtering and computing aggregates for a column of values e.g. for filtering a list of session ids in the 17th field, looking only for the duplicate ones:

cut -f17 | sort -n | uniq -c | grep -v ' 1 '

- And finally, a random aside that I learned about last week – bash treats the output of the following two commands differently:

echo $output

echo “$output”

If $output has no line breaks, then the output will be identical.  If it does have line breaks, then the former will flatten the multiple lines into a single line, while the latter will print out $output exactly as it was captured.


A colleague recently pointed me at Datameer, an analytics front-end for Hadoop.  As their website and datasheet mention, they use a familiar spreadsheet interface for large data.  I recently saw a demo of the product, and I thought they had done a nice implementation of joins through a graphical user interface targeted at non-ETL experts.  At least based on the demo, I thought anyone who has decent experience with Excel would be able to effectively use it.  Note that it is not a tool targeting “BI for the masses”; it is definitely more of an analyst’s or an IT expert’s tool.

Added bonus that it easily integrates all the traditional data sources as well e.g. it’s easy to join a MySQL table against a “table” in Hadoop / Hive.  Would be cool if they could automatically discover the schema of your Hive tables and your traditional DB tables. They may already be able to; I didn’t see anything specific on their website indicating that though.

Tools like Datameer are a great addition to any BI practitioner’s toolset.  As datasets get larger, Hadoop allows ready access to large amounts of data at a reasonable price.  Datameer will now allow us to make that data more broadly accessible to a larger group of analysts.

Hive Annoyances

As I mentioned in my prior post, I’ve been using Hadoop / Hive for six months now.  My top three frustrations with Hive v0.4.0:

1) a decent quality CLI (command line interface) for Hive.  Editing of a query in Hive is very limited. You can’t use custom keybindings – ideally, you’d want the CLI to get editing mode (emacs or vi) from your inputrc.  History and history search is poor. My workaround to date has been to instead use the shell’s command line and execute each query using the command line

hive -e "... query ..."'

Another benefit of using the Bash shell as your Hive CLI is that you can include variables in your SQL statement e.g. in my script, I can do something like:

hive -e "select * from foo where ts='$TS'"

2) Hive’s inability to map columns correctly.  I am unable to systematically reproduce this behavior, but when you do complex queries with multiple selects, Hive gets confused about the order of columns.  In particular, if the order of the columns in your hive query does not match the order of the columns on disk in the file backing the table, then you get lots of junk back.

3) Lack of documentation and confusing documentation.  The Hive Language Manual looks to be the definitive reference for Hive; however, it is missing quite a bit of documentation and mixes documentation from various versions of Hive, including the most recent, unreleased version, v0.6.0.

Gripes aside, Hadoop / Hive have been a great platform, and we’re continuing to expand our usage of them daily.

Hadoop and Hive

I’ve been using Hadoop and Hive for the last six months and have been pretty impressed with how well it works.  To state the obvious, if you can correctly formulate your query, nothing beats this approach.  It’s been very useful for doing cohort analysis and large scale lifetime value computations on a relatively high traffic site.  There are of course limits to what you want to keep in Hadoop / Hive; however, the convenience and the growing feature set are reducing that limit more and more.

Hive is not a good store as a backend for a BI product, since it offers no caching at all.  However, a workflow where you crunch data in Hadoop/Hive and then export to a MySQL table (or an Endeca instance) for use in a BI tool works very well.


It’s been over one year since I last posted.  A second child and a new job (focused on digital marketing at KAYAK) are the major updates; a new non-professional blog and a lot more time on digital photography have been amongst the other changes in the last one year.

In the context of digital marketing, I’ve spent additional time looking into BI and analytics tools, so more on that shortly.

Is running a company easier than picking a cereal?

NPR’s Fresh Air interviewed Jonah Lehrer today who has a new book out about how humans make decisions.  A couple of topics dominated the interview:

  • An overload of choices make decision-making much more difficult since our pre-frontal cortex can only handle a small number of choices at a time (somewhere between 5 and 12)
  • Several experiments suggest that emotion is a critical ingredient to enable fast decision making and prevent analysis paralysis.

Of course, Jonah Lehrer is not the first person to cover this topic. I remember reading Barry Schwartz’s Paradox of Choice a couple of years ago, and it included references to similar ideas.

What intrigues and concerns me is the implication of this research for business intelligence.  It would suggest that:

  • Additional data / information can lead people (and hence companies) astray while they assume that more data is always better.  A lot of work has been done in this area.  I vaguely remember reading about such research in a Malcolm Gladwell-genre book.
  • If people are making decisions based on data that may not be relevant, then we are, yet again, underestimating the role of randomness in the success or failure of companies (and CEO’s).  Someone else recently wrote about this, but I do not recall who.
  • Emotion is a critical ingredient of important decision making; however, emotion can be significantly influenced by many other factors than data.  This suggestion in turn begs the question: just how important / relevant is the data?  Is it more often used for rationalizing emotional decisions than it is used for arriving at rational decisions?

While businesses will continue to move towards being run more and more by data, research like this heightens the importance of the non-technical parts of a good business intelligence implementation – understanding the data that you have available, truly understanding what you’re measuring, how it may impact the business and of course good data  visualization in the final implementation.

PS. The title of this blog post refers to the difficulty the author, Jonah Lehrer, has in picking cereal due to the vast number of choices available.

Physical interfaces

When I was working on a travel-related start-up idea last year, I had a call with a senior executive from a major web travel company. What she said really surprised me – that the majority of America still decides on where to go based on the following technique: when they see a compelling destination in a magazine, they rip out the corresponding page and put it in a folder. Then, when it comes time to book travel, they look through that folder and decide where they are going to go. Whether it’s true or not, it does remind one of the obvious fact: that people are using the physical interface of paper / a magazine in order to mark information for later retrieval. Other examples include bookmarking, dog-earing a page, etc.

A friend recently showed me a demo of a product idea that he’s been playing with that brings a physical world interface to the iPhone via its touch / gesture capabilities. What truly struck me during the demo was not how useful the feature was in itself, but rather the value of the action going from being an abstract action (i.e. mouse motion / click translated into a visual representation on the monitor) to a physical action again – something that you touch and interact with ( you can’t feel the texture – yet!).

Many have talked about the advantage of a physical book over the many e-versions that we have seen over the years. And of course, simply mimicking the act of turning a page with a gesture on the iPhone screen does not dramatically change the user experience of reading. However, gestures on the iPhone (and other similar touch interfaces) do get one step closer to the physical world that we are used to and for certain applications, that will be good enough for success.