Hive Annoyances

As I mentioned in my prior post, I’ve been using Hadoop / Hive for six months now.  My top three frustrations with Hive v0.4.0:

1) a decent quality CLI (command line interface) for Hive.  Editing of a query in Hive is very limited. You can’t use custom keybindings – ideally, you’d want the CLI to get editing mode (emacs or vi) from your inputrc.  History and history search is poor. My workaround to date has been to instead use the shell’s command line and execute each query using the command line

hive -e "... query ..."'

Another benefit of using the Bash shell as your Hive CLI is that you can include variables in your SQL statement e.g. in my script, I can do something like:

TS=$1
hive -e "select * from foo where ts='$TS'"

2) Hive’s inability to map columns correctly.  I am unable to systematically reproduce this behavior, but when you do complex queries with multiple selects, Hive gets confused about the order of columns.  In particular, if the order of the columns in your hive query does not match the order of the columns on disk in the file backing the table, then you get lots of junk back.

3) Lack of documentation and confusing documentation.  The Hive Language Manual looks to be the definitive reference for Hive; however, it is missing quite a bit of documentation and mixes documentation from various versions of Hive, including the most recent, unreleased version, v0.6.0.

Gripes aside, Hadoop / Hive have been a great platform, and we’re continuing to expand our usage of them daily.

Post a Comment

Your email is never published nor shared. Required fields are marked *