Often, when I’m doing data validation with the output from Hive and comparing it to another system, it’s useful to get Unix to do some summing for me.  awk, cut, sort, and uniq are quite handy in these cases, and often much faster than modifying and re-running Hive queries.  Here’s my bag of tricks:

- Summing the 5th column of numbers:

cat tmp.txt | awk '{t+=$5}END{print t}'

You may also find the above command useful for processing the output of ls -l

- cut, uniq, sort, grep, and wc are great for filtering and computing aggregates for a column of values e.g. for filtering a list of session ids in the 17th field, looking only for the duplicate ones:

cut -f17 | sort -n | uniq -c | grep -v ' 1 '

- And finally, a random aside that I learned about last week – bash treats the output of the following two commands differently:

echo $output

echo “$output”

If $output has no line breaks, then the output will be identical.  If it does have line breaks, then the former will flatten the multiple lines into a single line, while the latter will print out $output exactly as it was captured.

