As DataHive Consulting, we have been remiss in not mentioning anything about Hive up until now, especially since we think Hive is the easiest way to start using Hadoop for those just starting to make the jump from structured to unstructured data. For those just starting to look into Big Data, Apache Hive is a data warehouse software built on top of Hadoop, which supports the management, querying, and analysis of distributed datasets. It includes ETL (extract-load-transfer) tools, MapReduce-based queries, metadata storage, and indexing. But most importantly, it can all be managed through HiveQL, a query language similar to SQL. Although it lacks full ACID functionality at this point, Hive is a quick way to use Hadoop for those who have SQL and/or MapReduce framework experience.
Here’s a couple of our favorite starting points for learning more about Hive:
- The Apache Software Foundation’s Getting Started Page.
- Hortonworks’ Simple Hive ‘Cheat Sheet’ for SQL Users
- Quoble’s Hive Function Cheat Sheet
- For DB2/Informix environments, IBM and Cloudera have a nice intro to Hadoop
- Hortonworks also has a basic Hive tutorial using Sean Lahman’s fantastic baseball database
Where are you picking up your Hive tips? Please feel free to share in the comments!