Coding Hassle : hive

Hive enables you create tables and run sql like queries on HDFS data. There are 2 types of table in Hive, managed table with syntax CREATE TABLE and external table with syntax CREATE EXTERNAL TABLE.
We will look at external tables.

Hive 0.12.0 is tested with Hadoop 1.0.3.

1. External Table

External tables lets you run Hive queries without needing Hive copy/delete any data on HDFS. If you are not using just Hive but also Pig, mapreduce jobs, it is better use external tables.

hive> CREATE EXTERNAL TABLE logs(
id string,
country string,
type string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/data/logs';

This statement:

Creates an external table named logs on HDFS path /data/logs.
Specifies columns/fields and data types in a row
States that columns/fields in row are seperated by TAB character.
With this statement, Hive does not copy any data. It will use data on given location.
Hive does not even check whether given location exists, it is useful when you want to save data in that location later. However data should be directly under given folder, when it is saved.

After table creation, you can run your queries.

> SHOW TABLES;

This will list existing tables

> SELECT COUNT(*) FROM logs;

This will query given table.

2. External Table with Partitions

Partitions give you categorize your data even further and can speed your queries. Partitions enable you use data in multiple directories.

For example, you have log data stored in multiple directories named with date: /data/logs/2014_01_01, /data/logs/2014_01_02 and /data/logs/2014_01_03. You want to query data on these directories and can use date as an additional filter.

hive> CREATE EXTERNAL TABLE logs_by_date(
id string,
country string,
type string)
PARTITIONED BY (date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

hive> ALTER TABLE browsed ADD PARTITION (date = '2014-01-01') LOCATION '/data/logs/2014_01_01/';
hive> ALTER TABLE browsed ADD PARTITION (date = '2014-01-02') LOCATION '/data/logs/2014_01_02/';
hive> ALTER TABLE browsed ADD PARTITION (date = '2014-01-03') LOCATION '/data/logs/2014_01_03/';

These statements:

Creates an external table named logs_by_date.
Adds partitions to logs_by_date table.
An additional column date is added to this table. This column does not exists in raw log files, it is derived from the directory name. However this column can be used as same as another column.

You can list partitions of table:

> SHOW PARTITIONS logs_by_date;

This will query count of rows under /data/logs/2014_01_01 directory.

> SELECT COUNT(*) FROM logs_by_date WHERE date='2014-01-01';

3. Drop External Table

You can drop your external tables by typing:

hive> DROP TABLE logs;

This will not delete any data on HDFS and will only delete metadata of table.

Hive gives you great capability when it is to query data on HDFS. Its sql syntax is very similar to MySQL and you can start running your queries in a very short time. Hive gets your sql and creates map reduce jobs out of it.

This installation is done on Centos machine using Hive. The machine Hive is installed contains Hadoop binary and is used as Hadoop client.

1. Download Hive
You can get hive http://www.apache.org/dyn/closer.cgi/hive/. I will continue with hive-0.12.0, in my country link is http://ftp.itu.edu.tr/Mirror/Apache/hive/hive-0.12.0/hive-0.12.0.tar.gz

> cd /your/desired/path
> wget  http://ftp.itu.edu.tr/Mirror/Apache/hive/hive-0.12.0/hive-0.12.0.tar.gz

After downloading hive, extract it.

> tar -xzvf hive-0.12.0.tar.gz

2. Setup environment variables
If not set, set following environment variables:
Recommended way is to put these variables in shell script under /etc/profile.d. Create env_variables.sh and write:

export JAVA_HOME=/path/to/java/home
export HADOOP_HOME=/path/to/hadoop/home
export HIVE_HOME=/path/to/hive/home

To run hive command from anywhere, we must add it to PATH variable. Append following to env_variables.sh. If java and hadoop are also not on the PATH variable, add them also.

export PATH=$PATH:$HIVE_HOME/bin

3. If you have installed Hive on a Hadoop node or on a Hadoop client machine, it will find namenode and jobtracker addresses from Hadoop's configuration files (core-site.xml and mapred-site.xml).

4. If you run Hive with its defaults, it will store its metadata related to Hive tables into a local Derby database.
In this case, where you run Hive gets important; because it creates this database under the directory you run Hive. This has 2 complications:

Only one connection is allowed. Others cannot run Hive jobs under that directory
If you run Hive in another location, you cannot see previous table definitions.

To overcome this, we must create metastore database on a database server. I will use MySQL.

5. Create a schme name hive_metastore and set its character set as "latin1 - default collation".

sql> CREATE SCHEMA hive_metastore DEFAULT CHARACTER SET latin1;

6. Create a new configuration file under "hive-0.12.0/conf" named hive-site.xml and enter your database connection properties.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://server_address/hive_metastore?createDatabaseIfNotExist=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>myusername</value>
  <description>username to use against metastore database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>mypassword</value>
  <description>password to use against metastore database</description>
</property>
</configuration>

6. For Hive to connect MySQL server you must place Mysql JDBC driver under "hive-0.12.0/lib". You can download from http://dev.mysql.com/downloads/connector/j/

7. Now you can type hive command and run:

hive> SHOW TABLES;

This will print existing tables.

Coding Hassle

14 Mayıs 2014 Çarşamba

Hive External Table Creation and Partitions

1. External Table

2. External Table with Partitions

3. Drop External Table

11 Mayıs 2014 Pazar

Installation of Hive