14 Mayıs 2014 Çarşamba

Hive External Table Creation and Partitions

Hive enables you create tables and run sql like queries on HDFS data. There are 2 types of table in Hive, managed table with syntax CREATE TABLE and external table with syntax CREATE EXTERNAL TABLE.
We will look at external tables.

Hive 0.12.0 is tested with Hadoop 1.0.3.

1. External Table

External tables lets you run Hive queries without needing Hive copy/delete any data on HDFS. If you are not using just Hive but also Pig, mapreduce jobs, it is better use external tables.

hive> CREATE EXTERNAL TABLE logs(
id string,
country string,
type string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/data/logs';

This statement:
  • Creates an external table named logs on HDFS path /data/logs. 
  • Specifies columns/fields and data types in a row 
  • States that columns/fields in row are seperated by TAB character.
  • With this statement, Hive does not copy any data. It will use data on given location.
  • Hive does not even check whether given location exists, it is useful when you want to save data in that location later. However data should be directly under given folder, when it is saved.
After table creation, you can run your queries.
> SHOW TABLES;
This will list existing tables

> SELECT COUNT(*) FROM logs;
This will query given table.

2. External Table with Partitions

Partitions give you categorize your data even further and can speed your queries. Partitions enable you use data in multiple directories. 

For example, you have log data stored in multiple directories named with date: /data/logs/2014_01_01, /data/logs/2014_01_02 and /data/logs/2014_01_03. You want to query data on these directories and can use date as an additional filter.

hive> CREATE EXTERNAL TABLE logs_by_date(
id string,
country string,
type string)
PARTITIONED BY (date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

hive> ALTER TABLE browsed ADD PARTITION (date = '2014-01-01') LOCATION '/data/logs/2014_01_01/';
hive> ALTER TABLE browsed ADD PARTITION (date = '2014-01-02') LOCATION '/data/logs/2014_01_02/';
hive> ALTER TABLE browsed ADD PARTITION (date = '2014-01-03') LOCATION '/data/logs/2014_01_03/';

These statements:
  • Creates an external table named logs_by_date.
  • Adds partitions to logs_by_date table.
  • An additional column date is added to this table. This column does not exists in raw log files, it is derived from the directory name. However this column can be used as same as another column.

You can list partitions of table:
> SHOW PARTITIONS logs_by_date;

This will query count of rows under /data/logs/2014_01_01 directory.
> SELECT COUNT(*) FROM logs_by_date WHERE date='2014-01-01';

3. Drop External Table

You can drop your external  tables by typing:
hive> DROP TABLE logs;

This will not delete any data on HDFS and will only delete metadata of table.







Hadoop Control Scripts - Start/Stop Daemons

Hadoop includes some control scripts to start and stop Hadoop daemons on a cluster or on a single machine.

Hadoop 1.0.3 is used on a cluster running Centos operating system. 


1. masters/slaves Files

Hadoop includes master and slaves files to run specified commands on all cluster machines. These files are located under $HADOOP_HOME/conf.

Suppose you have a cluster of 4 servers: server1, server2, server3 and server4. On server1, you want to start namenode, jobtracker and secondary namenode. On other 3 servers, you want to run datanode and tasktracker daemons.

masters file
server1


slaves file
server2
server3
server4

If you run Hadoop on a single machine, run your server name to both files.

2. Control Scripts

Lets look at the scripts. These scripts are located $HADOOP_HOME/bin.

1. start-dfs.sh

This script should be run on namonode node.
> $HADOOP_HOME/bin/start-dfs.sh

  • Starts namenode on the server which script is run.
  • Starts datanodes on the servers listed in slaves file
  • Starts secondary namenode on the servers listed in masters file


2. start-mapred.sh

This script should be run on jobtracker node.
> $HADOOP_HOME/bin/start-mapred.sh

  • Starts jobtracker on the server which scripts is run
  • Starts tasktrackers on the servers listed in slaves file
  • This script does not use masters file


3. stop-dfs.sh

This script should be run on namonode node.
> $HADOOP_HOME/bin/stop-dfs.sh

  • Stops namenode on the server which script is run.
  • Stops datanodes on the servers listed in slaves file
  • Stops secondary namenode on the servers listed in masters file


4. stop-mapred.sh

This script should be run on jobtracker node.
> $HADOOP_HOME/bin/stop-mapred.sh

  • Stops jobtracker on the server which scripts is run
  • Stops tasktrackers on the servers listed in slaves file


5. hadoop-daemon.sh

This script should be run on datanode/tasktracker nodes.
This script operates according to the given parameters. By default, it runs on the server which script is run.

Start/Stop datanode
>$HADOOP_HOME/bin/hadoop-daemon.sh start datanode
>$HADOOP_HOME/bin/hadoop-daemon.sh stop datanode

Start/Stop tasktracker
>$HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker
>$HADOOP_HOME/bin/hadoop-daemon.sh stop tasktracker

6. start-all.sh

This script runs start-dfs.sh and start-mapred.sh. Starts all daemons using slaves and masters file. This scripts should be run namenode/jobtracker node.
>$HADOOP_HOME/bin/start-all.sh

7. stop-all.sh

This script runs stop-dfs.sh and stop-mapred.sh. Stops all daemons using slaves and masters file. This scripts should be run namenode/jobtracker node.
>$HADOOP_HOME/bin/stop-all.sh







13 Mayıs 2014 Salı

Hadoop Web Interfaces

Hadoop provides several web interfaces to monitor your Hadoop cluster. These web interfaces give a lot of information without dealing with shell commands.

I am using a Hadoop 1.0.3 cluster installed on Centos machines.

1. NameNode web interface

http://<namenode-server>:50070/

  • Enables user to browse HDFS file system
  • Gives summary about the cluster
  • This is the default address and port. Address and port can be changed by dfs.http.address property in hdfs-site.xml

2. JobTracker web interface

http://<jobtracker-server>:50030/

  • Gives information about the tasktracker nodes and mapreduce capacity of Hadoop cluster
  • Lists running/completed/retired jobs
  • Enables user to browse job history
  • Address and port can be changed by mapred.job.tracker.http.address property in mapred-site.xml

3. TaskTracker web interface

http://<tasktracker-server>:50060/

  • List running and non-running tasks on specific datanode
  • Address and port can be changed by mapred.task.tracker.http.address property in mapred-site.xml

4. Secondary NameNode web interface

http://<secondaryNN-server>:50090/

Address and port can be changed by dfs.secondary.http.address property in hdfs-site.xml


5. DataNode web interface

http://<datanode-server>:50075/
  • Enables browsing of data on specific datanode
  • Address and port can be changed by dfs.datanode.http.address property in hdfs-site.xml

11 Mayıs 2014 Pazar

Installation of Hive

Hive gives you great capability when it is to query data on HDFS. Its sql syntax is very similar to MySQL and you can start running your queries in a very short time. Hive gets your sql and creates map reduce jobs out of it.

This installation is done on Centos machine using Hive. The machine Hive is installed contains Hadoop binary and is used as Hadoop client.

1. Download Hive
You can get hive http://www.apache.org/dyn/closer.cgi/hive/. I will continue with hive-0.12.0, in my country link is http://ftp.itu.edu.tr/Mirror/Apache/hive/hive-0.12.0/hive-0.12.0.tar.gz

> cd /your/desired/path
> wget  http://ftp.itu.edu.tr/Mirror/Apache/hive/hive-0.12.0/hive-0.12.0.tar.gz

After downloading hive, extract it.

> tar -xzvf hive-0.12.0.tar.gz

2. Setup environment variables
If not set, set following environment variables:
Recommended way is to put these variables in shell script under /etc/profile.d. Create env_variables.sh and write:

export JAVA_HOME=/path/to/java/home
export HADOOP_HOME=/path/to/hadoop/home
export HIVE_HOME=/path/to/hive/home

To run hive command from anywhere, we must add it to PATH variable. Append following to env_variables.sh. If java and hadoop are also not on the PATH variable, add them also.

export PATH=$PATH:$HIVE_HOME/bin

3. If you have installed Hive on a Hadoop node or on a Hadoop client machine, it will find namenode and jobtracker addresses from Hadoop's configuration files (core-site.xml and mapred-site.xml).

4. If you run Hive with its defaults, it will store its metadata related to Hive tables into a local Derby database.
In this case, where you run Hive gets important; because it creates this database under the directory you run Hive. This has 2 complications:

  • Only one connection is allowed. Others cannot run Hive jobs under that directory
  • If you run Hive in another location, you cannot see previous table definitions.

To overcome this, we must create metastore database on a database server. I will use MySQL.

5. Create a schme name hive_metastore and set its character set as "latin1 - default collation".

sql> CREATE SCHEMA hive_metastore DEFAULT CHARACTER SET latin1;

6. Create a new configuration file under "hive-0.12.0/conf" named hive-site.xml and enter your database connection properties.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://server_address/hive_metastore?createDatabaseIfNotExist=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>myusername</value>
  <description>username to use against metastore database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>mypassword</value>
  <description>password to use against metastore database</description>
</property>
</configuration>
6. For Hive to connect MySQL server you must place Mysql JDBC driver under "hive-0.12.0/lib". You can download from http://dev.mysql.com/downloads/connector/j/

7. Now you can type hive command and run:
hive> SHOW TABLES;
This will print existing tables.





Mysql Performance Tuning

To speed up performance of your application, tuning mysql can be curicial. Although there are many aspects to mysql performance tuning, following will give basic understanding and tips about tuning mysql server.

These configurations are tested on Percona Server 5.5 which is forked from mysql server. These configurations should also appy to standard Mysql server. Percona server is running on a Centos machine.

1. Quick Server Configuration

If you are looking for a quick configuration for your server (MySQL Enterprise/Community or Percona Server), you can use following link:

https://tools.percona.com/wizard

This link will provide a wizard that asks needs of your application and will try to create a my.cnf file that will fulfill your needs.

2. Quick Server Performance Check

Before dealing with performance issues, you should know the misconfigured/lacking points. You can use following script to get a overview of your server performance.

http://www.day32.com/MySQL/

When you run this script, it will give a analysis of different configuration points.

3. Configuration Parameters

Mysql is configured via system variables. All system variables have a default value. These variables can be given at server startup from command line or from configuration files. On Linux this configuration file is my.cnf and itsWindows counterpart is my.ini. Some system variables are dynamic meaning that they can be modified within mysql shell when server is running.
To see current values:
mysql> SHOW VARIABLES;

3.1. Query Cache (query_cache_size)

Quick Configuration (my.cnf file):
query_cache_size=128M
query_cache_limit=1M
query_cache_type=1

query_cache_size is the amount of memory allocated to cache all queries
query_cache_limit is maximum size of SELECT result that can be cached. Query results bigger than this limit are not cached.
query_cache_type has 3 options:
  • 0: query cache is disabled
  • 1: all queries are cached, except the ones beginning with SELECT NO_SQL_CACHE
  • 2: queries only beginning with SELECT SQL_CACHE are cached

Key points:
  • The query cache stores the text of a SELECT statement together with the corresponding result.
  • If an identical statement is received later, the server retrieves the results from the query cache rather than parsing and executing the statement again
  • The query cache is shared among sessions, so a result set generated by one client can be sent in response to the same query issued by another client.
  • Mysql does not return stale data, when table data is changed, cached queries using table are flushed.
  • Queries must be exactly the same to be counted as identical. For example, following queries are not identical:
select * from test;
SELECT * FROM test;

The query cache can be useful in an environment where you have tables that do not change very often and for which the server receives many identical queries. In an environment with tables are constantly changing, maintaining query cache can cause performance impact.

Monitor Query Cache:
Following command gives query cache statistics:
mysql> SHOW GLOBAL STATUS LIKE 'Qcache%';
+-------------------------+-----------+
| Variable_name           | Value     |
+-------------------------+-----------+
| Qcache_free_blocks      | 1         |
| Qcache_free_memory      | 134182560 |
| Qcache_hits             | 3694      |
| Qcache_inserts          | 19        |
| Qcache_lowmem_prunes    | 0         |
| Qcache_not_cached       | 9         |
| Qcache_queries_in_cache | 14        |
| Qcache_total_blocks     | 31        |
+-------------------------+-----------+
8 rows in set (0.00 sec)


3.2. Table Cache (table_open_cache/table_cache)

When different clients tries to operate on the same table simultaneously, table is opened independently for each client. If open tables count reaches maximum number specified by table_open_cache, performance can degrade.
table_cache is renamed as table_open_cache with mysql 5.1.3 release.

Quick Configuration (my.cnf file):
table_open_cache=2048
open-files-limit=65536

table_open_cache is the maximum number of open tables that mysql can have
open_files_limit is the maximum number of file descriptors which mysql can have. This number cannot be smaller than operating system wide limit.
In Unix systems, you can get the maximum number of file descriptors per process, by typing :
> ulimit -n


Key Points:
  • table_open_cache is related to max_connections. If there are 100 concurrent connections and your joins include 10 tables, it is wise to give table_open_cache as 100*10=1000.
  • If MyISAM usage is heavy, you should consider that MyISAM tables takes 2 file descriptors instead of 1. Additionally, if you use same MyISAM table again in the query, for example joining by itself, 1 additional file descriptor is created. 
  • Mysql is limited by operating system limitations in terms of file descriptors. On Unix systems, you can see file descriptor limit per process by ulimit -n. You can increase this limit from mysql configuration by open-files-limit variable. If this limit is reached, mysql gives "Too many open files" error.

Monitor Table Cache:
You can monitor and observe table_open cache by typing:
mysql> SHOW GLOBAL STATUS LIKE '%Opened_tables%';

Caveat: You can run this command several times in busiest times of server. If there are increases in number, you should think increasing table_open_cache

3.3. Key Buffer Size (key_buffer_size)

MyISAM indexes are shared by all clients and key_buffer_size specifies the limit for this buffer in memory. The limit is 4GB-1 for 32 bit machines and larger for 64 bit machines. It is recommended that this limit should be close to 25% of system memory. If you are using MyISAM tables heavily, you can try to increase this value.

Quick Configuration (my.cnf file):
key_buffer_size=4GB


Key Points:
  • Under heavy MyISAM usage, this variable can be crucial.

Monitor Key Buffer Size:
You can list status of operations related to key reads and writes.
mysql> SHOW GLOBAL STATUS LIKE 'Key_%';

4. Further Reading

You can get more detail from following sites.
http://www.mysqlperformanceblog.com/
http://dev.mysql.com/doc/refman/5.0/en/server-parameters.html
http://dev.mysql.com/doc/refman/5.1/en/query-cache.html
http://dev.mysql.com/doc/refman/5.0/en/table-cache.html