Coding Hassle : 2014

22 Ekim 2014 Çarşamba

Hadoop Job Scheduling - Fair Scheduler

Default job scheduler for Hadoop uses a FIFO queue. The job submitted first gets all the resource it needs. This can make other jobs wait for a long time. To better utilize the resources of your cluster, you can change Hadoop job scheduler.

It was tested with Hadoop 1.0.3 on Ubuntu 12.04.

1. Default Scheduler

Default scheduler uses a FIFO queue. This means that the job submitted first gets all resources it needs and the ones submitted later will get remaining resources till the cluster runs at full capacity. Other jobs that do not have the chance to start, have to wait for running jobs to finish.

2. Why Need Another Scheduler?

Default job scheduling can hurt your applications in several levels.

You can have ad hoc queries, for example Hive queries, that are expected to finish in less than a minute.
You can have multiple report jobs, for example implemented as Pig/Hive scripts, that are waiting to be viewed by users.
You can have scheduled recurring jobs, for example Oozie workflows, that needs to be run every hour/day and update a data source in your system.
These jobs can be submitted from different users and you may want to prioritize according to the user.

With default scheduler, these cases cannot be satisfied properly.

We can use different scheduling algorithms, we will look into two of them:

Fair Scheduler
Capacity Scheduler

These two schedulers proposes slightly different solutions.

3. Fair Scheduler

In fair scheduler, every user has a different pool of jobs. However, you don't have to assign a new pool to each user. When submitting jobs, a queue name is specified (mapred.job.queue.name), this queue name can also be used to create job pools.

Cluster resources are fairly shared between these pools. User Hadooper1 can submit a single job and user Hadooper2 can submit many jobs. They get equal share of cluster on average. However, custom pools can be created by defining minimum map and reduce slots and setting a weighting for the pool.

Jobs in a pool can be scheduled using fair scheduling or first-in-first-out (FIFO) scheduling . If fair scheduling is selected for jobs inside the pool, these jobs should get equal share of resources over time. Otherwise, first submitted will be served first.

Fair scheduler provides preemption of the jobs in other pools. This means that if a pool does not get half of its fair share for a configurable time period, it can kill tasks in other pools.

4. Capacity Scheduler

Capacity scheduler gives users fair amount of cluster resources. Each user is given a FIFO queue for storing jobs.

Fair scheduler gives fair share of pool resources to the jobs inside the pool (although this pool can be configured as a FIFO queue). In contrast, capacity scheduler maintains a FIFO queue for the user's jobs. First submitted job can get all the resources given to that user.

5. Enabling Fair Scheduler

Quick Setup:
To enable fair scheduler in Hadoop 1.0.3, you just need to add following to your mapred-site.xml.
The hadoop-fairscheduler-1.0.3.jar is under the HADOOP_HOME/lib directory by default.

<property>
<name> mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
<property>
<name>mapred.fairscheduler.poolnameproperty</name>
<value>mapred.job.queue.name</value>
</property>

In this configuration, map-reduce queue name is used as pool name.

To view the fair-scheduler, you can view http://tasktracker:50030/scheduler.

Advanced Setup:
You can customize fair scheduler in two places.

HADOOP_CONF_DIR/mapred-site.xml
HADOOP_CONF_DIR/fair-scheduler.xm (Allocation file)

Hadoop documentation provides good explanation of these configuration parameters.

5 Ekim 2014 Pazar

Hadoop 1.0.3 Installation on Ubuntu

I will describe the required steps to install Hadoop 1.0.3 on a single node in pseudo-distributed mode. At the end of this post, you will be able to browse HDFS(Hadoop File System) and run map-reduce jobs on your single-node Hadoop cluster.

Hadoop 1.0.3 is used and installed on Ubuntu 12.04 TLS.

Prerequisites

1. Java

For installing Hadoop, you must have Java 1.5+ (Java 5 or above). I will continue with Java 1.7.

> sudo apt-get install openjdk-7-jdk

If you have different Java versions installed on your machine, you can select new Java 1.7 by typing:

> update-alternatives --config java

You can then select your desired java version.

You can check current java version by typing:

> java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1) (7u65-2.5.1-4ubuntu1~0.12.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

2. SSH

Hadoop uses ssh to connect and manages its nodes. This is also valid for a single node setup.

Openssh client is included in Ubuntu by default. However, you should also have openssh server installed.

> dpkg --get-selections | grep -v deinstall | grep openssh

If the list does not contain client or server, you can install missing one:

> sudo apt-get install openssh-client
> sudo apt-get install openssh-server

Now, you should connect to localhost:

> ssh localhost

This will ask for your password. Hadoop needs to establish connection without entering password.

To enable this:
Generate public and private keys

> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

Authorize the key by adding it to the list of authorized keys

> cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

You should now connect without password
> ssh localhost

3. Disable IPv6

To disable IPv6 , open /etc/sysctl.conf and add the following lines to the end of the file:

> vi /etc/sysctl.conf

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

You should restart your machine for changes to take effect.

4. Dedicated User for Hadoop

Although it is not necessary, you can create a dedicated user for Hadoop. This will help you seperate Hadoop management from other applications.

Create a user named hadoopuser and assign to group named hadoopgroup. You can get more detail about creating users and groups in this post.

> sudo groupadd hadoopgroup
> sudo useradd hadoopuser -m
> sudo usermod -aG hadoopgroup hadoopuser

Installation

1. Get Hadoop

You can get your desired Hadoop version from Apache download mirrors:
I will download Hadoop 1.0.3

> cd /home/hadoopuser
> wget http://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/hadoop-1.0.3.tar.gz

Extract hadoop package under home directory

> cd /home/hadoopuser
> sudo tar -xzvf hadoop-1.0.3.tar.gz
> sudo chown -R hadoopuser:hadoopgroup /home/hadoopuser/hadoop-.1.0.3

2. Set your environment variables

I will set HADOOP_HOME environment variable. You get more detail about setting environment variables in this post.

I will make HADOOP_HOME accesible system-wide, not per user. To do this, first create a system_env.sh under /etc/profile.d folder.

> vi /etc/profile.d/system_env.sh

export HADOOP_HOME=/home/hadoopuser/hadoop-1.0.3
export PATH=$PATH:$HADOOP_HOME/bin

3. Configuration

You can configure following configuration files as stated below for a basis.

$HADOOP_HOME/conf/hadoop-env.sh

Set JAVA_HOME variable in this file:

> vi /home/hadoopuser/hadoop-1.0.3/conf/hadoop-env.sh

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre

$HADOOP_HOME/conf/core-site.xml

<configuration>
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:10000</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoopuser/tmp</value>
  <description>A base for other temporary directories.</description>
</property>
</configuration>

$HADOOP_HOME/conf/mapred-site.xml

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:10001</value>
        <description>The host and port that the MapReduce job tracker runs
        at.  If "local", then jobs are run in-process as a single map
          and reduce task.
        </description>
    </property>
</configuration>

$HADOOP_HOME/conf/hdfs-site.xml

<configuration>
<property>
   <name>dfs.webhdfs.enabled</name>
   <value>true</value>
</property>
<property>
   <name>dfs.permissions</name>
   <value>false</value>
</property>
</configuration>

4. Format HDFS

To start using your Hadoop cluster, we should format HDFS. This is done when cluster is first setup. If you format an existing HDFS, data stored on it will be removed.

> hadoop namenode format

5. Start the Cluster

Hadoop provides several control scripts that enables you start/stop Hadoop daemons

To start Hadoop cluster, run:

> /home/hadoopuser/hadoop-1.0.3/bin/start-all.sh

This will start all 5 daemons: NameNode, SecondaryNameNode, JobTracker, TaskTracker and DataNode. You can check whether these daemons are running by typing:

> jps

> ps aux | grep hadoop

6. Explore the Cluster

Hadoop provides several web interfaces to monitor your cluster. You can browse these interfaces.

26 Haziran 2014 Perşembe

How to Schedule Hadoop Jobs - Apache Oozie

I will explain scheduled Hadoop jobs, scheduled with Apache Oozie.

You have large volumes of data collected and stored into HDFS. Your analysis process may differ according to nature of your data and needs of your business. To further investigate, let's go through following scenarios:

1. Running in Scheduled Time Intervals

You want to make analysis on your data in scheduled time intervals.

For example, you have e-commerce site and want to put your customers into categories like sport, book, electronic and etc. You want to make this analyze every day or every week or every month.

As another example, you want to create a report that shows impression/purchase count (and other statistics) of your products. You want to generate this report every night at 01:00 am.

2. Running When Data is Present

You want to make analysis when a speficic data feed comes.

Like previous example, you have a e-commerce site, you are collecting event logs. However, you also have a dependency to another data to be available on HDFS.

3. Running Dependent Analyses

You want to make a sequnce of analyses that are dependent on each other's output.

You want to implement a basic suggestion system. Firstly, you will analyze impression/purchase counts of your products (Your products also have associated categories). Secondly you will analyze interest areas of your customers. Then you want to merge these two outputs and match products with customers. At the end, you will offer specific products to specific customers.

4. Minimal Technical Effort/Minimal Dependency

From a technical view, you want to build a scalable and extensible scheduling system. You want to make your analyses with minimal effort and minimal language dependencies.

One Good Solution - Oozie

What we use at our system is Apache Oozie. We have some of the use cases stated below and others will likely be valid for us in near future.

Excerpt from Oozie web page:

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.

Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

In later posts, I will explain Oozie workflow and coordinator applications and workflow actions with examples.

Pig 0.11.1 Installation on Ubuntu

I will try to outline basic Pig 0.11.1 installation.

This was tested on Ubuntu 12.04 with Java 1.7 installed. Hadoop 1.0.3 is used on the same machine as Pig.

1. Download Pig

You can download Pig from http://www.apache.org/dyn/closer.cgi/pig. In my location, download link is:
http://apache.bilkent.edu.tr/pig/pig-0.11.1/pig-0.11.1.tar.gz

> cd /home/myuser/hadoop/
> wget http://apache.bilkent.edu.tr/pig/pig-0.11.1/pig-0.11.1.tar.gz

After downloading Pig, extract it

> tar -xzvf pig-0.11.1.tar.gz

This will extract the files to /home/myuser/hadoop/pig-0.11.1

2. Set Environment Variables

If not set, set the following environment variables:
Recommended way is to put these variables in shell script under /etc/profile.d. Create env_variables.sh and write:

export JAVA_HOME=/path/to/java/home
export HADOOP_HOME=/path/to/hadoop/home
export PIG_HOME=/path/to/pig/home

To run pig command from anywhere, we must add it to PATH variable. Append following to env_variables.sh. If java and hadoop are also not on the PATH variable, add them also.

export PATH=$PATH:$PIG_HOME/bin

3. Hadoop Cluster Information

If you have setupped $HADOOP_HOME environment variable, it will find namenode and jobtracker addresses from Hadoop's configuration files (core-site.xml and mapred-site.xml).

If it could not find your cluster you can add configuration directory of Hadoop to Pig's classpath

export PIG_CLASSPATH=$HADOOP_HOME/conf/

4. Map-Reduce Mode

Pig supports local and map-reduce mode. We will try map-reduce mode.
You can run an existing pig script with:

> pig myscript.pig

You can get your script with parameters substituted (dry-run) with:

> pig -r myscript.pig

You can enter to grant shell and run your Pig statement there.

> pig
2014-06-26 23:40:31,977 [main] INFO  org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53
2014-06-26 23:40:31,978 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/myuser/pig_1403815231975.log
2014-06-26 23:40:31,992 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/myuser/.pigbootup not found
2014-06-26 23:40:32,124 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:10000
2014-06-26 23:40:32,948 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:10001

grunt> logs = LOAD '/data/logs' using PigStorage() as (id:int, log:chararray);
grunt> DUMP logs;

23 Haziran 2014 Pazartesi

Hadoop Trash - Recover Your Data

Hadoop gives the capability to recover your deleted files. When files are deleted, they are moved to .Trash folder under user's home directory (for example "/home/myuser/.Trash" ) and remain for a minimum period of time before being deleted permanently. You can recover your files by copying under .Trash folder to your desired path.
However, Hadoop trash only stores files that are deleted from filesystem shell. Files that are deleted programmatically are deletely immediately. Though you can use trash programmatically by using its org.apache.hadoop.fs.Trash class.

Hadoop 1.0.3 is used on Ubuntu 12.04 machine.

1. Enable Trash

By default trash feature is disabled.

To enable it, write following property in core-site.xml on NameNode machine:


  fs.trash.interval
  60
  Number of minutes after which the checkpoint
  gets deleted.
  If zero, the trash feature is disabled.

As description states, deleted files will be moved to .Trash folder and remain there for 60 minutes before being deleted permanently. A thread checks trash and removes the files that remained more than this interval.

In Hadoop 1.0.3, time interval for this thread to run is not specified in core-default.xml and code, therefore states that this property is not available in Hadoop 1.0.3. However in newer versions, you can configure it:


  fs.trash.checkpoint.interval
  15
  Number of minutes between trash checkpoints.
  Should be smaller or equal to fs.trash.interval.
  Every time the checkpointer runs it creates a new checkpoint
  out of current and removes checkpoints created more than
  fs.trash.interval minutes ago.

2. fs -rm/-rmr Commands

If you use "hadoop fs -rm" or "hadoop fs -rmr" commands, these files will be moved to trash and you can restore them under .Trash directory.

> hadoop fs -rmr /data/logs/data.log

Moved to trash: hdfs://localhost:10000/data/logs/data.log

> hadoop fs -mv /home/myuser/.Trash/Current/data/logs/data.log /data/recovered_data.log

3. skipTrash

You can delete your files immediately by using skipTrash option

> hadoop fs -rmr -skipTrash /data/logs/data.log

Deleted hdfs://localhost:10000/data/logs/data.log

4. fs -expunge

You can empty your .Trash folder by expunge method. This will delete files in .Trash folder and creates a new checkpoint

> hadoop fs -expunge

14/06/20 20:25:20 INFO fs.Trash: Created trash checkpoint: /user/myuser/.Trash/1406202025

//TODO
In later versions, client side configuration of trash enables trash feature for that user running the client. It is TODO for me to try this out in Hadoop 1.0.3.

Hadoop Does Not Stop - Missing Pid Files

In our cluster, we have experienced that if pid files of Hadoop daemons go missing, daemons will not stop. If daemons do not stop properly and you try to kill forcefully (kill -9), Hadoop can stay in an erroneous state. For example, if you kill datanode daemon, blocks can go missing.

Hadoop 1.0.3 running on a pseudo-distributed single node cluster is used on Ubuntu 12.04

1. Pid Files

Hadoop stores process ids in files under /tmp directory by default.
Files are named as:

hadoop-myuser-namenode.pid
hadoop-myuser-datanode.pid
hadoop-myuser-jobtracker.pid
hadoop-myuser-tasktracker.pid
hadoop-myuser-secondarynamenode.pid

These files store process ids as text.
If these files are deleted and you try to run Hadoop stop scripts, they cannot find Hadoop daemons and cannot stop them.

2. Create Pid Files

2.1. First, learn process ids of running Hadoop daemons by jps command or ps aux | grep hadoop. If jps is not on the PATH, you can try with full path.

> jps

25564 JobTracker
24896 NameNode
25168 DataNode
25878 TaskTracker
12726 Jps
25448 SecondaryNameNode

2.2. Find out the directory where the pid files should be stored. It is /tmp by default. However this path can be changed in $HADOOP_HOME/hadoop-env.sh.

# The directory where pid files are stored. /tmp by default.
# export HADOOP_PID_DIR=/var/hadoop/pids

2.3. Go to pid directory and create missing files. Write corresponding process ids and save.

> vi hadoop-myuser-namenode.pid
> vi hadoop-myuser-datanode.pid
> vi hadoop-myuser-jobtracker.pid
> vi hadoop-myuser-tasktracker.pid
> vi hadoop-myuser-secondarynamenode.pid

2.4. Change permissions of these files so that the user running Hadoop daemons can read and write.
I personally use chown (given that file permissions are 664)

> chown -R myuser:myuser /tmp/hadoop*.pid

2.5. Then you can stop your deamons as explained in this post:

> $HADOOP_HOME/bin/stop-all.sh

stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

You can check with jps or ps aux | grep hadoop

> jps

14237 Jps

Hadoop - Map and Reduce Slots

We have seen how number of maps and reduce tasks are calculated here. I will continue the topic with configuration of avaiable map and reduce slots in the cluster. Hadoop uses these slots to run map and reduce tasks and these slots are fixed with certain properties.

Hadoop 1.0.3 is used on Ubuntu 12.04 machine.

1. Map Slots

Number of map slots can be defined in mapred-site.xml under $HADOOP_HOME/conf directory on tasktracker nodes.

This value is 2 by default as defined in mapred-default.xml:


  mapred.tasktracker.map.tasks.maximum
  2
  The maximum number of map tasks that will be run
  simultaneously by a task tracker.

To change its value, open mapred-site.xml and set:


  mapred.tasktracker.map.tasks.maximum
  7

You should configure all your tasktracker nodes this way. After setting these properties, you should restart your tasktracker daemons. You can see new values on jobtracker web interface at http://<jobtracker-server>:50030

2. Reduce Slots

Like map slots, number of reduce slots can be defined in mapred-site.xml under $HADOOP_HOME/conf directory on tasktracker nodes.

This value is 2 by default as defined in mapred-default.xml:


  mapred.tasktracker.reduce.tasks.maximum
  2
  The maximum number of reduce tasks that will be run
  simultaneously by a task tracker.

To change its value, open mapred-site.xml and set:


  mapred.tasktracker.reduce.tasks.maximum
  7

You should configure all your tasktracker nodes this way.

3. How to decide the number of map/reduce slots?

To decide the number of map slots, you should investigate number of processors on the machine.
As a general practice, you can allow 2 tasks per CPU core. If your tasktracker machine have 8 cores, you can have 16 task slots counting both map and reduce tasks.
You can identify number of cpu cores by:

> lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Stepping:              9
CPU MHz:               1600.000
BogoMIPS:              6784.81
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

I have 8 cpu cores stated by CPU(s) line. I will configure mapred.tasktracker.reduce.tasks.maximum and mapred.tasktracker.map.tasks.maximum as 8-1 = 7, since tasktracker and datanode daemons will consume 1 slot resource.

21 Haziran 2014 Cumartesi

Hadoop - Safe Mode

I will explain some details of Hadoop safe mode and related properties/commands.

Hadoop 1.0.3. is used on Ubuntu 12.04

1. Safe Mode

When namenode first starts, it loads its image file into memory. In this state, namenode has metadata of its filesystem. However it does not have the location of blocks. When datanodes begin to check in with the block list they have, namenode builds a map that holds blocks and their locations. Until a certain percentage of block locations are informed, namenode waits for other datanodes to check in.

During this time, namenode stays in safe mode. It does not replicate blocks and does not allow write, deletion and renaming. It gives a read-only view to clients, it gives directory listing and file reads will succeed if blocks are already informed by checked in datanodes. When configured percentage is met, namenode waits for 30 seconds and leaves safe mode.

Safe mode is beneficial. For example, replication is stopped thereby preventing unnecessary replication of a file, when it is just a matter of time for datanodes to check in with that file's block locations.

2. Configuration Parameters

dfs.replication.min
Default value: 1
Description: The minimum number of replicas that have to be written for a write to be successful.

dfs.safemode.threshold.pct
Default value: 0.999
Description: The proportion of blocks in the system that must meet the minimum replication level defined by dfs.replication.min before the namenode will exit safe mode. Setting this value to 0 or less forces the namenode not to start in safe mode. Setting this value to more than 1 means the namenode never exits safe mode.

dfs.safemode.extension
Default value: 30,000
Description: The time, in milliseconds, to extend safe mode by after the minimum replication condition defined by dfs.safemode.threshold.pct has been satisfied. For small clusters (tens of nodes), it can be set to 0.

3. Safe Mode Commands

Get safe mode status

hadoop dfsadmin -safemode get

Safe mode is ON

Namenode web interface also gives information about safe mode status:

http://<namenode-server>:50070

Leave safe mode

hadoop dfsadmin -safemode leave

Safe mode is OFF

Control file system status

hadoop fsck /

Hadoop - Small Files Problem

I will explain Hadoop small files problem. I will also mention some approaches that deal with this problem.

Hadoop 1.0.3 is used on Ubuntu 12.04

Hadoop is designed to work with large data, HDFS also is efficient with large volumes of data. If there are lots of small files that are significantly smaller than the HDFS block size (which is 64MB by default), both HDFS and Hadoop mapreduce will suffer from it.

1. Implications on HDFS

HDFS stores metadata of every file, directory and block on memory. One file, using a block, holds 300 byte on namenode's memory. If there are 10 million records, this gives us 3GB memory usage. If this number accumulates over time, namenode cannot maintain its operation.

In addition to that, HDFS is designed to stream large amount of data. There is an overhead when accessing small files, gathering files under a directory from different datanodes.

2. Implications on MapReduce

When a job is submitted, Hadoop first calculates input splits using FileInputFormat class. If files are smaller than split size (generally equals to HDFS block size), FileInputFormat creates a split per file. For each split, a map task is created. For example if you have 100000 10KB files (totaling to 1GB), you will have 100000 map tasks. Map tasks will process very small data and there will be overhead of creating map tasks. If the files have been 64 MB, there would be only be 16 splits and map tasks.

3. Some Options

Sequence Files: Small files can be written as sequence files.

Har Files: Hadoop archive files can be used to group multiple file to one. This way, namenode will store only Har file's information. However, as an input to mapreduce, Har files does not offer any enhancement.

CombineFileInputFormat: This class can be used to combine small files into one map task. This does not offer any enhancement on HDFS.

Hadoop - Number of Map and Reduce Tasks

I will try to explain how number of map and reduce tasks is calculated.

Test are done with Hadoop 1.0.3 on Ubuntu 12.04 machine.

1. Number of Map Tasks

When you submit a map-reduce job (or pig/hive job), Hadoop first calculates the input splits, each input split size generally equals to HDFS block size. For example, for a file of 1GB size, there will be 16 input splits, if block size is 64MB. However, split size can be configured to be less/more than HDFS block size. Calculation of input splits is done with org.apache.hadoop.mapred.FileInputFormat. For each of these input splits, a map task must be started.

First, let's investigate properties that govern the input split size:
mapred.min.split.size
Default value: 1
Description: The minimum size chunk that map input should be split into. Note that some file formats may have minimum split sizes that take priority over this setting.

mapred.max.split.size
Default value: This configuration cannot be set in Hadoop 1.0.3, it is calculated in code. However in later versions, its default value is Long.MAX_VALUE, that is 9223372036854775807.
Description: The largest valid size inbytes for a file split.

dfs.block.size
Default value: 64 MB, that is 67108864
Description: The default block size for new files.

If you are using a newer Hadoop version, some of the above properties are deprecated. You can check from here.

Configure the properties:
You can set mapred.min.split.size in mapred-site.xml
<property>
<name>mapred.min.split.size</name>
<value>127108864</value>
</property>

You can set dfs.block.size in hdfs-site.xml
<property>
<name>dfs.block.size</name>
<value>67108864</value>
</property>

Split Size Calculation Examples:
Calculation of input split size is done in InputFileFormat as:
Math.max(minSize, Math.min(maxSize, blockSize));

mapred.min.split.size	mapred.max.split.size	dfs.block.size	Input Split Size	Input Splits (1GB file)
1 (default)	Long.MAX_VALUE(default)	64MB(Default)	64MB	16
1 (default)	Long.MAX_VALUE(default)	128MB	128MB	8
128MB	Long.MAX_VALUE(default)	64MB	128MB	8
1 (default)	32MB	64MB	32MB	32

Configuring input split size larger than block size decreases data locality and can degrade performance.
According to above table, if file size is 1GB, there will be respectively 16, 8, 8 and 32 input splits.

What if input files are too small?
FileInputFormat splits files that are larger than split size. What if out input files are too small? In this point, FileInputFormat creates a input split per file. For example, if you have 100 10KB files and input split size is 64MB, there will be 100 input splits. Total file size is 1MB, but we will have 100 input splits and 100 map tasks. This is known as Hadoop small files problem. You can look at CombineFileInputFormat for a solution. Apache Pig combines small input files into one map by default.

So the number of map tasks depends on
- Total size of input
- Input split size
- Structure of input files (small files problem)

2. Number of Reducer Tasks

The number of reduce tasks to create is determined by themapred.reduce.tasks property in the JobConf, which is set by the setNumReduceTasks() method, and Hadoop simply creates this number of reduce tasks to be run.

14 Mayıs 2014 Çarşamba

Hive External Table Creation and Partitions

Hive enables you create tables and run sql like queries on HDFS data. There are 2 types of table in Hive, managed table with syntax CREATE TABLE and external table with syntax CREATE EXTERNAL TABLE.
We will look at external tables.

Hive 0.12.0 is tested with Hadoop 1.0.3.

1. External Table

External tables lets you run Hive queries without needing Hive copy/delete any data on HDFS. If you are not using just Hive but also Pig, mapreduce jobs, it is better use external tables.

hive> CREATE EXTERNAL TABLE logs(
id string,
country string,
type string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/data/logs';

This statement:

Creates an external table named logs on HDFS path /data/logs.
Specifies columns/fields and data types in a row
States that columns/fields in row are seperated by TAB character.
With this statement, Hive does not copy any data. It will use data on given location.
Hive does not even check whether given location exists, it is useful when you want to save data in that location later. However data should be directly under given folder, when it is saved.

After table creation, you can run your queries.

> SHOW TABLES;

This will list existing tables

> SELECT COUNT(*) FROM logs;

This will query given table.

2. External Table with Partitions

Partitions give you categorize your data even further and can speed your queries. Partitions enable you use data in multiple directories.

For example, you have log data stored in multiple directories named with date: /data/logs/2014_01_01, /data/logs/2014_01_02 and /data/logs/2014_01_03. You want to query data on these directories and can use date as an additional filter.

hive> CREATE EXTERNAL TABLE logs_by_date(
id string,
country string,
type string)
PARTITIONED BY (date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

hive> ALTER TABLE browsed ADD PARTITION (date = '2014-01-01') LOCATION '/data/logs/2014_01_01/';
hive> ALTER TABLE browsed ADD PARTITION (date = '2014-01-02') LOCATION '/data/logs/2014_01_02/';
hive> ALTER TABLE browsed ADD PARTITION (date = '2014-01-03') LOCATION '/data/logs/2014_01_03/';

These statements:

Creates an external table named logs_by_date.
Adds partitions to logs_by_date table.
An additional column date is added to this table. This column does not exists in raw log files, it is derived from the directory name. However this column can be used as same as another column.

You can list partitions of table:

> SHOW PARTITIONS logs_by_date;

This will query count of rows under /data/logs/2014_01_01 directory.

> SELECT COUNT(*) FROM logs_by_date WHERE date='2014-01-01';

3. Drop External Table

You can drop your external tables by typing:

hive> DROP TABLE logs;

This will not delete any data on HDFS and will only delete metadata of table.

Hadoop Control Scripts - Start/Stop Daemons

Hadoop includes some control scripts to start and stop Hadoop daemons on a cluster or on a single machine.

Hadoop 1.0.3 is used on a cluster running Centos operating system.

1. masters/slaves Files

Hadoop includes master and slaves files to run specified commands on all cluster machines. These files are located under $HADOOP_HOME/conf.

Suppose you have a cluster of 4 servers: server1, server2, server3 and server4. On server1, you want to start namenode, jobtracker and secondary namenode. On other 3 servers, you want to run datanode and tasktracker daemons.

masters file

server1

slaves file

server2
server3
server4

If you run Hadoop on a single machine, run your server name to both files.

2. Control Scripts

Lets look at the scripts. These scripts are located $HADOOP_HOME/bin.

1. start-dfs.sh

This script should be run on namonode node.

> $HADOOP_HOME/bin/start-dfs.sh

Starts namenode on the server which script is run.
Starts datanodes on the servers listed in slaves file
Starts secondary namenode on the servers listed in masters file

2. start-mapred.sh

This script should be run on jobtracker node.

> $HADOOP_HOME/bin/start-mapred.sh

Starts jobtracker on the server which scripts is run
Starts tasktrackers on the servers listed in slaves file
This script does not use masters file

3. stop-dfs.sh

This script should be run on namonode node.

> $HADOOP_HOME/bin/stop-dfs.sh

Stops namenode on the server which script is run.
Stops datanodes on the servers listed in slaves file
Stops secondary namenode on the servers listed in masters file

4. stop-mapred.sh

This script should be run on jobtracker node.

> $HADOOP_HOME/bin/stop-mapred.sh

Stops jobtracker on the server which scripts is run
Stops tasktrackers on the servers listed in slaves file

5. hadoop-daemon.sh

This script should be run on datanode/tasktracker nodes.
This script operates according to the given parameters. By default, it runs on the server which script is run.

Start/Stop datanode

>$HADOOP_HOME/bin/hadoop-daemon.sh start datanode
>$HADOOP_HOME/bin/hadoop-daemon.sh stop datanode

Start/Stop tasktracker

>$HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker
>$HADOOP_HOME/bin/hadoop-daemon.sh stop tasktracker

6. start-all.sh

This script runs start-dfs.sh and start-mapred.sh. Starts all daemons using slaves and masters file. This scripts should be run namenode/jobtracker node.

>$HADOOP_HOME/bin/start-all.sh

7. stop-all.sh

This script runs stop-dfs.sh and stop-mapred.sh. Stops all daemons using slaves and masters file. This scripts should be run namenode/jobtracker node.

>$HADOOP_HOME/bin/stop-all.sh

13 Mayıs 2014 Salı

Hadoop Web Interfaces

Hadoop provides several web interfaces to monitor your Hadoop cluster. These web interfaces give a lot of information without dealing with shell commands.

I am using a Hadoop 1.0.3 cluster installed on Centos machines.

1. NameNode web interface

http://<namenode-server>:50070/

Enables user to browse HDFS file system
Gives summary about the cluster
This is the default address and port. Address and port can be changed by dfs.http.address property in hdfs-site.xml

2. JobTracker web interface

http://<jobtracker-server>:50030/

Gives information about the tasktracker nodes and mapreduce capacity of Hadoop cluster
Lists running/completed/retired jobs
Enables user to browse job history
Address and port can be changed by mapred.job.tracker.http.address property in mapred-site.xml

3. TaskTracker web interface

http://<tasktracker-server>:50060/

List running and non-running tasks on specific datanode
Address and port can be changed by mapred.task.tracker.http.address property in mapred-site.xml

4. Secondary NameNode web interface

http://<secondaryNN-server>:50090/

Address and port can be changed by dfs.secondary.http.address property in hdfs-site.xml

5. DataNode web interface

http://<datanode-server>:50075/

Enables browsing of data on specific datanode
Address and port can be changed by dfs.datanode.http.address property in hdfs-site.xml

11 Mayıs 2014 Pazar

Installation of Hive

Hive gives you great capability when it is to query data on HDFS. Its sql syntax is very similar to MySQL and you can start running your queries in a very short time. Hive gets your sql and creates map reduce jobs out of it.

This installation is done on Centos machine using Hive. The machine Hive is installed contains Hadoop binary and is used as Hadoop client.

1. Download Hive
You can get hive http://www.apache.org/dyn/closer.cgi/hive/. I will continue with hive-0.12.0, in my country link is http://ftp.itu.edu.tr/Mirror/Apache/hive/hive-0.12.0/hive-0.12.0.tar.gz

> cd /your/desired/path
> wget  http://ftp.itu.edu.tr/Mirror/Apache/hive/hive-0.12.0/hive-0.12.0.tar.gz

After downloading hive, extract it.

> tar -xzvf hive-0.12.0.tar.gz

2. Setup environment variables
If not set, set following environment variables:
Recommended way is to put these variables in shell script under /etc/profile.d. Create env_variables.sh and write:

export JAVA_HOME=/path/to/java/home
export HADOOP_HOME=/path/to/hadoop/home
export HIVE_HOME=/path/to/hive/home

To run hive command from anywhere, we must add it to PATH variable. Append following to env_variables.sh. If java and hadoop are also not on the PATH variable, add them also.

export PATH=$PATH:$HIVE_HOME/bin

3. If you have installed Hive on a Hadoop node or on a Hadoop client machine, it will find namenode and jobtracker addresses from Hadoop's configuration files (core-site.xml and mapred-site.xml).

4. If you run Hive with its defaults, it will store its metadata related to Hive tables into a local Derby database.
In this case, where you run Hive gets important; because it creates this database under the directory you run Hive. This has 2 complications:

Only one connection is allowed. Others cannot run Hive jobs under that directory
If you run Hive in another location, you cannot see previous table definitions.

To overcome this, we must create metastore database on a database server. I will use MySQL.

5. Create a schme name hive_metastore and set its character set as "latin1 - default collation".

sql> CREATE SCHEMA hive_metastore DEFAULT CHARACTER SET latin1;

6. Create a new configuration file under "hive-0.12.0/conf" named hive-site.xml and enter your database connection properties.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://server_address/hive_metastore?createDatabaseIfNotExist=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>myusername</value>
  <description>username to use against metastore database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>mypassword</value>
  <description>password to use against metastore database</description>
</property>
</configuration>

6. For Hive to connect MySQL server you must place Mysql JDBC driver under "hive-0.12.0/lib". You can download from http://dev.mysql.com/downloads/connector/j/

7. Now you can type hive command and run:

hive> SHOW TABLES;

This will print existing tables.

Mysql Performance Tuning

To speed up performance of your application, tuning mysql can be curicial. Although there are many aspects to mysql performance tuning, following will give basic understanding and tips about tuning mysql server.

These configurations are tested on Percona Server 5.5 which is forked from mysql server. These configurations should also appy to standard Mysql server. Percona server is running on a Centos machine.

1. Quick Server Configuration

If you are looking for a quick configuration for your server (MySQL Enterprise/Community or Percona Server), you can use following link:

https://tools.percona.com/wizard

This link will provide a wizard that asks needs of your application and will try to create a my.cnf file that will fulfill your needs.

2. Quick Server Performance Check

Before dealing with performance issues, you should know the misconfigured/lacking points. You can use following script to get a overview of your server performance.

http://www.day32.com/MySQL/

When you run this script, it will give a analysis of different configuration points.

3. Configuration Parameters

Mysql is configured via system variables. All system variables have a default value. These variables can be given at server startup from command line or from configuration files. On Linux this configuration file is my.cnf and itsWindows counterpart is my.ini. Some system variables are dynamic meaning that they can be modified within mysql shell when server is running.
To see current values:

mysql> SHOW VARIABLES;

3.1. Query Cache (query_cache_size)

Quick Configuration (my.cnf file):

query_cache_size=128M
query_cache_limit=1M
query_cache_type=1

query_cache_size is the amount of memory allocated to cache all queries
query_cache_limit is maximum size of SELECT result that can be cached. Query results bigger than this limit are not cached.
query_cache_type has 3 options:

0: query cache is disabled
1: all queries are cached, except the ones beginning with SELECT NO_SQL_CACHE
2: queries only beginning with SELECT SQL_CACHE are cached

Key points:

The query cache stores the text of a SELECT statement together with the corresponding result.
If an identical statement is received later, the server retrieves the results from the query cache rather than parsing and executing the statement again
The query cache is shared among sessions, so a result set generated by one client can be sent in response to the same query issued by another client.
Mysql does not return stale data, when table data is changed, cached queries using table are flushed.
Queries must be exactly the same to be counted as identical. For example, following queries are not identical:

select * from test;
SELECT * FROM test;

The query cache can be useful in an environment where you have tables that do not change very often and for which the server receives many identical queries. In an environment with tables are constantly changing, maintaining query cache can cause performance impact.

Monitor Query Cache:

Following command gives query cache statistics:

mysql> SHOW GLOBAL STATUS LIKE 'Qcache%';

+-------------------------+-----------+
| Variable_name           | Value     |
+-------------------------+-----------+
| Qcache_free_blocks      | 1         |
| Qcache_free_memory      | 134182560 |
| Qcache_hits             | 3694      |
| Qcache_inserts          | 19        |
| Qcache_lowmem_prunes    | 0         |
| Qcache_not_cached       | 9         |
| Qcache_queries_in_cache | 14        |
| Qcache_total_blocks     | 31        |
+-------------------------+-----------+
8 rows in set (0.00 sec)

3.2. Table Cache (table_open_cache/table_cache)

When different clients tries to operate on the same table simultaneously, table is opened independently for each client. If open tables count reaches maximum number specified by table_open_cache, performance can degrade.
table_cache is renamed as table_open_cache with mysql 5.1.3 release.

Quick Configuration (my.cnf file):

table_open_cache=2048
open-files-limit=65536

table_open_cache is the maximum number of open tables that mysql can have
open_files_limit is the maximum number of file descriptors which mysql can have. This number cannot be smaller than operating system wide limit.
In Unix systems, you can get the maximum number of file descriptors per process, by typing :

> ulimit -n

Key Points:

table_open_cache is related to max_connections. If there are 100 concurrent connections and your joins include 10 tables, it is wise to give table_open_cache as 100*10=1000.
If MyISAM usage is heavy, you should consider that MyISAM tables takes 2 file descriptors instead of 1. Additionally, if you use same MyISAM table again in the query, for example joining by itself, 1 additional file descriptor is created.
Mysql is limited by operating system limitations in terms of file descriptors. On Unix systems, you can see file descriptor limit per process by ulimit -n. You can increase this limit from mysql configuration by open-files-limit variable. If this limit is reached, mysql gives "Too many open files" error.

Monitor Table Cache:
You can monitor and observe table_open cache by typing:

mysql> SHOW GLOBAL STATUS LIKE '%Opened_tables%';

Caveat: You can run this command several times in busiest times of server. If there are increases in number, you should think increasing table_open_cache

3.3. Key Buffer Size (key_buffer_size)

MyISAM indexes are shared by all clients and key_buffer_size specifies the limit for this buffer in memory. The limit is 4GB-1 for 32 bit machines and larger for 64 bit machines. It is recommended that this limit should be close to 25% of system memory. If you are using MyISAM tables heavily, you can try to increase this value.

Quick Configuration (my.cnf file):

key_buffer_size=4GB

Key Points:

Under heavy MyISAM usage, this variable can be crucial.

Monitor Key Buffer Size:
You can list status of operations related to key reads and writes.

mysql> SHOW GLOBAL STATUS LIKE 'Key_%';

4. Further Reading

You can get more detail from following sites.
http://www.mysqlperformanceblog.com/
http://dev.mysql.com/doc/refman/5.0/en/server-parameters.html
http://dev.mysql.com/doc/refman/5.1/en/query-cache.html
http://dev.mysql.com/doc/refman/5.0/en/table-cache.html

9 Nisan 2014 Çarşamba

Create User and Groups in Linux

How can I create users on linux? How can I create groups and assign users to a group? Although there are more to creating a user or group, it comes in handy to learn the basics.

These commands have been run on Centos 6.5. They should work on RedHat implementations. On other linux versions, behavior can differ.

1. Create a User

useradd command creates a user. New username is written to system files as needed and home directory for user is created. In Centos and other RedHat derivatives, a default group with same name as the user is also created and user is assigned to this group.

Creates user "javauser". This also creates a group name "javauser" and assign the user to it. You can see groups of new user by typing "groups javauser".

> useradd javauser

Creates a user and assigns it to a existing group.

> useradd -g devgroup javauser

Creates a user and assigns it to a secondary group. Also a new group with same name as user is created and assigned as primary group.

> useradd -G testgroup javauser

You can specify multiple secondary groups.

> useradd -G testgroup,sysgroup javauser

Creates a user but prevents default behaviour of creating a group with same name. Creating a new group is specific to RedHat derivatives.

> useradd -n javauser

2. Set Password for User

passwd command sets or updates a user's password.

Sets or updates password for a user.

> passwd javauser

When you enter the password and confirm it, password is set.

3. Change Groups of User

To change groups of an existing user, you can use usermod command. A user one primary group and multiple secondary groups.

Updates primary group of a user to specifed group. Former primary group is removed from the list of user's groups.

> usermod -g gr1 javauser

gr1 becomes the primary group of javauser.

Sets specified group(s) as secondary groups of user. Former secondary groups are unassigned if they are not specified in new group list. Groups are passed as a comma seperated list with no spaces between.

> usermod -G gr2,gr3 javauser

javauser now has gr2 and gr3 in its secondary groups.

Appends specified group(s) to the user's groups.

> usermod -a -G gr4 javauser

gr4 will be added javauser's groups.

4. Delete a User

userdel command deletes the specified user.
However, processes started by user, files owned by that user and jobs created by user must be handled seperately and in a planned order.

Deletes javauser

> userdel javauser

Deletes user and its home directory.

> userdel -r javauser

5. Create a Group

groupadd command creates a group.

Creates a group name mygroup

> groupadd mygroup

6. List Groups of a User

groups command list primary and secondary groups of a user

Prints groups of javauser

> groups javauser

Prints groups of the effective user (current working user)

> groups

7. Further Reading

You can more in manuals of these commands:
http://linux.die.net/man/8/useradd
http://linux.die.net/man/8/groupadd
http://linux.die.net/man/8/userdel
http://linux.die.net/man/8/usermod

6 Nisan 2014 Pazar

HDFS Security: Authentication and Authorization

Hadoop Distributed File System (HDFS) security features can be confusing at first, but it in fact follows simple rules. We can examine the topic in two parts: Authentication and Authorization.

I am testing with Hadoop 1.0.3 on Centos 5.8 server. In my client, I am using Centos 6.5.

1. Authentication

Authentication is determining whether someone is really who he claims to be.

Hadoop supports two different authentication mechanisms specified by the hadoop.security.authentication property which is defined in core-default.xml and its site specific version core-site.xml.

simple (no authentication)
kerberos

By default, simple is selected.

1.1. simple (no authentication)

If you have installed Hadoop with its defaults, there is no authentication made. This means that any Hadoop client can claim to be any user on HDFS.

The identity of client process is determined by host operating system. In Unix-like systems, this is equivalent to "whoami". This is the username that user is working under. Hadoop does not provide user creation or management.

Let's say, in your client machine you have typed:

> useradd hadoopuser
> su hadoopuser
> hadoop dfs -ls /

hadoopuser will be sent as user identity to NameNode. And after that, permissions checks (which is a part of authorization process) will be done.

Implications

This mode of operation poses great risk.

Think about this scenario:
1. In your Hadoop cluster, you have started NameNode daemon as user "hdfsuser". In Hadoop, the user running NameNode is super-user and can do anything on HDFS. Permission checks never fail for super-user.
2. There is a client machine Hadoop binaries installed on and this machine has network access to Hadoop cluster.
3. Owner of the client machine knows NameNode address and port. He also knows that super-user is "hdfsuser", but does not know its password.
4. To access HDFS and do much more, all he needs to do is to create a new user "hdfsuser" in client machine. Then he can run Hadoop shell commands with this user.

> useradd hdfsuser
> su hdfsuser
> hadoop dfs -rmr /

When he runs these commands, Hadoop believes the client that "he is hdfsuser" and executes requested command. As a result, all data on HDFS is deleted.

1.2. kerberos

I have no hands-on experience with this mode.

2. Authorization

Authorization is function of specifying access rights to resources.
First we can examine HDFS permissions model and then see how permissions checks are done.

2.1. HDFS Permissions Model

HDFS implements a permissions model for files and directories that shares much of the POSIX model.

We can continue by an example. If you run:

> hadoop dfs -ls /

You will see something like:
Found 2 items
drwxr-xr-x - someuser somegroup 0 2014-03-05 15:01 /tmp
drwxr-xr-x - someuser somegroup 0 2014-03-03 11:01 /data

1. Each file and directory is associated with an owner and a group. In the example, someuser is the owner of /tmp and /data directories. somegroup is the group of these directories.

2. The files or directories have separate permissions for the user that is the owner, for
other users that are members of the group, and for all other users.

In the example, drwxr-xr-x determines this behaviour. First letter d specifies whether it
is a directory or not. rwx show owner permissions, r-x shows permissions for group users,
and last r-x show permissions for other users.

3. For files, the r permission is required to read the file, and the w permission is required to write or append to the file. For directories, the r permission is required to list the contents of the directory, the w permission is required to create or delete files or directories, and the x permission is required to access a child of the directory.

4. When a new file/directory is created, owner is the user of client process and group is the group of its parent directory.

One note:
In HDFS owner and group values are just Strings, you can assign non-existent username and groups to a file/directory.

2.2. Permission Checking

Authorization takes places after user is authenticated. At this point, NameNode knows the username, -let's say hdfsuser-, then it tries to get groups of hdfsuser.

This group mapping is configured by the hadoop.security.group.mapping property in core-default.xml and its site specific version core-site.xml. Default implementation achieve this by simply running groups command for the user. Then it maps the username with returned groups.

Group mapping is done on NameNode machine NOT on the client machine. This is an important realization to make. hdfsuser can have different groups on client and NameNode machines; but what goes into mapping is the groups on NameNode.

When NameNode has username and its groups list, permission checks can be done.

If the user name matches the owner of file/directory, then the owner permissions are tested;

Else if the group of file/directory matches any of member of the groups list, then the group permissions are tested;

Otherwise the other permissions of foo are tested.

If a permissions check fails, the client operation fails.

2.3. Super-User

The user running NameNode daemon is the super-user and permission checks for super-user never fails.

2.4. Super Group

If you set dfs.permissions.supergroup in hdfs-site.xml, you can make members of given group also super-users. By default, it is set to supergroup in hdfs-default.xml.

<property>
    <name>dfs.permissions.supergroup</name>
    <value>admingroup</value>
</property>

In later Hadoop versions, property is named as dfs.permissions.superusergroup.

2.5. Disable HDFS Permissions

If you set dfs.permission to false, permission checking is disabled. However this does not change the mode, owner or group of files/directories.

<property>
    <name>dfs.permissions</name>
    <value>true</value>
</property>

In later Hadoop versions, property is named as dfs.permissions.enabled

1 Nisan 2014 Salı

Shell Types and Shell Config Files: /etc/profile, /etc/bashrc, ~/.bash_profile, ~/.bashrc

Setting JAVA_HOME must be easy, right? However, this can sometimes get tricky. To better understand your version of "JAVA_HOME is not set" errors, it can be instructive to look at shell types, config files and their working principles.

I am using Centos 6.5. Below explanations should also work for Red Hat implementations. In other Linux implementations, it could differ.

1. Environment Variables

1.1. Process Locality

If you open two shells and define an environment variable in first shell, second shell will not be able to see new environment variable.

1. Open two terminals

2. Create an environment variable in first:

> export SHELLNUMBER=1

3. In second shell, type:

> echo $SHELLNUMBER

This will print nothing.

1.2. Inheritance

If you define a variable in a shell and open a sub-shell, new variable will be available to sub-shell.

1. Open a terminal

2. Create an environment variable:

> export PARENT=1

3. Create a sub-shell by typing:

> bash

4. List variables:

> env | grep PARENT

This will print

PARENT=1

We can continue as follows to also test process locality:

5. Create another variable in sub-shell and list variables:

> export PARENT2=2
> env | grep PARENT

This will print:

PARENT=1

PARENT2=2

6. Now exit sub-shell and list variables:

> exit
> env | grep PARENT

This will print only

PARENT=1

As you can see, variables defined in sub-shell are not available to parent shell.

1.3. Case-Sensitivity

Environment variables are case sensitive meaning that JAVA_HOME and Java_Home are different variables. It is a common practice to use capital letters and underscore signs.

2. Shell Types

We can group shells in two categories: interactive/non-interactive and login/non-login shells.

2.1. Interactive/non-interactive shells

An interactive shell is the one whose input and output are connected to terminals or the one started with -i flag.
Non-interactive shell is the one where user input is not needed such as shell scripts.

You can learn if the shell you are working (probably interactive) by typing:

> echo $-

If output contains i, it is interactive.

To see a non interactive shell, create a shell script name test.sh with contents:

echo $-

Then run with:

> bash test.sh

The output will not contain i.

2.2. Login and non-login shells

A login shell is the shell when you login or the one started with -l flag.
In login shells, it usually prompts for user and password. This is the case when you ssh to remotely login to a linux machine.
Other Examples:

> bash -l

> su -l

In non-login shells, it does not prompt for user and password. This is the case when you are already logged in and type /bin/bash. An interactive non-login shell is also started when you open a terminal in a graphical environment.

3. Config Files

Files under /etc is usually provide global settings and files under user home directory provides user specific settings. User specific files can override global settings.

An interactive login shell reads /etc/profile and ~/.bash_profile. In Centos, ~/.bash_profile also reads ~/.bashrc and then /etc/bashrc files.

An interactive non-login shell gets its parent environment and reads ~/.bashrc and /etc/bashrc for additional configuration.

A non-interactive non-login shell expands $BASH_ENV variable if not null and reads specified file. Otherwise, it only gets its parent environment.

You can test this behaviour as following:
1. Open a terminal. We will use this as our main shell, do not close it.
2. Create a shell script named test.sh with contents:

env | grep SCRIPTVAR

3. Define a new variable in /etc/bashrc file:

export SCRIPTVAR=1

This variable is not be available to our main shell since /etc/bashrc must be read again.

4. Will it be available to shell script? Run test.sh:

> bash test.sh

This will not print our new variable

5. Set BASH_ENV variable to /etc/bashrc (just for testing, not for daily usage):

> export BASH_ENV=/etc/bashrc

6. Run script again and it prints SCRIPTVAR=1.