14 Mayıs 2014 Çarşamba

Hive External Table Creation and Partitions

Hive enables you create tables and run sql like queries on HDFS data. There are 2 types of table in Hive, managed table with syntax CREATE TABLE and external table with syntax CREATE EXTERNAL TABLE.
We will look at external tables.

Hive 0.12.0 is tested with Hadoop 1.0.3.

1. External Table

External tables lets you run Hive queries without needing Hive copy/delete any data on HDFS. If you are not using just Hive but also Pig, mapreduce jobs, it is better use external tables.

hive> CREATE EXTERNAL TABLE logs(
id string,
country string,
type string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/data/logs';

This statement:
  • Creates an external table named logs on HDFS path /data/logs. 
  • Specifies columns/fields and data types in a row 
  • States that columns/fields in row are seperated by TAB character.
  • With this statement, Hive does not copy any data. It will use data on given location.
  • Hive does not even check whether given location exists, it is useful when you want to save data in that location later. However data should be directly under given folder, when it is saved.
After table creation, you can run your queries.
> SHOW TABLES;
This will list existing tables

> SELECT COUNT(*) FROM logs;
This will query given table.

2. External Table with Partitions

Partitions give you categorize your data even further and can speed your queries. Partitions enable you use data in multiple directories. 

For example, you have log data stored in multiple directories named with date: /data/logs/2014_01_01, /data/logs/2014_01_02 and /data/logs/2014_01_03. You want to query data on these directories and can use date as an additional filter.

hive> CREATE EXTERNAL TABLE logs_by_date(
id string,
country string,
type string)
PARTITIONED BY (date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

hive> ALTER TABLE browsed ADD PARTITION (date = '2014-01-01') LOCATION '/data/logs/2014_01_01/';
hive> ALTER TABLE browsed ADD PARTITION (date = '2014-01-02') LOCATION '/data/logs/2014_01_02/';
hive> ALTER TABLE browsed ADD PARTITION (date = '2014-01-03') LOCATION '/data/logs/2014_01_03/';

These statements:
  • Creates an external table named logs_by_date.
  • Adds partitions to logs_by_date table.
  • An additional column date is added to this table. This column does not exists in raw log files, it is derived from the directory name. However this column can be used as same as another column.

You can list partitions of table:
> SHOW PARTITIONS logs_by_date;

This will query count of rows under /data/logs/2014_01_01 directory.
> SELECT COUNT(*) FROM logs_by_date WHERE date='2014-01-01';

3. Drop External Table

You can drop your external  tables by typing:
hive> DROP TABLE logs;

This will not delete any data on HDFS and will only delete metadata of table.







1 yorum:

  1. does partition pruning help with hive external table with text format in spark. suppose if I run query like spark.sql("""select * from hiveexternaltable where partitioncol='partitionvalue'""")

    YanıtlaSil