Coding Hassle : How to Schedule Hadoop Jobs

I will explain scheduled Hadoop jobs, scheduled with Apache Oozie.

You have large volumes of data collected and stored into HDFS. Your analysis process may differ according to nature of your data and needs of your business. To further investigate, let's go through following scenarios:

1. Running in Scheduled Time Intervals

You want to make analysis on your data in scheduled time intervals.

For example, you have e-commerce site and want to put your customers into categories like sport, book, electronic and etc. You want to make this analyze every day or every week or every month.

As another example, you want to create a report that shows impression/purchase count (and other statistics) of your products. You want to generate this report every night at 01:00 am.

2. Running When Data is Present

You want to make analysis when a speficic data feed comes.

Like previous example, you have a e-commerce site, you are collecting event logs. However, you also have a dependency to another data to be available on HDFS.

3. Running Dependent Analyses

You want to make a sequnce of analyses that are dependent on each other's output.

You want to implement a basic suggestion system. Firstly, you will analyze impression/purchase counts of your products (Your products also have associated categories). Secondly you will analyze interest areas of your customers. Then you want to merge these two outputs and match products with customers. At the end, you will offer specific products to specific customers.

4. Minimal Technical Effort/Minimal Dependency

From a technical view, you want to build a scalable and extensible scheduling system. You want to make your analyses with minimal effort and minimal language dependencies.

One Good Solution - Oozie

What we use at our system is Apache Oozie. We have some of the use cases stated below and others will likely be valid for us in near future.

Excerpt from Oozie web page:

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.

Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

In later posts, I will explain Oozie workflow and coordinator applications and workflow actions with examples.

Coding Hassle

26 Haziran 2014 Perşembe

How to Schedule Hadoop Jobs - Apache Oozie

1. Running in Scheduled Time Intervals

2. Running When Data is Present

3. Running Dependent Analyses

4. Minimal Technical Effort/Minimal Dependency

One Good Solution - Oozie

Hiç yorum yok:

Yorum Gönder