You have large volumes of data collected and stored into HDFS. Your analysis process may differ according to nature of your data and needs of your business. To further investigate, let's go through following scenarios:
1. Running in Scheduled Time Intervals
You want to make analysis on your data in scheduled time intervals.For example, you have e-commerce site and want to put your customers into categories like sport, book, electronic and etc. You want to make this analyze every day or every week or every month.
As another example, you want to create a report that shows impression/purchase count (and other statistics) of your products. You want to generate this report every night at 01:00 am.
2. Running When Data is Present
You want to make analysis when a speficic data feed comes.Like previous example, you have a e-commerce site, you are collecting event logs. However, you also have a dependency to another data to be available on HDFS.
3. Running Dependent Analyses
You want to make a sequnce of analyses that are dependent on each other's output.You want to implement a basic suggestion system. Firstly, you will analyze impression/purchase counts of your products (Your products also have associated categories). Secondly you will analyze interest areas of your customers. Then you want to merge these two outputs and match products with customers. At the end, you will offer specific products to specific customers.
4. Minimal Technical Effort/Minimal Dependency
From a technical view, you want to build a scalable and extensible scheduling system. You want to make your analyses with minimal effort and minimal language dependencies.One Good Solution - Oozie
What we use at our system is Apache Oozie. We have some of the use cases stated below and others will likely be valid for us in near future.Excerpt from Oozie web page:
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.
Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).
In later posts, I will explain Oozie workflow and coordinator applications and workflow actions with examples.