I currently have a simple Hadoop Hive query which used to run successfully via Hue which looks at 12 months worth of data. However, due to read thresholds being reduced (I have no control over this) the query will always fail as too many Maps are being used. Besides this process was being executed manually and downloaded to CSV. This script will allow me to have this process scheduled and run automatically every month.
So this gig will need to do the following via a BASH script: -
- Drop 2 tables in Hive if they exist. We have to do this as Hive does not allow all rows to be deleted.
- Create 2 tables in Hive
- Dynamically calculate 2 variables which contain the start (dtStart) and end (dtEnd) dates for the last 12 month period, i.e. if the script is executed today on 2014-09-28, the date range will be 2013-09-01 to 2014-08-31. Clearly this date range will change every time the script is executed.
- Loop a Hive query which will insert data into the first table. This loop will increment 1 month at a time, e.g. dtStart = dtStart + 1 month, until all 12 months are complete (dtStart >= dtEnd) and within each loop the query will use a date range specific to that month. Probably be easier to use the first day of each month, i.e. >=dtStart and <dtStart + 1 month
- When the loop is completed another Hive query is executed which aggregates the first table and inserts the result into the second table.
Note, this script must include some error checking in case of query failure. The second query cannot be executed if the first table does not contain all 12 months. It might be better if the script is simply exited if any of the queries in the loop fail. Failure will probably be because a threshold on the number of Maps being used has been exceeded (unlikely at the moment but you never know) or a timeout because of other processes. I highly doubt the second query after the loop will fail. I can provide table names, table structures, and the 2 insert query statements or you could just use something generic and I will change afterwards.
Please let me know your quotes on rate and timeframe to build. Thanks in advance.
Hello,
I have used Bash in many automation scripts.
I have no experience with Hadoop, but that won't be a problem for me.
I can start working tomorrow afternoon.
Thanks.