Friday, August 28, 2015

AWS Datapipeline and Python Script

AWS Datapipeline and Python Script

I had a task to do and it was supposed to be done quickly. Then I got to know that there is a language which can do certain task efficiently and easily as compared to other languages and its called "Python".

Python has lots of powerful library to perform tasks quickly as it is like a scripting language.
Although, I hadn't had any knowledge of using this wonderful language but I thought of giving it a try and guess what!!! ,  it is similar to other languages like Java,C# and also need less time to do certain tasks as lot of libraries support in python.

Let me tell you I had a task in which I need to do some of the manipulations in AWS resources and save result in other AWS resources and python has a powerful library called Boto which is very easy to work on . Have a look at the boto library here

I was able to complete my work quickly using python  and created a python script and now there is some requirement to schedule this script so that this script runs daily at a certain time and perform its task.

As we are using heavily the AWS resources for our work so it was not the difficult task to choose AWS Datapipeline to do this work for us using EMR clusters.

So, now I have all of the resources - my script was ready and i can also schedule that script by using aws datapiplines but a question pop up in my mind whether I can schedule a python script using datapipeline or not.
FYI, I was also new on datapipeline.

I decided to research on that and after lot of effort -searching on internet ;) and various hit and trial on datapipeline options .I was successfully able to schedule my python script using boto library on aws datapipeline.


So, Here are some of the points to schedule python script on aws datapipeline, so that it would be easy for you guys :-
Step1: Have your python script ready.
Step2: AWS account and console.
Step3: Choose Datapipeline and start creating a datapipeline.
Step4: Choose source as EmrActivity and provide the S3 path of  your script in "input"
           and provide output path to another S3 bucket location.
Step 5: In order to run python from EMR cluster ,you need to add  "preStepCommand" : ""   .
Step 6:Choose EMR cluster and choose the desired configuration of the hardware.
Step7 : Schedule your job and you can also add preconditions so that datapipeline checks for precondition fulfillment before each run.
Step8: Setup logs in your S3 logs directory so that you can check problem in your job and debug issue using those logs.

Step8: Set SNS topics and subscribe for job completion and job failure notifications.

Finally, have fun and let other hard work to be done for you by datapipelines.