Friday, August 28, 2015

AWS Datapipeline and Python Script

AWS Datapipeline and Python Script

I had a task to do and it was supposed to be done quickly. Then I got to know that there is a language which can do certain task efficiently and easily as compared to other languages and its called "Python".

Python has lots of powerful library to perform tasks quickly as it is like a scripting language.
Although, I hadn't had any knowledge of using this wonderful language but I thought of giving it a try and guess what!!! ,  it is similar to other languages like Java,C# and also need less time to do certain tasks as lot of libraries support in python.

Let me tell you I had a task in which I need to do some of the manipulations in AWS resources and save result in other AWS resources and python has a powerful library called Boto which is very easy to work on . Have a look at the boto library here

I was able to complete my work quickly using python  and created a python script and now there is some requirement to schedule this script so that this script runs daily at a certain time and perform its task.

As we are using heavily the AWS resources for our work so it was not the difficult task to choose AWS Datapipeline to do this work for us using EMR clusters.

So, now I have all of the resources - my script was ready and i can also schedule that script by using aws datapiplines but a question pop up in my mind whether I can schedule a python script using datapipeline or not.
FYI, I was also new on datapipeline.

I decided to research on that and after lot of effort -searching on internet ;) and various hit and trial on datapipeline options .I was successfully able to schedule my python script using boto library on aws datapipeline.


So, Here are some of the points to schedule python script on aws datapipeline, so that it would be easy for you guys :-
Step1: Have your python script ready.
Step2: AWS account and console.
Step3: Choose Datapipeline and start creating a datapipeline.
Step4: Choose source as EmrActivity and provide the S3 path of  your script in "input"
           and provide output path to another S3 bucket location.
Step 5: In order to run python from EMR cluster ,you need to add  "preStepCommand" : ""   .
Step 6:Choose EMR cluster and choose the desired configuration of the hardware.
Step7 : Schedule your job and you can also add preconditions so that datapipeline checks for precondition fulfillment before each run.
Step8: Setup logs in your S3 logs directory so that you can check problem in your job and debug issue using those logs.

Step8: Set SNS topics and subscribe for job completion and job failure notifications.

Finally, have fun and let other hard work to be done for you by datapipelines.  



Saturday, March 23, 2013

Thread safety in Application server


Thread Safety in Application Servers

Application servers need to be multithreaded to handle simultaneous client requests. WCF, ASP.NET, and Web Services applications are implicitly multithreaded; the same holds true for Remoting server applications that use a network channel such as TCP or HTTP. This means that when writing code on the server side, you must consider thread safety if there’s any possibility of interaction among the threads processing client requests. Fortunately, such a possibility is rare; a typical server class is either stateless (no fields) or has an activation model that creates a separate object instance for each client or each request. Interaction usually arises only through static fields, sometimes used for caching in memory parts of a database to improve performance.
For example, suppose you have a RetrieveUser method that queries a database:
// User is a custom class with fields for user data
internal User RetrieveUser (int id) { ... }
If this method was called frequently, you could improve performance by caching the results in a static Dictionary. Here’s a solution that takes thread safety into account:
static class UserCache
{
  static Dictionary <int, User> _users = new Dictionary <int, User>();
 
  internal static User GetUser (int id)
  {
    User u = null;
 
    lock (_users)
      if (_users.TryGetValue (id, out u))
        return u;
 
    u = RetrieveUser (id);   // Method to retrieve user from database
    lock (_users) _users [id] = u;
    return u;
  }
}
We must, at a minimum, lock around reading and updating the dictionary to ensure thread safety. In this example, we choose a practical compromise between simplicity and performance in locking. Our design actually creates a very small potential for inefficiency: if two threads simultaneously called this method with the same previously unretrievedid, the RetrieveUser method would be called twice — and the dictionary would be updated unnecessarily. Locking once across the whole method would prevent this, but would create a worse inefficiency: the entire cache would be locked up for the duration of calling RetrieveUser, during which time other threads would be blocked in retrieving anyuser.

If I would specifically talk about WCF then you must read WCF Instance context and concurrency.To get the gist quickly please read link and for more better clarification go to msdn.