We have a large integrated database containing a very diverse range of variables. We identify a master population using simple business rules to determine if they meet initial parameters, then keep narrowing down the population using more logic until we know whether they meet the given requirement. This logic is not terribly complex.
After the population is identified we create a large list of transactions associated with these people, then apply further business rules to these transactions to determine if the associated people are ‘in’ or ‘out’ of a final population of interest.
Our current t-sql based solution does not perform well and we’re looking for alternatives in the hadoop stack, particularly around using the parallel processing capabilities in hadoop to speed us up and enable larger data problems.
We’re looking for suggestions on which of the hadoop or surrounding technologies (even spark, r integrated with hadoop etc etc) would help us with this problem and others like it.
We would really appreciate it Image may be NSFW.
Clik here to view.