Sorry for being very abstract. I would like to ask this question, as many of you came across situation like me.
We need to build a machine learning based application. As part of our application we need to join files with huge data. The data is around 10 GB. We are using data frames to load the files and using cache as well.
However, there are quite a bit of joins that we need to perform. Could you please suggest some optimization techniques that you came across to ensure the application performs upto speed.
Is persist to disk with kyro serializer improves the performance.? We have join operations mainly. Other operations are not that computation heavy.