Quantcast
Channel: Hortonworks » All Replies
Viewing all articles
Browse latest Browse all 3435

join performance

$
0
0

There are 3 tables.
client table only having 300 records. I need this table to get client name.
person_detail table is having 180 million records and partition on client_id field. person_detail table is parquet.
emp_cat table has 1000 records. Need to put distinct as this table has dups.

Insert overwrite directory ‘/apps/dev/report_output.txt’
select distinct
report_query1.Client,
report_query1.Name,
report_query1.Age,
report_query1.Address,
Query1.category
from
(select distinct
B.clientid Client,
A.client_name Name,
B.age,
B.address,
B.cat_id
from client A JOIN
(select distinct
client_id client_id,
age age,
address address,
cat_id cat_id
from person_detail
where client_id = ‘5454’
and entry_date > ‘2015-01-01’ and entry_date < ‘2015-03-31’) B
on (A.clientid = B.clientid)
) report_query1
LEFT OUTER JOIN (select distinct client_id, cat_id, category from emp_cat) Query1
on report_query1.cat_id = Query1.cat_id
and report_query1.client_id = Query1.client_id

Question:
I am not able to explain plan for this table. Can someone help.
This query output is coming in 30 mins. How can i improve performance. Can I make all 3 tables parquet.
I have to store data in some file. Is it right way or do we have some better way.
I have fetched data for person_detail table in subquery as this table is partitioned on client id and it will reduce the data for join. Is it right?
What else I can do to improve performance.
I have kept big table in right side for mapside join.


Viewing all articles
Browse latest Browse all 3435

Trending Articles