Quantcast
Channel: Hortonworks » All Replies
Viewing all articles
Browse latest Browse all 3435

One mapper for a CSV import – ORC export

$
0
0

I have a small HDP 2.3 cluster (5 nodes) that I’ve setup to get some experience with Hive + Tez. I have one or two tables that I’m creating using what I believe is a fairly simple DDL as follows:

CREATE TABLE SomeTable_csv(valueA int, valueB int, valueC timestamp)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/some/long/path'
TBLPROPERTIES ("skip.header.line.count"="1");

And then I load a single file into that table. I then try to turn it into a table backed by an ORC file, so I do something similar:

CREATE TABLE SomeTable(valueA int, valueB int, valueC timestamp)
STORED AS ORC
LOCATION '/some/long/path';

INSERT OVERWRITE TABLE SomeTable
 SELECT valueA,valueB,valueC from SomeTable_csv;

Apparently this produces a Tez task that has 1 mapper and 1 reducer and takes hours to run (input file is about 50GB). I expected that it would make a reasonable attempt to use more mappers since it’s a simple mapping process – possibly aligning to the HDFS block size. Any hints about what I can do to try to get the process to break up into more than one mapper. I can split the input file before importing into HDFS, but seems like that’s what the task should be doing.

Thanks


Viewing all articles
Browse latest Browse all 3435

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>