I have a small HDP 2.3 cluster (5 nodes) that I’ve setup to get some experience with Hive + Tez. I have one or two tables that I’m creating using what I believe is a fairly simple DDL as follows:
CREATE TABLE SomeTable_csv(valueA int, valueB int, valueC timestamp) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/some/long/path' TBLPROPERTIES ("skip.header.line.count"="1");
And then I load a single file into that table. I then try to turn it into a table backed by an ORC file, so I do something similar:
CREATE TABLE SomeTable(valueA int, valueB int, valueC timestamp) STORED AS ORC LOCATION '/some/long/path'; INSERT OVERWRITE TABLE SomeTable SELECT valueA,valueB,valueC from SomeTable_csv;
Apparently this produces a Tez task that has 1 mapper and 1 reducer and takes hours to run (input file is about 50GB). I expected that it would make a reasonable attempt to use more mappers since it’s a simple mapping process – possibly aligning to the HDFS block size. Any hints about what I can do to try to get the process to break up into more than one mapper. I can split the input file before importing into HDFS, but seems like that’s what the task should be doing.
Thanks