Quantcast
Channel: Hortonworks » All Replies
Viewing all articles
Browse latest Browse all 3435

Sqoop Incremental Import (lastmodified) giving Duplicate rows for updated rows

$
0
0

Hi All,

I’m using:
-Apache Hadoop 2.4.0
-Apache Hive 0.13.1
-Apache Sqoop 1.4.4
-hdp-connector-for-teradata-1.3.2.2.1.5.0-695-distro
-Teradata 15.0.0.8

I’m trying to do an Incremental Import using Sqoop from Teradata to Hive tables.

From Sqoop documentation:
An alternate table update strategy supported by Sqoop is called lastmodified mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with –last-value are imported.

I followed the below steps:

STEP 1: One time activity
I’m doing a full import of the table to hive table.

STEP 2: One time activity
Created a Sqoob Job for incremental import
sqoop job –create incr1 — import –connection-manager org.apache.sqoop.teradata.TeradataConnManager –connect jdbc:teradata://192.168.199.137/testdb123 –username testdb123 –password testdb123 –table Paper_STAGE –incremental lastmodified –check-column last_modified_col –last-value “2014-10-03 15:29:48.66″ –split-by id –hive-table paper_stage –hive-import

STEP 3: This will be done on timely basis from any Scheduler OR Oozie
Executing the Sqoob Job for incremental import everytime I need the updated rows/newly added rows.
sqoop job –exec incr1

The source table has a “unique primary key” and “last modified column” with current timestamp.
The newly added rows though are working fine and getting imported but for the updated rows I’m getting duplicate rows.
Sqoop is not updating the updated rows but adding a new one with same Id and new current timestamp.

Is this something which is currently not supported in Sqoop as of now ?
This is since I found these:

http://stackoverflow.com/questions/19093417/sqoop-import-lastmodified-gives-duplicate-records-it-doesnt-merger

http://grokbase.com/p/cloudera/cdh-user/13a4n03jrh/sqoop-import-lastmodified-gives-duplicate-records-merger-does-not-happen

https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/xAbXEduvahU

https://issues.cloudera.org/browse/DISTRO-464

Is there a way to avoid the duplicate rows for the updated rows and get a merged updated row for each updated row in the Source table?
Kindly advise me any alternatives to handle this.

Thanks,
-Nirmal


Viewing all articles
Browse latest Browse all 3435

Trending Articles