Sqoop Best Practices

Below are some of the learning's which i thought to jot down for quick reference in the future also. These are lessons which i learnt while working in real world.
Just quick notes of things to remember and ask yourself
  • Are we sqooping from View or Table ?  Does your connector support view import?
  • Setup correct Sqoop Connector
    • Teradata connector
    • Oracle Connector
  • Are you importing direct to Hive then remember to use –hive-delims-drop option if your data contains default Hive delims , as this may break your Hive Table and give wrong row count if you use default options in Hive
  • Do a sample schema check of the table you are going to import. Check the fields which can be good candidate for split column. Generally primary keys are good enough. See by random sampling that records are balanced.
  • Decide on mappers count to use while Sqooping , default Sqoop use 4. You can increase it after talking to your DBA. Don’t make it too much keep it around 10 to not to overwhelm the database
  • Check the order of fields which Sqoop dumps , this is important for Teradata tables which gives two different order depending upon if your do plain select * or you do select * order by column id. If you are directly dumping to Hive then this would be taken care as Sqoop can create table for you. You can also give column order within the Sqoop command , it all depends on you how you want to do.
  • After Sqoop job is done remember to see Map Output Record count for the Job output. Match it with the count of rows in the database. Then at second instance match it with count of hive count if you are importing to hive. If you are not using hive then use wc function to find number of lines in output
  • Do a query in source database from where you imported to have count of records , compare it with what you have got in Hadoop after import.
  • Sample 10 records from source database , and also do the same from Hadoop. Compare the format of fields. Check for any white spaces or anything mismatching between the two
  • Ask from data users what is the preference for null values. Sqoop provides two parameters which you can use to tell to sqoop what to do for null values --null-string and --null-non-string. Read more about it and remember to use that.. If you are planning to use data with Hive and want to have hive friendly null value then for your reference hive uses \N to represent null. Read my blog post on null behavior for hive and Sqoop for more details.

Most importantly , document all what i have said. Keep a log of all of the things above. Form a Standard Operating Procedure around the steps mentioned above for your organization so that all follow the steps.
How about sharing your best practices ?
To be Updated
Also adding up details of each step with example

Manually Archive folder in Outlook 2010

Go to

File > MailBox Clean up > Archive

Select option

Archive all folders and sub folders

Specify the date older to which you want to Archive

Click Ok

The Job will start in background and you can see its progress in Outlook status bar

 

 

 

 

Hadoop on AIX

After hours of juggling with my effort to install Hadoop (CDH4) on AIX i concluded that it cannot be done easily. (I know some other flavors say they can)

Sorry for you , if you came after Google search in hope to find something.

I talked with Cloudera also and they said NO :(

My situation was that i wanted to install Hadoop Client on AIX system so that data can be sent from AIX to Hadoop cluster using simple Hadoop put commands.

I installed Java ( IBM JDK 1.6 on the AIX system)

Downloaded the CDH4 tar ball onto system

Set the environment variables

And when i did

$hadoop fs –ls /

I got message as

ls : failure to login

First i thought that its due to fact that hdfs is not allowing me to login to cluster , so i created the user with same name by which i was logged into AIX into Hadoop cluster.

After that also i got the same error.

I thought to ask Cloudera as the current Client had enterprise support :) and i raised query with them

They said IBM JDK don't support some Hadoop specific login mechanism and hence we cannot install Hadoop Client onto AIX

So my next job is to introduce design change and see how to get data inside cluster from AIX machine

  • WebHDFS
  • HttpFS
  • NFSProxy

The above are some of the options which i can propose for the client.

What are your experiences in installing Hadoop on AIX , please do share.

Thanks for reading :)

no libsvnjavahl-1 in java.library.path Eclipse SVN error

no libsvnjavahl-1 in java.library.path Eclipse SVN error

Fix

Go to

Windows > Preferences > Team > SVN 

SVN Interface

Select SVNKit