Sqoop Best Practices

Below are some of the learning's which i thought to jot down for quick reference in the future also. These are lessons which i learnt while working in real world.
Just quick notes of things to remember and ask yourself
  • Are we sqooping from View or Table ?  Does your connector support view import?
  • Setup correct Sqoop Connector
    • Teradata connector
    • Oracle Connector
  • Are you importing direct to Hive then remember to use –hive-delims-drop option if your data contains default Hive delims , as this may break your Hive Table and give wrong row count if you use default options in Hive
  • Do a sample schema check of the table you are going to import. Check the fields which can be good candidate for split column. Generally primary keys are good enough. See by random sampling that records are balanced.
  • Decide on mappers count to use while Sqooping , default Sqoop use 4. You can increase it after talking to your DBA. Don’t make it too much keep it around 10 to not to overwhelm the database
  • Check the order of fields which Sqoop dumps , this is important for Teradata tables which gives two different order depending upon if your do plain select * or you do select * order by column id. If you are directly dumping to Hive then this would be taken care as Sqoop can create table for you. You can also give column order within the Sqoop command , it all depends on you how you want to do.
  • After Sqoop job is done remember to see Map Output Record count for the Job output. Match it with the count of rows in the database. Then at second instance match it with count of hive count if you are importing to hive. If you are not using hive then use wc function to find number of lines in output
  • Do a query in source database from where you imported to have count of records , compare it with what you have got in Hadoop after import.
  • Sample 10 records from source database , and also do the same from Hadoop. Compare the format of fields. Check for any white spaces or anything mismatching between the two
  • Ask from data users what is the preference for null values. Sqoop provides two parameters which you can use to tell to sqoop what to do for null values --null-string and --null-non-string. Read more about it and remember to use that.. If you are planning to use data with Hive and want to have hive friendly null value then for your reference hive uses \N to represent null. Read my blog post on null behavior for hive and Sqoop for more details.

Most importantly , document all what i have said. Keep a log of all of the things above. Form a Standard Operating Procedure around the steps mentioned above for your organization so that all follow the steps.
How about sharing your best practices ?
To be Updated
Also adding up details of each step with example

No comments:

Post a Comment

Please share your views and comments below.

Thank You.