Jugnu Life :-): Sqoop

Showing posts with label Sqoop. Show all posts

Null behaviour in Sqoop and Hive

Sqoop by default imports the null values with string null as output.

So any record having null in data would be shown like below in HDFS

|N|null|Jagat Singh|BigData

The issue with such kind of import is that we cannot write hive queries like show me all records where the column is not null.

To know better we should keep in mind that the Hive default representation for null is

So to make Hive treat the record as null should be imported as \N

Read link below

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-HandlingofNULLValues

the corresponding Sqoop query to make such kind of import work would be

sqoop import …--null-string '\\N' --null-non-string '\\N'

We have to escape the \N parameter properly

Note
\N (backslash with a capital N)—not to be confused with \n, which is only a single newline character.

Hope this helps

Sqoop Best Practices

Below are some of the learning's which i thought to jot down for quick reference in the future also. These are lessons which i learnt while working in real world.
Just quick notes of things to remember and ask yourself

Are we sqooping from View or Table ? Does your connector support view import?
Setup correct Sqoop Connector
- Teradata connector
- Oracle Connector
Are you importing direct to Hive then remember to use –hive-delims-drop option if your data contains default Hive delims , as this may break your Hive Table and give wrong row count if you use default options in Hive
Do a sample schema check of the table you are going to import. Check the fields which can be good candidate for split column. Generally primary keys are good enough. See by random sampling that records are balanced.
Decide on mappers count to use while Sqooping , default Sqoop use 4. You can increase it after talking to your DBA. Don’t make it too much keep it around 10 to not to overwhelm the database
Check the order of fields which Sqoop dumps , this is important for Teradata tables which gives two different order depending upon if your do plain select * or you do select * order by column id. If you are directly dumping to Hive then this would be taken care as Sqoop can create table for you. You can also give column order within the Sqoop command , it all depends on you how you want to do.
After Sqoop job is done remember to see Map Output Record count for the Job output. Match it with the count of rows in the database. Then at second instance match it with count of hive count if you are importing to hive. If you are not using hive then use wc function to find number of lines in output
Do a query in source database from where you imported to have count of records , compare it with what you have got in Hadoop after import.
Sample 10 records from source database , and also do the same from Hadoop. Compare the format of fields. Check for any white spaces or anything mismatching between the two
Ask from data users what is the preference for null values. Sqoop provides two parameters which you can use to tell to sqoop what to do for null values --null-string and --null-non-string. Read more about it and remember to use that.. If you are planning to use data with Hive and want to have hive friendly null value then for your reference hive uses \N to represent null. Read my blog post on null behavior for hive and Sqoop for more details.

Most importantly , document all what i have said. Keep a log of all of the things above. Form a Standard Operating Procedure around the steps mentioned above for your organization so that all follow the steps.
How about sharing your best practices ?
To be Updated
Also adding up details of each step with example

Sqoop2 Build and Installation

Build and setup a Sqoop2 installation steps are documented at https://cwiki.apache.org/SQOOP/sqoop2-manual-setup.html

If you want to use one click shell script to do above then you can copy the following code

Just save the code below in some shell script file say

sqoop2_setup.sh

chmod 777 sqoop2_setup.sh

Change the workspace_dir path below depending on where you want sqoop setup to take place.

#!/bin/sh

# Change the path for workspace directory below
# And run the script
# After its complete you can start sqoop2 by going to http://localhost:8080/sqoop/
# For more details see https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Manual+Setup
workspace_dir=/home/hadoop/sqoop

cd $workspace_dir
svn co https://svn.apache.org/repos/asf/sqoop/branches/sqoop2
cd sqoop2
mvn install

cd ..
mkdir s2test
cd s2test

mkdir bin conf logs repository

cd ..
wget http://apache.mirrors.hoobly.com/tomcat/tomcat-7/v7.0.27/bin/apache-tomcat-7.0.27.tar.gz
tar zxvf apache-tomcat-7.0.27.tar.gz
rm apache-tomcat-7.0.27.tar.gz
mv apache-tomcat-7.0.27 s2test/server

# Working directory is s2test created earlier
echo "export JAVA_OPTS=-Dsqoop.config.dir=$workspace_dir/s2test/conf" >> s2test/server/bin/setenv.sh
chmod +x s2test/server/bin/setenv.sh

echo "# File s2test/conf/sqoop_bootstrap.properties
sqoop.config.provider=org.apache.sqoop.core.PropertiesConfigurationProvider" >> s2test/conf/sqoop_bootstrap.properties

echo "
# File s2test/conf/sqoop.properties
# Log4J system
org.apache.sqoop.log4j.appender.file=org.apache.log4j.RollingFileAppender
org.apache.sqoop.log4j.appender.file.File=$workspace_dir/s2test/logs/sqoop.log
org.apache.sqoop.log4j.appender.file.MaxFileSize=25MB
org.apache.sqoop.log4j.appender.file.MaxBackupIndex=5
org.apache.sqoop.log4j.appender.file.layout=org.apache.log4j.PatternLayout
org.apache.sqoop.log4j.appender.file.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} [%l] %m%n

org.apache.sqoop.log4j.debug=true
org.apache.sqoop.log4j.rootCategory=WARN, file
org.apache.sqoop.log4j.category.org.apache.sqoop=DEBUG
org.apache.sqoop.log4j.category.org.apache.derby=INFO

# Repository
org.apache.sqoop.repository.provider=org.apache.sqoop.repository.JdbcRepositoryProvider
org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.derby.DerbyRepositoryHandler
org.apache.sqoop.repository.jdbc.transaction.isolation=READ_COMMITTED
org.apache.sqoop.repository.jdbc.maximum.connections=10
org.apache.sqoop.repository.jdbc.url=jdbc:derby:$workspace_dir/s2test/repository/db;create=true
org.apache.sqoop.repository.jdbc.create.schema=true
org.apache.sqoop.repository.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
org.apache.sqoop.repository.jdbc.user=sa
org.apache.sqoop.repository.jdbc.password=
org.apache.sqoop.repository.sysprop.derby.stream.error.file=$workspace_dir/s2test/logs/derbyrepo.log" >> s2test/conf/sqoop.properties

cp $workspace_dir/sqoop2/server/target/sqoop.war $workspace_dir/s2test/server/webapps/
rm -rf $workspace_dir/s2test/server/webapps/sqoop2/

Sqoop2 Architecture and current status

Sqoop is a tool which is used to transfer data to/from RDBMS systems from/to Hadoop HDFS.

Sqoop is undergoing is a major architectural change with discussions going on for Sqoop2 feature proposals.

A very good overview about what any why changes are needed in Sqoop is documented in the proposal at Apache wiki below.

https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop

You can also go through the Sqoop2 presentation

Current Status

The design discussions are being tracked at Sqoop Jira 365

Summary and goals for Sqoop2 architecture has been documented at Apache Sqoop wiki

https://cwiki.apache.org/SQOOP/sqoop-2.html

Weekly meetings are being organized to discuss the progress of the work

https://cwiki.apache.org/SQOOP/sqoop2-weekly-meeting-minutes.html

The list of JIRAs against the task division is documented below

https://cwiki.apache.org/SQOOP/sqoop-2-jiras.html

If you want to build and setup sqoop2 install then you can also read following

http://jugnu-life.blogspot.in/2012/06/sqoop2-build-and-installation.html

Sample data for practice with Hadoop

Go to http://www.infochimps.com/datasets

Filter by Free data sets available

Filter by Downloadable data only

Choose the data type which is of interest to you.

Happy Hadooping :)

Oracle Date mapped to TimeStamp while importing with Sqoop

Oracle Date mapped to TimeStamp while importing with Sqoop

The current version of Sqoop 1.4.1 maps the Oracle Date to Timestamp since Oracle drives does this. Read the discussion below.

http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-faq-090281.html#08_01

How to solve this

While you are importing with sqoop pass on driver specific arguments as example below

$ sqoop import -D mapDateToTimestamp=false --connect jdbc:oracle:thin:@//db.example.com/foo --table bar

The above property mapDateToTimestamp to false would make the driver will revert to the default 9i-10g behavior and map DATE to Date.

org.apache.hbase#hbase;0.92.0-SNAPSHOT: not found

If you are using sqoop 1.4.1 and you try to build it you can get error as

org.apache.hbase#hbase;0.92.0-SNAPSHOT: not found

This is due to reason that HBase 0.92.0 has been released

Just make the following changes in build.xml and run the build again

https://reviews.apache.org/r/4169/diff/

Sqoop free form query example

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --target-dir importOnlyEmpName -e 'Select Name from Employee_Table where $CONDITIONS' --m 1

free form query is presented after -e or -query

We can write our query in single quotes or double quotes. Just read the notes below from official documentation.

sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --target-dir importOnlyEmpName -e "Select Name from Employee_Table where (employee_Name="David" OR Salary>'2000') AND \$CONDITIONS" --m 1

Example of Sqoop free form query with where clause

The above query is selecting just name from the table Employee_Table which has other columns also besides name.
Importance of $CONDITIONS in free form query
Its worth nothing the importance of $CONDITIONS in free form query ( this thread explains well , getting info from there)
If you run a parallel import, the map tasks will execute your query with different values substituted in for $CONDITIONS. e.g., one mapper may execute "select bla from foo WHERE (id >=0 AND id < 10000)", and the next mapper may execute "select bla from foo WHERE (id >= 10000 AND id < 20000)" and so on.
Sqoop does not parse your SQL statement into an abstract syntax tree which would allow it to modify your query without textual hints. You are free to add further constraints like you suggested in your initial example (read the thread), but the literal string "$CONDITIONS" does need to appear in the WHERE clause of your query so that Sqoop can textually replace it with its own refined constraints.
Setting -m 1 is the only way to force a non-parallel import. You still need $CONDITIONS in there because it queries the database
about column type information, etc in the client before executing the import job, but does not want actual rows returned to the client. So
it will execute your query with $CONDITIONS set to '1 = 0' to ensure that it receives type information, but not records.
Notes from Sqoop documentation
If you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"
The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.

Sqoop --target-dir example

Example for Import using Sqoop in target directory in the HDFS

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --m 1

The above command will import the data present in Employee_Table in sqoop database to the directory named employeeImportAll directory

After import is done we can check if data is present

Just see the output for each of the 3 commands one by one.

hadoop@jj-VirtualBox:~$ hadoop fs -ls
hadoop@jj-VirtualBox:~$ hadoop fs -ls /user/hadoop/employeeImportAll
hadoop@jj-VirtualBox:~$ hadoop fs -cat /user/hadoop/employeeImportAll/part-m-00000

All the results are present as comma separated file

ERROR tool.ImportTool: Error during import: No primary key could be found for table

12/03/05 23:44:31 ERROR tool.ImportTool: Error during import: No primary key could be found for table Employee_Table. Please specify one with --split-by or perform a sequential import with '-m 1'.

Sample queryon which i got this error

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll

Explanation

While performing the parallel imports Sqoop needs a criterion by which it can split the workload.Sqoop uses the splitting column to split the workload. By default Sqoop will identify the primary key column (if present) in a table to use as the splitting column.

The low and high values of splitting column are retrieved from databases and the map tasks operate on evenly sized components of total range.

For example , if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

Solution

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --m 1

Just add the --m 1 , it tells to use sequential import with 1 mapper

Or another solution can be by telling to sqoop to use particulay column as split column.

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --split-by columnName

Setting up development environment for Sqoop

The post at offical Wiki of Sqoop explains well the process to setup development environment for Sqoop.

I am having following

Ubuntu system 11.10
Downloaded Eclipse
Ant 1.8 in my system
Sbsclipse ( svn plugin for eclipse)
Make is already present in Ubuntu
Asciidoc i downloaded from Software repository in Ubuntu , easy part :)
Java 1.6 is already there in my system

All set :)

Sqoop import with where clause

If you are following from previous sqoop import tutorial http://jugnu-life.blogspot.in/2012/03/sqoop-import-tutorial.html then , lets try to do conditional import from RDBMS in sqoop

$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret -m 1

The sqoop command above would import all the rows present in the table Customer.

Let say that customer table is something like this

CustomerName	DateOfJoining
Adam	2012-12-12
John	2002-1-3
Emma	2011-1-3
Tina	2009-3-8

Now lets say we want to import only those customers which are joining after 2005-1-1

We can modify the sqoop import as

$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret --where "DateOfJoining > '2005-1-1' "

This would import only 3 records from above table.

Happy sqooping :)

Sqoop installation tutorial

Sqoop is a tool which is used to import / export data from RDBMS to HDFS

It can be downloaded from the apache website. As of writing this post the Sqoop is in incubation project with apache , but it would come as full project in the near future.

Sqoop is a client tool , you are not required to install it to all nodes of Cluster. The best practice is to just install it on client ( or edge node of the cluster) . The data transfer is direct between Cluster and Database , incase you are worried for traffic between machine where you install Sqoop and Database.

Installation steps

You can download the latest version of sqoop from apache website
http://sqoop.apache.org/

The installation is fairly simple to start off for development purpose with Sqoop

Download the latest sqoop binary file

Extract it in some folder

Specify the SQOOP_HOME and add Sqoop path variable so that we can directly run the sqoop commands

For example i downloaded sqoop in following directory and my environment variables look like this
export SQOOP_HOME="/home/hadoop/software/sqoop-1.4.3"

export PATH=$PATH:$SQOOP_HOME/bin

Sqoop can be connected to various types of databases .

For example it can talk to mysql , Oracle , Postgress databases. It uses JDBC to connect to them. JDBC driver for each of databases is needed by sqoop to connect to them.

JDBC driver jar for each of the database can be downloaded from net. For example mysql jar is present at link below

http://dev.mysql.com/downloads/connector/j/

Download the mysql j connector jar and store in lib directory present in sqoop home folder.

Thats it.

Just test your installation by typing

$ sqoop help

You should see the list of commands with there use in sqoop

Happy sqooping :)

Sqoop import tutorial

This tutorial explains how to use sqoop to import the data from RDBMS to HDFS. Tutorial is divided into multiple posts to cover various functionalities offered by sqoop import

The general syntax for import is

$ sqoop-import (generic-args) (import-args)

Argument	Description
`--connect <jdbc-uri>`	Specify JDBC connect string
`--connection-manager <class-name>`	Specify connection manager class to use
`--driver <class-name>`	Manually specify JDBC driver class to use
`--hadoop-home <dir>`	Override $HADOOP_HOME
`--help`	Print usage instructions
`-P`	Read password from console
`--password <password>`	Set authentication password
`--username <username>`	Set authentication username
`--verbose`	Print more information while working
`--connection-param-file <filename>`	Optional properties file that provides connection parameters

Example run

$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret -m 1

When we run this sqoop command it would try to connect to mysql database named CompanyDatabase with username root , password mysecret and with one map task.

Generally its not recommended to give password in command , instead its advisable to use -P parameter which tells to ask for password in console.

One more thing which we should notice is the use of localhost as database address , if you are running your hadoop cluster in distributed mode than you should give full hostname and IP of the database.

Sqoop Got error creating database manager

Error 1

12/02/28 18:47:04 ERROR tool.BaseSqoopTool: Got error creating database manager: java.io.IOException: No manager for connect string: jdbc.mysql://192.168.0.13/hadoopsqoops
at com.cloudera.sqoop.ConnFactory.getManager(ConnFactory.java:119)
at com.cloudera.sqoop.tool.BaseSqoopTool.init(BaseSqoopTool.java:189)
at com.cloudera.sqoop.tool.ImportTool.init(ImportTool.java:81)
at com.cloudera.sqoop.tool.ImportTool.run(ImportTool.java:411)
at com.cloudera.sqoop.Sqoop.run(Sqoop.java:144)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.cloudera.sqoop.Sqoop.runSqoop(Sqoop.java:180)
at com.cloudera.sqoop.Sqoop.runTool(Sqoop.java:219)
at com.cloudera.sqoop.Sqoop.runTool(Sqoop.java:228)
at com.cloudera.sqoop.Sqoop.main(Sqoop.java:237)

http://stackoverflow.com/questions/8455559/got-error-creating-database-manager-error-in-sqoop-import-query

sqoop import --connect jdbc:mysql://192.168.0.13:3306/hadoopSqoop --username root --password root --table widgets -m 1

Error 3

Host '192.168.0.100' is not allowed to connect to this MySQL server"

Its pure Mysql error related to permissions
Grant user from other host to connect
http://groups.google.com/group/mailing.database.mysql/browse_frm/thread/759d9fff020fce27

Erro 4

Error 5

Error 6

12/02/28 20:26:37 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-hadoop/compile/6d62f2246e8b34f9a6fb09aea4bbfe6c/widgets.java to /home/hadoop/./widgets.java
java.io.IOException: Destination '/home/hadoop/./widgets.java' already exists

hadoop@localhost's password:
localhost: Permission denied (publickey,password).
hadoop@jj-VirtualBox:~$
hadoop@jj-VirtualBox:~$ ^C
hadoop@jj-VirtualBox:~$ chmod go-w $HOME $HOME/.ssh
hadoop@jj-VirtualBox:~$ chmod 600 $HOME/.ssh/authorized_keys
hadoop@jj-VirtualBox:~$ chown `whoami` $HOME/.ssh/authorized_keys
hadoop@jj-VirtualBox:~$ start-dfs.sh
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-namenode-jj-VirtualBox.out
localhost: starting datanode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-datanode-jj-VirtualBox.out
localhost: starting secondarynamenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-secondarynamenode-jj-VirtualBox.out
hadoop@jj-VirtualBox:~$ start-mapred.sh
Warning: $HADOOP_HOME is deprecated.

starting jobtracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-jobtracker-jj-VirtualBox.out
localhost: starting tasktracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-tasktracker-jj-VirtualBox.out
hadoop@jj-VirtualBox:~$ sqoop import --connect jdbc:mysql://192.168.0.13:3306/hadoopSqoop --username root --password root --table widgets -m 1
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.

12/02/28 20:26:34 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
12/02/28 20:26:34 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
12/02/28 20:26:34 INFO tool.CodeGenTool: Beginning code generation
12/02/28 20:26:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `widgets` AS t LIMIT 1
12/02/28 20:26:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `widgets` AS t LIMIT 1
12/02/28 20:26:34 INFO orm.CompilationManager: HADOOP_HOME is /home/hadoop/software/hadoop-1.0.1/libexec/..
12/02/28 20:26:37 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-hadoop/compile/6d62f2246e8b34f9a6fb09aea4bbfe6c/widgets.java to /home/hadoop/./widgets.java
java.io.IOException: Destination '/home/hadoop/./widgets.java' already exists
at org.apache.commons.io.FileUtils.moveFile(FileUtils.java:1811)
at com.cloudera.sqoop.orm.CompilationManager.compile(CompilationManager.java:227)
at com.cloudera.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:83)
at com.cloudera.sqoop.tool.ImportTool.importTable(ImportTool.java:337)
at com.cloudera.sqoop.tool.ImportTool.run(ImportTool.java:423)
at com.cloudera.sqoop.Sqoop.run(Sqoop.java:144)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.cloudera.sqoop.Sqoop.runSqoop(Sqoop.java:180)
at com.cloudera.sqoop.Sqoop.runTool(Sqoop.java:219)
at com.cloudera.sqoop.Sqoop.runTool(Sqoop.java:228)
at com.cloudera.sqoop.Sqoop.main(Sqoop.java:237)
12/02/28 20:26:37 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/6d62f2246e8b34f9a6fb09aea4bbfe6c/widgets.jar
12/02/28 20:26:37 WARN manager.MySQLManager: It looks like you are importing from mysql.
12/02/28 20:26:37 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
12/02/28 20:26:37 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
12/02/28 20:26:37 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
12/02/28 20:26:37 INFO mapreduce.ImportJobBase: Beginning import of widgets
12/02/28 20:26:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `widgets` AS t LIMIT 1

Sqoop Call to localhost/127.0.0.1:8021 failed

Encountered IOException running import job: java.net.ConnectException: Call to localhost/127.0.0.1:8021 failed on connection exception: java.net.ConnectException: Connection refused

Simple indication is that Hadoop is not running , just check your hadoop service

Could not load db driver class: com.mysql.jdbc.Driver Sqoop

While working with Sqoop i got this error

java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver

Solution

MySQL jar is missing , download MySQL java connector jar and save it into sqoop/lib folder

You can download it from http://www.mysql.com/downloads/connector/j/

More details at Solution post
https://groups.google.com/a/cloudera.org/group/sqoop-user/browse_thread/thread/a7075a8557c00007?pli=1

The MySQL JDBC Driver was removed from Sqoop distribution in order to ensure that the default distribution is fully Apache license compliant.