Note
\N (backslash with a capital N)—not to be confused with \n, which is only a single newline character.
Build and setup a Sqoop2 installation steps are documented at https://cwiki.apache.org/SQOOP/sqoop2-manual-setup.html
If you want to use one click shell script to do above then you can copy the following code
Just save the code below in some shell script file say
sqoop2_setup.sh
chmod 777 sqoop2_setup.sh
Change the workspace_dir path below depending on where you want sqoop setup to take place.
#!/bin/sh
# Change the path for workspace directory below
# And run the script
# After its complete you can start sqoop2 by going to http://localhost:8080/sqoop/
# For more details see https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Manual+Setup
workspace_dir=/home/hadoop/sqoop
cd $workspace_dir
svn co https://svn.apache.org/repos/asf/sqoop/branches/sqoop2
cd sqoop2
mvn install
cd ..
mkdir s2test
cd s2test
mkdir bin conf logs repository
cd ..
wget http://apache.mirrors.hoobly.com/tomcat/tomcat-7/v7.0.27/bin/apache-tomcat-7.0.27.tar.gz
tar zxvf apache-tomcat-7.0.27.tar.gz
rm apache-tomcat-7.0.27.tar.gz
mv apache-tomcat-7.0.27 s2test/server
# Working directory is s2test created earlier
echo "export JAVA_OPTS=-Dsqoop.config.dir=$workspace_dir/s2test/conf" >> s2test/server/bin/setenv.sh
chmod +x s2test/server/bin/setenv.sh
echo "# File s2test/conf/sqoop_bootstrap.properties
sqoop.config.provider=org.apache.sqoop.core.PropertiesConfigurationProvider" >> s2test/conf/sqoop_bootstrap.properties
echo "
# File s2test/conf/sqoop.properties
# Log4J system
org.apache.sqoop.log4j.appender.file=org.apache.log4j.RollingFileAppender
org.apache.sqoop.log4j.appender.file.File=$workspace_dir/s2test/logs/sqoop.log
org.apache.sqoop.log4j.appender.file.MaxFileSize=25MB
org.apache.sqoop.log4j.appender.file.MaxBackupIndex=5
org.apache.sqoop.log4j.appender.file.layout=org.apache.log4j.PatternLayout
org.apache.sqoop.log4j.appender.file.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} [%l] %m%n
org.apache.sqoop.log4j.debug=true
org.apache.sqoop.log4j.rootCategory=WARN, file
org.apache.sqoop.log4j.category.org.apache.sqoop=DEBUG
org.apache.sqoop.log4j.category.org.apache.derby=INFO
# Repository
org.apache.sqoop.repository.provider=org.apache.sqoop.repository.JdbcRepositoryProvider
org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.derby.DerbyRepositoryHandler
org.apache.sqoop.repository.jdbc.transaction.isolation=READ_COMMITTED
org.apache.sqoop.repository.jdbc.maximum.connections=10
org.apache.sqoop.repository.jdbc.url=jdbc:derby:$workspace_dir/s2test/repository/db;create=true
org.apache.sqoop.repository.jdbc.create.schema=true
org.apache.sqoop.repository.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
org.apache.sqoop.repository.jdbc.user=sa
org.apache.sqoop.repository.jdbc.password=
org.apache.sqoop.repository.sysprop.derby.stream.error.file=$workspace_dir/s2test/logs/derbyrepo.log" >> s2test/conf/sqoop.properties
cp $workspace_dir/sqoop2/server/target/sqoop.war $workspace_dir/s2test/server/webapps/
rm -rf $workspace_dir/s2test/server/webapps/sqoop2/
Sqoop is a tool which is used to transfer data to/from RDBMS systems from/to Hadoop HDFS.
Sqoop is undergoing is a major architectural change with discussions going on for Sqoop2 feature proposals.
A very good overview about what any why changes are needed in Sqoop is documented in the proposal at Apache wiki below.
https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop
You can also go through the Sqoop2 presentation
Current Status
The design discussions are being tracked at Sqoop Jira 365
Summary and goals for Sqoop2 architecture has been documented at Apache Sqoop wiki
https://cwiki.apache.org/SQOOP/sqoop-2.html
Weekly meetings are being organized to discuss the progress of the work
https://cwiki.apache.org/SQOOP/sqoop2-weekly-meeting-minutes.html
The list of JIRAs against the task division is documented below
https://cwiki.apache.org/SQOOP/sqoop-2-jiras.html
If you want to build and setup sqoop2 install then you can also read following
http://jugnu-life.blogspot.in/2012/06/sqoop2-build-and-installation.html
Go to http://www.infochimps.com/datasets
Filter by Free data sets available
Filter by Downloadable data only
Choose the data type which is of interest to you.
Happy Hadooping :)
If you are using sqoop 1.4.1 and you try to build it you can get error as
org.apache.hbase#hbase;0.92.0-SNAPSHOT: not found
This is due to reason that HBase 0.92.0 has been released
Just make the following changes in build.xml and run the build again
https://reviews.apache.org/r/4169/diff/
\$CONDITIONS
instead of just $CONDITIONS
to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"
OR
conditions in the WHERE
clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.Example for Import using Sqoop in target directory in the HDFS
$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --m 1
The above command will import the data present in Employee_Table in sqoop database to the directory named employeeImportAll directory
After import is done we can check if data is present
Just see the output for each of the 3 commands one by one.
hadoop@jj-VirtualBox:~$ hadoop fs -ls
hadoop@jj-VirtualBox:~$ hadoop fs -ls /user/hadoop/employeeImportAll
hadoop@jj-VirtualBox:~$ hadoop fs -cat /user/hadoop/employeeImportAll/part-m-00000
All the results are present as comma separated file
12/03/05 23:44:31 ERROR tool.ImportTool: Error during import: No primary key could be found for table Employee_Table. Please specify one with --split-by or perform a sequential import with '-m 1'.
Sample queryon which i got this error
$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll
Explanation
While performing the parallel imports Sqoop needs a criterion by which it can split the workload.Sqoop uses the splitting column to split the workload. By default Sqoop will identify the primary key column (if present) in a table to use as the splitting column.
The low and high values of splitting column are retrieved from databases and the map tasks operate on evenly sized components of total range.
For example , if you had a table with a primary key column of id
whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi
, with (lo, hi)
set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.
Solution
$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --m 1
Just add the --m 1 , it tells to use sequential import with 1 mapper
Or another solution can be by telling to sqoop to use particulay column as split column.
$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --split-by columnName
If you are following from previous sqoop import tutorial http://jugnu-life.blogspot.in/2012/03/sqoop-import-tutorial.html then , lets try to do conditional import from RDBMS in sqoop
$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret -m 1
The sqoop command above would import all the rows present in the table Customer.
Let say that customer table is something like this
CustomerName | DateOfJoining |
Adam | 2012-12-12 |
John | 2002-1-3 |
Emma | 2011-1-3 |
Tina | 2009-3-8 |
Now lets say we want to import only those customers which are joining after 2005-1-1
We can modify the sqoop import as
$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret --where "DateOfJoining > '2005-1-1' "
This would import only 3 records from above table.
Happy sqooping :)
This tutorial explains how to use sqoop to import the data from RDBMS to HDFS. Tutorial is divided into multiple posts to cover various functionalities offered by sqoop import
The general syntax for import is
$ sqoop-import (generic-args) (import-args)
Argument | Description |
---|---|
--connect <jdbc-uri> | Specify JDBC connect string |
--connection-manager <class-name> | Specify connection manager class to use |
--driver <class-name> | Manually specify JDBC driver class to use |
--hadoop-home <dir> | Override $HADOOP_HOME |
--help | Print usage instructions |
-P | Read password from console |
--password <password> | Set authentication password |
--username <username> | Set authentication username |
--verbose | Print more information while working |
--connection-param-file <filename> | Optional properties file that provides connection parameters |
Example run
$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret -m 1
When we run this sqoop command it would try to connect to mysql database named CompanyDatabase with username root , password mysecret and with one map task.
Generally its not recommended to give password in command , instead its advisable to use -P parameter which tells to ask for password in console.
One more thing which we should notice is the use of localhost as database address , if you are running your hadoop cluster in distributed mode than you should give full hostname and IP of the database.