How to install Sqoop
Sqoop stands for Sql to Hadoop and is a connector to the relational world outside of Hadoop
The install is nice and easy, theres a couple of ways to do it but were going to try the easiest way which is to add the Cloudera Distribution for Hadoop (CDH) repository and use the Linux built in add-apt commands
Step 1 Add the CDH repository and update
Create a file called /etc/apt/sources.list.d/cloudera.list
find out what distribution of Linux you are running
$ lsb_release -a
3) add the following to the file created in step 1 replacing MY_DIST with the distribution name found in step 2
deb
http://archive.cloudera.com/debian MY_DIST-cdh3 contrib
deb-src
http://archive.cloudera.com/debianMY_DIST-cdh3 contrib
e.g. running lucid you would need
run sudo apt-get update
install curl
$ sudo apt-get install curl
add a public key to your repository
$ curl -s
http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
Step 2 update the local repository and install sqoop
$ sudo apt-get update
$ sudo apt-get install sqoop
In fact now we have added the
CDH3 repository we can install many of the other parts of the Hadoop eco-system e.g. Hive, Pig etc. just by using the sudo apt-get install command.
Thats Sqoop is installed if you have
MySql locally on the same machine you can run commands such as
$ sqoop export connect jdbc:mysql://localhost/htest table hresult username mscbi password mscbi export-dir /user/bigdatablog/outdir
will export the contents of the files in /user/bigdatablog/outdir to a table called hresult in the database htest.
Step 3 Install the SQL Server connector (if you need it)
Theres a connector for SQL Server distributed by Microsoft, you need to download th JDBC driver from microsoft.com
Then installation is again, pretty simple. run the following
$ export SQOOP_HOME=/usr/lib/sqoop
$ export SQOOP_CONF_DIR=/usr/lib/sqoop/conf
and also add the commands to your .bashrc file (see installing Hadoop for instructions on how to edit this file)
Go to micosoft.com and get tar.gz file for the SQL Server JDBC driver
go to the Downloads directory, unzip the tar and copy the jar fil to /usr/bin/sqoop/lib
$ cd Downloads (assumes starting from the home directory)
$ tar -zxvf sqljdbc_3.0.1301.101_enu.tar.gz (Check the version number you downloaded)
$ sudo cp sqljdbc_3.0/enu/sqljdbc4.jar /usr/lib/sqoop/lib
Now we have the JDBC driver in place, were ready to install the SQL Server connector, assuming you downloaded it in the step above, unzip the tar, move the directory to /usr/local, set the environment variable and permissions then install it
$ tar -zxvf sqoop-sqlserver-1.0.tar.gz
$ sudo mv sqoop-sqlserver-1.0 /usr/local
$ export MSSQL_CONNECTOR_HOME=/usr/local/sqoop-sqlserver-1.0
in the Install Hadoop example, we set up a Hadoop group change the ownership of the sqoop directories now and youll save time fiddling around with permissions later.
$ sudo chown -R bigdatablog:hadoop /usr/lib/sqoop
$ sudo chown -R bigdatablog:hadoop /etc/sqoop
Now install the connector
$ cd /usr/local/sqoop-sqlserver-1.0
$ ./install.sh
That should be it, the following should list all the databases on the server
UKTAG1218 logging on as ID = dba with password dba BTW, its not a production system with that id and password
$ sqoop list-databases connect jdbc:sqlserver:/
/UKTAG1218;username=dba;password=dba
and
$ sqoop export connect jdbc:sqlserver://uktag1218;username=dba;password=dba;database=MS_test export-dir outdir fields-terminated-by \t table htest
will take the files in the HDFS directory outdir and load them into a table called htest in a database called MS_Test note the fields-terminated-by \t splits the file in to columns seperated by tabs. The default is to load CSV files