DataBaser.Net: SQL Server+Sqoop

How to install Sqoop
Sqoop stands for “Sql to Hadoop” and is a connector to the relational world outside of Hadoop

The install is nice and easy, there’s a couple of ways to do it but we’re going to try the easiest way which is to add the Cloudera Distribution for Hadoop (CDH) repository and use the Linux built in add-apt commands

Step 1 – Add the CDH repository and update
Create a file called /etc/apt/sources.list.d/cloudera.list
find out what distribution of Linux you are running
$ lsb_release -a
3) add the following to the file created in step 1 replacing MY_DIST with the distribution name found in step 2
deb http://archive.cloudera.com/debian MY_DIST-cdh3 contrib
deb-src http://archive.cloudera.com/debianMY_DIST-cdh3 contrib

e.g. running lucid you would need

deb http://archive.cloudera.com/debian lucid-cdh3 contrib
deb-src http://archive.cloudera.com/debian lucid-cdh3 contrib

run sudo apt-get update

install curl

$ sudo apt-get install curl
add a public key to your repository

$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
Step 2 – update the local repository and install sqoop
$ sudo apt-get update
$ sudo apt-get install sqoop
In fact now we have added the CDH3 repository we can install many of the other parts of the Hadoop eco-system e.g. Hive, Pig etc. just by using the sudo apt-get install command.

That’s Sqoop is installed… if you have MySql locally on the same machine you can run commands such as

$ sqoop export –connect jdbc:mysql://localhost/htest –table hresult –username mscbi –password mscbi –export-dir /user/bigdatablog/outdir
will export the contents of the files in /user/bigdatablog/outdir to a table called hresult in the database htest.

Step 3 – Install the SQL Server connector (if you need it)
There’s a connector for SQL Server distributed by Microsoft, you need to download th JDBC driver from microsoft.com

Then installation is again, pretty simple. run the following

$ export SQOOP_HOME=/usr/lib/sqoop
$ export SQOOP_CONF_DIR=/usr/lib/sqoop/conf
and also add the commands to your .bashrc file (see installing Hadoop for instructions on how to edit this file)

Go to micosoft.com and get tar.gz file for the SQL Server JDBC driver

http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=21599

go to the Downloads directory, unzip the tar and copy the jar fil to /usr/bin/sqoop/lib

$ cd Downloads (assumes starting from the home directory)
$ tar -zxvf sqljdbc_3.0.1301.101_enu.tar.gz (Check the version number you downloaded)
$ sudo cp sqljdbc_3.0/enu/sqljdbc4.jar /usr/lib/sqoop/lib
Now we have the JDBC driver in place, we’re ready to install the SQL Server connector, assuming you downloaded it in the step above, unzip the tar, move the directory to /usr/local, set the environment variable and permissions then install it…

$ tar -zxvf sqoop-sqlserver-1.0.tar.gz
$ sudo mv sqoop-sqlserver-1.0 /usr/local
$ export MSSQL_CONNECTOR_HOME=/usr/local/sqoop-sqlserver-1.0
in the Install Hadoop example, we set up a Hadoop group – change the ownership of the sqoop directories now and you’ll save time fiddling around with permissions later.

$ sudo chown -R bigdatablog:hadoop /usr/lib/sqoop
$ sudo chown -R bigdatablog:hadoop /etc/sqoop
Now install the connector

$ cd /usr/local/sqoop-sqlserver-1.0
$ ./install.sh
That should be it, the following should list all the databases on the server UKTAG1218 logging on as ID = dba with password dba – BTW, it’s not a production system with that id and password

$ sqoop list-databases –connect ‘jdbc:sqlserver://UKTAG1218;username=dba;password=dba’
and

$ sqoop export –connect ‘jdbc:sqlserver://uktag1218;username=dba;password=dba;database=MS_test’ –export-dir outdir –fields-terminated-by “\t” –table htest
will take the files in the HDFS directory outdir and load them into a table called htest in a database called MS_Test – note the fields-terminated-by \t splits the file in to columns seperated by tabs. The default is to load CSV files