_覓 | 覦覈襦 | 豕蠏手 | 殊螳 | 譯殊碁 |
FrontPage › UpAndRunningWithHadoopDB
|
|
HadoopDB is an interesting project going on at Yale under the Prof. Daniel Abadi's supervision that I've been meaning to play with for some time now. I initially read the paper describing HadoopDB last year and intended to document how to setup a HadoopDB system using MySQL but I got busy with school work and never got around to it. Since I have a little more free time now that I've finished my thesis, I figured it was about time I got down to playing around with HadoopDB and describing how to setup a HadoopDB system using MySQL as the single node database. With that, I'm going to describe how to get up and running with HadoopDB. If you have not read the paper before starting, I strongly encourage you to give it a read. Its very well written and not that difficult to get through.
In this guide, I'm installing on Ubuntu Server 10.04 64-bit. Thus, I will be using the Ubuntu package manager heavily. I have not tested on other platforms but a lot of what is described here should apply to other platforms such as CentOS. This guide is only on how to set up a single node system. It would not be difficult to extend what is contained here for setting up a multi-node system which I may write about in the future. [edit]
1 Installing Hadoop #Before installing Hadoop, Java needs to be installed. As of 10.04, the Sun JDK packages have been dropped from the Multiverse section of the Ubuntu archive. You can still install the Sun JDK if you wish but for this article, I used OpenJDK without issues:
sudo apt-get install openjdk-6-jdk Before getting into the installation of Hadoop, I encourage you to read Michael Noll's in-depth guide to installing Hadoop on Ubuntu. I borrow from his articles a lot here.
First, create a user account and group that Hadoop will run as: sudo groupadd hadoop sudo useradd -m -g hadoop -d /home/hadoop -s /bin/bash -c "Hadoop software owner" hadoop Next, we download Hadoop and create directories for storing the software and data. For this article, Hadoop 0.20.2 was used:
cd /opt sudo wget http://www.gtlib.gatech.edu/pub/apache/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz sudo tar zxvf hadoop-0.20.2.tar.gz sudo ln -s /opt/hadoop-0.20.2 /opt/hadoop sudo chown -R hadoop:hadoop /opt/hadoop /opt/hadoop-0.20.2 sudo mkdir -p /opt/hadoop-data/tmp-base sudo chown -R hadoop:hadoop /opt/hadoop-data/ Alternatively, Cloudera has created Deb packages that can be used if you wish. I have not used them before so can't comment on how they work.
Next, we need to configure SSH for the hadoop user. This is required by Hadoop in order to manage any nodes. su - hadoop ssh-keygen -t rsa cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys When the ssh-keygen command is run, be sure to leave the passphrase as blank so that you will not be prompted for a password.
We will want to update the .bashrc file for the hadoop user with appropriate environment variables to make administration easier: export JAVA_HOME=/usr/lib/jvm/java-6-openjdk export HADOOP_HOME=/opt/hadoop export HADOOP_CONF=$HADOOP_HOME/conf export HADOOP_PATH=$HADOOP_HOME/bin export HIVE_HOME=/opt/hive export HIVE_PATH=$HIVE_HOME/bin export PATH=$HIVE_PATH:$HADOOP_PATH:$PATH We will cover installing Hive later in this article but for now, leave that environment variable in there. For the remainder of this article, I will be referring to various locations such as the Hadoop installation directory using the environment variables defined above. Next, we want configure Hadoop. There are 3 configuration files in Hadoop that we need to modify:
core-site.xml <configuration> <property> <name>hadoop.tmp.dir</name> <value>/opt/hadoop-data/tmp-base</value> <description>A base for other temporary directories</description> </property> <property> <name>fs.default.name</name> <value>localhost:54311</value> <description> The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. </description> </property> <property> <name>hadoopdb.config.file</name> <value>HadoopDB.xml</value> <description>The name of the HadoopDB cluster configuration file</description> </property> </configuration> mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54310</value> <description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration> hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> <description> Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> </configuration> Notice the reference to the HadoopDB XML file. We will cover that later but it is necessary for using HadoopDB to have that property in your configuration.
Next, we need to modify the $HADOOP_CONF/hadoop-env.sh file so that the JAVA_HOME variable is correctly set in that file. Thus, I have the following 2 lines in my hadoop-env.sh file: # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-openjdk Next, we need to format the Hadoop filesystem:
$ hadoop namenode -format 10/05/07 14:24:12 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = hadoop1/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ 10/05/07 14:24:12 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop 10/05/07 14:24:12 INFO namenode.FSNamesystem: supergroup=supergroup 10/05/07 14:24:12 INFO namenode.FSNamesystem: isPermissionEnabled=true 10/05/07 14:24:12 INFO common.Storage: Image file of size 96 saved in 0 seconds. 10/05/07 14:24:12 INFO common.Storage: Storage directory /opt/hadoop-data/tmp-base/dfs/name has been successfully formatted. 10/05/07 14:24:12 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at hadoop1/127.0.1.1 ************************************************************/ $ The above is the output from a successful format. Now, we can finally start our single-node Hadoop installation:
$ start-all.sh starting namenode, logging to /opt/hadoop/bin/../logs/hadoop-hadoop-namenode-hadoop1.out localhost: starting datanode, logging to /opt/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop1.out localhost: starting secondarynamenode, logging to /opt/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-hadoop1.out starting jobtracker, logging to /opt/hadoop/bin/../logs/hadoop-hadoop-jobtracker-hadoop1.out localhost: starting tasktracker, logging to /opt/hadoop/bin/../logs/hadoop-hadoop-tasktracker-hadoop1.out $ Again, if you don't see output similar to the above, something went wrong. The log files under /opt/hadoop/logs are quite helpful for trouble-shooting.
[edit]
2 Installing MySQL #sudo apt-get install mysql-server We don't need to perform any special configuration of MySQL for HadoopDB. Just make sure to take note of what password you specify for the root user since we will perform all work with HadoopDB as the root user (this is not mandatory but what I did to keep things simple).
Next, we need to install the MySQL JDBC driver. For this article, I used Connector J. After downloading the jar file, we need to copy it into Hadoop's lib directory so it has access to it: cp mysql-connector-java-5.1.12-bin.jar $HADOOP_HOME/lib Its worth noting that in the paper, the authors do say that initially they used MySQL with HadoopDB but switched to PostgreSQL. The main reason cited is due to the poor join algorithms in MySQL which I assume to mean the fact that only nested loop join is supported in MySQL. I don't attempt to make any comparison of HadoopDB running with MySQL versus PostgreSQL but I wanted to point out the authors observation.
[edit]
3 Download HadoopDB #Now we can download HadoopDB. I'm going to download both the jar file and check out the source from Subversion: Now we can download HadoopDB. After downloading the jar file, we need to copy it into Hadoop's lib directory so it has access to it:
cp hadoopdb.jar $HADOOP_HOME/lib I also checked out the source code from Subversion in case I needed to re-build the jar file at any time:
vn co https://hadoopdb.svn.sourceforge.net/svnroot/hadoopdb hadoopdb [edit]
4 Install Hive #Hive is used by HadoopDB as a SQL interface to their system. Its not a requirement for working with HadoopDB but it is another way to interact with HadoopDB so I'll cover how to install it.
First, we need to create directories in HDFS: hadoop fs -mkdir /tmp hadoop fs -mkdir /user/hive/warehouse hadoop fs -chmod g+w /tmp hadoop fs -chmod g+w /user/hive/warehouse tar zxvf SMS_dist.tar.gz sudo mv dist /opt/hive sudo chown -R hadoop:hadoop hive Since we already setup the environment variables related to Hive earlier when we were installing Hadoop, everything we need should now be in our path:
$ hive Hive history file=/tmp/hadoop/hive_job_log_hadoop_201005081717_1990651345.txt hive> create describe exit from load quit set hive> quit; $ [edit]
5 Data #We want to some data to play around with for testing purposes. For this article, I'm going to use the data from the paper published last summer: 'A Comparison of Approaches to Large-Scale Data Analysis'. Documentation on how to re-produce the benchmarks in that paper are prodivded in the link I gave to the paper. For this article, since I'm only running one Hadoop node and have absolutely no interest in generating lots of data I modified the scripts provided to produce tiny amounts of data:
svn co http://graffiti.cs.brown.edu/svn/benchmarks/ cd benchmarks/datagen/teragen Within the benchmarks/datagen/teragen folder, there is a Perl script named teragen.pl that is reponsible for the generation of data. I modified that script for my purposes to look like:
use strict; use warnings; my $CUR_HOSTNAME = `hostname -s`; chomp($CUR_HOSTNAME); my $NUM_OF_RECORDS_1TB = 10000000000; my $NUM_OF_RECORDS_535MB = 100; my $BASE_OUTPUT_DIR = "/data"; my $PATTERN_STRING = "XYZ"; my $PATTERN_FREQUENCY = 108299; my $TERAGEN_JAR = "teragen.jar"; my $HADOOP_COMMAND = $ENV{'HADOOP_HOME'}."/bin/hadoop"; my %files = ( "535MB" => 1, ); system("$HADOOP_COMMAND fs -rmr $BASE_OUTPUT_DIR"); # IGNORE ERRORS foreach my $target (keys %files) { my $output_dir = $BASE_OUTPUT_DIR."/SortGrep$target"; my $num_of_maps = $files{$target}; my $num_of_records = ($target eq "535MB" ? $NUM_OF_RECORDS_535MB : $NUM_OF_RECORDS_1TB); print "Generating $num_of_maps files in '$output_dir'\n"; ## ## EXEC: hadoop jar teragen.jar 10000000000 /data/SortGrep/ XYZ 108299 100 ## my @args = ( $num_of_records, $output_dir, $PATTERN_STRING, $PATTERN_FREQUENCY, $num_of_maps ); my $cmd = "$HADOOP_COMMAND jar $TERAGEN_JAR ".join(" ", @args); print "$cmd\n"; system($cmd) == 0 || die("ERROR: $!"); } # FOR exit(0); We then run the above Perl script to generate data that will be loaded in to HDFS. HadoopDB comes with a data partitioner that can partition data into a specified number of partitions. This is not particularly important for this article since we are running a single-node cluster so we only have 1 partition. The idea is that a separate partition can be bulk-loaded into a separate database node and indexed appropriately. For us, we just need to create a database and table in our MySQL database. Since we only have 1 partition, the database name will reflect that. The procedure to load the data set we generated into our single MySQL node is:
hadoop fs -get /data/SortGrep535MB/part-00000 my_file mysql -u root -ppassword mysql> create database grep0; mysql> use grep0; mysql> create table grep ( -> key1 char(10), -> field char(90) -> ); load data local infile 'my_file' into table grep fields terminated by '|' (key1, field); We now have data loaded into both HDFS and MySQL. The data we are working with is from the grep benchmark which is not the best benchmark for HadoopDB since it is un-structured data. However, since this article is just about how to setup HadoopDB and not testing its preformance, I didn't really worry about that much.
[edit]
6 HadoopDB Catalog and Running a Job #The HadoopDB catalog is stored as an XML in HDFS. A tool is provided that generates this XML file from a properties file. For this article, the properties file I used is:
#Properties for Catalog Generation ################################## nodes_file=machines.txt relations_unchunked=grep, EntireRankings relations_chunked=Rankings, UserVisits catalog_file=HadoopDB.xml ## #DB Connection Parameters ## port=3306 username=root password=password driver=com.mysql.jdbc.Driver url_prefix=jdbc\:mysql\:// ## #Chunking properties ## chunks_per_node=0 unchunked_db_prefix=grep chunked_db_prefix=cdb ## #Replication Properties ## dump_script_prefix=/root/dump_ replication_script_prefix=/root/load_replica_ dump_file_u_prefix=/mnt/dump_udb dump_file_c_prefix=/mnt/dump_cdb ## #Cluster Connection ## ssh_key=id_rsa-gsg-keypair The machines.txt file must exist and for this article, my machines.txt file had only 1 entry: localhost
Then in order to generate the XML file and store it in HDFS, the following is performed: java -cp $HADOOP_HOME/lib/hadoopdb.jar edu.yale.cs.hadoopdb.catalog.SimpleCatalogGenerator \ > Catalog.properties hadoop dfs -put HadoopDB.xml HadoopDB.xml Please not that the above tool is quite fragile and expects the input properties file to be in a certain format with certain fields. Its pretty easy to break the tool which is understandable given this is a research project.
We are now ready to run a HadoopDB job! The HadoopDB distribution comes with a bunch of benchmarks that were used in the paper that was published on HadoopDB. The data I generated in this article corresponds to the data that was used for their benchmarks so I can use jobs that have already been written in order to test my setup. I'm using the grep task from the paper to search for a pattern in the data I loaded earlier. Thus, to kick off a job I do: java -cp $CLASSPATH:hadoopdb.jar edu.yale.cs.hadoopdb.benchmark.GrepTaskDB \ > -pattern %wo% -output padraig -hadoop.config.file HadoopDB.xml Running the job, I see output like the following:
java -cp $CLASSPATH:hadoopdb.jar edu.yale.cs.hadoopdb.benchmark.GrepTaskDB \ > -pattern %wo% -output padraig -hadoop.config.file HadoopDB.xml 10/05/08 18:01:41 INFO exec.DBJobBase: grep_db_job 10/05/08 18:01:41 INFO exec.DBJobBase: SELECT key1, field FROM grep WHERE field LIKE '%%wo%%'; 10/05/08 18:01:41 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 10/05/08 18:01:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/05/08 18:01:41 INFO mapred.JobClient: Running job: job_local_0001 10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: Data locality failed for hadoop1.localdomain 10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: Task from hadoop1.localdomain is connecting to chunk 0 on host localhost with db url jdbc:mysql://localhost:3306/grep0 10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: SELECT key1, field FROM grep WHERE field LIKE '%%wo%%'; 10/05/08 18:01:41 INFO mapred.MapTask: numReduceTasks: 0 10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: DB times (ms): connection = 245, query execution = 2, row retrieval = 36 10/05/08 18:01:41 INFO connector.AbstractDBRecordReader: Rows retrieved = 3 10/05/08 18:01:41 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 10/05/08 18:01:41 INFO mapred.LocalJobRunner: 10/05/08 18:01:41 INFO mapred.TaskRunner: Task attempt_local_0001_m_000000_0 is allowed to commit now 10/05/08 18:01:41 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/home/hadoop/padraig 10/05/08 18:01:41 INFO mapred.LocalJobRunner: 10/05/08 18:01:41 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done. 10/05/08 18:01:42 INFO mapred.JobClient: map 100% reduce 0% 10/05/08 18:01:42 INFO mapred.JobClient: Job complete: job_local_0001 10/05/08 18:01:42 INFO mapred.JobClient: Counters: 6 10/05/08 18:01:42 INFO mapred.JobClient: FileSystemCounters 10/05/08 18:01:42 INFO mapred.JobClient: FILE_BYTES_READ=115486 10/05/08 18:01:42 INFO mapred.JobClient: FILE_BYTES_WRITTEN=130574 10/05/08 18:01:42 INFO mapred.JobClient: Map-Reduce Framework 10/05/08 18:01:42 INFO mapred.JobClient: Map input records=3 10/05/08 18:01:42 INFO mapred.JobClient: Spilled Records=0 10/05/08 18:01:42 INFO mapred.JobClient: Map input bytes=3 10/05/08 18:01:42 INFO mapred.JobClient: Map output records=3 10/05/08 18:01:42 INFO exec.DBJobBase: grep_db_job JOB TIME : 1747 ms. $ The results are stored in HDFS and I also specified I wanted the results put in an output directory named padraig. Inspecting the results I see:
$ cd padraig $ cat part-00000 ~k~MuMq= w0000000000{XSq#Bq6,3xd.tg_Wfa"+woX1e_L*]H-UE%+]L]DiT5#QOS5< vkrvkB8 6i0000000000.h9RSz'>Kfp6l~kE0FV"aP!>xnL^=C^W5Y}lTWO%N4$F0 Qu@:]-N4-(J%+Bm*wgF^-{BcP^5NqA ]&{`H%]1{E0000000000Z[@egp'h9! BV8p~MuIuwoP4;?Zr' :!s=,@!F8p7e[9VOq`L4%+3h.*3Rb5e=Nu`>q*{6=7 $ I can verify this result by going the data stored in MySQL and performing the same query on it:
mysql> select key1, field from grep where field like '%wo%'; +--------------------------------+------------------------------------------------------------------------------------------+ | key1 | field | +--------------------------------+------------------------------------------------------------------------------------------+ | ~k~MuMq= | w0000000000{XSq#Bq6,3xd.tg_Wfa"+woX1e_L*]H-UE%+]L]DiT5#QOS5< | | vkrvkB8 | 6i0000000000.h9RSz'>Kfp6l~kE0FV"aP!>xnL^=C^W5Y}lTWO%N4$F0 Qu@:]-N4-(J%+Bm*wgF^-{BcP^5NqA | | ]&{`H%]1{E0000000000Z[@egp'h9! | BV8p~MuIuwoP4;?Zr' :!s=,@!F8p7e[9VOq`L4%+3h.*3Rb5e=Nu`>q*{6=7 | +--------------------------------+------------------------------------------------------------------------------------------+ 3 rows in set (0.00 sec) mysql> Thus, I can see the same rows were returned by the HadoopDB job.
Conclusion I didn't get to use the Hive interface to HadoopDB as I had issues getting it going. If I get it going in the future, I'll likely write about it. HadoopDB is a pretty interesting project and I enjoyed reading the paper on it a lot. A demo of HadoopDB will be given at SIGMOD this year which should be interesting. Overall, I think its a pretty interesting project but I'm not sure how active it is. Based on the fact that a demo is being given at SIGMOD, I'm sure there is research being done on it but compared to other open source projects its difficult to tell how much development is occuring. I'm sure this has more to do with the fact that it is a research project first and foremost whose source code just happens to be available. It would be nice to see a mailing list or something pop up around this project though. For example, if I wanted to contribute a patch, its not really clear how I should go about doing that and whether it will be integrated or not. I do think its some interesting research though and I'll be keeping my eye on it and trying to mess around with it whenever I have spare time. Next thing I want to look into regarding HadoopDB is hooking it up to the column-orientated database [=http://monetdb.cwi.nl/ MonetDB] which I will write about if I get the chance.
鏤
|
覲企 蟆讀 蟯 企. |