本文共 12247 字,大约阅读时间需要 40 分钟。
参考和网上经验,终于完成hadoop伪集群的搭建和测试。
安裝Oracle JDK之前先卸载openjdk。查看是否安装了openjdk
java -versionrpm -qa | grep jdk
发现安装了openjdk,按照命令卸载
rpm -e --nodeps tzdata-java-201xc-.el6.....rpm -e --nodeps java-1.x.x-openjdk-1.x.x
卸载完成
以下是JDK1.7的
sh jdk-6u17-linux-i586-rpm.bin
如果下载的是rpm包,则:
chmod 755 jdk-7-linux-x64.rpm rpm -ivh jdk-7-linux-x64.rpm
直接安装至/usr/java/下,jdk1.7.0,不需要下面第三步
vi /etc/profile#在里面添加如下内容export JAVA_HOME=/usr/java/jdk1.7.0export JAVA_BIN=/usr/java/jdk1.7.0/binexport PATH=$PATH:$JAVA_HOME/binexport CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jarexport JAVA_HOME JAVA_BIN PATH CLASSPATH
保存退出,
source /etc/profile
cd /usr/binln -s -f /usr/java/jdk1.7.0/jre/bin/javaln -s -f /usr/java/jdk1.7.0/bin/javac
java -version
屏幕输出:
java version "1.7.0"Java(TM) SE Runtime Environment (build 1.7.0-b147)Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
tar -xzf hadoop-0.21.0.tar.gz 或 tar -zvxf hadoop-0.21.0.tar.gz -C 目标目录find / -name hadoop*
/root/hadoop/hadoop-0.21.0
vim ~/.bash_profileexport JAVA_HOME=/usr/java/jdk1.7.0export HADOOP_HOME=/root/hadoop/hadoop-0.21.0export PATH=$PATH:$HADOOP_HOME/bin
将以上三项写入~/.bash_profile中(或者/etc/profile中,source /etc/profile使生效)
输入
hadoop version
输出如下:
Hadoop 0.21.0Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707Compiled by chrisdo on Fri Feb 19 08:07:34 UTC 2010
Each component in Hadoop is configured using an XML file. Core properties go in core-site.xml, HDFS properties go in hdfs-site.xml, and MapReduce properties go in mapred-site.xml. These files are all located in the conf subdirectory.
default setting example: /HADOOP_INSTALL/docs/(core-default.html, hdfs-default.html, mapred-default.html)
Hadoop Running Mode:
(1) Standalone (or local) Mode: no daemons, run in a single JVM. local filesystem + local MapReduce job runner (2) Pseudo-distributed Mode: daemons run on the local machine. HDFS + MapReduce daemons (3) Fully distributed Mode: daemons runs on a cluster of machines. HDFS + MapReduce daemons
To run Hadoop in a particular mode, you need to do two things: set the appropriate properties, and start the Hadoop daemons.
Key configuration properties for different modes:
Component Property Standalone Pseudo-distributed Fully distributed
Core file:/// (default) hdfs://localhost/ hdfs://namenode/
HDFS dfs.replication N/A 1 3 (default) MapReduce mapred.job.tracker local (default) localhost:8021 jobtracker:8021Standalone模式不需要配置参数文件。
在${HADOOP_HOME}/conf/下修改三个配置文件,添加如下属性:
fs.default.name hdfs://localhost/ dfs.replication 1 mapred.job.tracker localhost:8021
首先查看SSH是否安装,可以直接输入SSH命令查看。
配置ssh免密登录ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsacat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keyschmod 0600 ~/.ssh/authorized_keys
设置完后,输入:
ssh localhost
测试是否可以登录。
输入:
hadoop namenode -format
只要之前3个配置文件没配错,这里就没问题。
其中会有如下提示:11/10/09 11:23:35 INFO common.Storage: Storage directory /tmp/hadoop-root/dfs/name has been successfully formatted.
To start the HDFS and MapReduce daemons, type:
% %
执行# 则开启全部,但是这里遇到一个问题:
namenode running as process 17031. Stop it first.localhost: Error: JAVA_HOME is not set.localhost: Error: JAVA_HOME is not set.jobtracker running as process 17793. Stop it first.localhost: Error: JAVA_HOME is not set.
两个脚本都跑不了,但是JAVA_HOME明明是设置了的!(可能与我这里使用的版本有关,这里用的是hadoop 0.21.0。)(PS:现在看这句话笑死我了O(∩_∩)O)
网上搜索的解决方案都是扯淡,最后查看脚本内容才发现,原来需要在conf/下的hadoop-env.sh脚本里手工添加JAVA_HOME,然后再执行就没问题了。
启动成功!验证hadoop是否正常启动:
jps
此语句执行后会列出已启动的东西NameNode,JobTracker,SecondaryNameNode…如果NameNode没有成功启动的话就要先执行"bin/stop-all.sh"停掉所有东西,然后重新格式化namenode,再启动
正常输出如下:20738 TaskTracker
17793 JobTracker 20840 Jps 20495 SecondaryNameNode 20360 DataNode 17031 NameNode启动成功后,查看集群情况:
hadoop dfsadmin -report http://10.80.18.191:50070
可以查看HDFS文件系统情况:
hadoop fsck /
这部分应该是参考了其他人的博客的,现在已经找不到源blog了。真的是~~~~(>_<)~~~~
vi /tmp/test.txt
(打开后随便输入一些内容,如"mu ha ha ni da ye da ye da",然后保存退出)hadoop dfs -copyFromLocal /tmp/test.txt firstTest
(注:如dfs中不包含firstTest的话就会自动创建一个,关于查看dfs文件系统中已有目录的指令为"hadoop dfs -ls /")
输出如下:DEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it.11/10/09 14:38:05 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30000011/10/09 14:38:05 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
显示命令已过时=。=,现在已经用fs代替dfs命令
这里原作者说法有错误,firstTest不是目录,实际是放入HDFS的文件,可以如下查看文件内容:hadoop fs -cat firstTest
与前面test.txt的内容一致
查看HDFS目录情况:hadoop dfs -ls /
hadoop jar hadoop-mapred-examples-0.20.2.jar wordcount firstTest result hadoop jar hadoop-0.20.2-examples.jar wordcount firstTest result
(注:此语句意为“对firstTest文件执行wordcount,将统计结果输出到result文件中”,若result文件夹不存在则会自动创建一个)
这里执行报错:
Exception in thread “main” java.io.IOException: Error opening job jar: hadoop-mapred-example0.21.0.jar
提示找不到这个包
解决方法:
hadoop fs -rmr result
删除之前运行产生的firstTest文件
必须进入hadoop安装目录下,即hadoop-mapred-examples-0.21.0.jar这个包所在目录:cd /root/hadoop/hadoop-0.21.0hadoop jar hadoop-mapred-examples-0.21.0.jar wordcount firstTest result
开始执行,如下输出:
11/10/09 15:28:29 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30000011/10/09 15:28:29 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id11/10/09 15:28:29 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.11/10/09 15:28:30 INFO input.FileInputFormat: Total input paths to process : 111/10/09 15:28:30 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps11/10/09 15:28:30 INFO mapreduce.JobSubmitter: number of splits:111/10/09 15:28:30 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null11/10/09 15:28:30 INFO mapreduce.Job: Running job: job_201110091132_000111/10/09 15:28:31 INFO mapreduce.Job: map 0% reduce 0%11/10/09 15:28:43 INFO mapreduce.Job: map 100% reduce 0%11/10/09 15:28:49 INFO mapreduce.Job: map 100% reduce 100%11/10/09 15:28:51 INFO mapreduce.Job: Job complete: job_201110091132_000111/10/09 15:28:51 INFO mapreduce.Job: Counters: 33 FileInputFormatCounters BYTES_READ=42 FileSystemCounters FILE_BYTES_READ=78 FILE_BYTES_WRITTEN=188 HDFS_BYTES_READ=143 HDFS_BYTES_WRITTEN=52 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 Job Counters Data-local map tasks=1 Total time spent by all maps waiting after reserving slots (ms)=0 Total time spent by all reduces waiting after reserving slots (ms)=0 SLOTS_MILLIS_MAPS=6561 SLOTS_MILLIS_REDUCES=3986 Launched map tasks=1 Launched reduce tasks=1 Map-Reduce Framework Combine input records=5 Combine output records=5 Failed Shuffles=0 GC time elapsed (ms)=112 Map input records=5 Map output bytes=62 Map output records=5 Merged Map outputs=1 Reduce input groups=5 Reduce input records=5 Reduce output records=5 Reduce shuffle bytes=78 Shuffled Maps =1 Spilled Records=10 SPLIT_RAW_BYTES=101
hadoop dfs -cat result/part-r-00000
(注:结果文件默认是输出到一个名为“part-r-*****”的文件中的,可用指令“hadoop dfs -ls result”查看result目录下包含哪些文件)
hadoop dfs -ls result
输出如下:
DEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it.11/10/09 15:50:52 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30000011/10/09 15:50:52 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.idFound 2 items-rw-r--r-- 1 root supergroup 0 2011-10-09 15:28 /user/root/result/_SUCCESS-rw-r--r-- 1 root supergroup 52 2011-10-09 15:28 /user/root/result/part-r-00000
part-r-00000就是结果文件,注意这个路径是在HDFS中的,OS里是不能直接访问到的。
查看结果:hadoop dfs -cat result/part-r-00000hadoop fs -rmr firstTest hadoop fs -rmr result
再执行:
hadoop dfs -copyFromLocal /tmp/test.txt firstTesthadoop jar /root/hadoop/hadoop-0.21.0/hadoop-mapred-examples-0.21.0.jar wordcount firstTest result
查看输出结果:
hadoop dfs -cat result/part-r-00000
输出如下:
DEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it.11/10/09 16:25:02 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30000011/10/09 16:25:02 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.idceshi 1
摘抄一篇网上的文章。
(现在看上面这句话,当时自己真的是。。。,以下为原文)
三年多没写java了,非常生疏,练练手=。=
1.在eclipse中创建新的JAVA工程 HadoopTest,然后创建一个类(勾选自动生成main函数)
2.工程中导入hadoop-0.20.2-core.jar:右键工程=》Build Path=》Configure Build Path,添加外部jar包hadoop-0.20.2-core.jar
3.编写测试用源代码,如下:
import java.io.IOException;import org.apache.hadoop.conf.*;import org.apache.hadoop.fs.*;public class TestOne { /** * @param args */ public static void main(String[] args) { // TODO Auto-generated method stub Configuration conf = new Configuration(); try { FileSystem fs = FileSystem.get(conf); Path f = new Path("hdfs://rhel-h1:9000/user/root/test1.txt"); FSDataOutputStream os =fs.create(f,true); int i=0; for (i=0;i<100;++i) os.writeChars("test"+i); os.close(); } catch (IOException e) { e.printStackTrace(); } }}
4.编译工程,export为jar包,只导出classpath。
5.上传jar包到hadoop maseter,执行:
hadoop jar HadoopTest.jar TestOne
由于原先测试代码中HDFS路径写错为"hdfs:///localhost/user/root/test1.txt"=。=,结果生成的测试文件放在了HDFS的/localhost/user/root下,生成了test1文本,本意放在/user/root下,应该改为Path f = new Path(“hdfs:///user/root/test1”);
不过没关系,测试的目的达到了。 查看结果:hadoop fs -cat /user/root/test1.txt
如果要编写MR,则只要实现相应的接口即可,开发过程类似。MAP过程需要继承org.apache.hadoop.mapreduce包中的Mapper类并重写map方法,REDUCE则是Reducer类和reduce方法,然后在main中通过job.setMapperClass(map.class)和job.setReducerClass(reduce.class)来调用执行。
Hadoop fs –ls 就是查看/usr/root 目录下的内容,默认如果不填路径这就是当前用户路径;Hadoop fs –ls / 查看HDFS根目录下内容hadoop fs -lsr / 查看根目录及之下各级子目录hadoop fs -put aatest.txt . 将当前目录下文件aatest.txt上传至HDFS中,注意,最后结尾有个. 表示放入HDFS默认的/user/$USER目录下,$USER表示登陆用户名hadoop fs -cat aatest.txt 查看上传的文件hadoop fs -cat aatest.txt | headhadoop fs -tail aatest.txt 查看文件结尾的内容(最后一千字节的内容)hadoop fs -get aatest.txt . 将HDFS上的文件下载到本地,这里结尾的.是指本地当前路径hadoop fs –rm aatest.txt 文件删除hadoop fsck / -files -blocks 文件块检查hadoop dfsadmin –report 查看所有 DataNode的情况;hadoop fs -copyFromLocal localfilename HDFSfilename 本地文件上传至HDFShadoop fs -copyToLocal xxxxhadoop fs -mkdir xxx 创建目录
在put命令这里做了个测试,分别传上2个几十M的文件到HDFS(两节点分布式部署),通过http://10.80.18.191:50070监控窗口可以看到,第一次文件放在了rhel-h3上,多了一个block,第二次文件放在了rhel-h2上,相应也多了一个block。在HDFS命令界面中查看,完全看不出文件分布在哪个节点上。
Hadoop fs –rmr xxx 删除目录;Hadoop job 后面增加参数是对于当前运行的 Job的操作,例如 list,kill等;Hadoop balancer 均衡磁盘负载。
这篇笔记blog绝大部分是自己初入职时候健哥叫我弄hadoop时候做的,只做了小部分修改,大体没变。大概折腾了有大约2天的样子。原本记录在记事本上,随后过了不知道多久,大概3个月还是半年,自己把在记事本的这篇整理到有道云上面了,加上了markdown语法修饰。今天重头看官方文档又做了一遍伪分布式,用了不到半个下午,然后再回头来看有道云上面的这篇blog,笑死自己,有种看自己之前练习的书帖的感觉。
笑归笑吧,实际上再看回这篇笔记,当初的经历记忆犹新,对自己初次接触hadoop,一步步解决困难搭建集群,还做了如此之详细的笔记有所触动。想想自己在还是个新人的时候那股拼劲,再想想现在的自己,很难说些什么。想到这里,就不好意思再重写这篇blog,虽然现在我看它真的low。稍后再把自己这次的open jdk+hadoop的再写一篇吧。或许一年后看也是很low的吧。
有时,“虚怀若谷,大智若愚”,“大师都有学徒的心”这种说和做完全是两回事。原因种种,难一述说,归根结底,泯于众人。
看似寻常最奇崛,成如容易却艰辛。还好能 有幸看到一年多前的自己,勉之。
转载地址:http://atsni.baihongyu.com/