Hadoop伪分布式搭建笔记记录
公司需要搭建一套日志分析系统,这个重任终于落到我身上了!激动之余把这次hadoop的搭建过程记录一下,以后用得着。简介HDFS: 主要用来存储大规模的文件MapReduce: 主要用来处理文件Zookeeper: 用于搭建Hadoop集群的分布式协作服务Hbase: 分别是数据库Hive: 数据仓库Pig: 数据流处理Mahout: 数据挖掘Flume: 主要用于收集日志,并导入到H
公司需要搭建一套日志分析系统,这个重任终于落到我身上了!激动之余把这次hadoop的搭建过程记录一下,以后用得着。
简介
- HDFS: 主要用来存储大规模的文件
- MapReduce: 主要用来处理文件
- Zookeeper: 用于搭建Hadoop集群的分布式协作服务
- Hbase: 分别是数据库
- Hive: 数据仓库
- Pig: 数据流处理
- Mahout: 数据挖掘
- Flume: 主要用于收集日志,并导入到Hbeas中
- Sqoop: 主要用于转换数据
hadoop核心
- Hadoop Common: 基础模块,为其他Hadoop模块提供基础支持
- Hadoop HDFS: 一个高可靠、高吞吐量的分布式文件系统,类似于Win7文件系统,NameNode:主要包括文件名、被分数、存储路径等
- DataNode:HDFS在存储大文件时,会将大文件进行分块存储,并被分为多个副本
- Hadoop MapReduce: 一个分布式的离线并行计算框架
- Map
- 1、读取输入文件内容,解析成key(资源偏移量)、value(资源内容)对
- 2、重写Map方法,编写业务逻辑输出新的Key、Value对
- 3、对输出的K、V进行分区
- 4、对数据按照key进行排序、分组。相同key的value会放到一起
- Reduce:
- 1、对多个Maori任务的输出按照不同的分区copy到不同的reduce节点
- 2、对多个Map任务的输出进行合并、排序,这里编写reduce函数实现自己的逻辑并对输入的key、vale进行处理,转换成新的key、vale输出
- Map
- Hadoop YARN: 一个新的MapReduce框架,主要用于任务调度与资源管理,分为Resource Manager 和 Node Manager两部分,前者负责分配资源,后者负责管理当前机器的资源并执行。
基本环境配置
1、hadoop服务器采用centos6 32位系统。首先保证本机与服务器之间的通信,固定linux ip为192.168.11.128
2、重启网卡
$ service network restart
3、重启完成后可以通过ifconfig来查看网络状态
$ ifconfig
eth0 Link encap:Ethernet HWaddr 00:0C:29:EE:EA:B4
inet addr:192.168.11.128 Bcast:192.168.11.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:feee:eab4/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:265653 errors:0 dropped:0 overruns:0 frame:0
TX packets:151108 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:311155544 (296.7 MiB) TX bytes:22442571 (21.4 MiB)
Interrupt:19 Base address:0x2024
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:21033 errors:0 dropped:0 overruns:0 frame:0
TX packets:21033 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:7065599 (6.7 MiB) TX bytes:7065599 (6.7 MiB)
3、关闭防火墙
$ service iptables stop
4、防止服务器重启后防火墙自动打开(永久关闭)
$ chkconfig iptables off
5、禁用SELinux子系统,编辑/etc/sysconfig/selinux文件,修改‘SELINUX=enforcing’为‘SELINUX=disabled’并wq保存
$ vim /etc/sysconfig/selinux
修改后如下,并按esc然后输入”:wq”保存
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
# targeted - Targeted processes are protected,
# mls - Multi Level Security protection.
SELINUXTYPE=targeted
6、修改主机名,修改主机名为hadoop
可是使用hostname命令立即生效
$ hostname hadoop
永久修改主机名,编辑/etc/sysconfig/network文件,修改HOSTNAME并保存,如下
$ vim /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=hadoop
7、配置IP映射,修改hosts文件,将服务器地址映射为hadoop
$ vim /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.11.128 hadoop
8、重启机器
$ reboot
配置ssh免密登陆
以上内容配置好了以后,通过hadoop这个主机名就可以直接访问到服务器(192.168.11.128)了,这个时候可以直接使用ssh来登陆,
1、通过ssh命令输入密码登陆,然后exit退出,这个时候在用户根目录会自动创建一个.ssh目录
$ ssh hadoop
root@hadoop 's password:
Last login: Fri Sep 8 06:56:17 2017 from 192.168.11.1
$ exit
logout
Connection to hadoop closed.
[root@hadoop hadoop-2.8.1]#
2、创建ssh证书,中间有操作 敲三次回车
$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
/root/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
d4:e4:d8:4f:9f:b5:46:bb:b6:7c:97:e1:7f:f6:9d:f4 root@hadoop
The key's randomart image is:
+--[ RSA 2048]----+
| . |
| * |
| o + . ..|
| . o ..oo|
| S . o+ |
| ...|
| .+o|
| +oX|
| =E|
+-----------------+
$
3、用ssh-copy-id命令copy秘钥到允许免密登陆的机器,(因为这里是单机,也就是本机),需要输入目标机器密码
$ ssh-copy-id hadoop
root@hadoop's password:
Now try logging into the machine, with "ssh 'hadoop'", and check in:
.ssh/authorized_keys
to make sure we haven't added extra keys that you weren't expecting.
[root@hadoop hadoop-2.8.1]#
4、重新连接ssh,不再提示输入密码表示配置正确
$ ssh hadoop
Last login: Fri Sep 8 06:56:17 2017 from 192.168.11.1
[root@hadoop ~]# exit
logout
Connection to hadoop closed.
[root@hadoop hadoop-2.8.1]#
明确运行环境
再安装hadoop之前,先确定要安装的版本,然后再根据该版本的hadoop对环境的要求进行环境配置,这里非常不建议直接安装官方编译好的,有可能在安装过程中遇到各种问题,所以尽量自己编译。
1、下载hadoop,hadoop源码官方下载地址:http://hadoop.apache.org/releases.html
2、创建一些必要目录
$ mkdir /home/hadoop
$ mkdir /home/hadoop/resource
$ mkdir /home/hadoop/apps
$ ll /home/hadoop/
total 12
drwxr-xr-x. 10 root root 4096 Sep 7 16:00 apps
drwxr-xr-x. 3 root root 4096 Sep 7 15:21 resource
3、将下载好的hadoop-2.8.1-src.tar.gz上传到/home/hadoop/resource
4、解压并进入hadoop-src目录,可以看到该目录的大体结构,其中BUILDING.txt就是hadoop编译需要的环境了
$ tar -zxf hadoop-2.8.1-src.tar.gz
$ cd hadoop-2.8.1-src
$ ll
total 220
-rw-rw-r-- 1 root root 15623 May 23 16:14 BUILDING.txt
drwxr-xr-x 4 root root 4096 Sep 1 09:14 dev-support
drwxr-xr-x 4 root root 4096 Sep 7 15:37 hadoop-assemblies
drwxr-xr-x 4 root root 4096 Sep 7 15:37 hadoop-build-tools
drwxrwxr-x 3 root root 4096 Sep 7 15:49 hadoop-client
drwxr-xr-x 11 root root 4096 Sep 7 15:40 hadoop-common-project
drwxr-xr-x 3 root root 4096 Sep 7 15:50 hadoop-dist
drwxr-xr-x 9 root root 4096 Sep 7 15:44 hadoop-hdfs-project
drwxr-xr-x 10 root root 4096 Sep 7 15:47 hadoop-mapreduce-project
drwxr-xr-x 4 root root 4096 Sep 7 15:37 hadoop-maven-plugins
drwxr-xr-x 3 root root 4096 Sep 7 15:49 hadoop-minicluster
drwxr-xr-x 4 root root 4096 Sep 7 15:37 hadoop-project
drwxr-xr-x 3 root root 4096 Sep 7 15:37 hadoop-project-dist
drwxr-xr-x 19 root root 4096 Sep 7 15:50 hadoop-tools
drwxr-xr-x 4 root root 4096 Sep 7 15:46 hadoop-yarn-project
-rw-rw-r-- 1 root root 99253 May 23 16:14 LICENSE.txt
-rw-rw-r-- 1 root root 15915 May 23 16:14 NOTICE.txt
drwxrwxr-x 2 root root 4096 Jun 1 23:24 patchprocess
-rw-rw-r-- 1 root root 20477 May 28 15:36 pom.xml
-rw-r--r-- 1 root root 1366 May 19 22:30 README.txt
-rwxrwxr-x 1 root root 1841 May 23 16:14 start-build-env.sh
[root@hadoop hadoop-2.8.1-src]#
5、查看编译环境要求
vim BUILDING.txt
下面是BUILDING.txt的内容,要求Unix内核系统、JDK1.7以上、Maven3.0等
Build instructions for Hadoop
----------------------------------------------------------------------------------
Requirements:
* Unix System
* JDK 1.7+
* Maven 3.0 or later
* Findbugs 1.3.9 (if running findbugs)
* ProtocolBuffer 2.5.0
* CMake 2.6 or newer (if compiling native code), must be 3.0 or newer on Mac
* Zlib devel (if compiling native code)
* openssl devel (if compiling native hadoop-pipes and to get the best HDFS encryption performance)
* Linux FUSE (Filesystem in Userspace) version 2.6 or above (if compiling fuse_dfs)
* Internet connection for first build (to fetch all Maven and Hadoop dependencies)
----------------------------------------------------------------------------------
The easiest way to get an environment with all the appropriate tools is by means
of the provided Docker config.
This requires a recent version of docker (1.4.1 and higher are known to work).
On Linux:
Install Docker and run this command:
配置JDK环境
1、根据要求下载JDK1.7,上传jdk-7u80-linux-i586.tar.gz并解压到apps目录
$ cd /home/hadoop/apps
$ tar -zxf ../resource/jdk-7u80-linux-i586.tar.gz
2、进入7dk7解压目录并靠背该目录路径,可以用pwd命令输出当前所在目录全路径
$ cd jdk1.7.0_80/
$ pwd
/home/hadoop/apps/jdk1.7.0_80
3、编辑profile文件,将复制的jdk路径添加到path变量中
$ vim /etc/profile
在profile末尾追加下面内容,需要注意的是,和Windows不一样,第二行是“:”冒号,而不是”;”分好
export JAVA_HOME=/home/hadoop/apps/jdk1.7.0_80
PATH=${MAVEN_HOME}/bin:${PATH}
4、用source刷新环境变量
$ source /etc/profile
$
5、验证java环境是否配置成功
$ java -version
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) Client VM (build 24.80-b11, mixed mode)
[root@hadoop jdk1.7.0_80]#
补充:假如不小心修改错了profile文件的话,执行source是会报错的,更腻歪的事这个时候path变量会清空,除了cd 等基本命令外,无法执行各种了,这个时候可以使用下面命令临时将一些常用命令目录添加到path中,但这是临时的,重启就会失效,所以执行完后赶紧把profile文件还原成原来的样子,然后再重复第3步操作
配置MAVEN环境
和配置Java环境大同小异,直接下载Maven,其实我这里用的是Maven3.5,Hadoop上面要求的是3.0,但这没影响,假如编译出现问题的话,可以换成3.0.
直接把apache-maven-3.5.0-bin.tar.gz解压到apps目录,然后配置PATH变量即可,在末尾追加下面信息
export MAVEN_HOME=/home/hadoop/apps/apache-maven-3.5.0
#这里的PATH=XXXX这行不是每次新增变量时都追加这么一行,而是在当前这一行里追加${新增的环境变量} + "/命令目录" + :
PATH=${MAVEN_HOME}/bin:${JAVA_HOME}/bin:${PATH}
验证Maven是否安装成功
$ mvn -v
Apache Maven 3.5.0 (ff8f5e7444045639af65f6095c62210b5713f426; 2017-04-03T12:39:06-07:00)
Maven home: /home/hadoop/apps/apache-maven-3.5.0
Java version: 1.7.0_80, vendor: Oracle Corporation
Java home: /home/hadoop/apps/jdk1.7.0_80/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-358.el6.i686", arch: "i386", family: "unix"
各种依赖
这是进行下面的操作必须的,直接yum安装下面这一堆乱七八糟的即可
$ yum -y install svn ncurses-devel gcc*
$ yum -y install lzo-devel zlib-devel autoconf automake libtool cmake openssl-devel
安装Cmake
1、下载cmake-3.0.2.tar.gz到resource目录,下载地址http://www.cmake.org/download/
$ wget https://cmake.org/download/cmake-3.0.2.tar.gz
2、解压cmake-3.0.2.tar.gz到apps目录
$ cd ../apps
$ tar -zxf ../resource/cmake-3.0.2.tar.gz
3、进入进入解压后的apps/cmake-3.0.2目录,执行bootstrap文件,并用make安装
$ cd /home/hadoop/apps/cmake-3.0.2
$ ./bootstrap
$ make
$ make install
4、验证是否安装成功
$ cmake --version
cmake version 2.8.12.2
安装ProtocolBuffer
这里用到的是ProtocolBuffer 2.5.0,我直接在csdn上下载的,.
1、下载完成后解压到apps目录
$ cd /home/hadoop/apps
$ tar -zxf /home/hadoop/protobuf-2.5.0.tar.gz
$ cd protobuf-2.5.0
$ ls
aclocal.m4 config.h.in configure.ac examples java Makefile.am protobuf.pc stamp-h1
autogen.sh config.log CONTRIBUTORS.txt generate_descriptor_proto.sh libtool Makefile.in protobuf.pc.in vsprojects
CHANGES.txt config.status COPYING.txt gtest ltmain.sh missing python
config.guess config.sub depcomp install-sh m4 protobuf-lite.pc README.txt
config.h configure editors INSTALL.txt Makefile protobuf-lite.pc.in src
2、接下来给ProtocolBuffer添加googletest依赖,本来安装的时候是可以自动下载这个依赖的,但是由于在墙内,只能先手动添加的, 如果你在墙外的话,可以忽略这里,直接跳到第3步。这里从google的github中下载
$ wget https://codeload.github.com/google/googletest/zip/release-1.5.0.zip
$ unzip release-1.5.0.zip
$ mv googletest-release-1.5.0 gtest
3、运行./configure命令生成安装脚本
$ ./configure
如果缺少依赖的话,安装这个
yum -y install gcc-c++
4、用make安装
$ make
$ make check
$ make install
5、创建/etc/ld.so.conf.d/libprotobuf.conf文件
$ vim /etc/ld.so.conf.d/libprotobuf.conf
文件内容写入如下,并wq保存
/usr/local/lib
6、重加载配置
$ sudo ldconfig
7、是否安装成功
$ protoc --version
libprotoc 2.5.0
安装FindBugs和Ant
版本同样按照hadoop的要求,findbugs-1.3.9.zip和apache-ant-1.9.4.tar.gz。
安装过程和JDK、MAVEN一样,解压到apps,然后添加环境变量,现在profile文件中总共添加了以下内容
export MAVEN_HOME=/home/hadoop/apps/apache-maven-3.5.0
export JAVA_HOME=/home/hadoop/apps/jdk1.7.0_80
export ANT_HOME=/home/hadoop/apps/apache-ant-1.9.4
export FINDBUGS_HOME=/home/hadoop/apps/findbugs-1.3.9
PATH=${MAVEN_HOME}/bin:${JAVA_HOME}/bin:${PATH}:${ANT_HOME}/bin:${FINDBUGS_HOME}/bin
编译hadoop
这是一个非常恐怖的过程,在编辑hadoop之前,最好再做最后一件事情,应该先修改一下maven的setting.xml,给他添加阿里云的maven仓库,否则在接下来的编译过程中会很辛苦的,我已经以身试法了。
1、在maven中添加阿里云仓库
$ vim /home/hadoop/apps/apache-maven-3.5.0/conf/settings.xml
在setting.xml的mirrors节点中添加下面内容
<mirror>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>
2、进入hadoop源码目录并执行maven编译命令
接下来是漫长的等待,网上说这需要30分钟左右,我觉得可能更久,这取决于网速和机器配置。
3、进入编译出来的文件目录查看,这就是编译出来的东西
$ cd /home/hadoop/resource/hadoop-2.8.1-src/hadoop-dist/target
$ ll
total 570704
drwxr-xr-x 2 root root 4096 Sep 7 15:50 antrun
drwxr-xr-x 3 root root 4096 Sep 7 15:50 classes
-rw-r--r-- 1 root root 2118 Sep 7 15:50 dist-layout-stitching.sh
-rw-r--r-- 1 root root 651 Sep 7 15:50 dist-tar-stitching.sh
drwxr-xr-x 9 root root 4096 Sep 7 15:50 hadoop-2.8.1
-rw-r--r-- 1 root root 194397578 Sep 7 15:50 hadoop-2.8.1.tar.gz
-rw-r--r-- 1 root root 30234 Sep 7 15:50 hadoop-dist-2.8.1.jar
-rw-r--r-- 1 root root 389866715 Sep 7 15:50 hadoop-dist-2.8.1-javadoc.jar
-rw-r--r-- 1 root root 27733 Sep 7 15:50 hadoop-dist-2.8.1-sources.jar
-rw-r--r-- 1 root root 27733 Sep 7 15:50 hadoop-dist-2.8.1-test-sources.jar
drwxr-xr-x 2 root root 4096 Sep 7 15:50 javadoc-bundle-options
drwxr-xr-x 2 root root 4096 Sep 7 15:50 maven-archiver
drwxr-xr-x 3 root root 4096 Sep 7 15:50 maven-shared-archive-resources
drwxr-xr-x 3 root root 4096 Sep 7 15:50 test-classes
drwxr-xr-x 2 root root 4096 Sep 7 15:50 test-dir
4、copy hadoop-2.8.1到apps目录,并查看
$ cp -r /home/hadoop/resource/hadoop-2.8.1-src/hadoop-dist/target/hadoop-2.8.1 /home/hadoop/apps/hadoop-2.8.1
hadoop-2.8.1里的内容就是我要的
$ cd /home/hadoop/apps/hadoop-2.8.1
$ ll
total 148
drwxr-xr-x 2 root root 4096 Sep 7 15:50 bin
drwxr-xr-x 3 root root 4096 Sep 7 15:50 etc
drwxr-xr-x 2 root root 4096 Sep 7 15:50 include
drwxr-xr-x 3 root root 4096 Sep 7 15:50 lib
drwxr-xr-x 2 root root 4096 Sep 7 15:50 libexec
-rw-r--r-- 1 root root 99253 Sep 7 15:50 LICENSE.txt
-rw-r--r-- 1 root root 15915 Sep 7 15:50 NOTICE.txt
-rw-r--r-- 1 root root 1366 Sep 7 15:50 README.txt
drwxr-xr-x 2 root root 4096 Sep 7 15:50 sbin
drwxr-xr-x 3 root root 4096 Sep 7 15:50 share
hadoop伪分布式配置
1、创建目录/home/hadoop/data/temp
$ mkdir /home/hadoop/data/temp
2、配置hadoop/etc/core-site.xml:
<configuration>
<!-- 这两个有一个有效 -->
<property>
<name>fs.defaltFS</name>
<value>hdfs://hadoop:8020</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop:8020</value>
</property>
<!-- temp路径 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/data/temp</value>
</property>
</configuration>
3、配置hadoop/etc/hadoop-env.sh,编辑JAVA_HOME为如下路径(你的JavaHOME路径):
export JAVA_HOME=/home/hadoop/apps/jdk1.7.0_80
4、配置hfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<!--数据备份数,目前只有一台机器,固定为1,默认是3 -->
<value>1</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop:9001</value>
</property>
</configuration>
5、格式化hdps
$ ./hadoop/bin/hdfs namenode -format
运行hadoop
在 HDFS 中创建用户目录
$ ./bin/hdfs dfs -mkdir -p /user/hadoop
$ ./bin/hdfs dfs -put ./etc/hadoop/*.xml input
$ ./bin/hdfs dfs -ls input
$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'
$ ./bin/hdfs dfs -cat output/*
配置yarn
1、更名mapred-site.xml.template
$ mv ./etc/hadoop/mapred-site.xml.template ./etc/hadoop/mapred-site.xml
2、修改mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
3、修改etc/yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
启动yarn
依次启动hadoop和yarn
$ ./sbin/start-dfs.sh
$ ./sbin/start-yarn.sh
开启历史服务
$ ./sbin/mr-jobhistory-daemon.sh start historyserver
本篇文章为原创文章,作者 郭胜凯 , 转载请注明出处[Github][2].
更多推荐
所有评论(0)