公司需要搭建一套日志分析系统,这个重任终于落到我身上了!激动之余把这次hadoop的搭建过程记录一下,以后用得着。

简介

  • HDFS: 主要用来存储大规模的文件
  • MapReduce: 主要用来处理文件
  • Zookeeper: 用于搭建Hadoop集群的分布式协作服务
  • Hbase: 分别是数据库
  • Hive: 数据仓库
  • Pig: 数据流处理
  • Mahout: 数据挖掘
  • Flume: 主要用于收集日志,并导入到Hbeas中
  • Sqoop: 主要用于转换数据

hadoop核心

  • Hadoop Common: 基础模块,为其他Hadoop模块提供基础支持
  • Hadoop HDFS: 一个高可靠、高吞吐量的分布式文件系统,类似于Win7文件系统,NameNode:主要包括文件名、被分数、存储路径等
  • DataNode:HDFS在存储大文件时,会将大文件进行分块存储,并被分为多个副本
  • Hadoop MapReduce: 一个分布式的离线并行计算框架
    • Map
      • 1、读取输入文件内容,解析成key(资源偏移量)、value(资源内容)对
      • 2、重写Map方法,编写业务逻辑输出新的Key、Value对
      • 3、对输出的K、V进行分区
      • 4、对数据按照key进行排序、分组。相同key的value会放到一起
    • Reduce:
      • 1、对多个Maori任务的输出按照不同的分区copy到不同的reduce节点
      • 2、对多个Map任务的输出进行合并、排序,这里编写reduce函数实现自己的逻辑并对输入的key、vale进行处理,转换成新的key、vale输出
  • Hadoop YARN: 一个新的MapReduce框架,主要用于任务调度与资源管理,分为Resource Manager 和 Node Manager两部分,前者负责分配资源,后者负责管理当前机器的资源并执行。

基本环境配置

1、hadoop服务器采用centos6 32位系统。首先保证本机与服务器之间的通信,固定linux ip为192.168.11.128

2、重启网卡

$ service network restart

3、重启完成后可以通过ifconfig来查看网络状态

$ ifconfig
eth0      Link encap:Ethernet  HWaddr 00:0C:29:EE:EA:B4  
          inet addr:192.168.11.128  Bcast:192.168.11.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:feee:eab4/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:265653 errors:0 dropped:0 overruns:0 frame:0
          TX packets:151108 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:311155544 (296.7 MiB)  TX bytes:22442571 (21.4 MiB)
          Interrupt:19 Base address:0x2024 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:21033 errors:0 dropped:0 overruns:0 frame:0
          TX packets:21033 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:7065599 (6.7 MiB)  TX bytes:7065599 (6.7 MiB)

3、关闭防火墙

$ service iptables stop

4、防止服务器重启后防火墙自动打开(永久关闭)

$ chkconfig iptables off

5、禁用SELinux子系统,编辑/etc/sysconfig/selinux文件,修改‘SELINUX=enforcing’为‘SELINUX=disabled’并wq保存

$ vim /etc/sysconfig/selinux

修改后如下,并按esc然后输入”:wq”保存

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted

6、修改主机名,修改主机名为hadoop
可是使用hostname命令立即生效

$ hostname hadoop

永久修改主机名,编辑/etc/sysconfig/network文件,修改HOSTNAME并保存,如下

$ vim /etc/sysconfig/network

NETWORKING=yes
HOSTNAME=hadoop

7、配置IP映射,修改hosts文件,将服务器地址映射为hadoop

$ vim /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.11.128  hadoop

8、重启机器

$ reboot

配置ssh免密登陆

以上内容配置好了以后,通过hadoop这个主机名就可以直接访问到服务器(192.168.11.128)了,这个时候可以直接使用ssh来登陆,

1、通过ssh命令输入密码登陆,然后exit退出,这个时候在用户根目录会自动创建一个.ssh目录

$ ssh hadoop
root@hadoop 's password:
Last login: Fri Sep 8 06:56:17 2017 from 192.168.11.1
$ exit
logout
Connection to hadoop closed.
[root@hadoop hadoop-2.8.1]# 

2、创建ssh证书,中间有操作 敲三次回车

$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
/root/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
d4:e4:d8:4f:9f:b5:46:bb:b6:7c:97:e1:7f:f6:9d:f4 root@hadoop
The key's randomart image is:
+--[ RSA 2048]----+
|          .      |
|         *       |
|        o + .  ..|
|       .   o ..oo|
|        S   . o+ |
|              ...|
|              .+o|
|              +oX|
|               =E|
+-----------------+
$ 

3、用ssh-copy-id命令copy秘钥到允许免密登陆的机器,(因为这里是单机,也就是本机),需要输入目标机器密码

$ ssh-copy-id hadoop
root@hadoop's password: 
Now try logging into the machine, with "ssh 'hadoop'", and check in:

  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

[root@hadoop hadoop-2.8.1]# 

4、重新连接ssh,不再提示输入密码表示配置正确

$ ssh hadoop
Last login: Fri Sep  8 06:56:17 2017 from 192.168.11.1
[root@hadoop ~]# exit
logout
Connection to hadoop closed.
[root@hadoop hadoop-2.8.1]# 

明确运行环境

再安装hadoop之前,先确定要安装的版本,然后再根据该版本的hadoop对环境的要求进行环境配置,这里非常不建议直接安装官方编译好的,有可能在安装过程中遇到各种问题,所以尽量自己编译。

1、下载hadoop,hadoop源码官方下载地址:http://hadoop.apache.org/releases.html

2、创建一些必要目录

$ mkdir /home/hadoop
$ mkdir /home/hadoop/resource
$ mkdir /home/hadoop/apps
$ ll /home/hadoop/
total 12
drwxr-xr-x. 10 root root 4096 Sep  7 16:00 apps
drwxr-xr-x.  3 root root 4096 Sep  7 15:21 resource

3、将下载好的hadoop-2.8.1-src.tar.gz上传到/home/hadoop/resource

4、解压并进入hadoop-src目录,可以看到该目录的大体结构,其中BUILDING.txt就是hadoop编译需要的环境了

$ tar -zxf hadoop-2.8.1-src.tar.gz
$ cd hadoop-2.8.1-src
$ ll
total 220
-rw-rw-r-- 1 root root 15623 May 23 16:14 BUILDING.txt
drwxr-xr-x 4 root root 4096 Sep 1 09:14 dev-support
drwxr-xr-x 4 root root 4096 Sep 7 15:37 hadoop-assemblies
drwxr-xr-x 4 root root 4096 Sep 7 15:37 hadoop-build-tools
drwxrwxr-x 3 root root 4096 Sep 7 15:49 hadoop-client
drwxr-xr-x 11 root root 4096 Sep 7 15:40 hadoop-common-project
drwxr-xr-x 3 root root 4096 Sep 7 15:50 hadoop-dist
drwxr-xr-x 9 root root 4096 Sep 7 15:44 hadoop-hdfs-project
drwxr-xr-x 10 root root 4096 Sep 7 15:47 hadoop-mapreduce-project
drwxr-xr-x 4 root root 4096 Sep 7 15:37 hadoop-maven-plugins
drwxr-xr-x 3 root root 4096 Sep 7 15:49 hadoop-minicluster
drwxr-xr-x 4 root root 4096 Sep 7 15:37 hadoop-project
drwxr-xr-x 3 root root 4096 Sep 7 15:37 hadoop-project-dist
drwxr-xr-x 19 root root 4096 Sep 7 15:50 hadoop-tools
drwxr-xr-x 4 root root 4096 Sep 7 15:46 hadoop-yarn-project
-rw-rw-r-- 1 root root 99253 May 23 16:14 LICENSE.txt
-rw-rw-r-- 1 root root 15915 May 23 16:14 NOTICE.txt
drwxrwxr-x 2 root root 4096 Jun 1 23:24 patchprocess
-rw-rw-r-- 1 root root 20477 May 28 15:36 pom.xml
-rw-r--r-- 1 root root 1366 May 19 22:30 README.txt
-rwxrwxr-x 1 root root 1841 May 23 16:14 start-build-env.sh
[root@hadoop hadoop-2.8.1-src]# 

5、查看编译环境要求

vim BUILDING.txt

下面是BUILDING.txt的内容,要求Unix内核系统、JDK1.7以上、Maven3.0等

Build instructions for Hadoop

----------------------------------------------------------------------------------
Requirements:

* Unix System
* JDK 1.7+
* Maven 3.0 or later
* Findbugs 1.3.9 (if running findbugs)
* ProtocolBuffer 2.5.0
* CMake 2.6 or newer (if compiling native code), must be 3.0 or newer on Mac
* Zlib devel (if compiling native code)
* openssl devel (if compiling native hadoop-pipes and to get the best HDFS encryption performance)
* Linux FUSE (Filesystem in Userspace) version 2.6 or above (if compiling fuse_dfs)
* Internet connection for first build (to fetch all Maven and Hadoop dependencies)

----------------------------------------------------------------------------------
The easiest way to get an environment with all the appropriate tools is by means
of the provided Docker config.
This requires a recent version of docker (1.4.1 and higher are known to work).

On Linux:
    Install Docker and run this command:

配置JDK环境

1、根据要求下载JDK1.7,上传jdk-7u80-linux-i586.tar.gz并解压到apps目录

$ cd /home/hadoop/apps
$ tar -zxf ../resource/jdk-7u80-linux-i586.tar.gz

2、进入7dk7解压目录并靠背该目录路径,可以用pwd命令输出当前所在目录全路径

$ cd jdk1.7.0_80/
$ pwd
/home/hadoop/apps/jdk1.7.0_80

3、编辑profile文件,将复制的jdk路径添加到path变量中

$ vim /etc/profile

在profile末尾追加下面内容,需要注意的是,和Windows不一样,第二行是“:”冒号,而不是”;”分好

export JAVA_HOME=/home/hadoop/apps/jdk1.7.0_80
PATH=${MAVEN_HOME}/bin:${PATH}

4、用source刷新环境变量

$ source /etc/profile
$ 

5、验证java环境是否配置成功

$ java -version
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) Client VM (build 24.80-b11, mixed mode)
[root@hadoop jdk1.7.0_80]# 

补充:假如不小心修改错了profile文件的话,执行source是会报错的,更腻歪的事这个时候path变量会清空,除了cd 等基本命令外,无法执行各种了,这个时候可以使用下面命令临时将一些常用命令目录添加到path中,但这是临时的,重启就会失效,所以执行完后赶紧把profile文件还原成原来的样子,然后再重复第3步操作

配置MAVEN环境

和配置Java环境大同小异,直接下载Maven,其实我这里用的是Maven3.5,Hadoop上面要求的是3.0,但这没影响,假如编译出现问题的话,可以换成3.0.

直接把apache-maven-3.5.0-bin.tar.gz解压到apps目录,然后配置PATH变量即可,在末尾追加下面信息

export MAVEN_HOME=/home/hadoop/apps/apache-maven-3.5.0
#这里的PATH=XXXX这行不是每次新增变量时都追加这么一行,而是在当前这一行里追加${新增的环境变量} + "/命令目录" + :
PATH=${MAVEN_HOME}/bin:${JAVA_HOME}/bin:${PATH}

验证Maven是否安装成功

$ mvn -v
Apache Maven 3.5.0 (ff8f5e7444045639af65f6095c62210b5713f426; 2017-04-03T12:39:06-07:00)
Maven home: /home/hadoop/apps/apache-maven-3.5.0
Java version: 1.7.0_80, vendor: Oracle Corporation
Java home: /home/hadoop/apps/jdk1.7.0_80/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-358.el6.i686", arch: "i386", family: "unix"

各种依赖

这是进行下面的操作必须的,直接yum安装下面这一堆乱七八糟的即可

$ yum -y install svn ncurses-devel gcc*
$ yum -y install lzo-devel zlib-devel autoconf automake libtool cmake openssl-devel

安装Cmake

1、下载cmake-3.0.2.tar.gz到resource目录,下载地址http://www.cmake.org/download/

$ wget https://cmake.org/download/cmake-3.0.2.tar.gz

2、解压cmake-3.0.2.tar.gz到apps目录

$ cd ../apps
$ tar -zxf ../resource/cmake-3.0.2.tar.gz

3、进入进入解压后的apps/cmake-3.0.2目录,执行bootstrap文件,并用make安装

$ cd /home/hadoop/apps/cmake-3.0.2
$ ./bootstrap
$ make
$ make install

4、验证是否安装成功

$ cmake --version
cmake version 2.8.12.2

安装ProtocolBuffer

这里用到的是ProtocolBuffer 2.5.0,我直接在csdn上下载的,.

1、下载完成后解压到apps目录

$ cd /home/hadoop/apps
$ tar -zxf /home/hadoop/protobuf-2.5.0.tar.gz
$ cd protobuf-2.5.0
$ ls
aclocal.m4    config.h.in    configure.ac      examples                      java       Makefile.am          protobuf.pc     stamp-h1
autogen.sh    config.log     CONTRIBUTORS.txt  generate_descriptor_proto.sh  libtool    Makefile.in          protobuf.pc.in  vsprojects
CHANGES.txt   config.status  COPYING.txt       gtest                         ltmain.sh  missing              python
config.guess  config.sub     depcomp           install-sh                    m4         protobuf-lite.pc     README.txt
config.h      configure      editors           INSTALL.txt                   Makefile   protobuf-lite.pc.in  src

2、接下来给ProtocolBuffer添加googletest依赖,本来安装的时候是可以自动下载这个依赖的,但是由于在墙内,只能先手动添加的, 如果你在墙外的话,可以忽略这里,直接跳到第3步。这里从google的github中下载

$ wget https://codeload.github.com/google/googletest/zip/release-1.5.0.zip
$ unzip release-1.5.0.zip
$ mv googletest-release-1.5.0 gtest

3、运行./configure命令生成安装脚本

$ ./configure

如果缺少依赖的话,安装这个

yum -y install gcc-c++

4、用make安装

$ make
$ make check
$ make install

5、创建/etc/ld.so.conf.d/libprotobuf.conf文件

$ vim /etc/ld.so.conf.d/libprotobuf.conf 

文件内容写入如下,并wq保存

/usr/local/lib

6、重加载配置

$ sudo ldconfig

7、是否安装成功

$ protoc --version
libprotoc 2.5.0

安装FindBugs和Ant

版本同样按照hadoop的要求,findbugs-1.3.9.zip和apache-ant-1.9.4.tar.gz。

安装过程和JDK、MAVEN一样,解压到apps,然后添加环境变量,现在profile文件中总共添加了以下内容

export MAVEN_HOME=/home/hadoop/apps/apache-maven-3.5.0
export JAVA_HOME=/home/hadoop/apps/jdk1.7.0_80
export ANT_HOME=/home/hadoop/apps/apache-ant-1.9.4
export FINDBUGS_HOME=/home/hadoop/apps/findbugs-1.3.9

PATH=${MAVEN_HOME}/bin:${JAVA_HOME}/bin:${PATH}:${ANT_HOME}/bin:${FINDBUGS_HOME}/bin

编译hadoop

这是一个非常恐怖的过程,在编辑hadoop之前,最好再做最后一件事情,应该先修改一下maven的setting.xml,给他添加阿里云的maven仓库,否则在接下来的编译过程中会很辛苦的,我已经以身试法了。

1、在maven中添加阿里云仓库

$ vim /home/hadoop/apps/apache-maven-3.5.0/conf/settings.xml

在setting.xml的mirrors节点中添加下面内容

<mirror>
    <id>alimaven</id>
    <name>aliyun maven</name>
    <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    <mirrorOf>central</mirrorOf>
</mirror>

2、进入hadoop源码目录并执行maven编译命令
接下来是漫长的等待,网上说这需要30分钟左右,我觉得可能更久,这取决于网速和机器配置。

3、进入编译出来的文件目录查看,这就是编译出来的东西

$ cd /home/hadoop/resource/hadoop-2.8.1-src/hadoop-dist/target
$ ll
total 570704
drwxr-xr-x 2 root root      4096 Sep  7 15:50 antrun
drwxr-xr-x 3 root root      4096 Sep  7 15:50 classes
-rw-r--r-- 1 root root      2118 Sep  7 15:50 dist-layout-stitching.sh
-rw-r--r-- 1 root root       651 Sep  7 15:50 dist-tar-stitching.sh
drwxr-xr-x 9 root root      4096 Sep  7 15:50 hadoop-2.8.1
-rw-r--r-- 1 root root 194397578 Sep  7 15:50 hadoop-2.8.1.tar.gz
-rw-r--r-- 1 root root     30234 Sep  7 15:50 hadoop-dist-2.8.1.jar
-rw-r--r-- 1 root root 389866715 Sep  7 15:50 hadoop-dist-2.8.1-javadoc.jar
-rw-r--r-- 1 root root     27733 Sep  7 15:50 hadoop-dist-2.8.1-sources.jar
-rw-r--r-- 1 root root     27733 Sep  7 15:50 hadoop-dist-2.8.1-test-sources.jar
drwxr-xr-x 2 root root      4096 Sep  7 15:50 javadoc-bundle-options
drwxr-xr-x 2 root root      4096 Sep  7 15:50 maven-archiver
drwxr-xr-x 3 root root      4096 Sep  7 15:50 maven-shared-archive-resources
drwxr-xr-x 3 root root      4096 Sep  7 15:50 test-classes
drwxr-xr-x 2 root root      4096 Sep  7 15:50 test-dir

4、copy hadoop-2.8.1到apps目录,并查看

$ cp -r /home/hadoop/resource/hadoop-2.8.1-src/hadoop-dist/target/hadoop-2.8.1 /home/hadoop/apps/hadoop-2.8.1

hadoop-2.8.1里的内容就是我要的

$ cd /home/hadoop/apps/hadoop-2.8.1
$ ll
total 148
drwxr-xr-x 2 root root  4096 Sep  7 15:50 bin
drwxr-xr-x 3 root root  4096 Sep  7 15:50 etc
drwxr-xr-x 2 root root  4096 Sep  7 15:50 include
drwxr-xr-x 3 root root  4096 Sep  7 15:50 lib
drwxr-xr-x 2 root root  4096 Sep  7 15:50 libexec
-rw-r--r-- 1 root root 99253 Sep  7 15:50 LICENSE.txt
-rw-r--r-- 1 root root 15915 Sep  7 15:50 NOTICE.txt
-rw-r--r-- 1 root root  1366 Sep  7 15:50 README.txt
drwxr-xr-x 2 root root  4096 Sep  7 15:50 sbin
drwxr-xr-x 3 root root  4096 Sep  7 15:50 share

hadoop伪分布式配置

1、创建目录/home/hadoop/data/temp

$ mkdir /home/hadoop/data/temp

2、配置hadoop/etc/core-site.xml:

<configuration>
    <!-- 这两个有一个有效 -->
    <property>
        <name>fs.defaltFS</name>
        <value>hdfs://hadoop:8020</value>
    </property>
    <property>  
        <name>fs.default.name</name>  
        <value>hdfs://hadoop:8020</value>  
    </property>
    <!-- temp路径 -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hadoop/data/temp</value>
    </property>
</configuration>

3、配置hadoop/etc/hadoop-env.sh,编辑JAVA_HOME为如下路径(你的JavaHOME路径):

export JAVA_HOME=/home/hadoop/apps/jdk1.7.0_80

4、配置hfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <!--数据备份数,目前只有一台机器,固定为1,默认是3 -->
        <value>1</value>
    </property>
  <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop:9001</value>
  </property>
</configuration>

5、格式化hdps

$ ./hadoop/bin/hdfs  namenode -format

运行hadoop

在 HDFS 中创建用户目录

$ ./bin/hdfs dfs -mkdir -p /user/hadoop
$ ./bin/hdfs dfs -put ./etc/hadoop/*.xml input
$ ./bin/hdfs dfs -ls input
$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'
$ ./bin/hdfs dfs -cat output/*

配置yarn

1、更名mapred-site.xml.template

$ mv ./etc/hadoop/mapred-site.xml.template ./etc/hadoop/mapred-site.xml

2、修改mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

3、修改etc/yarn-site.xml

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>

启动yarn

依次启动hadoop和yarn

$ ./sbin/start-dfs.sh
$ ./sbin/start-yarn.sh

开启历史服务

$ ./sbin/mr-jobhistory-daemon.sh start historyserver

本篇文章为原创文章,作者 郭胜凯 ,  转载请注明出处[Github][2].

Logo

CSDN联合极客时间,共同打造面向开发者的精品内容学习社区,助力成长!

更多推荐