ElasticSearch学习笔记之十一 Anayle API和IK分词器

ElasticSearch学习笔记之十一 Anayle API和IK分词器Anayle APIIK分词器IK分词器版本支持安装下载或者编译选择一选择二重启ElasticSearchIK分词器效果Anayle APIanalyze API 可以用来查看可分析全文是如何被分析的。我们可以在消息体里，指定分析器和要分析的文本：GET /_analyze{"analyzer": &qu

灵动的艺术

1464人浏览 · 2018-10-19 15:24:03

灵动的艺术 · 2018-10-19 15:24:03 发布

ElasticSearch学习笔记之十一 Anayle API和IK分词器

Anayle API
IK分词器

Anayle API

analyze API 可以用来查看可分析全文是如何被分析的。我们可以在消息体里，指定分析器和要分析的文本：

GET /_analyze
{
  "analyzer": "standard",
  "text": "歌唱我们亲爱的祖国从今走向走向繁荣富强 "
}

分析结果如下：

{
  "tokens": [
    {
      "token": "歌",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "唱",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "我",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "们",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "亲",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "爱",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "的",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    },
    {
      "token": "祖",
      "start_offset": 7,
      "end_offset": 8,
      "type": "<IDEOGRAPHIC>",
      "position": 7
    },
    {
      "token": "国",
      "start_offset": 8,
      "end_offset": 9,
      "type": "<IDEOGRAPHIC>",
      "position": 8
    },
    {
      "token": "从",
      "start_offset": 9,
      "end_offset": 10,
      "type": "<IDEOGRAPHIC>",
      "position": 9
    },
    {
      "token": "今",
      "start_offset": 10,
      "end_offset": 11,
      "type": "<IDEOGRAPHIC>",
      "position": 10
    },
    {
      "token": "走",
      "start_offset": 11,
      "end_offset": 12,
      "type": "<IDEOGRAPHIC>",
      "position": 11
    },
    {
      "token": "向",
      "start_offset": 12,
      "end_offset": 13,
      "type": "<IDEOGRAPHIC>",
      "position": 12
    },
    {
      "token": "走",
      "start_offset": 13,
      "end_offset": 14,
      "type": "<IDEOGRAPHIC>",
      "position": 13
    },
    {
      "token": "向",
      "start_offset": 14,
      "end_offset": 15,
      "type": "<IDEOGRAPHIC>",
      "position": 14
    },
    {
      "token": "繁",
      "start_offset": 15,
      "end_offset": 16,
      "type": "<IDEOGRAPHIC>",
      "position": 15
    },
    {
      "token": "荣",
      "start_offset": 16,
      "end_offset": 17,
      "type": "<IDEOGRAPHIC>",
      "position": 16
    },
    {
      "token": "富",
      "start_offset": 17,
      "end_offset": 18,
      "type": "<IDEOGRAPHIC>",
      "position": 17
    },
    {
      "token": "强",
      "start_offset": 18,
      "end_offset": 19,
      "type": "<IDEOGRAPHIC>",
      "position": 18
    }
  ]
}

很明显不是我们想要的结果。

IK分词器

如上面的问题，当我们在Elasticsearch中使用默认的标准分词器，这个分词器在处理中文的时候会把中文单词切分成一个一个的汉字，这不是我们想要的结果，因此需要引入es之中文的分词器插件es-ik来解决这个问题。

分词器/Token 过滤器	支持
Analyzer	ik_smart , ik_max_word
Tokenizer	ik_smart , ik_max_word

IK分词器版本支持

IK version	ES version
master	6.x -> master
6.3.0	6.3.0
6.2.4	6.2.4
6.1.3	6.1.3
5.6.8	5.6.8
5.5.3	5.5.3

安装

下载或者编译

选择一

从下面的网址下载需要的版本
https://github.com/medcl/elasticsearch-analysis-ik/releases
创建安装目录

cd your-es-root/plugins/ && mkdir ik

解压你下载的zip包到your-es-root/plugins/ik

选择二

使用的elasticsearch-plugin插件安装

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip

重启ElasticSearch

./bin/elasticsearch -d

IK分词器效果

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "歌唱我们亲爱的祖国从今走向走向繁荣富强"
}

结果如下：

{
  "tokens": [
    {
      "token": "歌唱",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "我们",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "亲爱的",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "祖国",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "从今",
      "start_offset": 9,
      "end_offset": 11,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "走向",
      "start_offset": 11,
      "end_offset": 13,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "走向",
      "start_offset": 13,
      "end_offset": 15,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "繁荣富强",
      "start_offset": 15,
      "end_offset": 19,
      "type": "CN_WORD",
      "position": 7
    }
  ]
}