Data Mining and Machine Learning

L1.Information Retrieval(信息检索)
(Focus on Text Retrieval)定义:大量文本集中找出和查询语句相关性最强的信息。

第一章

Moore’s Law:Technology performance doubles and prices halve every 18 months.
Disk capacity:106=1MB,109 tytes=1GB,10^12 bytes=1TB;
1TB能存储多少东西?Whast can you store on 1TB?
假设16 kHz每秒的文件,用16bits存储。
一秒的花费就是16000*16;
1TB=10^12;

Python计算脚本如下:

Python 3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:37:50) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 16000*16
>>> b = 10**12*8
>>> days = b/(a*60*60*24)
>>> print(days)
361.68981481481484
>>>

也就是说1TB内存可以存储约362天这个时间长度的上述该文件。
而现在PB即数据的存储力为362*1000
将近一千年。
而目前世界上最伟大,并且没有之一的公司google每天的数据处理量是20PB。

2.Why we store these big corpora? it’s huge expense.
相信数据的潜在价值。
What’s the problems?
This is a problem in semantics.
人类能区分语义上的问题,但计算机并不能。
For example:
I saw the man on the hill with the telescope?
(What’s the real)

对IR(Information Retrieval)有了一定认知后,看下IR 和 Database Retrieval的区别。

Databases Retrieval有如下特点:
First.数据特诊:数据有属性、结构、数据之间相关性、查询和人类似。
Second.结构化,严格的查询。
Third.数据需要及时更行。
Fourth.具体的查询给出具体的结果。
而IR的特点和数据库检索则不同。

定义:
Mining:Degging deep into the earth, to find hidden, valuable materials.

Data Mining: Analysis of large data corpora.Corpora which are too large for human inspection.

Information Retrieval Components
Documents: Identify words which are ‘important’ for discriminating between documents, and how important they are.
Index: Specifies the relationships between these ‘keywords’ and the documents.
The query
Matching: Measuring the similarity between the query and each dovument.
Retrieved documents
Assessment and Relevance Feedback.

一段本本中的结构:
Words(关键字,some words are more important)
Sentences(Grammar/Syntax)词按一定顺序的集合助于理解和消除歧义。

Logo

CSDN联合极客时间,共同打造面向开发者的精品内容学习社区,助力成长!

更多推荐