Data Mining & Machine Learning学习笔记 机器学习入门笔记(一)
Data Mining and Machine LearningL1.Information Retrieval(信息检索)(Focus on Text Retrieval)定义:大量文本集中找出和查询语句相关性最强的信息。第一章Moore’s Law:Technology performance doubles and prices halve every 18 months.Disk...
Data Mining and Machine Learning
L1.Information Retrieval(信息检索)
(Focus on Text Retrieval)定义:大量文本集中找出和查询语句相关性最强的信息。
第一章
Moore’s Law:Technology performance doubles and prices halve every 18 months.
Disk capacity:106=1MB,109 tytes=1GB,10^12 bytes=1TB;
1TB能存储多少东西?Whast can you store on 1TB?
假设16 kHz每秒的文件,用16bits存储。
一秒的花费就是16000*16;
1TB=10^12;
Python计算脚本如下:
Python 3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:37:50) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 16000*16
>>> b = 10**12*8
>>> days = b/(a*60*60*24)
>>> print(days)
361.68981481481484
>>>
也就是说1TB内存可以存储约362天这个时间长度的上述该文件。
而现在PB即数据的存储力为362*1000
将近一千年。
而目前世界上最伟大,并且没有之一的公司google每天的数据处理量是20PB。
2.Why we store these big corpora? it’s huge expense.
相信数据的潜在价值。
What’s the problems?
This is a problem in semantics.
人类能区分语义上的问题,但计算机并不能。
For example:
I saw the man on the hill with the telescope?
(What’s the real)
对IR(Information Retrieval)有了一定认知后,看下IR 和 Database Retrieval的区别。
Databases Retrieval有如下特点:
First.数据特诊:数据有属性、结构、数据之间相关性、查询和人类似。
Second.结构化,严格的查询。
Third.数据需要及时更行。
Fourth.具体的查询给出具体的结果。
而IR的特点和数据库检索则不同。
定义:‘
Mining:Degging deep into the earth, to find hidden, valuable materials.
Data Mining: Analysis of large data corpora.Corpora which are too large for human inspection.
Information Retrieval Components
Documents: Identify words which are ‘important’ for discriminating between documents, and how important they are.
Index: Specifies the relationships between these ‘keywords’ and the documents.
The query
Matching: Measuring the similarity between the query and each dovument.
Retrieved documents
Assessment and Relevance Feedback.
一段本本中的结构:
Words(关键字,some words are more important)
Sentences(Grammar/Syntax)词按一定顺序的集合助于理解和消除歧义。
更多推荐
所有评论(0)