Data Mining & Machine Learning学习笔记机器学习入门笔记（一）

Data Mining and Machine LearningL1.Information Retrieval(信息检索)（Focus on Text Retrieval）定义：大量文本集中找出和查询语句相关性最强的信息。第一章Moore’s Law:Technology performance doubles and prices halve every 18 months.Disk...

兴趣使然的程序猿

1278人浏览 · 2020-01-11 12:50:35

兴趣使然的程序猿 · 2020-01-11 12:50:35 发布

Data Mining and Machine Learning

L1.Information Retrieval(信息检索)
（Focus on Text Retrieval）定义：大量文本集中找出和查询语句相关性最强的信息。

第一章

Moore’s Law:Technology performance doubles and prices halve every 18 months.
Disk capacity：10^6=1MB,109 tytes=1GB,10^12 bytes=1TB;
1TB能存储多少东西？Whast can you store on 1TB？
假设16 kHz每秒的文件，用16bits存储。
一秒的花费就是16000*16；
1TB=10^12;

Python计算脚本如下：

Python 3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:37:50) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 16000*16
>>> b = 10**12*8
>>> days = b/(a*60*60*24)
>>> print(days)
361.68981481481484
>>>

也就是说1TB内存可以存储约362天这个时间长度的上述该文件。
而现在PB即数据的存储力为362*1000
将近一千年。
而目前世界上最伟大，并且没有之一的公司google每天的数据处理量是20PB。

2.Why we store these big corpora? it’s huge expense.
相信数据的潜在价值。
What’s the problems?
This is a problem in semantics.
人类能区分语义上的问题，但计算机并不能。
For example：
I saw the man on the hill with the telescope？
（What’s the real）

对IR（Information Retrieval）有了一定认知后，看下IR 和 Database Retrieval的区别。

Databases Retrieval有如下特点：
First.数据特诊：数据有属性、结构、数据之间相关性、查询和人类似。
Second.结构化，严格的查询。
Third.数据需要及时更行。
Fourth.具体的查询给出具体的结果。
而IR的特点和数据库检索则不同。

定义：‘
Mining：Degging deep into the earth, to find hidden, valuable materials.

Data Mining: Analysis of large data corpora.Corpora which are too large for human inspection.

Information Retrieval Components
Documents: Identify words which are ‘important’ for discriminating between documents, and how important they are.
Index: Specifies the relationships between these ‘keywords’ and the documents.
The query
Matching: Measuring the similarity between the query and each dovument.
Retrieved documents
Assessment and Relevance Feedback.

一段本本中的结构：
Words（关键字，some words are more important）
Sentences（Grammar/Syntax）词按一定顺序的集合助于理解和消除歧义。

CSDN学习社区

CSDN联合极客时间，共同打造面向开发者的精品内容学习社区，助力成长！

更多推荐

嵌入式作业（七）：基于Ardunio的STM32串口通信

嵌入式作业（七）0作业要求1Ardunio 完成STM32的串口通信（1）安装Ardunio IDE（2）stm32串口通信2关于 stduino IDE0作业要求安装 Ardunio IDE 和相关软件支持库，在Ardunio 完成STM32板子的串口通信程序：（1）持续向串口输出“Hello world！”；（2）当接收到“stop!”时，停止输出。网上有一个国人版的MCU集成开发平台， st

CSDN学习社区

JDBC详解

JDBC文章目录JDBC什么是JDBC?JDBC驱动程序:Java使用JDBC访问数据库的步骤:设置classpath:Oracle连接字符串的书写格式:简单的例子:常用数据库的驱动程序及JDBC URL:Oracle数据库:SQL Server数据库MySQL数据库Access数据库PreparedStatement接口:JNDI-数据源（Data Source）与连接池（Connection

CSDN学习社区

“模式识别与机器学习”学习笔记no2.再谈感知机

接**上篇：上篇主要进行了PLA，Pocket算法的理论过程分析和在给定数据集上利用pocket算法对数据集进行分类学习，得到错分数量最少的分类面。上篇中pocket算法的过程已经进行了编程和测试，框架已经建立了起来，这一篇主要上篇中没有提到或涉及不深的几个问题。1.数据集的构造。上篇是直接使用了题目给的向量，这次来根据正态分布来产生数据集。np.random.normal函数可以根据均值和方差生